I'm struggling to generalize a Machine Learning Model I made from scratch for OCR on press print books

I started this as a class project but i wanted to complete it as a fully working proof of concept. The goal is to have an OCR tool targeted at press printed text from the early 19th century, and make it more accurate than what you can find packaged for Pytessaract or built into scanners and such.

I’m using PyTorch.
I’m sure my architecture is pretty naive. But i’ve proven it can overfit so textbook says I should shrink my model to improve generalization (I’ve tried just increasing dropout but that’s not viable).

My overfitting model definition is:

def __init__(self, num_classes, hidden_size=235):
    super(OCRModel, self).__init__()
    self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
    self.pool1 = nn.MaxPool2d(kernel_size=2)
    self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
    # self.norm2 = nn.LayerNorm([32, 250, 185])
    self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
    self.pool3 = nn.MaxPool2d(kernel_size=2)
    self.norm3 = nn.LayerNorm([64, 125, 92])
    self.conv4 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
    self.pool4 = nn.MaxPool2d(kernel_size=2)
    self.norm4 = nn.LayerNorm([128, 62, 46])
    self.relu = nn.LeakyReLU()
    self.dropout = nn.Dropout(p=0.5)
    self.bilstm = nn.LSTM(46, hidden_size, bidirectional=True, batch_first=True)
    self.fc2 = nn.Linear(hidden_size*2, num_classes + 1)

def forward(self, x):
    x = self.relu(self.pool1(self.conv1(x)))
    x = self.relu(self.conv2(x))
    x = self.relu(self.norm3(self.pool3(self.conv3(x))))
    x = self.relu(self.norm4(self.pool4(self.conv4(x))))
    b, c, h, w = x.size()
    x =x.view(b, c * h, w)
    x, _ = self.bilstm(x)  # Pass the sequence through the BiLSTM layer
    x = self.dropout(x) 
    x = self.fc2(x)
    return x

I’m looking for suggestions from someone who knowns better, because textbooks and chatgpt can only really give general advice. what I think I should shrink first is removing a convolution layer and shrink the LSTM size.

But I don’t know if experts wouldn’t have a much better idea.

2 Likes

Not much info, So ill just generalise some suggestions:

If your course allows use a pre existing model and weights and train your own data. Just training accurate models like this takes millions before the model will even start to give good results. If you can find open royalty free models with weights that have datasets published use that as a jump of point and train on your own data set with very confined labels. Fine tune, fine tune, fine tune.

Then start your own model from scratch, Only your own detailed labeled dataset. And if OCR adding a GAN on training would probably help, especialy if your dataset is 1 language for example.

With weights based from the first system.

More details of exactly what you have done would help, Have you started on it, is it ongoing or is this just brain storming?

1 Like

I’m doing this for me, class stuff is long over.

I’m not sure how I would go about retraining an existing weighted model with my dataset in the form it is. The first thing I had tried was using existing models directly but they perform very poorly on this text that wasn’t printed with modern techniques (lots of voids and blotches).

I have the hooks in my decoding methods to put in a language model. I was planning on implementing that if I could get the base accuracy at a certain level (not there yet). because it would then act kind of like a keyboard auto correct.

My main focus on this project when I started was to minimize the amount of manual work I would have to do because i’m 100% solo on this. So I can’t be cutting out characters individually and labeling them. I have a whole set of page png + transcript txt files that I make the model learn alignment on with CtC.

I have proven that my Architecture can Overfit the dataset.
So my struggle now is generalization. just increasing dropout wasn’t enough (it’s the first thing I tried at the start of summer), so I’ve been shrinking the size of the model by cutting out layers and/or reducing the bidirectional lstm size. This only started after I Originally posted.

I’m worried about training performance too to be honest because I have encountered a bug with model state dictionary saving, and I cannot fix it for the life of me. it’s implemented exactly as described in documentation (and I had tried a number of different setups) but the loaded state doesn’t match the saved state at the end of training. I think somehow it’s not deep copying current state at the save point even though documentation says it should by default.

I can give more details in any part if you have more specific questions, I don’t want to end up writing an humongous wall of text in a single response…