I started this as a class project but i wanted to complete it as a fully working proof of concept. The goal is to have an OCR tool targeted at press printed text from the early 19th century, and make it more accurate than what you can find packaged for Pytessaract or built into scanners and such.
I’m using PyTorch.
I’m sure my architecture is pretty naive. But i’ve proven it can overfit so textbook says I should shrink my model to improve generalization (I’ve tried just increasing dropout but that’s not viable).
My overfitting model definition is:
def __init__(self, num_classes, hidden_size=235):
super(OCRModel, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
self.pool1 = nn.MaxPool2d(kernel_size=2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
# self.norm2 = nn.LayerNorm([32, 250, 185])
self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool3 = nn.MaxPool2d(kernel_size=2)
self.norm3 = nn.LayerNorm([64, 125, 92])
self.conv4 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.pool4 = nn.MaxPool2d(kernel_size=2)
self.norm4 = nn.LayerNorm([128, 62, 46])
self.relu = nn.LeakyReLU()
self.dropout = nn.Dropout(p=0.5)
self.bilstm = nn.LSTM(46, hidden_size, bidirectional=True, batch_first=True)
self.fc2 = nn.Linear(hidden_size*2, num_classes + 1)
def forward(self, x):
x = self.relu(self.pool1(self.conv1(x)))
x = self.relu(self.conv2(x))
x = self.relu(self.norm3(self.pool3(self.conv3(x))))
x = self.relu(self.norm4(self.pool4(self.conv4(x))))
b, c, h, w = x.size()
x =x.view(b, c * h, w)
x, _ = self.bilstm(x) # Pass the sequence through the BiLSTM layer
x = self.dropout(x)
x = self.fc2(x)
return x
I’m looking for suggestions from someone who knowns better, because textbooks and chatgpt can only really give general advice. what I think I should shrink first is removing a convolution layer and shrink the LSTM size.
But I don’t know if experts wouldn’t have a much better idea.