Over the last month I’ve been working with an organization called Story Squad. The goal of Story Squad was to build an E-Learning application that encourages children between 3rd and 5th grade to develop reading comprehension, writing, and illustration skills. I worked on the Data Science portion of the project and the problem we attempted to solve was a grueling one, an optical character recognition system (OCR) that can read “children’s” handwriting. Some handwriting is bad and I’m referring to adults here while other adults have downright horrid handwriting. Now children- they are much too innocent to say their handwriting is terrible but the room for improvement is definitely there. So here is the problem, how can I train a model to read a child’s handwriting?
This is the point when you begin to feel like the ‘Receiver of Memories’ in The Giver and no one you know has gone to the ‘outer limits’ of building an OCR for children’s handwriting and come back to tell us about it. How could we approach this task and what was the more efficient way of doing so?
As I began to research the process of training an OCR to read children’s handwriting, it became apparent that the model will only be as good as the data that it is trained with. What does it mean to have clean data for training a model for recognizing handwriting? First, let me explain what Story Squad would need from the Data Science team.
After a child reads a story on the Story Squad application they are then prompted to right a short side-story that uses references from the story they just read. The child then takes a picture of this story (most likely using a cell phone) and this image is uploaded into the application. As of now, the application cannot recognize the handwriting and this is where Tesseract OCR is involved.
How can we develop a preprocessing technique that is robust and doesn’t compromise handwriting integrity?
This was the first technical challenge to address. We needed to preprocess the images to have the least amount of noise while also giving up the least amount of handwriting integrity.
When you’re preprocessing images it is very easy to drift away from a good preprocessing to one that Tesseract will no longer recognize. Often it’s only one line of code away from going from descent to disaster.
Below was my first coding attempt in python to solve this problem.
The problem with this image is not difficult to see. The bottom of the page is completely black. If there was writing at the bottom of this image, as is the case with many of the images, Tesseract would not be able to see it. We needed to processing function that could remove the shading while still maintaining the integrity of the handwriting. I was trying to get these results using the opencv-python module and the pillow module originally but as I continued to research I came across the scikit-image module. This module had a function called Sauvola Threshold. Below is the function we created to process images using the Sauvola Threshold function.
I was reading the documentation and the Sauvola Threshold used a method I was familiar with, a dynamic standard deviation. I’ve used dynamic standard deviations with my finance projects I always wondered in what other areas besides finance this method could be applied. Another name for a dynamic standard deviation is a rescaled range. The idea comes from Harold Hurst while he was trying to calculate how big a dam spanning the Nile needed to be. Long story short the dam needed to be large enough to account for droughts and heavy rains for certain durations of time. Naturally, I applied the Sauvola Threshold to the original image and WOW! I felt like I found the holy grail of image preprocessing! The dynamic standard deviation strikes again!
Here is an example of the final preprocessing code we did for Story Squad:
Passing the Baton to the Next Cohort
By the end of this project we were able to develop a robust preprocessing technique and we furthered this technique by writing shell scripts that could convert .jpg to .png to fit Tesseract’s needs and segment the image into lines, called psm (page segmentation mode). For each segmented line we then created a corresponding ground truth .txt file that matched the writing in the segmentation. Below is an example of the segmentation and the corresponding ground truth .txt file.
We wrote a Quickstart guide to get the next cohort up and running and also left a few pages of documentation describing all our processes in detail and corresponding code to run during each step. The next cohort should be able to begin cleaning more segmentations of images and start the training process for the OCR. We weren’t sure how to implement the ground truth files but we can say with good confidence that we believe this is the best way to pursue making model. Ultimately, the model will need more processed data . We began this process but due to time we were only able to get about one quarter of the images processed.
List of Shipped Features:
- conversion.sh script — converts .jpg to png
- preprocessing.py — applies preprocessing to original images to make ready for Tesseract
- segmentation.sh — creates line segmentations of entire images
- generating ground truth .txt file using the segmentation.sh script
- process for further editing the segmentation for more accurate training
Here is a video containing our findings and processes:
(my portion @ 7min — 21min)
The Future of this Project
This project has a lot of room for improvement on the Data Science side. In the realm of preprocessing, the function is pretty robust already but this was focused on curating the data for “training” the model. As far as preprocessing for submission there could be a line removal function because children are usually writing on lined paper. Then, this additional functionality could be implemented on the training side as well. The whole idea of Data Science is that it is an art and a science so we need to try different functions and techniques in order to get the desired results. I like to say that we are either building or refining.
Once the training is accurate enough then we could begin implementing some spell checks, punctuation counters, length of words, or unique word usage. From there we can begin gathering statistics on those features mentioned. The reason we would gather statistics would be to match children with other children that are similar to their skill level. Ultimately, I could see the training accuracy being the best challenge going forward due to the fact that children’s handwriting is difficult to apply OCR. In the end, it was a goodtime learning about optical character recognition and computer vision. Those two words are going straight on the resume. I think companies will love the experience I have working on an OCR as well as working on a cross-functional team. Working on a team and collaborating with the success we has helped to develop specific skills that will be needed in the work place.
Some of the feedback I received from my peers during this period was to manage my time better and to also not forgot to take care of myself. This was helpful advice from my TPL because I was wearing myself out on this project. Taking the time to reach out for help and ask for collaboration was I big part of the success we had as a team.