Curtis Wigington and William Barrett, Computer Science
Images of historical documents are being collected and archived much faster than volunteers can possibly index them alone. Improvements in offline handwriting recognition could greatly accelerate the work of indexing by FamilySearch. Offline handwriting recognition has already been shown to be effective in assisting indexers and automating the transcription of historical documents into searchable texts. As documents are collected from all around the world, generalized techniques that are robust to damage and noise are needed. Effective stroke extraction and processing is an important tool in improving current recognition techniques.
There are two general classifications of handwriting recognition: online and offline. Online recognition uses the motion and order in which the strokes were made (as with a department store e-pen). This must be obtained by recording the handwriting as it was written. Offline handwriting recognition transcribes handwriting with only the scanned images. Offline handwriting recognition is generally considered significantly more difficult than online handwriting recognition. The transcription of historical documents such as census records must be transcribed using an offline system.
The primary focus of Intelligent Stitching was to improve the robustness of Intelligent Pen developed by Bauer and Barrett. Their method used a least cost search of the pixel intensities. In their method they extracted the stroke by using a consensus path that was found from sampling on regular intervals. I improved the robustness of the consensus by having all the pixels vote, rather than only sampling on intervals. Each pixel’s vote is weighted based on heuristics of the likelihood that the pixel is part of the stroke. This improved voting scheme allowed for the algorithm to more consistently extract the entire stroke on images with faint and faded handwriting. Figure 1 shows an example of the pixels that received multiple votes. After voting, only the pixels with a certain number of votes are accepted as part of the stroke. Figure 2 shows the accepted stroke.
After an individual stroke has been extracted, it does not cover the entire handwritten word. In order to extract the handwriting in its entirety, new seed points must be created. These new seed points are placed on the ends of the single stroke that was extracted and the process repeats. Once a seed expands and find no new pixels, it does not spawn new seed points and terminates.
There has been relatively little research done on stroke extraction on noisy historical documents. As such, most datasets that are focused on stroke extraction have ground truth data for very clean handwriting. These types of images poorly represent the strengths of our technique and as such we have not included any quantitative results. We have included a couple qualitative results to show how the algorithm performs.
Figure 3 is a signature from George Washington. This example shows how the algorithm can follow the stroke through some of the more faded parts of the stroke. Figure 4 shows another example of a signature. In this example there are many examples of dark stokes with faded strokes nearby. In most cases, the algorithm is able to find both the light and the dark strokes. The figure shows how form lines at the bottom of image can cause the stroke to jump between the form line and the handwriting.
Conclusion and Discussion
We have developed an algorithm for robust stroke extraction for historical documents that are faded and have noise. These techniques can be applied to other existing online recognition techniques to transcribe the handwriting. These techniques become increasingly necessary as more documents are imaged, well beyond the quantity that indexers can transcribe. Future research will include testing a variety of recognition techniques based on the strokes found by our algorithm.