| /var/tmp | |||||
|
Subscribe
|
Sat, 27 Dec 2008
tesseract
Anyhow, first I did an examination of how well tesseract translated stuff. I did this by taking scanned pages from Distributed Proofreaders, running tesseract on them, then manually checking to see what the result was. DP (Distributed Proofreaders) scans from different types of books, so we get a range of different fonts and printing styles. I convert the PNG from DP to a TIF and then let tesseract run One thing I quickly noticed is that tesseract handles " fi" poorly, that is, words that begin with the letters fi. One example is on page 305 of part 1 of 4 of Chambers's Twentieth Century Dictionary. Line 33 is translated as: "cats proverbially tight till each destroys the other. 1 11111;; ``````" The junk after the word other and the period is just junk that was OCR'd. Anyhow, this should not OCR as "proverbially tight till" but as "proverbially fight till". You can see what it looks like in the book here:
We can see the same situation on the same page. Further down, line 72 is
translated as:
This should actually be:
We can see this from a scan of a different book. Page 120 of Secresy, or, The Ruin on the Rock also has a bad translation of " fi". This despite different fonts, typesetting and so forth. Line 4 (line 3 if disregarding title) is translates as: "must acknowledge, his Hmmess has not undergone the trial you have" where the real translation is "must acknowledge, his firmness has not undergone the trial you have" Hmmess is actually firmness, once again " fi" is mistranslated.
I have been looking through the tesseract output of these letters and words with the debugger on, and am still doing so. [/ocr/tesseract] permanent link |
||||