/var/tmp
   


About
Android, Linux, FLOSS etc.


Code
My code

Subscribe
Subscribe to a syndicated RSS feed of my blog.

       

Thu, 02 Apr 2009

gocr
Cool, my patch to GOCR which deals with distinguishing a's from d's made it in the 0.47 release.

[/ocr/gocr] permanent link

Sun, 25 Jan 2009

gocr, ocr
With regards to all things OCR, I did a patch for GOCR in 2006. GOCR would see the letter 'a' when the letter was actually 'd'. There were two reasons for this:

1) Sometimes there would be a serif at the top of the 'd'. GOCR would examine a 'd' and be looking for a straight up-and-down line segment to the right side and two horizontal arcs on the left side - the top and bottom of the circle in 'd'. GOCR would see the serif at the top of up-and-down line segment and get confused. It was not expecting to see the serif being there, it expected to see mostly two arcs (the circle at the left of the 'd') and then a straight up-and-down segment to the right of the 'd' and that's it. So I put a patch in to make GOCR less strict and which would allow for the serif's one finds in text at the top of a 'd'. This was done in the ocr0_dD() function.

2) The second change improved recognition between a's and d's as well. This was done in the ocr0_aA() function. The letters 'a' and 'd' are printed in different fonts in different texts, and sometimes the only difference is that the up-and-down line segment on the right side extents significantly above the circle on the right for 'd', while with 'a' it stays level, or extends only slightly above the circle on the left side of the character.

The ocr0_aA() function currently looks into the box struct for x0, x1, y0 and y1. My patch looks for m1 as well. With the 2006 patch I put in, I make it so that if m1-y0 is greater than or equal to 0, I break.

While every test I ran showed an improvement in GOCR recognition after this, a week after I sent my patch in, as I became more familiar with gocr, two things occurred to me. The first was that instead of breaking, and declaring that it was not an 'a', I probably should have used the setac() call instead - diminishing the likelihood that the character was 'a' but not totally eliminating it. Secondly - the formula "m1-y0 >= 0" as the formula to break is somewhat arbitrary. What exactly is the length the line segment can rise where it transforms from an 'a' to a 'd'? I picked the number 0 arbitrarily. I did a number of tests, but more tests can probably be done, especially on very small and very large characters - especially very large characters. These concerns made me think I could do an even better patch. The one I submitted seemed to break nothing, and only fix things, but I decided a better patch would use a setac() instead of a break in ocr0_aA(), and that more testing should be done, especially on bigger characters, so that a better patch could be done.

I would have to spend some time doing that, so I contacted Joerg Schulenburg recently and he gave me some encouragement, so I am going forward with the new, better patch as the first one was never applied (and since I felt I could do better, I never pushed that it be applied). Joerg is busy with things, and I am a little busy as well, but I am less busy than I used to be, and have the time in the next weeks to do this new and better patch.

Anyhow, Joerg asked for some sample files. First I should say, I just downloaded gocr via cvs (January 24, 2009), patched it, and compared the files in the examples directory between the current cvs and my patched version (4x6.png 5x7.png 5x8.png ocr-a.png ocr-b.png handwrt1.jpg matrix.jpg). There was no change for any of the examples.

What I use to test is OCR scans I got off of the Distributed Proofreaders website. With my 2006 patch, in every test I did on every OCR image, I did not see any negative effect - my patch did not remove recognition of any correctly labeled 'a' or 'd'. The only changes I saw were incorrectly labeled a's and d's now being unlabeled as such - often with that incorrectly identified as an 'a' now being seen correctly as a 'd'.

An example of this is page 83 of the book "Daring and Suffering: A History of the Great Railroad Adventure" by William Pittenger. I got the scan of this from Distributed Proofreaders as it was on the way to Project Gutenberg. Line "8" of that page (To GOCR it is the eight line of text or whitespace, when reading the text it is the first line) with the current CVS snapshot of GOCR is

_ll of the eigbt _e_ were capt4red, a_a are

With my 2006 patch, the text correctly comes out as:

_ll of the eigbt _e_ were capt4red, a_d are

With my patch, GOCR now recognizes that the last letter of the word and is not 'a', but 'd'. A false recognition of the letter is replaced by the correct one.

You can download page 83 of the book yourself, and run it with the current cvs snapshot, and against my patch.

Another example is page 160 of the book "Left End Edwards" by Ralph Henry Barbour. This is another book whose pages I grabbed from Distributed Proofreaders on their way to project Gutenberg. My patch has a very good effect on this page, fixing four lines, all correctly.

The first line fixed (according to GOCR) is line 8:

a_d tahe_ 9ou o_. Peters say6 _obey _il_ be ais-

becomes:

a_d tahe_ 9ou o_. Peters say6 _obey _il_ be dis-

The patch correctly changes "ais-" to "dis-". You can look at the image and see is correct.

On the line which GOCR says is line 12, the line changes from:

'' I ao_'t believe t_ey _ll,'' replied Steve _o-

to

'' I do_'t believe t_ey _ll,'' replied Steve _o-

The d in don't is seen as d, not as a.

There are two more corrected lines as well - an a becomes d correctly, on lines 26 and 27 as well. You can see this for yourself. You can download the page and run the current gocr cvs against my 2006 patch.

As I said, I pulled these pages randomly from Distributed Proofreaders. It was the only online source of scans I knew of. I also tested this on scans of my own book collection as well, although my books are not in the public domain, so due to copyright issues I am less inclined to post them. The Distributed Proofreader books are public domain books.

As I said, I tested many pages on this, and if you want me to post more of my tests I will. Many of my tests simply had no change - the pre-patch gocr was the same as post-patch. On all my tests I saw no negative effect. Only positive ones, usually a 'd' mis-classified as 'a' being properly classified as 'd'.

But as I said, I have been thinking about the patch, and think a setac() would be better than a break for my ocr0_aA() test. I should probably test this on larger characters as well - the 0 in "m1-y0 >= 0" is somewhat arbitrary, and I want to run more tests, especially on large sized characters. So my existing patch seems to only fix things, but I feel I can make the patch even better.

Here is a copy of the 2006 patch which works against the current (January 25, 2009) CVS snapshot. In the next weeks, I will work to see if I can improve the patch, mostly in terms of setting setac() instead of breaking, as well as seeing if the 0 value in "m1-y0 >= 0" is the best number to use, especially when I do more testing against larger characters. So I'll be sending you an improved patch of my 2006 patch in the next few weeks. School is starting up for me again Monday and I will be somewhat busy, but I'm fairly sure I will have enough time to improve the 2006 patch in the next few weeks.

[/ocr/gocr] permanent link

Tue, 17 Jun 2008

/var/tmp
Well, I snapped this domain up again. I had it several years ago and lost it. Now I have it again.

One of my interests is OCR, particularly a free software OCR. I spent some time on gocr, even though none of my patches were used and the project has not been updated for over a year. GOCR seemed the best thing to contribute to when I was looking at this a year or two ago, but Google has put Tesseract and Ocropus out there so I am going to take a look at those now. They are in C++ - a language I knew nothing of two years ago, but have taken a class in so am now a little more familiar with. Apparently tesseract only does OCR, not layout. Ocropus is a layoout plugin.

I'm trying it now...it's pretty good. Better than GOCR probably.

I will attempt to improve it the same way...get a number of samples of different books from Distributed Proofreaders, match tesseract OCR to original...see if there are any patterns of failure, then fix that in the tesseract code if possible

[/ocr/gocr] permanent link