/var/tmp
   


About
Android, Linux, FLOSS etc.


Code
My code

Subscribe
Subscribe to a syndicated RSS feed of my blog.

       

Fri, 22 Oct 2010

poppler rendering
On Ubuntu, the default method of reading PDFs is with evince, which uses the poppler library as its backend for PDFs.

The bus map PDF for my area takes 16 seconds to load on my computer. It is on Ubuntu, on a 64-bit desktop system with an AMD Athlon(tm) II X2 240 2.8GHz processor with two cores, and four gigs of RAM. The bus map PDF is 751K. This seems far too long. Epdfview, which also uses poppler, takes 8 seconds to render the PDF. Adobe Reader 9.3.4, on the other hand, takes less than 4 seconds, and I'll use that as a benchmark here of what the render time should be.

So I looked into it. Instead of grabbing all of the necessary source and compiling with gcc's -pg flags for gprof, I compiled an uncompress kernel and ran oprofile which rendering the PDF. It didn't show everything fully in the oprofile report, so I downloaded the necessary debug symbol packages for poppler, cairo etc.

I rendered the PDF with evince, with epdfview, and with poppler's poppler-glib-demo in a local poppler library cloned from the latest git commit which I compiled manually. For all of them, oprofile pointed to poppler being the library dominating the processor.

So with the Ubuntu package of debug symbols for poppler installed, I had the oprofile report look at poppler. It showed three methods dominating processor time - in order they were - TextBlock::isBeforeByRule1, TextPage::coalesce and TextBlock::visitDepthFirst. This was the case for all three programs using the poppler library backend.

So I start hacking around with the coalesce method in poppler, when I come across this line:

sortPos = blk1->visitDepthFirst(blkList, i, blocks, sortPos, visited);

I look at the visitDepthFirst method and see it is doing topological sorts on the data. So I comment out the above line within coalesce

// sortPos = blk1->visitDepthFirst(blkList, i, blocks, sortPos, visited);

recompile and run it.

So, when rendering the aforementioned PDF with poppler's test program poppler-glib-demo WITH the visitDepthFirst method in coalesce, the fastest rendering I get is 8 seconds. When I render it without that line, the fastest render I get is less than 3 seconds. Removing this method more than halves my render time.

But perhaps this behavior was due to the PDF itself being unusual. I did a quick search through Google for other PDFs. As I had this problem with a (partial) city map, I looked for other city maps. I randomly found a PDF with a map of Paris. I tested its rendering in poppler-glib-demo, with and without the call to the visitDepthFirst method in coalesce. Fastest render time with the method was 3.76 seconds, fastest without it was 2.07 seconds. So this random map PDF did not have as significant improvement as my map PDF, but this call, which was not doing anything visibly to improve the map display, was adding more than 50% more time to the program.

As I said, from my cursory look, there was no difference between the displayed map which called visitDepthFirst, and the one which did not call it. I saw what the code did and the comment, for more information on its purpose and so forth I began digging through the logs. I saw that the code came from commit f83b677a8eb44d65698b77edb13a5c7de3a72c0f on November 12, 2009. In November 2009, Brian Ewins made a series of commits whose purpose was to improve text selection in tables. This particular commit changed the method of block sorting to "reading order" via a topological method. Aside from the comments in the git commits, these November 2009 column selection commits were discussed on the poppler mailing list, as well as in Bugzilla.

I revert poppler back to commit 345ed51af9b9e7ea53af42727b91ed68dcc52370 and compile epdfview against it. Then I revert to poppler commit f83b677a8eb44d65698b77edb13a5c7de3a72c0f and compiled epdfview against that as well. Commit 345e... is two commits before f83b... When I run epdfview, using either poppler library as a backend, the version which is two commits earlier, 345e... runs in epdfview in less than half the time that commit f83b... runs in. Commit f83b... more than doubles the time to run, with no noticable improvement in anything for that application of using poppler (displaying a map via epdfview).

I mentioned much of this on Freenode IRC channel #poppler, and how removing the visitDepthFirst method from coalesce improved rendering time enormously. Someone, I believe it was Andrea Canciani, looked at the method and said the two nested loops looked wrong.

There are two for loops in visitDepthFirst. I put a counter on the inside one to see how many times it ran on my 751K PDF. It ran over 196 million times! For every bit in my 751K PDF, the inner for loop ran 32 times. Not only that, if the TextBlock data structure blk3 is not equal to either blk1 or blk2, the inner for loop will make not one but two calls to the isBeforeByRule1 method. No wonder my map is rendering slow.

So this is where it stands now. 16 seconds seems too long for Ubuntu's default PDF viewer to load my local bus map - especially when in my hacking around I have gotten it to display in a second and a half or so. Whatever that topological commit from November 2009 fixed in terms of selecting text from tables, it has more than doubled the rendering time of some PDFs, especially PDFs with maps. The solution would be a change that would keep the fixes to text selection in tables, but still have something near the speed of rendering prior to that commit. I will look into this more when I have the time.

[/poppler] permanent link

Tue, 22 Dec 2009

I have become interested in the poppler library lately (and one of the
programs that depends on it, evince). Poppler is a PDF rendering library. I have been looking through Ubuntu's bug tracking system on launchpad, and people have been complaining how they run Evince on a PDF file and it crashes, or at least it doesn't display the file right away.

The particular bug that was reported through Ubuntu that I am looking at is #497175 . The user tried to use evince to look at his PDF< but it did not work. Text that should have been displayed was not displayed. He said Xpdf did work on the PDF file, displaying all text. I downloaded the sample PDF file, and saw indeed it displayed the text with Xpdf and not evince 2.28.1 (using poppler 0.12.0) on Ubuntu 9.10. I tried displaying it with evince 2.22.1.1 based on the poppler 0.6.4 library. That worked. So I figured some time between those early, working evince/poppler versions, and the more recent evince/poppler versions which broke for the bug reporter (and myself), something must have changed that broke this.

I wasted time in two respects looking at this. One is I looked at both evince and poppler. From my kibitzing of evince and poppler over the past months, I have seen over and over that most reported bugs on Ubuntu dealing with PDFs and evince are due to bugs in poppler, not evince. So trying different evince versions was a waste of time.

The second was how I dealt with poppler versions. I knew poppler 0.6.4 worked and poppler 0.12.0 didn't, so I downloaded poppler 0.10.0 (as the bug was reported recently, I leaned towards a more recent poppler version), then compiled evince against it, ran it, saw it worked, then began manually downloading other poppler versions, compiling them, compiling evince against it, testing it and so on. Eventually I saw that poppler 0.11.1 worked and poppler 0.11.2 did not.

However, I was doing a lot of unnecessary work. Poppler uses a git repository. I heard about git when it was announced in 2005, and I have checked out code from git repositories, and have browsed some git source trees over the web, but I have never looked much into it. Git has a cool feature called "bisect". Poppler has each release version tagged with the release name. So what I could have done was a git bisect - marking 0.6.4 as a good version, and 0.12.0 as a bad version. Git would have bisected all the commits between these two tags. I would test it to see if it was good or bad. If it was bad, it would bisect at the 25% mark between 0.6.4 and 0.12.0, if it is good, it would bisect at the 75% mark between 0.6.4 and 0.12.0. You keep bisecting until you get to the bad commit.

I am doing this now, and am down to my last test. I will mark it good or bad, after which we will know which commit caused this problem.

[...]

There, I'm done. Commit ad26e34bede53cb6300bc463cbdcc2b5adf101c2 broke it. Changes to the CairoOutputDev.cc file. Before that commit, the text displays, after the commit, it does not. I changed the Ubuntu bug report and reported it to the poppler upstream.

[/poppler] permanent link

Sat, 19 Dec 2009

poppler bug
I am looking at bug 436197 on the Ubuntu section of Launchpad. The bug is in the poppler library, and usually gets evoked by the evince application. I am able to duplicate it. The bug is a segmentation fault when evince tries to open certain PDF files, or tries to open certain pages in those PDF files. There are several bug duplicates since this problem has been hitting a number of people. The bug has also been reported to poppler. Launchpad has several PDF files which will reproduce the problem.

The segmentation fault happens when the TextWord constructor is called. The reason the segmentation fault happens is because the curFont object has not been created. So without doing much investigation, I simply created the curFont object if it did not exist, and then called a related method. This seemed to solve the problem, the program stopped crashing and the problem pages were displayed seemingly normally (a cursory look shows the problem pages displaying normally, but it is possible some portion of the page is displayed improperly).

git diff TextOutputDev.cc
diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index 442ace2..9686cc1 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -1988,6 +1988,11 @@ void TextPage::beginWord(GfxState *state, double 
x0, double y0) {
     rot = (m[2] > 0) ? 1 : 3;
   }
 
+  if (!curFont) {
+    curFont = new TextFontInfo(state);
+    fonts->append(curFont);
+  }
+
   curWord = new TextWord(state, rot, x0, y0, charPos, curFont, 
curFontSize);
 }

However, this is really just a hack. I don't have much of an understanding of how the poppler library works or how evince works. The Poppler people point out that this segmentation fault is not tripped on pdftotext, which also uses the poppler library. This is correct, it does not seem to. Then again, evince is calling the poppler_page_render() call in the poppler library, and pdftotext does not seem to do that. Thus, what that ultimately adds up to is questionable.

Right now I am exploring the Gfx class, as backtrace (and following the program logic) shows that the Gfx class is utilized between the call to poppler_page_render() and the failed construction of the curWord object of the TextWord class. Setting the printCommands boolean to true shows debugging information so I am looking at that.

What usually happens with the above patch is that the beginWord method is called many times, with one instance where no curFont object exists (and thus a segmentation fault would happen). I do not know much about the evince code or these libraries, so I am looking into all of this, seeing if I can come up with anything better than the above hack. It is pretty clear this is a poppler problem though - even if these pdf's are messed up, they don't crash PDF displayers that don't use the poppler library. The same goes for if evince is not doing something right with Cairo before handing it off to poppler. If this is happening 12 calls within poppler, it points to poppler being the problem.

[/poppler] permanent link

Thu, 22 Oct 2009

I have a Portal Document Format (PDF) file, which has a series of pages
that evince, the default Ubuntu (and Gnewsense) Gnome PDF reader crash on when they are opened to. Xpdf can read the pages just fine however.

This lead me to look over Ubuntu's launchpad web site, which I began browsing.

(I saw that file-roller, the default Ubuntu/Gnewsense Gnome archive file application was crashing with segmentation violations a lot. There was not much non-automatic information about this however, aside from some people saying the problem was not always reproducible, but only happened sometimes. Due to this, and due to it using a lot of heavy GTK/GDK stuff that I don't know, I moved on.)

I wanted to look at my evince crash a little more carefully, but I was still running Intrepid Ibex (Ubuntu 8.10) whereas most people were reporting the problem on Jaunty Jackalope (Ubuntu 9.04) or even beta versions of Karmic Koala (Ubuntu 9.10 - beta), the release version of which is supposed to be coming out in eleven days. Well, this indicates the problem has been around for a while, and is still around. So I upgraded to Jackalope. I was a little uneasy about whether to go to the Koala beta, but then I plunged in.

One thing I noticed, which was not around so much on Ubuntu's Hardy Heron (8.04), is apport, a window which pops up when an application crashes and says it will automatically report it to Ubuntu if people want. This popped up for me when evince crash and I sent in the bug. Later, I marked it as a duplicate of a similar one. Launchpad makes a slight effort to try to let you see if it's a duplicate while reporting, but that question can be a little complex, and the process doesn't deal with that. So I reported it, and then marked it as a duplicate later.

The Poppler PDF rendering library was partially implicated in the crash, so I downloaded the dpkg for ePDFView, which also uses that library. ePDFView also crashes on these pages. So I reported that as a bug to Ubuntu via apport. Stacktrace shows pretty much the same thing happening, they're both crashing in the JPEG 6.2 library, the call from which can be traced back, via the same route, to a Page::displaySlice call in the poppler library. So it looks like the poppler library (or possibly even the jpeg library) is at fault.

[/poppler] permanent link