Android, Linux, FLOSS etc.

My code

Subscribe to a syndicated RSS feed of my blog.


Fri, 12 Oct 2012

Processing large data files

I guess noticing this thing shows how little I know about programming, but I have now seen this come at me in two different directions and am now more aware of it.

The thing I am talking about is when I am processing a large data file with a program. The large data file is in a certain data format. So initially, since it is theoretically the easy way, I try to download the entire file into memory into a data structure which fits that data format. Once that is done, I start processing that data structure. Usually what I'm doing is in one way or another, translating the data from the form it is in, into another type of data structure, and outputing the new data structure.

The problem is you have this huge data structure in memory, and are trying to manipulate portions of it into another data structure, and it just takes up a lot of resources. The work gets done, but it is too slow. Sometimes all of this memory use starts memory paging, and then the machine slows to a crawl.

My first encounter with this is when I wrote a Java program for my blunder suite of tools - pgn2fen. I would take a PGN (Portable Game Notation) file that was 7 megs or so, load it into memory, and then convert every move of every game in that PGN into a FEN (Forsyth–Edwards Notation) data structure, which represents a chess board position.

Initially, I would load the file as a linked list of Strings, and then send that entire list as a parameter to various classes. As the program began coming together, I made a big improvement in the code. Now I created a second shorter linked list alongside the large linked list. I would then slice off a piece of the list, like a slice of salami or a banana, and send that slice around to the other classes. The large linked list was rid of the data as soon as it was sliced, and the smaller linked list with the slice itself was discarded once it was processed. I would then go on to slice off the next part of the large linked list, and repeat. The code looks like this:

            for (i=0; i < gameSize; i++) {

This change made the program over ten times faster.

I recently faced a similar problem. This time it was with a Perl script translating an RDF file into an XML data structure. In this case, my machine would start swapping and take hours to process the file. Maybe not surprising that it had a larger effect on the machine, as the PGN files were usually less than 10 megs, and this data file is over 240 megs. With my desktop GUI, as well as the RDF data structure, necessary operations and new data structure, my 4 gigs got swamped and my machine started paging. After a few hours the process was done, but I wanted to look into if there was a way to deal with this.

Again, if resources are infinite, it's always programatically easier to just load the entire data structure, do the processing, and output the new data structure. But resources are not infinite. Something I certainly learned doing over a decade of systems administration professionally.

In this case I switched from using the CPAN XML::Simple module, to using CPAN's XML::Twig module. From XML::Twig documentation:

[XML::Twig] allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing....One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10). To do this you can define handlers, that will be called once a specific element has been completely parsed...Once the element is completely processed you can then flush it, which will output it and free the memory. You can also purge it if you don't need to output it.

Which is what I do. RDF elements I need I grab with a handler, process, and then purge. RDF elements I do not need I have the handler purge immediately.

The processing now take much, much less memory. It finishes much faster as well. A lot of the time is probably taken by the instant-purging of RDF elements that will never be processed.

Any how, I now see I have run into the same problem twice. It was solved more or less the same way both times - I processed the large, original data structure one element at a time, and when I was done processing that element I would remove it from memory and go on to the next element. Not the easiest way to do things programatically, but a necessity with large data files and limited resources.

[/programming] permanent link