Android, Linux, FLOSS etc.

My code

Subscribe to a syndicated RSS feed of my blog.


Fri, 12 Oct 2012

Processing large data files

I guess noticing this thing shows how little I know about programming, but I have now seen this come at me in two different directions and am now more aware of it.

The thing I am talking about is when I am processing a large data file with a program. The large data file is in a certain data format. So initially, since it is theoretically the easy way, I try to download the entire file into memory into a data structure which fits that data format. Once that is done, I start processing that data structure. Usually what I'm doing is in one way or another, translating the data from the form it is in, into another type of data structure, and outputing the new data structure.

The problem is you have this huge data structure in memory, and are trying to manipulate portions of it into another data structure, and it just takes up a lot of resources. The work gets done, but it is too slow. Sometimes all of this memory use starts memory paging, and then the machine slows to a crawl.

My first encounter with this is when I wrote a Java program for my blunder suite of tools - pgn2fen. I would take a PGN (Portable Game Notation) file that was 7 megs or so, load it into memory, and then convert every move of every game in that PGN into a FEN (Forsyth–Edwards Notation) data structure, which represents a chess board position.

Initially, I would load the file as a linked list of Strings, and then send that entire list as a parameter to various classes. As the program began coming together, I made a big improvement in the code. Now I created a second shorter linked list alongside the large linked list. I would then slice off a piece of the list, like a slice of salami or a banana, and send that slice around to the other classes. The large linked list was rid of the data as soon as it was sliced, and the smaller linked list with the slice itself was discarded once it was processed. I would then go on to slice off the next part of the large linked list, and repeat. The code looks like this:

            for (i=0; i < gameSize; i++) {

This change made the program over ten times faster.

I recently faced a similar problem. This time it was with a Perl script translating an RDF file into an XML data structure. In this case, my machine would start swapping and take hours to process the file. Maybe not surprising that it had a larger effect on the machine, as the PGN files were usually less than 10 megs, and this data file is over 240 megs. With my desktop GUI, as well as the RDF data structure, necessary operations and new data structure, my 4 gigs got swamped and my machine started paging. After a few hours the process was done, but I wanted to look into if there was a way to deal with this.

Again, if resources are infinite, it's always programatically easier to just load the entire data structure, do the processing, and output the new data structure. But resources are not infinite. Something I certainly learned doing over a decade of systems administration professionally.

In this case I switched from using the CPAN XML::Simple module, to using CPAN's XML::Twig module. From XML::Twig documentation:

[XML::Twig] allows minimal resource (CPU and memory) usage by building the tree only for the parts of the documents that need actual processing....One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10). To do this you can define handlers, that will be called once a specific element has been completely parsed...Once the element is completely processed you can then flush it, which will output it and free the memory. You can also purge it if you don't need to output it.

Which is what I do. RDF elements I need I grab with a handler, process, and then purge. RDF elements I do not need I have the handler purge immediately.

The processing now take much, much less memory. It finishes much faster as well. A lot of the time is probably taken by the instant-purging of RDF elements that will never be processed.

Any how, I now see I have run into the same problem twice. It was solved more or less the same way both times - I processed the large, original data structure one element at a time, and when I was done processing that element I would remove it from memory and go on to the next element. Not the easiest way to do things programatically, but a necessity with large data files and limited resources.

[/programming] permanent link

Getting Scheme to do REPL in Emacs

I took a college course last year, half of which was learning Lisp and functional programming. I don't feel I learned that much about either Lisp or functional programming in the course. I had taken a previous course with the same instructor in graphing theory where I felt I did learn a lot. Especially in subsequent courses where I had to learn tree data structures and the like.

Anyhow, I decided to take another crack at Lisp and functional programming. Some of the great and/or successful programmers have a fondness for Lisp and recommend it, even if you don't see it around much any more. As Paul Graham says about his usage of Lisp, "Everyone else was writing their software in C++ or Perl. But we also knew that that didn't mean anything. If you chose technology that way, you'd be running Windows."

Structure and Interpretation of Computer Programs is often touted as a must-read book. When I first browsed through it a few years ago it seemed confusing. I'm not sure why that is, when I look at it now it mostly seems simple and clear. I'm still reading the first of the five chapters. They're very heavy on the "interpretation" part of their title, going into evaluation and eval etc. It's not yet clear to me why they're emphasizing this so much, but perhaps I'll understand as I read through the book.

My college course used Common Lisp. I understand CL is more of the real-world one, with more libraries, but also more cruft and less simplicity.

Scheme is simpler, more elegant, and easier to understand. Scheme defines functions with the symbol define. CL defines functions with the symbol defun. That alone tells you a lot about the dialects.

One thing I like about Scheme is it seems to have a small number of primitive expressions, with a few more derived/library expressions built on those primitive expressions. I like this simplicity. While these Scheme expressions deal with abstraction and things like that, it reminds me of how almost all number-theoretic functions on the natural numbers all derive from three primitive functions - constant, successor and projection, and by doing the composition and primitive recursion operations on those functions. And the computations that can't be done with these three functions and two operations are rather offbeat, like George Cantor's ones, which do little other than disprove you can't do every natural number computation with those rules.

I also like that Scheme clearly marks non-functional procedures and forms with exclamation points.

Especially since Lisp is not heavily used nowadays, it seems obvious to me that people should first learn Scheme as that seems the best language to learn in. If they want to do some real world stuff and feel CL has better libraries or whatnot, they can then shift to CL.

So anyhow I've been going through SICP. The initial expressions could mostly be done with the Scheme interpreter guile. It does not have readline support by default like CL does, so I put into my .guile profile:

(use-modules (ice-9 readline))

As the programs became more complex, I wanted a more sophisticated REPL. Emacs seems to be what Lisp programmers use. I am not well-acquainted with emacs, even though I first started using it twenty years ago! I usually use vi, or nano, or Gnome gedit, or Eclipse's editor, or the like. Anyhow, doing elisp with Emacs is easy enough, but using Scheme is a little bit more work. I spent some time looking at it today and got it put together. Oddly, there's not really one place on the web which tells you how to do this.

In my emacs init file I now have:

(setq scheme-program-name "guile")
(global-set-key (kbd "C-x C-e") 'scheme-send-last-sexp)

I also have:


Just so I don't have to do "Control-x -> 2" when I start Emacs. If I start using Emacs more for editing, perhaps I'll comment that line out.

So I click the bottom window, type "Escape-x" and then "run-scheme". Then I click the top window and start typing in expressions. I usually do "Control-x Control-e" after each one to evaluate it. It evaluates in the bottom window which runs guile. I had the scheme-program-name set to scm and was running that for a bit, but switched to guile. Don't know much about either aside from that both seem to be copyrighted by the FSF, but the FSF seems to push guile more, and also guile has a nice (help) facility.

Anyhow it is running well enough for now. I'd like to improve my Scheme REPL environment a little more, but it is working OK for now.

[/lisp/scheme] permanent link