| /var/tmp | |||||
|
Subscribe
|
Tue, 27 Mar 2012
Some success...
The breakthrough happened in late January. I have written Android apps from scratch like Bouncer and Love Poems, and I ported an open source Java library to Android with Panacea Database. Looking at a full-fledged open source Android project, FBReaderJ, I noticed some modifications I could make to it to improve it, and that would be for an audience without much overlap with the existing FBReaderJ audience. FBReaderJ is GPL licensed, which worries some people, but myself less. Anyhow, I released my version of the app, "Free books to download & read" on January 24th. By the last day of January, 2425 installs a day were happening, by February 5th, 11000 installs a day were happening. Daily installs ranged from over 8000 to over 11000 a day until February 20th. The install rate is still over 2000 a day. As is normal, the active installs in percent has been going down over time, but it is still over 35%. It currently has over 119600 active device installs. There is currently one ad - right before someone goes to a book - it has been requested from 13000 to 23000 times a day over the course of the past two weeks. Having had success with modifying an open source project, I doubled down, and on February 12th I released a modified version of OI File Manager, another open source Android project. I chose it because it was open source, because I had thought of doing a file manager for a while, and because it had a wide appeal - it is not a niche product like Panacea Database or Bouncer, many people can find it useful. I wanted to release another app with wide appeal to ride the wave of Book Reader. And it did so, it has over 4239 active device installs, which for my five apps is second to only Book Reader. And has been achieved in six weeks, while I have been working on apps like Bouncer for ten months. I do have my eye on one more Android open source project, but I have turned back to doing an original project. It uses Andengine, but is actually an app, not a game. It is original as far as I know, nothing else on Android does it in the manner mine will, which is much better than the handful of existing ones that are related to this app. I have to see how much work I am going to do on it before releasing it. It is more toward a niche product than a general one, but it is not a small niche. Anyhow, much work to be done on it, although I already have a decent prototype for one implementation of it. Book Reader was making over $20 a day when the downloads were first flying. Also, I had an ad on the page seen when the app was opened for the first time, which I now do not have - although I may put that back. Anyhow, I rolled $100 of that Admob money into ads. While I was running my ads, Admob dropped their minimum ad bid to $0.01 a bid. So I dropped my bid to that. The money went mainly to buying ads in Brazil for the File Manager. Ads seem to boost downloads from the target market, even when they're not running, don't know all the variables which cause that although I can guess some of them. Anyhow, I know have over 1000 active users from Brazil for File Manager that I probably would not have had any how. Were they worth ten cents a head? Well, the initial buys were overpriced before Admob's price drop. Also, it was something of a test. Also, I want to roll my profits back into the business and couldn't think of a better thing to spend it on. Even with that $100 spent, I'm still getting over $350 from February Admob profits for Book Reader. Those kind of dollars came from the initial pop, I'm now more at the $100 a month level, as I said before. Although if I had more ads in the Book Reader app, I could probably make more. Although I want to avoid having ads over the actual book, as that is annoying. In terms of running Admob ads - you can choose the devices to target, the SDK version, the country (and sometimes more specific location), whether to target mobile, wifi or both, gender and age group. Transfers of $50 or over from money I was owed to running ads gets you a small bonus of free ads. Each campaign is $10 a day minimum. Minimum bid nowadays is 1 cent a click. You can see conversion rates for app installation for app download ads. The annoying part for Admob is the approval process. First you have to get approved to be able to transfer money from your balance to ad campaign budget. Then campaigns have to be approved. After I was approved for balance to budget transfers, I transferred $50 and submitted a campaign. A week later it still sat unapproved, so I sent them an e-mail, then it was approved. Contrast this to Millennial Media, who approved a campaign for me recently within hours. You'd think Admob would be more responsive to me wanting to give them my money. So on that Millennial Media campaign - I noticed a few days ago that the paltry sum I made in February from Millennial Media had been put into my balance. The sum was paltry because I was not even signed up with Millennial in the beginning of February. Anyhow, I took the dollar or two and put it into a campaign in Norway for File Manager. It was approved within hours, which was the good part. One downside was the minimum 5 cent bid - 5 times what Admob does. Also the targetting is not as precise for kinds of device and such. You can target to country though, which I did. I wonder if "Android" goes out to Kindles, Nooks and the like, I hope not as it would be wasted money. Anyhow, my $1.20 daily budget was filled and I got 24 clicks. I'll probably do a bigger one next month for MM when my March money clears, maybe for different countries. Another nice thing about MM is I'm not stuck with $10 a day campaigns! But unlike Admob, MM keeps the money you earn for two months plus instead of one month plus, so I may as well roll the money back into ads. I signed up for Inmobi as well, but you have to talk to them or something to get approved to transfer money from balance to budget. It's not worth it at this point. I also might do Adsense for mobile ads. I'll have to see. I should get the $350+ by the middle of next month, so I have some ideas for the money. I might spend some money for a contractor to do some work on Book Reader - which I plan on using myself and sending back to FBReaderJ as well. I had used Admob as my sole ad network prior to January. One reason I chose them is they were known to be reliable about sending checks - in fact, they already sent me one last year. Also, they have a low check sending threshold - if you make $20 in a month, which I'm now easily doing. They also send the money within one month plus. If I made money on ads on January 1st, or January 30th, that money would get sent to me on March 1st and would arrive, usually around March 15th in Paypal. For Millennial Media and Inmobi, the amount of time is longer. But anyhow I wanted other ad networks. For the sake of redundancy for one - if there was some problem with Admob, I'd still have two other sources of income. Also, perhaps I'd get some better deals, or extra functionality, which I have gotten. Also, I like the idea of keeping some competition open for the ad networks - it benefits developers to have a few competing ad networks out there. I read a report which said the top Android developers usually have as the top four packages includes - Adwhirl, Admob, Inmobi and Millennial Media. That dovetailed with what I had heard already so I went with Inmobi and Millennial Media. Inmobi seems to do everything manually, and even over the phone. My app approval seemed to be in limbo until an e-mail back and forth. Then I had a phone conversation, where the rep said they wanted me to push up the number of requests I was getting as they thought it was too low. This conversation happened a month ago. I said my Book Reader got a lot of hits so submitted that. It was pending, then they said they wanted more info on my address etc., so I put that in and it is still pending. Not that I mind much, I submitted the app at their urging, to some extent. As I said before, to be able to transfer earnings balance to an ad budget requires manual intervention as well. Well, Admob and Millennial Media are more responsive without hassle, so I'll deal with them more in terms of buying and selling ads for the time being. Inmobi is still the primary target for File Manager ads though, with MM and then Admob as fallback, and 80% of traffic is directed to Inmobi via Adwhirl right off the bat. Aside from responsiveness, I'd need to make $1.67 a day from Inmobi to get a monthly check from them, and right now that is more like 28 cents a day, so I haven't even hit that minimum yet with them (or Millennial, which is about $1.03 a day). I suppose eCPM, RPM, CTR, etc. are important in differentiating ad networks, but one overriding thing is fill rates. Admob and Adsense integration has been increasing as time goes on, other than it taking a day for clicks, CTR, eCPM and revenue to update (but not impressions or fill rate), the two are very integrated. And for normal apps, the fill rate for this is usually over 98%, if not 99%. As opposed to this, Inmobi has had a 21-54% fill rate for me over the past two weeks. Millennial, which is getting a fraction of the direct File Manager traffic Inmobi gets, but which does get its run off, has had a 77-86% fill rate for the past 9 days. The major slackoff from them is for countries like Brazil and Poland, they don't have the presence Google can afford there yet. But for the US, France, Germany, Japan etc., their fill rates have been on par with Admob's. With Adwhirl, lower fill rates are not as big a deal, but it takes seconds for Adwhirl to miss an Inmobi ad, and the Millennial ad, and then maybe even an Admob ad before putting up an Admob "Adwhirl" ad, and by that time the Activity with the ad may have been clicked off. Sun, 01 Jan 2012
Happy New Year
Thu, 17 Nov 2011Looked at Admob today, I finally pushed past $25 in payments from my Android applications. $25 was the one-time fee I paid to get on Android Market. So I've made $25.16 from my three mobile apps so far, and am now 16 cents in the black. Admob sends you money when you hit $20 for a month, so in December I should be getting a check for October and before. In addition to the Admob money, Samsung was also nice enough to give me a free $500 value 10.1 inch tablet to write tablet-sized apps on. And with my latest update of Bouncer out this morning, all three of my apps now handle "extra-large" displays, as Android calls them. I was contemplating that I'm now in the black this morning, and felt good about it. My thought in terms of my business of putting out Android apps revolves around having no recurring capital costs, and if at all possible, no capital costs at all. Particularly in terms of some web page that an app must contact that I'd have to pay $10 a month or so for. Right now I just code the app, push it to Android Market, and collect the ad money. Aside from the slow wear on my keyboard, mouse, screen etc., the only expense is my time. I wrote a framework for a spreadsheet, and did a number of spreadsheet features for it. Then I worked on getting pre-2007 Excel files onto it, which I did. Then I worked on getting Excel 2007 and 2010 (.xlsx) files onto it - and got stuck. There are two possible paths to fixing this, an easier one of I can get things down to less than 65,536 methods, and a harder one if I can't. I took a shot at the easier path, and that just might not be possible, as I got rid of a lot of methods. I may be able to pare down a few more. If not I'll have to go on the harder route. Anyhow, I put the code up on Github. A month ago, I finished rewriting the layout of Panacea Database for all major (and minor) device sizes and screen densities. Then I added a feature to remember the last file opened. I did some testing and QA on the last file feature, but perhaps not enough, as it seems there have been some crashes since then which probably pertain to that. Which I am looking into. People seem to want column sorting, which I can work on implementing. I might throw in some SQLite stuff, depending on how easy it would be. So all of my apps have decent layouts for all major (and most minor) devices, which I am happy about. So now I am on to my new apps, as well as fixing bugs and implementing new features in Panacea Database. Sun, 09 Oct 2011I released another Android application - Love Poems. It took off initially - by the fourth day there were 442 downloads, with 280 of them active installs. But then that slope of adoption leveled off, it fell in the Market rankings etc. Not sure what hurt it - I did an update allowing users to increase or decrease the text size, while someone gave the app a two rating. It then sunk in the Market rankings and downloads leveled off. A few days later I released an update with a few more poems, and also adjusted the text sizes a little. I will do updates in the future, in terms of both poems and display tweaking. Android is continuing to gain market share. Here is the browser usage seen from various mobile operating systems, according to the web logs of the Internet's 7th most trafficked site, Wikipedia:
As the chart shows, the iPhone and iPad are doing well, as are Android smartphones. Windows Phone 7 is moribund - it only is 0.04% of traffic. There is more Android Honeycomb traffic on Wikipedia (0.05%) then Windows Phone. I guess we'll see how they do with Windows 8 and Mango which is supposed to launch in 2012, but they are way behind Apple and Google. The modern tablet market is newer than the smartphone market, so maybe they'll have a shot at competing there. I downloaded Windows 8 preview and developer kit and had a look at it. Their Store is free for developers, although applications are approved first. I'm currently developing a fourth app. Won't reveal all details until it's released, but it uses Fragments and the ActionBar. Android's compatibility package does backward compatibility for Fragments but not ActionBar, so I am using Jake Wharton's ActionBar Sherlock for backward compatibility in ActionBar usage. I have that all implemented already actually. I haven't done all the happy stuff you can do with tablets and Fragments yet, we'll see about that, it's not an essential element to the project, but with all the usage of ActionBar and Fragments, redesigning it to do that will be easier. This new app may use SQLlite as well, so I may be looking into SQLlite. I was invited to the Android Developer Lab in New York on August 24th. It was good - I met some interesting people, and they pointed us in the direction of where Android is going, which helps me point my development in that direction. I've been doing a bit of work on Panacea Database's layout. I moved a lot of stuff into XML. I'm using scale-independent pixels and density-independent pixels as much as possible, as well as adjusting the size of buttons by layout weight and that sort of thing. One thing I've been doing - I change how many rows I display when fetching rows from the database, and the scale-indepedent pixel text size of the display, depending on what screen size I have, what orientation I am in, and to some extent, how many dpi are on the display. The way I've been doing this is putting a "gone" TextView in the XML, and from my code, reading the number of rows to display from that. Not sure if its best practices, but it works - if I find a better way I'll do that. Sat, 09 Jul 2011According to Alexa.com, Wikipedia is currently the 7th most trafficked web site. They are also one of the few large web sites to allow everyone glimpses of their web log analysis. I mention this in a previous blog post. In December 2010, Android devices made up .078% of Wikipedia's web traffic. At the end of May 2011 (June numbers are not done yet) that was up to 1.16%. So Android traffic on Wikipedia increased about 48% in six months. Actually, the six month increase of about 48% from December to May was more-or-less matched by the one month increase from November 2010 to December 2010, which was a 47% increase in traffic. I guess a lot of people got Androids in their Christmas stocking, or next to their Hanukkah dreidels... So anyhow, I released my second Android application, Panacea Database, on June 11th. I definitely followed the Release Early, Release Often philosophy for this one - I got the idea for it on June 7th, and by June 11th it was published. I guess another party writing a nice Java library, which someone else posted a bug report, which was subsequently fixed, seven months before, that fixed all the Android bugs, helps. Thanks Miha Pirnat, wherever you are! So what it does is iterates table rows and does searches for Microsoft Access style files on Android. Or Microsoft Access 2000 to 2007. With a lot of Access 2010 working. I actually just sent a patch in to the library people to fix a bug. Or implement a kludge to get around the bug anyhow - until I'm interested in dealing with Attachment data types, they'll have to write a fix. So both my apps have passed through the 500 download point. Bouncer has a 41% active/total install ratio, Panacea Database has a 57% install ratio. Why is that? Well to quote a critic on the Android Market, Silas, "Move to SD card!!" The app has a lot of PNG's and JPG's and is 3.8MB. Maybe I will move some of that to the SD card, who knows? It's an issue I have to figure out how to deal with. My Admob revenue for the last week is 79 cents, $1.52 for the week before that, and $1.28 for the week before that. My first goal is $100 a month in revenues. Whether that be by ads, sales or whatever, it does not matter. Initially I thought of just tossing out apps left and right and seeing what stuck. But you put an app out and you have to maintain it. And I'm just one person. For now anyhow. I don't want lots of one-star ratings for my apps on Android Market. The lowest I've gotten were two three-star ratings for Panacea Database. One wanted me to fix the bug where next-lines in a text data type would make a button disappear. I've partially patched that already, and have a full patch for that (hopefully) that I will release, oops, I mean publish, soon. Mon, 20 Jun 2011
A Guide for the Android Developer Guide
Tue, 31 May 2011
Bouncer, my first Android application
So, I have published my first Android app (the concept for which someone else described to me). What have I learned about Android development and such since then? My first (unpublished) Android app was heavy on ListView. It was a tree of ListView's really - the top ListView went into sub-trees of ListView's, until a leaf/node on the bottom was reached, which might be something else. I filled out the onCreate method, and an onListItemClick method. The first screen of my new app was initally going to be a GridView. I then gave up on that. I then created two activities which could go back and forth to one another via clicks (listened to with OnClickListener) via Intents. Then I had them pass information to one another in the Bundles. So now I can pass messages to my sub-trees via the Bundles, and they can be separate Activities. Having dropped the Gridview, I tried out the TableLayout, which I eventually went with. So now I had my grid-like table of letters on the first screen, able to pass which letter was pressed via a bundle in the Intent to another Activity. I used Buttons for these letters. I then wanted there to be a tab on the front screen, with the table of buttons in the primary tab, but with people able to tab over to the "About" tab. So I made the first activity a TabActivity, and opened the Activity with the table with an Intent. I then wanted to change the color of the buttons, but found out it was not all that simple, and learned about 9-patch drawables and the like. So I created my own buttons, which needed their corner rounding to be specified and the like. Google suggests you put an End User License Agreement in the application. There is a standard class to do this, so I put it on the application. Ultimately, I want my app to cover all 50 of the US states, as well as the District of Columbia (Washington D.C.) Currently, it covers 46 of the 50. I had the current ID for 46 of the states, at this point in development I started putting up older licenses that may still be valid. Most of this time I was designing for a high density, normal size screen in a vertical position. About 17% of people using Android's use medium density however. Also, some people flip from vertical to horizontal mode, I even encourage this flipping in the application when the full image is about to come on the screen. So I did some work on making it at least function with medium density setups, and for high density setups when viewed horizontally. I get the display metrics, and then call different layouts depending on what the metrics are. When to release is always an open question. "Release early, release often", agile development and so forth is the popular credo, and I agree with it for most applications. On the other hand, you can't release too early, especially since Android Market has a rating system and so forth. But at this point, I felt I had enough, and the last four holdout states it didn't look like I would get anything from them in the next few days, so I decided 46 was enough to be useful, that layout looked decent for most phones, and was at least usable for almost all phones. So I released. One thing I did not do when releasing was release the initial version with ads. Why? Because Admob wants to know where it is on Google Market to give you an ad code, and I had nothing up there yet. I later realized I had misunderstood due to my unfamiliarilty with all of this, I could have put an ad in the initial version. Within a few hours of publishing version 1.0.0, I released 1.0.1, which contained Admob ads. It's been 28 hours since I released the initial version, and 15 hours since I released the version with ads. Thusfar I have had 78 downloads of the app from Android Market, and have had 55 ad impressions served. In subsequent versions I plan to improve the application. I will work to get the four missing states, and the District of Columbia. I will put more information about identification. I might put a bubble up announcing updates, but I wouldn't want it to be too annoying. I also have some kludgey stuff in the layout files which hopefully I can clean up, as I learn the Android API better these things can be more smooth. Fri, 22 Apr 2011I have been looking over Android's API and have been writing an Android application with Eclipse. Android use has started to take off in the past months. I have looked at various metrics, one I like is from the Internet's 8th most trafficked sites, Wikipedia. It shows the growth of Android use over the past six months:
The graph y-axis is the percentage of all browsers coming in - mobile, desktop and whatnot. X-axis is the time period of usage - the past six months. The OS versions are listed in the key, although "Mobile other" is a catch-all. In October 2010, 0.47% of all hits to Wikipedia came from Android phones. In March 2011, 0.98% of all hits to Wikipedia came from Android phones. So that has more than doubled within the past six months. Wed, 23 Mar 2011I have been corralled into doing some programming in python. So at one point I decide to write a do-while loop and learn - python has no do-while loops. Terrific. My patch made it into Evince 2.91.92, I'm officially a Gnome contributor, yay. I patched a bug while chasing down another bug. Carlos couldn't reproduce it - I wonder when it manifests itself. The bug crept in in December, and not many people are running evince released since then, so the pool to try to reproduce it is limited. Carlos fixed up my patch so that it wouldn't cause problems going in. I still have to fix that original bug. Actually, I already did, but the fix is trivial, and I want to look over my code again to make sure it's decent. I also patched the evince package for the upcoming Ubuntu 11.04. It was a suggested backport of a commit. Again, my patch had to be massaged in. I changed the Ubuntu documentation for patches so as to point to the complete method of doing a patch. I know people make use of git branches, but I never really used it until recently. It is very handy, especially if you're doing a lot of work on something. I will surely be using it in the future more. Thu, 20 Jan 2011
Blunder's PGN-to-FEN converter nearing completion
The minor re-design, or major refactoring, of Blunder's PGN-to-FEN converter was finished three days after my last blog post about it. It went very well, the new code which replaced the old code is more abstract and flexible, looks better and works better. Funny how these things go together - it seems good coding practices solve a lot of the headaches of coding and things begin working automagically. I mentioned problems in Lutz Tautenhahn's PGN-to-FEN converter in my last blog post. After writing it, I decided to e-mail him a bug report. Within 14 hours he fixed the bug and posted new code, which I tested, and both problems were fixed. So Lutz's converter is now working without problem, as far as I can see. I've fixed many things in the PGN-to-FEN converter since the redesign/refactor. I check in every (?) manner if a move would put the king in check. I now handle many (all?) en passant scenarios. I also now deal with PGNs where a FEN position in the middle of a game is given, and where the subsequent moves are from that position (i.e. we start in the middle of the game). I made other changes as well. I just made my most satisfying commit since the redesign/refactoring. It was the fruit of other commits before. First I began marking games on the linked list as I went, not all in the beginning (which caused an initial delay when parsing large PGNs with many games). Then I pushed code into the Game class that I had wanted to push there for a while. All of this allowed me to do the latest commit. I was reading the entire PGN into a linked list in the PGN class, and then pushing the entire linked list into other classes like Game. As Game only needs one game, I created a second, short linked list with only one game, and pushed that to Game. As the original data on the first, long list is no longer needed, I removed it. I am always dealing with the head of the linked list in these cases. Anyhow, this process of dealing only with the head of the larger linked list, and shrinking it as the program goes, has made the program over ten times faster. So what more is there to do? People have their own bizarre implementations of the PGN format. I handle many of them, but there are a few more out there I might do. All of the code is working, but I might clean up some of it so that it is easier to read and cleaner. I also might work on a user interface other than running the jar file from the command line. I might also discover some edge cases of the en passant sort that I am not dealing with. I have tested tens of thousands of games, and have been looking over FIDE chess rules, the specifications and so forth, so I don't think there will be much more of this type. The program is in decent enough shape right now, I guess I just want to deal with a few more of the oddball PGN implementations, and fix up the UI a little bit, before I feel this is fully formed in its first generation. But it is working pretty well as it is. [/projects/blunder] permanent link Tue, 04 Jan 2011Since I have a long way to go before becoming a good programmer, I sometimes refer to Code Complete, The Mythical Man-Month and the like to keep me on the right track. I think I have reached that point, of throwing away the first one built, with the Blunder PGN to FEN chess translation component I have been programming for the past month. To be honest with myself, I foresaw these design problems back when I originally did the design. I knew I would have to deal with many of the things I am dealing with now way back when I was doing the original design (although not totally - checking that a piece is pinned to the king is more important than I thought it would be, if I thought of it all). The thing is, designing the program with all of that in mind would be "boring". It would be too abstract initially, it wouldn't DO anything until quite a lot of the program was coded. The way I programmed this, it worked right off the bat - at least with the first PGN I used as a basis. It translated the first ply of the first move correctly, and then the next ply of the first move, then the first ply of the second move and so on. After that all worked, I tried another PGN. As I sought to get it working for my various PGNs, I added more and more functionality to the program. The method functionality I need now seems rather abstract, or at least more abstract than the functionality I have now. "Check to see if piece (rook or queen) is pinned to king horizontally", "Check to see if piece (bishop or queen) is pinned to king diagonally", and so on. Things are a little more abstract than I'd like, but if I try to keep things very specific, I will have much, much more coding to do. The program currently does over 95% of PGNs correctly, but there are too many possible corner cases to deal with. The functionality that deals with plies (half-moves), which is most of the program, has to be rewritten. The main thing I focused on with the initial design was the data structures. I did change things around a bit, especially the Board class, which is my half-way class between the translation of the PGN to FEN. I also realized while programming that I needed a Move class. When functionality got to where over nine out of ten PGNs parsed, I wanted to do PGN files that had multiple games within it - and thus a Game class was created as well. One nice thing is, aside from the edge cases I have to redesign for, my PGN to FEN converter has some aspects that are superior to the two other converters I've found out there - Lutz Tautenhahn's PGN-to-FEN converter and 7th Sun Green Light Chess's pgn2fen.exe program for Windows (or Linux, with WINE). Tautenhan's program I tested out more - I saw two problems - one, castling ability which is disabled due to a rook move is re-enabled if the rook moves back to the square. I'm fairly sure this is not legal with FIDE rules. Secondly, if a pawn move results in pawn promotion, Tautenhan's converter does not reset the half-move clock due to the pawn move, but in fact increments it. I believe this is not the case with FIDE rules, but am less sure. As far as the Green Light chess converter, I have not looked at it as much as Tautenhan's, but I do know it does not mark en passant squares in the FEN. Blunder's converter marks en passant squares, disables castling availability properly, and resets the halfmove clock on all pawn moves - even pawn promotions, which I believe is the correct behavior. Now I just have to redesign and abstract the methods that deal with converting a ply to a new configuration for my Board object. Which is most of the methodology for the program. I might tinker a little more with the data structures, perhaps making them a bit more robust. [/projects/blunder] permanent link Mon, 06 Dec 2010
Blunder, Chess, Java, Architecture and Construction
Blunder is a suite of chess-related tools. Primarily, it helps you go over your games, and see where you made mistakes or missed opportunities. You keep looking at the boards where you made your biggest and/or most recent mistakes, and keep testing that you now know what to do correctly. Most chess teachers say this is one of the main ways to improve your game, and with Blunder it is automated. Anyhow, the program has been out for almost a year, particularly the main LAMP (Linux, Apache, MySQL, PHP) component. However, one necessary component has been converting files in PGN format (records of games) to FEN format (pictures of individual boards at a set point). I give pointers how to do this, but have not been happy with any of the existing tools, and have begun writing my own one in Java, with GPL version 3 licensing. This was the impetus to put it on Sourceforge actually. As I said, Blunder is functional already, particularly the LAMP package for going over games. One necessary component for that to work is PGN to FEN conversion, for which there are tools out there. I am unhappy with them, so I am writing my own in Java. If any Java developers want to send git patches, I'd be happy to get them. This second package within the Blunder project is in pre-alpha right now. While this has all been done pretty loosely, I decided to try for a little bit stricter good practices in the pre-construction part of the project. I have to report - it worked out very well! I began by cheating on the good practices a little - I coded a method that read the file into an array. It was just a detail I didn't want to bother with once requirements and architecture was done as I'd want to get right into the construction beyond that first.
My requirements were:
I then did architecture. I sketched out the major classes, their responsibilities and their interactions. Initially there were three classes - Pgn, Board and Fen. I thought about it and realized Pgn should have a helper class, Move, and Pgn would have an array of Move objects. Board is primarily an array of characters representing the board, and Fen is an output String representing the FEN. I think it was helpful thinking about all of this beforehand more than I usually would have. It saved time in the long run. Every minute I spent doing this right off the bat probably saved a multiple of itself so far. One mistake I made is instead of making the Board array something that would be intuitive to me, I tried to fit its data structure to the other existing data structures. I thought this would make "less work". The problem is, Board's data structure then became inscrutable to me, and I had to bend my mind to figure out what it was, and kept making mistakes. I then decided to rearrange Board's data structure to something I could intuitively understand, and then use methods to do the conversion between it and the other two major data structures. This has worked out much better for me. Most of the work left to get the program from pre-alpha to alpha is doing the logic (methods) for the various chess moves. I already have methods for PawnMoveNoCapture and KnightMoveNoCapture. My next method will probably be PawnMoveWithCapture - a move where the pawn captures a piece. The program needs methods for all the various moves - Queen move (capture and no capture), Bishop move (capture and no capture), Castling (Queenside or Kingside) and so on. This will be the bulk of the work to get the program into alpha. I am plowing ahead with those methods right now. There is some code duplication within existing methods, but my concern is not with that but code duplication between methods - I already created a method to convert the letters from the Pgn moves to the numbers the array in Board uses, which both existing chess move methods use. I would like to complete moves for all the pieces, when capturing or not. Anyone who wants to send in git patches for these Java methods should feel free, the two existing methods can serve as a base. You can grab it from the project's git page on Sourceforge. [/projects/blunder] permanent link Mon, 22 Nov 2010
Linux desktop/smartphone penetration
This Wikipedia article tells you the share of web browsers from different sources, but clicking through the links you can see what penetrations OS's running web browsers have as well. These web sites give an accounting from their logs of what the OS's are for the people they're serving pages to. W3counter has 1.49% running Linux and 0.25% running Android in October 2010. Clicky gives a daily tally, which is 1.25% for Linux today, and has been hovering around 1.25% for the past weeks. Statcounter has 0.78% running Linux since September. Not sure what they're counting as Linux or why their Linux count is so much lower than the others Most interesting is Wikimedia, which really breaks down the statistics. They sample 1/1000 of their logs, so every hit they show can be assumed to be multiplied by about 1000. They count Linux, for which they include Android, as 2.04%. The breakdown is 0.75% Ubuntu, 0.47% Android, 0.07% SUSE, 0.06% Fedora, 0.05% Debian, and by the time it gets to Gentoo it is down to 0.02%. Red Hat, CentOS and "Linux Motor" (whatever Wikimedia means by that) comes up with the rest. There's even a breakdown of the different Ubuntu, Fedora and Android versions. Cool. It gives you a general idea of what the penetration rate is any way. Wed, 17 Nov 2010
Epdfview patch
Wed, 03 Nov 2010
Ubuntu and user-focus
I am installing to a KVM Maverick Meerkat 10.10 from the ISO. It gives the option to allow network updating while it copies files from CD to disk - smart, save the user some time later, very thoughtful. It also pops up a slideshow (which is browsable) showing features of Ubuntu while it is copying from the CD to the disk - nice, if the user is in a situation where he doesn't have much to do while waiting for install to finish, show him or her the system features and educate them about it There is a division of labor in all enterprises. I run the servers, sometimes I write the code, I investigate problems. I don't normally think about user desktop Linux experience much, except in an abstract way, such as that PDF backend library support could be better so that people could render their PDFs better. It's good there are people out there who do. [/linux/ubuntu] permanent link Tue, 02 Nov 2010
Building GNOME with jhbuild, a.k.a. pain
Trivia: Who said less than a month ago, "Getting a jhbuild to finish is next to impossible". Answer: Benjamin Otte. The #1 commiter to gtk+ in the last 1500 commits or so. The #1 commiter to cairo in the last 500 commits or so. The problem isn't jhbuild so much, although moduleset options could probably be cleaned up a little bit more. It is with broken stuff in GNOME, or which GNOME depends on. Luckily for me, my build tree is not that large. Poppler is not alone in my .jhbuildrc in ignoring gobject introspection stuff during a build any more - welcome pango! Then gtk+ won't build. It was failing on a dependency to fontconfig, which was broken by a commit on October 6th. Or at least fontconfig's pkg-config metadata hint file was broken, for a number of people who use standard build options (like me), causing gtk+ not to build. I've emailed the person who made the commit. I won't go into stuff higher up on the chain that depends on gtk+. Needless to say there's brokenness. Sat, 30 Oct 2010So, I'm happy to have my jhbuild building poppler and its dependencies off their latest commit, and then epdfview and evince on top of that. Of course, if anything is broken anywhere down the chain, things fall apart. I've turned off a lot of the gobject introspection for now.... Epdfview is more lightweight than evince, with a few less dependencies, so I often use it when testing. Epdfview currently compiles against the latest of its dependencies (thankfully no big breakage in gtk+ or glib, as sometimes happens) and can load some PDFs. But a number of PDFs it crashes on. Gdb says:
Program received signal SIGSEGV, Segmentation fault.
[...]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:31
31 ../sysdeps/x86_64/multiarch/../strlen.S: No such
file or directory.
in ../sysdeps/x86_64/multiarch/../strlen.S
(gdb) bt
#0 __strlen_sse2 () at
../sysdeps/x86_64/multiarch/../strlen.S:31
#1 0x00007ffff772f502 in g_strdup (str=0x1 <Address 0x1 out
of bounds>)
at gstrfuncs.c:101
#2 0x000000000040ad46 in
ePDFView::IDocument::setLinearized(char*) ()
#3 0x0000000000411680 in
ePDFView::PDFDocument::loadMetadata() ()
Hmmm. It took me a little time to figure out why this was breaking every now and then. I am compiled against the latest glib commit - is someone messing with g_strdup or something? Eventually, I tracked it down to a commit in poppler from September 17th. From the message I guess they knew it would break the API - "PopplerDocument:linearized is now a boolean value rather than string, so this commit breaks the API again." So that's simple enough. I changed the gchar's to gboolean's, and made some other little changes, and sent a patch in to jordi at emmasoft, so maybe it will get applied. My version is working anyhow... Tue, 26 Oct 2010
jhbuild, evince/poppler etc.
The default jhbuild moduleset is gnome-3.0, but that builds some of the stuff I'm focused on from tarball's, which is supposed to be deprecated in jhbuild now anyhow. So I remove gnome-3.0 from my .jhbuildrc and put all of the devel modulesets into my .jhbuildrc. But some of the dependencies were missing - they were in the non-devel modules. So I put all of those into my own moduleset. As my moduleset is local, I set use_local_modulesets to True - even if thats not necessary, I git pull from jhbuild before I run a jhbuild, so why not do that? I also put
module_autogenargs['evince'] = autogenargs \
+ ' --disable-nautilus '
into my .jhbuildrc to avoid those headaches with evince. I skip a
number of modules people recommend to put in skip, like mozilla,
although I don't believe they're dependencies in my chain. I also add a
few pkgconfig path's to .jhbuildrc, on advice from the net. Of
course, I also install the packages on Ubuntu that the jhbuild web site
recommends for Ubuntu 10.10.Incidentally, here are the jhbuild dependencies for evince (and epdfview):
Color code for nodes: green are packages in jhbuild "devel" modulesets, red are packages in jhbuild "non-devel" modulesets, brown (libgcrypt and libgpg-error) are also in jhbuild "non-devel" modulesets and they are tarballs there, purple are packages in jhbuild "devel" modulesets which other packages might have a hidden dependency to which is not shown in the current jhbuild modulesets. Finally blue are non-GNOME packages that no GNOME module is dependent on, but which are themselves dependent on some GNOME modules. I made the above dependency tree with graphviz, a tool which makes doing such dependency charts really easy. Everything went pretty swimmingly until I started to reach the top of the chain. Poppler busted on some GObject introspection stuff - I installed gobject-introspection as a jhbuild module and updated the poppler gir include from Gdk-2.0 to Gdk-3.0 and it went sailing along. Next up - gtk+ 3.0 broke. This happened to me a few days before, when I was taking my first stab at jhbuild. At that time, I looked at the recent gtk+ code, saw the stuff breaking had changed recently, and did a hard git reset of gtk+ to a commit from 48 hours before - it installed fine. This time the commit done was the last one. I went on GNOME's IRC network and tracked down the developer who made the bad commit, he fixed it and I was sailing along again. So now I get to evince. A few days ago I had some problems with deprecated combo box calls that had been removed from the dependency libraries, but there were patches for that in bugzilla. After patching that, this time I get an error that a set_scroll_adjustments call is failing. I look in gtk+ and see that they have been mucking with scrolling there recently, and figure it is due to that. I disable the call and compile. Evince comes up and I can look around, but it hangs on loading anything. I check poppler's test programs and they are working. So I encapsulate a lightweight PDF viewer that depends on poppler and gtk+, epdfview, into my personal jhbuild moduleset and build it against these libraries. Epdfview comes up, and displays PDFs etc. fine. Ultimately, epdfview and evince are dependent upon almost entirely the same libraries, except evince depends on three more icon-related ones. And epdfview is working. So either evince is broken, or some library it depends on has changed, meaning... evince is broken, for the moment. But Epdfview works. Fri, 22 Oct 2010
poppler rendering
The bus map PDF for my area takes 16 seconds to load on my computer. It is on Ubuntu, on a 64-bit desktop system with an AMD Athlon(tm) II X2 240 2.8GHz processor with two cores, and four gigs of RAM. The bus map PDF is 751K. This seems far too long. Epdfview, which also uses poppler, takes 8 seconds to render the PDF. Adobe Reader 9.3.4, on the other hand, takes less than 4 seconds, and I'll use that as a benchmark here of what the render time should be. So I looked into it. Instead of grabbing all of the necessary source and compiling with gcc's -pg flags for gprof, I compiled an uncompress kernel and ran oprofile which rendering the PDF. It didn't show everything fully in the oprofile report, so I downloaded the necessary debug symbol packages for poppler, cairo etc. I rendered the PDF with evince, with epdfview, and with poppler's poppler-glib-demo in a local poppler library cloned from the latest git commit which I compiled manually. For all of them, oprofile pointed to poppler being the library dominating the processor. So with the Ubuntu package of debug symbols for poppler installed, I had the oprofile report look at poppler. It showed three methods dominating processor time - in order they were - TextBlock::isBeforeByRule1, TextPage::coalesce and TextBlock::visitDepthFirst. This was the case for all three programs using the poppler library backend. So I start hacking around with the coalesce method in poppler, when I come across this line: sortPos = blk1->visitDepthFirst(blkList, i, blocks, sortPos, visited); I look at the visitDepthFirst method and see it is doing topological sorts on the data. So I comment out the above line within coalesce // sortPos = blk1->visitDepthFirst(blkList, i, blocks, sortPos, visited); recompile and run it. So, when rendering the aforementioned PDF with poppler's test program poppler-glib-demo WITH the visitDepthFirst method in coalesce, the fastest rendering I get is 8 seconds. When I render it without that line, the fastest render I get is less than 3 seconds. Removing this method more than halves my render time. But perhaps this behavior was due to the PDF itself being unusual. I did a quick search through Google for other PDFs. As I had this problem with a (partial) city map, I looked for other city maps. I randomly found a PDF with a map of Paris. I tested its rendering in poppler-glib-demo, with and without the call to the visitDepthFirst method in coalesce. Fastest render time with the method was 3.76 seconds, fastest without it was 2.07 seconds. So this random map PDF did not have as significant improvement as my map PDF, but this call, which was not doing anything visibly to improve the map display, was adding more than 50% more time to the program. As I said, from my cursory look, there was no difference between the displayed map which called visitDepthFirst, and the one which did not call it. I saw what the code did and the comment, for more information on its purpose and so forth I began digging through the logs. I saw that the code came from commit f83b677a8eb44d65698b77edb13a5c7de3a72c0f on November 12, 2009. In November 2009, Brian Ewins made a series of commits whose purpose was to improve text selection in tables. This particular commit changed the method of block sorting to "reading order" via a topological method. Aside from the comments in the git commits, these November 2009 column selection commits were discussed on the poppler mailing list, as well as in Bugzilla. I revert poppler back to commit 345ed51af9b9e7ea53af42727b91ed68dcc52370 and compile epdfview against it. Then I revert to poppler commit f83b677a8eb44d65698b77edb13a5c7de3a72c0f and compiled epdfview against that as well. Commit 345e... is two commits before f83b... When I run epdfview, using either poppler library as a backend, the version which is two commits earlier, 345e... runs in epdfview in less than half the time that commit f83b... runs in. Commit f83b... more than doubles the time to run, with no noticable improvement in anything for that application of using poppler (displaying a map via epdfview). I mentioned much of this on Freenode IRC channel #poppler, and how removing the visitDepthFirst method from coalesce improved rendering time enormously. Someone, I believe it was Andrea Canciani, looked at the method and said the two nested loops looked wrong. There are two for loops in visitDepthFirst. I put a counter on the inside one to see how many times it ran on my 751K PDF. It ran over 196 million times! For every bit in my 751K PDF, the inner for loop ran 32 times. Not only that, if the TextBlock data structure blk3 is not equal to either blk1 or blk2, the inner for loop will make not one but two calls to the isBeforeByRule1 method. No wonder my map is rendering slow. So this is where it stands now. 16 seconds seems too long for Ubuntu's default PDF viewer to load my local bus map - especially when in my hacking around I have gotten it to display in a second and a half or so. Whatever that topological commit from November 2009 fixed in terms of selecting text from tables, it has more than doubled the rendering time of some PDFs, especially PDFs with maps. The solution would be a change that would keep the fixes to text selection in tables, but still have something near the speed of rendering prior to that commit. I will look into this more when I have the time. Wed, 30 Dec 2009
Android install
Cannot complete the install because one or more required items could not be found. Software being installed: Android Development Tools 0.9.5.v200911191123-20404 (com.android.ide.eclipse.adt.feature.group 0.9.5.v200911191123-20404) Missing requirement: Android Development Tools 0.9.5.v200911191123-20404 (com.android.ide.eclipse.adt.feature.group 0.9.5.v200911191123-20404) requires 'org.eclipse.wst.xml.ui 0.0.0' but it could not be found What this translates to is Android is dependent on another plug-in. So I go to install the webtools/wst xml plug-in, but it needs an EMF plugin. Then it needs a GEF plugin. Finally it will accept the webtools/wst plugin. Then the Android plugin can be installed. This sounds easy, but between Eclipse's junky and non-intuitive GUI and Android's documentation not mentioning their plugin had dependencies, it was not. Tue, 22 Dec 2009
I have become interested in the poppler library lately (and one of the
The particular bug that was reported through Ubuntu that I am looking at is #497175 . The user tried to use evince to look at his PDF< but it did not work. Text that should have been displayed was not displayed. He said Xpdf did work on the PDF file, displaying all text. I downloaded the sample PDF file, and saw indeed it displayed the text with Xpdf and not evince 2.28.1 (using poppler 0.12.0) on Ubuntu 9.10. I tried displaying it with evince 2.22.1.1 based on the poppler 0.6.4 library. That worked. So I figured some time between those early, working evince/poppler versions, and the more recent evince/poppler versions which broke for the bug reporter (and myself), something must have changed that broke this. I wasted time in two respects looking at this. One is I looked at both evince and poppler. From my kibitzing of evince and poppler over the past months, I have seen over and over that most reported bugs on Ubuntu dealing with PDFs and evince are due to bugs in poppler, not evince. So trying different evince versions was a waste of time. The second was how I dealt with poppler versions. I knew poppler 0.6.4 worked and poppler 0.12.0 didn't, so I downloaded poppler 0.10.0 (as the bug was reported recently, I leaned towards a more recent poppler version), then compiled evince against it, ran it, saw it worked, then began manually downloading other poppler versions, compiling them, compiling evince against it, testing it and so on. Eventually I saw that poppler 0.11.1 worked and poppler 0.11.2 did not. However, I was doing a lot of unnecessary work. Poppler uses a git repository. I heard about git when it was announced in 2005, and I have checked out code from git repositories, and have browsed some git source trees over the web, but I have never looked much into it. Git has a cool feature called "bisect". Poppler has each release version tagged with the release name. So what I could have done was a git bisect - marking 0.6.4 as a good version, and 0.12.0 as a bad version. Git would have bisected all the commits between these two tags. I would test it to see if it was good or bad. If it was bad, it would bisect at the 25% mark between 0.6.4 and 0.12.0, if it is good, it would bisect at the 75% mark between 0.6.4 and 0.12.0. You keep bisecting until you get to the bad commit. I am doing this now, and am down to my last test. I will mark it good or bad, after which we will know which commit caused this problem. [...] There, I'm done. Commit ad26e34bede53cb6300bc463cbdcc2b5adf101c2 broke it. Changes to the CairoOutputDev.cc file. Before that commit, the text displays, after the commit, it does not. I changed the Ubuntu bug report and reported it to the poppler upstream.
Sat, 19 Dec 2009
poppler bug
The segmentation fault happens when the TextWord constructor is called. The reason the segmentation fault happens is because the curFont object has not been created. So without doing much investigation, I simply created the curFont object if it did not exist, and then called a related method. This seemed to solve the problem, the program stopped crashing and the problem pages were displayed seemingly normally (a cursory look shows the problem pages displaying normally, but it is possible some portion of the page is displayed improperly).
git diff TextOutputDev.cc
diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
index 442ace2..9686cc1 100644
--- a/poppler/TextOutputDev.cc
+++ b/poppler/TextOutputDev.cc
@@ -1988,6 +1988,11 @@ void TextPage::beginWord(GfxState *state, double
x0, double y0) {
rot = (m[2] > 0) ? 1 : 3;
}
+ if (!curFont) {
+ curFont = new TextFontInfo(state);
+ fonts->append(curFont);
+ }
+
curWord = new TextWord(state, rot, x0, y0, charPos, curFont,
curFontSize);
}
However, this is really just a hack. I don't have much of an understanding of how the poppler library works or how evince works. The Poppler people point out that this segmentation fault is not tripped on pdftotext, which also uses the poppler library. This is correct, it does not seem to. Then again, evince is calling the poppler_page_render() call in the poppler library, and pdftotext does not seem to do that. Thus, what that ultimately adds up to is questionable. Right now I am exploring the Gfx class, as backtrace (and following the program logic) shows that the Gfx class is utilized between the call to poppler_page_render() and the failed construction of the curWord object of the TextWord class. Setting the printCommands boolean to true shows debugging information so I am looking at that. What usually happens with the above patch is that the beginWord method is called many times, with one instance where no curFont object exists (and thus a segmentation fault would happen). I do not know much about the evince code or these libraries, so I am looking into all of this, seeing if I can come up with anything better than the above hack. It is pretty clear this is a poppler problem though - even if these pdf's are messed up, they don't crash PDF displayers that don't use the poppler library. The same goes for if evince is not doing something right with Cairo before handing it off to poppler. If this is happening 12 calls within poppler, it points to poppler being the problem. Wed, 21 Oct 2009
I have a Portal Document Format (PDF) file, which has a series of pages
This lead me to look over Ubuntu's launchpad web site, which I began browsing. (I saw that file-roller, the default Ubuntu/Gnewsense Gnome archive file application was crashing with segmentation violations a lot. There was not much non-automatic information about this however, aside from some people saying the problem was not always reproducible, but only happened sometimes. Due to this, and due to it using a lot of heavy GTK/GDK stuff that I don't know, I moved on.) I wanted to look at my evince crash a little more carefully, but I was still running Intrepid Ibex (Ubuntu 8.10) whereas most people were reporting the problem on Jaunty Jackalope (Ubuntu 9.04) or even beta versions of Karmic Koala (Ubuntu 9.10 - beta), the release version of which is supposed to be coming out in eleven days. Well, this indicates the problem has been around for a while, and is still around. So I upgraded to Jackalope. I was a little uneasy about whether to go to the Koala beta, but then I plunged in. One thing I noticed, which was not around so much on Ubuntu's Hardy Heron (8.04), is apport, a window which pops up when an application crashes and says it will automatically report it to Ubuntu if people want. This popped up for me when evince crash and I sent in the bug. Later, I marked it as a duplicate of a similar one. Launchpad makes a slight effort to try to let you see if it's a duplicate while reporting, but that question can be a little complex, and the process doesn't deal with that. So I reported it, and then marked it as a duplicate later. The Poppler PDF rendering library was partially implicated in the crash, so I downloaded the dpkg for ePDFView, which also uses that library. ePDFView also crashes on these pages. So I reported that as a bug to Ubuntu via apport. Stacktrace shows pretty much the same thing happening, they're both crashing in the JPEG 6.2 library, the call from which can be traced back, via the same route, to a Page::displaySlice call in the poppler library. So it looks like the poppler library (or possibly even the jpeg library) is at fault.
Thu, 27 Aug 2009
jedit
Sat, 25 Jul 2009
As I said yesterday, I've taken 36 hours of a Java programming 101 class
At first I just wanted to see what a real Java program looked like. So I downloaded the latest jEdit source from Sourceforge. jEdit is the sixth most all-time active project on Sourceforge, has had millions of downloads, and is written in Java. Using ant to compile it was easy enough, and I did a cursory look through the code. As they say, the best way to learn code is to try to change something. I looked through the bug list for open bugs that were not assigned. Bug 2808363 looked interesting so I took a look at that. As Sergey Zhumatiy states, the file he uploaded to Sourceforge does hang jEdit when one scrolls down to the line jEdit has trouble with (the line doing transliteration). I read through the rest of the thread and repeated some of what the other posters did - I did a thread dump and got the same result as Dale Anson . Denis Dzenskevich simplified the problem by yanking all of the relevant classes, methods, regular expressions etc. and putting them into a short Java file, duplicating the problem, and I ran the program and it hung for me as well. Matthieu Casanova noted the line jEdit was choking on from the uploaded file and mentioned that the regular expression used was in the perl.xml file. Denis Dzenskevich chimed in again, noting a geometric progression in processing with a scale of 2 for every new character processed. He notes he does not know Perl but posits that perhaps the regular expression could be simplified. The first thing I did was tried to simplify the "in the wild" code that jedit was stumbling on. I cut out extraneous lines, then I changed the file type from Perl module (pm) to Perl executable (pl), then I simplified the expression even more to where I was translating the a's in the word banana to b's (banana -> bbnbnb). A comment of a few words at the end of the transliteration line still had jEdit stumbling. With this simple line failing, I began to suspect that Denis Dzenskevich was right with regards to the regular expression. I read Sun's information about the Pattern class, and then about the Matcher class. I read Perl documentation about transliteration and the like. I also found a very helpful Javaworld article about out-of-control regular expressions using the java.util.regex package. I realized that the regular expression was using a greedy quantifier within the transliteration statement's second set of curly brackets. If the regular expression was going to match, this was completely pointless, so I added a question mark to the end of the quantifier, changing it to a reluctant quantifier. My Java test program (based on Denis Dzenskevich's test program) began working for my test perl files. I did an ant compile of jEdit with the new perl.xml file and suddenly jEdit was able to easily load all those test perl files it had been hanging on - it could even easily scroll through the original in the field file that had stumbled across the bug, the one Sergey Zhumatiy had uploaded to Sourceforge. I also tested Perl files which did use backslashes improperly in the second set of brackets on transliteration lines. Still the same problem. So the bug is still there, but it has been minimized somewhat, instead of stumbling over all kinds of Perl transliteration lines, even proper ones that work, it now only stumbles over lines of Perl where transliteration is done and backslashes are used improperly in the second set of curly braces (if they're used at all - you can do transliteration with forward slashes in Perl). So my patch partially fixes the problem anyhow. Fri, 24 Jul 2009
I have taken 36 hours of Java programming 101 in the past few weeks, and
I decided to look at how jEdit, Sourceforge's 6th most active project of all time was put together. To get the latest version from their subversion version control server I did a: svn co https://jedit.svn.sourceforge.net/svnroot/jedit/jEdit/trunk jEdit
I then ran "ant" in the jEdit directory to build a jedit.jar file.
"java -jar jedit.jar" started jEdit up. All of this is fairly simple to
heavy Java users I'm sure, and I've dealt with this stuff on some level
for years, but I am new to Java programming so I am learning by even
doing this simple stuff.
#!/usr/bin/perl The line it chokes on is the one with the comment, in fact, if this file consisted of just that one line, jEdit still freezes. It appears jEdit will eventually parse this line, but only after many minutes (possible even hours, days etc.) So far, this is my small contribution to the bug solve, that even a file with that one line mentioned above will freeze jEdit - they were using the real-world example before, which is too unwieldly. As someone in the bug thread points out, a thread dump (kill -3 to the process number running "java -jar jedit.jar") has AWT-EventQueue-0 showing that things are stuck in the system class Pattern. The hand-off from user-class land to system is the user-defined class TokenMarker, the handleRule method. I did this thread dump myself. As I said before, I am using the latest (as of July 23rd, 2009) code in their Subversion server.
Also as is pointed out in the bug thread, the code involved is not just
in the TokenMarker.java file, but in the perl.xml file. jEdit has
"modes" to edit different types of files. So that when you edit PERL,
in the default jEdit PERL mode, the words if, for and my are displayed
as dark blue. Scalars and arrays are green, and so on. The perl.xml
file contains the following XML relevant to this - So we have the above as the regular expression being matched against. The string being matched against is the line $a =~ tr{a}{b}; # I like the letter a but not b translate all We can actually remove everyting before the t in tr, but I am not sure that would still be a valid PERL statement (although it would run in PERL without a problem). So we leave it. Someone put a sample program invocating all of this on the jEdit page. I will put it here, except changing the string to my simpler one here.
import java.util.regex.Pattern; This program hangs for far too long. To go back to another point someone on the thread made, every character added to the end of the comment increases the execute time of the lookingAt method geometrically, with a scale factor of 2. If I run the test file:
import java.util.regex.Pattern;
public class Testy {
public static void main(String[] args) {
String str = "tr{a}{b}; # I like ";
final String regex =
"tr\\s*\\{.*?[^\\\\]\\}\\s*\\{(?:.*?[^\\\\])*\\}[cds]*";
for (int i=0;i<1000;i++) {
long start = System.currentTimeMillis(); // start timing
System.out.println(Pattern.compile(regex).matcher(str).lookingAt());
long stop = System.currentTimeMillis(); // stop timing
System.out.println("TimeMillis: " + (stop - start)); // print execution
System.out.println("String length is " + str.length());
System.out.println("String is " + str);
str = str + "z";
}
}
}
I get the output:
true
TimeMillis: 17
String length is 19
String is tr{a}{b}; # I like
true
TimeMillis: 3
String length is 20
String is tr{a}{b}; # I like z
true
TimeMillis: 9
String length is 21
String is tr{a}{b}; # I like zz
true
TimeMillis: 9
String length is 22
String is tr{a}{b}; # I like zzz
true
TimeMillis: 23
String length is 23
String is tr{a}{b}; # I like zzzz
true
TimeMillis: 40
String length is 24
String is tr{a}{b}; # I like zzzzz
true
TimeMillis: 71
String length is 25
String is tr{a}{b}; # I like zzzzzz
true
TimeMillis: 134
String length is 26
String is tr{a}{b}; # I like zzzzzzz
true
TimeMillis: 266
String length is 27
String is tr{a}{b}; # I like zzzzzzzz
true
TimeMillis: 843
String length is 28
String is tr{a}{b}; # I like zzzzzzzzz
true
TimeMillis: 1055
String length is 29
String is tr{a}{b}; # I like zzzzzzzzzz
true
TimeMillis: 2171
String length is 30
String is tr{a}{b}; # I like zzzzzzzzzzz
true
TimeMillis: 4541
String length is 31
String is tr{a}{b}; # I like zzzzzzzzzzzz
true
TimeMillis: 8796
String length is 32
String is tr{a}{b}; # I like zzzzzzzzzzzzz
true
TimeMillis: 17571
String length is 33
String is tr{a}{b}; # I like zzzzzzzzzzzzzz
true
TimeMillis: 35689
String length is 34
String is tr{a}{b}; # I like zzzzzzzzzzzzzzz
true
TimeMillis: 81209
String length is 35
String is tr{a}{b}; # I like zzzzzzzzzzzzzzzz
true
TimeMillis: 150282
String length is 36
String is tr{a}{b}; # I like zzzzzzzzzzzzzzzzz
As a line that is 42 characters long takes over 150 cpu seconds to
parse, a line approaching 80 characters would take over 150,000 cpu
years to parse. Which is far too long.
So where is the problem? Actually, I wanted to nail it down even more, so I did the test:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Testy {
public static void main(String[] args) {
String str = "tr{a}{b}; # I like ";
final String regex =
"tr\\s*\\{.*?[^\\\\]\\}\\s*\\{(?:.*?[^\\\\])*\\}[cds]*";
Pattern p = Pattern.compile(regex);
for (int i=0;i<1000;i++) {
long start = System.currentTimeMillis(); // start timing
Matcher m = p.matcher(str);
long mid = System.currentTimeMillis(); // stop timing
System.out.println("TimeMillis1: " + (mid - start)); // print execution
System.out.println(m.lookingAt());
long stop = System.currentTimeMillis(); // stop timing
System.out.println("TimeMillis2: " + (stop - start)); // print execution
System.out.println("String length is " + str.length());
System.out.println("String is " + str);
str = str + "z";
}
}
}
The tail of the output which looked like this:
TimeMillis1: 0
true
TimeMillis2: 2330
String length is 30
String is tr{a}{b}; # I like zzzzzzzzzzz
TimeMillis1: 0
true
TimeMillis2: 4454
String length is 31
String is tr{a}{b}; # I like zzzzzzzzzzzz
TimeMillis1: 0
true
TimeMillis2: 8807
String length is 32
String is tr{a}{b}; # I like zzzzzzzzzzzzz
TimeMillis1: 0
So clearly, it is Matcher's lookingAt method where it is spending all of
its time.
Thu, 02 Apr 2009
gocr
Sat, 24 Jan 2009
gocr, ocr
1) Sometimes there would be a serif at the top of the 'd'. GOCR would examine a 'd' and be looking for a straight up-and-down line segment to the right side and two horizontal arcs on the left side - the top and bottom of the circle in 'd'. GOCR would see the serif at the top of up-and-down line segment and get confused. It was not expecting to see the serif being there, it expected to see mostly two arcs (the circle at the left of the 'd') and then a straight up-and-down segment to the right of the 'd' and that's it. So I put a patch in to make GOCR less strict and which would allow for the serif's one finds in text at the top of a 'd'. This was done in the ocr0_dD() function. 2) The second change improved recognition between a's and d's as well. This was done in the ocr0_aA() function. The letters 'a' and 'd' are printed in different fonts in different texts, and sometimes the only difference is that the up-and-down line segment on the right side extents significantly above the circle on the right for 'd', while with 'a' it stays level, or extends only slightly above the circle on the left side of the character. The ocr0_aA() function currently looks into the box struct for x0, x1, y0 and y1. My patch looks for m1 as well. With the 2006 patch I put in, I make it so that if m1-y0 is greater than or equal to 0, I break. While every test I ran showed an improvement in GOCR recognition after this, a week after I sent my patch in, as I became more familiar with gocr, two things occurred to me. The first was that instead of breaking, and declaring that it was not an 'a', I probably should have used the setac() call instead - diminishing the likelihood that the character was 'a' but not totally eliminating it. Secondly - the formula "m1-y0 >= 0" as the formula to break is somewhat arbitrary. What exactly is the length the line segment can rise where it transforms from an 'a' to a 'd'? I picked the number 0 arbitrarily. I did a number of tests, but more tests can probably be done, especially on very small and very large characters - especially very large characters. These concerns made me think I could do an even better patch. The one I submitted seemed to break nothing, and only fix things, but I decided a better patch would use a setac() instead of a break in ocr0_aA(), and that more testing should be done, especially on bigger characters, so that a better patch could be done. I would have to spend some time doing that, so I contacted Joerg Schulenburg recently and he gave me some encouragement, so I am going forward with the new, better patch as the first one was never applied (and since I felt I could do better, I never pushed that it be applied). Joerg is busy with things, and I am a little busy as well, but I am less busy than I used to be, and have the time in the next weeks to do this new and better patch. Anyhow, Joerg asked for some sample files. First I should say, I just downloaded gocr via cvs (January 24, 2009), patched it, and compared the files in the examples directory between the current cvs and my patched version (4x6.png 5x7.png 5x8.png ocr-a.png ocr-b.png handwrt1.jpg matrix.jpg). There was no change for any of the examples. What I use to test is OCR scans I got off of the Distributed Proofreaders website. With my 2006 patch, in every test I did on every OCR image, I did not see any negative effect - my patch did not remove recognition of any correctly labeled 'a' or 'd'. The only changes I saw were incorrectly labeled a's and d's now being unlabeled as such - often with that incorrectly identified as an 'a' now being seen correctly as a 'd'. An example of this is page 83 of the book "Daring and Suffering: A History of the Great Railroad Adventure" by William Pittenger. I got the scan of this from Distributed Proofreaders as it was on the way to Project Gutenberg. Line "8" of that page (To GOCR it is the eight line of text or whitespace, when reading the text it is the first line) with the current CVS snapshot of GOCR is _ll of the eigbt _e_ were capt4red, a_a are With my 2006 patch, the text correctly comes out as: _ll of the eigbt _e_ were capt4red, a_d are With my patch, GOCR now recognizes that the last letter of the word and is not 'a', but 'd'. A false recognition of the letter is replaced by the correct one. You can download page 83 of the book yourself, and run it with the current cvs snapshot, and against my patch. Another example is page 160 of the book "Left End Edwards" by Ralph Henry Barbour. This is another book whose pages I grabbed from Distributed Proofreaders on their way to project Gutenberg. My patch has a very good effect on this page, fixing four lines, all correctly. The first line fixed (according to GOCR) is line 8: a_d tahe_ 9ou o_. Peters say6 _obey _il_ be ais- becomes: a_d tahe_ 9ou o_. Peters say6 _obey _il_ be dis- The patch correctly changes "ais-" to "dis-". You can look at the image and see is correct. On the line which GOCR says is line 12, the line changes from: '' I ao_'t believe t_ey _ll,'' replied Steve _o- to '' I do_'t believe t_ey _ll,'' replied Steve _o- The d in don't is seen as d, not as a. There are two more corrected lines as well - an a becomes d correctly, on lines 26 and 27 as well. You can see this for yourself. You can download the page and run the current gocr cvs against my 2006 patch. As I said, I pulled these pages randomly from Distributed Proofreaders. It was the only online source of scans I knew of. I also tested this on scans of my own book collection as well, although my books are not in the public domain, so due to copyright issues I am less inclined to post them. The Distributed Proofreader books are public domain books. As I said, I tested many pages on this, and if you want me to post more of my tests I will. Many of my tests simply had no change - the pre-patch gocr was the same as post-patch. On all my tests I saw no negative effect. Only positive ones, usually a 'd' mis-classified as 'a' being properly classified as 'd'. But as I said, I have been thinking about the patch, and think a setac() would be better than a break for my ocr0_aA() test. I should probably test this on larger characters as well - the 0 in "m1-y0 >= 0" is somewhat arbitrary, and I want to run more tests, especially on large sized characters. So my existing patch seems to only fix things, but I feel I can make the patch even better. Here is a copy of the 2006 patch which works against the current (January 25, 2009) CVS snapshot. In the next weeks, I will work to see if I can improve the patch, mostly in terms of setting setac() instead of breaking, as well as seeing if the 0 value in "m1-y0 >= 0" is the best number to use, especially when I do more testing against larger characters. So I'll be sending you an improved patch of my 2006 patch in the next few weeks. School is starting up for me again Monday and I will be somewhat busy, but I'm fairly sure I will have enough time to improve the 2006 patch in the next few weeks. Sat, 27 Dec 2008
tesseract
Anyhow, first I did an examination of how well tesseract translated stuff. I did this by taking scanned pages from Distributed Proofreaders, running tesseract on them, then manually checking to see what the result was. DP (Distributed Proofreaders) scans from different types of books, so we get a range of different fonts and printing styles. I convert the PNG from DP to a TIF and then let tesseract run One thing I quickly noticed is that tesseract handles " fi" poorly, that is, words that begin with the letters fi. One example is on page 305 of part 1 of 4 of Chambers's Twentieth Century Dictionary. Line 33 is translated as: "cats proverbially tight till each destroys the other. 1 11111;; ``````" The junk after the word other and the period is just junk that was OCR'd. Anyhow, this should not OCR as "proverbially tight till" but as "proverbially fight till". You can see what it looks like in the book here:
We can see the same situation on the same page. Further down, line 72 is
translated as:
This should actually be:
We can see this from a scan of a different book. Page 120 of Secresy, or, The Ruin on the Rock also has a bad translation of " fi". This despite different fonts, typesetting and so forth. Line 4 (line 3 if disregarding title) is translates as: "must acknowledge, his Hmmess has not undergone the trial you have" where the real translation is "must acknowledge, his firmness has not undergone the trial you have" Hmmess is actually firmness, once again " fi" is mistranslated.
I have been looking through the tesseract output of these letters and words with the debugger on, and am still doing so. [/ocr/tesseract] permanent link Mon, 16 Jun 2008
/var/tmp
One of my interests is OCR, particularly a free software OCR. I spent some time on gocr, even though none of my patches were used and the project has not been updated for over a year. GOCR seemed the best thing to contribute to when I was looking at this a year or two ago, but Google has put Tesseract and Ocropus out there so I am going to take a look at those now. They are in C++ - a language I knew nothing of two years ago, but have taken a class in so am now a little more familiar with. Apparently tesseract only does OCR, not layout. Ocropus is a layoout plugin. I'm trying it now...it's pretty good. Better than GOCR probably. I will attempt to improve it the same way...get a number of samples of different books from Distributed Proofreaders, match tesseract OCR to original...see if there are any patterns of failure, then fix that in the tesseract code if possible |
||||