Android, Linux, FLOSS etc.

My code

Subscribe to a syndicated RSS feed of my blog.


Fri, 27 Jul 2018

Python Scikit-learn and MeanShift, for Android location app

I am writing a MoLo app (Mobile/Local) which might even become a MoLoSo app at some point (Mobile/Local/Social).

Any how, the way it works right now is it runs in the background, and if I am moving around, it sends my latitude and longitude off to my server. So I have a lot of Instance IDs and IP addresses, timestamps and latitude and longitudes on my server.

How to deal with taking those latitudes and longitudes and clustering them? Well I am sending the information to the database via a Python REST API script, so I I start with that. I change the MariaDB/MySQL calls from insert to select, and pull out the latitudes and longitudes.

The data I have is all the latitude/longtitude points I have been at over the past four months (although the app is not fully robust, and seems to get killed once in a while since an Oreo update, so not every latitude and longitude is covered). I don't know how many clusters I want, so I have the Python ML package scikit-learn do a MeanShift.

One thing I should point out is that in the current regular fast update interval for the app, I only send a location update if the location has changed beyond a limit (so if I am walking around a building, it will not be sending constant updates, but if I am driving in a car it will).

Scikit-learn's MeanShift clusters the locations into four categories. Running sequentially through the clusters, doing a predict, the first category starts with 2172 locations. Then 403 locations for category two. Then 925 locations for category three. Then another 410 locations for category two. Then 4490 locations for category one again. Then 403 locations for category four. Then 2541 locations for category one.

The center of the first cluster is about half a mile west and a quarter mile south of where I live. So I guess I spend more time in Manhattan, Brooklyn and western Queens than in Bayside and Nassau.

The center of the second cluster was near Wilmington, Delaware. The center of the third cluster was Burtonsville, Maryland.

It's due to the aforementioned properties of the app (I only send a location update if the location has changed beyond a small distance limit) that I had 2172 locations from March 12th to April 2nd in one clustrered area, and then on April 2nd - 1738 locations in two different clustered areas. On April 2nd I drove to and from my aunt's funeral in Maryland. That trip created two clusters - one in Wilmington for my drive there and back, and one in Maryland where I drove to the church, the graveyard and to the lunch afterward.

So then I have another 4490 location updates in cluster one, until those 403 updates in cluster four. The center of that cluster is Milford, Connecticut, and it revolves a trip I made to my other aunt's house near New Haven, Connecticut from May 25th to May 26th. Then it is another 2541 updates back in cluster one.

So...I could exclude by location, but I could also exclude by date, which is easier. So I exclude those three days and do MeanShift clustering again. Now I get six clusters.

Cluster one is centered about five blocks east, and slightly north of where I live. It has the bulk of entries. Cluster two is centered in east Midtown. Cluster three is centered near both the Woodside LIRR station and the 7 train junction at 74th Street. Cluster four is centered in Mineola, Long Island. Cluster five is centered south of Valley Stream, with 200 updates in three chunks. Cluster six is in Roosevelt, Long Island and only has one update.

MeanShift is good, but I may try other cluster types as well.

[/ai] permanent link

Sat, 21 Mar 2015

Preparing for Artificial Intelligence and Machine Learning

I took a class in AI in late 2013, but I only started looking at practical engineering implementions for ML in the past few months.

In looking at things like scikit-learn, I saw that a lot of the algorithms are already coded. You can even automatically test what classifier/model will be best for the data. In looking at the package and examples, I suspected that the hard part was wrangling the in the field data into an acceptable form for the algorithms.

I was graciously invited to an event a few months ago by a fellow named Scott, at which there were several people with good field knowledge of AI and ML. I talked to two of them about algorithms and data. Both of them made the point that getting the data wrangled into a suitable form was the hard part. I then went onto the net and read about this more carefully, and others with experience seemed to agree. So it is like other programming, where getting the data structures and data input right is usually the hard part, since if that is done well, implementing the algorithms is usually not much of a chore.

So I began working on my ML project. What does it do? Sometimes I go to local supermarkets, and what I am looking for is out of stock. So this ML predicts whether the supermarket will have the item I'm looking for in stock.

I architected the data structures (which consists of purchases, and observations that certain products are missing) and programmed the inputs. Then I added Google Maps so I could see where the local supermarkets were. The program would prefer close supermarkets to far ones.

Now I have run into a problem/non-problem. In architecting the solution so that the ML models and algorithms could better understand the problem, I architected a solution so that I could better understand the problem as well. Before I would pretty much go to my closest supermarket, if they were out of stock then on to the next closest one, and so forth. Now I have all that data available on my Android, including a map, and deciding which supermarket to go to is trivial. I don't need the ML so much any more. I wonder how often this happens - you build a solution so that AI/ML can be used, but once all the data is recorded in an understandable way, you don't need the AI/ML any more. Although there can be situations where there is a lot of data for someone to remember in their head, but not a lot for an ML solution.

Any how, I went through enough trouble to put all of this together, that I will still go through with writing a program that predicts if the items I want are in stock. I'll also make a map with time/distance information between my home and the supermarkets, and the supermarkets with each other. Then my program will give me advice on which supermarkets to try first.

[/ai] permanent link