Fri, 27 Jul 2018
I am writing a MoLo app (Mobile/Local) which might even become a MoLoSo app at some point (Mobile/Local/Social).
Any how, the way it works right now is it runs in the background, and if I am moving around, it sends my latitude and longitude off to my server. So I have a lot of Instance IDs and IP addresses, timestamps and latitude and longitudes on my server.
How to deal with taking those latitudes and longitudes and clustering them? Well I am sending the information to the database via a Python REST API script, so I I start with that. I change the MariaDB/MySQL calls from insert to select, and pull out the latitudes and longitudes.
The data I have is all the latitude/longtitude points I have been at over the past four months (although the app is not fully robust, and seems to get killed once in a while since an Oreo update, so not every latitude and longitude is covered). I don't know how many clusters I want, so I have the Python ML package scikit-learn do a MeanShift.
One thing I should point out is that in the current regular fast update interval for the app, I only send a location update if the location has changed beyond a limit (so if I am walking around a building, it will not be sending constant updates, but if I am driving in a car it will).
Scikit-learn's MeanShift clusters the locations into four categories. Running sequentially through the clusters, doing a predict, the first category starts with 2172 locations. Then 403 locations for category two. Then 925 locations for category three. Then another 410 locations for category two. Then 4490 locations for category one again. Then 403 locations for category four. Then 2541 locations for category one.
The center of the first cluster is about half a mile west and a quarter mile south of where I live. So I guess I spend more time in Manhattan, Brooklyn and western Queens than in Bayside and Nassau.
The center of the second cluster was near Wilmington, Delaware. The center of the third cluster was Burtonsville, Maryland.
It's due to the aforementioned properties of the app (I only send a location update if the location has changed beyond a small distance limit) that I had 2172 locations from March 12th to April 2nd in one clustrered area, and then on April 2nd - 1738 locations in two different clustered areas. On April 2nd I drove to and from my aunt's funeral in Maryland. That trip created two clusters - one in Wilmington for my drive there and back, and one in Maryland where I drove to the church, the graveyard and to the lunch afterward.
So then I have another 4490 location updates in cluster one, until those 403 updates in cluster four. The center of that cluster is Milford, Connecticut, and it revolves a trip I made to my other aunt's house near New Haven, Connecticut from May 25th to May 26th. Then it is another 2541 updates back in cluster one.
So...I could exclude by location, but I could also exclude by date, which is easier. So I exclude those three days and do MeanShift clustering again. Now I get six clusters.
Cluster one is centered about five blocks east, and slightly north of where I live. It has the bulk of entries. Cluster two is centered in east Midtown. Cluster three is centered near both the Woodside LIRR station and the 7 train junction at 74th Street. Cluster four is centered in Mineola, Long Island. Cluster five is centered south of Valley Stream, with 200 updates in three chunks. Cluster six is in Roosevelt, Long Island and only has one update.
MeanShift is good, but I may try other cluster types as well.