Geolocation using multiple services
Table of Contents
UPDATE: A chinese version of this post is available here.
Intro
In a previous post I wrote about PostGIS and ways of querying geographical data.
This post will focus on building a system that queries free geolocation services 1 and aggregates their results.
Overview
In summary, we're making requests to different web services (or APIs), we're doing reverse geocoding on the results and then we'll aggregate them.
Comparing geonames and openstreetmap
To relate to the previous post, here are some differences between geonames and openstreetmap:
| criterion | OSM | geonames |
|---|---|---|
| size | 50 GB compressed | 309 MB compressed |
| entities | 3.25 billion | 11 million |
| has administrative area data | yes | yes |
| has lat/long city data | yes | yes |
| has neighbourhood/district data | yes | yes |
| has region/area polygonal areas | yes | no |
| has intracity-level metadata | yes | no |
| has terrain metadata | yes | no |
They are meant for different purposes. Geonames is meant for city/administrative area/country data and can be used for geocoding. Openstreetmap has much more detailed data (one could probably extract the geonames data using openstreetmap) and can be used for geocoding, route planning and more .
Asynchronous requests to geolocation services
We're using the gevent library to make asynchronous requests to the geolocation services.
import gevent import gevent.greenlet from gevent import monkey; gevent.monkey.patch_all() geoip_service_urls=[ ['geoplugin' , 'http://www.geoplugin.net/json.gp?ip={ip}' ], ['ip-api' , 'http://ip-api.com/json/{ip}' ], ['nekudo' , 'https://geoip.nekudo.com/api/{ip}' ], ['geoiplookup' , 'http://api.geoiplookup.net/?query={ip}' ], ] # fetch url in asynchronous mode (makes use of gevent) def fetch_url_async(url, tag, timeout=2.0): data = None try: opener = urllib2.build_opener(urllib2.HTTPSHandler()) opener.addheaders = [('User-agent', 'Mozilla/')] urllib2.install_opener(opener) data = urllib2.urlopen(url,timeout=timeout).read() except Exception, e: pass return [tag, data] # expects req_data to be in this format: [ ['tag', url], ['tag', url], .. ] def fetch_multiple_urls_async(req_data): # start the threads (greenlets) threads_ = [] for u in req_data: (tag, url) = u new_thread = gevent.spawn(fetch_url_async, url, tag) threads_.append(new_thread) # wait for threads to finish gevent.joinall(threads_) # retrieve threads return values results = [] for t in threads_: results.append(t.get(block=True, timeout=5.0)) return results def process_service_answers(location_data): # 1) extract lat/long data from responses # 2) reverse geocoding using geonames # 3) aggregate location data # (for example, one way of doing this would # be to choose the location that most services # agree on) pass def geolocate_ip(ip): urls = [] for grp in geoip_service_urls: tag, url = grp urls.append([tag, url.format(ip=ip)]) results = fetch_multiple_urls_async(urls) answer = process_service_answers(results) return answer
City name ambiguity
Cities with the same name within the same country
There are many cities with the same name within a country, in different states/administrative regions. There's also cities with the same name in different countries.
For example, according to Geonames, there are 24 cities named Clinton in the US (in 23 different states, with two cities named Clinton in the same state of Michigan).
WITH duplicate_data AS ( SELECT city_name, array_agg(ROW(country_code, region_code)) AS dupes FROM city_region_data WHERE country_code = 'US' GROUP BY city_name, country_code ORDER BY COUNT(ROW(country_code, region_code)) DESC ) SELECT city_name, ARRAY_LENGTH(dupes, 1) AS duplicity, ( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...') ELSE ARRAY_TO_STRING(dupes,',') END ) AS sample FROM duplicate_data LIMIT 5;
| city_name | duplicity | sample |
|---|---|---|
| Clinton | 24 | (US,NY),(US,AR),(US,NC),(US,MA),(US,MD),(US,OH),(U… |
| Franklin | 19 | (US,ME),(US,MA),(US,NC),(US,TX),(US,NC),(US,LA),(U… |
| Springfield | 19 | (US,MN),(US,KY),(US,SD),(US,MI),(US,VA),(US,IL),(U… |
| Madison | 18 | (US,CT),(US,MN),(US,NJ),(US,ME),(US,SD),(US,FL),(U… |
| Greenville | 18 | (US,NC),(US,SC),(US,MS),(US,KY),(US,RI),(US,ME),(U… |
Cities with the same name in the same country and region
Worldwide, even in the same region of a country, there can be multiple cities with the exact same name.
Take for example Georgetown, in Indiana. Geonames says there are 3 towns with that name in Indiana. Wikipedia says there are even more:
- Georgetown, Floyd County, Indiana
- Georgetown Township, Floyd County, Indiana
- Georgetown, Cass County, Indiana
- Georgetown, Randolph County, Indiana
WITH duplicate_data AS ( SELECT city_name, array_agg(ROW(country_code, region_code)) AS dupes FROM city_region_data WHERE country_code = 'US' GROUP BY city_name, region_code, country_code ORDER BY COUNT(ROW(country_code, region_code)) DESC ) SELECT city_name, ARRAY_LENGTH(dupes, 1) AS duplicity, ( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...') ELSE ARRAY_TO_STRING(dupes,',') END ) AS sample FROM duplicate_data LIMIT 4;
| city_name | duplicity | sample |
|---|---|---|
| Plantation | 3 | (US,FL),(US,FL),(US,FL) |
| Georgetown | 3 | (US,IN),(US,IN),(US,IN) |
| Robinwood | 3 | (US,MD),(US,MD),(US,MD) |
| Prospect Park | 2 | (US,NJ),(US,NJ) |
Reverse geocoding
Both (city_name, country_code) and (city_name, country_code, region_name)
tuples have failed as candidates to uniquely identify a location.
We would have the option of using zip codes or postal codes except we can't use those since most geolocation services don't offer that.
But most geolocation services do offer longitude and latitude, and we can use those to eliminate ambiguity.
Geometric data types in PostgreSQL
I looked further into the PostgreSQL docs and found that it also has geometric data types and functions for 2D geometry. Out of the box you can model points, boxes, paths, polygons, circles, you can store them and query them.
PostgreSQL has some additional extensions in the contrib directory. They are available out of the box with most Postgres installs.
In this situation we're interested in the cube and earthdistance extensions 2. The cube extension allows you to model n-dimensional vectors, and the earthdistance extension uses 3-cubes to store vectors and represent points on the surface of the Earth.
We'll be using the following:
- the
earth_distancefunction is available, and it allows you to compute the great-circle distance between two points - the
earth_boxfunction to check if a point is within a certain distance of a reference point - a gist expression index on the expression
ll_to_earth(lat, long)to make fast spatial queries and find nearby points
Designing a view for city & region data
Geonames data was imported into 3 tables:
geo_geoname(data from cities1000.zip )geo_admin1(data from admin1CodesASCII.txt )geo_countryinfo(data from countryInfo.txt )
Then we create a view that pulls everything together 3. We now have population data, city/region/country data, and lat/long data, all in one place.
CREATE OR REPLACE VIEW city_region_data AS ( SELECT b.country AS country_code, b.asciiname AS city_name, a.name AS region_name, b.region_code, b.population, b.latitude AS city_lat, b.longitude AS city_long, c.name AS country_name FROM geo_admin1 a JOIN ( SELECT *, (country || '.' || admin1) AS country_region, admin1 AS region_code FROM geo_geoname WHERE fclass = 'P' ) b ON a.code = b.country_region JOIN geo_countryinfo c ON b.country = c.iso_alpha2 );
Designing a nearby-city query and function
In the most nested SELECT, we're only keeping the cities in a 23km
radius around the reference point, then we're applying a country
filter and city pattern filter (these two filters are optional), and
we're only getting the closest 50 results to the reference point.
Next, we're reordering by population because geonames sometimes has districts and neighbourhoods around bigger cities 4, and it does not mark them in a specific way, so we just want to select the larger city and not a district (for example let's say the geolocation service returned a lat/long that would resolve to one district of a larger metropolitan area. In my case, I'd like to resolve this to the larger city it's associated with)
We're also creating a gist index (the @> operator will make use of the gist index)
which we're using to find points within a radius of a reference point.
This function takes a point (using latitude and longitude) and returns the city, region and country that is associated with that point.
CREATE INDEX geo_geoname_latlong_idx ON geo_geoname USING gist(ll_to_earth(latitude,longitude)); CREATE OR REPLACE FUNCTION geo_find_nearest_city_and_region( latitude double precision, longitude double precision, filter_countries_arr varchar[], filter_city_pattern varchar, ) RETURNS TABLE( country_code varchar, city_name varchar, region_name varchar, region_code varchar, population bigint, _lat double precision, _long double precision, country_name varchar, distance numeric ) AS $$ BEGIN RETURN QUERY SELECT * FROM ( SELECT * FROM ( SELECT *, ROUND(earth_distance( ll_to_earth(c.city_lat, c.city_long), ll_to_earth(latitude, longitude) )::numeric, 3) AS distance_ FROM city_region_data c WHERE earth_box(ll_to_earth(latitude, longitude), 23000) @> ll_to_earth(c.city_lat, c.city_long) AND (filter_countries_arr IS NULL OR c.country_code=ANY(filter_countries_arr)) AND (filter_city_pattern IS NULL OR c.city_name LIKE filter_city_pattern) ORDER BY distance_ ASC LIMIT 50 ) d ORDER BY population DESC ) e LIMIT 1; END; $$ LANGUAGE plpgsql;
Conclusion
We've started from the design of a system that would query multiple geoip services, would gather the data and would then aggregate it to get a more reliable result.
We first looked at some ways of uniquely identifying locations.
We've then picked a way that would eliminate ambiguity in identifying them. In the second half, we've looked at different ways of structuring, storing and querying geographical data in PostgreSQL.
Then we've built a view and a function to find cities near a reference point which allowed us to do reverse geocoding.
Footnotes:
By using multiple services (and assuming they use different data sources internally) after aggregation, we can have a more reliable answer than if we were using just one.
Another advantage here is that we're using free services, no setup is required, we don't have to take care of updates, since these services are maintained by their owners.
However, querying all these web services will be slower than querying a local geoip data structure. But, city/country/region geolocation databases are available, some examples include geoip2 from maxmind, ip2location or db-ip.
There's a nice post here using the earthdistance module to compute
distances to nearby or far away pubs.
Geonames has geonameIds as well, which are geonames-specific ids we can use to accurately refer to locations.
geonames does not have polygonal data about cities/neighbourhoods or metadata about the type of urban area (like openstreetmap does) so you can't query all city polygons (not districts/neighbourhoods) that contain that point.