UPDATE: A chinese version of this post is available here.

Intro

In a previous post I wrote about PostGIS and ways of querying geographical data.

This post will focus on building a system that queries free geolocation services 1 and aggregates their results.

Overview

In summary, we're making requests to different web services (or APIs), we're doing reverse geocoding on the results and then we'll aggregate them.

Comparing geonames and openstreetmap

To relate to the previous post, here are some differences between geonames and openstreetmap:

criterion OSM geonames
size 50 GB compressed 309 MB compressed
entities 3.25 billion 11 million
has administrative area data yes yes
has lat/long city data yes yes
has neighbourhood/district data yes yes
has region/area polygonal areas yes no
has intracity-level metadata yes no
has terrain metadata yes no

They are meant for different purposes. Geonames is meant for city/administrative area/country data and can be used for geocoding. Openstreetmap has much more detailed data (one could probably extract the geonames data using openstreetmap) and can be used for geocoding, route planning and more .

Asynchronous requests to geolocation services

We're using the gevent library to make asynchronous requests to the geolocation services.

import gevent
import gevent.greenlet
from gevent import monkey; gevent.monkey.patch_all()

geoip_service_urls=[
        ['geoplugin'    , 'http://www.geoplugin.net/json.gp?ip={ip}' ],
        ['ip-api'       , 'http://ip-api.com/json/{ip}'              ],
        ['nekudo'       , 'https://geoip.nekudo.com/api/{ip}'        ],
        ['geoiplookup'  , 'http://api.geoiplookup.net/?query={ip}'   ],
        ]

# fetch url in asynchronous mode (makes use of gevent)
def fetch_url_async(url, tag, timeout=2.0):
    data = None
    try:
        opener = urllib2.build_opener(urllib2.HTTPSHandler())
        opener.addheaders = [('User-agent', 'Mozilla/')]
        urllib2.install_opener(opener)
        data = urllib2.urlopen(url,timeout=timeout).read()
    except Exception, e:
        pass

    return [tag, data]

# expects req_data to be in this format: [ ['tag', url], ['tag', url], .. ]
def fetch_multiple_urls_async(req_data):

    # start the threads (greenlets)
    threads_ = []
    for u in req_data:
        (tag, url) = u
        new_thread = gevent.spawn(fetch_url_async, url, tag)
        threads_.append(new_thread)

    # wait for threads to finish
    gevent.joinall(threads_)

    # retrieve threads return values
    results = []
    for t in threads_:
        results.append(t.get(block=True, timeout=5.0))

    return results

def process_service_answers(location_data):
    # 1) extract lat/long data from responses
    # 2) reverse geocoding using geonames
    # 3) aggregate location data
    #    (for example, one way of doing this would
    #     be to choose the location that most services
    #     agree on)
    pass

def geolocate_ip(ip):
    urls = []
    for grp in geoip_service_urls:
        tag, url = grp
        urls.append([tag, url.format(ip=ip)])
    results = fetch_multiple_urls_async(urls)
    answer = process_service_answers(results)
    return answer

City name ambiguity

Cities with the same name within the same country

There are many cities with the same name within a country, in different states/administrative regions. There's also cities with the same name in different countries.

For example, according to Geonames, there are 24 cities named Clinton in the US (in 23 different states, with two cities named Clinton in the same state of Michigan).

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'
    GROUP BY city_name, country_code
    ORDER BY COUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 5;
city_name duplicity sample
Clinton 24 (US,NY),(US,AR),(US,NC),(US,MA),(US,MD),(US,OH),(U…
Franklin 19 (US,ME),(US,MA),(US,NC),(US,TX),(US,NC),(US,LA),(U…
Springfield 19 (US,MN),(US,KY),(US,SD),(US,MI),(US,VA),(US,IL),(U…
Madison 18 (US,CT),(US,MN),(US,NJ),(US,ME),(US,SD),(US,FL),(U…
Greenville 18 (US,NC),(US,SC),(US,MS),(US,KY),(US,RI),(US,ME),(U…

Cities with the same name in the same country and region

Worldwide, even in the same region of a country, there can be multiple cities with the exact same name.

Take for example Georgetown, in Indiana. Geonames says there are 3 towns with that name in Indiana. Wikipedia says there are even more:

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'
    GROUP BY city_name, region_code, country_code
    ORDER BY COUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 4;
city_name duplicity sample
Plantation 3 (US,FL),(US,FL),(US,FL)
Georgetown 3 (US,IN),(US,IN),(US,IN)
Robinwood 3 (US,MD),(US,MD),(US,MD)
Prospect Park 2 (US,NJ),(US,NJ)

Reverse geocoding

Both (city_name, country_code) and (city_name, country_code, region_name) tuples have failed as candidates to uniquely identify a location.

We would have the option of using zip codes or postal codes except we can't use those since most geolocation services don't offer that.

But most geolocation services do offer longitude and latitude, and we can use those to eliminate ambiguity.

Geometric data types in PostgreSQL

I looked further into the PostgreSQL docs and found that it also has geometric data types and functions for 2D geometry. Out of the box you can model points, boxes, paths, polygons, circles, you can store them and query them.

PostgreSQL has some additional extensions in the contrib directory. They are available out of the box with most Postgres installs.

In this situation we're interested in the cube and earthdistance extensions 2. The cube extension allows you to model n-dimensional vectors, and the earthdistance extension uses 3-cubes to store vectors and represent points on the surface of the Earth.

We'll be using the following:

  • the earth_distance function is available, and it allows you to compute the great-circle distance between two points
  • the earth_box function to check if a point is within a certain distance of a reference point
  • a gist expression index on the expression ll_to_earth(lat, long) to make fast spatial queries and find nearby points

Designing a view for city & region data

Geonames data was imported into 3 tables:

Then we create a view that pulls everything together 3. We now have population data, city/region/country data, and lat/long data, all in one place.

CREATE OR REPLACE VIEW city_region_data AS ( 
    SELECT
        b.country AS country_code,
        b.asciiname AS city_name,
        a.name AS region_name,
        b.region_code,
        b.population,
        b.latitude AS city_lat,
        b.longitude AS city_long,
        c.name    AS country_name
    FROM geo_admin1 a
    JOIN (
        SELECT *, (country || '.' || admin1) AS country_region, admin1 AS region_code
        FROM geo_geoname
        WHERE fclass = 'P'
    ) b ON a.code = b.country_region
    JOIN geo_countryinfo c ON b.country = c.iso_alpha2
);

Designing a nearby-city query and function

In the most nested SELECT, we're only keeping the cities in a 23km radius around the reference point, then we're applying a country filter and city pattern filter (these two filters are optional), and we're only getting the closest 50 results to the reference point.

Next, we're reordering by population because geonames sometimes has districts and neighbourhoods around bigger cities 4, and it does not mark them in a specific way, so we just want to select the larger city and not a district (for example let's say the geolocation service returned a lat/long that would resolve to one district of a larger metropolitan area. In my case, I'd like to resolve this to the larger city it's associated with)

We're also creating a gist index (the @> operator will make use of the gist index) which we're using to find points within a radius of a reference point.

This function takes a point (using latitude and longitude) and returns the city, region and country that is associated with that point.

CREATE INDEX geo_geoname_latlong_idx ON geo_geoname USING gist(ll_to_earth(latitude,longitude));
CREATE OR REPLACE FUNCTION geo_find_nearest_city_and_region(
    latitude double precision,
    longitude double precision,
    filter_countries_arr varchar[],
    filter_city_pattern  varchar,
) RETURNS TABLE(
    country_code varchar,
    city_name varchar,
    region_name varchar,
    region_code varchar,
    population bigint,
    _lat double precision,
    _long double precision,
    country_name varchar,
    distance numeric
    ) AS $$
BEGIN
    RETURN QUERY
    SELECT *
    FROM (
        SELECT
        *
        FROM (
            SELECT 
            *,
            ROUND(earth_distance(
                   ll_to_earth(c.city_lat, c.city_long),
                   ll_to_earth(latitude, longitude)
                  )::numeric, 3) AS distance_
            FROM city_region_data c
            WHERE earth_box(ll_to_earth(latitude, longitude), 23000) @> ll_to_earth(c.city_lat, c.city_long) AND
                  (filter_countries_arr IS NULL OR c.country_code=ANY(filter_countries_arr)) AND
                  (filter_city_pattern  IS NULL OR c.city_name LIKE filter_city_pattern)
            ORDER BY distance_ ASC
            LIMIT 50
        ) d
        ORDER BY population DESC
    ) e
    LIMIT 1;
END;
$$
LANGUAGE plpgsql;

Conclusion

We've started from the design of a system that would query multiple geoip services, would gather the data and would then aggregate it to get a more reliable result.

We first looked at some ways of uniquely identifying locations.

We've then picked a way that would eliminate ambiguity in identifying them. In the second half, we've looked at different ways of structuring, storing and querying geographical data in PostgreSQL.

Then we've built a view and a function to find cities near a reference point which allowed us to do reverse geocoding.

Footnotes:

1

By using multiple services (and assuming they use different data sources internally) after aggregation, we can have a more reliable answer than if we were using just one.

Another advantage here is that we're using free services, no setup is required, we don't have to take care of updates, since these services are maintained by their owners.

However, querying all these web services will be slower than querying a local geoip data structure. But, city/country/region geolocation databases are available, some examples include geoip2 from maxmind, ip2location or db-ip.

2

There's a nice post here using the earthdistance module to compute distances to nearby or far away pubs.

3

Geonames has geonameIds as well, which are geonames-specific ids we can use to accurately refer to locations.

4

geonames does not have polygonal data about cities/neighbourhoods or metadata about the type of urban area (like openstreetmap does) so you can't query all city polygons (not districts/neighbourhoods) that contain that point.