Mobile Reading Data Exchange

Geolocation and Mobile Data

Posted March 6, 2018
By Jason Young

In the past decade there has been an explosion of digital applications that collect and make use of the geographic location of users (Kitchin, Lauriault, and Wilson 2017). These apps are often used on mobile devices capable of identifying a user’s location with a high degree of accuracy, enabling the delivery of highly customized services based on that location. As a result, location-based services (LBS’s) are now used for everything from navigation and provision of government services (e.g., Mattern 2017) to the tracking of global pandemics (Sparke 2011) and the detection of earthquakes (e.g. Young et al. 2013). Although research based on these datasets remains new, it is already clear that spatial data are capable of unlocking many new forms of powerful analysis.

While Worldreader is not explicitly spatial, it does collect and utilize information about the location of users. As mentioned in a previous post, Worldreader does try give its readers access to books written by local authors. Country-specific laws, norms, and publishing agreements also shape what content is available to specific readers. As a result, Worldreader needs to know where readers are located in order to provide them with the correct mix of content. In the context of this research project, this means that we have access to some geographic data. Specifically, we can see the IP address of the mobile phone used to interact with the Worldreader app, which can then be tied to the location of that mobile phone. As a geographer I was particularly excited by this spatial data, since it had the potential to unlock a wealth of research questions! Some of the questions we thought about included:

Does interest in content vary by geographic region?
Do search terms used by readers to find books vary by geography?
Do reader behaviors differ by geography?

Geographic information can also help us to link the Worldreader data to other forms of geolocated data. For example, we can look for correlations between the number of Worldreader users in a particular country and the population of that country or to rates of internet penetration and mobile phone use. Even more powerfully, demographers have begun developing techniques that allow researchers to overcome biases in their data by comparing it with ‘ground truthed’ demographic data. As we’ll discuss in a future post, this can be particularly useful for big data research, since these datasets are likely to be affected by selection bias (e.g. Seely-Gant and Frehill 2015).

Despite all of these advantages to geospatial data, you will notice that we will not perform a lot of large-scale geospatial analysis in this project. Nor will we make use of other geospatial datasets to the fullest extent possible. Instead, our geospatial analysis will be restricted to looking at generalized, country-level patterns. This is because the form that our location data takes – that of an IP address taken from a mobile phone – does not support more localized forms of analysis. This post explores this constraint of the dataset, and should be of interest to anyone planning to use location-based mobile phone data.

Methods for Geolocation

The process by which the location of an object is identified is referred to as geolocation, and there are three primary methods by which cell phones are geolocated: the use of a built-in GPS receiver, active probing, and passive, or database-driven, identification. Of the three, the use of a built-in GPS receiver is easily the most accurate approach. Devices running iOS, for example, tend to have a horizontal error of less than 15 meters (Triukose et al. 2012). Applications can even retrieve the estimated accuracy levels of GPS measurements, which adds valuable context to the information. Unfortunately, use of GPS data is also the most restricted form of location-based data associated with mobile phones. Services run through mobile Web browsers often cannot access GPS information at all, and even within dedicated applications the user has the ability to turn off access to the GPS (Triukose et al. 2012). This project does not have access to any location information based on GPS receivers.

Both of the other approaches rely on the IP address of the mobile phone to identify its location. An IP address, or Internet Protocol address, is a number associated with every device that is connected to a particular digital network. This number acts as the name of the device, and also identifies the location of the device on the network. This allows devices on the network to send information to one another. There are currently two primary versions of IP addresses – version 4 (IPv4) and version 6 (IPv6). IPv4 addresses are 32-bit numbers that take the form 172.16.254.1. Because these addresses are relatively short, they have slowly been depleted as more servers are connected to the Web. IPv6 has therefore been developed and deployed to offer more potential addresses. IPv6 addresses are longer, 128-bit addresses that take the form 2001:0db8:0001:0000:0000:0ab9:C0A8:0102.

As its name implies, active IP geolocation methods collect additional information about a device’s position on a network in order to actively calculate that device’s geographic location (Ciavarrini et al. 2017). In most cases these methods collect information about the communication between the device in question (often called the target) and other devices with known geographic locations (called landmarks). Specifically, the delay in communication between these devices is collected. Based on this information, plus information about the network itself, researchers can perform geometric calculations to locate the target with a high degree of accuracy. Unfortunately, active measurements require researchers to have all of this information about communication between target devices and known landmarks. Even when this information is known the calculations can take a long time, making the method impractical for very large datasets. This project does not have access to the necessary information to make use of active geolocation.

This leaves passive geolocation methods. Passive methods rely on the use of databases that link blocks of IP addresses to geographic locations. IP addresses are managed by the Internet Assigned Numbers Authority (IANA), which assigns blocks of addresses to regional Internet registries (RIRs). There are five different RIRs:

African Network Information Center (AFRINIC)
American Registry for Internet Numbers (ARIN)
Asia-Pacific Network Information Centre (APNIC)
Latin America and Caribbean Network Information Centre (LACNIC)
Réseaux IP Européens Network Coordination Centre (RIPE NCC)

Each of these RIRs then has its own internal policy that dictates how it will assign IP addresses to its customers. These policies can be used to estimate the geographic footprint of different blocks of IP addresses. As a result public and free databases have emerged that map IP addresses to their presumed geographic location, usually at the city-level. Every time a user interacts with the Worldreader application, their device’s IP address is recorded by the Worldreader servers. The passive location allows us to match that address to a geographic location. While this method is possible for this project, the question remains as to how accurate it is.

Passive Geolocation and Accuracy

Unfortunately, it turns out that the passive method is not very accurate – at best, it probably only offers accurate country-level information about device location, rather than the city-level information that it often promises (Balakrishnan et al. 2009; Ciavarrini et al. 2017; Poese et al. 2011). One study of the accuracy of passive geolocation of IPv4 addresses indicate that approximately 70% of locations contain an error of at least 100km, 50% have an error of 200km or more, and 10% have errors of greater than 1000km (Triukose et al. 2012). While IPv6 addresses solve some of the problems associated with IPv4 addresses, as discussed below, methods have not been well-developed for mapping them to specific locations. As a result, locations obtained for IPv6 addresses may contain even greater levels of errors (Kester 2016). Additionally, IPv6 adoption remains relatively low across African countries, meaning that a majority of our data relies on IPv4 addresses (Google 2017; Maigron 2017; Tamon 2015). Therefore, the remainder of this section will focus on describing the reasons why geolocation based on IPv4 addresses contains high levels of error.

The process by which IP addresses are assigned to devices is complex, and is often based on methods that are not fully followed by IP organizations/companies and are not made fully public (Poese 2011; Weber 2017). One of the largest problems with geolocating IPv4 addresses stems from their finite nature. As these addresses were depleted, Internet Service Providers (ISPs) sought methods to increase their relative availability. One of the most prevalent solutions was the adoption of Network Address Translation (NAT). NAT act as a middle box between devices on an ISP’s private network and the World Wide Web (Triukose 2012). In other words, these ISPs can assign private IP addresses to devices within its own network, and then only assign a public IP address to the router that acts as an interface between these devices and the Web. This allows the ISP to use a single, public IP address to represent many different devices. Critically, the geographic location of this router may not be the same as the locations of the devices that it represents – it could exist anywhere within the ISP’s network. However, because the devices are represented by the router’s public IP address (also referred to as a mobile gateway IP address), they will appear to be in the location of the router. In many instances there will be only a few gateways across a country that push mobile data requests onto the Web. Moreover, the use of NAT can be unpredictable – researchers have found that, while most major ISPs do use NAT to some extent, ISPs will provide publicly visible IP addresses to some devices yet hide other devices behind NAT boxes (Triukose 2012). In fact, a single device may bounce between having its own public IP address and being routed through NAT. A single may be represented by multiple IP addresses across even the span of 10 minutes (Balakrishnan et al. 2009). These issues are compounded the intrinsic mobility of mobile phone usage. Mobile phones may maintain their same IP address even when the phone is roaming across a country, which can give rise to very large errors in location (Triukose 2012).

Additional error may be introduced during the construction of the databases used to attach IP addresses to specific locations. Many of the organizations that create and manage these organizations do not publish their methodology, and it is unclear how often they update the databases to reflect changes in IP address assignments (Poese 2011). Researchers have found that these databases often contain considerable degrees of geographic bias, in that they overrepresent only a few, popular countries. One study found that the United States accounts for an average of 45% of the entries across different IP geolocation databases (Poese 2011). This means that the geographic resolution of these databases is much lower in underrepresented countries. Unfortunately, these are precisely the countries in which we are interested for this project.

Conclusion

These dynamics have considerable implications for any research that relies on IP addresses to perform geospatial analysis of mobile phone use. While common IP geolocation services may suggest that analysis down to the city level is possible, it is likely that this granular analysis contains large levels of error. Not only does this constrain the types of questions that we can ask about mobile phone data, but it also prevents us from combining our data with localized datasets that could provide additional insights. Despite these constraints, most researchers agree that passive geolocations is fairly good at identifying the country in which a device is operating. This allows us to at least create country profiles of Worldreader user behavior and preferences.

Future developments may also provide new opportunities to perform geographic analysis with Worldreader behavior. Given the popularity of location-based services, companies and researchers are constantly exploring new methods for performing geolocation. However, even the newest methods still have difficulty identifying the location of mobile phones (e.g. Chandrasekaran et al. 2015). The greatest enhancement on this front would be the development of a dedicated Worldreader application capable of interacting with the GPS receiver on users’ phones. This is not a panacea, since it relies on users owning smartphones with GPS capabilities and on users granting Worldreader permission to access this location data. It also opens up new questions of research ethics, given that detailed geographic data makes it much easier to identify specific users. Nevertheless, it would open the door to performing many new forms of analysis with this big data set.

Bibliography

Balakrishnan M, I Mohomed, and V Ramasubramanian. 2009. Where’s that Phone: Geolocating IP Addresses on 3G Networks. IMC ’09. Chicago, Illinois, USA.

Chandrasekaran B et al. 2015. Alidade: IP Geolocation without Active Probing. Technical Report CS-TR-2015.01.

Ciavarrini G, V Luconi, and A Vecchio. 2017. Smartphone-based geolocation of Internet hosts. Computer Networks. 116: 22-32. Google. 2017. Statistics: Per-Country IPv6 Adoption. Google IPv6. https://www.google.com/intl/en/ipv6/statistics.html#tab=per-country-ipv6-adoption&tab=per-country-ipv6-adoption

Kester, J-J. 2016. Comparing the Accuracy of IPv4 and IPv6 Geolocation Databases. 24th Twente Student COnference on IT. January 22, 2016. Enschede, The Netherlands.

Kitchin R, T Lauriault, and M Wilson. 2017. Understanding Spatial Media. London: Sage.

Maigron P. 2017. Regional Internet Registries Statistics. http://www-public.tem-tsp.eu/~maigron/RIR_Stats/

Mattern S. 2017. Urban Dashboards. In: R Kitchin, T Lauriault, and M Wilson’s (eds) Understanding Spatial Media. London: Sage. Pp. 74-83.

Poese I, S Uhlig, MA Kaafar, B Donnet, and B Gueye. 2011. Editorial: IP Geolocation Databases: Unreliable? Computer Communication Review. 41(2): 53-56.

Seely-Gant, K and LM Frehill. 2015. Exploring Bias and Error in Big Data Research. Journal of the Washington Academy of Science. 101(3): 29-37.

Sparke M. 2011. The Look of Surveillance Returns. In: M Dodge’s (ed) Classics in Cartography: Reflections on Influential Articles from Cartographic. New York: Wiley. Pp. 373-386

Tamon MA. 2015. Why IPv6 Development is Slow in Africa & What to Do About It. CircleID. http://www.circleid.com/posts/20151018_why_ipv6_deployment_is_slow_in_africa_what_to_do_about_it/

Triukose S, S Ardon, A Mahanti, and A Seth. 2012. Geolocating IP Addresses in Cellular Data Networks. International Conference on Passive and Active Network Measurement. PAM 2012. Pp. 158-67.

Weber I and B State. 2017. Digital Demography. WWW ’17 Companion. April 3 – 7, 2017, Perth, Australia. Pp. 935 – 939.

Young J, D Wald, P Earle, and L Shanley. 2013. Transforming Earthquake Detection and Science Through Citizen Seismology. Washington DC: Woodrow Wilson International Center for Scholars.