Posted February 6, 2018
By Jason Young
This research is the closest that TASCHA has come to a big data project – prior to now, we’ve mostly been able to share data via email or Google Drive. That is much more challenging with a dataset that contains over 500 million rows, and is 300 GB when uncompressed. And, this is only the static dataset upon which we are focusing our efforts – Worldreader’s live datasets accumulate another million rows of data every day! This might not be the same magnitude of data that employees at Google or Amazon work with every day, but it is certainly large for us!
The size of the dataset has us quite excited, because it helps us expand our skills so that we can better take advantage of the new research opportunities afforded by big data. The excitement around big data can, in fact, be a bit infectious – it feels as though everyone is doing big data research now, and that it opens up limitless possibilities. In the business world, some have gone so far as to compare big data to gold (Alharthi et al. 2017). Beyond these business applications, big data also seem to offer potential benefits for the provision of government services, the operation of NGOs, and the pursuit of academic research and teaching (West and Portenoy 2016).
At the same time, moving into the big data research domain also requires us to overcome a lot of challenges. Not only have we had to change a lot of our research practices and infrastructure, but we’ve also come to recognize that big data sets offer a lot of constraints in terms of the types of research we can do with them. In the future, we’ll write a lot of different blog posts detailing both the challenges and the constraints of big data research. In the meantime, this post offers an initial overview of the challenges of big data.
The past decade has seen an incredible increase in the amount of data produced and consumed in the world. This increase in data availability is now broadly described in relation to ‘big data’ sets. The origination of the term ‘big data’ remains a bit murky, but likely emerged in the mid-1990s within the tech sector (Gandomi and Haider 2015). However, the term did not gain much traction until much more recently, when it was popularized by companies that focus on data analytics. This provides us with the hint that big data are just as much about transformations in analytical and computational capabilities as they are about the size of datasets themselves (boyd and Crawford 2012).
danah boyd and Kate Crawford (2012) define big data as “a cultural, technological, and scholarly phenomenon that rests on the interplay of” technology, analysis, and mythology (663). While technology and analysis are quite common in many definitions of big data, the mythology component might require a little extra explanation. They define mythology as “the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” (boyd and Crawford, 2012). In other words, big data are not only new forms of technology and analysis, but also an epistemology belief that these new forms of analysis are uniquely valuable. We think that it’s important to keep this component of their definition in mind, because this mythology can sometimes mask the many limitations of research – one of the themes that will emerge across some of our blog posts. This isn’t to say that big data don’t open up new possibilities – only to say that these possibilities remain limited, and are most useful when combined with other, more traditional forms of research.
More common is the 3V definition of big data. Under this definition, big data are data sets that are characterized by high volume, high velocity, and high degrees of variety (Alharthi et al. 2017; Laney 2001. This definition recognizes that the tools and practices that are used to analyze these datasets are also a key component of big data. Others have expanded this definition to include up to seven V’s (Sivarajah et al. 2017):
Notably, approximately 90% of these data are unstructured – which makes them particularly difficult to analyze (Sivarajah et al. 2017).
This definition naturally gives rise to a question of degree. What do we mean by ‘big’, or ‘high volume, velocity, and variety’? Gandomi and Haider (2015) argue that these terms can be pretty subjective:
Universal benchmarks do not exist for volume, variety, and velocity that define big data. The defining limits depend upon the size, sector, and location of the firm and these limits evolve over time. Also important is the fact that these dimensions are not independent of each other. As one dimension changes, the likelihood increases that another dimension will also change as a result. However, a ‘three-V tipping point’ exists for every firm beyond which traditional data management and analysis technologies become inadequate for deriving timely intelligence. The Three-V tipping point is the threshold beyond which firms start dealing with big data. (Gandomi and Haider 2015: 139)
We have hit our ‘three-V tipping point’ in our encounter with the Worldreader dataset, and it’s taught us a lot about the nature of big data research. In the next section we give a brief overview of some of the challenges that we’ve encountered (and continue to struggle with) as we journey past our tipping point.
Based on a review of academic and business literature, there are many different frameworks one can use to understand the types of challenges presented by big data. In this section we focus on five different areas that are most relevant to the challenges we expect to face, or have already grappled with, as part of this project. Those areas include technology infrastructure, training and skills, understanding the data, process and methods, and ethics.
Technology infrastructure. The lack of an IT infrastructure capable of handling large data sets and analysis is probably the challenge that comes most quickly to mind when thinking about big data (Alharthi et al. 2017; Sivarajah et al. 2017). At the most basic level, big data requires that research teams have the hardware and data warehouse architecture to efficiently store and access data for analysis. Alharthi et al. (2017) argue that infrastructure readiness for big data “requires significant investments in software and hardware to support the analysis of hundreds of millions of records in real time,” (288) and this has certainly been the case for us. In the past TASCHA has not required a data infrastructure capable of handling the large volume represented by the Worldreader dataset – we’ve primarily used more traditional forms of social science data, which can easily be stored on an individual computer and even shared via email. At the beginning of this project, we attempted to use a similar approach – Lucas cached data locally for analysis, since he had a workstation powerful enough to perform analysis effectively. He would also query the dataset to create smaller subsets of the data for others on the research team, which he would share via our team Google Drive. However, this became increasingly impractical - it is much more efficient if everyone on the team is able to directly query the entire dataset themselves. We have therefore been exploring other infrastructure options for the project, including setting up our own local server or relying on cloud services like Amazon Web Services (AWS). Future posts will describe these different options in more detail, and detail the final decisions that we made for the project’s technology infrastructure. Importantly, many of the technology decisions that we’ve had to make for this project will help build TASCHA’s long term capacity to do other big data projects.
Training and skills. Big data research projects are particularly tricky from the perspective of sourcing skilled personnel, because they require many different types of skill sets. The ‘big data’ component of the project requires that researchers have backgrounds in computer science, but also that they have strong statistical skills. However, at the end of the day, we are still using these big data to carry out social science – which means that researchers must also have the traditional social science skills that allow them to ask appropriate questions of the data, and to interpret analysis in a manner that sheds light on broader social processes. As West and Portenoy (2016) point out, it is nearly impossible to find all of these skills in a single person:
The number of skills required of a data scientist is unrealistic. It can’t be expected that all scientists working on computationally intensive research become experts in every aspect of data analytics. These research projects often require teams of people in order to cover all the necessary bases, such as statistics, software development, and domain expertise. (12)
Even if one is able to assemble a research team with all the requisite skills, researchers may face a difficult process of aligning their perspectives across deep disciplinary gaps (Metcalf and Crawford 2016; Zook et al. 2017). For example, computer science and the social sciences have very different histories in relation to human subjects and ethics. This can give rise to substantial disagreements over even basic questions of research ethics. We have been lucky in that we came to this project with a highly interdisciplinary team that included skilled data analysts and social scientists. When combined with the domain expertise of Worldreader staff, as well as the computational support provided by other iSchool units like the DataLab, it made this less of a challenge for us. Of course, this isn’t to say that our team doesn’t ever face disagreements or need to acquire new skills along the way – we’ll be documenting all of this in upcoming posts. It does mean, though, that we have a history of drawing on our varied skills and perspectives to productively engage with the new research challenges that this big data project is throwing at us.
Understanding the data. Traditionally, social scientists have gone out and collected their own data. Ahead of this, they first go through a careful process of reviewing literature, identifying research gaps, formulating research questions or hypotheses, conceptualizing and operationalizing data variables, and designing their research methods to ensure that the collected data (and resulting analysis) can provide answers to their research questions. Big data research tends not to follow this careful process – the data often exist prior to and outside of the research process itself, which means that the researcher has little control over what the data actually are. This has many implications for the types of methods that we can use within a big data project, which we’ll discuss in more detail below. More fundamentally, though, it also means that we need to spend some time figuring out what our data variables actually mean in the context of our research questions. For example, as described in our Data Variables [link] post, one of the variables we have access to is Client_ID. This is a unique ID number that tracks Worldreader usage on a device over time. Ideally, this would allow us to track an individual user’s behavior on the application over time. However, there are a lot of reasons why Client_ID might not actually make a good proxy for an individual user. Multiple people may use the same device, or a device might be sold to a new individual that is also a Worldreader user. On the other hand, a single user might have multiple devices that they regularly use to access Worldreader. All of these possibilities make it difficult to assume that there is a one-to-one relationship between users and Client_ID’s. The data cannot tell us this by itself – we need subject matter expertise that allows us to interpret the data appropriately. Future blog posts will continue to describe how we are interpreting the various data variables, as well as the constraints that exist in using these variables for social science research. For instance, we’re currently exploring the IP address variable within the dataset, in relation to whether we can use it to determine the geographic location of users – expect a blog piece very soon on this!
Process and methods. Big data also present researchers with many different challenges related to analysis. Broadly, these challenges can be broken down into three different areas: technical challenges, methodological challenges, and interpretive challenges. First, from a technical perspective, it can be time and memory intensive to perform even basic queries on very large datasets. This has forced Lucas and Bree to be very careful as they script and test the Python code that they use to actually analyze the data. Look for a new post soon, where Lucas will describe some of his initial interactions with the dataset.
Second, stepping back from the actual code used to interact with the data, there are broader methodological questions related to big data analysis (Brooker et al. 2016; Metcalf et al. 2016; Zook et al. 2017). As mentioned above, big data sets often pre-date the research project designed to analyze them. This means that the researchers are not in control of important aspects of data collection, such as determining the sampling method used to produce the data. More often than not, this means that big data researchers are dealing with convenience samples – which makes it really difficult to generalize the results of statistical analysis to broader populations. This, in turn, makes it difficult to use big data to answer many of the types of questions traditionally asked within the social sciences. For example, we might want to use the results of our analysis to make broader claims about reading patterns or literacy in particular countries, or to make generalizations about the adoption of e-reading applications across the world. However, because our data comes from a convenience sample (e.g., data results from people self-selecting to spend time on the Worldreader application), there is no guarantee that our data is representative of broader reading, e-reading, or literacy patterns. This significantly limits the types of conclusions and generalizations that we can make. Even within the dataset itself, there are constraints related to sampling. Only a small selection of our users register with the Worldreader application, and then only a percentage of those registered users provide additional age or gender information. It would be useful to be able to analyze data from those users that provided demographic information – for instance, to understand gender gaps or age gaps in e-reading – and then generalize the findings to the broader dataset (i.e., to the unregistered users). Because we are dealing with a convenience sample, though, we cannot know if some outside factor is shaping our sample in ways that distort their findings. It is possible, for example, that women are less likely to give away personal information about themselves than men, and therefore are less likely to register with the application. This might make a gender gap appear larger within the registered user data than it actually is in the broader dataset. This, of course, is hypothetical in relation to the Worldreader user base - the research into the relationship between gender and digital information giving behavior remains mixed, and is likely dependent on other variables such as age (Vo 2016). Yet, it illustrates the point that we just cannot know how various demographic variables might shape our dataset. Future posts will explore some of these limitations in more detail, and also examine recent research that attempts to overcome these constraints as they relate to big data projects.
Third, and related to the section above on ‘understanding the data’, big data researchers face new interpretive challenges. Metcalf et al. (2016) point out how “critics have highlighted how this fast-paced rollout [of big data analytical methods] has bulldozed thoughtful consideration of bias, statistical meaning, and grounded interpretation.” (6) Brooker et al. (2016) similarly worry that while “researchers no longer lack computational tools or theories to help make sense of social media data, […] there remains a paucity of methodologies to make transparent the move from tools to explanations.” (1) At the root of these comments is a concern that researchers are performing computational research simply because they can (because data is available), and not because they have carefully scoped out research questions based on thorough reviews of literature. This provides us with an important lesson – that, despite the allure of simply attacking big data with exploratory and inductive methods, our project needs to be driven by theory, domain expertise, and research needs. In this way we can guarantee that our results offer important findings, rather than simply offering graphs that belong on Spurious Correlations.
Ethics. Finally, there are a lot of ethical questions related to big data research, particularly revolving around privacy and human subjects protections (e.g., boyd and Crawford 2012; Metcalf and Crawford 2016; Metcalf et al. 2016; Zook et al. 2017). Many of these challenges have arisen because computer scientists have historically not had to deal with questions of human subjects and ethics – their work has been more closely aligned with the physical sciences than the social sciences or humanities. However, advanced computational methods increasingly allow us to tie data back to the humans that created them, producing profound questions related to privacy and research. Already we have written a little about this, in our Open Data Ethics (link) piece. But, you can also expect us to publish more on data ethics as this project continues to unfold.
Some of the literature on big data includes one other challenge that we didn’t discuss above – that of building a culture of big data research within your organization. Organizational culture – or, the values, assumptions, and norms that define an organization – can create barriers to big data research, if that culture does not support researchers as they grapple with the other challenges listed above (Alharthi et al. 2017). We, however, would like to see organizational culture more as an opportunity for this project than as a challenge. By engaging in this big data work, we hope that we can build a better culture of big data research here at TASCHA, so that we can effectively take on new big data projects in the future. The future blog pieces hinted at above will document our continued work toward creating that organizational culture.
Alharthi, Abdulkhaliq, Vlad Krotov, and Michael Bowman. 2017. Addressing barriers to big data. Business Horizons. 60: 285-92.
boyd, danah and Kate Crawford. 2012. Critical Questions For Big Data. Information, Communication & Society. 15(5): 662-79.
Brooker, Phillip, Julie Barnett, and Timothy Cribbin. 2016. Doing social media analytics. Big Data & Society. 1-12.
Gandomi, Amir and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management. 35: 137-44.
Laney, Doug. 2001. 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Delta. File 949. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Metcalf, Jacob and Kate Crawford. 2016. Where are human subjects in Big Data research? The emerging ethics divide. Big Data & Society. 1-14
Metcalf, Jacob, Emily F. Keller, and danah boyd. 2016. Perspectives on Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. Acc. 18 Dec. 2017 http://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/
Sivarajah, Uthayasankar, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth Weerakkody. 2017. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. 70: 263-86.
Vo, Evelyn. 2016. All That Data: User trust, user privacy, commercial stalking and making money. Strictly Literary.
West, Jevin and Jason Portenoy. 2016. The Data Gold Rush in Higher Education. In C. Sugimoto’s, H. Ekbia’s, and M. Mattioli’s (eds) Big Data is Not a Monolith. MIT Press.
Zook et al. 2017. Editorial: Ten simple rules for responsible big data research. PLOS: Computational Biology. 13(3): e1005399