Big Data: Opportunities for Computational and Social Sciences

futurelab default header

Scott Golder recently wrote blog post at Cloudera entitled “Scaling Social Science with Hadoop” where he accounts for “how social scientists are using large scale computation.” He begins with a delightful quote from George Homans: The methods of social science are dear in time and money and getting dearer every day. He then turns to talk about the trajectory of social science:

When Homans — one of my favorite 20th century social scientists — wrote the above, one of the reasons the data needed to do social science was expensive was because collecting it didn’t scale very well. If conducting an interview or lab experiment takes an hour, two interviews or experiments takes two hours. The amount of data you can collect this way grows linearly with the number of graduate students you can send into the field (or with the number of hours you can make them work!). But as our collective body of knowledge has accumulated, and the “low-hanging fruit” questions have been answered, the complexity of our questions is growing faster than our practical capacity to answer them. Things are about to change.

This is his bouncing off point for thinking about how “computational social science” provides new opportunities because of the “large archives of naturalistically-created behavioral data.” And then he makes a very compelling claim for why looking at behavioral data is critical:

Though social scientists care what people think it’s also important to observe what people do, especially if what they think they do turns out to be different from what they actually do.

By and large, I agree with him. Big Data presents new opportunities for understanding social practice. Of course the next statement must begin with a “but.” And that “but” is simple: Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And just because you have a big N doesn’t mean that it’s representative or generalizable. Scott knows this, but too many people obsessed with Big Data don’t.

Increasingly, computational scientists are having a field day with Big Data. This is exemplified by the “web science” community and highly visible in conferences like CHI and WWW and ICWSM and many other communities in which I am a peripheral member. In these communities, I’ve noticed something that I find increasingly worrisome… Many computational scientists believe that because they have large N data that they know more about people’s practices than any other social scientist. Time and time again, I see computational scientists mistake behavioral traces for cultural logic. And this both saddens me and worries me, especially when we think about the politics of scholarship and funding. I’m getting ahead of myself.

Let me start with a concrete example. Just as social network sites were beginning to gain visibility, I reviewed a computational science piece (that was never published) where the authors had crawled Friendster, calculated numbers of friends, and used this to explain how social network sites were increasing friendship size. My anger in reading this article resulted in a rant that turned into a First Monday article. As is now common knowledge, there’s a big difference between why people connect on social network sites and why they declare relationships when being interviewed by a sociologist. This is the difference between articulated networks and personal networks.

On one hand, we can laugh at this and say, oh folks didn’t know how these sites would play out, isn’t that funny. But this beast hasn’t yet died. These days, the obsession is with behavioral networks. Obviously, the people who spend the most time together are the REAL “strong” ties, right? Wrong. By such a measure, I’m far closer to nearly everyone that I work with than my brother or mother who mean the world to me. Even if we can calculate time spent interacting, there’s a difference in the quality of time spent with different people.

Big Data is going to be extremely important but we can never lose track of the context in which this data is produced and the cultural logic behind its production. We must continue to ask “why” questions that cannot be answered through traces alone, that cannot be elicited purely through experiments. And we cannot automatically assume that some theoretical body of work on one data set can easily transfer to another data set if the underlying conditions are different.

As we start to address Big Data, we must begin by laying the groundwork, understanding the theoretical foundations that make sense and knowing when they don’t apply. Cherry picking from different fields without understanding where those ideas are rooted will lead us astray.

Each methodology has its strength and weaknesses. Each approach to data has its strengths and weaknesses. Each theoretical apparatus has its place in scholarship. And one of the biggest challenges in doing “interdisciplinary” work is being about to account for these differences, to know what approach works best for what question, to know what theories speak to what data and can be used in which ways.

Unfortunately, our disciplinary nature makes a mess out of this. Scholars aren’t trained to read in other fields, let alone make sense of the conditions in which that work was produced. Thus, it’s all-too-common to pick and choose from different fields and take everything out of context. This is one of the things that scares me about students trained in interdisciplinary programs.

Now, of course, you might ask: But didn’t you come from an interdisciplinary program? Yes, I did. But there’s a reason that I was in grad school for 8.5 years. The first two were brutal as I received a rude awakening that I knew nothing about social science. And then I did a massive retraining as an ethnographer drawing on sociological and anthropological literatures. At this point, that’s my strength as a scholar. I know how to ask qualitative questions and I know how to employ ethnographic methods and theories to work out cultural practices. I had to specialize to have enough depth.

Of course, there’s one big advantage to an interdisciplinary program: it’s easy to gain an appreciation for diverse methodological and analytical approaches. In my path, I’ve learned to value experimental, computational, and quantitative research, but I’m by no means well trained in any of those approaches. That said, I am confident in my ability to assess which questions can be answered by which approaches. This also means that I can account for the questions I can’t answer.

Now back to Big Data… Big Data creates tremendous opportunities for those who know how to assess the context of the data and ask the right questions into it. But mucking with Big Data alone is not research. And seeing patterns in Big Data is not the same as hypothesis testing. Patterns invite more questions than they answer.

I agree with Scott that there’s the potential for social science to be transformed by Big Data. So many questions that we’ve wanted to ask but haven’t been able to. But I’m also worried that more computationally minded researchers will think that they’re answering social science questions simply by finding patterns in Big Data. It’s the same worry that I have when graph theorists think that they understand people because they can model a narrow kind of information flow given the perfect conditions.

If we’re going to actually attack Big Data, the best solution would be to combine forces between social scientists and computational scientists. In some places, this is happening. But there are also huge issues at play that need to be accounted for and addressed. First, every discipline has its arrogance and far too many scholars think that they know everything. We desperately need a little humility here. Second, we need to think about the differences in publication, collaboration, and validation across fields. Social scientists aren’t going to get tenure on ACM or IEEE publications. Hell, they’re often dismissed for anything that’s not single author. Computational scientists often see no point in the extended review cycles that go into journal publications to help produce solid articles. And don’t get me started on the messy reviewing process involved on both sides.

We need to find a way for people to start working together and continue to get validated in their work. I actually think that the funding agencies are going to play a huge role in this, not just in demanding cross-disciplinary collaboration, but in setting the stage for how research will be published. Given departmental obsessions with funding these days, they have a lot of sway over shaping the future here.

There’s also another path that needs to be used: cross-bred students. Scott Golder, our fearless critic, is a good example of this. He was trained in computational ways before going to Cornell to pursue a PhD in sociology. This is one way of doing it. Another is to start cross-breeding students early on. Computer scientists: teach courses for social scientists on how to think about Big Data from a computational perspective. Social scientists: allow computer scientists into your core courses or teach core courses for them to understand the fundamentals of social science methodology and social theory. And universities: provide incentives for your faculty to teach students outside of their departments and for departments to encourage their students to take classes in other departments.

It’s great that we have Big Data but we need to develop the intellectual apparatus to actually analyze it. Each of us has a piece to the puzzle, but stitching it together is going to take a lot of reworking of old habits. It can be done and it is important. The key is to let go of our grudges and territoriality without letting go of our analytic rigor and depth.

Image source: Cougar-Studio 

Original Post: