Hidden biases in Big Data


Harvard Business Review says don’t believe the hype about how big data will usher in an era of Wonderfulness (or maybe Big Brother, depending on your viewpoint.) It probably can’t because of  inherent biases in collecting and selecting data. If a data set is biased, then conclusions based on that data will be too. GIGO is what programmers call it. Garbage In, Garbage Out.

For example, the greatest number of tweets about Hurricane Sandy were from Manhattan which gives a faulty impression that Manhattan was the center. It wasn’t. New Jersey was. The worst hit areas had very few tweets because of prolonged power outages and they a bit preoccupied trying to save their homes.

We can think of this as a “signal problem”: Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.

Boston pothole reporting is another example. Potholes are reported more frequently in upper income areas because of the higher concentration of smart phones. Based on the data alone, you might erroneously conclude that wealthy areas had more potholes.

In the near term, data scientists should take a page from social scientists, who have a long history of asking where the data they’re working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation