Later On

A blog written for those whose interests more or less match mine.

Beware of Cheap Data: Loads of low-quality data support low-quality conclusions

leave a comment »

Dewey defeats Truman

Michael Byrne reports at Motherboard:

Beware of easy data. The massive, cheap datasets assured by social media pipelines like Twitter are likely offering dangerous distortions of the real world.

This is the conclusion anyway of a pair of computer scientists, Juergen Pfeffer and Derek Ruths, based at McGill University and Carnegie Mellon University, as ​described in the current issue of Science. With thousands of papers based on social media data now being published each year—compared to handfuls just five years ago—the situation might even be viewed as quite dire. Imagine astronomers, newly armed with telescopes, trying to chart the movements and development of galaxies without understanding the influence of black holes, a hidden gravitational influence—or hidden bias.

Bias is the key term as we attempt to extract meaningful observations from the non-stop social media avalanche of conversations, pronouncements, locations, images, categories, and on and on. In the face of these sheer volumes, it’s easy to delude oneself into thinking that those volumes are capable of delivering the random (or otherwise specified) sample needed to conduct good research.

Ruths and Juergen liken our present state of social media-based inquiry to the early days of telephone polling. Infamously, the Chicago Tribune trusted its new sampling methods—circa 1948—enough to publish the post-presidential election headline “Dewey Defeats Truman,” only to learn shortly thereafter that Truman had actually won in a landslide and that its polling methods had oversampled Dewey supporters enormously.

“Not everything that can be labeled as ‘Big Data’ is automatically great,” Juergen notes in a statement. “People want to say something about what’s happening in the world and social media is a quick way to tap into that. You get the behavior of millions of people—for free.”

As Pfeffer and Ruths explain, social science researchers often underestimate the degree to which different social media platforms are favored by certain segments of the population. Instagram, for example, is slanted toward 20-something African-Americans, Latinos, women, and urban dwellers, while Pinterest is big with women in households with incomes greater than $100,000.

Making the situation worse is that social media feeds are usually . . .

Continue reading.

Written by LeisureGuy

29 November 2014 at 10:22 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.