Everybody Lies: Big data, new data and what the internet can really tell us about who we really are
“In the era of Big Data all the world’s a lab.”
There’s a whole world of data out there but where does it come from, how does it benefit organisations and what are the pitfalls?
Big Data can provide insight on so many things, from finding out unemployment levels which are reflected in the increase in searches for online games, to how consumer spending patterns are influenced, or where flu is spreading.
Stephens-Davidowitz is an economist and former Google data scientist and is convinced that Google searches are the most important dataset ever collected on the human psyche. As a journalist writing on Big Data for the New York Times he has had the opportunity to interview academics, data journalists and entrepreneurs to help form the basis of his book, as well as following the trails people leave as they make their way across the web.
As a graduate student in economics in 2012 he believed what he had read about racism being limited to just a small percentage of Americans – most of them living in the deep South. Then he found Google trends – a tool which tells users how frequently a word or phrase is searched in different locations at different times. His subsequent dissertation and research showed a very different story, with millions of racist searches, or one in every hundred searches for Obama including the words ‘kkk’ or other derogatory term, with the highest rates including upstate New York and other unexpected places, not correlated to the traditional areas he had assumed. While those in the South may be more likely to admit to racism, he was shocked to discover plenty of people in the North have similar attitudes. Stephens-Davidowitz’s research was rejected by mainstream academic publications at the time, but he suggests Trump’s subsequent success makes his findings more plausible. Google searches for ‘how to vote’ or ‘where to vote’ in the weeks before elections can also accurately predict which parts of the country are going to have a big showing at the polls.
What is Big Data
Stephens-Davidowitz is reluctant to give a precise definition of what Big Data is because he says it is an inherently vague concept – are 18,462 observations Small Data and 18,463 observations Big Data? Much of the new information flows from Google and social media – on an average day in the early part of the twenty-first century, human beings generate 2.5 million trillion bytes of data. But there are other sources too, such as market research, new digitisation of information and projects like Facebook’s Gross National Happiness Index. He warns that the size of a dataset can be over-rated, with too many businesses drowning in data. At Google, major decisions are based on only a tiny sampling of all their data. Stephens-Davidowitz says you need the right data to find important insights, not vast quantities.
A major reason Google searches are so valuable is not that there are so many of them, it is that people are so honest in them. Big Data allows us access this new honest data, as well as being able to zoom in on small subsets of people and conduct controlled experiments rapidly and cheaply.
It is very difficult to know if advertisements work. While there’s no doubt that products that advertise the most have the highest sales there is still a big question between correlation and cause. Firms may believe they know how effective their advertising is, but economists are sceptical. Even though it is possible to do a trial to test effect in areas with and without advertising, companies are reluctant to risk not advertising. However, Stephens-Davidowitz’s Google research into the effectiveness of Super bowl TV advertising showed that despite the high price, beer and soft drinks advertising delivered a 2.5-to-1 return on investment. This insight was possible because the final teams who play in the Super Bowl are not known geographically beforehand, so it is possible to track sales in the locations of the teams naturally. These types of natural experiments are powerful and will take on increasing importance in an era of more, better and larger datasets.
Google first started conducting randomised experiments in 2000 as they are low cost and quick to do – measuring mouse movements and clicks with an automated programme to interpret results. These experiments have now been renamed A/B testing and are commonplace. In 2011 Google ran seven thousand A/B tests and Facebook runs thousands every day – more than the entire pharmaceutical industry starts in a year. They can be used highly effectively to test online advertising or political campaigns, with different colours, photos or messages – Obama’s most effective election campaign photo was a picture of him and his family rather than him alone and ‘learn more’ was a much more effective message than ‘sign up’ or ‘join us now’, overall gaining an additional $60 million in funding.
Netflix learnt a powerful lesson early on – it used to offer people the chance to queue up movies to watch when they had more time. But there was something odd in the data – when they did have the spare time, they rarely watched the films in the queue. While people kept compiling lists with highbrow documentaries or foreign films, in reality they ended up actually choosing lowbrow comedies or romance films. So Netflix stopped asking people what they wanted to see in the future and started building a model based on millions of clicks and views from similar customers, giving a list of suggested films based not on what they claimed to like but what they were likely to view, resulting in much higher viewing figures.
While Stephens-Davidowitz believes that Big Data will herald a revolution, he doesn’t think that Big Data will eliminate the need for small data. Facebook still uses surveys to understand exactly what people think about the news they are presented with, and employs social psychologists, anthropologists and sociologists to find out what the numbers miss. Schools are also realising they can’t rely solely on test results to assess good teaching – student surveys and teacher observations are needed to measure which teachers most improve student learning to avoid detrimental ‘teaching to the test’.
Caution seems to the best approach for Big Data – be careful what trails you leave on the web (even if data is anonymised), use Big Data with other sources, and consider what you really need to know. It is undoubtedly a hugely important tool – from medicine to business to politics to better understanding ourselves – but only if common sense is applied [with integrity].
About the author
Seth Stephens-Davidowitz is a New York Times contributor, a visiting lecturer at The Wharton School, and a former Google data scientist. He received a BA in philosophy from Stanford and a PhD in economics fromHarvard. His research has appeared in the Journal of Public Economics and other prestigious publications. He lives in New York City.
“A whirlwind tour of the human psyche … The empirical findings in “Everybody Lies” are so intriguing that the book would be a page-turner even if it were structured as a mere laundry list. But Mr Stephens-Davidowitz also puts forward a deft argument” Economist
“Everybody Lies is an absorbing, and impassioned examination of new data sources … as an introduction to our fascinating new universe of data, Everybody Lies is hard to beat”
John Thornhill, Financial Times
“This brilliant book is the best demonstration yet of how big data plus cleverness can illuminate and then move the world. Read it and you’ll see life in a new way” Lawrence Summers, President Emeritus and Charles W. Eliot University Professor of Harvard University
Author: Seth Stephens-Davidowitz's