When Big Data Fails

I forget how long ago, a couple of years maybe?[edit: actually, it was 2009] There was a lot of fanfare and publicity over the Google Flu Trends tool. Essentially, the wizards at Google had identified some search terms about influenza and the frequency by geographic location that these terms were searched. In other words, they mapped where people were most frequently searching in Google about the flu. From this information, they found a correlation to increased incidence of flu outbreaks to those geographic locations where there was frequent searching about the disease.  They even published about this in the journal Nature

Earlier this year however, the journal reported that Google's predictions grossly overestimated the flu outbreak for the US. What went wrong? Well first of all, as we have heard time and time again, correlation does not necessarily mean causation. In other words, there could be many other reasons for people searching flu related terms than there being a flu outbreak. Even I, as a beginner, could tell you that. In fact, according to the article researchers hypothesize that a spate of media coverage about the H1N1 virus led to an increase in searches.  So basically, media hype was the impetus for increased searches.

The interesting thing is that Google feels that by tweaking their algorhythm, they will be able to filter out so-called "noise" in the dat such as searches caused by media hype or other unrelated factors and improve the accuracy of their Flu Trends tool. I am very interested to see how this shakes out.  Let's see how it does in 2014! 

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.