How Google can combat the flu

Huge, real time data sets can provide leading indicators of disease spikes well before they become evident to traditional sampling methods. Could Google search data be a new tool for epidemics.

Published: Friday, November 14th 2008

2 mins (355 words)

I am often to be found banging the drum of “there are no models there is only data” - thankfully Wired have done a good PR job on this recently so it makes the discussion easier. It also conveniently stops me having to get a little bit “matrix child” on people and start talking about why “there is no spoon”.

Google, being possibly the most data rich organisation on Earth (until the LHC starts truly pumping out it’s 30 odd PETA bytes of data a day - yes that’s 30 MILLION Gigabytes), has put to rest the idea that you need to craft a model then test it to within an inch of it’s life. Rather you get vast sums of data and let it speak for itself by clustering it together and whatever knowledge comes from that is closer to fact rather than inference from the approximation given by the model.

To this end, Google put their search data to work to predict rising spikes of Flu Epidemics in the US a good couple of weeks before the CDC make the same prediction. It’s been validated by the CDC as being highly accurate and mapped across data since 2003 it shows a startling leading indicator compared to the CDC data at all times.

The CDC customarily collates data from doctors and health professionals as well as sending people out to localities and conducting surveys to get a view on current disease trends. This takes extensive resources and time to compile. Google’s view can be virtually instantaneous, based on the current trend of search terms, so will lead the CDC as it’s picking up people looking to deal with a disease NOW.

What would be fantastic, is a second layer on top of this looking at say Twitter data for people mentioning the word Flu in it too, and see how well that correlates back to the search and CDC data.

This really begins to illustrate how powerful large, multivariate data sources are and that it’s crucial to trap any bit of data you can, because somewhere, at some time, someone will be able to make more sense of it.