Using Social Media to Forecast the Flu

It can be easy for individuals to predict their own likelihood of falling ill from the flu–a red-eyed and drip-nosed neighbor on the T or bed-ridden child and spouse are pretty good indicators that one can expect the dreaded fever and chills–but mapping the spread of disease on a population level is much more challenging. Hospitals may not be alerted to outbreaks for as much as two weeks, the lag time between the Center for Disease Control’s (CDC) collection and decimation of available flu data to providers. But new research from the Computational Health Informatics Program at Boston Children’s Hospital and Harvard University may have found a novel approach to tracking the flu in real time. Their work combines multiple data sets in a technique known as ensemble modeling and has so far yielded “near perfect” results for flu modeling on a national scale.

The researchers created a machine learning algorithm based on four data sources:

  • Flu Near You, a participatory, public self-report system,
  • EHRs of patients seeking medical attention for the flu, provided by athenahealth in “near real time,”,
  • Twitter activity,
  • and Google data about the search volume of relevant keywords (the basis for Google’s discontinued Flu Trends tracker).

“We chose what was available, not strategic and found that what was available represented multiple ways to track the flu,” says Mauricio Santillana, the lead author of an article published today in the journal PLoS Computational Biology explaining their methodology. “Each of them contributes or brings in information that’s valuable…The sum outperforms the individuals.”

Santillana and his fellow researchers compared their predictive mathematical models against actual reports of illness last flu season and found that their real-time predictions (called now-casts in the report) and one, two and three week forecasts more accurately predicted timing and magnitude of disease spread than models based only on the CDC’s historical illness reports, what Santillana calls the “gold standard” of flu surveillance.

“We did not expect it to be that high and were happily surprised.”

For now their models only work on a national level but future plans include customizing the algorithms to regions and states, or as Santillana puts it, “a more actionable scale.”

This article was edited on 29 October 2015. The ensemble modeling approach more accurately predicted timing and magnitude than “models based only on the CDC’s historical illness reports,” not better than “the CDC’s historical reports” as was first reported.  

Paula Sokolska

Paula Sokolska

    Paula is a freelance science writer and strategic communications associate at Health Leads. Formerly a managing editor at MedTech Boston, she has a B.S. in Journalism from Boston University and has worked with the New England Center for Investigative Reporting, Boston Globe, Social Documentary Network, BU Today and several nonprofit organizations. She can be reached at paula.sokolska@gmail.com.

    Follow us!

    Send this to friend