Abstract We perform a social experiment to investigate, if zombie related twitter posts can used as a reliable indicator for an early warning system. We show how such a system can be set up almost out-of-the-box using R - a free software environment for statistical computing and graphics. Warning: This ...
" />
2016-09-24

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

We perform a social experiment to investigate, if zombie related twitter posts can used as a reliable indicator for an early warning system. We show how such a system can be set up almost out-of-the-box using R – a free software environment for statistical computing and graphics. Warning: This blog entry contains toxic doses of Danish irony and sarcasm as well as disturbing graphs.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

Proposing statistical methods is only mediocre fun if nobody applies them. As an act of desperation the prudent statistician has been forced to provide R packages supplemented with a CRAN, github, useR! or word-of-mouth advertising strategy. To underpin efforts, a reproducibility-crisis has been announced in order to scare decent comma-separated scientist from using Excel. Social media marketing strategies of your R package include hashtag #rstats twitter announcements, possibly enhanced by a picture or animation showing your package at its best:

Introducing gganimate: #rstats package for adding animation to any ggplot2 figure https://t.co/UBWKHmIc0e pic.twitter.com/oQhQaYBqOj

— David Robinson (@drob) February 1, 2016

Unfortunately, little experience with the interactive aspect of this statistical software marketing strategy appears to be available. In order to fill this scientific advertising gap, this blog post constitutes an advertisement for the out-of-the-box-functionality of the surveillance package hidden as social experiment. It shows shows what you can do with R when combining a couple of packages, wrangle the data, cleverly visualize the results and then team up with the fantastic R community.

The Setup: Detecting a Zombie Attack

As previously explained in an useR! 2015 lightning talk, Max Brooks’ Zombie Survival Guide is very concerned about the early warning of Zombie outbreaks.



However, despite of extensive research and recommendations, no reliable service appears available for the early detection of such upcoming events. Twitter, on the other hand, has become the media darling to stay informed about news as they unfold. Hence, continuous monitoring of hashtags like #zombie or #zombieattack appears an essential component of your zombie survival strategy.

Tight Clothes, Short Hair and R

Extending the recommendations of the Zombie Survival guide we provide an out-of-the-box (OOTB) monitoring system by using the rtweet R package to obtain all individual tweets containing the hashtags #zombie or #zombieattack.

In particular, the README of the rtweet package provides helpful information on how to create a twitter app to automatically search tweets using the twitter API. One annoyance of the twitter REST API is that only the tweets of the past 7 days are kept in the index. Hence, your time series are going to be short unless you accumulate data over several queries spread over a time period. Instead of using a fancy database setup for this data collection, we provide a simple R solution based on dplyr and saveRDS – see the underlying R code of this post by clicking on the github logo in the license statement of this post. Basically,

all tweets fulfilling the above hashtag search queries are extracted

each tweet is extended with a time stamp of the query-time

the entire result of each query us stored into a separate RDS-files using saveRDS

In a next step, all stored queries are loaded from the RDS files and put together. Subsequently, only the newest time stamped entry about each tweet is kept – this ensures that the re-tweeted counts are up-to-date and no post is counted twice. All these data wrangling operations are easily conducted using dplyr. Of course a full database solution would have been more elegant, but R does the job just as well as long it’s not millions of queries. No matter the data backend, at the end of this pipeline we have a database of tweets.

OOTB Zombie Surveillance

We are now ready to prospectively detect changes using the surveillance R package (Salmon, Schumacher, and Höhle 2016).

We shall initially focus on the #zombie series as it contains more counts. The first step is to convert the data.frame of individual tweets into a time series of daily counts.

It’s easy to visualize the resulting time series using the plotting functionality of the surveillance package.

We see that the counts on the last day are incomplete. This is because the query was performed at 10:30 CEST and not at midnight. We therefore adjust counts on the last day based on simple inverse probability weighting. This just means that we scale up the counts by the inverse of the fraction the query-hour (10:30 CEST) makes up of 24h (see github code for details). This relies on the assumption that queries are evenly distributed over the day.

We are now ready to apply a surveillance algorithm to the pre-processed time series. We shall pick the so called C1 version of the EARS algorithm documented in Hutwagner et al. (2003) or Fricker, Hegler, and Dunfee (2008). For a monitored time point \(s\) (here: a particular day, say 2016-09-23), this simple algorithm takes the previous seven observations before \(s\) in order to compute the mean and standard deviation, i.e. \[

\begin{align*}

\bar{y}_s &= \frac{1}{7} \sum_{t=s-8}^{s-1} y_t, \\

\operatorname{sd}_s &= \frac{1}{7-1} \sum_{t=s-8}^{s-1} (y_t – \bar{y}_s)^2

\end{align*}

\] The algorithm then computes the z-statistic \(\operatorname{C1}_s = (y_s – \bar{y}_s)/\operatorname{sd}_s\) for each time point to monitor. Once the value of this statistic is above 3 an alarm is flagged. This means that we assume that the previous 7 observations are what is to be expected when no unusual activity is going on. One can interpret the statistic as a transformation to (standard) normality: once the current observation is too extreme under this model an alarm is sounded. Such normal-approximations are justified given the large number of daily counts in the zombie series we consider, but does not take secular trends or day of the week effects into account. Note that the calculations can also be reversed in order to determine how large the number of observations need to be in order to generate an alarm.

We now apply the EARS C1 monitoring procedure to the zombie time series starting at the 8th day of the time series. It is important to realize that the result of monitoring a time point in the graphic is obtained by only looking into the past. Hence, the relevant time point to consider today is if an alarm would have occurred 2016-09-25. We also show the other time points to see, if we could have detected potential alarms earlier.

What a relief! No suspicious zombie activity appears to be ongoing. Actually, it would have taken 511 tweets before we would have raised an alarm on 2016-09-25. This is quite a number.

As an additional sensitivity analysis we redo the analyses for the #zombieattack hashtag. Here the use of the normal approximation in the computation of the alerts is more questionable. Still, we can get a time series of counts together with the alarm limits.

Also no indication of zombie activity. The number of additional tweets needed before alarm in this case is: 21. Altogether, it looks safe out there…

Summary

R provides ideal functionality to quickly extract and monitor twitter time series. Combining with statistical process control methods allows you to prospectively monitor the use of hashtags. Twitter has released a dedicated package for this purpose, however, in case of low count time series it is better to use count-time series monitoring devices as implemented in the surveillance package. Salmon, Schumacher, and Höhle (2016) contains further details on how to proceed in this case.

The important question although remains: Does this really work in practice? Can you sleep tight, while your R zombie monitor scans twitter? Here is where the social experiment starts: Please help answer this question by retweeting the post below to create a drill alarm situation. More than 511 (!) and 21 tweets, respectively, are needed before an alarm will sound.

(placeholder tweet, this will change in a couple of minutes!!)

Video recording, slides & R code of our (???) MV Time Series webinar now available at https://t.co/XVtLrjbJKZ #biosurveillance #rstats

— Michael Höhle (@m_hoehle) 21. September 2016

I will continuously update the graphs in this post to see how our efforts are reflected in the time series of tweets containing the #zombieattack and #zombie hashtags. Thanks for your help!

References

Fricker, R. D., B. L. Hegler, and D. A. Dunfee. 2008. “Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology.” Stat Med 27 (17): 3407–29.

Hutwagner, L., W. Thompson, G. M. Seeman, and T. Treadwell. 2003. “The bioterrorism preparedness and response Early Aberration Reporting System (EARS).” J Urban Health 80 (2 Suppl 1): 89–96.

Salmon, M., D. Schumacher, and M. Höhle. 2016. “Monitoring Count Time Series in R: Aberration Detection in Public Health Surveillance.” Journal of Statistical Software 70 (10). doi:10.18637/jss.v070.i10.

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Show more