2017-01-26



Publishing data on Kaggle is a way organizations can reach a diverse audience of data scientists with an enthusiasm for learning, knowledge, and collaboration. For Dr. Erin Miller of START, the National Consortium for the Study of Terrorism and Responses to Terrorism, making her organization's Global Terrorism Database available for analysis by Kaggle users has brought new awareness to their cause.

In this Open Data Spotlight, Erin discusses how setting aside agendas and focusing on understanding this unparalleled dataset of over 150,000 attack events allows users to undertake constructive analyses that may defy common conceptions about terrorism. Read on to learn more about the Global Terrorism Database project and the ways users of open data can make valuable contributions to the organizations that make them possible.



Getting Started

What is your background and role with the START Consortium?

I’m a criminologist at the University of Maryland, and I’m currently a Program Manager for the Global Terrorism Database (GTD) project at START. My role started out (more than 12 years ago) as a graduate assistant cleaning raw data, and now I manage the project team, workflow, resources, and interaction with end users and related research projects.

Can you tell us a little bit about the START Consortium?

START was created in 2005 by the US Department of Homeland Security, Office of University Programs as a Center of Excellence. The idea of the COE program is to have multidisciplinary university-based researchers focused on issues related to homeland security, and START’s organizing framework is social science. We develop research, training, and educational resources on the human causes and consequences of terrorism.



The Global Terrorism Database published on Kaggle by START.

Can you describe The Global Terrorism Database?

The GTD is an event-level database on terrorist attacks that have occurred worldwide, dating back to 1970. The “story” of GTD collection is a long one, but the short version is that it currently includes data on more than 150,000 attacks, with more than 100 variables describing when and where the attack happened, who the perpetrators and victims were, what tactics were involved, and what the outcome of the attack was… to the extent that this information is known. It’s based entirely on unclassified information–almost always media reports. Data collection is ongoing and we currently update the database annually.

With the expansion of online media, we’ve developed what we call a “hybrid” data collection strategy. We leverage automated processes (natural language processing, machine learning models) to sift through millions of news articles each month. We leverage human processes (reading thousands articles about terrorist attacks), to create entries in the database and maximize its accuracy.

Deep in the Data

How do you hope that opening up this dataset to analysis can benefit your work and the world?

Making the GTD available to users has always been a major priority, for both principled and practical reasons. We initially spent a few years digitizing and cleaning tens of thousands of hand-written data records, but ever since the data were presentable we’ve had the GTD posted on START’s website. We've seen a growing interest in objective data on this important topic, and there’s certainly far more potential for the analytical community at large to generate important findings than if we had kept it to ourselves for the last 10 years.

In addition, for any data collection project transparency is critical. It’s important that people understand and can actually see how the data are collected and what individual records look like, in order to promote smart usage and credibility. Finally, making the data available is great for the quality of the dataset itself. The best way to improve the accuracy of the data is to have eyes on it, flagging potential problems for review.

What motivated you to share your dataset on Kaggle?

Two things- first, Kaggle’s platform has much cooler functionality than we do. Allowing users to do custom analysis and then share it with other users is really powerful and promotes collaboration and knowledge growth.

Second, although we’ve shared our data for about a decade on START’s website, our user community seems to overlap with Kaggle’s user community only a little bit. It’s just a different circle of people, with different skill sets and interests. Since Kaggle users might not otherwise stumble across the GTD website, it seemed like a good opportunity to make the GTD available to a broader audience.

The Community

Do you have a favorite analysis done by the community so far?

There’s been a lot of activity, it’s tough to keep track of. It’s also great that Kaggle is accessible for all different skill levels, even novices who are just looking for some datasets to practice on.

I especially like Umesh's kernel, “Explore Global Terrorism Using Highcharter”, not only because it use a variety of visualizations, but Umesh included some contextual bullet points along with many of the graphics. This illustrates how much depth there is to the data, and how challenging it can be to sum it up in an image.

Kaggler Umesh's kernel, Exploring Global Terrorism Using Highcharter.

“Terrorist attacks around the world” by Pranav Pandya is pretty neat too. And, although I’m pretty familiar with patterns of terrorism, I think new users find analyses of the data for the United States to be really interesting (like Abigail Larion’s kernel), because the results defy conventional wisdom.

What have you been most surprised to learn?

I love how encouraging Kaggle users are to their fellow users. I get to engage with a lot of great analysts one-on-one, but my experience with social platforms (OK, mainly Twitter) is that when the GTD comes up it’s often because people are fighting with each other about terrorism and someone throws down a GTD link in an attempt to “prove” their point.

When you have a group of people who start with an interest in the data, rather than with an agenda, it’s much more constructive.

When you have a group of people who start with an interest in the data, rather than with an agenda, it’s much more constructive. I’ve come to enjoy seeing email alerts from Kaggle when someone tries to help out by answering another user’s question, or just says “great work, thanks!”

How would you like to see the data used in counterrorism efforts?

There are many ways the GTD can contribute to counterterrorism. These range from providing fairly basic information about what types of threats and tactics are prevalent across various jurisdictions, and how they vary over time… to more sophisticated analyses that attempt to provide insight on what types of counterterrorism strategies are most likely to be effective in a given context. I’m happy for the GTD to contribute rigorous data that decision-makers find useful.

Thoughts on Open Data

In what ways do you see access to open data changing the world?

I think the benefits of access to open data are both vast and pretty obvious, especially to Kaggle readers. So I’ll focus on one of the potentially problematic implications of open data in a changing world, which is the risk that when datasets are aggregated and republished without restriction, users may lose sight of where the data come from and may even take it for granted. This is not unlike the tension between news aggregators and the production of raw journalism.

If you find a dataset useful, take the time to learn about where it comes from. If you find it really useful, consider sending a note to the organization that collects the data to provide a testimonial that might help support proposals for ongoing funding.

Over the years, we been fortunate to receive funding from the US Department of Justice, the US Department of Homeland Security, and the US Department of State to collect the GTD, but it’s a pretty labor-intensive operation involving researchers and students at the University of Maryland. START is a non-profit research consortium and ongoing funding for data collection is not a foregone conclusion, despite how widely used the GTD is among data scientists, policymakers, the media, researchers, and educators. So I encourage users of open data: If you find a dataset useful, take the time to learn about where it comes from. If you find it really useful, consider sending a note to the organization that collects the data to provide a testimonial that might help support proposals for ongoing funding.

What is your advice to anyone who may be interested in learning how they can analyze START’s data?

My biggest piece of advice is to take a look at the GTD Codebook. Data on terrorism is far from straightforward, and the Codebook helps explain nuances to both new users and seasoned users.

Those who are especially interested in learning more about how the data are collected can check out the GTD Training Modules. The trainings are designed to demonstrate the strengths of the database as well as some potential pitfalls. We introduce the use of PivotTables in MS Excel for the interactive demonstrations, but the principles apply no matter what analytical tools you use.

Bio

Dr. Erin Miller has been part of the Global Terrorism Database (GTD) team since 2004, developing efficient and effective data collection processes. As co-principal investigator for the GTD, Erin produces reports on patterns of terrorism that provide context for current events, and she frequently consults with end-users of the GTD, including researchers, policymakers, journalists, and students. She serves as key personnel on research projects related to the GTD, and has created training modules to provide GTD users with insight on data collection methodologies and tools for data analysis.

Show more