2014-05-14

By Kevin C. Desouza & Kendra L. Smith

According to IBM, about 2.5 quintillion bytes of data are
created every day—enough to fill about 57.5 billion 32 GB iPads daily.
Some of these data are gathered by scientific instruments measuring
winds, temperatures, and currents around the world. Other data
are captured by computers tracking bond sales, stock trades, and
bank deposits. And other data are input by police officers, probation
officers, and welfare administrators. All of the data, however,
are simply that—data—until they are analyzed and used to inform
decision-making. What will the weather be like next week? What
are the most lucrative investment opportunities? Which neighborhoods
should be receiving more social services?

The term “big data” is used to describe the growing proliferation
of data and our increasing ability to make productive use of
it. A myriad of big data projects have been undertaken in scientific
domains. For instance, in 2012, pharmaceutical company Merck
found through data analysis that allergens would probably lie dormant
throughout March and April 2013 because of unseasonably
cold weather, followed by a sudden May warm-up that would cause
pollen to be released at a higher-than-average rate, thus driving the
potential need for Merck’s allergy medication Claritin. Merck then
modified its marketing strategy to capitalize on the high demand
for allergy relief. Through partnerships with Walmart, they created
personalized promotions based on zip code data to market Claritin
to heavily hit areas, resulting in increased revenue.

The business community has also been a heavy user of big data.
Each month Netflix collects billions of hours of user data to analyze
the titles, genres, time spent viewing, and video color schemes to
gauge customer preferences in order to continually update their recommendation
algorithms and programming to give the customer the
best possible experience.1 In 2013, Netflix launched its first original
series, House of Cards, largely using a mix of customer behavior data
and analytics to help shape the story. Netflix invested $100 million
into the series without testing a pilot or conducting focus groups,
instead banking on the success of an earlier BBC production by the
same name about UK politics, along with what it had learned about
the preferences of its 44 million customers.2House of Cards has been
a great success, bringing in 2 million new subscribers.

Data-driven intelligence has been used successfully in technical
and business endeavors, but a very different situation prevails
in the social arena. There, a large chasm exists between the potential
of data-driven information and its actual use in helping solve
social problems. Some social problems can be readily solved using
big data, such as using traffic data to help ease the flow of highway
traffic or using weather data to predict the next hurricane. But what
if we want to use data to help us solve our most human and critical
social problems, such as homelessness, human trafficking, and
education? And what if we not only want to solve these problems
but do so in a way that the solutions are sustainable for the future?

Social problems are often what are called “wicked” problems.
Not only are they messier than their technical counterparts, they
are also more dynamic and complex because of the number of stakeholders
involved and the numerous feedback loops among inter-related
components. Numerous government agencies and nonprofits
are involved in tackling these problems, with limited cooperation
and data sharing among them. Most of these organizations have
inadequate information technology resources, compared to their
counterparts in the hard sciences who work on technical problems
or in business who have ready access to financial, product, and customer
information.

Beyond the infrastructural impediments that social sector users
of big data face, data itself can be a problem. Oftentimes, data
are missing and incomplete, or stored in silos or in forms that are
inaccessible to automated processing. Then there are policy and
regulatory challenges that need to be faced, such as building data-sharing
agreements, ensuring privacy and confidentiality of data,
and creating collaboration protocols among various stakeholders
tackling the same type of problem.

Whereas there is no doubt that nonprofits, government, and other
organizations will continue to invest in big data technologies and
programs, questions still remain about how beneficial those investments
will turn out to be. The value proposition of big data is clear
for tackling complex technical and business problems, but the jury
is still out on how well big data can tackle complex social problems.

Why Data Is Big

Data, or individual pieces of information, have been gathered and
used throughout history. What’s changed recently is that advances in
digital technology have significantly increased our ability to collect,
store, and analyze data. Consider the US Census Bureau. In 1880,
the United States conducted a national census of 50 million people
that collected demographic information including age, gender, number
of people in the household, ethnicity, birth date, marital status,
occupation, health status, literacy, and place of origin. All of this information
was logged by hand, microfilmed, and sent to be stored
in state archives, libraries, and universities. It took seven to eight
years to properly tabulate census data after the initial collection.

In 1890, the Census Bureau streamlined its data collection
methods by adopting machine-readable punch cards that could be
tabulated in one year. In the most recent US census, conducted in
2010, the bureau used a range of emerging technologies to survey the
populace, including geographic information systems, social media,
videos, intelligent character-recognition systems, and sophisticated
data-processing software.

Today, big data is used to refer to data sets that extend beyond
single data repositories (databases or data warehouses) and are too
large and complex to be processed by traditional database management
and processing tools. Big data can encompass information
such as transactions, social media, enterprise content, sensors,
and mobile devices.

There are multiple dimensions to big data, which are encapsulated
in the handy set of seven “V”s that follow.

Volume: considers the amount of data generated and collected.

Velocity: refers to the speed at which data are analyzed.

Variety: indicates the diversity of the types of data that are
collected.
Viscosity: measures the resistance to flow of data.

Variability: measures the unpredictable rate of flow and types.

Veracity: measures the biases, noise, abnormality, and reliability
in datasets.

Volatility: indicates how long data are valid and should be stored.

Although all seven Vs are increasing, they are not equal. Consider
volume. The world’s collections of data are doubling every 18
months, presenting the public and private sectors with new opportunities
to transform information into insight. As the volume of data
increases along with the tendency to store multiple instances of the
same data across varied devices, the science of information search
and retrieval will have to advance.

The most challenging V for organizations is variety. Organizations
have built information systems to tackle data elements in specific
categories. The challenge for many organizations is to find economical
ways of integrating heterogeneous datasets while allowing for
newer sources of data (in origin and type) to be integrated within
existing systems. Ensuring that the data collected are of sufficient
veracity is also critical. Today, because of the proliferation of social
networks and social media, much of the data being collected needs
to be thoroughly analyzed before decision-making, as the data can
be easily manipulated.

Failing to Use Big Data

When considering big data in the context of social problems, we
arrive at a humbling conclusion: For the most part there is no big
data! When it comes to social problems, data are still highly unstructured
and largely limited to numbers, rather than other types
of data. Take, for instance, human trafficking, a $32 billion global
industry that ensnares an estimated 30 million people annually.
Although considerable momentum exists to combat the problem,
few initiatives have attempted to use big data.

Increasingly, traffickers make use of mobile phones, social media,
online classifieds, and other Internet platforms. Data from these
technologies could be collected and used to identify, track, and prosecute
traffickers, but a few daunting truths remain: The illicit nature
of human trafficking makes it difficult to collect primary data, primary
data collected from some organizations may be unreliable, and
we lack reliable indicators to measure anti-trafficking program and
policy success. Furthermore, most information collected on human
trafficking is stored in a manner that meets organizational needs, but
not global needs. Because of data privacy and security issues, data held
by various organizations are seldom shared in raw form, limiting the
creation of global, and big, datasets.

Making matters worse, agencies combatting trafficking often
compete with each other for scarce resources, whether grants and
gifts or recognition from the press and the community. Because of
this competition, data sharing between agencies—and even between
agencies and the public—is rare. The Polaris Project, for example,
has been working to combat human trafficking using a comprehensive
approach combining advocacy, client services, technical training
and assistance, global programs, and a national resource hotline.
Between 2003 and 2006, Polaris provided hotlines for human trafficking
survivors to call. In 2007, the US Department of Health and
Human Services selected Polaris as the country’s first national human
trafficking resource hotline. Over the years, Polaris is believed
to have logged more than 75,000 calls; nevertheless, access to the
data is limited and little is known about its reliability and its sources.

Think what might be done if the Polaris information was opened
to the public and integrated with other data sources, such as economic
indicators, transportation routes, education statistics, and
victim services. Only when the data are aggregated with other
data, analyzed, visualized, and made accessible to a multitude of
stakeholders will the collection be truly valuable. Only then will
the small data have a chance to grow into big data and help us effectively
combat human trafficking.

One hopeful sign is that in 2012 Google Giving awarded Polaris
and two other international anti-human trafficking organizations
$3 million to fund the aggregation of the data collected from their
three hotlines and to scale their hotlines into an international hotline.
Together, all three organizations have coalesced under the
Global Human Trafficking Hotline Network. This is a positive sign,
but it is yet to be seen what the fruits of this collaboration will be.

Barriers to Creating and Using Big Data

There are four principal reasons for the relative lack of structured
big data for social problems: Data are buried in administrative systems,
data governance standards are lacking, data are often unreliable,
and data can cause unintended consequences.

Data are buried in administrative systems | Most organizations collect
data to meet operational needs, and those data are often buried in
the organization’s administrative systems. To overcome this problem,
organizations are trying to find ways to build large datasets
that can be more widely used. This obstacle needs to be overcome
before we begin thinking of connecting datasets across organizations.
Take the US health care industry, for example. Inefficient
management of big data costs the industry between $100 billion and
$150 billion a year in administrative costs. The biggest problem in
the health care industry is the sheer volume of health and insurance
plans that providers contract and negotiate with to be paid for their
services. Each health or insurance plan supports its own system of
underwriting, claims administration, provider network contracting,
and broker network management—leaving data stored
in multiple formats in multiple
places. The McKinsey Global
Institute estimated that if the
US health care industry were to
transform its use of big data for
more efficiency and quality, the
sector could create more than
$300 billion in value every year.

Data governance standards are
lacking | A second challenge in
our ability to use big data for
social problems is the lack of
adequate data governance standards
that define how data are
captured, stored, and curated
for accountability. As a result,
large inconsistencies exist and
the data being captured are often
not readily suitable for analysis.
In many cases data need to be transformed before they can be
used, and transformation is costly. Analysts often struggle with integrating
different datasets because they lack good metadata (data
that describe data) and the quality of data is poor. An example of
this hindrance is the US government’s 2009 initiative, data.gov, to
make its vast amounts of data readily available to the public so that
nonprofits, businesses, and other organizations can use the data for
innovative purposes. The initiative has been hampered by the difficulty
of ensuring that the data are in a usable format. Data quality
differs heavily from agency to agency, with some agencies, such as
the Environmental Protection Agency, releasing data regularly and
in machine readable formats, whereas other agencies publish data
in difficult-to-manipulate forms such as PDFs or older file formats.3
The number of government datasets being made publicly available
has exploded, but only a handful of these datasets are ever used. The
ones that are being used are, not surprisingly, cases where there is
good metadata, ease of accessibility, and manipulability.

Data are often unreliable | The abundance of data provides great
opportunities to researchers trying to understand and solve social
problems, but unfortunately much of the data is unreliable. Simply
having a lot of data does not necessarily mean that the data
are representative and reliable. For example, in 2011, the Obama
Administration proposed the Keystone XL pipeline project to carry
tar sands oil from Alberta, Canada, to Texas. This proposal raised
concerns among landowners, farmers, ranchers, and environmentalists
who were living in the vicinity of the proposed pipeline. Despite
the concerns, the American Petroleum Institute and its oil lobby allies
were able to manipulate social media sentiment to show support
for the project. They did so by using Twitter to send an inordinate
number of tweets to show support for the project, which did not
accurately represent the overall public sentiment. The Rainforest
Action Network (RAN) discovered this subterfuge, criticizing the
oil companies for using fake Twitter accounts to show support for
the pipeline project. RAN pointed out a sudden spike in the number
(within three minutes on 15 accounts) of tweets favoring the pipeline.
RAN gathered evidence that 14 of 15 accounts were phony and
the tweets were generated by an automated process.

Data can cause unintended consequences | Big data users can find themselves
facing the unintended consequences of exploiting big data with
no regard for data quality, legality, disparate data meanings, and process
quality.4 This was the case when public agencies and a newspaper
in New York came under scrutiny for releasing information about gun
owners. In the wake of the Connecticut school mass shooting, a group
of journalists from The Journal News in White Plains, N.Y., used the
Freedom of Information Act to obtain information regarding gun
owners living in the suburbs of Westchester, Rockland, and Putnam
counties. The journalists published an article about the licensed gun
owners living in the neighborhood and also published an interactive
visual map that provided individual gun owners’ names and addresses.
The information was published to inform the public about who owns
firearms, but that information might also assist criminals who could
use it to target vulnerable homeowners who do not own guns or to
target homeowners who have guns in order to steal them.5

The Promise of Mobile Phones

There is one area where nonprofits have begun to make good use of
big data: mobile phones. In 2010 more than 5 billion mobile phones
were in use, more than 80 percent of them in developing countries.6
The percentage of people owning mobile phones in Sub-Saharan
Africa increased from 32.1 percent in 2008 to 57.1 percent in 2012,
and it is expected to rise to 75.4 percent by 2016.7 This growth has
offered people in developing countries better opportunities to improve
their quality of life.

For example, Cell Life, a South African organization, created a
mass messaging mobile service called Communicate, which reminds
patients to take their medications, links patients to clinics, and offers
peer-to-peer support services such as counseling and monitoring.8
Cell Life also developed Capture, a service that makes it possible for
health care workers in the field to collect and save information in
digital form using their mobile phones.

The rapid proliferation of mobile and Internet usage allows for the
collection of unprecedented amounts of information. Most modern
mobile phones contain global positioning system technology, which
identifies the geographic location of the phone. In addition to location
data, mobile phones contain a treasure trove of information,
such as call logs, SMS messages, and social media postings. A mobile
phone acts as an individual sensor collecting relevant information
from its environment, which when aggregated and analyzed with
information from millions of other mobile phones can lead to the
discovery of important information, which can then be disseminated
back to people on the ground via the same mobile phones.

For example, researchers are studying migration movements
following disasters as a way to understand the spread of infectious
disease. Harvard University epidemiologist Caroline Buckee and
her team use location data from mobile phones to understand the
patterns of people moving around in Kenya and help stop malaria
and other diseases from spreading.

Kenya’s western highlands are equipped with thousands of cell-phone
towers that transmit data on individuals’ phone call and text messaging
activity. Researchers found that people making calls and sending text
messages from a specific tower were making 16 times more trips away
from the area, with significant activity in the malaria hot spot of Lake
Victoria. Information on the patterns of human travel collected from
mobile phone usage are being
used to develop predictive models
to further combat malaria in
the region.9

Steps to Increase Use of Big Data

Big data has enormous potential
to inform decision-making to
help solve the world’s toughest
social problems. But for this to
happen, issues relating to data
collection, organization, and
analysis must first be resolved.
The following four recommendations
have the potential to
create datasets useful for evidence-based decision-making.

Building global data banks on
critical issues | The global community
needs to create large data
banks on complex issues such as human trafficking, global hunger,
and poverty. The data bank would have the capacity to hold multiple
different data types along with metadata that describes the datasets.
For this to happen, multi-sector alliances that promote data sharing
on thematic issues need to be created. At the 2012 G-8 Summit, leaders
of the world’s largest economies and four African heads of state
met to discuss and commit to a new phase of efforts to fight hunger
and food insecurity. Out of that discussion grew the New Alliance for
Food and Nutrition Security, which set its sights on helping 50 million
people out of poverty over the next 10 years through sustained
agricultural growth. As part of this plan, the New Alliance launched
a number of technology- and data-based initiatives. One was the Scaling
and Seeds and Other Technologies Partnership, developed to promote
commercialization, distribution, and adoption of technologies
that would improve seed varieties. The United States’ contribution
to the New Alliance has been chronicled through the Feed the Future
Initiative and website, and it has stayed true to the Alliance’s stance
on data sharing by establishing Agrilinks.org, a data-sharing platform
that is updated consistently. Farmers can tap into Agrilinks.
org to read about new agricultural practices or live tweet from
their mobile phones to ask questions of an agriculture expert. USAID
is offering open data from the Feed the Future initiative on
baseline data pulled from the Bangladesh Integrated Household
Survey dataset,10 baseline surveys of nearly 5,000 households in Ghana
that captured indicators outlined by the Feed the Future Initiative11
and the Women’s Empowerment in Agriculture Index.12

Engaging citizens and citizen science | Big data is not the sole province
of professionals. Citizens can also be enlisted to help create and
analyze these datasets. With the proliferation of data through open
data platforms, more and more citizens are creating new ideas and
products through what has become known as “citizen science.” In
2010, the City of London made government data available to the
public by opening the London Datastore. Managed by the Greater
London Authority, the London Datastore offered citizens the opportunity
to view and use raw data released from city agencies
and civil servants. Information distributed included data on crime
and economics, and real-time data from transit systems. Matthew
Somerville, a Web developer, created an online map app of the City
of London tube that had more than 250,000 hits in a matter of days.
Likewise, Ben Barker, an electronics engineer and cyclist, created
a bike map with information pulled from the London Datastore.13

Build a cadre of data curators and analysts | Today, not only do we have
a shortage of data curators and analysts who can tackle social problems,
we have limited avenues for our existing personnel to receive
the necessary training and build competencies. For the most part, we
have left data science to the sciences and business. The social sciences
have often equipped students simply with the basics of statistics. This
approach is unacceptable if we are to take advantage of big data. We
need to equip students and analysts with the necessary skills to curate
data so as to create large datasets. These skills are often found in
programs in informatics and the traditional degrees of information
and library science. In these programs, students learn about data organization,
preservation, visualization, search and retrieval, and use.
These are valuable skills that go beyond simply searching the Web for
information. In addition to these skills, increasing the capacity for an
analyst to think about what is possible with data is critical. Thinking
about networked relationships among datasets, and how to uncover
latent patterns in datasets, are competencies that need to be developed.

Promoting virtual experimentation platforms | To increase our understanding
of how to use big data for tackling social problems, we
need to promote more experimentation. Virtual experimentation
platforms, which allow individuals to share ideas, interact with others’
ideas, and work collaboratively to find solutions to problems
or take advantage of opportunities, can bring interested parties
together to create large datasets, develop innovative algorithms to
analyze and visualize the data, and develop new knowledge. One
example is Kaggle, a website where competitions on data analysis
are run. Unfortunately, organizations that are tackling social
issues seldom participate on these platforms. Virtual experimentation
platforms are essential if we are going to move the needle on
using big data to tackle social challenges. Initially, these platforms
should stimulate competitions to create large datasets on various
issues. Competitions that generate large datasets will be critical to
help the community realize the challenges associated with the way
the social sector is currently operating. Once a couple of datasets
are created, we can launch competitions that focus on the predictive
analytics and the discovery of novel patterns. The use of open forums
such as wikis and discussion groups can help the community share
lessons learned, collaborate, and advance new solutions.

The Future of Big Data

Business and science have shown that big data’s merits are undeniable.
Social sector organizations must now figure out how they too
can incorporate this type of decision-making capability into their
operations. The potential for growth and innovation exists, but there
are serious obstacles to overcome. The issues that are being tackled
in the social sector are in many ways more complex than they are
in business or science, making the use of big data that much more
difficult. In addition, greater attention must be paid to the rights,
privacy, and dignity of their constituents.

In spite of these obstacles, progress is being made. Public sector
agencies have made it clear that data are an important element of
social innovation. Institutions such as the US government and the
World Bank have made their data available to the public for mining
and further use. Individuals are using the data to create innovations,
mainly apps, to address a particular social problem.

Organizations have been created to help make better use of big
data for social problems. Data Without Borders, for example, matches
scientists and statisticians with nonprofits for pro bono data work to
help overcome the shortage of technology personnel capable of handling
big data projects. Globally, the world’s actors are making efforts
to use open data and big data to develop solutions to social problems
in innovative and collaborative ways. Progress is being made, but the
chasm must still be crossed. It is a challenge worth overcoming.

Show more