2016-12-09

Machine learning models benefit from an increased number of features — “more data beats better algorithms”. In the financial and social domains, macroeconomic indicators are routinely added to models particularly those that contain a discrete time or date. For example, loan or credit analyses that predict the likelihood of default can benefit from unemployment indicators or a model that attempts to quantify pay gaps between genders can benefit from demographic employment statistics.

The Bureau of Labor Statistics (BLS) collects information related to the labor market, working conditions, and prices into periodic time series data. Moreover, BLS provides a public API making it very easy to ingest essential economic information into a variety of analytics. However, while they provide raw data and even a few reports that analyze employment conditions in the United States, the tables they provide are more suited towards specialists and the information can be difficult to interpret at a glance.

In this post, we will review simple data ingestion of BLS time series data, enabling routine collection of data on a periodic basis so that local models are as up to date as possible. We will then visualize the time series using pandas and matplotlib to explore the series provided with a functional methodology. At the end of this post, you will have a mechanism to fetch data from BLS and quickly view and explore data using the BLS series id key.

The BLS API

The BLS API currently has two versions, but it is strongly encouraged to use the V2 API, which requires registration. Once you register, you will receive an API Key that will authorize your requests, ensuring that you get access to as many data sets as possible at as high a frequency as possible.

The API is organized to return data based on the BLS series id, a string that represents the survey type and encodes which version or facet of the data is being represented. To find series ids, I recommend going to the data tools section of the BLS website and clicking on the “top picks” button next to the survey you're interested in, the series id is provided after the series title. For example, the Current Population Survey (CPS), which provides employment statistics for the United States, lists their series; here are a few examples:

Unemployment Rate - LNS14000000

Discouraged Workers - LNU05026645

Persons At Work Part Time for Economic Reasons - LNS12032194

Unemployment Rate - 25 Years & Over, Some College or Associate Degree - LNS14027689

The series id, in this case, starts with LNS or LNU: LNS14000000, LNU05026645, LNS12032194, and LNS14027689. There are two methods to fetch data from the API. You can GET data from a single series endpoint, or you can POST a list of up to 25 ids to fetch multiple time series at a time. Generally, BLS data sets are fetched in groups, so we'll look at the multiple time series ingestion method. Using the requests.py module, we can write a function that returns a JSON data set for a list of series ids:

This script looks up your API key from the environment, a best practice for handling keys which should not be committed to GitHub or otherwise saved in a place that they can be discovered publicly. You can either change the line to hard code your API key as a string, or you can export the variable in your terminal as follows:

The function accepts a list of series ids and a set of generic keyword arguments which are stored as a dictionary in the kwargs variable. The first step of the function is to ensure that we have between 1 and 25 series passed in (otherwise an error will occur). If so, we create our request headers to pass and receive JSON data as well as construct a payload with our request parameters. The payload is constructed with the keyword arguments as well as the registration key from the environment and the list of series ids. Finally, we POST the request, check to make sure it returned successfully, and return the parsed JSON data.

To run this function for the series we listed before:

You should see something similar to the following result:

From here it is a simple matter to operationalize the routine (monthly) ingestion of new data. One method is to store the data in a relational database like PostgreSQL or SQLite so that complex queries can be run across series. As an example of database ingestion and wrangling, see the github.com/bbengfort/jobs-report repository. This project was a web/D3 visualization of the BLS time series data, but it utilized a routine ingestion mechanism as described in the README of the ingestion module. To simplify data access, we'll use a database dump from that project in the next section, but you can also use the data downloaded as JSON from the API if you wish.

Loading Data into Pandas Series

For this section, we have created a database of BLS data (using the API) that has two tables: a series table that has information describing each time series and a records table where each row is essentially a tuple of (blsid, period, value) records. This allows us to aggregate and query the timeseries data effectively, particularly in a DataFrame. For this section we've dumped out the two tables as CSV files, which can be downloaded here: BLS time series CSV tables.

Querying Series Information

The first step is to create a data frame from the series.csv file such that we can query information about each time series without having to store or duplicate the data.



Working backward, we can create a function that accepts a BLS ID and returns the information from the info table:

I utilize this function a fair amount to check if I have a time series in my dataset or to lookup seemingly related time series. In fact, we can see a pattern starting to emerge from function and the API fetch function from the last section. Our basic methodology is going to be to create functions that accept one or more BLS series ids and then perform some work on them. Unifying our function signatures in this way and working with our data on a specific key type dramatically simplifies exploratory workflows.

However, the BLS ids themselves aren't necessarily informative, so like the previous function, we need an ability to query the data frame. Here are a few example queries:



This query returns all of the time series whose source is the Local Area Unemployment Statistics (LAUS) program, which breaks down unemployment by state. However, you'll notice from the previous section that the prefixes of the series seem to be related but not necessarily to the source. We could also query based on the prefix to find related series:



Combining queries like these into a functional methodology will easily allow you to explore the 3,368 series in this dataset and more as you continue to ingest series information using the API!

Loading the Series

The next step is to load the actual time series data into Pandas. Pandas implements two primary data structures for data analysis — the Series and DataFrame objects. Both objects are indexed, meaning that they contain more information about the underlying data than simple one or two-dimensional arrays (which they wrap). Typically the indices are simple integers that represent the position from the beginning of the series or frame, but they can be more complex than that. For time series analysis, we can use a Period index, which indexes the series values by a granular interval (by month as in the BLS dataset). Alternatively, for specific events you can use the Timestamp index, but periods do well for our data.

To load the data from the records.csv file, we need to construct a Series per time series data structure, creating a collection of them. Here's a function to go about this:

In this function we use the csv module, part of the Python standard library, to read and parse each line of our opened CSV file. The csv.DictReader generates rows as dictionaries whose keys are based on the header row of the csv file. Because each record is in the format (blsid, period, value) (with some extra information as well), we can groupby the blsid. This does require the records.csv file to be sorted by blsid since the groupby function simply scans ahead and collects rows into the rows variable until it sees a new blsid.

Once we have our rows grouped by blsid we can load them into memory and sort them by time period. The period value is a string in the form 2015-02-01, which is a sortable format; however, if we create a pd.Period from this string the period will have the day granularity. For each row, we create a period using the .asfreq('M') to transform the period into the month granularity. Finally, we parse our values into floating points for data analysis and construct a pd.Series object with the values, the monthly period index, and assign a name to it — the string blsid, which we will continue to use to query our data.

This function uses the yield statement to return a generator of pd.Series objects. We can collect all series into a single data frame, indexed correctly as follows:

If you're using our data, the series data frame should be indexed by period, and there should be roughly 183 months (rows) in our dataset. There are also 3366 time series in the data frame represented as columns whose column id is the BLS ID. If any of the series did not have a period matched by the global period index, the concat function correctly fills in that value as np.nan. As you can see from a simple head: the data frame contains a wide range of data, and the domain of every series can be dramatically different.

Visualizing Series with Matplotlib

Now that we've gone through the data wrangling hoops, we can start to visualize our series using matplotlib and the plotting library that comes with Pandas. The first step is to create a function that takes a blsid as input and uses the series and info data frames to create a visualization:

The first thing this function does is look up the title of the series using the series info data frame. To do this it uses the get_value method of the data frame which will return the value of a particular column for a particular row by index. To look up the title by blsid, we will have to query the info data frame for that row, info[info.blsid == blsid], then fetch the index, and then we can then use that to get the 'title' column. After that we can simply plot the series, fetching it directly from the series data frame and plotting it using Pandas.

Warning: Don't try series.plot(), which will try to plot a line for every series (all 3366 of them); I've crashed a few notebooks that way!

This function is certainly an enhancement of the series_info function from before, allowing us to think more completely about the domain, range, and structure of the time series data for a single blsid. I typically use this function when adding macroeconomic features to datasets to decide if I should use a simple magnitude, or if I should use a slope or a delta, or some other representation of the data based on its shape. Even better though would be the ability to compare a few time series together:

In this function instead of providing a single blsid, the argument is a list of blsid strings. For each series, we plot them but add their title as a label. This allows us to create a legend with all the series names. Finally, we add a title that indicates the series blsid for reference later. We can now start making visual series comparisons:

One thing to note is that comparing these series worked because they had the approximately the same range. However, not all series in the BLS data set are in the same range and can be orders of magnitude different. One method to combat this is to provide a normalize=True argument to the plot_multiple_series function and then use a normalization method to bring the series into the range [0,1].

Conclusion

The addition of macroeconomic features to models can greatly expand their predictive powers, particularly when they inform the behavior of the target variable. When instances have some time element that can be mapped to a period, then the economic data collected by Census and BLS can be easily incorporated into models.

This post was designed to equip a workflow for ingesting and exploring macroeconomic data from BLS in particular. By centering our workflow on the blsid of each time series, we were able to create functions that accepted an id or a list of ids and work meaningfully with it. This allows us to connect exploration both on the BLS data explorer, as well as in our data frames.

Our exploration process was end-to-end with respect to the data science pipeline. We ingested data from the BLS API, then stored and wrangled that data into a database format. Computation on the timeseries involved the pd.Period and pd.Series objects, which were then aggregated into a single data frame. At all points we explored querying and limiting the data, culminating with visualizing single and multiple timeseries using matplotlib.

Helpful Links

BLS Timeseries CSV Tables

BLS Developer Documentation

Jobs Report BLS Database Ingestion

Interactive Exploration of the Employment Situation Report

Acknowledgements

Nicole Donnelly reviewed and edited this post, Tony Ojeda helped wrangle the datasets.

Show more