Districtdatalabs.silvrback.com

Simple CSV Data Wrangling with Python

Subscribe to Districtdatalabs.silvrback.com

2014-11-08

I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I have to jump through hoops to deal with this type of data, when it feels like it should be as easy as JSON or other serialization formats.

The basic problem is this: CSVs have inherent schemas. In fact, most of the CSVs that I work with are dumps from a database. While the database can maintain schema information alongside the data, the scheme is lost when serializing to disk. Worse, if the dump is denormalized (a join of two tables), then the relationships are also lost, making it harder to extract entities. Although a header row can give us the names of the fields in the file, it won't give us the type, and there is nothing structural about the serialization format (like there is with JSON) that we can infer the type from.

That said, I love CSVs. CSVs are a compact data format - one row, one record. CSVs can be grown to massive sizes without cause for concern. I don't flinch when reading 4 GB CSV files with Python because they can be split into multiple files, read one row at a time for memory efficiency, and multiprocessed with seeks to speed up the job. This is in stark contrast to JSON or XML, which have to be read completely from end to end in order to get the full data (JSON has to be completely loaded into memory and with XML you have to use a streaming parser like SAX).

CSVs are the file format of choice for big data appliances like Hadoop for good reason. If you can get past encoding issues, extra dependencies, schema inference, and typing; CSVs are a great serialization format. In this post, I will provide you with a series of pro tips that I have discovered for using and wrangling CSV data.

Specifically, this post will cover the following:

The basics of CSV processing with Python

Avoiding Unicode issues in Python 2.7

Using namedtuples or slots for memory efficiency

Serializing data with a schema using Avro

Although we won't cover it in this post, using these techniques you have a great start towards multiprocessing to quickly dig through a CSV file from many different positions in it at once. Hopefully this intro has made CSV sound more exciting, and so let's dive in.

The Data

Consider the following data set, a listing of company funding records as reported by TechCrunch. You can download the data set here.

The first ten rows are shown below:

The CSV file has a header row, so we have the field names, but we do have a couple of data type conversions that we have to make. In particular, the fundedDate needs to be transformed to a Python date object and the raisedAmt needs to be converted to an integer. This isn't particularly onerous, but consider that this is just a simple example, more complex conversions can be easily imagined.

Note: This file was obtained from the following source SpatialKey Sample Data on October 21, 2014.

The Basics of CSV Processing

Python has a built in csv module that handles all the ins and outs of processing CSV files, from dealing with dialects (Excel, anyone?) to quoting fields that may contain the delimiter to handling a variety of delimiters. Very simply, this is how you would read all the data from the funding CSV file:

A couple of key points with the code above:

Always wrap the CSV reader in a function that returns a generator (via the yield statement).

Open the file in universal newline mode with 'rU' for backwards compatibility.

Use context managers with [callable] as [name] to ensure that the handle to the file is closed automatically.

Use the csv.DictReader class only when headers are present, otherwise just use csv.reader. (You can pass a list of fieldnames, but you'll see its better just to use a namedtuple as we discuss below).

This code allows you to treat the data source as just another iterator or list in your code. In fact if you do the following:

You'll see that the function "returns" a generator, thanks to the yield statement in the function definition. This means, among other things, that the data is evaluated lazily. The file is not opened, read, or parsed until you need it. No more than one row of the file is in memory at any given time. The context manager in the function also ensures that the file handle is closed when you've finished reading the data, even in the face of an exception elsewhere in code. Note that this pattern requires the file to be open while you're reading from it. If you attempt to read from a data file that's closed, you will get an error.

This is powerful because it means that even for much larger data sets you will have efficient, portable code. Moreover, as we start looking at wrangling or munging the data from the CSV file, you'll have well encapsulated code that can handle a variety of situations. In fact, in code that has to read and parse files from a variety of sources, it is common to wrap the csv module in a class so that you can persist statistics about the data and provide multiple reads to the CSV file. In the above example you have to ensure that you call the function to get the data every time you want to read or do any computation, whereas a class can save some state of the data between reads. Here is an example of persisted data for our funding class:

This is a pretty simple class that keeps track of the number of rows overall as well as the number of rows for each company. All that's stored in memory is the company name and the number of funding records, but you can see how this might be extremely useful, especially if you're going to filter the data by company. No matter what, you have to make one complete read to get this data, but on all passes of the data this is stored and cached. So when we ask for the length of the data at the end, the file is completely read, and when we ask for the number of companies, we don't return to disk.

Extra Dependencies

The csv module is excellent for reading and writing CSV files, is memory efficient (providing an iterator that reads only one line from the file into memory at a time), and comes included with Python. Why would you need anything else?

The thing is, I regularly see this in the code of my colleagues (especially those that do research):

Although rarer, I also see things like numpy.genfromtxt also used as a production CSV reader.

The utilities for analysis that Pandas gives you, especially DataFrames, are extremely useful, and there is obviously a 1:1 relationship between DataFrames and CSV files. I routinely use Pandas for data analyses, quick insights, and even data wrangling of smaller files. The problem is that Pandas is not meant for production-level ingestion or wrangling systems. It is meant for data analysis that can happen completely in memory. As such, when you run this line of code, the entirety of the CSV file is loaded into memory. Likewise, Numpy arrays are also immutable data types that are completely loaded into memory. You've just lost your memory efficiency, especially for larger data sets.

And while this may be fine on your Macbook Pro, keep in mind that if you're writing data pipeline code, it will probably be run more routinely on a virtual cloud server such as Rackspace, AWS, or Google App Engine. Since you have a budget, it will also probably be running on small or micro servers that might have 1 GB of memory, if you're lucky. You don't want to blow it away!

Then there is the issue of dependency management. The Pandas library depends on Numpy, which itself takes a lot of compilation to put together. On my Macbook Pro, pip installing Pandas into a virtualenv takes about 5 minutes. This is a lot of overhead when you're not planning on using DataFrames for their potential!

So don't do it. Use the native csv module.

Avoiding Unicode Issues

Text encoding is the bane of my existence in so many ways. Mismatched encoding can crash entire programs that have been continually running for a long time and passed unit tests. Even if you handle encoding, if you chose the wrong encoding scheme (usually UTF-8), when something Latin 1 comes in, you can get invisible errors that don't raise any exceptions. Encoding is always something you should consider when opening up a file from anywhere.

If you're lucky enough to be able to use Python 3, you can skip this section. But as mentioned previously, production ingestion and wrangling systems are usually run on micro or small servers in the cloud, usually on the system Python. So if you're like me and many other data scientists, you're using Python 2.6+ which is currently what ships with Linux servers.

If you look at the csv module documentation, they suggest adding three classes to encode, read, and write unicode data from a file passed to the csv module as a wrapped function. These encoders translate the native file encoding to UTF-8, which the csv module can read because it's 8-bit safe. I won't include those classes here. Instead, you should always do the following when you import the csv module.

This is a very small dependency that can be easily fetched via pip install unicodecsv and you can find this module in pretty much every one of my data product repository's requirements.txt files. To learn more, visit the PyPI page for the latest version: unicodecsv 0.9.4.

Registering Schemas via NamedTuples

Now that we can easily extract data from a CSV file with a memory efficient and encoding-tolerant method, we can begin to look at actually wrangling our data. CSVs don't provide a lot in terms of types or indications about how to deal with the data that is coming in as strings. As noted previously, if there is a header line, we may get information about the various field names, but that's about it. We have already parsed our rows into Python dictionaries using the built in csv.DictReader, but this definitely depends on the header row being there. If the header row is not there, then you can pass a list of field names to the csv.DictReader and you can get dictionaries back.

Dictionaries are great because you can access your data by name rather than by position, and these names may help you parse the data that's coming in. However, dictionaries aren't schemas. They are key-value pairs that are completely mutable. When using a single CSV file, you may expect to have dictionaries with all the same keys, but these keys aren't recorded anywhere in your code, making your code dependent on the structure of the CSV file in an implicit way.

Moreover, dictionaries have a bit more overhead associated with them in order to make them mutable. They use more memory when instantiated and in order to provide fast lookups, they are hashed (unordered). They are great for when you don't know what data is coming in, but when you do, you can use something much better: namedtuples.

Running this code produces a result as shown below. As you can see the result is a tuple (an ordered set of values) that have names associated with them, much like the keys of a dictionary. Not only that, but the tuple is typed - it is not only a variable named "FundingRecord", each row is a FundingRecord.

The namedtuple function actually returns a subclass of tuple with the type specified by the typename, the first argument to the function. The type that is returned is immutable once constructed and is extremely lightweight because it contains no internal __dict__. It is very similar to using __slots__ for memory performance and efficiency. As you can see from the above code, you would use the FundingRecord just like any other object, so long as you didn't modify it in place.

This leads to an obvious question, how can you parse the data into various types if you can't assign it back to the object? Well, the nice thing is that the return from namedtuple is an actual type, one that you can subclass to add methods to!

If you run this code, you'll see a result as follows:

We now have a simple, self-documenting methodology for constructing schemas and parsers for CSVs in code that is not only memory efficient and fast but also encapsulates all the work we might want to do on a single record in the CSV file.

A couple of caveats to note:

Subclasses of tuple don't have an __init__ since they are immutable, only __new__.

When passing the type name to the namedtuple function use an _ to prevent type conflicts.

You won't be able to get access to the keys method, since this isnt' a dictionary; although there is an internal _fields method that returns a similar result.

Keep in mind that the class is immutable, and therefore read only!

Performance of namedtuples vs dicts

As a side example, I tested the performance in both time and space using a DictReader vs using a NamedTuple. The tests operated on a 548 MB CSV file that contained 5,000,000 rows consisting of a tuple that met the schema (uuid, name, data, page, latitude, longitude). I loaded the entire data set into memory by appending each record (either dict or namedtuple) to a list of records and then monitored the memory usage of the Python process. The code I used is below.

The result I got on my Macbook Pro with an 2.8GHz Intel i7 processor with 16 GB of memory was as follows:

The namedtuple benchmark code is similar and is as follows:

Running this code on the same data, on the same machine gave me the following result:

In terms of memory usage, the named tuples also used significantly less memory. Here is a summary of the overall performance:

Test

Time

Memory

dict

44.358

6.66 GB

namedtuple

20.682

2.23 GB

Using dictionaries is almost twice as slow, if a little more generic. Worse, using dictionaries takes almost three times the memory because of the aforementioned overhead of a mutable vs. immutable type. These small adjustments can mean the difference between performant applications and searching for new solutions.

Serializing Data with Avro

These challenges with CSV as a default serialization format may prompt us to find a different solution, especially as we propagate data throughout our data pipelines. My suggestion is to use a binary serialization format like Avro, which stores the schema alongside compressed data. Because the file is stored with a binary seralization, it is as compact as csv and can be read in a memory efficient manner. Because the data is stored with its schema, it can be read from any code and immediately extracted and automatically parsed.

So let's return to our original example with the funding.csv data set. In order to convert our parsed FundingRecord rows, we need to first define a schema. Schemas in Avro are simply JSON documents that define the field names along with their types. The schema for a FundingRecord is as follows:

Save this schema in a file called funding.avsc. To save out our parsed CSV to an Avro serialized format, use the following code:

Unfortunately, Avro doesn't currently support native datetime objects in Python but hopefully it will soon. Otherwise we simply convert our namedtuple back into a dictionary (or just use the DictReader directly) and pass that to the Avro serializer.

Once saved to disk, the savings become apparent. The original file was approximately 92K. After serializing with Avro, we have compressed our data to 64K! Not only do we have a compressed file format, but we also have the schema associated with the data. Reading the data is pretty easy now, and requires no parsing as before.

As you can see, the data in the dictionary has already been converted (at least the two integers) and no extra parsing is required. Avro will also enforce schema constraints and invariants to ensure that your wrangling is successful without things like databases for normalization.

Not only that, but Avro is a standard for Apache Hadoop. If you're going to be loading this data into HDFS, it's a good idea to consider loading it as Avro sequence file formats rather than CSV formats. It can make your life a lot easier, especially if you're using Hadoop streaming.

Conclusion

In this post, we looked several issues that arise when wrangling CSV data in Python. First, we reviewed the basics of CSV processing in Python, taking a look at the csv module and how that compared to Pandas and Numpy for importing and wrangling data stored in CSV files. Next, we highlighted the importance of encoding and how to avoid unicode issues. Then we took a look at the differences between dictionaries and namedtuples, and we saw how namedtuples are better when you know what data you're ingesting. And finally, we covered how to serialize CSV data with Avro so that the data is stored not only in a compact format but with its schema as well.

Hopefully, what you'll take away from this post is a basic data wrangling process that transforms data serialized input into an application-specific or data model usable form. The process uses a lot of transformations to and from flat files on disk, but in batch is far faster than accessing databases directly. If you have CSVs in multiple formats, this strategy can come in very handy to normalize data into a backwards-compatible safe methodology for storing and accessing data in WORM storage.

The code files to accompany this post can be found on Github.

As always, if you liked this post and you haven't already subscribed to the District Data Labs blog, go the blog home page, click on the Subscribe button, and enter your email so that you don't miss any of the awesome data science posts we have coming up in the near future.

Finally, below are some additional resources where you can continue exploring some of the stuff we've covered.

Additional Reading

Python for Data Analysis

Apache Avro

Python CSV Module Documentation

Saving 9GB of RAM with Python Slots

Serializing Python Objects

Acknowledgements

I'd like to thank Tony Ojeda for his help with preparing and producing this post. This started as an iPython notebook full of tips for some colleagues and classmates, and without his help it would not be the polished post that you see in front of you now!