Neworganizing.wellstone.org

Loading into Postgres Quickly - Part 2: ersatz features and benchmarks

Subscribe to Neworganizing.wellstone.org

2013-02-25

Perhaps as you read the previous post, you wondered "What can ersatz do for me, besides be fast? (and how fast is it, anyway?)"
We'll cover its capabilities by going through its config options, which may be a little dry, but when we're done, you should have a good idea of how to manipulate ersatz into doing what you want. We'll cover how fast first with some benchmarks:
Postgres is set up with the following resources:

We're loading a file with 600,000 rows, 15 text type columns, the same 3 of which are always null, the rest with 10 characters each, and no quotes or delimiters in the fields (though we're still running everything with quote detection on). Our table is empty at the beginning of every test.
pgloader's config looks like this:

The time to load looks like this:

Next we run ersatz on the same file:

Hey, not bad. I was actually concerned at this time, since I thought this could go much faster. After all, if we're doing no table manipulations and just run COPY directly we get:

You'll notice we basically lose all the "building buffer" time from the ersatz run. (in my experience copy times are variable enough that a random 20% difference from run to run is within expectation). One thing I considered is that inside ersatz, we're basically reading in and parsing the file as a CSV, which deals with quoting and delimiting, writing the the manipulated data out to a buffer using a CSV writer, which *writes* quoting and delimiting, then reading the buffer using COPY's CSV flag, which interprets quoting and delimiting again. It seems like there's some redundancy here.
Postgres offers 3 modes for COPY: TEXT, CSV, and BINARY.
TEXT assumes raw text split by some delimiter character, but also that any delimiters in the data are escaped with \. This takes you down the usual text escaping rabbit hole where you have to escape the escape character, which, of course, is included in stuff like newline, backspace, and tab as well. For pre-processing of files, this incurs a lot of overhead. This is how pgloader processes data before writing it to the buffer. Maybe this can happen fast in C, but in Python it is slow. The benefit is, for the actual COPY command, it is blazingly fast for importing text. Not fast enough to make up for the processing overhead, but definitely the fastest of the COPY modes.
BINARY is a specific format for each field written in raw binary data. This, surprisingly, takes a long time to COPY in my benchmarks. I suspect the fact that my data is all text has something to do with this. I leave this case as an exercise for the reader, but check out this stackoverflow question for an instance where someone seems to have found BINARY mode to be faster for numeric data: http://stackoverflow.com/questions/8144002/use-binary-copy-table-from-with-psycopg2. I also found it took a huge amount of time to build the binary format in the buffer in Python. Again, this might be able to happen fast in C.
CSV is surprisingly the best choice here. Some benchmarks

Note that the text mode row processing times are meaningless, since they *do not* do escape processing, they simply shoot everything into the buffer naively. Direct COPY processing time here is the time to run Linux's cut utility on the input file. In this case, this is a trivial transformation, since I'm grabbing every column from my input files, but it's worth noting that if you are planning on running direct COPY, you're going to have to preprocess the file in some other way. Obviously, if you've got a simple enough set of transformations, running COPY directly is the way to go. If things are a bit dicier, maybe ersatz can help!

What can ersatz do?
It can do lots of stuff. To run through the available functionality in ersatz, we'll look at the configuration options that set up its fancy capabilities. ersatz uses python formatted config files (see Mike's post series on config styles here: http://neworganizing.com/content/blog/python-configs) It has some universal config options and some table-specific config options. Here is a table config example:

This is a lot to digest, but for the moment we'll focus on the columns entry. This dictionary has the destination table columns as keys and the data sources for those columns as values. There are several different ways to define column sources:

This simply says populate the 'name' table column from column 23 (0 indexed) of the csv.

This populates the 'id' table column from a named incrementing key (defined elsewhere in the config file)

The key here is a bit different. It is a group of table columns that are populated by the multivalued value entry. This is because the value is a function that returns multiple values. Function values are defined using a small dictionary with a standard set of keys: 'function','columns', and optionally 'defaults'. function's value is an actual python function object (being able to include these is an advantage of using python formatted config files.) This function can be any python function, but it must return an iterable of return values, even if it means to return one value (a simple solution for pre-existing functions is to wrap the function in a decorator that converts its return value to a single-valued tuple.) columns' value is the indexes of the csv columns that provide the arguments for the function. Currently dictionaries of column-sourced function arguments are not implemented. This would be an easy improvement. defaults' value is a dict of argument names to constant values to pass as that argument to the function.
These different column definition formats allow a huge amount of flexibility in mapping file data to database table data. Since any function that takes an arbitrary set of file columns as input and an arbitrary set of database columns as output may be used, almost any row-to-row mapping is possible. Of course, one must note what is excluded here. Maps that depend on multiple rows of data are not directly supported (though if you wanted to hack something with globals you could do some things that only require one pass through the data. Counting, for instance. The incrementing key capability does something like this.) Also, the speed of import is affected by the functions chosen. Something like:

is going to add some processing time to your import.
What about some of the other entries in this config? A few are obvious: 'skip_head_lines' sets the number of rows in the file that are not data. 'quotechar' is the csv quoting character. 'field_sep' is the field separator. 'filename' is the location of the file. 'copy_every' is how many lines to format and push into the buffer before running Postgres' COPY command. 'table' is the name of the database table into which data is being imported. Less obvious is 'force_not_null'. This is used to indicate to postgres columns that should be populated with default blank characters rather than NULL values. It is directly passed to the COPY command and is not used in any ersatz processing.'udcs' is a dict of constant columns. Each key,value entry is a column name and the constant value to fill that column with. This can be useful if you are, say, populating one partition of a split table.
What if we want to import data into partitions but *don't* want to go through the hassle of splitting up the data into separate files and importing each file individually (which seems silly)? That's why 'partitions' and 'partition_table_pattern' exist. These options provide the capability to split up imports into different tables based on explicit values of select columns. (partitioning on non-explicit values like ranges is not implemented. This seems kind of tricky and like it could invoke some pretty messy configuration syntax.) 'partitions' is an ordered dictionary that maps column names to the set of values that column can partition on. 'partition_table_pattern' provides a table name pattern that is ready for Python's .format function to be called from it. (Note that this dictates that ersatz can only help you partition if your table names explicitly call out the partition). This works by inserting column values into the table name format string. So in our example above, if the 'state' column was populated with 'NH' value, the partition_table_pattern would evaluate to 'electoral_district_cd_import_NH' when we ran 'electoral_district_cd_import_{state}'.format(state='NH').
The partitioning capability of ersatz is limited by a few factors. The aformentioned config syntax issues associated with non-explicit value partitioning is one. Another is that ersatz currently initializes buffers for each table partition before importing any data, so the possible partition values must be explicitly provided ahead of time. This seems like something we might be able to solve with lazy initialization of buffers, which would remove the requirement to state these possible partitioning values in the config file. Of course, the database table partitions and partitioning rules have to be explicitly set up ahead of time, so this doesn't free us up to partition on an unknown set of values.
One question that might come to mind about this partitioning scheme is "COPY cascades through partition rules, so why not just copy into the master table and let Postgres do the work?" Well it turns out that we can explicitly split up data faster as we read it in than Postgres does when it partitions a bulk import. Much, much faster. We can provide benchmarks later when we go into more technical details about how this is accomplished.

The above config example is for one table. Some ersatz config options exist that control universal or cross-table behavior. A config example with these options:

'user', 'db', and 'pw' are the database connect fields. 'host', 'port', and 'sslmode' are also available but not necessary. 'debug' turns on dropping to the pdb Python debugger if something terrible happens. 'use_utf' turns on the UTF-8 conversion discussed in the previous blog post. 'testonly' rolls back all database imports before quitting. This allows the user to check data for cleanliness before actually committing it to the database. 'tables' lists the table config dicts such as the one discussed above (the key names here are not passed to postgres. Theyare used by ersatz and the user to keep track of which config is which.)
'parallel_load' and 'key_sources' are somewhat complicated. 'key_sources' is a dictionary of names and initial values for shared incrementing keys. One source can be used by many tables, so that tables loaded separately can be keyed back together. This allows, for example, a way to knit together data that has been imported from separate files, but has some kind of order-based relationship. Key sources are accessed via the parallel_load block.
'parallel_load' is a tuple of dictionaries each with two set key/value pairs. Each dictionary defines a set of tables that are to be loaded in parallel. The 'tables' key maps to a tuple of table names (the keys in the 'tables' dict) to be loaded in parallel. The 'keys' key maps to a dict of key names to key sources. The key names in this dict may be used by the tables being loaded to populate columns with shared key values. (NOTE: nothing stops you from trying to reference a key name in a table config that is not listed in a parallel_load block. This is a misconfiguration and will cause ersatz to crash. Without the parallel_load block to attach a key source to a key name, the table has no idea where to look for the key value.)
So, what does parallel_load actually do? It links the listed tables together so that rows of each are constructed simultaneously. A row for each table is populated in each table's buffer before moving on to the next row for any of the tables. Aside from sewing together separate files, a useful case for this is when a single file needs to be split into multiple tables, for example if different subsets of columns belong in different tables, as often happens with voter files. Running these together from the same file is much faster than opening the file, grabbing a subset of columns for a table, then re-opening the file for the next table.
(could parallel_load mean parallel computing? Maybe someday in the wistful future.)

Some examples of manipulations we've done with ersatz include loading electoral districts listed in separate columns of a voter file into separate tables based on type, creating electoral district identifiers by combining multiple columns of data, mapping one source's district designations to another by creating a mapping function, importing into specific partitions for separate elections and states, creating import timestamps by wrapping the Python now function and using it to populate a column, and much more! A toy example of ersatz with a complete config file is available here: https://github.com/natgaertner/ersatzpg/tree/master/examples

We hope this has gotten you excited about the possibilities of ersatz for your data. Next we'll discuss some of the intricacies of how ersatz accomplishes these feats. If you think Python features like itertools and coroutines are cool, you might think this is cool!