R-bloggers.com

Scaling a DataFrame in Javascript

2013-07-15

(This article was first published on Gary Sieling » R, and kindly contributed to R-bloggers)

In a previous article I proposed a thought experiment: one might model SQL Execution Plans in Javascript, and use this to design a database that supports example plans. Such a system might allow for new query languages to be built, or drop-in algorithms in a data processing application (for instance, one might want to be able to do “create index” for a LINQ query that is actually against in-memory data).

In this article I will explore the problem from a different angle: a data structure that can to handle massive amounts of data in memory.

In both R and Python/Pandas there is rich support for an object called a DataFrame, which bears strong resemblance to a table. If you load files into these, eventually you will find one large enough to run out of memory, even if you have a lot of RAM. Once this happens, you have to resort to other techniques – it would be preferable to keep the programming model, and have the tools handle data volume gracefully.

First, it is valuable to have a fairly large database table to test. Google n-grams provides good examples: this is word and phrase counts from Google books (scanned books), grouped by year. There are many thousands of these files – I chose the file of two word word phrases starting with “th”, which is 30 GB of text.

The following unix command splits the file into small blocks- I determined the size of the blocks experimentally, aiming to average around 64k, as this is a plausible value for disk block sizes in a relational database.

This generates loads of files, which are named sequentially, with names like xaaaa, xaaab, xaaac, aaaad, etc.

These files can be loaded trivially in R into a dataframe, like so, wherease the full file wouldn’t support this:

Year ranges run from 1800-2008, here is an example of a few rows:
1800-2009.

You can also run queries on this data, although this obviously only searches the loaded block.

A typical filesystem would store blocks in a B+ tree for quick insertion and lookup. These trees are usually short and wide, which reduces the hops to find a record, and track from one node to the next at the leaves, which supports scanning ranges of the tree.

I found a workable Javascript implementation of a B+ tree, which I’ve used for demonstration in this article.

The API for this implementation is pretty easy to follow. It has an interesting property that when you do “seek,” it stays in that location until the next seek, so you can traverse along the bottom of the tree if you wish.

To populate this tree, we need to add our block numbers – this tree requires integers for the keys. Since the files created above for blocks are named “aaa”, “aab”, etc, this is base-26, so we can convert it into numbers:

It’s pretty easy to go the other way too:

Now that we’ve defined these relationships, we can just insert all the numbers in the tree, like so:

On first glance this seems a bit stupid – we’ve invented the world’s most complicated array. We’ve stumbled across a neat database feature by accident: index ordered tables. This table happens to be pre-sorted by word and then by year, which gives performance gains to certain queries – for instance a “count(distinct word)” could be done without maintaining a large “set” of words in the DB for some memory savings.

In a relational database, storage doesn’t necessarily keep rows in any particular order. Each insert may fill a slot in an existing page, so several sequential inserts may put data in random locations. Interestingly, it is possible to influence paging in Oracle during bulk operations: “TRUNCATE table” tells it to immediately delete all pages, skipping rollback, without inspecting their contents, and “INSERT /*+ append */” tells it to create a new page- skip the step of looking for empty slots in existing pages (good for a single massive insert, and very slow for many small inserts).

Note that I’ve left the above objects at each node empty – when a node is accessed, it will be loaded into memory. Since I’m doing this in a browser, I’m going to load a block using Ajax. Once the block is loaded once, the getData method will replace itself with a memoized copy of the data, which helps subsequent queries (this is why a query will run faster the second time you run it, or how one query can effect the performance of subsequent queries):

This solves the problems of loading and caching data blocks, but we’re still going to run out of RAM if we run a long query – we need a way to kick blocks out of the system, and revert to the ‘unloaded’ state.

Now, we can combine this with the tree.

This shows the first lines of the 1000th file. If you then load 11 blocks, the first block is kicked out, and eventually freed by the garbage collector (it’s set low for testing purposes – obviously you want much more RAM).

This, I suggest, is a good implementation of a DataFrame for large tables – it supports “show me the first 100 rows” queries common in investigation, and scales well to paging through troublesomely large data sets.

One could loop over this table to generate an index – a mapping of entries {“the dog”: “xamym”}, which tells you where to start looking for specific record – once this index was built the lookup would be quite fast. Properly implemented, these indexes could be used by the R or Pandas DataFrame APIs (or a system like LINQ) to support dramatic performance improvements, without re-architecting a system that crumbles under load.

To leave a comment for the author, please follow the link and comment on his blog: Gary Sieling » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...