pgloader loads data into PostgreSQL. The new version is stable enough
nowadays that it's soon to be released, the last piece of the
3.1.0 puzzle
being full
debian packaging of the tool.
The pgloader logo is a loader truck, just because.
As you might have noticed if you've read my blog before, I decided that
pgloader needed a full rewrite in order for it to be able to enter the
current decade as a relevant tool. pgloader used to be written in the
python programming language, which is used by lots of people and generally
quite appreciated by its users.
Why changing
Still, python is not without problems, the main ones I had to deal with
being
poor performances and lack of threading capabilities. Also, the
pgloader setup design was pretty hard to maintain, and adding compatiblity
to other
loader products from competitors was harder than it should.
As I said in my
pgloader lightning talk at the
7th European Lisp Symposium
last week, in searching for a
modern programming language the best candidate
I found was actually
Common Lisp.
After some basic performances checking as seen in my
Common Lisp Sudoku Solver project where I did get up to
ten times faster
code when compared to python, it felt like the amazing set of features of
the language could be put to good use here.
So, what about performances after rewrite?
The main reason why I'm now writing this blog post is receiving emails from
pgloader users with strange feelings about the speedup. Let's see at the
numbers one user gave me, for some data point:
The raw numbers have been loaded into a PostgreSQL table
So what we see in this quite typical
CSV Loading test case is a best case of
30 times faster import. Which brings some questions on the table, of course.
Wait, you're still using COPY right?
The
PostgreSQL database system provides a really neat
COPY command, which in
turn is only exposing the
COPY Streaming Protocol, that pgloader is using.
So yes,
pgloader is still using
COPY. This time the protocol implementation
is to be found in the Common Lisp
Postmodern driver, which is really great.
Before that, back when pgloader was python code, it was using the very good
psycopg driver, which also exposes the COPY protocol.
So, what did happen here?
Well it happens that pgloader is now built using Common Lisp technologies,
and those are really great, powerful and fast!
Not only is Common Lisp code compiled to
machine code when using most
Common Lisp Implementations such as
SBCL or
Clozure Common Lisp; it's also
possible to actually benfit from
parallel computing and
threads in Common
Lisp.
That's not how I did it!
In the
pgloader case I've been using the
lparallel utilities, in particular
its
queuing facility to be able to implement
asynchronous IOs where a thread
reads the source data and preprocess it, fills up a batch at a time in a
buffer that is then pushed down to the writer thread, that handles the
COPY
protocol and operations.
So my current analysis is that the new thread based architecture used with a
very powerful compiler for the Common Lisp high-level language are allowing
pgloader to enter a whole new field of
data loading performances.
Conclusion
Not only is pgloader so much faster now, it's also full of new capabilities
and supports several sources of data such as
dBase files,
SQLite database files or even
MySQL live connections.
Rather than a configuration file, the way to use the new pgloader is using a
command language that has been designed to look as much like SQL as possible
in the pgloader context, to make it easy for its users. Implementation wise,
it should now be trivial enough to implement compatibility with other
data
load software that some
PostgreSQL competitor products do have.
Also, the new code base and feature set seems to attract way more users than
the previous implementation ever did, despite using a less popular
programming language.
You can already
download pgloader binary packages for
debian based
distributions and
centos based ones too, and you will even find a
Mac OS X
package file (
.pkg) that will make
/usr/local/bin/pgloader available for you
on the command line. If you need a windows binary, drop me an email.
The first stable release of the new
pgloader utility is scheduled to be
named
3.1.0 and to happen quite soon. We are hard at work on packaging the
dependencies for
debian, and you can have a look at the
Quicklisp to debian
project if you want to help us get there!