Charlesleifer.com

Five reasons you should use SQLite in 2016

2016-01-06

If you haven't heard, SQLite is an amazing database capable of doing real work in real production environments. In this post, I'll outline 5 reasons why I think you should use SQLite in 2016.

PS: the site you're reading this post on? Yep, SQLite.

Simple to manage

Have you ever managed a Postgres database? There are quite a few things you need to understand in order to ensure the database server is properly configured (shared buffers, effective cache size, work mem, maintenance work mem, wal buffers...). Upgrading can be a scary process as well, and you may need to take your database offline, run a program against it to upgrade it, and hope that it works when you turn things back on. And furthermore, do you know where exactly your postgres database is stored? Can you point somewhere and say, "that's my database"?

(Now, we all know that there are many situations where nothing but Postgres (or MySQL, Oracle, SQL Server, etc) will do for the requirements of the application. My purpose here is not to say that it is otherwise. Rather I just want to highlight the difference between managing a SQLite db compared to a typical database server.)

SQLite is easy to manage -- it's a single file (or at some times a file + transaction log). The file format is stable across major versions, so if I had a SQLite database file from version 3.0.0 (back in 2004), I could read it using the latest SQLite 3.10.0. If I want to take the database file with me on a thumb drive, I just copy the file, or better yet keep it in my dropbox folder. If I want to back things up every night, I just sync the database file to S3. If I want to share some data analysis I'm doing with a coworker, I can just send them a copy of the database file and they are ready to go. Having the database in a single file with a stable format is a feature.

What's more, SQLite is very easy to configure. SQLite features are managed in two ways: compilation flags and PRAGMA statements (run-time configuration). There are no configuration files to speak of, you just build the library with the features you want, then configure the run-time options when you create a database connection.

Constantly improving yet rock-solid stability

SQLite is actively developed by some truly amazing software engineers. New features of high quality are added at an impressive rate. Just recently SQLite added support for JSON data via the json1 extension (for info on using this with Python check out this post). SQLite also released an improved version of the full-text search extension, which includes result ranking using the BM25 algorithm.

In addition to adding new features, SQLite developers also are working to make the library more performant. In version 3.8.11, the release notes contain this little blurb:

SQLite now runs twice as fast as version 3.8.0 and three times as fast as version 3.3.9

Despite all these changes and enhancements, SQLite rarely introduces bugs. The SQLite test suite is widely regarded to be one of the best in the industry and the How SQLite is tested page frequently appears on HackerNews as developer after developer discovers this impressive document.

Extensible and Hackable

My personal favorite feature of SQLite is its extensibility. Because SQLite is embedded by your application, it runs in the same address space and can execute application code on your behalf. Both the Python standard library SQLite driver, pysqlite, and the alternative driver apsw provide APIs for defining custom SQL functions, aggregate functions, and collations. apsw goes one step further and provides APIs for defining virtual tables and virtual filesystems!

As a practical example, suppose you have a column in a table that is storing URLs and you wish to determine which are the most common hostnames -- if you were using a different database you would be forced to write a complicated regex, a combination of string-manipulation functions, or pull the data into your app and do the calcuations in code. With SQLite, you can define a hostname function in Python, and then use it to create a simple COUNT query:

You can also create aggregate functions, which accept 0..n values, and produce a single output value. Examples might be calculating the standard deviation, generating a string by processing the values in some way, doing some type of classification, etc.

Virtual tables, which are currently only supported by apsw, allow you to define a table in code and then query it as if it were a normal SQL table, even though the backing data might be completely dynamic. For example, I wrote a simple virtual table that allows you to query Redis as if it were a SQL table.

You can also write eponymous functions, which are functions that return 0..n rows of results. An example might be a regular-expression search function, which processes input and yields rows of matching tokens. I've written a library, sqlite-vtfunc, which makes it extremely easy to write these types of functions.

Virtually every aspect of SQLite can be manipulated by your application.

Lightning fast

SQLite is very fast. It runs on the same machine, so there is no networking overhead when executing queries or reading results. It runs in the same address space, so there is no wire protocol, serialization or need to communicate via unix sockets. SQLite runs on mobile devices where resources are scarce and efficiency is crucial. SQLite also supports a huge number of compilation flags that allow you to remove features you don't plan to use.

SQLite's speed makes up for one of it's (perceived) greatest shortcomings, which is database-file locking for writes. By writing data incredibly quickly, database locking only becomes an issue when there is a great deal of concurrent writers.

WAL Mode

The 3.7.0 release of SQLite added a new journaling method that utilizes a write-ahead log. By itself this isn't really exciting news, but what it means for web application developers (or anyone dealing with concurrency) is that readers no longer block writers, and vice-versa. Or to put it another way, reading and writing can both occur simultaenously. Without WAL mode, in order to write to the database, the writer would acquire exclusive access to the database and no reading could occur until the writer was finished.

Here's an example illustrating the difference between the two. Let's say we have two processes, a writer and a reader. The writer opens an exclusive transaction (indicating the intent to write). Next, reader opens a transaction. The reader then attempts to issue a SELECT statement:

Journal mode = "delete" (the default):

Writer: BEGIN EXCLUSIVE

Reader: BEGIN

Reader: SELECT * FROM foo;
Error: database is locked

Journal mode = "wal":

Writer: BEGIN EXCLUSIVE

Reader: BEGIN

Reader: SELECT * FROM foo;
Returns table contents

It's worth noting, however, that even if you don't enable WAL mode, writes typically occur in milliseconds. This is such a small amount of time that you will only notice problems if you have very high concurrency or very long write transactions.

Bonus reason: BerkeleyDB

BerkeleyDB's SQLite integration can give application developers needing concurrent database access even better performance, because rather than locking the entire database, BerkeleyDB only needs to lock the individual page(s). This allows BerkeleyDB to scale more efficiently under concurrent database load, provided the transactions aren't contending for the same page of data. BerkeleyDB also supports multi-version concurrency control (MVCC), which allows read operations to continue to occur on a page of data being operated on by a write transaction.

An additional benefit of BerkeleyDB is increased efficiency. In other words, BerkeleyDB may use less system resources and executes fewer system calls. You can find more details in this whitepaper and this brief technical overview.

BerkeleyDB's SQL interface is a drop-in replacement for SQLite, and supports the same APIs and features. BerkeleyDB provides a few additional features like replication (SQLite has a backup utility, but my understanding is that it is not as powerful as BDB's), encryption, and of course all the features of BerkeleyDB itself.

One major downside of using BerkeleyDB is that it is very sensitive to configuration values, and getting the right page size, cache size and other settings requires deep knowledge. The other downside is the licensing - to read more about the BerkeleyDB license, check out Oracle's licensing page.

For instructions on compiling the Python SQLite driver for use with BerkeleyDB, check out this post.

The end

I hope you'll give SQLite a try. Don't believe the FUD about it not being production-worthy, or not being suitable for use in web-applications.

If you'd like to read more, SQLite itself has a nice document describing when to use SQLite, which also includes a list of situations where another RDBMS would work better. I've also written a shorter post in the same vein called SQLite: Small. Fast. Reliable. Choose any three which you might also enjoy. Lastly, you can check out my posts tagged "sqlite" for more topical posts.

Thanks so much for taking the time to read, I hope you'll leave a comment if you have any questions, comments or suggestions!