2014-05-27

What if I told you that you could generate the following JSON response
in PostgreSQL?

What if I told you that it is over 10X faster than plain ActiveModel::Serializers
for small data sets, and 160X faster for larger data sets?

Typically when you have an API serving up JSON responses, your web
framework serializes your data after retrieving it with its ORM. We'll
talk about Rails specifically in this article, but this will generally
apply to most frameworks. So the typical Rails request will roughly
follow this flow (I am purposely brushing over some parts of the request
response cycle):

Rails receives the JSON request from the browser/client

Rails will apply some business logic and craft a query via
ActiveRecord

ActiveRecord serializes its query and sends the query to PostgreSQL

PostgreSQL will compile the result set and serializes the records
in its protocol format

ActiveRecord deserializes the records into a set of rows object

ActiveRecord will convert the set of rows into a set of model
object instances

Rails will convert the set of models objects into a JSON string

Rails will send the JSON string down to the browser

Most of the time in this response cycle is spent in steps 6 and 7. Rails
has to deserialize one format, then store that deserialized content in
memory just to serialize it in a different format. Since PostgreSQL
supports JSON responses,
we can use its JSON functions to
serialized our result set. That JSON response will still be serialized
in PostgreSQL's protocol format, but ActiveRecord can deserialize it as
a single string object, instead of a set of objects which it then
converts and reserializes. We end up having this response cycle instead:

Rails receives the JSON request from the browser/client

Rails will apply some business logic and craft a query via
ActiveRecord

ActiveRecord serializes its query and sends the query to PostgreSQL

PostgreSQL will compile the result set, serializes it as JSON then
serializes the JSON in its protocol format

ActiveRecord deserializes the protocal format into a single JSON
string

Rails will send the JSON string down to the browser

We are only removing 2 steps, but it is the bulk of the time spent
generating the response. We are also limiting the number of ruby objects
created, so this reduces memory usage and time spent garbage collecting
short lived Ruby objects used only for JSONification.

What Do We Gain by Generating Massive Queries for PostgreSQL

It takes a lot of work to tell PostgreSQL to generate a specific
JSON object; what exactly does that buy us?
By doing all this in PostgreSQL, we avoid using CPU cycles
and memory on our web server. I've done some very naive and basic
testing with a new, unoptimized Rails project, and a database of 1000
notes, each have 10 unique tags, totalling 10000 tags. When retrieving
all 11000 records with Rails and ActiveModel::Serializers, it took
roughly 9 seconds to generate the request. Most of the time was spent
in the View generating the JSON object in memory, with 657 milliseconds
in ActiveRecord, which (I think until someone tells me otherwise)
includes creating all the model instances.

When we apply the PostgreSQL technique outlined later in this article to the
same result set, the response only takes 72 milliseconds for the first
request. If we rerun this same request, PostgreSQL caching kicks in and
the response time is 54 milliseconds. That is a ~160X throughput
increase when we use PostgreSQL to generate JSON payloads.

The above numbers are a bit skewed by the size of this test payload.
11000 objects would be completely crazy to present to an end user. If we
pare back our result set 10 notes and 100 tags, the first and second
response times for Ruby side JSONification are 187 and 118 milliseconds.
When using PostgreSQL to generate our JSON payload, the response times
are 92 and 12 milliseconds. That is a 2X/10X increase. By utilizing
PostgreSQL, we can increase our applications' response times and
throughput.

Announce PostgresExt-Serializers

To utilize PostgreSQL, we have to generate a fairly complex query
manually. That is, until you include the PostgresExt-Serializers
gem into the project. PostgresExt-Serializers (PES) monkey
patches ActiveModel::Serializers (AMS),
and anywhere an ActiveRecord::Relation is serialized by AMS, PES will
take over and push the work to PostgreSQL. I wanted to use the awesome
work of AMS's DSL for generating JSON schemas without having to duplicate
that work. I am finding some pain points in terms of extracting the
information I need to generate the SQL query from AMS, but right now the
code for PES is very immature, hence the 0.0.1 release.

Nitty-Gritty Details About How it All Works: Massive PostgreSQL Queries

Let's say we have an Ember application that we are generating the JSON
request for. The Ember app wants the list of notes, along with the tags
associated with the notes, and we will side load the tags. Side loading
allows you to specify the ids of the tags on the note, and then include
a list of tags, which will be used to instantiate the tags on the note.
The benefit of side loading is that it allows you to save bandwidth by
use tag ids and an array of de-duplicated tags, instead of embedding the
duplicate tags objects under the notes, where you would have to duplicate
the tag objects. We only want notes with id < 40, which is arbitrary
in this example, but, as we will see, has implications on the query we
need to execute.

Here is the whole query we need to generate the JSON required, which is
also the example JSON at the beginning of this article:

Let's break it down. You'll notice that I am making use of Common Table
Expressions (CTEs). CTEs allow you to use temporary table definitions
in queries instead of embedding the subqueries directly in your query.

Gathering our Note Ids

The first important step is getting the note ids of our final result
set, which we do with:

We are creating a CTE that represents the ids for our notes, we'll be
using this extensively to generate our tag related records.

Getting Tag Ids Grouped by Note Ids

From our note_ids, we can assemble a list of tag ids grouped by notes.
This will be used to create the tag_ids attribute on the notes later
on.

Our projection is the note_id, plus an array_agg of the id of the
tags in our grouping. array_agg aggregates the group into an array.
This projection will return the following:

In this example, the tags belong_to a note, so we are retrieving this
data from the tags table. If this was a many-to-many relation, this
query would execute against the join table (i.e. notes_tags).

We group our tags by the note_id, and we use the HAVING clause to
only group tags which have a note_id contained in the note_ids CTE
that we created at the beginning.

Generating Our Note Records

Most of the time, we don't want to expose all of our record data to
Ember, since whatever we send to the client will be accessible by the
user, whether we intend it to be or not. We filter down the attributes
sent to Ember by limiting the columns in our projection.

Also note that in the projection, we are using coalesce
to ensure that we return an empty array if a specific note has no tag_ids.
We are using a LEFT OUTER JOIN to combine our previously generated
tag id groupings with our notes. We use an OUTER JOIN instead of an
INNER JOIN so that all our notes are returned, even if no tags are
associated with it. An INNER JOIN would only return notes that have
tags associated with it. We also use the same WHERE predicate in this
query as we did in the note_ids CTE, to ensure our query only returns
the desired records.

Turning Our Note Records into a Single JSON Array

So now that we have our notes records filtered down, we need to create a
JSON array of these records to use in our final query. At this point, we
will use two of PostgreSQL's JSON functions and the array_agg
function that we used earlier. row_to_json takes a PostgreSQL row and
converts it to a JSON object, where the columns of the row converted
into JSON properties.

Will be converted to

So at this point, our result set is a series of rows with a single
column of JSON representing the original PostgreSQL row from our
notes_attribute_filter CTE. We then use array_agg to turn the
rows of JSON objects into a single row with a single PostgreSQL
Array of JSON objects.

will be converted to

Lastly, we use array_to_json to convert the PostgreSQL array of JSON to a JSON array.

After combining these pieces, we get the following query:

which yields

We are using the notes_attributes_filter as our source for all the
JSON functions, and adding a column match with a value of 1, which
we will need later.

Aggregating Our Tag Records

We apply the attribute filtering and the aggregation techniques to our
tags table to generate our JSON array of tags. Note that when we
filter the tags attributes, we only include tags that have a note_id
of a note we are returning.

which yields

Combining Our Notes and Tags

So at this point, we have 2 CTEs that represent our notes and tags. We
need to combine these two tables into a single row, so that we can convert
that row to a JSON object with a notes and tags property. This is
the reason we added a match column onto both CTEs; we join those two
table into our final table, which we then call row_to_json on to get
our final JSON object, which mirrors the example at the beginning of
this article.

So there you have it, you could generate this giant query by hand every
time you need to create an API endpoint, or you could use ActiveModel::Serializers
and utilize the PostgresExt-Seriliazers optimizations to avoid Ruby and
Rails when generating API responses.

Show more