Implementing Auditing
Storing every change to an application’s database allows for sophisticated forensic analysis- usage trends over time, as a long-range debugger or for implementing data correction features more typically found in version control software, like ‘cherry-pick’ or ‘revert’.
Many products require this in the form of an audit trail, which in the simplest case can be used to see who put bad data in the database.
The typical implementation in Postgres involves one or more triggers to capture changes, as well as secondary storage. Unfortunately, even though this is a common problem, the community hasn’t yet converged on single implementation which solves even the majority of use-cases, although there are some good partial solutions. Getting this right early in the software development process is very valuable, as it becomes hard to adjust once in production.
Consider the following example:
Each time a record changes, one possible implementation saves data in a separate table with context for the operation: the user who executed the operation, the time, and so on.
While it’s possible to create a single table which contains present and past data, separating current and past data into separate tables provides natural partitioning, which reflects the different uses of these two data sets.
Data Model
I like to keep the triggers as simple as possible, by writing stored procedures that pre-generate queries in the simplest form. This reduces overall load on the system. To generate code in Postgres stored proceudres, the format function works well, allowing you to treat to use multi-line strings as templates – this lets you write the query you you want, then stick in the variable blocks.
Having set this up, you can insert data in the audit table, and retrieve it with the timestamp:
Another technique uses the built-in hstore object to store all audit records in a single table. This makes a natural trade-off: a single table is simple to manage and conceptualize, and prevents you from having to deal with migrations. On the other hand, having one audit table per main table makes it easy to drop any audit table into any query from your application. Some people have found the hstore implementation to be faster – it also makes it easier to create one single index.
Context
The implementation above shows what values Postgres makes readily available. Many of these appear useful, but I would consider of dubious value: Most applications connect through a single user account, and load balancer (e.g. pgbouncer), which obfuscates away much useful context.
On the other hand, some of these are incredibly helpful, but not it’s not intuitively obvious why until you need them. ‘Action’ for instance is valuable for giving the database a way to filter rows quickly during forensic analysis, and for discovering when an entity came into being or was deleted. Knowing the query running at the time of execution is useful, but could generate a significant quantity of audit history.
Two of options particular interest are the current transaction ID and transaction snapshot ID:
The first is a large integer that increments each time a transaction starts, so it can be used to roughly sort updates by when they occurred. However, it will eventually roll back to zero. txid_current_snapshot allows you to find out which transactions completely previous and are currently running, which would let you re-construct which transactions were currently running when an audit record was written.
While these are useful, the above implementions lack the truly useful and necessary context that a real application has: it’s own concept of users, an HTTP request ID, the web server executing a request, and so on.
There are a couple ways to inject context into a trigger. A possible implementation is to build a temp table that has the request context. Note that if you want the optimizer to know how to deal with it, you have to call analyze manually, or the vacuum process will note see the table.
More simply, you can overload the application_name context parameter. Application_name is supposed to be used to identify an application: typical values are “psql” or “PGAdmin”, but it can be overridden safely. One possible use would be to create a context ID, then send all this data in an async write to a separate database, to be retrieved at a later time.
Typically it is valuable to add a numeric revision number to each row. This allows the application to detect concurrent modifications, and simplifies sorting in the audit history.
If you’re able to inspect and modify audit history without a performance hit, it is valuable to find the previous row and create an ‘effective time range’ interval. This is how you’d do it if you didn’t store it:
You can also find the previous row, and set pointers going both directions, which causes the database to resemble a doubly-linked list:
Regardless of whether you create one large table or many smaller ones, it is easy to hide implementation details behind views and stored procedures, which can address many of the above complaints. If you know the column definition you want to return, it’s very easy to make a stored procedure that acts as if it were a view, as shown below:
Performance
Performance is a complex topic, and depends on your mix of write volume, total data, and need for forensic analysis. For smaller volumes, you can get away with everything in a single system, but as the volume grows, some amount of tuning is required (for instance, I’ve found that you can get away without indexes for a long time if you only do small amounts of forensic analysis, and it lets you save a lot of operational overhead).
Postgres does not support partitioning per se, but it does allow you to define a master type and make many tables share the same schema, which gets you part way there. For an auditing system, you’d likely want to split data into time windows. This would require a script to create new windows; but you might wish to create one anyway to move old tables into an alternate tablespace on cheaper/slower storage.
To avoid inconveniencing your end users, you may wish to move auditing out of transactions entirely. One method for doing this is to record the above information and send it to a message queue (e.g. Kafka ).
Alternately you can use the Postgres Async API to send queries
Audit history typically provides a wealth of information, if you can sift through the noise and plan up front what you store. Postgres provides a number of built-in features that support sophisticated analysis, but they only work if you store enough up front.
If you’re interested in this subject, check out the Wingspan Auditing library. In future posts I will discuss how to replicate git-style functionality within Postgres.
The post Auditing Data Modifications in Postgres appeared first on Gary Sieling.