2016-04-28

PGHoard is the cloud backup and restore solution we're using in Aiven. We started PGHoard development in early 2015 when the Aiven project was launched as a way to provide real-time streaming backups of PostgreSQL to a potentially untrusted cloud object storage.

PGHoard has an extensible object storage interface, which currently works with the following cloud object stores:

Amazon Web Services S3

Google Cloud Storage

OpenStack Swift

Ceph's RADOSGW utilizing either the S3 or Swift drivers

Microsoft Azure Storage (currently experimental)

Data integrity
PostgreSQL backups consist of full database backups, basebackups, plus write ahead logs and related metadata, WAL. Both basebackups and WAL are required to create and restore a consistent database.

PGHoard handles both the full, periodic backups (driving pg_basebackup) as well as streaming the write-ahead-log of the database.  Constantly streaming WAL as it's generated allows PGHoard to restore a database to any point in time since the oldest basebackup was taken.  This is used to implement Aiven's Database Forks and Point-in-time-Recovery as described in our PostgreSQL FAQ.

To save disk space and reduce the data that needs to be sent over the network (potentially incurring extra costs) backups are compressed by default using Google's Snappy, a fast compression algorithm with a reasonable compression ratio. LZMA (a slower algorithm with very high compression ratio) is also supported.

To protect backups from unauthorized access and to ensure their integrity PGHoard can also transparently encrypt and authenticate the data using RSA, AES and SHA256.  Each basebackup and WAL segments gets a unique random AES key which is encrypted with RSA.  HMAC-SHA256 is used for file integrity checking.

Restoring is the key
As noted in the opening paragraph, PGHoard is a backup and restore tool: backups are largely useless unless they can be restored.  Experience tells us that backups, even if set up at some point, are usually not restorable unless restore is routinely tested, but experience also shows that backup restoration is rarely practiced unless it's easy to do and automate.

This is why PGHoard also includes tooling to restore backups, allowing you to create new master or standby databases from the object store archives.  This makes it possible to set up a new database replica with a single command, which first restores the database basebackup from object storage and then sets up PostgreSQL's recovery.conf to fetch the remaining WAL files from the object storage archive and optionally connect to an existing master server after that.

Preparing PostgreSQL for PGHoard
First, we will need to create a replication user account. We'll just use the psql command-line client for this:

postgres=# CREATE USER backup WITH REPLICATION PASSWORD 'secret';
CREATE ROLE

We also need to allow this new user to make connections to the database. In PostgreSQL this is done by editing the pg_hba.conf configuration file and adding a line something like this:

host  replication  backup  127.0.0.1/32  md5

We'll also need to ensure our PostgreSQL instance is configured to allow WAL replication out from the server and it has the appropriate wal_level setting. We'll edit postgresql.conf and edit or add the following settings:

max_wal_senders = 2  # minimum two with pg_receivexlog mode!
wal_level = archive  # 'hot_standby' or 'logical' are also ok

Finally, since we have modified PostgreSQL configuration files, we'll need to restart PostgreSQL to take the new settings into use by running "pg_ctl restart", "systemctl restart postgresql" or "service postgresql restart", etc depending on the Linux distribution being used.  Note that it's not enough to "reload" PostgreSQL in case the WAL settings were changed.

Now we are ready on the PostgreSQL side and can move on to PGHoard.

Installing PGHoard
PGHoard's source distribution includes packaging scripts for Debian, Fedora and Ubuntu.  Instructions for building distribution specific packages can be found in the PGHoard README.  As PGHoard is a Python package it can also be installed on any system with Python 3 by running "pip3 install pghoard".

Taking backups with PGHoard
PGHoard provides a number of tools that can be launched from the command-line:

pghoard - The backup daemon itself, can be run under systemd or sysvinit

pghoard_restore - Backup restoration tool

pghoard_archive_sync - Command for verifying archive integrity

pghoard_create_keys - Backup encryption key utility

pghoard_postgres_command - Used as PostgreSQL's archive_command and restore_command

First, we will launch the pghoard daemon to start taking backups. pghoard requires a small JSON configuration file that contains the settings for the PostgreSQL connection and for the target backup storage. We'll name the file pghoard.json:

{
"backup_location": "./metadata",
"backup_sites": {
"example-site": {
"nodes": [
{
"host": "127.0.0.1",
"password": "secret",
"port": 5432,
"user": "backup"
}
],
"object_storage": {
"storage_type": "local",
"directory": "./backups"
}
}
}
}

In the above file we just list where pghoard keep's its local working directory (backup_location), our PostgreSQL connection settings (nodes) and where we want to store the backups (object_storage). In this example we'll just write the backup files to a local disk instead of a remote cloud object storage.

Then we just need to run the pghoard daemon and point it to our configuration file:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8809
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], running as PID: 8815, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar'
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000025' to 805706 bytes (4%), took: 0.056s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000025', size: 805706, took 0.003s
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], took: 0.331s to run, returncode: 0
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000026' to 797357 bytes (4%), took: 0.057s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000026', size: 797357, took 0.011s
INFO    Compressed 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar' to 15981960 bytes (19%), took: 0.335s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_0', size: 15981960, took 0.026s

PGHoard automatically connected to the PostgreSQL database server, noticed that we don't have any backups and immediately created a new basebackup and started the realtime streaming of WAL files (which act as incremental backups). Each file stored in the backups was first compressed for optimizing the transfer and storage costs.

As long as you keep PGHoard running, it will make full backups using the default schedule (once per 24 hours) and continuously stream WAL files.

Looking at the contents of the "backups" directory, we see that our backups now contain a full database backup plus a couple of WAL files, and some metadata for each of the files:

$ find backups/ -type f
backups/example-site/xlog/000000010000000000000025
backups/example-site/xlog/000000010000000000000025.metadata
backups/example-site/xlog/000000010000000000000026
backups/example-site/xlog/000000010000000000000026.metadata
backups/example-site/basebackup/2016-04-28_0
backups/example-site/basebackup/2016-04-28_0.metadata

Available backups can be listed with the pghoard_restore tool:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z

Looks like we are all set. Now let's try restore!

Restoring a backup
Restoring a backup is a matter of running a single command:

$ pghoard_restore get-basebackup --config pghoard.json --target-dir restore-test
Found 1 applicable basebackup

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z
metadata: {'compression-algorithm': 'snappy', 'start-wal-segment': '000000010000000000000026', 'pg-version': '90406'}

Selecting 'example-site/basebackup/2016-04-28_0' for restore
Basebackup complete.
You can start PostgreSQL by running pg_ctl -D restore-test start
On systemd based systems you can run systemctl start postgresql
On SYSV Init based systems you can run /etc/init.d/postgresql start

The pghoard_restore command automatically chooses the latest available backup, downloads, unpacks (and decompresses and decrypts, when those options are used) it to the specified target directory. The end result will be a complete PostgreSQL data directory (e.g. something like /var/lib/postgresql/9.5/main or /var/lib/pgsql/data, depending on the distro), ready to be used by a PostgreSQL instance.

There are more command-line options for more detailed control over the restoration process, for example restoring to a particular point in time or transaction (PITR) or choosing whether the restored database will be acting as a master or a standby.

Backup encryption
In order to encrypt our backups, we'll need to create an encryption key pair. PGHoard provides a handy command for automatically creating a key pair and storing it into our configuration file:

$ pghoard_create_keys --key-id example --config pghoard.json
Saved new key_id 'example' for site 'example-site' in 'pghoard.json'
NOTE: The pghoard daemon does not require the 'private' key in its configuration file, it can be stored elsewhere to improve security

Note that in most cases you will want to extract the private key away from the configuration file and store it safely elsewhere away from the machine that makes the backups. The pghoard daemon only needs the encryption public key. Only the restore tool needs the private key in order to restore backups.

Uploading backups to a cloud
Sending backups to an object storage in a cloud is simple: we just need the cloud's access credentials and we'll modify the object_storage section pghoard.json:

"object_storage": {
"aws_access_key_id": "XXX",
"aws_secret_access_key": "XXX",
"bucket_name": "backups",
"region": "eu-central-1",
"storage_type": "s3"
}

Now when we restart pghoard, the backups are sent to AWS S3 in Frankfurt:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8001
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], running as PID: 8014, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar'
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], took: 0.350s to run, returncode: 0
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000027' to 799445 bytes (4%), took: 0.406s
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000028' to 797784 bytes (4%), took: 0.137s
INFO    Compressed and encrypted 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar' to 15982372 bytes (19%), took: 0.417s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000028', size: 797784, took 0.885sINFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000027', size: 799445, took 1.104s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_1', size: 15982372, took 4.911s

The restore tool works the same way regardless of where the backups are stored:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_1            15 MB        76 MB  2016-04-28T09:39:37Z

PostgreSQL 9.2+ and Python 3.3+ required
Today we released PGHoard version 1.2.0 with support for Python 3.3 and PostgreSQL 9.2 plus enhanced support for handling network outages.  These features were driven by external users, in Aiven we always use the latest PostgreSQL versions (9.5.2 at the time of writing) and access object storages near the database machines.

PGHoard in Aiven.io
We're happy to talk more about PGHoard and help you set up your backups with it.  You can also sign up for a free trial of our Aiven.io PostgreSQL service where PGHoard will take care of your backups.

Cheers,
Team Aiven

Show more