PostgreSQL Recovery

August 01, 2010

We ran out of disk space on a database server, crashing PostgreSQL as a result. Unfortunately, PostgreSQL was unable to recover itself upon restart:

root@host# /etc/init.d/postgresql-8.3 start
 * Starting PostgreSQL 8.3 database server
 * The PostgreSQL server failed to start. Please check the log output:
2010-08-01 08:21:19 EDT LOG:  database system was interrupted; last known up at 2010-07-31 23:00:05 EDT
2010-08-01 08:21:19 EDT FATAL:  invalid data in file "00000001000002BD00000072.00000020.backup"
2010-08-01 08:21:19 EDT LOG:  startup process (PID 9203) exited with exit code 1
2010-08-01 08:21:19 EDT LOG:  aborting startup due to startup process failure

00000001000002BD00000072.00000020.backup is a zero-length file in the pg_xlog directory. We tried deleting that file, but then the restart threw a different error:

root@host# /etc/init.d/postgresql-8.3 start
 * Starting PostgreSQL 8.3 database server
 * The PostgreSQL server failed to start. Please check the log output:
2010-08-01 08:26:11 EDT LOG:  database system was interrupted; last known up at 2010-07-31 23:00:05 EDT
2010-08-01 08:26:11 EDT LOG:  could not open file "pg_xlog/00000001000002BD00000072" (log file 701, segment 114): No such file or directory
2010-08-01 08:26:11 EDT LOG:  invalid checkpoint record
2010-08-01 08:26:11 EDT PANIC:  could not locate required checkpoint record
2010-08-01 08:26:11 EDT HINT:  If you are not restoring from a backup, try removing the file "/var/lib/postgresql/8.3/main/backup_label".
2010-08-01 08:26:11 EDT LOG:  startup process (PID 9419) was terminated by signal 6: Aborted
2010-08-01 08:26:11 EDT LOG:  aborting startup due to startup process failure

We concluded that a point-in-time recovery (PITR) was required. First, we saved our current cluster data directory. (In retrospect, we needed the unarchived WAL files from the pg_xlog directory when restoring, so we should have copied them off to the side first.)

cd /var/lib/postgresql/8.3/main/
tar zcvf /var/backups/postgresql/8.3/main/snapshot/2010-08-01-CRASHED.tar.gz .

Next, we restored our last known good snapshot. The one from the morning of the crash looked suspiciously small, so we went back to the previous day:

rm -rf  *
tar jxvf /var/backups/postgresql/8.3/main/snapshot/2010-07-30.tar.bz2 
mkdir -p pg_xlog/archive_status
chown -R postgres.postgres pg_xlog
chmod -R 700 pg_xlog

Now we copied the unarchived WAL files from before the recovery into the pg_xlog directory. We then created /var/lib/postgresql/8.3/main/recovery.conf as follows:

restore_command = 'bzcat /var/backups/postgresql/8.3/main/wal/%f.bz2 > %p'
recovery_target_time = '2010-07-31 06:28:00 EDT'

We also made recovery.conf owned by the postgres user.

Finally, we started PostgreSQL via /etc/init.d/postgresql-8.3 start.