Accidental rm -rf deletes production database.
How a manual PostgreSQL replica repair turned into deletion of GitLab.com's primary database, why the standard pg_dump backups were empty, and why the only usable restore path took more than eighteen hours.
GitLab.com's database redundancy depended on the standby staying close enough to promote. PostgreSQL wrote changes to a primary database and streamed them through the write-ahead log to a hot-standby secondary. When the secondary was current, GitLab had a failover target. When it fell so far behind that the primary had already removed the WAL segments it needed, the standby stopped being redundancy and became a rebuild job.
That evening, GitLab.com was already under database pressure. Suspected spam and a background job trying to remove an employee account flagged for abuse increased load enough that users had trouble posting comments. Around 23:00 UTC, the secondary's replication process fell behind and could not catch up. The repair was standard PostgreSQL work: empty the secondary's data directory, then run pg_basebackup to copy the primary back over.
The repair did not go smoothly. pg_basebackup hung without useful output. The team increased replication sender limits, hit a PostgreSQL restart problem caused by an old max_connections setting, fixed that, and still saw pg_basebackup waiting. One engineer suspected the previous attempts had left files in the secondary's data directory. They ran rm -rf /var/opt/gitlab/postgresql/data on what they believed was db2, the secondary. It was db1, the production primary. The engineer stopped the command after a second or two. Those seconds were enough: about 300 GB had already been removed.
What followed was worse than the initial mistake. The normal pg_dump backups uploaded to S3 were not there. The job had been using PostgreSQL 9.2 binaries against a PostgreSQL 9.6 database, so it failed, and the email notifications never reached operators because DMARC rejected them. Azure disk snapshots existed for other servers, but not for the database servers. Replication was gone because the secondary had already been wiped. The only usable restore point was an LVM snapshot taken at 17:20 UTC for staging, about six hours before the deletion.
Restoring from that snapshot meant copying the staging database back to production over slow Azure classic disks, then restoring webhooks from a second copy because staging snapshots removed them to avoid accidental triggers. GitLab also incremented database sequences before bringing the service back. At 18:00 UTC on February 1, GitLab.com was operating again with database state from 17:20 UTC the previous day. Repositories and wikis were intact, but database changes from the next six hours and ten minutes were gone.
GitLab kept a public recovery document, streamed the recovery on YouTube, and used Twitter for updates. The stream peaked around 5,000 viewers. That transparency helped users understand the recovery as it happened, but the postmortem also notes a boundary: the public document initially included the engineer's name, and GitLab said it would redact names in future incidents. Openness helped the response; naming individuals was not necessary to learn from the failure.
The process of both finding and using backups failed completely.// GitLab postmortem, January 2017
From the first signal to all-clear in 18h 30m.
A primary database deletion exposed broken and slow recovery paths.
An engineer ran a destructive command on the wrong host. GitLab's secondary database had fallen too far behind the primary and could not continue streaming changes because required WAL segments were gone. The team needed to wipe the secondary and rebuild it with pg_basebackup. While trying to clear what they believed was the secondary's data directory, the engineer removed the primary's PostgreSQL data directory instead.
The standard backup path had already failed silently. GitLab's pg_dump job ran from an application server where Omnibus selected PostgreSQL 9.2, while the production database used PostgreSQL 9.6. pg_dump exited with an error, but the cron email alerts were rejected because they were not signed with DMARC. The S3 backup bucket was empty when the team needed it.
The usable restore path was a staging LVM snapshot taken at 17:20 UTC. That snapshot existed because an engineer had manually refreshed staging before load-testing work, not because it was the primary disaster-recovery path. Restoring it meant copying data back from slow Azure classic disks, recovering webhook data from a second copy, incrementing database sequences, and accepting permanent loss of database changes made after 17:20 UTC.