The rm -rf That Erased GitLab's Production Database
How a manual PostgreSQL replica repair reached the wrong host, why S3 backups and disk snapshots were both unavailable, and why the only working restore took 18 hours at 60 Mbps.
In the seconds it took an engineer to stop a runaway command, approximately 300 GB had been removed from GitLab's production database. Neither database host could be used for recovery. The pg_dump backups in S3 were empty, and disk snapshots had never been enabled on the database servers — three independent safety nets, all gone.
GitLab.com ran one primary PostgreSQL server handling all database load, with a single hot-standby secondary whose only job was failover. That evening, concurrent spam and a background job removing a flagged account pushed database activity high enough that the secondary fell roughly 4 GB behind. PostgreSQL streaming replication works by having the secondary replay change records — called WAL segments — that the primary writes. When the primary deletes old WAL segments before the secondary has consumed them, the secondary falls permanently out of sync. GitLab.com had no WAL archiving, so there was no retained history for the secondary to catch up from. The only fix was to wipe the secondary's data directory and rebuild it from scratch.
max_wal_senders limits how many streaming replication connections the primary database accepts. max_connections controls total database connections, and PostgreSQL allocates shared memory for that limit at startup. Changing max_connections therefore required a restart. The team had to bump max_wal_senders and lower max_connections from 8,000 to 2,000 — undocumented friction that hadn't been tested. pg_basebackup is the PostgreSQL tool for copying a primary database to a standby. It sits silent at startup — no output, no progress bar — while it waits for the primary to begin sending data. This behavior was not documented in the team's runbooks. Faced with a process that gave no signal, an engineer ran the data-directory removal on the production primary, believing they were connected to the secondary.
pg_dump is a PostgreSQL utility that exports a database to a file; GitLab ran it nightly and stored the output in S3. When responders looked for a pg_dump backup, the S3 bucket was empty. The backup job had been using PostgreSQL 9.2 tooling against a 9.6 database, and every run had terminated with an error. Nobody knew the backups were broken because cron failure emails had been rejected by GitLab's mail setup for weeks.
No single person owned recovery testing, so the version mismatch, the missing disk snapshots, and the broken alert path had all survived undetected. Each failure was independent and each was silent.
An LVM snapshot is a point-in-time copy of a disk volume. The only viable restore point was one taken manually six hours earlier at 17:20 UTC. That snapshot lived on staging — a cost-saving environment running Azure Classic disks throttled to about 60 Mbps. Copying it to production took 18 hours. At least 5,000 projects, 5,000 comments, and roughly 700 users were permanently lost. These were every database change made in the six hours between the snapshot and the deletion. Git repositories and wikis were unavailable during the outage but survived intact because they lived outside the database. GitLab streamed the recovery live on YouTube, posting regular updates as the restore ran.
GitLab's published accounts agreed that the loss window began at 17:20 UTC but gave three different endpoints: 23:25, 23:30, and 00:00 UTC.
Self-managed GitLab and GitHost instances were unaffected by the outage and data loss.
WAL-E is an open-source tool that archives PostgreSQL write-ahead log files to object storage, enabling continuous backup and point-in-time recovery. GitLab added hourly LVM snapshots and implemented WAL-E continuous streaming to S3 and Azure Blob, significantly expanding the recovery artifact set. As of the public tracker review, WAL-E monitoring was not yet available and automated restore testing had not been marked complete. Artifacts had been added, but end-to-end recovery assurance had not been verified.
From the first signal to all-clear in 18h 30m.
Before database load testing, an engineer manually took an LVM snapshot of production and loaded it into staging. That snapshot became the only viable recovery artifact six hours later.
Spammers hammering the database with snippet creation combined with a background job removing a flagged employee account to push database load high. The combined pressure made the service unstable.
The load escalated into a full write lockup, taking parts of the service down and signaling that the situation had moved beyond routine database pressure.
Responders were paged because the secondary had fallen roughly 4 GB behind and stopped replicating entirely. The primary had already removed the WAL segments the secondary needed to catch up.
While trying to rebuild the secondary, an engineer ran the data-directory removal on the wrong host. The rebuild process gave no output, making progress impossible to judge. The command was stopped within seconds, but approximately 300 GB was already gone.
After an 18-hour copy from staging's throttled disks, GitLab brought the database back online — the first service available to users since the deletion. Webhooks were excluded from this initial restoration and came back about an hour later.
GitLab finished restoring webhooks and confirmed the service was operating as expected, closing out about 18 hours of downtime.