FM-006GitLab2017-01-31impact 18h 30mSEV-1

The `rm -rf` That Erased GitLab's Production Database

How a manual PostgreSQL replica repair reached the wrong host, why S3 backups and disk snapshots were both unavailable, and why the only working restore took 18 hours at 60 Mbps.

database postgresql backup data-loss operator-error

citation

case study

In the seconds it took an engineer to stop a runaway command, approximately 300 GB had been removed from GitLab's production database. Neither database host could be used for recovery. The pg_dump backups in S3 were empty, and disk snapshots had never been enabled on the database servers — three independent safety nets, all gone.

GitLab.com ran one primary PostgreSQL server handling all database load, with a single hot-standby secondary whose only job was failover. That evening, concurrent spam and a background job removing a flagged account pushed database activity high enough that the secondary fell roughly 4 GB behind. PostgreSQL streaming replication works by having the secondary replay change records — called WAL segments — that the primary writes. When the primary deletes old WAL segments before the secondary has consumed them, the secondary falls permanently out of sync. GitLab.com had no WAL archiving, so there was no retained history for the secondary to catch up from. The only fix was to wipe the secondary's data directory and rebuild it from scratch.

max_wal_senders limits how many streaming replication connections the primary database accepts. max_connections controls total database connections, and PostgreSQL allocates shared memory for that limit at startup. Changing max_connections therefore required a restart. The team had to bump max_wal_senders and lower max_connections from 8,000 to 2,000 — undocumented friction that hadn't been tested. pg_basebackup is the PostgreSQL tool for copying a primary database to a standby. It sits silent at startup — no output, no progress bar — while it waits for the primary to begin sending data. This behavior was not documented in the team's runbooks. Faced with a process that gave no signal, an engineer ran the data-directory removal on the production primary, believing they were connected to the secondary.

pg_dump is a PostgreSQL utility that exports a database to a file; GitLab ran it nightly and stored the output in S3. When responders looked for a pg_dump backup, the S3 bucket was empty. The backup job had been using PostgreSQL 9.2 tooling against a 9.6 database, and every run had terminated with an error. Nobody knew the backups were broken because cron failure emails had been rejected by GitLab's mail setup for weeks.

No single person owned recovery testing, so the version mismatch, the missing disk snapshots, and the broken alert path had all survived undetected. Each failure was independent and each was silent.

An LVM snapshot is a point-in-time copy of a disk volume. The only viable restore point was one taken manually six hours earlier at 17:20 UTC. That snapshot lived on staging — a cost-saving environment running Azure Classic disks throttled to about 60 Mbps. Copying it to production took 18 hours. At least 5,000 projects, 5,000 comments, and roughly 700 users were permanently lost. These were every database change made in the six hours between the snapshot and the deletion. Git repositories and wikis were unavailable during the outage but survived intact because they lived outside the database. GitLab streamed the recovery live on YouTube, posting regular updates as the restore ran.

GitLab's published accounts agreed that the loss window began at 17:20 UTC but gave three different endpoints: 23:25, 23:30, and 00:00 UTC.

Self-managed GitLab and GitHost instances were unaffected by the outage and data loss.

WAL-E is an open-source tool that archives PostgreSQL write-ahead log files to object storage, enabling continuous backup and point-in-time recovery. GitLab added hourly LVM snapshots and implemented WAL-E continuous streaming to S3 and Azure Blob, significantly expanding the recovery artifact set. As of the public tracker review, WAL-E monitoring was not yet available and automated restore testing had not been marked complete. Artifacts had been added, but end-to-end recovery assurance had not been verified.

timeline · UTC

From the first signal to all-clear in 18h 30m.

About 17:20 UTC

A production snapshot is loaded into staging

Before database load testing, an engineer manually took an LVM snapshot of production and loaded it into staging. That snapshot became the only viable recovery artifact six hours later.

18:00 UTC

Spam load destabilizes the database

Spammers hammering the database with snippet creation combined with a background job removing a flagged employee account to push database load high. The combined pressure made the service unstable.

21:00 UTC

Database writes lock up

The load escalated into a full write lockup, taking parts of the service down and signaling that the situation had moved beyond routine database pressure.

22:00 UTC

Replication falls too far behind

Responders were paged because the secondary had fallen roughly 4 GB behind and stopped replicating entirely. The primary had already removed the WAL segments the secondary needed to catch up.

23:27 UTC

The primary deletion is stopped

While trying to rebuild the secondary, an engineer ran the data-directory removal on the wrong host. The rebuild process gave no output, making progress impossible to judge. The command was stopped within seconds, but approximately 300 GB was already gone.

17:00 UTC, February 1

The database returns without webhooks

After an 18-hour copy from staging's throttled disks, GitLab brought the database back online — the first service available to users since the deletion. Webhooks were excluded from this initial restoration and came back about an hour later.

About 18:00 UTC, Feb 1

Webhooks and normal operation return

GitLab finished restoring webhooks and confirmed the service was operating as expected, closing out about 18 hours of downtime.

lessons

What to take away.

Monitor backup success by independently checking artifact freshness and route failures through a channel that does not depend on the backup job's own email path.The backup job was failing while its cron emails were rejected, leaving responders unaware and the expected bucket empty. Independent freshness checks catch both job failure and notification failure, but they require explicit age and completeness thresholds and still do not prove restorability; pair them with restore validation where durable recovery matters.

semantic_correctness_monitoring

Asynchronous replicas need an independent source of retained change history when load spikes can push them beyond the primary's live retention window.The secondary became unrecoverable through normal catch-up after the primary removed required WAL, and the absence of WAL archiving forced a full manual resynchronization. Archived logs add storage, lifecycle, and restore-management costs, so this practice is most valuable where replica rebuild time would violate recovery objectives; it does not replace monitoring replica lag or testing full restores.

layered_recovery_planning

Classify repair steps that destroy state and make them verify the target's current role before execution, especially when primary and replica hosts accept similar commands.A secondary-rebuild action reached the primary because the responder believed they were on the other host, collapsing both database recovery sources. Role-aware wrappers, explicit target checks, or peer confirmation add friction during urgent work and cannot eliminate every operator error; they are warranted for irreversible steps whose accidental execution can remove the last viable copy. The record does not establish which technical guardrails already existed, so this is a recommendation rather than a proven missing control.

repair_safety_classification

Capacity-test the full snapshot restore path against recovery-time objectives, including the storage tier and transfer route used by the recovery source.The selected snapshot minimized known data loss, but staging's throttled disks made copying it to production take approximately 18 hours. Faster recovery storage or pre-positioned copies cost more during normal operation, so teams should size them to explicit recovery objectives and dataset growth; a fresh snapshot is not a timely recovery mechanism if its extraction path is the bottleneck.

snapshot_based_recovery

Exercise backup jobs against the deployed server and client version matrix whenever packaging or topology can select tooling implicitly.The application-host backup path implicitly selected PostgreSQL 9.2 tooling for a PostgreSQL 9.6 server, so pg_dump terminated instead of producing a recovery artifact. Compatibility checks add matrix maintenance and should focus on combinations the deployment system can actually select; they complement, rather than replace, artifact freshness checks and restore drills.

configuration_matrix_testing

sources

Read the sources.

Postmortem of database outage of January 31

GitLab ↗

GitLab.com database incident

GitLab ↗

[meta] Listing all issues related to Jan 31st outage to track their progress

GitLab ↗

FM-015 · The Impossible Date That Broke Azure VM Startup

The rm -rf That Erased GitLab's Production Database

From the first signal to all-clear in 18h 30m.

What to take away.

Read the sources.

The `rm -rf` That Erased GitLab's Production Database