FM-006GitLab2017-01-31impact 18h 30mSEV-1

Accidental rm -rf deletes production database.

How a manual PostgreSQL replica repair turned into deletion of GitLab.com's primary database, why the standard pg_dump backups were empty, and why the only usable restore path took more than eighteen hours.

databasebackupoperator-error

summary

GitLab.com's database redundancy depended on the standby staying close enough to promote. PostgreSQL wrote changes to a primary database and streamed them through the write-ahead log to a hot-standby secondary. When the secondary was current, GitLab had a failover target. When it fell so far behind that the primary had already removed the WAL segments it needed, the standby stopped being redundancy and became a rebuild job.

That evening, GitLab.com was already under database pressure. Suspected spam and a background job trying to remove an employee account flagged for abuse increased load enough that users had trouble posting comments. Around 23:00 UTC, the secondary's replication process fell behind and could not catch up. The repair was standard PostgreSQL work: empty the secondary's data directory, then run pg_basebackup to copy the primary back over.

The repair did not go smoothly. pg_basebackup hung without useful output. The team increased replication sender limits, hit a PostgreSQL restart problem caused by an old max_connections setting, fixed that, and still saw pg_basebackup waiting. One engineer suspected the previous attempts had left files in the secondary's data directory. They ran rm -rf /var/opt/gitlab/postgresql/data on what they believed was db2, the secondary. It was db1, the production primary. The engineer stopped the command after a second or two. Those seconds were enough: about 300 GB had already been removed.

What followed was worse than the initial mistake. The normal pg_dump backups uploaded to S3 were not there. The job had been using PostgreSQL 9.2 binaries against a PostgreSQL 9.6 database, so it failed, and the email notifications never reached operators because DMARC rejected them. Azure disk snapshots existed for other servers, but not for the database servers. Replication was gone because the secondary had already been wiped. The only usable restore point was an LVM snapshot taken at 17:20 UTC for staging, about six hours before the deletion.

Restoring from that snapshot meant copying the staging database back to production over slow Azure classic disks, then restoring webhooks from a second copy because staging snapshots removed them to avoid accidental triggers. GitLab also incremented database sequences before bringing the service back. At 18:00 UTC on February 1, GitLab.com was operating again with database state from 17:20 UTC the previous day. Repositories and wikis were intact, but database changes from the next six hours and ten minutes were gone.

GitLab kept a public recovery document, streamed the recovery on YouTube, and used Twitter for updates. The stream peaked around 5,000 viewers. That transparency helped users understand the recovery as it happened, but the postmortem also notes a boundary: the public document initially included the engineer's name, and GitLab said it would redact names in future incidents. Openness helped the response; naming individuals was not necessary to learn from the failure.

The process of both finding and using backups failed completely.// GitLab postmortem, January 2017

timeline · UTC

From the first signal to all-clear in 18h 30m.

17:20 UTC

Fresh production snapshot copied to staging

Before testing pgpool-II in staging, an engineer takes a manual LVM snapshot of the production database. This snapshot later becomes the only usable restore point close to the deletion.

19:00 UTC

Database load spike begins

GitLab.com sees increased database load from suspected spam and a background job trying to remove a GitLab employee account that had been flagged for abuse. Users have trouble posting comments on issues and merge requests.

23:00 UTC

Replica falls too far behind

The PostgreSQL secondary falls behind because required WAL segments have already been removed from the primary. Without WAL archiving, the team must manually resynchronize the secondary by wiping its data directory and running pg_basebackup.

23:30 UTC

rm -rf runs on production primary (db1)

After pg_basebackup hangs silently and the team investigates replication settings, an engineer wipes what they believe is the secondary's data directory. The command runs on db1, the production primary, and removes about 300 GB before the engineer stops it.

23:35 UTC

All hands — backup assessment begins

The team realizes the primary database has been deleted and starts looking for restore options. Replication cannot help because the secondary was already wiped as part of the repair attempt.

01:30 UTC

Standard backups are unusable

The pg_dump backups uploaded to S3 are missing because the backup job used PostgreSQL 9.2 against a PostgreSQL 9.6 database and its email alerts were rejected. Azure disk snapshots were not enabled for database servers. The team chooses the 17:20 LVM snapshot.

17:00 UTC Feb 1

Database restored without webhooks

GitLab restores the database from the staging copy created from the LVM snapshot, but that copy had webhooks removed to protect staging. Engineers create a separate restore from the snapshot to recover webhook data.

18:00 UTC Feb 1

Service restored from LVM snapshot

GitLab finishes restoring webhooks, increments database sequences, and confirms the service is operating. GitLab.com returns with database state from 17:20 UTC, permanently losing changes made between 17:20 and 23:30 UTC.

root cause

A primary database deletion exposed broken and slow recovery paths.

An engineer ran a destructive command on the wrong host. GitLab's secondary database had fallen too far behind the primary and could not continue streaming changes because required WAL segments were gone. The team needed to wipe the secondary and rebuild it with pg_basebackup. While trying to clear what they believed was the secondary's data directory, the engineer removed the primary's PostgreSQL data directory instead.

The standard backup path had already failed silently. GitLab's pg_dump job ran from an application server where Omnibus selected PostgreSQL 9.2, while the production database used PostgreSQL 9.6. pg_dump exited with an error, but the cron email alerts were rejected because they were not signed with DMARC. The S3 backup bucket was empty when the team needed it.

The usable restore path was a staging LVM snapshot taken at 17:20 UTC. That snapshot existed because an engineer had manually refreshed staging before load-testing work, not because it was the primary disaster-recovery path. Restoring it meant copying data back from slow Azure classic disks, recovering webhook data from a second copy, incrementing database sequences, and accepting permanent loss of database changes made after 17:20 UTC.

contributing factors

What turned a bad command into permanent data loss.

No unambiguous host indicator in the terminal

The sysadmin was working across multiple tmux panes connected to different hosts and lost track of which was active. The shell prompt gave no clear signal. A hostname-and-environment label in the PS1 — especially one that changes color or style for production hosts — would have been an immediate stop signal before running any destructive command.

Backup failure alerts never reached operators

The pg_dump backup job failed because it used PostgreSQL 9.2 binaries against a PostgreSQL 9.6 database. Cronjob notifications existed, but GitLab's mail setup rejected them because they were not signed with DMARC. The backup failure was detectable, but the signal died in the alert path.

Replica rebuild was manual and under-documented

The team had to manually resynchronize the secondary because WAL archiving was not in use and the required WAL segments were gone. pg_basebackup sat silently while waiting for replication data, and neither GitLab's runbook nor the official documentation made that behavior clear enough for the responders. The unclear tool behavior made the team suspect leftover files and repeat the wipe step.

No confirmation gate on destructive operations

The rm -rf command executed immediately against a live database data directory. Wrapping that operation in a tool that requires typing the target hostname and role would add a short verification step at the moment it matters. GitLab later emphasized recovery over trying to ban every dangerous command, but production-destructive maintenance still needs friction.

Recovery depended on slow staging infrastructure

The only usable six-hour-old restore point lived in a staging path hosted on Azure classic disks without Premium Storage. Copying data from staging to production took around 18 hours at about 60 Mbps. The backup existed, but the restore path was too slow for the recovery time GitLab needed.

lessons

What to take from this incident.

Assign ownership for backup restore, not just backup creation.GitLab had backup procedures, but no one owned regular proof that they could restore production data. A backup owner should track freshness, alert delivery, restore time, and restore correctness. The success condition is not a file in a bucket; it is a tested path back to service.

Backup alerts must use a monitored path that cannot silently bounce.GitLab's backup job failed and tried to report the failure by email, but DMARC rejection dropped the alert before anyone saw it. Backup monitoring should report into the same alerting system as production incidents, with dashboards and paging tied to backup age and last successful restore.

Shell prompts must distinguish production roles, not just hostnames.Engineers working across primary and secondary database hosts need an unambiguous signal of where they are and what role that host currently serves. Hostnames, environment labels, database role, and tmux window titles should make the wrong terminal visually wrong before a destructive command runs.

Document recovery tool behavior before responders have to infer it.pg_basebackup appeared stuck, but the postmortem says waiting silently for the primary to send replication data was normal behavior. Recovery runbooks should explain expected pauses, noisy failure modes, and when to keep waiting. Ambiguous tools invite destructive retries during incidents.

Design restore paths for the time you can afford to be down.GitLab's usable restore point came from a staging LVM snapshot, but copying it back from slow disks took around 18 hours. Recovery planning must include restore bandwidth, target environment, data-cleanup steps, and validation work. A backup that restores too slowly may preserve data but still miss the service's recovery objective.

sources

Read the original.

Postmortem of database outage of January 31

about.gitlab.com ↗

← previous

FM-005 · A latent CDN bug, woken by a valid config change

FM-007 · A maintenance script deletes 883 customer sites