FM-007Atlassian2022-04-05impact 14dSEV-1

A maintenance script deletes 883 customer sites.

How a planned cleanup of the deprecated Insight – Asset Management app turned into the permanent deletion of entire Atlassian Cloud sites, why the deletion API accepted both site and app identifiers without telling them apart, and why restoring 775 customers required Atlassian to rebuild each site one at a time.

clouddatabaseoperator-error

summary

For an Atlassian Cloud customer, a "site" was the unit that made Jira, Confluence, Opsgenie, and related products feel like one workspace. Underneath, that site pointed at data spread across multiple distributed datastores. As long as the metadata and datastores stayed aligned, the customer had their work. A site deletion fanned out across those stores in one operation. Restoring the site meant rebuilding each store from backup and stitching the tenant back together by hand. The delete path was one step; the restore path was many.

In early 2022, Atlassian was retiring Insight – Asset Management, a standalone app that had been consolidated into Jira Service Management. One team owned the cleanup: take Insight off the customer sites that still had it. They asked an engineering team to run a maintenance script that did the removal. The engineering team ran the script with the list of identifiers it had been given. The identifiers pointed at the full customer sites — not at Insight on those sites — and the deletion API accepted both site and app identifiers without telling them apart.

The script started at 07:38 UTC on April 5, 2022. It ran for twenty-three minutes. The first support ticket arrived at 07:46 UTC, eight minutes in. By the time the script finished at 08:01 UTC, 883 sites had been permanently deleted across 775 customers and 108 Atlassian-owned tenants. Atlassian declared a major incident at 08:24 UTC and confirmed the root cause publicly at 11:13 UTC: the wrong identifiers had been passed to the script, and nothing in the script, the review, or the deletion API itself had caught the mismatch.

What followed was a long recovery, not a complicated one. Atlassian's backups for every affected site existed and met the platform's one-hour recovery point objective. The problem was the shape of the restore pipeline. It had been built around the customer one-off case: a single customer asking to recover a single site. It was not designed to handle hundreds of sites in parallel, to coordinate across all of a site's datastores in bulk, or to be operated by a team running it as a pipeline. The team used it that way anyway, and scripted around it as they went. The first customers came back on April 8. The pipeline became more parallel in the second week as scripted improvements landed. The final customers were restored on April 18, fourteen days after the deletion. Almost all customers lost at most five minutes of data; 57 customers restored early lost more, because their restore-point policies across products were not aligned.

Public communication followed its own timeline. Atlassian's engineers had been working on recovery from the morning of April 5, but the first broad public statement came hours into the day, and the co-CEO's direct apology email to affected customers came about three days later. Customers experienced silence in the gap even though work was happening, and that silence shaped how the rest of the recovery felt.

The deletion happened because two safe paths shared one underlying API: a customer hard-deleting their own site, with confirmation; and an engineering team running a bulk cleanup, without one. The API did not know which kind of object the caller intended to delete, and the input shape that should have stopped the bulk run was never checked. A delete path that could fan out across hundreds of sites in minutes was paired with a restore path that could only undo one site at a time. The numbers between the two are the incident.

Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.// Atlassian, Post-incident review on the April 2022 outage

timeline · UTC

From the first signal to all-clear in 14d.

Apr 5, 07:38 UTC

Maintenance script runs against customer sites

An Atlassian engineering team executes a maintenance script as part of deprecating the Insight – Asset Management standalone app. The script is given site identifiers instead of app identifiers. It begins deleting entire customer sites rather than removing only the legacy app from them.

Apr 5, 07:46 UTC

First support ticket

A customer opens the first support ticket reporting that their site is inaccessible — eight minutes after the script started running.

Apr 5, 08:01 UTC

Script finishes running

The script completes after running for twenty-three minutes. By the time it stops, sites for 775 customers — 883 sites in total, plus 108 Atlassian-owned sites — have been permanently deleted.

Apr 5, 08:17 UTC

Major incident declared

Atlassian confirms the deletion pattern across its support queues and declares a major incident. The first public Statuspage update follows at 09:03 UTC.

Apr 5, 11:13 UTC

Root cause confirmed and communicated

Engineering confirms the wrong identifiers were passed to the script and that the deletion API accepted both site and app identifiers without telling them apart. Backups exist for every affected site; no customer data is permanently lost, but recovery has to be done site by site.

Apr 8, 01:50 UTC

Co-CEO sends public apology email

Atlassian's co-CEO sends a direct apology to affected customers, around three days into the incident. The communication delay becomes a separate focus of customer criticism.

Apr 8

First wave of customers restored

Engineers complete the first batch of restorations. The team works through the affected population in order, prioritising based on site size and customer impact.

Apr 14

Recovery pipeline parallelised

Atlassian iterates on the restore process during the incident, scripting steps that were manual and running more restores in parallel. The rate of customers restored per day rises sharply in the second week.

Apr 18

Final customers restored

The last of the 775 affected customers regain access to their sites, fourteen days after the deletion. Maximum data loss for almost all customers is five minutes prior to the deletion; for 57 customers restored early, data loss in Confluence and Insight tables exceeds five minutes because of inconsistent restore-point policies across products.

root cause

A deletion API that accepted any identifier, and a request that did not name an action.

Atlassian was retiring Insight – Asset Management, a standalone app that had been folded into Jira Service Management in 2021. The cleanup team asked an engineering team to remove the deprecated app from a set of customer sites. The engineering team ran a maintenance script with the list of identifiers it had been given. Those identifiers pointed at the full customer sites, not at the app on those sites. The deletion API accepted both kinds of identifier and assumed the input was correct, so the script deleted 883 sites for 775 customers instead of removing the app from them.

The handoff between the two teams did not include a single artifact that named both the action and the target in a form the script could verify against. The peer-review process for the script focused on which endpoint was called and how, not on whether the identifiers being passed in matched the type of object the script was supposed to act on. There was no warning signal in the API to confirm the type of deletion (site or app) being requested.

Atlassian had backups of every affected site, and the recovery point objective of one hour was met. The problem was the shape of the restore pipeline. It had been built around the customer one-off case — a single customer asking to recover a single site — and not around hundreds of sites in parallel. Each site had to be rebuilt across multiple distributed datastores and multiple co-resident products, then checked, then handed back. With no bulk-restore tool ready, recovery scaled linearly with the number of affected customers and ran for fourteen days.

contributing factors

What turned a scoped cleanup into a two-week incident.

Communication gap between requesting and executing teams

The team that owned the cleanup wanted Insight removed from a set of customer sites. The team that ran the script received a list of identifiers and an endpoint, not a description of the action and the target type. The gap survived the handoff because nothing in the request, the review, or the tooling forced the two to be paired together.

Deletion API accepted both site and app identifiers without distinguishing them

The same endpoint that handled customer-initiated site deletions also handled engineer-initiated app removals. It accepted both identifier types and assumed the input was correct. There was no warning signal to confirm what type of object the caller intended to delete, so an app cleanup with site identifiers became a site deletion with no friction.

Peer review focused on the call, not the intent

The script passed Atlassian's standard peer-review process, which focused on which endpoint was being called and how. The review did not check that the identifiers being passed in matched the type of object the script was supposed to act on, because that check would have required understanding the intent behind the run, not just the syntax of the call.

Restore tooling was built for the customer one-off case

Backups existed for every affected site, but the supported restore tool was designed for a single customer asking to recover a single site. The tool did not coordinate across the multiple datastores and co-resident products that make up an Atlassian Cloud site in bulk, did not run in parallel by default, and required steps that were manual under load. The deletion path scaled; the restoration path did not.

Public communication lagged the operational response

Engineering work on recovery started within hours of the deletion. The first broad public statement, and the co-CEO's apology email to affected customers, came about three days into the incident. Customers in the meantime learned about the outage through their own support queues and social media, and the absence of an early holding statement made the recovery feel slower than it was.

lessons

What to take from this incident.

Bulk destructive APIs must distinguish the kinds of objects they delete.An endpoint that accepts both 'delete site' and 'delete app on site' identifiers needs to refuse the call if the input type does not match the action. The kind of object being deleted is information the API has access to and the operator can be wrong about — the check belongs in the API, not in the operator's head.

Require destructive scripts to name the action and the target together.A request that boils down to a list of identifiers loses the action it was meant to perform. Make scripts that delete production data require an explicit action argument that has to match the identifier type, and reject runs where the two disagree. The same pattern applies to handoffs between teams: the request should describe what is being done, not only what it is being done to.

Build restore pipelines that match the worst plausible deletion.If the deletion path can remove hundreds of sites in minutes, the restore path needs to be able to bring them back at a comparable rate. Mass restore should be a tested, owned capability before it is needed, including across all distributed datastores and co-resident products that share a site, and including the bulk pipeline rather than just the customer-initiated single-site flow.

Publish a holding statement early, even before the scope is clear.Customers affected by a data-loss incident measure their experience against silence. A short, honest acknowledgment within hours, refined as scope and recovery progress become clearer, preserves trust during the long recovery. Waiting until the technical picture is complete tells customers that the work was not happening when in fact it was.

Map your recovery point objective to every product that shares a site.Atlassian's RPO of one hour was met for most affected customers, but 57 customers restored early lost more than five minutes in Confluence and Insight because their restore-point policies were not aligned with the rest of the platform. When products share a tenant, their backup cadences should share an RPO too, or the differences will surface in the worst recovery, not the best one.

sources

Read the original.

Post-incident review on the April 2022 outage

atlassian.com ↗

← previous

FM-006 · Accidental rm -rf deletes GitLab's production database

FM-008 · Cloudflare's control plane loses its primary facility