A maintenance script deletes 883 customer sites.
How a planned cleanup of the deprecated Insight – Asset Management app turned into the permanent deletion of entire Atlassian Cloud sites, why the deletion API accepted both site and app identifiers without telling them apart, and why restoring 775 customers required Atlassian to rebuild each site one at a time.
For an Atlassian Cloud customer, a "site" was the unit that made Jira, Confluence, Opsgenie, and related products feel like one workspace. Underneath, that site pointed at data spread across multiple distributed datastores. As long as the metadata and datastores stayed aligned, the customer had their work. A site deletion fanned out across those stores in one operation. Restoring the site meant rebuilding each store from backup and stitching the tenant back together by hand. The delete path was one step; the restore path was many.
In early 2022, Atlassian was retiring Insight – Asset Management, a standalone app that had been consolidated into Jira Service Management. One team owned the cleanup: take Insight off the customer sites that still had it. They asked an engineering team to run a maintenance script that did the removal. The engineering team ran the script with the list of identifiers it had been given. The identifiers pointed at the full customer sites — not at Insight on those sites — and the deletion API accepted both site and app identifiers without telling them apart.
The script started at 07:38 UTC on April 5, 2022. It ran for twenty-three minutes. The first support ticket arrived at 07:46 UTC, eight minutes in. By the time the script finished at 08:01 UTC, 883 sites had been permanently deleted across 775 customers and 108 Atlassian-owned tenants. Atlassian declared a major incident at 08:24 UTC and confirmed the root cause publicly at 11:13 UTC: the wrong identifiers had been passed to the script, and nothing in the script, the review, or the deletion API itself had caught the mismatch.
What followed was a long recovery, not a complicated one. Atlassian's backups for every affected site existed and met the platform's one-hour recovery point objective. The problem was the shape of the restore pipeline. It had been built around the customer one-off case: a single customer asking to recover a single site. It was not designed to handle hundreds of sites in parallel, to coordinate across all of a site's datastores in bulk, or to be operated by a team running it as a pipeline. The team used it that way anyway, and scripted around it as they went. The first customers came back on April 8. The pipeline became more parallel in the second week as scripted improvements landed. The final customers were restored on April 18, fourteen days after the deletion. Almost all customers lost at most five minutes of data; 57 customers restored early lost more, because their restore-point policies across products were not aligned.
Public communication followed its own timeline. Atlassian's engineers had been working on recovery from the morning of April 5, but the first broad public statement came hours into the day, and the co-CEO's direct apology email to affected customers came about three days later. Customers experienced silence in the gap even though work was happening, and that silence shaped how the rest of the recovery felt.
The deletion happened because two safe paths shared one underlying API: a customer hard-deleting their own site, with confirmation; and an engineering team running a bulk cleanup, without one. The API did not know which kind of object the caller intended to delete, and the input shape that should have stopped the bulk run was never checked. A delete path that could fan out across hundreds of sites in minutes was paired with a restore path that could only undo one site at a time. The numbers between the two are the incident.
Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.// Atlassian, Post-incident review on the April 2022 outage
From the first signal to all-clear in 14d.
A deletion API that accepted any identifier, and a request that did not name an action.
Atlassian was retiring Insight – Asset Management, a standalone app that had been folded into Jira Service Management in 2021. The cleanup team asked an engineering team to remove the deprecated app from a set of customer sites. The engineering team ran a maintenance script with the list of identifiers it had been given. Those identifiers pointed at the full customer sites, not at the app on those sites. The deletion API accepted both kinds of identifier and assumed the input was correct, so the script deleted 883 sites for 775 customers instead of removing the app from them.
The handoff between the two teams did not include a single artifact that named both the action and the target in a form the script could verify against. The peer-review process for the script focused on which endpoint was called and how, not on whether the identifiers being passed in matched the type of object the script was supposed to act on. There was no warning signal in the API to confirm the type of deletion (site or app) being requested.
Atlassian had backups of every affected site, and the recovery point objective of one hour was met. The problem was the shape of the restore pipeline. It had been built around the customer one-off case — a single customer asking to recover a single site — and not around hundreds of sites in parallel. Each site had to be rebuilt across multiple distributed datastores and multiple co-resident products, then checked, then handed back. With no bulk-restore tool ready, recovery scaled linearly with the number of affected customers and ran for fourteen days.