Atlassian’s 13-Day Outage: Lessons in Tech Crisis Management

Published in

Bootcamp

4 min readSep 15, 2023

In april 2022, Atlassian, a leading software company known for its popular project management tool, Jira, and collaboration platform, Confluence, experienced a major 13-day outage that affected a subset of its customers. The incident revealed crucial lessons in incident management, transparency, and disaster recovery. This article delves into the details of the incident, Atlassian’s response, and the valuable takeaways for organizations facing similar challenges.

The Prelude: Atlassian’s Journey with Mindville

To understand the incident’s context, we must rewind to 2020 when Atlassian acquired Mindville, a company known for its flagship product, Insight — an asset management software integrated with Jira. By 2021, Atlassian had successfully integrated Insight into its product ecosystem. In 2022, they initiated the removal of the old, standalone version of the application from customers’ sites where it was previously installed.

The Catastrophic Mistake: Human Error and Miscommunication

The incident began with a seemingly straightforward task: deleting instances of the old application from customer sites. However, miscommunication between two teams led to a catastrophic error. Team A provided Team B with the ID of the entire customer site, rather than the intended application instance. Team B executed the script without realizing the mistake, resulting in the deletion of 775 customer sites on April 5, 2022.

The consequences were dire. Customers lost access to critical services, including Jira, Confluence, and even OpsGenie, Atlassian’s Pagerduty competitor. The outage occurred just as Atlassian’s annual conference, Teams22, was underway in Las Vegas, further complicating the situation.

Atlassian: Complete Tech Management Ecosystem

The Slow Path to Recovery: Challenges and Solutions

Atlassian’s initial response was to form a cross-functional, global incident response team, working 24/7 to resolve the issue. However, they faced numerous challenges:

Lack of Customer Contact Information: The customer contact information had been deleted along with the sites, making it impossible to inform affected customers promptly.
Escalation Management Breakdown: The absence of site IDs disrupted the escalation management dashboards, preventing the correct grouping of tickets.
Complex Restoration Process: Restoring the 775 sites was a complex, multi-step process, with numerous unforeseen dependencies between steps.

Despite these challenges, Atlassian managed to restore 53% of impacted users between April 5th and April 14th. They implemented a code freeze on April 8th to prevent further inconsistencies and proposed a more efficient approach on April 9th, labeled “Restoration 2.”

Restoration 2 reduced complexity and dependencies, reusing the old site cloudId identifier and significantly streamlining the process. Testing the new process during the incident was risky but essential for a successful recovery.

Throughout the incident, Atlassian leveraged over 450 support engineers worldwide. They optimized their processes incrementally, focusing on automation and asynchronous execution of large steps to expedite recovery.

Remarkably, Atlassian was able to recover data with a recovery point objective (RPO) of just 5 minutes before the incident, thanks to their combination of full and incremental backups.

Fifteen Key Takeaways for Incident Management and Transparency

The Atlassian outage provides valuable insights into handling similar incidents and maintaining trust with customers:

Immediate Actions: In the short term, block bulk deletions and implement soft deletes as a universal practice. Soft deletes involve marking data for deletion before actual deletion, allowing for safe recovery.

Disaster Recovery: Ensure disaster recovery processes meet recovery time objectives (RTO) and regularly test them. Accelerate multi-product/multi-site restoration for a larger set of customers.

Incident Management: Improve the incident management process for very large-scale incidents, focusing on coordination and communication.

Contact Information: Backup authorized account contact information outside product instances to prevent loss during an outage.

Unified Escalation System: Implement a unified escalation system that allows for multiple work objects (tickets, tasks) to be stored under a single customer account, ensuring efficient communication.

Transparency: Communicate openly and transparently about the incident’s details, including estimated time to resolution (ETR) and any technical complexities. Avoid vague statements that can erode trust.

Prioritize Critical Services: During an outage, prioritize critical services and communicate the recovery plan clearly. Do not neglect critical tools like monitoring solutions.

Customer Impact: Understand that the scale of an incident amplifies its impact, even if a small percentage of customers are affected.

Competitive Threat: Be prepared for competitors to seize the opportunity and offer incentives to affected customers. Ensure your post-incident response is as strong as your pre-incident marketing.

Transparency Builds Trust: Transparency is key to maintaining trust. Share detailed post-incident updates, action items, and their implementations with your technical audience.

API Design: Ensure your APIs have clear design principles, including unambiguous IDs and confirmation mechanisms to prevent critical errors.

Ownership: Clearly define ownership within your organization to prevent confusion and reduce the risk of errors.

Public Relations: Be mindful of timing when sharing critical information. Avoid prolonged periods of silence on social media during an incident.

Overselling: Avoid overselling your company’s capabilities, as it can lead to a loss of trust when reality falls short of expectations.

Backup Everything: Maintain comprehensive backups of critical data and regularly test your recovery processes.

The Atlassian outage serves as a reminder that incidents can happen to any organization. How a company responds to these incidents, communicates with affected customers, and implements lessons learned can make a significant difference in preserving trust and reputation. By embracing transparency, robust incident management processes, and disaster recovery strategies, organizations can navigate challenging situations more effectively and minimize the impact on their customers and stakeholders.

Bootcamp

Atlassian’s 13-Day Outage: Lessons in Tech Crisis Management

The Prelude: Atlassian’s Journey with Mindville

The Catastrophic Mistake: Human Error and Miscommunication

The Slow Path to Recovery: Challenges and Solutions

Fifteen Key Takeaways for Incident Management and Transparency

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Bootcamp

Written by Arin

No responses yet