Update on Azure Storage Service Interruption

Yesterday evening Pacific Standard Time, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions.  I want to first sincerely apologize for the disruption this has caused.  We know our customers put their trust in us and we take that very seriously.  I want to provide some background on the issue that has occurred.
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.
Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions. While services are generally back online, a limited subset of customers are still experiencing intermittent issues, and our engineering and support teams are actively engaged to help customers through this time.
When we have an incident like this, our main focus is rapid time to recovery for our customers, but we also work to closely examine what went wrong and ensure it never happens again.  We will continually work to improve our customers’ experiences on our platform.  We will update this blog with a RCA (root cause analysis) to ensure customers understand how we have addressed the issue and the improvements we will make going forward.
Update to Azure Customers:
We will continue to investigate what led to this event and will drive the needed improvements to avoid similar situations in the future. In the meantime, we believe it’s important to share a current understanding of the status and the gaps we’ve discovered.

Incident Information

Incident ID
3071402

Incident Title
Microsoft Azure Service Incident :  Connectivity to multiple Azure Services  – Partial Service Interruption

Service(s) Impacted
Azure Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services

Incident Start Date and Time
11/19/2014 00:51:00 AM (UTC)
Date and Time Service was Restored
11/19/2014 11:45:00 AM (UTC)
Summary
On November 19, 2014, Azure Storage Services were intermittently unavailable across regions which are listed in “Affected Regions” below. Microsoft Azure Services and customer services which have a dependency on the affected Azure Storage Services were impacted as well. This includes the Service Health Dashboard and Management Portal. This interruption was due to a bug that got triggered when a configuration change in the Azure Storage Front End component was made, resulting in the inability of the Blob Front-Ends to take traffic.
The configuration change had been introduced as part of an Azure Storage update to improve performance as well as reducing the CPU footprint for the Azure Table Front-Ends. This change had been deployed to some production clusters for the past few weeks and was performing as expected for the Table Front-Ends.
As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service.
The configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends, which had been previously performing as expected for the Table Front-Ends. This bug resulted in the Blob Front-Ends to go into an infinite loop not allowing it to take traffic.
Unfortunately the issue was wide spread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.
Once the issue was detected, the configuration change was reverted promptly. However, the Blob Front-Ends had entered into an infinite loop triggered by the update, and couldn’t refresh the configuration without a restart. This caused the recovery to take longer. The Azure team investigated the mitigation steps and validated them. Once the mitigation steps were deployed, most of customers started seeing the availability improvement across regions at 11/19/2014 11:45:00 AM UTC. A subset of customers who were using the IaaS Virtual Machines (VMs) reported inability to connect to their VMs including through Remote Desktop Protocol (RDP) and SSH.
Customer Impact
Customers using Azure Storage Services may have experienced timeouts or connectivity issues while attempting to connect Storage Blob, Table and Queues. Azure Services that have a dependency on Storage Services were affected as well, customers may have experienced unavailability issues with affected services as well as IaaS Virtual Machines (VMs).
Communications and support
There was an Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health Dashboard. As a mitigation, we leveraged Twitter and other social media forums. We also provided targeted communication to the Management Portal for affected customers as available. As a result of the impact to the Service Health Dashboard, timely updates were not reflected for approximately the first three hours of the outage.
Downstream support tools that had dependency on the Service Health Dashboard and Management Portal were also impacted, limiting customers to create new support cases during the early phase of the outage, and our ability to update to impacted customers was delayed due to high case volumes.
Affected Regions
A subset of customers in the following regions were affected.

Regions

Central US

East US

East US 2

North Central US

South Central US

West US

North Europe

West Europe

East Asia

Southeast Asia

Japan East

Japan West

Root Cause
A bug in the Blob Front-Ends which was exposed by the configuration change made as a part of the performance improvement update, which resulted in the Blob Front-Ends to going into an infinite loop.
Next Steps
We are taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
· Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
· Improve the recovery methods to minimize the time to recovery.
· Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
· Improve Service Health Dashboard Infrastructure and protocols.
We apologize for the impact and inconvenience this has caused.
Regards,
The Microsoft Azure Team
Privacy Statement

http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/