Spring over navigation

Azure-statushistorik

Denne side indeholder årsagsanalyser (RCA'er) af tidligere tjenesteproblemer, hver opbevaret i 5 år. Fra den 20. november 2019 omfattede dette RCA'er for alle emner, som vi kommunikerede offentligt. Fra 1. juni 2022 inkluderer dette RCA'er for brede problemer som beskrevet i vores dokumentation.

Produkt:

Område:

Dato:

september 2022

7-9

Post Incident Review (PIR) - Azure Front Door - Connectivity Issues (Tracking ID YV8C-DT0)

What happened? 

Between 16:10 and 19:55 UTC on 07 Sep 2022, a subset of customers using Azure Front Door (AFD) experienced intermittent availability drops, connection timeouts and increased latency. At its peak, this impacted approximately 25% of the traffic, and on average, 10% of the traffic that traverses through the AFD service during the impact window. Some customers may have seen higher failures if their traffic was concentrated in the edges or regions with higher impact. This could also have impacted customers’ ability to access other Azure services that leverage AFD, in particular the Azure management portal and Azure Content Delivery Network (CDN).

What went wrong and why? 

The AFD platform automatically balances traffic across our global network of edge sites. When there is a failure in any of our edge sites or an edge site becomes overloaded, traffic is automatically moved to other healthy edge sites in other regions where we have fallback capacity. It is because of this design that customers and end users don’t experience any issues in case of localized or regionalized impact. In addition to that, we also have protections built in every single node to protect our platform from unusual traffic spikes corresponding to each domain hosted on AFD.

 Between 15:15 and 16:44 UTC we observed 3 unusual traffic spikes for one of the domains hosted on AFD.

  • The first two spikes for this domain occurred at 15:15 UTC and 16:00 UTC on 07 of September, 2022 and were fully mitigated by AFD platform. A third spike that occurred between 16:10 to 16:44 caused a subset of environments managing this traffic to go offline.
  • The first two spikes were successfully absorbed due the platform protection mechanisms, however the ones that were initiated during the third spike did not fully mitigate the unexpected increase due to the nature of the traffic pattern (different from the first two spikes). At this stage in our investigation, we believe that all three traffic spikes were malicious HTTPS flood attacks. (Layer 7 volumetric DDOS attacks) 
  • The malicious traffic spikes did not originate from a single region. We found that they were coming from all around the world. A combination of malicious traffic (3rd spike), large traffic ramp-up for legitimate traffic for other customer and degraded customer origins, resulted in overwhelming the resources of a few environments taking them offline and resulting in a 25% drop in overall availability during the third traffic spike
  • By design, these environments will automatically recover and resume taking traffic once healthy. During this instance, users and our systems retried the requests, resulting in a larger build-up of requests. This build-up did not allow time for a subset of the environments to recover fully resulting in a subsequent 8% availability drop, for more than 3.5 hours following the traffic spike.

How did we respond? 

We have automatic protection mechanisms in such events which mitigate circa 2,000 DDoS attacks per day, and the record that we have mitigated in a day has been 4,296. (More information can be found here: ). In addition to this, AFD platform also has in-built DDoS protection mechanisms on each node at both a system and an application layer. These help for further mitigations in such cases. In this instance, these mechanisms significantly helped to absorb the first two spikes without any customer impact.

During the third spike, the platform protection mechanisms were partially effective, mitigating around 40% of the traffic. This significantly helped to limit global impact. For a larger duration, 8.5% of the overall AFD service, concentrated in some regions, was impacted by this issue. Some customers may have seen higher failures if their traffic was concentrated in predominantly North America, Europe, or the APAC regions.

 As our telemetry alerted us regarding impact on availability, we manually intervened. The first step was that we took manual action to further block the attack traffic. In addition, we expedited the AFD load balancing process which then enabled auto-recovery systems to work as designed. The systems worked by ensuring the most efficient load distributions in regions where there was a large build-up of traffic. Once the environment recovered, we began to gradually bring AFD instances back online to resume traffic management in a normal way. We were 100% globally recovered by 19:55 UTC.

How are we making incidents like this less likely or less impactful?

Although the AFD platform has built-in resiliency and capacity, we must continuously strive to improve through these lessons learned. We have a few previously planned repair items that were inflight being either partially deployed and/or staged to be deployed. We believe that these repair items would have mitigated the third malicious traffic spike had they been in place before Sept 7th. We are now expediting these repair items that were scheduled for later this year and they should be completed in the next few weeks. These include: 

  • Effectively tuning the protection mechanisms in the AFD nodes to mitigate the impact of this class of traffic patterns in future. (Estimated completion September 2022)
  • Addressing issues identified in the current platform environment recovery process. This will reduce time to recover for each environment and will prevent environments from becoming overloaded. (Estimated completion September 2022)
  • Tooling to trigger ‘per customer’ failover until we have fully automated the traffic shifting mechanisms. This work is completed.
  • Improvements in dynamic rate limiting algorithm to ensure fairness to legitimate traffic. (Estimated completion October 2022)
  • Improve existing proactive automatic communication process to notify customers more expeditiously. (Estimated completion October 2022)

How can we make our incident communications more useful?

Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey, 

7-9

Post Incident Review (PIR) - Azure Cosmos DB - North Europe (Tracking ID 3TPC-DT8)

What happened?

Between 09:50 UTC and 17:21 UTC on 07 Sep 2022, a subset of customers using Azure Cosmos DB in North Europe may have experienced issues accessing services. Connections to Cosmos DB accounts in this region may have resulted in an error or timeout.

Downstream Azure services that rely on Cosmos DB also experienced impact during this window - including Azure Communication Services, Azure Data Factory, Azure Digital Twins, Azure Event Grid, Azure IoT Hub, Azure Red Hat OpenShift, Azure Remote Rendering, Azure Resource Mover, Azure Rights Management, Azure Spatial Anchors, Azure Synapse, and Microsoft Purview.

What went wrong and why?

Cosmos DB load balances workloads across its infrastructure, within frontend and backend clusters. Our frontend load balancing procedure had a regression that did not factor in the effect of a reduction in available cluster capacity, due to ongoing maintenance. This surfaced during an ongoing platform maintenance event in one of the frontend clusters in the North Europe region, causing the availability issues described above. 

How did we respond?

Our monitors alerted us of the impact on this cluster. We ran two workstreams in parallel – one focused on identifying the reason for the issues themselves, while one focused on mitigating the customer impact. To mitigate, we load balanced off the impacted cluster by moving customer accounts to healthy clusters within the region.

Given the volume of accounts we had to migrate, it took us time to safely load balance accounts – we had to analyze the state of each account individually, then systematically move each to an alternative healthy cluster in North Europe. This load balancing operation allowed the cluster to recover to a healthy operating state.

Although we have the ability to mark a Cosmos DB region as offline (which would trigger automatic failover activities, for customers using multiple regions) we decided not to do that during this incident – as the majority of the clusters (and therefore customers) in the region were unimpacted.

How are we making incidents like this less likely or less impactful?

Already completed:

  • Fixed the regression in our load balancer procedure, to safely factor in capacity fluctuations during maintenance.

In progress:

  • Improving our monitoring and alerting to detect these issues earlier and apply pre-emptive actions. (Estimated completion: October 2022)
  • Improving our processes to reduce the impact time with a more structured manual load balancing sequence during incidents. (Estimated completion: October 2022)

How can customers make incidents like this less impactful?

Consider configuring your accounts to be globally distributed – enabling multi-region for your critical accounts would allow for a customer-initiated failover during regional service incidents like this one. For more details, refer to:

More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:

Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

august 2022

30-8

Post Incident Review (PIR) - Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

What happened?

Between 06:00 UTC on 30 Aug 2022 and 16:00 UTC on 31 Aug 2022, customers running Ubuntu 18.04 (bionic) Virtual Machines (VMs) who had Ubuntu Unattended-Upgrades enabled received a systemd version that resulted in Domain Name System (DNS) resolution errors. This issue was confined to Ubuntu version 18.04, but impacted all Azure regions including public and sovereign clouds.

Downstream Azure services that rely on impacted Ubuntu VMs also experienced impact during this window – including Azure Kubernetes Service (AKS), Azure Monitor, Application Insights, Log Analytics and Microsoft Sentinel. AKS customers could have experienced pod creation errors such as ImagePullBackoff as kubelet was unable to resolve DNS names of container registry. Customers may have experienced an inability to access Azure Monitor, Application Insights, Log Analytics, and/or Microsoft Sentinel log data, and may have noticed missed or delayed Log Search alerts and/or Microsoft Sentinel alerts.

What went wrong, and why?

At 06:00 UTC on 30 August 2022, a Canonical Ubuntu security update was published – so Azure VMs running Ubuntu 18.04 (bionic) with unattended-upgrade enabled started to download and install the new packages, including systemd version 237-3ubuntu10.54. This led to a loss of their DNS configurations due to a race-condition bug: .

The manifestation of this bug was triggered due to the combination of this and a previous update. This bug only affects systems using a driver name to identify the proper Network Interface Card (NIC) in their network configuration, which is why this issue impacted Azure uniquely and not other major cloud providers. This resulted in DNS resolution failures and network connectivity issues for Azure VMs running Ubuntu 18.04 (bionic). As a result, other services dependent on these VMs were impacted by the same DNS resolution issues.

When unattended-upgrades are enabled, security updates are automatically downloaded and applied once per day by default. Considering their criticality, security updates like these do not go through our Safe Deployment Practices (SDP) process. However, we are reviewing this process to ensure that we minimize customer impact during incidents like these.

How did we respond?

Multiple Azure teams detected the issue shortly after the packages were published via production alerts, including our AKS and Azure Container Apps service teams. Upon investigation, we identified the root cause as the bug in Ubuntu mentioned above, and began engaging other teams to explore appropriate mitigations. During this time, incoming customer support cases describing the issue validated that the issues were limited to the Ubuntu versions described above.

There were multiple mitigation and remediation steps, several of which were completed in partnership with Canonical / Ubuntu:

  • This bug and a potential fix have been highlighted on the Canonical / Ubuntu website, which we encouraged impacted customers to read (linked above).
  • For impacted Azure VM instances, we recommended that customers reboot the VM(s) or, if reboot was not an option, run a script to fix it. Azure provided a template script but encouraged customers to test and modify the script as needed before applying: .
  • For the impact to AKS nodes, our AKS team developed an automatic detection and remediation solution then rolled this out across all regions. This resolved the vast majority of customers, others required manual mitigations through support.
  • Additional downstream impact to other services was addressed through similar remediation steps to address the bug in their specific Ubuntu version, so some of these services recovered prior to the final mitigation time above.

How are we making incidents like this less likely or less impactful?

Already completed:

  • Improved monitoring of AKS data plane for alerts on the uptick of errors.
  • Reviewed AKS monitoring algorithms to help ensure detection and alerting upon similar VM errors that were experienced in this scenario.
  • Established an improved escalation path to Canonical during outages and include that in our internal Technical Service Guides (TSGs).

Short term:

  • AKS will take full control over the security patch mechanism, versus shared control with Canonical today. This includes additional testing and controlled release of these patches done directly by AKS. (Estimated completion: December 2022).

Medium term:

  • For IaaS VMs, we are working to engage with Canonical to run dedicated tests on proposed packages before they are published for Azure users.

Longer-term:

  • AKS will establish a process with Canonical to close the testing gap for the upgrade scenario. (Estimated completion: March 2023).
  • AKS will provide customers with maintenance window control and environment staging for these patches when they are deemed safe to release. (Estimated completion: December 2023).

How can our customers and partners make incidents like this less impactful?

How can we make our incident communications more useful?

Microsoft is piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey:

27-8

Post Incident Review (PIR) - Datacenter power event - West US 2 (Tracking ID MMXN-RZ0)

What happened?

Between 02:47 UTC on 27 Aug 2022 and 02:00 UTC on 28 Aug 2022, a subset of customers experienced failures when trying to access resources hosted in the West US 2 region. Although initially triggered by a utility power outage that affected all of our datacenters in the region, the vast majority of our backup power systems performed as designed to prevent impact. Failures of a small number of backup power systems led to customer impact in two datacenters. Most customers were recovered by 07:00 UTC on 27 Aug 2022, but small subsets of resources required manual recovery – with the final set being brought back online by 02:00 UTC on 28 Aug 2022.

During this impact window, several downstream Azure services that were dependent on impacted infrastructure also experienced issues – including Storage, Virtual Machines, App Services, Application Insights, Azure Database for PostgreSQL, Azure Red Hat OpenShift, Azure Search, Azure SQL DB, Backup, Data Explorer, ExpressRoute, and NetApp Files. 

What went wrong, and why?

On August 27 at 02:47 UTC, we identified a power event that caused impact to a number of storage and compute scale units in the West US 2 region. The West US 2 region is made up of 10+ datacenters, spread across three Availability Zones on multiple campuses. During this event, the whole region experienced a utility power outage, impacting all datacenters in the region. A failure on major distribution lines caused at least two substations to lose power. That resulted in loss of utility power across a broad area that included all three Availability Zones in the West US 2 region.

In all datacenters except two, our backup power systems performed as designed, transitioning all infrastructure to run briefly on batteries and then on generator power. But in two separate datacenters, two unique but unrelated issues occurred that prevented some of the servers in each datacenter from transitioning to generator power. Since these two datacenters were in two different Availability Zones, customers may have been impacted by both.

In the first datacenter, impact was caused when a small number of server rack level Uninterruptible Power Supply (RUPS) systems failed to stay online during the transition to generator, creating a momentary loss of power to the servers. These servers were immediately re-energized once backup generators started and supported the load.

In the second datacenter, several Primary UPS systems (approximately 12% of the total UPS systems in the datacenter) failed to support the load during the transition to generator, due to UPS battery failures. As a result, the downstream servers lost power until the UPS faults could be cleared and put back online with utility supply.

The initial trigger to this event was when a high voltage static wire (used to help protect transmission lines against lightning strikes) failed. When the static wire failed, it created a voltage surge on the 230kV lines, causing breakers at two substations (approximately 30 miles apart) within the utility power grid to open. The root cause of the static wire failures is still under investigation by the utility provider. 

How did we respond?

This event was first detected by our EPMS (Electrical Power Monitoring System) in West US 2, which in turn notified our datacenter team of the utility loss issue, and then of equipment failure issues. While the vast majority of datacenters transitioned to backup power without issue, two specific datacenters experienced different UPS issues described above that prevented a full transition to backup power sources.

Due to the nature of this event, the team followed our Emergency Operations Procedure (EOP) to manually restore Mechanical, Electrical, Plumbing (MEP) equipment to its operational state. Once the MEP was returned to an operational state, the racks began to recover. The Public Utility Department (PUD) was able to close their breakers and restore utility power to our datacenters by 03:48 UTC. This enabled the datacenter teams to begin the recovery of the affected equipment and restoration of power to the impacted racks. By 04:46 UTC, power was fully restored to all affected racks, and services continued their recovery.

Four Azure Storage scale units were impacted by the power loss (one Standard, two Premium, one Ultra Disk scale unit) resulting in the data hosted on these becoming inaccessible until power was restored and the scale units recovered to healthy states. The Standard Storage scale unit was fully available by 07:45 UTC, although the vast majority of clients would have seen availability restored by 06:05 UTC. The two Premium Storage scale units were restored by 0510 UTC. Due to a software bug (the fix for which is already in our deployment pipeline) a small subset of disk requests (<0.5%) may have encountered further errors through 07:30 UTC. Due to a combination of hardware failures and software bugs, the Ultra Disk scale unit was not fully available until 21:40 UTC on 8/28. The majority of the data (> 99.9%) was available by 05:15 UTC on 08/27.

Impacted Azure compute scale sets were brought back online – mostly automatically after storage recovered, but a subset of infrastructure and customer VMs required manual mitigations to ensure they came back online successfully. VMs that were using the Trusted Launch feature, in particular, did not automatically recover and required engineering team intervention to restore – all of these VMs were restored to a functional state by 00:20 UTC on 8/28. 

How are we making incidents like this less likely or less impactful?

Already completed:

  • We have completed detailed inspections on our generator systems and all generators are in good operating condition. 
  • UPS systems have been inspected and all components are operating and functioning per design specifications. This inspection highlighted that battery replacement is required.
  • After reviewing the entire line up, batteries have been replaced in the UPS units that experienced failures.

Short term:

  • Complete deployment of the Storage software fix for the bug that caused a small tail of errors following Premium Storage scale unit recovery.
  • Complete deployment of the software fix for the trusted VM feature, which caused some VMs not to come back online automatically after storage recovery.

Longer term:

  • We are working on several platform, telemetry, and process improvements that will reduce Ultra Disk replica recovery time.
  • We are improving our VM migration times to healthy hosts for faster recovery. This includes telemetry improvements to identify long running/stuck migration operations to identify issues more quickly.
  • Microsoft uses multiple equipment vendors and designs – findings will be reviewed against our global fleet and, where necessary, applied beyond the impacted datacenters.
  • Process failure mode effects analysis (PFMEA) review of the processes utilized during the event, applying lessons learned and improving our methodology. This includes assessing human touch points, working to engineer out or automate systems for smoother transition or recovery. Findings will also be applied to our Tabletop/GameDay exercises, ensuring team members are familiar and prepared to respond.

How can customers and partners make incidents like this less impactful?

  • While Availability Zones are designed in ways to reduce correlated failures, they can still occur. We encourage customer and partner Business Continuity & Disaster Recovery (BCDR) plans to include the ability to failover between regions, in case of a region-wide incident. While the likelihood of failure decreases with the magnitude of the failure, it never goes to zero:
  • Consider which are the right Storage redundancy options for your critical applications. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

We are piloting this “PIR” template as a potential replacement for our “RCA” (Root Cause Analysis) template.

You can rate this PIR and provide any feedback using our quick 3-question survey: 

18-8

Post Incident Review (PIR) - Azure Key Vault - Provisioning Failures (Tracking ID YLBJ-790)

What happened?

Between 16:30 UTC on 18 Aug 2022 and 02:22 UTC on 19 Aug 2022, a platform issue caused Azure offerings such as Bastion, ExpressRoute, Azure Container Apps, Azure ML, Azure Managed HSM, Azure Confidential VMs, Azure Database Services (MySQL - Flexible Server, Postgres- Flexible Server, PostgreSQL – Hyperscale) to experience provisioning failures globally. This issue impacted customers that relied on provisioning of certificates as part of provisioning of an Azure resource. This write-up is a Post Incident Review (PIR) we are providing to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn from this and improve.

What went wrong, and why?

The requesting authority for Azure Key Vault (the underlying platform, on which all the described services rely for the creation of certificate resources) was experiencing high latency and volume of requests. This resulted in provisioning failures for the impacted services, as those services were not able to acquire certificates within the expected time. During the incident, a backend service that Azure Key Vault relies on, became unhealthy due to an unexpected spike in traffic during scheduled hardware maintenance, which caused a build-up of requests in the queue, resulting in high latencies to fulfil new certificate creation requests. 

How did Microsoft respond?

We developed and deployed a hotfix to increase the throughput, created new queues for request processing, and drained the queue of accumulated requests to alleviate the overall latency and process requests as expected.

How is Microsoft making incidents like this less likely, or at least less impactful?

• In the short-term, we are implementing request caps and partitioning the request queues, to help prevent lasting failures in the service in similar scenarios.

• We are also reviewing the backend capacity and gaps in the maintenance process that led to the loss of availability during this maintenance operation.

• Based on our learning from this incident, we are implementing improvements to our health monitoring and operational guidance that would help reduce the time to detect similar issues and allow us to address similar issues before customers experience impact.

• In the longer term, we are working to add fine-grained distributed throttling and portioning, to add additional isolation layers to the backend of this service, which will minimize impact in similar scenarios.

• Finally, we will work to add more Availability Zones and fault domains in all layers of the stack, along with automatic failover to the service, to help prevent disruption to customer workloads.

How can we make our incident communication more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

12-8

Post Incident Review (PIR) - Azure Communication Services - Multiple Regions (Tracking ID YTYN-5T8)

What happened?

Between 18:13 UTC on 12 Aug 2022 and 03:30 UTC on 13 Aug 2022, customers using Azure Communication Services (ACS) may have experienced authentication failures, or failures using our APIs. As a result, multiple scenarios may have been impacted including SMS, Chat, Email, Voice & Video scenarios, Phone Number Management, and Teams-ACS Interop.

What went wrong, and why?

An Azure resource provider provides the ability for customers to create and maintain resources, in this case, for ACS. The ACS resource provider utilizes backend Cosmos DB instances for resource metadata persistence. Prior to the incident, an increased volume of data-plane related requests was made by the resource provider to the database, which met database throughput limits. At 18:00 UTC on 12 Aug 2022, to meet the increased demand of requests, the database processing capacity was increased. This change in database capacity inadvertently exposed a latent code bug for the resource provider, which resulted in a functional difference in the number of database results being returned against what could be processed by the resource provider. ACS is a globally distributed service and the metadata being retrieved was required for routing calls across different regions for the authentication process. This resulted in ACS authentication failures, and subsequently caused SMS, Chat, Voice & Video, Phone Number Management, and Teams-ACS Interop scenarios to fail. 

How did we respond?

Automated alerting indicated several failures for different ACS API requests made by customers. We immediately investigated with multiple engineering teams, however understanding the nature of the issue took time because specific fields used for debugging Cosmos DB issues were not being logged for successful queries. Due to the service configuration, a rollback of the change to the database instance would not have been supported. Once the underlying issue was identified, we developed code fixes to resolve the issue. We validated and deployed the fix using our Safe Deployment Practices, in phases. The hotfix was fully rolled out at 03:30 UTC on 13 Aug 2022, with customers reporting successful operation shortly thereafter.

How are we making incidents like this less likely or less impactful?

Completed:

  • We've completed code updates to address the latent bug and help ensure the resource provider can process all results in similar scenarios.
  • We've added additional logging of backend database requests for the ACS resource provider, to ensure improved traceability in future.
  • We have added additional gates for database configuration updates and hardening for our processes when applying such updates. We have mirrored all production configuration templates in our pre-production environment to allow validation of configuration updates before they get deployed to production.
  • We have completed additional Failure Mode Analyses (FMA) across different ACS features. We have created repair items for a resilient service architecture to improve failure recovery time.

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey: 

juli 2022

29-7

Post Incident Review (PIR) – Network Connectivity Issues (Tracking ID 7SHM-P88)

What happened?

Between 08:00 UTC and until 13:20 UTC on 29 July 2022, customers may have experienced connectivity issues such as network drops, latency, and/or degradation when attempting to access or manage Azure resources in multiple regions.

The most significant impact would have been experienced in the following regions – Brazil South, Canada Central, East Asia, East US, East US 2, France Central, Japan East, Korea Central, North Central US, South Africa North, South Central US, Southeast Asia, West Europe, and West US. Customers in other regions may have seen an intermittent impact when accessing resources across the Microsoft wide area network (WAN).

What went wrong, and why?

Starting at 08:00 UTC on 29 July, the Azure WAN began to experience a sudden and significant increase of traffic, upwards of 60 Tbps in additional traffic compared to the normal levels of traffic carried on the network.

While the event was detected immediately and automated remediation was triggered, the substantial increased bursts of traffic occurring throughout the event affected the ability of automated mitigations to continue providing the necessary relief to the network. WAN routers then became overloaded and dropped network packets, which resulted in network connectivity issues experienced by some customers.

This event included impact to both intra-region and cross-region traffic over various network paths, which included ExpressRoute.

Our investigation of this event continues, ensuring our diagnosis of contributing factors is complete and mitigations for this class of incident are finalized. The remaining workstreams are expected to be finished within two months. We will update the status page and Azure Service Health when completed.

How did we respond?

We have several detection and mitigation algorithms that were triggered automatically around 08:00 UTC when an increased burst of traffic occurred. The volume of traffic surges continued to substantially increase, reaching 10-15 times greater than any traffic volume experienced on the network prior.

While the mitigation mechanisms were successfully triggered to load balance and throttle the traffic bursts to help prevent impact, the significance of traffic on the WAN routers resulted in these mechanisms to take longer to alleviate the traffic surges and restore traffic back to normal levels.

By 13:20 UTC, traffic levels returned to normal as network telemetry confirmed packet drops had reduced to standard levels, which is when customers would have seen resource and service network health restored.

How are we making incidents like this less likely or less impactful?

We are implementing service repairs because of this incident, including but not limited to:

Already Completed: 

  • Additional alerting for specific packet drops signature caused by significant traffic bursts.

Work in Progress:

  • Improvements to network device capabilities to help reduce packet drops when handling significant traffic bursts.
  • Changes to the network design for traffic spike detection to help reduce the time to mitigate for similar events.
  • Improve network incident response playbook to better streamline preventative actions performed for similar events.
  • Apply additional layers of network throttling to help protect the network reliability when increased traffic surges occur. 

How did we communicate with impacted customers?

Starting around 11:00 UTC, we began to receive some reports of a potential emerging issue. As signals continued to gradually increase, we posted an initial statement to the Azure status page at 11:52 UTC.

Delays in communications via Service Health in the portal were primarily due to challenges gauging the extent of impact and affected regions as limited telemetry of the networking event developing did not clearly indicate a viable scope of impact. Though other signals via internal and external reports indicated a likely platform event ongoing, the disparity of signals deterred targeted notifications until a broad networking issue was determined.

Communications were sent via Azure Service Health for Azure services that started to report impact, where were later determined to be affected by the networking event. With further analysis and evidence of regional impact confirmed, broad targeted communication was sent to customers region-wide for the identified affected regions by 13:28 UTC. 

Between 13:28 and 15:32 UTC, communications were sent to the customers of additional impacted regions identified.

By 15:32 UTC, we began reporting recovery via the status page and Service Health, but monitoring and preventative workstreams persisted, which we continued to report until the necessary preventative workstreams were completed by 19:52 UTC.

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

21-7

Post Incident Review (PIR) - SQL Database - West Europe (Tracking ID 3TBL-PD8)

What happened?

Between 03:47 UTC and 13:30 UTC on 21 Jul 2022, customers using SQL Database and SQL Data Warehouse in West Europe may have experienced issues accessing services. During this time, new connections to databases in this region may have resulted in errors or timeouts. Existing connections would have remained available to accept new requests, however if those connections were terminated and then re-established, they may have failed.

New connections to the region and related management operations began failing from 03:47 UTC, partial recovery began at 06:12 UTC, with full mitigation at 13:30 UTC. Although we initially did not declare mitigation until 18:45 UTC, a thorough impact analysis confirms that failure rates had returned to pre-incident levels earlier. No failures that occurred after 13:30 UTC were directly as a result of this incident.

During this impact window, several downstream Azure services that were dependent on the SQL Database service in the region were also impacted - including App Services, Automation, Backup, Data Factory V2, and Digital Twins.

Customers that had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica - more information can be found here

What went wrong, and why?

For context, connections to the Azure SQL Database service are received and routed by regional gateway clusters. Each region has multiple gateway clusters for redundancy - traffic is distributed evenly between the clusters under normal operations, and automatically rerouted in case one of the clusters becomes unhealthy. Each gateway cluster has a persisted cache of metadata about each database in the system, that is used for connection routing. These caches are used for scaling-out gateway nodes, to avoid contention on a single source of metadata. There are multiple caches per gateway cluster and each node will fetch data from any of the caches that is available. The West Europe region has two gateway clusters, and each of these clusters has two persisted metadata caches.

An operator error led to an incorrect action being performed in close sequence on all four persisted metadata caches. The action resulted in a configuration change that made the caches unavailable to the regional gateway processes. This resulted in all regional gateway processes in West Europe becoming unable to access connection routing metadata, leading to the regional incident from 03:47 UTC. New connections would have failed as the gateways were not able to read routing metadata, but connections that were already established would have continued to work. Management operations on server and database resources would also have been impacted, as some workflows also rely on connection routing.

A secondary impact of the issue was that our internal telemetry service in the West Europe region became overloaded with queries. This caused the telemetry ingestion to fall behind by a few hours and telemetry queries were also timing out. The telemetry issues contributed to delays in automatically notifying impacted customer subscriptions via Azure Service Health.

As some customers were receiving automatic notifications of impact within 15 minutes, we assumed that the notification pipeline was working as designed. It was later in the event when we understood that communications were not reaching all impacted subscriptions. As a result, we broadened our communications to every customer in the region and published an update on the Azure status page.

Additionally, Automatic failover for anyone who had setup failover groups with auto-failover configuration was also impacted due to telemetry issues (manual failover was not impacted).

How did we respond?

This regional incident was detected by our availability monitors, and we were on the investigation bridge within 13 minutes of customer impact. We understood the issue to the action that was performed erroneously and determined a way to reverse it. Another option would have been to rebuild entirely new caches - but it was determined that this rebuild would take much longer than fixing the caches in-place, so we proceeded to formulate the method to revive the caches in-place.

On applying this initial mitigation, the caches came back up, which resulted in a partial recovery of the incident at 06:18 UTC. While success rates improved significantly at this point (~60%) the recovery was considered 'partial' due to two reasons. Firstly, a timing issue in applying mitigation caused gateways in one of the two clusters to cache incorrect cache connection strings. Secondly, the metadata caches were not receiving updates for changes that happened while the caches were unavailable.

The first issue was mitigated by restarting all the gateway nodes in the cluster, which needed to be done at a measured pace to avoid overloading the recovering metadata caches. As the restarts progressed, we saw success rates continue to improve, steadily reaching 97% around 07:58 UTC, once all restarts had completed. At this point connections to any database that had not undergone changes (i.e., service tier updates) during the incident would have been successful.

The last step was to determine which persistent cache entries were stale (missed updates) and refresh them to a consistent state. We developed and executed a script to refresh cache entries, with the initial refreshes being done manually while the script was being developed. The success rate recovered to 99.9% for the region at 11:10 UTC. We then proceeded to identify and mitigate any residual issues, and also started the process to confirm recovery with customers and downstream impacted Azure services.

Based on login success rate telemetry, the incident mitigation time was determined to be 13:30 UTC. Mitigation communications were sent out to all impacted customers at 19:16 UTC, after a thorough validation that no residual impact remained.

How are we making incidents like this less likely or less impactful?

We are implementing a number of service repairs as a result of this incident, including but not limited to:

Completed:

  • Programmatically blocking any further executions of the action that led to the metadata caches becoming unavailable.

In progress:

  • Implementing stronger guardrails on impactful operations to prevent human errors like the one that triggered this incident.
  • Implementing in-memory caching of connection routing metadata in each gateway process, to further increase resiliency and scalability.
  • Implementing throttling on telemetry readers to prevent ingestion from falling behind.
  • Removing dependency of automatic-failover on telemetry system.
  • Investigating other service resiliency repairs as determined by our internal retrospective of this incident, which is ongoing.

How can our customers and partners make incidents like this less impactful?

Customers who had configured active geo-replication and failover groups would have been able to recover by performing a forced-failover to the configured geo-replica.

More guidance for recovery in regional failure scenarios is available at:

How can we make our incident communications more useful?

We are piloting this "PIR" format as a potential replacement for our "RCA" (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey: