Skip Navigation

Azure status history

This page contains root cause analyses (RCAs) of previous service issues, each retained for 5 years. From November 20, 2019, this included RCAs for all issues about which we communicated publicly. From June 1, 2022, this includes RCAs for broad issues as described in our documentation.

Product:

Region:

Date:

November 2022

2/11

Post Incident Review (PIR) - Enrolling new certificates / Provisioning new resources - Azure Public / Government / China (Tracking ID YTGZ-1Z8)

What happened?

Between 00:42 UTC on 2 November and 05:55 UTC on 3 November 2022, customers experienced failures when attempting to provision new resources in a subset of Azure services including Application Gateway, Bastion, Container Apps, Database Services (MySQL - Flexible Server, Postgres - Flexible Server, and others), ExpressRoute, HDInsight, Open AI, SQL Managed Instance, Stream Analytics, VMware Solution, and VPN Gateway. Provisioning new resources in these services requires creating new certificates. The certificate Registration Authority (RA) that processes new certificate requests experienced a service degradation, which led to an incident preventing provisioning new resources in this subset of Azure services. We’re providing you with this Post Incident Review (PIR) to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn and improve. Communications for this incident were provided under Tracking IDs YTGZ-1Z8 (Azure Public), 7LHZ-1S0 (Azure Government) and YTHP-180 (Azure China).

What went wrong, and why?

At 23:56 UTC on 1 November through 00:52 UTC on 2 November, an internal Certificate Authority (CA) experienced a brief service degradation. At the same time, the Registration Authority (RA) that sends requests to the CA received a burst of certificate renewal requests, resulting in requests queueing at the RA. Once the CA recovered, the request queue started processing. Due to a latent performance bug in the RA, inadvertently introduced in the past month as part of a feature enhancement, the rate of new incoming requests was greater than the rate at which requests could be processed from the queue. This triggered automatic throttling of incoming requests, but the RA was unable to recover fully due to throttled requests causing additional retry traffic. The latent performance bug was not caught during deployment or through health monitoring, due to a test gap where this specific set of conditions was not exercised. 

How did we respond? 

The issue was detected via our internal monitoring and the relevant engineering team was engaged within one minute of the alert firing. The service was unable to self-heal, so the following steps were taken to mitigate the incident. Firstly, we rolled back the change containing the performance bug. Secondly, we blocked requests to the RA, to enable the queue to drain. Finally, once the queue was at a manageable length, we re-enabled traffic slowly - and monitored service recovery until all traffic was re-enabled, and the RA had returned to a healthy state.

How are we making incidents like this less likely or less impactful?

  • We have increased processing capacity by 3x, for the RA backend component. (Completed)
  • We have introduced more granular request throttling, to help smooth similar spikes in traffic. (Completed)
  • We have extended our end-to-end service monitoring to include upstream request sources, to reduce time to detection. (Completed)
  • We are exploring more fine-grained throttling as an additional isolation layer across the backend of the RA (Estimated completion: December 2022).
  • We are deploying additional dedicated RA capacity for the Azure China and Azure Government cloud environments (Estimated completion: March 2023). 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

October 2022

26/10

Post Incident Review (PIR) - Azure Cosmos DB - East US (Tracking ID JKWW-JP8)

What happened? 

Between 00:25 UTC and 06:00 UTC on 26 October 2022, a subset of customers using Azure Cosmos DB in the East US region may have experienced issues connecting to the service. Connections to Azure Cosmos DB accounts in this region may have resulted in an error or timeout. Downstream Azure services that rely on Azure Cosmos DB also experienced impact during this window - including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Digital Twins, Azure Policy, Azure Rights Management, Azure Red Hat OpenShift, and Azure Spatial Anchors. 

What went wrong, and why?

A change to the front-end gateway of Azure Cosmos DB, to include additional diagnostic information, was introduced on the affected cluster earlier this month. The change had no effect to the fidelity of the system since introduction. On 26 October, a configuration change was applied to the Azure Load Balancer. This change resulted in intermittent network connectivity issues, from which the system can normally recover. However, the diagnostic change resulted in higher-than-expected time spent in the Kernel, which resulted in spikes of high CPU utilization across the cluster. In turn, this created an increase in exceptions, and even larger time spent locking in the kernel. This resulted in timeouts and increased latency of incoming requests, ultimately leading to the customer impact described above.

How did we respond?

Our monitors alerted us to the impact on this cluster. We worked with our customers and partners to trigger mitigation steps while investigating the factors contributing to this issue. To mitigate the incident, accounts were offloaded from the impacted cluster to other clusters in the same region. To safely offload the quantity of accounts we had to migrate, we systematically moved each database account to an alternative healthy cluster. As a result, the lower load improved the state of the impacted cluster and enabled recovery. All customer impact was confirmed mitigated by 06:00 UTC.

How are we making incidents like this less likely or less impactful?

  • We have paused the configuration changes for the Azure Load Balancer. (Completed)
  • We have worked to fix the original regression in the diagnostic stack. (Completed)
  • We are improving our load balancing automation, to speed up recovery in similar circumstances. (Estimated completion: November 2022).

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: