Skip Navigation

Azure status history

This page contains root cause analyses (RCAs) of previous service issues, each retained for 5 years. From November 20, 2019, this included RCAs for all issues about which we communicated publicly. From June 1, 2022, this includes RCAs for broad issues as described in our documentation.

Product:

Region:

Date:

January, 2023

25-1

Preliminary Post Incident Review (PIR) – Azure Networking – Global WAN issues (Tracking ID VSG1-B90)

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far.

After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

What happened?

Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud.

What went wrong and why?

We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.

How did we respond?

Our monitoring initially detected DNS and WAN related issues from 07:12 UTC. We began investigating by reviewing all recent changes. By 08:10 UTC, the network started to recover automatically. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issues. Networking telemetry shows that nearly all network devices had recovered by 09:00 UTC, by which point the vast majority of regions and services had recovered. Final networking equipment recovered by 09:35 UTC.

Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.

How are we making incidents like this less likely or less impactful?

  • We have blocked highly impactful commands from getting executed on the devices (Completed)
  • We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

23-1

South Central US - Network issues impacting some Azure services - Mitigated (Tracking ID 7NBR-T98)

Summary of Impact: Between approximately 16:27 UTC on 23 Jan 2023 and 19:38 UTC on 23 Jan 2023, a subset of customers in the South Central US region may have experienced increased latency and/or intermittent connectivity issues for some services in the region. Downstream impact to other impacted services in the region also occurred. 

Preliminary Root Cause: We identified an unhealthy network device in the regional network gateway in the South Central US region. This device would have impacted traffic between Availability Zones and data centers in South Central US and traffic into and out of the South Central US region.

Mitigation: The unhealthy network device was removed from service so that traffic was served through other healthy paths.

You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts ( for video tutorials and for how-to documentation) and you will be notified via your preferred communication channel(s).


18-1

Preliminary Post Incident Review (PIR) – Single zone power event – West Europe (Tracking ID 6S_Q-JT8)

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far.

After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.

What happened?

Between 09:44 and 13:10 UTC on 18 January 2023, a subset of customers using Storage services in West Europe may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region. Other Azure services with dependencies on this specific storage infrastructure may also have experienced impact – including Azure Application Insights, Azure Automation, Azure Container Registry, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Red Hat OpenShift, Azure Search, and Azure Virtual Machines (VMs).

What went wrong and why?

We determined that an issue occurred during a planned power maintenance, causing an incident for a subset of storage and networking infrastructure in one datacenter of one Availability Zone in West Europe. This impacted storage tenants, and network devices which may have rebooted. This unexpected event was caused by a failure in the electrical systems feeding the affected racks. While all server racks have redundant dual feeds, one feed was powered down for maintenance, and a failure in the redundant feed caused a complete shutdown of the affected racks. We continue to investigate the nature of this redundant feed failure, to prevent incident reoccurrence.

How did we respond?

The issue was detected by the datacenter operation team performing the maintenance at the time. We immediately initiated the maintenance rollback procedure, and restored power to the affected racks. Concurrently, we escalated the incident and engaged other Azure service stakeholders to initiate/validate service recovery. Most impacted resources automatically recovered following the power event, through automated recovery processes. The storage team identified two storage scale units that did not come back online automatically – nodes were not booting properly, as network connectivity was still unavailable. Networking teams were engaged to investigate, and identified a Border Gateway Protocol (BGP) issue. BGP is the standard routing protocol used to exchange routing and reachability information between networks. Since BGP functionality did not recover automatically, 3 of the 20 impacted top-of-rack (ToR) networking switches stayed unavailable. Networking engineers restored the BGP session manually. One storage scale unit was fully recovered by 10:00 UTC, the other storage scale unit was fully recovered by 13:10 UTC.

How are we making incidents like this less likely or less impactful?

Microsoft has an extensive internal retrospective process after incidents, including deep-dive reviews into any issues caused by power distribution systems. All learnings and action items are captured in an incident management system to ensure that all are tracked and closed in a timely manner. Since this incident was triggered by a power event, the datacenter forensic team will evaluate these results and issue service bulletins or process changes globally to prevent incident reoccurrence, as and when required. Our internal retrospective is ongoing. This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings. This will include a summary of our learnings from the review process, including any relevant next steps that will make incidents like this less likely, or at least less impactful.

How can customers make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations:
  • Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this incident – for example, by targeting all resources within a single Availability Zone:
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more:

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

November, 2022

2-11

Post Incident Review (PIR) - Enrolling new certificates / Provisioning new resources - Azure Public / Government / China (Tracking ID YTGZ-1Z8)

What happened?

Between 00:42 UTC on 2 November and 05:55 UTC on 3 November 2022, customers experienced failures when attempting to provision new resources in a subset of Azure services including Application Gateway, Bastion, Container Apps, Database Services (MySQL - Flexible Server, Postgres - Flexible Server, and others), ExpressRoute, HDInsight, Open AI, SQL Managed Instance, Stream Analytics, VMware Solution, and VPN Gateway. Provisioning new resources in these services requires creating new certificates. The certificate Registration Authority (RA) that processes new certificate requests experienced a service degradation, which led to an incident preventing provisioning new resources in this subset of Azure services. We’re providing you with this Post Incident Review (PIR) to summarize what went wrong, how we responded, and the steps Microsoft is taking to learn and improve. Communications for this incident were provided under Tracking IDs YTGZ-1Z8 (Azure Public), 7LHZ-1S0 (Azure Government) and YTHP-180 (Azure China).

What went wrong, and why?

At 23:56 UTC on 1 November through 00:52 UTC on 2 November, an internal Certificate Authority (CA) experienced a brief service degradation. At the same time, the Registration Authority (RA) that sends requests to the CA received a burst of certificate renewal requests, resulting in requests queueing at the RA. Once the CA recovered, the request queue started processing. Due to a latent performance bug in the RA, inadvertently introduced in the past month as part of a feature enhancement, the rate of new incoming requests was greater than the rate at which requests could be processed from the queue. This triggered automatic throttling of incoming requests, but the RA was unable to recover fully due to throttled requests causing additional retry traffic. The latent performance bug was not caught during deployment or through health monitoring, due to a test gap where this specific set of conditions was not exercised. 

How did we respond? 

The issue was detected via our internal monitoring and the relevant engineering team was engaged within one minute of the alert firing. The service was unable to self-heal, so the following steps were taken to mitigate the incident. Firstly, we rolled back the change containing the performance bug. Secondly, we blocked requests to the RA, to enable the queue to drain. Finally, once the queue was at a manageable length, we re-enabled traffic slowly - and monitored service recovery until all traffic was re-enabled, and the RA had returned to a healthy state.

How are we making incidents like this less likely or less impactful?

  • We have increased processing capacity by 3x, for the RA backend component. (Completed)
  • We have introduced more granular request throttling, to help smooth similar spikes in traffic. (Completed)
  • We have extended our end-to-end service monitoring to include upstream request sources, to reduce time to detection. (Completed)
  • We are exploring more fine-grained throttling as an additional isolation layer across the backend of the RA (Estimated completion: December 2022).
  • We are deploying additional dedicated RA capacity for the Azure China and Azure Government cloud environments (Estimated completion: March 2023). 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: