Skip Navigation

Azure status history

This page contains all RCAs for incidents that occurred on November 20th, 2019 or later and will - from that date forward - provide a 5-year RCA history. RCAs prior to November 20th, 2019 are not available.

Product:

Region:

Date:

September 2020

9/18

RCA - Azure Storage Premium File Shares - East US (Tracking ID SMSC-FS0)

Summary of Impact: Between 11:30 UTC and 19:51 UTC on 18 Sep 2020, a subset of customers using Azure Storage Premium File Shares in East US may have experienced issues accessing services. Other downstream services may have seen impact or experienced service degradation.


Root Cause: On a single storage scale unit in East US, a feature was applied to optimize the performance of IO operations. The feature contained a code bug in an infrequent error path, which when hit would cause a storage front end process to become unhealthy. The incident started when a small number of clients entered an invalid state, triggered by a combination of a routine network maintenance operations which happened on the storage scale unit at the time and a code bug on the client side. This caused the faulty error path to be hit more frequently. The series of events led to multiple front ends becoming unhealthy, which resulted in failed requests and increased latencies for the duration of the incident.


Mitigation: We mitigated the incident by applying a configuration change to disable the performance optimization feature that introduced the bug. Once the front end processes became healthy again, we applied another configuration change to balance the load across the front ends in order to speed up the recovery.


Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to) the following actions:

1. The performance optimization feature has been temporarily disabled in other storage scale units in order to prevent similar incidents until the code fix is deployed.
2. A code fix has been developed and will be validated and deployed before re-enabling the performance optimization feature.
3. Improving testing and validation to help catch similar issues before they roll out to production.
4. Investigating the reason why the monitoring system did not trigger an early warning alert when the front end processes started failing.


Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

9/14

RCA - Connectivity Issues - UK South (Tracking ID CSDC-3Z8)

Summary of Impact: Between 13:30 UTC on 14 Sep and 00:41 UTC on 15 Sep 2020, a subset of customers in the UK South may have encountered issues connecting to Azure services hosted in this region. Customers leveraging Availability Zones and configured for zone redundancy would not have experienced a loss in service availability. In some instances, the ability to perform service management would have been impacted. Zone Redundant Storage (ZRS) remained available throughout the incident.

Root Cause and Mitigation: On 14th September 2020, a customer impacting event occurred in a single datacenter in UK South due to a cooling plant issue. The issue occurred when a maintenance activity that was being performed at our facility had the site shut down the water tower makeup pumps via their Building Automation System (BAS). This was shut down in error and was noticed at approximately13:30 UTC when our teams began to inspect the plant.

By this time, the issue had begun to impact downstream mechanical systems resulting in the electrical infrastructure that supports the mechanical systems shutting down. Microsoft operates its datacenters with 2N design meaning that we operate with a fully redundant, mirrored system. The 2N design is meant to protect against interruptions which could cause potential downtime; however, in this case, the cascading failures impacted both sides of the electrical infrastructure that supports mechanical systems. When the thermal event was detected by our internal systems, automation began to power down various resources of the Network, Storage, and Compute infrastructure to protect hardware and data durability. There were portions of our infrastructure that could not be powered down automatically (for example due to connectivity issues); some of these were shut down via manual intervention.

It took approximately 120 minutes for the team to diagnose the root cause and begin to remediate the mechanical plant issues, with cooling being restored at 15:45 UTC. By 16:30 UTC temperatures across the affected parts of the data center had returned to normal operational ranges.

Networking recovery began at approximately 16:30 UTC by beginning power-cycling network switches to recover them from the self-preservation state they entered when overheated. The recovery order was prioritized to first bring Azure management infrastructure, Storage clusters, and then Compute clusters online. When network switches providing connectivity to a set of resources were power-cycled and started to show health, engineers began recovering the other classes of resources. Network recovery was completed at 23:32 UTC. Shortly after this, any impacted Storage and Compute clusters regained connectivity, and engineers took further steps to bring any remaining unhealthy servers back online.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Review the logs and alarms from all affected mechanical and electrical gear to help ensure there was no damage or failed components. This is complete.
  • Review and update Operational Procedure and Change Management to help ensure that the correct checks are in place and system changes via commands across systems are validated visually prior to commencement of work or return to a normal state.
  • Validate and update the discrimination study for the Mechanical and Electrical systems.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

9/3

RCA - Network Latency Issue – West Europe (Tracking ID 8KLC-1T8)

Summary of Impact: Between 09:21 and 17:32 UTC on 03 Sep 2020, a subset of customers may have experienced intermittent latency or issues connecting to resources hosted in West Europe. Retries may have worked during this timeframe.

Root Cause: Two separate events occurred in close succession prior to the start of impact from this incident:

  • Approximately 4 hours before the impact start, some local activity (likely construction) in the vicinity of the data centre cause an increase in the number of packets corrupted during transmission over fiber optic cables between data centers in the West Europe region. These errored packets were detected and dropped, and our networking automation systems took the links out of service and opened tickets with the local site to have them repaired. This is a standard process, and our automated safety checks validated that there was no impact related to this.
  • Separately, between 09:21 and 09:26 UTC a significant fiber-cut occurred approximately 5 kilometres from the data centre on one one of the other paths between the data centers. This cut impacted 50% of the capacity for that route, but again, this event on its own would have no impact on traffic overall in the West Europe region.

Each of the events in isolation would have had no perceptible impact on the networking operations for West Europe, but when combined, they resulted in 9 links between data centres receiving an unequal share of traffic, becoming congested, and dropping packets (the impact was to less than 2% of the total capacity on the impacted links). Connections that travelled over these congested links would have experienced increased packet loss and latency. As connections are spread over the available links, services that retried requests by opening new connections were likely to have been unaffected and successful.

The time to mitigate was extended by the need for on-call engineers to identify that there were multiple causes for down links and identify the best way to reduce congestion and rebalance traffic. During the initial response, the large number of concurrent alerts resulted in on-call engineers taking actions that moved the congestion from one link to another, but did not resolve it.

Mitigation: Mitigation was achieved by engineers manually determining which of the links that had experienced errors could be put back into service and rebalancing traffic across the links in service. Full mitigation was declared at 17:32 UTC, but most customers would have see improvement in advance of this time. Full restoration was achieved by September 4 02:00 UTC when the significant fiber cut was repaired.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Accelerate the readiness of additional fiber paths between these data centres to reduce the impact of future fiber cuts.
  • Improve the tooling used by on-call engineers when responding to complex incidents with multiple causes of downed links, so that they can reduce congestion faster and achieve mitigation more quickly.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

August 2020

8/21

Content Delivery Network (CDN) - Service Degradation - Mitigated (Tracking ID DLYY-ND8)

Summary of Impact: Between 18:05 and 19:55 UTC on 21 Aug 2020,  a subset of customers using Azure CDN from Verizon may have experienced service degradation.

Preliminary Root Cause: We determined that a recent deployment task impacted connectivity to origins, causing dynamic or cache miss requests to fail.

Mitigation: The CDN provider rolled out an update that fixed the issue.

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.
8/14

RCA - Degraded connectivity to Microsoft Services within the Southeast region of the United States (Tracking ID 9MDM-TT8)

Summary of Impact: Between approximately 02:20 UTC to 03:30 UTC and 04:07 UTC to 04:52 UTC on 14 Aug 2020, a subset of customers connecting through one of Microsoft's edge-nodes () in the Southeast United States (US) may have experienced intermittent periods of degraded connectivity when attempting to connect to Azure, Microsoft 365, and Xbox resources.

Root Cause: Microsoft's Global Network consists of edge-nodes that connect to the Internet externally and two or more Backbone sites internally via diverse optical fiber paths for redundancy during failure scenarios.

On 14 Aug 2020 at 02:20 UTC, we experienced a dual fiber path failure isolating one of our edge-nodes in the Southeastern US. The initial fiber path incident occurred on 13 Aug 2020 at 18:34 UTC due to a fiber cut causing the path to be removed from Microsoft's Global Network. Traffic was then routed to our secondary fiber path per design. Meanwhile, our fiber provider had dispatched a technician to work on resolving the initial fiber incident. While working on that incident, the technician inadvertently disconnected our secondary fiber path at 02:20 UTC, which resulted in the secondary path to be removed from Microsoft's Global Network isolating this edge-node site.

Our network is designed to withstand site isolation and all traffic should have rerouted to the next closest edge-node in the region. However, we identified a router in this edge-node site that continued to advertise a few local prefixes to the Internet, which resulted in the blackholing of all Internet traffic destined to those prefixes in the edge-node site. The route advertisement of the local prefixes should have been withdrawn by the router when the site was isolated from Microsoft Global Network during the secondary fiber path incident but that did not occur due to a missing configuration at this site to detect site isolation and resulted in an outage. In addition, customer notification of the event was delayed due to correlation of the event and the impact.

Mitigation: The outage was mitigated when the fiber provider technician completely restored the fiber connectivity at 04:52 UTC on 14 Aug 2020.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Taking steps to prevent dual failures from occurring, reduce the degree of impact, and shorten time-to-mitigate by implementing improved failover operations to backup sites.
  • Modifying our router configurations globally, to implement conditional prefix advertisement and withdrawal to ensure routers disconnect as expected during isolation events.
  • Improving our alert correlation to notify fiber technicians in a timely manner, and to improve the overall notification experience.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

July 2020

7/18

RCA - Azure DNS - Connectivity issues (Tracking ID TTPY-3P0)

Summary of Impact: Between 07:50 and 08:45 UTC (approx.) on 18 Jul 2020, Azure DNS experienced a transient resolution issue which in-turn impacted connectivity for some other Azure services. Authoritative and other DNS services were not impacted by this issue.

Root Cause: The decommissioning of a legacy (preview) DNS solution inadvertently caused some data streams for Azure DNS recursive resolver service to become out of sync with the resolver state. This was detected by a sync pipeline, which triggered a reset of the resolver instances to recover from the stale state. Unfortunately, this reset was not done in a staggered fashion and led to multiple resolver instances rebooting at the same time. This in turn led to degradation of the service and caused DNS resolution failures for the queries originating from virtual networks. Azure services dependent on the Azure DNS resolver service also saw degradation of service during this time.

The impact of the incident was observed across multiple Azure regions to varying degrees. While some instances of the service saw no impact, most impacted instances auto-recovered within 10 minutes, though some instances took up to 30 minutes to recover. The DNS resolution issues were fully auto-mitigated across all regions within 54 minutes. During this time, authoritative Azure DNS service was not impacted and DNS queries originating from the internet for zones hosted on Azure DNS were answered successfully.

Mitigation: The issue was self-healed as the restarts completed, and all services with dependencies on the recursive DNS service would have seen a restoration of functionality also. The DNS service was fully mitigated at 08:45 UTC, but some services with multiple dependencies may have taken longer for all customers to see full service restoration.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Fixing the orchestration logic in the sync pipeline to help ensure that resolver instances are reset in a staggered, partitioned fashion
  • Improving the resolver startup sequence to help ensure that a resolver instance can be up and running with 10 minutes after a reset

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

7/7

RCA - Virtual Machines - Virtual machine unexpected restarts (Tracking ID 8S8J-9T8)

Summary of impact: Between 07:24 UTC on 07 Jul 2020 and 21:16 UTC on 17 Jul 2020, a subset of customers using Virtual Machines (VMs) may have experienced intermittent connection failures when trying to access some virtual machines. These virtual machines may have also restarted unexpectedly.

Root cause: We determined that an ongoing OS update deployment task inadvertently contained a code configuration error that resulted in a number of previously addressed bug fixes being reverted on a subset of clusters. This manifested as system deadlock on a subset of host nodes which were running VM workloads with heavy disk I/O. As a result, VMs on those nodes rebooted.

Mitigation: We stopped the ongoing deployment and subsequently developed and rolled out a new deployment task which contained a code fix to detect that a new patch needed to be applied. This fix was deployed to all impacted clusters, thereby mitigating the VM reboots and customer impact.

In parallel to deploying the permanent fix across all regions, we expedited mitigation for some customers by identifying affected nodes that were hosting the customers' VM workloads and reattaching patches to those nodes.

Next Steps: We understand that the time to mitigate for this incident was longer than desired, and we sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help reduce the duration of such incidents. This includes (but is not limited to):

  • Incorporating the missed combination of software versions and system configurations in our validation matrix before deploying similar updates.
  • Improving rollout monitoring and correlation capabilities to detect such faults and halt the rollout sooner.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

7/1

RCA - Azure SQL Database - Japan East (Tracking ID CLCK-LD0)

Summary of Impact: Between 09:24 and 11:15 UTC on 01 Jul 2020, a subset of customers using Azure SQL Database, Azure SQL Data Warehouse/Synapse Analytics, Azure Database for MySQL, Azure Database for PostgreSQL, and Azure Database for MariaDB in Japan East may have experienced service connection failures or possible timeouts. Services utilizing SQL Databases may have also been impacted.

Root Cause: Connections to Azure SQL Database and related data services go through a load balanced set of front-end nodes (Gateways) that provide directory lookup services and reroute the incoming connections to the intended backend nodes hosting the database. For scalability and zone redundancy purposes, there are multiple active SQL Gateway clusters in a region. During this incident, one of the SQL Gateway clusters became unhealthy, having an intermittent impact on login availability. A specific network traffic pattern combined with a networking stack configuration on the SQL Gateway instances triggered an imbalance on the CPU processing of new connection requests. The persistence of such CPU imbalance over a long period of time caused high response latency and increased timeouts on connection requests. The error condition propagated across multiple instances of the SQL Gateway cluster in this region, sometimes causing a service restart.

Mitigation: Multiple SQL Gateway instances became healthy upon the triggered service restart. On further investigation, we were able to isolate the specific network pattern and the configuration setting that caused this incident and were able to reconfigure the traffic to prevent a recurrence. 

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Fix the underlying issue that causes service restart when such a condition occurs.
  • Improve the alerting logic and add identified telemetry to diagnose this kind of issues faster.
  • Activate a newer SQL Gateway cluster in this region with a more efficient networking stack configuration that reduces the chances of hitting a processing imbalance.
Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: