Skip navigation

Azure status history

Product:

Region:

Date:

March 2020

15/3

RCA - Service Availability Issue in West Central US (Tracking ID SMR4-D90)

Summary of Impact: Between approximately 20:30 UTC on 15 Mar 2020 and 00:30 UTC on 16 Mar 2020, a subset of customers in West Central US may have experienced issues connecting to resources hosted in this region.

Root Cause: At approximately 20:30 UTC on 15 Mar 2020, an ice storm in the region resulted in a loss of power to both primary and secondary utility power feeds in West Central US. This caused the datacenter to transition to generator power. Although largely successful, a subset of generators failed and UPS units supporting the downstream racks carried the load until their batteries drained. At that point, the affected racks lost power entirely. Any customer resources hosted on the affected racks became unavailable at the time.

Mitigation: Site engineers manually intervened to restore power to the affected infrastructure. Concurrently, impacted Azure services implemented their disaster recovery plans to work around the power failure. Subsequently, engineering worked to recover customer resources as affected infrastructure started to become available again.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Reviewing the sequence of operations and associated control systems to remediate identified single point failures
  • Validating electrical hardware components to ensure proper functionality

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: .

5/3

RCA - Service Management/Authentication Errors - Azure China (Tracking ID SND4-L80)

Summary of Impact: Between 21:03 CST (UTC+8) on 05 Mar 2020 and 16:03 CST on 06 Mar 2020, a subset of customers in the Azure China regions may have encountered failures when performing service management operations on resources hosted in these regions. Customers may also have experienced authentication failures when attempting to access the Azure portal or other Azure resources in the Azure China regions.

Root Cause: When clients connect to an Azure service, they validate the Transport Layer Security (TLS) certificate of that Azure service. This validation requires access to an Online Certificate Status Protocol (OCSP) service and Certificate Revocation List (CRL). 

Azure in China uses an external Certificate Authority (CA). This CA hosts the OCSP and CRL endpoints on remote locations. Those endpoints were not reachable from clients in China during the incident. This issue was in a telecom provider and it impacted both Azure and other customers of the CA. It caused clients to fail certificate validation, which in turn caused failure in connecting to Azure services. 

These OCSP and CRL endpoints are mirrored in multiple locations via a Content Distribution Network (CDN).

Mitigation: To mitigate, engineers updated DNS records to re-route clients to alternate OCSP and CRL endpoints that were reachable. The troubleshooting took time as multiple companies were involved in the network path. Some Azure services were able to mitigate sooner by deploying the latest CRL to their servers out-of-band. 

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):  

  • Work with our telecom partner in China to address the root cause of why the primary CRL and OCSP endpoints were not reachable. 
  • Improve the resiliency of our certificate infrastructure, including hosting the CRL and OCSP endpoints closer to Azure servers. 

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey 

4/3

Intermittent latency in US Government Cloud - Mitigated (Tracking ID DLS5-TZ0)

Summary of impact: Between 14:00 and 17:15 EST on 04 Mar 2020, a limited subset of customers in US Gov Texas and US Gov Arizona may have experienced intermittent issues or latency connecting to resources hosted in these regions. A smaller subset of customers in other US Gov regions may have also experienced issues connecting to resources.

Preliminary root cause: After a preliminary investigation, engineers determined that a recent maintenance event on network infrastructure in the US Gov Texas region led to a shift in network traffic, causing a single network device to become congested. As this network device was responsible for routing some network traffic for other US Gov regions, some customers outside of US Gov Texas may have encountered brief periods of high latency, though most of the impact would have been to a limited number of customers with resources in US Gov Texas.

Mitigation: Engineers isolated the impacted network device and rerouted network traffic to mitigate the issue.

Next steps: We apologize for the impact to affected customers. Engineers will continue to investigate to establish the full root cause and prevent future occurrences.

3/3

RCA- Issues connecting to resources in East US (Tracking ID 9SS0-VT8)

Summary of Impact: Between 15:30 and 22:00 UTC on 03 March 2020, a subset of customers in East US may have experienced issues connecting to resources in this region. 

Root Cause: A malfunction in building automation control caused temperatures in multiple rooms of a data center in the East US region to spike impacting Storage, Compute, Networking and other dependent services. Although the cooling system has N+1 redundancy, the automation failure resulted in a significant reduction in cooling air flow. This caused a cascade of events which caused network devices to become unresponsive, VMs to shutdown, and some storage hardware to go offline. 

Mitigation: The malfunction in the building automation control was fixed by resetting the controllers for the cooling system. Due to the nature of the automation failure, each cooling unit had to be manually reset. By 16:00 UTC the cooling controller was back online and ambient temperatures and air flow had returned to normal ranges. Engineers then power cycled and restored failed server hardware in groups to restore services in the region. After recovery of the building and network infrastructure, engineers recovered storage hardware and compute VMs that did not recover automatically.  By 22:00 UTC, 99.999% of affected VMs were back up and running. Availability of all storage data was restored.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Review of the building automation control system responsible for this incident, mitigation of any issues if found, and application of the mitigation to other data centers with the same control system
  • Review of mechanical cooling system, mitigation of any issues if found, and application of the mitigation to other data centers with the same cooling system.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

February 2020

28/2

RCA - Networking Issue in South Central US (Tracking ID DMP0-HT8)

Summary of Impact: Between 12:48 and 19:10 UTC on 28 Feb 2020, a subset of customers in South Central US may have encountered failure notifications when performing service management operations on resources hosted in this region. Additionally, some customers using Azure Data Factory V2 may have seen errors when running pipelines and job executions for dataflow activities as well as seeing errors when attempting to provision SQL Server Integration Services (SSIS) and Integration Runtime for Azure Data Factory SSIS packages.

Root Cause: During a scale-up operation of the service that manages customer network resources, a new capacity configuration was deployed. This configuration triggered a conflict with an existing backend service configuration and caused an increase in the failure rate of requests to the service used for service discovery. The increased failure rate exceeded the Azure platform’s capability for retry logic to avoid customer impact.

Mitigation: Resources deployed during the scale-up operation performed with the incompatible configuration were removed from rotation, allowing the automatic recovery of the backend service.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Improving deployment testing and detection to avoid configuration conflicts
  • Implementing last known good configuration fall back in the event of resource deployment errors, instead of the requirement to deploy new configuration

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: .

24/2

503 errors accessing the Azure Portal - Mitigated (Tracking ID 8_K4-TD0)

Summary of impact: Between approximately 19:45 and 22:15 UTC on 24 Feb 2020, a limited subset of customers may have encountered HTTP 503 error codes when attempting to access the Azure Portal. The availability of individual resources (Virtual Machines, Web Apps, databases, etc.) was not impacted.

Preliminary root cause: After a preliminary investigation, engineers determined that an internal automation workflow was generating a high volume of requests to a backend CosmosDB partition on which the Azure Portal relies. As the volume of incoming requests grew, an automated throttling process began on the impacted partition, preventing requests from completing.

Mitigation: Engineers manually stopped the internal automation workflow, allowing the volume of requests to return below normal thresholds and the partition throttling to stop.

Next steps: We apologize for the impact to affected customers. Engineers will continue to investigate the underlying cause and take steps to prevent future occurrences.

22/2

RCA - MSA Login Failures (Tracking ID CT05-PC0)

Summary of impact: Between 00:00 UTC and 04:41 UTC on Feb 22, 2020, a subset of our customers may have been unable to sign in with their Microsoft Service accounts (MSA) to access their Azure resources. Users signing in with non-Microsoft accounts would have been unaffected. Users may also have been unable to create new Microsoft accounts.

Root Cause: Engineers determined that a server authentication component reached an operational threshold which resulted in increased failures and the unavailability of the service metadata required for successful user sign-in. The issue occurred during planned server maintenance and was detected by internal monitoring.

Mitigation: Engineers added additional resources to address the server authentication component hitting a threshold but complete service restoration required a full recovery of the domain controllers followed by service metadata store restarts which extended the duration of the issue. Service was monitored using telemetry to verify the issue was fully mitigated.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but not limited to):

  • Adding additional capacity and resiliency features for this component to prevent future occurrences.
  • Updating datacenter decommission and capacity planning documentation.
  • Updating internal monitoring to quickly flag capacity issues when they arise.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey
21/2

RCA - Networking - Intermittent Connectivity (Tracking ID CTH0-HZ8)

Summary of Impact: Between 22:17 UTC on 21 Feb 2020 and 00:53 UTC on 22 Feb 2020 Azure customers may have experienced network connection failures for IPv4 traffic. During this time:

  • Network connections that originated from or were destined to West Europe (Amsterdam) or Southeast Asia (Singapore) regions may have resulted in a connection failure. Two-thirds of the new connection attempts would have been impacted; previously established connections between these two regions were unaffected.
  • Any traffic going to or from the Internet was not impacted. IPv6 traffic was not impacted.

Root Cause: The incident was caused by a configuration error pushed to routers in Microsoft Wide Area Network (WAN), in Singapore and Amsterdam. The configuration change sent all SYN packets (used to initiate a connection) to the router CPU for inspection, which caused the router’s built-in denial of service protection to drop a subset of packets to protect the router CPU.

Mitigation: Engineers mitigated the incorrect configuration by deploying an update to the configuration changes of the two impacted network routers. As the deployment progressed through the routers, partial recovery started at approximately 20:40 UTC, followed by full recovery at 00:53 UTC on 22 Feb 2020.

Next Steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and to our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Improving the network router configuration testing process in the lab for its efficacy at large traffic scales.
  • Improving the alerting to more quickly detect configuration errors.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

5/2

Connectivity errors with Azure Portal - Mitigated (Tracking ID LN__-JC0)

Summary of impact: Between 20:25 and 22:10 UTC on 05 Feb 2020, a subset of customers may have experienced difficulties connecting to the Azure Portal.

Preliminary root cause: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. 

Mitigation: Engineers successfully rerouted traffic to another region, allowing connectivity to the Azure Portal to resume, mitigating the issue.

Next Steps: We sincerely apologize for the impact to the affected customers. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

January 2020

30/1

RCA - Authentication issues for Microsoft Government Cloud resources (Tracking ID 0KYQ-PP8)

Summary of impact: Between 14:12 and 19:10 EST on 30 Jan 2020, approximately 6% of customers in US Government service regions experienced sign-in failures when attempting to access their resources for Azure Government & Office 365 Government services. Customers outside of US Government service regions were not impacted.

Root cause: Engineers made back end updates to the application collection representing Microsoft services in Azure Active Directory in production. The application settings of these services are routinely synchronized from Commercial Microsoft Services to their US Government equivalent. There was a code defect in the incremental/delta synchronization pipeline which left the applications in an incorrect state in US Government service regions. As a result, these applications failed to get a token from Azure Active Directory in the US Government service regions and sign-ins for users of those services failed.

Mitigation: Engineers performed a full synchronization of applications from the Commercial to US Government environment. After the full synchronization cycle was complete, the application state was fixed and the subsequent sign-in requests were successful. To prevent the manifestation of the bug in the US Government service regions, engineers disabled the task which synchronizes the applications from the Commercial to US Government environment. While the task is disabled, applications are kept in sync by making manual updates in both environments.

Next Steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Platform and to our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Engineers are working to fix the bug in the task before re-enabling the synchronization task.
  • Improve telemetry to more quickly detect and mitigate bugs before they enter the production environment.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

 
25/1

RCA - SQL Database and dependent services - Service Availability Issues (Tracking ID 5TYQ-DC0)

Summary of Impact: Between 22:00 EST on 24 Jan 2020 and 08:15 EST on 25 Jan 2020, customers in the Azure Government regions may have experienced failures when trying to access Azure SQL Database and Data Warehouse resources or dependent services. Specifically, new connections to databases in these regions may have resulted in an error or timeout, but already established pooled connections continued to work. Some manageability operations such as failovers to geo-redundant regions were also impacted.

Root cause: Connections to Azure SQL Database and Azure Data Warehouse go through a set of load balanced front-end nodes called gateways. Engineers determined that a recent maintenance activity did not complete successfully which in-turn caused the gateways to hold an incorrect certificate configuration that effectively blocked connections to the associated SQL resources.

Mitigation: Engineers mitigated the incorrect configuration by deploying an update to the impacted regions. As the deployment progressed through the regions, partial recovery started at approx. 03:30 EST, followed by full recovery at 08:15 EST on 25 Jan 2020.

Next Steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and to our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve the alerting to help detect incorrect configurations quicker.
  • Improve our preventative processes and resiliency of the Azure SQL Database and Azure SQL Data Warehouse to help avoid these types of service disruptions in the future.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

24/1

RCA - Cross-Region Connectivity Issue - North America (Tracking ID 4_YR-FC8)

Summary of Impact: Between 21:03 and 22:28 UTC on 24 Jan 2020, a programming error caused network congestion, dropping around 10% of traffic, and causing high latency for services traversing the network between data centers. A software update on the SWAN (Software enabled Wide Area Network)—Microsoft's backbone network that connects large data center regions together—caused the router forwarding tables to become mis-programmed.

Root Cause: From 21:03 UTC, the SWAN network was unable to generate and program a working Forwarding Information Base (FIB) into the SWAN routers due to a bad configuration push. Incompatible FIB pushes that caused failure on part of the routers resulted in FIB rollbacks. The incompatible FIB push caused all traffic engineering tunnels on the routers to go down, which made traffic go on the shortest path and resulted in congestion drops in the network of around 10% of traffic.

Mitigation: When the incompatible FIB was rolled back to the last known good state, Traffic Engineering worked again, and drops subsided. However, the non-working FIB continued to get generated, and its install/rollback kept causing drops until the configuration change was rolled back. Rollback of the configuration change was slow due to safeguards built into the system. The configuration problem was finally rolled back at 22:28 UTC, completely resolving the traffic drops.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • A correction of the software defect that was introduced that prevented router programming
  • Improving testing of SWAN software deployments with the router firmware
  • Improving SWAN rollback procedures to make the process faster and less error-prone

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

22/1

Service Management Failures for Application Gateway, Azure Bastion, and Azure Firewall (Tracking ID HT3R-990)

Summary of Impact: Between 05:55 UTC on 22 Jan and 00:56 UTC on 23 Jan 2020, a subset of customers using Application Gateway/WAF V2 SKU, Azure Firewall, and Azure Bastion services may have received failure notifications when performing service management operations—such as create/scale, update, and delete.

Preliminary Root Cause: Engineers determined that a recent deployment task impacted service-to-service communication which resulted in failure of management requests.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next Steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

20/1

RCA - Azure connectivity issues (Tracking ID 0TSQ-TT0)

Summary of impact: Between 15:01 and 16:30 UTC on 20 Jan 2020, a subset of customers in Sweden, Finland, Norway, and Russia may have experienced increased latency or difficulties connecting to Azure services. Impact to customers was based on the location of the customer. Customers in other geographic regions would have continued to be able to access resources during this time.

Root Cause: Microsoft has numerous "Edge Sites" on its network which greatly enhance the connectivity experience for Microsoft users in geographic regions where Edge sites exist. Starting at 15:01 UTC on 20th Jan 2020, a multiple fiber-cut event isolated the Stockholm and Helsinki Azure Edge sites, impacting internet traffic routing via these regions to the wider Microsoft network. For resiliency, these edge sites typically have 2 degrees of connectivity, but in this case, both paths went offline at the exact same time. Further investigation determined that the two diverse paths collapsed into the same conduit for a section outside Stockholm, and this is where the cut happened. Traffic to/from the internet towards Microsoft automatically shifted to other Edge sites in Europe, but the change in path and increase in traffic caused longer latency for some customers.

Mitigation: Microsoft properties in the impacted regions were asked to migrate to other locations to minimize impact. Once this migration completed, the impact to customers was mitigated. In addition, the Microsoft WAN team procured capacity from a different fiber path to bring the impacted sites back online. All sites were fully operational by 2020-01-21 21:18 UTC.

Next Steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve the speed at which traffic can be routed to other regions to mitigate the impact.
  • Harden the fiber infrastructure in Stockholm and Helsinki to make it more resilient to multiple fiber cuts.
  • Review of Edge Site connectivity architecture to ensure that there are no other circuit-convergences that inadvertently cause a single point of failure.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey