Skip Navigation

Azure status history

Product:

Region:

Date:

January 2020

1/20

Azure connectivity issues - Mitigated (Tracking ID 0TSQ-TT0)

Summary of impact: Between 15:00 and 16:30 UTC on 20 Jan 2020, a subset of customers in Sweden, Norway and Russia may have experienced difficulties connecting to Azure services. 
 
Preliminary root cause: Multiple fiber cuts affected network traffic routing for the Nordic region resulting in intermittent connectivity issues during the impacted time.
 
Mitigation: Engineers failed over traffic to alternate sites to mitigate.
 
Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A full RCA will be provided within approximately 72 hours.

December 2019

12/27

Unable to create new resources in Azure Portal (Tracking ID 4_6S-NC8)

Summary of Impact: Between 15:18 and 16:26 UTC on 27 Dec 2019 customers using the Azure Portal may have received failure notifications or 404 errors when attempting to create new resources through the Azure Portal. Programmatic deployments, such as CLI, Powershell, Template deployments etc., would have been successful.

Preliminary Root Cause: Engineers determined that a recent deployment to Azure Portal's Marketplace extension caused issues for users creating new resources through the Azure Portal.

Mitigation: Engineers rolled back the recent deployment task to mitigate the issue.

Next Steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts:  video tutorials and  how-to documentation.

12/12

RCA - Connectivity issue for Azure Resources in North America (Tracking ID HKZT-N88)

Summary of Impact: Between 09:45 and 16:26 UTC on 12 Dec 2019, a subset of customers in North America may have experienced degraded performance, network drops, or timeouts when accessing Azure resources. Customers may also have experienced downstream impact to dependent Azure services.

Root cause: Engineers identified a routing protocol metric change within an ISP backbone network, which resulted in network connectivity degradation for a limited subset of Azure customers. Instead of sending traffic to Microsoft to the closest interconnection point, the ISP was sending traffic from across US regions to an interconnection point in California, saturating some of the links in California.

Mitigation: Engineers brought down the affected peerings between Azure and the ISP and failed over network traffic in order to mitigate the issue.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Work with the ISP to streamline the engagement process for service incidents to help reduce the time to repair issues in the future.
  • Fine tune Azure monitoring and telemetry to more quickly detect and mitigate events of this nature.
  • Create automated remediation of interconnection points suffering from network congestion

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

12/11

RCA - Azure CDN and Azure Kubernetes Service - Service Availability Issue (Tracking ID LTPP-R98)

Summary of Impact: Between 17:24 and 18:48 UTC on 11 Dec 2019, clients of customers attempting to reach Azure CDN from Verizon endpoints would intermittently receive HTTP (5XX) errors or connection failures instead of expected content.  

Root Cause: Azure CDN providers use staged deployment processes to deploy configuration changes across their global infrastructure Points of Presence (PoPs). A recent change to Verizon's deployment pipeline introduced a latent bug that caused some deployment service health notifications to provide incorrect health status.   While Verizon was performing maintenance to resolve a delay in a separate configuration deployment, an improperly encoded configuration file was deployed to production. Due to the aforementioned bug which caused latency in their service health notifications, the regular safety features in their deployment process did not trigger, and allowed the improper configuration to reach global PoPs.  This configuration caused service instability across the their global PoPs and resulted in customers receiving HTTP errors (5XX) or connection errors when attempting to reach Azure CDN from Verizon endpoints. Verizon's monitoring caught this issue immediately however, and teams were engaged to resolve the issue. Upon Verizon mitigating the issue, Microsoft services were restored to a healthy state.

Mitigation: After determining the root cause to be the improperly encoded configuration file, a new hotfix was developed and deployed globally. After which, the Verizon's global infrastructure began recovering.  

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps with our partners to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):  

  • Review all service change management processes and practices to ensure that all health check mechanisms are monitored and interact correctly with deployment staging. 
  • Add additional validation to maintenance and deployment practices to ensure all configuration deployment paths result in valid configuration.  
  • Review all CDN monitoring and alerting services to ensure that all CDN infrastructure availability alerting escalates quickly to Microsoft engineering teams.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey 

November 2019

11/20

RCA - Multiple Services - Downstream impact from Azure Front Door (Tracking ID HLMF-R88)

Summary of Impact: Between 00:56 and 03:40 UTC on 20 Nov 2019, multiple services across Microsoft including Azure, Microsoft 365 and Microsoft Power Platform leveraging the Azure Front Door (AFD) service experienced availability issues resulting from high request failure rates. During this event, some impacted services were able to divert traffic away from the AFD service to mitigate impact for them.

One of the impacted services was the Azure Status Page at . Engineering executed the failover plan to the secondary hosting location, but this resulted in a delay in status communication changes. Communications were successfully delivered via Azure Service Health, available within the Azure management portal.

Root Cause: Azure Front Door services provide network edge caching and web acceleration services to many of Microsoft’s SaaS services, in addition to the optimization offering direct to Azure customers. A routine, periodic deployment was released through our validation pipeline that, when combined with specific traffic patterns, caused service-wide, intermittent HTTP request failures for all services utilizing the AFD service.

Investigation into the faulting behavior revealed that the combination of a sequenced code deployment, a configuration deployment and specific traffic patterns triggered a dormant code bug that instigated the platform to crash. These deployed changes were tested before being shipped to the broader cloud; however, the specific traffic pattern was not observed during test and pilot phases.

Azure Front Door deploys to over one hundred points of presence (PoPs) around the globe and deploys customer configuration globally to each of these PoPs, enabling customers to quickly make changes to their service. This is done to ensure customers are able to promptly remove regional components out of specification and update configuration for network security services to mitigate attacks. Through a staged deployment, these changes passed validation and service health-checks. Having passed these validations, propagation to global PoPs was quick, by design, to meet the aforementioned service objectives. After propagation, the fault triggering behavior was instigated only by specific traffic patterns, that occurred after the deployment had completed.

This resulted in impacted customers experiencing a high, but intermittent, rate of web request failures globally while accessing shared services across the Azure and Office platforms.

Mitigation: Global monitoring detected the issue and engaged engineers at 01:04 UTC. Engineers confirmed the multiple sources of the issue to be primarily triggered by the configuration deployment and identified a fix for the issue by 01:27 UTC. Engineers immediately initiated deployment rollback procedures to return the service to a healthy state; this rolled out quickly, progressively and completely to all global platforms by 02:40 UTC. Many of the Microsoft SaaS impacted services were able to initiate failover away from the AFD service, providing mitigation to customers while the underlying AFD mitigation was deployed.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Verify that the fix deployed globally to AFD, during mitigation, is a stable release and will remain in place until all internal reviews of this issue have been completed.
  • Review all service change management processes and practices to help ensure appropriate deployment methods are used.
  • Review the change validation process to identify components and implement changes, required to increase test traffic diversity, improving the scope of trigger and test code paths.
  • Prioritize deployment of a component independent automated recovery process so impacted deployments, like that experienced during this incident, are automatically returned to the last-known-good (LKG) state at a component layer, quickly and without manual intervention, to help reduce time to mitigate and scope of impact.
  • Investigate and remediate the delay experienced with publishing communications to the Azure Status Page during the impact window.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

11/7

RCA - Storage - West Europe (Tracking ID 1_TX-5D8)

Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.

Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.

Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The fix has already been deployed to the impacted scale units in the region. We are currently deploying the fix globally to our fleet.
  • We have been performing cross-scale-unit load-balancing operations numerous times before without any adverse effect. In the wake of this incident, we are reviewing our procedures, tooling and service again for such load balancing operation. We have paused further load balancing actions in this region until this review is completed.
  • We are rigorously reviewing all our background management processes and deployments to prevent any further impact to customers in this region.
  • We are reviewing our validation procedures gaps to catch these issues in our validation environment.


Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: 


11/5

RCA - Azure Databricks - Unable to connect to workspaces (Tracking ID KM9W-1T0)

Summary of Impact: Between 14:19 and 15:34 UTC on 05 Nov 2019, a subset of customers using Azure Databricks may have intermittently experienced the error "This URL is temporarily unavailable. Please try again later." when connecting to Databricks workspaces.

Root Cause: Engineers determined that a backend service responsible for user management and access control experienced a service restart on one of its three service replicas. This restart was caused by high CPU usage attributed to other unrelated services sharing the same node. Once one of the replicas restarted, all pending requests were distributed to the remaining two replicas. Eventually, the remaining replicas were throttled by the overflow of requests and restarted.

Mitigation: Azure Databricks engineering mitigated the issue by reducing the load on the replicas by temporarily disabling internal jobs and services. In addition, the replicas were migrated to nodes that had additional CPU availability.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Increasing CPU and memory resource allocations for all service replicas based on additional load tests
  • Improving on-call detection and remediation time
  • Increasing the number of replicas to provide additional redundancy
  • Providing isolation between service replica requests made from users and internal services

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

11/5

RCA - Virtual Machines and Storage - West US 2 (Tracking ID LVGX-DP0)

Summary of Impact: Between 02:33 and 05:30 UTC on 05 Nov 2019, a subset of customers in West US 2 may have experienced service availability issues for resources hosted in this region due to an underlying Storage issue.

Root Cause: The Stream layer in an Azure Storage scale unit includes a set of metadata nodes and data nodes, which together store the data on disk and are in charge of distributing and replicating the data to ensure high availability & durability. In this incident, a rare sequence of management operations occurred over several weeks creating an unexpected state on the metadata nodes in the scale unit. A software bug in the code that tracks the progress of those operations caused the processes on the metadata nodes to crash and restart. This caused customer requests for data located on that scale unit to fail. Once the metadata processes restarted, the storage services on the scale unit partially recovered until the mitigation was applied.

Mitigation: Once the software bug was identified, a fix was deployed (at approximately 04:30 UTC) to correctly handle the unexpected state, and availability was fully restored for all affected storage accounts. Dependent services (e.g. Virtual Machines) may have taken additional time to recover after storage services were restored.

Resilience options: While we always strive to prevent downtime for customers and to mitigate quickly should impact occur, customers who are sensitive to any service interruption are also recommended to review the latest Azure guidance regarding resilience architecture, which can help prevent impact in scenarios such as this: .

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • We are closing a gap in testing of the relevant operations that triggered the bug.
  • We have identified improvements to enable metadata processes to restart faster after a crash. This will roll-out in a future release and will reduce mitigation time for such incidents.
  • We have already deployed the fix to the handle the unexpected state on the impacted storage scale unit.
  • We are rolling out the fix to all production storage scale units in accordance with safe deployment practices. In the interim, no management operations of the type required to trigger the bug will be performed.


Provide Feedback: 
Please help us improve the Azure customer communications experience by taking our survey: 

    11/4

    RCA - Azure SQL Database - West Europe (Tracking ID 1S1W-F88)

    Summary of Impact: Between 05:00 UTC and 11:30 UTC on 04 Nov 2019 you were identified as a customer using SQL Database/ MySQL Database/ PostgresSQL Database and / or SQL Data Warehouse in West Europe who may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in this region. Existing databases would not have been impacted by this issue.

    Root Cause: All Azure SQL service management requests (create, update, delete, etc.) are serviced through the SQL Control Plane infrastructure. Engineers determined that during this issue, the control plane service became unhealthy after receiving a significant increase in internally generated operations and reaching an operational threshold. This led to service management requests becoming unable to complete, which in-turn resulted in timeouts and throttling. Subsequent investigation by engineers determined this issue was due service requests from an internal Azure group that triggered a bug which caused an excessive number of internally generated operations.

    Mitigation: Engineers performed a configuration change to reduce the number the number of internally generated operations from the bug. Engineers then de-prioritized all the requests from the internal Azure group to ensure these completed after standard workloads, thus allowing normal control plane functionality to resume, and the internal requests to complete over an extended timeframe.

    Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Review of thresholds and throttling within SQL control plane to ensure that sufficient resources remain available to service standard control plane functionality at all times
    • Review of monitoring and alerts to ensure engineers become aware of developing resource exhaustion issues before they become customer impacting
    • Implement a fix to the algorithm that generated the excessive internal operations.
    • Review of auto-mitigation option for control plane resource issues to minimize the potential for customer impact for issue related to this.

    October 2019

    10/30

    RCA - Storage - West Europe (Tracking ID GKXC-PP8)

    Summary of Impact: Between 10:17 and 13:52 UTC on 30 Oct 2019, a subset of customers using Storage, or Azure services with Storage dependencies, in West Europe may have experienced difficulties connecting to resources hosted in this region.

    Root Cause: An Azure Storage scale unit internally has a set of data nodes and metadata nodes. In normal operation, the set of metadata nodes are all able to process query requests in parallel. In this incident, a unique sequence of circumstances triggered a previously unknown software bug that impacted a single storage scale unit. This bug caused the scale unit to enter a state where all metadata query requests were directed to a single server. This caused long queue build-ups on that server, causing high latency, and also protective throttling to prevent complete failure of the metadata software role. This in turn affected other software roles within the storage scale unit, and ultimately customer requests for resources located on that scale unit.

    Mitigation: Once the issue was understood, a mitigation was applied to correct the system state and allow all metadata nodes to process query requests again. This reduced latency/throttling and enabled recovery. Engineers monitored the scale unit for an extended period to ensure full mitigation had occurred.

    Resilience options: While we always strive to prevent downtime for customers, and to mitigate quickly should impact occur, customers who are sensitive to any service interruption are also recommended to review the latest Azure guidance regarding resilience architecture, which can help prevent impact in scenarios such as this: .

    Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Adding additional specific high-severity alerting for this mode of degraded metadata software role state.
    • Implement and deploy automated mitigation for this class of issue.
    • Identify and deploy the fix for underlying the bug that caused the metadata service to fall back to a single server only. While we understand the extremely uncommon failure mode and have a rapid mitigation for it, we are continuing to thoroughly investigate all possible conditions under which the state could be triggered.


    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/29

    RCA - Virtual Machines and Azure Databricks (Tracking ID 1V2C-NZ0)

    Summary of Impact: Between 23:40 UTC on 29 Oct 2019 and 06:15 UTC on 30 Oct 2019, a limited subset of customers using Virtual Machines and/or Azure Databricks may have received failure notifications or experienced increased latency when performing service management operations - such as create, update, delete - for these resources.

    Root Cause: An incorrect network configuration change was rolled out via an automated workflow which unintentionally targeted multiple production regions. The configuration change took effect over a period of hours, as Software Load Balancer (SLB) instances in each region were automatically restarted. Eventually, multiple Azure backend services experienced failures while attempting to establish outbound network connectivity.

    Mitigation: Engineers rolled back the configuration change and manually restarted the affected SLB services to mitigate the issue.

    Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

    • Removing this network configuration change step in the automated workflow
    • Adding strict validation for similar kinds of network configuration changes

    Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/25

    Azure Databricks and Data Factory v2 - Workspace and Data Flow Errors (Tracking ID KNQ8-BP0)

    Summary of impact: Between approximately 11:00 and 14:40 UTC on 25 Oct 2019, a subset of customers using Azure Databricks may have received 'No Web App available' error notifications when logging into a Databricks workspace. Related API calls may have also not returned a response. Additionally, a very limited subset of customers using Data Factory v2 may have received failure notifications for Data Flow jobs.

    Preliminary root cause: Engineers determined that a backend database used to process workspace access requests became unhealthy, causing requests to fail. As this database supports the control plane for the East US, East US 2, Central US, and North Central US regions, only customers in these regions would have seen impact. Additionally, a small number of Data Factory v2 customers in these regions would have seen downstream impact from this issue.

    Mitigation: Engineers redeployed the affected backend database to mitigate the issue.

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.