Skip Navigation

Azure status history

Product:

Region:

Date:

November 2019

11/7

RCA - Storage - West Europe

Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.

Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.

Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The fix has already been deployed to the impacted scale units in the region. We are currently deploying the fix globally to our fleet.
  • We have been performing cross-scale-unit load-balancing operations numerous times before without any adverse effect. In the wake of this incident, we are reviewing our procedures, tooling and service again for such load balancing operation. We have paused further load balancing actions in this region until this review is completed.
  • We are rigorously reviewing all our background management processes and deployments to prevent any further impact to customers in this region.
  • We are reviewing our validation procedures gaps to catch these issues in our validation environment.


Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: 


11/5

RCA - Azure Databricks - Unable to connect to workspaces

Summary of Impact

Between 14:19 and 15:34 UTC on 05 Nov 2019, you were identified as a customer using Azure Databricks who may have intermittently experienced the error "This URL is temporarily unavailable. Please try again later." when connecting to Databricks workspaces.

Root Cause and Mitigation

Root Cause: Engineers determined that a backend service responsible for user management and access control experienced a service restart on one of its three service replicas. This restart was caused by high CPU usage attributed to other unrelated services sharing the same node. Once one of the replicas restarted, all pending requests were distributed to the remaining two replicas. Eventually, the remaining replicas were throttled by the overflow of requests and restarted.

Mitigation: Azure Databricks engineering mitigated the issue by reducing the load on the replicas by temporarily disabling internal jobs and services. In addition, the replicas were migrated to nodes that had additional CPU availability.

Next Steps

We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Increasing CPU and memory resource allocations for all service replicas based on additional load tests
  • Improving on-call detection and remediation time
  • Increasing the number of replicas to provide additional redundancy
  • Providing isolation between service replica requests made from users and internal services

Provide Feedback

Please help us improve the Azure customer communications experience by taking our survey

11/5

RCA - Virtual Machines and Storage - West US 2

Summary of Impact: Between 02:33 and 05:30 UTC on 05 Nov 2019, a subset of customers in West US 2 may have experienced service availability issues for resources hosted in this region due to an underlying Storage issue.

Root Cause: The Stream layer in an Azure Storage scale unit includes a set of metadata nodes and data nodes, which together store the data on disk and are in charge of distributing and replicating the data to ensure high availability & durability. In this incident, a rare sequence of management operations occurred over several weeks creating an unexpected state on the metadata nodes in the scale unit. A software bug in the code that tracks the progress of those operations caused the processes on the metadata nodes to crash and restart. This caused customer requests for data located on that scale unit to fail. Once the metadata processes restarted, the storage services on the scale unit partially recovered until the mitigation was applied.

Mitigation: Once the software bug was identified, a fix was deployed (at approximately 04:30 UTC) to correctly handle the unexpected state, and availability was fully restored for all affected storage accounts. Dependent services (e.g. Virtual Machines) may have taken additional time to recover after storage services were restored.

Resilience options: While we always strive to prevent downtime for customers and to mitigate quickly should impact occur, customers who are sensitive to any service interruption are also recommended to review the latest Azure guidance regarding resilience architecture, which can help prevent impact in scenarios such as this: .

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • We are closing a gap in testing of the relevant operations that triggered the bug.
  • We have identified improvements to enable metadata processes to restart faster after a crash. This will roll-out in a future release and will reduce mitigation time for such incidents.
  • We have already deployed the fix to the handle the unexpected state on the impacted storage scale unit.
  • We are rolling out the fix to all production storage scale units in accordance with safe deployment practices. In the interim, no management operations of the type required to trigger the bug will be performed.


Provide Feedback: 
Please help us improve the Azure customer communications experience by taking our survey: 

    October 2019

    10/30

    RCA - Storage - West Europe

    Summary of Impact: Between 10:17 and 13:52 UTC on 30 Oct 2019, a subset of customers using Storage, or Azure services with Storage dependencies, in West Europe may have experienced difficulties connecting to resources hosted in this region.

    Root Cause: An Azure Storage scale unit internally has a set of data nodes and metadata nodes. In normal operation, the set of metadata nodes are all able to process query requests in parallel. In this incident, a unique sequence of circumstances triggered a previously unknown software bug that impacted a single storage scale unit. This bug caused the scale unit to enter a state where all metadata query requests were directed to a single server. This caused long queue build-ups on that server, causing high latency, and also protective throttling to prevent complete failure of the metadata software role. This in turn affected other software roles within the storage scale unit, and ultimately customer requests for resources located on that scale unit.

    Mitigation: Once the issue was understood, a mitigation was applied to correct the system state and allow all metadata nodes to process query requests again. This reduced latency/throttling and enabled recovery. Engineers monitored the scale unit for an extended period to ensure full mitigation had occurred.

    Resilience options: While we always strive to prevent downtime for customers, and to mitigate quickly should impact occur, customers who are sensitive to any service interruption are also recommended to review the latest Azure guidance regarding resilience architecture, which can help prevent impact in scenarios such as this: .

    Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Adding additional specific high-severity alerting for this mode of degraded metadata software role state.
    • Implement and deploy automated mitigation for this class of issue.
    • Identify and deploy the fix for underlying the bug that caused the metadata service to fall back to a single server only. While we understand the extremely uncommon failure mode and have a rapid mitigation for it, we are continuing to thoroughly investigate all possible conditions under which the state could be triggered.


    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/25

    Azure Databricks and Data Factory v2 - Workspace and Data Flow Errors

    Summary of impact: Between approximately 11:00 and 14:40 UTC on 25 Oct 2019, a subset of customers using Azure Databricks may have received 'No Web App available' error notifications when logging into a Databricks workspace. Related API calls may have also not returned a response. Additionally, a very limited subset of customers using Data Factory v2 may have received failure notifications for Data Flow jobs.

    Preliminary root cause: Engineers determined that a backend database used to process workspace access requests became unhealthy, causing requests to fail. As this database supports the control plane for the East US, East US 2, Central US, and North Central US regions, only customers in these regions would have seen impact. Additionally, a small number of Data Factory v2 customers in these regions would have seen downstream impact from this issue.

    Mitigation: Engineers redeployed the affected backend database to mitigate the issue.

    Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

    10/22

    RCA - Service Availability - West Europe

    Summary of impact: Between 21:15 UTC on 22 Oct 2019 and 00:23 UTC on 23 Oct 2019, a subset of customers using Storage in West Europe may have experienced service availability issues. In addition, resources with dependencies on the impacted storage unit may have experienced downstream impact in the form of availability issues or high latency.

    Root cause: Azure Storage service uses an automatic load balancing system to partition and balance customer workloads across different servers within a storage scale unit. A partition master role maintains the map of how the partitions are distributed across the different partition servers. A routine maintenance operation on one of the partitions caused an inconsistency in the partition map due to a bug. This caused the servers handling the inconsistent state to crash and resulted in a single storage scale unit in West Europe becoming unhealthy. This also resulted in downstream availability impact to several dependent services in the region.

    Note: On 21 Oct 2019, a different storage scale unit in the region was first impacted by this bug. Some customers may have been impacted by both incidents if their workloads were spread across both of the storage scale units. As part of mitigating the first incident, we had blocked the backend operation that triggered the bug until the fix was deployed. However, we did not identify a second entry point which could trigger the bug resulting in the incident on 22 Oct 2019.

    Mitigation: Engineers modified the tool that was created during the incident on 21 Oct 2019 to correct the inconsistency in the partition map specific to the impacted scale unit.

    Engineers also blocked all backend operations that can trigger the inconsistent state until the fix was deployed globally.

    Once the storage backend service health was restored, all dependent services automatically recovered.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Engineers understand the root cause of the bug which triggered the inconsistent system state and have rolled out a fix across production to help ensure the triggering functionality has been disabled from all entry-points.
    • Engineers are hardening the load-balancing system to detect and reject inconsistencies in the partition map in a safe manner to prevent impact on system health.

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

    10/22

    Azure Portal Sign In Failure - West Europe - Mitigated

    Summary of Impact: Between 06:40 UTC and 09:54 UTC on 22 Oct 2019 a subset of customers were identified as having experienced issues signing in to the Azure Portal in West Europe. Customers would have seen failure notifications when trying to sign in. Accessing the preview Portal here would have been successful. 

    Preliminary Root Cause: Engineers determined that a recent deployment task increased traffic to a backend service, in this case AAD, which in turn reached operational threshold, preventing requests from completing.

    Mitigation: Engineers performed a change to the service configuration to mitigate the issue. Engineers also rolled back the recent deployment task to further mitigate the issue.

    Next Steps: Engineers will review deployment procedures to prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

    10/21

    RCA - Storage - West Europe

    Summary of impact: Between 23:20 UTC on 21 Oct 2019 and 04:32 UTC on 22 Oct 2019, a subset of customers using Storage in West Europe may have experienced service availability issues. In addition, resources with dependencies on the impacted storage unit may have experienced downstream impact in the form of availability issues or high latency.

    Root cause: The Azure Storage service uses an automatic load balancing system to partition and balance customer workloads across different servers within a storage scale unit. A partition master role maintains the map of how the partitions are distributed across the different partition servers. A routine maintenance operation on one of the partitions caused an inconsistency in the partition map due to a bug. This caused the servers handling the inconsistent state to crash and resulted in a single storage scale unit in West Europe becoming unhealthy. Downstream availability impact was then seen to several dependent services in the region.

    Mitigation: Engineers developed a tool to correct the inconsistency in the partition map on the impacted scale unit. The time taken to develop and test this tool resulted in a higher than expected recovery time. Engineers also blocked the backend operation that triggered the inconsistent state until the underlying bug is fixed. Once service health was restored to the backend Storage scale unit, all dependent services automatically recovered.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

    • Engineers understand the root cause of the bug which triggered the inconsistent system state and have rolled out a fix across production to help ensure the triggering functionality has been disabled from all entry-points.
    • Engineers are hardening the load-balancing system to detect and reject inconsistencies in the partition map in a safe manner to prevent impact on system health.

    Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

      10/21

      RCA - Service Management Errors - East US 2

      Summary of Impact: Between 10:08 and 23:37 UTC on 21 Oct 2019, a subset of customers in East US 2 may have received error notifications or experienced high latency when performing service management operations - such as create, update, or delete - for resources hosted in this region. Additionally, customers using Azure Databricks and/or Data Factory v2 may have encountered service management errors in multiple regions. A very limited subset of customers using Virtual Machines with SQL Server images, or other SQL IaaS offerings, may have also encountered errors performing service management operations on resources hosted in East US 2.

      Root Cause: Engineers had investigated and found that an internal service, “service fabric cluster manager”, was near capacity due to the legacy hardware that was being used. Around 9:00 UTC on 21 Oct 2019, a regular operation spike caused the cluster manager to reach its capacity limit, resulting in queueing and increased latency. The increased latency triggered retries on the Regional Network control plane service (RNM), which then went into a feedback loop, causing operations to queue up and forced the Regional Network Manager (RNM) into a state where it was not able to recover. 

      Mitigation: Engineers moved the cluster manager from the legacy hardware to new Gen5 hardware to mitigate.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): 

      • Moving the cluster manager to the latest Gen5 hardware in all regions.
      • Implementing caching on the gateway to reduce frequent calls to the cluster manager, which will help with reducing the load.
      • Improving the Regional Network Manager (RNM) operation feedback loops to no longer queue up operations indefinitely.

      Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

      10/18

      RCA - Authentication issues with Azure MFA in North America

      Summary of impact: Between 13:30 UTC and 15:57 UTC on 18 October 2019, customers in North America experienced issues receiving multi-factor authentication (MFA) challenges. Users who had valid MFA claims during the incident were not impacted. However, users who were required to perform an MFA challenge during this incident were unable to complete the challenge. This represented 0.51% of users in North American tenants using the service during the incident.

      Root cause: At 13:30 UTC, severe packet loss was experienced on an external network route between Microsoft and the Apple Push Notification Service (APNS). The packet loss was greater in severity and duration than previously encountered. It also coincided with morning peak traffic in North America. This combination of events caused a build-up of unprocessed requests in the MFA service, leading to service degradation and failures in MFA requests.

      Mitigation: Service monitors detected the build-up of unprocessed requests in the MFA service at 13:38 UTC and engineers were fully engaged by 13:48 UTC. Engineering confirmed the issue was a loss of network connectivity and began troubleshooting within the Microsoft datacenter networks. Engineering determined that the datacenter networks did not experience loss of connectivity and pinpointed the issue as external to the Microsoft datacenter networks. While further troubleshooting was underway to identify the most impacted network routes, engineering prepared a hotfix to bypass the impacted external service altogether, and to restore MFA functionality. The hotfix was rolled out to one region to validate the effectiveness of the fix. In the meantime, the external network recovered, and packet loss was reduced to normal rates. Engineering paused further rollout of the hotfix. The network issue was confirmed to be mitigated at 15:57 UTC, and the MFA service functionality recovered. The hotfix, which was then redundant, was rolled back. 

      Following mitigation, engineering immediately performed the below tasks to harden the service against a similar degradation in the future:

      • The impacted network path was pinpointed, and traffic was reconfigured to bypass the failure path.
      • Three hotfixes to the service were immediately rolled out to rebalance network traffic and increase resiliency to network failures
      • Precise alerts were setup to monitor for the specific network condition which triggered the issue. 

      Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future, including, but not limited to:

      1. Fine-grained fault domain isolation work has been accelerated. This work builds on the previous fault domain isolation work which limited this incident to North American tenants. This includes:

      • Additional physical partitioning within each Azure region.
      • Logical partitioning between authentication types.
      • Improved partitioning between service tiers.

      2. Additional hardening and redundancy within each granular fault domain to make them more resilient to network connectivity loss. This includes:

      • Improved resilience to request build-up.
      • Optimizing network traffic to decrease load on network links.
      • Improved instructions to users for self-service in case notifications are not delivered.
      • Service restructuring to decrease service impact of network packet loss.

      3. Enhanced monitoring for networking latency and various resource usage thresholds. This includes:

      • Multi-region and multi-cloud targeted monitoring for the specific type of packet loss encountered.
      • Improved monitors for additional types of resource usage.

      Other categories and action items are being actively identified.

      Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

      10/16

      RCA - Azure Portal

      Summary of Impact: Between 13:45 and 14:59 UTC (approx.) on 16 Oct 2019, a subset of customers may have experienced latency issues with the Azure Portal, Command Line, and Azure PowerShell. Services already running would not have experienced impact during this issue.

      Root Cause: Azure Resource Manager (ARM) is the underlying deployment and management service for Azure, providing the management layer that allows create, update, delete, etc. functions for the platform. Customers interact with ARM every time they use the platform, but the primary interaction points are via PowerShell, Command line, APIs and/or the Azure Portal.

      During this incident a number of scale units in one of our UK South data centers responsible for processing ARM requests became overloaded. This caused intermittent latency for some customers in this region as they would have been directed to their nearest ARM endpoints. As retry logic and customer initiated retries ensued, this added to the load for these scale units, and impact to customers became more prevalent.

      The underlying cause was twofold:

      • A routine deployment update for the ARM backend service had taken place overnight in UK South. The update was successful, but as it deployed, it reduced the number of ARM instances, as the code branch for this deployment had not been updated to the new value required in this region.
      • The resources in UK South are normally over-scaled versus demand, and thus they were not aligned with the same auto-scale functionality as other regions that ARM uses.

      Following the above deployment, there was no immediate impact to ARM workloads in the region. However, as requests to ARM began to climb during the business day, the lower number of instances, and the lack of an immediate auto-scale response caused this to manifest as customer-impacting latency.

      Mitigation: Initially engineers manually scaled-out the service to ensure adequate resources were available. There were some delays in the application of these steps, as this is a highly secure environment, and the change-approvals process had recently been updated.

      Subsequently, engineers applied a mitigation to ensure the auto-scale functionality was restored to prevent this issue reoccurring. Engineers then monitored the service for an extended period following mitigation to ensure all requests backlogs cleared, and that auto-scale functionality was correctly enabled.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

      • Review of monitoring and alerting relating to resource consumption, and requests success rates in ARM
      • Review of alerting for scale-up functionality underpinning the ARM platform
      • Investigation of auto-mitigation options to prevent this scenario from reoccurring
      • Verifying that all on-call engineers are aware of the updated processes for change requests in this environment
      • Update of the process regarding change of minimum instance thresholds for regions to ensure deployments do not revert to previous values

      Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: 

      10/2

      Log Analytics - USGov Virginia - Mitigated

      Summary of Impact: Between 19:45 EDT on 02 Oct 2019 and 23:45 EDT on 03 Oct 2019 a subset of customers using Log Analytics in USGov Virginia may have received 500-level errors when performing queries on log analytics data.

      Preliminary root cause: Engineers determined that instances of a back-end service responsible for middle-tier authentication became unhealthy, preventing data queries from completing.

      Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 

      Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

      September 2019

      9/25

      RCA - Azure Kubernetes Service (AKS) - Provisioning Failures

      Summary of Impact: Between 20:50 UTC on 25 Sep 2019 and 04:25 UTC on 26 Sep 2019, a subset of customers using Azure Kubernetes Service (AKS) may have received failure notifications when performing create and/or update operations for their resources.

      Root Cause: An update to a backend service responsible for processing different service management requests for AKS was enabled by a recent configuration change that had been rolled out. This update exposed a code defect for the service and a fix for the defect had yet to complete in the impacted regions. This resulted in failures for the service, preventing it from processing requests. Internal monitoring systems for this issue also did not initially correlate to complete scope of impact, which delayed the investigation.

      Mitigation: Engineers reverted the configuration change which mitigated the issue.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

      • Monitoring the completion of the fix for the defect in all regions.
      • Updating the monitoring systems to improve alerting accuracy for these types of scenarios.

      Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

      9/16

      RCA - Azure Front Door Service and Azure CDN

      Summary of Impact: Between 00:00 UTC on 16 Sep 2019 and 00:30 UTC on 18 Sep 2019, you were identified as a customer who may have seen provisioning failures with Azure CDN and Azure Front Door service when using new Key Vault Certificates.

      Root Cause: Engineers determined that a recent deployment task impacted Key Vault API calls, causing the provisioning failures. The deployment change was made to filter out managed secrets for Key Vault portal extensions, however this also impacted Remote Procedure Calls (RPCs). Customers who depend on RPCs would not be able to see Key Vault managed secrets impacting the ability to provision CDN and Azure Front Door services.

      Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

      Next Steps:  We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. We have implemented additional test scenarios for listSecrets RPCs to prevent future occurrences.

      Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

      9/11

      RCA - Australia East - Intermittent Connectivity

      Summary of Impact: Between 00:30 and 01:40 UTC on 11 Sep 2019, a subset of customers in Australia East may have experienced intermittent connectivity failures when connecting to resources hosted in this region or when accessing the Azure Portal.

      Root Cause and Mitigation: The Azure Load Balancing service is hosted on multiple servers grouped into rings. During a routine configuration update to a single ring, a health check failed, causing the configuration update to automatically roll back. The rollback process was optimized for speed, and in this case, reduced the capacity of the ring because it updated multiple instances at the same time. When the ring capacity was reduced, there was customer impact due to dropped packets. To prevent this happening in the future, we are adding additional health checks to the rollback process. These checks will ensure that the available capacity does not drop below a minimum specified threshold.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

      • Capacity health checks during rollback to prevent the rollback process from reducing the capacity below a set threshold.
      • Improve the alerting thresholds to capture packet drops on the Azure Software Load Balancer.

      Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: 

      August 2019

      8/28

      RCA - Azure Email Alerts

      Summary of Impact: Between 20:22 UTC on 28 Aug 2019 and 01:35 UTC on 29 Aug 2019, a subset of customers may not have received email alerts sent from a subset of Azure services during the impact window.

      Root Cause: Engineers determined that the instance of the backend database service, utilized for email processing, had reached operational thresholds. During scale up operations the service became unresponsive due to logging operations exceeding thresholds. This prevented requests from downstream Azure services, utilizing the Azure Email service, from completing successfully.

      Mitigation: Engineers halted the scale up operation. Operation throttling thresholds were gradually returned to normal to fully mitigate the issue.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

      • Improve throttling performance to help ensure that there are sufficient service resources during maintenance activities
      • Improve resiliency of the backend service responsible for processing email alerts

      Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

      8/23

      RCA - Service Management Operations - West/North Europe

      Summary of Impact: Between 06:20 and 10:46 UTC on 23 Aug 2019, a subset of customers in North Europe and West Europe may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in these regions.

      Root Cause: During this incident, the memory consumption on Azure Resource Manager’s (ARM) worker roles exceeded operational thresholds. This should have initiated a clearing process which attempts to recover additional memory when thresholds are reached. However, the memory cache was unrecoverable at the time given that it was referenced by other (active) objects in the system. CPU utilization on the affected worker roles also increased, which in turn prevented certain processes from completing.

      Analysis has indicated that all threads on the affected worker roles were busy during the impact window, thus manifesting in impact to service management operations to ARM-dependent services and/or resources.

      In addition, worker roles recycle on a weekly cadence which, in this scenario, further contributed to thresholds being reached. The nature of the issue required manual intervention to fully mitigate, which further delayed mitigation.

      Mitigation: Engineers performed a manual restart of the affected worker roles to mitigate the issue.

      Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

      • Automatically recycling worker roles on a more frequent cadence [In Progress]
      • Separating distinct cache types [In Progress]
      • Adding enhanced logging information to understand memory cache usage in worker roles [In Progress]
      • Reducing cache footprint [In Progress]

      Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey .

      8/16

      RCA - Azure Services - South Central US

      Summary of impact: Between 14:16 UTC and 21:36 UTC on 16 Aug 2019, a subset of customers using resources in South Central US may have experienced difficulties connecting to resources hosted in this region.

      Root cause: An outage occurred during a regional infrastructure update, which is part of a South Central US datacenter refresh initiative. At approximately 14:00 UTC on 8/16/2019, facility engineers proceeded with removal of infrastructure following a pre-approved method of procedure (MOP). However, an error in fiber trunk validation resulted in production impacting connectivity loss.
      The loss of these physical links resulted in connectivity issues with a scale unit in the datacenter that hosted several Azure infrastructure services.

      Mitigation: To mitigate, services with multi-region deployments successfully failed out of the region to minimize impact. Residual impact was mitigated by restoring full fiber connectivity to the affected scale unit and placing it back into production rotation.

      Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

      • Halt infrastructure refresh projects globally until below repair items are addressed - Complete
      • Method of Procedure (MoP) detailed review as related to infrastructure refresh projects – in progress
      • Improve existing physical infrastructure change management process with additional collaboration & validation between facility and infra engineering - in progress.

      Provide feedback: Please help us improve the Azure customer communications experience by taking our survey