Skip Navigation

Azure status history

This page contains all root cause analyses (RCAs) for incidents that occurred on November 20, 2019 or later. Each RCA will be retained on this page for 5 years. RCAs before November 20, 2019 aren’t available.

Product:

Region:

Date:

April 2022

4/8

Service Management Operation Errors Across Azure Services in East US 2 (Tracking ID Y__5-9C0)

Summary of Impact: Between 12:25 UTC on 08 Apr 2022 and 14:40 UTC on 09 Apr 2022, customers running services in the East US 2 region may have experienced service management errors, delays, and/or timeouts. Customers may have experienced issues that caused GET and PUT errors impacting the Azure portal itself, as well as services including Azure Virtual Machines (VMs), Virtual Machine Scale Sets (VMSS), Azure Data Factory (ADF), Azure Databricks, Azure Synapse, Azure Backup, Azure Site Recovery (ASR), and Azure Virtual Desktop (AVD) Customers may have seen errors including “The network connectivity issue encountered for Microsoft.Compute cannot fulfill the request. For some downstream services that have auto-scale enabled, this service management issue may have caused data plane impact.

Root Cause: We determined that the Compute Resource Provider (CRP) Gateway service experienced an issue which severely reduced its throughput. The underlying issue was a retry storm triggered by the zonal failure of a related Allocator service. While we were able to recover the Allocator service by restarting the instances in the failed zone, the backlog of work exposed a potential issue with .Net CLR and Garbage Collector. This resulted in a large percentage of incoming calls to the CRP Gateway to fail. The retries triggered by the upstream services only made the load situation worse. Under normal circumstances, the Gateway instances are overprovisioned for such retry storms but the combination of the reduced throughput across all instances and continuous retries (some services which normally make 25K calls in 1 minutes were making 150K calls in the same period due to retries) resulted in a prolonged impact.

Deeper investigation into process profile data exposed that the process was experience a high rate of timeout exceptions for ongoing operations and .Net Garbage Collector was overworked due to high heap churn under above mentioned spike in load. A high rate of exceptions and the simultaneous pressure on the .Net GC exposed an unfavorable interaction in the .Net runtime's process wide lock.

Mitigation: To mitigate the situation, the below steps were taken:

  • Two large services were temporarily throttled more aggressively to ensure they do not continue to overload gateway.
  • Once the underlying issue of the throughput reduction was partially understood, the gateway services were restarted multiple times until they got out of the wedged state.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • As a long-term fix, we initiated a CRP gateway hotfix that will prevent the gateway from entering into the wedged state. The hotfix roll-out is progressing as per our Safe Deployment Practices.
  • We are flighting a configuration change to make the .Net GC work less hard and avoid interaction with the process wide lock which is surfaced with exception handling.
  • Repair items have been identified to optimize areas of code which were causing heap churn

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

March 2022

3/16

RCA - Azure AD B2C – Authentication Failures and Error Notifications (Tracking ID TTCR-NTZ)

Summary of Impact: Between 09:13 and 10:22 UTC on March 16, 2022, end-users of customers using Azure Active Directory B2C may have experienced errors and timeouts when attempting to sign in or sign up. Retry attempts were likely to succeed during this incident

Root Cause: The service experienced a significant increase in workload in the affected region during a planned maintenance operation. As a result, a subset of sign-in requests was queued up by the system, which increased processing time, and in some cases sign-in attempts by end-users may have timed out.

Mitigation: The service automatically scaled up compute resources in response to the increase in workload, which provided partial relief. In addition, we rerouted subsets of the workload to alternate capacity to achieve complete mitigation.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve planned maintenance Standard Operating Procedures (SOP) by pre-provisioning of additional capacity to affected regions to help handle unanticipated workload.
  • Improve planned maintenance SOP to include pro-active assessment of similar pre-provisioning in other regions beyond the affected region for the service.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

3/1

RCA - Azure Resource Manager - Service Management Operation Failures (Tracking ID ZNRZ-HDG)

Summary of Impact: Between 11:49 EST on 01 March 2022 and 03:08 EST on 03 Mar 2022, a subset of customers experienced errors when using Azure Resource Manager to perform service management operations in the Azure Government cloud.

Root Cause: A synchronization issue occurred between backend components used to permit certain ARM requests. A configuration change was applied to these backend components, which resulted in some instances of the ARM service becoming unreachable, causing errors for a subset of operation requests.

Mitigation: We rolled out a hotfix to affected components, restoring the ARM service, which allowed operation requests to complete as expected.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Update ARM component configuration methods to help prevent synchronization issues when similar updates are required.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: