Skip Navigation

Azure status history

This page contains all root cause analyses (RCAs) for incidents that occurred on November 20, 2019 or later. Each RCA will be retained on this page for 5 years. RCAs before November 20, 2019 aren’t available.

Product:

Region:

Date:

November 2021

11/16

Azure Active Directory - Issues when attempting to authenticate - Mitigated (Tracking ID SVT2-VCZ)

Summary of Impact: Between at 20:20 UTC on 16 Nov 2021 and 05:00 UTC on 17 Nov 2021, you were identified as a customer using Azure Active Directory External Identities who may have experienced issues when attempting to authenticate.

Preliminary Root Cause: We determined that an outbound port exhaustion led to service latency and failures.

Mitigation: We have scaled up the service capacity to mitigate this issue.

Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. 

Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

11/12

Microsoft Graph - Intermittent 400-level errors accessing Graph APIs - Mitigated (Tracking ID PLT7-RTZ)

Summary of Impact: Between 02:00 UTC on 12 Nov 2021 and 17:00 UTC on 15 Nov 2021, a subset of customers primarily located in the North America and APAC geographies may have encountered intermittent 400-level errors when attempting to access Microsoft Graph APIs.

Preliminary Root Cause: We determined that a recent update to improve the underlying Microsoft Graph API infrastructure created a configuration issue between the Microsoft Graph API interface and its underlying Internet Information Services driver. This configuration issue prevented calls to various APIs from completing as expected.

Mitigation: We failed-over our service to a previously-known healthy state and rolled back the update to mitigate impact.

Next steps: We sincerely apologize for the impact to affected customers. We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

11/9

RCA - Intermittent Failures When Accessing or Using Access Panel (Tracking ID DK83-BDZ)

Summary of Impact: Between 14:03 UTC and 19:28 UTC on Nov 9 2021, customers using Azure Active Directory’s Access Panel whose traffic was routed through West Central US and Central US may have experienced issues when attempting to access or use Access Panel functionality. Users may have experienced intermittent failures when attempting the following operations: 

Launching Single Sign on (SSO) applications from My Apps or when using direct sign-on links

Registering for Multi-Factor authentication (MFA)

Self-Service Password Reset

Performing self-service management of groups

Accepting terms of use agreements

This incident had no impact on authentication and MFA scenarios outside of MFA registration, authentication for applications that were not launched through My Apps or through direct sign-on links.

Root Cause: On 09 Nov 2021 between 03:00 and 05:30 UTC, a code change to Access Panel was deployed and introduced an unrelated bug impacting a small subset of customers. After we failed to detect this bug through testing, it was discovered during telemetry validation in the first deployment stage in West Central US, at which point we failed over traffic from West Central US to Central US at 07:00 UTC to prevent customers from experiencing this discovered bug. Based on the information available at the time, failing-over traffic was deemed to be the safest and fastest recovery alternative until a fix could be safely deployed the following morning.

Failing-over is a standard operating procedure exercised on at least a monthly basis, however in this case we made the decision to keep the service in a failed-over state, while working on a fix to be deployed the following morning. Since this procedure is regularly tested and is an order of magnitude faster than rolling back the change, we had determined this path as the best course of action, and did not expect any issues. The difference in this instance was that the failover happened during off-business hours and persisted into business hours. This meant that at the point of failover the Central US datacenter was scaled to handle the low amounts of off-business hours traffic.

As we reached business hours for the region, at 14:03 UTC on 09 Nov 2021, we started seeing traffic rising quickly for the Access Panel service in Central US, at a pace which exceeded what the auto-scaling configuration, which our service relies on for reacting to traffic fluctuations, could handle. Our auto-scaling configuration was not equipped to provision capacity at the pace required to keep up with the rate of traffic increase related to both the failover and incoming morning peak traffic combined. That resulted in customer requests to the service timing out and failing.

The issue was detected by automation at 14:09 UTC on 09 Nov 2021 and engineers were engaged at 14:14 UTC. While investigating and validating the scope of impact, communications were delayed, with first notification being sent at 15:25 UTC.

While the incident was ongoing, we inaccurately scoped the customer impact in our communications to only applications launched through the My Apps portal, without calling out direct sign-on links used outside of My Apps. Upon further investigation, we have also found that a workaround could have been possible for a subset of impacted scenarios, where, for applications using sign-on methods other than SAML, customers could have accessed the application’s site directly.

Mitigation: Impact was mitigated by rebalancing traffic and manually scaling out the service in the impacted regions, West Central US and Central US. Since the Access Panel service in West Central US was previously auto-scaled down, due to no incoming traffic following the service failover, the scaling out operation took a more significant amount of time. The scale out operations completed at 18:40 UTC on 09 Nov 2021 and mitigated impact for the vast majority of customers. Complete mitigation was accomplished when deployment rollback completed at 19:28 UTC.

Next steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Adjust provisioned capacity and autoscaling configuration for the Access Panel service in all regions to support failover scenarios during peak hours. (To be completed in November 2021)
  • Update Access Panel standard operating procedures to include rapidly increasing capacity during scale-out operations to help shorten mitigation times for similar scenarios. (To be completed in November 2021)
  • Update test cases to include missing scenarios as surfaced in this incident. (To be completed in November 2021)
  • Improve Access Panel tooling and processes to enable rapid rollbacks of configuration changes. (To be completed in December 2021)
  • Improve standard operating procedures for notifying customers to drive down time to notify for this class of incident. (To be completed in March 2022)
  • Plan to onboard the Access Panel service to automated communication to help drive down time to notify for this class of incidents. (To be completed in March 2022)
  • Adjust standard operating procedures to help improve the quality of communications, including details around scope of impact and potential workarounds. (To be completed in March 2022)

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

October 2021

10/27

SMS MFA not being received by multiple tenants (Tracking ID ZKP4-NSG)

Starting at 10:03 UTC, a subset of customers using SMS for Multi-Factor Authentication in United States are experiencing difficulties signing into Azure resources, such as Azure Active Directory, when Multi-Factor Authentication is required by policy.

Workaround: Customers are recommended to use voice calls or non-telecom authentication methods to complete Multi-Factor Authentication.

Current status: The issue is mitigated as of 13:20 UTC. This issue was due to a 3rd party cellular provider in United States which was experiencing issues, impacting Azure MFA users.

10/13

RCA - Virtual Machines (Tracking ID 0NC_-L9G)

Summary of Impact: Between 06:27 UTC and 12:42 UTC on 13 Oct 2021, a subset of customers using Windows-based Virtual Machines (Windows VM) may have received failure notifications when performing service management operations - such as start, create, update, delete. Deployments of new VMs and any updates to extensions may have failed. Management operations on Availability Set, Virtual Machine Scale Set were also impacted.

Non-Windows Virtual Machines were unaffected, however services with dependencies on Windows VMs may have also experienced similar failures when creating resources.

Root Cause: Windows-based Virtual Machines utilize the Windows Virtual Machine Agent (VM Agent) extension, which is used to manage interactions between the Virtual Machine and the Azure Fabric.

When creating and updating Windows VMs, the Compute Resource Provider (CRP) has a dependency upon the Platform Image Repository to retrieve download locations for the latest version of the VM Agent package. Using this information, the VM Agent will update itself to the latest version in the VM.

As part of the journey to move all classic resources to Azure Resource Manager (ARM), we are migrating the image and extension publishers to the regional ARM publishing pipeline. Approximately 20% of all extensions have been successfully migrated.

At approximately 06:27 UTC, tooling provided an ARM template for use in performing these migrations. This tooling did not consider an edge case and as an unintended consequence marked the Windows VM Agent extension as visible to the publishing subscription only in the ARM regional service after migration. As the result, VM management operations started to fail after receiving zero results from the regional Platform Image Repositories.

The outcome of this was that service management operations (start, stop, create, delete, etc.) on customers Windows VM were unable to locate the Windows VMAgent extension, and thus unable to complete successfully.

Part of our change management process is to leverage the Safe Deployment Practice (SDP) framework (). In this case, some of the functionality of our classic infrastructure is incompatible with the SDP framework. This incompatibility underscores the importance in which we are treating the complete migration to ARM. Once the migration is complete, it will allow us to make all changes using the SDP framework without using bespoke tools that support classic resources only.

Mitigation: Determining the root cause took an extended period due to multiple releases for Azure components being in flight simultaneously on the platform, each of which had to be investigated. Additionally, involving subject matter experts (SMEs) for each of the involved components added to this time as we needed to eliminate multiple possible scenarios to ensure we could triage the underlying cause.

Once we determined the issue, and reviewed multiple mitigation options, we mitigated impact by making the extension public in one region at first and validating the results, ensuring no further impact would be caused by a surge in requests for Virtual Machines. Once validated, we started rolling out the change to the new pipeline region-by-region, mitigating the issue. Engineers monitored the platform success rate for operations after the changes were completed.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The migration of remaining packages in this category (including the Linux version of the VM Agent) is on hold until all repairs are in place
  • Additional pre-check and post-checks are being developed and implemented
  • VM operation resilience to failures when VM agent cannot be found
  • Engineering is also evaluating other safeguards to flight each extension type and prevent any potential negative impact with the remainder of migration.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: