Skip navigation

Azure status history

This page contains all root cause analyses (RCAs) for incidents that occurred on 20 November 2019 or later. Each RCA will be retained on this page for five years. RCAs before 20 November 2019 aren’t available.

Product:

Region:

Date:

January 2021

15/1

Azure Network Infrastructure service availability issues for customers located in Argentina - Mitigated (Tracking ID DM7S-VC8)

Summary of Impact: Between 17:30 and 20:15 UTC on 15 Jan 2021, customers located in Argentina attempting to access the Azure Portal and/or Azure Resources may have experienced degraded performance, network drops, or timeouts. Customers may also have experienced downstream impact to dependent Azure services due to an underlying networking event.

Preliminary Root Cause: We determined that a network device, affecting network traffic in Argentina, experienced a hardware fault and that network traffic was not automatically rerouted.

Mitigation: We took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

December 2020

14/12

RCA - Azure Active Directory - Authentication errors (Tracking ID PS0T-790)

Summary of impact: Between 08:01 and 09:20 UTC on 14 Dec 2020, a subset of users in Europe might have encountered errors while authenticating to Microsoft services and third-party applications. Impacted users would have seen the error message: “AADSTS90033: A transient error had occurred. Please try again”. The impact was isolated to users who were served through one specific back end scale unit in Europe. Availability for Azure Active Directory (AD) authentication in Europe dropped to a 95.85% success rate during the incident. Availability in regions outside of Europe region remained within Service Level Agreement (SLA).

Root Cause: The Azure AD back end is a geo-distributed and partitioned cloud directory store. The back end is partitioned into many scale units with each scale unit having multiple storage units distributed across multiple regions. Request processing for one of the back end scale units experienced high latency and timeouts due to high thread contention. The thread contention happened on the scale unit due to a particular combination of requests and a recent change in service topology for the scale unit rolled out previously.

Mitigation: To mitigate the problem, engineers updated the backend request routing to spread the requests to additional storage units. Engineers also rolled back the service topology change that triggered high thread contention.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Augment existing load testing to validate the combination of call patterns that caused the problem.
  • Further root cause the reason for thread contention and make necessary fixes before re-enabling the service topology change.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: