ナビゲーションをスキップする

Azure の状態の履歴

This page contains all root cause analyses (RCAs) for incidents that occurred on November 20, 2019 or later. Each RCA will be retained on this page for 5 years. RCAs before November 20, 2019 aren’t available.

製品:

リージョン:

日付:

2021年2月

2/26

RCA Pending - Azure Storage and dependent services - Japan East (Tracking ID PLWV-BT0)

In light of new information, the following RCA is still preliminary and our investigations are continuing. While we have not changed any of the text below, we will provide an update once our investigation concludes.

Summary of Impact: Between 03:29 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and increased latency for resources utilizing Azure Storage, including failure of virtual machine disks. Some Azure services utilizing Storage may have also experienced downstream impact.


Root Cause: There were contributing factors that led to customer impact.

Firstly, we had an active deployment in progress on a single storage scale unit. Our safe deployment process normally reserves some resources within a scale unit so that deployments can take place. In addition to this space being reserved for the deployment, some nodes in the scale unit entered an unhealthy state and so they were removed from use from the scale unit. The final factor was that resource demand on the scale unit was unusually high.

In this case, our resource balancing automation was not able to keep up and spread the load to other scale units. A combination of all these factors resulted in a high utilization of this scale unit causing it to be heavily throttled in order to prevent failure. This resulted in a loss of availability for customers and Azure services attempting to utilize Storage resources within the impacted storage scale unit.

Mitigation: To mitigate customer impact as fast as possible, unhealthy nodes were recovered which restored resources to the service. In addition, engineers took steps to aggressively balance resource load out of the storage scale unit.

Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We declared full mitigation at 10:02 UTC.

Next steps: We sincerely apologize for the impact this event had on our customers. Next steps include but are not limited to:

  • Improve detection and alerting when auto-balancing is not keeping up to help quickly trigger manual mitigation steps.
  • Reduce the maximum allowed resource utilization levels for smaller storage scale units to help ensure increased resource headroom in the face of multiple unexpected events.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

2/16

Azure Frontdoor - Europe - Timeouts connecting to resources (Tracking ID ZN8_-VT8)

Summary of Impact: Between approximately 12:00 UTC and 13:30 UTC a subset of customers using Azure Frontdoor in Europe may have experience timeouts and/or issues connecting to resources. 

Root Cause: Engineers determined that a backend network device became unhealthy, and traffic was not automatically rerouted. This resulted in Azure Front Door requests to fail. 

Mitigation: We manually removed the faulty backend network device and rerouted network traffic. This mitigated the issue.

Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

2/12

RCA - Azure Cosmos DB connectivity issues affecting downstream services in West US region (Tracking ID CVTV-R80)

Summary of Impact: Between February 11, 23:23 UTC and February 12, 04:30 UTC, a subset of customers using Azure Cosmos DB in West US may have experienced issues connecting to resources. Additionally, other Azure services that leverage Azure Cosmos DB may have also seen downstream impact during this time. The Cosmos DB outage affected user application requests to West US. A small subset of customers using Cosmos DB in other regions saw an impact on their replication traffic into West US. Customer impact for Azure Cosmos DB accounts was dependent on the Geo-Replication configurations in place:

  • Accounts with no Geo-Replication: Read and write requests failed for West US
  • Accounts with Geo-Replicated Single-Write + Multiple-Read regions: Read and write requests failed for West US. The Cosmos DB client SDK automatically redirected read requests to a healthy region – an increased latency may have been observed due to longer geographic distances
  • Accounts with Geo-Replicated Multiple Write + Read regions: Read and write requests may have failed in West US. The Cosmos DB client SDK automatically redirected read and write requests to a healthy region – an increased latency may have been observed due to longer geographic distances

Root Cause: On February 11, 10:04 UTC (approximately thirteen hours before the incident impact), a Cosmos DB deployment was completed in West US using safe deployment practices; unfortunately, it introduced a code regression that triggered at 23:11 UTC, resulting in the customer impact described above.

A rare failure condition in the configuration store for one of the West US clusters was encountered. The front-end service (which is responsible for request routing of customer traffic) should handle this. Due to the code regression, the cluster's front-end service failed to address the condition and crashed.

Front-end services for other clusters in the region also make calls to the impacted cluster's front-end service to obtain configuration. These calls were timed out because of unavailability, triggering the same unhandled failure condition and resulting crash. This cascading effect impacted most West US Cosmos DB front-end services. Cosmos DB customers in the region would have observed this front-end service outage as a loss of availability.

Mitigation: Cosmos DB internal monitoring detected the failures and triggered high severity alerts. The appropriate teams responded to these alerts immediately and began investigating. During the triage process, Engineers noted that the configuration store's failure condition (which led to the unhandled error) was uncommon and not triggered in any other clusters worldwide.

The team applied a configuration change to disable the offending code causing the process crashes. Automated service recovery then restored all cluster operations.

Next Steps: We apologize for the impact on affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Expediting roll out of a hotfix for the Cosmos DB Gateway application to isolate failures for internal metadata requests to reduce the regional and inter-regional impact
  • Improving Cosmos DB monitoring to detect unhandled failures
  • Improving the Cosmos DB front-end service to remove dependencies on current configuration store in steady-state
  • Improving publicly available documentation, with the intent of providing more straightforward guidance on the actions customers can take with each account configuration type in the event of partial, regional, or availability zone outages
  • Improving Cosmos DB automated failover logic to accelerate failover progress due to partial regional outages

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

2021年1月

1/15

Azure Network Infrastructure service availability issues for customers located in Argentina - Mitigated (Tracking ID DM7S-VC8)

Summary of Impact: Between 17:30 and 20:15 UTC on 15 Jan 2021, customers located in Argentina attempting to access the Azure Portal and/or Azure Resources may have experienced degraded performance, network drops, or timeouts. Customers may also have experienced downstream impact to dependent Azure services due to an underlying networking event.

Preliminary Root Cause: We determined that a network device, affecting network traffic in Argentina, experienced a hardware fault and that network traffic was not automatically rerouted.

Mitigation: We took the faulty network device out of rotation and rerouted network traffic to mitigate the issue.

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

2020年12月

12/14

RCA - Azure Active Directory - Authentication errors (Tracking ID PS0T-790)

Summary of impact: Between 08:01 and 09:20 UTC on 14 Dec 2020, a subset of users in Europe might have encountered errors while authenticating to Microsoft services and third-party applications. Impacted users would have seen the error message: “AADSTS90033: A transient error had occurred. Please try again”. The impact was isolated to users who were served through one specific back end scale unit in Europe. Availability for Azure Active Directory (AD) authentication in Europe dropped to a 95.85% success rate during the incident. Availability in regions outside of Europe region remained within Service Level Agreement (SLA).

Root Cause: The Azure AD back end is a geo-distributed and partitioned cloud directory store. The back end is partitioned into many scale units with each scale unit having multiple storage units distributed across multiple regions. Request processing for one of the back end scale units experienced high latency and timeouts due to high thread contention. The thread contention happened on the scale unit due to a particular combination of requests and a recent change in service topology for the scale unit rolled out previously.

Mitigation: To mitigate the problem, engineers updated the backend request routing to spread the requests to additional storage units. Engineers also rolled back the service topology change that triggered high thread contention.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Augment existing load testing to validate the combination of call patterns that caused the problem.
  • Further root cause the reason for thread contention and make necessary fixes before re-enabling the service topology change.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: