ナビゲーションをスキップする

Azure の状態の履歴

このページには、2019 年 11 月 20 日以降に発生したインシデントのすべての根本原因分析 (RCA) が含まれています。各 RCA は 5 年間、このページに保持されます。2019 年 11 月 20 日より前の RCA は使用できません。

製品:

リージョン:

日付:

2022年1月

1/13

RCA - Azure Resource Manager - Issues with management and resource operations (Tracking ID 8V39-P9Z)

Summary of Impact: Between 09:00 UTC on 13 Jan 2022 and 20:00 UTC on 14 Jan 2022, a subset of customers using Azure Resource Manager (ARM) to deploy, modify, or remove Azure resources experienced delays, timeouts, and failures which were visible for long running operations executed on the platform. Impact was most severe for a period of 5 hours starting at 15:30 UTC on Jan 13 and another period of 8 hours starting at 00:00 UTC on Jan 14, and in regions including but not limited to West US, West US 2, South Central US, North Europe, West Europe, East Asia and Southeast Asia.

Impact to customers will have been broad, as numerous Azure services rely on service management operations orchestrated by the ARM platform. Most customers will have experienced delays and timeouts, but many customers will have seen deployment or resource management failures.

Root Cause: A code modification which started rolling out on 6 Jan 2022 exposed a latent defect in the infrastructure used to process long running operations (informally, "jobs"). The code modification resulted in an exception for a tiny fraction of job executions, each one of them disabled a small part of the job execution infrastructure. Over the course of hours, the job executions shifted entirely away from the regions that had received the new code to their backup paired regions. For a period of 16 hours, there was no customer impact as the backup paired regions executed the jobs as intended. The impact spread to backup paired regions as the new code was deployed, resulting in job queue up, latency delays, and timeouts. In some cases, the jobs executed with such prolonged delays that they were unable to succeed, and customers will have seen failures in these cases.

As a result of the way that the job execution infrastructure was implemented, the compounding failures were not visible in our telemetry - leading to engineer’s mis-identifying the cause initially and attempting mitigations which did not improve the underlying health of the service. The consequence of this was a second period of impact starting at 00:00 UTC on 14 Jan 2022 and extending for approximately 8 hours.

Mitigation: Identifying the source of the problems in this case took time, as some parts of the job infrastructure remained healthy and processing jobs, while other key parts were being disabled. At the time we were unable to clearly identify the newly released code as correlating with the impact we were seeing. When the nature of the problem became clear we immediately started to roll back to a previous build. This change rolled out progressively completed at 20:00 UTC on 14 Jan 2022.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Reviewing and improving our monitoring and alerting strategy for our job execution infrastructure to improve our ability to detect problems like this one before they become customer-impacting.
  • Fixing the underlying problem which allows a single rare exception to disable parts of the job execution infrastructure.
  • Providing better visibility for operators when a paired region has assumed responsibility for job execution, in order to indicate a reduced-redundancy state and signal the need to pause or roll back a deployment.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

1/4

Azure Cosmos DB - East US (Tracking ID 9VT8-HPG)

Summary of Impact: Between 12:30 UTC on 04 Jan 2022 and 7:41 UTC on 5 Jan 2022, customers with Azure Cosmos DB accounts in East US may have experienced connectivity and service availability errors while accessing their Cosmos DB databases. One Cosmos DB cluster in the East US region was unavailable during this time, so both new and existing connections to databases in this subscription in this region may have resulted in errors or timeouts.

Root Cause: Cosmos DB uses Azure Service Fabric as the underlying platform for providing fault tolerance in the cluster. Service Fabric uses the ring topology, and each node establishes a lease relationship with nodes in its proximity (i.e. neighborhood) to detect failure. It has a set of nodes that are responsible for determining cluster memberships of other nodes, known as Arbitrators. A node that fails to refresh lease within a timeout period will be reported by its neighbors, and the arbitrators need to determine whether the node should leave the cluster. This check is done in a timer callback.

During this incident, the timer callback on one of the nodes was fired multiple times at a frequency higher than intended. This resulted in the node's neighbors getting incorrectly reported as unavailable. By design, the Arbitrators trusted this information as they did not receive any healthy uptime notification within the stipulated time frame. This continued until the quorum of nodes was lost, and the cluster went down eventually. The cluster came back up once the culprit node was manually rebooted as part of the mitigation efforts.

Mitigation: After the initial investigation, the cluster was marked as offline at 14:08 UTC on 04 Jan 2022 which triggered regional failover for accounts that had multiple regions and automatic failover enabled. Customers that did not have automatic failover enabled continued to be impacted until the cluster was recovered.

The cluster was recovered by rebooting the Service Fabric Infrastructure nodes after removing the culprit node. However, recovery of the cluster was delayed due to overload of the configuration store as the service was restarting. Cosmos DB Engineers initially tried to reduce the load on the configuration store by delaying the startup of about 20% of the nodes. This approach did not fully resolve the problem. Engineers then manually applied configuration changes to increase the timeout on the requests used to fetch data from the configuration store. This change allowed the recovery to continually make progress. Availability to the cluster was incrementally restored as service back end processes started running. Recovery was completed at 07:41 UTC on 05 Dec 2022.


Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

Service Fabric team to develop a fix to improve resilience in case of misfired timer(s) reporting incorrect node health status within Azure Service Fabric.

Azure Cosmos DB to improve monitoring to better identify culprit nodes early on if this failure pattern reoccurs.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

2021年12月

12/16

Mitigated - Microsoft Graph API (Tracking ID ZN7Y-5DG)

Summary of Impact: Between 14:00 UTC and 17:30 UTC on 16 December 2021, a subset of users in North America may have experienced performance issues and timeout errors with Microsoft Graph APIs.

Preliminary Root Cause: We determined that the scaling out process was affected by an internal infrastructure issue leading to high CPU usage. This resulted in the performance issues and timeout errors with Microsoft Graph APIs.

Mitigation: To mitigate the issue we redistributed traffic to other healthy regions in North America.

Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.


12/16

RCA - Azure Active Directory - Experiencing sign in issues when attempting to access Azure, Dynamics 365, and/or Microsoft 365 Services (Tracking ID S_3M-FZZ)

Summary of impact: Between 01:00 UTC and 02:25 UTC on 16 December 2021, Azure Active Directory (Azure AD) users may have experienced impact when accessing Microsoft 365, Dynamics 365 and Azure services.

Customers using desktop and mobile applications, such as Microsoft Teams, with their work or school accounts experienced minimal disruption as they were automatically routed to the Azure AD Backup Authentication service. 

Some Azure AD B2C users may have experienced impact between 01:00 UTC and 02:25 UTC.

Some Microsoft Account users, using Outlook on iOS to access their email, may have experienced impact between 00:11 UTC and 02:25 UTC.

Root Cause: A regular operating system (OS) update was deployed to endpoints of a backend service in the Azure AD authentication stack, which interacted with the service in an unexpected way, making the service on the updated endpoints unresponsive. As the update rollout progressed, all redundant endpoints were impacted, at which point the service became unavailable, on 16 December 2021 at 00:11 UTC.

All changes, with this one included, follow the safe deployment process (SDP) with automated health monitoring in place meant to stop the rollout in the event of an issue. In this case, due to a gap in the backend service’s health monitoring, the update rollout was not stopped until all redundant endpoints were impacted. As the backend service became unavailable, this issue started to manifest as sign-in failures for a subset of Microsoft Accounts (personal accounts).

The Azure AD authentication stack is designed with circuit-breakers that isolate failures in service dependencies , and in this case, limiting impact to Microsoft Accounts . However, due to a second latent issue in one of the circuit-breakers, requests that were queuing up as a result of sign-in failures were not limited by this circuit breaker, leading to sign-in failures for a subset of users of Azure AD and Azure AD B2C.

During this outage, we failed to communicate on the specific impact to Azure AD B2C, in particular marking it as impacted on the Azure Status Page, due to a coordination issue. In addition, our first notification to impacted customers was delayed.

Mitigation:

01:00 UTC – The backup authentication service was automatically activated and started taking traffic. 

02:25 UTC – The backend service was restored .

04:34 UTC – Traffic to the backup authentication service was routed back to the primary authentication infrastructure.

05:44 UTC – The primary authentication infrastructure was monitored to ensure services stability. Further changes to the backend service were disabled and the incident was declared fully mitigated.

Next steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

• Disable all infrastructure changes including OS updates to the active capacity of the backend service – Complete 

• Perform changes to the backend service to help reduce time to detect and recover in similar scenarios – Complete

• Improve the communications process to help account for coverage of all impacted services and scenarios - Complete

• Deploy a change to the circuit breakers to fix the discovered latent issue and help minimize impact caused by this backend service – Complete 

• Investigate and solidify a plan for additional process and tooling improvements to help reduce the notification time for impacted customers. - January 2022

• Continue to expand coverage of the Backup authentication service – July 2022


12/13

RCA - Azure Data Factory V2 - West Europe (Tracking ID 8T9M-T9G)

Summary of Impact: Between approximately 06:30 UTC and 12:30 UTC on 13 December 2021, you were identified as a customer that may have experienced intermittent errors accessing Azure Data Factory resources (ADF) in West Europe.

Root Cause: We determined a backend service, responsible for processing API requests became unhealthy. Retry logic from ADF, coupled with this unhealthy service, resulted in a rare combination of transient conditions and lead to additional errors. This resulted in intermittent API failing calls for Azure Data Factory resources.

Mitigation: We restarted the backend service which mitigated the issue.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Implement an early alert system to notify failures at this level
  • Create a repair item and work with the backend service team to create a solution to help prevent future occurrences

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

12/9

RCA - Errors connecting to SCM App Service and/or Azure Functions deployments (Tracking ID SNCJ-9ZG)

Summary of Impact: Between approximately 21:00 UTC on 09 Dec 2021 and 00:23 UTC on 10 Dec 2021, a subset of customers using App Service and/or Azure Functions may have encountered issues with deployments using web deploy, managing WebJobs, connecting to Web SSH console, executing certain Diagnostic Tools using Azure portal or while accessing SCM endpoint. App Service resources utilizing other deployment methods were not affected. Standard App Service control plane operations and runtime were not impacted.

Root Cause: As part of introducing certain resiliency measures, a configuration change was implemented to create a redundant Domain Name Resolution zone for App Service SCM endpoints. However, due to a misconfiguration, this change incorrectly affected the name resolution for App Service SCM endpoint, which resulted in a subset of our customers not being able to connect to the SCM endpoints of the service. This misconfiguration was limited in scope to the SCM endpoint and no other data plane or control plane operations experienced any failures.

Mitigation: Engineering was engaged upon receiving reports of failure. In order to mitigate the issue, the misconfigured state of the zone was corrected and persisted for SCM endpoints. Most customers observed resolution within 20 - 40 minutes from when the mitigation was applied. Resolution times varied for customers depending on the Time To Live (TTL) setting of the various networking devices in their traffic pipelines.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Adding automated detection for SCM endpoint related failures.
  • Adding improved validation for zone configuration updates.
  • Enhancing review measures for critical network configurations.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

2021年11月

11/16

RCA - Azure Active Directory - Issues when attempting to authenticate (Tracking ID SVT2-VCZ)

Summary of Impact: Between 20:20 UTC on 16 Nov 2021 and 05:00 UTC on 17 Nov 2021, a subset of customers using Azure Active Directory B2C in the Australia region may have encountered failures while attempting to authenticate. These attempts may have succeeded on retry.

Root Cause: The engineering team has determined that an underlying code defect triggered some service nodes to experience a shortage of outbound ports to complete network tasks. This in turn caused an increase in service latency and timeouts.

Mitigation: As an immediate mitigation, traffic was migrated away from the affected nodes into healthy nodes in a different environment. The engineering team also proactively increased capacity in this environment.

Next Steps: We sincerely apologize for the impact to affected customers. We are currently conducting an internal review to take additional steps to continuously improve the Microsoft Azure platform and our processes to ensure such issues do not occur in the future. In this case, this includes (but is not limited to):

• Enhancement of service capacity and automatic scaling rules to include additional dependencies such as outbound port connections

• Improvement of detection logic for cases where long-running network calls affect customer-facing performance

• An overall service capacity review in the affected region, and worldwide

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

11/12

Microsoft Graph - Intermittent 400-level errors accessing Graph APIs - Mitigated (Tracking ID PLT7-RTZ)

Summary of Impact: Between 02:00 UTC on 12 Nov 2021 and 17:00 UTC on 15 Nov 2021, a subset of customers primarily located in the North America and APAC geographies may have encountered intermittent 400-level errors when attempting to access Microsoft Graph APIs.

Preliminary Root Cause: We determined that a recent update to improve the underlying Microsoft Graph API infrastructure created a configuration issue between the Microsoft Graph API interface and its underlying Internet Information Services driver. This configuration issue prevented calls to various APIs from completing as expected.

Mitigation: We failed-over our service to a previously-known healthy state and rolled back the update to mitigate impact.

Next steps: We sincerely apologize for the impact to affected customers. We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

11/9

RCA - Intermittent Failures When Accessing or Using Access Panel (Tracking ID DK83-BDZ)

Summary of Impact: Between 14:03 UTC and 19:28 UTC on Nov 9 2021, customers using Azure Active Directory’s Access Panel whose traffic was routed through West Central US and Central US may have experienced issues when attempting to access or use Access Panel functionality. Users may have experienced intermittent failures when attempting the following operations: 

Launching Single Sign on (SSO) applications from My Apps or when using direct sign-on links

Registering for Multi-Factor authentication (MFA)

Self-Service Password Reset

Performing self-service management of groups

Accepting terms of use agreements

This incident had no impact on authentication and MFA scenarios outside of MFA registration, authentication for applications that were not launched through My Apps or through direct sign-on links.

Root Cause: On 09 Nov 2021 between 03:00 and 05:30 UTC, a code change to Access Panel was deployed and introduced an unrelated bug impacting a small subset of customers. After we failed to detect this bug through testing, it was discovered during telemetry validation in the first deployment stage in West Central US, at which point we failed over traffic from West Central US to Central US at 07:00 UTC to prevent customers from experiencing this discovered bug. Based on the information available at the time, failing-over traffic was deemed to be the safest and fastest recovery alternative until a fix could be safely deployed the following morning.

Failing-over is a standard operating procedure exercised on at least a monthly basis, however in this case we made the decision to keep the service in a failed-over state, while working on a fix to be deployed the following morning. Since this procedure is regularly tested and is an order of magnitude faster than rolling back the change, we had determined this path as the best course of action, and did not expect any issues. The difference in this instance was that the failover happened during off-business hours and persisted into business hours. This meant that at the point of failover the Central US datacenter was scaled to handle the low amounts of off-business hours traffic.

As we reached business hours for the region, at 14:03 UTC on 09 Nov 2021, we started seeing traffic rising quickly for the Access Panel service in Central US, at a pace which exceeded what the auto-scaling configuration, which our service relies on for reacting to traffic fluctuations, could handle. Our auto-scaling configuration was not equipped to provision capacity at the pace required to keep up with the rate of traffic increase related to both the failover and incoming morning peak traffic combined. That resulted in customer requests to the service timing out and failing.

The issue was detected by automation at 14:09 UTC on 09 Nov 2021 and engineers were engaged at 14:14 UTC. While investigating and validating the scope of impact, communications were delayed, with first notification being sent at 15:25 UTC.

While the incident was ongoing, we inaccurately scoped the customer impact in our communications to only applications launched through the My Apps portal, without calling out direct sign-on links used outside of My Apps. Upon further investigation, we have also found that a workaround could have been possible for a subset of impacted scenarios, where, for applications using sign-on methods other than SAML, customers could have accessed the application’s site directly.

Mitigation: Impact was mitigated by rebalancing traffic and manually scaling out the service in the impacted regions, West Central US and Central US. Since the Access Panel service in West Central US was previously auto-scaled down, due to no incoming traffic following the service failover, the scaling out operation took a more significant amount of time. The scale out operations completed at 18:40 UTC on 09 Nov 2021 and mitigated impact for the vast majority of customers. Complete mitigation was accomplished when deployment rollback completed at 19:28 UTC.

Next steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Adjust provisioned capacity and autoscaling configuration for the Access Panel service in all regions to support failover scenarios during peak hours. (To be completed in November 2021)
  • Update Access Panel standard operating procedures to include rapidly increasing capacity during scale-out operations to help shorten mitigation times for similar scenarios. (To be completed in November 2021)
  • Update test cases to include missing scenarios as surfaced in this incident. (To be completed in November 2021)
  • Improve Access Panel tooling and processes to enable rapid rollbacks of configuration changes. (To be completed in December 2021)
  • Improve standard operating procedures for notifying customers to drive down time to notify for this class of incident. (To be completed in March 2022)
  • Plan to onboard the Access Panel service to automated communication to help drive down time to notify for this class of incidents. (To be completed in March 2022)
  • Adjust standard operating procedures to help improve the quality of communications, including details around scope of impact and potential workarounds. (To be completed in March 2022)

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: