Passer la navigation

Historique des états d’Azure

This page contains all RCAs for incidents that occurred on November 20th, 2019 or later and will - from that date forward - provide a 5-year RCA history. RCAs prior to November 20th, 2019 are not available.

Produit :

Région :

Date :

juillet 2020

18/7

RCA - Azure DNS - Connectivity issues (Tracking ID TTPY-3P0)

SUMMARY OF IMPACT: Between 07:50 and 08:45 UTC (approx.) on 18 Jul 2020, Azure DNS experienced a transient resolution issue which in-turn impacted connectivity for some other Azure services. Authoritative and other DNS services were not impacted by this issue.

ROOT CAUSE: The decommissioning of a legacy (preview) DNS solution inadvertently caused some data streams for Azure DNS recursive resolver service to become out of sync with the resolver state. This was detected by a sync pipeline, which triggered a reset of the resolver instances to recover from the stale state. Unfortunately, this reset was not done in a staggered fashion and led to multiple resolver instances rebooting at the same time. This in turn led to degradation of the service and caused DNS resolution failures for the queries originating from virtual networks. Azure services dependent on the Azure DNS resolver service also saw degradation of service during this time.

The impact of the incident was observed across multiple Azure regions to varying degrees. While some instances of the service saw no impact, most impacted instances auto-recovered within 10 minutes, though some instances took up to 30 minutes to recover. The DNS resolution issues were fully auto-mitigated across all regions within 54 minutes. During this time, authoritative Azure DNS service was not impacted and DNS queries originating from the internet for zones hosted on Azure DNS were answered successfully.

MITIGATION: The issue was self-healed as the restarts completed, and all services with dependencies on the recursive DNS service would have seen a restoration of functionality also. The DNS service was fully mitigated at 08:45 UTC, but some services with multiple dependencies may have taken longer for all customers to see full service restoration.

NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
  • Fixing the orchestration logic in the sync pipeline to help ensure that resolver instances are reset in a staggered, partitioned fashion
  • Improving the resolver startup sequence to help ensure that a resolver instance can be up and running with 10 minutes after a reset
PROVIDE FEEDBACK: Please help us improve the Azure customer communications experience by taking our survey:

7/7

RCA - Virtual Machines - Virtual machine unexpected restarts (Tracking ID 8S8J-9T8)

Summary of impact: Between 07:24 UTC on 07 Jul 2020 and 21:16 UTC on 17 Jul 2020, a subset of customers using Virtual Machines (VMs) may have experienced intermittent connection failures when trying to access some virtual machines. These virtual machines may have also restarted unexpectedly.

Root cause: We determined that an ongoing OS update deployment task inadvertently contained a code configuration error that resulted in a number of previously addressed bug fixes being reverted on a subset of clusters. This manifested as system deadlock on a subset of host nodes which were running VM workloads with heavy disk I/O. As a result, VMs on those nodes rebooted.

Mitigation: We stopped the ongoing deployment and subsequently developed and rolled out a new deployment task which contained a code fix to detect that a new patch needed to be applied. This fix was deployed to all impacted clusters, thereby mitigating the VM reboots and customer impact.

In parallel to deploying the permanent fix across all regions, we expedited mitigation for some customers by identifying affected nodes that were hosting the customers' VM workloads and reattaching patches to those nodes.

Next Steps: We understand that the time to mitigate for this incident was longer than desired, and we sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help reduce the duration of such incidents. This includes (but is not limited to):

  • Incorporating the missed combination of software versions and system configurations in our validation matrix before deploying similar updates.
  • Improving rollout monitoring and correlation capabilities to detect such faults and halt the rollout sooner.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

1/7

RCA - Azure SQL Database - Japan East (Tracking ID CLCK-LD0)

Summary of Impact: Between 09:24 and 11:15 UTC on 01 Jul 2020, a subset of customers using Azure SQL Database, Azure SQL Data Warehouse/Synapse Analytics, Azure Database for MySQL, Azure Database for PostgreSQL, and Azure Database for MariaDB in Japan East may have experienced service connection failures or possible timeouts. Services utilizing SQL Databases may have also been impacted.

Root Cause: Connections to Azure SQL Database and related data services go through a load balanced set of front-end nodes (Gateways) that provide directory lookup services and reroute the incoming connections to the intended backend nodes hosting the database. For scalability and zone redundancy purposes, there are multiple active SQL Gateway clusters in a region. During this incident, one of the SQL Gateway clusters became unhealthy, having an intermittent impact on login availability. A specific network traffic pattern combined with a networking stack configuration on the SQL Gateway instances triggered an imbalance on the CPU processing of new connection requests. The persistence of such CPU imbalance over a long period of time caused high response latency and increased timeouts on connection requests. The error condition propagated across multiple instances of the SQL Gateway cluster in this region, sometimes causing a service restart.

Mitigation: Multiple SQL Gateway instances became healthy upon the triggered service restart. On further investigation, we were able to isolate the specific network pattern and the configuration setting that caused this incident and were able to reconfigure the traffic to prevent a recurrence. 

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Fix the underlying issue that causes service restart when such a condition occurs.
  • Improve the alerting logic and add identified telemetry to diagnose this kind of issues faster.
  • Activate a newer SQL Gateway cluster in this region with a more efficient networking stack configuration that reduces the chances of hitting a processing imbalance.
Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

juin 2020

14/6

RCA - Azure Active Directory - Authentication Errors (Tracking ID PMHH-NS0)

Summary of Impact: Between 23:00 UTC on 14 Jun 2020 and 01:40 UTC on 15 Jun 2020, a subset of customers using Azure Active Directory may have experienced authentication issues when accessing resources. Customers may have received the following error message “AADSTS90033: A Transient error has occurred. Please try again.”

Root Cause: An unexpected increase in traffic volume and resource utilization of infrastructure in the region responsible for acquiring authentication tokens resulted in regional contention which exceeded operational thresholds; resulting in authentication issues for a subset of customers.

Mitigation: The backend infrastructure was scaled out to increase resources and traffic was redistributed.

Next Steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. This includes, and is not limited to:

  • Improvements to our systems that dynamically scale resources in this scenario accordingly with corresponding monitoring
  • Improvements to monitoring to better detect increases in traffic for similar scenarios
  • Review and enhance procedures to improve customer communication experience

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

11/6

RCA - Storage - East US (Tracking ID 9VHK-J80)

Summary of impact: Between 11:57 and 14:20 UTC on 11 Jun 2020, a subset of Storage customers in East US may have experienced connection failures when trying to access some of their resources hosted in this region. Services with dependencies on the impacted storage resources, such as Virtual Machines, may also have experienced downstream impact during this time.

Root Cause: Engineers determined that an incident during a planned power maintenance activity at the datacenter caused an impact to a single storage scale unit, which then became unhealthy. The incident caused power to be lost to to a subset of racks comprising 60% of this single storage scale unit.

The maintenance activity itself did not impact the storage scale unit, but it caused the scale unit to have reduced redundant power options at the time of the incident. All racks and network devices have two sources of power for redundancy, but it is standard procedure in some types of maintenance to isolate some resources to a single source of power for a short period. After the isolation had been completed on this scale unit, but before maintenance could begin, a distribution breaker in the redundant power source tripped open unexpectedly and the power was lost to the subset of racks.

Mitigation: The site engineers paused all maintenance work and inspected the electrical distribution system to ensure there were no apparent equipment fault issues. They found the tripped breaker and determined it had failed. Power was restored by closing the other breaker that had previously been opened to commence the isolation for the scale unit, and this restored a single power source to the impacted racks. A new breaker was located and fully-tested before installation. The bad breaker on the redundant power supply was replaced with the new breaker and redundant power was then also restored to the affected racks.

Once power was restored to the impacted storage racks, the automated restart process for storage resources began, and restored the scale unit to full operation. The restart process for storage clusters follows a series of structured steps to ensure full integrity of customers' data is preserved, and access to storage resources on this scale unit would have become available over a short period of time. Final mitigation was declared at 14:20, but most customers would have seen recovery prior to this time.

Subsequent testing showed that the breaker had an internal failure on one phase and it has been sent to the manufacturer for full forensic analysis.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Forensic investigation into the breaker that failed.
  • Review of breaker maintenance and testing requirements
  • Improving monitoring and alerting visibility when in maintenance modes.
  • Ensuring that electrical maintenance activities associated power switching are monitored closely with automated processes to manage unintended impact.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

4/6

RCA - Azure Resource Manager - Failures creating or deleting resources (Tracking ID DLZG-7C0)

Summary of Impact: Between 07:45 and 16:57 UTC on 04 Jun 2020, a subset of customers across all Public Azure regions may have experienced deployment failures when attempting to create or delete certain service based resources via Azure Resource Manager (ARM) deployment and management service due to an underlying networking issue. While the related networking resources for the impacted services were actually being created or deleted during this time, ARM was not notified of the deployment status and hence was failing the service creation or deletion. This issue may have impacted some GET or READ action on the resources. Less than .01% of users would have experienced this issue.

This issue was initially detected an hour after the impact start time and was identified and escalated by an underlying service experiencing end user impact. Once detected, multiple engineering teams were engaged to investigate the cause of the issue to understand what needed to be fixed. By 11:00 UTC, the appropriate networking team was engaged and began investigating. The underlying cause was identified by 13:00 UTC. We identified the appropriate fix and rolled it out to a single region to validate success. We confirmed success of the roll out and began deploying to other regions in 3 batches. At the end of each batch we validated the success of the fix. By 16:57 UTC, the fix was rolled out to all regions and mitigation was confirmed.

Root Cause: A recent ARM deployment contained a configuration file that stores the URL endpoint that ARM connects to for operation status query calls. The configuration file had an incorrect endpoint for networking resources. Due to this wrong setting, the ARM status query for networking service management operations failed, which customers saw as failures when attempting to create or delete networking resources. The faulty configuration file was not caught prior to production because the update that caused the network resource failures was applied after testing was performed on a then healthy configuration file. When picking up the latest configuration file for deployment, the faulty file was assessed for production and not testing. The faulty configuration file was then manually rolled out without testing being performed with the newest configuration, breaking change.

Mitigation: We corrected the incorrect URL endpoint within the configuration file and safely re-deployed to mitigate the issue.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Networking service will onboard to SDP (Safe Deployment Practice) endpoint configuration rollout process immediately, to ensure enough testing is done and enough time occurs between deployment batches to catch any misconfigurations or changes prior to deployment.
  • Networking service will immediately plug-in testing and monitoring holes to make sure we immediately identify an issue like this on the networking end as failures were only seen on the ARM end.
  • Networking service will work with ARM team to streamline configuration rollout process, to guard against errors that may occur with the current manual deployment process.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

mai 2020

19/5

RCA - SQL and Open-Source Database Service Management Issues - East US (Tracking ID 8K76-LZ8)

Summary of Impact: Between 13:30 and 16:30 UTC on 19 May 2020, a subset of customers in East US may have intermittently experienced timeouts and latency issues when processing service management operations such as create, update, delete, for Azure SQL Database, Azure SQL Database Managed Instance, Azure Database for MariaDB, Azure Database for MySQL, Azure Database for PostgreSQL, and Azure Synapse Analytics services. Some customers may have also encountered issues or experienced latency when loading database management tools or expanding database resources in SQL Server Management Studio (SSMS). Retries may have been successful.

Root Cause: Engineers determined that a bug caused high resource utilization in the internal cluster service that is responsible for receiving and executing service management operations in the East US region. The bug was encountered in all the service instances of the region leading to failures and timeouts for management operations.

Mitigation: Engineers mitigated the high resource consumption by identifying and addressing the trigger condition for the bug. System returned to normal after pending operations were drained from the queue.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Issuing a hotfix for the bug.
  • Increasing the resiliency of the SQL service management plane to prevent service disruptions due to high resource utilization.
  • Improving the early detection mechanisms for high resource utilization issues.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

18/5

RCA - Multiple Services - Central India (Tracking ID SLN3-HD0)

Summary of Impact: Between 12:41 UTC on 18 May 2020 and 08:30 UTC on 19 May 2020, customers may have experienced difficulties connecting to resources hosted in Central India. A number of Storage and Compute scale units had gone offline, impacting Virtual Machines and other Azure services with dependencies on these.

Root Cause: At 11:25 UTC, a power issue with the regional utility power provider caused a Central India datacenter to transfer to generator power. This transition to generators worked as designed for all infrastructure systems except for the package air handling units (PAHU) in two of the datacenter’s colocation rooms (colos). As a result, air temperatures for these two colos rose above operational thresholds. Alerts were triggered and on-site teams began efforts to recover the PAHUs. Initial efforts to recover the PAHU’s were unsuccessful, and at 13:22 UTC, engineers began shutting down compute, network and storage resources to protect datacenter equipment from thermal damage.

Mitigation: Engineers undertook various workstreams to bring back connectivity. First, technicians determined why the PAHU's all shut down. By 16:31 UTC engineers implemented a workaround for a failed component and sequentially restarted each unit to bring temperatures back into the safe operating range. In parallel, engineers prepared for equipment to come back online. By 16:48 UTC, temperatures were monitored, deemed stable, and the process of service recovery began. Some network devices and servers needed to be manually powered on due to the manner in which they were powered off. Network was recovered first, with Storage clusters recovering automatically once powered on with the network available. After Storage and Networking were restored, dependent Compute hardware recovery began and came back online by 19:59 UTC allowing Virtual Machines and other dependent Azure services to start recovery and complete by 08:30 UTC on 19 May. Currently, all colos are running on utility power and all PAHU's are running under automatic control.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Forensic investigation into the PAHU components to determine the cause of the shutdown.
  • Once the forensic investigation is complete, the elimination of the fault from the system through replacement, engineering out, or logical changes.
  • If the fault was component-based and systemic, ensure the replacement in other datacenters if present.
  • Develop a comprehensive load shedding plan that will keep high priority services up longer to allow for a graceful shutdown.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: