跳过导航

Azure 状态历史记录

This page contains all RCAs for incidents that occurred on November 20th, 2019 or later and will - from that date forward - provide a 5-year RCA history. RCAs prior to November 20th, 2019 are not available.

产品:

区域:

日期:

2020年5月

5/19

RCA - SQL and Open-Source Database Service Management Issues - East US (Tracking ID 8K76-LZ8)

Summary of Impact: Between 13:30 and 16:30 UTC on 19 May 2020, a subset of customers in East US may have intermittently experienced timeouts and latency issues when processing service management operations such as create, update, delete, for Azure SQL Database, Azure SQL Database Managed Instance, Azure Database for MariaDB, Azure Database for MySQL, Azure Database for PostgreSQL, and Azure Synapse Analytics services. Some customers may have also encountered issues or experienced latency when loading database management tools or expanding database resources in SQL Server Management Studio (SSMS). Retries may have been successful.

Root Cause: Engineers determined that a bug caused high resource utilization in the internal cluster service that is responsible for receiving and executing service management operations in the East US region. The bug was encountered in all the service instances of the region leading to failures and timeouts for management operations.

Mitigation: Engineers mitigated the high resource consumption by identifying and addressing the trigger condition for the bug. System returned to normal after pending operations were drained from the queue.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Issuing a hotfix for the bug.
  • Increasing the resiliency of the SQL service management plane to prevent service disruptions due to high resource utilization.
  • Improving the early detection mechanisms for high resource utilization issues.
5/18

RCA - Multiple Services - Central India (Tracking ID SLN3-HD0)

Summary of Impact: Between 12:41 UTC on 18 May 2020 and 08:30 UTC on 19 May 2020, customers may have experienced difficulties connecting to resources hosted in Central India. A number of Storage and Compute scale units had gone offline, impacting Virtual Machines and other Azure services with dependencies on these.

Root Cause: At 11:25 UTC, a power issue with the regional utility power provider caused a Central India datacenter to transfer to generator power. This transition to generators worked as designed for all infrastructure systems except for the package air handling units (PAHU) in two of the datacenter’s colocation rooms (colos). As a result, air temperatures for these two colos rose above operational thresholds. Alerts were triggered and on-site teams began efforts to recover the PAHUs. Initial efforts to recover the PAHU’s were unsuccessful, and at 13:22 UTC, engineers began shutting down compute, network and storage resources to protect datacenter equipment from thermal damage.

Mitigation: Engineers undertook various workstreams to bring back connectivity. First, technicians determined why the PAHU's all shut down. By 16:31 UTC engineers implemented a workaround for a failed component and sequentially restarted each unit to bring temperatures back into the safe operating range. In parallel, engineers prepared for equipment to come back online. By 16:48 UTC, temperatures were monitored, deemed stable, and the process of service recovery began. Some network devices and servers needed to be manually powered on due to the manner in which they were powered off. Network was recovered first, with Storage clusters recovering automatically once powered on with the network available. After Storage and Networking were restored, dependent Compute hardware recovery began and came back online by 19:59 UTC allowing Virtual Machines and other dependent Azure services to start recovery and complete by 08:30 UTC on 19 May. Currently, all colos are running on utility power and all PAHU's are running under automatic control.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Forensic investigation into the PAHU components to determine the cause of the shutdown.
  • Once the forensic investigation is complete, the elimination of the fault from the system through replacement, engineering out, or logical changes.
  • If the fault was component-based and systemic, ensure the replacement in other datacenters if present.
  • Develop a comprehensive load shedding plan that will keep high priority services up longer to allow for a graceful shutdown.
5/11

Azure Connectivity - United Kingdom - Mitigated (Tracking ID Z_V3-FC8)

Summary of Impact: Between 13:30 and 13:50 UTC on 11 May 2020, a limited subset of customers in the United Kingdom may have experienced connectivity issues to Azure sites and services.

Preliminary Root Cause: Engineers have confirmed the issue was caused by a power event at a regional internet service provider network exchange.

Mitigation: Engineers have confirmed that power was restored, restoring access to resources from the impacted networks and mitigating the issue.

Next Steps: We apologize for any inconvenience this may have caused. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

2020年4月

4/29

Issues accessing resources in the Azure Portal - Mitigated (Tracking ID PMN6-7D8)

Summary of Impact: Between 18:41 UTC on 29 Apr 2020 and 11:00 UTC on 30 Apr 2020, a subset of customers may have experienced issues accessing resources in their Azure Portal and received an "Access Denied" error message. Programmatic methods, such as PowerShell or Azure CLI, could be used to list resources on their subscriptions and access their individual resources via resource ID.

Preliminary Root Cause: Engineers determined that a recent deployment task introduced a software bug which impacted the synchronization of Role-Based Access Control (RBAC) information for resource management services (such as Azure Resource Manager) which are leveraged by the Azure Portal to display resources. This manifested in Azure Portal call failure and resources not displayed as expected.

Mitigation: Engineers first developed and deployed a hotfix to the impacted regions, forcing a refresh of all RBAC-related data for customers in the region and subsequently scaling-out to handle the increased load.

Next Steps: We apologize for the impact to affected customers. Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

4/22

RCA - Managed Database services - China North (Tracking ID ZK36-9P8)

Summary of Impact: Between 17:37 and 21:54 CST on 22 Apr 2020, a subset of customers may have seen issues affecting service management operations for Azure SQL Database, Azure SQL Database Managed Instance, Azure Database for MariaDB, Azure Database for MySQL, Azure Database for PostgreSQL and Azure Synapse Analytics services in China North. Service management operations including create, rename, update and delete may have been impacted. Connectivity to database resources was not impacted.

Root cause: Engineers determined that a manual maintenance operation impacted instances of an internal cluster data service that is responsible for receiving and executing service management operations. The primary instance of the data service became unhealthy preventing some of the service management operations from completing.  

Mitigation: Engineers paused the maintenance operation and initiated failover of the data service's primary to a healthy instance, thus mitigating the issue. Engineers monitored for an extended period post-mitigation to ensure there were no further occurrences. Engineers also worked to complete the maintenance operations offline and restore all instances to operational rotation.  

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and to our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Engineers will continue to review the specific maintenance operation to understand the cause of the impact, and will ensure the operation is fully automated and covered by pre-deployment test cases.
4/22

RCA - Managed Database services - UK South (Tracking ID TS66-1C0)

Summary of Impact: Between 09:37 and 13:54 UTC on 22 Apr 2020, a subset of customers may have seen issues affecting service management operations for Azure SQL Database, Azure SQL Database Managed Instance, Azure Database for MariaDB, Azure Database for MySQL, Azure Database for PostgreSQL, Azure Database for MySQL, and Azure Synapse Analytics services in UK South. Service management operations including create, rename, update and delete may have been impacted. Connectivity to database resources was not impacted.

Root cause: Engineers determined that a manual maintenance operation impacted instances of an internal cluster data service that is responsible for receiving and executing service management operations. The primary instance of the data service became unhealthy preventing some of the service management operations from completing.

Mitigation: Engineers paused the maintenance operation and initiated failover of the data service's primary to a healthy instance, thus mitigating the issue. Engineers monitored for an extended period post-mitigation to ensure there were no further occurrences. Engineers also worked to complete the maintenance operations offline and restore all instances to operational rotation. 

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and to our processes to help ensure such incidents do not occur in the future.

  • Engineers will continue to review the specific maintenance operation to understand the cause of the impact, and will ensure the operation is fully automated and covered by pre-deployment test cases.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

2020年3月

3/15

RCA - Service Availability Issue in West Central US (Tracking ID SMR4-D90)

Summary of Impact: Between approximately 20:30 UTC on 15 Mar 2020 and 00:30 UTC on 16 Mar 2020, a subset of customers in West Central US may have experienced issues connecting to resources hosted in this region.

Root Cause: At approximately 20:30 UTC on 15 Mar 2020, an ice storm in the region resulted in a loss of power to both primary and secondary utility power feeds in West Central US. This caused the datacenter to transition to generator power. Although largely successful, a subset of generators failed and UPS units supporting the downstream racks carried the load until their batteries drained. At that point, the affected racks lost power entirely. Any customer resources hosted on the affected racks became unavailable at the time.

Mitigation: Site engineers manually intervened to restore power to the affected infrastructure. Concurrently, impacted Azure services implemented their disaster recovery plans to work around the power failure. Subsequently, engineering worked to recover customer resources as affected infrastructure started to become available again.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Reviewing the sequence of operations and associated control systems to remediate identified single point failures
  • Validating electrical hardware components to ensure proper functionality

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: .