Pomiń nawigację

Historia stanu platformy Azure

Ta strona zawiera wszystkie analizy głównej przyczyny dla zdarzeń, które wystąpiły w dniu 20 listopada 2019 r. lub później. Każda analiza głównej przyczyny będzie przechowywania na tej stronie przez 5 lat. Analizy głównej przyczyny przed 20 listopada 2019 r. nie są dostępne.

Produkt:

Region:

Data:

październik 2021

13.10

RCA - Virtual Machines (Tracking ID 0NC_-L9G)

Summary of Impact: Between 06:27 UTC and 12:42 UTC on 13 Oct 2021, a subset of customers using Windows-based Virtual Machines (Windows VM) may have received failure notifications when performing service management operations - such as start, create, update, delete. Deployments of new VMs and any updates to extensions may have failed. Management operations on Availability Set, Virtual Machine Scale Set were also impacted.

Non-Windows Virtual Machines were unaffected, however services with dependencies on Windows VMs may have also experienced similar failures when creating resources.

Root Cause: Windows-based Virtual Machines utilize the Windows Virtual Machine Agent (VM Agent) extension, which is used to manage interactions between the Virtual Machine and the Azure Fabric.

When creating and updating Windows VMs, the Compute Resource Provider (CRP) has a dependency upon the Platform Image Repository to retrieve download locations for the latest version of the VM Agent package. Using this information, the VM Agent will update itself to the latest version in the VM.

As part of the journey to move all classic resources to Azure Resource Manager (ARM), we are migrating the image and extension publishers to the regional ARM publishing pipeline. Approximately 20% of all extensions have been successfully migrated.

At approximately 06:27 UTC, tooling provided an ARM template for use in performing these migrations. This tooling did not consider an edge case and as an unintended consequence marked the Windows VM Agent extension as visible to the publishing subscription only in the ARM regional service after migration. As the result, VM management operations started to fail after receiving zero results from the regional Platform Image Repositories.

The outcome of this was that service management operations (start, stop, create, delete, etc.) on customers Windows VM were unable to locate the Windows VMAgent extension, and thus unable to complete successfully.

Part of our change management process is to leverage the Safe Deployment Practice (SDP) framework (). In this case, some of the functionality of our classic infrastructure is incompatible with the SDP framework. This incompatibility underscores the importance in which we are treating the complete migration to ARM. Once the migration is complete, it will allow us to make all changes using the SDP framework without using bespoke tools that support classic resources only.

Mitigation: Determining the root cause took an extended period due to multiple releases for Azure components being in flight simultaneously on the platform, each of which had to be investigated. Additionally, involving subject matter experts (SMEs) for each of the involved components added to this time as we needed to eliminate multiple possible scenarios to ensure we could triage the underlying cause.

Once we determined the issue, and reviewed multiple mitigation options, we mitigated impact by making the extension public in one region at first and validating the results, ensuring no further impact would be caused by a surge in requests for Virtual Machines. Once validated, we started rolling out the change to the new pipeline region-by-region, mitigating the issue. Engineers monitored the platform success rate for operations after the changes were completed.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The migration of remaining packages in this category (including the Linux version of the VM Agent) is on hold until all repairs are in place
  • Additional pre-check and post-checks are being developed and implemented
  • VM operation resilience to failures when VM agent cannot be found
  • Engineering is also evaluating other safeguards to flight each extension type and prevent any potential negative impact with the remainder of migration.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

sierpień 2021

11.8

RCA - Connection errors for resources leveraging Azure Front Door and Azure CDN (Tracking ID 0MQY-NPG)

Summary of Impact: Between 06:30 UTC and 09:30 UTC on 11 Aug 2021, a subset of customers leveraging Azure Front Door and Azure CDN Standard from Microsoft in Japan East, Japan West, Korea South, Korea Central and/or West US regions may have experienced intermittent HTTPS request connectivity failures when trying to reach their applications. During the incident, the average global error rate was ~2.5% and the peak global error rate was ~5%.

Root Cause: Azure Front Door and Azure CDN Standard from Microsoft serve traffic through edge locations around the world. We were in the process of rolling out a software update to prevent the use of TLS session resumption keys which were older than specific thresholds. The update followed the Azure safe deployment process and was rolling out in phases until it reached the impacted locations. A subset of edge locations in Korea, Japan, and West US were running with stale TLS resumption keys, and the rolled-out update triggered the mechanism to prevent the reuse of stale keys. However, a code defect in the rolled-out version resulted in a race condition where a few servers in the impacted locations tried to revert to a full TLS handshake. This race condition resulted in these servers dropping HTTPS requests.

Mitigation: Our monitoring detected this issue and alerted the service team. To mitigate, we removed unhealthy edge locations from serving traffic, which routed traffic to healthy edge locations. We also rolled back the update that caused the regression.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Deploy the fix for the regression that caused the race condition.
  • Enhancements to monitoring to ensure alerting if TLS resumption is off or if the session resumption key is older than threshold.
  • Enhancements in staging environments to account for additional stale TLS resumption failure modes.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

lipiec 2021

28.7

RCA - Azure Network Infrastructure service availability issues - Brazil Southeast (Tracking ID LNZM-TZG)

Summary of Impact: Between 13:48 UTC and 15:20 UTC on 27 July 2021, a subset of customers experienced issues connecting to their services in the Brazil Southeast region.

Root Cause: We determined that a degradation in connectivity was caused by packet loss when the metadata on one of our regional-level routers was updated incorrectly. As part of a planned network configuration refresh, an update was being performed on the regional-level routers in the Brazil Southeast region. The regional-level tier of the network is designed with redundancy to allow a subset of the routers at that network tier to be taken off-line (not serving customer traffic) for updates.

During the update, our automated network configuration system applied an incorrect IPv4 network prefix (IP Range) to a regional-level router that was taken off-line. Restoring traffic to this regional-level router resulted in packet loss for some of the Azure services in the region. The incorrect network prefix caused traffic from this region to Azure destinations in other regions and a subset of internet regions to be dropped.

Mitigation: The device with incorrect prefixes was removed from service. This mitigation took longer than expected because automated safety checks were failing for the entire region, and some human intervention was required to proceed with the traffic rollback. 

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • We have audited all ongoing deployments to identify this error pattern and enhancing our validation checks to prevent such a combination of errors, including improved detection logic.
  • We are continuously enhancing our alert logic to help identify issues faster and force rollbacks without human intervention.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: