Hopp over navigasjon

Azure-statushistorikk

Denne siden inneholder analyser av opprinnelige årsaker (RCA-er) for tidligere tjenesteproblemer, som lagres i fem år. Fra 20. november 2019 inkluderte dette RCA-er for alle problemer vi har kommunisert offentlig. Fra 1. juni 2022 inkluderer dette RCA-er for generelle problemer som beskrevet i dokumentasjonen.

Produkt:

Område:

Dato:

juni 2022

29.6

Preliminary Post Incident Review (PIR) - Azure Firewall, Azure Data Explorer, Azure Synapse Analytics, Azure Backup, Azure Search - Multi Region (Tracking ID YVTL-RS0)

What happened?

Starting at 05:26 UTC on 28 Jun 2022, customers who had more than 16 backend IP addresses (also known as destination IP (DIP)) behind an Azure Load Balancer may have experienced intermittent connectivity failures and packet loss while trying to connect to their resources. Many services that use an Azure Load Balancer as part of their implementation were impacted, including, but not limited to, Azure Data Explorer, Azure Log Analytics, Azure Backup and Azure Search. This issue impacted Azure Firewall significantly more than other cloud services for two reasons: (1) Azure Firewall scales out the number of instances as load increases and can have more than 16 Virtual Machines behind a Load Balancer when handling large amounts of traffic (2) Azure Firewall requires that both the upstream and downstream side of a connection are handled by the same instance.

Separately, from 02:40 UTC to 20:14 UTC on 29 June 2022, one of the routers in the Microsoft Wide Area Network (WAN) in the London, UK area experienced an issue. Packets for a subset of the endpoints travelling through that router intermittently experienced high loss rates. This would have impacted traffic being served to and from particular regions; UK West, UK South, Europe West and Europe North. Customers using ExpressRoute and VPN Gateway would have noticed this impact.

What do we know so far?

As discussed above, there were 2 separate events that resulted in impact to our customers. The coincidence of the timing for these 2 events was a contributing factor in the way we managed this event.

The first issue was with our SLB (Software Load Balancer). Our SLB is our software-based routing and load balancing service for all network traffic within our datacenters and clusters. At the highest level, it is a set of servers called multiplexers or MUXs that create an SLB ring. Each ring hosts many VIPs (Virtual IPs) to serve traffic to multiple DIPs (Destination IPs , these are the IPs of the resources behind the VIP). The MUXs then direct traffic to the DIPs that are in service behind the VIP.

Our latest code deployment contained a bug in which the MUXs ended up having conflicting information about which DIPs were in a healthy state. This bug would only manifest if a DIP is unhealthy and the synchronization between the MUXs becomes incorrect. Even if the conditions did arise for the bug to manifest itself, the MUXs would only intermittently drop packets. Since typically, a VIP is in front of multiple DIPs, the bug was intermittent, and impact would have been limited to a small set of network traffic. Network traffic also contains a lot of retry logic, and thus the impact was further dampened. These factors combined to prevent our existing monitoring from producing high fidelity alerts that we could respond to and investigate. Furthermore, this update impacted customers or cloud services that leverage more than 16 DIPs behind a VIP. As most customers do not typically use 16 or more DIPs, when support cases were being raised, our support teams faced challenges isolating the issue. This led to delays in correlation between customer reported issues and the behavior of the platform. Lastly, even in cases where a resource in path of the traffic flow did have more than 16 DIPs, it could be hosted in different VNETs or even different subscriptions and this further delayed identification of the SLB bug.

When we deploy updates to our cloud, we use a system called 'Safe Deployment Practices' (SDP) here: . Using SDP, Azure code rolls out a single region or zone at a time and has checks that ensure that service health is not degraded by the code or configuration change being deployed. Unfortunately, because of the intermittent nature of this failure and its uncommon use case, our health checks did not send us notifications of any issues and the code propagated to 17 regions (of our 60+ regions) before we correlated customer impact to this specific deployment.

The second and completely unrelated issue was to do with part of our WAN (Wide Area Network) infrastructure in London. A router in London intermittently started dropping network packets. This caused intermittent network failures for customers where traffic was flowing through UK West, UK South, Europe West and Europe North. This was characterized as a ‘gray’ failure, which is when reporting could suggest that there is an issue, however the end user experience doesn’t experience this failure. (Customers can learn more about gray failures in cloud systems here: )

How did we respond?

For the first issue (SLB impact) we engaged in multiple workstreams in parallel. We used manual techniques to mitigate VIPs for customers that raised support requests to us. Simultaneously, we developed a hotfix that would address the bug and could be applied without further impacting customer traffic (a simple rollback would have left impacted VIPs in a bad state). After the hotfix was tested, it was rolled out using a phased approach as per our Safe Deployment Practices. Additionally, we created automation that would scan all VIPs every 5 minutes and mitigate any VIP in an inconsistent state to minimize the impact to customers while the hotfix rolled out. This script continued to run until we had our sustainable mitigation completely rolled out.

For the second issue (WAN impact) we spent a lot of time looking for software-based causes because failures were being observed across 17 regions (router issues typically impact a small number of regions), and there were a number of concurrent software changes and the SLB issue described above.

As the investigation continued, engineers triangulated that impacted traffic involved WAN routers located near London. That rapidly led the engineers to remove the unhealthy router from the network. Redundant routers took over the traffic and fully resolved the first issue.

What are our next steps:

We understand the amount of trust customers put in Microsoft cloud services. We have work to do. Over the next 14 days a thorough internal retrospective will be conducted to determine what happened and the steps that we will take to make these kinds of issues both less likely and less impactful – ultimately to earn back the trust of our customers and partners. Once this retrospective has been completed, we will share our final Post Incident Review (PIR).

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:


7.6

Post Incident Review (PIR) - Datacenter cooling event - East US 2 (Tracking ID NMB2-ND0)

What happened?

Between 02:41 and 14:30 UTC on 07 Jun 2022, a subset of customers experienced difficulties connecting to resources hosted in one particular Availability Zone (AZ) of the East US 2 region. This issue impacted a subset of storage and compute resources within one of the region’s three Availability Zones. As a result, Azure services with dependencies on resources in this zone also experienced impact.

Since the vast majority of services that were impacted already support Availability Zones customers using always-available and/or zone-redundant services would have observed that this zone-specific incident did not affect the availability of their data and services. Five services (Application Insights, Log Analytics, Managed Identity Service, Media Services, and NetApp Files) experienced regional impact as a result of this zonal issue. These five services are already working towards enabling AZ support. Finally, while App Service instances configured to be zone-redundant would have stayed available, from the other AZs, control plane issues were observed regionally that may have prevented customers from performing service management operations during the impact window.

What went wrong, and why?

Microsoft experienced an unplanned power oscillation in one of our datacenters within one of our Availability Zones in the East US 2 region. Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.

The electrical transients were introduced by anomalous component behavior within Uninterruptible Power Supply (UPS) modules, and cascaded throughout the datacenter electrical distribution system including electrical power supply to the mechanical cooling plant. As a result of the AHU self-protective shutdown, cooling to the datacenter was interrupted. Although the electrical transients did not impact our compute, networking, or storage infrastructure – which did not lose power – the mechanical cooling plant shutdown led to an escalating thermal environment, which induced protective shutdown of a subset of this IT infrastructure prior to the restoration of cooling.

Thorough detailed analysis has resulted in an adjustment to the UPS gain settings, preventing any further oscillations. These oscillations are the power equivalent of having a microphone too close to an amplifier – just as setting the volume too high can trigger a self-sustained sound oscillation, power oscillations can occur when the gain of the UPS is too high. The normal process of adding load to the UPS units results in an increase in gain and, in this case, the gain went high enough to cause the oscillations to occur. Adjusting the control gain setting lower in the UPS returns them to stable operation for all load values, preventing disruptions to any other infrastructure such as the AHUs.

Subsets of equipment including network, storage, and compute infrastructure were automatically shut down, both to prevent damage to hardware and to protect data durability under abnormal temperatures. As a result, Azure resources and services with dependencies on these underlying resources experienced availability issues during the impact window. A significant factor of downstream service impact was that our storage infrastructure was amongst the hardware most affected by these automated power and thermal shutdowns. Eight storage scale units were significantly impacted – due to thermal shutdowns directly and/or loss of networking connectivity, itself due to thermal shutdowns of corresponding networking equipment. These scale units hosted Standard Storage including LRS/GRS redundant storage accounts, which in turn affected Virtual Machines (VMs) using Standard HDD disks backed by this storage, as well as other services and customers directly consuming blob/file and other storage APIs.

The platform continuously monitors input/output transactions from the VMs to their corresponding storage. So even if the scale unit running a VM’s underlying compute was operational, when transactions did not complete successfully within 120 seconds (inclusive of retries) the connectivity to its virtual disk is considered to be lost, and a temporary VM shutdown is initiated. Any workloads running on these impacted VMs, including first-party Azure services and third-party customer services, would have been impacted as their underlying hosts were either shut down by thermal triggers, or had their storage/networking impacted by the same. 

How did we respond?

As soon as the AHUs shut themselves down as a result of the power disturbance, alerts notified our onsite datacenter operators. We deployed a team to investigate, who confirmed that the cooling units had shut themselves down pending manual intervention. Following our Standard Operating Procedure (SOP), the team attempted to perform manual resets on the AHUs, but these were not successful. Upon further investigation the onsite team identified that, due to the nature of this disturbance, recovering safely would require resetting the AHUs while running on backup power sources, to prevent the power oscillation pattern on the utility line from triggering a fault. This meant that two primary steps were required to recover – firstly, the impacted datacenter manually transferred from utility power to backup power sources, our onsite generators. By doing this, we changed the characteristics in the power lineup to obviate the creation of the oscillations. Secondly, the AHUs were then manually reset to recover them, which restored cooling to the datacenter.

Once temperatures returned to normal levels, some hardware including network switches needed to be manually power cycled to be brought back online. The network hardware and components serve different compute and storage resources for the scale units in this datacenter, including host instances for other applications and services. Onsite engineers then manually reviewed the status of various infrastructure components, to ensure that everything was working as intended. 

Following the restoration of most storage network connectivity, recovery activities included diagnosing and remediating any host nodes that had entered an unhealthy state due to loss of network, and triaging any other hardware failures to ensure that all storage infrastructure could be brought back online. Even after all storage nodes returned to a healthy state, two storage scale units still exhibited slightly lower API availability compared to before this incident. It was determined that this was caused by a limited number of storage software roles being in an unhealthy state – those roles were restarted, which restored full API availability for those scale units.

Since our compute continuously monitors for Storage access, as storage/networking started recovering the compute VMs automatically started coming back up. This worked as expected in all cases expect on one scale unit, where the physical machines were shut down and did not recovery automatically. Since the VMs were originally down due to storage/networking issues, it was only detected once storage recovered, so we manually recycled the nodes to bring them back online. Upon investigation, an issue with the cluster power management unit prevented automatic recovery.

Two specific Azure services (ExpressRoute and Spatial Anchors) performed manual mitigations to fail customers over to use the other two Availability Zones within the region. Thus, while some impacted services recovered even earlier, full mitigation of this incident was declared at 14:30 UTC.

After cooling was restored and infrastructure was brought back online, our onsite teams opted to leave the datacenter running on backup power sources during additional investigations and testing, both focused on the UPS gain setting. In consultation with our critical environment hardware suppliers, we ran comprehensive testing to confirm the relevant gain settings based on the amount of load across the system. After these settings were deployed, we have since returned the datacenter back to our normal utility power feed.

How are we making incidents like this less likely or less impactful?

Already completed:

  • Updates to the gain setting, described above, have been deployed and the datacenter is back on utility in the impacted datacenter. We are confident that this has mitigated the risk of the power oscillation issue that was triggered.
  • Furthermore, our critical environment team has assessed systemic risk across all our datacenters globally, to ensure that none are at risk of the same situation. Of our 200+ Azure datacenters across 60+ regions, we identified only one other datacenter (beyond the impacted datacenter in East US 2) that had a similar power draw that could have potentially triggered a similar oscillation – this risk has since been mitigated with a similar configuration change.

Work in progress:

  • We have identified opportunities to improve our tooling and processes to flag anomalies more quickly, and are in the process of fine-tuning our alerting to inform onsite datacenter operators more comprehensively.
  • We are investigating why a subset of networking switches took longer than expected to recover. Although these were manually mitigated during the incident, we are exploring ways to optimize this recovery to ensure that customer workloads are brought online more quickly.
  • Similarly, we continue to diagnose a small subset of storage and compute nodes that remained in unhealthy states after restoration of networking, to streamline their recovery. This includes addressing a driver-related issue that prevented compute nodes in one scale unit from recovery automatically.
  • We are addressing some specific monitoring gaps including for compute nodes that have not been powered back on, specifically for scenarios in which they had been automatically shut down.

In the longer term:

  • We are developing a plan for fault injection testing relevant critical environment systems, in partnership with our industry partners, to be even more proactive in identifying and remediating potential risks.
  • We are exploring improved supplier diversity in the critical environment space, to minimize potential single points of failure within our hardware lineup.
  • We are investing in improved engineering tooling and processes that will accelerate the identification and remediation of unhealthy node states during incidents of this scale.
  • We have several workstreams in motion that will further improve storage node start-up times, learnings from this incident have validated the need to prioritize these optimizations.
  • Finally, we continue to invest in expanding how many Azure services support Availability Zones, so that customers can opt for automatic replication and/or architect their own resiliency across services:

How can our customers and partners make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations:
  • Consider which are the right Storage redundancy options for your critical applications. Zone redundant storage (ZRS) remains available throughout a zone localized failure, like in this incident. Geo-redundant storage (GRS) enables account level failover in case the primary region endpoint becomes unavailable:
  • Consider using Azure Chaos Studio to recreate the symptoms of this incident as part of a chaos experiment, to validate the resilience of your Azure applications. Our library of faults includes VM shutdown, network block, and AKS faults that can help to recreate some of the connection difficulties experienced during this outage – for example, by targeting all resources within a single Availability Zone:
  • More generally, consider evaluating the reliability of each of your critical Azure applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review:
  • Finally, ensure that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, web-hooks, and more:

How can we make our incident communications more useful?

We are piloting this “PIR” format as a potential replacement for our “RCA” (Root Cause Analysis) format.

You can rate this PIR and provide any feedback using our quick 3-question survey:

mai 2022

31.5

RCA - Azure Active Directory Sign In logs (Tracking ID YL23-V90)

Summary of impact: Between 21:35 UTC on 31 May and 09:54 UTC on 01 Jun 2022, you were identified as a customer who may have experienced significant delays in the availability of logging data for resources such as sign in and audit logs, for Azure Active Directory and related Azure services. This impact may also have resulted in missed or misfired alerts, and issues accessing tools such as Microsoft Graph, the Azure portal, Azure Application Insights, Azure Log Analytics, and/or PowerShell. During this time, Azure Resource Manager (ARM) dependent services may also have experienced CRUD (create, read, update, and delete) service operation failures and/or issues communicating with other Azure services.

Root Cause: As part of continuous service improvements, a recent change to the underlying infrastructure that was intended to optimize performance, inadvertently caused the observed impact. While this issue was detected by our telemetry, service health degradation was only detected in the final phase of the rollout. This underlying infrastructure has both regional and global endpoints, to facilitate the varying needs of dependent services. When global endpoints become unhealthy, they will have a global impact on services that depend on them. In this incident, we saw the observed impact manifest as the change reached the final phase of the rollout - when it was deployed to high traffic regions and global endpoints. Prior to this issue manifesting, the impactful change was being rolled out over the last week, with no issues being observed.

Mitigation: Initially we attempted to roll-back the impactful change however, due to internal service dependencies, this approach was unsuccessful. As a result, we deployed a hotfix to the affected underlying infrastructure, which mitigated all customer impact.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Disabling the impactful change globally, across all of the affected underlying infrastructure [Completed]
  • Updating our standard operating procedures (SOP) to include specific mitigation guidance for similar scenarios to avoid rollback and help to reduce time to mitigation [Will be completed by 6/8/2022]
  • Adjusting our rollout plan for similar changes to the underlying infrastructure, to add rollout phases that will help reduce impact to customers in similar scenarios [Will be completed by 6/8/2022]
  • Migrating the underlying infrastructure to a new architecture that will eliminate internal service dependencies which prevented us from rolling back the change. This will also improve the speed of our repair actions, to help significantly shorten time to mitigation during similar impact scenarios [Rollout will begin in July 2022]

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

april 2022

8.4

RCA - Service Management Operation Errors Across Azure Services in East US 2 (Tracking ID Y__5-9C0)

Summary of Impact: Between 12:25 UTC on 08 Apr 2022 and 14:40 UTC on 09 Apr 2022, customers running services in the East US 2 region may have experienced service management errors, delays, and/or timeouts. Customers may have experienced issues that caused GET and PUT errors impacting the Azure portal itself, as well as services including Azure Virtual Machines (VMs), Virtual Machine Scale Sets (VMSS), Azure Data Factory (ADF), Azure Databricks, Azure Synapse, Azure Backup, Azure Site Recovery (ASR), and Azure Virtual Desktop (AVD) Customers may have seen errors including “The network connectivity issue encountered for Microsoft.Compute cannot fulfill the request. For some downstream services that have auto-scale enabled, this service management issue may have caused data plane impact.

Root Cause: We determined that the Compute Resource Provider (CRP) Gateway service experienced an issue which severely reduced its throughput. The underlying issue was a retry storm triggered by the zonal failure of a related Allocator service. While we were able to recover the Allocator service by restarting the instances in the failed zone, the backlog of work exposed a potential issue with .Net CLR and Garbage Collector. This resulted in a large percentage of incoming calls to the CRP Gateway to fail. The retries triggered by the upstream services only made the load situation worse. Under normal circumstances, the Gateway instances are overprovisioned for such retry storms but the combination of the reduced throughput across all instances and continuous retries (some services which normally make 25K calls in 1 minutes were making 150K calls in the same period due to retries) resulted in a prolonged impact.

Deeper investigation into process profile data exposed that the process was experience a high rate of timeout exceptions for ongoing operations and .Net Garbage Collector was overworked due to high heap churn under above mentioned spike in load. A high rate of exceptions and the simultaneous pressure on the .Net GC exposed an unfavorable interaction in the .Net runtime's process wide lock.

Mitigation: To mitigate the situation, the below steps were taken:

  • Two large services were temporarily throttled more aggressively to ensure they do not continue to overload gateway.
  • Once the underlying issue of the throughput reduction was partially understood, the gateway services were restarted multiple times until they got out of the wedged state.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • As a long-term fix, we initiated a CRP gateway hotfix that will prevent the gateway from entering into the wedged state. The hotfix roll-out is progressing as per our Safe Deployment Practices.
  • We are flighting a configuration change to make the .Net GC work less hard and avoid interaction with the process wide lock which is surfaced with exception handling.
  • Repair items have been identified to optimize areas of code which were causing heap churn

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: