Stan
Historia

Historia stanu platformy Azure

Ta strona zawiera analizy głównych przyczyn wcześniejszych problemów z usługą, z których każda jest przechowywana przez 5 lat. Od 20 listopada 2019 r. obejmowała ona analizy głównych przyczyn wszystkich problemów, o których informowaliśmy publicznie. Od 1 czerwca 2022 r. obejmuje ona analizy głównych przyczyn dla problemów w szerszym zakresie, zgodnie z opisem w naszej dokumentacji.

Produkt:

Region:

Data:

kwiecień 2024

22

What happened?

Between 03:45 CST and 06:00 CST on 23 April 2024, a platform issue resulted in an impact to the following services in the Azure China regions:

Issues connecting or failures for multiple Azure services (such as Cosmos DB, Azure Virtual Desktop, Azure Databricks, Azure IoT Hub, Service Bus, Logic Apps, Data Factory, AKS) due to DNS resolution issues. Service management operations may have failed and operations in the Azure Portal (portal.azure.cn) may have failed.
Azure Monitor Log Analytics offerings and Microsoft Sentinel in the China East 2 and China East 3 may have experienced intermittent data latency, failure to query and retrieve data, incorrect alert activation and/or failures for create, update, retrieve or delete operations in Log Analytics.

What do we know so far?

We identified the issue was caused by an incorrect nameserver configuration on one of the domains.

How did we respond?

Our internal monitors alerted us to impact to customers at 04:10 CST. We identified the issue and performed a roll back which completed at 05:12 CST, at which point services began to recover. Full recovery was declared at 06:00 CST.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.

19

This is our "Preliminary" PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

What happened?

Between 04:26 UTC and 07:30 UTC on 19 April 2024, a platform issue in West US resulted in impact for the following services:

Azure Database for MySQL / MariaDB - Single Server: Connectivity issues, all connections may have failed for a subset of customers. Retries would have been unsuccessful.
Azure Databricks: Due to Azure Databricks dependencies on the impacted database service, customers in West US, West US 2 and South Central US may have experienced failures and timeouts with workspace login and authentication requests. For Azure Databricks, the impact may have included: Cluster CRUD requests and jobs relying on cluster start/resize/termination may not have executed, jobs submitted through APIs/Schedulers may not have executed, UI and Databricks SQL queries may have timed out, users may have experienced failures launching Databricks SQL Serverless Warehouses and may have been unable to access UC APIs, and customers may have observed errors citing “Authentication is temporarily unavailable” or “TEMPORARILY_UNAVAILABLE”.

What do we know so far?

An Azure Database for MySQL deployment in the West US region caused connectivity impact for a subset of customers and some dependent services. Azure Databricks services in the West US, West US 2 or South Central US regions, that made use of Azure Databases for MySQL in the West US region, were unable to connect to the Azure Database for MySQL service during this incident.

How did we respond?

Once we identified this issue was triggered by a recent deployment, we proceeded to perform a rollback to a previous, good build. The rollback was completed by 07:30 UTC. Immediately after, our monitors indicated that our resources returned to a healthy state. Once connectivity was re-established, service functionality to Azure Databricks was also restored.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

5

Note: During this incident, in addition to our standard communications via Azure Service Health and its alerts, we also communicated via the public Azure Status page. This temporary overcommunication was to ensure that all affected customers received information, described as 'scenario 3' in our documentation, see https://www.aka.ms/StatusPageCriteria. Our final Post-Incident Review (PIR), below, has been communicated to affected customers through Azure Service Health, this entry on the Status History page will soon be removed.

What happened?

Between 08:50 UTC and 13:18 UTC on 5 April 2024, customers using Azure services which have dependencies on Azure Front Door may have experienced intermittent degraded performance, latency, and/or timeouts when attempting to access services hosted in the United Kingdom area, due to an issue with an edge site in the London area. Impacted customers would have experienced degraded availability during the following periods, all in UTC - 08:50 to 09:10, 09:35 to 09:56, 10:26 to 10:44, 10:56 to 11:37, 12:33 to 12:44, and 13:13 to 13:19.

What went wrong and why?

There was a reduction in capacity in the London edge site due to maintenance, which caused the automatic load balancing to engage during the peak traffic hours. There was a recent rollout in the automatic load balancing algorithm that caused sub-optimal traffic distribution in the London edge site under high load. Due to this distribution issue, one of the edge sites was unable to handle the incoming traffic volume, causing the intermittent issues described above.

How did we respond?

08:50 UTC on 5 April 2024 – Customer impact began.
09:05 UTC on 5 April 2024 – Service monitoring detected high latency and timeout spikes in the UK South region.
11:45 UTC on 5 April 2024 – We identified that an issue with load balancing was affecting traffic between Azure Front Door Points of Presence (POP) in UK South.
12:15 UTC on 5 April 2024 – We performed configuration changes in order to adjust load balancing, which resolved this issue.
13:18 UTC on 5 April 2024 – After monitoring, our telemetry confirmed that the issue was mitigated and full service functionality was restored.

How are we making incidents like this less likely or less impactful?

We have rolled back the load balancing routing algorithm update, enabling full capacity restoration. (Completed)
We are improving our load balancing routing algorithm. (Estimated completion: July 2024)
We are improving our alerting for suboptimal traffic distribution prior to high load occurring. (Estimated completion: July 2024)

How can customers make incidents like this less impactful?

More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review - see https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more - see https://www.aka.ms/ash-alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey https://aka.ms/AzPIR/_T4K-1CG

marzec 2024

14

Watch our 'Azure Incident Retrospective' video about this incident: https://aka.ms/AIR/VT60-RPZ

What happened?

Between 10:33 UTC on 14 March 2024 and 11:00 UTC on 15 March 2024, customers using Azure services in the South Africa North and/or South Africa West regions may have experienced network connectivity failures, including extended periods of increased latency or packet drops when accessing resources. This incident was part of a broader continental issue, impacting telecom services to multiple countries in Africa.

The incident resulted from multiple concurrent fiber cable cuts that occurred on the west coast of Africa (specifically the WACS, MainOne, SAT3, and ACE cables) in addition to earlier ongoing cable cuts on the east coast of Africa (including the EIG, and Seacom cables). These cables are part of the submarine cable system that connect Africa’s internet to the rest of the world, and service Microsoft’s cloud network for our Azure regions in South Africa. In addition to the cable cuts, we later experienced impact reducing our backup capacity path, leading to congestion that impacted services.

Some customers may have experienced degraded performance including extended timeouts and/or service failures across multiple Microsoft services – while some customers may have been unaffected. Customer impact varied depending on the service(s), region(s), and configuration(s). Impacted downstream services included Azure API Management, Azure Application Insights, Azure Cognitive Services, Azure Communication Services, Azure Cosmos DB, Azure Databricks, Azure Event Grid, Azure Front Door, Azure Key Vault, Azure Monitor, Azure NetApp Files, Azure Policy, Azure Resource Manager, Azure Site Recovery, Azure SQL DB, Azure Virtual Desktop, Managed identities for Azure resources, Microsoft Entra Domain Services, Microsoft Entra Global Secure Access, Microsoft Entra ID, and Microsoft Graph. For service specific impact details, refer to the ‘Health history’ section of Azure Service Health within the Azure portal.

What went wrong, and why?

The Microsoft network is designed to support multiple failures to our Wide Area Network (WAN) capacity at any given point in time. Specifically, our regions in South Africa are connected via multiple diverse physical paths – both subsea and terrestrially within South Africa. The network is designed to support multiple failures and continue operating with only one single physical path. In this case, our South Africa Regions has four physically diverse subsea cable systems serving the country and the designed failure mode is that three out of four can fail with no impact to our customers.

Following news of geopolitical risks in the Red Sea, we ran internal simulations and capacity planning analysis. On 5 February, we initiated capacity additions to our African network. On 24 February, multiple cable cuts in the Red Sea impacted our east coast network capacity to Africa. This east coast capacity was unavailable, however there was no customer impact because of the built in redundancy.

Before our capacity additions from February had come online, on 14 March we experienced additional multiple concurrent fiber cable cuts, this time on the west coast of Africa – which further reduced the total network capacity for our Azure regions in South Africa. These cable cuts were due to a subsea seismic event (likely an earthquake and/or mudslide) which impacted multiple subsea systems – one of which is used by Microsoft. Additionally, after the west coast cable cuts had occurred, we experienced a line card optic failure on a Microsoft router inside the region that further reduced network headroom. Microsoft experiences hundreds of line card optic failures every day across the 500k+ devices that operate our network – such an event would normally have been invisible to our customers. However, the combination of concurrent cable cuts and this line card failure removed the necessary headroom on the failover path, which led to the congestion experienced.

This combination of events affected Azure services including Compute, Storage, Networking, Databases, and App Services – as well as Microsoft 365 services. While many customers leverage local instances of their services within the South Africa regions, there are some services that rely on API calls made to regions outside of South Africa. The reduced bandwidth to/from the South Africa regions, impacted these specific API calls and therefore impacted service availability and/or performance.

How did we respond?

The timeline that follows includes network availability figures, which represent the breadth of impact to our network capacity but may not represent the impact experienced by any specific customer or service.

3 February 2024 – News articles surfaced geopolitical risk to Red Sea subsea cable infrastructure.
5 February 2024 – Based on our internal simulations, we began the process of requesting capacity augments to Microsoft’s west coast Africa network.
24 February 2024 – Multiple cable cuts in the Red Sea impacted east coast capacity (EIG, and Seacom cables), no impact to customers/services.
4 March 2024 – Local fiber providers began work on approved capacity augments.
14 March 2024 @ 10:02 UTC – Multiple cable cuts impacted west coast capacity (WACS + MAINONE + SAT3).
14 March 2024 @ 10:33 UTC – Customer impact began, as reduced capacity began to cause networking latency and packet drops, our on-call engineers began investigating. Network availability dropped as low as 77%.
14 March 2024 @ 11:55 UTC – Azure Front Door failed out of the region, to reduce inter-region traffic.
14 March 2024 @ 12:00 UTC – Individual cloud service teams began reconfigurations to optimize network traffic to reduce congestion.
14 March 2024 @ 15:44 UTC – After the combination of our mitigation efforts and the end of the business day in Africa, network traffic volume reduced – network availability rose above 97%.
14 March 2024 @ 16:25 UTC – We continued implementing traffic engineering measures to throttle traffic and reduce congestion – network availability rose above 99%.
15 March 2024 @ 06:00 UTC – As network traffic volumes increased, availability degraded, and customers began experiencing congestive packet loss – network availability dropped to 96%.
15 March 2024 @ 11:00 UTC – We shifted capacity from Microsoft's Edge in Lagos to increase headroom for South Africa, last packed drops observed on our WAN. While this effectively mitigated customer impact, we continued to monitor until additional capacity supported more headroom.
17 March 2024 @ 21:00 UTC – First tranche of emergency capacity came online.
18 March 2024 @ 02:00 UTC – Second tranche of emergency capacity came online, Azure Front Door brought back into our South Africa regions, incident declared mitigated.

How are we making incidents like this less likely or less impactful?

We have added Wide Area Network (WAN) capacity to the region, in the form of a new physically diverse cable system with triple the capacity of pre-incident levels (Completed).
We are reviewing our capacity augmentation processes to help accelerate urgent capacity additions when needed (Estimated completion: April 2024).
We continue to work with our fiber providers to restore WAN paths after the cable cuts on the west coast of Africa (Estimated completion: April 2024) and on the east coast of Africa (Estimated completion: May 2024).
We are evaluating adding a fifth WAN path between South Africa and the United Arab Emirates, to build even more resiliency to the rest of the world (Estimated completion: June 2024).
We are increasingly shifting services to run locally from within our South Africa regions, to reduce dependencies on international regions where possible, including Exchange Online Protection (Estimated completion: June 2024).
In the longer term, we are investing in WAN Gateways in Nigeria to improve our fault isolation and routing capabilities. (Estimated completion: December 2024)
Finally, we are working to build out and activate Microsoft-owned fiber capacity to these regions, to reduce dependencies on local fiber providers. This includes investments in our own capacity on the new submarine cables going to Africa (specifically the Equiano, 2Africa East and West) which will exponentially increase capacity to serve our regions in South Africa. Importantly, this capacity will also be controlled by Microsoft – giving us more operational flexibility to add/change/move capacity in our WAN, versus relying on third-party telecom operators. These WAN fiber investments on new cable systems will land on the west coast of Africa (Estimated completion: December 2024) as well as on the east coast of Africa (Estimated completion: December 2025).

How can our customers and partners make incidents like this less impactful?

During the incident, we advised customers who were able to fail out of these regions to consider doing so. Customers should consider a multi-region geodiversity strategy for mission-critical workloads, see https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss, but a retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/VT60-RPZ