Ignora esplorazione

Cronologia stato di Azure

Questa pagina contiene tutte le analisi della causa radice degli eventi imprevisti che si sono verificati a partire dal 20 novembre 2019. Ogni analisi verrà mantenuta in questa pagina per 5 anni. Le analisi effettuate prima del 20 novembre 2019 non sono disponibili.

Prodotto:

Area:

Data:

giugno 2021

14/6

RCA - Error notifications for service management operations - West US 2 (Tracking ID LL1H-9CZ)

Summary of Impact: Between 22:00 UTC on 14 Jun 2021 and 11:15 UTC on 15 June 2021, a subset of customers with resources hosted in the West US 2 region may have received errors while performing service management operations - such as create, update, delete - for multiple services. On 15 Jun 2021 09:20 UTC mitigation was applied, and services gradually began to recover as load from queued service management requests reduced. Full recovery was confirmed for all impacted services at 11:15 UTC

Root Cause: We established that there were several factors that contributed to this customer impact:

  • One of the backend access control services specifically serving service management requests in West US 2 experienced a period of unexpected high CPU consumption, because of an anomalous spike in internal traffic. This resulted in requests to the service timing out.
  • Additionally, resources hosting this service became unavailable due to a code defect in a driver. This defect manifested itself under this specific load, which both exacerbated the issue and lengthened mitigation efforts.

Due to the nature of this backend service, automatic scaling is not possible. We rely on stress-testing to predict capacity needs. We have identified that our stress tests did not account for the configuration present in this West US 2.

Mitigation: Mitigation workstreams continued over an extended period due to complications in recovering a low-level internal service under high load with the crashing driver. At 6:39 UTC on June 15, targeted network rules were introduced to block specific internal traffic on a subset of underlying backend service instances to reduce the load. At 6:51 UTC, we applied a configuration change to the infrastructure and removed the impacted driver. At 9:20 UTC, additional capacity was applied to the internal infrastructure. This allowed impacted customer-facing services to stabilize, thus mitigating the issue.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • We will continue to investigate the cause for the anomalous traffic spike.
  • Re-running stress tests across all regions to account for configuration differences across regions.
  • Scaling capacity to what is predicted by the stress-tests above.
  • Removing the driver causing failures during the incident.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: 

maggio 2021

20/5

RCA - Issues accessing the Azure portal and other Microsoft services (Tracking ID KN22-39Z)

Summary of Impact: Between 06:52 UTC and 16:20 UTC on 20 May 2021, a subset of Azure customers may have experienced intermittent errors when attempting to access the Azure portal and other Microsoft and Azure services. Impact was observed across multiple services and regions to varying degree. Recovery time varied by service, and most services fully recovered by 16:20 UTC.

Root Cause: We identified a series of transient name resolution issues that impacted a subset of Azure regions. The impact was seen as follows:

  • 06:52 UTC to 07:10 UTC - regions in Europe
  • 09:00 UTC to 09:30 UTC - regions in India
  • 15:53 UTC to 16:20 UTC - regions in Europe (primarily UK)

The name resolution issues were caused by a code regression in a recent deployment to our edge DNS servers. The regression introduced lock contention issues which, when triggered, caused some processes on our edge servers to go into a paused state and stop serving traffic for some time. The paused processes auto recovered and started service traffic again. This led to intermittent query drops and degraded service performance. During this time retries may have been successful. The issue had a low probability of being triggered and it only started manifesting itself several days after the gradual deployment completed.

Mitigation: To resolve the issue, we have rolled back the recent deployment using our safe deployment practices (SDP) to a previously known healthy state, first to the impacted regions and then globally. After mitigation at 16:20 UTC, we continued to monitor the platform to ensure stability of the service both prior and during the roll back of the deployment.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
  • Introducing software and hardware diversity on our DNS stack to ensure that a code regression does not impact the service resiliency.
  • Improving the stress/non-functional test coverage to handle additional fault injection scenarios.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: .

10/5

Microsoft Azure Portal - Intermittent Portal Access Issues (Tracking ID GVD7-RDZ)

Summary of Impact: Between 15:24 UTC and 17:55 UTC on 10 May 2021, customers may have experienced intermittent 500-level errors or an intermittent latency when accessing the Azure portal. Azure services were not affected. 

Preliminary Root Cause: The Azure portal frontend endpoints in the US North Central region experienced an increase in CPU usage, causing some instances to not serve traffic as fast as expected.

Mitigation: We rerouted traffic around the unhealthy region and scaled out CPU resources in adjacent regions to handle the increase in traffic.

Next steps: We sincerely apologize for the impact to affected customers. We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation. 

4/5

Azure Speech Service - West Europe - Mitigated (Tracking ID LLL3-LTZ)

Summary of Impact: Between 06:45 UTC and 11:35 UTC on 04 May 2021, a subset of customers using Azure Speech Service in West Europe may have experienced failures with online transcription, batch transcription, custom speech, and translation.

Preliminary Root Cause: We have determined that during recent deployment a part of the code lost access to KeyVault, preventing the App Service that Azure Speech Service is dependent on from running as expected.

Mitigation: We have restored the access to the KeyVault to mitigate this issue and enable the App Service to run as expected in turn bringing Azure Speech Service back to healthy state.

Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

aprile 2021

30/4

Issues accessing Azure Portal - HTTP 500-level Errors / Performance issues - Mitigated (Tracking ID 0TK3-HPZ)

Summary of Impact: Between 07:30 and 08:45 UTC on 30 Apr 2021, a subset of customers may have experienced intermittent HTTP 500 errors or general latency when trying to access the Azure Portal. There was no impact to Azure services during this time and retries to the portal may have been successful for some customers.

Preliminary Root Cause: At the start of business hours in the UK region, the Azure portal frontend endpoints in UK South began scaling up their instances to support the daily traffic. Our initial investigation show that the scaling process kicked in as expected but instances didn’t serve traffic as fast as expected, leading to degraded customer experience.

Mitigation: The issue was self-healed once the new instances were able to service traffic. Even though our telemetry shows that the traffic patterns for the duration of the incident are similar to those observed during the past week, we provisioned additional instances and also increased the maximum instance count to be used for future scaling operations. 

Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.
20/4

RCA- Intermittent 503 errors accessing Azure Portal (Tracking ID HNS6-1SZ)

Summary of Impact: Between approximately 10:30 and 12:11 UTC, and again between 13:49 and 14:09 UTC on 20 Apr 2021, a subset of customers may have experienced intermittent HTTP 503 errors when trying to access the Azure Portal. There was no impact to Azure services during this time, and retires to the portal may have been successful for some customers.

Root Cause: The Azure portal frontend resources in UK South was taken out of rotation for maintenance the previous day, at 2021-04-19 19:08 UTC. For operational reasons related to an issue with that maintenance, the region was left out of rotation for a longer period than anticipated. This shifted traffic from UK South to UK West. This scenario was within acceptable operational limits, as the volume of Azure Portal traffic for that part of the world was declining at the end of the working day there.

The next day, the increase in traffic cause our instances in UK West to automatically scale-up, and it soon reached the maximum allowed number of instances, and stopped scaling up further. The running instances became overloaded, causing high CPU and disk activity, to a point where the instances became unable to process requests and began returning HTTP 503 errors.

Mitigation: At 12:11 UTC, we removed the region from the global Azure Portal rotation to restore functionality. In tandem we also scaled-out resources in other regions to ensure there was no impact related to the load rebalancing.

As part of the mitigation troubleshooting, the UK West instances were initially reimaged and retuned to rotation, as the impact from the UK south traffic was not fully understood, and thus it was believed this would resolve the issue. They were briefly brought online at 13:49 UTC, but the lack of scale caused a recurrence of the issue. UK West was taken out of rotation again at 14:09 UTC. pending a full RCA review.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Return UK South and UK West to rotation with increased autoscaling limits. [COMPLETED]
  • Ensure autoscaling rules for adjacent regions are adjusted in the event of a region being taken out of rotation.
  • Raise internal alerts to a higher severity to ensure an earlier response.
  • Raise default thresholds for autoscaling to account for growth of Portal.
  • Improve monitoring to take region out of rotation automatically (failures weren't consistent enough to reach the threshold for our alerts).
  • Alert if a region is running at the maximum auto-scale limits.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey:

1/4

RCA - DNS issue impacting multiple Microsoft services (Tracking ID GVY5-TZZ)

Summary of Impact: Between 21:21 UTC and 22:00 UTC on 1 Apr 2021, Azure DNS experienced a service availability issue. This resulted in customers being unable to resolve domain names for services they use, which resulted in intermittent failures accessing or managing Azure and Microsoft services. Due to the nature of DNS, the impact of the issue was observed across multiple regions. Recovery time varied by service, but the majority of services recovered by 22:30 UTC.

Root Cause: Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure’s layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.

Mitigation: The decrease in service availability triggered our monitoring systems and engaged our engineers. Our DNS services automatically recovered themselves by 22:00 UTC. This recovery time exceeded our design goal, and our engineers prepared additional serving capacity and the ability to answer DNS queries from the volumetric spike mitigation system in case further mitigation steps were needed. The majority of services were fully recovered by 22:30 UTC. Immediately after the incident, we updated the logic on the volumetric spike mitigation system to protect the DNS service from excessive retries.

Next Steps: We apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Repair the code defect so that all requests can be efficiently handled in cache.
  • Improve the automatic detection and mitigation of anomalous traffic patterns.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey at .