Skip Navigation

Azure status history

Product:

Region:

Date:

October, 2019

21-10

Service Management Errors - East US 2 - Mitigated

Summary of impact: Between 10:08 and 23:37 UTC on 21 Oct 2019, a subset of customers in East US 2 may have received error notifications or experience high latency when performing service management operations - such as create, update, or delete - for resources hosted in this region. Impacted services are listed below. Additionally, customers using Azure Databricks and/or Data Factory v2 may encounter service management errors in multiple regions. A very limited subset of customers using Virtual Machines with SQL Server images, or other SQL IaaS offerings, may also encounter errors performing service management operations on resources hosted in East US 2.

Preliminary root cause: After an initial investigation, engineers found that a backend service responsible for processing service management requests experienced a period of high load, causing service management requests to fail or be delayed.

Mitigation: Engineers deployed a platform hotfix to decrease the high load and verified the service health to mitigate the issue. 

Next steps: A Root Cause Analysis will be provided within 72 hours.

18-10

Authentication issues with Azure MFA in North America

Summary of Impact: Between 13:30 UTC and 15:57 UTC on 18 October, 2019 a subset of customers in North America may have experienced issues receiving multi-factor authentication (MFA) challenges. The issue was detected by automated monitoring, and the engineering team was alerted and fully engaged by 1348 UTC. Communications were provided to customers once impact was confirmed.

Preliminary Root cause: The trigger of the issue was severe packet loss between Microsoft and a third-party service. The packet loss occurred in a network external to Microsoft but on the route to the third-party service. The packet loss was greater in severity and duration than previously measured. The combination of severe packet loss and morning peak load in North America resulted in Azure MFA service degradation in North American data centers. Azure MFA in the rest of the world did not experience a degradation of service.

Mitigation: The route recovered from packet loss starting at 1540 UTC, and by 1557 UTC the Azure MFA service in North America had recovered.

Next Steps: The team has maintained round the clock coverage to ensure service stability and has not seen a recurrence of the degradation that caused the issue.  Additionally, changes have been made to improve resilience and throttling to better protect against similar issues connecting to third-party services.  Finally, we are accelerating work that was already in progress which is designed to protect against a total loss of connection to the third-party communications service. 
16-10

Azure Portal - Mitigated

Summary of impact: Between approximately 13:45 and 14:59 UTC on 16 Oct 2019, a subset of customers may have experienced latency issues with the Azure Portal, Command Line and Azure PowerShell.

Preliminary root cause: Engineers determined that some instances of an ARM backend service reached an operational threshold, preventing requests from completing.

Mitigation: Engineers performed a manual scale out of the backend service to mitigate this issue.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

2-10

Log Analytics - USGov Virginia - Mitigated

Summary of Impact: Between 19:45 EDT on 02 Oct 2019 and 23:45 EDT on 03 Oct 2019 a subset of customers using Log Analytics in USGov Virginia may have received 500-level errors when performing queries on log analytics data.

Preliminary root cause: Engineers determined that instances of a back-end service responsible for middle-tier authentication became unhealthy, preventing data queries from completing.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue. 

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: for video tutorials and for how-to documentation.

September, 2019

25-9

RCA - Azure Kubernetes Service (AKS) - Provisioning Failures

Summary of Impact: Between 20:50 UTC on 25 Sep 2019 and 04:25 UTC on 26 Sep 2019, a subset of customers using Azure Kubernetes Service (AKS) may have received failure notifications when performing create and/or update operations for their resources.

Root Cause: An update to a backend service responsible for processing different service management requests for AKS was enabled by a recent configuration change that had been rolled out. This update exposed a code defect for the service and a fix for the defect had yet to complete in the impacted regions. This resulted in failures for the service, preventing it from processing requests. Internal monitoring systems for this issue also did not initially correlate to complete scope of impact, which delayed the investigation.

Mitigation: Engineers reverted the configuration change which mitigated the issue.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Monitoring the completion of the fix for the defect in all regions.
  • Updating the monitoring systems to improve alerting accuracy for these types of scenarios.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

16-9

RCA - Azure Front Door Service and Azure CDN

Summary of Impact: Between 00:00 UTC on 16 Sep 2019 and 00:30 UTC on 18 Sep 2019, you were identified as a customer who may have seen provisioning failures with Azure CDN and Azure Front Door service when using new Key Vault Certificates.

Root Cause: Engineers determined that a recent deployment task impacted Key Vault API calls, causing the provisioning failures. The deployment change was made to filter out managed secrets for Key Vault portal extensions, however this also impacted Remote Procedure Calls (RPCs). Customers who depend on RPCs would not be able to see Key Vault managed secrets impacting the ability to provision CDN and Azure Front Door services.

Mitigation: Engineers deployed a platform hotfix to mitigate the issue.

Next Steps:  We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. We have implemented additional test scenarios for listSecrets RPCs to prevent future occurrences.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey:

11-9

RCA - Australia East - Intermittent Connectivity

Summary of Impact: Between 00:30 and 01:40 UTC on 11 Sep 2019, a subset of customers in Australia East may have experienced intermittent connectivity failures when connecting to resources hosted in this region or when accessing the Azure Portal.

Root Cause and Mitigation: The Azure Load Balancing service is hosted on multiple servers grouped into rings. During a routine configuration update to a single ring, a health check failed, causing the configuration update to automatically roll back. The rollback process was optimized for speed, and in this case, reduced the capacity of the ring because it updated multiple instances at the same time. When the ring capacity was reduced, there was customer impact due to dropped packets. To prevent this happening in the future, we are adding additional health checks to the rollback process. These checks will ensure that the available capacity does not drop below a minimum specified threshold.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Capacity health checks during rollback to prevent the rollback process from reducing the capacity below a set threshold.
  • Improve the alerting thresholds to capture packet drops on the Azure Software Load Balancer.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey: 

August, 2019

28-8

RCA - Azure Email Alerts

Summary of Impact: Between 20:22 UTC on 28 Aug 2019 and 01:35 UTC on 29 Aug 2019, a subset of customers may not have received email alerts sent from a subset of Azure services during the impact window.

Root Cause: Engineers determined that the instance of the backend database service, utilized for email processing, had reached operational thresholds. During scale up operations the service became unresponsive due to logging operations exceeding thresholds. This prevented requests from downstream Azure services, utilizing the Azure Email service, from completing successfully.

Mitigation: Engineers halted the scale up operation. Operation throttling thresholds were gradually returned to normal to fully mitigate the issue.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve throttling performance to help ensure that there are sufficient service resources during maintenance activities
  • Improve resiliency of the backend service responsible for processing email alerts

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey

23-8

RCA - Service Management Operations - West/North Europe

Summary of Impact: Between 06:20 and 10:46 UTC on 23 Aug 2019, a subset of customers in North Europe and West Europe may have received failure notifications when performing service management operations - such as create, update, delete - for resources hosted in these regions.

Root Cause: During this incident, the memory consumption on Azure Resource Manager’s (ARM) worker roles exceeded operational thresholds. This should have initiated a clearing process which attempts to recover additional memory when thresholds are reached. However, the memory cache was unrecoverable at the time given that it was referenced by other (active) objects in the system. CPU utilization on the affected worker roles also increased, which in turn prevented certain processes from completing.

Analysis has indicated that all threads on the affected worker roles were busy during the impact window, thus manifesting in impact to service management operations to ARM-dependent services and/or resources.

In addition, worker roles recycle on a weekly cadence which, in this scenario, further contributed to thresholds being reached. The nature of the issue required manual intervention to fully mitigate, which further delayed mitigation.

Mitigation: Engineers performed a manual restart of the affected worker roles to mitigate the issue.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this included (but was not limited to):

  • Automatically recycling worker roles on a more frequent cadence [In Progress]
  • Separating distinct cache types [In Progress]
  • Adding enhanced logging information to understand memory cache usage in worker roles [In Progress]
  • Reducing cache footprint [In Progress]

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey .

16-8

RCA - Azure Services - South Central US

Summary of impact: Between 14:16 UTC and 21:36 UTC on 16 Aug 2019, a subset of customers using resources in South Central US may have experienced difficulties connecting to resources hosted in this region.

Root cause: An outage occurred during a regional infrastructure update, which is part of a South Central US datacenter refresh initiative. At approximately 14:00 UTC on 8/16/2019, facility engineers proceeded with removal of infrastructure following a pre-approved method of procedure (MOP). However, an error in fiber trunk validation resulted in production impacting connectivity loss.
The loss of these physical links resulted in connectivity issues with a scale unit in the datacenter that hosted several Azure infrastructure services.

Mitigation: To mitigate, services with multi-region deployments successfully failed out of the region to minimize impact. Residual impact was mitigated by restoring full fiber connectivity to the affected scale unit and placing it back into production rotation.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Halt infrastructure refresh projects globally until below repair items are addressed - Complete
  • Method of Procedure (MoP) detailed review as related to infrastructure refresh projects – in progress
  • Improve existing physical infrastructure change management process with additional collaboration & validation between facility and infra engineering - in progress.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

14-8

RCA - USGov Iowa - Service Availability and Downstream Impact

Summary of impact: Between 16:25 EDT on 14 Aug 2019 and 01:00 EDT on 15 Aug 2019, a subset of customers in USGov Iowa may have experienced issues accessing services in that region. Engineers determined that an underlying compute issue was impacting SQL, API Management, Azure Site Recovery, Media Services, Service Bus, and StorSimple.

A subset of App Service customers may have experienced Service Management issues across these regions: USDoD Central, USDoD East, USGov Arizona, USGov Iowa, USGov Texas, and USGov Virginia.

Root cause and mitigation: Automated health monitoring is utilized in all Azure regions to predict and react to telemetry signals that indicate that a failure condition has occurred on a resource. In this instance, health monitoring indicated that several physical host nodes in the scale unit had reached a threshold for high memory utilization, which resulted in those host nodes moving to a state for repair. An investigation is ongoing into the specific cause of the increased memory footprint on these hosts, but as this is the expected behavior for the platform health monitoring, this was functioning as expected. A secondary issue was detected on review that was causing the build-up of the number of nodes requiring non-automated or manual repair. When nodes are taken out of rotation and put into repair, automation processes the node and will perform needed tasks to achieve repair and return the node to service. In this case, the automation was triggering and was putting the nodes into a manual repair mode, waiting for human intervention. With the increased number of nodes failing and requiring human intervention, this impacted the Service Fabric seed nodes of Azure SQL DB in this stamp. Quorum was lost on seed nodes and databases hosted on this instance of Azure SQL DB became unavailable. In parallel, SQL DB and Compute engineering teams worked to restore these services and nodes in the scale unit. Manually applying the repair to these nodes was successful, after which, the cluster fabric and SQL DB services were able to recover, mitigating the issue.

After an initial recovery for the SQL DB instance, additional node recovery efforts mistakenly again impacted one of the Service Fabric seed nodes leading to Quorum loss on the seed nodes. This was due to a method of recovery for the systems that were requiring repair. A repair item has been created to address this conflict in recovery. Engineers manually recovered the nodes and brought them back online to restore connectivity to the dependent services, which in turn mitigated the downstream impact once the dependent services were restored.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Improve the automation repair process to track the number of nodes required to operate the Azure SQL DB service and alert when the number of available nodes comes down below the safe operational thresholds
  • Improve alerting from the Service Fabric layer quorum failures to upstream services when failures rate exceeds normal safe operational thresholds

Provide feedback: Please help us improve Azure customer communications experience by taking our survey - 

9-8

RCA - Azure Portal - Issues Loading Dashboard Tiles

Summary of impact: Between approximately 16:07 and 19:17 UTC on 09 Aug 2019, customers using the Azure portal may have received failure notifications under the following scenarios:

  • Customers who had pinned tiles to their dashboards that provided them with a list of resources such as All Resources, All VMs, etc. would have either seen a "No resources found" message or just a grey tile. All other tiles on dashboards such as pinned resources continued to function as expected.
  • Access of the following two pages on the Azure Portal. a) "My permissions" page that enables customers to see what permissions they have in a given subscription and b) "Resource Providers" option in the resource menu for any given subscription that lets customers view, register and unregister Resource Providers in their subscriptions. Trying to open one of these two pages would have resulted in the page load timing out after 30 seconds.

Root cause: The issue was caused by a code change that introduced a bug in an underlying API that the affected experiences relied on. The Azure Portal relies on feature flags to enable or disable features and the above regression would occur only when one such flag was set to the disabled state. While engineering had tests to cover these scenarios, those tests were only run with the flag set to enabled which was not the flag state we had in production.

Mitigation: Portal engineering received alerts from our monitoring and alerting infrastructure. A roll back to the previous production build that did not contain this regression was initiated. This roll back was completed in multiple regions by 09 Aug 2019 19:09 UTC and affected experiences were back to working as expected at this point of time.

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Added new tests that runs with all possible states of the particular feature flag that resulted in this incident to ensure that a similar regression is caught before changes get deployed to production.

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

8-8

RCA - Azure CDN - Connectivity Issues

Summary of impact: Between 14:05 and 19:00 UTC on 08 Aug 2019, you were identified as a customer using Akamai CDN who may have experienced HTTP 504 (Gateway Timeout) errors when attempting to connect to resources.

Root cause: Engineers determined that a network update to add new peer servers to the US East region caused connections to fail between Akamai CDN and Azure resources. A network filtering policy on a redundant Microsoft peering device was improperly configured, which caused traffic sourced from Akamai CDN and destined for origins hosted on Azure storage in East US and East US 2 to be intermittently dropped. 

Mitigation: Engineers rolled back the updates to mitigate the issue. After the root cause was identified and the network filtering policy was updated and consistent across the pair of devices, traffic was successfully routed over this path without drops. 

Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Inventory device fleet to ensure no further inconsistencies exist for network filtering policies
  • Update device configuration drift platform to account network filtering policy inconsistencies across device pairs
  • Add additional automation and checks to update network filtering policies across device fleet to remove the risk of inconsistencies

Provide feedback: Please help us improve the Azure customer communications experience by taking our survey

July, 2019

30-7

RCA - Issues logging in to the Azure Portal with a Microsoft Account

Summary of Impact: Between 21:00 UTC and 22:24 UTC on 30 Jul 2019, a subset of customers may have experienced intermittent error messages and failures when logging in to the Azure Portal with a Microsoft account. This issue affected a subset of MSA users who could not authenticate or manage their accounts. Retries may have been successful.

Root Cause: Engineers were performing standard maintenance on a standby module when the active module became unstable. It was determined that combination of factors including a device mismatch and code bug, resulted in the active device becoming unstable.

Mitigation: Engineers routed the user traffic to other redundant paths to recover the service, which restored Microsoft Account services for customers.

Next Steps:
We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Updates to processes around manageability of all devices.
  • Improve and increase the redundancy of the supporting infrastructure, including the manageability of failed devices.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey