Multiple services down in us-east4

Incident Report for IBM Power for Google Cloud

Postmortem

We deeply apologize for the impact this service interruption caused. This analysis outlines the complete root cause and the sequence of cascading failures that compromised the facilities redundant power systems in the us-east4 regional extension. The findings confirm that the initial external utility event triggered a failure chain within the facility, resulting in the loss of critical services on the IP4G platform. This analysis details the preventative actions we are taking to reinforce resilience.

Impact Duration: 3 Hours 30 minutes

On Wednesday, 15 October 2025 a major utility power event caused the Google Cloud regional extension data center in us-east4 to transition to generator power. During automatic transfer of utility power to generator power, the transfer failed for several blocks of power. The multiple block failures caused a complete loss of redundant power for multiple customers in the data center facility, including IBM Power for Google Cloud.

Background and Incident Details

IBM Power for Google Cloud operates in Google Cloud regional extension data centers. These data centers are tier 3 or greater facilities with N+1 cooling and power at minimum. IBM Power for Google Cloud relies on the data center facility to manage utility, generator and UPS power. When we deploy each region, our engineers collaborate with the facility provider to ensure all systems are supplied by redundant power designated by the facility. All systems in IBM Power for Google Cloud are connected to redundant power sources with no single point of failure. In us-east4, each IBM Power for Google Cloud compute, storage, and network system is connected to two power sources supplied by the facility. Each power source is attached to an independent circuit that is served via an array of uninterruptible power supplies (UPS) in the event of utility power failure. The facility also maintains several generators that supply long term power in the event of a utility power failure. Generators and UPS units are grouped into blocks that can also provide additional redundancy to adjacent blocks in the event of failures within a block. The overall power system is robust, and engineered for critical workloads.

On October 15, 2025 utility power was lost and systems successfully transferred to multiple blocks of Uninterruptible Power Supplies (UPS) as expected. All facility generators received the start command and began running. However, two generators experienced faults that lead to a cascading sequence of failures and loss of power for the data center facility:

Initial Failure: Generator E tripped due to an improper breaker trip setting (a configuration error).

Secondary Failure: Simultaneously, Generator B tripped due to an internal breaker fault.

Total Discharge: With two generators unavailable, four independent UPS blocks exhausted their battery capacity and shut down.

Cascading Failure: The UPS load was automatically transferred to the remaining UPS block which became overloaded and tripped the output breaker, causing a total loss of critical facility power.

This series of failures, which bypassed the designed redundancy of the facility, ultimately resulted in a total loss of power to all IBM Power for Google Cloud systems and other customers within the affected data center blocks.

Timeline

07:27 AM EST - The data center facility provider’s UPS block supporting IBM Power for Google Cloud reaches end of battery discharge and shuts down

07:30 AM EST - IP4G engineering detects loss of access to critical systems and begins troubleshooting

08:00 AM EST - IP4G incident posted on the Statuspage to notify customers of incident

08:22 AM EST - IP4G engineering receives notification of critical power and generator failures in the provider facility

08:28 AM EST - Utility power and load restored to UPS block supporting IP4G

08:39 AM EST - IP4G engineering detects systems coming online and being services restoration process

09:08 AM EST - IP4G Block Storage systems online and healthy

09:51 AM EST - IP4G Network systems online and healthy

10:48 AM EST - IP4G Compute systems online and healthy

10:57 AM EST - IP4G notifies customers that compute workloads can be powered on

11:28  AM EST - IP4G notifies customers that the incident is resolved and transitions the status to monitoring

01:03 PM EST - Second utility power loss; customer loads transfer to generator power successfully.

02:11 PM EST - Utility power restored and stable for the final recovery

Root Cause

The confirmed Root Cause has been identified by our Facility Provider and is summarized below:

“Utility power failed the entire building, and was transferred to generators as expected. However, two generators experienced unexpected failures. The breaker for Generator E tripped to due improper breaker trip settings. The breaker for Generator B failed to remain closed the cycled multiple times due to a breaker fault, this prevented Generator B from supporting any load. All UPS units supported by Generator B and E ran on battery until the end of their discharge time and shutdown. The load was then automatically transferred to an alternate UPS in another power block, this simultaneous transfer caused the target UPS to become overloaded and trip the output breaker, ultimately causing a loss of all power.”

Prevention

The failures outlined above demand immediate and decisive action from both the facility provider and our internal IP4G engineering team. Below are the committed steps we are taking to address the root causes and reinforce the resilience of the platform:

Provider Facility Actions

  • Generator B breaker has been replaced and functionally tested with no issues. The failed breaker has been sent to the OEM for RCA
  • Generator E breaker settings have been updated
  • The facility provider has conducted a facility-wide audit of all generator and UPS breaker settings
  • The facility provider will update maintenance policy to require additional operational testing as part of regular maintenance activity

IBM Power for Google Cloud Actions

  • We are updating runbooks to accelerate compute services restoration time in the event of a complete power failure
  • We have updated our facility notification channels to ensure facility events are correctly routed to engineering

This is the final version of the report

Posted Oct 16, 2025 - 21:46 UTC

Resolved

Several customers have reported that they have successfully restored their applications. We have not observed any abnormalities during our monitoring and will transition this incident to resolved. We do not expect any further impact.

Incident Report
We sincerely apologize to our Google Cloud customers whose businesses were impacted by this outage. This does not meet the quality and reliability standards we aim to provide. We are investigating the root cause but would like to provide a summary of the incident based on information we currently have.

On Wednesday, 15 October 2025 at approximately 07:36 EST a major utility power event caused the Google Cloud regional extension data center in us-east4 to transition to generator power. During automatic transfer of utility power to generator power, the transfer failed for several blocks of power. The multiple block failures caused a complete loss of redundant power for IBM Power for Google Cloud services.

The data center facility provides N+1 power using multiple feeds, panels, generators and multiple battery banks, all feeding redundant power to IBM Power for Google Cloud infrastructure. Our current understanding, based on feedback from the facility provider, is that multiple cascading failures occurred during the utility power disruption. This resulted in complete loss of redundant power and caused impact to all compute, storage, and networking services in us-east4 for IBM Power for Google Cloud.

We are working with our facility provider to understand the complete root cause and will take immediate steps to enhance the platform's resilience in the event of similar failures. We will provide a complete root cause in the following days and post the final analysis and preventative steps to this incident.

If you continue to experience impact or are unable to power on virtual machines, please reach out to IBM Power for Google Cloud Support using https://cloud.google.com/support and reference this incident.
Posted Oct 15, 2025 - 16:32 UTC

Update

Some customers have reported issues with RMC connectivity after booting virtual machines.

We have completed the backend process to assist with restoring RMC connectivity. If you continue to experience connectivity issues for RMC, please review RMC troubleshooting steps in the documentation below.

https://docs.converge.cloud/docs/how-to/aix/aix-troubleshoot-rmc/

If you continue to experience impact or are unable to power on virtual machines, please reach out to IBM Power for Google Cloud Support using https://cloud.google.com/support and reference this incident.

We are continuing to monitoring the environment for reported or observed issues.
Posted Oct 15, 2025 - 16:17 UTC

Update

Some customers have reported issues with RMC connectivity after booting virtual machines.

We are running a backend process to help correct this. Additionally, customers should review RMC troubleshooting steps in the documentation below.

https://docs.converge.cloud/docs/how-to/aix/aix-troubleshoot-rmc/
Posted Oct 15, 2025 - 15:39 UTC

Monitoring

The incident impacting us-east4 has been resolved for all impacted customers.

Customers may experience errors while powering on virtual machines in the web UI. We recommend using the CLI to power on virtual machines for now.

Customers should be able to power on virtual machines and restore applications. We recommend powering on priority workloads first, then moving through the remaining virtual machines in appropriate groups based on workload priority.

If you continue to experience impact or are unable to power on virtual machines, please reach out to IBM Power for Google Cloud Support using https://cloud.google.com/support and reference this incident.

We are continuing to monitor the environment and will provide another update within 30 minutes.
Posted Oct 15, 2025 - 15:28 UTC

Update

Compute services are coming online now. We expect that some customers will be able to view virtual machines, volumes, or networks in the IBM Power for Google Cloud CLI, API, or UI.

Customers that can view workloads in the CLI, API, or UI, can being to power on virtual machines.

We are continuing to complete health checks but expect compute, storage, and networking to be stable.

Utility power has remained stable after the previous restoration. We are working with the facility provider to understand root cause for failed generator transfer.

We will post another update within 30 minutes.
Posted Oct 15, 2025 - 14:57 UTC

Update

Compute services are coming online now as we continue to complete compute service restoration.

We expect customers will be able to start VM's within 45 minutes based on current progression. We are working to shorten this timeline as much as possible.

Utility power has remained stable after the previous restoration. We are working with the facility provider to understand root cause for failed generator transfer.

We expect that customers are still be unable to view virtual machines, volumes, or networks in the IBM Power for Google Cloud CLI, API, or UI. We will provide another update within 30 minutes.
Posted Oct 15, 2025 - 14:41 UTC

Update

Compute services are coming online now as we continue to complete compute service health checks.

Utility power has remained stable after the previous restoration. We are working with the facility provider to understand root cause for failed generator transfer.

We expect that customers are still be unable to view virtual machines, volumes, or networks in the IBM Power for Google Cloud CLI, API, or UI. We will provide another update within 30 minutes.
Posted Oct 15, 2025 - 14:25 UTC

Update

Storage and networking health checks have been completed. We are beginning restoration of compute services.
Posted Oct 15, 2025 - 13:43 UTC

Update

We are progressing through service health checks and restoration as expected with no abnormalities detected so far.

Currently we expect services to be restored within 45 minutes.

We will provide an incident report and root cause after services are restored.
Posted Oct 15, 2025 - 13:04 UTC

Update

We are progressing through service health checks and restoration as expected with no abnormalities detected so far.

Currently we expect services to be restored within 45 minutes.

We will provide an incident report and root cause after services are restored.
Posted Oct 15, 2025 - 13:04 UTC

Update

Our data center provider has reported that the root cause of the power disturbance has been identified and resolved. Utility power is now available. A root cause for the failed generated transfer for multiple blocks of power is being investigated.

We are able to access the management and control plane for IBM Power for Google Cloud services and are evaluating environment health.

Currently, we expect services to be available within an hour.
Posted Oct 15, 2025 - 12:43 UTC

Identified

On-site resources have confirmed a significant power event.

The data center provider experienced a utility power disturbance. As a result, customer loads were transferred to generator power but this transfer failed for several blocks of power causing a major power outage for multiple customers.

We are continuing to work with the facility provider to restore power and services but do not currently have an ETA.
Posted Oct 15, 2025 - 12:31 UTC

Update

We are working with on-site resources to evaluate and restore services.

There is no ETA on service restoration at this time. We will provide another update within 30 minutes.
Posted Oct 15, 2025 - 12:20 UTC

Investigating

We are investigating an incident that impacts connectivity in us-east4.
Posted Oct 15, 2025 - 12:00 UTC
This incident affected: US-East4 (Ashburn, Northern Virginia, USA) (Power Compute, Network Fabric, Block Storage).