Skip to main content

How to protect your data centre from battery and power failures

Data centres across Asia are growing at an unprecedented pace, with the region poised to account for 40% of the world’s capacity.

But power failures remain a huge issue for data centres — especially with the increasing use of renewable energy in the industry. According to Uptime Institute’s 2024 report, power problems make up 52% of major data centre outages. Over half of operators say their most recent severe outage cost them more than US$100,000, while 16% reported losses above US$1 million.

While uninterruptible power supply (UPS) systems and lithium-ion batteries play a critical role in keeping operations running during outages, these backup systems come with risks. These include power supply inconsistencies, thermal runaway fires, and generator failures that can shut down facilities for hours or even days.

To help operators navigate these risks, Marsh Asia’s Regional High-Tech Expert, Fred Chuan, and Communications, Media and Technology Industry Leader, Larry Liu, share practical steps to combat these costly risks. 

1. Due to the inconsistency of renewable energy, data centre operators rely on battery storage and UPS to ensure a reliable power supply. How can data centres reduce risks from lithium-ion batteries and UPS systems?

Fred

Consistency and reliability of energy sources start with the facility’s design. Proper storage, monitoring, and maintenance of batteries and UPS systems can mean the difference between seamless operations and unexpected outages. Here are some recommendations:

  • Separate battery rooms: Store lithium-ion batteries in a dedicated fire-rated room with a minimum two-hour rating, sprinklers and easy access for firefighters.
  • Controlled environment with a Battery Management System (BMS): Maintain stable temperature and humidity to prevent overheating, with online monitoring for early detection of issues and prolonged battery life.
  • Gas and off-gas detection: Fit hydrogen gas detectors set at 10% of the lower explosive limit (LEL). If triggered, the detectors must automatically run mechanical ventilation and trigger an electrical shutdown of battery chargers. An off-gas detection system also gives operators an early warning to intervene and prevent thermal runaway by electrically isolating the batteries.
  • Certified equipment: Choose UL-listed batteries and UPS systems for tested safety and reliability.
  • Regular maintenance: Inspect and test systems frequently to catch problems early.

Why redundancy alone isn’t enough: How an 11-hour outage ensued 

An electrical outage occurred at a data centre due to issues within its UPS system. The main UPS initially experienced power problems and automatically transferred the load to a redundant UPS unit to maintain continuous power. However, the primary transfer switch failed, triggering the Static Transfer Switch (STS) to shift the load to the redundant UPS. When the primary UPS recovered, the STS attempted to switch the load back, but unstable utility power prevented the primary UPS from delivering full power.

This instability caused the STS to rapidly toggle the load between the two UPS units. To prevent damage from this repeated switching, the STS was ultimately locked out. Consequently, the data centre experienced an 11-hour power outage, resulting in significant downtime and operational disruption.


Primary UPS fault → transfer to redundant UPS

STS priority is to switch back to primary UPS

 

Primary UPS recovers but cannot support full load

 

STS rapidly toggles load between primary and redundant UPS units

STS locked out → outage

This incident underlines that redundancy alone is not a safeguard. Proper assessments, regular inspections, and rigorous testing are critical to reveal underlying faults and mitigate risks associated with UPS power outages.

2. How can data centres minimise downtime from cooling system failures?

Fred

Cooling systems are an essential part of data centres. Without them, servers overheat and data centres risk equipment and component failures and massive downtime. The good news is that there are practical steps to keep cooling systems resilient.

  • N+1 redundancy for HVAC systems: Ensure adequate cooling for critical heating, ventilation, and air conditioning (HVAC) components which control temperature and airflow in the data centre. Utilise a minimum of N+1 redundancy to prevent overheating. 
  • Backup power: Ensure HVAC systems can run during power outages.
  • Fire-resistant insulation and ducts: Use non-combustible Factory Mutual (FM)-approved materials and protect areas that are under large ducts with sprinklers.
  • Regular checks: Inspect and service mechanical and electrical equipment to prevent failures.

How a cooling failure stalled 2.5 million bank transactions

In October 2023, a major data centre cooling system failure caused overheating that disrupted operations for two leading banks in Singapore. The incident led to the unavailability of their online banking apps for close to 14 hours. This affected 2.5 million payments and ATM transactions as well as caused 810,000 failed login attempts.

The root cause was traced to a contractor error during a planned upgrade. The contractor incorrectly closed valves in the chilled water system, causing temperatures to rise beyond safe limits. Although both banks activated their disaster recovery and business continuity plans, technical issues at their backup data centres — including network misconfiguration and connectivity problems — prevented full recovery.

This outage exceeded regulatory limits on unscheduled downtime for critical systems, resulting in significant penalties and restrictions on IT changes. 

Learn more about the three major risks that data centres face.

3. How can operators protect against losses from UPS and battery failures? What about external power disruptions? 

Larry

Even with robust precautions forming the first line of defence, failures of transformers, switches, UPS systems or on-site batteries can still occur. This is why insurance plays an essential role as the last line of defence. Here’s what we recommend for operators managing risks associated with lithium-ion battery and UPS systems.

Property Damage and Business Interruption (PDBI) covering: 

  • Physical damage such as fire and explosions caused by batteries or UPS failures.
  • Mechanical or electrical breakdowns.
  • Business interruptions such as loss of insurable gross profit and additional operating expenses incurred during downtime.
  • Contingent Business Interruption (CBI): Losses arising from an insured physical damage at third‑party locations (for example, fire at a substation on the loop) that interrupt the data centre’s power supply.

PDBI provides operators the financial resources to repair or replace damaged equipment and cover the costs of business interruption.

Why Marsh?

Marsh is the trusted broker for more than 80% of the world’s largest cloud service and data centre providers. With deep industry knowledge and experience across Asia, we help data centre operators design safer facilities and transfer risk to keep their business running even during unexpected scenarios.

Ready to strengthen your data centre’s risk management?

Get in touch to learn how we can help you protect your critical infrastructure.