Skip to content
CELS Virtual Helpdesk

CELS Virtual Helpdesk

  • Systems Group
  • Blog
  • Documentation

CELS Virtual Helpdesk

CELS Shared Services Systems Group

Documentation Search

Search for:

Most Recent Dispatch

  • Scheduled Maintenance on May 27, 2025, from 7AM to 12PM CDT

Site search

Outage update: The story so far and when things will come back.

March 29, 2022 by Stacey, Craig

At 2:28 PM today, the lab suffered a partial site-wide power outage. I do not yet have the exact cause, though it was no doubt related to the incredible storm we were seeing. I happened to be recording the storm outside my window right as the power went off, so you can hear my exasperation in the recording: https://www.dropbox.com/s/jkchw3pppkqkg6k/2016-07-28%2019.28.43.mp4?dl=0.
This took out all infrastructure housed in 240 that was not backed up by UPS. The remaining systems remained up for a short time before the UPS ran out of juice and was unable to sustain the load any longer. At this point, the remaining systems failed.
CELS Systems maintains a presence in the 221 data center to house critical servers and services to be able to ride out outages of this nature. An overzealous team member, in an attempt to cleanly shut down the systems in 240 that were going to run out of UPS, inadvertently also shut down the systems in 221.
So, while a team of us ran around 240 trying to ascertain the situation, another team hoofed it over to 221 to bring affected systems back online.
At this point, we were dark on all fronts – supercomputers, HPC clusters, core computing, base IT. Through some magic of UPSes and fairy dust, the wifi and phones in some parts of the building remained operational through this outage (until the batteries in those UPSes also went down).
At 4:00, we were informed that the rest of the site had power back, but building 240 still was not powering up, and that the building’s electrical contractor was en route to determine why. Traffic and weather conspired to delay their arrival until well after 5PM.
During this time, the tenants of the 240 datacenter discussed our options. We’ve had some recent experience with power outages and what it takes to successfully come back from a planned outage (and, sadly, some experience with the unplanned outages as well). We determined that it would be counterproductive to try to bring the systems back online tonight, as we had no estimate of return to operation for power. Once power was restored, the dependency chain of having enough heat load in the room to run the chillers to cool the room would kick in.
Because of the amount of time and effort to get ALCF, LCRC, Magellan, and Beagle back into operation, and the fact there was no true estimation of when power would return nor if the chillers would be able to return to operation, we made the decision to postpone powering up the room until tomorrow morning, beginning at 6AM. Without the supercomputers and HPC clusters running there is not enough heat being generated in the room to run the chillers, which results in them shutting down, which results in the few systems that are running to overheat.
At 6AM tomorrow the building operations staff and the various sysadmins will converge on the data center to bring back the equipment housed in there.
We’re still working to fix the niggling items that have not come back from building 221. I’ll send updates as I have them.
—
Craig

Post navigation

Previous Post:

Some areas have power restored, Building 240 still down

Next Post:

thrash.mcs.anl.gov OS update

Helpful links

  • Service Catalog
  • Request…
    • a domain name
    • a GCE Unix Group
    • an IP Address
    • a Laptop Build
    • a loaner laptop
    • a JIRA project
    • a Mailing List
    • an Overleaf account
    • a port activation
    • a poster print
    • a reactivation for a returning user
    • an upgrade to Slack Business Plus from Free.
    • a WordPress migration
    • a WordPress site
    • an xgitlab or gitlab migration
    • a Zoom license upgrade

Previous Dispatches

Search Documentation

Search for:

Privay & Security Notice

Privacy & Security Notice

Site tools

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
© 2025 CELS Virtual Helpdesk | WordPress Theme by Superbthemes