Skip to content
CELS Virtual Helpdesk

CELS Virtual Helpdesk

  • Systems Group
  • Blog
  • Documentation

CELS Virtual Helpdesk

CELS Shared Services Systems Group

Documentation Search

Search for:

Most Recent Dispatch

  • Confluence Service Scheduled Maintenance for May 20, 3PM CDT

Site search

Summary of the service outage from March 4th and 5th

March 29, 2022 by Stacey, Craig

First of all, the actual cause of the problem was a bad network link.  It was difficult to find, and we were just about to begin migrating the services out of the 221 data center into 240 when it was finally discovered.
Here’s the link to Networking’s summary of what happened: https://slack-files.com/T025KBW9U-F025WFY2W-0c97c9
We discovered things were starting to throw alerts and misbehave around 3:25PM on Tuesday. There were a few services that were being slow or unresponsive, and we eventually narrowed down the culprit to the database server for infrastructure services.  This db server is set up in a very resilient and highly available environment, and in the 221 data center (which has generator-backed UPS, unlike 240).   We started debugging this, and it became evident that the system became unresponsive whenever network hosts tried to contact it.  (It became so unresponsive, in fact, that it wouldn’t even respond locally as long as it was accepting network connections.)
The redundancy and high availability wasn’t helping us, since the entire environment over in the data center seemed to be affected.  We couldn’t migrate the service to one of the other hypervisors, since they were all behaving the same way.
We called in the networking team after a couple of hours when it started to really smell like a network problem, and we all started working on it together.  Unfortunately, as outlined in the PDF above, the actual problem was not presenting itself in any useful manner.  By all indications, the network was working, and was not signaling errors indicating any problem.
After debugging into the night, rebooting various servers and hypervisors, we called it a night with the plan to start attacking it early in the morning, including moving the databases from 221 to 240.  Just as we were wrapping up the data dump, Networking discovered the bad optics, and things sprang to life.  Many services came back right away, for others we had the fallout of repercussions from our troubleshooting steps and reboots amidst bad networking.  Most things came back okay, we’ve got one redundant RADIUS server that needs rebuilding.
Affected hosts were: most websites we host, many login machines (which mount NFS from said websites), Trouble tickets (rt.mcs.anl.gov), account management, rdp.mcs.anl.gov (Windows Terminal Server), and any other service that relies on coredb.mcs.anl.gov.
Thank you all for your patience during the outage, sorry it took so long to diagnose the actual problem.  If you have any questions, by all means let me know.

Post navigation

Previous Post:

And we're back!

Next Post:

We're troubleshooting a networking issue that's effecting connections on anything inside the MCS network and anything outside it. More details as they emerge.

Helpful links

  • Service Catalog
  • Request…
    • a domain name
    • a GCE Unix Group
    • an IP Address
    • a Laptop Build
    • a loaner laptop
    • a JIRA project
    • a Mailing List
    • an Overleaf account
    • a port activation
    • a poster print
    • a reactivation for a returning user
    • an upgrade to Slack Business Plus from Free.
    • a WordPress migration
    • a WordPress site
    • an xgitlab or gitlab migration
    • a Zoom license upgrade

Previous Dispatches

Search Documentation

Search for:

Privay & Security Notice

Privacy & Security Notice

Site tools

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
© 2025 CELS Virtual Helpdesk | WordPress Theme by Superbthemes