Summary of the service outage from March 4th and 5th
First of all, the actual cause of the problem was a bad network link. It was difficult to find, and we were just about to begin migrating the services out of the 221 data center into 240 when it was finally discovered.
Here’s the link to Networking’s summary of what happened: https://slack-files.com/T025KBW9U-F025WFY2W-0c97c9
We discovered things were starting to throw alerts and misbehave around 3:25PM on Tuesday. There were a few services that were being slow or unresponsive, and we eventually narrowed down the culprit to the database server for infrastructure services. This db server is set up in a very resilient and highly available environment, and in the 221 data center (which has generator-backed UPS, unlike 240). We started debugging this, and it became evident that the system became unresponsive whenever network hosts tried to contact it. (It became so unresponsive, in fact, that it wouldn’t even respond locally as long as it was accepting network connections.)
The redundancy and high availability wasn’t helping us, since the entire environment over in the data center seemed to be affected. We couldn’t migrate the service to one of the other hypervisors, since they were all behaving the same way.
We called in the networking team after a couple of hours when it started to really smell like a network problem, and we all started working on it together. Unfortunately, as outlined in the PDF above, the actual problem was not presenting itself in any useful manner. By all indications, the network was working, and was not signaling errors indicating any problem.
After debugging into the night, rebooting various servers and hypervisors, we called it a night with the plan to start attacking it early in the morning, including moving the databases from 221 to 240. Just as we were wrapping up the data dump, Networking discovered the bad optics, and things sprang to life. Many services came back right away, for others we had the fallout of repercussions from our troubleshooting steps and reboots amidst bad networking. Most things came back okay, we’ve got one redundant RADIUS server that needs rebuilding.
Affected hosts were: most websites we host, many login machines (which mount NFS from said websites), Trouble tickets (rt.mcs.anl.gov), account management, rdp.mcs.anl.gov (Windows Terminal Server), and any other service that relies on coredb.mcs.anl.gov.
Thank you all for your patience during the outage, sorry it took so long to diagnose the actual problem. If you have any questions, by all means let me know.