Things are mostly back
CIS successfully revived our router, and most MCS hosts and services are back online. We’re still looking over the monitoring to identify what are actual issues vs. transient errors that will self-correct (some servers may not be properly syncing their time via NTP, for example), and will be addressing them as we find them.
If your local machine is misbehaving at this point, your first course of action should be to reboot. If that does not fix the issue, you can contact the help desk ([email protected]) and we’ll get you taken care of.
I’ll post a more in-depth summary of what happened, but a blow-by-blow overview is available at http://www.mcs.anl.gov/systems/blog (aka http://mcssys.wordpress.com). Thanks for your patience, and we’ll be sure to let you know of any findings we make on how to prevent this sort of outage in the future, or at least minimize the length of the downtime when there’s a catastrophic hardware failure like we had.