Legacy compute environment back online
Our file server that serves the legacy environment suffered a failed disk this morning. This is usually non-fatal and the disk in question was swapped and the filesystems began their repair. This also usually goes without incident. At approximately 3:30, an NFS issue appeared almost immediately after the RAID system reported clean, and various servers and services began reporting OFFLINE on our monitoring system.
The tools to investigate this were not working, and we were not able to gain access to the operating system. At 4PM, we chose to reboot the server from the management interface, in the hopes a reboot would rectify the issue. It indeed appears to have done that. Unfortunately, this particular server can take some time to boot up due to the number of storage pools, its age, and possible hidden voodoo we haven’t discovered. This was one of those times and it completed the boot 90 minutes later.
The file server in question did cleanly completed its startup, and we rebooted the 4 login nodes serving login.mcs.anl.gov as a precautionary measure. If you see issues with other systems, please report them to [email protected]<mailto:[email protected]> and we’ll investigate. We will investigate what caused the issue.