Summary of repo.anl-external.org downtime
Executive summary
VM Hypervisor showed signs of going bad on Friday night. Mitigating steps were taken. Server failed on Sunday night too late for anyone to be able to work on it until Monday. Server was restored at 11:30AM. Hypervisor is now stable, however critical services hosted on that hypervisor will be moved to more resilient hardware to reduce likelihood of downtime.
What happened
An MCS Virtual Server hypervisor (hereafter referred to as vserver8) had a system disk go into a bad state, taking down vserver8 and all Virtual Machines hosted on it. Affected VMs were:
- login1.mcs.anl.gov
- login2.mcs.anl.gov
- buildbot.mcs.anl.gov
- pwca.alcf.anl.gov
- horde.alcf.anl.gov
- repo.anl-external.org
The short term fix
We noticed instability with the server on Friday night, as the hypervisor had gone offline and rebooted a couple of times. A reboot seemed to clear the issue with the disk, though there did appear to be corruption in some previously retired VMs. As a first step, we made sure we had a current backup of the data stored on repo. We moved the IP addresses of login1 and login2 over to login3 and shut those VMs down. buildbot, pwca and horde remained down to minimize the load on the hypervisor in the hopes of increasing the likelihood of it staying up through the weekend. Once we had the data in repo.anl-external.org confidently duplicated, we brought it back up and kept our eye on the service, with the goal of migrating it to a new hypervisor on Monday.
The second failure
On Sunday night, the disk on vserver8 failed in a different manner than before. Unfortunately, there was nobody available to handle the situation and thus it had to wait until morning. First thing in the morning, attempts were made to bring the VM back. Due to the configuration of that machine, we were unable to recover from a bad system disk in the usual methods. Ultimately, we had to burn a bootable linux LiveCD to boot the machine and initiate the data transfer to the new disk.
The progress on that transfer gave us an original estimate of about an hour to copy the data to the new disk, at which point the machine would be resurrected as it was before the crash. Had it looked like it would take longer, our recovery path would have been to migrate the data from the repocafe backup made on Friday to a new disk pool and set up a new VM on that. The data duplication looked to be the fastest path to recovery, so we continued on that route, since our backup on Friday would not have included any changes over the weekend.
The copy slowed down some, and ultimately finished just before 11:30 AM (about 1.5 hours after the initial estimate). After the copy finished, the server was able to reboot and operate as normal with no loss of data. It currently shows to be healthy.
Next steps
login1 and login2 are still directed at login3. Later today, we will send a notice to users who may be affected when we move the IP addresses back to their original hosts. If you logged into login1 or login2 since Friday evening, and are still logged in, you’ll be among the affected users.
repo.anl-external.org is currently up and stable, and we will begin the process of moving that to a more resilient VM infrastructure. When we first deployed it, it was intended to be a “best effort” self-service SVN repo to ease collaborations with external users. Because we never made that explicitly clear, and because it is *so* much easier to self-serve these sorts of things, many users gravitated towards using it as their primary SVN repo over the more “production level” svn.mcs.anl.gov.
Bearing this in mind, we’re reclassifying repo.anl-external.org as a critical service. We’re going to move the VM to hardware that’s better designed to weather these sorts of failures and be able to move trivially from hypervisor to hypervisor as needed, which we currently do with other critical servers. There will be an announced outage on this service as we migrate the last of the repository data to the new server. We’ll make the bulk of this work happen in the background with the goal of having the outage only necessary to copy last minute data changes and ultimately move the VM. There will be further updates on this as we progress, and we’ll coordinate to ensure the migration does not happen at a critical time.