repo.anl-external.org downtime post-mortem
What happened on Monday:
Overnight/early morning of Monday, May 11, the system disk for the server "repo.anl-external.org" detected major corruption and marked itself as read-only. This was not a physical disk in a physical machine, but a virtual disk being used by the virtual machine running the repocafe service. The corruption was not hardware-related, but appeared to be filesystem level corruption.
Efforts to fix the corruption on Monday morning were not successful, and it was evident we would need to build a new VM to host the service. Luckily, the virtual disk containing the actual data from the repositories was unaffected. We took advantage of the downtime to move the VM to a different virtual machine host running a more modern build that would provide higher reliability for the short term (with a longer term fix in mind, detailed below).
A new VM was built, repocafe software installed, and the configuration restored from backups. We learned that despite the backups of the configuration being performed as expected, the database the service uses to track user accounts was not being backed up properly. (We also learned it was hosted locally on the VM rather than on our DB server.)
We were eventually able to restore the database from tape and get it functional again in relatively short order. After internal testing showed it to be functioning as expected, we announced the service was back. A user reported commit e-mails were not working, which we quickly rectified by installing a missing perl module. Diagnostics indicated this was the only missing module, and svn check-ins were working properly once this was fixed.
What’s going to happen longer term:
You may recall we had a different failure involving this service in November, and at that time announced we would move it to a more resilient architecture. We were (and are) still working on that, though design and implementation decisions made when the service was initially stood up made the move problematic. We had rectified the issue that caused the November failure, but this most recent failure indicates that we really need to design this system better.
As such, we’re going to undo those hampering design decisions and fully roll the system into our top tier infrastructure. We’ll be working with repo owners as we get closer to doing this, but it should be complete by mid-summer. This move may involve a name change out of the anl-external.org namespace, however the self-service nature of repository creation and management will be maintained. As we finalize the details of the move, we’ll have more information for you on how it will look.
For now, continue to use the service as usual, and report any oddities to [email protected].
Thanks!