Earlier today we received a bunch of notices of bad hardware from Amazon - we took (as we normally do) proactive action to replace these servers that were on the bad hardware.
However, provisioning failed for various reasons - but the most prominent one was that one of the data centers we operate in wasn't able to support the SSD volumes that we were upgrading to (this is a limitation of some Amazon accounts, which we found out). We had to roll back some changes and then reprovision jenkins masters again. To provide availability - we recover servers to different data centres, restoring the jenkins data on fresh volumes and fresh servers - to do this we require provisioning to work, and in this case, due to the inavailability of SSD volumes in some cases, prolonged the recovery for a few hours in some cases. We are looking at how we can avoid these "older" data centres in future to prevent this.