Background
Starting yesterday at 12:00GMT, CloudBees has experience partial downtime of its services.
CloudBees is currently hosting its services on AWS, on the East-1 region. Yesterday, AWS started experiencing serious problems on their infrastructure. Most notably, its EBS service (used to store data) was primarily impacted.
While most of RUN@cloud applications were able to run properly, DEV@cloud services were impacted by AWS' outage. Also, our users were not able to log into our services anymore.
Actions taken
We have quickly taken actions to make sure that CloudBees' users were able to use our services by moving load to less impacted AWS zones. At midnight (GMT), most services were up and running again (login, SSO, GrandCentral, Jenkins, RUN@cloud). However, our forge services (Git, SVN and Maven repositories) were still experiencing difficulties.
Forge Status
Currently, we have been able to resume forge operations. However, the data we have been able to recover for now predates by 12h the initial down-time i.e. all data that would have been stored in Git/SVN/Maven between Thursday morning at 0:00GMT and Thursday 12:00GMT is not part of the recovered information. However, this information is not lost: it is being kept "hostage" of AWS' recovery procedure as they re-mirror their data. Once this process will be done, our goal is to reconcile the 12h of missing data with the current forge data. This process will depend on the type of repository:
- Git repositories: those repositories are accessible as of now. If you have information missing from the 12h period, you can simply PUSH from your local repository to recover information.
- Subversion repositories: we have decided to forbid WRITE access to those repositories and only allow READ access. That way, we will be able to properly reconcilie them once data is made available to us by AWS. If you do not want to wait and do the reconciliation by yourself now, please open a support ticket.
- Maven repositories: much like for the Git repositories, those are accessible in Read-Write and will be reconciliated once AWS is fully back online.
Next
We will provide a new report as soon as we have more information available from AWS.
We would like to apologize for this down-time and we will work on processes to make sure we can improve our availability in case of a future serious down-time of our hosting provider.