Timeframe
11am January 26th 2016 - 5am January 27th 2016 (UTC)
Impact
- Intermittent failure accessing Git repositories from outside AWS networks
- Total failure accessing Git / Subversion / Maven (during the recovery process)
Root Cause
There was an intermittent issue with external networks accessing Git services over SSH and via the Git daemon.
We do not have a complete picture of what change caused this (as we had made no changes in this area) - but we posit that there was a change in Amazon that was incompatible with our networking layer (multiple layers of network-address-translation).
We do not have a complete picture of what change caused this (as we had made no changes in this area) - but we posit that there was a change in Amazon that was incompatible with our networking layer (multiple layers of network-address-translation).
(at a basic level - around 90% of connections from external systems to our Git over SSH ports were hanging - connections from inside AWS were working as normal - and it was 90% from a given client - occasionally they'd get through)
Data Loss / Security Implications
There were no data-loss of security implications from this outage. The services were rebooted in a controlled fashion.
Complications
Due to a long period since the last restart of affected services (over a year), there were a few minor configuration issues that needed to be worked through during the restart process - these have now been codified into the system configuration so that future restarts are brisk.
Followup
- We are reviewing the changes required to restore stability to the system to see if we can better explain the failure
- We are planning more frequent fire-drills in this area of the platform.
- implement further changes identified in the internal CloudBees Post Outage Review