Please see http://status.cloudbees.com for status indicators and high level system status information.
For support, please visit support.cloudbees.com or email support@cloudbees.com

Thursday 28 January 2016

Git / Subversion / Maven Service Outage - January 26th 2016

Timeframe
11am January 26th 2016 - 5am January 27th 2016 (UTC)

Impact

  • Intermittent failure accessing Git repositories from outside AWS networks
  • Total failure accessing Git / Subversion / Maven (during the recovery process)

Root Cause

There was an intermittent issue with external networks accessing Git services over SSH and via the Git daemon.

We do not have a complete picture of what change caused this (as we had made no changes in this area) - but we posit that there was a change in Amazon that was incompatible with our networking layer (multiple layers of network-address-translation).


(at a basic level - around 90% of connections from external systems to our Git over SSH ports were hanging - connections from inside AWS were working as normal - and it was 90% from a given client - occasionally they'd get through)

During the investigations we initiated a full-reset of our systems to clear any underlying hardware and network faults - this took our services off-line for a period of time.

Data Loss / Security Implications

There were no data-loss of security implications from this outage.  The services were rebooted in a controlled fashion.

Complications

Due to a long period since the last restart of affected services (over a year), there were a few minor configuration issues that needed to be worked through during the restart process - these have now been codified into the system configuration so that future restarts are brisk.

Followup

  • We are reviewing the changes required to restore stability to the system to see if we can better explain the failure
  • We are planning more frequent fire-drills in this area of the platform.
  • implement further changes identified in the internal CloudBees Post Outage Review