Please see http://status.cloudbees.com for status indicators and high level system status information.
For support, please visit support.cloudbees.com or email support@cloudbees.com

Thursday, 28 January 2016

Git / Subversion / Maven Service Outage - January 26th 2016

Timeframe
11am January 26th 2016 - 5am January 27th 2016 (UTC)

Impact

  • Intermittent failure accessing Git repositories from outside AWS networks
  • Total failure accessing Git / Subversion / Maven (during the recovery process)

Root Cause

There was an intermittent issue with external networks accessing Git services over SSH and via the Git daemon.

We do not have a complete picture of what change caused this (as we had made no changes in this area) - but we posit that there was a change in Amazon that was incompatible with our networking layer (multiple layers of network-address-translation).


(at a basic level - around 90% of connections from external systems to our Git over SSH ports were hanging - connections from inside AWS were working as normal - and it was 90% from a given client - occasionally they'd get through)

During the investigations we initiated a full-reset of our systems to clear any underlying hardware and network faults - this took our services off-line for a period of time.

Data Loss / Security Implications

There were no data-loss of security implications from this outage.  The services were rebooted in a controlled fashion.

Complications

Due to a long period since the last restart of affected services (over a year), there were a few minor configuration issues that needed to be worked through during the restart process - these have now been codified into the system configuration so that future restarts are brisk.

Followup

  • We are reviewing the changes required to restore stability to the system to see if we can better explain the failure
  • We are planning more frequent fire-drills in this area of the platform.
  • implement further changes identified in the internal CloudBees Post Outage Review

Wednesday, 9 December 2015

Outage Report - Authentication Service

Timeframe

2015-12-09 14:12pm - 15:00pm UTC.

Impact

  • Customers weren't able to login.
  • Customers weren't able to use their DEV@cloud masters.
  • Builds weren't started.

Root Cause

An erroneous config file was deployed to production. The authentication service went down, because the application failed to validate the new config file. Syntactically the config was ok, but it had a non-existent DNS entry which caused the validation failure.

Data Loss / Security Implications

There are no known data loss or security implications for DEV@cloud customers.

Followup
Stop using dynamically generated DNS entries from EC2 instance tags.

Monday, 7 December 2015

Global Restart for Jenkins 1.609.4.6

Timeframe

Vulnerability public @ November 9th 2015
Vulnerability fixed @ November 6th 2015

Impact

  • Undisclosed at this time
You may restart your Jenkins to be upgraded to 1.609.4.6 immediately. We will automatically restart your Jenkins in the next 48 hours if you have not done so already.

Root Cause

Undisclosed at this time.

There is one DEV@cloud relevant high-severity vulnerability being patched in 1.609.4.6

Further information will not be provided at the current time.

Data Loss / Security Implications

During the global restart, jobs that are queued (but not building) are lost.  We are internally tracking a fix for this issue, however it will not be in place for this release.

There are no other known data loss or security implications for DEV@cloud customers.


Followup
Full information on the security vulnerability will be made available when the Jenkins team publicly announces the list of vulnerabilities included in the security release.

Wednesday, 25 November 2015

Jenkins Master - Upgrade to Java 8

Overview

CloudBees has changed the default configuration for all Jenkins masters to use Java 8 by default.

This modernizes our Java stack and provides a more easily supported environment for our Jenkins engineering team

Version

  /opt/java8/bin/java -version
  java version "1.8.0_60"
  Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
  Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

Activating Java 8


To activate Java 8, restart your Jenkins.

Regressions / Limitations


At this stage there are no known regressions with running on Java 8 - but you should log a support ticket if you experience any issues.


Deactivating Java 7

To revert to Java 7, please log a support ticket requesting the downgrade.

Wednesday, 11 November 2015

Outage for Jenkins Security Release 1.609.4.3

Timeframe

Vulnerability public @ November 6th 2015 15:00 (UTC)
Vulnerability closed @ November 6th 2015 22:00 (UTC)
Vulnerability fixed @ November 9th 04:00 (UTC)

Impact

  • CLI / OPE connectivity disabled

Root Cause

See https://www.cloudbees.com/jenkins-security-advisory-2015-11-06
The CloudBees response to the vulnerability announcement (see advisory) was to close the OPE/CLI TCP ports - and then remove CLI functionality shortly thereafter. This occurred 7 hours after the vulnerability was made public.

On November 8th, a patch was released to close the vulnerability in the Jenkins server, and we progressively rolled this patch out, and re-activated the OPE/CLI functionality on all Jenkins services.

Data Loss / Security Implications

Indications are


  1. there was no increase in traffic to the Jenkins servers we checked for breaches
  2. access to the CLI ports was closed 7 hours after the initial announcement
  3. the exploit as written doesn't work due to the network configuration of DEV@cloud
  4. the exploit is based off a commons-collections vulnerability announced early in 2015 - so there may have been unannounced vulnerabilities floating around the internet
Customers need to perform a risk assessment to determine whether they need to reissue credentials in their environment.

Followup

Our status notes are ephemeral - the overall outage notice was written and posted once the release had been completed.

Full information on the security vulnerability is available in 

https://www.cloudbees.com/jenkins-security-advisory-2015-11-06

Tuesday, 3 November 2015

DEV@cloud global restart - Java 7 update

We will be performing a Java upgrade and global restart of all Jenkins instances in DEV@cloud.

Purpose:

  • patch Java 7 to latest update
  • deploy Java 8 so it can be used on beta customers (in preparation for global rollout)
  • allow individual customers to be switched to Java 8

Window

  • 4th November 7am UTC - 9am UTC

Impact

The outage will be momentary for customers as their Jenkins restarts.

Due to how this patch to the environment is applied it is not possible for us to hold off this restart for individual customers.

Our monitoring systems will tell us if your Jenkins has not come back up cleanly, however in the event that you do experience issues, please raise a support request via the normal means.

Post Outage Review

There were a small number of Jenkins servers in our production environment running an older base operating system.  These older instances did not upgrade to our satisfaction - and so we made the decision to terminate these instances and reprovision customer Jenkins on newer and faster hardware.

While this was not ideal timing, the work was completed largely within the outage window - but not as quickly as we would like.

Improvements

We are reviewing the way we communicate outages with customers - in this case we did not have sufficient time (for operational scheduling reasons) to communicate this particular upgrade.

We are also reviewing the Jenkins behaviour of displaying a stack-trace to the user rather than something more useful.

There are also changes being made to our hosted Jenkins platform to improve the resilience and stability.

Tuesday, 20 October 2015

DEV@cloud CA Certificate Issue - 21 October 2015

DEV@cloud CA Certificate Issue - 21 October 2015

Timeframe (UTC)

October 20 2015 4am - October 21 2015 2am

Impact

  • Jenkins master access to HTTPS services using command line tools would fail due to missing Root CA certificate chain

Root Cause

A component on the Jenkins masters instance was upgraded - however due to a failure in the package system, the Root CA certificate list (that lives on-disk in a ca-certificates.crt file) was no longer available.

As this file was missing, anything that relied on its existence was no longer able to access HTTPS protected services - this was typically limited to command line tools such as curl and git.

Resolution


The Root CA certificate list was reinstalled.

Data Loss / Security Implications

There are no data-loss or security implications.

Followup

  • We are improving the robustness of our testing and change control processes to help limit and subsequently eliminate failure of this nature in our upgrade process.
  • We are amending our status monitoring to detect this fault (our monitoring jobs all connect to Git over SSH - and hence did not fail under this scenario)