Please see http://status.cloudbees.com for status indicators and high level system status information.
For support, please visit support.cloudbees.com or email support@cloudbees.com

Wednesday 9 December 2015

Outage Report - Authentication Service

Timeframe

2015-12-09 14:12pm - 15:00pm UTC.

Impact

  • Customers weren't able to login.
  • Customers weren't able to use their DEV@cloud masters.
  • Builds weren't started.

Root Cause

An erroneous config file was deployed to production. The authentication service went down, because the application failed to validate the new config file. Syntactically the config was ok, but it had a non-existent DNS entry which caused the validation failure.

Data Loss / Security Implications

There are no known data loss or security implications for DEV@cloud customers.

Followup
Stop using dynamically generated DNS entries from EC2 instance tags.

Monday 7 December 2015

Global Restart for Jenkins 1.609.4.6

Timeframe

Vulnerability public @ November 9th 2015
Vulnerability fixed @ November 6th 2015

Impact

  • Undisclosed at this time
You may restart your Jenkins to be upgraded to 1.609.4.6 immediately. We will automatically restart your Jenkins in the next 48 hours if you have not done so already.

Root Cause

Undisclosed at this time.

There is one DEV@cloud relevant high-severity vulnerability being patched in 1.609.4.6

Further information will not be provided at the current time.

Data Loss / Security Implications

During the global restart, jobs that are queued (but not building) are lost.  We are internally tracking a fix for this issue, however it will not be in place for this release.

There are no other known data loss or security implications for DEV@cloud customers.


Followup
Full information on the security vulnerability will be made available when the Jenkins team publicly announces the list of vulnerabilities included in the security release.

Wednesday 25 November 2015

Jenkins Master - Upgrade to Java 8

Overview

CloudBees has changed the default configuration for all Jenkins masters to use Java 8 by default.

This modernizes our Java stack and provides a more easily supported environment for our Jenkins engineering team

Version

  /opt/java8/bin/java -version
  java version "1.8.0_60"
  Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
  Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

Activating Java 8


To activate Java 8, restart your Jenkins.

Regressions / Limitations


At this stage there are no known regressions with running on Java 8 - but you should log a support ticket if you experience any issues.


Deactivating Java 7

To revert to Java 7, please log a support ticket requesting the downgrade.

Wednesday 11 November 2015

Outage for Jenkins Security Release 1.609.4.3

Timeframe

Vulnerability public @ November 6th 2015 15:00 (UTC)
Vulnerability closed @ November 6th 2015 22:00 (UTC)
Vulnerability fixed @ November 9th 04:00 (UTC)

Impact

  • CLI / OPE connectivity disabled

Root Cause

See https://www.cloudbees.com/jenkins-security-advisory-2015-11-06
The CloudBees response to the vulnerability announcement (see advisory) was to close the OPE/CLI TCP ports - and then remove CLI functionality shortly thereafter. This occurred 7 hours after the vulnerability was made public.

On November 8th, a patch was released to close the vulnerability in the Jenkins server, and we progressively rolled this patch out, and re-activated the OPE/CLI functionality on all Jenkins services.

Data Loss / Security Implications

Indications are


  1. there was no increase in traffic to the Jenkins servers we checked for breaches
  2. access to the CLI ports was closed 7 hours after the initial announcement
  3. the exploit as written doesn't work due to the network configuration of DEV@cloud
  4. the exploit is based off a commons-collections vulnerability announced early in 2015 - so there may have been unannounced vulnerabilities floating around the internet
Customers need to perform a risk assessment to determine whether they need to reissue credentials in their environment.

Followup

Our status notes are ephemeral - the overall outage notice was written and posted once the release had been completed.

Full information on the security vulnerability is available in 

https://www.cloudbees.com/jenkins-security-advisory-2015-11-06

Tuesday 3 November 2015

DEV@cloud global restart - Java 7 update

We will be performing a Java upgrade and global restart of all Jenkins instances in DEV@cloud.

Purpose:

  • patch Java 7 to latest update
  • deploy Java 8 so it can be used on beta customers (in preparation for global rollout)
  • allow individual customers to be switched to Java 8

Window

  • 4th November 7am UTC - 9am UTC

Impact

The outage will be momentary for customers as their Jenkins restarts.

Due to how this patch to the environment is applied it is not possible for us to hold off this restart for individual customers.

Our monitoring systems will tell us if your Jenkins has not come back up cleanly, however in the event that you do experience issues, please raise a support request via the normal means.

Post Outage Review

There were a small number of Jenkins servers in our production environment running an older base operating system.  These older instances did not upgrade to our satisfaction - and so we made the decision to terminate these instances and reprovision customer Jenkins on newer and faster hardware.

While this was not ideal timing, the work was completed largely within the outage window - but not as quickly as we would like.

Improvements

We are reviewing the way we communicate outages with customers - in this case we did not have sufficient time (for operational scheduling reasons) to communicate this particular upgrade.

We are also reviewing the Jenkins behaviour of displaying a stack-trace to the user rather than something more useful.

There are also changes being made to our hosted Jenkins platform to improve the resilience and stability.

Tuesday 20 October 2015

DEV@cloud CA Certificate Issue - 21 October 2015

DEV@cloud CA Certificate Issue - 21 October 2015

Timeframe (UTC)

October 20 2015 4am - October 21 2015 2am

Impact

  • Jenkins master access to HTTPS services using command line tools would fail due to missing Root CA certificate chain

Root Cause

A component on the Jenkins masters instance was upgraded - however due to a failure in the package system, the Root CA certificate list (that lives on-disk in a ca-certificates.crt file) was no longer available.

As this file was missing, anything that relied on its existence was no longer able to access HTTPS protected services - this was typically limited to command line tools such as curl and git.

Resolution


The Root CA certificate list was reinstalled.

Data Loss / Security Implications

There are no data-loss or security implications.

Followup

  • We are improving the robustness of our testing and change control processes to help limit and subsequently eliminate failure of this nature in our upgrade process.
  • We are amending our status monitoring to detect this fault (our monitoring jobs all connect to Git over SSH - and hence did not fail under this scenario)

Saturday 17 October 2015

DEV@cloud Build Interruption - 17 October 2015

Timeframe

5am-7am 17th October 2015 (UTC)

Impact

  • New builds unable to launch during outage

Root Cause

Dynamic DNS entries for the CloudBees build system were not updated correctly, taking the build service offline.

Data Loss / Security Implications

There was no data loss or security impacts from this outage.

Followup

  • The impacted DNS update tools are being reviewed to determine the root cause of the faulty updates.  
  • The problem is intermittent and ongoing - however the system is being run manually (with verification) to maintain availability.

Tuesday 25 August 2015

DEV@cloud Build Outage - 2015-Aug-25

Timeframe

25th August 2015 - 12:00am UTC to 3:00am UTC.

Scope of outage - DEV@cloud

  • No new builds could be launched on DEV@cloud provided executors.
  • Existing builds were able to complete.
  • On-Premise Executors (OPE) and builds were not impacted.

Root Cause

A network configuration issue occurred preventing communication between build web-services.

Data Loss

No builds in flight were lost, and no data was lost during this outage.

Complications

The recovery from the outage also took longer than expected as we needed to increase the size of our build farm to catch-up with the builds that had stacked up during the outage. Dynamically sizing the farm is not particularly fast as we usually don't need to ramp up capacity that quickly.

Proactive Steps

  • catalogue and monitor the impacted internal service directly
  • increase capacity scaling rate
  • implement further changes identified in internal Post Outage Review



Tuesday 21 July 2015

Maven Repository Server (repo.cloudbees.com) Outage

The main Maven repository server which is used by DEV@cloud builds went offline earlier today. This may have resulted in hung or failed builds.

We have restarted the server which recovered after the reboot.

Monday 15 June 2015

Rollback of 1.609.1 Upgrade

As we communicated on Friday, we had planned to rollout a release of Jenkins LTS 1.609.1 to DEV@cloud today. In the course of deploying this, we found a severe regression in the core Jenkins LTS: a deadlock which occurs under high load. Since it will take us some time to verify the fix, we have rolled back the release.

Most instances were not upgraded to 1.609.1. We have automatically downgraded any upgraded instances back to 1.580.3.4 which does not exhibit this issue.

We apologize for any disruption to your business and will be in touch with further details on the release when ready.

Sincerely,

The DEV@cloud Team

Thursday 14 May 2015

Nexus Upgrade

CloudBees' Nexus instance http://repo.cloudbees.com has been upgraded from 1.9.1 to 1.9.2.4.

This should help with intermittent Error 500 issues that were being experienced by a small number of customers.

In addition to the version bump we have also internally moved from custom Tomcat hosting to the Sonatype preferred Nexus hosting model using Jetty.

While this should not be visible externally (or cause issues in your builds), any unexpected behaviour should be reported to our Support Team.

Failed builds should be retried.

Tuesday 5 May 2015

Slave Provisioning Outage

We have just recovered from an outage in the DEV@cloud slave provisioning service. This outage was due to a problem with a backend authentication API. This outage also affected the CloudBees Console.

Wednesday 8 April 2015

Maven Repository Server (repo.cloudbees.com) Outage

The main Maven repository server which is used by DEV@cloud builds went offline earlier today. This may have resulted in hung or failed builds.

We have restarted the server and are adding additional monitoring to detect this problem in the future.

Friday 6 March 2015

AWS Maintenance Event Affecting DEV@cloud

This issue is resolved.
---
CloudBees DEV@cloud Jenkins may be intermittently affected by an ongoing maintenance event within Amazon Web Services, which provides the hosting for DEV@cloud. While Amazon has tried to limit the impact of this event, we are seeing issues with internal network routing between some instances.

We are working to move Jenkins instances off of these affected EC2 hosts. For the minority of Jenkins instances affected, this will result in a few minutes of downtime during the migration. We apologize in advance for any inconvenience. We are working with Amazon to better understand how to avoid this in the future.

Wednesday 4 February 2015

Sonar service outage

In order to provide a more reliable database backend for our Sonar services, we migrated the Sonar database to a new database server. However the data-migration failed for some Sonar instances.

Sonar instances impacted by this issue would be unable to connect to the database and thus unavailable during the outage.

To maintain the integrity of affected Sonar instances we have successfully re-migrated the impacted databases.

Upon restart of Sonar, some Sonar instances were in an inconsistent state having had plugin updates via the UI which were incompatible with the version of Sonar currently deployed at CloudBees.  Our engineers have completed correctly upgrading impacted Sonar instances.


The Sonar service is now back online.