CloudBees Status: 2014

Wednesday, 12 November 2014

Jenkins master outage

Earlier today we received a bunch of notices of bad hardware from Amazon - we took (as we normally do) proactive action to replace these servers that were on the bad hardware.

However, provisioning failed for various reasons - but the most prominent one was that one of the data centers we operate in wasn't able to support the SSD volumes that we were upgrading to (this is a limitation of some Amazon accounts, which we found out). We had to roll back some changes and then reprovision jenkins masters again. To provide availability - we recover servers to different data centres, restoring the jenkins data on fresh volumes and fresh servers - to do this we require provisioning to work, and in this case, due to the inavailability of SSD volumes in some cases, prolonged the recovery for a few hours in some cases. We are looking at how we can avoid these "older" data centres in future to prevent this.

Thursday, 9 October 2014

Jenkins masters restarted to apply security patch

An important security patch was applied just after 1AM GMT today. This was done as a "soft" restart - where jobs that were running were allowed to complete. Some masters, however, had long running jobs and eventually had to be restarted - in that case - restarting the build is recommended (if there isn't an external automatic trigger).

Monday, 29 September 2014

Status of CVE-2014-6271 and AWS reboots

We have received several inquiries via our support channels about how CloudBees systems have been affected by CVE-2014-6271 (aka "shellshock") and the ongoing alert we have posted about AWS reboots.

CVE-2014-6271 status

CloudBees systems including Forge (Git/SVN), RUN@cloud (Apps/Databases) and DEV@cloud Jenkins Masters have been patched against CVE-2014-6271. DEV@Cloud Slaves are already hardened to allow arbitrary process execution via build scripts in isolated containers, but are being patched as an additional precaution.

Side note: the Forge outage on Sept 24 was a result of maintenance required to perform these security upgrades.

AWS reboot status

There is an active alert on status.cloudbees.com warning about a massive set of reboots that Amazon is performing on it's AWS systems (AWS is the primary provider of CloudBees computing resources). These reboots are not related to the shellshock alert, but may result in some small windows of service disruption. Where possible, we are rebooting servers ourselves ahead of the scheduled reboots to minimize disruption.

Wednesday, 24 September 2014

Forge Service restored

In the process of applying a critical security patch, Forge SVN, WebDav and Maven repositories were made inaccessible starting roughly 19:00 GMT. Service was mostly restored by 19:15, with all services being available by 19:50.

Thursday, 14 August 2014

OS-x build service returned to normal

We had a problem when half our OS-X build capacity was unavailable - and the remaining load spilled over. This causes some jobs to build up due to storage usage.

This is now restored. We also have had to disable access to osx builds for free accounts (they were only technically available for paid - but now we are enforcing it).

Friday, 1 August 2014

Cross system outage

Earlier today there was an outage in a core system that had a cascading effect - this was a key system used to check entitlements, and this resulted in builds being unavailable, as well as deployments. The console was also unavailable for a short time while the service was restored.

Monday, 28 April 2014

API and console timeouts resolved

Earlier today there were intermittent timeouts, and general slow-ness, of the the main API and thus the consoles.

This was due to internal systems accessing an administrative database a bit too aggressively - the underlying cause has been fixed.

If you were attempting to deploy apps at the time and saw failures - feel free to try again.

Tuesday, 8 April 2014

Resolved: OSX Build Service Offline

This issue is resolved.

The colocation facility hosting our OSX build service is experiencing an outage. They are working to restore service as soon as possible.

ClickStarts currently struggling to load

At this time we are looking into issues around clickstarts being slow to load. If you are trying a clickstart - please try again shortly.

Heartbleed bug and SSL

There has been some work done on the heartbleed SSL vulnerability - some ELB based services are still pending update.

You can read more about this issue here.

We will have more updates soon as well as any suggested actions (expect an email if you are possibly effected).

Tuesday, 1 April 2014

Resolved:Forge Outage

This issue is resolved.
---
Forge repositories are undergoing an unexpected outage. Failover instances are unreachable, and we are attempting to restore service. We are investigating this problem and will provide next update in 15 minutes.

Migration to new DEV@cloud Cloud Slaves service

As announced on our blog two weeks ago, and communicated via email to customers last week, we are in the process of transitioning accounts to the new CloudBees Cloud Slaves service. As a result of this change, customers should generally see cloud slaves come online faster, run faster, and have additional speed options.

If you run into any problems, please see our known issues page.

Friday, 21 March 2014

API outage causing delayed masters and executors resolved

Earlier today there was an extended EC2 api outage - this had the effect of delaying the provisioning of masters (mainly for new accounts) - but also build services that were not using the new "mansion" based service. Service should be restored now but there may be a short backlog as everything recovers.

Resolved: Database service outage

A problem with our infrastructure provider this evening caused problems on slave-less CloudBees database clusters during the nightly database backup routine. Customers with apps connecting to the affected databases were likely to experience database connection hangs and timeouts. If apps are not coded with proper timeouts and retries, the apps may need to be restarted to re-establish connections to the database.

This problem did not affect customers who are paying for dedicated databases with slave configurations or customers using our recently added ClearDB multi-tenant database clusters which are configured with redundancy by default.

This was a problem that we have never encountered before and identification/recovery of the underlying problem was slower than we aim for. We apologize for the downtime and are working to identify ways to avoid this specific problem and to improve our recovery time.

Tuesday, 18 March 2014

Resolved: SauceLabs Downtime Affecting Some Jenkins Users

This issue is resolved.
---
SauceLabs is experiencing an extended outage which is affecting Jenkins users (CloudBees DEV@cloud or otherwise) who have installed the SauceLabs plugin. If users have configured the SauceLabs Badge column in their main view, any attempt to view the Jenkins index pages or jobs will hang. This is due a bug in the SauceLabs plugin.

You can temporarily disable the SauceLabs plugin by navigating to <Jenkins URL>/pluginManager/installed and searching for the word "Sauce", uncheck the checkbox, and then press "Restart when no jobs are running" at the bottom of the page. After Jenkins restarts, you should be able to view your jobs, but you may lose Sauce-related configuration if you edit any of your jobs which contain such.

You can follow the SauceLabs Ops Team on twitter.

Wednesday, 12 March 2014

Revproxy service problems resolved

Around 9PM UTC on the 11th March, there was a problem with the revproxy service that services most non SSL apps in the US region. This was resolved shortly after bit did result in some apps incorrectly returning 500 errors intermittently.

Thursday, 27 February 2014

ec2-50-19-213-178 shared database server restarted

Around 11pm GMT the ec2-50-19-213-178 shared database service encountered some problems, and was restarted (the restart took some time) - all restored to normal service now.

Thursday, 23 January 2014

50.19.213.178 multi-tenant database cluster.

Connectivity issue with the 50.19.213.178 multi-tenant database cluster has been resolved.