Acceptance and Production environments unavailable

Incident Report for Tradecloud One

Postmortem

Tradecloud One production outage between 12 May 2021 01:43 CEST and 13 May 2021 09:01 CEST

What was the issue?

On Wed, 12 May 2021 01:43 CEST, the Tradecloud Cassandra cluster serving the acceptance and production environments became completely unavailable due to failed external block storage.

The Cassandra cluster is a distributed database used by Tradecloud services to write and read data. The Cassandra data is persisted on a block storage cluster, hosted by our infrastructure provider Equinix Metal, but supported and maintained by Datera.

What was the impact?

Most operations in the Tradecloud One platform were not possible: users were not able to login on the portal and services were not able to process requests. Connectors queued incoming messages for later processing.

How did Tradecloud respond?

The incident was reported to customers at 8:29 on our status page and customers were personally informed around Wed, 12 May 2021 10:00 CEST. After that we provided regular updates via https://status.tradecloud1.com.

At 12 May 12:53 CEST Equinix Metal identified the incident. At 22:09 CEST Equinix Metal confirmed that there are two failed drives across two nodes in the storage cluster. At 23:41 CEST Equinix Metal confirmed that both storage cluster nodes were reloaded. The long lead time was due to the dependency on Datera, working in the US west coast time zone.

We are awaiting a postmortem by Equinix Metal. We do not know the root cause of the storage cluster failure. Tradecloud chose to use a storage cluster because it should be highly redundant and two failed disks should not be any issue. Also Tradecloud was not aware that the support and maintenance of the cluster was delegated to third party Datera.

At 13 May 00:24 CEST Tradecloud managed to bring the Cassandra cluster up and a full repair started. At 8:37 CEST Cassandra cluster repair was completed and 9:01 CEST the acceptance and production were available again.

What actions does Tradecloud take to prevent this from happening again?

As a first measure, we are moving the Cassandra data from the storage cluster to the local disks of the Cassandra servers as soon as possible, to be no longer dependent on the storage cluster and Datera.

We are already in the process of migrating from Equinix Metal to the Google Cloud Platform as our new infrastructure provider. The Google Cloud Platform is one of the most reliable infrastructure platforms in the world. Instead of one data center we will be using 3 availability zones spread over Eemshaven and Frankfurt data centers. We already have migrated our test environments successfully to the Google Cloud Platform, and will be migrating production during May and June.

As a second measure, we will create an additional Cassandra cluster at the Google Cloud Platform until we are fully migrated to GCP and replicate data continuously from Equinix Metal. This will provide an additional real time backup and will provide an easy way to migrate data.

With these two measures we ensure that Tradecloud will not have a similar, long outage again, and are no longer dependent on a third party. We are very enthusiastic about our experience with the Google Cloud Platform. The Google Cloud Platform will provide excellent reliability, scalability and security for the Tradecloud One platform.

Please send any questions to support@tradecloud1.com

Posted May 18, 2021 - 08:50 CEST

Resolved

This incident has been resolved.

Posted May 13, 2021 - 09:42 CEST

Monitoring

Everything is back at normal operating capacity are we are monitoring platform status.

Posted May 13, 2021 - 09:01 CEST

Update

Our cloud provider continues to troubleshoot the issue but unfortunately cannot give us an ETR. We expect that the current situation will last until at least tomorrow morning.

Posted May 12, 2021 - 22:00 CEST

Update

In order to restore our data storage we require further support from our cloud provider. We expect that the current situation will last until at least 16:00 CET.

Posted May 12, 2021 - 11:00 CEST

Update

Our API v2 Connector, SAP Webservices Connector and Webhook Connector are back online and available for our integrated customers.
Any incoming messages will be stored in queue and processed once we are able to restore our data storage and bring up the rest of the platform.

Posted May 12, 2021 - 10:23 CEST

Update

We are in contact with our cloud provider and working on recovering our data storage.

Posted May 12, 2021 - 08:59 CEST

Identified

As of May 12 1:43 CET, the Tradecloud One Acceptance and Production environment are unavailable due to an outage at our cloud provider.
We have suspended all services until this issue has been resolved. We are currently awaiting response and mitigation from our cloud provider.

Posted May 12, 2021 - 08:29 CEST

This incident affected: Tradecloud One Portal, Connectors (API v2 Connector, Isah SCI Connector, SAP Webservices Connector, Webhook Connector), and Acceptance Test Environment (Portal, API v2 Connector, Connectors).