On Wed, 12 May 2021 01:43 CEST, the Tradecloud Cassandra cluster serving the acceptance and production environments became completely unavailable due to failed external block storage.
The Cassandra cluster is a distributed database used by Tradecloud services to write and read data. The Cassandra data is persisted on a block storage cluster, hosted by our infrastructure provider Equinix Metal, but supported and maintained by Datera.
Most operations in the Tradecloud One platform were not possible: users were not able to login on the portal and services were not able to process requests. Connectors queued incoming messages for later processing.
The incident was reported to customers at 8:29 on our status page and customers were personally informed around Wed, 12 May 2021 10:00 CEST. After that we provided regular updates via https://status.tradecloud1.com.
At 12 May 12:53 CEST Equinix Metal identified the incident. At 22:09 CEST Equinix Metal confirmed that there are two failed drives across two nodes in the storage cluster. At 23:41 CEST Equinix Metal confirmed that both storage cluster nodes were reloaded. The long lead time was due to the dependency on Datera, working in the US west coast time zone.
We are awaiting a postmortem by Equinix Metal. We do not know the root cause of the storage cluster failure. Tradecloud chose to use a storage cluster because it should be highly redundant and two failed disks should not be any issue. Also Tradecloud was not aware that the support and maintenance of the cluster was delegated to third party Datera.
At 13 May 00:24 CEST Tradecloud managed to bring the Cassandra cluster up and a full repair started. At 8:37 CEST Cassandra cluster repair was completed and 9:01 CEST the acceptance and production were available again.
As a first measure, we are moving the Cassandra data from the storage cluster to the local disks of the Cassandra servers as soon as possible, to be no longer dependent on the storage cluster and Datera.
We are already in the process of migrating from Equinix Metal to the Google Cloud Platform as our new infrastructure provider. The Google Cloud Platform is one of the most reliable infrastructure platforms in the world. Instead of one data center we will be using 3 availability zones spread over Eemshaven and Frankfurt data centers. We already have migrated our test environments successfully to the Google Cloud Platform, and will be migrating production during May and June.
As a second measure, we will create an additional Cassandra cluster at the Google Cloud Platform until we are fully migrated to GCP and replicate data continuously from Equinix Metal. This will provide an additional real time backup and will provide an easy way to migrate data.
With these two measures we ensure that Tradecloud will not have a similar, long outage again, and are no longer dependent on a third party. We are very enthusiastic about our experience with the Google Cloud Platform. The Google Cloud Platform will provide excellent reliability, scalability and security for the Tradecloud One platform.
Please send any questions to email@example.com