Degraded order processing (Resolved)

Incident Report for Tradecloud One

Postmortem

Tradecloud One order service outage on 4 Oct between 13:15 and 15:21 CEST

What was the issue?

On Tuesday, 4 Oct 2022 between 13:15 and 15:21 the order service crashed and was restarted every minute in the production environment.

A customer ERP integration sent two rogue order messages every 15 minutes for a long time. Each message contained a new document. Each document was appended to the order header. The order header had grown to contain 3600+ documents. At 13:15 when appending a new document to the order, the order service crashed. The order service was restarted each time and tried to process the order message again.

What was the impact?

Order operations by users from the portal, like accepting orders, were not possible during this time window. It was still possible to see orders in the portal.

Order messages were partly processed during this time. The remaining order messages were queued and processed after the incident was resolved.

How did Tradecloud respond?

At 13:23 an incident was reported by the Tradecloud operations system and the on-call engineer was notified and investigated the issue.

At 14:36 a major incident was announced internally and three engineers were involved in communicating, investigating and fixing the issue.

At 14:47 the incident was reported to customers on our status page. The root cause was identified.

At 15:08 a hotfix, improving the order document appending performance, was created.

At 15:23 the hotfix was released to production and monitored if it works as expected.

At 15:34 the hotfix was confirmed working. Order operations were possible again, remaining order messages processed and the incident was resolved on our status page.

What actions does Tradecloud take to prevent this from happening again?

We are adding safety limits to orders for documents, lines, delivery schedule and delivery history. These limits are beyond current usage and will not impact any order. The limits will be documented in the API manual.

This measure will ensure that the order service always can process order messages, regardless of the size.

Please send any questions to support@tradecloud1.com

Posted Oct 06, 2022 - 09:07 CEST

Resolved

A fix is in place and the root cause has been resolved.

Posted Oct 04, 2022 - 15:34 CEST

Monitoring

The fix has been deployed and we are currently monitoring the situation.

Posted Oct 04, 2022 - 15:27 CEST

Update

We found a way to mitigate the issue and are currently deploying a fix.

Posted Oct 04, 2022 - 15:12 CEST

Identified

We are currently experiencing a technical issue in our order service.

Due to this, some operations may not be possible through the portal.

Some of our customers may experience that orders sent to Tradecloud through the API Connector are currently not processed. These order messages are queued and will be processed once the issue has been resolved.

We have found the root cause and investigating ways to mitigate the issue.

Posted Oct 04, 2022 - 14:47 CEST

This incident affected: Tradecloud One Portal and Connectors (API v2 Connector).