On Tuesday, 4 Oct 2022 between 13:15 and 15:21 the order service crashed and was restarted every minute in the production environment.
A customer ERP integration sent two rogue order messages every 15 minutes for a long time. Each message contained a new document. Each document was appended to the order header. The order header had grown to contain 3600+ documents. At 13:15 when appending a new document to the order, the order service crashed. The order service was restarted each time and tried to process the order message again.
Order operations by users from the portal, like accepting orders, were not possible during this time window. It was still possible to see orders in the portal.
Order messages were partly processed during this time. The remaining order messages were queued and processed after the incident was resolved.
At 13:23 an incident was reported by the Tradecloud operations system and the on-call engineer was notified and investigated the issue.
At 14:36 a major incident was announced internally and three engineers were involved in communicating, investigating and fixing the issue.
At 14:47 the incident was reported to customers on our status page. The root cause was identified.
At 15:08 a hotfix, improving the order document appending performance, was created.
At 15:23 the hotfix was released to production and monitored if it works as expected.
At 15:34 the hotfix was confirmed working. Order operations were possible again, remaining order messages processed and the incident was resolved on our status page.
We are adding safety limits to orders for documents, lines, delivery schedule and delivery history. These limits are beyond current usage and will not impact any order. The limits will be documented in the API manual.
This measure will ensure that the order service always can process order messages, regardless of the size.
Please send any questions to email@example.com