3CX instances partially unavailable
Incident Report for Dstny
Postmortem

Background info

On the 23rd of September we experienced a hardware failure in one of our switch stacks, responsible for connectivity between a part of our compute node infrastructure and our storage systems. This resulted in a loss of storage connectivity on a part of our compute node cluster for 3CX bundles.

Normally a redundancy should take over all storage connectivity between our compute node infrastructure and our storage systems, but this failed for a couple of compute nodes.

After arrival of an onsite engineer, we started restoring storage connectivity on the affected compute nodes. After connectivity restored, we immediately started investigation on all affected 3CX instances and began restoring them to full working state.

During restore process, we kept encountering some storage connectivity errors fixing these along the way. Unfortunately this resulted in a large setback of estimated restore time.

 

Root cause

Due to loss of connectivity between our compute node infrastructure and our storage systems, around 1/7 of our 3CX bundle instances couldn't connect to their disks anymore.

This triggered a fail-safe mechanism, on those 3CX instances, and we had to take individual action on each 3CX instance to resolve this.

 

Action plan

Investigations have completed and the remaining devices in the switch stack are 100% operational. All affected compute nodes have been investigated and tested on second path take over.

We are now in the process of reviewing the storage connectivity infrastructure and implementing changes to enhance reliability and further strengthen our redundancy.

Posted Oct 13, 2021 - 12:39 CEST

Resolved
This incident will be closed. On short term a maintenance will follow to restore full redundancy on this switch stack.
Posted Sep 28, 2021 - 16:02 CEST
Update
Unfortunately the spare switch failed and we are speeding up some maintenances to regain full redundancy.
Posted Sep 27, 2021 - 15:43 CEST
Update
A switch that caused the issue and has been down since last night will be replaced, no impact is to expected.
Posted Sep 24, 2021 - 22:03 CEST
Update
We are continuing to monitor for any further issues.
Posted Sep 23, 2021 - 13:40 CEST
Monitoring
All 3cx servers should be back online now. If you still experiences issues, please contact support.
Posted Sep 23, 2021 - 10:54 CEST
Update
While checking we found some more 3cx servers to be offline. We are recovering these servers now as well.
Posted Sep 23, 2021 - 09:56 CEST
Identified
Duo to the incident: https://status.destiny.nl/incidents/4dppmx3x1rzm

We are still recovering some 3CX instances that are in locked state. We should this resolved within the hour.
Posted Sep 23, 2021 - 07:25 CEST
This incident affected: Hosting (3CX).