Background info
On the 23rd of September we experienced a hardware failure in one of our switch stacks, responsible for connectivity between a part of our compute node infrastructure and our storage systems. This resulted in a loss of storage connectivity on a part of our compute node cluster for 3CX bundles.
Normally a redundancy should take over all storage connectivity between our compute node infrastructure and our storage systems, but this failed for a couple of compute nodes.
After arrival of an onsite engineer, we started restoring storage connectivity on the affected compute nodes. After connectivity restored, we immediately started investigation on all affected 3CX instances and began restoring them to full working state.
During restore process, we kept encountering some storage connectivity errors fixing these along the way. Unfortunately this resulted in a large setback of estimated restore time.
Root cause
Due to loss of connectivity between our compute node infrastructure and our storage systems, around 1/7 of our 3CX bundle instances couldn't connect to their disks anymore.
This triggered a fail-safe mechanism, on those 3CX instances, and we had to take individual action on each 3CX instance to resolve this.
Action plan
Investigations have completed and the remaining devices in the switch stack are 100% operational. All affected compute nodes have been investigated and tested on second path take over.
We are now in the process of reviewing the storage connectivity infrastructure and implementing changes to enhance reliability and further strengthen our redundancy.