1. Incident Information

Incident Slogan: Calls are interrupted or are not establishing

Priority: Critical

Start Time: 17/02/2021 15:22 CEST (approximately)

Stop Time: 17/02/2021 19:45 CEST (approximately)

EXECUTIVE SUMMARY

The activation of an improved backup of our SQL-cluster resulted in unexpected behavior which caused Swyx telephony systems not being able to connect to the SQL database. Due to this the connection to the database was lost and calls could not be established.

The unexpected behavior caused the SQL-cluster to be out of sync and needed manually to be restored.

2. Incident Detail

3. Root Cause Identification

After troubleshooting the incidents our engineers established that the SQL-servers that are needed for the Swyx configuration are showing a high number of simultaneous connections. The high number of connections was caused by backups of the SQL database cluster. To decrease the load we temporarily stopped the backup process and had to stop the individual backup processes. Most processes could automatically be stopped but a few processes needed to be manually stopped.

Due to the high number of connections another issue occurred which we later identified the SQL-cluster being out of sync. This resulted in the IP-address being set at the secondary SQL-server and the databases being active on the primary SQL-database. To get the SQL-cluster back into sync our engineers tried to restart the service, when this did not help our engineers tried restarting the services by rebooting the SQL-cluster. Unfortunately this did not have the desired effect.

Our engineers were finally able to get the IP-address back on the correct server but the data was still out of sync. Our engineers added an SQL expert to crisis team and were able to manually get the data back into sync. Only a few databases needed manual actions to get these systems back online.

4. Action Plan

All involved parties will get together later today to compile a plan of action. The plan of action will be presented to our Managing Director for approval.

Due to the nature of the last interruptions our Managing Director has decided to put all migrations, changes and planned improvements on hold for the time being.

5. Timeline

11:30 - Activated a backup to optimize the backup of our SQL-cluster.

15:22 - The activation of the backup resulted in unexpected behavior which caused an increase in simultaneous connections over time. This resulted that the SQL-cluster got out of sync. As a result Swyx telephony servers were unable to connect to the database and were unable to setup new calls.

16:30 - After troubleshooting we identified the problem in the SQL-backup. We disabled the backup and applied a roll back of the configuration. Unfortunately this did not resolve the outage.

17:00 - After further investigation the high number of simultaneous connections caused the SQL cluster to be out of sync. Which resulted in the secondary server to have the connection IP-address but the databases were active on the primary SQL server. Our engineers tried to get the SQL-cluster back into sync by restarting the services. Unfortunately did not have the desired effect.

17:45 - Our engineers restarted the SQL servers in the hope to restart the services. Unfortunately this did not have the desired effect.

18:30 - Our engineers were able to get the IP-address back in sync. Unfortunately the data was still out of sync.

18:35 - Added an SQL expert to the engineering team.

18:45 - Our engineers were able to manually get the data back into sync.

19:30 - A few databases needed manual actions to restore the connection.

19:45 - All services are restored.

Posted Feb 18, 2021 - 16:25 CET

Resolved

Deze melding hebben we langer open gehouden dan eerder aangekondigd doordat we verstoringsmeldingen hebben ontvangen, na troubleshooting van deze meldingen bleken deze niet gerelateerd te zijn aan de verstoring. De verstoring is verholpen. We werken aan een Reason For Outage (RFO) waar we de verstoring toelichten en welke stappen we gaan nemen om de stabiliteit van het platform te verbeteren. Onze excuses voor het ongemak.

Posted Feb 18, 2021 - 10:55 CET

Update

We hebben alle systemen afgelopen nacht in monitoring gehouden, alle systemen zien er goed uit en zijn operationeel. We houden deze melding nog open tot 09:00 om de piek met het opstarten te monitoren. Indien u nog problemen ervaart vragen wij u om contact op te nemen met onze servicedesk.

Posted Feb 18, 2021 - 07:47 CET

Monitoring

De verstoring is verholpen, er zijn nog enkele systemen welke extra actie nodig hebben, deze worden momenteel doorgelopen. We houden de systemen in monitoring.

Posted Feb 17, 2021 - 19:37 CET

Update

We hebben een deel van de omgevingen weer kunnen herstellen echter zijn nog niet alle omgevingen online. Onze engineers zijn nog bezig met de herstelwerkzaamheden.

Posted Feb 17, 2021 - 18:33 CET

Identified

De aanpassingen hebben niet het gewenste effect. We zijn de verstoring verder aan het onderzoeken.

Posted Feb 17, 2021 - 16:25 CET

Monitoring

We ontvangen meldingen dat de telefonie weer hersteld is. We blijven de telefonie monitoren.

Posted Feb 17, 2021 - 16:20 CET

Identified

Het probleem lijkt gevonden. We hebben acties ondernomen om de verstoring te verhelpen. We ontvangen ook weer berichten dat de telefonie weer werkt.

Posted Feb 17, 2021 - 16:14 CET

Investigating

Op dit moment ontvangen we meldingen dat gesprekken worden verbroken of deze komen niet tot stand. Onze engineers zijn de meldingen aan het onderzoeken. Onze excuses voor het ongemak.

Posted Feb 17, 2021 - 15:41 CET

This incident affected: Hosting (Swyx).