VPS Offline
Incident Report for Dstny
Postmortem

Incident time duration
15/09/2021 07:40 - 15/09/2021 13:00

Symptom
VPS hosts where unreachable

Status
Resolved

Timeline
07:40 Received reports that our VPS hosting was not reachable for several of our customers.

Investigation started (during this time the team executed a lot of isolating and troubleshooting the root cause)

09.55 Decision taken to execute a rollback of previous maintenance work that caused the problem.

11.00 Rollback was successfully completed. However not all VPS machines could restore the connection automatically. In some cases we had to re-start the machine manually.

13.00 Created solution to make it possible the affected machines could be rebooted via Flux to apply the fix.

13:00 – rest of the day | Monitored situation and assisted customers with rebooting their VPS.

Background info
At the moment we are busy with phasing out our AMS5 datacenter. This environment was created in the past with an old orchestration tool to manage the environment. However this tool isn't in use anymore. During our maintenance we moved the messagebus (RabbitMQ) of this environment. This should not have caused any impact at our live hosting.

We created a rollback scenario where we could temporarily provide the old RabbitMQ machines of an IP address.

During the maintenance everything was successfully moved and running. However we did found an issue spawning new VPS instances. At that particular time it didn't felt like it was related to the change.

During the early morning we received noticed that several VPS instances were not reachable. Later in the morning after isolating and troubleshoot we decided to roll back the change that was made the night before while we also found out what was causing these problems.

Root cause
When we changed the IP addresses of the old RabbitMQ nodes, the old orchestration tool (which wasn't in use anymore for years) came back online. This resulted in:

  • Old storage service worker came back online, along with an old config which wasn't up-to-date anymore.
  • Along with 9 different network configs of hypervisors got changed to the old configuration.

Action plan
During future migrations and when everything is completed the old hosting environment along with the particular orchestration tool will be deactivated/removed at our new setup.

Posted Sep 16, 2021 - 16:07 CEST

Resolved
VPS reachability issues of this morning are resolved.

If your VPS is still causing issues it's best to restart or turn off/start the machine within Flux. In case you need assistance with it please contact our support department.
Posted Sep 15, 2021 - 14:45 CEST
Monitoring
Rollback was successful.

In case you still have a problem with your VPS we would like to ask you that contact us so we can restore it manually.
Posted Sep 15, 2021 - 11:05 CEST
Identified
At the moment we are doing a rollback of the maintenance work of yesterday evening. We expect to provide a new update in 20-30 minutes.
Posted Sep 15, 2021 - 09:52 CEST
Update
We are continuing to investigate this issue.
Posted Sep 15, 2021 - 08:22 CEST
Investigating
We are investigating an issue where some users are reporting that VPS is not online.
Posted Sep 15, 2021 - 07:51 CEST
This incident affected: Hosting (VPS cloud).