Incident time duration
15/09/2021 07:40 - 15/09/2021 13:00
Symptom
VPS hosts where unreachable
Status
Resolved
Timeline
07:40 Received reports that our VPS hosting was not reachable for several of our customers.
Investigation started (during this time the team executed a lot of isolating and troubleshooting the root cause)
09.55 Decision taken to execute a rollback of previous maintenance work that caused the problem.
11.00 Rollback was successfully completed. However not all VPS machines could restore the connection automatically. In some cases we had to re-start the machine manually.
13.00 Created solution to make it possible the affected machines could be rebooted via Flux to apply the fix.
13:00 – rest of the day | Monitored situation and assisted customers with rebooting their VPS.
Background info
At the moment we are busy with phasing out our AMS5 datacenter. This environment was created in the past with an old orchestration tool to manage the environment. However this tool isn't in use anymore. During our maintenance we moved the messagebus (RabbitMQ) of this environment. This should not have caused any impact at our live hosting.
We created a rollback scenario where we could temporarily provide the old RabbitMQ machines of an IP address.
During the maintenance everything was successfully moved and running. However we did found an issue spawning new VPS instances. At that particular time it didn't felt like it was related to the change.
During the early morning we received noticed that several VPS instances were not reachable. Later in the morning after isolating and troubleshoot we decided to roll back the change that was made the night before while we also found out what was causing these problems.
Root cause
When we changed the IP addresses of the old RabbitMQ nodes, the old orchestration tool (which wasn't in use anymore for years) came back online. This resulted in:
Action plan
During future migrations and when everything is completed the old hosting environment along with the particular orchestration tool will be deactivated/removed at our new setup.