On August 18th, we experienced an issue with a critical hosting infrastructure that supports a range of our EMEA GateManager servers. It was the longest and most severe incident we have experienced to date, and it affected many of you and your customers. We are truly sorry about this, and we understand that an explanation is needed.
What happened
On August 18th, 13:22 CEST our surveillance systems alerted us that our VPN lines were lost to our external hosting provider as well as public access to the servers.
We discovered later that at 13:19 an unannounced power outage had occurred in the region of the hosting center. This is a “normality”, which typically is handled by the power redundancy systems at the hosting center.
Unfortunately, an error in the redundancy setup of the battery-powered Uninterruptable Power Supply (UPS) system, resulted in the UPS system not being able to supply power long enough for the diesel generator to kick in, and systems went down abruptly. Bringing abrupt shutdown systems back online is typically not an issue either, but it required manual assistance, and it was discovered in the process that an essential backbone switch had died and had to be replaced.
This chain of events was not fully uncovered until the evening of August 18th, where we could also conclude that all GateManager servers were fully operational, and no data were lost. Later the License servers were also fully operational and started processing the queued orders.
Our Response
When our surveillance system first alerted us about the incident, our hosting team contacted the hosting center staff, and when the severity was known, we went immediately on site. We made the decision to focus our efforts on locating the errors and get a grasp of the situation and assist where we could. This meant that it took us longer than we could have liked to our customer-facing server status with information about the unavailability of the affected servers as we focused our efforts on understanding the situation, to determine what we could do.
This also meant that a more detailed bulletin of the situation also came much later than we would have liked, as we didn’t have any reliable update to post, and we didn’t want to make any promises that we couldn’t keep.
Concurrently it became apparent from incoming calls, that our server status page was not easy enough to find on our website, which meant that many of you were unaware of the incident. We understand the frustration you must have felt and apologies for this.
What will happen now
We have started a thorough analysis of the events, and will take actions to optimize the following areas:
1. Assure optimization of redundancy measures at the hosting center.
2. Assure optimization of disaster recovery processes, to limit downtime.
3. Optimize our processes around information flow towards customers.
We have received confirmation from our hosting provider that the first two topics are already in process.
Please visit status.secomea.com for further details and updates on the current situation.