Re: Forward Feral: summary of the move from November 19th
We'd like to share more about the events after our move in November and the subsequent disruption to service. We greatly regret the problems customers had and would like to describe what happened, what we'll do to make sure it will never happen again, and how we will grow from the experience.
Moving to the Netherlands
The move itself went very well. Equipment was packed up, shipped across the continent, and unpacked professionally without any major problems. The few minor issues that did arise were easily solved by checking progress on-site. A minimal amount of hardware directly failed as a result with the largest failure being two RAM modules.
It took a week to bring all servers back on-line over shooting the original estimate. This was the result of a few slip ups from third parties. Once back up we could see that traffic levels had dropped. We immediately began investigating and found a number of issues.
The move required us to become our own ISP and effectively announce "we're over here" with the change requiring a simple handover from our existing ISP. A few minutes after the handover we dropped off the Internet with our ISP closing the support ticket assuming the issue was on our side. We subsequently diagnosed problems within their network as having left behind elements that should have been removed during the handover.
Beyond this the physical set up had changed resulting in a different distribution of traffic. That, coupled with the way our routers handled lossless networks (a method we use to ensure data on the network is not lost), resulted in slower speeds. We ultimately had to choose between having half the servers at full speed and the rest off-line versus all of them at half speed. We opted for the latter.
This drop in performance resulted in a large increase in latency and a drop in throughput. Those further away were impacted the worst. We have a greater understanding of how our routers operate and have made changes accordingly.
To further complicate issues we found that under certain circumstances the servers went off-line or experienced very high latency regardless of network operations. We were surprised at this finding and will work to provide a permanent fix; it has also given us insight into how to better monitor operations on the servers.
Finally, we saw changes in speeds back home for some customers. Traffic profiles changed as people presumably moved about for Christmas meaning that some ISPs saw overloaded connections to our ISP. This is not the fault of our ISP or network but a reality that made diagnosing things a little harder. We understand our responsibility to mitigate such issues and will continue to connect to more upstream ISPs in the future.
Additional problems included unexpected default settings on routers, being delivered wrong goods, and deliveries taking longer than expected (probably because of the Christmas period). Relatively speaking these were minor issues as they were easily solved.
In total, there was one week of downtime and full speeds were restored three weeks later.
Communication during the move
We know from prior experience that customers like to know what happening and to be kept in the loop. We started the move with this understanding and said we would offer candid updates to our Twitter feed. We truly did offer a candid view and posted everything as it was happening and what we were working on.
Once it would not detract from looking at the problems the Twitter feed was replaced with a status page. That page was then reflected to include the latest information and a link was placed on all pages of our website.
It had become evident quite quickly that Twitter was the wrong medium as the back-and-forth nature hid the real updates from those not following the feed closely. We also saw that some updates were too technical, others were not technical enough or were the wrong mix of both.
For the future we will keep with this idea of a dedicated status page to better reflect disruptions and offer more transparency. We will expand its functionality to provide insight into ongoing issues.
The need to move was a definite requirement. We could only connect to one ISP as we were in one of their data centres. The limitations of being connected to one ISP were very apparent: we were powerless to act on any issues that arose affecting latency or throughput to customer's own ISPs.
The size we reached gave us sufficient purchasing power to negotiate the price at some of the best data centres in the world. I strongly believe we have found the right place: a 99.999% power SLA in a top facility. We also have been given our own room to expand 2-3 times over. We're not moving out.
We're now in a good position to build our own network. Our number one goal is to improve throughput for everyone without increasing the cost of our services. We already have stronger links to Europe and plan to improve the quality to US ISPs by the end of the month.
Lastly, but certainly not least, I would like to personally apologise for the events following the move. I understand how frustrating it can be to purchase a service and for it not to work, or worse, not live up to your expectations particularly during a period when you wish to use it more. We will endeavour to apply the lessons learnt into practice as soon as possible ensuring this will never happen again.
We would like to do what's right and will compensate for not being at full service. You can redeem compensation by following this link.
The Feral Hosting team