Latest Update 11:55 Thursday 26th January: All shared email services are online and active since 3am and service has been stable, we continue to monitor the platform. We will post further updates if needed.
We are currently investigating some performance issues on our QMail service which is resulting in slower than usual mail delivery and issues with mailbox connections. Our engineering team are working to resolve the issue at present, it may require us to failover one of the storage nodes to resolve it fully, this would involve a maintenance window of up to 1 hour where mail would be unavailable.
We will update this post again shortly with more information and advise if a maintenance windows is required.
Update 12:30: We are taking mail offline for emergency maintenance to failover one storage node, we are allowing up to one hour for this maintenance window starting now.
Update 13:27: We have failed over a storage node and are working to stabilize the system before we enable customers connections and we will update again in 30 minutes
Update 14:01: We continue to work on getting services restored and will update this post in 30 minutes.
Update 14:30: We are continuing to work with our Storage vendors to stabilise the platform. We will update again in 30 minutes.
Update 15:04: Our vendor believes that a single SSD in the ZIL mirror is causing the performance issues. We’re working with them to diagnose this fully. It’s an extremely complex system and is slow to diagnose. Our current ETA is 16:00 GMT.
Update 16:00: We have brought services back online but due to the the large number of connections being made at once access is intermittent at present, we continue to work with the vendor to find a full resolution.
Update 16:40: We are still working on this, and POP /IMAP services are working but not optimally – Webmail services are disabled while we continue to stabilise the mail system
Update 16:50: We need to shut down services completely again for about 30 minutes. The estimate for this is around 17:10 until around 17:40.
Update 17:53: The reboot of the systems has completed but still remains unstable. We are working with the vendor to resolve. We will update again in 30 minutes
Update 17:25: Email is totally disabled currently. Our vendor is examining the situation carefully and trying to find a cause for the slowness. It boils down to data written by the mail servers (that your email client uses) via a network file share (NFS) is taking a long time to be written to the file system on the backend. This is very unusual behaviour obviously and has them gather other experts within their organisation (which is substantial) to find out the cause.
Update 19:35: : Our senior engineers are still working with our vendor to identify the root cause.
Update 20:30: From our senior management to our support staff we’re working on this issue with highest possible priority. We keep relieving the inbound queues and delivering email to mailboxes, this means that some people with forwarders to outside mailboxes will get those emails. it also means as soon as the issue is fully resolved all your email will be in your inbox immediately.
Update 22:30: Firstly apologies for the tardy update, head stuck firmly in the clouds. We have identified the cause of the issue. We have a fix that involves moving a lot of mailbox data around, so it’s slow. However we should be able to turn mail servers back on in the next couple of hours and in parallel complete the data migration which will help alleviate the problem.
Update 23:50: Large volumes of data are being migrated currently but this is going to take time to complete
Day changes to Thursday 26/1.
Update: 01:40: We continue to migrate data and we’re seeing improved performance across the system however it’s not currently live for any customers.
Update: 03:25: All mail services are back up and running. At this point the system is considered ‘at risk’ until we see full live traffic at peak times. However since moving a lot of data from old vdevs to new vdevs our metaslab contention issues are far less prominent than before.
Translation: stuff got slow because free slots on each mirror within our storage pool were difficult for the system to find. A mirror is a pair of disks. Prior to today we had 15 mirrors, we now have 31 mirrors as we added 32 new disks to the system this evening to take the load. Part of the process of fixing things was copying mailboxes from 1 folder to another. This allowed the system to rebalance data somewhat and free up allocations. It’s akin to having to run disk defragmenter in Windows 95/98 back in the day however the performance issues we were presented with around midday Wednesday was not conducive to this being the cause of the performance bottleneck. The other benefit of adding the new disks is not only additional metaslab space but additional disk space in the entire mail system and also increased performance due to there being 72 disks versus the original 40.
Update: 06:20: Mail has remained stable since around 3am. As mentioned before, we can’t generate production traffic to test the changes that are made how ever we’re confident that the system is working. It also has more head room than it had yesterday so we expect it to cope and most users will see a notable performance boost after a few hours (Once caches warm up fully).
Update 09:10: The system is fully online and at present remains stable, we continue to monitor performance.
Question asked: What are ‘caches’? A cache (pronounced cash) is typically a record of frequently accessed files / data /email held in memory or on high speed solid state disks. In the case of Blacknight’s email system it uses both RAM and SSD. There is 128GB of ram cache and 800GB of high speed SAS Solid state cache.
Question asked: Is email that is being sent to me in a queue or was it bounced back to the sender? Answer: All email is queued on our end.
Question asked: What is IMAP? Many of our customers use their phones or an email client such as Apple Mail, Thunderbird our Outlook. It collects the email from us using 1 of 2 protocols typically. POP or IMAP. POP downloads the mail. IMAP syncs the email to the client but always leaves it on the server. POP typically downloads the email and requests its deletion so the email is only held by your computer / tablet or other device.
Question asked: What is ZIL and SSD? ZIL Disks in our email storage system are a pair of high performance disks that all data is written to before it’s committed to our normal disks. An SSD is a solid state disk and it’s typically anything from 100 to 1000 times faster than a regular drive. In the case of ZIL disks, they’re mirrored log drives for the ZFS file system. All writes get put onto this mirror before being written to their final destination. As mentioned this is an extremely complex system.
Question asked: What is ZFS? ZFS is a file system oringally from the good old days of Solaris from Sun Microsystems. It’s a pretty cool file system and it’s commonly used by storage companies because of the flexibility and easy of use it offers. On the fly snapshots, hot expand space, see many many performance metrices etc etc. We use a Nexenta SAN and they use ZFS.