Updates related to Incident 4743
Email Cluster A is Online
All users can now have full access to mail and systems are performing normally. Undelivered spam is now being delivered and we expect that queue to be cleared later today.
We have begun a thorough root cause analysis with our storage vendor Netapp and will provide you with an incident summary by end of the business day on Monday.
Once again, we apologize to you and your customers for the inconvenience.
Update: Monday, January 12, 2009:
Incident Summary:
We strongly suspect a long-existing hardware defect caused this incident. Our technical team continues our analysis and we will follow-up with more details in the coming days. As well, our vendor is simultaneously conducting a root cause analysis.
NetApp released a shelf controller firmware upgrade on November 17, 2008. This release fixes 3 software bugs a number of which are quite pertinent and critical to our environment. Two of these critical bugs affected the controller reliability in the past.
Our approach to all releases is to allow some time for vendors to release updates and patches. We followed this approach with the November 17 NetApp release.
We first rolled this shelf controller firmware upgrade to our Development QA environment in November and subsequently rolled the upgrade out to our pre-production environment in December. These devices are the exact same models as our production environment. We conducted this shelf controller upgrade on over a hundred devices without incident.
Our primary OpenSRS technicians working on these upgrades are NetApp certified specialists. Based on testing, the time we've been running NetApps, and our previous experience making 10 successful firmware upgrades on this type of component, we were confident in undertaking maintenance on Cluster A of our email service.
During our scheduled maintenance on Cluster A (January 10, 2009: 06:00 UTC - 09:00 UTC), the shelf controller firmware upgrade of our NetApp caused a failure of the controlling disk head in the storage pool. The controlling disk head lost access to a shelf and handed control over to the second disk head which then triggered a rebuild of disks on that shelf. This directly affected 3 mailstores.
For the first 2.5 hours, all of Cluster A mailboxes were offline for testing and to alleviate stress on the one functioning disk head while the rebuild occurred. At approximately January 10, 2009: 12:30 UTC, after thorough testing and confirmation from NetApp, we reactivated the offline disk head which restored access for 50% of customer mailboxes on the cluster as well as forward only and filter only accounts. These customers were able to access their mailboxes and send/receive mail.
The rebuild continued on the affected 3 mailstores. The rebuild process is a consecutive activity. In order to restore services, each of the affected mailstores must be rebuilt in sequence. This meant that 50% of customer mailboxes remained offline for a period of approximately 20 hours. We assessed that the rebuild process required the mailboxes to be offline to execute the rebuild more quickly, restore service efficiently and maintain service for the 50% unaffected mailboxes.
Our Technical team closely monitored the rebuild process throughout the day. As well, we worked closely with NetApp to analyze the issue and determine possible paths to speed up the rebuild process while not adversely affecting the system. The first mailstore was rebuilt on January 10: 16:55 UTC, the second on January 11: 04:05 UTC and the third at January 11: 14:20 UTC.
Full access was provided to remainder of customers by January 11, 2009: 05:00 UTC when the second mailstore finished rebuilding. The third mailstore rebuild was a single disk rebuild which could run in the background without affecting mailbox access. Inbound mail began flowing for most of the affected mailboxes. We determined that inbound mail for 10% of customers would need to be queued to not impede the final mailstore rebuild. These customers were able to access their mailboxes and send mail. Inbound mail was delayed an additional 10 hours for these customers.
Inbound mail (ham) delivered throughout the night with queues flushing at January 11, 2009: 17:00 UTC. At this time, Cluster A was fully online.
Spam email delivery was enabled and continued for the remainder of the day.
Update: January 16, 2009
Dear Customers -
On behalf of all of the members of the OpenSRS team, please accept our sincere and deepest apologies for the service disruption on Cluster A this past weekend.
Many of you have asked, “How could we have let this happen again?” We initially were led to believe that we had a software problem. We have now determined that the string of service problems on Cluster A are related to a hardware problem inside one of our NetApp devices.
Below is a letter of explanation I received from Jeff Goldstein, General Manager at NetApp Canada.
We are not without fault in this situation. Network-attached storage is complex and we trusted our vendor to provide us with accurate advice related to our problems. In hindsight, we should have pressed earlier for replacement hardware.
Please rest assured that we are dedicated to providing a reliable email service and will be working tirelessly to restore your confidence in us. An incident report is available at OpenSRS Status.
Sincerely,
Elliot Noss,
President and CEO, Tucows
Dear Elliot Noss,
I am writing today regarding the recent outage that occurred this past weekend with Cluster A of the OpenSRS Email Service.
As you are aware, Cluster A of the OpenSRS Email Service has experienced a number of service degradations related to issues with our NetApp storage device. Our engineers here at NetApp worked closely with the technical operations and development teams at OpenSRS to trouble-shoot and resolve these issues. In each of the cases, we believed a software
fault was the cause.The intermittent problem turned out to be due to the hardware shelf controller as well as firmware in one of our NetApp storage devices, which caused the issues on Cluster A.
We are deeply sorry for the inconvenience that resulted from these hardware and email service issues.
One of the promises we make to our customers is that our solutions provide highly available data management and in this case we let you down.
To begin to resolve this issue, we’re taking immediate action to replace the hardware and firmware in Cluster A at our expense. Our engineers will then test and evaluate the components involved to determine what specifically went wrong and apply those findings back into our own quality control teams.
Our two companies have been working together for the past nine years. We value our relationship and will work hard to restore your confidence in NetApp and our solutions.
Again, please accept our sincere apologies.
Regards,
Jeff Goldstein
Canadian General Manager
NetApp Canada
This update is related to Incident 4743
Email Cluster A is Degraded
All systems are now normal other than a small queue of inbound messages still to be delivered to about 10% of users. Once this is complete we will be returning Cluster A to "Online" status.
Note that delivery of queued spam messages to spam folders will continue for several hours.
We will now begin conducting a thorough root cause analysis with our storage vendor Netapp in order to provide you with an incident summary by end of business day Monday.
Once again, we apologize to you and your customers for the inconvenience.
This update is related to Incident 4743
Email Cluster A is Degraded
All customers have access to their mailboxes and can send/receive mail.
We continue to monitor the inbound mail queue which is largely cleared now. It is possible that some users may still receive older messages queued on sending servers rather than on our system. For approximately 10% of users we expect it will take an additional two to four hours for all their mail to be delivered.
After this mail has been delivered we will begin delivering queued spam messages.
Once again, we apologize to you and your customers for the inconvenience.
This update is related to Incident 4743
Email Cluster A is Degraded
All customers have access to their mailboxes and can send/receive mail.
We continue to monitor the inbound mail queue which is decreasing slightly faster than expected and is still on track to be cleared in the next few hours. For approximately 10% of users we expect it will take an additional four to six hours for all their mail to be delivered.
After this mail has been delivered we will begin delivering queued spam messages.
Once again, we apologize to you and your customers for the inconvenience.
This update is related to Incident 4743
Email Cluster A is Degraded
All customers have access to their mailboxes and can send/receive mail.
Mail that had been queued is now being delivered and we expect that most customers' mailboxes will be up-to-date in the next few hours. For approximately 10% of users we expect a slightly longer time before all mail has been delivered - possibly up to 8 hours.
Once mail has been delivered we will begin delivering queued spam messages.
We will be leaving the status as degraded until all inbound mail has been delivered and we are sure that the system is handling peak loads.
Once again, we apologies to you and your customers for the inconvenience.
This update is related to Incident 4743
Email Cluster A is Degraded
All customers have access to their mailboxes and can send/receive mail.
We have fully tested and restored the previously affected mailstores. No mail or data was lost.
Mail that we've been queuing is now being delivered. With just under one million messages to deliver this process will take several more hours. Spam messages will be delivered to spam folders after the main queue has been cleared.
While services have been fully restored, we will be leaving the status as degraded until all inbound mail has been delivered and we are sure that the system is handling prime time load (as Europe wakes up).
Our technical team will be conducting a thorough root cause analysis with our storage vendor Netapp. We will provide you with an incident summary by end of business day Monday.
This update is related to Incident 4743
Email Cluster A is Degraded
We are currently on track to complete the rebuild of the mailstores by 11:00 P.M. EST. We expect that we will be able to restore the services soon after it is complete.
Once access has been restored we will begin delivering queued inbound mail. We expect that all mail will be delivered overnight.
At this time, 50% of customer mailboxes have full access and are able to send/receive mail.
We are scheduling our next update in conjunction with the rebuild schedule, so will we provide a more detailed update at approximately 11:00 P.M. EST.
This update is related to Incident 4743
Email Cluster A is Degraded
Rebuilding of affected mailstores is continuing. We've adjusted our estimate for completion of the rebuild for those remaining mailstores to 11 P.M. EST. Once the rebuild is complete, we should be able to bring affected users back online fully.
There are no other changes to report. To summarize:
- 50% of mailboxes on Cluster A are online and able to send and receive email normally.
- The other 50% of mailboxes are offline and users have no access to either send or receive email.
- No mail has been lost, and all mail to affected users is being queued and will be delivered once full service is restored.
- Forward-only and filter-only accounts are functional and mail is being delivered to those users. Some filter-only users are unable to log into the Spam Quarantine.
We will keep you updated. The next update will be provided at approximately 8:30 P.M. EST.
This update is related to Incident 4743
Email Cluster A is Degraded
Rebuilding of affected mailstores is continuing. The current estimate for completion of the rebuild for those remaining mailstores remains 10 P.M. EST.
There are no other changes to report. To summarize:
- 50% of mailboxes on Cluster A are online and able to send and receive email normally.
- The other 50% of mailboxes are offline and users have no access to either send or receive email.
- No mail has been lost, and all mail to affected users is being queued and will be delivered once full service is restored.
- Forward-only and filter-only accounts are functional and mail is being delivered to those users. Some filter-only users are unable to log into the Spam Quarantine.
We will keep you updated. The next update will be provided at approximately 6:30 P.M. EST.
This update is related to Incident 4743
Email Cluster A is Degraded
Currently, 50% of customer mailboxes on this cluster are fully available and those customers are able to send and receive email normally. Additionally, forward-only and filter-only accounts are also functioning and mail is being delivered. However, some filter-only users will be unable to log into the Spam quarantine.
The remaining 50% of customer mailboxes are offline and those users have no access to their mailboxes. Those customers are unable to send or receive mail at this time. Inbound mail is being queued for these customers and will be delivered once service is restored. Affected users logging into webmail may see a "Service Unavailable" error message. Users with email clients will not be able to send or receive new mail. They will receive timeout errors.
As anticipated, upon completion of the rebuild of the first affected mailstore, it was determined that we should not restore full service to those users. Restoring service for those users while the other mailstores were being rebuilt would have had too great an impact on the service overall.
Rebuilding of the other affected mailstores is continuing. The current estimate for completion of the rebuild for those remaining mailstores remains 10 P.M. EST.
We will keep you updated. The next update will be provided at approximately 4:30 P.M. EST.
Additionally, we have some more detailed information for you on the nature of the fault that led to this incident:
Q: What happened to cause this degradation?
A: During our scheduled maintenance, the firmware upgrade of our NetApp caused a failure of a controlling disk head in the storage pool.
Q. Why didn't you just rollback the firmware upgrade?
A: During the firmware upgrade, a number of disks became marked as 'bad', triggering a RAID level rebuild by the system. Once this rebuild is triggered, it must complete. Restoring to the previous firmware would do nothing to change this situation. Exactly why the firmware update triggered this rebuild is not known. We are working with NetApp to determine a root cause. We have performed many dozens of firmware upgrades to this type of module in many other filers of the same model in the past, and have never experienced a similar result.
Q: Is this related to the Cluster 'A' service interruptions you experienced last year?
A: We were upgrading the firmware on the NetApp storage devices to address issues related to last year's service interruptions. We tested the firmware upgrade and had previously upgraded firmware like this with no issue. We are working with NetApp to investigate why both disk heads reacted to cause a rebuild. We will provide you with a full incident summary when we have those answers.
This update is related to Incident 4743

