2009年12月21日月曜日

Rackspace Hostingがクラウドサービスのシステム障害で、40分程度ダウン。頻度改善は難しくても、顧客対応はどんどんよく出来るはず。

障害の原因、対応を報告し、今後障害を繰り返さないための改善策について顧客説明することが今後要求される可能性がある。  サービス業である以上、顧客に対する説明として何が必要十分なのか、業界としてもある程度標準化がされるといいな、と思いつつ、この辺のレベルの低さはやはり前記のIDC市場調査の結果に繋がっていると思う。


Rackspace Outage Has Limited Impact

Rackspace experienced an outage yesterday--a recurring issue this year for the hosted data center provider--which took down a number of high profile sites including the popular blog site TechCrunch. No network is impervious to outages, but a company like Rackspace needs to provide consistent and reliable service.

Aside from TechCrunch, a number of other services and blog sites were impacted by the Rackspace outage, including 37signals, Brizzly, Robert Scoble's blog, sites hosted by Laughing Squid, Tumblr, and Mashable.

The Rackspace blog describes the root cause: "The issues resulted from a problem with a router used for peering and backbone connectivity located outside the data center at a peering facility, which handles approximately 20% of Rackspace's Dallas traffic."

The blog post goes on to explain that the router configuration error was part of final testing for data center integration between the Chicago and Dallas facilities, and that it should not have impacted operation during normal business hours. "The network integration of the facilities was scheduled to take place during the monthly maintenance window outside normal business hours, and today's incident occurred during final preparations."

The outage left many Rackspace customers saying "Hey! Who turned off the cloud?"

While a data center outage that impacts popular and well known sites is a black eye for cloud computing in general, the scope of the impact from this outage was relatively small. As this blog points out "Rackspace is small potatoes. Now it's a fast growing bag of potatoes, but still dinky. And the other catch: Rackspace is more about hosting than the cloud."

For customers that rely on Rackspace to host their servers--especially Web servers--it may seem very much as if the Internet went down when the Rackspace data center was unavailable. However, cloud computing services like Amazon EC2 and Microsoft Azure, and Internet keystones like Google, and Amazon were not impacted at all by the Rackspace outage.

Mistakes happen, but customers of Rackspace have a right to question the repeated outages and service interruptions. At least one Rackspace customer is also upset about a related issue pertaining to customer notification of network issues like this outage.

The customer's hosted servers were affected by the Rackspace outage and found out from customer complaints that its site had been unavailable for two hours. In a comment, the customer stated "We also pay Rackspace extra for a constant monitoring service that is supposed to immediately notify me by email or phone call if our server becomes inaccessible at any time. I was HIGHLY disturbed to find out that Rackspace actually SUPPRESSED these notifications from being sent to their customers for some strange reason."

The comment offers no evidence to support the claim that Rackspace intentionally withheld notification, and I have not had any feedback from Rackspace to confirm or deny the accusation. If it turns out to be true, it would damage Rackspace's credibility and customer service reputation.

The bottom line, though, is that Rackspace determined the cause of the problem and fixed it relatively quickly, and it provided status updates on the blog to keep customers informed. Even brief outages seem devastating to those affected by them, but they will happen, and when they do this is pretty much how you want them handled.

Tony Bradley tweets as @PCSecurityNews, and can be contacted at his Facebook page.

http://www.pcworld.com/businesscenter/article/185171/rackspace_outage_has_limited_impact.html