2009年10月14日水曜日

MS社/Danger次元について、クラウド業界からの反論: この事件はクラウドの問題ではない!ITの基本を忘れている事が原因

The Microsoft/Danger/T-Mobile Sidekick Fiasco is NOT a Failure of Cloud Computing!

sidekickblue_flame Over the past few days, I have seen a lot of articles, tweets and commentary about how the recent failure within Danger (who was purchased by Microsoft about a year ago) of data for the T-Mobile Sidekick was "the Cloud's fault," and this really bothered me. As Microsoft is poised to do something with the Danger brand ("Project Pink") as well as soon release their Cloud Computing Platform called Azure, this could not have come at a worse time for them. There is obviously a lot of attention being paid to the cell phone market place as the Android platform is trying to make a positioning move to attempt to dethrone Apple's iPhone. The Danger (now Microsoft) Sidekick was a device that provided great functionality "back in the day" (I actually went through quite a few generations of Sidekicks – from the B&W version up to a few color ones a few years ago). The Sidekick has a tiny market share and the user demographic is really much younger (e.g., teens) than the iPhone/Android/Blackberry crowd.

Last week, the Danger data network started experiencing some degradation of service where users were unable to access their data. A quick side note about the Sidekick, unlike other data-containing cellphones, the Sidekick stores all of the data (contacts, appointments, pictures, etc.) in a network datastore and not within the device itself. Most users rely solely on this service and don't back up their data to a local computer. Other "smarter" phones like the Blackberry and iPhone rely on data synchronization with a physical computer or an Exchange Server to reliably back up their data. In my opinion, this is where the failure of the Sidekick started – single remote source of data only.

Details on the data issue are still being revealed (recently, there is a discussion about "dogfooding" or even "sabotage" where Microsoft may have wanted to replace the existing technology with their own – I will let the conspiracy theory experts battle that one out) but my understanding is that Microsoft wanted to upgrade the SAN (Storage Area Network) that powered the Sidekick data network and contracted with Hitachi to get the job done. Unfortunately for reasons unknown, no backup of the data was performed prior to this upgrade attempt (Failure #2). The upgrade of the SAN proceeded without a backup in place and the data was "destroyed" resulting in thousands of Sidekick users stuck without their data. As of this writing, some users have actually been able to recover data (e.g., if they didn't power off their device or if they did a "reverse sync" from their Sidekick back to the Danger servers – I don't have details on this so please don't try anything without doing any research first).

This brings me back to the title of this post: this fiasco is NOT a failure of Cloud Computing, it is simply a failure of not following standard IT practices, ones that even an average computer user knows. Back up your data, your servers and your infrastructure regularly and store it securely in different locations.

It is somewhat understandable (and unfortunate) that mainstream media and even the tech community jump so quickly to the conclusion that the Cloud is at fault here. Cloud Computing is relatively new and as with any new technology or service, people are looking for any and all holes therein. The same could be said about the launch of eCommerce back in the mid-1990's. There were failures, fraud and other issues associated with it and the naysayers were quick to point out only the negatives of the movement. Today, people use eCommerce for everything and could not live without it (there are still issues with fraud and security but the technology has evolved and stabilized). Cloud Computing is now going through a similar hype-cycle and we are in the phase where many are adopting and using it wholeheartedly but others are sitting in wait, hoping for some sort of a failure to point out the disadvantages of it.

With recent Gmail failures, users were quick to blame the Cloud. Gmail is a great example of a SaaS application (which many, including me, call a "Cloud Application"). However, Gmail has been around longer than the term "Cloud Computing" so have we simply compartmentalized it into a Cloud Application category? It is not a huge issue if we have. However, what DOES bother me is when a failure happens therein and people simply say "oh, it's the Cloud's fault". Sorry, but what would we have said if a similar failure happened 4 years ago? "Oh, it's a failure of SaaS" and "SaaS is evil"?

Let's face it, hardware fails. It is not bulletproof. Power outages happen. Generators don't turn on. Code has bugs (after all, to err is human and hardware and software are created by humans). We know this as facts and there is no way to avoid it. What you CAN do is work to minimize your risk, downtime and disruptions by following standard IT practices. And, you can even use the Cloud as a means to help your achieve reliability, stability and uptime.

Here is a list of 5 things you might want to consider part of your IT "Best Practices":

  1. Backups – Back up often. Set up automatic as well as manual backup procedures. Store your data locally AND somewhere completely geographically distinct from your infrastructure. On GoGrid, for example, you have persistent storage within your VM. That can be 1 place for a backup to reside. But also use Cloud Storage (which is a redundant SAN-like device) to store other backups. Lastly, and I always recommend this to any GoGrid users (or anyone who has a website or web-application for that matter), have a 3rd party backup solution. There are many out there, some are free, and some cost. Remember that you really do need pay a premium in order to ensure reliability and dependability.
  2. Redundancy – Physical servers AND virtualized servers do encounter issues. You would never put all of your eggs in one basket so why do it with your infrastructure! You should ensure that you set up a "high availability" (HA) infrastructure where you have 2 (or more) of everything, whether they be all active or as hot or warm standbys. That way, if something fails, you can use the other hardware (virtualized or physical) to minimize disruption. I participated in the writing/testing of this article ("How to Set Up a Load Balanced and Redundant LAMP Web Application on GoGrid") on our GoGrid wiki which is a great starting point for setting up an HA environment.
  3. Failovers – Unfortunately, having a failover environment comes at a premium as well. Most people, unless they are hugely successful, decide to put off setting up a Disaster Recovery (DR) environment due to costs and the time it takes to do so. That is, until their primary site goes down for hours or days (hey T-Mobile/Microsoft…what happened here?), then DR suddenly moves to the top of the list. GoGrid, for example, has partnered with Stratonomic to provide DR solutions that won't break the bank and provides you with some definite peace of mind.
  4. "Hybrid Hosting" – One of the unique and market-first offerings that GoGrid provides is "Cloud Connect" which is the ability to connect cloud front-end infrastructure with physical back-end infrastructure. The advantage of this might not be immediately apparent to many users, however, as a "best practice" sometimes you need to think "outside the box" whether that box is physical or virtual. By setting up your front-end environment using the cloud (scalable, dynamic, elastic, etc.), you can optimize your web server environment for traffic and redundancy. Using physical boxes in the backend allows you to have additional services (like managed backups or security enhancements), thus making your infrastructure more secure and reliable.
  5. Due Diligence – Regardless of your infrastructure, datacenter or hosting environment, take some time right now to figure out your IT strategy and Best Practices. Have you covered the 4 points listed above? The Cloud might not be the end-all solution for you but it does provide an alternative to traditional methodologies and practices. If you are only in the Cloud, think about putting some of your data or services on physical hardware or backing it up to a remote location. If you are using purely physical servers, you might want to think of using the Cloud as a failover or secondary site. One way or another, look to diversify your infrastructure.

Failures will happen, Cloud or not. My hope, however, is that before people start blaming Cloud Computing for issues that are obviously NOT actually related to the Cloud, they sit back and think about what they are saying. It's easy nowadays to find a scapegoat and to blame something that is relatively new. In my opinion, by not fully understanding the complexity of the issues and immediately jumping to conclusions, users actually are doing themselves a disservice and come off sounding inexperienced or lacking full knowledge or even the entire picture. The Cloud is not a crutch, nor a panacea for all things IT. It is, however, a viable option or strategy that IT professionals should seriously consider when evaluating their offerings and re-architecting their solutions for resiliency.