2009年2月26日木曜日

The Case Against Cloud Computing, Part Three

4) SLA(Service Level Agreement)の欠如
 
SLAをCloud Computingサービスの稼動率を保証する契約書であるように解釈するのは誤っている、と指摘している。
SLAの本質的な目的は、障害発生時にサービスプロバイダーがどういう代替手段で損失を補填するか、という事を明確にする事である、その金額も基本的にサービスFeeを超える事がない、という事を認識する必要がある、と指摘。
 
次のような取り組みが大事である、とのこと
  • Cloud Computingの稼働率が大きな課題になるが、企業内のIT環境においても、同様な稼働率を維持するためのコストを評価し(特に人員コスト)、それと比較した上でCloud Computingの費対効果を評価ウする必要がある。
  • 企業内のアプリケーションでMission Criticalなアプリケーションと必ずしもそうでないものを明確に整理し、Cloud Computingに移行できるものを選択する、という事が重要。 特に、Mission Criticalではないにも関わらず、運用コストが非常に高いアプリケーションなどは要注意
 
 

In parts 1 and 2 of this series, I discussed two common objections to cloud computing: difficulty of application migration and heightened risk. In this posting, I want to address another common objection to cloud computing, the one that has to do with service-level agreements. I call it:

SLA: MIA

One of the most common concerns regarding cloud computing is the potential for downtime—time the system isn't available for use. This is a critical issue for line-of-business apps, since every minute of downtime is a minute that some important business function can't be performed. Key business apps include taking orders, interacting with customers, managing work processes, and so on. Certainly ERP systems would fall into this category, as would vertical applications for many industries; for example, computerized machining software for a manufacturing firm, or software monitoring sensors in industries like oil and gas, power plants, and so on.

Faced with the importance of application availability, many respond to the potential use of cloud-based applications with caution or even horror. This concern is further exacerbated by the fact that some cloud providers don't offer SLAs and some offer inadequate SLAs (in terms of guaranteed uptime.)

Underlying all of these expressed concerns is the suspicion that one cannot really trust cloud providers to keep systems up and running; one might almost call it a limbic apprehension at depending upon an outside organization for key business continuity. And, to be fair, cloud providers have suffered outages. Salesforce endured several in recent years, and Amazon also has had one or two not so long ago.

Put this way, it's understandable that organizations might describe the concern regarding this all-important meeting of critical business systems with cloud provider reliability as an SLA issue.

Is that the best way to comprehend the issue, or even to characterize it, though?

If one looks at the use of SLAs in other contexts, they are sometimes part of commitments within companies—when, say, the marketing department has IT implement a new system, IT guarantees a certain level of availability. More commonly, though, SLAs are part of outsource agreements, where a company selects an external provider like EDS to operate its IT systems.

And certainly, there's lots of attention on SLAs in that arena. A Google search on "outsource SLA" turns up pages of "best practices," institutes ready to assist in drafting contracts containing SLAs, advice articles on the key principles of SLAs—a panoply of assistance in creating air-tight SLA requirements. A Google search for "outsource SLA success," unfortunately turns up nary a link. So one might assume that an SLA doesn't necessarily assist in obtaining high quality uptime, but provides the basis for conflict negotiation when things don't go well—something like a pre-nuptial agreement.

So if the purpose of an SLA is more after-the-fact conflict resolution guidelines, the implication is that many of the situations "covered" by SLAs don't go very well; in other words, after all the best practices seminars, all the narrow-eyed negotiation (BTW, doesn't it seem incredibly wasteful that these things are negotiated on a one-off basis for every contract?), all the electrons have been sacrificed in articles about SLAs, they don't accomplish that much regarding the fundamental requirement: system availability. Why could that be?

First, the obvious problem I've just alluded to: the presence of an SLA doesn't necessarily change actual operations; it just provides a vehicle to argue over. The point is system uptime, not having a contract point to allow lawyers to fulfill their destiny.

Second, SLAs, in the end, don't typically protect organizations from what they're designed to: loss from system uptime. SLAs are usually limited to the cost of the hosting service itself, not the opportunity cost of the outage (i.e., the amount of money the user company lost or didn't make). So besides being ineffective, SLAs don't really have any teeth when it comes to financial penalty for the provider. I'll admit that for internal SLAs, the penalty might be job loss for the responsible managers, which is pretty emotionally significant, but the SLA definitely doesn't result in making the damaged party whole. After all, having the IT department pay the marketing department is just transferring money from one pocket to another.

Finally, the presence of an SLA incents the providing organization to behavior that meets the letter of the agreement, but may not meet the real needs of the user; moreover, the harder the negotiating went, the more likely the provider is to "work to rule," meaning fulfill the bare requirements of the agreement rather than solving the problem. There's nothing more irritating than coming to an outside service provider with a real need and having it dismissed as outside the scope of the agreement. Grrrr!

Given these—if not shortcomings, then challenges, shall we say—of SLAs, does that mean their absence or questionable quality for cloud computing providers means nothing?

No.

However, one should keep the service levels of cloud computing in perspective, with or without an SLA in place.

Remember, the objective is service availability, not a contractual commitment that is only loosely tied to the objective. So here are some recommendations:

One, look for an SLA, but remember it's a target to be aimed for, not an ultimatum that magically makes everything work. And keep in mind that cloud providers, just like all outsourcers, write their SLAs to minimize their financial exposure by limiting payment to cost of the lost service, not financial effect of the lost service.

Two, use an appropriate comparison yardstick. The issue isn't what cloud providers will put in writing, it's how will a cloud provider stacks up against the available alternatives. If you're using an outsourcer that consistently fails to meet its uptime commitments, surely it makes sense to try something new? And if the comparison is the external cloud provider versus your internal IT group, the same evaluation makes sense.

Third, remember that the quality of internal uptime is directly related to the sophistication of the IT organization. While large organizations can afford significant IT staffs and sophisticated data centers, much of the world limps by with underfunded data centers, poor automation, and shorthanded operations staffs. They run from emergency to emergency, and uptime is haphazard. For these kind of organizations, a cloud provider may be a significant improvement in quality of service.

Fourth, even if you're satisfied with the quality of your current uptime, examine what it costs you to achieve it. If you're using lots of manual intervention, people on call, staffing around the clock, you may be meeting uptime requirements very expensively. A comparison of uptime and cost between the cloud and internal efforts (or outsourced services) may be instructive. I spoke to a fellow from Google operations who noted that at the scale it operates, manual management is unthinkable; nothing goes into production until it's been fully automated. If you're getting uptime the old-fashioned way—plenty of elbow grease—it may be far better, economically speaking, to consider using the cloud.

Fifth, and a corollary to the last point, even if there are some apps that absolutely, positively have to be managed locally due to stringent uptime requirements, recognize that this does not cover the entirety of your application portfolio. Many applications do not impose such strict uptime requirements; managing them in the same management framework and carrying the same operations costs as the mission-critical apps is financially irresponsible. Examine your application portfolio, both current and future, and sort them according to hard uptime requirements. Evaluate whether some could be migrated to a lower-cost environment whose likely uptime capability will be acceptable—and then track your experience with those apps to get a feel for real-world outcomes regarding cloud uptime. That will give you the data to determine whether more critical apps can be moved to the cloud as well.

In a sense, the last recommendation is similar to the one in the "Risk" posting in this series. One of the recommendations in that posting is to evaluate your application portfolio according to its risk profile to identify those which can safely be migrated to a cloud infrastructure. This uptime assessment is another evaluation criteria to be applied in an assessment framework.

So "cloud SLA" is not an oxymoron; neither is it a reason to avoid experimenting and evaluating how cloud computing can help you perform IT operations more effectively.