2010年1月26日火曜日

Reuven Cohen氏による、AmazonのOver Subcription手法について論じる:飛行機/ホテル業界との共通点=>

最近、Amazon Web Servicesのレスポンスタイムに問題がある、というブログ記事が多く、Cloudkickと呼ばれるクラウドの性能を測定するサービスの報告によると、AWSに対して行ったPingのLatencyを測定したところ、クリスマス以降、その数値が平均50msだったのが、1000msを超えるようになっている、との事。
 
この頃から各方面で、AmazonはOversubscription(キャパシティ以上に顧客登録すること)しているのでは、という疑いが出始めた。  最初は、alan williamson氏のブログにこのことが指摘されたが、その後いくつかの記事を経て、下記のReuben Cohen氏のブログに達している。 
 
Amazonはこの件に対して直ぐに反論し、障害を報告したユーザとは常に対策を提供すべく活動している、とコメントしている。 
 
下記記事によると、Amazon EC2は常習的にOversubscribeしていることを指摘しながら、この手法は別に珍しいことではなく、飛行機業界、ホテル業界でも常識的に採用されているテクニックである、と述べている。 
一方では、Oversubcribingのテクニックは、効果的な見積もりテクニックがあって初めて効果を出すものである、という事も指摘しており、現在急成長しているAmazonの市場において、この見積もりテクニックが上手く追いついていない事が最近の性能問題に連結しているのでは、と予測している。 
 
Amazon EC2のインスタンスを予約するときに、Quotaと呼ばれるサーバのグループをベースにされる事があるが(20台程度)、これも1ユーザあたりの利用サーバ数を見積もる上で非常に重要な手法になる。

Oversubscribing the Cloud

There's been a bit of a debate raging over whether or not Amazon EC2 has been oversubscribed and is suffering from performance problems because it. The discussion started when Alan Williamson wrote a blog post on Tuesday that said he was experiencing growing performance problems while running a large EC2 deployment for one of his customers. The post accused Amazon of oversubscribing their environment which in turn meant he needed to buy larger instances to maintain the same level of performance in turn increasing his client's costs. 

The debate hits at the heart of complexities involved in trying to deploy cost effective, revenue generating, public use infrastructure as a service platforms. I've been saying this for a while -- one of the hardest parts creating a public cloud service is estimating your customers demand while trying to remain competitive, which really means having prices that are on par or better then Amazon EC2.

Amazon was quick to respond saying "We do not have over-capacity issues. -- When customers report a problem they are having, we take it very seriously. Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance."

The problem with Amazon's vague response is it does very little to address a potentially major issue. In a sense they're saying we'll help you (if you're big enough) while providing no real insight into how their cloud is built, deployed or run. They do imply there are issues, but not relating to over-capacity, it's the fault of how their customers are deploying on EC2, not how their cloud itself is deployed or run. On one hand Amazon has stated they don't have "over-capacity issues", but on the other hand they are far from saying that they don't oversubscribe their environment. Let's be realistic, how else do you expect Amazon to achieve their ridiculously low price points? The very fact they can offer EC2 at such a low cost is to me indirect proof they do oversubscribe their environment. And hell, why not oversubscribe? In fact I'll go as far as to say that it is a good thing.

Amazon isn't alone in using oversubscribing or overbooking techniques for their service. The concept is common within a variety of industries where multiple users share a common resource. These resources can range from hotel rooms, to airline seats to more technical commodities such as bandwidth, storage, shared servers or even energy. The oversubscription model is dependent on the ratio of the allocated commodity which in turn is estimated on a per user / usage basis. The key is to have a well defined model which accounts for a standard deviation (or how much variation there is from the "average" usage). This typically guarantees the quality of a service for a particular user. Underlying the oversubscription model is the fact that statistically few users will attempt to utilize their full allotment of resources simultaneously. This allows you to offer more resources then you actually have available. The concept applies well to public cloud infrastructure environments, and probably is the most important aspect of any competitive pricing model. 

But there are problems with the oversubscription model. The problem occurs because there seems to be a non-linear relationship between the amount of capacity versus the amount of customer demand you have. Or to put it another way, just adding more servers as customer demand increases doesn't necessarily automatically guarantee the same level of service across your cloud deployment, something Amazon's recent dramatic growth & performance issues seems to prove.

This brings us to the concept of a quota's. Have you ever wondered why when you sign up for a "unlimited" cloud infrastructure service such as EC2, you are given an initial allotment of servers? For Ec2 it's something like 20 instances. The reason is simple, the hardest part of an oversubscription model is in capacity planning. That is the use of a quota system is an extremely important aspect in any cloud capacity / resource planning you will be doing when launching and running your own public cloud service. 

As an example, for the Enomaly ECP our quota system was developed to provide a predetermined level of deviation across a real or hypothetical pool of customers. Yes, it was developed to allow our hosting / cloud service provider customers to oversubscribe their environments. But it also allows for a variety of pricing & costing schemes to be implemented. Models such as tiers of usage, quality of service tiers, and even the ability to provide additional quota increases for "good behavior", like when you receive an automatic increase to your credit limit on your credit card. Without this type of quota functionality, it is practically impossible to adequately run a revenue positive public cloud service.

So the real question we need to ask Amazon is -- are their oversubscription models keeping up with the growth and scale of the underlying platform? Prove it.

Labels: , , , ,