2009年3月10日火曜日

How Amazon builds the world's most scalable storage

Amazonが提供するストレージサービス、S3の概要とさまざまな障害に対して同対応しているか。

The cloud storage market is accelerating fast - despite naysayers and alarmists - and Amazon's S3 is leading the charge. Storing over 40 billion files for 400,000 customers Amazon is the one to beat. How do they it for pennies per GB a month? Read on.

I attended FAST '09, the best storage conference around, where Alyssa Henry, S3's GM, gave a keynote. Amazon doesn't talk much about how their technology works, so even the little Alyssa added was welcome.

Aggressive goals
A multi-billion dollar business running one of the world's largest websites, Amazon engineers understand the problem. Their goals reflect both technical and market requirements:

  • Durable
  • 99.99 availability
  • Support an unlimited number of web scale apps
  • Use scale as an advantage - linear scalability
  • Vendors won't engineer for the 1% - only the the 80% - DIY
  • Secure
  • Fast
  • Straightforward APIs
  • Few concepts to learn
  • AWS handles partitioning - not customers
  • Cost-effective

One key: Amazon writes the software and builds massive scale on commodity boxes. Reliability at low cost achieved through engineering, experience and scale.

With many components come many failures
10,000+ node clusters mean failures happen frequently - even unlikely events happen.

  • Disk drives fail
  • Power and cooling failure
  • Corrupted packets
  • Techs pull live fiber
  • Bits rot
  • Natural disasters

Amazon's deals with failure with a few basic techniques:

-Redundancy
Increases durability, availability, cost, and complexity. Example: plan for the catastrophic loss of entire data center; store data in multiple data centers.

Expensive but once paid for costly small-scale features like RAID aren't needed.

-Retry
Just like disk drives - it's quicker for Amazon to retry than it is for customers. Leverage redundancy - retry from different copies.

-Idempotency
This is cool. An idempotent actions result doesn't change even if the action is repeated - so there's no harm in doing it twice if the response is too slow.

For example, reading a customer record can be repeated without changing the result. So they don't retry too much there's

-Surge protection
Rate limiting is a bad idea - build the infrastructure to handle uncertainty. Don't burden already stressed components with retries. Don't let a few customers bring down the system.

Surge management techniques include exponential back off (like CSMA/CD) and caching TTL (time to live) extensions.

-Eventual consistency
Amazon sacrifices some consistency for availability. And sacrifices some availability for durability.

-Routine failure
Everything fails - so every failure handling code path must work. Avoid unused/rarely used code paths since they are likely to be buggy.

Amazon routinely fails disks, servers, data centers. For data center maintenance they just turn the data center off to exercise the recovery system.

-Diversity
Monocultures are risky. For software there is version diversity: they engineer systems so different versions are compatible.

Likewise with hardware. One lot of drives from a vendor all failed. A shipment of faulty power cords. Correlated failures happen

Diversity of workloads: interleave customer workloads for load balancing.

-Integrity checking
Identify corruption inbound, outbound, at rest. Store checksums and compare at read - plus scan all the data at rest as a background task.

-Strong instrumentation
Internal, external. Real time, historical. Per host, aggregate. When things go wrong, you need history to see why.

-Get people out of the loop
Human processes fail. Humans are slow. If a human screws up an Amazon system, don't blame the human's fault. It's the system.

Final thought
Storage is a lasting relationship that requires trust.

The Storage Bits take
Amazon is the world leader in scale out system engineering. Google may have led the way, but the necessity to count money and ship products set a higher bar for Amazon.

Amazon Web Services will dwarf their products business within a decade. I'd like to see them open the kimono more in the future.

Comments welcome, of course. There's a longer version of this on StorageMojo. And there's the Amazon CompSci paper Dynamo: Amazon's Highly Available Key-value Store. Not S3 specific, but close.