The most recent three-day failure of the Amazon cloud to deliver customer services highlights the importance of understanding what is the meaning of promised levels of service.
The Amazon EC2 SLA guarantees 99.95% availability of services within a Region over a trailing 365-day period. A customer then qualifies for SLA Service Credits for the fees and charges a customer would incur otherwise.
When signing a Service Level Agreement the following needs attention:
1. A 99.95% availability over a 365-day period means that Amazon will deliver contracted for services except for 2,628 minutes per year (or 43.8 hours).
2. Failure rates are not uniformly distributed. Data center downtime statistics shows a steep declining exponential curve. There are few large failures and many small failures. Therefore the chances of a 263 minutes failure is at least 37 times greater than the calculated average outage.
3. The Amazon SLA hedged its promise by offering availability “within a computing Region”, which was defined as a physically distinct, independent infrastructure backing up within the region. Contractually Amazon did not have a failure because more than one processing region was involved.
4. The amount of liability for damages that Amazon incurred was not specified. The Amazon SLAs compensate only for unbilled fees. The damage for loss in revenue will always exceed the costs for transaction processing. Inclusion of liability insurance must be included in the costs for delivering cloud services.
SLAs should be defined by means of uptime statistics. The uptime should be stated in days and not as an annual interval. The customer is buying online support for web-based applications. Availability in support of web services is critical. What is missing in the Amazon SLAs is a contractual statement as to the time it will take to completely reconstitute operations in minutes, not days, if critical business operations are involved.
Missing from the SLAs is the all-important issue of a guarantee of the latency the cloud service will be providing. The standard here is Google, with average response to searches in 30 milliseconds and rarely peaking over 80 milliseconds. Cloud vendor suppliers should not be allowed to trade off uptime vs. latency.
Uptime (with defined geographic limits), time to recover (hours), time to reconstitute to full operations (minutes) and min-max latency (milliseconds) are the metrics for cloud operations, whether public or private. A clear definition of measurements accompanied with rigorous definitions of SLA terms is a necessity.
Performance of computing services used to be evaluated by polling of customer opinions. Such an approach, similar to beauty contests, cannot be applied any more. Cloud services are the engines of the global trade, processing trillions of dollars or transactions every day. Performance metrics and a liability for losses must be included in SLA negotiations.