2009年8月6日木曜日

Dr. Dobb's | Databases in the Cloud: Elysian Fields or Briar Patch?

Cloud Computing上でのデータ管理についての長い研究論文。
Cloudをデータ保存の有力な手法、という観点で見ながら、いくつか存在する方式をよくまとめている。

The cornucopia of products for handling distributed data in the cloud includes everything from lightweight key-value stores to industrial-strength databases

By Ken North,  Dr. Dobb's Journal
8 03, 2009
URL:http://www.ddj.com/database/218900502

Ken North is an author, consultant and industry analyst. He teaches seminars and chaired the XML Devcon 200x conference series, Nextware, LinkedData Planet and DataServices World 200x conferences.


Cloud computing is the latest sea change affecting how we develop and deploy services and applications and fulfill the need for persistent information and database solutions. Database technology evolves even as new computing models emerge, inevitably raising questions about selecting the right database technology to match the new requirements.

The cloud is an elastic computing and data storage engine, a virtual network of servers, storage devices, and other computing resources. It's a major milestone in on-demand or utility computing, the evolutionary progeny of computer timesharing, high-performance networks and grid computing. The computer timesharing industry that emerged four decades ago pioneered the model for on-demand computing and pay-per-use resource sharing of storage and applications. More recently Ian Foster and Carl Kesselman advanced the concept of the grid to make large-scale computing networks accessible via a service model. Like computer timesharing and the grid, cloud computing often requires persistent storage so open source projects and commercial companies have responded with data store and database solutions.

  • Public clouds include commercial enterprises that can host applications and databases, offering Software as a Service (Saas), Platform as a Service (PaaS), Infrastructure as a Service (IaaS) and Database as a Service (DaaS). Infrastructure providers include Amazon Elastic Compute Cloud (EC2), GoGrid, Rackspace Mosso, and Joyent, whereas Microsoft Azure, Google AppEngine, Force.com, Zoho, and Facebook are platform providers. There are also providers targeting specific classes of cloud users, such as HP CloudPrint and IBM LotusLive for collaboration services and social networking for businesses. Other SaaS providers include Birst and SAS for on-demand business intelligence (BI), Salesforce.com and Zoho for customer relationship management (CRM), Epicor, NetSuite, SAP Business ByDesign and Workday for enterprise resource planning (ERP) suites. The DaaS providers include EnterpriseDB, FathomDB, Longjump, and TrackVia.

  • Private clouds, like server consolidation, clusters, and virtualization, are another evolutionary step in data center and grid technology. Gartner Research predicted government will have the largest private clouds but any organization having thousands of servers and massive storage requirements is a likely candidate. Security and reliability are the appeal of private clouds for large enterprises that can afford the infrastructure. Public cloud computing does not provide the 99.99% uptime that enterprise data center managers desire for service level agreements. The fact a private cloud sits behind a firewall mitigates the risk from exposing data to the cloud. The private cloud also alleviates concerns about data protection in multi-tenancy cloud environments. One issue in the private versus public cloud debate is the diversity of APIs used to invoke cloud services. This has caused interest in creating a standard but the Eucalyptus initiative took a different approach. Assuming the Amazon APIs to be a de facto standard, it developed private cloud software that's largely compatible with Amazon EC2 APIs.

When evaluating the suitability of a database solution for cloud computing, there are multiple considerations.

  • First, you must consider the class of applications that will be served: business intelligence (BI), e-commerce transactions, knowledge bases, collaboration and so on.
  • Second, you must determine suitability for public and/or private clouds.
  • Thirdly, you must consider ease of development.

And, of course, budget is not to be overlooked.

Mission: What Will Run In the Cloud?

Selecting a database manager should be a function of the mission and applications it must support, not just budget and whether it will run in the enterprise or a private or public cloud.

Some organizations use a cloud provider as a backup for mission-critical applications or databases, not as the primary option for deploying applications or services. Oracle database users can run Backup software that uses Amazon Simplified Storage System (S3) for Oracle database backups. For an even bigger safety net, organizations can look to cloud computing as a disaster recovery option.

The New York Times project that created the TimesMachine, a web-accessible digital archive, is a prime example of a one-off cloud project requiring massively scalable computing. But factors besides on-demand elasticity come into play when the goal is hosting applications in the cloud on a long-term basis, particularly database applications.

Cloud users are often looking to deploy applications and databases with a highly-scalable, on-demand architecture, often on a pay-per-use basis. Common scenarios for using the cloud include startups and project-based, ad hoc efforts that want to ramp up quickly with minimal investment in infrastructure. But Amazon's public cloud has also been used to support e-business web sites, such as NetFlix.com, eHarmony.com and Target.com. E-mail is a backbone of modern business and companies, such as Boston Celtics, have gone to a cloud computing model for e-mail and collaboration software. Companies can also opt to use a cloud to host ERP or CRM suites that operate with SQL databases, such as open source ERP suites (Compiere, Openbravo, SugarCRM) and BI solutions (Jasper, Pentaho). Because data warehouses use source data from operational systems, organizations using the cloud to host operational databases are likely to do the same for data warehouses and business intelligence.

On a pay-as-you-go basis, the cloud handles provisioning on demand. Machine images, IP addresses, and disk arrays are not permanently assigned, but databases on a public cloud can be assigned to persistent storage. This saves having to bulk load a database each time you fire up machine instances and run your application. But it also puts a premium on database security and the cloud provider having a robust security model for multi-tenancy storage.

The cloud is particularly well-suited for processing large data sets and compute-intensive applications that benefit from parallel processing, such as rendering video and data analysis. The early Amazon EC2 users have included biomedical researchers, pharmaceutical, bioengineering and banking institutions. They were early adopters of grid computing for purposes such as financial modeling, drug discovery and other research. Medical research often requires massive simulations of genetic sequencing and molecular interactions. This has been done by grids, often using Basic Local Alignment Search Tool (BLAST) programs, and more recently by clouds. Researchers have also used MapReduce software in the cloud for genetic sequence analysis. Eli Lilly uses Amazon EC2 for processing bioinformatics sequence information.

Cloud computing is also used for other purposes, such as integrating SaaS and enterprise systems. Players in the on-demand integration space include Boomi, Cast Iron Systems, Hubspan, Informatica, Jitterbit and Pervasive Software. Business intelligence (BI) activity, such as analytics, data warehousing and data mining, requires horsepower and a capital outlay that might be prohibitive for small and medium businesses. Cloud computing offers an attractive pay-per-use alternative and there appears to be a large potential BI-on-demand market.

The marriage of cloud computing and business intelligence can be accomplished by several means. One option is to have data and applications hosted by a SaaS provider. Another is to create cloud-based applications hosted by an infrastructure provider. A third alternative is to do both and use data replication or a data integration suite.

PaaS, Public, Private Clouds

The Platform-as-a-Service (PaaS) solution bundles developer tools and a data store, but users who opt to use an infrastructure provider or build a private cloud have to match the data store or database to their application requirements and budget. There are open source and commercial products that have a wide range of capabilities, from scalable simple data stores to robust platforms for complex query and transaction processing.

Databases, data stores, and data access software for cloud computing must be evaluated for suitability for both public and private clouds and for the class of applications supported. For example, Amazon Dynamo was built to operate in a trusted environment, without authentication and authorization requirements. Whether the environment supports multi-tenant or multi-instance applications also influences the database decision.

Databases and Data Stores

Data management options for the cloud include single format data stores, document databases, column data stores, semantic data stores, federated databases and object-relational databases. The latter group includes "Swiss Army Knife" servers from IBM, Microsoft, OpenLink, and Oracle that process SQL tables, XML documents, RDF triples and user-defined types.

Building a petabyte size web search index is a very different problem from processing an order or mapping wireless networks. The requirements of the application and data store for those tasks are quite different. For new applications hosted in the cloud, developers will look primarily to several classes of data store:

  • SQL/XML (object-relational) databases
  • Column data stores
  • Distributed hash table (DHT), simple key-value stores
  • Tuple spaces variants, in-memory databases, entity-attribute-value stores and other non-SQL databases having features such as filtering, sorting, range queries and transactions.

Because this cornucopia of data stores has diverse capabilities, it's important to understand application requirements for scalability, load balancing, consistency, data integrity, transaction support and security. Some newer data stores are an exercise in minimalism. They avoid joins and don't implement schemas or strong typing, instead storing data as strings or blobs. Scalability with very large data set operations is a requirement for cloud computing, which has contributed to the recent enthusiasm for the DHT and distributed key-value stores.

Associative arrays, dictionaries, hash tables, rings, and tuple spaces have been around for years, as have entity-attribute-value (EAV) stores, database partitions and federated databases. But cloud computing puts an emphasis on scalability and load balancing by distributing data across multiple servers. The need for low-latency data stores has created an Internet buzz about key-value stores, distributed hash tables (DHT), entity-attribute-value stores and data distribution by sharding.

Tuple spaces are a solution for distributed shared memory that originated with the Linda effort at Yale that spawned more than 20 implementations, including Object Spaces, JavaSpaces, GigaSpaces, LinuxTuples, IBM TSpaces, and PyLinda. You can find GigaSpaces eXtreme Application Platform as a pay-per-use service on Amazon EC2. It includes a local and distributed Jini transaction manager, Java Transaction API (JTA), JDBC support, with b-tree and hash-based indexing capabilities. Amazon SimpleDB also provides standard tuple spaces interfaces, but adds secondary indexing and support for additional query operators.

For large data sets and databases, partitioning data has been a facilitator of parallel query processing and load balancing. Horizontal partitioning, referred to as sharding, has caught the attention of developers looking to build multi-terabyte cloud databases because of its success at Amazon, Digg, eBay, Facebook, Flickr, Friendster, Skype, and YouTube.

SQLAlchemy and Hibernate Shards, object-relational mappers for Python and Java, respectively, provide sharding that's useful for cloud database design. Google developed Hibernate Shards for data clusters before donating it to the Hibernate project. You can do manual sharding for a platform such as Google AppEngine, use SQLAlchemy or Hibernate Shards for Python or Java development, or use a cloud data store such as MongoDB that provides administrative commands for creating shards.

Distributed Hash Table, Key-Value Data Stores

Distributed hash tables and key-value stores are tools for building scalable, load balanced applications, not for enforcing rigid data integrity, consistency and Atomic Consistent Isolated Durable (ACID) properties for transactions. They have limited applicability for applications doing ad hoc query and complex analytics processing.

Products in this group include memcached, MemcacheDB, Project Voldemort, Scalaris, and Tokyo Cabinet. Memcached is ubiquitous and a popular solution for caching for database-powered web sites. It's a big associative array that's accessed with a get or put function, using the key that's a unique identifier for data. It's particularly useful for caching information produced by expensive SQWL queries, such as counts and aggregate values. MemcacheDB is a distributed key-value data store that conforms to the memcached protocol but uses Berkeley DB for data persistence.

Scalaris is a distributed key-value store, implemented in Erlang, which has a non-blocking commit protocol for transactions. Using the Web interface, you can read or write a key-value pair, with each operation being an atomic transaction. Using Java, you can execute more complex transactions. Scalaris has strong consistency and supports symmetric replication, but does not have persistent storage.

The open source Tokyo Cabinet database library is causing a buzz in online discussions about key-value stores. It's blazingly fast, capable of storing 1 million records in 0.7 seconds using the hash table engine and 1.6 seconds using the b-tree engine. The data model is one value per key and it supports LZW compression. When keys are ordered, it can do prefix and range matching. For handling transactions, it features write ahead logging and shadow paging. Tokyo Tyrant is a database server version of Tokyo Cabinet that's been used to cache large SQL databases for high-volume applications.

Some products of this group support queries over ranges of keys, but ad hoc query operations and aggregate operations (sum, average, grouping) require programming because they are not built-in.

Hadoop MapReduce

Hadoop MapReduce would be a nominee for the Academy Award for parallel processing of very large data sets, if one existed. It's fault-tolerant and has developed a strong following in the grid and cloud computing communities, including developers at Google, Yahoo, Microsoft, and Facebook. Open source Hadoop is available from Apache, a commercial version is available from CloudEra and Amazon offers an Elastic MapReduce service based on Hadoop.

MapReduce operates over the Hadoop Distributed File System (HDFS), with file splits and data stored as key value pairs. The HDFS enables partitioning data for multiple machines to do parallel processing of batches and reduce processing time. MapReduce is suitable for processing very large data sets for purposes such as building search index engines or data mining, but not for online applications requiring sub-second response times. Frameworks built on top of Hadoop, such as Hive and Pig, are useful for extracting information from databases for Hadoop processing. The eHarmony.com site is an example of the marriage of an Oracle database and Amazon MapReduce, using the latter for analytics involving millions of users.

Entity-Attribute-Value Datastores

EAV stores are derived from data management technology that pre-dates the relational model for data. They do not have the full feature set of an SQL DBMS, such as a rich query model based on a non-procedural, declarative query language. But they are more than a simple key-value data store. EAV data stores from major cloud computing providers include Amazon SimpleDB, Google AppEngine datastore and Microsoft SQL Data Services. And one type, the RDF datastore used for knowledge bases and ontology projects, has been deployed in the cloud.

Google Bigtable uses a distributed file system and it can store very large data sets (petabyte size) on thousands of servers. It's the underlying technology for the Google AppEngine datastore. Google uses it, in combination with MapReduce, for indexing the Web and for applications such as Google Earth. Bigtable is a solution for projects that require analyzing a large collection, for example the one billion web pages and 4.78 billion URLs in the ClueWeb09 data set from Carnegie Mellon University. For those seeking an open source alternative to Bigtable for use with Hadoop, Hypertable, and HBase have developed a following. Hypertable runs on top of a distributed file system, such as HDFS. HBase data is organized by table, row and multi-valued columns and there's an integrator-style interface for scanning a range of rows. Hypertable is implemented in C++, whereas HBase is implemented in Java.

The Google AppEngine includes a schemaless data store that's optimized for reading, supports atomic transactions and consistency, and stores entities with properties. It permits filtering and sorting on keys and properties. It has 21 built-in data types, including list, blob, postal address and geographical point. Applications can define entity groupings as the basis for performing transactional updates and use GQL, a SQL-like query language. Access to the Google AppEngine datastore is programmable using Python interfaces for queries over objects known as entities. The datastore is also programmable using Java Data Objects (JDO) and Java Persistence API. Although AppEngine bundles a data store, the AppScale project provides software for operating with data stores such as HBase, Hypertable, MongoDB and MySQL.

Amazon Platform

Amazon SimpleDB is a schemaless, Erlang-based, eventually consistent data store suited for high-availability applications. The data model provides domains of large collections of items, which are hash tables containing attributes that are key-value pairs. Attributes can have multiple values and there are no joins. The query language provides queries that can return an itemName, all attributes, the attribute count or an attribute list. Data is stored in a single format (untyped strings), without applying constraints, so all predicate comparisons are lexicographical. Therefore for accurate query results you must store data in an ordered format, for example padding numbers with leading zeroes and using dates in ISO 8601:2004 format.

Azure Services Platform

Microsoft's Windodws Azure, like Google AppEngine and Force.com, offers a platform for cloud computing that includes a data store and other features for application development. Microsoft .NET Services provide a service bus and authentication and Live Services are application building blocks. Microsoft also offers SharePoint Services and Dynamics CRM Services in the Azure cloud. Like Amazon S3 and EC2, communication using the Azure Services Platform is based on the web services model, with Microsoft supporting SOAP and REST. Microsoft Azure bundles SQL Data Services (SDS) and exposes Azure Table Storage via ADO.NET Data Services. The database Azure currently offers is a single instance of SQL Server that's limited to 10 gigabytes of storage. For a larger requirement it's necessary to partition data to scale horizontally.

For those with a history of using industrial-strength databases, a big adjustment to the new EAV stores is lack of strong typing. SimpleDB uses string values to store everything, so comparisons and sorting require that you pad numbers with leading zeros. Microsoft SQL Data Services provides Base64, Boolean, datetime decimal, and string. With more than 20 types, Google AppEngine has more built-in types than SimpleDB or SQL Data Services.

RDF and Semantic Data Stores

Social networking and e-commerce have shown us there are classes of web applications that must operate with massive data stores and support a user base measured in millions. Cloud computing is often touted as a vehicle for scaling out that type of site and powering Web 3.0 applications. Tim Berners-Lee has said a web of linked data will evolve from the web of linked documents. This has produced a surge of interest in data stores that can handle very large knowledge bases and data sets encoded to impart semantics using the W3C Resource Description Format (RDF) and in the W3C SPARQL query language.

Interest in RDF, micro formats and linked data has raised awareness of the capabilities and capacity of RDF data stores. Because there are a number of RDF data stores, the benchmark wars are reminiscent of the Transaction Processing Council (TPC) benchmark competition among SQL vendors. RDF data is stored as subject-predicate-object triples. The leading RDF data stores often store additional information for versioning and temporal queries, but they are capable of storing and querying over billions of triples. A W3C wiki identifies more than a dozen triple stores, about half citing deployments or benchmarks with 1 billion triples or more.

Sesame, Jena, and Mulgara are popular open source solutions. OpenLink Virtuoso is a universal server that in a recent benchmark loaded 110,500 triples per second. The Virtuoso Universal Server (Cloud Edition) is a pre-packaged AMI for EC2. In addition to SQL and XML databases, it provides online backup to Amazon S3 buckets and installable RDFizer cartridges. Franz AllegroGraph RDFStore offers a vehicle for building RDF-based federated knowledge stores in the cloud. It supports SPARQL queries, Prolog and RDFS++ reasoning. On Amazon EC2, it stored and indexed a 10-billion triple data set in 6.19 hours using 10 large EC2 instances. The SQL/XML products can store RDF triples, including Oracle 11g, IBM Boca for DB2. On the patent front, Microsoft has been active with applications for methods to store RDF triplets and convert SPARQL queries to SQL.

Document Stores, Column Stores

Storing data by column, rather than the row-orientation of the traditional SQL platform, does not subvert the relational model. But when combined with data compression and a shared-nothing, massively parallel processing (MPP) architecture, it can sustain high-performance applications doing analytics and business intelligence processing. By using a Sybase IQ or Vertica column store with a cloud computing service, organizations can roll their own scalable BI solutions without a heavy capital outlay for server hardware. Sybase IQ processes complex analytics queries, accelerates report processing and includes a word index for string processing, such as SQL LIKE queries. It provides connectivity via standard data access APIs and its Rcube schemas provide a performance advantage over the star schema typically used for relational data warehouses and data marts. Vertica Analytic Database is a solution from a company co-founded by Michael Stonebreaker. Vertica supports a grid architecture, terabyte-sized databases, and standards-based connectivity. It makes pay-as-you-go analytics available to Amazon EC2 users, with a large AMI instance, drivers for ODBC, JDBC, Python, and Ruby, and a database size of 1 terabyte per node as you scale out to multiple nodes.

Apache CouchDB is a schema-less, fault-tolerant data store that appeals to developers building HTTP applications for which a document paradigm is useful. It supports the JavaScript Object Notation (JSON) and AtomPub data formats and it provides a REST-style API for reading and updating named documents. To ensure data consistency it has ACID properties and does not overwrite committed data. It uses a document ID and sequence number to write a b-tree index, with sequence number providing the ability to generate a document's version history. CouchDB provides a peer-based distributed database solution that supports bi-directional replication.

SQL/XML Databases

The SQL database has survived every paradigm shift critics said would be the death of SQL, including object-oriented programming (OOP), online analytical processing (OLAP), Internet computing and the World Wide Web. Some have suggested SQL platforms are not sufficiently scalable for large workloads or data volumes, but there's ample evidence to the contrary. The UPS shipping system central database processes 59 million transactions per hour. It has a table that contains more than 42 billion rows has achieved a peak workload of more than 1 billion SQL statements per hour with IBM DB2. The data warehouse at eBay, running on a Teradata system, contains 5 petabytes of data. LGR Telecommunications derives information from call records to feed a 310 TB Oracle data warehouse. At a recent conference, Microsoft reported Hotmail has 300 million users and processes more than 2 billion non-spam messages per day with Microsoft SQL Server running on a 10,000 server farm.

The SQL/XML database platforms provide a rich query model, supporting SQL, XQuery, XPath expressions, and SPARQL queries. Typically a key-value store requires logic in the application to perform record-oriented query processing. But instead of procedural programming, the SQL solution offers a declarative programming solution that relies on the query optimizer to generate the access path to the data. The SQL platforms offer mature administrative tools and standards-based connectivity, but the highest capacity SQL configurations have not yet been seen in the pay-per-use cloud.

IBM DB2 gives you a hybrid storage engine that supports transaction processing, business intelligence, and XML document processing. It currently holds several TPC benchmark records, including 1 million TPC-C transactions per minute on an 8-processor/64-core cluster running Red Hat Linux Advanced Server. But the ready-to-run Amazon EC2 AMIs aren't configured for that type of workload. The AMIs bundles are for running IBM DB2 Express Edition or Workgroup Edition and Informix Dynamic Server Express Edition and Workgroup Edition. For heavier lifting, you'll need to move your own DB2 Enterprise Edition or Informix Dynamic Server (IDS) licenses to EC2. Besides DB2 and Informix Dynamic Server, there are pre-packaged AMIs for IBM Lotus Web Content Management and WebSphere sMash. For DB2 or IDS development, IBM provides Developer AMIs for EC2 that have no DB2 or IDS usage charge.

Oracle users can transfer licenses to EC2 for Oracle 11g database, Fusion Middleware and Enterprise Manager. The company also provides ready-to-run AMIs and the Oracle Secure Backup Cloud Module can create compressed and encrypted database backups using Amazon S3. The S3 backups easily integrate with Oracle Recovery Manager using its SBT interface. The Oracle EC2 AMIs are pre-configured to use Enterprise Linux. The selection includes Oracle Database 10g Express Edition, Oracle Database 11g Enterprise Edition, Oracle Database 11g SE and WebLogic Server 10g. Oracle's licensing policy permits moving Fusion Middleware to EC2, including WebLogic Server, JRockit (Java VM), Coherence and the Tuxedo transaction processing monitor.

Oracle Coherence is an in-memory, distributed data grid that stores key-value pairs. It provides linear scalability (reportedly deployed in a 5000-node grid), replication, caching and transparent failover for distributed data. Coherence supports analysis and aggregation over the entire grid and it's available for C++, Java and .NET development. Oracle Real Application Clusters are not currently available on a public cloud provider

MySQL Enterprise is a platform suitable for cloud computing scenarios, such as scaling out with multiple servers and using master/slave replication. Some MySQL users have created a high-availability solution for Amazon EC2 by using a multi-instance master-master replication cluster. MySQL Enterprise subscribers can sign up for 24x7 support services for EC2, with different levels of support available from Sun. With the Platinum subscription, you get an enterprise dashboard, replication monitor, connectors, caching (memcached) and partitioning with MySQL Advanced. Continuent and Sun are working on making MySQL clustering technology available on cloud computing services such as GoGrid, Rackspace Mosso, and Amazon EC2.

EnterpriseDB Postgres Plus Cloud Edition is a version of PostgreSQL with enhancements such as GridSQL, replication, asynchronous pre-fetch for RAID, and a distributed memory cache. GridSQL uses a shared-nothing architecture to support parallel query processing for highly-scalable environments such as grids and clouds. For its cloud edition, EnterpriseDB partnered with Elastra, which had a SaaS offering with PostgreSQL and MySQL on Amazon and a product for management of clustered data warehouses. Elastra used Amazon S3 as the persistence solution to the ephemeral disk storage problem when trying to manage databases using EC2 instances.

In-Memory Databases, Cache

For applications that require extremely high throughput, in-memory databases and caches can be deployed to deliver performance. One solution is to pair an in-memory database with a disk-resident SQL database, with the former acting as a cache for the latter. TimesTen and solidDB are robust in-memory products that were acquired by Oracle and IBM respectively. Oracle TimesTen is an embeddable, in-memory database that supports ODBC and JDBC data access. It can provide real-time caching for and automatic synchronization with Oracle 11g databases. IBM solidDB maintains redundant copies of the database at all times and provides JDBC and ODBC query capabilities. It can scale using partitioning for instances, act as cache for SQL databases and do periodic snapshots (checkpoints to disk). The GigaSpaces XAP builds on the tuple spaces shared memory model and offers a JDBC capability for SQL queries.

To improve responsiveness, high-volume web sites typically use cache to reduce the number of queries against SQL databases. Ehcache is a distributed Java cache used by LinkedIn. The memcached server provides a distributed object cache often used for MySQL applications. JBoss Cache has been integrated with GridGain in the Open Cloud Platform. Grid Dynamics demonstrated linear scalability with the GridGain platform from 2 to 512 nodes using Amazon EC2 to run Monte Carlo simulations. Quetzall CloudCache is targeted at cloud applications hosted on EC2. CloudCache returns data in JSON or XML format and it can run in multiple EC2 regions. It offers a REST-style API and there are bindings for Ruby, Python, PHP, and Java development. Microsoft is currently previewing a distributed, in-memory cache code-named Velocity. It supports retrieving data by key or tag, optimistic and pessimistic concurrency, and it includes an ASP.NET session provider.

Federated Data

The federated database provides a solution when data is distributed because volume, workload or other considerations make it impractical to combine it into a single database. Open SkyQuery and Flickr have been showcases for federation,

SkyQuery runs distributed queries over federated astronomical data sources. Flickr uses sharding to support billions of queries per day over federated MySQL databases used for data management of 2 billion photos. That type of success and the scalability requirements of cloud computing have put new emphasis on federated data and sharding. Mergers and acquisitions also may force the creation of federated data stores to permit execution of business intelligence and other queries against disparate CRM databases.

IBM has been using GaianDB, based on Apache Derby, to test performance of a lightweight federated database engine. It distributed the database over 1000 nodes, which GaianDB was able to query in 1/8 second. Fetching a million rows took five seconds.

Platform and API Issues

Database options for public cloud computing can be limited by the choice of cloud provider. SaaS providers, such as Google AppEngine and Force.com, offer a specific platform for development, including predefined APIs and data stores. But private clouds and infrastructure providers, such as GoGrid, Joyent and Amazon EC2, enable the cloud user to match the software, database environment and APIs to requirements.

Besides cloud storage APIs, developers can program to diverse APIs for data stores and standards-based APIs for SQL/XML databases. The programmer developing applications for the cloud can work with SQL APIs and APIs for cloud services. For Amazon that involves using Web Services Description Language (WSDL) and invoking specific web services. For projects that use the cloud to power rich Internet applications (Web 2.0), developers might be looking to use JavaScript Object Notation (JSON) and the Atom Publishing protocol (AtomPub). More than one guru considers AtomPub to be the de facto standard for accessing cloud-based data.

Ease of development is an important aspect for a cloud database solution, with application programming interfaces (API) being a major factor. Some data access programming for the cloud can be done with familiar APIs, such as Open Database Connectivity (ODBC), JDBC, Java Data Objects (JDO) and ADO.NET.

Security

For certain classes of applications, security is an obstacle to using public cloud services, but it's not an insurmountable obstacle. Current thinking on the subject emphasizes encryption, authorization, authentication, digital certificates, roles and policy-based security controls. Database backups to the cloud can be encrypted. Communications can use secure networking and encrypted data.

Java and .NET offer robust cryptographic solutions for applications and services accessing databases. Operating systems and robust SQL databases offer additional layers of security. SQL databases provide features such as row-level encryption and role-based assignment of privileges and access to data. But even with multi-level security, one serious threat to databases in the public cloud and the corporate data center is a breach of hypervisor security by an authorized employee. For helping to ensure data security, Amazon EC2 provides for the definition of security groups. But you must use an Amazon API function to manually monitor the security group descriptions. And there is no logging function to monitor failed attempts at authentication.

There are differences in security depending on whether you use SaaS, a platform provider or infrastructure provider. Because SaaS providers offer a bundle with tools, APIs and services, the SaaS user is not caught up in choosing the optimal data store and security model. However, those creating private clouds or using an infrastructure provider must select a data management solution that's consistent with the application's security requirement.

Saleforce.com hosts applications on Oracle databases using a multi-tenancy model. On the other hand, Amazon EC2 is an example of multi-instance security. If you fire up an AMI running Oracle, DB2 or Microsoft SQL Server, you have a unique instance that does not serve other tenants. The process of authorizing database users, defining roles and granting privileges is your responsibility when using IaaS.

Fault-Tolerance and Cloud Failover

One of the exciting possibilities introduced by cloud service providers is being able to configure fault-tolerant, highly-available systems and hot backups for disaster recovery. It's possible to configure and operate a private cloud for a fairly seamless failover to Amazon EC2, for example. It would require replicating data in the private and public cloud, implementing the Amazon APIs and availability zones, IP assignment and load balancing for the private cloud, and using server configurations compatible with Amazon instances. The latter would be necessary to avoid breaking applications or services due to changes in endianness, the Java heap size and other dissimilarities. A recent dialogue with IBM about the inverse scenario, deploying databases in a public cloud and moving them to a private cloud, revealed this would be more of a challenge.

Final Thoughts

The SQL database became predominant even though an earlier generation of databases delivered ACID properties and excellent performance on Create Replace Update Delete (CRUD) operations. But they, like some of the software mentioned here, required a programmer to write code to navigate through data in order to perform queries, such as an aggregation query. But SQL platforms provided an ad hoc query solution that did not require procedural programming because it used a declarative query language and provided built-in aggregation functions. The logic that must be programmed in an application or service, versus built-in with the database engine, is an element of the total cost of ownership (TCO) of a data store solution.

The direction an organization takes on cloud computing, whether to go the private cloud route or use a public cloud, will determine what options are available for data management. For those who walk the PaaS path, the focus will be on the platform's capabilities, not the data store per se. Those who walk the private cloud or IaaS paths will have to choose a hardware and software configuration, including a data store that fits the business goals and requirements of applications running in the cloud. Many factors will influence the choice of one or more of a spectrum of cloud database solutions, ranging from simple data stores to platforms that support complex queries and transaction processing.

Not every project requires the full functionality of the SQL database managers so there's a definite need for lightweight, fast, scalable data stores.