This post analyses the following paper published in 2006:

Bigtable: A Distributed Storage System for Structured Data, 7th USENIX Symposium on Operating Systems Design and Implementation – OSDI

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.

Key points

  • Proposal: Bigtable, a distributed storage system created by Google.
  • Merit: presents a new distributed storage system that allows scaling large amounts of data (petabytes), meeting different latency requirements with high performance.
  • Validation: creation of a clustered environment with data in Bigtable and comparison of operations in that environment with the same data in other distributed storage systems. Validation is difficult to apply considering that Bigtable is a proprietary technology, with closed source code. Alternatively, the possibility of using HBase, which has technology based on Bigtable.
  • Perspectives: implementation of support for secondary indexes, infrastructure for bigtables replicated cross-data-center with multiple master replicas, and evolution of resource shares within the Bigtable itself as cluster services increase.

Synopsis

Bigtable is a distributed storage system designed to scale on petabytes of data and thousands of machines with broad applicability, performance, and availability. Unlike database systems, it provides customers with a simple data model that supports dynamic control over the layout and format of the data. Clients can control the locality of their data through options in their schemas, which allow dynamic control of how data will be served: the memory or disk.

The Bigtable data model represents a sparse, distributed, and persistent multidimensional map, indexed by row key, column key, and timestamp. The data is ordered by a key line following the lexicographic criteria and the range of rows in a table is dynamically partitioned, in which each range of rows is called a tablet. The key columns are grouped into column families, which represent the basic access control unit and have the syntax: family:qualifier. The timestamp allows each cell in the Bigtable to contain several versions of the same data.

The Bigtable API allows you to create and delete tables and column families, as well as to change cluster metadata, table and column families, such as access control permissions. Clients can also iterate over multiple family columns and it is possible to limit the lines, columns and time markers produced by a scan.

The Google File System (GFS) is used to store logs and data files, and Bigtable uses Google’s SSTable file format to store data internally. It also depends on Chubby (a distributed blocking service) to control concurrency and replication. Unfortunately, since they are very interconnected, if Chubby is not available, Bigtable will also be unavailable.

For the implementation of Bigtable, there are three main components: a library that is linked to each client, a master server, and many tablet servers. The master server takes care of the metadata and the workload balance of the tablets between the tablet servers. Each tablet server perform local read and write work on assigned tablets. Clients do not communicate with the master server, but with the tablet servers directly. This way, the master server suffers less stress and is loaded lighter.

The following improvements were made in the implementation of Bigtable to increase its performance:

  • locality groups – clients can group multiple column families to increase reading efficiency;
  • compression – use of compression algorithms to reduce the space used by the data;
  • caching for read performance – better performance for reading the same data and reading data close to recently read data;
  • Bloom filters – reducing the number of disk accesses;
  • commit-log implementation – reconstitution of servers in case of failures;
  • speeding up tablet recovery – if a master server decides to move a tablet between the tablet servers, it is compressed and sent without the need for any recovery of registry entries; and
  • exploiting immutability – the immutability characteristic of SSTables forces us to make garbage collection of obsolete SSTables but allows us to divide tablets quickly.

The evaluation of Bigtable’s performance is carried out using random and sequential operations of reading and writing, analyzing its performance of scalability through the variation of the number of tablet servers. On a single tablet server, random reads are slower than all other operations; random reads from memory are much faster; random and sequential writes perform better than random readings; sequential reads perform better than random readings, and scans are even faster. Using more than one tablet server, aggregate throughput increases dramatically, but performance does not increase linearly, due to an unbalanced load in multiple server configurations.

Bigtable is used in several Google applications, such as Google Analytics, Google Earth, and Personalized Search.

In the process of designing, implementing, maintaining, and supporting Bigtable, some lessons were learned:

1. large distributed systems are vulnerable to many types of failures;

2. it is important to postpone the insertion of new resources until it is clear how the new resources will be used;

3. the importance of adequate monitoring at the system level;

4. most importantly, the value of simple designs (for example, the clarity of code and design are of immense help in maintaining and debugging code).

Some distributed storage solutions and parallel databases have components that, in some way, coincide with those of Bigtable and the services that integrate it (Chubby and GFS), although, on the whole, they have differences. Some of these solutions are: Boxwood project, CAN, Chord, Tapestry and Pastry, D2B, C-Store, Sybase IQ, SenBase, and KDB+.

Bigtable, in August 2006, was being used in more than 60 Google projects and its users positively evaluated its performance, its high availability, and its ability to scale the capacity of the clusters by adding more machines. Besides, the development of a storage solution by Google provided more flexibility for the design of its data model.

Analysis

The paper presents the successful implementation of Bigtable, a distributed data storage system that allows scalability, is reliable, and has an innovative data model, with representation techniques such as immutable SSTable, Bloom filters, and tablet compression. It also explains its implementation at scale and how to avoid bottlenecks, using interactions between Chubby, master server and tablet server.

The solution created by Google is relevant and adequate in view of the need to store and manage, with high performance and availability, a large amount of data (petabytes). With Bigtable, it is possible to achieve high performance with the use of low-cost commodity hardware, in addition to the system being very flexible (the article mentions several projects that use the solution), and the results show its ability to deal with different workload.

The experiments carried out demonstrated the efficiency of the Bigtable (except between one and 50 machines, a portion in which there is a drop in performance, caused by the lack of balance between the servers in the network), however, the performance evaluation would also be more beneficial considering, other parameters since the article presents the results only according to the variation in the number of tablet servers. Also, comparing performance with other storage systems would be proficient, as the results of the experiment comprise only the solution created by Google.