Google is the world’s most popular search engine and one of the largest technology companies. With billions of searches performed every day, Google requires a robust infrastructure to support its massive data processing needs. One key component of this infrastructure is the database systems that store and organize Google’s vast trove of data.
Google’s Database Requirements
Google has unique database requirements compared to most companies due to the sheer amount of data it handles. Some key factors that influence Google’s choice of database technology include:
- Scale – Google processes over 3.5 billion searches per day and handles over 63,000 search queries every second. This creates enormous data storage and processing needs.
- Speed – Users expect results in fractions of a second, so Google’s databases need to be highly performant.
- Reliability – With millions of users constantly accessing services, downtime is unacceptable. Google requires resilient database systems.
- Flexibility – Google operates a diverse range of services with different data models and access patterns. The database infrastructure must be flexible.
Very few database solutions can meet the unique demands of Google’s infrastructure. Google utilizes a variety of database technologies customized for different applications across the company’s products and services.
Spanner – Google’s Globally Distributed Database
One of Google’s most well-known in-house databases is Spanner. Launched in 2012, Spanner pioneered the concept of a globally-distributed NewSQL database. It combines features from both relational and NoSQL database models.
Some key facts about Spanner:
- Horizontally scalable across data centers for unlimited capacity and throughput.
- Strong consistency and transactional support, unlike eventually consistent NoSQL databases.
- Automated replication across continents for high availability.
- Automatic sharding to distribute and parallelize data across nodes.
- Fine-grained synchronization to allow atomic locks and consistent reads across regions.
Spanner achieves efficient multi-region distribution through its use of GPS and atomic clocks to precisely timestamp data. This provides global ordering of transactions – a key innovation that enabled Spanner’s groundbreaking globally consistent transactions.
Google uses Spanner across many of its core services such as AdWords and Google Play to store structured relational data. Spanner supports SQL queries, schemas, and ACID transactions familiar to developers familiar with traditional relational databases.
Feature | Description |
---|---|
Scalability | Designed to scale across millions of machines in hundreds of data centers and trillions of database rows. |
Availability | Replicated synchronously across data centers to survive region-wide failures and natural disasters. |
Consistency | Offers strong transactional consistency across regions using TrueTime for global time ordering. |
Distribution | Automatically sharded and replicated across clusters and continents. |
Timing | Uses atomic clocks and GPS for global TrueTime ordering of transactions. |
Bigtable – Google’s Distributed Big Data Store
Google Bigtable is a highly-scalable NoSQL big data database. It provides petabyte-scale sparse table storage for large analytical datasets used across Google products.
Bigtable was one of Google’s earliest in-house databases designed specifically for web-scale applications. It inspired development of open-source NoSQL databases like Apache HBase.
Some key characteristics of Bigtable:
- Column-oriented NoSQL database model designed for high performance.
- Built on Google File System for distributed storage across cheap commodity servers.
- Workloads are distributed across tablets using row key range partitioning.
- Low latency random access to petabytes of structured semi-structured data.
- Millions of operations per second throughput on inexpensive hardware.
Bigtable remains a foundational technology for Google’s analytics pipelines. It is particularly well-suited for read-intensive workloads on huge datasets required by services like Google Maps, Gmail, Search, and Google Cloud.
Feature | Description |
---|---|
Storage | Distributed across low-cost commodity servers using Google File System (GFS). |
Data Model | Sparse table with dynamic columns organized by row key, column key, and timestamp. |
Tablets | Tables are horizontally partitioned into tablets of 100-1000 MB. |
Compression | Column values are compressed using encoding techniques like RLE and Snappy. |
Caching | Extensively caches data in memory across machines. |
Other Notable Google Databases
Spanner and Bigtable represent just two of the many database technologies developed at Google. Some other notable examples include:
Megastore
Megastore is a relational datastore for interactive online services. It provides ACID transactions on top of Bigtable’s scalable NoSQL storage.
Dremel
Dremel is a columnar analytical database used for interactive analysis of web-scale datasets. It provided inspiration for Google BigQuery.
F1
F1 is a distributed SQL database built on Spanner infrastructure. It combines high availability with strong consistency and transactional support.
Colossus (GFS2)
Colossus, also known as GFS2, is the successor to the Google File System. It provides faster distributed storage for Google’s evolving infrastructure needs.
SSTable
SSTable (Sorted String Table) is a persistent log-structured storage engine for immutable data. Used across many Google databases.
Open Source Databases
While most of Google’s internal database technologies remain proprietary, some have open source equivalents:
- Bigtable -> Apache HBase, Apache Cassandra
- Megastore -> Apache HBase, Apache Kudu
- Dremel -> Apache Drill
- Spanner -> CockroachDB
These open source projects emulate some aspects of Google’s databases while lacking the full scale, capabilities, and proprietary features.
Cloud Database Options
Google also offers fully-managed public database services through its Google Cloud Platform (GCP):
- Cloud Bigtable – Petabyte-scale NoSQL database
- Cloud Spanner – Horizontally-scalable relational database
- Cloud SQL – Managed MySQL and PostgreSQL databases
- Firestore – Serverless NoSQL document database
- BigQuery – Serverless enterprise data warehouse
These showcase robust database technologies running on Google’s infrastructure, abstracting away the implementation details.
Conclusion
Google utilizes a diverse ecosystem of purpose-built databases to power its products and services at web-scale. Spanner and Bigtable have proven to be foundational technologies enabling Google’s unique approach to distributed data management. While most of Google’s systems are proprietary, open source alternatives and managed cloud services now provide a glimpse into Google’s advanced data storage and processing capabilities.