LinkedIn is a popular professional networking platform used by over 700 million members worldwide. With so many users and data to manage, LinkedIn relies on powerful database systems to store and organize information. But what database does LinkedIn actually use? Let’s take a closer look.
The Evolution of LinkedIn’s Database Infrastructure
In its early days, LinkedIn used an open source MySQL database for its core functionalities. MySQL is a popular relational database management system used by many major web applications. It provided LinkedIn with a cost-effective and flexible database solution as the company was just getting started.
However, as LinkedIn began to scale and accumulate more member data, they started hitting limitations with MySQL. Specifically, MySQL constrained LinkedIn in two key areas:
- Scale – MySQL struggled to scale with the rapid growth of LinkedIn’s membership and usage.
- Flexibility – MySQL was not as optimal for handling newer, less structured data types like graphs and social connections.
To resolve these issues, LinkedIn began migrating to NoSQL database systems in 2010. NoSQL databases like Cassandra and Voldemort were designed to be highly scalable and flexible for large-scale web applications.
2010 – Cassandra for Scalability
In 2010, LinkedIn adopted Apache Cassandra to help scale their core infrastructure while still using MySQL as the primary operational database. Cassandra is a distributed NoSQL database designed to handle large amounts of structured data across clusters of commodity servers.
Some key advantages of Cassandra for LinkedIn:
- Linear scalability – Easy to add more nodes without downtime.
- High availability – No single point of failure, built-in replication.
- Performance with large datasets – Fast writes optimized for heavy workloads.
Cassandra allowed LinkedIn to reliably store and manage the growing volumes of core user profile data as they added more members. By 2011, LinkedIn reported handling over 1 billion writes per day with Cassandra.
2011 – Voldemort for Flexibility
In 2011, LinkedIn developed a new distributed key-value NoSQL database called Project Voldemort. Voldemort was designed as a flexible, highly available store for large volumes of non-relational data.
LinkedIn used Voldemort for:
- News feed data
- Social graph connections
- Recommendations engines
Voldemort provided low-latency performance and horizontal scaling for these less structured data types. It also replicated data across multiple servers for high availability.
By leveraging both Cassandra and Voldemort, LinkedIn now had a scalable and flexible NoSQL backend to power their growing social network in tandem with MySQL as the primary transactional database.
LinkedIn’s Database Infrastructure Today
Today, LinkedIn operates a complex database infrastructure that enables its social network capabilities and business solutions. Here are some key components of their current database landscape:
Operational Databases
- MySQL – Still used as the primary OLTP database for core functions like user profiles, feeds, messaging, etc. LinkedIn has heavily optimized MySQL for efficiency at their scale.
- MongoDB – Document-based operational database used for some newer services and products. Provides more flexibility than MySQL.
Analytical Databases
- Azure Data Warehouse – Fully managed petabyte-scale cloud data warehouse. Used for BI and analytics applications.
- Druid – Columnar database optimized for real-time analytics on time-series data from high velocity event streams.
- Vertica – Analytic SQL database designed to manage and analyze structured and semi-structured data at scale.
Distributed NoSQL Databases
- Cassandra – Manages high volumes of structured core social data across clusters, powering core apps and features.
- Voldemort – Stores semi-structured data like social graphs and news feeds for low-latency access.
- Espresso – In-house distributed key-value NoSQL store built on RocksDB, optimized for LinkedIn workflows.
- Elasticsearch – Search and analytics engine that leverages Lucene for full-text search capabilities.
Caching Layers
LinkedIn leverages caching technologies like Memcached and Redis to reduce database loads and improve performance:
- Memcached – In-memory object caching system to reduce database queries.
- Redis – Key-value cache and store supporting more complex data structures.
Data Pipeline and Workflow Tools
LinkedIn uses various data pipeline and workflow orchestration frameworks to move data between their databases and applications:
- Apache Kafka – Publish-subscribe messaging system for streaming data ingestion and processing.
- Apache Airflow – Workflow orchestration framework to programmatically author, schedule and monitor workflows.
- Apache NiFi – Automated dataflow between systems, enabling movement, processing, and monitoring of data.
Data Lakes
LinkedIn leverages cloud data lakes to collect and consolidate data for big data analytics:
- HDFS – Hadoop Distributed File System underpins LinkedIn’s data lakes on cloud object stores like Amazon S3.
- Azure Data Lake Storage – Managed cloud data lake also used by LinkedIn.
Key Takeaways
Here are some key points on LinkedIn’s database infrastructure evolution:
- Started with MySQL as their primary relational database.
- Adopted Cassandra and Voldemort NoSQL databases to scale write throughput and support newer data types.
- Currently uses a hybrid infrastructure with MySQL, MongoDB, Cassandra, Voldemort, and other data stores.
- Leverages caching, data pipelines, workflow tools, and data lakes like HDFS to optimize their architecture.
- Migrating analytics to cloud data warehouses like Azure Data Warehouse.
- Continually optimizing databases like MySQL and developing custom solutions like Espresso to meet their specific use cases.
By combining the flexibility of NoSQL with the reliability of MySQL, adding in powerful caching, analytics, and data ingestion technologies, LinkedIn has built a world-class database infrastructure capable of powering their professional network at massive scale.
While the details may evolve, LinkedIn will continue relying on this powerful database foundation to support their expanding products and services for years to come.