LinkedIn is one of the largest professional social networks in the world, with over 800 million members as of October 2022. With so many users and so much data to manage, LinkedIn requires a robust and scalable database infrastructure to support its operations. In this article, we will explore the key requirements LinkedIn’s database needs to fulfill and the technical decisions the company has made regarding its database technologies.
Some of the key factors that influence LinkedIn’s choice of database include:
– Scalability – With hundreds of millions of users, LinkedIn generates immense amounts of data that needs to be stored, accessed and analyzed. The database needs to scale easily.
– Speed – The database needs to be able to handle millions of read/write operations per second to support real-time updates. Slow performance will negatively impact user experience.
– Reliability – As a social network with users across the globe, LinkedIn needs to maintain 24/7 availability. The database needs to have high uptime and quick failover when issues occur.
– Flexibility – Given LinkedIn’s diverse data requirements across profiles, posts, jobs, skills, and more, the database needs flexibility in data models and query capabilities.
– Cost – LinkedIn operates on a freemium business model and needs to maintain profitability. The database solution needs to be cost-effective at scale.
Based on these requirements, LinkedIn utilizes a combination of relational and NoSQL database systems to power its platform.
Relational Databases
For its foundational database needs, LinkedIn relies on relational database management systems (RDBMS). An RDBMS stores data in tables that are linked together via common keys. Some benefits of the relational model that make it suitable for parts of LinkedIn’s infrastructure are:
– Structured data – The tabular relational model can efficiently organize LinkedIn profile data, jobs, skills, and other structured content.
– ACID compliance – Relational databases are ACID (Atomicity, Consistency, Isolation, Durability) compliant, which guarantees data accuracy and reliability.
– Query capabilities – LinkedIn can use complex SQL queries to derive insights from relational data. This facilitates analytics and reporting.
– Ease of use – Relational systems have been around for decades making them a mature technology with many skilled resources available.
For its critical relational database needs, LinkedIn uses MySQL – a popular open-source RDBMS. MySQL is known for its high performance, reliability, and ability to scale horizontally across commodity servers. To handle billions of queries per day, LinkedIn runs a massive deployment of MySQL across multiple data centers.
Key Relational Data in LinkedIn
Some examples of relational data stored by LinkedIn in MySQL include:
Data Type | Includes |
---|---|
User profiles | Name, contact info, education, experience, skills, recommendations |
Jobs | Title, description, requirements, compensation, company |
Skills | Name, category, related skills |
Posts and articles | Text, media, metadata, comments |
This structured data fits naturally into relational tables optimized for online transaction processing (OLTP).
NoSQL Databases
While relational databases form the core of its infrastructure, LinkedIn also utilizes NoSQL databases for certain use cases. NoSQL databases are non-tabular and distributed across commodity servers. Some benefits of NoSQL databases include:
– Flexible schemas – NoSQL databases can dynamically accommodate new types of unstructured and semi-structured data.
– Scalability – Databases like Cassandra easily scale by adding cheap commodity servers without downtime.
– High availability – NoSQL systems are designed such that no single point of failure can bring the system down.
– Faster reads – Certain NoSQL databases are optimized for blazing fast reads to serve live user traffic.
Based on these advantages, LinkedIn deploys NoSQL databases such as Cassandra and Voldemort for needs like:
Cassandra for the Activity Stream
LinkedIn uses Cassandra for its activity streams that show real-time user actions. Cassandra’s fast writes and low latency reads provides the speed needed to serve activity streams to millions of concurrent users. Its linear scalability enables smoothly handling spikes in traffic.
Voldemort for Cached Objects
Frequently accessed immutable data like posts and profiles are cached in Voldemort – a NoSQL key-value store. Voldemort’s performance for reads and writes makes it ideal for caching layers.
Hadoop for Analytics
LinkedIn uses Hadoop’s distributed file system along with tools like Hive, Pig, Spark, and Kafka for performing analytics on petabytes of user interaction data. This facilitates derives business insights that improve LinkedIn’s recommendation algorithms among other data products.
Graph Databases
In addition to relational databases and NoSQL stores, LinkedIn leverage graph databases like Neo4j in certain use cases where relationship queries are frequent. As a professional social network, much of LinkedIn’s core value lies in its professional connection graph. Some benefits of using graph databases include:
– Intuitive representation of connections as nodes and relationships.
– Powerful traversal of complex relationship graphs using graph queries.
– Ability to rapidly evolve data models without expensive schema migrations.
Graph databases empower recommendations to users based on their professional network and identify relationships between entities.
Conclusion
To summarize, LinkedIn utilizes:
– Relational databases like MySQL for storing core structured entities.
– NoSQL systems like Cassandra and Voldemort for scaling writes, reads, and caching.
– Hadoop-based solutions for analytics on massive interaction datasets.
– Graph databases like Neo4j for storing and querying professional connection graphs.
The production engineering teams at LinkedIn carefully evaluate trade-offs between database types while architecting solutions. No single database can meet every requirement cost-effectively at LinkedIn’s scale. The company’s growing technical maturity and expertise allows it to skillfully leverage the right technologies for various data storage and processing needs in its vast and diverse ecosystem.