LinkedIn is the world’s largest professional network with over 800 million members worldwide. As a platform that connects professionals with each other and with opportunities, LinkedIn deals with massive amounts of data on a daily basis. This is where data engineers come in.
Data engineers at LinkedIn are responsible for building and maintaining the infrastructure that allows the company to store, process, and analyze all of this data at scale. They work closely with data scientists, analysts, and other roles to ensure that LinkedIn can leverage its data to provide value to its members and customers.
Building Data Infrastructure
One of the core responsibilities of data engineers at LinkedIn is building and maintaining the data infrastructure. This includes:
- Designing and implementing data pipelines to move data between systems
- Building and optimizing data processing systems like Hadoop, Spark, etc.
- Developing data lakes and data warehouses to store structured, unstructured, and semi-structured data
- Setting up data modeling frameworks such as schemas and ETL (Extract, Transform, Load) logic
- Establishing framework for data quality, data governance, and metadata management
To accomplish this, data engineers at LinkedIn use various tools and technologies like Kafka, Airflow, dbt, Looker, etc. The infrastructure they build allows LinkedIn to store and process the massive amounts of data generated by over 800 million members and over 30 million companies with LinkedIn pages.
Enabling Advanced Analytics
In addition to building data storage and processing systems, data engineers also focus on enabling advanced analytics use cases. Examples of how they do this include:
- Building data marts and cubes optimized for analytics and BI
- Preparing data for use in machine learning and AI systems
- Creating aggregated views and summary tables to speed up queries
- Implementing performance tuning, indexing, partitioning to optimize analytics
- Providing self-service access to data through BI tools, SQL interfaces, etc.
This allows data scientists, business analysts, and other roles at LinkedIn to extract meaningful insights from the data through analytics, reporting, and AI applications.
Monitoring and Maintaining Data Infrastructure
Since data infrastructure has many moving parts and the data itself is always growing, data engineers have to monitor and maintain the systems they build. Key responsibilities here include:
- Continuously monitoring data pipelines, warehouses, etc. for issues
- Conducting performance tuning and troubleshooting as needed
- Architecting scalable and fault-tolerant infrastructure
- Implementing changes to schema and data models as required
- Upgrading infrastructure components like Hadoop, Spark, etc.
- Ensuring stability, reliability, and efficiency of data infrastructure
Data engineers rely on logs, metrics, alerts, and other tools to monitor the health of the data ecosystem. They are on the frontlines of maintaining LinkedIn’s data infrastructure and preventing issues that could impact productivity.
Collaborating Across Teams
Since data engineering sits at the heart of LinkedIn’s data operations, data engineers have to collaborate closely with various stakeholders. This includes:
- Working with product and program managers to understand data needs
- Partnering with data scientists and analysts to ensure infrastructure meets analytics use cases
- Coordinating with engineering teams on APIs, service integration, and more
- Collaborating with IT and DevOps on deployment, maintenance, and monitoring
- Discussing improvements and optimizations with stakeholders
- Promoting data quality, governance, and access best practices
LinkedIn data engineers don’t work in a silo. They frequently communicate and collaborate with various internal teams via meetings, working sessions, project coordination, and more. This cross-functional collaboration is key to ensuring the data infrastructure scales to meet LinkedIn’s needs as it continues growing.
Developing Data Products and Features
In addition to foundational data engineering, LinkedIn data engineers also get opportunities to create data products and platforms leveraging the infrastructure they build. Examples include:
- Building internal data tools and web applications for analytics and data access
- Creating platforms like Orbiter that democratize data access via self-service
- Developing data APIs for use by various applications and services
- Implementing custom data pipelines for specific product initiatives
- Prototyping new data science applications using the infrastructure
This allows data engineers to work on higher value deliverables beyond just foundational data infrastructure. LinkedIn provides opportunities to get involved in building interesting data products at scale.
Automating and Optimizing
A core part of the data engineering role is relentlessly automating manual processes and optimizing performance. Data engineers at LinkedIn do this through:
- Automating repetitive tasks related to data processing, testing, and deployment
- Designing automated alerting and monitoring systems
- Implementing automated rollbacks, failovers, and recovery procedures
- Continuously optimizing data querying and pipelines as data volumes increase
- Leveraging techniques like partitioning, caching, compaction to optimize workflows
- Staying on top of advancements in data engineering tools and techniques
Automation and optimization is critical given the scale at which LinkedIn operates. Data engineers play a key role here since they understand the data infrastructure best. Their work has a big impact on productivity.
Adhering to Best Practices
As stewards of LinkedIn’s data assets, its critical for data engineers to adhere to industry best practices around:
- Security – Implementing data security including access controls, encryption, etc.
- Testing – Developing automated test suites for data transformations, ETL, etc.
- CI/CD – Implementing CI/CD pipelines for infrastructure deployments
- Documentation – Creating technical design docs, runbooks, wikis, etc.
- Code Quality – Following style guidelines, code reviews, static analysis, etc.
- Monitoring – Setting up logging, metrics collection, alerting, etc.
Adherence to best practices ensures data infrastructure is scalable, reliable, and audit-ready as per industry standards. This is a priority for data engineers at LinkedIn.
Mastering Data Engineering Tools and Technologies
To be able to architect, build, and run data infrastructure at scale, LinkedIn data engineers need to master a wide array of tools and technologies including:
- Big data tech like Hadoop, Spark, Kafka for processing & pipelines
- Relational databases like MySQL and distributed ones like Vitess
- Data warehousing tools like Snowflake, Redshift, BigQuery, etc.
- Workflow schedulers like Azkaban, Airflow, Luigi, etc.
- Containerization with Docker and Kubernetes
- Infrastructure as code tools like Terraform and Ansible
- Monitoring and alerting tools like Prometheus, Nagios, etc.
LinkedIn uses cutting edge tools and data engineers have access to them. Becoming power users of these tools allows them to operate at scale.
Education and Background
Due to the technical nature of the role, data engineers at LinkedIn often have the following education and backgrounds:
- Bachelor’s or higher education in computer science, engineering or related field
- Hands-on experience with programming languages like Python, Java, Scala, etc.
- Knowledge of operating systems, networks, distributed systems fundamentals
- Experience in software engineering and architecture
- Strong hands-on experience with data tools and technologies
While not mandatory, having background in software engineering, infrastructure engineering or similar roles is valued. Many data engineers start their careers as software engineers or developers before specializing in data.
Key Skills
Here are some of the key technical and soft skills needed to succeed as a data engineer at LinkedIn:
Technical Skills
- Expertise in data processing tools like Hadoop, Spark, Hive, etc.
- Proficiency in data warehousing and databases
- ETL and workflow orchestration skills
- Coding skills in Python, Scala, Java for engineering tasks
- Infrastructure as code skills using Terraform, Ansible, etc.
- Containerization skills with Docker and Kubernetes
- Monitoring, logging, and alerting skills
- Troubleshooting and performance tuning abilities
Soft Skills
- Communication and collaboration skills
- Problem solving and analytical thinking
- Ability to operate at scale with minimal supervision
- Willingness to learn and master new technologies
- Attention to detail since data is sensitive
- Teamwork and leadership skills
Day to Day Responsibilities
Here is an overview of typical day to day responsibilities and activities of a data engineer at LinkedIn:
- Meetings – Attend Scrum meetings, sync up with cross-functional teams
- Tickets – Work on sprint tickets for tasks related to data infrastructure
- Tracking – Monitor data pipelines, workflows, warehouse operations
- Support – Troubleshoot issues, provide guidance on data challenges
- Documentation – Document architectures, APIs, processes, etc.
- Testing – Develop unit tests, integration tests, conduct QA
- Deployments – Deploy infrastructure changes, new data tooling, etc.
Data engineers don’t work isolated. Collaboration with data users and technical teams is a significant part of the daily responsibilities.
Challenges
Some of the challenges faced by data engineers at LinkedIn include:
- Supporting complex legacy infrastructure with multi-year histories
- Coordinating changes across distributed systems and teams
- Constantly evolving scale and performance requirements
- Debugging and optimizing complex pipelines with cascading failures
- Mastering new open source and proprietary data technologies
- Maintaining high quality as data volume and pipelines grow exponentially
There are no “easy” days as a data engineer at web-scale companies like LinkedIn. The challenges are fulfilling for engineers who enjoy solving scalability and distributed systems problems.
Impact
As a data engineer at LinkedIn, you enable high impact deliverables like:
- Building data democratization platforms like WhereHows
- Enabling the LinkedIn feed with real-time engagement data
- Powering personalized recommendations to millions of users
- Allowing insights that improve networking and job seeking experiences
- Building Hadoop and Spark clusters that process petabytes of data
Data engineers have a tangible impact on LinkedIn’s products and mission. The scale of impact is immense given LinkedIn’s extensive user base.
Career Growth and Progression
Data engineers at LinkedIn can progress their careers in multiple directions including:
- Becoming senior data platform engineers or architects
- Leading teams of data engineers as managers
- Cross-functional progression into data science, analytics or engineering management roles
- Shifting focus to specialized areas like applied ML, analytics, etc.
- Moving into solution architecture, strategic engineering, or technical program management
There are abundant opportunities to take on greater responsibilities and advance to technical leadership positions as a data engineering veteran.
Conclusion
Data engineers serve as the foundational pillar of LinkedIn’s data ecosystem. They build and maintain the complex infrastructure that powers LinkedIn’s analytics, products, and datasets at massive scale. Data engineers get exposure to cutting edge tools and technologies while collaborating cross-functionally. For engineers who enjoy solving scalability challenges and building robust distributed data systems, being a data engineer at LinkedIn provides rich career growth and impact.