LinkedIn DataHub is a data management platform developed by LinkedIn to help companies build data pipelines to move data between various storage systems and enable data processing. DataHub provides an easy way to build metadata driven data architecture and makes data easily discoverable and trustworthy.
What problems does LinkedIn DataHub solve?
As organizations grow, their data infrastructure and architecture can become complex and disjointed. This leads to some key data management challenges:
- Data silos – Data stored and managed in isolation in different systems/tools
- Lack of data visibility – No centralized catalog to easily find and understand available datasets
- Manual data processes – Moving and processing data requires engineering effort
- Data governance issues – No common data definitions, standards and policies
- Distrust in data – Unsure of data accuracy, lineage etc due to above issues
LinkedIn DataHub aims to solve these problems by providing:
- A unified metadata store – All metadata in one place describing available data assets
- Automated metadata ingestion – Metadata automatically collected from source systems
- Intuitive UI – Search, browse and understand datasets via rich UI
- Common vocabulary – Standardized taxonomy and ontology for data
- Shared data model – Common definitions for key data elements
- Data lifecycle support – Lineage, quality, ownership metadata to trust data
- Extensible architecture – Easy integration with other data systems
By providing these capabilities on top of existing data infrastructure, DataHub makes it much easier for organizations to overcome their data management challenges.
What are the key features of LinkedIn DataHub?
Some of the main features and capabilities provided by LinkedIn DataHub include:
Metadata management
- Centralized metadata repository – Stores all metadata and acts as a single source of truth
- Automated ingestion – Crawls source systems and extracts technical, operational, business metadata
- Schema management – Defines the metadata model and relationships between objects
- Taxonomy support – Manage glossary and ontology for data elements
- User-friendly UI – Browse, search metadata through intuitive web UI
Data lifecycle management
- Data lineage – Visualize flow of data from sources downstream to consumers
- Data profiling – Analyze data quality, structure and relationships
- Ownership & governance – Define owners, process and policies for data
- Workflow integration – Trigger actions in other systems based on metadata
Platform extensibility
- Modular architecture – Integrate other data systems via flexible extensibility framework
- Connectors – Pre-built connectors for data sources like Hive, Kafka, Snowflake etc
- APIs – REST APIs to access metadata programmatically
- Plugins – Optional custom plugins to extend functionality
- Notifications – Alert users of metadata changes via webhooks
With these capabilities working together, DataHub provides a scalable, centralized and automated way to manage data and metadata for the entire organization.
How does LinkedIn DataHub work?
At a high level, LinkedIn DataHub works as follows:
- Integrate with source systems – DataHub connectors extract metadata from underlying data systems like databases, data warehouses, file storage etc.
- Load metadata into repository – Extracted metadata is modeled and loaded into the DataHub metadata repository.
- Enrich metadata – Additional metadata like ownership, tags, glossary terms are added to provide context.
- Surface metadata via UI – The metadata catalog UI allows users to easily search and browse metadata.
- Analyze & act on metadata – Users and applications can analyze metadata to gain insights and trigger actions.
- Keep metadata updated – As source systems change, metadata is continuously re-ingested to keep DataHub updated.
Key components and modules involved in this workflow include:
Connectors
DataHub provides a connector framework to integrate with various data systems like MySQL, Kafka, Snowflake, Hive etc. Connectors use the APIs and libraries of the source systems to extract relevant metadata and load it into the metadata repository.
Metadata repository
DataHub stores all metadata in its centralized repository. This is powered by a persistent storage backend like MySQL or PostgreSQL. The metadata is modeled as entities and relationships between them.
Web UI
DataHub provides a web-based UI allowing users to search, browse, visualize and edit metadata from the repository. The UI provides a business-friendly metadata catalog for discovering and understanding data.
REST API
DataHub exposes a REST API which enables programmatic access to the metadata. This allows other applications to integrate with DataHub metadata.
Identity & access
DataHub integrates with organization identity providers like LDAP, SAML, OIDC etc to manage users, groups and permissions for accessing metadata.
Plugins
Custom plugins can be developed to extend DataHub’s capabilities e.g. integrating with data governance tools, adding custom metadata models etc.
Notifications
Webhooks and email notifications keep stakeholders updated about changes and events in DataHub like new metadata, ownership changes etc.
With these modules working together, DataHub is able to provide a comprehensive automated metadata driven approach to managing data at scale for the enterprise.
What are the benefits of using LinkedIn DataHub?
Some of the key benefits organizations can realize using LinkedIn DataHub include:
- Faster discovery of data – Intuitive interface for searching, browsing metadata helps users easily find relevant datasets.
- Improved data understanding – Rich technical, operational, business metadata provides insights into data.
- Trusted data quality – Data profiling metadata highlights quality issues to increase trust.
- Greater reuse of data – Discovering similar datasets avoids duplicating data.
- Enhanced governance – Common definitions, standards enable governing data consistently.
- Increased productivity – Find, access data faster instead of rebuilding existing datasets.
- Lower cost – Reduce manual efforts of curating metadata via automation.
- Better collaboration – Democratized metadata helps teams align and work together.
By leveraging LinkedIn DataHub, data teams are better equipped to deliver quality, trusted data to meet business needs and fuel data driven decision making, analytics and AI initiatives.
What types of metadata does LinkedIn DataHub manage?
LinkedIn DataHub manages different types of metadata spanning technical, operational and business contexts. Key metadata managed includes:
Technical metadata
- Dataset properties – Schema, size, compression, storage format
- Data pipeline attributes – ETL jobs, scripts, transformations
- Infrastructure details – Hosts, servers, databases
- Source system metadata – Tables, columns, data types
Operational metadata
- Owners and contacts for data assets
- Lineage – Upstream sources and downstream usage
- Pipeline execution details – Run times, status, performance
- Data quality stats and metrics – Profiling, accuracy measures
Business metadata
- Glossary terms and definitions for data elements
- Taxonomy and ontology mapping of assets
- Tags and classifications for discovery
- Domain context – Business meaning, purpose
This rich metadata from multiple perspectives provides a holistic understanding of data assets in a single platform.
How to get started with LinkedIn DataHub?
Here are some tips to help get started with LinkedIn DataHub:
Install
- Choose deployment option – On-prem, containerized or cloud
- Allocate resources – CPU, memory, storage
- Configure prerequisites – Java, MySQL, Kafka
- Run DataHub installer scripts
Ingest metadata
- Prioritize critical systems for initial ingestion
- Configure connectors for each source system
- Model metadata structure for sources
- Iteratively tune and enhance metadata
Explore metadata
- Browse top level datasets to get started
- Leverage search to find relevant entities
- Follow relationships between objects
- View lineage flows between assets
Govern metadata
- Onboard stakeholders to manage specific domains
- Establish roles for owners, editors, viewers of metadata
- Develop policies and processes for metadata changes
- Track audit logs of changes
Automate
- Develop ETL jobs to keep metadata fresh
- Use APIs and webhooks to integrate with other systems
- Publish metadata to consumers and stakeholders
With these steps, organizations can start realizing value from automated metadata management and accelerate their data journey with LinkedIn DataHub.
What are some real world use cases and examples of LinkedIn DataHub?
LinkedIn DataHub powers metadata driven data management at numerous organizations. Some examples include:
Expedia
- Centralized metadata from databases, data warehouses, file storage
- Enabled self-service access to accurate, trusted data
- Improved reliability of analytics pipelines
- Enhanced governance through consistent metadata
Comcast
- Ingested technical, business metadata for 150+ systems
- Standard taxonomies for consistent data definition
- Integrations with data lineage, catalog, quality tools
- Raised data maturity via metadata-driven management
TIAA
- Central portal for data discovery and access
- Automated large volume ETL metadata pipelines
- Improved collaboration across teams
- Enabled innovation by sharing data efficiently
Ford
- Accelerated model development cycles for AI/ML teams
- Ensured optimal data quality for advanced analytics
- Democratized data access with self-service catalog
- Scalable metadata management as data volume grew
As these examples illustrate, organizations are using DataHub to modernize data architectures, empower users with access to trusted data and enable advanced analytics at scale.
What are some key capabilities planned for DataHub?
The LinkedIn DataHub project has an active open source roadmap. Some key features on the roadmap include:
- Scalability enhancements – For larger metadata volumes
- Finer grained access control – Row and attribute level security
- Advanced lineage analysis – Impact analysis, visualization
- Deeper ecosystems integrations – BigQuery, Snowflake, DBT
- Flexible deployment options – Fully managed SaaS offering
- Alerting framework – Configurable alerts and notifications
- Extension framework – Streamlined build of custom plugins
- Advanced workflow integration – Trigger actions in other systems
By continuously investing in the open source roadmap, DataHub aims to stay at the forefront in supporting organizations on their data journey.
Conclusion
LinkedIn DataHub provides a powerful metadata driven solution to enable enterprise data management. By providing automated and scalable metadata collection, a business friendly catalog UI, and an extensible architecture, DataHub helps organizations accelerate their ability to deliver reliable, accurate data to users in a governed way. Adoption by leading companies highlights the value of DataHub’s metadata-first approach. With continued open source momentum, DataHub is poised to be a key enabler for advanced analytics and AI use cases powered by trusted data.