DataHub is an open source data discovery, management, and collaboration tool created by Netflix. It provides tools to discover, understand, transform, and share data across teams and organizations in a frictionless and self-service manner. DataHub code refers to the source code that powers the DataHub platform.
What are the key features of DataHub?
Some of the key features and capabilities of DataHub include:
- Metadata management – DataHub provides a centralized metadata repository to register, describe, and organize datasets and data assets. This enables discovery and understanding of available data.
- Data lineage – DataHub automatically captures lineage between data assets to understand upstream sources and downstream usage and dependencies.
- Search and discovery – Intuitive search and filtering capabilities make it easy to find relevant datasets.
- Collaboration – DataHub supports sharing, notifications, and access controls to foster collaboration.
- Customizability – DataHub is designed to be extended and customized for specific environments and use cases.
What are the benefits of DataHub?
Some key benefits that DataHub provides include:
- Accelerating discovery and understanding of data assets across the organization.
- Enabling self-service access to data without complex warehousing.
- Reducing redundancy by revealing hidden duplicate datasets.
- Understanding impact of changes via end-to-end lineage.
- Promoting collaboration and re-use instead of duplication.
- Enhancing organizational productivity by connecting people, processes, and data.
What are the key components of DataHub architecture?
DataHub has a modular, microservices-based architecture. Some of the key components include:
- WebApp – Angular-based user interface and frontend
- Metadata Service – Rest API layer handling metadata CRUD operations
- Metadata Graph Service – Underlying graph database containing entity metadata
- Metadata Events Service – Asynchronous events and notifications
- DataHub Ingestion Framework – For importing metadata from different sources
- Entity Services – Modules providing functionality around specific entities
- Authentication and Authorization – Providing access controls
These components are loosely coupled and communicate via REST APIs. This enables extensibility and flexibility in the overall architecture.
What technologies is DataHub built on?
Some of the key technologies used to build DataHub include:
- Backend
- Java – Primary language for core backend services
- Spring Boot – Framework for building Java services
- Neo4j – Graph database used to store metadata
- Elasticsearch – Used for free text search
- MySQL – Used to store service metadata
- Frontend
- TypeScript – Primary language for web UI
- React – Frontend JavaScript framework
- Apollo GraphQL – For connecting UI to backend GraphQL API
- Infrastructure
- Docker – For containerization and microservices
- Kubernetes – For container orchestration
- NGINX – Open source reverse proxy and load balancer
This stack provides scalability, extensibility, and rapid development capabilities.
How does the metadata ingestion framework work?
The metadata ingestion framework provides configurable “recipes” and connectors for importing metadata from different sources into DataHub. Some key aspects include:
- Support for different source systems like databases, data warehouses, BI tools, etc.
- Common interface for adding new connectors and recipes.
- Recipes describe how to extract metadata from sources.
- Base classes provide reuseability across recipes.
- Connectors interface with sources and emit metadata events.
- Orchestrator runs ingestion jobs based on recipes and connectors.
Some examples of supported systems include Hive, BigQuery, Snowflake, Looker, MySQL, Postgres, SQL Server, and more.
How does DataHub integrate with other systems?
DataHub provides various integration points to ingest metadata from and export metadata to other systems including:
- REST API – For CRUD operations on entities.
- Webhooks – For notification of metadata events.
- GRAPHQL API – To query metadata and relationships.
- Kafka – For publishing metadata change events.
- DataHub React UI – JS library for embedding DataHub UI.
DataHub focuses on metadata management while integrating with downstream systems for analytics, governance, lineage, and more.
How does DataHub support data governance?
DataHub supports common data governance use cases in the following ways:
- Discovery – Find data through semantic search and navigation.
- Profiling – View profiles of data quality, usage, and performance.
- Cataloging – Register new datasets and capture technical metadata.
- Lineage – Understand upstream sources and downstream usage.
- Policies – Associate policies and standards to datasets.
- Issue tracking – Record data issues visible across teams.
DataHub integrates with other systems like data catalogs, DQ tools, and BI tools to provide a unified data governance solution.
How does DataHub support collaboration?
DataHub enables collaboration through the following capabilities:
- Spaces – To organize assets and teams into groups and projects.
- Sharing – To control asset visibility across users and groups.
- Comments – To discuss assets and ask questions.
- Notifications – To inform users of relevant activity.
- Access Controls – To manage user and group permissions.
- Usage Stats – To see how often assets are accessed.
These features alongside integrations with Slack, Teams, and email facilitate seamless data collaboration.
How can I customize or extend DataHub?
DataHub provides various extension points to customize for your specific needs:
- REST APIs – To build custom experiences on top of DataHub metadata.
- Plugins – To add functionality around ingestion and metadata.
- Recipes – To ingest from new data sources.
- React UI Library – To create embedded UIs.
- Kafka Events – To subscribe and react to metadata changes.
- Configuration – To tune performance and customize behavior.
DataHub is designed to be flexibly adapted rather than forcibly customized. This enables evolving your solution as needs change.
How does DataHub support data security and compliance?
DataHub helps address security and compliance requirements in the following ways:
- Authentication – Support for LDAP, SAML, OIDC, and database-based authentication.
- Access controls – Fine grained control over asset visibility, modification, and usage.
- Encryption – Encryption of sensitive attributes like passwords and connection strings.
- Audit logs – Detailed logs of changes, access, and usage.
- Policy association – Linking policies and standards to datasets.
- Markdown support – For capturing security and compliance related documentation.
Additionally, being an open source project, DataHub provides complete transparency over the solution code and architecture.
How does DataHub differ from traditional data catalog tools?
DataHub | Traditional Data Catalogs |
---|---|
Focus on organizational knowledge graph | Focus on technical metadata |
Evolving, shared understanding of data landscape | Static, point-in-time metadata |
Connections between people, processes, and data | Catalog of datasets |
Captures relationships not just definitions | Manual, user-driven cataloging |
Data centric | Tool centric |
DataHub takes a more holistic, evolving, and connected approach to metadata management versus traditional hierarchical data catalogs.
What are some key advantages of DataHub over alternatives?
Some key advantages of DataHub include:
- Comprehensive – Broad metadata coverage across many systems.
- Extensible – Customizable without need for forks or custom code.
- Usable – Intuitive UI requiring minimal training.
- Community – Open source with thriving community support.
- Cloud native – Made for modern containerized environments.
- Interoperable – Integrates with existing stack vs. rip-and-replace.
Additionally, being open source, DataHub benefits from rapid innovation and transparency.
What types of organizations use DataHub?
DataHub is well suited for organizations with the following characteristics:
- Many disparate data systems and pipelines.
- Culture embracing open source and community collaboration.
- Focus on automation and reducing manual processes.
- Transitioning to a cloud native technology stack.
- Desire for flexibility and customizability in their stack.
- Need for connecting people, data, and insights.
Leading technology companies including Netflix, Expedia, and PayPal use DataHub for metadata management.
What are some examples of how companies use DataHub?
- Netflix – DataHub serves as the metadata platform to provide broad visibility and understanding into the hundreds of data systems at Netflix.
- Expedia – Uses DataHub as the metadata layer to fuel self-serve data discovery and democratization across brands.
- PayPal – Leverages DataHub for artifact and lineage metadata management as part of its data mesh implementation.
- Experian – Relying on DataHub to catalog and govern usage of data products feeding into marketing analytics.
- Morningstar – Uses DataHub to index investment data products for easy discovery and governance.
These examples highlight how DataHub provides value across domains in different ways.
How can I learn more about DataHub?
There are several resources to learn more about DataHub:
- DataHub documentation – https://datahubproject.io/docs/
- DataHub GitHub repo – https://github.com/linkedin/datahub
- DataHub demo video – https://www.youtube.com/watch?v=0j2IDe6L-VM
- DataHub community Slack – https://slack.datahubproject.io/
Additionally, the open source nature of DataHub allows diving into the source code to really understand how it works.
What are best practices for implementing DataHub?
Some best practices for adopting DataHub include:
- Start with a specific pain point vs. a broad initiative.
- Focus on change management and user engagement.
- Begin ingesting with high value datasets.
- Phase rollout to prove value before expanding.
- Start small but plan for scale long term.
- Invest in maintenance to keep metadata current.
- Iteratively enhance with governance capabilities.
- Extend incrementally vs. customizing directly.
Following these principles will lead to a successful DataHub deployment yielding real value.
What types of skills are needed to use and customize DataHub?
Key skills needed for different DataHub roles include:
Role | Required Skills |
---|---|
User | Data analysis, Business skills |
Admin | Linux, Docker, Java |
Developer | Java, Spring Boot, React |
Data Engineer | Python, Kafka, ETL |
DataHub leverages common skill sets and has an easy learning curve even for non-technical users.
Conclusion
DataHub provides an enterprise-grade open source metadata and data discovery platform. With robust community support and contributors including Netflix and LinkedIn, DataHub offers a feature-rich solution harnessing modern data stacks. By registering, connecting, and governing data assets, DataHub accelerates analytics and democratizes data across the organization.