DataHub is a metadata management service that allows you to build a knowledge graph of your data assets in AWS. It provides a single place to discover, search, lineage and understand data. DataHub helps data consumers find relevant data easily and helps data producers publish metadata and build connections between disparate data.
What problems does DataHub solve?
Here are some of the key problems that DataHub aims to solve:
- Data discovery – Finding relevant datasets in large organizations is difficult due to data spread across different systems and lack of metadata.
- Data governance – Without metadata and lineage, ensuring compliance with regulations like GDPR is challenging.
- Data context – Consumers often lack context on datasets such as accuracy, freshness, security levels etc.
- Data democratization – Self service access to data is limited without a catalog of available datasets.
By building a knowledge graph with semantic metadata, DataHub addresses these challenges and allows easy discovery, governance and access to data at scale.
Key capabilities of DataHub
Here are some of the key capabilities provided by DataHub:
Metadata ingestion
DataHub provides an extensible framework to ingest technical, business and operational metadata from all data sources like databases, data warehouses, lakes, BI tools etc. Both batch and event based ingestion are supported.
Knowledge graph
The metadata is used to build a knowledge graph of data assets and their connections. This includes entities like datasets, columns, dashboards, ML models etc. and relationships like lineage, ownership, tags etc.
Search and discovery
The metadata powers a robust search experience to allow users to easily find datasets by name, description, tags, lineage etc. Search relevancy is optimized using techniques like synonym matching.
Data lineage
DataHub automatically stitches upstream and downstream dataset connections to provide visibility into data lineage across systems. This is useful for both data governance and impact analysis.
Browse and exploration
The knowledge graph enables intuitive browsing and exploration of datasets by entity relationships like columns, tags, owners etc. Interactive graph traversals allow serendipitous data discovery.
Custom metadata
DataHub allows defining custom metadata models to capture domain specific metadata like data quality, SLA, compliance status etc. This unlocks advanced governance workflows.
Notifications and automation
Webhooks and emails can be configured to automatically trigger actions when metadata changes. For example, send notifications on ownership changes or trigger downstream jobs on data availability.
How does DataHub work?
At a high level, DataHub works follows a bottom up metadata driven architecture:
- Metadata is ingested from underlying data systems through batch or event based mechanisms.
- The metadata is used to build a knowledge graph of entities and relationships.
- The graph powers search, discovery, lineage and other experiences through DataHub’s UI and APIs.
- Users can view metadata, explore relationships and discover data in a self-service manner.
This architecture provides flexibility to extend metadata support to new systems and build custom interfaces leveraging DataHub’s APIs.
Metadata ingestion
DataHub supports ingesting technical, operational and business metadata from a wide variety of systems like:
- Databases – MySQL, Postgres, SQL Server, Snowflake etc.
- Data warehouses – BigQuery, Redshift, Athena
- Data lakes – S3, ADLS
- BI tools – Looker, Tableau, PowerBI
- ML platforms – SageMaker, DataRobot
- Custom systems via REST APIs
Both batch and event based ingestion are supported. Batch ingestion is useful for initial loads and full refreshes while event-based ingestion captures real time changes.
Knowledge graph
The metadata ingested from different systems is used to build a comprehensive knowledge graph within DataHub. The key aspects include:
- Entities – Datasets, columns, dashboards, ML models, tags etc.
- Relationships – Lineage, upstream/downstream, ownership, tags, SLA etc.
- Identity stitching – Matching entities ingested from multiple systems.
- Search indexing – Optimizing text search across entities.
This knowledge graph powers all downstream experiences in DataHub like search, discovery, lineage etc.
User experience
DataHub provides the following user experiences on top of the knowledge graph:
- Search – Search datasets by name, description, tags, lineage etc.
- Browse – Navigate the graph by entity relationships.
- Lineage – View upstream and downstream datasets.
- Glossary – Curated definitions of terminologies.
- Usage – View dataset usage across BI tools, notebooks etc.
- Impact analysis – Understand dependencies before changes.
These experiences help both data producers and consumers find, understand and trust data easily.
Benefits of using DataHub
Here are some of the key benefits of using DataHub:
Faster data discovery
Intuitive search and browsing allows both data producers and consumers to easily find relevant datasets for their needs.
Improved data governance
Lineage, traceability and controls help comply with regulations and improve trust in data.
Enhanced productivity
Easy discovery and self-service reduces time spent on locating datasets and understanding context.
Democratized data access
Catalog and glossary empowers novice data users with the information needed to access and use data.
Reduced technical debt
Rich metadata improves developer productivity and reduces time resolving data issues.
Centralized metadata
Unified metadata from all systems avoids siloed disjoint metadata across tools.
Extensible and scalable
Open architecture allows extending DataHub to new data sources and custom implementations.
Key components of DataHub architecture
DataHub follows a microservices based architecture. The key components are:
Metadata services
Core services for ingesting, storing, managing and querying metadata.
- Metadata service – Stores metadata as entities and relationships.
- Metadata Jobs – Batch ingestion pipelines for metadata.
- Message Broker – Handles events for real-time updates.
- Search service – Provides full text search across entities.
Backend services
Backend services for managing workflows, policies, users etc.
- Auth service – Authentication and authorization.
- Entity service – Lifecycle management for entities.
- Workflow service – Manage metadata ingestion pipelines.
- Policy service – Apply rules and policies to metadata.
Frontend apps
Browser based UIs for end user experiences.
- Web React App – Main user experience for search, browse etc.
- GMS React App – Data governance workflows.
- Metadata Ingestion UI – Configure pipelines.
External integrations
Integration modules for different data systems.
- Connectors – Extract metadata from warehouses, lakes etc.
- Bots – Index usage info from BI tools.
- Hooks – Listen for metadata change events.
This architecture provides a robust, scalable and extensible metadata platform.
How to use DataHub?
Here is a quick overview of steps to use DataHub:
- Installation – Deploy DataHub containerized stack on Kubernetes or VM.
- Configuration – Configure ingestion pipelines for required systems.
- Ingestion – Run ingestion to extract metadata into DataHub.
- Search – Use search to find datasets based on metadata.
- Browse – Navigate dataset relationships like lineage and tags.
- Explore – Leverage glossary and other metadata to understand datasets.
Advanced usages like custom metadata, policies, automation etc. can build on this foundation.
Deployment options
DataHub is packaged as Docker containers and can be deployed in different environments:
- Docker – Run standalone or orchestrated via Docker Compose.
- Kubernetes – Leverage Helm charts for deployment on Kubernetes.
- Cloud – Deploy on AWS ECS, Azure AKS etc.
This provides flexibility to deploy on cloud or on-premises per your requirements.
Ingestion sources
DataHub provides various ingestion connectors out of the box for common systems:
Source | Connector |
---|---|
MySQL | MySQL |
Postgres | Postgres |
BigQuery | BigQuery |
Snowflake | Snowflake |
Redshift | Redshift |
S3 | S3 |
ADLS | ADLS Gen1, Gen2 |
Looker | LookML |
Tableau | TMS |
DataRobot | DataRobot |
Custom connectors can also be built using DataHub’s framework to extend support for other systems.
Access and security
DataHub provides the following mechanisms for access control:
- Authentication – Via LDAP, SAML etc.
- Authorization – User/group based ACL for metadata.
- Encryption – Secure communication via HTTPS and TLS.
- Redaction – Hide sensitive data like PII in metadata.
- Anonymization – Aliasing for entities containing sensitive data.
This ensures only authorized users get access to appropriate metadata.
Comparison with alternatives
Here is how DataHub compares with some alternate solutions:
Data catalogs
DataHub provides more extensive metadata management beyond just a catalog. Key differences:
Feature | DataHub | Data Catalogs |
---|---|---|
Search | ✅ | ✅ |
Lineage | ✅ | ❌ |
Impact Analysis | ✅ | ❌ |
Glossary | ✅ | ❌ |
Custom Metadata | ✅ | ❌ |
Scale | Enterprise | Departmental |
Data governance tools
DataHub focuses on technical metadata management vs just governance policies and procedures. Key differences:
Feature | DataHub | Governance Tools |
---|---|---|
Metadata Ingestion | ✅ | ❌ |
Knowledge Graph | ✅ | ❌ |
Lineage | ✅ | Manual |
Policies | ✅ | ✅ |
Workflows | ✅ | ✅ |
Limitations of DataHub
While DataHub provides extensive metadata management capabilities, it also has some limitations:
- Learning curve – Getting started with metadata modeling requires some ramp up time.
- Maturity – As an open source project, maturity lags commercial tools.
- Scale – Extreme scales (>1B entities) may require optimization.
- Security – Additional integrations needed for enterprise security.
- Support – Reliant on open source community and self-service.
Organizations should evaluate these factors based on their specific use cases and requirements.
Conclusion
In summary, DataHub provides a comprehensive metadata management platform to build a knowledge graph of data assets. It enables critical use cases like discovery, governance, democratization and developer productivity through its metadata ingestion, entity relationship graph and flexible architectures.
For most enterprises struggling with fragmented metadata landscape, DataHub can serve as strategic metadata platform to deliver value across multiple business and technology domains.