DataHub is an open source data discovery and metadata platform that helps organizations maximize the value of their data assets. As data volumes grow exponentially, data teams struggle to keep track of what data exists, where it is located, who owns it, and how to access it. DataHub provides a unified metadata catalog and search interface to find datasets across data warehouses, lakes and other siloed repositories.
DataHub has a microservices architecture and can be deployed on Kubernetes or as Docker containers. Docker provides a simple way to deploy DataHub in a self-contained environment without needing to install all the dependencies natively on the host system. Here is a step-by-step guide on how to install DataHub using Docker.
Prerequisites
Before starting with the DataHub installation, you need the following prerequisites:
- Docker installed on the host system. Docker provides operating-system-level virtualization by containerizing applications. Refer to the Docker installation instructions for your operating system.
- Docker Compose installed for defining and running multi-container Docker applications. Instructions to install Docker Compose are here.
- Git client to clone the DataHub repo. You can install Git from here.
- A browser to access the DataHub UI. DataHub has been tested on the latest versions of Chrome, Firefox, Safari and Edge.
Get the DataHub Docker Source Code
DataHub is an open source project hosted on GitHub. We will clone the DataHub repo to get the necessary docker files.
Run the following commands on your terminal to clone the DataHub repo:
git clone https://github.com/linkedin/datahub.git cd datahub
This will create a datahub directory containing the source code.
Configure DataHub Properties
Before starting DataHub, we need to configure some properties like the host name and ports by editing the docker/.env file:
DATAHUB_GMS_HOST=datahub-gms DATAHUB_FRONTEND_HOST=datahub-frontend DATAHUB_KAFKA_ENABLED=false DATAHUB_EMBEDDED_ES=true
- DATAHUB_GMS_HOST – Hostname to use for the GMS container
- DATAHUB_FRONTEND_HOST – Hostname to use for the Frontend container
- DATAHUB_KAFKA_ENABLED – Disable Kafka by setting to false
- DATAHUB_EMBEDDED_ES – Use embedded Elasticsearch by setting to true
Start the DataHub Containers
With the configuration in place, we are now ready to start DataHub!
Simply run the following Docker compose command:
docker-compose up -d
This will start the DataHub containers in detached mode. The main containers that run are:
- datahub-gms – Graph Metadata Service backend
- datahub-frontend-react – DataHub React frontend
- datahub-elasticsearch – Elasticsearch for powering search & metadata
- datahub-mysql – MySQL database backend
Check the status of the containers by running:
docker-compose ps
You should see the containers in Up state once they are ready.
Access the DataHub UI
The DataHub UI can be accessed on the host URL at:
http://localhost:9002
Log in using the default credentials:
- Username: datahub
- Password: datahub
That’s it! You should now have DataHub installed and ready for use 🎉
Ingest Sample Metadata
After logging in, you will notice there is not much metadata to explore. Let’s ingest some sample metadata and datasets to get started.
Run the following command to ingest the sample metadata:
docker run --network datahub_default linkedin/datahub-gma:latest --quickstart
This will ingest metadata like popular databases, dashboards, datasets etc. Refresh your browser to see the updated metadata!
Configure Users and Access
By default, DataHub comes with a single admin user account. You can create additional users and provide them permissions via groups.
To create a new user:
1. Click “Settings” and select “Users” from the left menu.
2. Click the “+ New User” button and fill in the details like name, email etc.
3. Click “Generate” to auto generate a password. Share these credentials with the user.
To create a new user group:
1. Click “Settings” and select “Groups” from the left menu.
2. Click the “+ New Group” button.
3. Enter a name and description for the group.
4. Use the toggles to assign permissions to the group like Edit / View / Manage tags etc.
5. Click “Save” to create the group.
To add a user to a group:
1. Edit the user from the Users section.
2. Under “Member Of” groups, select the groups you want the user to be added to.
3. Click “Save” to update the user’s group memberships.
The user should now inherit the permissions of the groups they are members of!
Enable Authentication
By default, DataHub uses a simple credentials based authentication. You can integrate it with your organization’s authentication provider like LDAP, SAML etc.
Here are some guides to enable authentication in DataHub:
Based on your use case, enable the appropriate enterprise authentication system.
Integrations
A key strength of DataHub is the number of integrations available to ingest metadata from various data systems.
Here are some popular integrations:
Source System | Integration |
---|---|
MySQL | MySQL metadata extractor |
BigQuery | BigQuery metadata extractor |
Snowflake | Snowflake metadata extractor |
Hive | Hive metadata extractor |
Looker | LookML ingestion |
Superset | Superset ingestion |
Data Build Tool | dbt ingestion |
Check the DataHub integrations guide for details on setting up the various integrations.
The integrations will automatically extract metadata from source systems on a scheduled basis and load into DataHub’s metadata graph.
Configure Metadata Ingestion Pipeline
To regularly ingest metadata from different sources, we need to configure metadata ingestion pipelines. This consists of:
- Configuring metadata sources like database connections
- Extractors to pull metadata from sources
- Recipe to bring together sources and extractors to ingest from
- Orchestrator to schedule and run the pipelines
Here are the key steps to configure ingestion:
- Update the docker/.env file to configure metadata sources like database credentials, API keys etc.
- Define recipes at metadata-ingestion/examples like mysql_to_file.yml containing the sources and extractors to use.
- Run recipes manually using command:
docker run --network datahub_default linkedin/datahub-ingestion:latest ingest -c metadata-ingestion/examples/recipes/mysql_to_file.yml
- Schedule recipes by editing docker/docker-compose.yml and updating the ingestion service.
- Commit updated recipes, sources etc to GitHub to persist your pipelines.
Over time, you can build a comprehensive metadata ingestion pipeline to keep DataHub continually up to date!
Troubleshooting
Here are some common troubleshooting tips:
- Check container logs using
docker-compose logs
for errors. - Try restarting containers using
docker-compose restart [container]
. - Delete all containers using
docker-compose down
and restart. - Ensure ports configured in .env are free on the host.
- Check network connectivity between containers.
- Review DataHub repo issues on GitHub for similar problems.
- Join the DataHub Slack community to ask questions.
Conclusion
Docker provides a quick way to get DataHub up and running. Once you have a basic understanding of how to deploy DataHub with Docker, you can easily customize it to suit your environment.
With the growing popularity of data platforms, investing in a metadata and governance strategy is key for organizations. DataHub provides an open source option to build a knowledge graph of your data assets and democratize data discovery.
Some ways to build on this Docker setup:
- Integrate with enterprise authentication systems.
- Ingest metadata from all your data sources.
- Customize the frontend and theming.
- Deploy on Kubernetes for scale and high availability.
As your metadata needs grow, DataHub provides the flexibility to expand via its microservices architecture. The active open source community also keeps improving DataHub with new features and integrations.
So don’t let your metadata remain isolated in silos. Start your DataHub journey today towards a centralized metadata catalog!