Databricks connector

Modified on: Mon, 1 Jun, 2026 at 9:57 AM

This article explains how the Databricks connector for DataGalaxy works.

This connector is available in the following modes:

SaaS Online mode ✅

This connector suppports the following import modes:

Standard mode ✅

⚠ A recent breaking change in Databricks' REST API impacts the current version of the connector regarding the lineage around notebooks. You may miss some lineage links when notebooks are run through a job (aka workflow).
We're currently working on the next version of the connector which use a different and more precise approach for retrieving the lineage. This new version is currently in test phase and will be released soon. This new approach will only be available in URN mode.

Scope, attributes and mapping with DataGalaxy

Objects

Some of the attributes listed here may not be present by default in your objects' screens configuration. To make them appear in DataGalaxy screens, it may be necessary to adapt the screens of the concerned objects before running the connector. See this article to learn more about screen customization.

Instance

A Databricks Instance is represented by a Relational DB in the Dictionary module, and by a Data Flow in the Data Processing module.

The URN follows this syntax:

urn:databricks-1:instance

The following attributes are retrieved from the connection configuration:

DataGalaxy attribute	Source/Value
Technical name	Instance hostname setup in the connector

Catalog

A Catalog is represented by a Model.

The URN follows this syntax:

urn:databricks-1:instance:catalog

The list of Catalogs is retrieved using the JDBC connection and the SHOW CATALOGS statement. The following attributes are retrieved using the DESCRIBE CATALOG EXTENDED statement:

DataGalaxy attribute	Source/Value
Technical name	catalog
Summary	Comment
Creation date of the source object	Created At
Last modification date of the source object	Updated At

Note: the system and __databricks_internal catalogs are filtered out implicitely.

Schema

A Schema is represented by a Model.

The URN follows this syntax:

urn:databricks-1:instance:catalog:schema

The list of Schemas is retrieved using the JDBC connection and the SHOW SCHEMAS statement. The following attributes are retrieved using the DESCRIBE SCHEMA EXTENDED statements:

DataGalaxy attribute	Source/Value
Technical name	databaseName
Summary	Comment

Note: the INFORMATION_SCHEMA schemas are filtered out implicitely.

Table (Managed or External)

A Table is represented by a Table.

The URN follows this syntax:

urn:databricks-1:instance:catalog:schema:table

The list of Tables is retrieved using the JDBC connection and the SHOW TABLES statement. The following attributes are retrieved using the DESCRIBE TABLE EXTENDED statement (some attributes may not be present depending on the type of Table):

DataGalaxy attribute	Source/Value
Technical name	tableName
Summary	Comment
Technical type	Type
External Id	Id*
Creation date of the source object	Created At
Last modification date of the source object	Updated At
Link to source	Location
Query	Query
Current storage size	sizeInBytes*
Is partitioned	"# Partition Information" present in the table's metadata*

*These information are only available using the metadata retrieval method "DESC TABLE".

View (including Materialized View)

A View is represented by a View.

The URN follows this syntax:

urn:databricks-1:instance:catalog:schema:view@view

The list of Views is retrieved using the JDBC connection and the SHOW TABLES statement. The following attributes are retrieved using the DESCRIBE TABLE EXTENDED statement (some attributes may not be present depending on the type of View):

DataGalaxy attribute	Source/Value
Technical name	tableName
Summary	Comment
Technical type	"VIEW"
External Id	Id*
Creation date of the source object	Created At
Last modification date of the source object	Updated At
Link to source	Location
Query	Query
Current storage size	sizeInBytes*
Is partitioned	"# Partition Information" present in the table's metadata*

*These information are only available using the metadata retrieval method "DESC TABLE".

Column

A Column is represented by a Column.

The URN follows this syntax:

urn:databricks-1:instance:catalog:schema:table:column

The following attributes are retrieved at the same time as the Table's metadata:

DataGalaxy attribute	Source/Value
Technical name	col_name
Summary	Comment
Technical type	data_type

The following attributes are calculated:

DataGalaxy attribute	Source/Value
Order	Position of the Column in the Columns list
Is partition key	Column present in the "# Partition Information" section of table's metadata

Workspace Directory

A Workspace Directory is represented by a Data Flow.

The URN follows this syntax:

urn:databricks-1:instance:Workspace@workspace:directory

The following attributes are retrieved using the Databricks' REST API List contents (GET /api/2.0/workspace/list) endpoint:

DataGalaxy attribute	Source/Value
Technical name	name
External Id	object_id

Notebook

A Notebook is represented by a Data Processing.

The URN follows this syntax:

urn:databricks-1:instance:Workspace@workspace:directory:notebook@notebook

The following attributes are retrieved using the Databricks' REST API List contents (GET /api/2.0/workspace/list) endpoint:

DataGalaxy attribute	Source/Value
Technical name	name
External Id	object_id
Technical type	object_type
Summary	language

Workflow

Note: the Workflows are only supported in URN mode.

A Workflow is represented by a Data Processing.

The URN follows this syntax:

urn:databricks-1:instance:Workflows@workflows:workflowId

The following attributes are retrieved using the Databricks' REST API List jobs (GET /api/2.2/jobs/list) and Get a single job (GET /api/2.2/jobs/get) endpoints:

DataGalaxy attribute	Source/Value
Technical name	job_id
Functional name	name
Summary	description

Links

The links created by the Databricks connector are lineage links between structures in the Dictionary and eventually Data Processing objects in the Data Processing module. Retrieving the lineage is optional, the "Get Lineage" option has to be selected in the configuration of the connector. Then, granularity can be configured at table or column level*. The method for retrieving the lineage can be configured too*, the connector offers two options:

The new option recommended by Databricks is to use Databricks' System Tables. This may require a configuration from an administrator of the Databricks workspace to make them available. This is the option which will bring you the most precise lineage. The system.access.table_lineage and system.table.column_lineage views are read by the connector to retrieve the lineage with this method.
The legacy option is to use Databricks' lineage REST API (GET /api/2.0/lineage-tracking/table-lineage endpoint). This API is less precise than the System Tables. For instance, it cannot get the full mapping between input and output objects around Notebooks and Workflows, which is possible using the System Tables.

These two options are only available in URN mode.

When creating the links around Notebooks and Workflows, the following behavior is implemented to get the most complete and precise lineage possible in DataGalaxy:

If the "Get Notebooks" option is selected, the lineage links are created around all Notebooks which are part of the scope of the connector. If a Notebook is not in the scope (filtered out by the path filter or belonging to another Databricks Workspace), then the links are created directly between the involved Dictionary structures.
If the "Get Workflows" option is selected, the lineage links are created around all Workflows which are part of the scope of the connector. If a Workflow is not in the scope (belonging to another Databricks Workspace), then the links are created directly between the involved Dictionary structures.

When a Data Processing object is involved in the lineage, the Databricks connector leverages the Data Processing Items, in order to provide the most precise mapping between input and output objects. The name of the Data Processing Items created by the connector is a technical name built by analyzing the names of the input and output objects, it doesn't represent anything from the reality and doesn't come from Databricks. This name will remain stable over time, as long as the input and output objects don't change.

Note: the Orphaned Objects Handling mechanism doesn't support the Data Processing Items yet. It means that if you have old Data Processing Items, they will not be cleaned by the Orphaned Objects Handling feature. This is identified by the team which is currently working on an evolution to manage this.

Detailed scope

Input (dictionary module)

Catalog, schema, table and view

From the home page of your Databricks account, these items are visible in the “Catalog” section on the left

Column

By clicking on a given table or view, you will see details of the columns that comprise it

Input (Data Processing module)

Folders

From the home page of your Databricks account, the folders are visible in the “Workspace” section on the left. The folders contained in the “Workspace” folder will be imported. The "User" and “Repos” folders will be ignored during the import.

Notebooks

Notebooks are visible by clicking on a folder. They will then appear in the central part of the screen.

Workflows (only with the Unity Catalog version of Databricks in URN mode)

From the home page of your Databricks account, workflows are visible in the “Jobs and Pipelines” section on the left.

By clicking on a given job, you will see the details associated with it, including the associated lineage, with the upstream and downstream tables (which in DataGalaxy will correspond to links to tables).

You will also find this information by opening the “lineage” tab of a given table in the “Catalog” section.

Output (dictionary module)

Catalog, schema, table, view and column

Output (dataprocessing module)

Folders and Notebooks

Configuration of a connection

On Databricks' side

The Databricks connector uses the JDBC driver provided by Databricks and the Unity Catalog REST API. Connecting to a Databricks instance therefore requires a cluster to execute SQL commands via the JDBC driver. You can use either an interactive cluster or an SQL Warehouse cluster. Access to the cluster connection information is available here. To optimize processing times, you can start the cluster before launching the connector.

There are three modes available for authenticating on the Databricks cluster:

The token authentication

The detailed procedure for generating a token is available here. The token is associated with a user who must have access to the tables you want to upload via the connector (in other words, when you log in with the account associated with the token, if you do not see what you want to upload, then the import with this token will not upload the missing objects either).

To generate a token, follow these steps:

Log in using the User you want to associate with the token
From the home page, click on the user icon in the top right corner, then on “Settings”
Open the “Developer” menu, then click on the "Manage" button for “Access tokens”
Generate your token by assigning it a description and a lifetime
Keep the generated token; you can now use it to configure your DataGalaxy connection

Authentication using Entra ID (Azure AD) Service Principal

In order to use service principals on Azure Databricks, an admin user must first create a new Microsoft Entra ID (formerly Azure AD) application by following these steps:

Go to the Azure portal (for example, by clicking on the user icon from your Databricks account and then on “Azure Portal”)
Once in the Azure portal, find and click on “Microsoft Entra ID” in the search bar
Then click on “+ Add” and then “App registration”
Enter the necessary information, remembering to select the option “Accounts in this organizational directory only (Single tenant)” in the “Supported account types” section
Once the application has been created, remember to copy/paste the “Application (client) ID” and the “Directory (tenant) ID” before clicking on “Certificates and secrets” on the left
In “Certificates and secrets,” generate a secret using “+ New client secret”. In the window on the right, enter a description and an expiration date before clicking “Add”
Keep the generated secret, you can now use it to configure your DataGalaxy connection

Once the application has been created in Microsoft Entra ID, you will need to link it to your Databricks account by following these steps:

From the home page, click on the user icon in the top right corner, then click on “Settings”
Open the “Identity and Access” menu, then click on the "Manage" button under “Service Principals”
You will then be able to create a service principal using the “Add service principal” button
All you need to do is copy the Microsoft Entra Application ID to associate your Azure application with the Azure Databricks account

Authentication using a Databricks Service Principal

A Service Principal is a specialized identity used for automatic access and scheduled operations. You can manage access for a Databricks Service Principal in the same way you manage access for a user. To create one, follow these steps:

From the home page, click on the user icon in the top right corner, then on “Settings”
Open the “Identity and Access” menu, then click on the "Manage’ button under “Service Principals”
You will then be able to create a service principal using the “Add service principal” button
Once the service principal has been created, click on it to access its details, including the “Secrets” tab where you will find the “Generate secret” button. As with the token, you will be asked to enter a lifetime for this secret
Keep the generated secret, you can now use it to configure your DataGalaxy connection

Details of the rights required to obtain metadata

We will now detail the permissions associated with the different types of metadata, divided into categories

Level 1: Catalog, Schema, Table/View

To grant the necessary rights to your Databricks/Azure Primary Service, the process is the same:

Select “Catalog” on the left
Select the source you want to grant access to
Go to the “Permissions” tab and click on “Grant”
In the window that opens, add your Primary Service, then check the USE CATALOG, USE SCHEMA and SELECT before confirming (if you use the metadata retrieval method "INFORMATION_SCHEMA", replace SELECT by BROWSE)
```
For more precision you can add these USE SCHEMA and SELECT at schema and table level
```

Option 1: Folder, Notebook (Feature - Get Notebooks)

To grant the necessary rights to your Databricks/Azure Primary Service, the procedure is the same:

Select “Workspace” on the left
Select the “Workspace” folder to which you want to grant access
Click on the “Share” button
In the window that opens, enter the name of your Primary Service and grant it the “Can View” right

Option 2: Workflow (Feature - Get Workflows*)

Select on the left "Jobs & Pipelines"
Select the element you want to grant access to
On the right window scroll down to "Permissions" and click on "Edit permissions"
In the opening window give the name of your Service Principal and give it the right "Can View"

Option 3: Unity Lineage (Feature - Get Lineage from Unity)

When "Feature - Get Lineage from Unity" is activated, having the previous authorizations on all the right tables, notebooks and workflows is enough to retrieve all lineage informations attached to them. In both modes (REST API and System Tables), your workspace must be enabled for Unity Catalog

Suboption 1: Lineage Granularity
1. Table
2. Column
Suboption 2: Lineage retrieval method
1. Auto
2. System Tables (more information here): this retrieval method provides more lineage information than with the API REST, like a mapping around the objects of the scope (see images below, comparing both options at column level)
3. API REST Unity (deprecated by Databricks)

Compute resource

To grant your Databricks/Azure Primary Service the right to use the compute resource required for importing to DataGalaxy, follow these steps:

Select “SQL Warehouses” on the left
Select the resource you want to use for your import or create it
Click on “Permissions” on the right
In the window that opens, choose your Main Service and give it the “Can Use” right
Finally, in the “Connection details” tab, you will find everything you need to configure the connection on the DataGalaxy side, namely the “Server hostname” and the “HTTP path”

For more details about cluster permissions, please refer to the following table. The permissions management documentation in Databricks is available here.

Summary

In URN mode the connector provides several metadata retrieval options

Databricks object	DataGalaxy object	Level 1	Option 1	Option 2
Catalog	Model	✅	✅	✅
Schema (Database)	Model	✅	✅	✅
Table	Table	✅	✅	✅
View	View	✅	✅	✅
Column	Column	✅	✅	✅
Folder	DataFlow		✅	✅
Notebook	DataProcessing		✅	✅
Workflows*	DataProcessing			✅

* Only available in URN mode

On DataGalaxy's side

The following information is required to set up a connection:

Parameter		Mandatory	Description
Server		Yes	Databricks server hostname, example: adb-XXXXXXXXXXXXXXXX.X.azuredatabricks.net
Port		Yes	Connection port to Databricks server, example: 443
HTTP Path		Yes	URL of Databricks calculation resources, example: sql/protocolv1/o/XXXX/0125-105531-okp9kyqn
Auth mode		Yes	Databricks token Azure AD Service Principal
Azure AD Service Principal	Tenant ID	Yes	AZURE Tenant ID
	Client ID	Yes	AZURE CLIENT ID
	Client secret	Yes	AZURE Client Secret
Databricks Service Principal	Client ID	Yes	Databricks Service Principal Client ID
Databricks Service Principal	Client secret	Yes	Databricks Service Principal Secret
Databricks token	Password	Yes	Databricks Access Token
Filter - Catalog		No	Limit the scope to one or more catalogs
Feature - Get Notebooks		No	Notebooks will be represented in DataGalaxy as Data Processing objects
Filter - Path filter (prefix)		No	For notebooks: limits the scope for a given folder based on the prefix entered, example value: /Shared Note: the /Users folder is implicitly filtered out.
Feature - Get Workflows*		No	Workflows will be represented in DataGalaxy as Data Processing objects
Feature - Table/View metadata fetching method		Yes	DESC TABLE: "usual" metadata retrieval method, requires "SELECT" grant on tables for selected catalogs INFORMATION_SCHEMA: this method retrieves less metadata and only works in URN mode and when the Unity Catalog is activated. But it only requires the "BROWSE" grant instead of "SELECT"
Feature - Get Lineage from Unity		No	Retrieves lineage provided by Unity.
Subfeature - Lineage granularity		No	Table and column levels are available (only Table level in non-URN mode)
Subfeature - Lineage retrieval method		No	Auto (test both methods and pick the one that works), System tables or REST API (deprecated by Databricks) (only REST API available in non-URN mode
Subfeature - Lineage history depth (in days)		No	Number of days for the lookback period for lineage events
Subfeature - Custom system catalog		No	Use a custom system catalog for the lineage retrieval
Feature - JDBC driver client		No	Use Thrift client (= default mode if Auto): configures UseThriftClient=1 in the JDBC driver. Databricks will depreciate this option in the future; Use Statement Execution APIs: configures UseThriftClient=0 in the JDBC driver. Use this if you encounter some issues with Thrift client. Still young on Databricks' side, side effects are possible, so not the default option so far. More information in Databricks' documentation.

⚠ Retrieving the lineage will significantly increase the duration of the connector and so the associated cost of the corresponding compute cluster.

* Only available in URN mode

From Standard to URN mode

Differences

In Standard mode, the name of your root object will be the one you give it when you create the connection (or the root object of the Dictionary module you target). In URN mode, the name of the root object will be the name of the Databricks server used when setting up the connection.
- Standard mode
- URN mode
In Standard mode, in the "Data Processing" module, your "Data Flow" and "Data Processing" items will be found directly under your root object. In URN mode these same objects are grouped one level lower in the hierarchy, under an object named "Workspace". Another extra object named "Workflows" will also appear at the same level than the "Workspace" object.
Standard mode:
URN mode:

Migration guide

The aim of this guide is to show you how to switch your root object and all the Databricks objects it contains from Standard mode to URN mode. Once you've completed these steps, you'll be able to perform all your future imports in URN mode and take advantage of the new features associated with this mode.

Bring one level below the objects contained in your root object from the "Data Processing" module:
- Open the menu associated with your root object (”Databricks” here) and pick the option “+ Create a child”. It will be of type "Data Flow" and you will name "Workspace"
- Once it is done you will have to move all your other objects (”Shared” and "Test" here) by opening their associated menus and choosing the "Move" option. You will target the previously created object, "Workspace"
- If you do not do this, when you do the final URN import on your root object you will end up with duplicates of all the objects retrieved from Databricks
If is is not already the case, associate to "Database" sources from the "Dictionary" module the "URN" attribute. Do the same for your "Data Flow" objects from the "Data Processing" module
Associate with your root objects of the "Dictionary" and "Data Processing" modules the right URN
- Regarding this, we advise you to follow these steps in order to avoid any error:
  - Perform a new import in URN mode, which will create a new root object in each module for which the URN attribute will be filled
  - Copy the URN attribute
  - Delete the root objects and all its children that you just imported in URN mode (since a URN must be unique, if you do not delete this root object before trying to assign its URN to another object, the platform will return an error)
  - Paste the URNs in order to fill the URN attribute fields from your root objects in both modules that are still in Standard mode
Do a final import in URN mode
- This time all the URN attributes from the child objects under your root objects should be filled

Congratulations, you switched from the Standard mode to the URN mode and can now enjoy all the new features it offers!

Execution of the connector

Step 1: Installation

Download DataGalaxy connector from the portal (see here)
Extract the connector archive in the directory of your choice
Download the Databricks plug-in from the portal and copy it into the /lib directory of the connector

Step 2: Run connector

After starting the connector, access the connectors of the Dictionary or Data Processing categories

If it was correctly installed, the Databricks plug-in will appear

Fill the corresponding fields using the connection information from above

DATABRICKS ACCESS TOKEN:

Azure AD Service Principal:

Databricks Service Principal:

Clic on "Test" to test the connection
Once the connection test passed follow the steps to finalize your import

This connector is also available in online mode, more information on this page: [How to] Online Connector operating mode

Releases

Date	Plugin Version	DataGalaxy release	Desktop connector version (minimum)	Description
28/05/2026	6.7.0	v3.345.0	5.15.9	Path of the system catalog made configurable
05/05/2026	6.6.1	v3.337.0	5.15.9	Lineage history depth configurable in days
24/04/2026	6.5.3	v3.332.1	5.15.9	Updated internal dependencies
14/04/2026	6.5.2	v3.329.3	5.15.8	Adding option to use the new Statement Execution APIs client instead of Thrift client in the Databricks' JDBC driver
19/03/2026	6.4.6	v3.322.0	5.15.7	Bugfix regarding jobRunId parameter
03/11/2025	6.4.2	v3.273.1	5.13.0	Improve connector's resilience when retrieving lineage information
17/10/2025	6.4.1	v3.268.2	5.13.0	Allow the user to choose between two metadata retrieval methods
03/10/2025	6.3.1	v3.262.0	5.13.0	Fix a bug preventing from authenticating to EntraID in CLI mode with the --password argument
23/09/2025	6.3.0	v3.254.0	5.13.0	Addition of the option to filter out or not the "/Users" folder
25/08/2025	6.2.0	v3.245.0	5.13.0	Addition of new retrieval option for lineage
04/08/2025	6.1.3	v3.228.1	5.13.0	Fix issue with Notebook retrieval
31/07/2025	6.1.2	v3.220.1	5.7.8	Fixed a bug related to the retrieval of the lineage in standard mode
07/06/2025	6.0.15	v3.178.1	5.6.2	- Fixed http proxy configuration with JDBC driver - Fixed unnecessary creation of Processing root object in URN mode even if no children have to be created below
27/05/2025	6.0.13	v3.172.5	5.6.1	- Fixed a bug related to retrieving lineage from another unity workspace - FIxed a bug related to views
20/05/2025	6.0.11	v3.171.0	5.5.13	- New lineage behavior: all lineage can be imported, independently of choosing to create Notebooks and Workflows in DataGalaxy. - Activated the possibility of using URN imports for everybody
04/04/2025	5.1.0	v3.154.6	5.5.5	Optimized how data is handled in URN mode
21/01/2025	4.0.12	v3.125.0	5.2.9	Improved resiliency of the connector
09/01/2025	4.0.11	v3.116.1	5.2.8	Fixed a bug regarding CSV imports and improved logs
16/10/2024	3.0.3	v3.85.1	5.2.6	System catalogs are now filtered out and error logging is improved
20/09/2024	3.0.2	v3.77.1	5.2.6	Fixed a bug regarding external tables that are views
23/08/2024	3.0.1	v3.69.0	5.2.3	Updated the logger to show more information when using verbose mode
26/07/2024	3.0.0	v3.62.0	5.0.3	Migrated from java 11 to java 17
04/07/2024	2.4.2	v3.56.0		Fixed a bug where some connexion fields where not loaded from a saved connection
04/07/2024	2.4.1	3.56.0		Updated a dependency
15/05/2024	2.4.0	v3.46.0		Addition of Databricks Service Principal authentication
16/04/2024	2.3.0	v3.40.0		Addition of Entra ID (Azure AD) Service Principal authentication

English

Scope, attributes and mapping with DataGalaxy

Objects

Instance

Catalog

Schema

Table (Managed or External)

View (including Materialized View)

Column

Workspace Directory

Notebook

Workflow

Links

Detailed scope

Input (dictionary module)

Input (Data Processing module)

Output (dictionary module)

Output (dataprocessing module)

Configuration of a connection

On Databricks' side

The token authentication

Authentication using Entra ID (Azure AD) Service Principal

Authentication using a Databricks Service Principal

Details of the rights required to obtain metadata

Summary

On DataGalaxy's side

From Standard to URN mode

Differences

Migration guide

Execution of the connector

Step 1: Installation

Step 2: Run connector

Releases

Table of contents

Related Articles