Definitions
To prevent misunderstandings, let’s share some definitions of the terms we use in this page.
Asset
A data asset is any digital object or entity made up of data. It could be a dataset, document, visualization or data service and can include tools designed to make that data available and usable. For example, a dashboard that displays real-time data on a website is a data asset.
Data system
A platform from a technology meant to store, transform or observe data assets. Those are the systems the connectors connect to, in order to exchange information, at least to map in DataGalaxy the assets they contain, or deeper integrations scenarios. Examples: Databricks, Power BI, dbt, Bigeye…
Sometimes known as “data platform”, “source system”.Object
A DataGalaxy object, with its type (depending on the module), its attributes and its location in the hierarchy (”path”). Some objects aim to represent an asset from a data system (like a Power BI Report) and will likely be created by a connector in this case, some others add additional information (like Glossary objects) or a layer of abstraction (like Data Products) and will likely be managed manually.
Also known as “Entity”.Orphaned Object
An object in DataGalaxy which was created to represent an asset from a data system, but became "orphaned" because this asset doesn’t exist anymore.
Why did we build this feature?
When it comes to mapping their data systems and assets in DataGalaxy, users are looking for the most automation possible, and at the same time, the highest level of accuracy and freshness. As those systems are dynamic, new assets are constantly created, some get modified and some are removed.
Those changes should be properly reported in DataGalaxy, so that the users benefit from an up-to-date and trustable representation of all their assets.
What is Orphaned objects handling?
The connectors take a snapshot of the source system at each run, usually on a schedule. As the source system has continued to live since the last run, the connector can face three situations:
- A new asset is discovered, which is not known in DataGalaxy → the connector will have to create a new object for representing this asset
- An asset is recognized as existing in DataGalaxy → the connector will have to update the corresponding object
- An asset which is represented as an object in DataGalaxy doesn’t exist anymore in the source system → a decision has to be taken about what to do with this object, which has become “orphaned” = it doesn’t represent any asset from the system anymore.
The purpose of the Orphaned objects handling feature is to identify those objects and propose actions to the user to be applied on those objects. There are two main reasons why the connector has to ask the user for the action and not just simply delete the objects:
- The connector may be wrong regarding some assets: in most of the systems, renaming an asset will be interpreted by the connector as a new asset with the new name and a deleted asset with the old name. As many enrichments can be manually added to the corresponding object in DataGalaxy, the user could prefer renaming it instead of deleting it and loose the enrichments.
- Even if the asset has been deleted, the user might prefer that its readers can still find it and be informed that it's obsolete, rather than it disappearing entirely from DataGalaxy
So the user has the ability to configure three different behaviors regarding orphaned objects:
- Do nothing: the objects will remain in DataGalaxy without modification.
- Change status to Obsolete: only the objects’ statuses will change, making it easier to identify the objects (using a filtered view for instance) and apply manual actions later on if necessary.
- Delete them: if you’re sure you’ll not lose any enrichment and want to keep your DataGalaxy automatically synchronized with your data system without effort, it’s the right option.
How does it precisely work?
Let’s analyze deeper how this feature works.
When a connector runs, a GUID is created for this run. This GUID is associated with all objects created or updated by the connector during this run, in their importGUID attribute. Then, two conditions have to be met to identify which objects in DataGalaxy are considered orphaned after this run:
- The objects have to be in the scope of the current connection
- The objects’ importGUID have to be different than the one of the current run, meaning they don’t have been just created or updated during this run.
After importing some objects in the platform, the connector sends an command to the platform to apply the action chosen by the user on the orphaned objects, providing the current run scope and importGUID: it’s the “reconcile” process.
Comparing the importGUIDs is straightforward. To define if an object belongs to the scope of the current connection, two versions of this feature have been implemented. The first one is implemented in the standard mode of the connectors (which is still the most used) and has some limitations worth noting. A new smarter approach has been implemented in the new URN mode of the connectors, but this mode is not available for all connectors so far.
Scope of the connections in standard mode and limitations
In standard mode, the connectors rely on root objects which are defined by the user for each module in which the connector has to create objects. The first implementation of the scope of the connections is pretty simple: all objects which are in the hierarchy tree below one of the root objects linked to the connection is considered part of the scope of this connection.
This brings some known limitations that are worth taking into consideration to prevent unexpected behavior and data loss (especially if the action selected for orphaned objects handling is to delete them).
⚠️ If more than one connection use the same root objects, they will conflict with each other. All objects from connection A which are not imported by connection B will be considered orphaned by connection B (and vice-versa) as they will have a different importGUID and the same scope (= the same root object as parent).
⚠️ The same issue appears if running a connection after changing its filter to a more restrictive one: all objects which where previously imported but are not part of the new filter will be considered orphaned. Same behavior if the service account used by the connector to connect to the source has been restricted in terms of permissions and sees less assets than before.
Smarter approach for the scope in URN mode
In URN mode, there are no root objects configured by the user anymore. Objects can be imported anywhere in the platform, and a same connector can create objects from multiple technologies. So we had to implement a smarter approach and we took the opportunity to fix the issue of the first version.
In URN mode, each connection considers all these parameters for defining its scope:
- The connector itself, knowing which objects and attributes it can bring to the platform and which one it can’t: Snowflake connector knows it can’t consider any Azure SQL object as part of its scope
- The filters configured by the user in this connection: a Snowflake connection filtered on database MY_DB knows it can’t consider any object of another Snowflake database as part of its scope.
Note: some questions remain regarding the scope, especially since we implemented the ability for the connectors to create objects from other technologies. For instance: a Power BI connector can create a Snowflake Table because it discovers a Dataset in Power BI which gets its data from this Table in Snowflake. Let’s consider in a future run, that the Power BI connector doesn’t find this Table anymore: it just means there are no more Dataset getting its data from this Table, it doesn’t mean the Table doesn’t exist anymore. The decision to consider this Table orphaned (and so eventually make it Obsolete or delete it in DataGalaxy) can depend on the uses cases. We’re still working on this topic to give the control to the user. Today, these objets are not considered part of the scope by the Power BI connector so no action regarding orphaned objects handling will be applied on them.