At the beginning of many reflections on the data use, optimization or valorization, there is often a mapping approach. Indeed, how can we manage an element whose volume or complexity we do not know how to evaluate?
But when it comes to mapping data, we are often facing their complexity and volume...! We are faced with a wall: how can we display in a comprehensible way an element that is constantly evolving and whose points of view are so diverse? how can we promote the understanding of these data?
Dictionary, Glossary, Catalog ... what is all this?
It can be tempting to lump these different terms together. We see it this way
- The dictionary should be understood as a collection of words and expressions that define them with essential and non-essential attributes. For the record, the etymology of the Germanic term for dictionary is "Wörterbuch" which simply means "book (buch) of words (wort). Wikipedia goes further by introducing the notion of reference, i.e. a corpus aiming at teaching and informing.
- The glossary is understood as a collection of "glosses", i.e. rare or foreign words, associated with a definition. The objective is to describe or comment on terms in a semantic way. Unlike lexicons, which are intended to describe all the words of a language, the glossary aims to describe words centered on a domain
This means that we have two objects which, although pursuing a common objective, to explain words, are not constructed in the same way:
- The dictionary is intended to be exhaustive, referential and precise
- The glossary has a vulgarization objective
What does it means for data mapping?
If we adapt these terms to data management, we can construct the following definitions
Data dictionnary
"inventory of the intangible asset of data, which allows users to discover and explore all available data sets, improve their understanding of that data, facilitate collaboration with other users to enrich the quality of those assets, and create more value from that data." decideo
This definition is close to the tradition of data management, so more on the IT side. The idea is:
- to carry out a data inventory as complete and detailed as possible: this means having for each data: a brief description, technical attributes (size, format, type...), a storage (of which set or database it depends?)...
- To be able to manage this data: what is the quality of this data? does it need to be protected? is it sensitive? subject to particular constraints?
- To allow the exploration of this dictionary and the collaboration around it: it is good to have a repository allowing to know data ... to allow everybody to take hold of it is better!
The notion of value is put aside as we would rather put in the glossary.
In practice, a data dictionary will often be a list of the data present in the organization with
- Its classification or hierarchy: Let's make an analogy with a dictionary entry, as Wikipedia does: "the dog is an animal of the class of mammals, order of carnivores and family of canids" by analogy, a classification for data management would consist in saying that the data depends on a more or less complex chain of management: table, container, base...
- A definition: what is the meaning of this data, how to describe it...
- Attributes to understand its management context.
The glossary
Defined above as a tool for popularizing data, it is to be understood as a synthesis of the various terms of the organization, defined in a semantic manner, i.e. understandable from a functional point of view. It is not intended to translate all the data contained in the dictionary. Its main purpose is to render and describe the main terms used in the organization.
In practical terms, the glossary aims to
- Define in a business way the terms used in the organization: what does this indicator mean, how is it calculated?
- To offer a certain data modeling in order to place it in a usage context that is understandable by all: Data often depends on a context. Several points of view can be used to describe it: by data silos/verticals? by organizations?
- To provide management attributes: operating domains, management constraints, intrinsic quality of the information.
Dictionary AND Glossary
When talking about defining data we have two complementary points of view:
- The dictionary aiming to define from a technical point of view and in an exhaustive way the company's data
- The glossary which offers a business "vulgarization" of the data.
Each one deals with a part of the requirement: What is the point of knowing the exact definition of an employee if I don't know the different elements managed in my information system? So with this "employee" object:
- In my glossary i will understand that the definition of an employee is "a physical person who has an employment contract with the organization with an end date in the future". Several attributes to describe this object can be used: management responsibility, personal nature of the data.... It can also be linked to several other types of information: address, banking information, etc.
- In my dictionary I will understand that my employee is in fact a concept potentially scattered in several tables or even several systems.
Linking the elements of the dictionary with those of the glossary provides greater added value to this data mapping process.
Beyond the small semantic discussions explained above, it is interesting to note that data mapping is modernizing. It follows the basic trend in data management: opening up to business communities for a better appropriation. The current challenge is to know how to model this glossary so that it is understandable and valuable for everyone.
Note: this article is a simplified version of this article published by Gauthier Coponat on May 27, 2020