Submit a ticket My tickets
Welcome
Login  Sign up

Define and represent data quality

The quality of data is based on how well it meets the requirements of data consumers. Metadata defines what the data represents. Data quality is about meeting expectations. Metadata is a primary means of clarifying expectations." DMBOK.

Therefore, a metadata catalog is an essential tool to highlight data quality. This usually involves 3 steps: 

  • Defining data quality
  • Representing these indicators 
  • Documenting the data quality processes

Define the data quality

According to the DMBOK, there are several types of dimensions that can be used to define the data quality:

Objectives dimensions

These are the dimensions that can be calculated:

  1. Completeness: The proportion of data stored against the potential for 100%.
  2. Uniqueness: No entity instance (thing) will be recorded more than once based upon how that thing is

    identified.

  3. Timeliness: The degree to which data represent reality from the required point in time.

  4. Validity: Data is valid if it conforms to the syntax (format, type, range) of its definition.

  5. Accuracy: The degree to which data correctly describes the ‘real world’ object or event being

    described.

  6. Consistency: The absence of difference, when comparing two or more representations of a thing

    against a definition.

Subjectives dimensions

These are dimensions that depend on the user's perception at the time of consumption, such as usability, stability, flexibility, value, and trust.

Contextual dimensions

We can add a third type of dimensions: those representing the context. A piece of data can be essential in one case and much less so in another. For example, a wrong date of birth is important from a marketing point of view (campaign) but not for billing.

Representing these indicators

In DataGalaxy, you can separate the functional and technical aspects of data quality:

The business point of view

This represents the requirements that must be met in order for business users to be able to use the data with confidence. To document this aspect we recommend using the glossary, so you can : 

  • Describe the data you have 
  • Link them to business requirements if necessary, 
    • Either by using dedicated attributes, 
    • Or by attaching them to objects in the glossary that support these requirements. 

For example, we can specify that there is a business term "date of birth" attached to the "customer" concept and that, regardless of the underlying system, the date of birth cannot be prior to 01/01/1900.  

In the example below, the notion of nomenclature is used to represent the functional requirements.

The technical point of view 

The dictionnary can be used to display all the measures that can be calculated for a field or a column. We can thus specify the quality for each field of a database according to the dimensions described above and follow their evolution in time thanks to time series indicators. 

Linking the points of view

Between these two points of view, lineage can highlight the different sources and the level of reputation using, for example, golden links.

Documenting the quality processes

Once these steps have been completed, you can focus on documenting the data quality processes in the processing module. Indeed, a quality control process :

  • Collect data from a source A
  • Transforms them: respect of a format for example
  • Loads them into a "quality" container: different table, different system...

These processes can be 

  • Prefixed : so they can be easily identified from the line-up
  • Linked to the application allowing to carry out these quality processes

Next steps 

As the DMBOK states, a data catalog is essential to highlighting data quality projects: from the definition of management requirements to the return of the trust base. It is possible to go further by using the notion of process. By defining the data consumed by business process, it will be possible to identify 

  • Upstream: processes generating incorrect data and therefore identify possible remedies 
  • Downstream: any risk of using poor quality data and therefore their potential impact

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.