Data Profiling

DSLAB GLOBAL
2 min readMar 4, 2021

--

1. What is Data Profiling?

Data Profiling is a technique for detecting and investigating data quality issues such as duplication, lack of consistency, lack of accuracy, and lack of completeness. Data profiling is done by analyzing one or more data sources and collecting metadata showing the status of the data, allowing data managers to investigate the cause of data errors. Data profiling allows you to view data statistics such as the degree of redundancy and percentage of attribute values in tabular and graphical format.

  • Collecting descriptive statistics such as min, max, count and sum
  • Data type, length and repetition pattern collection
  • Tagging data by keyword, description or category
  • Risk of performing data quality evaluation and performing data join
  • Discovering metadata and evaluate its accuracy
  • Identifying distributions, key candidates, foreign key candidates, functional dependencies, implicit value dependencies, and perform cross-table analysis

2. Type of Data Profiling

There are three main types of data profiling.

  • Structure discovery

This process allows to make sure the data is consistent and well-formed, and perform mathematical checks on the data (e.g. sum, min or max). Confirming data structure can help you understand how well your data is structured. For example, you can find out the percentage of phone numbers with incorrect digits in your database.

  • Content search

Individual data records are examined to detect errors with this technique. The content search identifies whether there is a problem with a specific row in the table, and system problems arising from the data (e.g. phone numbers without area codes)

  • Relationship discovery

Discover how parts of your data are interrelated. For example, discover key relationships between database tables, cells in a spreadsheet, or references between tables, etc. Understanding relationships is important to reuse data. And, you need to bring the relevant data sources together in a way that either integrates them or maintains important relationships.

3. Data profiling with CLICK AI

CLICK AI goes through the process of taking a closer look at individual elements of the database to check the quality of the data. This allows you to find and eliminate or fix areas with null values, or invalid or ambiguous values.

--

--