Data Quality; The 3 Keys To Developing A Strategy You Can Really Trust
Part 3: Keys To Developing A Data Quality Strategy You Can Really Trust: Data Profiling
Data quality must be at the forefront of any data warehouse and analytics project to guarantee validity and value within the information you receive.
The key is to focus on 3 main areas to build a solid data quality best practice standard.
The 3 main areas of a successful data quality strategy include:
- Data Terminology
- Data Governance
- Data Profiling
Now that we’ve discussed the importance of a sound data governance foundation and data terminology in part 1 and 2 of this blog series, today we will dive into the MOST important aspect of data quality…data profiling.
What exactly is Data Profiling?
Data profiling is the process of examining data from the source, collecting statistics and creating useful summaries about that data. It is performed through systematic detailed analysis of the data source. Like it or not, you can’t trust copybooks, data models, or source system experts. Regardless of how hard we may try, errors inevitably find their way into our systems. The end result; poor data quality.
Basically, You have to know your data before you can fix it
Why is Data Profiling so Important?
Data profiling is important because data processing and trusted analysis cannot happen without it. When data profiling is used at the start of a project, it can significantly shorten the development cycle by identifying source system data anomalies and accelerates an understanding of source system data. Data profiling should be the best practice at the start to discover if data is suitable for analysis—and make a “go / no go” decision on the project.
Data warehouse and business intelligence projects depend on data profiling to uncover data quality issues and determine what needs to be corrected during the extract-transform-load (ETL) process. Data Profiling can also highlight problem sources and in many cases, can identify the reason behind the issues (e.g. user inputs, errors in interfaces, data corruption). When profiling the data, analysts can review the structure of the data checking formats and mathematical equations (min, max and sum), delve into content by looking at individual data records to discover errors,analyze completeness of data and look into the relationship of the data between tables, spreadsheets and other sources.
What is the Best Practice for Data Profiling?
Like most things in life, there are numerous ways to accomplish this. The old way (also known as the long way) required a skilled resource to manually query data. The new way (and our recommendation) is to utilize a data profiling software. For our projects, we use the Oracle Enterprise Data Quality Profile and Audit software product. With advanced data profiling software, you increase the speed at which data profiling can be completed. This not only saves time and money, it will also allow your project to move forward with sound, trustworthy data. Data profiling software also creates a common repository that can be used by the entire team. Finally, it provides a more thorough analysis of the data sources than what is done with manual processes that only query a subset of the data.
Data Profiling, Data Terminology and Data Governance; The Key to Real Data Quality
If someone asked if your data is sound, would you say yes? What if I asked you to bet your job on it? I bet you paused before answering, didn’t you? A robust data quality process involving data profiling, data terminology standardization and data governance are the only way to overcome skepticism and distrust in your data. Using these 3 data quality techniques will have you answering, “yes” in no time at all.