Data Profiling: The Unsung Hero in Your Data Quality Strategy
Share this Session:
  Vernon Gomes   Vernon Gomes
Consultant
Protiviti
 


 

Monday, April 3, 2017
05:15 PM - 06:00 PM

Level:  Introductory


What is data profiling?:
  • Definition of data profiling
  • Process of doing it
  • How does it relate to Data Quality Management?
  • Why is data quality important?
  • Vern the business guy - He cares about getting his information and getting it quickly and accurately.
  • Vernon the IT guy - He cares about maintaining the security of the data (reliable) and keeping his infrastructure scalable (when there’s more data) and available.

Data Profiling and Data Quality Management:

  • Doesn’t directly add data quality
  • Data Profiling is just a tool that can help meet your data quality objectives!
  • Has to be combined with your business users (usually who owns the data)
  • Not using a technical fix to solve what might be a business problem
  • Example: ERP System with a Customer table and Address Table
  • Customer with Address line deleted
  • First IT thought: Let’s go back and add an address to the Address table or remove the Customer Table.
  • First Business User thought: Is this no longer a valid Address or Customer?
  • Generally only thought about during a data migration
  • Legacy to New (CRM, ERP, QMS, etc.) system
  • Creating a data mart/warehouse
  • But, the reality is…..
  • Data Quality Management is an ongoing process!
  • Something that is easily forgotten
  • How often should I be stewarding my data?

Data Profiling Job:

  • Aggregate functions
  • Min, Max values (both on Strings and numeric values)
  • Min, Max value count
  • Average, Median values (both on Strings and numeric values)
  • Min, Max string length
  • Distinct Values
  • Null values
  • Null value count
  • Null percentage
  • Zero percentage
  • Blanks percentage
  • Pattern analysis
  • Do values maintain a certain format?
  • yyyy-mm-dd, mm-dd-yyyy, etc. for dates
  • Should using a consistent pattern be a business rule?
  • 5 Digit Zip Code vs 5 Digit + 4 (30096-4682)
  • Casing format for names (Vernon Gomes vs VERNON GOMES)
  • European vs. American format for Dates (09/24/2016 vs 24/09/2016)
  • Can test for integrity of data
  • Referential Integrity Test
  • Check if a foreign key exists in a parent table.
  • Remember the ERP Customer-Address example used before.
  • How does this information help me?
  • You can make key business decisions that can help you maintain your data quality initiatives.
  • “To prevent having so many empty Address 1 fields, we’ll make it a required field in our ERP and CRM.”
  • “I have users adding information like email addresses into an Address 2 field. This might be a training opportunity or an opportunity to add a new field.”

Various Data Profiling Tools:

  • SAP Business Objects Data Services
  • InformationSteward
  • Talend

Points to Take Home:

  • Data Profiling is a powerful tool that can assist you in improving your data quality.
  • But it requires the collaboration with the business to improve your data quality.
  • It’s not a “one-and-done.” This is an ongoing process!
  • Can be used for more than just your Data Warehousing or migration projects
  • Data Profiling involves using tools to quickly analyze table(s) and determine the type of data and quality of said data.
  • You can use this information to make business decisions to maintain your data quality.


Vernon Gomes is a former IT industry Systems Administrator turned Consultant. He is currently a Data and Analytics Consultant at Protiviti and specializes in ETL, data modeling, and systems architecture.


   
Close Window