Previous posts in this seven-part series have examined product changes, seasonal and other behavioral factors, and mix shift. This post explores data quality, which is often at the heart of sudden metric changes.
WHY IT MATTERS
In recent years, corporate scandals, regulatory changes and the collapse of major financial institutions have all brought much-needed attention to the quality of enterprise information. Facebook, for example, has had multiple problems with errors in its data. Poor data quality corrodes the confidence and trust of the entire community around a product, including customers, investors and product builders—and makes it difficult to get an accurate picture of product health.
The most common manifestation of a data quality problem is a sudden and drastic change that cannot otherwise be easily explained. By better understanding the underlying sources of data quality issues, we can develop action plans to address them. (We will discuss such action plans in Part 7 of this series.) To do this, we must first identify the problems and qualify their extent. Both of these tasks can be a challenge.
Discrepancies in data are often due to errors in the way the data is logged (or recorded). To resolve such issues, identify all points at which a logging error could have occurred:
Missing data Perhaps you have recently launched a new product, or launched in a new country from which you are not yet logging data. When data is not logged, you may underestimate the aggregate value of key metrics. Because these errors generally do not materially affect the aggregate early on, instead increasing their effects over longer periods of time, they are often difficult to detect.
Duplicate logging In some cases, the ETL (extract, transform, load) process may log a value more than once, thus artificially inflating aggregate values. Again, these errors are difficult to detect until enough time has passed to make the effects of duplication substantial.
Incorrect logging Often, the source of a data quality problem is incorrectly logged data—for example, logging for Variable 1 instead of Variable 2, logging incorrect values to Variable 1, etc.
DATA TRANSFORMATION ISSUES
While “transforming” raw data makes it more usable, it may also introduce errors. Duplication of records can create multiple problems, including incorrect joins, incorrect relationships between objects (e.g., using the same name for two different objects), integrating incorrect sources and aging issues (e.g., inconsistencies between older and newer data sets). Proper checks and quality control at each touchpoint along the path can help ensure that data transformation problems are identified. Some issues will be easier (and less costly) to detect and correct than others, but all will benefit from the best practices listed below.
DATA QUALITY BEST PRACTICES
Best practices for data quality fall into three broad categories: proper logging, identification of issues and addressing problems.
Logging Early in the development of a product, it is crucial that you understand what to log, how changes will be made and how those changes will manifest in the data. You should also document any business and technical rules that may affect data quality, which will help you more easily identify problems down the line.
Identifying issues When monitoring for data quality issues, be proactive, not reactive. Monitor from two perspectives: bottom-up and top-down; using alerting systems on both ends is the single most effective tool for catching problems.
Fixing problems Once you’ve identified the source of a data quality problem, fixing it is generally relatively easy. It is important to also assess the problem’s business impact and cost to the organization; these downstream effects are often difficult to quantify, especially when the problem it itself difficult to detect.
Effective implementation of solutions requires organizational commitment and long-term vision but will help foster a sense of attention to quality and excellence company-wide.
- Data quality issues often stem from logging issues. Identifying missing data, duplicate logging or logical errors will help diagnose this problem.
- The process of transforming raw data can also lead to errors.
Follow us on Medium for weekly updates.
This work is a product of Sequoia Capital's Data Science team. Jamie Cuffe, Avanika Narayan, Chandra Narayanan, Hem Wadhar and Jenny Wang contributed to this post. Please email email@example.com with questions, comments and other feedback.