The Six Dimensions of Data Quality

Good data is essential for any data-driven initiative. Whether you’re generating a marketing report, training a machine learning model, or forming strategic decisions, data quality directly affects your outcomes. In this article, I’ll walk through the six dimensions of data quality—accuracy, completeness, uniqueness, consistency, timeliness, and validity—while showcasing real-time use cases, the business impact of each dimension, and a practical demo using a toy dataset in Pandas.


A Sample Dataset

Let’s create a small, fictional dataset that intentionally contains data quality issues. We’ll use it to illustrate how each dimension can be checked using Pandas.

Python
Python
CopyEdit
import pandas as pd

# Create a toy dataset with various data quality issues
data = {
    "Name": ["John Doe", "Jane Smith", "Johnathan Doe", "Jake Brown", "Jake Brown"],
    "Email": [
        "john@example.com",
        "jane@example,com",  # Notice the comma instead of '.' in the domain
        "johnathan.example.com",  # Missing '@'
        "jake.brown@example.com",
        "jake.brown@example.com"  # Duplicate
    ],
    "EyeColor": ["Blue", "Green", "Blue", "Pale", "Blue"],  # "Pale" is invalid
    "Age": [30, None, 50, 45, 45],  # Missing age for Jane Smith
    "Height": [173, 160, 15, 180, 180],  # 15 cm is suspiciously small
    "DateOfBirth": [
        "1989-07-11",
        "1992-03-05",
        "1970-07-11",
        "1977-02-10",
        "1977-02-10"  # Duplicate
    ],
    "Address": [
        "123 Fake St",
        "456 Real Rd",
        "123 Fake St",
        "789 Another Pl",
        "789 Another Pl"
    ],
    "LastUpdated": [
        "2025-01-01",
        "2024-12-31",
        "2022-05-25",
        "2023-10-01",
        "2023-10-01"  # Same date for potential duplication
    ]
}

df = pd.DataFrame(data)
df

We’ll refer to this dataset in the examples below.


1. Accuracy

Definition

  • Accuracy measures how well data reflects real-world facts.
  • A practical example is ensuring that a customer’s address, contact information, or measured attributes (like height) are correct and up to date.

Real-Time Use Case

  • Autonomous Vehicles: Sensors must capture accurate data about road conditions and other vehicles in real time. Inaccurate sensor readings could lead to incorrect maneuvers, potentially causing accidents.
  • Business Impact: If a company relies on inaccurate consumer addresses for shipping, it incurs higher shipping costs, possibly returned packages, and customer dissatisfaction.

Demo: Checking Suspicious Height Values

In our sample dataset, 15 cm for an adult is suspiciously low.

Python
Python
CopyEdit

# Convert Height to numeric in case it's not

df["Height"] = pd.to_numeric(df["Height"], errors="coerce")

# Check for suspicious values: let's flag heights below 50 cm or above 250 cm

suspect_height = df[(df["Height"] < 50) | (df["Height"] > 250)]
suspect_height

The rows identified here likely require further investigation or correction.


2. Completeness

Definition

  • Completeness focuses on whether all the critical pieces of information are present.
  • It doesn’t require every field to be filled—only those that are essential for the dataset’s intended purpose.

Real-Time Use Case

  • Healthcare Systems: When a patient is admitted, critical fields (like allergies, blood type, medications) must be completed. Missing any of these can lead to life-threatening treatment errors.
  • Business Impact: Incomplete data in customer relationship management (CRM) systems might result in lost sales opportunities, since missing contact details or purchase history data can cripple marketing efforts.

Demo: Checking Missing Values

Let’s see which columns in our dataset have missing entries:

Python
python
CopyEdit
df.isnull().sum()

For our sample, Age is missing for Jane Smith. Depending on the use case—say, an insurance premium calculation—missing age data could invalidate the entire record for underwriting.


3. Uniqueness

Definition

  • Uniqueness checks whether each record should appear only once within a dataset (unless there is a valid reason for duplicates).
  • Duplicates often arise when merging data sources or through repeated entries.

Real-Time Use Case

  • Fraud Detection: Credit card companies constantly check uniqueness (and other parameters) in real time to spot suspicious patterns. If a single user has multiple active credit card accounts with the same details, it might be a sign of fraud.
  • Business Impact: Duplicate records in a marketing database lead to overcounting and wasted resources, such as multiple promotional emails sent to the same person, which damages brand reputation and may violate privacy regulations.

Demo: Finding Duplicate Rows

We might consider a combination of columns—Name and DateOfBirth—as unique identifiers:

Python
python
CopyEdit
duplicates = df.duplicated(subset=["Name", "DateOfBirth"], keep=False)
df[duplicates]

The rows that appear here need to be examined. They might represent genuine duplicates or valid but similar records.


4. Consistency

Definition

  • Consistency ensures that data does not conflict across different fields or systems.
  • A classic example is ensuring that a ZIP or postal code in one dataset lines up with the corresponding city in another.

Real-Time Use Case

  • Supply Chain Management: If a product is marked as “In Stock” in one system but is actually “Out of Stock” in another, online orders can’t be fulfilled properly. Real-time consistency checks are crucial for accurate inventory management.
  • Business Impact: Inconsistent data across reporting systems can lead to misalignment between departments, resulting in poor forecasting, double-counting of sales, and operational inefficiencies.

Demo: Checking for Inconsistent Age and Date of Birth

We can do a basic check to ensure that the reported age and date of birth align (with some tolerance).

Python
python
CopyEdit
import datetime

today = datetime.date(2025, 3, 25)

def calculate_age(dob_str):
    dob = datetime.datetime.strptime(dob_str, "%Y-%m-%d").date()
    return today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))

df["CalculatedAge"] = df["DateOfBirth"].apply(calculate_age)
df["AgeDifference"] = df["CalculatedAge"] - df["Age"].fillna(-9999)

# Check if difference is more than 1 year
inconsistent_age = df[df["AgeDifference"].abs() > 1]
inconsistent_age

When discrepancies appear, we’d investigate or update the data accordingly.


5. Timeliness

Definition

  • Timeliness measures if data is available when needed and whether it is kept up to date.
  • Different use cases have different timeliness requirements. Near-real-time data is crucial for some operations, whereas monthly or quarterly updates might suffice for others.

Real-Time Use Case

  • Stock Trading: In algorithmic trading, stock price data needs to be updated in near real time. Even a few seconds’ delay can result in significant financial losses.
  • Business Impact: Outdated customer or product information could lead to poorly targeted marketing campaigns, decreased sales, and inefficiencies. In fast-paced sectors like e-commerce, old data directly translates to lost revenue.

Demo: Checking Data Freshness

Let’s say we want our records to have been updated within the last year (relative to 2025-03-25).

Python
python
CopyEdit
df["LastUpdated"] = pd.to_datetime(df["LastUpdated"], errors="coerce")

threshold_date = pd.to_datetime("2024-03-25")
df["Timely"] = df["LastUpdated"] >= threshold_date
df

A false in the “Timely” column indicates that the record may be stale and needs a refresh or validation.


6. Validity

Definition

  • Validity checks whether data values conform to expected formats, types, or ranges.
  • For instance, a valid email address should contain the ‘@’ symbol, a product ID should follow a specified alphanumeric format, and an age should be within a reasonable range (0–120, for example).

Real-Time Use Case

  • Online Form Submissions: Real-time form validation ensures only correct data types are entered—e.g., an email field must contain “@,” a phone number must be numeric, and so on.
  • Business Impact: Invalid data (like a wrong product SKU format) can break downstream systems, cause order fulfillment errors, and lead to lost revenue or customer dissatisfaction.

Demo: Checking Email and Eye Color Formats

We’ll do a simple validation for the presence of ‘@’ in emails and look for eye colors in a controlled set.

Python
python
CopyEdit
# Basic email check
df["ValidEmail"] = df["Email"].apply(lambda x: "@" in x)

# Eye color validity
valid_eye_colors = ["Blue", "Green", "Brown", "Hazel"]
df["ValidEyeColor"] = df["EyeColor"].isin(valid_eye_colors)

df[["Name", "Email", "EyeColor", "ValidEmail", "ValidEyeColor"]]

Rows flagged as invalid can be corrected or removed, depending on the context.


Pulling It All Together

Data quality is not a one-time project; it’s an ongoing process. By proactively measuring and addressing these six dimensions—accuracy, completeness, uniqueness, consistency, timeliness, and validity—you minimize errors, build trust in your analytics, and create a solid foundation for decision-making.

Key Takeaways

  1. Identify Critical Fields—Focus on the data that directly impacts your key business operations.
  2. Choose Relevant Dimensions—Not every data field needs every check, but a critical subset does.
  3. Monitor Continuously—Data changes rapidly; what was complete last month may be incomplete now.
  4. Fix Issues at Their Source—Identify patterns in recurring issues and prevent them from re-entering your systems.

Whether you’re dealing with real-time sensor data in logistics or customer records in a CRM, applying these data quality dimensions ensures that your data remains trustworthy and actionable. Incorporate regular checks into your data pipelines using Pandas (and other tools) to catch issues early and keep your analyses on track.


Further Reading and Acknowledgments

That wraps up our exploration of the six dimensions of data quality. By systematically applying these checks, you’ll maintain high-quality data, ensuring that any analytics, machine learning models, or operational decisions you make will stand on solid ground.

Leave a Reply

Your email address will not be published. Required fields are marked *