Ensuring data quality is crucial for any data-driven organization, and Great Expectations, an open-source Python library, provides a powerful solution for this challenge. It allows you to verify and profile your data ensuring that it meets your quality standards.
By incorporating Great Expectations into your data pipeline, you can automate the detection and resolution of data issues, boosting confidence in the data and the decisions based on it.
This blog will guide you through using Great Expectations to validate CSV data as data source, enabling a streamlined, reliable approach to data quality management.
For organizations looking to enhance their data pipelines, hiring a skilled Data Engineer from Vivasoft Limited can significantly accelerate the integration of tools like Great Expectations. It ensures that your data remains accurate, consistent, and trustworthy across all your operations.
Project Overview
We’ll guide you through the steps in this blog post on how to use the open-source Python package Great Expectations to ensure data quality.
With the aid of Great Expectations, you can verify, record, and profile your data to make sure it satisfies your quality standards. We’ll talk about utilizing GX to validate CSV data here.
Technology Used
- Python
- CSV
Documentation - python_greatExpectation
Objectives
Establish Robust Data Quality Checks:
- Use thorough and automated data quality checks to find and fix problems early in the data pipeline
- Make sure that the data is accurate, consistent, and comprehensive across different data sources
Increase Confidence in Data-Driven Decisions:
- By comparing data to established standards and expectations, you can increase trust in the data
- Offer stakeholders legitimate data so they may decide on business actions
Simplify Data Quality Management:
- To make the process of defining, managing, and updating data quality rules more efficient, use Great Expectations
- Minimize the amount of manual labor and human error involved in conventional data quality management techniques
Enable Continuous Data Monitoring:
- For continuous data quality assurance, incorporate data validation into pipelines for continuous integration / continuous deployment (CI/CD)
- Schedule regular data quality inspections to find and fix problems promptly
Improve Data Documentation and Transparency:
- Produce comprehensive, readable data documentation automatically
- Deliver data engineers, analysts, and other stakeholders clear visibility to validation results and data quality indicators
Enhance Collaboration Across Teams:
- By establishing common data quality expectations for data quality, data scientists, data engineers, and business analysts can work together more effectively
- Promote an organizational culture that values accountability and knowledge of data quality
Adapt to Evolving Data Requirements:
- As business needs and data sources change, ensure the flexibility to adjust expectations for data quality
- To stay up with the ever-changing data world, continuously update data validation procedures
Minimize Risk of Data-Related Issues:
- Minimize the possibility of errors and inconsistencies in data that could affect decision-making and business operations
- Make sure that the legal and industry standards for data quality are being followed
Optimize Data Pipeline Performance:
- Determine and resolve any issues with data quality that could have an impact on the reliability and effectiveness of data pipelines
- Boost the workflows for data processing in terms of overall effectiveness and efficiency
Leverage Open Source Capabilities:
- Benefit from numerous features and community support offered by the Great Expectations library, which is available for free
- Participate in and gain from the continuous innovation and progress that occurs within the Great Expectations community
Technology Stack
Great Expectations
- Data Validation
- Generate docs in web UI
CSV
Top Level Workflow
Setting Up Great Expectations
Prepare the Environment
- Managing dependencies in a virtual environment is recommended
Python3 -m venv gx_venv
source gx_venv/bin/activate
Install Great Expectations
pip install great_expectations
Initial Configuration
Data Source
- Keep a sample csv file as source data in project dir
Create .py file
- The file contains all data validations and configurations
Validate and Generate Docs
- Run the python file
python3 *.py
Project Directory Structure
Local environment → project dir → *.py, *.csv
Create Validation: Step by Step guide
Import GX
project_dir → *.py
import great_expectations as gx
Create Context
context = gx.get_context()
Connect to Data
validator = context.sources.pandas_default.read_csv("employee_data.csv")
Create Expectations
# column exists or not
validator.expect_column_to_exist(column="id")
# check total no of columns
validator.expect_table_column_count_to_be_between(min_value=1,
max_value=3)
# check total no of rows
validator.expect_table_row_count_to_be_between(min_value=1,
max_value=100)
# check not null
validator.expect_column_values_to_not_be_null(column="id", notes="**identification** of each employee")
# check column values
validator.expect_column_values_to_be_between(
"id",
min_value=1,
max_value=10)
# check distinct
validator.expect_column_values_to_be_unique(column="id")
# check distinct count
validator.expect_column_unique_value_count_to_be_between(column="id",
min_value=1, max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)
Full Code
import great_expectations as gx
# create data context
context = gx.get_context()
# connect to data
validator = context.sources.pandas_default.read_csv("employee_data.csv")
# create expectations
validator.expect_column_to_exist(column="id")
validator.expect_table_column_count_to_be_between(min_value=1,
max_value=3)
validator.expect_table_row_count_to_be_between(min_value=1,
max_value=100)
validator.expect_column_values_to_not_be_null(column="id",
notes="**identification** of each employee")
validator.expect_column_values_to_be_between(
"id",
min_value=1,
max_value=10
)
validator.expect_column_values_to_be_unique(column="id")
validator.expect_column_unique_value_count_to_be_between(column="id",
min_value=1,
max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)
# create a checkpoint
checkpoint = context.add_or_update_checkpoint(
name="my_quickstart_checkpoint",
validator=validator,
)
# validation result
checkpoint_result = checkpoint.run()
result = dict(checkpoint_result)["_success"]
print(f"checkpoint result: {result}")
# checkpoint_config
# visualize results as HTML representation
context.view_validation_result(checkpoint_result) # perfect to visualize the results (not mandatory)
DOCS
Conclusion
You may use Great Expectations to make sure your data satisfies your quality standards by following these steps. Frequent validation of data aids in preserving data reliability and integrity, facilitating improved analytics and decision-making.