Transform and Ensure Data Quality Workflow with Great Expectations

Vivasoft Team

Published on

04.10.2024

Time to Read

4 min

Ensuring data quality is crucial for any data-driven organization, and Great Expectations, an open-source Python library, provides a powerful solution for this challenge. It allows you to verify and profile your data ensuring that it meets your quality standards.

By incorporating Great Expectations into your data pipeline, you can automate the detection and resolution of data issues, boosting confidence in the data and the decisions based on it.

This blog will guide you through using Great Expectations to validate CSV data as data source, enabling a streamlined, reliable approach to data quality management.

For organizations looking to enhance their data pipelines, hiring a skilled Data Engineer from Vivasoft Limited can significantly accelerate the integration of tools like Great Expectations. It ensures that your data remains accurate, consistent, and trustworthy across all your operations.

Project Overview

We’ll guide you through the steps in this blog post on how to use the open-source Python package Great Expectations to ensure data quality.

With the aid of Great Expectations, you can verify, record, and profile your data to make sure it satisfies your quality standards. We’ll talk about utilizing GX to validate CSV data here.

Technology Used

Python
CSV

Documentation - python_greatExpectation

Objectives

Establish Robust Data Quality Checks:

Use thorough and automated data quality checks to find and fix problems early in the data pipeline
Make sure that the data is accurate, consistent, and comprehensive across different data sources

Increase Confidence in Data-Driven Decisions:

By comparing data to established standards and expectations, you can increase trust in the data
Offer stakeholders legitimate data so they may decide on business actions

Simplify Data Quality Management:

To make the process of defining, managing, and updating data quality rules more efficient, use Great Expectations
Minimize the amount of manual labor and human error involved in conventional data quality management techniques

Enable Continuous Data Monitoring:

For continuous data quality assurance, incorporate data validation into pipelines for continuous integration / continuous deployment (CI/CD)
Schedule regular data quality inspections to find and fix problems promptly

Improve Data Documentation and Transparency:

Produce comprehensive, readable data documentation automatically
Deliver data engineers, analysts, and other stakeholders clear visibility to validation results and data quality indicators

Enhance Collaboration Across Teams:

By establishing common data quality expectations for data quality, data scientists, data engineers, and business analysts can work together more effectively
Promote an organizational culture that values accountability and knowledge of data quality

Adapt to Evolving Data Requirements:

As business needs and data sources change, ensure the flexibility to adjust expectations for data quality
To stay up with the ever-changing data world, continuously update data validation procedures

Minimize Risk of Data-Related Issues:

Minimize the possibility of errors and inconsistencies in data that could affect decision-making and business operations
Make sure that the legal and industry standards for data quality are being followed

Optimize Data Pipeline Performance:

Determine and resolve any issues with data quality that could have an impact on the reliability and effectiveness of data pipelines
Boost the workflows for data processing in terms of overall effectiveness and efficiency

Leverage Open Source Capabilities:

Benefit from numerous features and community support offered by the Great Expectations library, which is available for free
Participate in and gain from the continuous innovation and progress that occurs within the Great Expectations community

Technology Stack

Great Expectations

Data Validation
Generate docs in web UI

CSV

Top Level Workflow

Setting Up Great Expectations

Prepare the Environment

Managing dependencies in a virtual environment is recommended

				
					Python3 -m venv gx_venv
source gx_venv/bin/activate

Install Great Expectations

				
					pip install great_expectations

Initial Configuration

Data Source

Keep a sample csv file as source data in project dir

Create .py file

The file contains all data validations and configurations

Validate and Generate Docs

Run the python file

				
					python3 *.py

Project Directory Structure

Local environment → project dir → *.py, *.csv

Create Validation: Step by Step guide

Import GX

project_dir → *.py

				
					import great_expectations as gx

Create Context

				
					context = gx.get_context()

Connect to Data

				
					validator = context.sources.pandas_default.read_csv("employee_data.csv")

Create Expectations

				
					# column exists or not
validator.expect_column_to_exist(column="id")


# check total no of columns
validator.expect_table_column_count_to_be_between(min_value=1,
  max_value=3)


# check total no of rows
validator.expect_table_row_count_to_be_between(min_value=1,
         max_value=100)


# check not null
validator.expect_column_values_to_not_be_null(column="id",                                         notes="**identification** of each employee")


# check column values
validator.expect_column_values_to_be_between(
"id",
min_value=1,
max_value=10)


# check distinct
validator.expect_column_values_to_be_unique(column="id")


# check distinct count
validator.expect_column_unique_value_count_to_be_between(column="id",
min_value=1,                                                  max_value=10)


validator.save_expectation_suite(discard_failed_expectations=False)

Full Code

				
					import great_expectations as gx

# create data context
context = gx.get_context()

# connect to data
validator = context.sources.pandas_default.read_csv("employee_data.csv")

# create expectations
validator.expect_column_to_exist(column="id")

validator.expect_table_column_count_to_be_between(min_value=1,
                                       max_value=3)

validator.expect_table_row_count_to_be_between(min_value=1,
                                              max_value=100)

validator.expect_column_values_to_not_be_null(column="id",
                                             notes="**identification** of each employee")
validator.expect_column_values_to_be_between(
   "id",
   min_value=1,
   max_value=10
)

validator.expect_column_values_to_be_unique(column="id")

validator.expect_column_unique_value_count_to_be_between(column="id",
                                                        min_value=1,
                                                        max_value=10)

validator.save_expectation_suite(discard_failed_expectations=False)


#  create a checkpoint
checkpoint = context.add_or_update_checkpoint(
   name="my_quickstart_checkpoint",
   validator=validator,
)

# validation result
checkpoint_result = checkpoint.run()
result = dict(checkpoint_result)["_success"]
print(f"checkpoint result: {result}")
# checkpoint_config

# visualize results as HTML representation
context.view_validation_result(checkpoint_result)  # perfect to visualize the results (not mandatory)

DOCS

Conclusion

You may use Great Expectations to make sure your data satisfies your quality standards by following these steps. Frequent validation of data aids in preserving data reliability and integrity, facilitating improved analytics and decision-making.

50+ companies rely on our top 1% talent to scale their dev teams.

Case Study

Excellence Our minimum bar.

It has become a prerequisite for companies to develop custom software.

Top Software Development Company on GoodFirms

We've stopped counting. Over 50 brands count on us.

Our company specializes in software outsourcing and provides robust, scalable, and efficient solutions to clients around the world.

Heartfelt appreciation to Vivasoft Limited for believing in my vision. Their talented developers can take any challenges against all odds and helped to bring Klikit into life.appreciation to Vivasoft Limited for believing in my vision. Their talented developers can take any challenges.

Start with a dedicated squad in 7 days

NDA first, transparent rates, agile delivery from day one.

Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation