Transform and Ensure Data Quality Workflow with Great Expectations

Last Update: October 4, 2024
Featured image of Transform and Ensure Data Quality Workflow with Great Expectations
Table of Contents
Contributors
Picture of Vivasoft Data Engineering Team
Vivasoft Data Engineering Team
Tech Stack
0 +
Want to accelerate your software development company?

It has become a prerequisite for companies to develop custom software products to stay competitive.

Ensuring data quality is crucial for any data-driven organization, and Great Expectations, an open-source Python library, provides a powerful solution for this challenge. It allows you to verify and profile your data ensuring that it meets your quality standards.

By incorporating Great Expectations into your data pipeline, you can automate the detection and resolution of data issues, boosting confidence in the data and the decisions based on it.

This blog will guide you through using Great Expectations to validate CSV data as data source, enabling a streamlined, reliable approach to data quality management.

For organizations looking to enhance their data pipelines, hiring a skilled Data Engineer from Vivasoft Limited can significantly accelerate the integration of tools like Great Expectations. It ensures that your data remains accurate, consistent, and trustworthy across all your operations.

Project Overview

We’ll guide you through the steps in this blog post on how to use the open-source Python package Great Expectations to ensure data quality.

With the aid of Great Expectations, you can verify, record, and profile your data to make sure it satisfies your quality standards. We’ll talk about utilizing GX to validate CSV data here.

Technology Used

  • Python
  • CSV

Documentation - python_greatExpectation

Objectives


Establish Robust Data Quality Checks:

  • Use thorough and automated data quality checks to find and fix problems early in the data pipeline
  • Make sure that the data is accurate, consistent, and comprehensive across different data sources


Increase Confidence in Data-Driven Decisions:

  • By comparing data to established standards and expectations, you can increase trust in the data
  • Offer stakeholders legitimate data so they may decide on business actions


Simplify Data Quality Management:

  • To make the process of defining, managing, and updating data quality rules more efficient, use Great Expectations
  • Minimize the amount of manual labor and human error involved in conventional data quality management techniques


Enable Continuous Data Monitoring:

  • For continuous data quality assurance, incorporate data validation into pipelines for continuous integration / continuous deployment (CI/CD)
  • Schedule regular data quality inspections to find and fix problems promptly


Improve Data Documentation and Transparency:

  • Produce comprehensive, readable data documentation automatically
  • Deliver data engineers, analysts, and other stakeholders clear visibility to validation results and data quality indicators


Enhance Collaboration Across Teams:

  • By establishing common data quality expectations for data quality, data scientists, data engineers, and business analysts can work together more effectively
  • Promote an organizational culture that values accountability and knowledge of data quality


Adapt to Evolving Data Requirements:

  • As business needs and data sources change, ensure the flexibility to adjust expectations for data quality
  • To stay up with the ever-changing data world, continuously update data validation procedures


Minimize Risk of Data-Related Issues:

  • Minimize the possibility of errors and inconsistencies in data that could affect decision-making and business operations
  • Make sure that the legal and industry standards for data quality are being followed


Optimize Data Pipeline Performance:

  • Determine and resolve any issues with data quality that could have an impact on the reliability and effectiveness of data pipelines
  • Boost the workflows for data processing in terms of overall effectiveness and efficiency


Leverage Open Source Capabilities:

  • Benefit from numerous features and community support offered by the Great Expectations library, which is available for free
  • Participate in and gain from the continuous innovation and progress that occurs within the Great Expectations community

Technology Stack

Great Expectations

  • Data Validation
  • Generate docs in web UI

CSV

Top Level Workflow

Toplevel GX workflow
Toplevel GX workflow

Setting Up Great Expectations

Prepare the Environment

  • Managing dependencies in a virtual environment is recommended
				
					Python3 -m venv gx_venv
source gx_venv/bin/activate
				
			

Install Great Expectations

				
					pip install great_expectations
				
			

Initial Configuration

Data Source

  • Keep a sample csv file as source data in project dir

Create .py file

  • The file contains all data validations and configurations

Validate and Generate Docs

  • Run the python file
				
					python3 *.py
				
			

Project Directory Structure

Local environment → project dir → *.py, *.csv

Create Validation: Step by Step guide

Import GX

project_dir → *.py

				
					import great_expectations as gx
				
			

Create Context

				
					context = gx.get_context()
				
			

Connect to Data

				
					validator = context.sources.pandas_default.read_csv("employee_data.csv")
				
			

Create Expectations

				
					# column exists or not
validator.expect_column_to_exist(column="id")


# check total no of columns
validator.expect_table_column_count_to_be_between(min_value=1,
  max_value=3)


# check total no of rows
validator.expect_table_row_count_to_be_between(min_value=1,
         max_value=100)


# check not null
validator.expect_column_values_to_not_be_null(column="id",                                         notes="**identification** of each employee")


# check column values
validator.expect_column_values_to_be_between(
"id",
min_value=1,
max_value=10)


# check distinct
validator.expect_column_values_to_be_unique(column="id")


# check distinct count
validator.expect_column_unique_value_count_to_be_between(column="id",
min_value=1,                                                  max_value=10)


validator.save_expectation_suite(discard_failed_expectations=False)
				
			

Full Code

				
					import great_expectations as gx

# create data context
context = gx.get_context()

# connect to data
validator = context.sources.pandas_default.read_csv("employee_data.csv")

# create expectations
validator.expect_column_to_exist(column="id")

validator.expect_table_column_count_to_be_between(min_value=1,
                                       max_value=3)

validator.expect_table_row_count_to_be_between(min_value=1,
                                              max_value=100)

validator.expect_column_values_to_not_be_null(column="id",
                                             notes="**identification** of each employee")
validator.expect_column_values_to_be_between(
   "id",
   min_value=1,
   max_value=10
)

validator.expect_column_values_to_be_unique(column="id")

validator.expect_column_unique_value_count_to_be_between(column="id",
                                                        min_value=1,
                                                        max_value=10)

validator.save_expectation_suite(discard_failed_expectations=False)


#  create a checkpoint
checkpoint = context.add_or_update_checkpoint(
   name="my_quickstart_checkpoint",
   validator=validator,
)

# validation result
checkpoint_result = checkpoint.run()
result = dict(checkpoint_result)["_success"]
print(f"checkpoint result: {result}")
# checkpoint_config

# visualize results as HTML representation
context.view_validation_result(checkpoint_result)  # perfect to visualize the results (not mandatory)
				
			

DOCS

GX data validation in UI for data quality
GX Data Validation in UI

Conclusion

You may use Great Expectations to make sure your data satisfies your quality standards by following these steps. Frequent validation of data aids in preserving data reliability and integrity, facilitating improved analytics and decision-making.

Potential Developer
Tech Stack
0 +
Accelerate Your Software Development Potential with Us
With our innovative solutions and dedicated expertise, success is a guaranteed outcome. Let's accelerate together towards your goals and beyond.
Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation