Transform and Ensure Data Quality Workflow with SODA

Last Update: September 24, 2024
Featured image of Data Quality Workflow with Soda article
Table of Contents
Contributors
Picture of Vivasoft Data Engineering Team
Vivasoft Data Engineering Team
Tech Stack
0 +
Want to accelerate your software development company?

It has become a prerequisite for companies to develop custom software products to stay competitive.

In today’s data-driven world, data quality is extremely important. Poor data leads to inaccurate analysis, incorrect conclusions, and missed opportunities. That’s why strong validation procedures are essential, especially when using data in key applications such as machine learning models or analytics.

This article focuses on Soda, a powerful tool for automating data quality checks throughout data pipelines. Soda enables enterprises to proactively monitor and evaluate data, ensuring its accuracy, completeness, and consistency.

By integrating Soda, you can detect and address data issues early, safeguarding downstream processes. Whether for small or large enterprises, Soda delivers accurate data for better insights and decisions.

To ensure seamless data quality management and pipeline optimization, consider hiring a Data Engineer from Vivasoft Limited. Our experts are well-equipped to implement cutting-edge tools like Soda, enabling your business to maintain data accuracy and gain a competitive edge.

Project Overview

In this project, we will focus on ensuring data quality and validation for a Soda-based data pipeline. The purpose is to assess the quality of data before it is used in downstream applications such as machine learning models, dashboards, or reports.

Technology Used

  • Python
  • BigQuery
  • PostgreSQL
  • soda-core

The major goal of this project is to use Soda to automate the validation and monitoring of data quality across the pipeline, ensuring that the data is correct, comprehensive, and consistent.

This will aid in the detection and resolution of data abnormalities, inconsistencies, or missing information early in the pipeline, preventing any influence on downstream analysis and decision-making.

Data Source Integration:

  • The data comes from a variety of sources, including Dask/Pandas and direct queries to databases like as BigQuery or Snowflake, Databricks, and so on.

  • Each source may bring unique data quality difficulties, such as missing data, mismatched data types, or duplicate records.

Data Quality Rules Definition:

To ensure the quality of incoming data, we define a set of rules using Soda’s scan YAML files. These rules are designed to meet the project’s specific requirements. Examples include:

  • Ensure no missing values in crucial columns.

  • Ensure that numeric fields fall inside expected ranges.

  • Checking for duplicates in primary identifier fields (such as customer IDs).

  • Validating the data’s freshness (for example, checking that data was ingested within the last 24 hours), and so on.

Data Quality Checks:

  • Soda scans are performed on a regular basis at various stages of the data
    pipeline, and are triggered automatically when fresh data is imported.

  • These scans validate the data quality criteria specified in the YML files and create reports indicating whether the checks passed or failed.

SODA Cloud Dashboard:

  • The Soda Cloud platform has a centralized dashboard for visualizing the results of all data quality checks. This allows stakeholders to see the health of their data in real time.

  • The dashboard has features such as anomaly detection, trend visualization, and issue tracking. Alerts can be set up to notify the team when a data quality check fails, allowing for a timely resolution.

  • Collaborative capabilities in Soda Cloud allow teams to assign issues to individual members, track resolution status, and view historical data quality trends.

Integration with Apache Airflow:

  • Generate extensive and readable data documentation automatically.

  • Give data engineers, analysts, and other stakeholders complete visibility into validation results and data quality metrics.

Example Use Case:

  • A telecoms corporation collects data on client call logs, which are then ingested into a data lake every day. The data is then analyzed and loaded into Google BigQuery for additional analysis.

  • Before loading the data into BigQuery, Soda examines the call record data to ensure that all key fields (e.g., call duration, timestamps) are filled out and that there are no duplicates or unusual values.

  • If Soda recognizes an error, notifications appear in the Soda Cloud dashboard, allowing the team to take corrective action before the data is used for customer analytics or billing.

Why Use SODA?

A modern, scalable solution for data engineers and scientists to monitor and assess data quality from a variety of data sources, including data lakes, warehouses, and pipelines.

What is Soda?

Soda is a program that automates the validation of data quality by performing predefined checks on your data. It detects data anomalies, validates accuracy, and assures consistency throughout your workflow.

Real-Time Monitoring

Soda assists you in setting up continuous monitoring of your data, allowing teams to be notified of concerns as they arise rather than discovering them after the data has been used in decision-making. It works by allowing users to establish quality rules (for example, ensuring no null values in crucial columns or meeting data volume thresholds), which are automatically reviewed as new data arrives.

Soda Cloud

The premium version, Soda Cloud, goes a step further by including a collaborative dashboard that allows teams to observe and monitor data quality trends over time, receive alerts when anomalies are found, and even track issue solutions. This insight is especially beneficial for teams that work with several data sources and pipelines, since it provides a centralized center for managing data quality initiatives.

Integration with Modern Data Stacks

Soda works flawlessly with prominent data platforms including Google BigQuery, AWS Redshift, Snowflake, and Apache Airflow. This interface allows for real-time validation of data as it is ingested, processed, and stored, ensuring that errors are identified early in the pipeline.

Soda enables data engineers and data scientists to automate the implementation of data quality criteria, decrease manual oversight, and ensure that only high-quality data reaches crucial downstream processes such as reporting or machine learning models.

Technology Stack

SODA

PostgreSQL, BigQuery

Top Level Workflow

Toplevel SODA architecture by SODA
Toplevel SODA Architecture by SODA

Setting Up SODA

Prepare the Environment

  • Managing dependencies in a virtual environment is recommended.
				
					Python3 -m venv soda_venv
source soda_venv/bin/activate
				
			

Install SODA

				
					pip install soda-core
pip install -i https://pypi.cloud.soda.io soda-postgres
pip install -i https://pypi.cloud.soda.io soda-bigquery
				
			

SODA Cloud

  • Create a SODA cloud account with the organization email.
  • Create an API key and save it.

Initial Configuration

Setup configuration

  • Create a file named configuration.yml in the venv
  • Modify the config according to the following code:
				
					data_source my_bigquery_source:
 type: bigquery
 account_info_json: '{
     "type": "service_account",
     "project_id": "gcp project id”,
     "private_key_id": "from service account json file",
     "private_key": "from service account json file",,
     "client_email": "service account mail from service account json file",
     "client_id": "from service account json file",
     "auth_uri": "from service account json file",
     "token_uri": "from service account json file",
     "auth_provider_x509_cert_url": "from service account json file",
     "client_x509_cert_url": "from service account json file",
     "universe_domain": "googleapis.com"
 }'
 auth_scopes:
   - https://www.googleapis.com/auth/bigquery
   - https://www.googleapis.com/auth/cloud-platform
   - https://www.googleapis.com/auth/drive
 project_id: "dummy_project"
 dataset: BigQuery_dataset_name


soda_cloud:
 host: cloud.us.soda.io
 api_key_id: ad16c3e0-ea78-4cc3-8ba9-2490ebc10
 api_key_secret: oJEreCPi1AHA_CHO_20XVeeWiQfMLz0IOdbncpgRUyFSLWmx
				
			

Run the following cmd to test the connection:

				
					soda test-connection -d my_postgres_source -c configuration.yml -V
				
			

Create Check File to Validate Data

  • Create a file named checks.yml in the venv
  • Modify the code according to the following example:
				
					checks for actor:
 - row_count > 0
 - duplicate_count(actor_id) = 0
 # anomaly detection
 - anomaly detection for row_count


checks for address:
 - row_count > 0


checks for payment:
 - row_count > 0
 - duplicate_count(payment_id) = 0
 - duplicate_count(customer_id) >= 2
 - sum(amount) >= 10000.0
 - percentile(amount, 0.95) >= 5


checks for city:
 - row_count > 0
 - min(country_id) >= 1
 - max(country_id) <= 200
				
			
  • Run the following cmd to test the validation:
				
					soda scan -d my_postgres_source -c configuration.yml checks.yml
				
			
  • Go to your soda cloud profile and check the dashboard
  • Done

Dashboard

Homepage

SODA dashboard homepage
SODA Dashboard Homepage

Inspection

SODA dashboard data quality inspection
SODA Dashboard Data Quality Inspection

Conclusion

To summarize, data quality is crucial for accurate insights, dependable machine learning models, and sound decision-making.

Soda automates data validation, ensuring consistency and scalability across pipelines while preventing bad data from affecting downstream processes. Soda Cloud’s real-time monitoring and notifications enable teams to solve data concerns proactively.

Whether you’re a small team or an organization, Soda improves data dependability, saving up time for critical objectives. Begin automating your data quality checks with Soda today to ensure reliable, actionable results.

Potential Developer
Tech Stack
0 +
Accelerate Your Software Development Potential with Us
With our innovative solutions and dedicated expertise, success is a guaranteed outcome. Let's accelerate together towards your goals and beyond.
Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation