Build a Comprehensive Data Engineering Pipeline for the adTech Company using Apache Airflow

Picture of Vivasoft Team
Vivasoft Team
Published on
24.09.2024
Time to Read
4 min
Data Engineering Pipeline for the adTech Company
Table of Contents

The project comprised building and setting up an extensive data infrastructure for the adTech business.

The main objective was to set up the data warehouse, expedite the analysis of different campaign data, and integrate BigQuery with the company’s Application Tracking System (ATS). We’ll talk about designing data pipelining in this section.

Technology Used

Python, Apache Airflow, BigQuery, GCP Composer, dbt

Technology Stack

Apache Airflow

DAGs for Social Media Data Sources

To manage the data extraction, loading, and transformation (ELT) operations for every social media data source (Tiktok, Reddit, Snapchat, LinkedIn, Facebook), specific Directed Acyclic Graphs (DAGs) were implemented within Airflow.

This method made process management simple and allowed for modularization.

Parallel Data Processing

Airflow’s ELT task parallelism characteristics are used to analyze data from several sources concurrently, maximize throughput, and shorten processing times because each data source is independent.

Fault Tolerance with Backup DAGs

Implemented backup DAGs in Airflow to ensure fault tolerance and continuity of data pipelines. These backup DAGs could automatically resume operations in case of interruptions or failures, minimizing downtime and data loss.

File Cleanup and Maintenance

In order to optimize storage resources and preserve data hygiene, Integrated Airflow tasks were used to carry out routine file cleanup and maintenance. This involved making sure that any extraneous or leftover files (CSV) from data processing jobs during ELT were swiftly eliminated.

Google BigQuery

Columnar Database

Using BigQuery’s columnar storage format to optimize query performance, LinkedIn has more than 190 data sources, while some social media data sources, like Facebook, have more than 500 data fields.

This makes it possible to quickly retrieve and analyze specific columns within large datasets, which is critical in the adtech industry where quick insights are critical.

Database Partitioning and Clustering Mechanism

  • By date, the campaign database is divided using it.
  • Scheduling and storing options for SQL queries.

GCP Composer

Dynamic Configuration Management

Utilized the adaptability of Cloud Composer to quickly set up and modify Apache Airflow configurations, Python package dependencies (PyPI libraries), and environment variables as needed. This made it possible to adjust to changing project requirements without experiencing any downtime.

Auto Scaling

Utilized the adaptability of Cloud Composer to quickly set up and modify Apache Airflow configurations, Python package dependencies (PyPI libraries), and environment variables as needed. This made it possible to adjust to changing project requirements without experiencing any downtime.

Worker, Scheduler, and Webserver Configuration

Airflow components, such as workers for task execution, schedulers for job scheduling, and the webserver for user interface interaction, were configured and managed using Cloud Composer. Tasks related to deployment and maintenance were expedited by this consolidated management.

Trigger Configuration

created trigger settings in Cloud Composer to automate the execution of workflows depending on preset events or circumstances. This reduced the need for manual intervention by increasing workflow automation and efficiency.

dbt

N.B: It will be covered in this blog’s third(final) portion.

Project Implementation

Workflow

DAGs for Social Media Data Sources

For each social media data source, separate Directed Acyclic Graphs (DAGs) were implemented in Airflow to manage ELT activities. With its abundance of social media data sources, Windsor.ai was used. in order to avoid having to build separate APIs for each type of data source. This method made process management simple and allowed for modularization.

Parallel Data Processing

Since each social media source is autonomous, the task parallelism characteristics of Airflow were used to analyze data concurrently, optimizing overall throughput and cutting processing times.

Fault Tolerance with Backup DAGs

Airflow backup DAGs were implemented to guarantee data pipeline continuity and fault tolerance. In the event of disruptions or failures, these backup DAGs might immediately resume operations, reducing downtime and data loss.

Since each DAG operates on a daily basis, the backup DAG will backfill the data from that specific date in the event that there is a data pipeline failure.

File Cleanup and Maintenance

Integrated Airflow tasks were used to carry out regular file maintenance and cleanup, making sure that any remaining or superfluous files from data processing operations were immediately deleted in order to maximize storage capacity and preserve data hygiene.

While obtaining data from the API, these airflow DAGs generate a few extra CSV files. Thus, it must be removed once the DAGs have completed running.

Project Directory Structure

  • dags/
    • windsor_data_reddit_composer.py
    • windsor_data_tiktok_composer.py
    • windsor_data_snapchat_composer.py
    • windsor_data_linkedin_composer.py
    • windsor_data_facebook_composer.py
    • schema_data/
      • campaign_data/
        • reddit/
          • reddit_data.csv  (temp)
        • tiktok/
          • tiktok_data.csv  (temp)
        • snapchat/
          • snapchat_data.csv  (temp)
        • linkedin/
          • linkedin_data.csv  (temp)
        • fb/
          • facebook_data.csv  (temp)
      • table_schema/
        • tiktok_schema,json
        • snapchat_schema.json
        • reddit_schema.json
        • facebook_schema.json
        • linkedin_schema.json

Variables and connections

Variables

We can save sensitive data and use it later without having to create new data thanks to airflow variables.

  • api_key: Windsor API key
  • api_url: base URL
  • gcp_project_name: GCP project id
  • bigQuery_dataset_name: BigQuery dataset
  • gcp_service_account: gcp service account JSON file

Connections

Using an airflow connection instead of starting from scratch makes it easier to establish many complex data source connections.

  • is_windsor_api_avaiable: It is used to verify whether the Windsor service is up and running prior to beginning the data intake phase.
  • weslect_bigquery_gcp_conn: JSON file content for GCP service accounts.

BigQuery Schema Design

A schema must be created in advance in order to store data. Let’s take a look at the Snapchat schema, for instance. It’s available in bucket → dags → schema_data → table_schema:

				
					[
   {"name": "row_id", "type": "STRING", "mode": "REQUIRED", "description": "row unique id", "default_value_expression": "GENERATE_UUID()"},   
   {"name": "account_id", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "account_name", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "campaign", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "clicks", "type": "FLOAT64", "mode": "NULLABLE", "description": ""},
   {"name": "spend", "type": "FLOAT64", "mode": "NULLABLE", "description": ""},
   {"name": "impressions", "type": "FLOAT64", "mode": "NULLABLE", "description": ""},
   {"name": "date", "type": "DATE", "mode": "NULLABLE", "description": ""},
   {"name": "source", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "ad_id", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "ad_name", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "campaign_id", "type": "STRING", "mode": "NULLABLE", "description": ""},
   {"name": "conversion_ad_click", "type": "FLOAT64", "mode": "NULLABLE", "description": ""},
   {"name": "quartile_1", "type": "FLOAT64", "mode": "NULLABLE", "description": ""}
]
				
			
  • row_id : Automatically created field. The row’s unique key is it, and the remaining fields are derived from the API response in accordance with the needs of the business.

BigQuery Database Design

Raw Data

      • Dataset name: campaign
        • Table name
          • tiktok_campaign
          • snapchat_campaign
          • reddit_campaign
          • linkedin_campaign
          • facebook_campaign

Processed Data

    • Dataset name: campaign_analysis
      • Table name
        • src_fb_campaign
        • src_linkedin_campaign
        • src_tiktok_campaign
        • src_reddit_campaign
        • Src_snapchat_campaign

Dimension Data

    • Dataset name: campaign_analysis_dim
      • Table name
        • dim_all_campaign
        • dim_linkedin_campaign
        • dim_tiktok_campaign
        • dim_reddit_campaign
        • dim_snapchat_campaign

Fact Data

    • Dataset name: campaign_analysis_fact
      • Table name
        • fact_fb_campaign
        • fact_linkedin_campaign
        • dim_tiktok_campaign
        • dim_reddit_campaign
        • dim_snapchat_campaign

Explore the next part of our comprehensive series on Apache Airflow setup. Learn to automate and streamline social media data processing with Python, BigQuery, and GCP Composer.

If you face challenges in building a data engineering pipeline for your adTech company, Vivasoft‘s experts can help.

Our team of data engineering experts can create a targeted solution and optimize your data processing and get valuable insights. Contact us today to discuss your needs, and let us deliver a personalized solution that elevates your adTech operations.

50+ companies rely on our top 1% talent to scale their dev teams.
Excellence Our minimum bar.
It has become a prerequisite for companies to develop custom software.
We've stopped counting. Over 50 brands count on us.
Our company specializes in software outsourcing and provides robust, scalable, and efficient solutions to clients around the world.
klikit

Chris Withers

CEO & Founder, Klikit

Klikit-logo

Heartfelt appreciation to Vivasoft Limited for believing in my vision. Their talented developers can take any challenges against all odds and helped to bring Klikit into life.appreciation to Vivasoft Limited for believing in my vision. Their talented developers can take any challenges.

Start with a dedicated squad in 7 days

NDA first, transparent rates, agile delivery from day one.

Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation