Build a Comprehensive Data Engineering Pipeline for the adTech Company using Apache Airflow

Last Update: July 4, 2024
Data Engineering Pipeline for the adTech Company
Table of Contents
Contributors
Picture of Vivasoft Team
Vivasoft Team
Tech Stack
0 +
Want to accelerate your software development your company?

It has become a prerequisite for companies to develop custom software products to stay competitive.

The project comprised building and setting up an extensive data infrastructure for the adTech business.

The main objective was to set up the data warehouse, expedite the analysis of different campaign data, and integrate BigQuery with the company’s Application Tracking System (ATS). We’ll talk about designing data pipelining in this section.

Technology Used

Python, Apache Airflow, BigQuery, GCP Composer, dbt

Technology Stack

Apache Airflow

DAGs for Social Media Data Sources

To manage the data extraction, loading, and transformation (ELT) operations for every social media data source (Tiktok, Reddit, Snapchat, LinkedIn, Facebook), specific Directed Acyclic Graphs (DAGs) were implemented within Airflow.

This method made process management simple and allowed for modularization.

Parallel Data Processing

Airflow’s ELT task parallelism characteristics are used to analyze data from several sources concurrently, maximize throughput, and shorten processing times because each data source is independent.

Fault Tolerance with Backup DAGs

Implemented backup DAGs in Airflow to ensure fault tolerance and continuity of data pipelines. These backup DAGs could automatically resume operations in case of interruptions or failures, minimizing downtime and data loss.

File Cleanup and Maintenance

In order to optimize storage resources and preserve data hygiene, Integrated Airflow tasks were used to carry out routine file cleanup and maintenance. This involved making sure that any extraneous or leftover files (CSV) from data processing jobs during ELT were swiftly eliminated.

Google BigQuery

Columnar Database

Using BigQuery’s columnar storage format to optimize query performance, LinkedIn has more than 190 data sources, while some social media data sources, like Facebook, have more than 500 data fields.

This makes it possible to quickly retrieve and analyze specific columns within large datasets, which is critical in the adtech industry where quick insights are critical.

Database Partitioning and Clustering Mechanism

  • By date, the campaign database is divided using it.
  • Scheduling and storing options for SQL queries.

GCP Composer

Dynamic Configuration Management

Utilized the adaptability of Cloud Composer to quickly set up and modify Apache Airflow configurations, Python package dependencies (PyPI libraries), and environment variables as needed. This made it possible to adjust to changing project requirements without experiencing any downtime.

Auto Scaling

Utilized the adaptability of Cloud Composer to quickly set up and modify Apache Airflow configurations, Python package dependencies (PyPI libraries), and environment variables as needed. This made it possible to adjust to changing project requirements without experiencing any downtime.

Worker, Scheduler, and Webserver Configuration

Airflow components, such as workers for task execution, schedulers for job scheduling, and the webserver for user interface interaction, were configured and managed using Cloud Composer. Tasks related to deployment and maintenance were expedited by this consolidated management.

Trigger Configuration

created trigger settings in Cloud Composer to automate the execution of workflows depending on preset events or circumstances. This reduced the need for manual intervention by increasing workflow automation and efficiency.

dbt

N.B: It will be covered in this blog’s third(final) portion.

Project Implementation

Project Implementation

Workflow

DAGs for Social Media Data Sources

For each social media data source, separate Directed Acyclic Graphs (DAGs) were implemented in Airflow to manage ELT activities. With its abundance of social media data sources, Windsor.ai was used. in order to avoid having to build separate APIs for each type of data source. This method made process management simple and allowed for modularization.

Parallel Data Processing

Since each social media source is autonomous, the task parallelism characteristics of Airflow were used to analyze data concurrently, optimizing overall throughput and cutting processing times.

Fault Tolerance with Backup DAGs

Airflow backup DAGs were implemented to guarantee data pipeline continuity and fault tolerance. In the event of disruptions or failures, these backup DAGs might immediately resume operations, reducing downtime and data loss.

Since each DAG operates on a daily basis, the backup DAG will backfill the data from that specific date in the event that there is a data pipeline failure.

File Cleanup and Maintenance

Integrated Airflow tasks were used to carry out regular file maintenance and cleanup, making sure that any remaining or superfluous files from data processing operations were immediately deleted in order to maximize storage capacity and preserve data hygiene.

While obtaining data from the API, these airflow DAGs generate a few extra CSV files. Thus, it must be removed once the DAGs have completed running.

Project Directory Structure

  • dags/
    • windsor_data_reddit_composer.py
    • windsor_data_tiktok_composer.py
    • windsor_data_snapchat_composer.py
    • windsor_data_linkedin_composer.py
    • windsor_data_facebook_composer.py
    • schema_data/
      • campaign_data/
        • reddit/
          • reddit_data.csv  (temp)
        • tiktok/
          • tiktok_data.csv  (temp)
        • snapchat/
          • snapchat_data.csv  (temp)
        • linkedin/
          • linkedin_data.csv  (temp)
        • fb/
          • facebook_data.csv  (temp)
      • table_schema/
        • tiktok_schema,json
        • snapchat_schema.json
        • reddit_schema.json
        • facebook_schema.json
        • linkedin_schema.json

Variables and connections

Variables

We can save sensitive data and use it later without having to create new data thanks to airflow variables.

  • api_key: Windsor API key
  • api_url: base URL
  • gcp_project_name: GCP project id
  • bigQuery_dataset_name: BigQuery dataset
  • gcp_service_account: gcp service account JSON file

Connections

Using an airflow connection instead of starting from scratch makes it easier to establish many complex data source connections.

  • is_windsor_api_avaiable: It is used to verify whether the Windsor service is up and running prior to beginning the data intake phase.
  • weslect_bigquery_gcp_conn: JSON file content for GCP service accounts.

BigQuery Schema Design

A schema must be created in advance in order to store data. Let’s take a look at the Snapchat schema, for instance. It’s available in bucket → dags → schema_data → table_schema:

BigQuery Schema Design
  • row_id : Automatically created field. The row’s unique key is it, and the remaining fields are derived from the API response in accordance with the needs of the business.

BigQuery Database Design

BigQuery Database Design

Raw Data

      • Dataset name: campaign
        • Table name
          • tiktok_campaign
          • snapchat_campaign
          • reddit_campaign
          • linkedin_campaign
          • facebook_campaign

Processed Data

    • Dataset name: campaign_analysis
      • Table name
        • src_fb_campaign
        • src_linkedin_campaign
        • src_tiktok_campaign
        • src_reddit_campaign
        • Src_snapchat_campaign

Dimension Data

    • Dataset name: campaign_analysis_dim
      • Table name
        • dim_all_campaign
        • dim_linkedin_campaign
        • dim_tiktok_campaign
        • dim_reddit_campaign
        • dim_snapchat_campaign

Fact Data

    • Dataset name: campaign_analysis_fact
      • Table name
        • fact_fb_campaign
        • fact_linkedin_campaign
        • dim_tiktok_campaign
        • dim_reddit_campaign
        • dim_snapchat_campaign

Explore the next part of our comprehensive series on Apache Airflow setup. Learn to automate and streamline social media data processing with Python, BigQuery, and GCP Composer.

If you face challenges in building a data engineering pipeline for your adTech company, Vivasoft‘s experts can help.

Our team of data engineering experts can create a targeted solution and optimize your data processing and get valuable insights. Contact us today to discuss your needs, and let us deliver a personalized solution that elevates your adTech operations.

Tech Stack
0 +
Accelerate Your Software Development Potential with Us
With our innovative solutions and dedicated expertise, success is a guaranteed outcome. Let's accelerate together towards your goals and beyond.
Blogs You May Love

Don’t let understaffing hold you back. Maximize your team’s performance and reach your business goals with the best IT Staff Augmentation