Project Overview
Best practices and advanced setups for creating and managing a strong DBT project are covered in this project. Tests, documentation, seed, package, incremental models, macros, and model configurations are all covered in detail. You may design an effective and scalable DBT project that guarantees data integrity and quality by using this type of approach.
Prior to delving into this advanced guide, we advise anyone who are unfamiliar with DBT to begin with the beginner guide in order to have a basic grasp.
You can find the beginner guide here.
Technology Used
Toplevel Workflow
dbt Environment
├──dbt_project
| └──analyses: save SQL for reporting purpose
| └──checks.yml
| └──dbt_packages: all dbt libraries
| └──logs
| └──dbt.log: dbt logs when each dbt runs models
| └──macros
| └──date_operation.sql(example): fn to use across dbt project
| └──models: Core logic to build the data warehouse
| └──source.yml: Top level database configuration to use
| └──schema.yml: Top level documentation
| └──seeds: To export file as table
| └──snapshots: For SCD(slowly changing dimensions)
| └──target: All complied SQL files dbt executes
| └──tests: Custom test SQL to ensure data quality
| └──source.yml: Top level schem
└──generic
└──equal_row_count.sql: custom test functionality
dbt_project.yml
vars: # to use the variable across the dbt project
my_variable: 'value'
models: # core components of dbt
dbt_poject_name:
materialized: view # top level configuration
src: # sub folder name in models dir
+schema: src # db schema name → _src
+materialized: table # table type
+ tags: [‘src’, ‘table’, ‘incremental’] # optional → good for dbt doc
dim: # sub folder name in models dir
+schema: dim # db schema name → dbt_dim
+materialized: view # view type
+ tags: [‘dim’, ‘view’] # optional → good for dbt doc
fact: # sub folder name in models dir
+schema: fact # db schema name → dbt_fact
+materialized: table # table type
+ tags: [‘fact’, ‘table’, ‘incremental’] # optional → good for dbt doc
packages.yml
Dependencies on other DBT packages are managed via the file in a DBT project. With the use of this file, you can quickly install and manage external DBT packages that offer more features, models, macros, and more.
[doc]
Configuration
packages:
- package: dbt-labs/dbt_utils
version: 1.3.0
Install
//To install, run:
dbt deps
macros
Key Components
- Purpose of Macros: It makes your code more modular and maintainable by enabling you to write reusable SQL segments that can be called from several models.
Creating Macros
- Macros are defined in .sql files within the macros directory.
- A macro is defined using the Jinja macro keyword.
Example
-- trim and lower a col val
{% macro lower_and_trim(col) %} – name of macro
LOWER(TRIM({{col}})) – sql operation
{% endmacro %}
seeds
Configuration
- Upload a CSV file (e.g., country_codes.csv) under seeds directory
- Create a properties.yml file
version: 2
seeds:
- name: country_codes
config:
quote_columns: true
delimiter: ','
column_types:
FIFA: text
Global Code: text
dbt_project.yml Configuration
seeds:
+schema: external_data
+ful_refresh: true
Load csv data as table
dbt seed
tests
Key Components
1. Purpose of Tests
- Tests verify your data’s assumptions, ensuring data integrity and quality.
- They assist in the early detection of mistakes and discrepancies in the data stream.
2. Types of Tests
- Schema Tests: Verify your data’s constraints, including uniqueness, non-null values, and column connections.
- Data Tests: custom SQL queries that verify the accuracy of the data and business logic.
test file configuration
- Create a custom test file named equal_row_count.sql under tests/generic folder
- Sample code:
{% test equal_row_count(model, column_name, src) %}
SELECT
COUNT(1)
FROM
{{ model }}
HAVING
COUNT(1) != (SELECT COUNT(1) FROM {{ src }})
{% endtest %}
3. Create schema.yml file under tests folder
version: 2
data_tests:
- name: equal_row_count
description: >
check actor model row count != dvdrental.public.actor
version: 2
models:
- name: actor
description: Incremental table with unique actor id
- name: actor_id
description: actor unique identification.
data_tests:
- unique # dbt buitin test
- not_null # dbt buitin test
- equal_row_count: # dbt custom test
src: source('stg_data', 'actor')
Test dbt
dbt test