Marketing Analytics Pipeline (GCP + dbt)
Data Engineering
Marketing Analytics
GCP
Modern data engineering architecture for marketing KPI tracking
Context & Problem
Business question: How to build a modern infrastructure to track marketing KPIs in real-time?
Marketing teams need quick access to their performance metrics to make informed decisions. This project demonstrates setting up a complete ETL pipeline using modern data engineering best practices.
Architecture
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Sources ββββββΆβ Extract ββββββΆβ BigQuery ββββββΆβ dbt β
β (CSV) β β (Python) β β (Raw) β β (Transform) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ
β Streamlit βββββββ BigQuery β
β (Dashboard) β β (Analytics) β
βββββββββββββββ βββββββββββββββ
Components
- Extract: Python script generating realistic simulated marketing data
-
Load: Loading to Google BigQuery (dataset
marketing_raw) -
Transform: dbt to calculate KPIs (dataset
marketing_analytics) - Visualize: Interactive Streamlit dashboard
Implementation
1. Data Extraction
# src/extract_data.py
import pandas as pd
import numpy as np
def generate_marketing_data(start_date, end_date, seed=42):
"""Generate realistic simulated marketing data."""
np.random.seed(seed)
channels = {
'Google Ads': {'sessions': 1500, 'conv_rate': 0.05, 'spend': 3000},
'Facebook Ads': {'sessions': 1200, 'conv_rate': 0.04, 'spend': 2500},
'LinkedIn': {'sessions': 400, 'conv_rate': 0.06, 'spend': 1500},
'Email': {'sessions': 800, 'conv_rate': 0.08, 'spend': 500},
'Direct': {'sessions': 2000, 'conv_rate': 0.03, 'spend': 0}
}
# Generation with realistic variability
# - Poisson distribution for sessions
# - Weekend effect (-30%)
# - No negative values
return df2. dbt Transformation
-- dbt_marketing/models/staging/stg_marketing_daily.sql
WITH source AS (
SELECT * FROM {{ source('marketing_raw', 'daily_performance') }}
),
transformed AS (
SELECT
date,
DATE_TRUNC(date, WEEK) AS week_start_date,
DATE_TRUNC(date, MONTH) AS month_start_date,
source AS marketing_source,
sessions,
conversions,
revenue,
spend,
-- Calculated KPIs
SAFE_DIVIDE(revenue, spend) AS roas,
SAFE_DIVIDE(spend, conversions) AS cost_per_conversion,
SAFE_DIVIDE(conversions, sessions) AS conversion_rate,
SAFE_DIVIDE(revenue, conversions) AS revenue_per_conversion,
SAFE_DIVIDE(revenue, sessions) AS revenue_per_session,
-- Flags
spend = 0 AS is_organic_channel,
EXTRACT(DAYOFWEEK FROM date) IN (1, 7) AS is_weekend
FROM source
)
SELECT * FROM transformed3. Streamlit Dashboard
The dashboard presents:
- Global KPIs: Total Spend, Revenue, Conversions, ROAS
- Time trend: Spend vs Revenue by day
- Performance by channel: Marketing source comparison
- ROAS by source: Return on ad spend
- Conversion funnel: Sessions β Conversions
Results
Calculated KPIs
| Metric | Formula | Usage |
|---|---|---|
| ROAS | Revenue / Spend | Overall effectiveness |
| CPA | Spend / Conversions | Acquisition cost |
| Conversion Rate | Conversions / Sessions | Traffic quality |
| Revenue/Session | Revenue / Sessions | Traffic value |
Generated Data
- Period: 84 days (Sept-Nov 2024)
- Channels: 5 marketing sources
- Volume: ~420 rows per execution
Demo
The dashboard allows you to:
- Filter by date range
- Select channels to display
- Visualize trends
- Compare performance
Technologies
| Component | Technology | Version |
|---|---|---|
| Language | Python | 3.13+ |
| Data Warehouse | Google BigQuery | Cloud |
| Transformation | dbt-core + dbt-bigquery | 1.10.3 |
| Dashboard | Streamlit | 1.51.0 |
| Visualization | Plotly + Altair | - |
| Dependency Management | Poetry | - |
Project Structure
marketing-data-pipeline/
βββ src/
β βββ extract_data.py # Data generation
β βββ load_bigquery.py # BigQuery loading
β βββ dashboard.py # Streamlit dashboard
βββ dbt_marketing/
β βββ dbt_project.yml # dbt config
β βββ models/
β βββ staging/
β βββ sources.yml
β βββ stg_marketing_daily.sql
βββ data/
β βββ raw/
β βββ marketing_data.csv
βββ pyproject.toml # Poetry dependencies
βββ README.md
Execution
# 1. Extraction
python src/extract_data.py
# 2. BigQuery loading
python src/load_bigquery.py
# 3. dbt transformation
cd dbt_marketing && dbt run
# 4. Dashboard
streamlit run src/dashboard.pyLearnings
This project illustrates Modern Data Stack best practices:
- Separation of concerns: Distinct Extract, Load, Transform
- Infrastructure as Code: dbt for versioned transformations
- Reproducibility: Fixed seed for simulated data
- Documentation: Documented dbt sources and models