Marketing Analytics Pipeline (GCP + dbt)

Data Engineering
Marketing Analytics
GCP
Modern data engineering architecture for marketing KPI tracking

Context & Problem

Business question: How to build a modern infrastructure to track marketing KPIs in real-time?

Marketing teams need quick access to their performance metrics to make informed decisions. This project demonstrates setting up a complete ETL pipeline using modern data engineering best practices.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Sources   │────▢│   Extract   │────▢│  BigQuery   │────▢│     dbt     β”‚
β”‚   (CSV)     β”‚     β”‚  (Python)   β”‚     β”‚   (Raw)     β”‚     β”‚ (Transform) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                   β”‚
                                                                   β–Ό
                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚  Streamlit  │◀────│  BigQuery   β”‚
                                        β”‚ (Dashboard) β”‚     β”‚ (Analytics) β”‚
                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

  1. Extract: Python script generating realistic simulated marketing data
  2. Load: Loading to Google BigQuery (dataset marketing_raw)
  3. Transform: dbt to calculate KPIs (dataset marketing_analytics)
  4. Visualize: Interactive Streamlit dashboard

Implementation

1. Data Extraction

# src/extract_data.py
import pandas as pd
import numpy as np

def generate_marketing_data(start_date, end_date, seed=42):
    """Generate realistic simulated marketing data."""
    np.random.seed(seed)

    channels = {
        'Google Ads': {'sessions': 1500, 'conv_rate': 0.05, 'spend': 3000},
        'Facebook Ads': {'sessions': 1200, 'conv_rate': 0.04, 'spend': 2500},
        'LinkedIn': {'sessions': 400, 'conv_rate': 0.06, 'spend': 1500},
        'Email': {'sessions': 800, 'conv_rate': 0.08, 'spend': 500},
        'Direct': {'sessions': 2000, 'conv_rate': 0.03, 'spend': 0}
    }

    # Generation with realistic variability
    # - Poisson distribution for sessions
    # - Weekend effect (-30%)
    # - No negative values

    return df

2. dbt Transformation

-- dbt_marketing/models/staging/stg_marketing_daily.sql
WITH source AS (
    SELECT * FROM {{ source('marketing_raw', 'daily_performance') }}
),

transformed AS (
    SELECT
        date,
        DATE_TRUNC(date, WEEK) AS week_start_date,
        DATE_TRUNC(date, MONTH) AS month_start_date,
        source AS marketing_source,
        sessions,
        conversions,
        revenue,
        spend,

        -- Calculated KPIs
        SAFE_DIVIDE(revenue, spend) AS roas,
        SAFE_DIVIDE(spend, conversions) AS cost_per_conversion,
        SAFE_DIVIDE(conversions, sessions) AS conversion_rate,
        SAFE_DIVIDE(revenue, conversions) AS revenue_per_conversion,
        SAFE_DIVIDE(revenue, sessions) AS revenue_per_session,

        -- Flags
        spend = 0 AS is_organic_channel,
        EXTRACT(DAYOFWEEK FROM date) IN (1, 7) AS is_weekend

    FROM source
)

SELECT * FROM transformed

3. Streamlit Dashboard

The dashboard presents:

  • Global KPIs: Total Spend, Revenue, Conversions, ROAS
  • Time trend: Spend vs Revenue by day
  • Performance by channel: Marketing source comparison
  • ROAS by source: Return on ad spend
  • Conversion funnel: Sessions β†’ Conversions

Results

Calculated KPIs

Metric Formula Usage
ROAS Revenue / Spend Overall effectiveness
CPA Spend / Conversions Acquisition cost
Conversion Rate Conversions / Sessions Traffic quality
Revenue/Session Revenue / Sessions Traffic value

Generated Data

  • Period: 84 days (Sept-Nov 2024)
  • Channels: 5 marketing sources
  • Volume: ~420 rows per execution

Demo

View Streamlit Dashboard

The dashboard allows you to:

  • Filter by date range
  • Select channels to display
  • Visualize trends
  • Compare performance

Technologies

Component Technology Version
Language Python 3.13+
Data Warehouse Google BigQuery Cloud
Transformation dbt-core + dbt-bigquery 1.10.3
Dashboard Streamlit 1.51.0
Visualization Plotly + Altair -
Dependency Management Poetry -

Project Structure

marketing-data-pipeline/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ extract_data.py      # Data generation
β”‚   β”œβ”€β”€ load_bigquery.py     # BigQuery loading
β”‚   └── dashboard.py         # Streamlit dashboard
β”œβ”€β”€ dbt_marketing/
β”‚   β”œβ”€β”€ dbt_project.yml      # dbt config
β”‚   └── models/
β”‚       └── staging/
β”‚           β”œβ”€β”€ sources.yml
β”‚           └── stg_marketing_daily.sql
β”œβ”€β”€ data/
β”‚   └── raw/
β”‚       └── marketing_data.csv
β”œβ”€β”€ pyproject.toml           # Poetry dependencies
└── README.md

Execution

# 1. Extraction
python src/extract_data.py

# 2. BigQuery loading
python src/load_bigquery.py

# 3. dbt transformation
cd dbt_marketing && dbt run

# 4. Dashboard
streamlit run src/dashboard.py

Learnings

This project illustrates Modern Data Stack best practices:

  1. Separation of concerns: Distinct Extract, Load, Transform
  2. Infrastructure as Code: dbt for versioned transformations
  3. Reproducibility: Fixed seed for simulated data
  4. Documentation: Documented dbt sources and models

← Back to Portfolio Analysis | View Dashboard