Data Pipeline
An ETL (Extract, Transform, Load) pipeline for processing tourism data from multiple sources. Built to power analytics dashboards and insights for South Tyrol's tourism industry through the Open Data Hub.
Overview
South Tyrol's tourism industry generates data from hundreds of sources: hotels, ski lifts, museums, events, weather stations, and more. This data is valuable for understanding visitor patterns, optimizing resources, and planning future investments.
This pipeline collects data from these diverse sources, cleans and normalizes it, applies business rules, and loads it into a data warehouse optimized for analytical queries.
Key Features
Multi-Source Ingestion
Connects to REST APIs, SFTP servers, and databases. Handles rate limiting, retries, and authentication.
Data Quality Checks
Validates data against schemas, checks for anomalies, and quarantines bad records for review.
Scheduled Execution
Airflow-managed DAGs with configurable schedules, dependencies, and retry policies.
Incremental Loading
Tracks processed records to enable incremental updates, minimizing processing time and resource usage.
Data Sources
The pipeline currently ingests data from over 30 sources including:
Accommodation: Hotel occupancy, booking platforms, guest registrations
Activities: Ski lift usage, museum visits, event attendance
Infrastructure: Traffic counters, parking occupancy, public transit
Environment: Weather stations, air quality, webcam metadata
The pipeline processes approximately 2 million records daily, with peak loads during ski season reaching 5 million records per day.
Tech Stack
What I Learned
This project taught me that data engineering is 80% handling edge cases. Every source has its quirks: inconsistent date formats, unexpected nulls, schema changes without notice, and API outages.
I also learned about designing for observability - comprehensive logging, metrics, and alerts that make it easy to diagnose issues when (not if) they occur.