Data Pipeline

Python
ETL Data Engineering Tourism
View on GitHub

An ETL (Extract, Transform, Load) pipeline for processing tourism data from multiple sources. Built to power analytics dashboards and insights for South Tyrol's tourism industry through the Open Data Hub.

Overview

South Tyrol's tourism industry generates data from hundreds of sources: hotels, ski lifts, museums, events, weather stations, and more. This data is valuable for understanding visitor patterns, optimizing resources, and planning future investments.

This pipeline collects data from these diverse sources, cleans and normalizes it, applies business rules, and loads it into a data warehouse optimized for analytical queries.

Key Features

Multi-Source Ingestion

Connects to REST APIs, SFTP servers, and databases. Handles rate limiting, retries, and authentication.

Data Quality Checks

Validates data against schemas, checks for anomalies, and quarantines bad records for review.

Scheduled Execution

Airflow-managed DAGs with configurable schedules, dependencies, and retry policies.

Incremental Loading

Tracks processed records to enable incremental updates, minimizing processing time and resource usage.

Data Sources

The pipeline currently ingests data from over 30 sources including:

Accommodation: Hotel occupancy, booking platforms, guest registrations
Activities: Ski lift usage, museum visits, event attendance
Infrastructure: Traffic counters, parking occupancy, public transit
Environment: Weather stations, air quality, webcam metadata

The pipeline processes approximately 2 million records daily, with peak loads during ski season reaching 5 million records per day.

Tech Stack

Python Apache Airflow pandas SQLAlchemy PostgreSQL TimescaleDB Docker

What I Learned

This project taught me that data engineering is 80% handling edge cases. Every source has its quirks: inconsistent date formats, unexpected nulls, schema changes without notice, and API outages.

I also learned about designing for observability - comprehensive logging, metrics, and alerts that make it easy to diagnose issues when (not if) they occur.