Inside Open Data Hub's Architecture

Oct 2025 10 min read

I work at NOI Techpark on the Open Data Hub, an open data platform that collects and provides access to datasets about mobility, tourism, weather, and other domains in South Tyrol. What started as a regional project has grown into a sophisticated data platform serving millions of API requests.

This article explores the architecture behind Open Data Hub—the design decisions, trade-offs, and lessons learned from building a platform that must handle diverse data sources, varying update frequencies, and real-time requirements.

The Problem Space

Open Data Hub aggregates data from dozens of heterogeneous sources:

Real-time traffic sensors updating every minute
Public transit data in GTFS and GTFS-RT formats
Tourism statistics updated daily
Weather stations with varying refresh rates
Parking availability changing by the second
Event calendars updated sporadically

Each source has its own format, update frequency, reliability profile, and quirks. Some provide well-documented REST APIs. Others offer CSV files uploaded to FTP servers. A few require scraping websites. The challenge is presenting all of this through a unified, reliable API.

High-Level Architecture

The system follows a pipeline architecture with three main stages:

Data Collection — Ingest data from external sources
Data Processing — Transform, validate, and enrich
Data Serving — Expose through APIs and other interfaces

This separation allows each stage to scale independently and fail gracefully. If a data collector crashes, historical data remains available. If the API is overloaded, collection continues unaffected.

Data Collection Layer

Data collectors are standalone Java services, each responsible for one or more data sources. They run on scheduled intervals (cron-like) and write to a shared data store.

@Scheduled(fixedRate = 60000)  // Every minute
public void collectTrafficData() {
    List<TrafficSensor> sensors = trafficApi.getSensors();

    for (TrafficSensor sensor : sensors) {
        Measurement m = Measurement.builder()
            .stationId(sensor.getId())
            .value(sensor.getCurrentFlow())
            .timestamp(Instant.now())
            .build();

        measurementRepository.save(m);
    }
}

Design decisions:

Pull over push: We pull data from sources rather than accepting pushes. This gives us control over rate limiting and error handling.
Idempotent writes: Collectors can safely retry failed operations. We use upserts keyed on source ID and timestamp.
Source isolation: Each collector runs in its own process. A misbehaving source can't affect others.

Handling unreliable sources:

External APIs fail in creative ways. Our collectors implement:

Exponential backoff with jitter
Circuit breakers to avoid hammering failing endpoints
Staleness detection—if data hasn't updated in too long, mark it as stale
Alerting when sources go offline for extended periods

Data Storage

We use PostgreSQL as the primary data store, with TimescaleDB for time-series data. This combination handles both the relational aspects (stations, metadata, relationships) and the time-series aspects (measurements, events).

Why not a dedicated time-series database? We evaluated InfluxDB and other specialized solutions. PostgreSQL with TimescaleDB won because: (1) we already had PostgreSQL expertise, (2) it handles the relational queries we need, and (3) the operational simplicity of one database technology outweighed the performance gains of specialized solutions for our scale.

Schema design:

The core schema follows a star pattern with stations at the center:

-- Stations are the core entity
CREATE TABLE stations (
    id UUID PRIMARY KEY,
    source_id VARCHAR NOT NULL,
    name VARCHAR NOT NULL,
    station_type VARCHAR NOT NULL,
    coordinates GEOGRAPHY(POINT),
    metadata JSONB,
    UNIQUE(source_id, station_type)
);

-- Measurements are time-series data
CREATE TABLE measurements (
    station_id UUID REFERENCES stations(id),
    timestamp TIMESTAMPTZ NOT NULL,
    data_type VARCHAR NOT NULL,
    value DOUBLE PRECISION,
    PRIMARY KEY (station_id, timestamp, data_type)
);

-- TimescaleDB hypertable for automatic partitioning
SELECT create_hypertable('measurements', 'timestamp');

API Layer

The public API is a Spring Boot application that translates REST requests into database queries. It's stateless, allowing horizontal scaling behind a load balancer.

Query flexibility:

Consumers have diverse needs. A mobile app wants current parking availability. A researcher wants historical traffic patterns. A dashboard wants aggregated statistics. We address this with a flexible query language:

GET /v2/stations/TrafficSensor?where=active.eq.true
                              &select=id,name,coordinates
                              &limit=100

GET /v2/measurements/TrafficSensor?from=2025-01-01
                                  &to=2025-01-31
                                  &select=avg(value)
                                  &groupBy=hour

This design—inspired by PostgREST—gives consumers power without requiring custom endpoints for each use case.

Caching strategy:

We use a multi-layer caching approach:

HTTP caching: Standard Cache-Control headers. CDNs handle most static data.
Application cache: Redis caches expensive queries. TTL varies by data type—real-time data has short TTL, historical data has long TTL.
Query result caching: Identical queries within a time window return cached results.

Real-Time Data: GTFS-RT

Public transit data requires special handling. GTFS-RT (General Transit Feed Specification - Realtime) provides real-time updates to scheduled transit data: vehicle positions, trip updates, and service alerts.

The challenge: GTFS-RT uses Protocol Buffers and assumes you have the static GTFS data loaded. Our pipeline:

Periodically fetch and parse static GTFS feeds
Build in-memory indexes for fast lookup
Subscribe to GTFS-RT feeds (usually every 10-30 seconds)
Merge real-time updates with scheduled data
Expose through both native GTFS-RT and our REST API

I built a separate service for this—gtfs-transformer—in Go, because the real-time processing benefits from Go's concurrency model and lower memory footprint compared to Java.

Monitoring and Observability

With data flowing from dozens of sources, visibility is essential. We instrument everything:

Metrics: Prometheus scrapes all services. Grafana dashboards show collection rates, API latency, error rates.
Logs: Structured logging (JSON) shipped to Elasticsearch. Every data record has a trace ID from source to API.
Alerts: PagerDuty alerts for critical issues. Slack notifications for warnings.
Data quality: Automated checks for anomalies—sudden drops in data volume, unusual values, gaps in time series.

Lessons Learned

1. External sources are the hardest part

The most challenging work isn't our infrastructure—it's handling the unpredictability of external data sources. APIs change without notice. Formats are inconsistent. Documentation lies. Building resilience into collectors is more important than optimizing the happy path.

2. Schema evolution is constant

New data sources mean new fields, new types, new relationships. We've learned to design for flexibility. JSONB columns for metadata that changes frequently. Feature flags to hide incomplete integrations. Migration scripts that work forward and backward.

3. Start simple, optimize later

Our first version used a single PostgreSQL instance with no caching. It worked fine until it didn't. The modular architecture let us add caching, read replicas, and eventually TimescaleDB without rewriting the entire system.

4. Open data means open source

Almost all of Open Data Hub is open source. This transparency builds trust with data providers and consumers. It also attracts contributions—we've merged pull requests from researchers, students, and developers who use the platform.

What's Next

Open Data Hub continues to evolve. Current focus areas:

GraphQL API for more flexible queries
Event streaming for real-time subscriptions
Improved data lineage and provenance tracking
AI/ML integration for anomaly detection and prediction

Working on this platform has taught me more about data systems than any textbook could. The intersection of reliability, performance, and usability—while handling data you don't control—is endlessly interesting.

Open Data Hub is developed at NOI Techpark in Bolzano, Italy. The platform is open source and available at github.com/noi-techpark. The API is public and free to use at opendatahub.com.