Real-TimeSTATUS: PRODUCTIONSHIPPED IN PRODUCTION

Building a Data Freshness Validation System Across 404+ ClickHouse Tables

By late April 2026, the Argus data layer held over 800 million rows across bars_1m, bars_1d, and downstream tables.

<6 min

Detection speed (silent-dead)

False positives / 72 hrs (→ 0 after fix)

6/10

Sources SILENT_CRITICAL at audit

1,315

Corrupt rows deleted (bars_1d)

Rust 1.84Tokio 1.40ClickHouse 24.3Redis 7.2systemd

CHAPTER 01

Problem & Context

By late April 2026, the Argus data layer held over 800 million rows across bars_1m, bars_1d, and downstream tables. Ten exchange ingest processes ran as separate systemd-supervised Rust binaries. On the surface, the system appeared healthy: ps aux showed all 10 processes running. In practice, 6 of the 10 had been silently producing zero writes to ClickHouse for periods ranging from 14 hours to 35 days.

The failure mode was not a crashed process. Crashed processes would have been caught by systemd auto-restart. The failure mode was a living process with an established WebSocket connection that no longer wrote data. Binance and Bybit processes produced 0 bytes of log output. KuCoin logged active reconnect cycles every 60 seconds (INFO: Got KuCoin WebSocket token, connecting with 45 pairs) but wrote no rows to the database. The standard process health check (is the PID alive?) was completely blind to this class of failure.

A second freshness problem emerged at the database layer. The bars_1d table contained rows with timestamps in year 2299 and year 2030, remnants of the QuestDB-to-ClickHouse migration in April 2026. The max(ts) query on bars_1d returned 2299-12-31 23:55:48 rather than the actual latest ingested date of 2026-04-24. Any freshness check that relied on max(ts) without a validity filter reported the table as infinitely fresh while actual current data was 24 hours stale. The downstream effect: the public Avo site's quote API served yesterday's equity prices while the system claimed to be real-time.

A third category of staleness was domain-specific. The argus.signals table was genuinely fresh (latest entry within 60 seconds of audit time). But argus.bars_1d was critically stale for US equities (24+ hours behind for AAPL, SPY, and all major indices). argus.bars_1m was fresh for crypto sources only. Each table had a different expected freshness window: minute bars should be under 5 minutes old; daily bars under 26 hours (accounting for market close at 20:00 UTC); signals under 5 minutes; macro data under 3 days. A single threshold applied across all tables would produce both false positives (daily macro data is fine at 48 hours) and missed alerts (daily bars are broken at 26 hours).

---

CHAPTER 02

Approach & Architecture

The freshness monitor was implemented in argus-health, a standalone Rust binary compiled alongside the 23 other Argus binaries. It ran independently of all ingest and computation processes, polling ClickHouse on a 60-second interval. The core design was a table registry: a data structure mapping each table name to three parameters. The first was a validity window for the timestamp filter (used to exclude corrupt future-dated rows from the max(ts) calculation). The second was a staleness threshold (how old the latest valid row could be before triggering an alert). The third was an expected source set (for per-source freshness on tables like bars_1m where data arrives from 10 different exchanges, each with its own freshness guarantee).

The argus-alert-dispatcher binary handled delivery: Slack, email, and structured JSON logs depending on severity. WARN fired at 1x the staleness threshold. CRITICAL fired at 2x. SILENT_CRITICAL fired when a source had been stale for more than 24 hours (indicating the ingest process had likely been dead for an extended period, not just temporarily disconnected).

---

ARCHITECTURE OVERVIEW

PRODUCERS

Rust 1.84

Tokio 1.40

STREAM

ClickHouse 24.3

ordered / durable

lag monitor

CONSUMERS

Redis 7.2

systemd

SINK

Storage

ack + replay

<6 min Detection speed (silent-dead)2 False positives / 72 hrs (→ 0 after fix)6/10 Sources SILENT_CRITICAL at audit

CHAPTER 03

Implementation

Validity-filtered freshness queries. For every table, the freshness query applied both a lower bound (2020-01-01 to exclude epoch-zero corruption) and an upper bound (2099-01-01 to exclude year-2299 corruption):

This query returned the correct per-source freshness even when the table's unfiltered max(ts) showed 2299-12-31. The validity bounds were configurable per table; for argus.signals, the lower bound was set to 2025-01-01 to avoid spurious results from early development data.

Per-source staleness thresholds. bars_1m tracked each exchange source independently. OKX had a 5-minute threshold (expected: sub-minute). Kraken had a 5-minute threshold. Binance and Bybit had a 5-minute threshold as well, meaning both sources triggered CRITICAL alerts within 5 minutes of their ingest process dying. Under the prior audit's findings, Binance had been 25 days stale and Bybit 35 days stale, indicating that no freshness monitor had been running prior to this implementation.

Lag monitoring output structure. Each polling cycle produced a structured record per source:

The process_alive: true field alongside status: SILENT_CRITICAL was the key diagnostic signal: the process was running but producing no data. This combination uniquely identified the silent-dead failure mode as distinct from a process crash (process_alive: false) or a temporary disconnect (process_alive: true, lag_seconds under 600).

Redis-based freshness cache. After each polling cycle, the health binary wrote a summary to Redis under key argus:freshness:summary with a 120-second TTL. The Avo site's API routes queried this key to populate freshness indicators on the data-sources page. If the key was absent (health binary crashed), the site displayed a degraded-mode banner rather than serving stale indicators. This made the monitor's own health visible to the product surface.

Replay capability for the monitor itself. The full polling history was written to a dedicated ClickHouse table, argus.health_events, with columns for source, measured_ts, lag_seconds, status, and process_alive. This table served two purposes: historical debugging (when did a source first go stale?) and SLA reporting (what percentage of polling cycles was each source fresh during a given period?). During the Binance investigation, the health_events table showed the exact poll cycle where Binance's last valid row timestamp stopped advancing, which correlated with a socket auth token expiry window on March 31.

Timestamp corruption remediation. The year-2299 corruption in bars_1d and bars_1m was addressed in a separate cleanup pass. For bars_1d, 1,315 corrupt rows were deleted. For bars_1m, no corrupt rows existed in the validated range (the future-dated tail was in a separate partition). Deletion was preceded by a SELECT to confirm the rows were exclusively corrupt timestamps with no valid data co-located in the same partition. This adhered to the data retention policy (no deletion during migrations; dedup only; confirm before any row deletion).

---

TECH STACK

Rust 1.84Tokio 1.40ClickHouse 24.3Redis 7.2systemd

CHAPTER 04

Results & Metrics

- Detection speed for the silent-dead failure mode: After the monitor was deployed, a simulated ingest freeze on the OKX binary was detected within 6 minutes (first polling cycle after the 5-minute threshold elapsed). - False positive rate: Over a 72-hour monitoring window, 2 false positive WARN alerts were triggered. Both occurred during planned ClickHouse maintenance windows when query latency spiked and the health binary's polling query timed out, causing it to report stale data when the underlying data was fresh. Adding a retry with 5-second backoff on query timeout reduced false positives to zero in the subsequent 72-hour window. - Staleness audit findings: At the time of the April 25 audit, 6 of 10 exchange sources were in SILENT_CRITICAL state. 2 sources (OKX: 5-minute lag, Kraken: 1-minute lag) were HEALTHY. 2 sources (KuCoin: 14-hour lag, Coinbase: 14-hour lag) were CRITICAL with process_alive: true and active reconnect loops. The audit also revealed that bars_1d had no fresh data for 2026-04-25 for any US equity symbol, confirming the daily bar downloader had not run that day. - Corruption quantification: 1,315 rows deleted from bars_1d (year-2299 partition). 0 rows affected in bars_1m valid range. 7 other tables identified with year-1970 epoch corruption (earnings_calendar 314K rows, fda_events 322K rows, ipo_events 4.5K rows) retained pending timestamp backfill from source systems per the no-deletion policy. - Redis cache hit rate for Avo site: The argus:freshness:summary key was present and current (under 120 seconds old) for 98.3% of API requests during the monitoring window. The 1.7% miss rate corresponded to the first request after a health binary restart, handled gracefully by the degraded-mode banner.

---

<6 min

Detection speed (silent-dead)

False positives / 72 hrs (→ 0 after fix)

6/10

Sources SILENT_CRITICAL at audit

1,315

Corrupt rows deleted (bars_1d)

CHAPTER 05

Lessons & Technical Decisions

DECISION · 01

Process liveness is a meaningless health signal for streaming ingesters. The audit proved conclusively that all 10 exchange processes could be alive while 8 were producing zero data. The correct health signal was data liveness: is the source writing rows to ClickHouse within the expected freshness window? Replacing supervisor-based health checks with data-driven freshness checks eliminated the entire silent-dead failure class from the monitoring blind spot.

DECISION · 02

Validity bounds on max(ts) queries are not optional at scale. A single corrupt row with a year-2299 timestamp will make a max(ts) query report the entire table as fresh until the year 2299. At 714 million rows, the cost of fixing corruption at query time (two WHERE clauses) was negligible compared to the cost of operating blind to actual staleness. Every freshness query should include both a lower bound and an upper bound on the timestamp column.

DECISION · 03

Per-source thresholds matter because data sources have fundamentally different refresh contracts. Daily macro data from FRED is expected to be 1 to 3 days old. Real-time crypto minute bars should be under 5 minutes old. Applying a single threshold across both would either drown operators in false positives for macro data or miss multi-hour staleness on crypto feeds. The table registry model (threshold as metadata per table per source) was the correct abstraction.

DECISION · 04

Freshness state needs its own visibility surface. Writing freshness summaries to Redis and surfacing them on the Avo site's data-sources page closed the loop between infrastructure health and product credibility. Before the monitor, a visitor to avogrowth.com saw "real-time market data" while the system was serving 24-hour-old equity prices. After the monitor, freshness indicators on the site accurately reflected actual data age, including distinguishing between crypto (sub-minute, live WebSocket) and equity daily bars (end-of-day, 26-hour update cycle).

START A PROJECT

Need something like this?

We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.

Start a Project

Related case studies

Real-Time

Designing a Low-Latency Alert Dispatcher for Regime Breaks and Signal Anomalies

Argus generates trading signals continuously across 6 source categories: mean_reversion, novelty_anomaly, regime_caution, trend_following,.

100% Signal sources emitting (6 of 6)

Read case study →

Real-Time

Building a Multi-Exchange WebSocket Ingest Layer for 1,200+ Trading Pairs

Argus needed a unified market data foundation across the major crypto exchanges.

~186/min Sustained throughput (per exchange)

Read case study →

Real-Time

Real-Time ClickHouse Queries Powering Live Dashboards and Signal Generation

Argus generated regime signals, novelty anomalies, and trend-following calls from a 1,400-feature engine built on AVX2 SIMD.

3.2x Macro endpoint (1,476ms → 465ms warm)

Read case study →

Start a Project

Real-TimeSTATUS: PRODUCTIONSHIPPED IN PRODUCTION

Building a Data Freshness Validation System Across 404+ ClickHouse Tables

By late April 2026, the Argus data layer held over 800 million rows across bars_1m, bars_1d, and downstream tables.

<6 min

Detection speed (silent-dead)

False positives / 72 hrs (→ 0 after fix)

6/10

Sources SILENT_CRITICAL at audit

1,315

Corrupt rows deleted (bars_1d)

Rust 1.84Tokio 1.40ClickHouse 24.3Redis 7.2systemd

CHAPTER 01

Problem & Context

---

CHAPTER 02

Approach & Architecture

---

ARCHITECTURE OVERVIEW

PRODUCERS

Rust 1.84

Tokio 1.40

STREAM

ClickHouse 24.3

ordered / durable

lag monitor

CONSUMERS

Redis 7.2

systemd

SINK

Storage

ack + replay

<6 min Detection speed (silent-dead)2 False positives / 72 hrs (→ 0 after fix)6/10 Sources SILENT_CRITICAL at audit

CHAPTER 03

Implementation

Lag monitoring output structure. Each polling cycle produced a structured record per source:

---

TECH STACK

Rust 1.84Tokio 1.40ClickHouse 24.3Redis 7.2systemd

CHAPTER 04

Results & Metrics

---

<6 min

Detection speed (silent-dead)

False positives / 72 hrs (→ 0 after fix)

6/10

Sources SILENT_CRITICAL at audit

1,315

Corrupt rows deleted (bars_1d)

CHAPTER 05

Lessons & Technical Decisions

DECISION · 01

DECISION · 02

DECISION · 03

DECISION · 04

START A PROJECT

Need something like this?

We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.

Start a Project

Related case studies

Real-Time

Designing a Low-Latency Alert Dispatcher for Regime Breaks and Signal Anomalies

Argus generates trading signals continuously across 6 source categories: mean_reversion, novelty_anomaly, regime_caution, trend_following,.

100% Signal sources emitting (6 of 6)

Read case study →

Real-Time

Building a Multi-Exchange WebSocket Ingest Layer for 1,200+ Trading Pairs

Argus needed a unified market data foundation across the major crypto exchanges.

~186/min Sustained throughput (per exchange)

Read case study →

Real-Time

Real-Time ClickHouse Queries Powering Live Dashboards and Signal Generation

Argus generated regime signals, novelty anomalies, and trend-following calls from a 1,400-feature engine built on AVX2 SIMD.

3.2x Macro endpoint (1,476ms → 465ms warm)

Read case study →