Real-TimeSTATUS: PRODUCTIONSHIPPED IN PRODUCTION

Designing a Low-Latency Alert Dispatcher for Regime Breaks and Signal Anomalies

Argus generates trading signals continuously across 6 source categories: mean_reversion, novelty_anomaly, regime_caution, trend_following,.

100%

Signal sources emitting (6 of 6)

209,033

Regime keys updated to 24-hr TTL

380ms

Signal emit latency (p99)

Duplicate orders (3 PEL replays)

Rust 1.84Tokio 1.40Redis 7.2 StreamsClickHouse 24.3

CHAPTER 01

Problem & Context

Argus generates trading signals continuously across 6 source categories: mean_reversion, novelty_anomaly, regime_caution, trend_following, high_conviction, and regime_breakout. Under normal conditions, signal emission was healthy. The problem was notification: when a regime break occurred or a novelty anomaly spiked, nothing notified downstream consumers until their next polling interval. The regime pipeline polled every 30 seconds. The feature engine polled every 60 seconds. Apex consumed signals via a separate query path. In fast-moving market conditions, a 30-second or 60-second delay between signal generation and consumer awareness was operationally unacceptable.

A second problem was key persistence. The argus-regime binary wrote regime state into Redis using HSET against keys of the form argus:regime:{symbol}:{timeframe}. By April 25, 2026, 209,033 regime keys existed in Redis. The EXPIRE call was missing from the write path in argus-regime/src/main.rs (lines 233 to 247). If the regime pipeline stopped running, stale regime keys would persist indefinitely, and downstream consumers including Apex would read them as current regime state. A signal generated against a stale regime would carry incorrect regime_confidence values, leading to trades based on outdated market structure.

The third problem was observability. There was no unified view of signal velocity, regime transitions, or alert delivery failures. The health binary (argus-health) queried ClickHouse for data freshness but did not monitor the alerting path itself.

---

CHAPTER 02

Approach & Architecture

The alerting system was built around argus-alert-dispatcher, a dedicated Rust binary separate from the regime and signal generation pipelines. Separation was intentional: alert delivery failures should never block signal generation, and alert processing load should never compete for CPU with the AVX2 SIMD feature engine.

The signal delivery path used Redis 7.2 Streams as the transport layer rather than pub/sub. Pub/sub would have dropped messages delivered while no consumer was connected. Streams persisted messages until acknowledged, enabling Apex to reconnect after a crash and replay all unprocessed alerts from its last acknowledged stream entry. The stream key was argus:signals:stream. Argus published to the stream; Apex subscribed via consumer group apex-consumer.

The regime key TTL problem was addressed in parallel. The fix added a single redis.expire(&redis_key, 86400) call after every redis.hset_multiple() in the regime writer. All 209,033 existing keys were patched in a single pass using Redis SCAN rather than the blocking KEYS command:

The SCAN approach processed keys in batches of 100, issuing EXPIRE on each, completing the full 209,033-key remediation without any observable Redis latency spike. After the fix, all keys had a consistent 24-hour TTL confirmed across sampled keys with zero keys at -1 (no expiry).

---

ARCHITECTURE OVERVIEW

PRODUCERS

Rust 1.84

Tokio 1.40

STREAM

Redis 7.2 Streams

ordered / durable

lag monitor

CONSUMERS

ClickHouse 24.3

Consumer 2

SINK

Storage

ack + replay

100% Signal sources emitting (6 of 6)209,033 Regime keys updated to 24-hr TTL380ms Signal emit latency (p99)

CHAPTER 03

Implementation

Signal schema. Each alert published to argus:signals:stream carried a complete decision snapshot as JSON. The schema included: signal_id (UUID for idempotency), timestamp (ISO 8601 UTC), symbol, direction (long/short/neutral), confidence (0.0 to 1.0), regime, regime_confidence, source, source_detail (human-readable rule that fired), features (raw values used in the decision), suggested_kelly, max_position_usd, holding_period_hours, exit_rules, backtest_stats, ttl_seconds, and version.

The ttl_seconds field was critical. Crypto signals carried a 600-second TTL; macro-catalyst signals carried 300 seconds. Apex validated signal age on receipt and dropped any signal older than its TTL. This prevented stale signals from entering the execution pipeline after a consumer lag incident.

At-least-once delivery and idempotency. Redis Streams consumer groups provided at-least-once delivery guarantees. Apex acknowledged each stream entry only after successful execution:

Unacknowledged entries remained in the pending-entries list (PEL). If Apex restarted, it read from the PEL first before consuming new entries. Because the same signal could be processed twice under crash-restart scenarios, execution used signal_id as an idempotency key: before placing any order, Apex queried its executions table for a matching signal_id and short-circuited if found.

Backpressure. The alert dispatcher used a Tokio bounded channel (capacity 10,000) between the signal detection loop and the Redis publish path. If Redis became temporarily unavailable or slow, channel backpressure caused the detection loop to pause accepting new signals. The channel depth was monitored; sustained depth above 5,000 triggered a WARN log. This bounded the memory impact of a Redis latency spike to approximately 10,000 signals buffered in RAM.

Replay capability. Redis Streams retained all entries until explicitly trimmed. The stream was configured with MAXLEN ~ 100000, trimming entries older than the most recent 100,000 in an approximate (non-blocking) fashion. This gave Apex approximately 10 to 15 minutes of replay window during typical signal volumes. For longer replay needs (debugging, backtesting alert behavior), signals were also persisted to ClickHouse's argus.signals table, which accumulated 1.59 million records by April 25. Replay from ClickHouse was available via the argus-api binary at port 8080 with time-range filtering.

Lag monitoring. The argus-health binary tracked two lag metrics: stream consumer lag (how far behind Apex's consumer group was from the head of argus:signals:stream) and signal emission lag (time between the argus-signals pipeline generating a signal and argus-alert-dispatcher publishing it to the stream). Target was under 100 ms emit-to-stream latency. The alert thresholds were: p99 latency above 500 ms triggered WARN; above 2,000 ms triggered CRITICAL; stream consumer lag above 10 seconds triggered CRITICAL; stream consumer dead triggered CRITICAL.

---

TECH STACK

Rust 1.84Tokio 1.40Redis 7.2 StreamsClickHouse 24.3

CHAPTER 04

Results & Metrics

- Signal velocity: The argus.signals table showed all 6 sources emitting within a 100-second window at audit time. The freshest source (mean_reversion) was under 1 minute old. The oldest active source (regime_breakout) was 112 seconds old. This is consistent with a 60-second polling interval plus processing overhead. - Regime key remediation: 209,033 keys updated to 24-hour TTL in a single SCAN pass. Zero keys remained at -1 after the fix. Redis performance was unaffected during the scan (batch size 100, no KEYS command used). - Signal emit latency: Measured via timestamp differential between signal.timestamp in argus.signals and the Redis stream entry ID timestamp. Median latency was 62 ms. p99 was 380 ms during the regime pipeline's 30-second polling cycle, with the spike caused by the poll interval itself rather than processing overhead. - Idempotency effectiveness: During a 6-hour monitoring window, 3 signals were processed twice by Apex due to reconnect events. All 3 were correctly deduplicated via signal_id lookup before order placement. Zero duplicate orders were submitted to OKX or IBKR. - Consumer group depth: During normal operation, Apex kept the PEL depth at 0 to 3 entries. During the tested crash-restart scenario (SIGKILL to apex-worker), PEL depth reached 47 entries before the consumer restarted and drained to 0 within 8 seconds.

---

100%

Signal sources emitting (6 of 6)

209,033

Regime keys updated to 24-hr TTL

380ms

Signal emit latency (p99)

Duplicate orders (3 PEL replays)

CHAPTER 05

Lessons & Technical Decisions

DECISION · 01

Missing TTL on Redis keys is a class of silent data corruption. The 209,033 keys without TTL would have served stale regime context to Apex indefinitely after any regime pipeline outage. The fix was a single line of code, but the diagnostic required a deliberate audit. Adding TTL enforcement as a code-level invariant (enforced in the write function, not as a configuration setting) prevents recurrence regardless of future changes to the write path.

DECISION · 02

Pub/sub would have been wrong here. Any alert delivery system for a trading pipeline must survive consumer crashes. Redis pub/sub drops messages to consumers that are offline at the time of publish. Streams persisted messages, made replay trivial, and gave the PEL as a built-in pending queue. The operational overhead of consumer groups was minimal relative to the reliability benefit.

DECISION · 03

Signal TTL as a contract between producer and consumer. Embedding TTL in the signal payload rather than in the transport layer made the contract explicit. Apex could always validate signal freshness without querying an external configuration store. When Apex processed a signal from the PEL (possibly several minutes old), the TTL check correctly discarded expired signals rather than executing stale trades.

DECISION · 04

Separation of detection, publication, and execution. The argus-alert-dispatcher was a separate process from argus-signals (detection) and from Apex (execution). This boundary allowed each component to fail and restart independently. A detected signal was durable once written to the Redis stream; it survived a restart of either the dispatcher or the consumer without loss.

START A PROJECT

Need something like this?

We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.

Start a Project

Related case studies

Real-Time

Building a Multi-Exchange WebSocket Ingest Layer for 1,200+ Trading Pairs

Argus needed a unified market data foundation across the major crypto exchanges.

~186/min Sustained throughput (per exchange)

Read case study →

Real-Time

Building a Data Freshness Validation System Across 404+ ClickHouse Tables

By late April 2026, the Argus data layer held over 800 million rows across bars_1m, bars_1d, and downstream tables.

<6 min Detection speed (silent-dead)

Read case study →

Real-Time

Real-Time ClickHouse Queries Powering Live Dashboards and Signal Generation

Argus generated regime signals, novelty anomalies, and trend-following calls from a 1,400-feature engine built on AVX2 SIMD.

3.2x Macro endpoint (1,476ms → 465ms warm)

Read case study →

Start a Project

Real-TimeSTATUS: PRODUCTIONSHIPPED IN PRODUCTION

Designing a Low-Latency Alert Dispatcher for Regime Breaks and Signal Anomalies

Argus generates trading signals continuously across 6 source categories: mean_reversion, novelty_anomaly, regime_caution, trend_following,.

100%

Signal sources emitting (6 of 6)

209,033

Regime keys updated to 24-hr TTL

380ms

Signal emit latency (p99)

Duplicate orders (3 PEL replays)

Rust 1.84Tokio 1.40Redis 7.2 StreamsClickHouse 24.3

CHAPTER 01

Problem & Context

---

CHAPTER 02

Approach & Architecture

---

ARCHITECTURE OVERVIEW

PRODUCERS

Rust 1.84

Tokio 1.40

STREAM

Redis 7.2 Streams

ordered / durable

lag monitor

CONSUMERS

ClickHouse 24.3

Consumer 2

SINK

Storage

ack + replay

100% Signal sources emitting (6 of 6)209,033 Regime keys updated to 24-hr TTL380ms Signal emit latency (p99)

CHAPTER 03

Implementation

At-least-once delivery and idempotency. Redis Streams consumer groups provided at-least-once delivery guarantees. Apex acknowledged each stream entry only after successful execution:

---

TECH STACK

Rust 1.84Tokio 1.40Redis 7.2 StreamsClickHouse 24.3

CHAPTER 04