AI / Machine LearningSTATUS: PRODUCTIONSHIPPED IN 2 WEEKS

ML Signal Scoring: From 48% Accuracy to a 72% Win Rate Through Architectural Selection

We rebuilt the signal scoring pipeline from scratch, fixing look-ahead contamination and adding a top-decile filter that produced 72.2% win rate on selected signals.

72.2%

Win rate (top-decile signals)

55 to 62%

Model accuracy (post-fix)

32K/sec

Hypothesis tester speed

0.3ms p95

LightGBM prescreener latency

LightGBMRust 1.84 (SIMD feature engine)Python 3.12Redis 7ClickHouse 26.3Ollama (local LLM inference)

CHAPTER 01

Problem & Context

The Apex trading system generated more than 50,000 signals per day across 1,200 crypto pairs and US equities. Nearly all of them bled capital. The initial pipeline funneled signals through a LightGBM model trained with train_test_split(random_state=42) across 10,486 historical trades, producing a 48% accuracy score. That is worse than a coin flip on a binary outcome. Profiling the failure revealed three compounding flaws: 12 of the 18 training features were hardcoded constants with zero variance, the column entry_signal_strength was NULL for 100% of training rows, and random shuffling introduced future leakage by placing 2026 rows in the training set alongside 2024 rows.

The Rust feature engine computed 55 technical indicators per bar using SIMD and wrote to Redis and Postgres every tick. The ML training pipeline ignored it entirely and instead computed proxy features from a JSONL chain that turned out to be 99% duplicates. 2.08 million rows of near-identical records produced a model that had learned nothing about market structure.

CHAPTER 02

Approach & Architecture

A structured comparison across four candidate architectures was run on the same 10,486 clean trade dataset. The critical finding emerged from a percentile analysis: the top 10% of signals by composite score produced a 72.2% win rate and +0.80% mean PnL per trade. Everything below that threshold produced cumulative losses. The alpha was not in the scoring model. It was in the selectivity filter applied on top of it.

Approach B (Few Strong ensemble combining 7B and 14B parameter models) won because the 161 LoRA adapters in Approach A had a fundamental misconfiguration: prompt.json specified llama3.2:1b while adapter_config.json pointed to a different model. None of the adapters loaded in production despite 1.9 GB of adapter storage.

The production architecture implemented two changes simultaneously. First, the Rust feature engine was wired directly into the LightGBM training pipeline, replacing the 12 zero-variance proxy features with 55 real technical indicators. The training split was changed from random shuffle to a strict chronological boundary. Second, a top-decile filter passed only the top 10% of signals per cycle to execution.

ARCHITECTURE OVERVIEW

INGEST

LightGBM

FEATURES

Rust 1.84 (SIMD feature engine)

TRAIN

Python 3.12

v1 / v2 / v3

eval loop

SERVE

Redis 7

Production predictions feed back into training set. Continuous retraining cadence

72.2% Win rate (top-decile signals)55 to 62% Model accuracy (post-fix)32K/sec Hypothesis tester speed

CHAPTER 03

Implementation

The inference stack was divided into three latency tiers. Fast Brain used qwen2.5-coder:1.5b via local Ollama at 12-second inference for per-cycle signal triage. Deep Brain used qwen2.5-coder:7b running every 4 hours for pattern review. Research Brain used qwen2.5-coder:14b running daily for strategy-level analysis. The LightGBM prescreener ran in the hot path at sub-millisecond latency, ahead of any model inference.

The LGBM model trained on the 55 Rust features with chronological split achieved 55% to 62% accuracy on the held-out test window, up from 48%. More importantly, combining this model's output with the top-decile filter produced a compound selection function that reliably identified the profitable tail. The retraining cadence was set to weekly, triggered when the rolling win rate on the most recent 500 closed trades deviated more than 5 percentage points from the model's expected win rate.

TECH STACK

LightGBMRust 1.84 (SIMD feature engine)Python 3.12Redis 7ClickHouse 26.3Ollama (local LLM inference)

CHAPTER 04

Results & Metrics

Applying the score floor plus top-decile filter to the 10,486-trade backtest produced a 72.2% win rate on selected signals with +0.80% mean PnL per trade. The unfiltered population produced 27.1% win rate and negative PnL across all four architectures. The filter reduced execution volume by approximately 90%, which was the intended behavior. Selectivity, not model sophistication, was the mechanism of profit.

The hypothesis tester binary processed 738 hypotheses in 23 milliseconds, approximately 32,000 tests per second, which was the benchmarking baseline for the broader signal search infrastructure.

72.2%

Win rate (top-decile signals)

55 to 62%

Model accuracy (post-fix)

32K/sec

Hypothesis tester speed

0.3ms p95

LightGBM prescreener latency

CHAPTER 05

Lessons & Technical Decisions

DECISION · 01

The most expensive mistake was treating model accuracy as the primary optimization target. A model at 55% accuracy on a balanced dataset sounds useful. On a live trading dataset where 27% of signals are winners in aggregate, a 55%-accurate model that scores the profitable signals higher than the unprofitable ones only helps if the selectivity filter is aggressive enough to exploit the score ordering.

DECISION · 02

The second significant failure was the training data pipeline. Using random splits on time-series data produces optimistic accuracy estimates because models can memorize future context. After fixing the split, the model's true accuracy dropped, but the top-decile selection remained valid because the model still preserved relative rank ordering.

DECISION · 03

The local AI hypothesis failed clearly. 1B to 3B parameter models lacked the reasoning depth for market analysis. The production decision was to move all trading-path inference to quantitative signals from the Rust engine plus LightGBM, with cloud LLMs reserved for strategic reasoning tasks that run offline.

START A PROJECT

Need something like this?

We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.

Start a Project

Related case studies

AI / Machine Learning

Regime Detection: A 50-Point Win Rate Spread and the System That Learned to Exploit It

We found a 50-percentage-point win rate spread between market regimes, fixed a regime classifier that was routing by symbol name instead of market structure, and built a live suppression system for anti-patterns.

62.1% Win rate in choppy regime

Read case study →

AI / Machine Learning

Correlation Engine: Real-Time NxN Correlation and Cluster Detection Across 1,200 Symbols

We built a Rust correlation engine processing 1,200 symbols with incremental sliding window updates at 340ms p95 per cycle, 14x faster than full recompute.

1,200 Symbols in correlation matrix

Read case study →

AI / Machine Learning

Anomaly Detection: Five-Layer Contextual Synthesis for Real-Time Market Intelligence

We built a five-layer parallel context engine that synthesizes macro, sector, correlation, historical, and catalyst data into a 2-sentence market narrative within 1.5 seconds of signal emission.

1-1.5 sec Synthesis latency (p95)

Read case study →

Start a Project

AI / Machine LearningSTATUS: PRODUCTIONSHIPPED IN 2 WEEKS

ML Signal Scoring: From 48% Accuracy to a 72% Win Rate Through Architectural Selection

We rebuilt the signal scoring pipeline from scratch, fixing look-ahead contamination and adding a top-decile filter that produced 72.2% win rate on selected signals.

72.2%

Win rate (top-decile signals)

55 to 62%

Model accuracy (post-fix)

32K/sec

Hypothesis tester speed

0.3ms p95

LightGBM prescreener latency

LightGBMRust 1.84 (SIMD feature engine)Python 3.12Redis 7ClickHouse 26.3Ollama (local LLM inference)

CHAPTER 01

Problem & Context

CHAPTER 02

Approach & Architecture

ARCHITECTURE OVERVIEW

INGEST

LightGBM

FEATURES

Rust 1.84 (SIMD feature engine)

TRAIN

Python 3.12

v1 / v2 / v3

eval loop

SERVE

Redis 7

Production predictions feed back into training set. Continuous retraining cadence

72.2% Win rate (top-decile signals)55 to 62% Model accuracy (post-fix)32K/sec Hypothesis tester speed

CHAPTER 03

Implementation

TECH STACK

LightGBMRust 1.84 (SIMD feature engine)Python 3.12Redis 7ClickHouse 26.3Ollama (local LLM inference)

CHAPTER 04

Results & Metrics

The hypothesis tester binary processed 738 hypotheses in 23 milliseconds, approximately 32,000 tests per second, which was the benchmarking baseline for the broader signal search infrastructure.

72.2%

Win rate (top-decile signals)

55 to 62%

Model accuracy (post-fix)

32K/sec

Hypothesis tester speed

0.3ms p95

LightGBM prescreener latency

CHAPTER 05

Lessons & Technical Decisions

DECISION · 01

DECISION · 02

DECISION · 03

START A PROJECT

Need something like this?

We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.

Start a Project

Related case studies

AI / Machine Learning

Regime Detection: A 50-Point Win Rate Spread and the System That Learned to Exploit It

62.1% Win rate in choppy regime

Read case study →

AI / Machine Learning

Correlation Engine: Real-Time NxN Correlation and Cluster Detection Across 1,200 Symbols

We built a Rust correlation engine processing 1,200 symbols with incremental sliding window updates at 340ms p95 per cycle, 14x faster than full recompute.

1,200 Symbols in correlation matrix

Read case study →

AI / Machine Learning

Anomaly Detection: Five-Layer Contextual Synthesis for Real-Time Market Intelligence

We built a five-layer parallel context engine that synthesizes macro, sector, correlation, historical, and catalyst data into a 2-sentence market narrative within 1.5 seconds of signal emission.

1-1.5 sec Synthesis latency (p95)

Read case study →