How We Built a Live AI Trading System in 3 Weeks: An Honest Technical Breakdown

Why We Are Writing This

Most companies in the AI trading space talk about their technology in marketing language. Six specialized agents. Proprietary signals. Real-time intelligence. These phrases mean nothing without understanding the actual architecture behind them.

We built AIOKA in public. The track record is live, the methodology is documented, and we think the honest technical story is more interesting than the polished version. This is a breakdown of what we actually built, how we built it, what broke, and what we learned.

If you are a developer, algo trader, or builder thinking about constructing something similar, this is the article we would have wanted to read before we started.

The Architecture in One Paragraph

AIOKA runs a Python/FastAPI backend on Railway, connected to a Neon PostgreSQL database. Every 5 minutes, the system fetches data from 12 external providers, computes 30 signals across 6 categories, runs an 8-state regime classifier, and assembles a briefing package that gets sent to 6 specialized Claude agents running in parallel via asyncio.gather. Each agent returns a verdict in structured JSON. The Chief Judge reads all 6 verdicts and synthesizes a final council decision. If the confidence exceeds the threshold and all 7 entry gates pass, Ghost Trader executes a paper trade on Kraken's simulated endpoint. The entire cycle takes under 5 seconds.

That is the architecture. Now let us explain what each layer actually involves.

The Signal Pipeline: 30 Signals, 6 Categories

The 30 signals AIOKA tracks fall into six categories. We will walk through each.

On-Chain Signals (8 signals): These come from blockchain data and cover wallet-level behavior. The key signals are MVRV Z-Score (market cap vs. realized cap), SOPR (spent output profit ratio), Exchange Net Flow (BTC moving onto vs. off exchanges), Whale Net Flow (large wallet transactions), Hash Ribbon (miner capitulation indicator from hashrate crossovers), and Puell Multiple (miner revenue relative to historical average). We fetch most on-chain data from CryptoQuant via their REST API, with some derived signals computed internally from OHLCV data.

Momentum Signals (6 signals): RSI on multiple timeframes (15m, 1h, 4h), EMA 200 distance (signed, not absolute -- this distinction matters), funding rate from Deribit and Binance, and price momentum over 4h and 24h periods. These come from CoinGecko for price data and exchange APIs for derivatives data.

Macro Signals (5 signals): DXY (US Dollar Index), gold price, treasury yield spread (10Y minus 2Y), VIX (market volatility), and a news blackout gate that reads the FOMC/NFP/CPI/ECB calendar and activates a trading pause 30 minutes before and 60 minutes after scheduled events. The news blackout alone has prevented several entries that would have been executed into high-volatility event windows.

Options Signals (4 signals): Put/Call ratio, Deribit implied volatility index (DVOL), open interest changes, and options volume bias. These come from Deribit's API and required the most debugging of any data source -- their API versioning is inconsistent and several endpoints changed behavior between our initial integration and current production.

Liquidity Signals (4 signals): Stablecoin mint impulse (net USDT/USDC issuance as a proxy for new capital entering the ecosystem), Liquidity Imbalance Ratio from order book data across Binance, Coinbase, and OKX, and two derived signals from the OTC/dark pool flow estimator.

Sentiment Signals (3 signals): Fear and Greed Index from Alternative.me, BTC Dominance (total crypto market share), and the Liquidation Cascade Detector, which monitors large forced-close events across exchanges and flags when a cascade pattern is forming.

The Regime Classifier

Before any of the 30 signals feed into the council, the system must determine what kind of market it is operating in. The same signal that should trigger a BUY recommendation in a WHALE ACCUMULATION regime might correctly trigger HOLD or even SELL in a HIGH VOLATILITY regime.

We use an 8-state regime classifier that runs a Hidden Markov Model trained on OHLCV data, funding rate history, exchange flow history, and MVRV/SOPR data. The states are: BULL TRENDING, BEAR TRENDING, WHALE ACCUMULATION, DISTRIBUTION, HIGH VOLATILITY, LOW VOLATILITY COMPRESSION, MACRO UNCERTAINTY, and RECOVERY.

The HMM is retrained periodically on current data so the model does not become stale. The regime output is passed into every agent's briefing, which means agents do not reason about signals in a vacuum -- they reason about signals in the context of the current market structure.

This is arguably the most important architectural decision we made. A signal-only system without regime context produces inconsistent results because the same signal means different things in different market conditions. Regime-aware analysis is more robust.

The 6 Specialized Agents

The AI council consists of 6 specialized Claude agents, each with a specific domain mandate and context package, and a Chief Judge who synthesizes their outputs.

The Fundamentals Agent receives the on-chain signals, the current MVRV Z-Score, SOPR, and exchange flow data. Its mandate is to assess whether the market's fundamental structure supports or contradicts a long position.

The Momentum Agent receives the technical signals: RSI readings across timeframes, EMA 200 distance, and price momentum. Its mandate is to assess whether momentum conditions favor an entry or suggest waiting.

The Macro Strategist receives the macro signals, the news blackout status, and a briefing on current macro conditions. Its mandate is to flag whether any macro-level risks should prevent entry.

The Dark Pool Analyst receives the OTC flow data, whale net flow, and liquidity signals. Its mandate is to assess what institutional behavior implies about near-term price direction.

The Sentiment Analyst receives the Fear and Greed reading, BTC dominance, and liquidation data. Its mandate is to assess whether sentiment conditions are contrarian-positive, neutral, or a risk.

The Risk Assessor receives the full signal package plus the current portfolio state, including whether there is an open trade, the time since the last trade closed, and the current win/loss streak. Its mandate is to determine whether the risk conditions support taking a new position.

All 6 agents run in parallel using asyncio.gather, which means the council deliberation time is bounded by the slowest single agent, not by the sum of all six. In practice, the council runs in 3 to 4 seconds for all 6 agents.

The Chief Judge receives all 6 agent verdicts in structured JSON and synthesizes them into a final decision. The Chief Judge's prompt explicitly asks it to identify whether agents are reasoning from their domain expertise or overstepping into other domains, to flag when agents contradict each other significantly, and to weight the dissenting view appropriately when one or two agents vote against the majority.

The 7-Gate Entry Framework

The council can return a BUY recommendation with high confidence, but the trade still does not execute unless all 7 entry gates pass. This is by design.

The gates are:

Regime gate: Current regime must not be BEAR TRENDING, HIGH VOLATILITY, or DISTRIBUTION.

EMA proximity gate: Signed EMA 200 distance must be within the configured range -- above the floor (not below EMA support) and below the ceiling (not overextended).

RSI gate: RSI on the primary timeframe must be within the non-overbought range.

News blackout gate: No high-impact event within the exclusion window.

Post-trade cooldown gate: Minimum wait time after the previous trade closes (3 hours for a win, 6 hours for a loss).

Momentum gate: 4h momentum must not be in a deteriorating pattern.

Confidence gate: Council confidence must exceed the configured minimum threshold.

All 7 must pass. Not 6 of 7, not 5 of 7. This strictness was a deliberate choice. We wanted the system to take fewer, higher-quality entries rather than many mediocre ones.

The Stack: Railway, Neon, FastAPI

The backend runs on Railway as a single Python service (though we have separated the public API to a second service). We use Neon PostgreSQL with SQLAlchemy async sessions throughout -- no synchronous DB calls exist in the hot path. Every write is an async session commit.

Alembic handles database migrations. The migration discipline we developed over 3 weeks was the source of several painful lessons. Early in the build, we shipped code that referenced new database columns before the migration creating those columns had been applied in production. The result was that the service started successfully on Railway, the health check passed, and the first time the new code path executed, it crashed with UndefinedColumnError.

The rule we now enforce: code that references a new column and the migration creating that column must ship in the same commit. Split deploys are banned. We also enforce that alembic heads shows exactly 1 head before any new migration. Multiple heads means Railway's startup command (alembic upgrade head) will fail silently on one branch.

Sentry is wired for error tracking. One lesson we learned: Sentry does not capture errors that are silently swallowed by broad except blocks. Early versions of several provider integrations had except Exception: pass patterns that were eating errors without surfacing them. Adding structured logging before the swallow statement and setting up Sentry capture on critical paths fixed this.

The camelCase vs snake_case Incident

This deserves its own section because it wasted a non-trivial amount of time.

FastAPI with Pydantic models configured to use alias serialization outputs camelCase field names when response_model_by_alias=True is set. If you write a TypeScript interface on the frontend based on the Pydantic model definition (which uses snake_case), you get a mismatch that is invisible when the array is empty and breaks everything when real data arrives.

The fix is simple: curl the actual endpoint and compare the JSON keys against the TypeScript interface before committing. The rule we now follow: never assume field names from the model definition. Always verify against the actual API response.

This is the kind of lesson that is obvious in retrospect and costs hours when you learn it the hard way.

The Logger Safety Issue

Python's standard library logging module uses %-style string formatting. Structlog, a popular alternative, uses keyword arguments. They are not interchangeable.

Early in the build, logging code was sometimes written as:

logger.info("council_verdict", confidence=0.84, regime="WHALE_ACCUMULATION")

This looks reasonable. On a structlog logger, it works. On a stdlib logger, it raises a TypeError. In our codebase, these were inside try/except blocks that caught the TypeError silently, which meant the log line appeared to work (no crash) but produced no output and swallowed the error.

The fix was two-part: a test that AST-scans every Python file for keyword arguments on stdlib logger calls, and a hard rule that all logging in the codebase uses %-format:

logger.info("council_verdict confidence=%s regime=%s", 0.84, "WHALE_ACCUMULATION")

The test runs on every push to main and has caught regressions several times.

2,736 Tests and What They Actually Verify

Our test suite has grown to 2,736 tests across the backend codebase. A number like this sounds impressive, but tests verify code correctness, not feature correctness. They do not tell you whether the Ghost Trader is taking good trades -- they tell you whether the code paths work as written.

The test categories that matter most for a trading system:

Unit tests for signal computation: Does the MVRV Z-Score calculator return the right value given known inputs? Does the ATR calculation match the expected output for a given OHLCV series?

Integration tests for entry gates: Do all 7 gates independently block entries in conditions where they should? Do all 7 pass in conditions where they should? These are the tests that caught the most bugs.

Structural consistency tests: No raw signal keys in user-facing output. No structlog kwargs on stdlib loggers. No imports from frozen modules in new asset code. These tests enforce architectural rules automatically.

Recovery tests: Does the system correctly hydrate all state from the database after a restart? This is the test category that is easiest to skip and most expensive to learn the hard way.

Key Takeaways

•

6 specialized Claude agents run in parallel (asyncio.gather), all completing in 3-4 seconds.

•

30 signals across 6 categories feed an 8-state HMM regime classifier that runs before any agent sees the data.

•

All 7 entry gates must pass before a trade executes -- the system is designed for quality over quantity.

•

Alembic migration discipline: code referencing new columns ships in the same commit as the migration. Split deploys cause UndefinedColumnError on Railway restarts.

•

Logger safety: stdlib %-format only. Keyword arguments on stdlib loggers raise TypeError silently in try/except blocks.

•

camelCase vs snake_case: always verify against the actual API JSON response, never against the Pydantic model definition.

•

Test counts measure code coverage, not trade quality. Recovery tests and structural consistency tests matter most.

See What We Built

The track record from the Ghost Trader is publicly available at /track-record. Every trade, every signal state, every council verdict is recorded. The live dashboard shows the current council state updated every 5 minutes.

For a higher-level overview of the system and the team behind it, the about page covers the full picture.

If you are building something similar and want to compare notes, the AIOKA blog documents the lessons as we learn them.

*This article is for informational purposes only and does not constitute financial advice. Past performance of paper trading systems does not guarantee future results with real capital.*