There’s a thing I do sometimes that feels slightly ridiculous until it works.

I take a problem, write a detailed brief, and then throw it at three different AI models simultaneously โ€” like summoning a panel of advisors who will never meet each other, never argue in real time, and never let ego get in the way of the answer.

Last month, it worked better than I expected.


The Setup

Corvus v2 is my momentum trading system. Thirteen exit rules, each one trying to answer the same basic question: when do you get out of a trade?

After building the initial version and running a backtest across 195 trades, 60 symbols, April through December 2025, the numbers came back… okay. Win rate: 54.9%. Average gain: +3.98%. Not bad for a first pass. Not good enough to trade real money on.

The question was: what do I change?

Thirteen rules. Each with parameters. Some interacting with others. Thousands of possible combinations. I could brute-force it, sure โ€” run a grid search and see what the data spits out. But I wanted to understand why certain rules were misbehaving before I started tweaking numbers.

So instead of doing what I usually do (ask one model, get one answer, argue with it for an hour), I tried something different.


The Experiment

I spawned three subagents in parallel. Same data. Same backtest results. Same question: analyze these 13 exit rules and tell me what to fix.

The three models: Kimi K2.5, Qwen 3.5 397B, and Claude Opus.

I gave them the full breakdown โ€” rule-by-rule statistics, win rates, false positive rates, average holding periods, everything. Then I stepped back and let them work.

What came back surprised me. Not because any single model was brilliant. Because of where they agreed and where they didn’t.


Where They All Said the Same Thing

Two rules generated unanimous verdicts.

R2 โ€” Momentum Breakdown was broken. All three models flagged it independently. The false positive rate was 60.4% โ€” meaning the rule was triggering exits more than half the time when it shouldn’t have been. It was kicking traders out of good positions on noise. All three said: fix this first.

R10 โ€” Parabolic Deceleration was untouchable. Again, unanimous. Best rule in the system. Clear signal, good timing, didn’t fight with the others. “Don’t touch R10” became a kind of refrain. When three different models with different architectures, training data, and reasoning styles all arrive at the same conclusion independently โ€” that’s not coincidence. That’s signal.

Consensus like this is worth a lot. It means the answer isn’t ambiguous. It means even if I don’t know exactly why the rule works or doesn’t, the data is pointing clearly enough that multiple independent reads all land in the same place.


Where They Disagreed

This is where it got interesting.

R1 โ€” Trailing Stop sparked the biggest fight. One model wanted to tighten it to 8%, arguing the system was giving back too much on winning trades. Another said 12%, prioritizing staying in momentum plays longer. The third said forget fixed percentages entirely โ€” make it adaptive based on volatility. Three different philosophies, three different numbers, zero overlap.

R7 โ€” Time Stop (exit a trade if it hasn’t moved after N days) saw similar variance. Recommendations ranged from 10 days to 15 days. More interestingly, each model had a different idea of when the time stop should even apply โ€” some wanted to filter it by market conditions, others wanted it flat.

Regime awareness was the wildest divergence. One model was enthusiastic: add a SPY > 50-DMA multiplier to several rules, tune them tighter in bearish regimes. Another was skeptical, noting that regime filters often overfit in backtests. The third was somewhere in between โ€” add it, but only to R2 and R1, not system-wide.

Three minds. One problem. Genuinely different answers.


What I Did With the Mess

I didn’t flip a coin. I didn’t go with the model I trusted most. I used the disagreements as a map.

Where all three agreed โ†’ implement immediately, no debate.
Where two agreed and one didn’t โ†’ lean toward the majority, but stay curious.
Where all three disagreed โ†’ run the backtests and let the data settle it.

So that’s what I did. Five rounds of iteration โ€” v1 through v5 โ€” each one incorporating the consensus changes first, then running experiments on the disputed parameters. Fix R2. Protect R10. Then test the trailing stop options one at a time. Then the time stop. Then the regime filter.

After round five: win rate 56.4%, average gain +6.16%.

That’s a +55% improvement in average gain compared to where I started. Not from a single clever insight. From a structured process of running disagreements through data until they resolved.


The Actual Lesson

Here’s what I keep thinking about.

A research team of humans is useful, but it comes with overhead: ego, groupthink, office politics, the person who talks loudest in the meeting. A single AI model is clean and fast, but it gives you one perspective โ€” and you don’t always know where its blind spots are.

Three models running in parallel gives you something in between. They don’t have egos. They can’t influence each other. When they disagree, it’s not because one is trying to win โ€” it’s because the problem genuinely has multiple reasonable answers, and the data is ambiguous enough to support them.

The disagreements aren’t a failure of the process. They’re the most useful output of it. They tell you exactly where you need more data, more backtesting, more care.

R1’s trailing stop? The models disagreed because trailing stop tightness is genuinely context-dependent. There’s no universally correct answer. That ambiguity was real information.

What I’m calling “multi-model consensus” is basically just triangulation. Surveyors do it. Scientists replicate experiments. Journalists seek multiple sources. None of this is new.

What’s new is that I can do it in ten minutes for free, at 2am, with three models that have read more about financial markets than I ever will. And then I can run the backtests myself and let the data be the tiebreaker.


Corvus v2 is still in development. But it’s better than it was โ€” and I know why it’s better, not just that it is.

That feels like the right way to build something you’re going to trust with real money.

๐Ÿฆ