The Binary Exhaustion Lab is a research workbench for studying whether a deterministic candle-exhaustion screen reader, optionally cross-checked by a multimodal LLM, can predict the directional outcome of the next N bars on a single instrument with a measurable, calibrated edge. Every prediction is auto-resolved against the actual OHLC close and recorded; the lab learns its own win rate over time rather than asserting one.
This is simulation, not advice. No real orders are placed. Wagers are paper, bankroll is fictional, all P&L is for study. The interesting question the lab tries to answer rigorously is: does the heuristic stack have an edge, how big is it, and how does its confidence relate to reality?
On each observation the lab executes a six-stage pipeline. The first stage
is the original deterministic engine; the rest are the "hybrid_v1"
additions distilled from a four-way agent competition held on this lab.
| # | Stage | What it does | Why it's there |
|---|---|---|---|
| 1 | Engine prior | Runs binary_exhaustion_engine.analyze_binary_exhaustion:
consecutive-candle counter, MA stacking (9/20/50/200), distance-from-mean
in ATR units, ZigZag swing context. Outputs a bull-probability in
[0,1]. |
Unmodified domain heuristic, kept as the prior so all later additions are auditable changes. |
| 2 | Regime detection | Computes Kaufman's Efficiency Ratio (ER) on the last 10 and 30 closes;
labels the regime trending,
mixed, or choppy
and produces two smoothly interpolated feature weights
trend_weight and revert_weight. |
Mean-reversion signals are noise in strong trends; momentum signals are noise in chop. The regime gate down-weights each accordingly. |
| 3 | Logit-space ensemble | Adds five regime-gated independent features to the prior: RSI(14) reversion, ATR-normalized stretch from EMA(20), 3-bar ATR momentum, last-bar wick rejection, and a higher-timeframe MTF bias (EMA9 vs EMA21). All contributions are summed in logit space, then squashed through a sigmoid. | No single feature can saturate the estimate (an 88% honesty clamp
caps heuristic-only confidence). Each feature is auditable in the
factors[] response field. |
| 4 | Vision LLM fold | If a screenshot is provided, a multimodal model is asked to reply
with Direction / Confidence / Pattern / Reasoning in a
structured block. Its vote is folded into the ensemble with bounded
weight (~0.35 in logit space) — never as a hard override. |
The LLM sees the picture (gaps, wicks, structure breaks) the deterministic engine cannot. But it is also the noisiest input, so its weight is bounded and clamped. |
| 5 | PAV isotonic calibration | Every resolved (won/lost) observation in MongoDB updates an isotonic regression that maps raw heuristic confidence to realized win rate. Below 25 resolved samples the calibrator is identity; above that it becomes the truth-teller. | "75% confidence" actually meaning 75% over the long run requires a non-parametric, monotone map from claim to reality. Pool-Adjacent-Violators is the standard estimator. |
| 6 | Drawdown-braked ¼-Kelly sizing | Wager sizing inputs are the per-confidence-band Wilson lower bound (so a thin bucket can't claim a fat edge) and the Kelly fraction is multiplied by a drawdown brake (linear ramp, floor 0.25) and a losing-streak damper (×0.5 after 3 consecutive losses). | Kelly assumes p is known. We estimate p. The brake accepts that estimation, which prevents catastrophic ramp-up during structural regime shifts the engine hasn't adapted to yet. |
ER(n) = |close[t] − close[t−n]| / Σ |close[i] − close[i−1]| (i = t−n+1 … t)
ER → 1 is a perfect trend, ER → 0 is pure chop.
We blend ER(10) and ER(30) at 60/40 and map the result to two weights
trend_weight ∈ [0.35, 1.0] and
revert_weight = 1 − (trend_weight − 0.35).
z = 0.55 · logit(p_engine)
+ ∑ contrib_i (each contrib gated by regime weight)
p = σ(z)
p ∈ [0.12, 0.88] (honesty clamp)
Each contrib_i is a small, bounded shift (typically ±0.2 to ±0.5
in logit units). The clamp prevents any combination of heuristics from
claiming more than 88% certainty — that level of confidence has to be
earned by calibration data, not asserted by an indicator.
z' = (1 − w) · logit(p) + w · logit(p_LLM) w = 0.35
Given resolved samples (c_i, y_i) with y_i ∈ {0,1}
sorted by claimed confidence c_i, PAV produces the
monotone non-decreasing function that minimizes squared error.
Algorithmically, sweep left to right and pool any adjacent pair whose right
value is below the left, repeat until monotone. We interpolate linearly
between fitted knots. Identity below 25 samples avoids overfitting.
p_lower(w, l, z=1.96)
= ((p + z²/2n) − z · √(p(1−p)/n + z²/4n²)) / (1 + z²/n)
where p = w/(w+l), n = w+l
This is the conservative true win rate of each confidence band given
n samples. A bucket with n < 50 is marked insufficient
and forces the floor wager regardless of how good p̂ looks.
f_full = (b · p − q) / b # b = payout ≈ 0.85, q = 1−p f = f_full · 0.25 · brake · streak # ¼-Kelly + brake brake = max(0.25, 1 − 2 · drawdown_pct) # 25% floor streak = 0.5 if losing_streak ≥ 3 else 1.0 size = clip(f, floor=0.001, cap=0.02) # 0.1% … 2% of bankroll
The single hardest thing about a confidence number is that it must be true. We make four explicit honesty commitments:
calibrator_active: false). We don't pretend we have an
isotonic fit when we don't.factors[] trail
including each contribution's signed magnitude. Anyone can
reconstruct why a verdict said what it said. Black box, this is not.Sizing answers one question: given everything I know about this confidence band's history, how much would a sane fractional-Kelly bettor risk? The lab's answer is roughly:
hybrid_engine.run_self_consistency()
utility exists to fan out K diversified votes concurrently for
variance reduction, but it costs ~Kx more per observation and is
currently disabled by default.Everything in sections 1–6 above is about directional conviction: "the next bar closes UP (or DOWN)". But every directional engine has the same failure mode — the regime where price is genuinely undecided. On a chart where the moving averages have collapsed into a squeeze and price is ribboning sideways, even an 80%-confidence directional call is, in information-theoretic terms, noise.
The CHOP BUSTER card sitting below the directional wager on the lab page is a deliberate second opinion designed for exactly that regime. It is a non-directional bet: instead of guessing whether price goes up or down, it asks "does price stay between two specific strikes for the next N bars?". When the regime detector (§3.1) reports compressed / squeeze / mixed, the chop-buster arms; otherwise it sits in a dimmed STANDBY state that shows the user what bracket it would propose if conditions flipped. It is never auto-charged.
The chop-buster is a synthetic short strangle expressed as two of our binary atoms with the sign flipped:
By Breeden–Litzenberger (1978), the risk-neutral density of the underlying is recoverable from the second derivative of a continuum of European call prices in strike. Carr–Madan (1998) generalises this: any payoff function on the terminal underlying can be replicated as a portfolio of digitals at every strike. A short strangle is the simplest non-trivial case of this — three regions of payoff (lose, win, lose), constructed from exactly two digital sells. We're not pricing options; we are using the payoff structure as a hypothesis-test instrument.
The pstay figure on the card is not theoretical. It is an empirical, walk-forward sample: for the most recent 60 sliding windows of horizon bars in the current OHLC history, count what fraction stayed strictly inside a ±0.8·ATR bracket around the window's starting close. If the bracket got blown through, count it as a loss; otherwise a win. The card publishes that number front-and-centre, e.g.:
p(stay inside) = 7% · 60-window historical sample · horizon 3 × 1m
A 7% number is loud and embarrassing on purpose. It is telling the user exactly what is true: this specific bracket, at these specific strikes, on this specific chart, gets killed 93% of the time historically — do not fund it with real conviction. The wager sizer agrees and floors the position to 0.1% of bankroll. The card is "ARMED" only in the sense that the regime qualifies (sideways), not in the sense that the math endorses the trade. This distinction is intentional: showing the user the regime-detector state separately from the historical-edge state turns a single confusing recommendation into two pieces of information they can reason about.
When the historical sample does support a bet (pstay ≥ 60% on the 60-window sample), the chop-buster sizes at ½-Kelly (vs. the directional engine's ¼-Kelly) and caps at 1% of bankroll (vs. the directional engine's 2%). The two reasons:
The sizer formula:
b = 0.85 (binary-style payout)
f_full = (b·p_stay − (1 − p_stay)) / b (full-Kelly fraction)
f_half = max(floor_pct, min(cap_pct, (½-Kelly, clamped)
f_full · 0.5))
size$ = bankroll · f_half
The user is reminded by the UI that the directional wager and the chop-buster are two independent opinions on the same chart, not two legs of a hedge. They draw from the same bankroll but are not combined. A typical pattern when the engine fires both: