Binary Exhaustion Lab — hybrid_v1METHODOLOGY

A short, honest whitepaper · 2026-06-11 · ~6 min read

Contents

What this is, and what it is not
The pipeline, top to bottom
The math, briefly
Calibration honesty
Wager sizing
Known limitations
CHOP BUSTER — the non-directional addendum
References

1. What this is, and what it is not

The Binary Exhaustion Lab is a research workbench for studying whether a deterministic candle-exhaustion screen reader, optionally cross-checked by a multimodal LLM, can predict the directional outcome of the next N bars on a single instrument with a measurable, calibrated edge. Every prediction is auto-resolved against the actual OHLC close and recorded; the lab learns its own win rate over time rather than asserting one.

This is simulation, not advice. No real orders are placed. Wagers are paper, bankroll is fictional, all P&L is for study. The interesting question the lab tries to answer rigorously is: does the heuristic stack have an edge, how big is it, and how does its confidence relate to reality?

2. The pipeline, top to bottom

On each observation the lab executes a six-stage pipeline. The first stage is the original deterministic engine; the rest are the "hybrid_v1" additions distilled from a four-way agent competition held on this lab.

#	Stage	What it does	Why it's there
1	Engine prior	Runs `binary_exhaustion_engine.analyze_binary_exhaustion`: consecutive-candle counter, MA stacking (9/20/50/200), distance-from-mean in ATR units, ZigZag swing context. Outputs a bull-probability in `[0,1]`.	Unmodified domain heuristic, kept as the prior so all later additions are auditable changes.
2	Regime detection	Computes Kaufman's Efficiency Ratio (ER) on the last 10 and 30 closes; labels the regime trending, mixed, or choppy and produces two smoothly interpolated feature weights `trend_weight` and `revert_weight`.	Mean-reversion signals are noise in strong trends; momentum signals are noise in chop. The regime gate down-weights each accordingly.
3	Logit-space ensemble	Adds five regime-gated independent features to the prior: RSI(14) reversion, ATR-normalized stretch from EMA(20), 3-bar ATR momentum, last-bar wick rejection, and a higher-timeframe MTF bias (EMA9 vs EMA21). All contributions are summed in logit space, then squashed through a sigmoid.	No single feature can saturate the estimate (an 88% honesty clamp caps heuristic-only confidence). Each feature is auditable in the `factors[]` response field.
4	Vision LLM fold	If a screenshot is provided, a multimodal model is asked to reply with `Direction / Confidence / Pattern / Reasoning` in a structured block. Its vote is folded into the ensemble with bounded weight (~0.35 in logit space) — never as a hard override.	The LLM sees the picture (gaps, wicks, structure breaks) the deterministic engine cannot. But it is also the noisiest input, so its weight is bounded and clamped.
5	PAV isotonic calibration	Every resolved (won/lost) observation in MongoDB updates an isotonic regression that maps raw heuristic confidence to realized win rate. Below 25 resolved samples the calibrator is identity; above that it becomes the truth-teller.	"75% confidence" actually meaning 75% over the long run requires a non-parametric, monotone map from claim to reality. Pool-Adjacent-Violators is the standard estimator.
6	Drawdown-braked ¼-Kelly sizing	Wager sizing inputs are the per-confidence-band Wilson lower bound (so a thin bucket can't claim a fat edge) and the Kelly fraction is multiplied by a drawdown brake (linear ramp, floor 0.25) and a losing-streak damper (×0.5 after 3 consecutive losses).	Kelly assumes p is known. We estimate p. The brake accepts that estimation, which prevents catastrophic ramp-up during structural regime shifts the engine hasn't adapted to yet.

3. The math, briefly

3.1 Kaufman Efficiency Ratio (regime detector)

ER(n) = |close[t] − close[t−n]|  /  Σ |close[i] − close[i−1]|     (i = t−n+1 … t)

ER → 1 is a perfect trend, ER → 0 is pure chop. We blend ER(10) and ER(30) at 60/40 and map the result to two weights trend_weight ∈ [0.35, 1.0] and revert_weight = 1 − (trend_weight − 0.35).

3.2 Logit-space ensemble

z   = 0.55 · logit(p_engine)
    + ∑  contrib_i  (each contrib gated by regime weight)
p   = σ(z)
p   ∈ [0.12, 0.88]                   (honesty clamp)

Each contrib_i is a small, bounded shift (typically ±0.2 to ±0.5 in logit units). The clamp prevents any combination of heuristics from claiming more than 88% certainty — that level of confidence has to be earned by calibration data, not asserted by an indicator.

3.3 Vision fold (when present)

z'  = (1 − w) · logit(p)  +  w · logit(p_LLM)     w = 0.35

3.4 Pool-Adjacent-Violators (PAV) isotonic regression

Given resolved samples (c_i, y_i) with y_i ∈ {0,1} sorted by claimed confidence c_i, PAV produces the monotone non-decreasing function that minimizes squared error. Algorithmically, sweep left to right and pool any adjacent pair whose right value is below the left, repeat until monotone. We interpolate linearly between fitted knots. Identity below 25 samples avoids overfitting.

3.5 Wilson score lower bound (per-bucket truth)

p_lower(w, l, z=1.96)
    = ((p + z²/2n) − z · √(p(1−p)/n + z²/4n²)) / (1 + z²/n)
   where  p = w/(w+l),  n = w+l

This is the conservative true win rate of each confidence band given n samples. A bucket with n < 50 is marked insufficient and forces the floor wager regardless of how good p̂ looks.

3.6 Fractional Kelly with drawdown brake

f_full     = (b · p − q) / b              # b = payout ≈ 0.85, q = 1−p
f          = f_full · 0.25 · brake · streak       # ¼-Kelly + brake
brake      = max(0.25, 1 − 2 · drawdown_pct)      # 25% floor
streak     = 0.5  if losing_streak ≥ 3 else 1.0
size       = clip(f, floor=0.001, cap=0.02)       # 0.1% … 2% of bankroll

4. Calibration honesty

The single hardest thing about a confidence number is that it must be true. We make four explicit honesty commitments:

The 88% logit clamp means no purely-heuristic configuration can claim >88% certainty. Higher confidence has to come from real resolved-sample evidence via PAV.
The per-bucket Wilson lower bound means a 75% point estimate on n=8 samples becomes p_lower≈0.41, which crosses below the Kelly break-even and forces a floor wager. Thin evidence cannot fund a fat bet.
The identity calibrator below 25 samples is openly displayed (calibrator_active: false). We don't pretend we have an isotonic fit when we don't.
Every observation stores the full factors[] trail including each contribution's signed magnitude. Anyone can reconstruct why a verdict said what it said. Black box, this is not.

5. Wager sizing in plain English

Sizing answers one question: given everything I know about this confidence band's history, how much would a sane fractional-Kelly bettor risk? The lab's answer is roughly:

No edge proven (insufficient samples, negative Kelly, or a neutral verdict) → 0.1% of bankroll. That's the "keep the muscle memory alive" floor.
Edge proven (Wilson p_lower > break-even at the assumed 0.85 payout) → ¼-Kelly fraction, capped at 2% of bankroll. ¼-Kelly is widely accepted as the prudent multiplier because full Kelly's growth is fragile to p-estimation error.
In a drawdown → linearly ramp size down to a 25% floor at a 50% drawdown. Compounding losses are the failure mode that ends most martingale-shaped systems; we explicitly prevent the death spiral.
After 3 losses in a row → cut sizing in half. A short streak is a signal that the regime has shifted in a way the engine hasn't adapted to yet.

6. Known limitations

Edge is symbol-specific. The PAV calibrator pools across the session db; if the same instrument trades very differently across market sessions (NY open vs Asia overnight) the calibrator is an average. A per-symbol calibrator is a clear next step.
Payout is approximated at 0.85. Real venue payouts vary; the lab's wager-sizing math assumes the conservative end of common binary-option payouts. Override-able via config.
LLM variance is bounded but real. A single vision call is noisy by construction; the hybrid_engine.run_self_consistency() utility exists to fan out K diversified votes concurrently for variance reduction, but it costs ~Kx more per observation and is currently disabled by default.
Only directional outcome is graded. A 0.01% "win" and a 3% "win" are treated identically. Magnitude-weighted scoring is on the roadmap.
Auto-resolution depends on OHLC fidelity. If the bar provider lags or backfills, resolution can be deferred a tick or two. The background resolver is idempotent so this is harmless, just slow.

7. CHOP BUSTER — the non-directional addendum

Everything in sections 1–6 above is about directional conviction: "the next bar closes UP (or DOWN)". But every directional engine has the same failure mode — the regime where price is genuinely undecided. On a chart where the moving averages have collapsed into a squeeze and price is ribboning sideways, even an 80%-confidence directional call is, in information-theoretic terms, noise.

The CHOP BUSTER card sitting below the directional wager on the lab page is a deliberate second opinion designed for exactly that regime. It is a non-directional bet: instead of guessing whether price goes up or down, it asks "does price stay between two specific strikes for the next N bars?". When the regime detector (§3.1) reports compressed / squeeze / mixed, the chop-buster arms; otherwise it sits in a dimmed STANDBY state that shows the user what bracket it would propose if conditions flipped. It is never auto-charged.

7.1 The math: binary atoms as a complete basis

The chop-buster is a synthetic short strangle expressed as two of our binary atoms with the sign flipped:

Atom A: SOLD digital — pays nothing if price > K_upper, full premium if not.
Atom B: SOLD digital — pays nothing if price < K_lower, full premium if not.

By Breeden–Litzenberger (1978), the risk-neutral density of the underlying is recoverable from the second derivative of a continuum of European call prices in strike. Carr–Madan (1998) generalises this: any payoff function on the terminal underlying can be replicated as a portfolio of digitals at every strike. A short strangle is the simplest non-trivial case of this — three regions of payoff (lose, win, lose), constructed from exactly two digital sells. We're not pricing options; we are using the payoff structure as a hypothesis-test instrument.

7.2 Why this is honest about its own edge

The p_stay figure on the card is not theoretical. It is an empirical, walk-forward sample: for the most recent 60 sliding windows of horizon bars in the current OHLC history, count what fraction stayed strictly inside a ±0.8·ATR bracket around the window's starting close. If the bracket got blown through, count it as a loss; otherwise a win. The card publishes that number front-and-centre, e.g.:

p(stay inside) = 7%  ·  60-window historical sample  ·  horizon 3 × 1m

A 7% number is loud and embarrassing on purpose. It is telling the user exactly what is true: this specific bracket, at these specific strikes, on this specific chart, gets killed 93% of the time historically — do not fund it with real conviction. The wager sizer agrees and floors the position to 0.1% of bankroll. The card is "ARMED" only in the sense that the regime qualifies (sideways), not in the sense that the math endorses the trade. This distinction is intentional: showing the user the regime-detector state separately from the historical-edge state turns a single confusing recommendation into two pieces of information they can reason about.

7.3 Sizing: ½-Kelly, half the cap, two reasons

When the historical sample does support a bet (p_stay ≥ 60% on the 60-window sample), the chop-buster sizes at ½-Kelly (vs. the directional engine's ¼-Kelly) and caps at 1% of bankroll (vs. the directional engine's 2%). The two reasons:

Asymmetric payoff geometry. Selling a strangle wins small and often, but the loss when the bracket is blown is up to the full premium at once. This is the classic short-vol fat-tail problem and the standard Mertonian remedy is a smaller bet size, not a higher one. Halving the cap is the conservative response.
Regime-detector half-life. The "compressed" label that arms the card has a half-life: the engine notices the squeeze after it has already begun. The bet is therefore being taken with a tailwind that may already be fading. Smaller sizing absorbs this without blowing the bankroll.

The sizer formula:

b      = 0.85                              (binary-style payout)
f_full = (b·p_stay − (1 − p_stay)) / b     (full-Kelly fraction)
f_half = max(floor_pct, min(cap_pct,       (½-Kelly, clamped)
             f_full · 0.5))
size$  = bankroll · f_half

7.4 Two opinions, not two bets

The user is reminded by the UI that the directional wager and the chop-buster are two independent opinions on the same chart, not two legs of a hedge. They draw from the same bankroll but are not combined. A typical pattern when the engine fires both:

Strong directional (≥65% conf) + non-ranging regime → directional ARMED, chop-buster STANDBY (educational). Take the directional.
Strong directional + ranging regime → directional reports a number but the engine has admitted the regime is sideways; chop-buster arms. The two opinions are explicitly contradicting; the user must decide whether to trust the directional or the regime detector. Sections 7.2 + 3.1 tell the user how to read this.
Weak directional + ranging regime → directional sized to floor, chop-buster ARMED. The lab is saying: "I have no directional view, and the regime favors selling vol. Consider the bracket — if its historical p_stay is high enough."

7.5 Limits specific to the chop buster

p_stay is a 60-window estimate. On a 1m chart with a thin OHLC history this is a small sample. The chop buster is most reliable on instruments and timeframes that have at least a few hundred bars in the lab's market data buffer.
±0.8·ATR is a heuristic width. On extremely low-ATR instruments (some OTC FX pairs) this becomes a sub-pip bracket that is trivially broken; p_stay will publish a tiny number and the floor-only sizing will protect the bankroll. Future versions will allow the user to widen the bracket explicitly.
Real binary-option venues quote different payouts for the "in" leg of a range bet. The 0.85 assumption is conservative for most listed venues; override-able via config when modelling a specific broker.
The chop-buster is never auto-charged. Auto-charge MC applies only to directional wagers. This is intentional: a non-directional bet whose downside is several multiples of its upside should require an explicit human commit, even in simulation.

References

Kaufman, P.J. (1995). Smarter Trading. McGraw-Hill. (Efficiency Ratio.)
Wilson, E. B. (1927). "Probable inference, the law of succession, and statistical inference." Journal of the American Statistical Association 22 (158): 209–212.
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., Silverman, E. (1955). "An empirical distribution function for sampling with incomplete information." Annals of Mathematical Statistics 26: 641–647. (PAV / isotonic regression.)
Kelly, J. L. (1956). "A new interpretation of information rate." Bell System Technical Journal 35 (4): 917–926.
Wang, X. et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. (Vote-strength ensembling of LLM outputs.)
Breeden, D. T., Litzenberger, R. H. (1978). "Prices of State-Contingent Claims Implicit in Option Prices." Journal of Business 51 (4): 621–651. (Risk-neutral density from option prices — the formal justification for treating digitals as the basis of all payoffs.)
Carr, P., Madan, D. (1998). "Towards a Theory of Volatility Trading." In Volatility: New Estimation Techniques for Pricing Derivatives (R. Jarrow, ed.), Risk Books, 417–427. (Static replication of arbitrary payoffs by portfolios of vanillas; foundation of the chop-buster atom.)
Merton, R. C. (1969). "Lifetime Portfolio Selection under Uncertainty: The Continuous-Time Case." Review of Economics and Statistics 51 (3): 247–257. (Why fat-tailed downside payoffs justify smaller Kelly fractions than the textbook full-Kelly answer.)