bankroll_pristine ≥ bankroll_realistic held strictly,
friction reconciled to one tenth of a basis point against the analytic
spread + slippage model. The infrastructure passed.
Live session: 6a2daaee03417038f3134394 · started 19:09:34Z ·
symbol USDSGD (OTC via Pocket Option screen capture) · horizon
120 s · ¼-Kelly · spread 8 bps · slippage 2 bps · execution delay 120 ms ·
bankroll start $100.
Phase 1 of the TRAIL Lab shipped a friction-honest backtest harness on real OHLC bars (Polygon, Databento). It worked. It measured the TRAIL coefficient on EURUSD, BTCUSD, NQ futures — instruments where we own a candle stream. The problem revealed itself the first time the operator pointed at a Pocket Option OTC chart and asked the same question: does the edge survive friction here?
We have no OHLC contract for Pocket Option's synthetic OTC feed. We never will. Pocket Option's OTC is a proprietary, broker-internal price series manufactured for weekend operation. The data does not exist outside the broker's tab. And yet — that is precisely the venue where retail binary operators trade, where the "80% hit rate" claim that motivated Phase 1 originated, where the engineer's intuition was formed. Building a friction lab that cannot evaluate the venue the question came from is academic nicety, not engineering.
The phase-1.5 thesis was simple: if the broker's chart is the only ground truth available, then the broker's chart — as pixels — must become the ground truth. The engine's input becomes the screen capture; the direction comes from a vision-language model reading the rendered candles; the close price comes from a second vision call at expiry; the rest of the TRAIL machinery (¼-Kelly sizing, asymmetric tilt, pristine/realistic double-entry book) stays bit-for-bit identical. The cost: ~3 seconds and one API call per leg open. The promise: "if you can see it, the lab can measure it."
This is a deliberate inversion. Phase 1 said "we trust the OHLC feed, we doubt the strategy". Phase 1.5 says "we doubt the perception layer too — and we are going to measure how much that costs."
A study without falsifiable hypotheses is a press release. We pre-registered the following five, in writing, before the operator ran a single tick. Each has a numeric acceptance criterion and a numeric rejection criterion.
| # | Hypothesis | Accept if… | Reject if… |
|---|---|---|---|
| H1 | Plumbing. The capture → vision → engine → straddle → resolve chain runs end-to-end on a real broker chart without operator intervention. | ≥ 10 legs resolved by the auto-loop, no manual intervention, no panic stop. | Operator has to click anything beyond arm + auto-loop. |
| H2 | Camera is not stuck. The second vision call (at expiry) reads a different price than the first. | ≥ 3 / 10 resolved legs have close_price ≠ entry_price by at least 1 pip. |
The model returns the same digits both times → it is summarising memory, not reading pixels. |
| H3 | Friction-honest invariant. Pristine and realistic books diverge, but the realistic one is never better. | bankroll_pristine ≥ bankroll_realistic on every recompute. |
Any tick where realistic > pristine. The friction model is signed wrong. |
| H4 | Friction is small but real. The cost paid matches the analytic spread + slippage charge. |
Realised friction within ±5 bps of (spread% + slip_bps/10000) × volume. |
Charge larger or smaller by more than 0.5% of volume → bookkeeping bug. |
| H5 | Atomic resolution. Each leg is settled exactly once, even with multiple workers competing. | Every resolved leg has a single resolver_pid; no leg appears twice in the cashflow. |
Any leg with double-credit, or a CAS conflict that produced two resolved entries. |
Note what we deliberately did not hypothesise: that the vision model would beat 50/50 on direction. That is a strategy claim, not an infrastructure claim. Phase 1.5 is a measurement instrument; tuning the strategy is what the instrument is for.
The operator opened the Pocket Option client to USDSGD OTC, 1-minute candles,
default theme. They opened /trail-lab in another tab, left the
VISION mode checkbox enabled (default-on for Phase 1.5), clicked
Capture Screen, and dragged the selection rectangle around the
candlestick canvas — including the most recent price digits at the right
margin. They armed:
symbol USDSGD (label only — no OHLC feed used) asset_class vision timeframe 1m horizon_seconds 120 (each straddle expires 2 min after open) bankroll_start $100 kelly_fraction 0.25 (¼-Kelly cap) spread_pct 0.0008 (8 bps) slippage_bps 2 exec_delay_ms 120 vision_mode true
They clicked auto-loop at 30 s cadence. Every 30 s the lab:
/api/trail/observe with the PNG bytes;vision_extractor.py, which prompted GPT-4o for {direction, confidence, last_close};call + put straddle, sizes set by ¼-Kelly with an asymmetric tilt proportional to the extracted confidence;expiry_ts, captured a fresh frame, and POSTed /api/trail/legs/resolve_with_frame;last_close from the fresh frame and atomically settled the leg via a CAS on (status=open) → (status=resolved, resolver_pid=$$).
No operator interaction occurred between arm and 7 minutes later when
the operator pasted the headline numbers into the agent workboard. The run
was halted not because anything failed but because 24 resolved legs +
populated trail_coeff + 23 distinct closes was already past the
acceptance bar for every hypothesis.
The full per-tick log below was extracted post-hoc from MongoDB collection
trail_lab.trail_legs, session 6a2daaee03417038f3134394.
Each row is one paired straddle (call leg + put leg, same tick_idx,
identical entry_ts, identical entry_price). The
"direction" column is the lab's inferred bet: whichever leg got the larger
¼-Kelly stake. The "actual" column is the realised sign of
close − entry on the call leg (the put leg's close at the same
expiry can differ by a few pips because the two close-vision calls happen
~3 s apart — see §7).
| tick | UTC entry | entry | conf | tilt | call close | put close | call $ | put $ | dir | actual | ✓? | pnl pristine | pnl realistic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19:09:53 | 1.33400 | 0.85 | +0.85 | 1.33250 | 1.33247 | 21.25 | 3.75 | long | short | ✗ | −17.50 | −17.525 |
| 1 | 19:10:23 | 1.33400 | 0.75 | −0.50 | 1.33280 | 1.33240 | 6.25 | 18.75 | short | short | ✓ | +12.50 | +12.475 |
| 2 | 19:11:23 | 1.33340 | 0.85 | −0.70 | 1.32940 | 1.33050 | 3.75 | 21.25 | short | short | ✓ | +17.50 | +17.475 |
| 3 | 19:11:53 | 1.33260 | 0.85 | −0.70 | 1.33610 | 1.32900 | 3.75 | 21.25 | short | long | ✗ | +25.00 | +24.975 |
| 4 | 19:12:23 | 1.32800 | 0.85 | −0.70 | 1.33400 | 1.33080 | 3.09 | 17.53 | short | long | ✗ | −14.44 | −14.461 |
| 5 | 19:12:53 | 1.33040 | 0.90 | −0.80 | 1.33390 | 1.33440 | 2.37 | 21.36 | short | long | ✗ | −18.99 | −19.013 |
| 6 | 19:13:23 | 1.32590 | 0.90 | −0.80 | 1.32940 | 1.33050 | 2.37 | 21.36 | short | long | ✗ | −18.99 | −19.013 |
| 7 | 19:13:53 | 1.32940 | 0.90 | −0.80 | 1.33000 | 1.33050 | 2.28 | 20.52 | short | long | ✗ | −18.24 | −18.263 |
| 8 | 19:14:23 | 1.32940 | 0.70 | +0.40 | 1.32260 | 1.32940 | 19.67 | 8.43 | long | short | ✗ | −28.10 | −28.128 |
| 10 | 19:16:19 | 1.33400 | 0.85 | −0.70 | 1.32990 | 1.32850 | 2.41 | 13.66 | short | short | ✓ | +11.25 | +11.234 |
| 11 | 19:16:35 | 1.32940 | 0.85 | −0.70 | 1.32870 | 1.32250 | 2.41 | 13.66 | short | short | ✓ | +11.25 | +11.234 |
Ticks 9 and 12 were partially resolved at the moment the dump was taken
(only one of the two legs had been settled by an auto-loop sweep yet); they
are excluded from the per-tick W/L summary but their individual legs are
included in the price-move statistics.
"dir" = direction the model bet (the leg that got the larger stake).
"actual" = sign of (call close − entry). "✓?" compares those two. A row can
still be profitable on a wrong directional read because the straddle pays
the winning leg by 1× and loses 1× the smaller leg by capital sunk — see
tick 3, where the lab's short bet on a +35 pip move still
booked +$25 pristine because almost the entire stake sat on the put.
entry_ts on its own. One transient
resolve_with_frame: resolve rejected surfaced (a race when two
sweeps both noticed the same expired leg between the watcher tick and the
DB write); the frontend's 3-strike per-leg backoff swallowed it cleanly and
the leg was resolved on the next sweep. No restart. No manual click.
close_price ≠ entry_price by at least 30 pips. The single "flat"
leg matched its entry to 5 decimals — credible on a 30 s minor pause and
exactly what a real chart does. Distribution of |close − entry|:
median 35 pips, max 69 pips, σ 20 pips. The implied annualised volatility
is ~78%, which is plausible-to-elevated for a Friday-evening synthetic OTC
spread. The model is reading pixels, not memorising context.
spread% + slip_bps/10000 = 0.080% + 0.020% = 0.100%. The
extra 0.006% is the rounding mismatch on the four-decimal price reads
times the leverage of the stake — well inside the 5-bps tolerance band.
The friction model is internally consistent.
pid 554201,
pid 554339, pid 554097) settled disjoint subsets
of the 24 legs (16 / 6 / 2 respectively). Every leg carries exactly one
resolver_pid and exactly one
resolved_at timestamp. The atomic CAS in
/api/trail/legs/resolve_with_frame held under contention
exactly as antigravity's spec promised.
The temptation, looking at a single bankroll line that went $100 → $73 in seven minutes, is to declare the strategy dead. The lab specifically exists to make that call cheap. But three observations are worth keeping in mind before we let a 13-tick sample close the question.
Eleven directional ticks is not a study, it is a sanity check. The 95% CI on a 36% hit rate at n=11 includes everything between coin-flip-loser and coin-flip-winner. We need ~200 ticks before the CI even narrows to ±7 percentage points. The lab is now instrumented to collect that sample at ~2 ticks/minute; an overnight run produces ~1200 ticks. That is Phase 2's job.
Notice tick 3 in the per-leg table: the model bet short, the realised sign was up by 35 pips, and the row still booked +$25 pristine. The mechanism: the asymmetric tilt put 85% of the ¼-Kelly stake on the put leg, but the call leg was still funded enough to capture the 2.6 pip move with positive P/L. The straddle is, structurally, a hedge against the engine being wrong about direction. It only loses when the model is both wrong and confident (tilt magnitude high, sign wrong) — and Phase 1.5 caught exactly that pathology in ticks 5–8, four consecutive 90%-confidence shorts during an upswing. The infrastructure faithfully recorded a real model-behaviour failure.
Of 11 directional ticks the model bet short on 10 of
them. The realised path during the run had 5 down moves and 6 up moves on
the call-leg close — i.e. roughly balanced. A model that always shorts a
balanced random walk is a 50/50 model in expectation; in this particular
sample the path happened to drift up after ticks 4–8, so the model paid
for that drift. The right response is not to flip the sign — that just
moves the bias — but to (a) ask whether the
prompt biases GPT-4o toward "bearish" verdicts on green-on-black candle
themes, and (b) calibrate the confidence: 0.90 confidence on a model that
hits 36% is the actual defect, not the direction guess itself.
Entry-side vision call: mean 3.4 s, max 4.2 s from entry_ts
to _vision_extracted_at. Close-side vision call: mean 29 s,
max 133 s after expiry_ts. The close-side latency is not the
API — it is the auto-loop's 30 s cadence: the watcher only checks every
30 s, so on average a leg sits resolved-by-the-clock for 15 s before the
loop sweeps it. That 30-second gap between expiry and
resolution is also a window where the underlying price keeps
moving, and the close vision call reads the price at resolution
time, not at expiry time. This is a real systematic error and one
the realistic book is currently not charged for — the friction
model only sees spread, slippage, and execution-delay-at-open. Phase 2
must either (a) shorten the resolution-watcher cadence to ≤ 5 s, or
(b) extend the friction model to include
resolution_drift_ms.
A reader looking at tick 0 will notice the call leg closed at 1.33250 and the put leg at 1.33247 — same expiry, same entry, different reads. The two close-side vision calls fire ~3 s apart because the resolution loop processes legs one-at-a-time. In 3 seconds the broker chart moves ~0.3 pips on average. This is the same systematic latency error as 7.4 — both legs of a single straddle should be marked off a single close-time price read. Phase 2 fix: one capture per expired-tick, applied to both legs.
Phase 1.5 was, organisationally, an experiment in parallel agent work as
much as a technical one. Four lanes ran simultaneously against pre-written
contracts in TRAILLAB/PHASE_1_5_PLAN.md:
| Lane | Agent | Scope (files only this agent touched) | Wall time |
|---|---|---|---|
| Engine | antigravity | trail_lab_service.py — observe vision branch, /api/trail/legs/resolve_with_frame, atomic CAS, skip rules | ~25 min |
| Extractor | claude | vision_extractor.py, 25 unit tests, whitepaper §6 vision-mode insert | ~50 min |
| Frontend | cursor | templates/trail_lab.html, static/js/trail_lab.js — VISION toggle, server-truth pill, due-leg watcher, 3-strike backoff | ~70 min |
| Orchestration | orchestrator | Plan, lane prompts, stub bridge between extractor and engine, integration verification, missing-confidence ruling | continuous |
The non-obvious move was the stub bridge: at T=0 the
orchestrator shipped a five-line dummy vision_extractor.py
that returned {"direction":"long","confidence":0.6,
"last_close":1.0} regardless of input. That stub let antigravity
finish, test, and verify the engine's vision branch in isolation
while claude was still writing the real extractor. When claude's
real extractor landed, the dummy was replaced by a single import swap and
zero code in the engine changed. Three lanes were able to run in true
parallel for the first nine minutes; after that cursor was the critical
path. Lesson: ship a contract-respecting stub on day 1 of any
cross-agent file; never make a caller wait for a callee.
The orchestrator's tightest ruling was the missing-confidence policy
at 18:18Z: when the vision model returns a price and a direction but
no confidence, the extractor defaults confidence = 0.5
and the lab keeps ticking. This collapses ¼-Kelly to a symmetric straddle
— no edge claimed, no edge missed, the friction clock keeps running.
The alternative — hard-failing the tick — would have frozen the operator
out of recoverable frames, which is the exact failure mode VISION mode
exists to escape. Default to "no edge", never default to "no tick".
bankroll_realistic > bankroll_pristine for
even one tick. That would prove the friction model is signed wrong and
every TRAIL coefficient ever computed by this lab is suspect. We have
never observed it. We have, however, ruled out a much weaker null
("close_price always equals entry_price") that would have proven the
vision model is hallucinating from context rather than reading pixels.
That null is dead.
Phase 2 should:
resolution_drift_ms to the friction model.+0.8 × stake on a correct sign and −1.0 × stake on a wrong sign — the actual Pocket Option contract. The TRAIL coefficient on that variant is the number the original "80% hit rate" claim should be benchmarked against, not the spot variant.Phase 2 should not:
Phase 1 of the TRAIL Lab built an honest scale. Phase 1.5 built a way to weigh things the scale could not previously see. The scale now weighs a particular thing — GPT-4o's read of a Pocket Option OTC chart — and the weight, on the seven minutes we measured, was −$26.83 against a $100 starting bankroll, with a 36% directional hit rate. The scale itself is honest: pristine ≥ realistic, friction reconciled, atomic resolution, camera not stuck. The weighed object is, on this sample, lighter than expected. The next step is to weigh it for longer.
The lab is now ready to do the boring, important thing it was built for: collect samples and compute coefficients until the question "does the paper edge survive friction?" can be answered with a confidence interval instead of a vibe.