Perf harness

The perf harness is the reproducible way to measure StreamMDX behavior before and after changes. It exists to turn performance claims into comparable local evidence instead of screenshots or isolated anecdotes.

It is designed to answer one question quickly: did this change make streaming better, worse, or unchanged?

What this showcases

Fixed fixture input and deterministic run settings.
A stable metric set you can compare across commits.
A release gate pattern for preventing silent regressions.

Core metrics

Metric	Why it matters	Typical target
First flush (ms)	Time to first visible content	Lower is better
Patch p95 (ms)	Tail update latency under stream load	Lower is better
Long tasks	Main-thread jank indicator	Near zero
Coalescing (%)	How much patch merging reduced churn	Stable range
Memory (peak)	Risk of runaway allocations	No upward drift

Standard run matrix

Use the same matrix for every benchmark pass so results stay comparable:

Fixture: Naive Bayes article (default demo fixture)
Rate: 12000 chars/s
Tick: 5 ms
Runs: 3
Theme: both light and dark

Capture workflow

Build worker + docs assets.
Start docs with automation API enabled.
Capture a baseline JSON.
Apply your change.
Capture candidate JSON.
Diff baseline vs candidate.

NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:dev
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/main.json
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/candidate.json

Regression policy

A practical default policy for release readiness:

Fail if first flush regresses by more than 15%.
Fail if patch p95 regresses by more than 20%.
Fail if long-task count increases by more than 2x.
Warn (but do not fail) on memory increase under 10%.

How to interpret results

Use the harness in two modes:

claim-grade comparisons with fixed fixture/scenario settings
exploratory runs when you are diagnosing scheduler or workload cliffs

Do not mix those two uses. If the settings differ, the result is diagnostic, not a published baseline.

Example comparison output

{
  "scenario": "naive-bayes-default",
  "baseline": { "firstFlushMs": 32, "patchP95Ms": 4.1, "longTasks": 0 },
  "candidate": { "firstFlushMs": 35, "patchP95Ms": 4.6, "longTasks": 0 },
  "result": { "status": "pass", "notes": ["within thresholds"] }
}

Common failure patterns

Large syntax/highlight updates: patch p95 spikes after code-heavy sections.
Over-eager UI effects: long tasks increase during stream bursts.
Unbounded plugin work: memory climbs per run.

When this happens, test with one feature disabled at a time (math, mdx, html) to isolate cost.

Guardrails for CI

Run perf harness on PRs that touch packages/* or docs renderer code.
Store baseline in source control or artifact storage with commit metadata.
Require explicit approval when metrics exceed thresholds.

Next steps

Benchmarks hub: Benchmarks
Integration guide: Perf harness
Change log discipline: Perf quality changelog
Scheduler interpretation: Scheduling and jitter