Perf harness
The perf harness is the reproducible way to measure StreamMDX behavior before and after changes. It exists to turn performance claims into comparable local evidence instead of screenshots or isolated anecdotes.
It is designed to answer one question quickly: did this change make streaming better, worse, or unchanged?
What this showcases
- Fixed fixture input and deterministic run settings.
- A stable metric set you can compare across commits.
- A release gate pattern for preventing silent regressions.
Core metrics
| Metric | Why it matters | Typical target |
|---|---|---|
| First flush (ms) | Time to first visible content | Lower is better |
| Patch p95 (ms) | Tail update latency under stream load | Lower is better |
| Long tasks | Main-thread jank indicator | Near zero |
| Coalescing (%) | How much patch merging reduced churn | Stable range |
| Memory (peak) | Risk of runaway allocations | No upward drift |
Standard run matrix
Use the same matrix for every benchmark pass so results stay comparable:
- Fixture: Naive Bayes article (default demo fixture)
- Rate:
12000chars/s - Tick:
5ms - Runs:
3 - Theme: both light and dark
Capture workflow
- Build worker + docs assets.
- Start docs with automation API enabled.
- Capture a baseline JSON.
- Apply your change.
- Capture candidate JSON.
- Diff baseline vs candidate.
bash
NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:dev
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/main.json
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/candidate.jsonRegression policy
A practical default policy for release readiness:
- Fail if first flush regresses by more than
15%. - Fail if patch p95 regresses by more than
20%. - Fail if long-task count increases by more than
2x. - Warn (but do not fail) on memory increase under
10%.
How to interpret results
Use the harness in two modes:
- claim-grade comparisons with fixed fixture/scenario settings
- exploratory runs when you are diagnosing scheduler or workload cliffs
Do not mix those two uses. If the settings differ, the result is diagnostic, not a published baseline.
Example comparison output
json
{
"scenario": "naive-bayes-default",
"baseline": { "firstFlushMs": 32, "patchP95Ms": 4.1, "longTasks": 0 },
"candidate": { "firstFlushMs": 35, "patchP95Ms": 4.6, "longTasks": 0 },
"result": { "status": "pass", "notes": ["within thresholds"] }
}Common failure patterns
- Large syntax/highlight updates: patch p95 spikes after code-heavy sections.
- Over-eager UI effects: long tasks increase during stream bursts.
- Unbounded plugin work: memory climbs per run.
When this happens, test with one feature disabled at a time (math, mdx, html) to isolate cost.
Guardrails for CI
- Run perf harness on PRs that touch
packages/*or docs renderer code. - Store baseline in source control or artifact storage with commit metadata.
- Require explicit approval when metrics exceed thresholds.
Next steps
- Benchmarks hub: Benchmarks
- Integration guide: Perf harness
- Change log discipline: Perf quality changelog
- Scheduler interpretation: Scheduling and jitter