Perf harness
The perf harness is the reproducible way to measure StreamMDX behavior before and after changes.
It is designed to answer one question quickly: did this change make streaming better, worse, or unchanged?
What this showcases
- Fixed fixture input and deterministic run settings.
- A stable metric set you can compare across commits.
- A release gate pattern for preventing silent regressions.
Core metrics
| Metric | Why it matters | Typical target |
|---|---|---|
| First flush (ms) | Time to first visible content | Lower is better |
| Patch p95 (ms) | Tail update latency under stream load | Lower is better |
| Long tasks | Main-thread jank indicator | Near zero |
| Coalescing (%) | How much patch merging reduced churn | Stable range |
| Memory (peak) | Risk of runaway allocations | No upward drift |
Standard run matrix
Use the same matrix for every benchmark pass so results stay comparable:
- Fixture: Naive Bayes article (default demo fixture)
- Rate:
12000chars/s - Tick:
5ms - Runs:
3 - Theme: both light and dark
Capture workflow
- Build worker + docs assets.
- Start docs with automation API enabled.
- Capture a baseline JSON.
- Apply your change.
- Capture candidate JSON.
- Diff baseline vs candidate.
NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:dev
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/main.json
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/candidate.json
Regression policy
A practical default policy for release readiness:
- Fail if first flush regresses by more than
15%. - Fail if patch p95 regresses by more than
20%. - Fail if long-task count increases by more than
2x. - Warn (but do not fail) on memory increase under
10%.
Example comparison output
{
"scenario": "naive-bayes-default",
"baseline": { "firstFlushMs": 32, "patchP95Ms": 4.1, "longTasks": 0 },
"candidate": { "firstFlushMs": 35, "patchP95Ms": 4.6, "longTasks": 0 },
"result": { "status": "pass", "notes": ["within thresholds"] }
}
Common failure patterns
- Large syntax/highlight updates: patch p95 spikes after code-heavy sections.
- Over-eager UI effects: long tasks increase during stream bursts.
- Unbounded plugin work: memory climbs per run.
When this happens, test with one feature disabled at a time (math, mdx, html) to isolate cost.
Guardrails for CI
- Run perf harness on PRs that touch
packages/*or docs renderer code. - Store baseline in source control or artifact storage with commit metadata.
- Require explicit approval when metrics exceed thresholds.
Next steps
- Benchmarks hub: Benchmarks
- Integration guide: Perf harness
- Change log discipline: Perf quality changelog