SMStreamMDXStreaming renderer
HomeShowcasePerf Harness
Showcase

Perf harness

Reproducible perf runs and guardrails for streaming metrics.

performancetestingsafety

Perf harness

The perf harness is the reproducible way to measure StreamMDX behavior before and after changes. It exists to turn performance claims into comparable local evidence instead of screenshots or isolated anecdotes.

It is designed to answer one question quickly: did this change make streaming better, worse, or unchanged?

What this showcases

  • Fixed fixture input and deterministic run settings.
  • A stable metric set you can compare across commits.
  • A release gate pattern for preventing silent regressions.

Core metrics

MetricWhy it mattersTypical target
First flush (ms)Time to first visible contentLower is better
Patch p95 (ms)Tail update latency under stream loadLower is better
Long tasksMain-thread jank indicatorNear zero
Coalescing (%)How much patch merging reduced churnStable range
Memory (peak)Risk of runaway allocationsNo upward drift

Standard run matrix

Use the same matrix for every benchmark pass so results stay comparable:

  • Fixture: Naive Bayes article (default demo fixture)
  • Rate: 12000 chars/s
  • Tick: 5 ms
  • Runs: 3
  • Theme: both light and dark

Capture workflow

  1. Build worker + docs assets.
  2. Start docs with automation API enabled.
  3. Capture a baseline JSON.
  4. Apply your change.
  5. Capture candidate JSON.
  6. Diff baseline vs candidate.
bash
NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:dev
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/main.json
npm run perf:demo -- --rate 12000 --tick 5 --runs 3 --out tmp/perf-baseline/candidate.json

Regression policy

A practical default policy for release readiness:

  • Fail if first flush regresses by more than 15%.
  • Fail if patch p95 regresses by more than 20%.
  • Fail if long-task count increases by more than 2x.
  • Warn (but do not fail) on memory increase under 10%.

How to interpret results

Use the harness in two modes:

  • claim-grade comparisons with fixed fixture/scenario settings
  • exploratory runs when you are diagnosing scheduler or workload cliffs

Do not mix those two uses. If the settings differ, the result is diagnostic, not a published baseline.

Example comparison output

json
{
  "scenario": "naive-bayes-default",
  "baseline": { "firstFlushMs": 32, "patchP95Ms": 4.1, "longTasks": 0 },
  "candidate": { "firstFlushMs": 35, "patchP95Ms": 4.6, "longTasks": 0 },
  "result": { "status": "pass", "notes": ["within thresholds"] }
}

Common failure patterns

  • Large syntax/highlight updates: patch p95 spikes after code-heavy sections.
  • Over-eager UI effects: long tasks increase during stream bursts.
  • Unbounded plugin work: memory climbs per run.

When this happens, test with one feature disabled at a time (math, mdx, html) to isolate cost.

Guardrails for CI

  • Run perf harness on PRs that touch packages/* or docs renderer code.
  • Store baseline in source control or artifact storage with commit metadata.
  • Require explicit approval when metrics exceed thresholds.

Next steps