Comparisons and Benchmarks
This article is a roadmap for making fair, reproducible comparisons against other renderers (streamdown, react-markdown, or any custom pipeline). It focuses on testing methodology and on how to interpret the metrics StreamMDX records.
What to compare
- Time to first visible render: first flush time with a realistic streaming scenario.
- Total stream duration: time until the final render is stable.
- Stutter and jank: long tasks and RAF delta p95.
- Memory growth: peak heap size over the full stream.
- Output correctness: HTML regression snapshots for the same fixture.
Recommended workflow
- Capture StreamMDX baselines locally using the perf harness.
- Re-run the same fixture against competitors using their recommended API.
- Keep scenario definitions and chunk/tick sizes identical.
- Compare deltas with the harness comparator and log results.
- Validate final HTML output against baseline snapshots.
Fixtures and scenarios (current local baselines)
naive-bayes:S1_slow_small,S2_typical,S4_chunky_networktable-large:S2_typical,S6_extreme
StreamMDX perf harness
Start the docs server (required for harness runs):
bash
NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:devRun the harness (examples):
bash
npm run perf:harness -- --fixture naive-bayes --scenario S1_slow_small --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture naive-bayes --scenario S2_typical --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture naive-bayes --scenario S4_chunky_network --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture table-large --scenario S2_typical --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture table-large --scenario S6_extreme --runs 3 --warmup 1 --out tmp/perf-baselinesCompare candidates:
bash
npm run perf:compare -- --base tmp/perf-baselines/<baseline> --candidate tmp/perf-baselines/<candidate>Record results in /docs/perf-quality-changelog and keep run paths up to date in /docs/perf-harness.
Notes on interpretation
- First flush is the most user-visible latency metric.
- Long task p95 tracks stutter; lower is better.
- RAF p95 near 16-17ms indicates smoother animation/scrolling.
- Memory peak is most important for multi-stream dashboards.
When the comparison is not apples-to-apples
If another renderer doesn't support MDX, HTML sanitization, or streaming patches, call that out. Use a "reduced" fixture if needed, but keep a second "full" fixture to show StreamMDX's complete feature coverage.