Comparisons and Benchmarks

This article is a roadmap for making fair, reproducible comparisons against other renderers (streamdown, react-markdown, or any custom pipeline). It focuses on testing methodology and on how to interpret the metrics StreamMDX records.

What to compare

Time to first visible render: first flush time with a realistic streaming scenario.
Total stream duration: time until the final render is stable.
Stutter and jank: long tasks and RAF delta p95.
Memory growth: peak heap size over the full stream.
Output correctness: HTML regression snapshots for the same fixture.

Recommended workflow

Capture StreamMDX baselines locally using the perf harness.
Re-run the same fixture against competitors using their recommended API.
Keep scenario definitions and chunk/tick sizes identical.
Compare deltas with the harness comparator and log results.
Validate final HTML output against baseline snapshots.

Fixtures and scenarios (current local baselines)

naive-bayes: S1_slow_small, S2_typical, S4_chunky_network
table-large: S2_typical, S6_extreme

StreamMDX perf harness

Start the docs server (required for harness runs):

NEXT_PUBLIC_STREAMING_DEMO_API=true npm run docs:dev

Run the harness (examples):

npm run perf:harness -- --fixture naive-bayes --scenario S1_slow_small --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture naive-bayes --scenario S2_typical --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture naive-bayes --scenario S4_chunky_network --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture table-large --scenario S2_typical --runs 3 --warmup 1 --out tmp/perf-baselines
npm run perf:harness -- --fixture table-large --scenario S6_extreme --runs 3 --warmup 1 --out tmp/perf-baselines

Compare candidates:

npm run perf:compare -- --base tmp/perf-baselines/<baseline> --candidate tmp/perf-baselines/<candidate>

Record results in /docs/perf-quality-changelog and keep run paths up to date in /docs/perf-harness.

Notes on interpretation

First flush is the most user-visible latency metric.
Long task p95 tracks stutter; lower is better.
RAF p95 near 16-17ms indicates smoother animation/scrolling.
Memory peak is most important for multi-stream dashboards.

When the comparison is not apples-to-apples

If another renderer doesn't support MDX, HTML sanitization, or streaming patches, call that out. Use a "reduced" fixture if needed, but keep a second "full" fixture to show StreamMDX's complete feature coverage.