Various benchmark improvements. #5

kentonv · 2025-10-11T03:35:25Z

Hello! I am the lead engineer of Cloudflare Workers. We've been analyzing this test all week, and it helped us find a lot of things we could optimize. Thanks for that!

This commit fixes two issues with the test itself:

The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepancy should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting.
The react-ssr benchmark was not setting process.env.NODE_ENV to "production", so React was running in dev mode. When using a higher-level framework, this is normally handled by the framework, but the react-ssr benchmark seems to call lower-level libraries directly. Vercel normally sets this as an actual environment variable in prod, but Workers does not. (Maybe we should...)

This commit also includes some housekeeping, which likely has little impact:

Update wrangler to latest on all benchmarks. A few of them were set to extremely outdated versions, though we're not aware of any specific issues.
Updated compatibility date on all benchmarks. We haven't seen this make a difference, but some compatibility dates were six months old and lots has changed in node-compat since then.
Set minify: true in all wranlger.jsonc files. We haven't observed much difference from this, but as some of the bundle sizes are fairly large it could improve cold start time slightly.

We found many performance issues in Open Next this week, and have fixed several. So, this commit also bumps the version number of Open Next to get those improvements.

That said, this work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems.

Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular:

We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.)
We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation.

These two changes are already live for all Workers.

We'll have a blog post about all these changes later.

Finally, we have a few suggestions about how to run and interpret these benchmarks:

The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. We are fixing it, but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we agree this benchmark is silly and shouldn't be included.
We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded, and he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used. These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.)
This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have surprisingly different single-threaded performance (example: my 16-core Ryzen 9 9950X personal desktop is 1.7x faster than my 44-core Xeon w9-3575X corp workstation, for single-threaded workloads). Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations. To get a better sense of the average speed on Cloudflare, we would recommend running tests from many different global locations to hit different Cloudflare colos and thus different machines, but admittedly that's a lot of work.

Hello! I am the lead engineer of Cloudflare Workers. We've been analyzing this test all week, and it helped us find a lot of things we could optimize. Thanks for that! ------------------ This commit fixes two issues with the test itself: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepancy should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * The react-ssr benchmark was not setting process.env.NODE_ENV to "production", so React was running in dev mode. When using a higher-level framework, this is normally handled by the framework, but the react-ssr benchmark seems to call lower-level libraries directly. Vercel normally sets this as an actual environment variable in prod, but Workers does not. (Maybe we should...) This commit also includes some housekeeping, which likely has little impact: * Update wrangler to latest on all benchmarks. A few of them were set to extremely outdated versions, though we're not aware of any specific issues. * Updated compatibility date on all benchmarks. We haven't seen this make a difference, but some compatibility dates were six months old and lots has changed in node-compat since then. * Set `minify: true` in all wranlger.jsonc files. We haven't observed much difference from this, but as some of the bundle sizes are fairly large it could improve cold start time slightly. ------------------ We found many performance issues in Open Next this week, and have fixed several. So, this commit also bumps the version number of Open Next to get those improvements. That said, this work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. These two changes are already live for all Workers. We'll have a blog post about all these changes later. ------------------ Finally, we have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we agree this benchmark is silly and shouldn't be included. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have surprisingly different single-threaded performance (example: my 16-core Ryzen 9 9950X personal desktop is 1.7x faster than my 44-core Xeon w9-3575X corp workstation, for single-threaded workloads). Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations. To get a better sense of the average speed on Cloudflare, we would recommend running tests from many different global locations to hit different Cloudflare colos and thus different machines, but admittedly that's a lot of work.

greptile-apps

Greptile Overview

Greptile Summary

This PR makes comprehensive improvements to benchmark fairness and accuracy in the Cloudflare vs Vercel comparison repository. The lead engineer of Cloudflare Workers addresses two critical testing inconsistencies: the Next.js benchmark now uses force-dynamic on both platforms (previously only Vercel used it), and the React SSR benchmark now properly sets NODE_ENV=production for Cloudflare Workers (matching Vercel's default behavior). Additionally, the PR updates all Cloudflare benchmark configurations with housekeeping improvements including upgrading Wrangler CLI to version 4.42.2, updating compatibility dates to October 2025, enabling minification, and most significantly bumping @opennextjs/cloudflare from version 1.3.0 to 1.10.1 to incorporate performance fixes including restored response body streaming. These changes ensure both platforms are tested under identical conditions, eliminating unfair advantages that were ironically hurting Cloudflare's performance due to bugs in the non-dynamic code path.

Important Files Changed

Changed Files

Filename	Score	Overview
next-bench/cf-edition/src/app/bench/page.tsx	5/5	Added `force-dynamic` export to ensure dynamic rendering matches Vercel configuration
react-ssr-bench/cf-edition/wrangler.jsonc	5/5	Fixed critical NODE_ENV production setting and added minification for fair React benchmarking
next-bench/cf-edition/package.json	5/5	Updated @opennextjs/cloudflare to 1.10.1 and wrangler to 4.42.2 for performance improvements
next-bench/cf-edition/wrangler.jsonc	5/5	Updated compatibility date and enabled minification for optimization
sveltekit-bench/cf-edition/package.json	5/5	Minor wrangler CLI version bump from 4.41.0 to 4.42.2
sveltekit-bench/cf-edition/wrangler.jsonc	5/5	Updated compatibility date and enabled minification for consistency
vanilla-bench/cf-edition/package.json	5/5	Major wrangler update from 3.110.1 to 4.42.2
vanilla-bench/cf-edition/wrangler.jsonc	5/5	Updated compatibility date and enabled minification
react-ssr-bench/cf-edition/package.json	5/5	Updated wrangler from 3.110.1 to4.42.2 for tooling consistency

Confidence score: 5/5

This PR is extremely safe to merge with minimal risk as it addresses critical fairness issues in benchmarking
Score reflects the systematic approach to fixing test inconsistencies and the clear rationale from subject matter experts
No files require special attention as all changes are well-documented configuration updates and dependency upgrades

Sequence Diagram

sequenceDiagram
    participant Developer
    participant CF_Package as "CF Package Manager"
    participant CF_Wrangler as "CF Wrangler"
    participant CF_Worker as "CF Worker Runtime"
    participant OpenNext as "OpenNext Library"
    participant React_SSR as "React SSR Engine"

    Developer->>CF_Package: "Update package.json dependencies"
    CF_Package->>CF_Package: "Upgrade @opennextjs/cloudflare to ^1.10.1"
    CF_Package->>CF_Package: "Update wrangler to ^4.42.2"
    
    Developer->>CF_Wrangler: "Update wrangler.jsonc config"
    CF_Wrangler->>CF_Wrangler: "Set compatibility_date: 2025-10-01"
    CF_Wrangler->>CF_Wrangler: "Enable minify: true"
    
    Developer->>React_SSR: "Configure production environment"
    React_SSR->>React_SSR: "Set process.env.NODE_ENV: 'production'"
    
    Developer->>OpenNext: "Apply force-dynamic setting"
    OpenNext->>OpenNext: "Enable dynamic rendering mode"
    
    Developer->>CF_Worker: "Deploy updated benchmarks"
    CF_Worker->>CF_Worker: "Apply performance optimizations"
    CF_Worker->>CF_Worker: "Fix CPU-heavy request queuing"
    CF_Worker->>CF_Worker: "Tune V8 garbage collector"

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cramforce · 2025-10-13T03:18:48Z

We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded, and he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used. These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.)

The way our pricing works, it always makes sense to select the faster CPU when you have a CPU-bound workload. Since you only pay extra on the memory (which is a very low price relatively speaking) and that does get you faster execution. Net you save money going with more memory.

It's more nuanced on IO-bound workloads but also often a win.

greptile-apps bot reviewed Oct 11, 2025

View reviewed changes

t3dotgg approved these changes Oct 12, 2025

View reviewed changes

t3dotgg merged commit 28106ba into t3dotgg:main Oct 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various benchmark improvements. #5

Various benchmark improvements. #5

kentonv commented Oct 11, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

cramforce commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Various benchmark improvements. #5

Various benchmark improvements. #5

Conversation

kentonv commented Oct 11, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Sequence Diagram

Uh oh!

cramforce commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants