- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 33.6k
src: enable libm trig functions in V8 #60153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Review requested: 
 | 
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
Should provide better performance on some platforms.
34e8bf7    to
    e02de4a      
    Compare
  
    
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
| just a note of interest... several years ago I tried applying libm optimizations to some operations in V8 (though, not the trig functions) and we eventually found the output to be bad enough in certain cases to necessitate reverting it. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @targos ... that's why I wanted to be sure to tag you here ... :-) ... it definitely seemed to have an impact locally but I couldn't quite figure out if it was completely wired in correctly. What additional changes do you think would be necessary? | 
| 
 Don't we have a 37% gain in one case ? Or do I misread what @jasnell posted? | 
| That was what I saw locally but it's possible something else is happening there. Need to be certain | 
| We would need to port at least: Lines 207 to 208 in f0aa073 
 Lines 1417 to 1419 in f0aa073 
 This should go in https://github.com/nodejs/node/blob/f0aa073907fc88f64ca95e4821eb7fdf49b133e2/tools/v8_gypfiles/features.gypi, similar to many other build flags with a variable declaration at the top and a new condition for the define. And possibly: Lines 6930 to 6932 in f0aa073 
 Lines 6937 to 6957 in f0aa073 
 This is less trivial. We don't have  | 
| @targos ... yeah, that's pretty much what I suspected but was taking the optimistic approach initially ;-) ... If this is going to require the  | 
This fixes two issues with the test: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepency should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * We also fixed several other performance issues in Open Next, so this bumps the version to get some of those fixes. This work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. (We'll have a blog post about these changes later.) We also have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we recommend excluding this test, as it unfairly benefits Cloudflare. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have different single-threaded performance. Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations.
This fixes two issues with the test: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepency should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * We also fixed several other performance issues in Open Next, so this bumps the version to get some of those fixes. This work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. (We'll have a blog post about these changes later.) We also have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we recommend excluding this test, as it unfairly benefits Cloudflare. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have different single-threaded performance. Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations.
(Disclosure: I am the lead engineer of Cloudflare Workers.) This fixes two issues with the test: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepency should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * We also fixed several other performance issues in Open Next, so this bumps the version to get some of those fixes. This work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. (We'll have a blog post about these changes later.) We also have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we recommend excluding this test, as it unfairly benefits Cloudflare. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have different single-threaded performance. Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations.
(Disclosure: I am the lead engineer of Cloudflare Workers.) This fixes some issues with the test: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepency should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * The react-ssr benchmark was not setting process.env.NODE_ENV to "production", so React was running in dev mode. When using a higher-level framework, this is normally handled by the framework, but the react-ssr benchmark seems to call lower-level libraries directly. Vercel normally sets this as an actual environment variable in prod, but Workers does not. (Maybe we should...) * We also fixed several other performance issues in Open Next, so this bumps the version to get some of those fixes. This work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. ------------------ We also did some housekeeping that probably has little impact: * Updated compatibility date on all benchmarks. We haven't seen this make a difference, but some compatibility dates were six months old and lots has changed in node-compat since then. * Set `minify: true` in all wranlger.jsonc files. We haven't observed much difference from this, but it is a good idea when pushing very large bundles, as some of these benchmarks are due to their dependencies. * Update wrangler to latest on all benchmarks. A few of them were set to extremely outdated versions, though we're not aware of any specific issues. ------------------ Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. (We'll have a blog post about these changes later.) ------------------ We also have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we recommend excluding this test, as it unfairly benefits Cloudflare. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have different single-threaded performance. Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations.
Hello! I am the lead engineer of Cloudflare Workers. We've been analyzing this test all week, and it helped us find a lot of things we could optimize. Thanks for that! ------------------ This commit fixes two issues with the test itself: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepency should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * The react-ssr benchmark was not setting process.env.NODE_ENV to "production", so React was running in dev mode. When using a higher-level framework, this is normally handled by the framework, but the react-ssr benchmark seems to call lower-level libraries directly. Vercel normally sets this as an actual environment variable in prod, but Workers does not. (Maybe we should...) This commit also includes some housekeeping, which likely has little impact: * Update wrangler to latest on all benchmarks. A few of them were set to extremely outdated versions, though we're not aware of any specific issues. * Updated compatibility date on all benchmarks. We haven't seen this make a difference, but some compatibility dates were six months old and lots has changed in node-compat since then. * Set `minify: true` in all wranlger.jsonc files. We haven't observed much difference from this, but as some of the bundle sizes are fairly large it could improve cold start time slightly. ------------------ We found many performance issues in Open Next this week, and have fixed several. So, this commit also bumps the version number of Open Next to get those improvements. That said, this work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. These two changes are already live for all Workers. We'll have a blog post about all these changes later. ------------------ Finally, we have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we agree this benchmark is silly and shouldn't be included. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have surprisingly different single-threaded performance (example: my 16-core Ryzen 9 9950X personal desktop is 1.7x faster than my 44-core Xeon w9-3575X corp workstation, for single-threaded workloads). Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations. To get a better sense of the average speed on Cloudflare, we would recommend running tests from many different global locations to hit different Cloudflare colos and thus different machines, but admittedly that's a lot of work.
Hello! I am the lead engineer of Cloudflare Workers. We've been analyzing this test all week, and it helped us find a lot of things we could optimize. Thanks for that! ------------------ This commit fixes two issues with the test itself: * The next.js benchmark used force-dynamic on Vercel but not Cloudflare. It should use force-dynamic on both. Ironically, this discrepancy should have given Cloudflare an advantage, but we found that Open Next had significant performance bugs in the non-dynamic code path that actually made it worse. Among other things, streaming of the response body was essentially disabled in this mode. We are fixing those bugs, but obviously it's most fair for both platforms to use the same dynamic setting. * The react-ssr benchmark was not setting process.env.NODE_ENV to "production", so React was running in dev mode. When using a higher-level framework, this is normally handled by the framework, but the react-ssr benchmark seems to call lower-level libraries directly. Vercel normally sets this as an actual environment variable in prod, but Workers does not. (Maybe we should...) This commit also includes some housekeeping, which likely has little impact: * Update wrangler to latest on all benchmarks. A few of them were set to extremely outdated versions, though we're not aware of any specific issues. * Updated compatibility date on all benchmarks. We haven't seen this make a difference, but some compatibility dates were six months old and lots has changed in node-compat since then. * Set `minify: true` in all wranlger.jsonc files. We haven't observed much difference from this, but as some of the bundle sizes are fairly large it could improve cold start time slightly. ------------------ We found many performance issues in Open Next this week, and have fixed several. So, this commit also bumps the version number of Open Next to get those improvements. That said, this work is ongoing: we expect to land more improvements in the future. Open Next is not as mature as Next.js, it seems. Separately, Cloudflare has made some changes to our production environment which should significantly improve performance. In particular: * We corrected a problem where CPU-heavy requests would tend to queue up on a single worker instance per colo, causing excess latency when running concurrent CPU-heavy requests driven from a single client location. (That said, it is still possible for requests to be randomly assigned to the same isolate and block each other, but this should be less common now.) * We found that we had tuned the V8 garbage collector too far in the direction of favoring memory usage over execution speed. A small adjustment made a big difference in performance, especially in these tests which do a lot of memory allocation. These two changes are already live for all Workers. We'll have a blog post about all these changes later. ------------------ Finally, we have a few suggestions about how to run and interpret these benchmarks: * The "shitty sine benchmark" is indeed suffering from a missing optimization in Node, penalizing Vercel. [We are fixing it](nodejs/node#60153), but it will presumably take some time for this Node change to find its way to Vercel. In the meantime, we agree this benchmark is silly and shouldn't be included. * We think it is more appropriate to test with a Vercel instance using 1vcpu rather than 2. [The CTO of Vercel argues there should be no difference since the workload is fundamentally single-threaded](https://x.com/cramforce/status/1975656443954274780), and [he is publishing pricing comparisons on the assumption that only 1 vcpu was actually used](https://x.com/cramforce/status/1975652040195084395). These pricing comparisons are only fair if the assumption is correct. We honestly think he is correct, so we think to avoid any questions the test should be run with 1vcpu. (I realize this sounds like some sort of trick, but it isn't. We haven't had a chance to test the difference ourselves. I just honestly think the 2vcpu thing creates confusion that would be nice to avoid.) * This benchmark still contains a singificant "luck" factor in terms of what hardware you get assigned to. Cloudflare has several different generations of hardware in our fleet, and we would expect Vercel / AWS does as well. Different CPUs may have surprisingly different single-threaded performance (example: my 16-core Ryzen 9 9950X personal desktop is 1.7x faster than my 44-core Xeon w9-3575X corp workstation, for single-threaded workloads). Noisy neighbors can also have significant impact by consuming memory bandwidth that is shared by all tenants on the machine. We have seen runs of the test where Cloudflare wins across the board, and others where Vercel wins across the board, presumably as a result of this noise -- and it's not just Cloudflare's performance that varies, but also Vercel's. Note, though, that simply running more iterations of the benchmark does not correct for this "luck", because once instances are assigned to machines, they tend to stay on those machines. Additionally, noisy neighbor effects can be driven by other factors like time of day, regional load imbalances, etc., that don't go away with additional iterations. To get a better sense of the average speed on Cloudflare, we would recommend running tests from many different global locations to hit different Cloudflare colos and thus different machines, but admittedly that's a lot of work.

Should provide better performance on some platforms.
Context: @t3dotgg put together some benchmarks that show the trig functions running in node.js aren't as fast as they could be. We did some digging and think this may help. https://github.com/t3dotgg/cf-vs-vercel-bench/blob/main/vanilla-bench/vercel-edition/api/
There's a possibility this won't work on all platforms/archs so running some tests in CI to verify first.Appears to be fine in CI!/cc @nodejs/v8