Skip to content

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Aug 23, 2025

⚙️ Optimization

Resolves #3857. Divides by 100 instead of by 10, as proposed.

There's similar place in to_chars, skipped for now.

🏁 Benchmark

Large and small numbers., like numbers naturally seen when counting things.
Generated via log-normal distribution, as @statementreply suggested.
Picked some arbitrary parameters, to approximately fit in the integer ranges.

Benchmarked also std::_UIntegral_to_buff separetely as well to see how much the optimization helps on its own, avoiding #1024 limitation.

⏱️ Benchmark results

i5-1235U P cores:

Benchmark Before After Speedup
internal_integer_to_buff<uint8_t, 2.5, 1.5> 2.30 ns 3.42 ns 0.67
internal_integer_to_buff<uint16_t, 5.0, 3.0> 3.70 ns 2.64 ns 1.40
internal_integer_to_buff<uint32_t, 10.0, 6.0> 4.69 ns 2.86 ns 1.64
internal_integer_to_buff<uint64_t, 20.0, 12.0> 10.5 ns 5.29 ns 1.98
integer_to_string<uint8_t, 2.5, 1.5> 5.87 ns 5.44 ns 1.08
integer_to_string<uint16_t, 5.0, 3.0> 6.79 ns 6.32 ns 1.07
integer_to_string<uint32_t, 10.0, 6.0> 8.11 ns 7.28 ns 1.11
integer_to_string<uint64_t, 20.0, 12.0> 14.5 ns 14.2 ns 1.02
integer_to_string<int8_t, 2.5, 1.5> 6.64 ns 5.96 ns 1.11
integer_to_string<int16_t, 5.0, 3.0> 6.23 ns 5.88 ns 1.06
integer_to_string<int32_t, 10.0, 6.0> 7.58 ns 6.33 ns 1.20
integer_to_string<int64_t, 20.0, 12.0> 17.8 ns 18.8 ns 0.95

i5-1235U E cores:

Benchmark Before After Speedup
internal_integer_to_buff<uint8_t, 2.5, 1.5> 4.14 ns 4.79 ns 0.86
internal_integer_to_buff<uint16_t, 5.0, 3.0> 8.08 ns 4.76 ns 1.70
internal_integer_to_buff<uint32_t, 10.0, 6.0> 11.4 ns 5.41 ns 2.11
internal_integer_to_buff<uint64_t, 20.0, 12.0> 23.8 ns 13.9 ns 1.71
integer_to_string<uint8_t, 2.5, 1.5> 17.2 ns 12.7 ns 1.35
integer_to_string<uint16_t, 5.0, 3.0> 17.1 ns 13.6 ns 1.26
integer_to_string<uint32_t, 10.0, 6.0> 18.3 ns 14.0 ns 1.31
integer_to_string<uint64_t, 20.0, 12.0> 36.6 ns 29.4 ns 1.24
integer_to_string<int8_t, 2.5, 1.5> 17.8 ns 12.0 ns 1.48
integer_to_string<int16_t, 5.0, 3.0> 20.0 ns 13.4 ns 1.49
integer_to_string<int32_t, 10.0, 6.0> 21.5 ns 15.1 ns 1.42
integer_to_string<int64_t, 20.0, 12.0> 39.7 ns 35.0 ns 1.13

🥉 Results interpretation

I'm not even sure if this is worth doing.

Allocating the string and copying the result there takes roughly half of the time, so the effect of micro-optimization in digits generation is small.

However, the internal function seem to show improvement. This looks like an indication that #1024 improvement would help here. It could be that the performance is limited due to failed store-to-load forwarding, as individual character stores are followed by bulk memcpy; in this case, the improvement may be somewhat negated by a longer stall.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner August 23, 2025 19:42
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Aug 23, 2025
@StephanTLavavej StephanTLavavej added performance Must go faster decision needed We need to choose something before working on this labels Aug 24, 2025
@StephanTLavavej StephanTLavavej self-assigned this Aug 24, 2025
@StephanTLavavej

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as outdated.

@AlexGuteniev AlexGuteniev force-pushed the integers branch 2 times, most recently from 672f1db to 7ea6121 Compare August 25, 2025 13:09
@StephanTLavavej StephanTLavavej removed their assignment Nov 14, 2025
@StephanTLavavej
Copy link
Member

I think I would be more comfortable with this if we didn't have to drag in the Ryu tables into <xmemory>.

It would be fairly simple to completely reimplement the digit tables with a constexpr helper to emit the digits.

@AlexGuteniev
Copy link
Contributor Author

It would be fairly simple to completely reimplement the digit tables with a constexpr helper to emit the digits.

Do we want that? I've inserted predefined array for now. which should be better for throughput.

@AlexGuteniev
Copy link
Contributor Author

I think I would be more comfortable with this if we didn't have to drag in the Ryu tables into <xmemory>.

After doing this, and eliminating if constexpr, I was able to do some code rearrangement, the results are now somewhat better, updated the tables in description.

@StephanTLavavej
Copy link
Member

My concern was around keeping Ryu-derived code in separate files. If we generate the tables with a freshly written constexpr function, then there's no question that they aren't a derived work. That said, the digit tables are very obviously data/facts with no creativity.

@AlexGuteniev
Copy link
Contributor Author

Ok, generated the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

decision needed We need to choose something before working on this performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

<string>: to_string() for integers could be faster

2 participants