Skip to content

Commit bd4a55a

Browse files
authored
add qualcomm 8cx gen 3 benchmark + minor tuning (#48)
* simplify arm code * some tuning * adding qualcomm results
1 parent eb73315 commit bd4a55a

File tree

2 files changed

+33
-9
lines changed

2 files changed

+33
-9
lines changed

README.md

+16
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,22 @@ faster than the standard library.
162162
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |
163163

164164

165+
On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
166+
boost as the Neoverse V1.
167+
168+
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
169+
|:----------------|:-----------|:--------------------------|:-------------------|
170+
| Twitter.json | 15 | 10 | 1.5 x |
171+
| Arabic-Lipsum | 4.0 | 2.3 | 1.7 x |
172+
| Chinese-Lipsum | 4.0 | 2.9 | 1.4 x |
173+
| Emoji-Lipsum | 4.0 | 0.9 | 4.4 x |
174+
| Hebrew-Lipsum | 4.0 | 2.3 | 1.7 x |
175+
| Hindi-Lipsum | 4.0 | 1.9 | 2.1 x |
176+
| Japanese-Lipsum | 4.0 | 2.7  | 1.5 x |
177+
| Korean-Lipsum | 4.0 | 1.5 | 2.7 x |
178+
| Latin-Lipsum | 50 | 20 | 2.5 x |
179+
| Russian-Lipsum | 4.0 | 1.2 | 3.3 x |
180+
165181
One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. We expect good performance on recent ARM-based Qualcomm processors.
166182

167183
## Building the library

src/UTF8.cs

+17-9
Original file line numberDiff line numberDiff line change
@@ -1388,20 +1388,28 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
13881388
prevIncomplete = Vector128<byte>.Zero;
13891389
// Often, we have a lot of ASCII characters in a row.
13901390
int localasciirun = 16;
1391-
if (processedLength + localasciirun + 64 <= inputLength)
1391+
if (processedLength + localasciirun + 16 <= inputLength)
13921392
{
1393-
for (; processedLength + localasciirun + 64 <= inputLength; localasciirun += 64)
1393+
Vector128<byte> block = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
1394+
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(block, v80))).ToScalar() == 0)
13941395
{
1395-
Vector128<byte> block1 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
1396-
Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
1397-
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
1398-
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
1399-
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
1400-
if (AdvSimd.Arm64.MaxAcross(or).ToScalar() > 127)
1396+
localasciirun += 16;
1397+
for (; processedLength + localasciirun + 64 <= inputLength; localasciirun += 64)
14011398
{
1402-
break;
1399+
Vector128<byte> block1 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
1400+
Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
1401+
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
1402+
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
1403+
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
1404+
1405+
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(or, v80))).ToScalar() != 0)
1406+
{
1407+
break;
1408+
}
14031409
}
1410+
14041411
}
1412+
14051413
processedLength += localasciirun - 16;
14061414
}
14071415
}

0 commit comments

Comments
 (0)