Skip to content

JIT_CountProfile32 incorrect native codegen on linux #89340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AndyAyersMS opened this issue Jul 22, 2023 · 10 comments · Fixed by #89350
Closed

JIT_CountProfile32 incorrect native codegen on linux #89340

AndyAyersMS opened this issue Jul 22, 2023 · 10 comments · Fixed by #89350
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

This method needs to have a bsr in it to work properly. From libcoreclr.so in .NET 8 Preview 6:

(lldb) di -n JIT_CountProfile32
libcoreclr.so`JIT_CountProfile32:
libcoreclr.so[0x33f390] <+0>:  push   rbp
libcoreclr.so[0x33f391] <+1>:  mov    rbp, rsp
libcoreclr.so[0x33f394] <+4>:  push   rbx
libcoreclr.so[0x33f395] <+5>:  push   rax
libcoreclr.so[0x33f396] <+6>:  mov    rbx, rdi
libcoreclr.so[0x33f399] <+9>:  cmp    dword ptr [rdi], 0x0
libcoreclr.so[0x33f39c] <+12>: jle    0x33f3cb                  ; <+59> [inlined] InterlockedAdd at jithelpers.cpp:6015
libcoreclr.so[0x33f39e] <+14>: lea    rdi, [rip + 0x33f193]
libcoreclr.so[0x33f3a5] <+21>: call   0x65f740                  ; symbol stub for: __tls_get_addr
libcoreclr.so[0x33f3aa] <+26>: mov    ecx, dword ptr [rax]
libcoreclr.so[0x33f3b0] <+32>: mov    edx, ecx
libcoreclr.so[0x33f3b2] <+34>: shl    edx, 0xd
libcoreclr.so[0x33f3b5] <+37>: xor    edx, ecx
libcoreclr.so[0x33f3b7] <+39>: mov    ecx, edx
libcoreclr.so[0x33f3b9] <+41>: shr    ecx, 0x11
libcoreclr.so[0x33f3bc] <+44>: xor    ecx, edx
libcoreclr.so[0x33f3be] <+46>: mov    edx, ecx
libcoreclr.so[0x33f3c0] <+48>: shl    edx, 0x5
libcoreclr.so[0x33f3c3] <+51>: xor    edx, ecx
libcoreclr.so[0x33f3c5] <+53>: mov    dword ptr [rax], edx
libcoreclr.so[0x33f3cb] <+59>: lock
libcoreclr.so[0x33f3cc] <+60>: inc    dword ptr [rbx]
libcoreclr.so[0x33f3ce] <+62>: add    rsp, 0x8
libcoreclr.so[0x33f3d2] <+66>: pop    rbx
libcoreclr.so[0x33f3d3] <+67>: pop    rbp
libcoreclr.so[0x33f3d4] <+68>: ret

by way of comparison, here is the 64 bit version which does basically the same thing:

libcoreclr.so`JIT_CountProfile64:
libcoreclr.so[0x33f3e0] <+0>:   push   rbp
libcoreclr.so[0x33f3e1] <+1>:   mov    rbp, rsp
libcoreclr.so[0x33f3e4] <+4>:   push   r14
libcoreclr.so[0x33f3e6] <+6>:   push   rbx
libcoreclr.so[0x33f3e7] <+7>:   mov    rbx, rdi
libcoreclr.so[0x33f3ea] <+10>:  mov    rax, qword ptr [rdi]
libcoreclr.so[0x33f3ed] <+13>:  mov    r14d, 0x1
libcoreclr.so[0x33f3f3] <+19>:  test   rax, rax
libcoreclr.so[0x33f3f6] <+22>:  jle    0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f3f8] <+24>:  bsr    rax, rax
libcoreclr.so[0x33f3fc] <+28>:  cmp    eax, 0xd
libcoreclr.so[0x33f3ff] <+31>:  jb     0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f401] <+33>:  xor    rax, 0x3f
libcoreclr.so[0x33f405] <+37>:  mov    cl, 0x33
libcoreclr.so[0x33f407] <+39>:  sub    cl, al
libcoreclr.so[0x33f409] <+41>:  shl    r14, cl
libcoreclr.so[0x33f40c] <+44>:  lea    rdi, [rip + 0x33f125]
libcoreclr.so[0x33f413] <+51>:  call   0x65f740                  ; symbol stub for: __tls_get_addr
libcoreclr.so[0x33f418] <+56>:  mov    ecx, dword ptr [rax]
libcoreclr.so[0x33f41e] <+62>:  mov    edx, ecx
libcoreclr.so[0x33f420] <+64>:  shl    edx, 0xd
libcoreclr.so[0x33f423] <+67>:  xor    edx, ecx
libcoreclr.so[0x33f425] <+69>:  mov    ecx, edx
libcoreclr.so[0x33f427] <+71>:  shr    ecx, 0x11
libcoreclr.so[0x33f42a] <+74>:  xor    ecx, edx
libcoreclr.so[0x33f42c] <+76>:  mov    edx, ecx
libcoreclr.so[0x33f42e] <+78>:  shl    edx, 0x5
libcoreclr.so[0x33f431] <+81>:  xor    edx, ecx
libcoreclr.so[0x33f433] <+83>:  mov    dword ptr [rax], edx
libcoreclr.so[0x33f439] <+89>:  lea    eax, [r14 - 0x1]
libcoreclr.so[0x33f43d] <+93>:  test   eax, edx
libcoreclr.so[0x33f43f] <+95>:  je     0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f441] <+97>:  pop    rbx
libcoreclr.so[0x33f442] <+98>:  pop    r14
libcoreclr.so[0x33f444] <+100>: pop    rbp
libcoreclr.so[0x33f445] <+101>: ret
libcoreclr.so[0x33f446] <+102>: lock
libcoreclr.so[0x33f447] <+103>: add    qword ptr [rbx], r14
libcoreclr.so[0x33f44a] <+106>: pop    rbx
libcoreclr.so[0x33f44b] <+107>: pop    r14
libcoreclr.so[0x33f44d] <+109>: pop    rbp
libcoreclr.so[0x33f44e] <+110>: ret

I noticed this because of some odd looking profile counts while investigating #87194 (comment).

In particular for EnumerateUsingIndexer as run under BDN, with both interlocked (precise) and scalable counts enabled (interlocked counts are the upper value), note how the second count below is 1 instead of ~12800 or so.

;; linux 32 bit  counts

@@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 385 ILOffset 16 Count 2 Other 33
3846300
4372481
Schema InstrumentationKind 385 ILOffset 37 Count 2 Other 0
12821
1

;; linux 64 bit counts

@@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 386 ILOffset 16 Count 2 Other 33
3799800 0
3817984 0
Schema InstrumentationKind 386 ILOffset 37 Count 2 Other 0
12666 0
12734 0

 ;; windows 32 bit counts

 @@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 385 ILOffset 16 Count 2 Other 33
4156800
4110336
Schema InstrumentationKind 385 ILOffset 37 Count 2 Other 0
13856
13826

Not clear yet what the problem is, either there is some sort of preprocessor mixup, or else a bug in clang or LLVM?

@ghost ghost added untriaged New issue has not been triaged by the area owner area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jul 22, 2023
@ghost
Copy link

ghost commented Jul 22, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This method needs to have a bsr in it to work properly. From libcoreclr.so in .NET 8 Preview 6:

(lldb) di -n JIT_CountProfile32
libcoreclr.so`JIT_CountProfile32:
libcoreclr.so[0x33f390] <+0>:  push   rbp
libcoreclr.so[0x33f391] <+1>:  mov    rbp, rsp
libcoreclr.so[0x33f394] <+4>:  push   rbx
libcoreclr.so[0x33f395] <+5>:  push   rax
libcoreclr.so[0x33f396] <+6>:  mov    rbx, rdi
libcoreclr.so[0x33f399] <+9>:  cmp    dword ptr [rdi], 0x0
libcoreclr.so[0x33f39c] <+12>: jle    0x33f3cb                  ; <+59> [inlined] InterlockedAdd at jithelpers.cpp:6015
libcoreclr.so[0x33f39e] <+14>: lea    rdi, [rip + 0x33f193]
libcoreclr.so[0x33f3a5] <+21>: call   0x65f740                  ; symbol stub for: __tls_get_addr
libcoreclr.so[0x33f3aa] <+26>: mov    ecx, dword ptr [rax]
libcoreclr.so[0x33f3b0] <+32>: mov    edx, ecx
libcoreclr.so[0x33f3b2] <+34>: shl    edx, 0xd
libcoreclr.so[0x33f3b5] <+37>: xor    edx, ecx
libcoreclr.so[0x33f3b7] <+39>: mov    ecx, edx
libcoreclr.so[0x33f3b9] <+41>: shr    ecx, 0x11
libcoreclr.so[0x33f3bc] <+44>: xor    ecx, edx
libcoreclr.so[0x33f3be] <+46>: mov    edx, ecx
libcoreclr.so[0x33f3c0] <+48>: shl    edx, 0x5
libcoreclr.so[0x33f3c3] <+51>: xor    edx, ecx
libcoreclr.so[0x33f3c5] <+53>: mov    dword ptr [rax], edx
libcoreclr.so[0x33f3cb] <+59>: lock
libcoreclr.so[0x33f3cc] <+60>: inc    dword ptr [rbx]
libcoreclr.so[0x33f3ce] <+62>: add    rsp, 0x8
libcoreclr.so[0x33f3d2] <+66>: pop    rbx
libcoreclr.so[0x33f3d3] <+67>: pop    rbp
libcoreclr.so[0x33f3d4] <+68>: ret

by way of comparison, here is the 64 bit version which does basically the same thing:

libcoreclr.so`JIT_CountProfile64:
libcoreclr.so[0x33f3e0] <+0>:   push   rbp
libcoreclr.so[0x33f3e1] <+1>:   mov    rbp, rsp
libcoreclr.so[0x33f3e4] <+4>:   push   r14
libcoreclr.so[0x33f3e6] <+6>:   push   rbx
libcoreclr.so[0x33f3e7] <+7>:   mov    rbx, rdi
libcoreclr.so[0x33f3ea] <+10>:  mov    rax, qword ptr [rdi]
libcoreclr.so[0x33f3ed] <+13>:  mov    r14d, 0x1
libcoreclr.so[0x33f3f3] <+19>:  test   rax, rax
libcoreclr.so[0x33f3f6] <+22>:  jle    0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f3f8] <+24>:  bsr    rax, rax
libcoreclr.so[0x33f3fc] <+28>:  cmp    eax, 0xd
libcoreclr.so[0x33f3ff] <+31>:  jb     0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f401] <+33>:  xor    rax, 0x3f
libcoreclr.so[0x33f405] <+37>:  mov    cl, 0x33
libcoreclr.so[0x33f407] <+39>:  sub    cl, al
libcoreclr.so[0x33f409] <+41>:  shl    r14, cl
libcoreclr.so[0x33f40c] <+44>:  lea    rdi, [rip + 0x33f125]
libcoreclr.so[0x33f413] <+51>:  call   0x65f740                  ; symbol stub for: __tls_get_addr
libcoreclr.so[0x33f418] <+56>:  mov    ecx, dword ptr [rax]
libcoreclr.so[0x33f41e] <+62>:  mov    edx, ecx
libcoreclr.so[0x33f420] <+64>:  shl    edx, 0xd
libcoreclr.so[0x33f423] <+67>:  xor    edx, ecx
libcoreclr.so[0x33f425] <+69>:  mov    ecx, edx
libcoreclr.so[0x33f427] <+71>:  shr    ecx, 0x11
libcoreclr.so[0x33f42a] <+74>:  xor    ecx, edx
libcoreclr.so[0x33f42c] <+76>:  mov    edx, ecx
libcoreclr.so[0x33f42e] <+78>:  shl    edx, 0x5
libcoreclr.so[0x33f431] <+81>:  xor    edx, ecx
libcoreclr.so[0x33f433] <+83>:  mov    dword ptr [rax], edx
libcoreclr.so[0x33f439] <+89>:  lea    eax, [r14 - 0x1]
libcoreclr.so[0x33f43d] <+93>:  test   eax, edx
libcoreclr.so[0x33f43f] <+95>:  je     0x33f446                  ; <+102> [inlined] InterlockedAdd64 at jithelpers.cpp:6044
libcoreclr.so[0x33f441] <+97>:  pop    rbx
libcoreclr.so[0x33f442] <+98>:  pop    r14
libcoreclr.so[0x33f444] <+100>: pop    rbp
libcoreclr.so[0x33f445] <+101>: ret
libcoreclr.so[0x33f446] <+102>: lock
libcoreclr.so[0x33f447] <+103>: add    qword ptr [rbx], r14
libcoreclr.so[0x33f44a] <+106>: pop    rbx
libcoreclr.so[0x33f44b] <+107>: pop    r14
libcoreclr.so[0x33f44d] <+109>: pop    rbp
libcoreclr.so[0x33f44e] <+110>: ret

I noticed this because of some odd looking profile counts while investigating #87194 (comment).

In particular for EnumerateUsingIndexer as run under BDN, with both interlocked (precise) and scalable counts enabled (interlocked counts are the upper value), note how the second count below is 1 instead of ~12800 or so.

;; linux 32 bit  counts

@@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 385 ILOffset 16 Count 2 Other 33
3846300
4372481
Schema InstrumentationKind 385 ILOffset 37 Count 2 Other 0
12821
1

;; linux 64 bit counts

@@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 386 ILOffset 16 Count 2 Other 33
3799800 0
3817984 0
Schema InstrumentationKind 386 ILOffset 37 Count 2 Other 0
12666 0
12734 0

 ;; windows 32 bit counts

 @@@ codehash 0xE86C4ED7 methodhash 0x84E51BC6 ilSize 0x00000026 records 0x00000002
MethodName: System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer
Signature: instance void  ()
Schema InstrumentationKind 385 ILOffset 16 Count 2 Other 33
4156800
4110336
Schema InstrumentationKind 385 ILOffset 37 Count 2 Other 0
13856
13826

Not clear yet what the problem is, either there is some sort of preprocessor mixup, or else a bug in clang or LLVM?

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jul 22, 2023
@AndyAyersMS AndyAyersMS self-assigned this Jul 22, 2023
@AndyAyersMS AndyAyersMS added this to the 8.0.0 milestone Jul 22, 2023
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 22, 2023

Source code is here:

HCIMPL1(void, JIT_CountProfile32, volatile LONG* pCounter)
{
FCALL_CONTRACT;
FC_GC_POLL_NOT_NEEDED();
LONG count = *pCounter;
LONG delta = 1;
if (count > 0)
{
DWORD logCount = 0;
BitScanReverse(&logCount, count);
if (logCount >= 13)
{
delta = 1 << (logCount - 12);
const unsigned rand = HandleHistogramProfileRand();
const bool update = (rand & (delta - 1)) == 0;
if (!update)
{
return;
}
}
}
InterlockedAdd(pCounter, delta);
}
HCIMPLEND

It looks like there's not much else in the runtime that depends up on BitScanReverse working properly, though it is used by CreateCastCache. The JIT and GC use BitScanReverse but have their own slightly different implementations.

cc @VSadov

@AndyAyersMS
Copy link
Member Author

Preprocessed output looks reasonable (with the exception of odd expansion of rand, which is immaterial here):

inline
unsigned char
__attribute__((visibility("default")))
BitScanReverse(
           PDWORD Index,
       UINT qwMask)
{





    int lzcount = __builtin_clzl(qwMask);
    *Index = (DWORD)(31 - lzcount);
    return qwMask != 0;
}

...

void JIT_CountProfile32(volatile LONG* pCounter) { LPVOID __me; __me = 0;
{
    { }; { }; { };
                           ;

    LONG count = *pCounter;
    LONG delta = 1;

    if (count > 0)
    {
        DWORD logCount = 0;
        BitScanReverse(&logCount, count);

        if (logCount >= 13)
        {
            delta = 1 << (logCount - 12);
            const unsigned Do_not_use_rand = HandleHistogramProfileRand();
            const bool update = (Do_not_use_rand & (delta - 1)) == 0;
            if (!update)
            {
                return;
            }

        }
    }

    InterlockedAdd(pCounter, delta);
}
; }

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

@BruceForstall
Copy link
Contributor

Looks like you should rename the rand local to avoid having this kick in:

#define rand Do_not_use_rand

@jakobbotsch
Copy link
Member

It seems like __builtin_clzl should be __builtin_clz, or 31 - lzcount computed in the next line should be 63 - lzcount.

@markples
Copy link
Contributor

I think __builtin_clzl is a bug because long is 64 bits here, not 32. When LONG is 32 and long is 64, and value of __builtin_clzl is guaranteed to be in [32,63]. The 31 - lzcount then goes negative but is cast to a large unsigned. The conditional can be optimized out, but the side effect of the random computation needs to be retained.

@markples
Copy link
Contributor

@jakobbotsch Aw, I typed too much, and you beat me to it :-)

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jul 22, 2023
The existing PAL code was using `__builtin_clzl` which is intended for platforms
where `long` is 64 bits. Instead use `__builtin_clz`.

The GC version had a similar issue so I've changed that too. The JIT version
was already using `__builtin_clz`.

Fixes dotnet#89340.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 22, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 22, 2023
AndyAyersMS added a commit that referenced this issue Jul 22, 2023
The existing PAL code was using `__builtin_clzl` which is intended for platforms
where `long` is 64 bits. Instead use `__builtin_clz`.

The GC version had a similar issue so I've changed that too. The JIT version
was already using `__builtin_clz`.

Fixes #89340.
@VSadov
Copy link
Member

VSadov commented Jul 24, 2023

It looks like there's not much else in the runtime that depends up on BitScanReverse working properly, though it is used by CreateCastCache.

If the concern is about CastCache having dependency on some BitScanReverse peculiarity - the underlying array in the cache is never empty, so bsr will not be used with 0

@AndyAyersMS
Copy link
Member Author

It looks like there's not much else in the runtime that depends up on BitScanReverse working properly, though it is used by CreateCastCache.

If the concern is about CastCache having dependency on some BitScanReverse peculiarity - the underlying array in the cache is never empty, so bsr will not be used with 0

It was not just the 0 case; BitScanReverse was generally broken for non-windows. But it looks like maybe CastCache would be unlikely to have hit this as we don't support any 32 bit non-windows platforms. BitScanReverse64 was ok.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants