Skip to content

Vector256.Shuffle() produces bad results in release mode #86432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ayende opened this issue May 18, 2023 · 9 comments
Closed

Vector256.Shuffle() produces bad results in release mode #86432

ayende opened this issue May 18, 2023 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Milestone

Comments

@ayende
Copy link
Contributor

ayende commented May 18, 2023

Description

Run the following code with release and debug configuration, it will give different outputs:

Correct

$ dotnet run -c Debug
<0, 0, 21, 2063, 2066, 3095, 0, 0>

Wrong

dotnet run -c Release
<0, 0, 21, 2063, 0, 17, 0, 0>

Note that the last 4 elements are getting what appear to be random values. Repeated runs get different values.

Reproduction Steps

Run:

using System.Runtime.Intrinsics;

Span<uint> a = stackalloc uint[8] { 0, 21, 2063, 2066, 3095, 0, 0, 0 };
var d = Vector256.Shuffle(
    Vector256.LoadUnsafe(ref a[0]), 
    Vector256.Create(8u, 0, 1, 2, 3, 4, 5, 6)
);

Console.WriteLine( d);

In release and debug mode.

Expected behavior

Should generate the same output in both cases

Actual behavior

Wrong output

Regression?

No response

Known Workarounds

No response

Configuration

dotnet --version
7.0.203

Tested on x64 Windows machine.

Other information

No response

@ghost ghost added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels May 18, 2023
@ghost
Copy link

ghost commented May 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Run the following code with release and debug configuration, it will give different outputs:

** Correct **

$ dotnet run -c Debug
<0, 0, 21, 2063, 2066, 3095, 0, 0>

** Wrong **

dotnet run -c Release
<0, 0, 21, 2063, 0, 17, 0, 0>

Note that the last 4 elements are getting what appear to be random values. Repeated runs get different values.

Reproduction Steps

Run:

using System.Runtime.Intrinsics;

Span<uint> a = stackalloc uint[8] { 0, 21, 2063, 2066, 3095, 0, 0, 0 };
var d = Vector256.Shuffle(
    Vector256.LoadUnsafe(ref a[0]), 
    Vector256.Create(8u, 0, 1, 2, 3, 4, 5, 6)
);

Console.WriteLine( d);

In release and debug mode.

Expected behavior

Should generate the same output in both cases

Actual behavior

Wrong output

Regression?

No response

Known Workarounds

No response

Configuration

dotnet --version
7.0.203

Tested on x64 Windows machine.

Other information

No response

Author: ayende
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@hypeartist
Copy link

hypeartist commented May 18, 2023

7.0.400-preview.23225.8 (Debug):

vmovupd			ymm0, ymmword ptr[reloc @RWD00]
vmovupd			ymm1, ymmword ptr[reloc @RWD32]
vpermd			ymm1,ymm1,ymmword ptr [rcx]  
vpand			ymm1,ymm0,ymm1  
vxorps			ymm2,ymm2,ymm2  
vpandn			ymm0,ymm0,ymm2  
vpor			ymm6,ymm1,ymm0 
vextractf128	xmm7,ymm6,1 
vinsertf128		ymm6,ymm6,xmm7,1  
vmovupd			ymmword ptr [rax+8],ymm6 

7.0.400-preview.23225.8 (Release):

vmovupd			ymm0, ymmword ptr[reloc @RWD00]
vmovupd			ymm1, ymmword ptr[reloc @RWD32]
vpermd			ymm1,ymm1,ymmword ptr [rcx]  
vpand			ymm1,ymm0,ymm1  
vxorps			ymm2,ymm2,ymm2  
vpandn			ymm0,ymm0,ymm2  
vpor			ymm0,ymm1,ymm0  
vmovupd			ymmword ptr [rax+8],ymm0  

8.0.100-preview.4.23260.5 (Debug/Release):

vmovups			ymm0, ymmword ptr [reloc @RWD00]
vpermd			ymm0, ymm0, ymmword ptr [rax]
vpand			ymm0, ymm0, ymmword ptr [reloc @RWD32]
vmovups			ymmword ptr [rcx], ymm0

@EgorBo EgorBo added the bug label May 18, 2023
@EgorBo EgorBo added this to the 8.0.0 milestone May 18, 2023
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label May 18, 2023
@hypeartist
Copy link

hypeartist commented May 18, 2023

@EgorBo Annotating method containing the code above with [MethodImpl(MethodImplOptions.NoInlining)] breaks 7.0.x Debug results too

	[MethodImpl(MethodImplOptions.NoInlining)]
	private static Vector256<uint> Sh()
	{
		Span<uint> a = stackalloc uint[8] { 0, 21, 2063, 2066, 3095, 0, 0, 0 };
		var d = Vector256.Shuffle(
			Vector256.LoadUnsafe(ref a[0]),
			Vector256.Create(8u, 0, 1, 2, 3, 4, 5, 6)
		);
		return d;
	}

@EgorBo
Copy link
Member

EgorBo commented May 18, 2023

The problem that Vector256.Shuffle emits an unnecessary vpand here in gtNewSimdShuffleNode when it's expected to just do Avx2.PermuteVar8x32

@ayende
Copy link
Contributor Author

ayende commented May 18, 2023

As long as we are talking about it, I was expecting that it would do something like:

var shiftedPrev = Vector256.Shuffle(prev, Vector256.Create(7u, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff));

Translated into:

var shiftedPrev = Avx2.PermuteVar8x32(prev, Vector256.Create(7u, 0, 0, 0, 0, 0, 0, 0)) &
            Vector256.Create(uint.MaxValue, 0, 0, 0, 0, 0, 0, 0);

@tannergooding
Copy link
Member

This is a duplicate of #85132

There was a relatively simple issue with the codegen for out of bounds indices that needs to be backported to .NET 7 (we were only copying the lower 128-bits of the zeroing mask)

The problem that Vector256.Shuffle emits an unnecessary vpand here in gtNewSimdShuffleNode when it's expected to just do Avx2.PermuteVar8x32

The and is expected and necessary so that out of bounds indices zero out the corresponding element in the result.

As long as we are talking about it, I was expecting that it would do something like:

There are a few optimizations for Shuffle that got missed in .NET 7 due to the support coming in so late. They are on my backlog to still complete/improve for .NET 8

@EgorBo
Copy link
Member

EgorBo commented May 18, 2023

The and is expected and necessary so that out of bounds indices zero out the corresponding element in the result.

why? the indexes are compile-time known, aren't they?

@tannergooding
Copy link
Member

why? the indexes are compile-time known, aren't they?

Because out of bounds indices are documented as zeroing the corresponding result element; the underlying permute instruction doesn't do any zeroing, that's only the behavior of pshufb (and only if the most significant bit is set). Other permute instructions mask the index to always be "in range" instead.

@ayende
Copy link
Contributor Author

ayende commented May 18, 2023

My intent was to actually do just that, zero the unimportant elements, by the way

@ghost ghost locked as resolved and limited conversation to collaborators Jun 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Projects
None yet
Development

No branches or pull requests

4 participants