-
Notifications
You must be signed in to change notification settings - Fork 13.4k
incorrect vector codegen #42148
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Comments
By "bad" do you mean incorrect or not optimal? |
Incorrect. The middle element is false, and the program panics. |
Here is a version that can be built with current zig (rather than my branch), and reproduces the problem, but its more difficult to follow, so I don't recommend looking at it. https://godbolt.org/z/o6lm3Z |
I reproduced it, and reduced the test case on ARM: pub export fn start() void { ======================= arget datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" %"[]u8" = type { i8*, i64 } @0 = internal unnamed_addr constant [3 x i1] [i1 false, i1 true, i1 false], align 1 ; Function Attrs: nobuiltin noreturn nounwind WhileCond: ; preds = %WhileCond, %Entry ; Function Attrs: nobuiltin nounwind Then: ; preds = %Entry Else: ; preds = %Entry EndIf: ; preds = %Else ; Function Attrs: argmemonly nounwind attributes #0 = { nobuiltin noreturn nounwind "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" } =========
panic: // @panic
|
Do you know if this reproduces with trunk? |
I reproduced it with 9.0.0-svn362869-1~exp1 in Debian experimental on arm64. The generated assembly appeared to be identical. |
Are you compiling with sse disabled? That's the only way I could get the assembly you were showing. The issue seems to be a bug in the handling of vector loads that don't fit exactly into some number of bytes. We get confused and end up treating all 3 bits as being in bit 0 of the same byte. |
Zig uses -march=native and this laptop supports AVX-256. I also reproduced it on aarch64. |
Ignore that. These are all with optimizations disabled, -O0. I could not reproduce with optimizations enabled. |
Target-independent legalization for load and store operations on vector types with non-byte-sized element types has been a known problem in the past... it wasn't clear how we wanted to represent them. We've settled on tightly packing them (so <3 x i1> is one byte), and we fixed most of the legalization code to correctly honor that. But maybe there's something that still isn't fixed. I can't reproduce the issue with the AArch64 testcase; can you give complete steps for building an executable that reproduces the issue? |
Eli, on X86 at least with sse disabled. What I see happened in that the v3i1 store needs to be widened to v4i1 during type legalization. The call to FindMemType in the load widening code ends up returning i1 as the type because no other types evenly divide into WidenWidth which is 4. This causes the code to emit 3 identical i1 loads since the pointer doesn't increment. These of course get CSEd to the same node. This is then assembled together to make a v3i1 vector. Since the loads are just i1 the element offset within in the byte is now lost. |
Oops that should have said "v3i1 load needs to be widened". v3i1 stores already scalarize, but not loads. |
DAGTypeLegalizer::WidenVecOp_STORE has a special case for non-byte-sized stores. (See https://reviews.llvm.org/D42100 .) Looks like we never applied an equivalent fix to DAGTypeLegalizer::WidenVecRes_LOAD. |
This is still an open downstream bug with LLVM 10 rc1. ziglang/zig#3246 (comment) |
Candidate patch https://reviews.llvm.org/D74590 Though I tend to think that zig should not be generating vectors of i1 in memory. |
All the vector comparison operators return vectors of i1. On x86 this ends up as a mask registers, and on PPC or ARM this ends up as a i1 sign-extended to the vector width of the operators. |
"should not", meaning that you're going to get terrible quality code, and likely always will. SIMD instruction sets tend to have wildly different ways of representing compare results depending on the target and the type of the compare operands. |
LLVM's icmp instruction when used on vectors returns . Why design the instructions this way if the return type is not recommended? |
Making icmp return is semantically meaningful. And it gives the backend the flexibility to easily do the right thing across all the targets LLVM supports if the result is used in a select, or sign-extended, or something like that. The point at which code generation becomes tricky is when you try to store the vector to memory. |
Ah, I see, thanks. Zig is storing vectors of i1 in memory only in the same way that clang stores function parameters in memory - in order to emit debug info, and so that programmers can name intermediate values. It's fully expected that mem2reg would remove these. I'll double check that this is happening. Thanks |
Maybe fixed after a5fa778. I only checked the output I didn't run anything. |
Extended Description
Using llvm 8 (Ubuntu Disco, 8.0.0-3 (tags/RELEASE_800/final)) The following zig program:
pub export fn start() void {
var a: [3]bool = []bool{ false, true, false};
var x2: @Vector(3, bool) = a;
if (x2[1] != true) unreachable;
}
const builtin = @import("builtin");
pub fn panic(msg: []const u8, error_return_trace: ?*builtin.StackTrace) noreturn {
while (true) {}
}
compiles to this llvm-ir (full file included)
@0 = internal unnamed_addr constant [3 x i1] [i1 false, i1 true, i1 false], align 1
; Function Attrs: nobuiltin nounwind
define void @_start() #2 !dbg !136 {
Entry:
%a = alloca [3 x i1], align 1
%x2 = alloca <3 x i1>, align 4
%0 = bitcast [3 x i1]* %a to i8*, !dbg !147
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %0, i8* align 1 bitcast ([3 x i1]* @0 to i8*), i64 3, i1 false), !dbg !147
call void @llvm.dbg.declare(metadata [3 x i1]* %a, metadata !140, metadata !DIExpression()), !dbg !147
%1 = load [3 x i1], [3 x i1]* %a, !dbg !148
%vector_to_array = extractvalue [3 x i1] %1, 0, !dbg !148
%2 = insertelement <3 x i1> undef, i1 %vector_to_array, i32 0, !dbg !148
%vector_to_array1 = extractvalue [3 x i1] %1, 1, !dbg !148
%3 = insertelement <3 x i1> %2, i1 %vector_to_array1, i32 1, !dbg !148
%vector_to_array2 = extractvalue [3 x i1] %1, 2, !dbg !148
%4 = insertelement <3 x i1> %3, i1 %vector_to_array2, i32 2, !dbg !148
store <3 x i1> %4, <3 x i1>* %x2, align 4, !dbg !149
call void @llvm.dbg.declare(metadata <3 x i1>* %x2, metadata !145, metadata !DIExpression()), !dbg !149
%5 = load <3 x i1>, <3 x i1>* %x2, align 4, !dbg !150
%6 = extractelement <3 x i1> %5, i32 1, !dbg !150
%7 = icmp ne i1 %6, true, !dbg !151
br i1 %7, label %Then, label %Else, !dbg !152
Then: ; preds = %Entry
tail call fastcc void @panic(%"[]u8"* @2, %builtin.StackTrace* null), !dbg !153
unreachable, !dbg !153
Else: ; preds = %Entry
br label %EndIf, !dbg !152
EndIf: ; preds = %Else
ret void, !dbg !154
}
Which produces the following x86_64 (disassembled)
Dump of assembler code for function _start:
0x0000000000201010 <+0>: push %rbp
0x0000000000201011 <+1>: mov %rsp,%rbp
=> 0x0000000000201014 <+4>: sub $0x10,%rsp
0x0000000000201018 <+8>: mov -0xef8(%rip),%al # 0x200126 <__unnamed_1+2>
0x000000000020101e <+14>: mov %al,-0x2(%rbp)
0x0000000000201021 <+17>: mov -0xf04(%rip),%cx # 0x200124 <__unnamed_1>
0x0000000000201028 <+24>: mov %cx,-0x4(%rbp)
0x000000000020102c <+28>: mov -0x2(%rbp),%al
0x000000000020102f <+31>: mov -0x4(%rbp),%dl
0x0000000000201032 <+34>: mov -0x3(%rbp),%sil
0x0000000000201036 <+38>: add %sil,%sil
0x0000000000201039 <+41>: or %sil,%dl
0x000000000020103c <+44>: shl $0x2,%al
0x000000000020103f <+47>: or %al,%dl
0x0000000000201041 <+49>: and $0x7,%dl
0x0000000000201044 <+52>: mov %dl,-0xc(%rbp)
0x0000000000201047 <+55>: mov -0xc(%rbp),%al
0x000000000020104a <+58>: mov %al,-0x8(%rbp)
0x000000000020104d <+61>: mov -0x8(%rbp),%al
0x0000000000201050 <+64>: and $0x1,%al
0x0000000000201052 <+66>: neg %al
0x0000000000201054 <+68>: test $0x1,%al
0x0000000000201056 <+70>: jne 0x20106d <_start+93>
0x0000000000201058 <+72>: jmp 0x20105a <_start+74>
0x000000000020105a <+74>: xor %eax,%eax
0x000000000020105c <+76>: mov %eax,%esi
0x000000000020105e <+78>: movabs $0x200140,%rdi
0x0000000000201068 <+88>: callq 0x201000
0x000000000020106d <+93>: jmp 0x20106f <_start+95>
0x000000000020106f <+95>: add $0x10,%rsp
0x0000000000201073 <+99>: pop %rbp
0x0000000000201074 <+100>: retq
The text was updated successfully, but these errors were encountered: