00:00 karolherbst: so far only not equal is
00:01 karolherbst: ehh
00:08 karolherbst: sooo. let's test
00:09 anholt: pushed some new bugfixes to https://crates.io/crates/gpu-trace-perf
00:09 karolherbst: airlied: ehh.. nir_validate is also a bit... well... :D
00:13 karolherbst: jenatali: 1: divide fp32 ................Wimp pass 2.00 @ {0x1.ee8cc8p-56, -0x1.eed3bp+4}
00:13 karolherbst: :)
00:13 jenatali: Nice :)
00:14 karolherbst: jenatali, jekstrand: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/c21d90832f6dae5ee67caa3c589dc9af56e0624f
00:15 karolherbst: you really want to predicate that stuff though I guess?
00:15 karolherbst: although fmul is quite cheap compared to the fdiv + frcp anyway?
00:15 karolherbst: uhm
00:15 karolherbst: frcp
00:15 karolherbst: dunno
00:15 jenatali: That's not too bad actually
00:16 karolherbst: yeah..
00:16 karolherbst: I'd like to have a more generic approach though
00:16 jenatali: karolherbst: Does that fix any other tests for you?
00:16 karolherbst: because this scaling stuff is quite common
00:17 karolherbst: jenatali: I check fdiv in non wimpy right now
00:17 jenatali: I'm curious how much is depending on the accuracy of fdiv
00:17 karolherbst: most of them? :D
00:17 jenatali: That'd be nice :)
00:18 karolherbst: also, we should get rid of those SpvCapabilityLinkage, SpvCapabilityFloat16Buffer warnings
00:18 karolherbst: jenatali: " 1: divide fp32 ................passed 2.00 @ {0x1.ee8cc8p-56, -0x1.eed3bp+4}" \o/
00:18 jenatali: karolherbst: I've got a series I need to port over still for supporting the vload/vstore_half opcodes which removes the Float16Buffer one
00:18 jenatali: But yeah... I guess we do kinda support linkage in that we can consume libclc
00:18 jenatali: We don't really support it outside of that though
00:18 karolherbst: ehhh...
00:19 karolherbst: now it fails differently
00:19 karolherbst: but it does look better overall
00:20 jenatali: I'll take that as a win
00:20 karolherbst: "ERROR: acos16: -nan ulp error at 0x0p+0 (0x00000000): *0x1.921fb6p+0 vs. -nan" mhh
00:20 jenatali: Only for 16-component vectors?
00:20 karolherbst: but I think something's up with vec16
00:20 jenatali: Sounds like you've got some hardcoded 4s you need to find
00:21 jenatali: I think I got everything along the paths that we're hitting
00:21 karolherbst: jenatali: -1 -w https://gist.githubusercontent.com/karolherbst/7be503b48ce8f006158f1b5bff7d7699/raw/e026756af34d8ccdfbcdbb37b712d45d9f611ccd/gistfile1.txt
00:21 jenatali: Not bad
00:21 karolherbst: isfinite and the others crash because of bool lowering
00:22 karolherbst: we need a solution for that
00:22 karolherbst: jenatali: yeah.. up to vec4 it looks fine
00:22 karolherbst: vec8 is also worse
00:23 jenatali: Interesting
00:23 karolherbst: do you still have some vec fixes?
00:23 jenatali: Considering how many bugs I already found in vec8/vec16, I'm not surprised
00:23 jenatali: I don't think so
00:23 karolherbst: I'll run it non wimpy just to rule it out though
00:24 karolherbst: jenatali: anyway.. would be cool if one of you could fix that isfinite and co with bool lowering.. wither we just always lower in vtn or... well.. handle them :p
00:24 jenatali: I can try to take a look. Not sure I understand what's going on though yet
00:28 daniels: cwabbott: re Marge - she does Piglit as well, and it's one MR at a time, regardless of which tree. I think your timeout was due to not needing any rebase or commit message rewrite, which leads to the pipeline never being triggered
00:29 karolherbst: jenatali: dest bit_size switches over to 32
00:29 karolherbst: but the opcode is bool1
00:29 karolherbst: so it fails to validate
00:29 jenatali: karolherbst: Hm, probably my bad
00:29 karolherbst: mhh "ERROR: acos4: -4.061432 ulp error at 0x1.2f955ep-1 (0x3f17caaf): *0x1.df47fp-1 vs. 0x1.df47e8p-1"
00:29 jenatali: Wonder how this works for us... maybe I just haven't run those tests with validation enabled...
00:30 karolherbst: ehhh.. I have some weirdo nouveau bug
00:30 karolherbst: if the test runs to long the hw starts rejecting command submissions... mhhh
00:30 karolherbst: anyway.. it looks much better with the fdiv fix
00:30 karolherbst: I just expect either more of that or something else
00:31 jenatali: :)
00:31 karolherbst: and it needs a fix for fp64 I guess....
00:31 karolherbst: *sighs*
00:31 jenatali: Just don't expose fp64? :)
00:31 jenatali: (That's our solution for now)
00:31 karolherbst: ehh...
00:31 karolherbst: fp64 is slow anyway except on 10k€ GPUs your customers will all use :p
00:38 karolherbst: oh well..
00:38 karolherbst: I will continue tomorrow I think :)
00:38 karolherbst: maybe I figure out the other fails
00:39 karolherbst: heh..
00:39 karolherbst: pow fails
00:39 karolherbst: not fails
00:39 karolherbst: rsqrt as well
00:40 karolherbst: sqrt
00:40 karolherbst: those are all quite basic
00:40 karolherbst: fmod
00:40 karolherbst: *sigh*
00:43 jenatali: karolherbst: Do you set the lower_fmod?
00:43 jenatali: There's a libclc implementation for that
03:02 jekstrand: karolherbst: NIR has an fdiv opcode. It's just usually lowered.
03:04 jekstrand: karolherbst: Intel even has an fdiv, I think.
03:04 jekstrand: karolherbst: Maybe we need an fdiv_for_real opcode
03:05 imirkin: or just selective lowering?
03:11 jekstrand: I think the lowering already is selective, we just lower to mul+rcp because its just as fast and leaves more CSE opportunities.
03:11 jekstrand: But we could totally turn that off for kernels
03:12 jenatali: jekstrand: Yeah the lowering is definitely selective
03:12 jenatali: We're using the fdiv opcode in DXIL, which is why WARP is able to pass the CL tests
03:12 jekstrand: right
03:12 jekstrand: Like I said, I'm very sure we have fdiv in hardware, it's just not a good idea when fastmath is allowed
03:14 imirkin: jekstrand: yeah, makes sense.
03:16 jekstrand: It'd probably take me all of 20 minutes to wire it through the back-end
03:25 jenatali: jekstrand: In addition to the comment you just left, there's still one outstanding comment on the images MR regarding address mode - thoughts?
03:25 jekstrand: jenatali: What are the rules about images in CL? Are you allowed to do crazy things like stash them in variables?
03:25 jenatali: jekstrand: Very explicitly no, let me find it
03:26 jenatali: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#restrictions b)
03:28 jekstrand: jenatali: Given that giant pile of restrictions, I think it should be fine.
03:28 jekstrand: jenatali: Have you seen it cause problems?
03:28 jenatali: No, optimizations end up coalescing all the derefs
03:29 jekstrand: Given the restrictions, I think that's guaranteed to always happen.
03:29 jenatali: Sorry, to be clear, I have seen the logical format cause problems
03:29 jenatali: IIRC it ends up causing nir validation to blow up
03:29 jenatali: Let me try it again, it's been a while
03:30 jekstrand: I could imagine it causing problems with the way that we're whacking all pointers to 64-bit
03:31 jekstrand: Specifically, if something gets confused as to how many bits your image deref is supposed to be
03:31 jekstrand: But that should be solvable.
03:31 jenatali: I expect we could pick any address mode that has a physical pointer type and it'd be fine
03:31 jenatali: But yeah, let me double check exactly where it falls over, building now
03:31 jekstrand: Or make a logical64
03:32 jekstrand: I kind-of want to move the address formats from spirv_options to nir_shader
03:33 jenatali: jekstrand: Handling of SpvOpSampledImage, trying to create a vec2 of a 32bit deref and a 64bit deref
03:33 jekstrand: Then nir_builder could always do the right thing and we wouldn't need that weird bitsize hack.
03:33 jekstrand: jenatali: We just need both samplers and images to use the same format
03:34 jenatali: The images are either function_temp or shader_in (don't recall) since they're function parameters
03:34 jekstrand: They should be uniform too
03:35 jenatali: That gets tricky, because you need them to be in sequence with the rest of the kernel args
03:35 jenatali: So that you can deal with clSetKernelArg(n) and know that it's an image
03:35 jekstrand: Ugh
03:35 jekstrand: right
03:35 jekstrand: Honestly, I kind-of wish we'd made kernel args uniforms
03:36 jekstrand: I think that maps better
03:36 mareko: jekstrand: does intel use the khronos-internal glcts anymore or is the public version enough?
03:36 jenatali: Yeah... it kind of does actually
03:36 jekstrand: mareko: I don't think we use internal much
03:36 jenatali: jekstrand: That'd be a good cleanup to apply, but not as part of this series :)
03:36 jekstrand: mareko: We might have a few hudred tests we run from it.
03:36 jekstrand: jenatali: Then maybe make them inputs?
03:37 jekstrand: I guess that might cause problems though
03:37 jekstrand: with the counting
03:37 jenatali: Yeah... that could work, as long as the backend removes them from the inputs list before counting
03:38 jenatali: Er, by backend, I guess I mean API layer (Clover/CLOn12 compiler)
03:38 jekstrand: jenatali: nir_variable::location is an integer; just make it -1 for constant samplers.
03:39 jenatali: jekstrand: Good point, there's a bit there that we can just use to exclude them from counting
03:41 jekstrand: That's even the documented use:
03:41 jekstrand: * If the variable is a uniform, shader input, or shader output, and the
03:41 jekstrand: * slot has not been assigned, the value will be -1.
03:41 jenatali: jekstrand: Some details are coming back to me... inside of the actual kernel, the SPIR-V storage class is UniformConstant for images. Currently the kernel wrapper creates variables as nir_var_shader_in, and derefs them to pass to the kernel. That works as long as the pointer size is the same
03:41 jekstrand: roughly
03:42 jekstrand: Right.... That makes sense.
03:42 jenatali: I can just grab whatever the vtn type is for UniformConstant and use that
03:42 jekstrand: Another option is that we can make images nir_var_uniform and other things nir_var_shader_in for now and, as long as we have a unified location space, we're fine.
03:42 jekstrand: We can move inputs over to uniform as a cleanup
03:42 jenatali: IIRC the wrapper doesn't set location
03:42 jenatali: It probably should though
03:42 jekstrand: It very much does. I'm looking at it right now. :)
03:43 jenatali: Ooh it does
03:43 jenatali: ... why aren't we using that?
03:43 jekstrand: I don't have any idea. What are you using?
03:43 jenatali: I think re-counting based on the inputs list
03:44 jenatali: Before your variable list merge, at least
03:44 jekstrand: Oh, that's a very bad plan. Using locations is a much better plan. :-)
03:44 jenatali: I agree
03:44 jenatali:makes note for tomorrow
03:46 jenatali: Yeah, i++ rather than using location. That's a shame
03:53 airlied: mareko: you still need the kc-cts pieces for conformance
03:55 airlied: there is no internal cts anymore, there is the cts and there is the kc-cts plugin which you get separate
03:55 airlied: gles3.2 conformance no longer needs kc-cts apparently
04:01 jenatali: jekstrand: I'm pretty sure things are working through luck only for me right now. I'm passing image pointers as function args via nir_build_deref_var, which just builds a pointer-sized arg (64bit) despite the fact that the vtn address mode for the variable should be logical (32bit)
04:02 jenatali: I think I need to put some more effort into the variable modes used for this stuff
04:02 jekstrand: jenatali: :-/
04:04 jenatali: Ah well, I'll have better code at the end of the day :)
04:04 jekstrand: Yeah, doing things more properly is usually worth it eventually
04:09 jenatali: Amazing how such a small comment can lead to such a large re-work. Ah well. Such is life
04:09 jenatali: Anyway, that's a problem for future me, for now it's bedtime
04:12 jekstrand: :)
04:24 jekstrand:is very confused by our v_load_store implementation
04:24 airlied:goes to make sure I didn't write it
04:26 airlied: yah I did partly :-P
04:26 jekstrand: It goes out of its way to split the load/store into components. Why?
04:26 airlied: alignment
04:27 jekstrand: It should literally be ((gentypen *)p)[offset]
04:27 airlied: you can do crappy things
04:27 jekstrand: It's required to be component-aligned
04:27 jekstrand: Good enough for us
04:27 airlied: no it isn't
04:27 airlied: you can load a vec3 from strange places
04:27 airlied: revert 5471ef7532a9fda81f69ebefde3805028a1850d7
04:27 airlied: and see :)
04:27 jekstrand: "The computed address must be 8-bit aligned if gentype is char or uchar; 16-bit aligned if gentype is short or ushort; 32-bit aligned if gentype is int, uint, or float; and 64-bit aligned if gentype is long or ulong."
04:28 airlied: yeah so it was vectors it had issues with
04:28 airlied: it's not required to be vector aligned
04:28 jekstrand: Sure, that's fine
04:28 jekstrand: Our back-end will deal
04:28 airlied: it wasn't and it didn't :)
04:28 jekstrand: You're sure you hit this problem with iris?
04:28 jenatali: jekstrand: Everything else in CL (except packed structs...) requires things to be vector-aligned for CL
04:29 jekstrand: So say it loads a u16vec4 and has an alignment of 2. If the back-end needs that to be split, it can split it.
04:30 jenatali: And until bbrezillon's packed struct series, alignment isn't propagated from SPIR-V, it's assumed based on the type being accessed
04:30 airlied: my memory doesn't say iris or llvmpipe
04:30 airlied: but it's a trivial recvert I expect, and the piglit tests hit it
04:30 airlied: so I'm sure CTS does
04:30 jekstrand: jenatali: Right... Yeah, we need to fix alignment propagataion.
04:30 jenatali: jekstrand: Help land that series ;)
04:30 jekstrand: jenatali: It's on my list. I'm reviewing as fast as I can.
04:30 jenatali: :P Yeah, I know
04:31 jekstrand:feels like he's done nothing this week but review other people's patches.
04:31 jenatali: But look how good CL is getting!
04:31 jekstrand: Oh, I know. It's 100% worth it.
04:31 airlied: like when I commited that vload fix it was likely we'd undo it when everything was fixed
04:32 jekstrand: sure
04:32 jekstrand:is very confused why the vstore tests are failing
04:44 jekstrand: Totally an iris bug!
04:47 airlied: bleh it's not a simple clean revert
04:49 jekstrand: It helps if you re-upload your kernel inputs every dispatch instead of assuming they're probably the same as the previous ones.
04:51 jekstrand: Pass 94 Fails 5 Crashes 3 Timeouts 4
04:51 jekstrand: Still not quite as good as nouveau
04:59 airlied: nice
05:00 jekstrand: Turns out load of a u64vec16 from scratch didn't quite work. :)
05:00 jekstrand: That takes 32 loads; the array had length 8. :)
05:01 airlied: I'd be interested if try and revert the vload/store to see if it been fixed
05:01 jekstrand: airlied: I think we need to land some patches from bbrezillon first
05:01 airlied: jekstrand: ah cool
05:01 jekstrand: airlied: I think that's on my review queue for tomorrow.
05:03 jekstrand: I think most of my remaining crash/fail in the basic tests will be fixed once karolherbst reworks workgroup system values.
05:04 jekstrand: \o/
07:49 pq: melissawen, did you know about "pixel blend mode" property in https://www.kernel.org/doc/html/latest/gpu/drm-kms.html#plane-composition-properties btw.?
08:36 cwabbott: karolherbst: jekstrand: jenatali: btw, on amd mareko made nir_op_ffma map to mul+add on hw before gfx10, because apparently that's more efficient
08:36 cwabbott: so nir_op_ffma already effectively has the Vulkan/OpenGL/D3D semantics where it may or may not be actually fused
08:37 cwabbott: I think I'd okay with renaming it to nir_op_fmad and adding a "no, really, this is actually fma" opcode for OpenCL
08:40 cwabbott: or maybe we could add an option and then make make glsl_to_nir and vtn map it to mul+add if the option is true, and then amd sets that option and stops mapping fma to mul+add in the backend
09:23 tanty: tomeu, I'm trying to make my mind about how to approach uploading images to minio
09:23 tanty: whether to just have a python script to help with that in mesa/.gitlab-ci
09:23 tanty: or actually add a couple of commands to the ci-fairy
09:23 tomeu: tanty: yep, that's what I was thinking
09:24 tanty: opinion?
09:24 tomeu: well, that sounds good as well :)
09:24 tanty: ci-fairy -> upload random file, not just images, obviously
09:24 tomeu: hopefully we can use the same script to upload artifacts from other jobs such as deqp, piglit, etc
09:24 tanty: also, I'm quite newby to minio and our fdo instance
09:25 tanty: how could I test against it?
09:25 tanty: or should I get some local minio and punch it
09:25 tanty: ?
09:26 tanty: test against it -> basically, how to get credentials (?)
09:26 tomeu: easiest is to just test by pushing to gitlab
09:27 tanty: does it need to be a mesa/mesa job or is it ok from any personal/project place?
09:27 tomeu: otherwise you can indeed install your own minio instance (fahien is doing that), or bug daniels to give you a longer running token
09:27 tomeu: failed images can be uploaded from your fork
09:27 tanty: OK
09:28 tanty: I'll try with that by now
09:28 tomeu: but reference images need to be uploaded from mesa/mesa indeed :/
09:28 tanty: thanks!
09:28 tomeu: but the only difference should be in the bucket name, so if it works for failed images, it should also work for reference?
09:28 tomeu: yw!
09:28 karolherbst: jenatali: I suspect that even with dxils fdiv it doesn't work on all devices
09:30 karolherbst: cwabbott: CL allows ffma or fmad to be used for optimizing fmul and fadd
09:30 karolherbst: but there is also a flag to disallow that...
09:30 karolherbst: it's a bit annoying
09:31 cwabbott: karolherbst: sure, I guess implementing that would mean splitting up "exact" into "can fuse" and "assume no NaN's" or something like that
09:32 karolherbst: mhh.. I have an idea
09:32 cwabbott: but does it have a fma() function which is guaranteed to be fused?
09:32 karolherbst: yes
09:32 karolherbst: as does glsl
09:32 cwabbott: no, glsl doesn't
09:32 karolherbst: ohh..
09:32 karolherbst: right..
09:32 karolherbst: it's a bit weaker if you put the precise modifier on it
09:33 karolherbst: and totall weak if you don't
09:33 karolherbst: *totally
09:33 cwabbott: right, precise just means "be consistent please"
09:33 karolherbst: but yeah, CL has a precise one
09:33 karolherbst: but we have libclc lowering for that
09:34 cwabbott: yeah, you'll probably want a separate nir opcode for targets that can actually do fma
09:34 karolherbst: yeah...
09:34 cwabbott: even on amd they support fma, it's just slower until gfx10
09:34 karolherbst: I htink that would improve nirs optimization stuff actually
09:34 karolherbst: so the driver could tell nir what they got: fmad or ffma or both
09:35 karolherbst: and nir can just optimize accordingly
09:35 karolherbst: so instead of not merging fmul and fadd on precise ones, we could optimize it to fmad if the driver sets has_fmad
09:35 cwabbott: I think the first step would be s/ffma/fmad/ in mesa
09:36 karolherbst: no
09:36 karolherbst: ffma is really more like ffma
09:36 cwabbott: why?
09:36 cwabbott: ffma *in nir* is currently actually fmad
09:36 karolherbst: why is that?
09:36 karolherbst: if precise is set, we don't merge fmul and fadd to ffma
09:37 karolherbst: even though if ffma would be fmad that would be legal
09:37 cwabbott: because of how amd implements it, and how glsl and vulkan spir-v specify it
09:37 cwabbott: it de-facto is fmad
09:37 karolherbst: in nouveau I treat ffma as ffma
09:37 karolherbst: and that works perfectly
09:37 cwabbott: and that's perfectly fine
09:38 cwabbott: but ffma is a valid lowering for fmad
09:38 karolherbst: sure
09:38 karolherbst: but also in CL
09:38 karolherbst: except you specify you don't want that
09:38 karolherbst: and we treat ffma as ffma and don't lower fmad to it
09:39 cwabbott: what do you mean by "don't lower fmad to it"? there is no opcode called fmad in NIR currently
09:40 cwabbott: there is ffma, which is treated by AMD as fmad
09:40 karolherbst: I meant merging fmul and fadd
09:40 karolherbst: sure, but renaming it, fixing all opts just to readd the same ffma opts again doesn't make much sense to me ;p
09:40 cwabbott: would you actually have re-add all the opts?
09:40 karolherbst: as we do it today makes perfectly sense if ffma is ffma
09:40 karolherbst: because they are correct
09:41 karolherbst: they are not if you treat ffma as fmad
09:41 karolherbst: well "correct" would be the wrong term here
09:41 karolherbst: but we would have to change some stuff
09:41 cwabbott: the thing is, amd really wants an fmad opcode
09:41 karolherbst: we wouldn't have to change a thing if we treat it as ffma
09:41 karolherbst: sure
09:41 karolherbst: and nouveau as well
09:41 karolherbst: but only for older hw
09:41 cwabbott: currently we treat ffma as fmad, and everything works
09:41 karolherbst: right
09:41 karolherbst: but
09:42 karolherbst: you could allow more optimizations
09:42 karolherbst: that precise thing means nothing if it's all fmad
09:42 karolherbst: and that's more or less my point
09:42 karolherbst: so if you rename it, you would remove the precise flags on all fmad related opts
09:42 karolherbst: then add ffmad and readd the original opts with those flags back
09:43 karolherbst: just easer to add fmad and add relaxed optimizations :)
09:44 cwabbott: I still don't see what you mean... if we added a fmad opcode to nir, we'd want to treat it exactly the same way as we do ffma now, with the same optimizations etc.
09:44 karolherbst: no
09:44 cwabbott: ?
09:45 karolherbst: example
09:45 karolherbst: (('~ffma', 0.0, a, b), b), is correct, but if ffma would be fmad, that could be (('ffma', 0.0, a, b), b),
09:45 karolherbst: without causing issues
09:45 karolherbst: but the reason we do not means that ffma is not just the unfused thing
09:46 karolherbst: well.. ffma is really jsut a "fmad _or_ ffma" both at once
09:46 karolherbst: and we just assume it's the wrong one in those precise optimizations
09:47 karolherbst: but because it's about optimizations and usually about fmul and/or fadd treating it as ffma makes more sense as with ffma, that ~ modifier has to stay
09:47 karolherbst: where for fmad it can go away
09:47 cwabbott: but when you say "fmad _or_ ffma" both at once, that's really what fmad is
09:47 cwabbott: something that could be fused or could not be
09:47 karolherbst: I wouldn't treat fmad as fused
09:47 karolherbst: if we have both opcodes we really should make it a clear cut
09:48 cwabbott: but that's what every other frontend other than OpenCL wants
09:48 cwabbott: they all want something that could be or could not be fused
09:48 karolherbst: well.. even CL wants that I think
09:48 karolherbst: just not always
09:49 karolherbst: I just think we can make better decissions if we don't have a "can be both" optimization
09:49 karolherbst: uhm
09:49 karolherbst: opcopde
09:49 cwabbott: no, I think "can be both" can result in better optimizations
09:49 karolherbst: we could have (('~ffma', a, b, 0.0), ('fmul', a, b)) and (('fmad', a, b, 0.0), ('fmul', a, b))
09:49 karolherbst: that's not possible if fmad can be both
09:50 cwabbott: (('fmad', a, b, 0.0), ('fmul', a, b)) still isn't right
09:50 cwabbott: because of the signedness of 0
09:50 cwabbott: like like how ('fadd' a, 0) -> a has to be inexact
09:50 karolherbst: mhhhh
09:51 karolherbst: right..
09:51 karolherbst: we can also go for the classical one
09:51 karolherbst: (('~fadd', ('fmul', a, b), c), ('ffma', a, b, c), 'options->fuse_ffma'), but for fmad we could just use (('fadd', ('fmul', a, b), c), ('fmad', a, b, c), 'options->fuse_fmad')
09:52 karolherbst: it's just more work in the end I think
09:52 cwabbott: I guess there's kinda room for both
09:53 karolherbst: I think being more explicit has its benefit as then it's super clear what happens
09:53 karolherbst: and you can always duplicate passes handling it for ffma and fmad
09:53 karolherbst: it's just more work
09:54 cwabbott: if you have an opcode that can be both, then you can decompose it into fmul + fadd when inexact, even when the backend implements it as fma -- once you lower to ffma or (definitely-not-fused) fmad then you've lost that information
09:54 karolherbst: that's why we need a usa_fmad and use_ffma flag and have both opts
09:55 cwabbott: although I can't think of any optimizations off the top of my head that would actually benefit from that
09:55 karolherbst: the idea is, if you have both and you define them precisely, you never loose information
09:55 karolherbst: fmad was always fmul + fadd originally (or the API has an opcode for that)
09:55 cwabbott: you do though... you lose the information that it could be either
09:56 cwabbott: you're narrowing the semantics
09:56 karolherbst: but that's up for the backend to decide if they care or not
09:56 cwabbott: and narrowing the semantics always results in transforms that are no longer valid
09:57 karolherbst: if a backend specifices use_fmad and use_ffma, they just deal with everything you throw at them and you won't miss optimization oppotrunities
09:57 karolherbst: and they say use_ffma but not use_fmad, you still optimize to ffma for the non precise case
09:57 karolherbst: you just can't do it for precise
09:57 cwabbott: no, that's not what I'm talking about
09:58 karolherbst: I know what you mean, I just think we can resolve that by the backend telling nir what it wants and cares about
09:58 cwabbott: no, you can't
09:59 karolherbst: I still don't see an example where it would actually hurt
10:00 cwabbott: yeah, now that I'm thinking about it - an inexact ffma would be the same as an inexact could-be-either opcode, right?
10:00 karolherbst: yep
10:02 cwabbott: so we'd introduce fmad, have a late optimization that does (('fadd' ('fmul' a, b) c), ('fmad' a, b, c), 'use_fmad') and (('~ffma' a, b, c), ('fmad', a, b, c), 'use_fmad')
10:02 cwabbott: and amd would set use_fmad on pre-gfx10
10:02 karolherbst: yeah, I guess you could do that
10:03 cwabbott: I think I'd avoid having yet another equivalent thing in the core
10:03 cwabbott: like how we also create fsub late
10:07 karolherbst: yeah.. I mean, I just see it as the pain you get for having both and I am sure every solution has a valid argumentf or why it's a bad one :/
10:21 melissawen: pq, hey, I didn't know that. I read it now quickly, and it seems very useful for that patch.
10:22 melissawen: I'll need to take a closer look at the doc to understand it well.
10:22 melissawen: thank you :)
10:23 pq: melissawen, the important bit however is, what is the blend mode when that property does not exist. I would assume *that* is never documented. ;-)
10:23 pq: but I suppose IGT should get it right
10:24 pq: btw. does anyone run IGT on big-endian?
10:26 melissawen: Do you mean, this: "Current DRM assumption is that alpha is premultiplied" ? or what is got by the userspace?
10:28 pq: oh, it's written somewhere? That's good.
10:29 pq: <rant>as an aside, it's funny how "everything" uses premultiplied, then it's the wrong thing to do from color correctness perspective</rant>
10:35 melissawen: pq, about endianness, there is a ifdef that I think is handling big-endian host on drm code: https://cgit.freedesktop.org/drm/drm-misc/tree/include/drm/drm_fourcc.h
10:36 pq: is VKMS using the HOST formats or the normal formats?
10:37 melissawen: normal, ARGB/XRGB8888
10:37 pq: then the #ifdef does nothing as I can see
10:39 pq: the question I always striggle with is: for a certain normal DRM format, between little- and big-endian hosts, which one stays identical? The bits of a word, or the binary data in memory.
10:39 pq: *struggle
10:39 pq: by binary data I mean as an array of bytes
10:40 pq: if it's array of bytes, then accessing through an array of bytes produces always the right interpretation
10:41 emersion: DRM formats are LE
10:41 emersion: even on BE machines IIRC
10:41 pq: emersion, yes, but what does that mean
10:42 pq: ok, so the array of bytes in memory is identical between LE and BE for the same format?
10:42 emersion: yes, i think so
10:42 emersion: never tested it, but that's what i've been told iirc
10:43 pq: I've never had a BE machine either
10:43 pq: hence would be really nice to hear that someone is running IGT on a BE machine
10:44 emersion: agreed
10:44 pq: melissawen, so, I think if your pixel interpretation is correct on LE, it is also correct on BE, and sorry for bothering :-)
10:45 emersion: and probably facing heaps of errors, since last time i heard someone said it works, it was because two parts of the stack got it wrong, making the result look okay
10:46 pq: yeah, hence it would be really important to have IGT and VKMS correct on BE, so that other drivers can be compared to them.
10:46 pq: it's really far too easy to have two mistakes cancel each other out
10:47 ccr: wouldn't it be rather a question of whether the CPU and "GPU" are of same endianess
10:47 pq: ccr, no GPU involved
10:47 ccr: ok
10:48 pq: that's a whole another rabbit hole
10:48 ccr: undoubtably
10:49 emersion: hm hm https://patchwork.kernel.org/patch/9692037/
10:49 pq: if a display driver chooses to consistently lie about formats, that's fine if no-one can access the VRAM data directly.
10:49 pq: bypassing the lies, that is
10:50 pq: emersion, a bit old, 2017
10:50 emersion: yeah, and no clear conclusion
10:51 pq: I think HOST formats are newer than that?
10:51 pq: though HOST formats are kernel-internal only, right?
10:51 melissawen: pq, no problem. in fact, thanks for advice and new info :) I still have a lot to learn
10:51 emersion: i'm not familiar with HOST formats
10:52 pq: they're in melissa's link above
10:52 emersion: oh
10:52 emersion: that sounds quite new
10:53 emersion: ah, no, 2018
10:54 emersion: ah, as you said, it's not in uapi/
10:55 emersion: okay, i now understand what DRM_FORMAT_BIG_ENDIAN does
10:56 ccr: better document it for the future generations? :P
10:56 emersion: it's kind of documented
10:56 emersion: #define DRM_FORMAT_BIG_ENDIAN (1U<<31) /* format is big endian instead of little endian */
10:56 emersion: i just didn't understood you could just flip the flag on any format
10:57 pq: this seems to be one of those things where all terse explanations are also ambiguous :-P
10:58 pq: https://afrantzis.com/pixel-format-guide/
11:00 pq: so the format definition uses bits of a little-endian word, and the word is not host-endian but little-endian specifically, which means that the pixel data as an array of bytes is the same on both LE and BE CPU hosts
11:01 MrCooper: is there any userspace which uses DRM_FORMAT_BIG_ENDIAN at all yet, let alone correctly?
11:02 emersion: "correctly" is an important precision https://github.com/swaywm/wlroots/blob/1dbcfdaf81778fcd4635c6ecd62b89477f69f0d8/render/gles2/texture.c#L261
11:02 MrCooper: it falls down for formats where components straddle byte boundaries at the latest
11:02 emersion: ah, in this case it's just a whitelist
11:02 MrCooper: *breaks down
11:02 emersion: so it's not a big deal
11:05 pq: MrCooper, no, I think those formats is where it's actually necessary, because e.g. DRM_FORMAT_RGB565 need a byteswap on BE CPU before the bits can be accessed. The flag is the only way to define a RGB565 format doesn't need a byteswap on BE.
11:07 pq: to me, all the 8 bpc formats are the exceptions, not the ones that straddle byte boundaries
11:07 emersion: tbh, i think defining a proper separate format for this one would have been less misleading
11:08 pq: and all the other 16-bit formats?
11:08 pq: and 101010 formats
11:08 pq: well, all non-8-bpc formats
11:09 emersion: this is only used for things like cursor planes, right?
11:09 emersion: i mean: DRM_FORMAT_BIG_ENDIAN is only used for dumb FBs?
11:10 pq: I don't think it's used at all, which is the problem :-P
11:11 emersion: ahah
11:11 pq: It's a format definition, it's used everywhere. If it's definition depended on where it is used, chaos would ensue. WAS THAT A DRAGON??!?
11:12 emersion: yeah but i assumed the use-case was letting drivers advertise it on BE systems, to allow for more efficient CPU rendering in user-space
11:12 pq: traditionally KMS drivers didn't advertise any formats...
11:13 pq: I think?
11:14 emersion: the formats array is part of the CRTC struct
11:14 emersion: so that'd surprise me a little
11:14 pq: oh, CRTC advertises formats? never realized that
11:17 emersion: err, no
11:17 emersion: it's planes
11:17 pq: universal planes?
11:17 emersion: drmModePlane
11:18 emersion: not familiar enough with KMS history to know when drmModePlane has been introduced
11:18 pq: so overlays, in the time before universal planes
11:18 emersion: yeah, it predates CAP_UNIVERSAL_PLANES for sure
11:19 pq: and AddFB did not even *have* a format, just depth and bpp
11:20 pq: well, AddFB didn't have problems because it didn't use formats, so I went a little too far back there :-)
11:28 karolherbst: jenatali: so this scaling thing we really need for like.. most native ops :D
11:28 karolherbst: sqrt and rsqrt as well
11:28 karolherbst: but it's not _that_ painful to add
11:28 karolherbst: it's pretty straight forward and can be implemented in a way it works for a lot of opcodes with shared code
11:28 karolherbst: annoying is just that I use bcsel :/
11:31 daniels: tanty: it's pretty easy to run your own local MinIO + OPA, let me know if you want some instructions
11:32 tanty: daniels: is there some place in which it is documented/scripted any custom changes applied for the MinIO that is running in fdo ?
11:33 daniels: tanty: https://gitlab.freedesktop.org/freedesktop/helm-gitlab-config/-/tree/master/gitlab-minio-provision is everything we have
11:34 daniels: it takes a little bit of clicking to get through the layers of helmfile -> helm -> k8s tho :P
11:34 daniels: saving you the clicks, https://gitlab.freedesktop.org/freedesktop/helm-gitlab-config/-/blob/master/gitlab-minio-provision/values/minio/fdo-minio/default.yaml.gotmpl is the config, which mostly just a) sets the JWK URL so we can validate the JWTs, and b) sets the OPA URL so it knows where to look for policy
11:34 tanty: :D
11:35 daniels: then https://gitlab.freedesktop.org/freedesktop/helm-gitlab-config/-/blob/master/gitlab-minio-provision/values/minio/fdo-opa/fdo-policy.rego is where we define all the OPA policy
11:35 tanty: great. Thanks!
11:35 daniels: np
11:37 daniels: lmk if you have any questions, I braindumped quite a bit on fahien yesterday so I can just copy & paste a giant tract :P
11:50 icecream95: airlied: I see phoronix picked up on your blog post...
12:07 daniels: heh, icecream needs a more stable IRC connection
12:14 karolherbst: airlied, jenatali: didn't we had some "exact" think stuff for libclc?
12:14 karolherbst: *thing
12:14 karolherbst: even constant folding isn't always correct
12:21 cwabbott: huh... in https://gitlab.freedesktop.org/cwabbott0/mesa/-/jobs/4210680 it seems like deqp passed but the job failed... anyone know why?
12:22 cwabbott: it says:
12:22 cwabbott: 2020-08-21T12:09:39 + echo deqp: pass
12:22 cwabbott: 2020-08-21T12:09:39 deqp: pass
12:24 tomba: Could someone update drm-misc-fixes to -rc1? I got an important fix I'd like to merge (drm/omap: fix incorrect lock state)
12:24 karolherbst: jenatali: ahh.. the vec16 issue is running out of TLS space indeed :)
12:24 tomba: Although the fix applies even on 5.8. But the issues is not in 5.8, so feels wrong to apply on top of 5.8...
12:34 karolherbst: jenatali: math_brute_force: Pass 85 Fails 13 Crashes 2 Timeouts 0 :)
12:34 karolherbst: and that's with subnormal support
12:35 karolherbst: "ERROR: asin: 4.666239 ulp error at 0x1.ffd8p-1 (0x3f7fec00): *0x1.8bcc94p+0 vs. 0x1.8bcc9ep+0"
12:35 karolherbst: that feels like ffma vs fmad
12:36 karolherbst: maybe something else
12:36 karolherbst: maybe asin just needs more precision in libclc.. mhh
12:37 karolherbst: but it does so much mad
12:42 MrCooper: pq: formats which don't boil down to an array of bytes could be defined per the CPU byte order by default, with some mechanism for explicit LE/BE variants
12:55 jenatali: karolherbst: That's great progress :)
12:56 jenatali: Personally, I'd like to avoid introducing extra ops as required in vtn for scaling - if we want them there they can be optional
12:56 jenatali: Nir optimizations don't seem to introduce problems, it's just a matter of the resulting opcode either being lowered to something not precise enough, or the driver/hardware not using a precise enough op
12:57 jenatali: But both of those end up being driver-specific, and I think that's where we should solve the problem
12:57 karolherbst: yeah...
12:57 karolherbst: but it's a bit messy
12:57 jenatali: E.g. the driver has a check for `if (KERNEL) { use more precise op }
12:57 karolherbst: we don't want to add the additional requiernments for _all_ opcodes
12:57 pq: MrCooper, could be, but are not, but are in practice used like that? :-)
12:57 jenatali: For CL you kinda do though
12:58 karolherbst: because after we consumed the API stuff it's all fair game to use lower precision internally
12:58 pq: and then the HOST formats are explicitly CPU-endian
12:58 karolherbst: just the API operations have to be precise
12:58 karolherbst: but if you end up emiting fdivs internally for other stuff that shouldn't need the scaling
12:58 jenatali: Sure, but what internal ops are you talking about that wouldn't end up bubbling out to affect the API?
12:58 karolherbst: driver internal lowering
12:58 karolherbst: could be anything really
12:59 karolherbst: what I'd like to do is to run a lowering pass inside clover I think?
12:59 karolherbst: and just fix up some ops
12:59 jenatali: That could work
12:59 karolherbst: yeah..
12:59 karolherbst: I think that's much cleaner than having drivers to care about the API
13:00 karolherbst: ehh something in exp also needs scaling: ERROR: exp: -8377325.500000 ulp error at -0x1.5d5ap+6 (0xc2aead00): *0x1.ff4fb8p-127 vs. 0x0p+0
13:01 karolherbst: anyway
13:01 karolherbst: jenatali: I think if we do a fix for fdiv/sqrt/rsqrt it looks good enough
13:01 karolherbst: a lot of those passes use fdiv
13:01 jenatali: Cool :)
13:01 karolherbst: and the remaining fails are quite insignificant
13:02 karolherbst: I am sure they might even pass all with disabled subnormals...
13:02 karolherbst: :D
13:02 jenatali: Yeah, libclc was pretty close to a magic bullet
13:02 karolherbst: yep.. exp passes without denorm support
13:03 karolherbst: asin still fails, but that looks like something else
13:04 karolherbst: okay.. so let me port that fixup into a proper pass we can call from clover or you can call quite early
13:04 karolherbst: I expect some nir opts to need some precise handling though?
13:04 karolherbst: jenatali: wasn't there a patch to mark everything as precise or something?
13:04 jenatali: Not sure - like I said, when we send stuff to WARP we're able to pass bruteforce
13:04 jenatali: karolherbst: Yeah, not mine, let me find it
13:05 jenatali: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6358
13:05 karolherbst: ahhh
13:05 karolherbst: that's required for you as well?
13:05 jenatali: Nope
13:05 karolherbst: I see
13:06 jenatali: Though we're still not quite sync'd up to master, it's possible more optimizations were introduced to master that cause problems
13:06 karolherbst: yeah.. but I am really not as concerned about the complicated opcodes to fail
13:06 karolherbst: fdiv/sqrt/rsqrt were just red flags
13:07 karolherbst: some of the fails are even nouveau or other bugs
13:08 karolherbst: seeins some out of bounds and invalid register errors
13:08 karolherbst: but fixing all that will be part of fixing general CTS stuff anyway
13:08 jenatali: Yep, makes sense
13:09 karolherbst: list of fails: https://gist.githubusercontent.com/karolherbst/2bb73d305973483b44f33c0670e55d67/raw/200b9bb308510ebf1d98d7ebc6e63bcea6bb2cfa/gistfile1.txt
13:09 karolherbst: those _cr ones are just annoying, we can just ignore those :D
13:23 jenatali: karolherbst: What're we waiting for with https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5891 ?
13:24 karolherbst: nothing I guess
13:24 karolherbst: marge it :p
13:25 jenatali: Sure, here goes
13:50 alyssa: Is the number of textures/samplers used by a program known at compile-time?
13:50 cwabbott: this is a fun game of "spot the bug": https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/compiler/nir/nir_lower_io_arrays_to_elements.c#L64
13:51 alyssa: shader_info here seems incomplete, I'm trying to figure out if that's an inherent issue stemming from arrays of sampler/textures/indirect access
13:53 cwabbott: alyssa: shader_info::num_textures should have that info
13:53 alyssa: cwabbott: even with separate samplers etc?
13:54 cwabbott: alyssa: we currently only support separate samplers with vulkan, and there the concept of num_samplers/num_textures doesn't make sense
13:55 alyssa: well, it's a good thing we don't support vk
13:55 alyssa:glances nervously at panv
13:56 alyssa: thanks :)
14:02 alyssa: num_textures does seem to work, makes me feel a little nervous but, uh, that's what CI is for right? :p
14:03 jenatali: CL needs separate samplers too, for what it's worth
14:04 alyssa: jenatali: CL needs a lot more than that to work on Panfrost, so we'll cross that bridge then ;)
14:04 jenatali: :)
14:37 MrCooper: mattst88: how did you think I could help on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6031 (that you couldn't yourself :) ?
15:07 alyssa: wc
15:24 jekstrand: cwabbott: Yeah, I think that's roughly what we want to do.
15:24 jekstrand: cwabbott: It's been on my list of things to fix for a while but never towards the top
15:25 cwabbott: jekstrand: what are you responding to?
15:28 jekstrand: cwabbott: Your fmad/ffma comments
15:29 jekstrand: jenatali: Sorry for interrupting your MR. Generally, if someone's given significant feedback but not yet signed off, it's good to make sure they're ok with it before assigning to marge even if someone else has reviewed.
15:29 karolherbst: cwabbott: btw, I'd like to merge https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6377 if that's okay
15:29 jenatali: jekstrand: No worries, I'm glad you were able to catch it that I wasn't doing the right thing
15:29 jenatali: Not trying to rock the boat, just still learning the ropes :)
15:30 jekstrand: jenatali: Yup, I figured. :)
15:32 karolherbst: heh, although it was me telling him to marge it :p ohh, but it seems like I missjudged as I thought it already have more reviews than just mine...
15:32 karolherbst: my bad then
15:32 jenatali: No harm done at lesat
15:32 jenatali: least*
15:33 cwabbott: karolherbst: btw, I'd much prefer if we named ordered-not-equal fneo rather than fne like your commit message seems to imply
15:33 cwabbott: I think that's what jekstrand's branch did
15:33 karolherbst: yeah.. I was thinking a little about it
15:34 karolherbst: I think it leads to nicier to read code as things are still aligned properly :D
15:34 cwabbott: 99% of the time, when you want != you'll want fneu
15:34 karolherbst: right
15:34 cwabbott: so calling the almost-never-used thing "fne" without a suffix is guaranteed to get people confused
15:34 karolherbst: yeah..
15:34 karolherbst: I have nothing against being explicit about ordering
15:37 jekstrand: jenatali, karolherbst: Can num_work_groups actually be > UINT32_MAX?
15:37 karolherbst: for CL sure
15:37 jenatali: Yeah, I think so
15:37 karolherbst: global_work_size is unlimited afaik
15:37 karolherbst: except we all missed something
15:38 karolherbst: "CL_​INVALID_​GLOBAL_​WORK_​SIZE if any of the values specified in global_work_size[0], …​ global_work_size[work_dim - 1] exceed the maximum value representable by size_t on the device on which the kernel-instance will be enqueued."
15:38 karolherbst: and that's it
15:38 jekstrand: cwabbott: How would you feel about keeping nir_builder helpers for nir_feq -> feqo and nir_fne -> fneu? That way we have versions that map to C but actually generate properly ordered opcodes.
15:39 jekstrand: karolherbst: Ok, then.....
15:39 jekstrand: size_t, eh?
15:39 jekstrand: *sigh
15:39 karolherbst: yep....
15:39 karolherbst: jekstrand: that's even more fun on GPUs which only have 2 dimensional grids like nv50 :)
15:40 karolherbst: we can eunque 512 thread big blocks, but if you enqeue 512x1x2 as the global_size you are out of lock
15:40 cwabbott: jekstrand: yeah, that might work I guess... although it is a little tricky
15:40 karolherbst: so... the runtime needs to lower this into several dispatches :/
15:41 karolherbst: mhhh
15:41 karolherbst: yeah.. the functions are all generated
15:42 karolherbst: cwabbott: maybe have both versions and we just add fne to nir_builder.h?
15:42 karolherbst: as a "symlink" to fneu
15:42 jekstrand: cwabbott: Yeah... The problem is that I also don't expect people to remember that != in C is the unique unordered comparison with out a little help.
15:42 jekstrand: karolherbst: That's what I was proposing.
15:43 karolherbst: ahh
15:43 karolherbst: yeah I guess that would be fine
15:43 jekstrand: Also, FYI, my proposal was assuming that we would be adding unordered versions lf all the others as per my MR from a year ago.
15:43 jekstrand: So we'd have both for everything and then helpers for the C versions.
15:43 karolherbst: yeah, that's my plan as well
15:44 jekstrand: My MR also has helpers for lowering if drivers want that.
15:44 jekstrand: I think we can actually express most of them on Intel HW without too much trouble.
15:44 karolherbst: yeah.. I just want to land the rename first
15:44 karolherbst: in order to... not run into issues
15:44 jekstrand: The only one that's tricky is fequ and fneo
15:44 jenatali: LLVM/DXIL has both of them as well
15:44 jekstrand: As does SPIR-V
15:44 karolherbst: what hw doesn't?
15:45 jekstrand: karolherbst: Intel HW doesn't make it easy to implement fequ or fneo
15:45 jekstrand: karolherbst: See also my MR
15:45 karolherbst: right.. I remember there was something ugly about it
15:45 jekstrand: All the others are easy. We just flip the comparison and add a ! which we get for free with our predication system.
15:45 karolherbst: mainly wondering if I should add the fne helper already and remove some bits of the MR?
15:46 karolherbst: equ indeed seems abit messy :/
15:53 jekstrand: karolherbst: After scanning through your MR again, I donj't think a nir_fne helper helps
15:54 jekstrand: karolherbst: There are not that many places we emit an fne and 50% of them are fneu(x, x) to do a NaN check and we very much want that unordered. :)
15:54 karolherbst: right
15:54 jekstrand: karolherbst: Most of the churn in your MR is in backends handling the opcodes and helpers don't help them.
15:55 jekstrand: So, yeah, the helper would be right/useful about 3 places. Not worth it.
15:55 karolherbst: at this point I am more concerned which hw does not support unordered properly anyway
15:55 jekstrand: All hardware should support fneu
15:55 jekstrand: fequ is what's more concerning
15:55 karolherbst: I meant the others
15:55 jekstrand: We add a lowering pass for those
15:55 karolherbst: but that's unrelated to the MR anyway
15:55 karolherbst: right
15:55 jekstrand: karolherbst: Seriously, just revive my MR and rebase it. :-)
15:56 jekstrand: It's got all the patches.
15:56 karolherbst: right
15:56 karolherbst: I just want to land the rename first so I know I didn't mess up renaming :)
15:56 karolherbst: but on top I'd just take your MR and work through that
15:56 jekstrand: Including a lowering pass and an implementation of fequ/fneo on Intel.
15:57 jekstrand: karolherbst: That's fine. I'm pretty sure your patch is exactly the first patch of my old MR. It's just rebased such that it applies to all the back-ends that have been added since.
15:57 jekstrand: The re-name of fneu is the first step
15:57 karolherbst: yeah, probably
15:57 jekstrand: Rename of the others to feqo etc. is the second.
15:58 karolherbst: anyway.. I'd go ahead and marge it then and then work through your MR to implement the other ones
15:58 karolherbst: maybe even add some optimizations? dunno yet...
15:58 jekstrand: I guess my MR doesn't rename things to feqo but we should.
15:58 karolherbst: we make use of unordered in codegen
15:59 karolherbst: jekstrand: the thing is also that inot flt is fgeu actually but I think we do that only for !precise sources?
15:59 karolherbst: that's probably stuff I'd like to add as well
15:59 karolherbst: correct opts I mean
16:00 karolherbst: ahh yeah
16:00 karolherbst: (('~inot', ('flt', a, b)), ('fge', a, b)),
16:00 karolherbst: (('~inot', ('flt', a, b)), ('fge', a, b), 'lower_unordered'),
16:00 karolherbst: (('inot', ('flt', a, b)), ('fgeu', a, b)),
16:01 karolherbst: or something then?
16:01 karolherbst: oh well
16:01 jekstrand: karolherbst: Right. Because right now we feq and fge are both ordered
16:01 jekstrand: inot(flto(a, b)) = fgeu(a, b)
16:01 karolherbst: yep
16:01 jekstrand: So once we have proper opts, we do those optimizations exactly
16:02 karolherbst: yeah
16:02 karolherbst: we have those in codegen but I think they are not correct
16:02 karolherbst: but anyway
16:02 karolherbst: I would like to do the opts after rebasing as well and verify it actually works :)
16:03 karolherbst: okay anyway. merging the rename now to at least resolve the confusion and then I go ahead and rebase your MR on top :)
16:12 jenatali: Hm, for some reason GitLab likes to turn off notifications for MRs sometimes...
16:12 jenatali: Odd...
16:35 jekstrand: karolherbst: Does NV have an actual fdiv opcode?
16:35 karolherbst: nope
16:35 jekstrand: hrm....
16:35 jekstrand: karolherbst: We do on Intel. I'm wondering if it's precise enough for what you pointed out on the CLC MR
16:36 karolherbst: dunno
16:36 karolherbst: what's weird is that nvidia has two version of fdiv inside ptx
16:36 karolherbst: approx and full
16:36 jekstrand: I suspect it is; otherwise why would they waste gates on it.
16:36 jekstrand: If it was just rcp+mul, it'd be pointless
16:36 karolherbst: both do upscaling, full does downscaling on top
16:36 karolherbst: or the other way around
16:37 jekstrand: karolherbst: Does CLC depend on the precision of fdiv for things?
16:37 jekstrand: jenatali: ^^
16:37 jenatali: Yep
16:37 karolherbst: jekstrand: anyway, my plan was to move that into a new nir pass and run it once in clover after we linked clc in
16:37 karolherbst: yes
16:37 karolherbst: libclc does :D
16:37 jenatali: As does CL in general
16:37 karolherbst: yeah, but for clc it is quite cirtical
16:37 jenatali: Math bruteforce has a test for div
16:38 karolherbst: it even has a more strict one which _requires_ correct rounding I still don't pass
16:39 karolherbst: but that's optional
16:39 jenatali: karolherbst: You should just not set the API flag that says you support correct rounding
16:39 karolherbst: I know
16:39 karolherbst: I just set it to test things
16:39 jenatali: Ah
16:39 karolherbst: we don't even advertise denorm support in clover yet
16:41 karolherbst: jenatali: I will fix up my patch and push it (and add review tags while at it)
16:42 jenatali: karolherbst: Great - I'll wait until that's done to apply jekstrand's feedback, unless you want to do that too (splitting the refactor)
16:43 jekstrand: karolherbst: Yeah... having a prcise fdiv lowering pass seems reasonble.
16:43 jekstrand: At some point, I'll have to test the intel fdiv opcode and see if it's precise enough.
16:43 karolherbst: I also planned to add sqrt and rsqrt handling in there
16:43 jekstrand: I'm guessing jenatali is getting saved by WARP
16:43 karolherbst: all of that requires the same scaling, just slighly different
16:43 jekstrand: karolherbst: I think we have different SQRT and RSQ in our hardware too. :)
16:44 karolherbst: :)
16:44 karolherbst: yeah.. but if scaling works for all hw I think that's good enough
16:44 jekstrand: Our math box is quite nice sometimes
16:44 karolherbst: worst case we change the constants a bit
16:44 karolherbst: or make it optional for hw which is precise
16:44 jekstrand: karolherbst: Sure, but if Intel doesn't need the scaling.....
16:44 jenatali: jekstrand: Yep, WARP's fdiv is accurate, but we do get failures running on hardware
16:44 karolherbst: jekstrand: mind giving me patches for iris?
16:44 karolherbst: I can test myself then :p
16:44 jekstrand: karolherbst: I'm working on trying to sure those up right now.
16:45 jekstrand: I ran into some annoying issues
16:45 jenatali: Let me find my Intel run and see if fdiv was one of the failures...
16:45 jekstrand: Somehow my patches brok iris pretty bad.
16:45 karolherbst: ahh
16:45 jekstrand: jenatali: It probably is. Likely the D3D12 driver does mul+rcp.
16:45 jekstrand: jenatali: I suspec the OpenCL driver is the only one which actually uses those bits of the math box.
16:46 jenatali: jekstrand: Looks like it crashed during that run before it got to divide :P my machine with an Intel GPU is currently running the CTS on the NVIDIA GPU so I don't really want to interrupt it
16:47 jekstrand: fair enough. :)
16:47 jenatali: The CTS takes way too long...
16:47 jenatali: At least, when trying to actually run it for certification purposes
16:47 karolherbst: jenatali: pushed
16:47 jenatali: Thanks
16:48 karolherbst: regarding using uniforms.. I think that's mostly fine. and I don't actually think there is any drawback to it
16:49 karolherbst: I think on nv50 we can use shared memory instead as well, but I also don't know the actual reasons for doing so
16:49 jekstrand: jenatali: Are there cases in CLC where it'd like to take different paths based on fdiv or sqrt/rsq like for ffma?
16:49 jenatali: jekstrand: Not that I remember seeing
16:49 jenatali: I think the only things in libclc were denorm control (which we set to false for SPIR-V right now... maybe that should be another builtin) and ffma
16:49 jekstrand: karolherbst: It's more that uniforms are a better match in terms of how these things are handled in graphics.
16:50 jenatali: karolherbst: That could be why you are getting failures with denorms
16:50 karolherbst: jekstrand: yeah, I totally get that
16:50 jekstrand: karolherbst: shader_in has always been odd because in 3D it's exclusively used for cross-geometry-stage I/O
16:50 karolherbst: jenatali: ahh, maybe
16:50 jekstrand: It has the right name but the wrong semantics most of the time
16:50 karolherbst: yeah...
16:50 karolherbst: I don't mind changing it
16:50 jekstrand: Whereas uniform is almost an exact semantic match even if the name is weird.
16:50 karolherbst: we have our own intrinsic to load that
16:50 karolherbst: it was just easier to do for shader_in at the time
16:50 karolherbst: I think
16:51 karolherbst: with mem_const that all is much easier to handle anyway
16:51 jenatali: At the end of the day it does a KERNEL check before converting to load_kernel_input I think
16:51 jekstrand: Yeah
16:51 jekstrand: At the time, I think we were both trying to figure out what all this stuff meant too.
16:51 jekstrand: Now we've got a much clearer picture, I think.
16:52 jenatali: Indeed
16:52 karolherbst: yep
16:52 karolherbst: I remember the time when I tried to make kernel parameters to work...
16:52 karolherbst: that was ugly
16:52 jekstrand: Yeah
16:52 jekstrand: We had a lot of back-and-forth on that at the time
16:52 jekstrand: With lots of me floundering around not really understanding kernels
16:53 karolherbst: yeah... I think the biggest issue is just that spirv isn't explicit about kernels in the end
16:53 karolherbst: it just says "that is supposed to be called", but besides that? nothing :p
16:54 karolherbst: but yeah...
16:54 jenatali: It is very much a dance between all the various specs - CL C, CL API, SPIR-V, CL environment for SPIR-V, CL extensions
16:54 jenatali: And throw SPIR into the mix sometimes too :P
16:54 karolherbst: jenatali: the issue is rather that kernel parameters are not specified
16:54 jenatali: Hm?
16:54 karolherbst: where is your kernel parameter in the spirv? :p
16:55 jenatali: Function parameters for the kernel entrypoint?
16:55 karolherbst: nope, those are function parameters
16:55 karolherbst: not kernel parameters
16:55 karolherbst: it doesn't work if you assume it's the same
16:55 karolherbst: you can call _any_ kernel functions inside your kenrel
16:55 karolherbst: so how do you deal with the address space missmatch?
16:56 karolherbst: it just starts to make sense once you agree that there are no kernel parameters in the spirv
16:56 karolherbst: and runtimes have to create their kernel function on the fly
16:56 jenatali: Sure, I suppose
16:56 karolherbst: at first we tried to make it work without a wrapper
16:56 karolherbst: and that totally failed :p
16:56 jenatali: Oof...
16:57 karolherbst: yeah.. took some times
16:57 jekstrand: I think, in theory, I could imagine a stack with some sort of function call ABI where it would make sense
16:57 jekstrand: But, yeah, that's not what we have.
16:57 jekstrand: So we need wrappers
16:57 karolherbst: even then it doesn't as function parameters live in function temp memory
16:57 karolherbst: kernel args don't
16:58 jekstrand: You could have some piece of hardware that pre-loads them somehow
16:58 karolherbst: maybe some hw can make it the same
16:58 karolherbst: like hw with push constants
16:58 karolherbst: and have registers as the function call ABI
16:58 jekstrand: And that hardware could be a little bit of shader code which pulls them from uniforms and.....
16:58 jekstrand: Yeah, I just invented wrappers. :-P
16:58 jenatali: Hey like a wrapper
16:58 karolherbst: :p
16:59 karolherbst: I think on hw which uses the same way getting kernel args into the shader and uses the same stuff for function calling can make it work
16:59 karolherbst: but relevant hw doesn't do that :p
17:00 jekstrand: karolherbst: Any reason to not land " clover/nir: use offset for temp memory
17:00 jekstrand: ?
17:00 jekstrand: karolherbst: Feel free to throw my RB on it and marge it
17:01 karolherbst: what MR was it?
17:02 jekstrand: karolherbst: I don't know. It's in your cl_wip branch
17:02 karolherbst: ahh
17:02 karolherbst: then I just didn't do another round of creating MRs :p
17:02 jekstrand: karolherbst: Trying to figure out how we can trim down that stack.
17:02 karolherbst: merging MRs :p
17:02 jekstrand: karolherbst: Yeah, I know. I'm working on it the bits I can help with.
17:03 karolherbst: let's see
17:03 karolherbst: yeah.. I think once offsets land I will probably create a new MR with some patches
17:03 karolherbst: stuff conflicts/depends on the patch adding that clover nir pass
17:04 jekstrand: karolherbst: Are there any others of yours that I should be reviewing?
17:04 karolherbst: nope
17:04 jekstrand: karolherbst: You keep saying review MRs but I don't know which ones you're referring to. :-)
17:04 karolherbst: only https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6064 maybe.. but maybe I like this memcpy approach doing the right thing
17:04 karolherbst: I am still super unsure about it
17:04 jekstrand: karolherbst: I think there's a couple from bbrezillon I should loook at today. The alignments one in particular.
17:04 karolherbst: :D
17:04 karolherbst: yeah..
17:05 mattst88: :q
17:05 mattst88: derp
17:05 karolherbst: jekstrand: I think getting mem_constant is also somethign I should be reviewing.. let's see
17:06 karolherbst: libclc is what I still want to figure out and work on.. mhh
17:06 karolherbst: printf.. mhh
17:06 karolherbst: images.. packed/alignment stuff...
17:07 karolherbst: then we also have spirv support in clover mhhh
17:07 jenatali: Don't forget memcpy
17:07 karolherbst: ahh memcpy
17:07 karolherbst: did jekstrand look over that yet?
17:08 jenatali: Don't think so - it depends on the packed struct MR to provide alignment info IIRC
17:08 karolherbst: okay
17:08 jenatali: You can actually get pretty far without supporting memcpy
17:08 karolherbst: yeah
17:08 jenatali: I think it's only generated for struct -> struct copies
17:08 karolherbst: yeah
17:08 karolherbst: or well..
17:08 karolherbst: memory to memory :p
17:08 jenatali: Sure, but CL C doesn't have a memcpy function
17:08 karolherbst: I think you'd end up with that for big arrays, no?
17:09 jenatali: Yeah but you can't assign one array to another unless they're embedded in struct, right?
17:09 karolherbst: ahh, could be
17:09 jenatali: The async copy opcodes are dedicated as well
17:09 karolherbst: the old "wrap it in a struct" trick
17:10 emersion: does anyone know how to list fourcc formats/modifiers that can be used for rendering
17:10 emersion: from EGL
17:11 emersion: should i just assume anything returned by the EGL_EXT_image_dma_buf_import_modifiers extension is fine?>
17:12 emersion: or maybe only those with "external_only" set to false?
17:13 bnieuwenhuizen: emersion: rendering with which API? :)
17:13 emersion: rendering with opengl
17:14 bnieuwenhuizen: if you mean GL, I think external_api would not be allowed but otherwise there are nor estrictions specced (so they should all be valid if they'd be renderable non-shared?)
17:14 emersion: ok, thanks!
17:14 emersion: non-shared?
17:16 Vanfanel: emersion: an atomic commit where I try to set a cursor or move it on a cursor plane always fails with EINVAL (-22) if I do it while the PRIMARY plane has CRTC_ID and FB_ID set to ZERO. Is that supposed to happen on all drivers? I mean, I am trying to operate on the cursor plane, not the PRIMARY one.
17:20 bnieuwenhuizen: emersion: if you come with a format that is not renderable normally in GL usage, I wouldn't expect EGL to make it renderable? Though maybe I'm too optimistic/Vulkan-minded there
17:51 Lyude: seanpaul: want to review the last un-reviewed patch on https://patchwork.freedesktop.org/series/80542/ ?
18:09 jekstrand: karolherbst, jenatali: To me the three biggest issues are: constants, images, and anything that sures up pointers. Alignments being one of the biggest ticket items for #3.
18:10 karolherbst: yeah
18:10 jenatali: Yep
18:10 karolherbst: constants sound like a huge win
18:10 karolherbst: I just have _huge_ TLS usage issues with libclc
18:10 jekstrand: CLC is important but I think we need constants to be solid first.
18:10 karolherbst: and I am sure that's caused by the inferior optimization handling
18:10 jenatali: Agreed
18:11 jekstrand: I'm going to go dig up the alignment MR and start reading.
18:11 jenatali: Cool :)
18:11 karolherbst: wuhu.. I am down to 72 patches now
18:11 karolherbst: ...
18:11 jenatali: Down?
18:11 karolherbst: LD
18:11 karolherbst: :D yeah...
18:11 karolherbst: I started to pile stuff up quite a lot
18:12 karolherbst: most of them are just nouveau things
18:12 karolherbst: like shader cache
18:12 karolherbst: and if we land offsets, it would be another 8 patches gone
18:13 jekstrand: karolherbst: FYI: I just force-pushed my wip/shamrock branch with the latest. I'm hoping, if Kayden gets a chance to read through !6405, that I can land all but the top 3 today.
18:13 jekstrand: karolherbst: That plus your cl_mem branch gets me "Pass 95 Fails 5 Crashes 1 Timeouts 5" on basic on iris.
18:13 karolherbst: ahh, that's required for running clover on iris I guess?
18:13 karolherbst: nice
18:13 karolherbst: 59 commits :O
18:13 karolherbst: ahh
18:13 karolherbst: there is stuff from me
18:14 jenatali: And me :D
18:14 karolherbst: jekstrand: yeah.. ping me once you get a clean MR with iris patches I can pull to start testing libclc on iris as well
18:14 karolherbst: I am willing to go through the pain of debugging all of that :D
18:16 karolherbst: jenatali: uhm.. what do I need for event stuff?
18:17 karolherbst: just libclc?
18:17 jenatali: karolherbst: The libclc series adds handling for event types, which is needed to be able to parse libclc
18:17 jenatali: But I've got another couple patches I haven't ported over to actually use the async methods from libclc
18:17 jenatali: Let me track them down...
18:20 karolherbst: jekstrand: heh... somehow I got regressions and I think our pass rate is equal now
18:21 jenatali: karolherbst: https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/112/diffs?commit_id=c2d5a8dc2596a7d2d50792f038cd33d6c9feecdf
18:21 karolherbst: ehh. I don't like regressions
18:21 jenatali: It won't apply cleanly, but hopefully shouldn't be *too* bad?
18:24 karolherbst: ohh wait
18:24 karolherbst: maybe I should stick with 1.1...
18:28 jekstrand: karolherbst: :-/
18:30 karolherbst: "Pass 96 Fails 5 Crashes 1 Timeouts 4" heh...
18:30 jekstrand: Yeah, that's about the same
18:30 karolherbst: local mem is busted.. let's see
18:30 jekstrand: hiloeo sometimes crashes and sometimes times out for me.
18:30 karolherbst: mhh
18:30 karolherbst: let's see
18:33 karolherbst: I think I messed up alignment again...
18:34 karolherbst: or something stupid..
18:37 jenatali: I just love the name hiloeo
18:44 karolherbst: okay... now it passes again.. so what broke it :)
18:44 jenatali: Which tests?
18:45 karolherbst: vload_local
18:45 karolherbst: something broke local memory in unexpected ways...
18:45 karolherbst: trying to figure out what
18:45 karolherbst: I suspect that libclc could mess things up
18:45 karolherbst: but.. let's see
18:46 karolherbst: maybe the shader cache is broken.. who knows
18:46 karolherbst: would be good to know as well though
18:46 karolherbst: ahh yeah...
18:46 karolherbst: ehh
18:46 karolherbst: jekstrand: it would be fine if shader cache is also broken for you in regards to local memory :)
18:47 karolherbst: s/fine/fun/
18:47 jekstrand: karolherbst: I don't think I have that problem...
18:48 karolherbst: what was failing besides the tests complaining about defined __IMAGE whatever?
18:48 jekstrand: fails:
18:48 jekstrand: basic bufferreadwriterect
18:48 jekstrand: basic global_work_offsets
18:48 jekstrand: basic kernel_numeric_constants
18:48 jekstrand: basic kernel_preprocessor_macros
18:48 jekstrand: basic loop
18:48 jekstrand: crashes:
18:48 karolherbst: ohhh
18:48 jekstrand: basic work_item_functions
18:48 jekstrand: timeouts:
18:48 jekstrand: basic async_copy_global_to_local
18:48 jekstrand: basic async_copy_local_to_global
18:48 jekstrand: basic async_strided_copy_global_to_local
18:48 jekstrand: basic async_strided_copy_local_to_global
18:48 karolherbst: ohhh
18:48 jekstrand: basic hiloeo
18:49 karolherbst: I think I know why bufferreadwriterect might fail..
18:49 karolherbst: let's see
18:49 karolherbst: yeah...
18:49 karolherbst: jekstrand: you need to specify allocation size
18:49 karolherbst: PIPE_COMPUTE_CAP_MAX_MEM_ALLOC_SIZE eg
18:49 karolherbst: and base it on VRAM
18:49 jekstrand: karolherbst: I do but maybe it's not big enough
18:49 emersion: bnieuwenhuizen: yeah i'd expect that too
18:49 karolherbst: I suspect it's too big :p
18:50 karolherbst: max alloc is like 1/4 of VRAM usually
18:50 jekstrand: karolherbst: ?
18:50 jekstrand: karolherbst: I think I advertise 1G right now
18:50 jekstrand: so not too big
18:50 jenatali: That's the minimum allowed
18:50 karolherbst: how much memory can you allocate ?
18:50 karolherbst: memory actively used
18:50 jekstrand: all your RAM
18:50 karolherbst: at once
18:50 karolherbst: jekstrand: you can use all RAM at once?
18:50 jekstrand: sure
18:50 karolherbst: ahh, okay
18:51 karolherbst: anyway, back to "Pass 99 Fails 2 Crashes 1 Timeouts 4" :D
18:51 jekstrand: karolherbst: What'd you fix?
18:51 karolherbst: removed nouveaus shader cache
18:51 jekstrand: heh
18:51 jekstrand: I wonder if we don't have an issue with specifying the wrong scratch size
18:51 karolherbst: I wanted to test it :)
18:51 karolherbst: yeah...
18:52 karolherbst: I plan on passing all tests today
18:52 karolherbst: the crashw e have is annoying to fix
18:52 karolherbst: but..
18:52 karolherbst: everything else is possible
18:55 karolherbst: jekstrand: mhh.. actually applying the patches alone is enough :)
18:55 karolherbst: I suspect we overwrite something we get handed it...
18:59 karolherbst: nice.. I broke it :)
19:00 jekstrand: karolherbst: oh?
19:00 karolherbst: ehhh...
19:00 karolherbst: no
19:00 karolherbst: git messed up
19:00 karolherbst: the heck
19:03 karolherbst: yeah.. whtvr
19:03 karolherbst: guess cherry-picking patches sometimes does break things
19:09 jekstrand: ugh... packed....
19:09 jenatali: Yup
19:09 jekstrand: I can't even figure out what the semantics are supposed to be
19:10 jenatali: jekstrand: Taking a closer look at addressing modes for images, I think the only way this works out sanely is if we switch to using uniform for kernel inputs
19:10 jekstrand: jenatali: do it!
19:10 jekstrand: jenatali: It'll require some minor changes to nouveau and iris but meh
19:11 jenatali: Yeah, you added an assert in your constant mem patch that's highlighting a flaw in what I'm doing
19:11 karolherbst: why would rquire changes?
19:11 jenatali: You're asserting that deref casts can't change deref modes, just make them more specific - I'm totally not doing that right now
19:11 jekstrand: karolherbst: s/load_kernel_input/load_uniform
19:11 karolherbst: we can't use load_uniform
19:11 jenatali: jekstrand: We could keep it as load_kernel_input?
19:11 karolherbst: or... well
19:11 jekstrand: Or we can make nir_lower_io emit load_kernel_input
19:11 karolherbst: making load_uniform byte addressing could be super painful
19:11 jenatali: Yeah I like that one
19:11 karolherbst: dunno how to even start?
19:12 jekstrand: karolherbst: load_uniform already is byte addressing for us
19:12 karolherbst: ahh.. then I need to know how to switch over :D
19:12 jenatali: If we keep it using load_kernel_input, then the switch only really matters during intermediate steps before lower_explicit_io
19:12 jekstrand: That seems reasonable to me
19:12 karolherbst: I mean.. I am all for making everything byte addressing by default
19:12 jekstrand: Eventually, I want gallium to handle all this stuff and just give us cbuf0
19:12 karolherbst: :p
19:13 karolherbst: yeah
19:13 karolherbst: sooo
19:13 karolherbst: "Pass 101 Fails 0 Crashes 1 Timeouts 4"
19:13 jenatali: \o/
19:13 jekstrand: karolherbst: \o/
19:13 karolherbst: what a terrible hack
19:13 jekstrand: karolherbst: Are any of those fixes going to fix things for me? :)
19:14 karolherbst: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/d4931ee7fd58cdfb6a5addee5091be3722d643ef
19:14 karolherbst: this one will
19:14 karolherbst: fixes kernel_numeric_constants and kernel_preprocessor_macros
19:14 karolherbst: now async :D
19:14 karolherbst: uggg
19:15 karolherbst: jenatali: mind rebasing your async stuff and add the patches to the libclc MR?
19:15 karolherbst: I mean.. I could do it myself but..
19:15 jenatali: karolherbst: Yeah I'll handle it :)
19:15 karolherbst: given that it's in the downstream fork
19:15 karolherbst: cool, thanks
19:15 jenatali: Should I just add it in, or should I create another MR based on it?
19:15 karolherbst: just add it
19:15 karolherbst: if we think the MR is too big and we want to land stuff earlier we can always split it up again
19:15 karolherbst: but at this point clc isn't quite near getting merged anyway
19:16 jenatali: True
19:17 jekstrand: karolherbst: Can the client get pointers to kernel memory?
19:17 jekstrand: kernel input memory
19:17 jenatali: jekstrand: Kernel inputs are in private memory
19:17 jenatali: You can't get pointers to the addresses that our wrapper uses, but you can get pointers to the private variables in the actual kernel
19:18 karolherbst: jenatali: nope
19:18 jekstrand: What I'm wondering is if we even need to care about the actual layout of the stuff our wrapper creates
19:18 karolherbst: jekstrand: &&
19:18 karolherbst: ...
19:18 jenatali: I don't think so
19:18 karolherbst: jekstrand: we do have to
19:18 karolherbst: more or less
19:18 karolherbst: sooo..
19:19 karolherbst: there are some requiernments on how much fits in
19:19 jekstrand: Well, those aren't the same answer. :)
19:19 karolherbst: but in the end I think it's quite up to us.. just alignment has to be correct
19:19 karolherbst: so vec16 stuff needs to be vec16 aligned
19:19 jekstrand: Why?
19:19 jenatali: To be clear, there may be driver requirements for the layout, but not API requirements
19:20 karolherbst: jekstrand: because the CTS tests that
19:20 jekstrand: jenatali: Ok, that's the precise question I'm trying to answer.
19:20 karolherbst: it takes the address of arguments and verifies alignment
19:20 jenatali: karolherbst: But the things it takes the address of are inside the kernel, not the wrapper
19:20 jenatali: Meaning they're private variables, i.e. scratch
19:20 karolherbst: jenatali: we want to load propagate though
19:20 karolherbst: if we use different layouts it can all go boom
19:20 jenatali: So there's API requirements on the layout of scratch, but not on the kernel inputs
19:21 karolherbst: right.. private memory
19:21 karolherbst: but I don't think we should have different rules
19:21 karolherbst: eg if struct sizes changes...
19:21 karolherbst: that would be annoying
19:21 jekstrand: karolherbst: Should and can are different things. :-)
19:21 karolherbst: right...
19:21 karolherbst: but I thought it's just easier to follow the same rules :)
19:21 jekstrand: I think it generally is
19:21 karolherbst: there is _one_ issue though
19:22 jekstrand: I'm specifically trying to decide if we have to care about packed there
19:22 karolherbst: applications can pass in by value
19:22 karolherbst: so if you have this 3kb struct you pass in
19:22 karolherbst: it has to have the same layout rules
19:22 karolherbst: you can add padding how much you want between args
19:22 jenatali: Right, so the local variable we create in the wrapper to load into needs to be packed
19:22 karolherbst: but the args themselves have to fit
19:23 jenatali: And in order to load data into it correctly, the kernel arg needs to be packed as well
19:23 karolherbst: yep
19:23 karolherbst: well
19:23 karolherbst: no
19:23 jenatali: Since the runtime doesn't have type info to decompress the arg while loading it into the buffer
19:23 karolherbst: only if the kernel takes an address
19:23 karolherbst: but the data type itself still needs to be packed
19:23 jenatali: karolherbst: Sure, technically true
19:23 karolherbst: or we are smart and unpack
19:23 karolherbst: we know the layout of the data actually
19:23 karolherbst: so we could fix it up
19:23 jenatali: Eh I suppose...
19:24 karolherbst: I just don't think it's worth it
19:24 jenatali: I agree
19:24 karolherbst: if applications thing using packed is the way to go
19:24 karolherbst: they can suffer from bad perf
19:24 karolherbst: I don't care
19:24 karolherbst: *think
19:24 karolherbst: I would even make it obviously slow and if they complain I'd say: don't pack
19:25 karolherbst: but yeah.. the problem starts with structs passed in by value
19:36 jenatali: karolherbst: Pushed what should be a working async commit
19:37 karolherbst: jenatali: just that one single commit?
19:37 jenatali: Yep
19:37 karolherbst: let's see..
19:37 jenatali: On top of the rest of the series that should be all you need
19:38 karolherbst: ehhh.. why is mvsc so broken and makes it impossible to use enums? ...
19:38 jenatali: Hm?
19:39 jenatali: Oh the enum -> uint32_t change? That's because it's not necessarily an extension opcode anymore
19:39 jenatali: The async functions are core opcodes for whatever reason...
19:39 karolherbst: ohhh
19:39 karolherbst: I see
19:39 karolherbst: but I thought enums are signed anyway?
19:39 karolherbst: on msvc
19:39 jenatali: Yeah
19:39 karolherbst: anyway you left a "enum uint32_t opcode"
19:39 jenatali: :P
19:39 jenatali: Oops
19:39 jenatali: How did that build for me? :)
19:40 karolherbst: mhhh
19:40 karolherbst: I bet because of msvc :p
19:40 jenatali: Probably
19:40 karolherbst: guess you figured out how to declared unsigned enums :D
19:40 karolherbst: *declare
19:40 jenatali: Hah
19:40 karolherbst: Can't find clc function _Z29async_work_group_strided_copyPU3AS3Dv3_fPU3AS1KS_mm9ocl_event
19:41 jenatali: Hmmm
19:41 jenatali: One sec
19:43 jenatali: Hm, it shouldn't be looking for that one, that's a vec3, it should be swapping it for a vec4
19:43 jenatali: Give me a minute
19:46 jekstrand: Ugh... Took me like an hour to decide that bbrezillon's interpretation of "packed" was correct.
19:46 jenatali: :D
19:46 jekstrand: Dang, there are a lot of corners in there.
19:46 jekstrand: Also, packed is evil.
19:46 jenatali: Agreed
19:47 jenatali: At least we don't need to special-case packed on members - LLVM does that for us by inserting padding fields where necessary
19:49 jekstrand: And... now I remember the beef I had with that MR
19:49 jenatali: ?
19:51 jekstrand: There's a patch in there that mixes size_align funcs with explicit types
19:51 jenatali: Ah
19:53 jenatali: karolherbst: Found it - when switching the mangler to use vtn types instead of glsl types I missed a pointer level of indirection on this special case
19:53 jenatali: New version is pushed
20:02 karolherbst: cool, thanks
20:03 karolherbst: what do we do about that "Unsupported SPIR-V capability: SpvCapabilityGroups (18)" warning?
20:04 karolherbst: ohh wait.. we still don't support subgroups. mhh
20:06 jenatali: Yeah, I left an issue on the translator, those ops shouldn't be tagged with requiring that capability
20:06 karolherbst: ahh
20:06 jenatali: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/595
20:08 karolherbst: mhhh
20:08 karolherbst: something is odd with those tests
20:08 karolherbst: they seem to crash after a while
20:09 jenatali: They're long-running tests, but they do pass for me
20:09 karolherbst: yeah.. wondering on what's up actually
20:09 karolherbst: I suspect a nouveau bug
20:09 karolherbst: I am seeing some OOR_ADDR errors...
20:09 jenatali: jekstrand: FYI if you grab the last patch on the libclc series you can fix the 4 timeouts
20:10 karolherbst: I already pushed my branch :D
20:10 jenatali: :)
20:10 karolherbst: mhh
20:10 karolherbst: async_copy_local_to_global fails
20:10 karolherbst: mhhhh
20:11 airlied: subgroups are the one thing I tnik GL 3 adds
20:12 karolherbst: soo.. why do I get out of bound reads...
20:12 jenatali: Good question
20:13 imirkin: karolherbst: OOR_ADDR means you're reading constbuf out of bounds iirc
20:13 imirkin: karolherbst: double-check the encoding.
20:14 karolherbst: imirkin: that would surprise me: https://gist.github.com/karolherbst/17744460ee409a3c223bd6a897a398a9
20:14 karolherbst: it's probably shared mem
20:15 imirkin: mmmmm
20:15 imirkin: maybe.
20:15 imirkin: i don't know if shared gets you that error. makes sense that it might?
20:15 karolherbst: heh
20:15 karolherbst: "shared: 0"
20:16 karolherbst: let's see if the API provided value is sane
20:16 imirkin: ah. well yeah. maybe that's not great.
20:16 karolherbst: that's just the shader provided one...
20:16 karolherbst: we have to deal with.. both
20:17 karolherbst: uhm..
20:17 karolherbst: SHARED_MEMORY_SIZE : 0x6000
20:17 karolherbst: mhhh
20:17 karolherbst: that's... a lot
20:17 imirkin: sharing is caring
20:17 jenatali: Heh
20:17 karolherbst: 24k.. but that should be fine
20:17 imirkin: there are some settings that determine how much shared is generally available, i.e. the l2/l1 split
20:18 karolherbst: yeah...
20:18 karolherbst: let's use more and see where this gets me
20:18 karolherbst: nhhh
20:18 karolherbst: "24544 -> [b5 ab e2 40 ] != [ 0 0 0 0 ]"
20:18 karolherbst: so the last 0x20 bytes are broken
20:19 karolherbst: suspicious
20:19 karolherbst: if I set it to 0x600 it fails starting with 2560
20:20 karolherbst: which is 0xa00...
20:20 karolherbst: ehhh
20:22 karolherbst: ohhhh
20:23 karolherbst: seems like we don't configure the split for gp100
20:23 karolherbst: right..
20:23 karolherbst: because we can't
20:23 karolherbst: the heck
20:24 karolherbst: ehh...
20:27 karolherbst: mhhh
20:27 karolherbst: imirkin: any ideas?
20:31 imirkin: karolherbst: maybe L1_CONFIGURATION ?
20:31 imirkin: oh, that's not there.
20:31 karolherbst: yeah
20:31 karolherbst: only for 1.7
20:31 karolherbst: but we use 2.1
20:32 karolherbst: heh...
20:32 karolherbst: I think it's a runtime issue
20:33 karolherbst: I limited the size and now it fails on earlier addresses
20:35 krh: ajax: my sympathies
20:35 krh: ajax: I thought it was a pun on glx relay though
20:41 jekstrand: anholt: Why are you suddenly reviewing OpenCL MRs?
20:42 karolherbst: lol...
20:46 karolherbst: mhhh
20:56 karolherbst: a shader debugger would be nice now...
21:03 jekstrand: Sometimes I really want to put the alignment stuff on nir_deref_cast
21:05 anholt: jekstrand: I tend to trawl nir mrs
21:06 karolherbst: okay.. so
21:06 karolherbst: async is only broken for vec3...
21:07 jenatali: karolherbst: The spec requires vec3 to behave as if it was vec4
21:07 karolherbst: right...
21:10 karolherbst: sooo
21:10 karolherbst: we have an undefined value being the 4th component
21:10 karolherbst: and we store that
21:10 karolherbst: but that shouldn't matter as far as I can tell
21:11 jenatali: Hm, I wonder if I'm missing a cast somewhere - you should actually be reading and writing 4 components
21:11 karolherbst: I am
21:11 karolherbst: just the 4th is undefined
21:11 jenatali: It shouldn't be undefined, it should be read from the source of the copy
21:11 karolherbst: ohh, yeah, that's correct
21:12 karolherbst: just the initial write to shared I mean
21:12 karolherbst: jenatali: but that could be better...
21:12 karolherbst: soo...
21:12 karolherbst: ehhh
21:12 karolherbst: yeah...
21:13 jenatali: karolherbst: Could be better? That's what the spec requires
21:13 karolherbst: jenatali: soo... you read always a vec4
21:13 jenatali: "async_work_group_copy and async_work_group_strided_copy for 3-component vector types behave as async_work_group_copy and async_work_group_strided_copy respectively for 4-component vector types."
21:13 karolherbst: no
21:13 karolherbst: right. but that's not the issue
21:13 karolherbst: wait a moment
21:14 karolherbst: jenatali: localBuffer[ get_local_id( 0 )*copiesPerWorkItem+i ] = src[ get_global_id( 0 )*copiesPerWorkItem+i ];
21:14 karolherbst: const __global char3 *src
21:14 karolherbst: _but_
21:14 karolherbst: I see a 4 component read_global
21:14 karolherbst: that's wrong
21:14 jenatali: That's not the async code then
21:15 karolherbst: yeah, I know
21:15 jenatali: Oh ok
21:15 karolherbst: but that kernel probably traps
21:15 karolherbst: and so some part of the execution gets skipped
21:15 karolherbst: and that's probably the OOR_ADDR I see
21:15 karolherbst: but nhhh
21:15 karolherbst: ahh yeah
21:15 karolherbst: it's global memory
21:15 karolherbst: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 1a000e [OOR_ADDR]
21:16 karolherbst: I think...
21:16 imirkin: global memory definitely doesn't get you OOR_ADDR
21:16 imirkin: there's no range to be out of.
21:17 imirkin: (oor = out of range)
21:19 karolherbst: mhhh, yeah..
21:19 karolherbst: it's odd
21:21 ajax: krh: gonna edit that pun into the doc and force-push, no one will ever know
21:21 ajax: on the plus side i think i almost understand glamor now
21:21 imirkin: karolherbst: iirc if you hit like the holes in gmem, you can end up with weird shit
21:27 karolherbst: annoying when you get the kernel channel to die...
21:42 karolherbst: ehhh...
21:42 karolherbst: that didn't went well
21:46 pmoreau: curro_: OpenCL 3.0 adds some better versioning (https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_versioning; there is cl_khr_extended_versioning for OpenCL 1.0 and above) and add new version queries that return the version as an unsigned int, or a vector of { uint version; string name; } for supported versions for example. I’m planning on changing the internals in clover to that new API.
21:48 pmoreau: curro_: Would you be okay if I used the new OpenCL types inside core/, or should I create new types to avoid pulling in the OpenCL API in core?
21:59 jekstrand: jenatali: Were you going to actually RB my mem_constant MR?
21:59 jenatali: jekstrand: Right, I should do that :) give me a few
21:59 jekstrand: karolherbst: I'd really like at least an ACK for you (RB on the clover and SPIR-V patchs) before landing anything.
21:59 jekstrand: karolherbst: I don't want to land it and break stuff underneath you.
22:00 karolherbst: which MR?
22:00 jekstrand: karolherbst: But I'm also ok not waiting for the constants-as-inputs
22:00 jekstrand: karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6379
22:00 karolherbst: maybe I should rebase and at least check if stuff breaks or not
22:01 jekstrand: karolherbst: That'd be good
22:01 jekstrand: It's kind-of a pain to regression test right now. :-/
22:01 karolherbst: yeah...
22:01 karolherbst: I still want this 0 fail test_basic :D
22:02 karolherbst: then we can at least say: as long as it doesn't regress test_basic :p
22:02 karolherbst: but that async stuff is broken in some way
22:02 jekstrand: I was more referring to the fact that getting good regression tests involves back-porting 50+ patches on top of whatever branch before testing. :-(
22:02 karolherbst: ahh
22:03 karolherbst: I just rebase my full thing on top of whater MR :p
22:03 jekstrand: The iris situation is going to improve as soon as my last Jenkins run comes back.
22:03 karolherbst: *whatever
22:03 karolherbst: but we can also land offsets now, no?
22:03 jekstrand: what do you mean by "offsets"?
22:03 jenatali: System values
22:03 karolherbst: yeah
22:03 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5891
22:04 jekstrand: yup
22:04 jenatali: Cool, I'll assign it
22:04 karolherbst: jenatali: add tags first :p
22:04 jenatali: I did?
22:04 jenatali: Unless I missed something?
22:04 karolherbst: ohh
22:04 karolherbst: didn't see that
22:04 karolherbst: gitlab is sometimes.. weird?
22:04 karolherbst: dunno
22:04 jekstrand: I see tags
22:04 karolherbst: jekstrand: I think you messed up
22:05 jekstrand: karolherbst: oh?
22:05 jekstrand: What'd I mess up?
22:05 karolherbst: constant mem is broken
22:05 jekstrand: bother
22:05 karolherbst: "Pass 46 Fails 0 Crashes 60 Timeouts 0" :/
22:05 jekstrand: uh oh
22:05 jekstrand: What's crashing?
22:05 karolherbst: let me see..
22:05 jekstrand: I'm guessing it's those new asserts jenatali asked me to add. :P
22:05 jenatali: :D
22:05 karolherbst: yeah....
22:05 karolherbst: test_basic: ../src/compiler/nir/nir_lower_io.c:1333: lower_explicit_io_deref: Assertion `addr->bit_size == deref->dest.ssa.bit_size' failed.
22:05 jekstrand: Yup, those are thoe ones!
22:06 karolherbst: let's see
22:06 jenatali: Hopefully it's catching a bug that just so happened to work before?
22:06 karolherbst: maybe
22:07 karolherbst: I shouldn't have tried with "if"...
22:07 karolherbst: hah
22:08 karolherbst: sizeof crashes as well
22:08 karolherbst: that's like 10 instructions total
22:08 karolherbst: mhh
22:08 karolherbst: vec1 32 ssa_5 = load_const (0x00000000 /* 0.000000 */)
22:08 jekstrand: What nir_variable_mode is it dying on?
22:09 karolherbst: shader_in
22:09 jekstrand: Figured
22:09 karolherbst: yeah... so
22:09 karolherbst: it wants to set a 64 bit address on load_kernel_input
22:09 karolherbst: mhhh...
22:09 jekstrand: Only because you're giving it a 64-bit addr_format
22:10 karolherbst: right..
22:10 karolherbst: actually
22:10 karolherbst: nir_address_format_32bit_offset
22:10 jekstrand: maybe spirv_to_nir is building 64-bit derefs?
22:10 karolherbst: yep, it is
22:11 karolherbst: guess we should use nir_address_format_32bit_offset_as_64bit instead
22:11 jekstrand: That would work
22:13 karolherbst: jekstrand: mind fixing that in your commit? it changes 8f7784ee8da3
22:13 karolherbst: ehhh
22:14 karolherbst: use spirv_options.temp_addr_format instead of nir_address_format_32bit_offset for nir, nir_lower_explicit_io, nir_var_shader_in,
22:14 karolherbst: I think that should be alright
22:14 jekstrand: That seems wrong
22:14 karolherbst: I mean sure.. but you can also add another if 64 bit else thing
22:15 jekstrand: Yeah....
22:15 jekstrand: hrm.
22:15 jekstrand: bah
22:15 karolherbst: the thing is just, that with libclc we move the construction of the struct into its own function :/
22:15 karolherbst: so it becomes a bit annoying
22:15 karolherbst: anyway
22:16 karolherbst: another test run
22:16 jekstrand:is really tempted to put address formats in nir_shader_compiler_options
22:17 jekstrand: Or maybe straight in nir_shader.
22:17 karolherbst: somebody will cmplain that the shader cache takes 0.5% space more then :p
22:17 jekstrand: :P
22:17 jekstrand: I just deleted 5 linked lists from nir_shader; I think I have the right. :P
22:18 karolherbst: :D
22:18 karolherbst: anyway, with that fixed: "Pass 103 Fails 1 Crashes 2 Timeouts 0"
22:18 jekstrand: Ok, cool.
22:18 karolherbst: now what's up with vec3 and async
22:18 anholt: jekstrand: not sure if I thanked you before, but I love the single list.
22:18 jekstrand: anholt: I'm glad. I like it too.
22:18 jekstrand: I like it more every time I add an address mode. :)
22:18 jekstrand: variable_mode, rather
22:19 karolherbst: :D
22:19 karolherbst:thinks there are too many references of nir_variables in codegen
22:20 jekstrand: karolherbst: brw_fs_nir doesn't know what a nir_variable is
22:20 karolherbst: yeah...
22:20 karolherbst: we need it for input/output slot assignment :/
22:21 karolherbst: but I already cleaned some stuff up recently
22:21 karolherbst: getting rid if nir_deref_instr was once
22:22 anholt: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6329 is the NIR improvements mr I can't wait for right now
22:22 karolherbst: I think the next bigger rework I would like to do is to move to byte addressing all the way
22:23 karolherbst: It slowly feels like that working on all that CL support isn't a burden to nir, but actually made it more sane :D but maybe that's just a coincedence and most of the changes would have happened anyway
22:25 jenatali: karolherbst: Offsets/system values is merged, feel free to rebase and drop that from your private branch
22:25 karolherbst: nice
22:25 jekstrand: karolherbst: Just added a clover patch for you
22:27 karolherbst: cool
22:28 karolherbst: somehow I am doing it wrong...
22:28 jekstrand: karolherbst: ?
22:28 karolherbst: remeber that I said I was down to 72 patches?
22:29 karolherbst: now that system values + offsets are merged and I rebased, I am still at 72 :D
22:29 jenatali: :D
22:29 karolherbst: I should really flush stuff out into a new MR
22:29 jenatali: jekstrand: I really like the variable list refactoring, except that it's really difficult to cherry-pick into our fork without just doing a full rebase, which is also a ton of work :(
22:30 karolherbst: jenatali: you have to rebase sooner or later anyway :p
22:30 jekstrand: jenatali: Yeah....
22:30 karolherbst: that's how people stay at llvm 3.9
22:30 karolherbst: :p
22:30 jenatali: karolherbst: Yeah, I know, we'll get there eventually... it'll be easier once our patches to common code are upstream
22:31 jekstrand: That's the exact reason why I landed it right before a release branch-point. I didn't want backport hell for stable fixes. :)
22:31 karolherbst: jenatali: that's why they are saying as well :p
22:31 jenatali: :P
22:31 karolherbst: anyway... vec3 async
22:32 jenatali: We're close... I'm hopeful that within another few weeks we can get to the point where the only things downstream are a chunk of our CLC frontend, and maybe libclc if we're not ready to merge that yet
22:32 karolherbst: yeah.. hopefully
22:32 karolherbst: huh...
22:32 karolherbst: mhhh
22:33 karolherbst: mhhhh
22:33 karolherbst: instead of a shl a 0x2 I see a shl a 0x5
22:33 karolherbst: that could mess things up
22:33 karolherbst: especially when used in a load_shared
22:33 karolherbst: 0x2 with vec4 vs 0x5 with vec3
22:34 karolherbst: and that's essentially the only difference between the kernels
22:34 karolherbst: mhhh
22:34 jenatali: That seems... wrong
22:34 karolherbst: align use on NPOT maybe?
22:34 karolherbst: let's see...
22:35 karolherbst: at least I know what to look out for now
22:38 karolherbst: ahhhh
22:38 karolherbst: something cases tu ulong4
22:38 karolherbst: *casts
22:38 karolherbst: but the tests uses uchar4
22:39 karolherbst: jenatali: does that ring a bell?
22:39 jenatali: That... could be a mangling problem
22:39 jenatali: Give me a sec
22:39 jenatali: Yep, I see the bug
22:39 karolherbst: :) cool
22:41 jenatali: Pushed - the replace_vector_type should be done on src_types[i]->deref->type, not src_types[i]->type
22:41 karolherbst: :)
22:41 karolherbst: PASSED test.
22:42 karolherbst: okay...
22:42 karolherbst: new run :)
22:44 jenatali: :D
22:44 jenatali: Sorry about that - too many patches in flight
22:45 karolherbst: ... uuiuiui
22:45 karolherbst: please...
22:45 jenatali: ?
22:46 karolherbst: yay
22:46 karolherbst: "Pass 106 Fails 0 Crashes 0 Timeouts 0" \o/
22:46 jenatali: \o/
22:46 jekstrand: karolherbst: \o/
22:47 jekstrand: karolherbst: Is that with CLC?
22:47 karolherbst: yeah
22:47 jekstrand: neato
22:47 karolherbst: it was required to fix those async fails
22:47 jekstrand: Right
22:47 jenatali: I mean we probably could've written it in nir, but the libclc version is there so why not use it :D
22:48 jekstrand: I just need to get over my fear of __builtin so we can land CLC. :-P
22:48 karolherbst: jenatali: I doubt it
22:49 karolherbst: at this point you wouldn't be able to use the nir cfg helper
22:49 karolherbst: so you have to construct unstructured blocks
22:49 jenatali: Ooh...
22:49 karolherbst: .. yeah...
22:49 karolherbst: guess why I use bcsel for fdiv/fsqrt
22:49 karolherbst: ...
22:49 jenatali: Yeah, ok, I didn't think about that, I'm glad libclc has implementations of these
22:49 karolherbst: main reason I want to move the lowering out of vtn :D
22:49 karolherbst: but...
22:49 karolherbst: a mul is probably cheaper than if
22:50 karolherbst: but you could also predicate...
22:50 karolherbst: it's... annoying
22:50 karolherbst:sometimes wished nir could predicate instructions
22:50 jekstrand: karolherbst: No, you don't wish for that
22:50 karolherbst: :D
22:50 karolherbst: maybe I do
22:51 jekstrand: No, you really don't.
22:51 karolherbst: I don't think it would be _that_ bad
22:51 karolherbst: it's really just a special source on instructions
22:51 karolherbst: and if it's required to be a bool I could move more optimiations into nir
22:51 jekstrand: And something that looks like a phi node afterwards. :)
22:51 karolherbst: and actually use bool types
22:51 jenatali:is glad it can't
22:51 jekstrand: Because SSA+predication is a disaster
22:52 karolherbst: mhhhhhh....
22:52 karolherbst: we can work it out :p
22:52 karolherbst: you know what we have in codegen for that?
22:52 jekstrand: Nope
22:52 karolherbst: OP_UNION, which is like phi, just without being a phi :p
22:53 jekstrand: cwabbott and I have been talking about how we'd do predication in SSA for as long as we've been working on NIR.
22:53 karolherbst: so it declares sources, and only one is actually set
22:53 karolherbst: one has a true predicate, all others false
22:53 jekstrand: I've thought about doing things in the back-end such as, if something's only used by a bcsel, predicating it on the bcsel condition.
22:53 karolherbst: so it's like phi, just for predication instead of cfg
22:54 jekstrand: Oh, yeah, it's doable.
22:54 karolherbst: the thing is.. we can predicate everything
22:54 jekstrand: There are quite a few papers and other examples of predication with SSA
22:54 jekstrand: We can also predicate everything
22:54 jekstrand: And so can AMD
22:54 karolherbst: yeah...
22:55 karolherbst: it might make sense and it might be doable in a non messy way and it's probably work
22:55 karolherbst: a lot
22:55 karolherbst: but it might be worth it?
22:55 karolherbst: dunno
22:55 jekstrand: For us, at least, though, it performs worse.
22:55 karolherbst: really?
22:55 jekstrand: Because predicated instructions take more cycles than non-predicated.
22:55 karolherbst: that's odd
22:55 karolherbst: uhu...
22:55 karolherbst: your predicationg is busted :p
22:55 jekstrand: So unless you're predicating something really expensive like a texture lookup, it's just not worth the bother.
22:56 karolherbst: for us it's for free
22:56 jekstrand: Do you know that? Have you benchmarked it with enough granulatiry to be sure?
22:56 karolherbst: nvidia uses it wherever they can :p
22:56 karolherbst: I also saw a block of 50 instructions predicated :D
22:56 jekstrand: Oh, predication is definitely cheaper than an if
22:56 karolherbst: yeah
22:57 karolherbst: but it's also used on alu instructions a lot
22:57 karolherbst: essentially for small ifs all the time
22:57 jekstrand: I wonder if the predication actually reduces latency
22:57 karolherbst: the bigger the ifs get the more likely it is for nvidia to use jumps
22:57 karolherbst: it keeps thread converged
22:57 jekstrand: Sure
22:57 karolherbst: so you don't have to converge them manually later
22:57 jekstrand: We do that too, we just don't bother to predicate
22:58 jekstrand: Because why
22:58 karolherbst: jumps are not for free?
22:58 jekstrand: I guess predicating would let us drop the bcsel at the end
22:58 karolherbst: yeah
22:58 jekstrand: Oh, I mean we flatten ifs
22:58 jekstrand: We just don't bother to predicate them
22:58 karolherbst: ahh
22:58 jekstrand: We just bcsel at the end
22:58 karolherbst: so you bcsel.. mhh
22:59 karolherbst: I never actually tried, but we just trust nvidia doing the right thing
22:59 jekstrand: There are sure
22:59 jekstrand: Especially on older nvidia hardware, divergence was death. I don't think it ever rejoined.
22:59 jekstrand: I'm not sure where they got reconvergence
23:00 jekstrand: Ours has always re-converged fine so the only real reason to flatten ifs is scheduling
23:00 jekstrand: Well, that, and the 16-deep limit on our call stack on Gen4-5
23:01 HdkR: Volta+ you can now even force reconvergence with sync now if the hardware doesn't happen to :P
23:01 karolherbst: jekstrand: do you have "clover/spirv: Don't call llvm::regularizeLlvmForSpirv" in an MR somewhere?
23:02 jekstrand: karolherbst: I thought I landed that
23:02 karolherbst: I still have it in my branch
23:02 jekstrand: karolherbst: I guess not. No, it's not in an MR anywhere
23:02 jekstrand: Let me make one quick.
23:02 jekstrand: I've got another one or two that should land
23:02 karolherbst: ahh
23:02 karolherbst: I will pick it up anyway, planning to do another MR with random stuff
23:06 jenatali: jekstrand, karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6306
23:06 jekstrand: karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6432
23:06 karolherbst: ahhh
23:06 karolherbst: jekstrand: r-by me and marge 6306 :p
23:07 karolherbst: jekstrand: let me test the O0 patch in 6432 :D
23:09 jekstrand: karolherbst: Yeah, without -O0, LLVM generates some truely fantastic code.
23:09 karolherbst: I can imagine
23:10 karolherbst: but maybe if doesn't produce switches anymore then :(
23:10 karolherbst: would be quite sad to break switches :D
23:10 jekstrand: How would that be a bad thing?
23:10 jekstrand: I mean, I'm all for good test cases but yeah....
23:11 karolherbst: anyway.. we probably want to land the spirv support for clover
23:11 jekstrand: "optimizing" ifs to switches isn't a good thing in general.
23:11 karolherbst: and then we can unit test or test spirvs or wahtever
23:11 jekstrand: Yeah
23:11 jekstrand: That's a much better plan
23:11 jekstrand: That should be pretty close to free modulo passing tests
23:13 jekstrand: daniels: Thanks for all your CI fixing. I've been marging a lot today and it's going really smooth once again! \o/
23:14 Lyude: *homer voice* marge...
23:15 imirkin: well marge, i don't want to lie to you. bye!
23:15 jekstrand: Someone needs to change Marge's avatar....
23:16 karolherbst: jekstrand: :D you asked for it, didn't you? :D
23:16 jekstrand: karolherbst: Asked for what?
23:16 pmoreau: karolherbst: Re “we probably want to land the spirv support for clover”: currently working on rebasing the series on top of master, cause that hasn’t been done in a while and was also needed after !5038.
23:16 karolherbst: jekstrand: "I couldn't merge this branch: Gitlab refused to merge this request and I don't know why!" :p
23:17 mattst88: jekstrand: I'll happily give up my "rage guy" avatar if someone promises to use it for marge :D
23:17 karolherbst: I can do an avatar like I did for envytools, just with the mesa logo :p
23:17 karolherbst: mhh
23:17 mattst88: I think it would be particularly funny in conjunction with some of marge's errors :)
23:17 karolherbst: but black background is annoying
23:17 jekstrand: karolherbst: Yeah, that one's my fault. I force-pushed with your RB after I'd assigned marge. That usually confuses things.
23:17 karolherbst: ahhh
23:17 karolherbst: I see
23:18 jekstrand: I tried to reset by unassigning and re-assigning but I guess it didn't work
23:19 HdkR: `[drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:66:crtc-2] flip_done timed out` One of these days I'll spend some time figuring this problem out. Today is not this day :P
23:19 jekstrand: karolherbst: How's -O0 working for you?
23:20 karolherbst: ehhh
23:20 karolherbst: not good
23:20 jekstrand: ?
23:20 karolherbst: one test crashed
23:20 karolherbst: ahh mhh
23:20 karolherbst: I don't handle pack_32_2x16 in nouveau
23:20 karolherbst: *sigh*
23:20 jenatali: nir_lower_pack?
23:20 karolherbst: I thought I enabled all of those :p
23:21 jenatali: It's a pass, not an option, IIRC
23:21 karolherbst: we also have options
23:21 jekstrand: Yeah but you may need the whole pass
23:22 pmoreau: Grrr, hitting that bug again in the CTS where it ignores the last reported extension… I guess I should send a patch since the issue has stayed opened since May without any updates.
23:22 karolherbst: :/
23:24 jekstrand: pmoreau: That's aggrivating...
23:25 chrisf: that is generally how things get fixed in the CTS though
23:25 chrisf: pmoreau: ticket number?
23:26 pmoreau: https://github.com/KhronosGroup/OpenCL-CTS/issues/790
23:26 jekstrand: pmoreau: Typically, the way it works is that you get annoyed with how long the issue has been open, file an MR or gerrit, that gets ignored for a month or two, and then someone else comes along and makes a comletely different fix which gets landed inside of 3 days.
23:26 chrisf:relieved that it's CL
23:26 jenatali: Sounds about right
23:27 jenatali: Still waiting for this one to land: https://github.com/KhronosGroup/OpenCL-CTS/pull/827
23:27 karolherbst: jekstrand: ehhh.. but lower_pack loweres everything :/
23:27 jenatali: Not sure if this is Windows-specific or if just nobody supports RGB 32bpc images
23:27 pmoreau: Yeah :-/
23:28 chrisf: does CL CTS have a single gatekeeper like VK-GL-CTS?
23:28 jekstrand: jenatali: That one looks like it's blocked on you though.
23:28 jenatali: jekstrand: I pushed changes 25 days ago
23:28 pmoreau: jekstrand: It feels like you suffered from that a few times at least. :-/
23:28 jekstrand: jenatali: Oh, so you did.
23:28 jekstrand: pmoreau: Yeah....
23:29 pmoreau: chrisf: There are usually two persons doing reviews, so almost double the throughput! :-D
23:32 karolherbst: mhhhh...
23:33 karolherbst: guess I just implement it then
23:34 jekstrand:runs upstream clover on iris just for grins
23:35 imirkin: benchmark it against the avx512 :)
23:36 jenatali: I'm still pretty excited about karolherbst's full pass of test_basic :)
23:36 jekstrand: imirkin: Can I test avx512 buy running clover on iris on ksim?
23:36 imirkin: is there anything else it's good for?
23:36 imirkin: i assumed that was why they added it
23:36 jekstrand: avx512? Totally. They just didn't think krh512 would work for marketing
23:37 jekstrand: karolherbst: Pass 89 Fails 5 Crashes 7 Timeouts 5
23:37 jekstrand: karolherbst: That's upstream :D
23:37 jenatali: Nice!
23:37 jekstrand: Well, plus my 3 patches to enable clover
23:37 jekstrand: We've made some serious headway this week.
23:38 karolherbst: ehhh...
23:38 karolherbst: those pack instructions all annoying except the 64_2x32 ones :p
23:39 jekstrand: karolherbst: depends on how much hardware support you have for different types. :)
23:39 karolherbst: not much
23:39 jekstrand: karolherbst: pack16 is just (int32)y << 16 | x
23:39 karolherbst: sure
23:39 karolherbst: we just have 32 bit registers
23:40 jekstrand: Yeah
23:40 karolherbst: best part is this: vec2 = load_global, vec1 = pack
23:40 karolherbst: ....
23:40 karolherbst: right...
23:41 daniels: jekstrand: thanks! super glad it's helped :)
23:41 jekstrand: daniels: It's so much better.
23:41 karolherbst: ehhh...
23:41 karolherbst: now that's just broken
23:41 karolherbst: I run nir_lower_pack
23:41 karolherbst: that generates pack_32_2x16_split
23:41 jekstrand: daniels: No more babysitting and checking an hour later to see if it worked. Just assign and walk away. Just like it's supposed to be. :D
23:41 karolherbst: ohh ehh
23:41 jenatali: karolherbst: There's algebraic opts for lowering the pack_split variants
23:41 karolherbst: :/
23:42 jenatali: To bitmasks and shifts
23:42 karolherbst: yeah..I guess I should turn on lower_pack_32_2x16_split
23:42 karolherbst: I thought there were all turned on
23:42 jekstrand: Yeah options->lower_pack_split should do it as long as you run nir_lower_packing to get _split ones first
23:42 daniels: jekstrand: luckily it'll stay that way forever with no further maintenance required!
23:42 jekstrand: Packing is such a mess.
23:42 karolherbst: yeah
23:43 jekstrand: daniels: How many beers do you want to bet on that?
23:43 jekstrand: daniels: Just ask craftyguy; I have a way with CI systems....
23:44 daniels: jekstrand: a negative number? don't want to have to be fixing CI and _buying_ beers as well; prefer a win-win where either it all works with no work, or I do work and then beers arrive
23:49 jekstrand: daniels: That's fair. :)
23:49 karolherbst: ehhh
23:49 karolherbst: nir_lower_bool_to_int32 returns always true... :/
23:50 jekstrand: karolherbst: Uh... That sounds wrong
23:50 jekstrand: You'd think that a pass which handles bools would....
23:50 jekstrand: karolherbst: It looks like it has correct tracking to me
23:50 karolherbst: I'll check
23:53 jekstrand: Ugh... Why is clover so lazy?
23:53 karolherbst: :D
23:53 jekstrand: It has set_compute_resources() just so it doesn't have to create image views
23:54 karolherbst: jekstrand: ufff...
23:54 karolherbst: lower_alu_instr is wrong
23:54 jekstrand: karolherbst: how?
23:55 karolherbst: op == mov
23:55 karolherbst: then it just falls to the end
23:55 jekstrand: It still changes the bit-size. THat's a change.
23:55 karolherbst: no
23:55 karolherbst: it doesn't
23:55 jekstrand: Oh....
23:55 karolherbst: yeah...
23:55 jekstrand: I see it now.
23:55 jekstrand: We need an actual progress bool
23:55 karolherbst: I'll fix it :)
23:55 jekstrand: Or something
23:56 jekstrand: Good catch
23:56 karolherbst: mhhhh
23:57 karolherbst: I don't like my solution, but it works
23:57 jekstrand: karolherbst: Oh?
23:58 jekstrand: karolherbst: Just do "if (alu->dest.dest.ssa.bit_size > 1) return false;" in the mov/vec case.
23:58 karolherbst: nope
23:58 karolherbst: you'll see
23:58 jekstrand:waits with bated breath
23:59 karolherbst: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6433/diffs?commit_id=720b9573cf09eae1f29ac47eceb7b6baf391589c
23:59 jekstrand: karolherbst: Oh, that is gross
23:59 karolherbst: yep
23:59 jekstrand: I like my if a lot better
23:59 karolherbst: yeah... if I move it up, maybe
23:59 jekstrand:writes a patch