02:57 jekstrand: airlied, karolherbst: What would you think about wiring clover into llvmpipe?
02:57 jekstrand: Mostly so that we can have some NIR OpenCL thing in CI.
02:57 jekstrand: Don't need to run the whole OpenCL CTS or anything but it'd be good to have some coverage so I don't break the universe on people.
04:43 airlied: jekstrand: already done :-)
04:44 airlied: jekstrand: build debug build, LP_DEBUG=cl
04:46 airlied: just need to think about writing it up to CI at least the piglit tests
05:58 airlied: jekstrand: a quick run shows it's barely working at the moment, not sure if it was recent breakage :-P
06:11 airlied: jekstrand: seeing a fair lot of "build_explicit_io_load: Assertion `addr_format_is_offset(addr_format)' failed. "
07:19 MrCooper: bnieuwenhuizen: to be clear, I'm not saying everybody should test clang/... before creating an MR; just that when CI catches a problem, it should be investigated and fixed locally as much as possible
07:19 MrCooper: in particular, Marge is not there for bouncing off iterative fixes
07:46 tomeu: anholt: just stared at the job logs and code, and got lucky at the second try or so
07:46 tomeu: anholt: do you have logs?
07:48 baedert: I have a user who reported a shader using too many input components. I tried reproducing but radeonsi_dri.so just crashes on glLinkProgram. Is radeonsi known to have missing error reporting in these cases? How would I reliably implement shader fallback then?
07:56 MrCooper: it's probably just a bug
08:04 baedert: ok, I'll try to get some debugging symbols
09:39 karolherbst: jekstrand: yeah.. without the structurizer it's quite pointless to do at the moment :)
09:55 bbrezillon: jekstrand: did you push this "vector insert/extract" branch to your repo?
12:48 karolherbst: jekstrand: huh.. I now get load_scratch_base_ptr in combination with load_kernel_input :/
12:49 karolherbst: but I guess I could just treat it as a 0?
12:51 karolherbst: but I don't think we want to have that for shader_in :)
12:54 karolherbst: or maybe that's fine.. mhh
13:15 alyssa: what's the story with genxml vs envytools vs roll-your-own?
13:16 alyssa: amd/registers is a 3rd
13:17 karolherbst: alyssa: I guess people never agreed on one system
13:18 alyssa: karolherbst: panfrost doesn't even have a system, we just have handcoded packed bitfields in a big .h *sweats*
13:18 karolherbst: alyssa: I am thinking about writing a python parser for header files we get from nvidia ... :/
13:18 karolherbst: I think it's all a bit messy at this point
13:18 alyssa: woof
13:18 alyssa: v3d seems to be an in-tree fork of genmxl
13:19 tzimmermann: are there rules-of-thumb for crashing a gpu?
13:19 alyssa: tzimmermann: don't do it ideally? =p
13:19 karolherbst: tzimmermann: invalid pointers usually do the trick
13:19 tzimmermann: alyssa, he he
13:19 karolherbst: or did you mean a real crash?
13:22 tzimmermann: i'm on bug-hunting within ast and did frequent modesets to test the driver. at some point the display froze and the dmesg was full of messages about invalid edid data. the hw cursor still worked. after ~10sec the screen came back as normal. i guess the gpu had an internal error and restarted itself
13:26 tzimmermann: never happened before
13:37 TheRealJohnGalt: I have a single application that on all vulkan drivers (radv/aco, radv/llvm, and amdvlk) ends up eating vram and it's never freed until a reboot. What would the best way to debug this be?
13:43 danvet: alyssa, karolherbst genxml iirc was originally meant to use the .xml files intel uses internally
13:44 danvet: with the goal to outright publish these
13:44 danvet: but never worked out, so we had to handwrite them still :-/
13:45 danvet: karolherbst, yeah from amd/intel it seems impossible to get the actual raw data out used to generate headers and decoders and all that
13:54 MrCooper: TheRealJohnGalt: even restarting Xorg doesn't free it? If so, sounds like a leak in the kernel
13:56 TheRealJohnGalt: MrCooper: well usually afterward I can't even start X back up. But when I can, yes.
13:56 danvet: TheRealJohnGalt, /proc/meminfo as a first step
13:56 MrCooper: TheRealJohnGalt: yes what?
13:56 danvet: also /proc/slabinfo might show where this all went
13:56 TheRealJohnGalt: Oh yes it doesn't free it.
13:57 TheRealJohnGalt: And I'll check, thank you.
13:57 danvet: just watch these and observe which number goes up
13:57 danvet: compared to not running that program
13:58 danvet: might also be worth it to recompile the entire kernel with kmemleak, but might also be a logical leak somewhere
14:00 alyssa: danvet: understood.
14:00 alyssa: also, if I check in Rust into Mesa how quickly will I have my fd.o credentials revoked? =P
14:01 danvet: I think there's a bunch of people who think rust is a lot better idea than c++
14:01 danvet: and we have c++ :-P
14:01 alatiera: rust works wonders in gstreamer ^^
14:02 danvet: alyssa, another thing with generated headers is that sometimes the raw .xml from hw engineers are a bit garbage for written a multi-generation driver
14:02 danvet: e.g. intel people love renaming stuff way too much, but only for the next chip
14:02 alyssa: Heh
14:02 danvet: that was another reason we kinda gave up the direct route
14:04 TheRealJohnGalt: danvet: will do, I appreciate the help.
14:04 alyssa: danvet: I happen to not like xml okay? ;P
14:05 TheRealJohnGalt: Also, would there be a chance a game specific workaround would be merged allowing barrier in nested control flows on glsl? Or is this just too hacky because it's out of spec to ever be considered for merge?
14:06 danvet: alyssa, well you only suffer that if your source material is xml
14:06 alyssa: danvet: [insert expanding brain meme about RE here]
14:28 jekstrand: karolherbst: Yeah, treat it as 0
14:28 jekstrand: karolherbst: Or use offset32 instead of global32
14:29 karolherbst: jekstrand: for temp memory, right?
14:29 karolherbst: I think I'll go with that indeed... although mhh
14:29 karolherbst: oh well..
14:29 karolherbst: let's see where that breaks
14:31 karolherbst: ehh
14:31 karolherbst: I see the problem
14:34 jekstrand: karolherbst: yup.
14:34 jekstrand: karolherbst: I didn't think global32 or global64 worked before so I'm confused how I broke them.
14:34 jekstrand: Unless you have patches somewhere.
14:35 karolherbst: it's on top of master
14:35 jekstrand: karolherbst: This also lets you, if you want to, calculate an actal 64-bit scratch base pointer and then scratch becomes global memory and generic pointers "just work" apart from shared.
14:35 karolherbst: I'll show you the patch
14:36 karolherbst: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6059
14:36 karolherbst: I think that's how it should look like in the end
14:36 karolherbst: no idea if there are issues with that
14:37 jenatali: karolherbst: We use global pointers for kernel inputs
14:37 karolherbst: jenatali: mhh.. why though?
14:38 karolherbst: they are more like uniforms
14:38 jenatali: Hmm
14:38 karolherbst: jenatali: also.. _no_ application will rely on more than 6 const buffers anyway :p
14:38 karolherbst: (even though CL demands at least 8, but using more than 6 leads to perf issues)
14:39 jenatali: Yeah, I don't feel strongly about that I guess
14:39 karolherbst: well, using const buffers should be faster :p
14:39 jenatali: We can always convert it after the lower_io pass
14:39 karolherbst: even if it doesn't really matter all that much
14:40 karolherbst: jenatali: couldn't you just back the storage by an UBO?
14:40 karolherbst: or normal uniforms or whatever that is in dxil?
14:40 jenatali: karolherbst: We do back it by a UBO
14:40 karolherbst: ahh
14:40 jenatali: Yeah a constant buffer
14:40 karolherbst: okay.. so with that patch it would actually be closer to what you need in the end?
14:41 jenatali: Eh, IIRC we convert it to 32bit index/offset indexing
14:41 jenatali: Let me re-read what we do with them :P
14:41 karolherbst: :D
14:41 karolherbst: but index/offset stuff should still work I guess
14:42 karolherbst: anyway.. I tested struct passed by value into the kernel input, so it seems to work jsut fine
14:42 jenatali: Yeah, we use 32bit global but it's essentially an offset anyway, so your patch is fine
14:42 karolherbst: cool
14:43 karolherbst: I got that working all before any of that was in a state as sane as today :p
14:43 karolherbst: :D
14:43 karolherbst: but it's also a nice perf opt for me, less 64 bit int math
14:44 karolherbst: jenatali: do you actually have any idea on how to support 64 bit pointers?
14:44 jenatali: Ah ok, we do a stupid thing - we lower_explici_io on shader_in as 32bit global, then do another lowering pass to convert those instructions to ubo loads with the "global" ptr as the offset, then do another lower_explicit_io on mem_ubo
14:45 jenatali: karolherbst: What do you mean?
14:45 karolherbst: jenatali: well.. 32 bit is a bit limiting
14:45 jenatali: Yep
14:45 karolherbst: and I doubt you can just do 64 bit pointers
14:45 jenatali: What do you mean "just do" 64 bit pointers?
14:45 jenatali: Our backend is using 64bit pointers
14:46 karolherbst: ohh, really?
14:46 karolherbst: I thought there is no way of using 64 bit points with dx at this point?
14:46 jenatali: Psh, we don't even *have* pointers
14:46 jenatali: Which is why we need 64bit CL pointers, so we have enough bits to store index/offset
14:46 karolherbst: right...
14:47 karolherbst: mhhh
14:47 karolherbst: annoying
14:47 jenatali: A bit :P
14:47 jenatali: I'm still not sure what your question was - was there a specific place you were seeing problems with 64bit pointers?
14:47 karolherbst: ahh no, just I was generally wondering what your plans are on how to support that properly
14:48 karolherbst: like if you have a 24GB buffer bound
14:48 jenatali: Heh, no we don't support 24GB buffers
14:48 karolherbst:has a GPU with 24GB VRAM waiting in the office
14:48 karolherbst: :p
14:48 jenatali: But with 32bit pointers, we wouldn't even be able to get to 2GB, since we'd need to shave some bits to store a buffer index
14:49 karolherbst: mhh
14:49 jenatali: And then you trade off how many buffers vs how big they can be
14:49 karolherbst: yeah.. I am just sure at some point applications want to use bigger buffers
14:50 jenatali: CL's minimum spec'd max buffer size is 1GB
14:50 karolherbst: huh?
14:50 karolherbst: I am sure that's not true
14:50 karolherbst: ohh.. minimum
14:50 karolherbst: well
14:50 karolherbst: yeah
14:50 jenatali: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MAX_MEM_ALLOC_SIZE
14:51 bbrezillon: jenatali: couldn't we just tweak the index/offset split if that ever becomes an issue?
14:51 bbrezillon: I mean, we have 32bit for the index right now
14:51 jenatali: bbrezillon: Yeah the 64bit pointers aren't the limitation honestly, there's other places where 32bit is assumed in the API/driver stack
14:51 bbrezillon: it shouldn't be too complicated to only reserve 8bits and keep the rest for the offset
14:51 bbrezillon: oh, ok
14:53 karolherbst: bbrezillon: I guess trick bits are overflows into the next ssbo buffer
14:53 karolherbst: but I guess even that is somehow manageable
14:53 karolherbst: is there a limit on the amount of ssbo buffers?
14:54 jenatali: karolherbst: Yeah, which is why we added dedicated address formats that do the math in 32bits as part of https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/83/commits - one of the ones I need to prep for upstream
14:54 karolherbst: okay, cool
14:54 jenatali: karolherbst: Uh... some old hardware that'll never run our CLOn12 layer only supports 8
14:54 jenatali: Almost all supports 64, and a bunch has no limit
14:55 karolherbst: okay
14:56 karolherbst: well.. nv has no native support for that anyway :D
14:56 jenatali: For that?
14:56 karolherbst: ssbos
14:57 karolherbst: so we have to store the size/address of ssbos in the driver const buffer and fetch + bound check at runtime
14:57 jenatali: Ah
15:11 karolherbst: jenatali: I think I will finish the offset stuff quickly.. that's probably not a lot of work
15:11 karolherbst: uhh.. maybe I should fix 64 bit points first with clover
15:21 jenatali: karolherbst: I still need to send the nir/spirv bits of https://gitlab.freedesktop.org/kusma/mesa/-/merge_requests/83 upstream, but I was going to try to grab jekstrand's !5278 into our downstream fork first, since I think it'll affect how that MR ends up looking upstream and I want to be able to test it
15:22 karolherbst: jenatali: ohh. I was more talking about gallium itself
15:22 karolherbst: the interfaces right now are.. 32 bit only
15:22 jenatali: Oh
15:22 karolherbst: so even if you allocate a buffer with a 64 bit address gallium just messes up
15:22 karolherbst: the interface is a bit stupid :/
15:23 jenatali: Got it. Yeah that's what I was saying when I mentioned 64bit pointers weren't our limitation :P
15:23 jenatali: We've got similar problems carried forward from older D3D versions
17:17 jenatali: jekstrand: If I've got nir passes/changes that are used by our CL compiler, but would be unused by existing gallium drivers (I guess clover could pick them up if they wanted?), do you think it would make sense to get those reivewed/landed ahead of the full CL compiler? Or as part of it?
17:38 karolherbst: jenatali: what passes? might be easy to just wire them up in clover :p
17:40 jenatali: printf, splitting 64bit phis, memcpy, extensions to lower_explicit_io for the address modes we're using, adding alignment info through vtn and lower_explicit_io, etc
17:40 jenatali: There might be more, those are the ones in our tracking list that quickly jump out :P
17:43 karolherbst: mhhh
17:43 karolherbst: printf might be itneresting :p
17:43 karolherbst: and memcpy
17:43 karolherbst: jenatali: yeah.. I think memcpy is the easiest to upstream for now
17:43 karolherbst: create an MR, tell me which CTS tests hits this and I wire it up for clover
17:44 jenatali: Mmkay, makes sense - I have no idea which CTS tests hit it :P we wired it up before even trying the CTS
17:46 karolherbst: the allignment info stuff also seems interesting, but that's not a nir pass, is it?
17:47 jenatali: No, I was just saying, changes to nir/vtn and passes
17:54 karolherbst: okay... second try: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6064
17:54 karolherbst: or I just rewrite the clover bits...
17:54 karolherbst: the current stuff is slightly annoying
17:57 karolherbst: but at least this should allow 64 bit GPUs with 32 bit applications and 32 bit GPUs with 64 bit applications to work without issues :)
18:00 karolherbst: jenatali: also.. I think I will check the offset stuff today :)
18:00 karolherbst: should be fairly straightforward
18:01 jenatali: Which offset stuff?
18:01 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5891
18:01 jenatali: Ah, awesome
18:02 karolherbst: jenatali: do you know where all the offset tests are? I'd assume basic and api, but maybe there are more?
18:11 jekstrand: 64-bit phi splitting sounds interesting
18:12 karolherbst: I am actually wondering why that is needed though? ohh.. dxil is ssa?
18:12 jekstrand: It's the one part of 64-bit lowering that NIR can't quite do yet.
18:12 jekstrand: You still end up with 64-bit SSA values and/or registers any time something goes through a phi
18:13 jekstrand: We handle that in our back-end by just allowing 64-bit select and MOV
18:14 karolherbst: mhhh
18:16 karolherbst: our hw is very weird in this regard
18:16 karolherbst: we only have 32 bit registers, but we do have 64 bit operations
18:17 karolherbst: it's annoying
18:17 karolherbst:really wants to port over to util/ra to fix codegens RA
18:18 jekstrand: util/ra is pretty nice
18:18 karolherbst: yeah.. and codegens graph colouring is broken beyond repair :)
18:18 jekstrand: :-/
18:18 karolherbst: yeah... algorithm is wrong
18:19 anholt:wishes we had easier ways to handle contiguous register groups in util/ra
18:19 karolherbst: the highlight is this TGSI: https://gist.github.com/karolherbst/704a4f74639e244f3ff97da020a926cd
18:19 karolherbst: so.. we have 63 32 bit registers
18:19 karolherbst: _but_
18:19 karolherbst: codegen decides some values might have to get spilled
18:19 karolherbst: but they are illegal to get spilled
18:20 jekstrand: anholt: Me too
18:20 karolherbst: even though in the end nothing has to get spilled
18:20 jenatali: karolherbst: The global offsets are tested in basic IIRC. Local offsets would only be tested by thread_dimensions if you implement looping to handle huge dispatch sizes
18:20 jekstrand: anholt: For IBC, I was forced to generate register classes at compile time
18:20 imirkin: virtual nodes can't get spilled...
18:21 karolherbst: imirkin: ¯\_(ツ)_/¯ it's not even the code which is wrong, just the algorithm we use to decide if values needs to be spilled
18:21 karolherbst: and if I have to replace the algorithm I can also just move over to util/ra and remove some pain
18:22 imirkin: graph coloring is never perfect
18:22 imirkin: i don't know that the algo is wrong
18:22 karolherbst: our approach is just ... wrong
18:22 karolherbst: we decide what has to be spilled without having allocated even one register
18:22 karolherbst: but we check for all values
18:23 karolherbst: so we do a "worst case" approach
18:23 karolherbst: and assume the worst case happens when we finally allocate
18:23 karolherbst: so... even in such a small TGSI, codegen thinks something might get spilled
18:24 karolherbst: and if that's virtual nodes or whatever it screws up
19:38 Plagman: curro: sorry for the belated follow-up, was kinda slammed with other stuff, but i just tested dota here and getting 120fps
19:38 Plagman: i'm really interested in what might be going different
19:44 airlied: jekstrand: do we want to know what vector compute is -)
19:44 airlied: :-P
19:45 jekstrand: airlied: Compute via vec4!
19:45 jekstrand: airlied: I genuinely have no idea
19:45 airlied: jekstrand: I wondered if it was some strange compute via vec4 or we added some vector instructions to your scalar instructions so you could vector while you scalar
19:46 airlied: jekstrand: I only ask because phoronix pointed it out as a SPIRV extension
19:46 jekstrand:has to read this
19:47 airlied: looks like maybe a SPIR-V ext to make SYCL on CPUs better but not sure
19:48 jekstrand: Wow, that SPIR-V extension is weird
19:49 jekstrand: I love how they don't describe how anything works whatsoever
19:50 jekstrand: airlied: Yup, I'm just as in the dark as you are, I'm afraid.
19:50 karolherbst: airlied: in doubt it has something to do with AVX512 :p
19:52 curro: Plagman: hmm, i'd love to know too but it's kind of hard without access to the engine's source code... can you confirm that the event loop is supposed to sleep once per frame waiting for events? what kind of event is supposed to wake it up when there is no input?
19:53 Plagman: i can look, yeah
19:53 airlied: karolherbst: AVX1024 :-P
19:53 Plagman: will put it on my list
19:54 danvet: jekstrand, airlied together with the inline assembly I'm betting on pass-through magic of hand-optimized shaders and lolz for media and speciality libraries on top
19:54 danvet: that's the usual thing at least
19:54 danvet: intel-mkl and whatever they all called
19:55 jekstrand: danvet: That sounds like a good guess
19:55 jekstrand: There's some call stack stuff in there too which makes no sense to me.
19:55 danvet: it would explain the utter lack of documentation
19:55 danvet: *public documentation
20:06 anholt: tomeu: tried to restart what looked like an intermittent fail in llvmpipe-traces, got a 403: https://gitlab.freedesktop.org/anholt/mesa/-/jobs/3794628
20:16 karolherbst: airlied: yeah.. thinking about this a little, it was always painful to vectorize OpenCL through avx, because.. how'd you even do that reliably
20:16 karolherbst: so now, they just push that down to OpenCL and let application do it
20:17 jekstrand: Didn't you know? Our GPUs are basically just CPUs with a different microcode which turns graphics into AVX512 :P
20:17 karolherbst: they are still people thinking that AVX is just as competent as a real GPU :p
20:17 jekstrand:is a troll
20:17 imirkin: for compute, it's not _far_ from the truth
20:17 karolherbst: jekstrand: ehhh... I still have nightmares of that intel guy comparing AVX512 Xeons to nvidia gpus.. :D
20:18 jekstrand: Depending on what you're doing, a Phi cluster can match or beat an Nvidia cluster for compute.
20:18 jekstrand: Very much depends on what you're doing
20:18 jekstrand: And how good you are at optimizing for AVX512 vs. CUDA
20:19 karolherbst: yeah.. I guess
20:19 karolherbst: but it was about normal Xeons, not phis
20:19 iive: avx512 is horrible to optimize for... it's not even risc anymore.
20:19 jekstrand: I'm not saying AVX512 is amazing
20:20 karolherbst: anyway... that probably describes the reason for why that extension exists the best
20:20 airlied: once someone says auto vectorizing compiler they drink
20:20 jekstrand: But depending on memory patterns, how much you can actually trivially parallelize, etc., GPUs can end up being pretty terrible.
20:20 jekstrand: airlied: Yeah....
20:21 jekstrand: airlied: That's part of what made CRAY big back in the day, though.
20:21 karolherbst: I am quite happy that I won't have to think about vectorization :D
20:21 airlied: karolherbst: was that the article where someone asksd how nvidisa do vectorisation?
20:21 anholt: on the subject of auto vectorizing: tried scalarizing ntt at the top before optimizing and vectorizing, and we gain ~30% instr count.
20:22 jekstrand: anholt: Yeah. Vectorizing is hard
20:22 karolherbst: airlied: no idea what article you are talking about
20:22 anholt: yep. our pass is definitely a win compared to nothing, but I was kind of hoping we'd do better.
20:23 jekstrand: anholt: Part of the problem is that a good vectorizer needs to undo optimizations.
20:23 airlied: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html
20:24 airlied: karolherbst: that one
20:24 jekstrand: Like if you multiply by (1, 2, 3, 4), we'll get rid if the multiply by 1 but you'd want to keep it for the vector form.
20:24 airlied: its great, and showed that intel compiler ppl really didnt get gous at the time
20:24 karolherbst: airlied: ahh yeah.. :D
20:24 karolherbst: same
20:25 karolherbst: they only thing which gets vectorizer are load/stores
20:26 karolherbst: the actual annoying part of that intel guy wasn't that he was comparing perf, but he said AVX512 can just be considered to be 16 threads, so.. you have your intel CPU threads * 16 :p done
20:26 karolherbst: kind of fits the mindeset described in that article
20:29 HdkR: Let's just get ARM SVE with 2048bit registers wired up for software rasterization, surely that'll be good as a GPU </s>
20:29 karolherbst: *sigh*
20:29 jekstrand: HdkR: Nvidia's talking about buying ARM. Maybe that's their plan? :-P
20:30 HdkR: plez no
20:30 karolherbst: getting rid of SVE? good plan
20:31 HdkR: I'd be happier with Google purchasing ARM so at the very least Mali-VDGC doesn't /immediately/ get tossed in the trash
20:31 karolherbst: jekstrand: at least the gpu hw engineers are now convinced that vectorization is probably not working out all that well :p
20:31 jekstrand: Yeah, I don't know what would happen to the Mali team if Nvidia bought ARM.
20:32 jekstrand: I kind-of expect they'd immediately spin it off as a separate company and just keep the CPU part.
20:32 danvet: yeah anyone buying arm is a regulatory nightmare
20:33 jekstrand: I mean, ARM and Nvidia aren't really competing much.
20:33 jekstrand: But I don't know what it would do to the mobile world. Nvidia might be much less happy to sell IP.
20:35 jekstrand: Even though Mali isn't nearly as common as some, ARM CPU cores are verywhere.
20:43 robclark: majority of non-qcom SoCs are using mali.. so it is pretty common. It's kinda the default go-to for someone making a SoC that needs an gpu
20:44 robclark: and for things were power consumption matters, nv doesn't really have an alternative
20:44 robclark: (but I also don't think nv would end up buying arm.. I don't really see how that could work)
20:53 jekstrand: There's some ARM+Vivante for super-cheap chips
20:53 jekstrand: And the odd IMG
20:54 jekstrand: BUt, yeah, Mali is pretty popular in the non-qualcomm market
20:56 karolherbst: jenatali: ehh.. can't get the hw support to work (or maybe it's something else).. will try your lowering :)
20:56 jenatali: karolherbst: Cool, sounds good
21:17 karolherbst: jenatali: soo.. if I want the full stuff, I disable the lowering and enable the offset thingies, right?
21:24 jenatali: karolherbst: Yeah, disable the lowering, and then implement the system values to hook up to whatever you want
21:24 jenatali: We have them loading out of a UBO
21:28 karolherbst: yeah.. we do that for some stuff as well
21:52 karolherbst: jenatali: I assume I also have to do some calculations based on the offset?
21:53 jenatali: karolherbst: All the calculations should be taken care of by the lowering of sysvals
21:53 karolherbst: right... but I meant the API only accepts the global_id_offset, no?
21:53 jenatali: Yeah?
21:54 karolherbst: but your lowering add two intrinsics: one for the invocation and one for the group
21:54 jenatali: The calculation of the global ID is lowered to use the local IDs + the global offset
21:55 jenatali: And the local IDs are computed with the invocation offset baked in
21:59 karolherbst: jenatali: right, so the runtime has to still prepare two values, right?
21:59 jenatali: Right
21:59 jenatali: One for sure, since the API can set the global offset
21:59 danvet: jenatali, btw was there anything from your (not just you personally) side about the dma-fence fun we have on linux?
22:00 jenatali: The other one only if huge dispatches are used and you need to loop them
22:00 jenatali: danvet: I don't think so - I haven't heard anything at least
22:00 jenatali: Thanks for keeping us in the loop though :)
22:00 danvet: hm, maybe next step when we try to figure out how to get out of this might be more interesting
22:00 jenatali: Yeah, I'm sure Steve'll have some insights he'd be happy to share
22:00 danvet: or maybe the discussion at xdc
22:01 jenatali: Yeah, or that
22:17 jenatali: danvet: Actually one thing we didn't quite understand is why there's a need for global ordering of scheduled work. On Windows we have that with GDI, where all drawing has to be ordered. Is that in-order requirement coming from X? What about a system running without X?
22:18 karolherbst: jenatali: heh.. somehow I think your MR is way to complicated...
22:18 jenatali: Hm?
22:18 karolherbst: jenatali: I set work group offsets to 0 and I pass all tests
22:19 jenatali: D3D has a limit of 65k thread groups per dispatch
22:19 karolherbst: ohhh
22:19 jenatali: If we want to hit the max limits that CL can use we need to loop dispatches and use those
22:19 karolherbst: so for gallium I can just ignore that one
22:19 jenatali: Yep, probably
22:19 karolherbst: okay
22:20 jenatali: Which is why there's a cap to lower it away :)
22:20 karolherbst: jenatali: fun cat.. num_work_groups is in an ubo for us...
22:20 karolherbst: hw can't tell
22:20 karolherbst: (and doesn't care)
22:20 jenatali: Cool
22:21 karolherbst: but we still have a thread id..
22:21 karolherbst: it's weird
22:21 karolherbst: and grid id
22:21 karolherbst: wait.. that was different pre kepler
22:22 karolherbst: oh well.. nvm
22:29 karolherbst: jenatali: top three commits: https://gitlab.freedesktop.org/karolherbst/mesa/-/commits/cl_wip
22:30 karolherbst: the first one you might want to pick :)
22:31 karolherbst: curro: is there a good reason we have the offset as an kernel input arg? I am sure it makes more sense to have it inside pipe_grid_info as hw might support it natively (I am still convinced nv does, just have to figure out how)
22:35 karolherbst: jenatali: so.. if I undertand stuff correctly.. the workgroup id stays 0 for the first thread even though you have an invoc offset of like 100k and 10 threads per block on x?
22:37 jenatali: So in DXIL all we have is group ID or thread ID
22:37 karolherbst: right
22:37 jenatali: and those reset to 0 whenever we start a new dispatch
22:38 karolherbst: right
22:38 karolherbst: same as we essentially :p
22:38 curro: karolherbst: hm sorry the offset of what?
22:38 jenatali: So if we want to make it look like we're doing one dispatch, and support the work_group_id functions *not* resetting to 0 in the middle of a CL kernel dispatch, then we add offsets for all the ones after the first
22:38 karolherbst: curro: when you enqueue a kernel you can specify an offset
22:38 karolherbst: global_work_offset in clEnqueueNDRangeKernel
22:39 karolherbst: jenatali: ahhh... I see
22:39 karolherbst: mhhhh
22:39 karolherbst: I think that could break for us indeed?
22:39 karolherbst: let me think
22:39 jenatali: But if you can do ridiculously huge dispatches all at once, and don't need to emulate them by looping dispatches, then you should be fine?
22:39 karolherbst: jenatali: so essentially there is a difference if you enqueue a kernel once and the driver needs to split or you enqueue it 100k times?
22:40 jenatali: Right
22:40 karolherbst: heh.. well
22:41 karolherbst: we can do ridiculously huge dispatches, but there is always more...
22:41 karolherbst: kepler plus we are limited to "{ 0x7fffffff, 65535, 65535 }"
22:42 karolherbst: for the grid size
22:42 karolherbst: which.. well
22:42 karolherbst: cl can do more :p
22:42 karolherbst: I just don't know if it makes sense to even care
22:42 karolherbst: because if you launch something even bigger, you either need to pass down the machine to your children anyway or you just do it for fun
22:45 karolherbst: jenatali: I think at this point I'll risk it and wait for bug reports :p
22:45 curro: karolherbst: ah, that offset. it's an input parameter in order to keep the compute dispatch interface minimal since the pipe driver doesn't necessarily need to know that information nor bother to process it as long as we plumb it through to the kernel
22:45 jenatali: Sure
22:45 karolherbst: curro: right.. but what if the hw has native support for it
22:49 curro: karolherbst: i wouldn't have any objection against moving that responsibility into the pipe driver if it's measurably helpful for some hardware
22:49 karolherbst: okay
22:49 curro: doesn't seem very likely though since it would save you little more than a single vector addition per thread invocation
22:50 karolherbst: well.. 64 bit math is still annoying
22:51 bnieuwenhuizen: curro: if there is a loop of dispatches (kinda like what jenatali said earlier) it would still save you from copying and uploading more UBO data no?
22:52 curro: bnieuwenhuizen: possibly, if you're hardware doesn't have any more efficient way to provide one-time input data to a kernel
22:53 curro: even if it's a pipelined UBO write it's still constant overhead per thread block
22:53 karolherbst: bnieuwenhuizen: the issue is more that the offset calculation has to be inserted into the kernel even if you don't use it
22:54 curro: oh? that's not true if your backend supports some form of DCE
22:54 karolherbst: curro: huh? how does that matter, you either upload the offset into the kernel args or your driver ubo, so for hw not supporting it, there shouldn't be any difference
22:54 karolherbst: curro: you don't know the value at compile time
22:54 karolherbst: unless you recompile each time you enqueue
22:55 curro: karolherbst: you don't need to know the value at compile time in order to know whether the value is statically used by the kernel
22:55 karolherbst: it's always used when you use the global_id
22:56 karolherbst: which... a lot of kernels do use
22:56 jekstrand: Vulkan even has a thing to provide this offset in vkCmdDispatchBase
22:56 jekstrand: We use a push constant for it
22:56 curro: karolherbst: indeed, so they need the offset parameter in that case (with the current interface)
22:56 jekstrand: Also, a single add in a shader doesn't matter.
22:56 karolherbst: curro: right, and without knowing what llvm does, it might be broken today :)
22:56 karolherbst: anyway
22:57 curro: jekstrand: yes, providing the global offset in a push constant-like interface is what the current API is designed for
22:57 karolherbst: offsets affect each get_global_id call
22:57 jekstrand: karolherbst: So?
22:57 karolherbst: so you have to always insert the addition
22:57 jekstrand: Yeah
22:57 jekstrand: And why's that a problem?
22:57 karolherbst: jekstrand: we discuss if you want it to be an argument or part of pipe_grid_info and driver handled
22:57 karolherbst: where the latter might be a benefit
22:57 karolherbst: because ther emight be hw supportint it natively
22:58 jekstrand: Yeah
22:58 karolherbst: (I think nv does, I just didn't figure out how to use it)
22:58 jekstrand: For us, we'd just make it a push constant
22:58 jekstrand: I think on some HW we even have a HW thing for it
22:58 karolherbst: right
22:58 karolherbst: jekstrand: but for you it doesn't make a different if you lower it in your compiler or read it from the kernel input, right?
22:58 jekstrand: We have a ThreadGroupIDStartingX I'm not sure what it's for
22:59 karolherbst: yeah.. sounds like the stuff
22:59 karolherbst: _but_
22:59 karolherbst: we have NVC1C0_QMDV01_07_CTA_RASTER_WIDTH_RESUME
22:59 karolherbst: and I thought maybe that's also that stuff :)
22:59 bnieuwenhuizen: same here, have a COMPUTE_START_{X,Y,Z}
22:59 karolherbst: yeah...
23:00 karolherbst: okay
23:00 jekstrand: I should really figure that out and stop pushing it for Vulkan
23:00 karolherbst: so it indeed might make sense to move it into pipe_grid_info
23:00 jekstrand: I think that's the conclusion
23:00 karolherbst: jekstrand: if you give me the details I could write the patch for iris :p
23:00 karolherbst: (still figures out how to do that for nouveau though)
23:01 jekstrand: karolherbst: If it's the bit I think it is, you need to set ThreadGroupIDStartingX/Y/Z in GPGPU_WALKER
23:01 jekstrand: Hrm...
23:02 jekstrand: Maybe not?
23:02 karolherbst: yeah...
23:02 jekstrand: Nah, I think that's probably right
23:02 karolherbst: I will try to figure it out for nouveau first though :)
23:02 jekstrand: Yeah, I think that's right
23:02 karolherbst: I probably just need to launch the kernel differently or so
23:02 jekstrand: For Z it's called resume and it's wierd.
23:02 karolherbst: :D
23:02 karolherbst: maybe not supported
23:02 jekstrand: I think it's the same
23:02 karolherbst: and resume is something else
23:02 jekstrand: It's just got a different name
23:03 karolherbst: jekstrand: well.. there is a problem
23:03 curro: i doubt that's the conclusion, as i said earlier that would be an acceptable change, but the overhead it would save would be marginal, a constant amount of time per thread invocation, would certainly make sense to do though if you can show you can measure *any* improvement from it, which i'm skeptical about
23:03 karolherbst: and this makes it all slightly annoying
23:03 curro: you must also admit that there is a value in the state tracker taking care of that responsibility for all drivers
23:03 jekstrand: curro: If literally all hardware we care about has a dedicated bit for it, though.
23:04 jekstrand: That changes the equation
23:04 karolherbst: jekstrand: the issue is, if you lower huge workgroups, you need to adjust the work group id, if you don't and just specify the offset through the API, the work group starts at 0
23:04 curro: if we can't measure any improvement from a hardware feature and it costs effort/code to use it, why bother?
23:04 karolherbst: soo..
23:04 karolherbst: resume _could_ be the lowering
23:04 karolherbst: and the other stuff, the normal offset or something
23:05 karolherbst: it's all annoying
23:05 karolherbst: curro: because the arg removes space from the kernel input buffer :p
23:05 karolherbst: (not that it matters much, but still)
23:06 karolherbst: the problem is just, that we add overhead to _trivial_ kernels as well
23:06 karolherbst: with the current approach
23:06 karolherbst: and one time CPU overhead, vs per thread overhead... I choose the former any time
23:12 jekstrand: karolherbst: I just kicked a Vulkan CTS run off to Jenkins with the HW fields used to implement baseGroupX/Y/Z.
23:12 jekstrand: karolherbst: It should show up on mesa-ci.01.org in an hour or so.
23:12 karolherbst: heh, cool :)
23:13 karolherbst: hah!
23:13 jekstrand: karolherbst: https://gitlab.freedesktop.org/jekstrand/mesa/-/commit/35a5de3c51244ef33b72f0945d2c6e0adced0c4d
23:13 karolherbst: I was able to trigger a difference in one test
23:13 jekstrand: karolherbst: If that works, I can drop a pile of annoying crap from ANV
23:13 jenatali: A difference?
23:13 karolherbst: :)
23:13 jekstrand: A whole NIR pass, even.
23:13 karolherbst: jenatali: I set width_resume to 1 and global_work_offsets fails
23:14 karolherbst: jekstrand: \o/ question now: does all hw support it? :p
23:14 karolherbst: although.. you could probably jut reuse jenatalis work or something? dunno how complex the lowering is for you
23:17 karolherbst: ahh okay... we have a bug anyway
23:17 karolherbst: skeggsb: QMD defaults to 1.7 on pascal :p
23:17 karolherbst: not 2.1
23:25 bnieuwenhuizen: jekstrand: don't forget to drop the add of the offset, otherwise you run into doubling the offset?
23:58 jekstrand: bnieuwenhuizen: I didn't. :-)
23:59 jekstrand: bnieuwenhuizen: Well, I didn't drop the add. I just pushed (0, 0, 0)