00:00 airlied: ah it sank some loads but not their addr calcs
00:11 airlied: okay managed to avoid some spills by allowing a bunch of alu instrs to move
00:16 airlied: -Scratch: 34816 bytes per wave
00:16 airlied: +Scratch: 5120 bytes per wave
00:16 airlied: seems like a bit of a win
07:18 MrCooper: airlied: understatement of the week 8-O
07:52 Lynne: which kernel branch should I use if I want to play around with nvk a little?
07:56 Lynne: they've silently added ada's gsp firmware now
08:03 airlied: Lynne: there really isnt one with everything
08:03 airlied: if you have ada i think skeggsb latest gsprm branch is the only.hope
08:20 Lynne: seems a bit old, anything more recent I can just rebase that commit on?
09:05 airlied: Lynne: https://gitlab.freedesktop.org/skeggsb/nouveau/-/tree/00.02-gsp-rm?ref_type=heads
09:05 airlied: that one?
09:07 Lynne: no, https://gitlab.freedesktop.org/skeggsb/mesa/-/commits/02.00-ad10x was what I was looking at
09:07 Lynne: thanks, I'll check it out later
11:19 pq: emersion, maybe Emil Velikov might be a better person to look at DRM device minor number stuff than me. I've never looked into that side.
11:20 emersion: me neither! :P
11:20 emersion: but noted
11:20 pq: IIRC Emil (xexaxo) has done things around libdrm device identification
11:32 Danyil: Hello people, I was wondering if there's some progress or info (if any) regarding hardware accelerated GPU scheduling on linux. I haven't seen any projects utilizing it.
11:32 Danyil: It makes a noticeable difference on windows desktop in terms or input latency so I wanted to test it on linux but haven't found any info about it...
11:34 Danyil: https://www.tomshardware.com/news/microsoft-explains-gpu-hardware-scheduling-aims-to-improve-input-lag
11:34 Danyil: https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/
11:47 alyssa: eric_engestrom: samuelig: great news, you're off the hook :p
11:47 alyssa: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24153
12:03 alyssa: airlied: it's a hard problem \shrug/
13:24 alyssa: nir_opt_preamble why >_>
13:31 alyssa: I'm trying to refine this to reconstruct if's when needed but not when not and oof this is starting to feel like homework for my algorithms class :p
14:06 eric_engestrom: alyssa: nice, thanks gfxstrand 🤗
16:07 karolherbst: doesn't AMD hardware have an instruction to fetch the workgroup size?
16:09 karolherbst: so.. I have to redefine nir_intrinsic_load_workgroup_size to be the workgroup_size of the _current_ grid item, not any. So that the last_block (which might be smaller) can return its actual size
16:09 karolherbst: but radeonsi currently lowers it and just sets it to the normal block size, which is actually not what I need here
16:10 karolherbst: dschuermann: maybe you have any ideas what can be done here and if AMD hardware does have some instructions to deal with last_block stuff?
16:11 mareko: karolherbst: the shader doesn't have that information
16:12 karolherbst: that's.... weird?
16:13 karolherbst: lowering that could get pretty messy if you have to figure out inside the shader if you are one of the last blocks
16:13 karolherbst: because that's a thing on each dimension
16:13 mareko: the last block feature was not meant to be used by frontends
16:13 karolherbst: well.. I need it for CL
16:14 karolherbst: and I already have it working on llvmpipe and radeonsi, just that the workgroup_size reported is wrong
16:14 karolherbst: on radeonsi that is
16:15 alyssa: what.. was it for?
16:15 karolherbst: anyway, all of that nonsense is lowered in the frontend (and optimized paths as today are taking if last block is disabled), so it's zero cost if it's not used
16:15 alyssa: if not frontends?
16:15 karolherbst: _but_
16:15 mareko: driver blits
16:15 karolherbst: I rely on drivers to report propre workgroup sizes
16:15 alyssa: okie
16:16 karolherbst: well.. not anymore, because I'm planning to use it in a frontend
16:17 karolherbst: CL is just a bit annoying with this feature, as it has CLC queries to get the _current_ and the _enqueued_ workgroup size (and other things)
16:17 karolherbst: so I need both, just the enqueued one can be perfectly lowered in the frontend
16:17 mareko: you basically have to lower it except that you don't have to put the whole shader into a conditional block
16:18 karolherbst: lower what?
16:18 mareko: the last block
16:18 karolherbst: I sure won't lower it
16:18 karolherbst: well.. if radeonsi needs it lowered, it's radeonsis business
16:18 karolherbst: but I don't see why the frontend should lower it
16:19 mareko: whatever
16:19 karolherbst: there is hardware supporting it natively, so there is that
16:20 alyssa: karolherbst: tbh I'm on team "rusticl lowers it"
16:20 karolherbst: alyssa: last_block?
16:20 alyssa: yeah
16:20 karolherbst: why
16:20 alyssa: behind a compute cap
16:21 karolherbst: ahh
16:21 alyssa: because there's piles of hw that doesn't support it natively
16:21 alyssa: and it's not a thing in gl
16:21 alyssa: so either it's 1 lowering call in rusticl or N calls in every gallium driver that wants cl
16:21 karolherbst: it's an optional feature though
16:21 alyssa: in that case, compute cap and don't advertise on radeonsi?
16:22 karolherbst: well.. the hardware can actually do it
16:22 mareko: "can"
16:23 karolherbst: so the restriction on AMD hardware is, that there are no proper interfaces for those system values?
16:23 mareko: it wouldn't be difficult to lower it in radeonsi
16:23 karolherbst: I mean, lowering the system value is a different story than lowering the entire feature
16:24 alyssa: lowering the whole feature is just "round up and put the shader in a big conditional", right?
16:24 karolherbst: no
16:24 mareko: yes
16:24 karolherbst: there is more to it
16:24 alyssa:watches rock paper scissors rematch
16:24 karolherbst: you have to calculate the workgroup size according to the disabled work items
16:24 karolherbst: so a simple x * y * z thing won't do
16:25 DemiMarie: Danyil: I don’t believe any of the existing GPU drivers for Linux use firmware scheduling. The in-development Xe and Asahi drivers do use it.
16:25 karolherbst: also the local group id has to be in order and everything
16:25 mareko: need "load_last_workgroup_size" in radeonsi (easy), then a hw-agnostic NIR pass which lowers it
16:26 mareko: it seems easy
16:26 karolherbst: yeah, something like that I guess
16:27 karolherbst: thing is, it's just a quite bit of code and it would impact everything using nir_intrinsic_load_workgroup_size where nir_intrinsic_load_enqueued_workgroup_size isn't a proper replacement
16:28 karolherbst: but if there is enough hardware which actually doesn't have either of those it's getting a bit messy and I might have to redesign things
16:29 mareko: messy how? all drivers just need load_last_wg_size and the NIR pass can do the rest
16:30 mareko: wg_size.x = wg_id.x == num_wg.x - 1 ? last_wg_size.x : wg_size.x;
16:31 karolherbst: yeah.. and on some drivers wg_id also explodes into more code
16:31 karolherbst: but I guess if there is no better alternative that's what needs to happen
16:31 karolherbst: I just wished drivers/hardware would have a native system value for that
16:32 mareko: we don't even have a native value for num_wg
16:32 mareko: and wg_size
16:32 karolherbst: yeah, but that's cheap to put into a ubo
16:33 mareko: not cheap if the shader is tiny
16:33 karolherbst: just anything depending on the current block id is.. well.. I wished it would be in hardware
16:33 karolherbst: or push constants or whatever driver prefer to use there
16:33 mareko: user data SGPRs
16:34 mareko: any load would be slower
16:34 karolherbst: yeah.. I just forget that most hardware doesn't have UBOs with GPR access speed
16:35 karolherbst: I'd have to play around on nvidia as well wiht this feature, but I think nvidia has the stuff for it
16:35 karolherbst: iris will be interesting to figure out
16:36 alyssa: karolherbst: where are we at with deleting clover
16:36 karolherbst: somebody needs to figure out r600
16:36 alyssa: ah..
16:37 mareko: r600 is not compute hw
16:37 alyssa: there, official word from AMD, r600 is not compute hw
16:38 alyssa:deletes clover
16:38 karolherbst: :D
16:38 alyssa: mareko: that's ok, mali isn't graphics hw ;)
16:38 karolherbst: there are apparently still users
16:38 karolherbst: and it kinda works, just somebody needs to figure out the remaining issues
16:38 karolherbst: probably a week of work? dunno
16:40 mareko: we should actually emulate the last block completely
16:40 mareko: in radeonsi
16:40 karolherbst: why?
16:41 karolherbst: seems to work good enough at least, or rather the shader header thing does what I need
16:42 mareko: there may be perf penalty in some hw
16:42 karolherbst: mhhh
16:42 karolherbst: I'd kinda prefer to have more data on that
16:43 mareko: it will be revealed eventually, now is not the time
16:43 karolherbst: I think for now it's probably fine to use whatever there is until someobdy has time to actually look into it
16:43 karolherbst: fair enough
16:44 karolherbst: the annoying part is simply that in CL if you explicitly compile a CL2.0 or CL3.0 kernel, the compiler has to assume the last_block feature will be used unless it's disabled by the application, so I kinda prefer to not have to add overhead it it's not actually needed
16:51 mareko: it's just isub, ieq, bcsel per dimension, and other-than-radeonsi drivers also need UBO loads
16:52 karolherbst: yeah.. I guess it's not that bad given how little num_workgroups is used (probably)
16:52 karolherbst: I can add code for it to lower_system_values, that's not the big problem here
16:53 karolherbst: just have to rethink all the lowering here
16:56 karolherbst: but anyway, that's for next week.
18:31 AndrewR: ...it seems I compiled embree/luxcorerender for 32-bit Slackware. It even outputs something with
18:31 AndrewR: RUSTICL_ENABLE=lp RUSTICL_DEVICE_TYPE=all LP_CL=1 bin/luxcoredemo
18:33 AndrewR: https://pastebin.com/SQqWJRCZ
18:35 AndrewR: https://pasteboard.co/QzOj2LyLZbim.png
19:00 AndrewR: ah, no it was RUSTICL_ENABLE=lp RUSTICL_DEVICE_TYPE=gpu LP_CL=1 bin/luxcoredemo probably ..
19:02 AndrewR: https://pastebin.com/CHugRiFT
19:07 karolherbst: don't need LP_CL=1
19:07 karolherbst: but yeah.. such an output is expected on the CPU if it's not fast enough
19:08 karolherbst: the quality of the image increases over time
19:08 AndrewR: karolherbst, for some reason (with both clover and rusticl active) Luxcorerender does not print opencl info w/o that variable set ...
19:08 karolherbst: it's a bit buggy...
19:09 karolherbst: luxcore doesn't query the devices correctly and then also has internal bugs, it's a bit of a problem... just only have one impl active and it mostly works
19:11 AndrewR: karolherbst, yeah .... but for now workaround seems to work for me :)
19:13 AndrewR: https://github.com/LuxCoreRender/LuxCore/blob/2f35684a04d9e1bd48d6ffa88b19a88871e90942/src/luxrays/devices/ocldevice.cpp#L89 - it was failing here ...
19:13 karolherbst: mhhh
19:14 karolherbst: ohh right.. if clover doens't advertise devices it fails :)
19:55 cambrian_invader:
22:45 cambrian_invader: sanity check: I'm getting EACCESS from DRM_IOCTL_GEM_OPEN because it doesn't have DRM_RENDER_ALLOW set and it's being called with the gpu_fd from lima_bo_import
22:46 cambrian_invader: the obvious fix is to use kms_fd instead
22:46 cambrian_invader: but shouldn't this get caught by anyone who runs lima?
23:02 edt: Just installed a am5 mb with an rx7700 and a rx660xt gpu. When looking in nvtop, I often see both gpu (builtin and 6600) active. This happens in both game and video playback. I am using mesa 21.1.3 on linux 6.4.3. How is mesa utilizing the buildin gpu? and for what?
23:03 edt: thats a ryzen 7700
23:17 penguin42: edt: I don't know the answer, but does radeontop allow you to see actiivity on each?
23:21 edt: never figured out how to get radeontop to show in builtin gpu - nvtop generally gives better info
23:21 penguin42:didn't realise nvtop did anything on Radeon
23:22 penguin42: edt: I see radeontop has an option -b for bus and -p for path
23:22 penguin42:only has the one card
23:22 edt: -p /dev/dri/card0 dies, card1 shows the rx6600xt
23:23 edt: card0 is the builtin
23:23 edt: is suggest you try nvtop - surprised me on how well it works with radeon (radv)
23:34 edt: apparently nv in nvtop stands for Neat Videocard
23:34 penguin42: haha
23:35 edt: and it works for amd, intel and nvidia
23:36 penguin42: hmm very pretty
23:37 HdkR: nvtop's name changed once it gained support for more than NVIDIA
23:37 HdkR: It also supports Adreno :)
23:40 edt: Its interesting it shows both gpu(s) (buildin on ryzen 7700 (2 cus) rx6600xt (32 cus)) as used. I'd like to understand what is happening.
23:41 edt: btw I use two displays, both are connected to the the rx6600xt
23:41 edt: one via DP and the other via HDMI
23:42 penguin42: edt: So do you ever see the builtin doing anything on nvtop ?
23:46 edt: I just seen the gpu & memory traces moving up & down (like the other gpu but not matching it). Its probably good. An extra 2 cus probably help in some cases. I'd just like to understand how/why this happens. From what I read 'crossfire' does not work on linux. Looks like something like it does though...
23:46 edt: I just see...
23:54 edt: search a bit shows crossfire as replaced by xDMA and there are patches for xdma in linux 6.3. They are for AMD-Xilinx Alvep cards. Wonder if these patches are also helping/working here???