00:00airlied: ah it sank some loads but not their addr calcs
00:11airlied: okay managed to avoid some spills by allowing a bunch of alu instrs to move
00:16airlied: -Scratch: 34816 bytes per wave
00:16airlied: +Scratch: 5120 bytes per wave
00:16airlied: seems like a bit of a win
07:18MrCooper: airlied: understatement of the week 8-O
07:52Lynne: which kernel branch should I use if I want to play around with nvk a little?
07:56Lynne: they've silently added ada's gsp firmware now
08:03airlied: Lynne: there really isnt one with everything
08:03airlied: if you have ada i think skeggsb latest gsprm branch is the only.hope
08:20Lynne: seems a bit old, anything more recent I can just rebase that commit on?
09:05airlied: Lynne: https://gitlab.freedesktop.org/skeggsb/nouveau/-/tree/00.02-gsp-rm?ref_type=heads
09:05airlied: that one?
09:07Lynne: no, https://gitlab.freedesktop.org/skeggsb/mesa/-/commits/02.00-ad10x was what I was looking at
09:07Lynne: thanks, I'll check it out later
11:19pq: emersion, maybe Emil Velikov might be a better person to look at DRM device minor number stuff than me. I've never looked into that side.
11:20emersion: me neither! :P
11:20emersion: but noted
11:20pq: IIRC Emil (xexaxo) has done things around libdrm device identification
11:32Danyil: Hello people, I was wondering if there's some progress or info (if any) regarding hardware accelerated GPU scheduling on linux. I haven't seen any projects utilizing it.
11:32Danyil: It makes a noticeable difference on windows desktop in terms or input latency so I wanted to test it on linux but haven't found any info about it...
11:34Danyil: https://www.tomshardware.com/news/microsoft-explains-gpu-hardware-scheduling-aims-to-improve-input-lag
11:34Danyil: https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/
11:47alyssa: eric_engestrom: samuelig: great news, you're off the hook :p
11:47alyssa: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24153
12:03alyssa: airlied: it's a hard problem \shrug/
13:24alyssa: nir_opt_preamble why >_>
13:31alyssa: I'm trying to refine this to reconstruct if's when needed but not when not and oof this is starting to feel like homework for my algorithms class :p
14:06eric_engestrom: alyssa: nice, thanks gfxstrand 🤗
16:07karolherbst: doesn't AMD hardware have an instruction to fetch the workgroup size?
16:09karolherbst: so.. I have to redefine nir_intrinsic_load_workgroup_size to be the workgroup_size of the _current_ grid item, not any. So that the last_block (which might be smaller) can return its actual size
16:09karolherbst: but radeonsi currently lowers it and just sets it to the normal block size, which is actually not what I need here
16:10karolherbst: dschuermann: maybe you have any ideas what can be done here and if AMD hardware does have some instructions to deal with last_block stuff?
16:11mareko: karolherbst: the shader doesn't have that information
16:12karolherbst: that's.... weird?
16:13karolherbst: lowering that could get pretty messy if you have to figure out inside the shader if you are one of the last blocks
16:13karolherbst: because that's a thing on each dimension
16:13mareko: the last block feature was not meant to be used by frontends
16:13karolherbst: well.. I need it for CL
16:14karolherbst: and I already have it working on llvmpipe and radeonsi, just that the workgroup_size reported is wrong
16:14karolherbst: on radeonsi that is
16:15alyssa: what.. was it for?
16:15karolherbst: anyway, all of that nonsense is lowered in the frontend (and optimized paths as today are taking if last block is disabled), so it's zero cost if it's not used
16:15alyssa: if not frontends?
16:15karolherbst: _but_
16:15mareko: driver blits
16:15karolherbst: I rely on drivers to report propre workgroup sizes
16:15alyssa: okie
16:16karolherbst: well.. not anymore, because I'm planning to use it in a frontend
16:17karolherbst: CL is just a bit annoying with this feature, as it has CLC queries to get the _current_ and the _enqueued_ workgroup size (and other things)
16:17karolherbst: so I need both, just the enqueued one can be perfectly lowered in the frontend
16:17mareko: you basically have to lower it except that you don't have to put the whole shader into a conditional block
16:18karolherbst: lower what?
16:18mareko: the last block
16:18karolherbst: I sure won't lower it
16:18karolherbst: well.. if radeonsi needs it lowered, it's radeonsis business
16:18karolherbst: but I don't see why the frontend should lower it
16:19mareko: whatever
16:19karolherbst: there is hardware supporting it natively, so there is that
16:20alyssa: karolherbst: tbh I'm on team "rusticl lowers it"
16:20karolherbst: alyssa: last_block?
16:20alyssa: yeah
16:20karolherbst: why
16:20alyssa: behind a compute cap
16:21karolherbst: ahh
16:21alyssa: because there's piles of hw that doesn't support it natively
16:21alyssa: and it's not a thing in gl
16:21alyssa: so either it's 1 lowering call in rusticl or N calls in every gallium driver that wants cl
16:21karolherbst: it's an optional feature though
16:21alyssa: in that case, compute cap and don't advertise on radeonsi?
16:22karolherbst: well.. the hardware can actually do it
16:22mareko: "can"
16:23karolherbst: so the restriction on AMD hardware is, that there are no proper interfaces for those system values?
16:23mareko: it wouldn't be difficult to lower it in radeonsi
16:23karolherbst: I mean, lowering the system value is a different story than lowering the entire feature
16:24alyssa: lowering the whole feature is just "round up and put the shader in a big conditional", right?
16:24karolherbst: no
16:24mareko: yes
16:24karolherbst: there is more to it
16:24alyssa:watches rock paper scissors rematch
16:24karolherbst: you have to calculate the workgroup size according to the disabled work items
16:24karolherbst: so a simple x * y * z thing won't do
16:25DemiMarie: Danyil: I don’t believe any of the existing GPU drivers for Linux use firmware scheduling. The in-development Xe and Asahi drivers do use it.
16:25karolherbst: also the local group id has to be in order and everything
16:25mareko: need "load_last_workgroup_size" in radeonsi (easy), then a hw-agnostic NIR pass which lowers it
16:26mareko: it seems easy
16:26karolherbst: yeah, something like that I guess
16:27karolherbst: thing is, it's just a quite bit of code and it would impact everything using nir_intrinsic_load_workgroup_size where nir_intrinsic_load_enqueued_workgroup_size isn't a proper replacement
16:28karolherbst: but if there is enough hardware which actually doesn't have either of those it's getting a bit messy and I might have to redesign things
16:29mareko: messy how? all drivers just need load_last_wg_size and the NIR pass can do the rest
16:30mareko: wg_size.x = wg_id.x == num_wg.x - 1 ? last_wg_size.x : wg_size.x;
16:31karolherbst: yeah.. and on some drivers wg_id also explodes into more code
16:31karolherbst: but I guess if there is no better alternative that's what needs to happen
16:31karolherbst: I just wished drivers/hardware would have a native system value for that
16:32mareko: we don't even have a native value for num_wg
16:32mareko: and wg_size
16:32karolherbst: yeah, but that's cheap to put into a ubo
16:33mareko: not cheap if the shader is tiny
16:33karolherbst: just anything depending on the current block id is.. well.. I wished it would be in hardware
16:33karolherbst: or push constants or whatever driver prefer to use there
16:33mareko: user data SGPRs
16:34mareko: any load would be slower
16:34karolherbst: yeah.. I just forget that most hardware doesn't have UBOs with GPR access speed
16:35karolherbst: I'd have to play around on nvidia as well wiht this feature, but I think nvidia has the stuff for it
16:35karolherbst: iris will be interesting to figure out
16:36alyssa: karolherbst: where are we at with deleting clover
16:36karolherbst: somebody needs to figure out r600
16:36alyssa: ah..
16:37mareko: r600 is not compute hw
16:37alyssa: there, official word from AMD, r600 is not compute hw
16:38alyssa:deletes clover
16:38karolherbst: :D
16:38alyssa: mareko: that's ok, mali isn't graphics hw ;)
16:38karolherbst: there are apparently still users
16:38karolherbst: and it kinda works, just somebody needs to figure out the remaining issues
16:38karolherbst: probably a week of work? dunno
16:40mareko: we should actually emulate the last block completely
16:40mareko: in radeonsi
16:40karolherbst: why?
16:41karolherbst: seems to work good enough at least, or rather the shader header thing does what I need
16:42mareko: there may be perf penalty in some hw
16:42karolherbst: mhhh
16:42karolherbst: I'd kinda prefer to have more data on that
16:43mareko: it will be revealed eventually, now is not the time
16:43karolherbst: I think for now it's probably fine to use whatever there is until someobdy has time to actually look into it
16:43karolherbst: fair enough
16:44karolherbst: the annoying part is simply that in CL if you explicitly compile a CL2.0 or CL3.0 kernel, the compiler has to assume the last_block feature will be used unless it's disabled by the application, so I kinda prefer to not have to add overhead it it's not actually needed
16:51mareko: it's just isub, ieq, bcsel per dimension, and other-than-radeonsi drivers also need UBO loads
16:52karolherbst: yeah.. I guess it's not that bad given how little num_workgroups is used (probably)
16:52karolherbst: I can add code for it to lower_system_values, that's not the big problem here
16:53karolherbst: just have to rethink all the lowering here
16:56karolherbst: but anyway, that's for next week.
18:31AndrewR: ...it seems I compiled embree/luxcorerender for 32-bit Slackware. It even outputs something with
18:31AndrewR: RUSTICL_ENABLE=lp RUSTICL_DEVICE_TYPE=all LP_CL=1 bin/luxcoredemo
18:33AndrewR: https://pastebin.com/SQqWJRCZ
18:35AndrewR: https://pasteboard.co/QzOj2LyLZbim.png
19:00AndrewR: ah, no it was RUSTICL_ENABLE=lp RUSTICL_DEVICE_TYPE=gpu LP_CL=1 bin/luxcoredemo probably ..
19:02AndrewR: https://pastebin.com/CHugRiFT
19:07karolherbst: don't need LP_CL=1
19:07karolherbst: but yeah.. such an output is expected on the CPU if it's not fast enough
19:08karolherbst: the quality of the image increases over time
19:08AndrewR: karolherbst, for some reason (with both clover and rusticl active) Luxcorerender does not print opencl info w/o that variable set ...
19:08karolherbst: it's a bit buggy...
19:09karolherbst: luxcore doesn't query the devices correctly and then also has internal bugs, it's a bit of a problem... just only have one impl active and it mostly works
19:11AndrewR: karolherbst, yeah .... but for now workaround seems to work for me :)
19:13AndrewR: https://github.com/LuxCoreRender/LuxCore/blob/2f35684a04d9e1bd48d6ffa88b19a88871e90942/src/luxrays/devices/ocldevice.cpp#L89 - it was failing here ...
19:13karolherbst: mhhh
19:14karolherbst: ohh right.. if clover doens't advertise devices it fails :)
19:55cambrian_invader:
22:45cambrian_invader: sanity check: I'm getting EACCESS from DRM_IOCTL_GEM_OPEN because it doesn't have DRM_RENDER_ALLOW set and it's being called with the gpu_fd from lima_bo_import
22:46cambrian_invader: the obvious fix is to use kms_fd instead
22:46cambrian_invader: but shouldn't this get caught by anyone who runs lima?
23:02edt: Just installed a am5 mb with an rx7700 and a rx660xt gpu. When looking in nvtop, I often see both gpu (builtin and 6600) active. This happens in both game and video playback. I am using mesa 21.1.3 on linux 6.4.3. How is mesa utilizing the buildin gpu? and for what?
23:03edt: thats a ryzen 7700
23:17penguin42: edt: I don't know the answer, but does radeontop allow you to see actiivity on each?
23:21edt: never figured out how to get radeontop to show in builtin gpu - nvtop generally gives better info
23:21penguin42:didn't realise nvtop did anything on Radeon
23:22penguin42: edt: I see radeontop has an option -b for bus and -p for path
23:22penguin42:only has the one card
23:22edt: -p /dev/dri/card0 dies, card1 shows the rx6600xt
23:23edt: card0 is the builtin
23:23edt: is suggest you try nvtop - surprised me on how well it works with radeon (radv)
23:34edt: apparently nv in nvtop stands for Neat Videocard
23:34penguin42: haha
23:35edt: and it works for amd, intel and nvidia
23:36penguin42: hmm very pretty
23:37HdkR: nvtop's name changed once it gained support for more than NVIDIA
23:37HdkR: It also supports Adreno :)
23:40edt: Its interesting it shows both gpu(s) (buildin on ryzen 7700 (2 cus) rx6600xt (32 cus)) as used. I'd like to understand what is happening.
23:41edt: btw I use two displays, both are connected to the the rx6600xt
23:41edt: one via DP and the other via HDMI
23:42penguin42: edt: So do you ever see the builtin doing anything on nvtop ?
23:46edt: I just seen the gpu & memory traces moving up & down (like the other gpu but not matching it). Its probably good. An extra 2 cus probably help in some cases. I'd just like to understand how/why this happens. From what I read 'crossfire' does not work on linux. Looks like something like it does though...
23:46edt: I just see...
23:54edt: search a bit shows crossfire as replaced by xDMA and there are patches for xdma in linux 6.3. They are for AMD-Xilinx Alvep cards. Wonder if these patches are also helping/working here???