15:02kwizart: hi there, any reason why nouveau wouldn't be able to turn off the gpu on an optimus system ?
15:03coderobe: kwizart: depends on the gpu i suppose
15:08kwizart: why pm would depends on a given gpu ? at least if using a kernel without secure boot enabled (or any lockdown feature) I don't see a reason for nouveau not to disable the hw when not needed
15:18pmoreau: The functionality is not always implemented. For example on Macs, where there is some interaction with the apple_gmux driver (at least for old laptops, not sure about recent ones), it is not supported upstream (though patches do exist).
15:20pmoreau: Outside of Macs, automatic suspend should occur, as long as 1) you/your distribution did not disable automatic suspend, 2) no program used the card recently (i.e. in the last 5-10s IIRC).
15:23pmoreau: I think that in some cases, you can end up with some non-existent active connectors that prevent the GPU from being suspended. I don’t remember whether that comes from misinterpreting data from the VBIOS, or something else.
15:23pmoreau: karolherbst: Did you see my message yesterday/Saturday about SPIRV-Tools?
15:28kwizart: pmoreau, thx for thoses info ? does theses cases where suspend could be prevented could be workaround by using something like bbswitch ?
15:33pmoreau: kwizart: I don’t have any experience with bbswitch, so no idea. Could you please share the output from `dmesg`?
15:34kwizart: well, I will try to document and reference to use https://nouveau.freedesktop.org/wiki/Optimus/ instead, I expect that bbswitch might be less reliable
15:37pmoreau: If you look at the last section on that page, you could try running `xrandr` to rule out whether it is the VGA port keeping the GPU on.
15:38kwizart: pmoreau, here my dmesg https://paste.fedoraproject.org/paste/f2VSpy0xIrDuzJApijSexw
15:39kwizart: for the record, I'm trying to keep the nvidia driver installed (but unused) but keep using nouveau
15:40kwizart: so I can switch easily between floss (where I hope suspend will works) and nvidia
15:42pmoreau: Could you try uninstalling the NVIDIA driver and see if you still have the issue? You can see from the logs that it’s trying to load the NVIDIA driver quite often; that might end up waking up the card.
15:43kwizart: pmoreau, yep, but it stops after some time, I think I can prevent that easily
15:43kwizart: also seems to work as appropriate in my case; https://paste.fedoraproject.org/paste/LlZcUkHv0~MCXezwkpc7dg
15:44pmoreau: Otherwise the logs look good and vga_switcheroo get enabled, so it should auto-suspend. How are you checking that the GPU is still on or not?
15:44kwizart: but I was asking for another case and more generally if it's reliable to rely on switcheroo to disable the gpu if unused
15:44pmoreau: You could try `cat /sys/kernel/debug/vgaswitcheroo/switch` and see whether it’s marked as DynOff or DynOn.
15:47kwizart: yep it's DynOff for me
15:47pmoreau: In most cases, it is fine to rely on switcheroo to disable the GPU if unused. The cases where it doesn’t work, are Macs and some Pascal cards which fail to resume properly.
15:47pmoreau: karolherbst should know more about it.
15:48pmoreau: IIRC, he was suspecting that sometimes the GPU was marked as off but still consuming more power than expected. I don’t think it is a widespread issue though.
18:40karolherbst: pmoreau: I think runpm is busted on all laptops with skylake and kabylake CPUs :/
18:40karolherbst: no idea why though
18:44cyberpear: karolherbst: could that explain why my kernel that used to work w/ your runpm_fixes branch no longer works? -- I thought it was the weirdest thing that I had a kernel that worked one day, but didn't another later day (I'm now thinking an intel cpu firmware update caused the breakage?)
18:46karolherbst: probably not? no idea
18:48cyberpear: I'd booted to a newer kernel (w/ your fixes) and it didn't work, but when I booted back to the older kernel (w/ your fixes) that had worked, it no longer worked... I'd suspected hw failure, but when I went back to nvidia blob, it worked again using the blob... was the strangest thing
19:12pmoreau: karolherbst: Try SPIRV-Tools master to avoid the vloadn pointer type validation error.
19:13karolherbst: pmoreau: uff.. I kind of hope that would have beend fixed in 2019.3 by now :D
19:13karolherbst: pmoreau: anyway, I will probably push my changes tomorrow to the clover branch... got a bit annoying but I think I've figured it out
19:13pmoreau: The fix was merged shortly after 2019.3 IIRC, and that was released a couple of months ago already.
19:13karolherbst: ahh, I see
19:14karolherbst: I kind of thing curro won't like my new patches either... but we will see then...
19:15pmoreau: Hopefully he will, but we will see.
19:15karolherbst: mhhh... maybe
19:32karolherbst: hakzsam: you were looking into stg/ldg once, right? are sure it's only for non coherent memory?
19:45karolherbst: st and ld have this second predicate thing...
19:48karolherbst: pmoreau: nvdisasm-10.1 fixed this stupid sched thing
19:48karolherbst: it now accepts a 0x0 sched opcode for maxwell now
19:50pmoreau: What was the issue? What does a sched opcode of 0x0 do: no stall at all?
19:53karolherbst: it's invalid
19:53karolherbst: so nvdisasm bailed
19:54karolherbst: pmoreau: currently trying to figure out what's the difference between ld.g/ldg and st.g/stg....
19:54karolherbst: but... it doens't seem they are
19:54karolherbst: the generic ld/st have this second weirdo predicate though
19:57karolherbst: there is a difference
19:57karolherbst: st/lg can take a 32 bit immediate, stg/ldg only 24
19:57karolherbst: so this leaves 8 bit for random stuff
19:59karolherbst: huhm... stg/ldg need 12 bits more for the encoding
19:59curro: karolherbst: i like it about as much as your last one :P
20:00karolherbst: curro: :p.. yeah
20:00karolherbst: but there is a problem I actually have to address
20:00karolherbst: but yeah
20:00karolherbst: I don't think that clovers design right now is actually represanting CL that well, so that's quite a clash
20:01karolherbst: but I have an idea now to make everybody happy.. just have to see how that works out
20:03karolherbst: pmoreau: mhh st mask is 0xe000000 stg mask is 0xfff80000 :/
20:03karolherbst: that makes no sense
20:03karolherbst: why using stg over st...
20:05karolherbst: nvidia even prefers using stg if I need a bigger offset than 24 bit
20:05karolherbst: HdkR: you don't know if the st vs stg thing is anything public?
20:05karolherbst: the nvdisasm docs are always a bit too vague on that stuff
20:14curro: karolherbst: what do you think isn't represented well?
20:18karolherbst: the fact that we don't have to compile down to the hardware at cl_program creation but, but the latest at cl_kernel creation time.
20:18karolherbst: well... I guess you could compile later if you don't want to give different data through clGetKernelInfo
20:18karolherbst: uhm.. clGetKernelWorkGroupInfo I mean
20:19karolherbst: right now the only reasonable thing to do is to compile down to the hardware at cl_program creation time, so that you can return adjusted values for clGetKernelWorkGroupInfo
20:20karolherbst: and this kind of requires drivers to be able to serialize their shaders, which... isn't exactly trivial. Was looking at adding this for nouveau, but it's quite a lot of work
20:20karolherbst: so.. but the cl API is totally fine if we return an IR or _something_ for the binary info stuff
20:21karolherbst: so my idea would be to kind of handle the multiple entry point stuff inside clover... maybe even compile several nirs when the cl_program object is created... serialize it and sve it
20:22karolherbst: when cl_kernel is created, do a bind on the shader object and pass the serialized NIR or spirv into the driver... bind again when the kernel gets launched, which usually skips compilation as long as the same cso is used
20:22karolherbst: + add an gallium callback to retrieve the information needed for clGetKernelWorkGroupInfo
20:29karolherbst: at least that's my current plan
20:30karolherbst: robclark wrote something towards this: https://github.com/karolherbst/mesa/commit/894280761822d4440bdd59e2019f17bd82f1a837
20:30HdkR: karolherbst: Nvidia never publicly documents the details of the instructions :P
20:30karolherbst: but I'd would make some bigger changes to it
20:31karolherbst: HdkR: I know.. but usually in the developer forums all kind of questions are answered
20:31karolherbst: and usually there is always that person wondering about those crazy differences
20:31karolherbst: and sometimes they even get answers
21:03curro: karolherbst: it's designed that way because of... the CL API. as you probably know the problem is that at cl_program compilation time we have no cl_command_queue and therefore no pipe_context to compile the program with. that's why compilation is done the first time a kernel is launched. ideally we would have a pipe_screen entry-point allowing us to compile a program into a native binary which we could bind later on to a compute
21:03curro: state object by using PIPE_SHADER_IR_NATIVE, but that's not possible without some rework of the Gallium API
21:04karolherbst: well, right now for llvm based drivers the compilation is already finished after the cl_program thing was fully linked
21:05karolherbst: and at kernel bind time the appropiate entry point is simply selected by radeonsi/r600
21:05karolherbst: but they don't compile anymore
21:05karolherbst: that's already done by clover
21:07karolherbst: also compiling at kernel launch time is too late anyway, so clover can't rely on that
21:07curro: karolherbst: because LLVM has a back-end able to codegen for them so Clover doesn't depend on the Gallium interface in order to compile a program...
21:07karolherbst: compilation has to be done before clGetKernelWorkGroupInfo can be called
21:07curro: karolherbst: you say it's a lot of effort to serialize the shader binaries [why? don't you have to represent the program as a string of bits eventually before you give it to the hardware?], but it's the kind of effort that would likely pay off because you'd also be able to use the same serialization scheme for driver binary caching
21:07karolherbst: curro: meta data
21:08karolherbst: sure.. the serialization part itself isn't all that complicated, but the format has to be valid for the shader cache as well
21:08karolherbst: because drivers will use it for that as well
21:09karolherbst: but the painful part is we have a huge struct which contains input and output data of the compilation process
21:09karolherbst: and all that stuff has to be split up and everything
21:09karolherbst: I am mostly through this, but it's more work than one would expect
21:10curro: hm, that's unfortunate
21:10karolherbst: the problem is: compiling at bind time is too late
21:11karolherbst: I kind of like the idea robclark had with his patch... it just needs some work and maybe I can fit it quite well into the current thing
21:11curro: karolherbst: i mean, it works fine if you're willing to give conservative values via clGetKernelWorkGroupInfo, which is mildly inefficient but fully compliant. but yeah i see the benefit of compiling at CL linking time somebody just needs to do the Gallium API work
21:12karolherbst: and have a CAP or a way to select a path for "clover compiles" or "driver compiles"
21:12karolherbst: ohh, it requires a API reowrk anyway
21:12karolherbst: or at least some additions
21:12karolherbst: the idea would be to create a cso with compiled binary at cl_kernel creation time
21:12karolherbst: and query the info with a new cb
21:13karolherbst: so get_workgrou_info(pipe, cso, &out_data) or something
21:13karolherbst: just need to try out a few things
21:15karolherbst: ohh it's called cs inside core::kernel, not cso
21:18curro: karolherbst: it seems like a bit of a hack compared to adding kernel compilation entry-points to pipe_screen, but i'm okay with the dummy context approach as long as you add an interface to pipe_context to export serialized binaries (which you could later on import using IR NATIVE) instead of the hacks for sharing CSOs across contexts
21:19karolherbst: yeah.. I am not entirely sure about this part either, but when the kernel is created there is no context anyway :(
21:20karolherbst: I think I will skip the exporting serialized binaries though, because that's really not required
21:20karolherbst: it's good enough to store the spir-v/serialized nir inside clover and export/import that through the CL API
21:20karolherbst: and then we just have the spirv/nir -> binary compilation inside the driver
21:21karolherbst: and I was thinking about adding a new gallium driver cb which does the nir deserialization for every driver and then call into the driver to compule nir_shader -> binary in order to fix this so boundary issue as well
21:21karolherbst: + it keeps hte deserialization in common code
21:22karolherbst: + we might not have to add a NIR_SERIALIZED IR type
21:22karolherbst: but that's what I want to figure out tomorrow
21:23karolherbst: and this cb would be helpful for other state trackers using the dynamic pipe loader if there will be new ones
21:27curro: karolherbst: hmf... i don't think a callback for deserializing NIR would solve any real problems...
21:28karolherbst: well, alternative is to add a new IR type and let drivers to the deserialization ...
21:28curro: karolherbst: this was under the assumption that you have a dummy context available in the clover::device, so yeah you would have a context at that time and at the time the CL program is built
21:28karolherbst: but I dislike the new IR type...
21:29curro: karolherbst: I don't see the need for a new IR type, i think we should make the serialized form the default one
21:29karolherbst: I think everybody else would complain then
21:29karolherbst: because right now driver use it for the nir_shader thing
21:30curro: karolherbst: why would anybody complain about a simplification of the state tracker
21:30karolherbst: and adding a new IR type would kind of be a valid compromise
21:30curro: +API cleanup + optimization
21:30karolherbst: why is it simplier? why is it an optimization?
21:31karolherbst: right now, for the driver it's simple: it gets the nir_shader * and can just start using it
21:32curro: karolherbst: sigh, i argued these three points in detail in the static pipe loader MR you sent, i don't have the time to do it again here ATM
21:33karolherbst: right... I am just not convinced it's an optimization. It might be a cleanup.. but for drivers it would be more complicated as they would have to deserialize in addition ;) I agree that the runtime overhead of doing that might not be significant though
21:33karolherbst: and it might not matter
21:34karolherbst: but then somebody wants to reduce API overhead again... and sees this and thinks: why are we passing the serialized nir into the driver, if we could pass in the nir_shader* directly?
21:34karolherbst: and then we have the argument later again
21:34karolherbst: especially because radeonsi might actually ditch the llvm backend in the future and go NIR + ACO only
21:35curro: karolherbst: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/833#note_160375
21:35curro: it would have net negative overhead in the typical case
21:37karolherbst: mhh.. right.. forgot the cache hit case..
21:37karolherbst: curro: I think with shader variants you might actually get to the point where you have multiple deserializations though
21:37karolherbst: on the same shader
21:38karolherbst: but that's highly driver specific
21:38karolherbst: but true.. the same would be cloned anyway
21:39karolherbst: I think I will just write the patch and check what's actually changing int the code. Just to know how that would look like in the end
21:42karolherbst: and maybe even do some benchmarks with shader-db
21:45curro: sounds good. i think it has a good chance of allowing you to remove a bunch of useless if (nir) cases from the state tracker