07:44pmoreau: karolherbst: May I take in your local mem patch in my series instead of my own patch? (https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/d43a0fee1c60a072e1b7505e063dda8f2e393c3b)
07:50pmoreau: Also, what is the most recent family supported for OpenCL? It looks like Mesa supports up to Turing, so I’m guessing same thing for OpenCL?
08:17shining: Hello, I just installed Debian on a laptop that has 2 graphic cards, one intel and one nvidia. It's the first time I have a laptop with hybrid graphics and am quite confused by it. It seems like nouveau is not very happy because it triggers timeout and WARNING logs in dmesg. And it shows up as DynOff in vgaswitcheroo/switch. But then why does Xorg talk only about nouveau and not about i915 at all ? In
08:17shining: the end I get llvmpipe for mesa/opengl. More info at https://pastebin.com/7bRzrv1E
08:27shining: To add to the confusion, gdm switches back to wayland by default, which is why xrandr --listproviders did not work. I do see 2 providers when on xorg : https://pastebin.com/N6ccPYVe
08:27shining: By the way do you know how to get any logs on wayland, any equivalent of Xorg.0.log ?
08:34pmoreau: Hello, regarding your last question I sadly have no idea.
08:35pmoreau: Going through the logs, Xorg does see both cards:
08:35pmoreau: > modeset(0): using drv /dev/dri/card0 <-- Presumably the Intel one
08:35pmoreau: > modeset(G0): using drv /dev/dri/card1 <-- NVIDIA one
08:35pmoreau: But somehow it does not end up picking Intel for card0, but instead gets llvmpipe
08:35pmoreau: > modeset(0): Refusing to try glamor on llvmpipe
08:36pmoreau: Maybe something related to this?
08:36pmoreau: > VGA arbiter: cannot open kernel arbiter, no multi-card support
08:38pmoreau: Since you have a laptop with two GPUs and this was detected, the discrete GPU (the NVIDIA one) will be automatically suspended to save power, and will automatically resume if you try to use it.
08:39pmoreau: shining: ^
08:42shining: ah yes thanks, that's clearer, I guess I can ask in intel then
08:43shining: About the timeout and WARNING logs in dmesg, that's related to the suspend then ?
08:44pmoreau: No, I believe it is related to the PMU firmwares missing
08:44pmoreau: > nouveau 0000:2d:00.0: pmu: firmware unavailable
08:45pmoreau: I’m not sure whether the timeout is expected, though.
08:46pmoreau: I don’t think NVIDIA has released any PMU firmwares for that GPU.
08:49shining: it's not that ? https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=2567e092339cd3403d697dc2e0967c31b7acb989
08:51pmoreau: No, you would need the "Pmu" ones (which are available for the gp10b, but not for the gp108).
08:54shining: and what do the PMU firmwares handle exactly ? it's still possible to use the card without them ?
08:54pmoreau: You would need something in nvidia/gp108/pmu
08:55shining: yep I have nothing there indeed.
08:55pmoreau: PMU stands for Power Management Unit IIRC, so anything from reclocking the card to changing the fan speed.
08:56pmoreau: It’s possible that the fans aren’t controlled by the GPU since it’s a laptop.
08:58shining: oh I found a patch for the timeout : https://lkml.org/lkml/2021/2/14/1
08:59shining: but no answer and no merge apparently
09:00pmoreau: skeggsb: ^ Did you see that patch coming by?
09:02pmoreau: I am not sure if removing that code for all cards and not just GP108 is a good idea.
09:03pmoreau: You should probably still be able to use the card, even with the timeout.
09:04shining: What about xrandr --listproviders output ? it looks bad for both cards, doesn't it ?
09:04shining: with name:modesetting for both
09:05pmoreau: I can’t seem to find the xrandr output in your paste
09:06shining: see the second pastebin link
09:07pmoreau: Ah, missed it
09:08pmoreau: Both use the modesetting DDX, but I believe you should still be able to configure it as you want.
09:08pmoreau: Atm in X, what do you get if you run `DRI_PRIME=0 glxinfo -B` and `DRI_PRIME=1 glxinfo -B`?
09:12shining: llvmpipe in both : https://pastebin.com/rfTQyn4U
09:13pmoreau: Oh, I missed that provider 0 does not have the sink offload capability.
09:17pmoreau: You might want to figure out first out to get the Intel driver to be used instead of llvmpipe for card0, and then we can have another look at the offloading.
09:20shining: yep, thanks for your help !
09:44pmoreau: karolherbst: Small error: `param_max_size()` returns the max size in bytes that can be occupied by all args to a kernel, but 128 is the max number of arguments to a kernel when `param_max_size() == 1024`. So comparing `param_max_size()` to 128 does not make sense to me. https://gitlab.freedesktop.org/karolherbst/mesa/-/blob/rusticl/src/gallium/frontends/rusticl/core/device.rs#L62
11:38karolherbst: pmoreau: yeah.. mhh, it's bad wording in the spec I think? let me see..
11:38karolherbst: pmoreau: from the CL 1.0 spec: "Max size in bytes of the arguments that can be passed to a kernel. The minimum value is 256."
11:38karolherbst: CL 2.1: "Max size in bytes of all arguments that can be passed to a kernel. The minimum value is 1024 for devices that are not of type CL_DEVICE_TYPE_CUSTOM. For this minimum value, only a maximum of 128 arguments can be passed to a kernel."
11:39karolherbst: so.. dunno
11:39karolherbst: I'd assume bytes :D
11:39karolherbst: amount of paramters makes literally no sense
11:39karolherbst: because a 1GB struct can be one parameter
11:40karolherbst: and you have to support 128 of them?
11:40karolherbst: good luck
11:41karolherbst: pmoreau: we should probably open an MR against the CL spec and change the phrasing
11:44karolherbst: pmoreau: and yeah, turing supports CL alright
11:46karolherbst: shining: your GPU crashed
11:46karolherbst: or is off..
11:46karolherbst: not sure what is happening there either
11:46karolherbst: ehh wait. it didn't crash
11:48karolherbst: what the....
11:48karolherbst: shining: your system seems super broken tbh
11:49karolherbst: "[ 6764.587] (EE) modeset(0): glamor initialization failed" -> that is intel
11:49karolherbst: ahh, pmoreau already said something like this
12:27shining: karolherbst: yes it looks like the first issue is intel, I tried to blacklist nouveau but it does not help. I still have glamor initialization failed.
12:28shining: what else is broken ?
13:03pmoreau: karolherbst: Re “and you have to support 128 of them?” note that it says a **maximum** of 128 arguments, not minimum.
13:05pmoreau: And good to know for Turing! I was planning on replying on the ML regarding OpenCL support, unless you want to do that.
13:20Tom^-laptop: changing pstate on NVIDIA Corporation GF108 [GeForce GT 440] isnt implented yet i assume? getting "/sys/kernel/debug/dri/0/pstate: Function not implemented" doing it.
13:28Tom^-laptop: or does karolherbst keep some experimental repo around for that like he did with kepler in the past :p
13:48pmoreau: Tom^-laptop: skeggsb had some experimental Fermi reclocking (https://github.com/skeggsb/nouveau/tree/devel-clk) but IIRC it was not working for everyone at the time, and it hasn’t been updated in a while. So current status is indeed: not implemented yet.
13:56Tom^-laptop: pmoreau: ah ok
14:11karolherbst: shining: dunno.. hard to say. I'd ask in #intel-gfx
14:12karolherbst: pmoreau: nono, it's a cap
14:12karolherbst: but yeah...
14:12karolherbst: it's stupid
14:12karolherbst: I'd just assume bytes
14:12pmoreau: Right, so you don’t need to support 128 arguments. You could only support 10.
14:12karolherbst: the 128 for _custom_ devices anyway
14:13karolherbst: pmoreau: sure, but only for custom devices
14:13karolherbst: but I think it's wrongly phrased anyway
14:13karolherbst: for anything not custom it's the size in bytes
14:13karolherbst: and why is it different for custom devices?
14:14pmoreau: I do not read it that way
14:15pmoreau: The minimum for non custom devices is 1024 (bytes). For that minimum size (of 1024 bytes), up to 128 arguments can be passed to a kernel.
14:15pmoreau: This is what I understand.
14:15pmoreau: It says nothing about what the requirements are for custom devices, only for non-custom ones.
14:18karolherbst: pmoreau: then I'd ignore this 128 value
14:18karolherbst: because.. then it doesn't tell you anything
14:18pmoreau: Yeah, I would too
14:18karolherbst: you can always support more :D
14:18karolherbst: I think it's related to int16 size or whatever?
14:19karolherbst: it makes no sense actually
14:19pmoreau: I’m guessing it’s there because there were some devices out there that had that weird requirement and they still wanted to be able to support 1.2 on them.
14:20pmoreau: I haven’t been able to find a single reference in the whole spec, apart to this very instance, that says anything about how many arguments one can pass to a kernel.
14:21pmoreau: *apart from
14:21karolherbst: yeah, because it makes no sense
14:21pmoreau: Even in the OpenCL C spec nothing is said about it.
14:21pmoreau: But SPIR-V has a limit on it, the OpenCL SPIR-V Env spec says nothing about lifting those restrictions.
14:21karolherbst: maybe it was left by accident
14:29pmoreau: Speaking of accidents: the current 1.0 and 1.1 specs says that CL_DEVICE_LOCAL_MEM_SIZE should be at least 1 KB, when technically the 1.0 spec should say 16 KB and the 1.1 one should say 32 KB…
14:29pmoreau: And the 1.1 spec still says 256 bytes for CL_DEVICE_MAX_PARAMETER_SIZE instead of 1024.
14:46pmoreau: Disregard my previous comment: I hadn’t realised I was reading the table for the embedded profile… 🤦
17:02pmoreau: karolherbst: I don’t think you answered (or if you did, I’m sorry and missed it): may I replace my own local mem alignment with yours, which you linked to in the MR?
17:37karolherbst: pmoreau: sure, I am not sure what is the best appraoch, but even though I don't like sharing the one field for both, in the end I think it's still fine, as local pointers have to be alligned for $silly reasons
17:37karolherbst: *pointer aligned
17:39pmoreau: If you already had converged with curro that this was the way to go, I don’t mind following it even if I’d prefer not sharing.
17:39karolherbst: yeah.. we had a discussion somewhere about it
17:39karolherbst: one MR and then I dropped the patch again or something
17:41pmoreau: I’ll push an updated version after dinner; I went ahead and added all the constraints including for 3.0; for the 2.x ones I have them in a separate patch but I don’t think I’ll submit that one in this MR and keep it for when some OpenCL 2.x features get added.
18:51imirkin: pmoreau: fwiw that fermi reclock stuff doesn't just work for everyone, it works for ~no one. using the same gpu as skeggsb i couldn't get it to (properly) work.
18:54RSpliet: I can confirm that my branch is no better; in fact, it might be worse. It only worked half on that one Fermi card I owned
18:54RSpliet: But a careful combination of the two branches may even work on three different cards
19:08pmoreau: I see… 😅
19:09imirkin: RSpliet: either that, or will make it work on zero person's cards ;)
19:09RSpliet: that's the wrong kind of careful
19:22pmoreau: Is there a way to tell the register allocation to use a lower number of registers and spill if it goes above that?
19:22imirkin: pmoreau: yeah
19:22imirkin: we do that for fermi+
19:22imirkin: based on number of threads
19:23pmoreau: That’s exactly what I’m after 🙂
19:23imirkin: let me find it...
19:23imirkin: it's not obvious how it's done
19:23imirkin: i wasn't super-proud of how i did it
19:23imirkin: but i also couldn't come up with something materially cleaner
19:23pmoreau: Found it
19:25pmoreau: I didn’t think the number of threads would be stored in `Target`
19:26imirkin: yeah, exactly - not super-proud of it ;)
19:26imirkin: numThreads was the thing to search for
19:26imirkin: iirc with opencl, the number of threads isn't baked into the shader
19:26imirkin: so you're stuck either (a) picking a "safe" max value or (b) recompiling
19:26pmoreau: It can be via attributes, but does not have to.
19:27pmoreau: At the same time, we currently compile right before launching the kernel so we should know the actual value
19:28imirkin: this was made for (a) GL shaders which explicitly specify and (b) additional support for the "variable" local size ext (whose name escapes me)
19:28imirkin: in the latter ext, you're allowed to expose a lower number of threads
19:28pmoreau: We will need to change that, as OpenCL has an API to query some data about the kernel before launching it, for example to allow the application to pick a saner block size based on the compilation results.
19:28imirkin: ah ok
19:29imirkin: yeah, probably more gallium APIs will be required to expose that sort of info
19:29imirkin: as all these things are completely opaque atm
19:30pmoreau: So right now I am going with: let’s go with what we have and use the final amount of threads since we have it, to avoid launching kernels using too many threads.
19:31imirkin: that should already be picked up
19:31imirkin: if (threads == 0)
19:31imirkin: threads = info->target >= NVISA_GK104_CHIPSET ? 1024 : 512;
19:31imirkin: but the nv50 target has to respect that
19:31imirkin: look at how the nvc0 target does it
19:31pmoreau: Right, the nv50 target completely ignores the amount of threads or shared memory used for now.
19:32pmoreau: OpenCL allows 3 different setups: 1) give the block size at run-time with no hints for compile-time, 2) give the block size at run-time and some hints for compile-time, 3) give the block size at run-time and compile-time.
19:32imirkin: well, you'll work it out. i actually gtg
19:32imirkin: back in a small while
19:33pmoreau: No worries
20:09pmoreau: There are some interesting things going on 😀 (yes, those two instructions are right after each other)
20:09pmoreau: > st b32 l[0x10] $r0
20:09pmoreau: > ld b32 $r0 l[0x10]
20:10tertl3: i set up a RHEL in gnome boxes
20:10tertl3: i guess theres no way to get gpu passthrough on it?
20:11tertl3: usually when I try red hat, i never get passed the subscription thing
20:11imirkin: pmoreau: yeah, that happens ... spilling is imperfect.
20:11pmoreau: (`l[0x10]` is accessed multiple times afterwards, so it does make sense for it to be spilled.)
20:11imirkin: pmoreau: normally we try to fold that sort of thing together, but in some instances we fail
20:12pmoreau: Well that will be part of trying to optimise things further, once we get something working.
20:12imirkin: yeah, i've tried to slay that dragon several times
20:19imirkin: iirc there are problems with phi nodes/etc
20:55pmoreau: I didn’t know/had forgotten that NOUVEAU_MESA_DEBUG was a thing.
20:56pmoreau: I think the errors I am now getting on those kernels that used to use too many regs before constraining them, are due to some misconfiguration of tls.
21:32imirkin: pmoreau: i'm currently working on another dragon right now ... i realized that the way addresses are loaded on nv50 is not optimal, so will see if i can improve it
21:32imirkin: basically a few ops can work on address regs directly
21:33imirkin: but in most cases we end up doing them on regular regs, and then move into addr reg