15:51js: Not sure if this was already reported, but in case it's not, here's a success story: Nouveau works fine on an RTX 3060 Laptop GPU with an external display. 3440x1440 at 100 Hz worked (monitor could do 160, though), but it doesn't reach anywhere near 100 FPS, meaning desktop is choppy. Same on a 2560x1440 60 Hz display, it's more like 30 FPS. But absoluitely usable for coding, so great work! I'm guessing this is just the missing reclocking support?
15:51js: Even stuff like dragging windows between the AMD iGPU and the external nouveau display work fine, even Quake runs with nouveau 🙂
15:52js: Ha I jinxed it. Just after writing this message it crashed. Never did that before. Odd.
15:55karolherbst: js: it's all software rendered so far. We have some patches you could try out if you are interested though
15:56karolherbst: that reminds me.. I should at least get the mesa patches in asap
15:56karolherbst: mhh although in this case it's probably more the PCIe and memory limitates
15:57karolherbst: anyway... we are planning to use the nvidia release gsp firmware for that GPU, so it all might be fixed "soon"
16:12js: karolherbst: No way this is software rendering? I get 30 FPS in Quake in 2560x1440 and stable 60 FPS in 1280x720
16:12js: and yeah happy to try patches 🙂
16:12karolherbst: which quake?
16:12js: let me try Q2
16:12karolherbst: CPU is probably fast enough :)
16:12karolherbst: or it's intel
16:13js: no Intel in this machine except WiFi 😉
16:13js: but yeah, AMD for the iGPU
16:13karolherbst: you can check with "glxinfo" but I'd assume it uses llvmpipe
16:13js: you think the iGPU is rendering to a buffer that is displayed by nouveau?
16:13karolherbst: we don't have any official GL support for ampere at this point
16:13js: how useful is glxinfo on a Wayland system? 🙂
16:13karolherbst: you still have xwayland
16:14js: Extended renderer info (GLX_MESA_query_renderer):
16:14js: Vendor: AMD (0x1002)
16:14js: Device: AMD RENOIR (LLVM 14.0.0, DRM 3.46, 5.18.11-200.fc36.x86_64) (0x1638)
16:14karolherbst: then it uses the AMD gpu
16:14js: huh, so it's the AMD GPU rendering into a buffer on the Nvidia?
16:14js: so the problem is probably too low bandwidth on the copying?
16:14karolherbst: on linux we render everything with the "main" GPU by default
16:14karolherbst: you can switch over with "DRI_PRIME=1 glxinfo" e.g.
16:14karolherbst: but because we don't support GL with nouveau on ampere at this point
16:14js: I mean, TBQH, I only care about the nvidia GPU for external display, not for performance 😉
16:15karolherbst: the limiting factor you run into is PCIe bandwith and memory bandwidth
16:15karolherbst: the new nvidia GSP firmware should solve this as with that we are capable of speeding up the GPU
16:15js: I think it's not only that, because the 3440x1440@100 display actually worked smoother than the 2560x1440@60
16:16karolherbst: I worked on speeding up the PCIe link on previous gens, but have no idea if the same code still works on ampere. Not worth investigating it though
16:16karolherbst: that would be slightly odd
16:16karolherbst: js: same connector?
16:16js: I noticed the higher I set the Hz, the more I get - but I never get the 100%
16:16js: yeah, USB-C
16:16karolherbst: that could use the AMD GPU
16:16js: nope, that's connected directly to the NVidia
16:17js: only the HDMI is directly connected to the AMD 🙂
16:17js: yes, 100%
16:17js: display also remains black without `modprobe nouveau`
16:17karolherbst: why is that different on each laptop :D
16:17karolherbst: on intel + nvidia ones it's usually the case that USB-C is connected with intel and the other ports with nvidia
16:17js: (I have nouveau blacklisted, because otherwise it wakes the GPU and wastes a lot of battery)
16:17karolherbst: we should be able to runpm the nvidia GPU
16:17js: I think it's just this notebook - and one of the reasons I bought it 🙂
16:17karolherbst: maybe you run into the issue that the HDA controller stays awake
16:17js: because it meant I would always have a blob-free external display 😄
16:18js: w /sys/bus/pci/devices/0000:01:00.0/remove - - - - 1
16:18js: w /sys/bus/pci/devices/0000:01:00.1/remove - - - - 1
16:18js: I just have this in tmpfiles.d
16:18js: and then it stops wasting battery
16:18js: when I wanted to use it, I just rescan the PCI bus and modprobe nouveau
16:18karolherbst: what's /sys/bus/pci/devices/0000:01:00.0/power/control by default?
16:18karolherbst: same for .1
16:19karolherbst: 1:00.0 should be auto
16:19karolherbst: but .1 might be "on"
16:19karolherbst: which also has to be auto otherwise we won't be able to turn the GPU off
16:19js: both auto
16:19karolherbst: and status in the same directory?
16:19karolherbst: pointless to check with a display connected
16:20js: in any case, removing it from the PCI bus reduces power draw by ~ 20W
16:20karolherbst: ohh, it's "runtime_status"
16:20karolherbst: yeah.. but we should be able to do the same without having to remove it
16:21js: let me disconnect the screen and see what it says
16:21karolherbst: this all runpm business is very annoying because some distribution still don't get it and have broken set ups there
16:21js: runpm? This is Fedora 36
16:21js: runtime_status is now suspended
16:21karolherbst: (the PCIe controller the gPU is connected to also needs the same settings, control to auto and runtime_status needs to change to suspended
16:21karolherbst: for both .0 and .1?
16:22js: disconnected AC as well now, let's wait a bit to read power draw
16:22karolherbst: do you have a .2 and/or .3 on the gpu? I guess not though
16:22js: linux averages it over the last few minutes and says its 0W when AC was connected - which is technically correct, since it wasn't running on battery 😉
16:22karolherbst: the pcie bus controller might still be on. Or we don't really support the ACPI bits, but then I wonder why it works with removing stuff
16:22js: nope, no .2 or .3
16:22js: power draw is fine now
16:23karolherbst: maybe you checked with the display connected or something
16:23js: either 5.18 fixed stuff, or it's because I had the display attached before?
16:23karolherbst: or there was a bug and it's now fixed
16:23karolherbst: who knows
16:23js: nope checked after a fresh boot without display before
16:23js: I already had that problem before nouveau would even recognize the GPU at all 🙂
16:23karolherbst: in that case you have to manually change it to auto
16:23js: I think on 5.17 it started detecting it, to my surprise 🙂
16:23karolherbst: it works without loading nouveau though
16:24karolherbst: just nothing flips the option
16:24js: ok, total draw right now is 11W, which is fine
16:24js: usually with the dGPU not removed it would be > 30
16:24karolherbst: less hacks :P
16:24karolherbst: but yeah
16:24js: I'll reboot and see how it is after a fresh boot
16:24karolherbst: good idea
16:25karolherbst: might take some time because the GPU is in use for a short time (or until the boot process all settles)
16:34js: ok this is extremely odd
16:34js: if I immediately after reboot rescan and modprobe nouevau, power draw is fine
16:34js: if I remove all my hacks and do nithing, it draws ~ 30W
16:34js: currently drawing 44W (!)
16:35karolherbst: what are the files saying now?
16:35karolherbst: could also be that the firmware is in a weird state or something
16:35js: also noticed nouveau crashing in dmesg, and the system becoming unresponsive
16:35karolherbst: that's probably the cause
16:36js: [ 9.684513] nouveau 0000:01:00.0: mc: intr 00000040
16:36js: many of those
16:36js: [ 273.558494] nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff
16:36js: and t hose
16:36karolherbst: okay.. that means the GPU is dead
16:36karolherbst: or well
16:36karolherbst: "stopped responding"
16:36js: followed by this every now and then:
16:36js: [ 230.915572] nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM]
16:36karolherbst: yeah.. so I've heard about some issues having issues on AMD systems with all that runtime power management stuff
16:37karolherbst: users having issues
16:37js: interesting that it works when I enable the card later
16:37js: timing issues?
16:37karolherbst: the system firmware is involved in the process
16:37js: both are still auto btw
16:37karolherbst: and on intel systems we need a weird workaround to make it work
16:38karolherbst: we might not fully understand on how all of that works and I saw that nvidia has tons of workarounds for various systems in this regard
16:38js: but runtime status on .0 is active and on .1 suspended
16:38karolherbst: what happened is, that the GPU stopps responding to any request
16:38karolherbst: and the code can't deal with it, so it's stuck
16:38karolherbst: more or less
16:39karolherbst: that can happen when starting to use the GPU after it got powered down
16:39karolherbst: I never had access to such a system so I couldn't really investigate
16:39js: this machines ACPI tables have a weird function to enable/disable the GPU - that keeps state between reboots even
16:39js: but usually, that makes it disappear entirely from the PCI bus
16:40karolherbst: more "fun stuff" to figure out I guess
16:40karolherbst: I'd file a bug for that against PCI/ACPI components of the kernel
16:40js: ~> cat /sys/devices/platform/asus-nb-wmi/dgpu_disable
16:40js: let me try if I can still disable it in this broken state
16:40karolherbst: sounds like we need special ways of querying for firmware disabled devices or something..
16:40js: karolherbst: well, it is enabled 🙂
16:41js: and if I set this to 1, it completely disappears from lspci etc.
16:41karolherbst: yeah.. as the PCI device needs to be removed
16:41js: yep, disabling it makes the power draw go down, while the /remove no longer worked
16:41js: now, let's see what happens when reenabling it
16:42js: then the power draw immediately goes up again
16:42karolherbst: well.. obviously
16:42js: well, I was hoping for nouveau to recover when the GPU is reset to a clean state 🙂
16:42js: let me reboot
16:42karolherbst: question is just how nouveau copes with that
16:43karolherbst: sometimes it can I think
16:44js: oh THAT is fun.
16:44js: after reboot it's disabled again
16:44karolherbst: I am sure the GPU was actually off all the time
16:44js: I remember there being some bug where you needed to enable it twice so it stays...
16:44js: nah, it was on because power draw increased 🙂
16:44js: there's two things this ACPI call does
16:45js: immediately toggle the dGPU on/off and update the BIOS setting that is not user accessible
16:45js: I think it brought back the GPU but didn't update the firmware setting, so after a reboot it was gone again
16:45js: (no longer in lspci)
16:45js: I think there was a weird issue where the Windows software also does the ACPI call twice - apparently you need to call it again once the dGPU is actually back for the firmware to get updated
16:45karolherbst: so the thing is, the PCI device detection could doesn't detect turned off devices
16:46js: [ 175.484403] acpi device:03: Cannot transition to power state D0 for parent in D3cold
16:46karolherbst: you can ignore it if you don't see it three times
16:47karolherbst: or something
16:47js: ok, interesting. After the dGPU is back, I need to modprobe nouveau manually, despite having removed the blacklist
16:47karolherbst: _sometimes_ it takes longer for devices to transition than expected by linux
16:47js: but power draw is high
16:47js: now let me plug in the external display and remove it again
16:48js: cool, now the external display doesn't work oO
16:48js: ok, now with the card hopefully re-enabled in firmware, let me reboot once more sigh
16:48karolherbst: yeah.. that should work
16:48karolherbst: if the GPU doesn't get turned of while nouveau is loaded, it should all work fine
16:59js: ok weird
16:59js: after setting dgpu_disable to 0 a few more times, it sticks
16:59js: but as soon as nouveau loads, same error as above
16:59js: so I added it to the rd.driver.blacklist, which makes it load at ~ 20 seconds, still the same
17:00js: but never had that issue when I loaded it later on with a display attached
17:00karolherbst: try loading with nouveau.runpm=0
17:00js: let me reboot once more but with a display attached...
17:02js: that seems to have made a difference
17:02js: no crash now
17:02js: but power usage this time is still high
17:02js: ah, actually, no, not that high
17:02js: might just be because I just booted
17:02js: so, it seems, what made all my problems go away is booting with a display attached 😉
17:03karolherbst: yeah, so runpm=0 disables runtime powermanagement, so the GPU will stay on
17:03js: didn't try runpm=0
17:03karolherbst: if the display is attached, the GPU stays in use and never gets turned off
17:03js: this was just booted with display attached
17:03karolherbst: but runpm=0 should habe the same effect as having a display attached
17:03js: booting with display attached and then unplugging -> everything works as expected
17:03js: booting with no display attached -> crash and high power draw
17:03js: ok let me try runpm
17:04js: but then power draw will always be high, right?
17:04js: that's not exactly what I want 😄
17:04js: so, maybe some weird state the dGPU is in if there's no display attached that nouveau doesn't like?
17:04karolherbst: js: seemingly unrelated question, but might allow me to suggest something :D are you by any chance a student? Could do a GsoC or EVoC and working on that issue if you are interested/motivated enough :D
17:05js: nope, I'm too old for that 😄
17:06karolherbst: seriously having an issue with people working on nouveau, because either people get hired and work on other things or leave, because of stupid reasons. So I can't really focus on issues like this and have to deal with such affecting even more users
17:07karolherbst: and this topic might be a fun one as with nvidias released source code there might not be a huge issue figuring it all out how to make that runpm stuff all work
17:08js: TBH, I don't think this is related to runpm as much as in Nvidia on laptops generally being weird
17:09karolherbst: well, yeah, but apparently it's all working with their driver afaik
17:09js: like, if I do dGPU pass through to a VM, the GPU is only detected if I plug in a display
17:09js: and that is with the nvidia proprietary driver!
17:09js: I wouldn't know - never ran their blob on my Linux system 😄
17:10js: as said earlier, the HDMI being directly attached to the AMD GPU was one of the reasons for buying this notebook: Because it guaranteed I could use at least one external display on Linux without blobs 😉
17:10karolherbst: virtualization is totally weird as well, true
17:11js: in any case, I think I will revert to my hack where I blacklist nouveau and only load it manually once a display is attached. And now I now that I don't even need to unload + remove the device again in that case 😉
17:11karolherbst: anyway, they open sourced their kernel driver by using a huge blob on the GPU instead, but at least all that runpm stuff is public
17:11karolherbst: I just don't really have the time to look into it until it becomes a really pressing issue
17:11karolherbst: it's all very annoying :(
17:12karolherbst: yeah, probably the best workaround
17:13karolherbst: anyway.. I'll probably try to upstream all the ampere bits next week or the week after to at least give users GL with ampere
17:13karolherbst: now that Ben has finished up the enablement
17:14js: ugh, now nouveau crashed my system again, with no display attached
17:15js: could still type dmesg in the terminal, but all other I/O no longer worked
17:15js: last i could see was a backtrace from nouveau, but couldn't save it ☹️
17:15js: I guess best for now is to keep nouveau disabled unless I need it, and then save all important work before enabling it 😄
17:16js: at least it allows me to use an external display at all, which is impressive work
17:16js: karolherbst: btw, how useful would the new open source kernel module from nvidia be for just getting an external display to work?
17:16js: (without blobs, of course)
17:16karolherbst: _might_ be good enough, but I don't think they officially support that yet
17:17karolherbst: they do have wayland support and it might even be turned on by default on fedora if using nvidias rpms
17:17karolherbst: but never tried myself
17:17js: well nvidia's rpms are all blob 😉
17:17karolherbst: ohhh, right
17:19js: are you guys disassembling the blob or just moving it into a VM and intercepting all calls to the hardware?
17:20karolherbst: we don't need to start a VM in order to do that
17:20karolherbst: we can use valgrind to intercept ioctls and can use mmiotrace to intercept any read/writes to the GPU from the kernel driver
17:20js: ah 🙂 so no decompilation at all being used?
17:21js: for fear of it not being clean room, I suppose?
17:21karolherbst: decompilation is also a legal risk for some people in some countries, so we don't want to rely on that
17:21karolherbst: yeah mostly that
17:22js:lives in a country where it would be perfectly legal, because decompilation is fully allowed for interoperability, even if an EULA forbids it
17:23karolherbst: yeah.. in theory, and I am sure nvidia wouldn't do anything
17:24js: I once posted on an official Apple mailing list that I just disassembled their stuff, despite their EULA forbidding it, and nothing happened, because, what should happen? Law explicitly says that the software vendor can't forbid it.
17:24karolherbst: yeah.. most vendors don't really care
17:24karolherbst: or some
17:24karolherbst: depends on the business
17:24karolherbst: if all you do is software, then that's a different situation
17:25karolherbst: nvidia on the other hand is very very protective over their IP
17:25karolherbst: and they are more of a sw company than hw anyway :P
17:25js: to the point where one has to start what dirty laundry they have in their blob 😉
17:25js: *start to wonder
17:26karolherbst: they started to provide documentation for their hardware and stuff, but there is literally no way they'll ever open source cuda
17:26karolherbst: as this is their core business, not selling GPUs
17:26karolherbst: well.. that as well, but the focus is on cuda
17:27js: yeah, I think the entire open source kernel driver is motivated by making CUDA easier available in the DC
17:27karolherbst: well, and secure boot easier to deploy :P
17:27js:uses his own secure boot keys anyway 🙂
17:27karolherbst: there are business caring about out of the box secure boot
17:28js: with the MS 3rd party key installed, I wouldn
17:28js: 't call it "secure" boot
17:28js: so many binaries have been signed with it that will happily load unsigned code
17:28karolherbst: more secure than the current situation
17:28karolherbst: aren't those revoked?
17:28karolherbst: ehh well..
17:28karolherbst: that relies on actually working revocation mechanism in firmware
17:29js: my dbx had a few entries from the factory
17:29karolherbst: wow, that's more than I expected
17:29karolherbst: uefi comes with internet access though, so it could in theory download new revocation lists....
17:29js: I doubt bmy UEFI does internet
17:30karolherbst: all uefi does internet
17:30js: mine doesn't even netboot
17:30karolherbst: that's strange
17:30js: AMD system with an Intel WiFi card
17:30js: also no vendor locked WiFi card, so I can even replace it
17:31karolherbst: yeah, still
17:31karolherbst: worst case you use USB-C + ethernet
17:31js: the WiFi card would need to have an option rom then
17:31js: well, given I don't have the MS 3rd party key, that Option ROM wouldn't be loaded either 😉
17:32karolherbst: anyway.. in theory uefi _could_ download revocations lists :P
17:32js: or the OS could do it
17:32karolherbst: well, that's too late isn't it?
17:32js: if you have new dbx entries signed by the KEK, it should work
17:32karolherbst: not if the malware whatever it is intercepts it
17:33js: true, but you could update it via the OS as soon as it gets revoked, hopefully before you get malware 😉