01:30mangix: karolherbst: different error turns out. you got any idea? https://gist.github.com/neheb/36c45ba194d52a9ea42cc483227f0cff
07:49karolherbst: mangix: just that we do something stupid in mesa :/
07:50karolherbst: mangix: what GPU is that with?
07:50karolherbst: and what version of libdrm do you have installed?
07:53mangix: karolherbst: 750 ti and 2.4.105-1.fc34
07:54karolherbst: mhh sad
07:55karolherbst: mangix: with 106 I've added some debugging features we can use to dump the entire command stream into a file and investigate it
07:55karolherbst: maybe it can help in situations like this
07:55karolherbst: but I suspect something else is going wrong.. mhh
07:56karolherbst: I think this "validating bo list" thing has something to do with it
07:57karolherbst: I think I even see what's up..
07:58karolherbst: we should be able to catch that in userspace...
07:58karolherbst: mhh interesting
07:58mangix: The session is Wayland if that helps.
07:59karolherbst: not really
07:59karolherbst: so I think we are just messing up access settings for our buffer objects
08:00mangix: Is it a userspace issue?
08:00karolherbst: I think so
08:00karolherbst: the kernel validates the list of buffers we want to do stuff with and something in there is messed up
08:00karolherbst: I'd just need a simple reproducer, that's all
08:02mangix: IIRC, I just launched dota 2. Crashed in 1 second.
08:04mangix: fedora 34, if it makes a difference
08:04mangix: kernel 5.12.11
08:15karolherbst: mangix: same error?
08:30karolherbst: mangix: yep.. happening here as well
08:30karolherbst: mhh "Xwayland: Unknown handle 0x00001548"
08:32karolherbst: mangix: can you verify that disabling the steam overlay fixes the issue?
08:32mangix: As soon as this update finishes
08:33mangix: New update was released today
08:34karolherbst: no worries.. I had to download the entire game anyway :D
08:35karolherbst: but I think I know what bug that is
08:35karolherbst: so if disabling the ingame overlay helps then I am quite positive about it
08:38mangix: Hmm no crash yet.
08:38mangix: Yep that fixed it
08:40karolherbst: so I think that's just the multithreading issue we have
08:41karolherbst: so the game and the overlay use separate GL contexts for rendering and if we do that we get memory corruptions due to multiple threads accessing the same stuff
08:41karolherbst: I have a branch which should fix that
08:41karolherbst: just need to figure out some last issues
08:41mangix: Any relation to Vulkan?
08:42karolherbst: yeah.. mhh vulkan is difficult because that requires changes in the kernel UAPI and it's kind of WIP
08:42karolherbst: just not enough people to do all the things
08:43karolherbst: hw enablement is taking a lot amount of times and I am trying to fix those huge bugs atm
08:45mangix: I was experiencing pretty heave input lag with Wayland. Switching to X11 improved it but there's a visual glitch now.
08:47karolherbst: I mean.. X11 does tear
08:47karolherbst: but if it's something else we might want to fix it
08:47mangix: I made a video
08:48mangix: something appears and disappears on screen
08:48mangix: doesn't happen with XWayland
08:48mangix: let me see where I can upload it
08:49mangix: I just remembered. For this GPU, doesn't the clock have to be manually set higher?
08:50karolherbst: it has to
08:50karolherbst: and I think we are terrible if it comes to the memory bandwidth being too low
08:51mangix: Uploaded on youtube: https://www.youtube.com/watch?v=i4xA0DFdopU
08:55karolherbst: ahh in dota
08:55karolherbst: mangix: mind doing an apitrace? If it's reproducible when replaying we can look into it
08:55mangix: i tried again, didn't happen, weird
08:55mangix: probably some heisenbug
08:56mangix: Hmmmmmm the game settings are not being saved, meaning I can't switch to Vulkan
08:57karolherbst: we don't support vulkan anyway
08:57mangix: consequence of using the Steam Flatpak I guess
08:57karolherbst: I suspect the game just checks and reverts to gl
08:57mangix: Oh OK
08:59karolherbst: *sigh*.. now I have written this awesome valgrind suppression file and it doesn't find any actual races which matter, but when I run that stuff without valgrind I hit a memory corruption.. so annoying
09:02mangix: Alright, back to Wayland. Nouveau crashed on the desktop again
09:07mangix: Huh interesting. I have the nouveau X11 driver installed. Thought I removed it.
09:08karolherbst: I never had those random crashes...
09:08karolherbst: it's super annoying
09:32mangix: Hmmm how do I overclock this thing?
09:51mangix: found it
09:51mangix: surprisingly playable
09:52mangix: I don't dare take this online though
12:44karolherbst: RSpliet: so pcercuei is hitting your runtime pm issue as well: https://pastebin.com/VSCu67Pu
12:44karolherbst: do you know if there is any work done on that?
12:45RSpliet: I have a runtime PM issue?
12:45karolherbst: RSpliet: wasn't it you or did you simply discuss it?
12:45karolherbst: the one where the audio device has no codecs
12:45karolherbst: or was it pmoreau[m]?
12:45RSpliet: Oh yes. Mine was fixed a few kernels ago, but there's a more subtle interaction that bugs others differently
12:46karolherbst: I was under the impression you were in the loop there though
12:46RSpliet: I chimed in on the ML because I was a tad annoyed at the lack of action for an issue that should have a fairly simple workaround
12:46karolherbst: anyway.. were there any workarounds for users or would one have to patch the kernel or so?
12:47RSpliet: Knowing I shouldn't express annoyances because everything is high prio and there's like two devs :-P
12:47karolherbst: RSpliet: mind pointing it out to pcercuei? I think it would be good if more users complain about it
12:47karolherbst: or maybe pcercuei can fix it even :p
12:47pcercuei: I can maybe fix it if I understand the problem :)
12:48karolherbst: pcercuei: well.. basically the snd_hda_intel fails to initialize the device or so and then runpm never happens
12:48RSpliet: pcercuei: can you first explain your symptoms?
12:48karolherbst: RSpliet: audio device stays active
12:49karolherbst: there is this "snd_hda_intel 0000:01:00.1: no codecs initialized" message which reminded me of that bug
12:49pcercuei: RSpliet: what I noticed was that the nouveau driver never auto-suspends my nvidia GPU when it's supposed to be idle
12:49RSpliet: pcercuei: and if you manually unmap the HDA driver from the device, it suddenly plunges into s3?
12:50karolherbst: pcercuei: is the gpu set to suspended though?
12:50karolherbst: "cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status"
12:50karolherbst: RSpliet: ohh right.. this was it
12:50RSpliet: The last message on the ML thread is here btw: https://lists.freedesktop.org/archives/nouveau/2021-April/038450.html
12:51karolherbst: echo "0000:01:00.1" > '/sys/bus/pci/devices/0000:01:00.1/driver' was the command or something?
12:51pcercuei: RSpliet: correct! If I unmap the HDA driver, then nouveau puts the hardware to sleep
12:51karolherbst: echo "0000:01:00.1" > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind'
12:51karolherbst: pcercuei: nice...
12:51karolherbst: okay, so you have your workaround :D
12:51pcercuei: karolherbst: yep, I figured out the command ;)
12:51RSpliet: pcercuei: yep, then this issue affects you too
12:52pcercuei: Now I wonder, why doesn't snd_hda_intel find a codec?
12:53RSpliet: pcercuei: there are legit GPUs out there that don't have a codec. The HDA device is part of the GPU, the codec is presumably a separate chip. If the GPU doesn't have display outputs, the codec can be omitted by the OEM
12:55pcercuei: So if I understand it right, the HDA device is there, but is useless, correct?
12:56RSpliet: In your laptop it's presumably useless yes
12:56RSpliet: There were some voices around the codec hiding itself if nothing is plugged into the HDMI port or DP, but so far I haven't heard a confirmation of that
12:58RSpliet: There was something along the lines of the codec bitmask read returning an MMIO read error, and snd_hda_intel interpreting the return value as a mask anyway?
12:59RSpliet: Anyway, slightly out of my domain, ultimately a fix will need to land in snd_hda_intel
12:59RSpliet: Which is alsa stuff. Not that the nouveau lot shouldn't have feedback into the fix ofc...
12:59pcercuei: I do have a MMIO read error, yes
13:01pcercuei: karolherbst: thanks for the help btw
13:01pcercuei: I'll take it from here
13:02karolherbst: pcercuei: I suspect that the problem is that your GPU doesn't have any outputs
13:02karolherbst: but the OEM forgot to disable the audio part of the GPU or so
13:02karolherbst: maybe it's for weird streaming stuff.. dunno
13:03karolherbst: but you don't have any codecs so it's pointless anyway
13:03RSpliet: karolherbst: we always unconditionally enable the HDA. That's what the PCI quirk is for
13:04RSpliet: ultimately the snd_hda_intel design is shite and is prone to all this sorts of errors. It's probing is split up into two phases, and once it starts actually doing the leg work (part 2) it can't unbind itself anymore on errors.
13:05RSpliet: I wonder how much of this can be done in a better way if it starts using deferred probing
13:06RSpliet: But for now the best we can do is catch all possible errors and corner cases, and make sure it doesn't stop snd_hda_intel from finishing it's initialisation so that it at least continues to take care of S3
13:06karolherbst: mhh yeah...
13:06karolherbst: I guess we could skip enabling the audio device if we have no connectors...
13:06karolherbst: but uff...
13:07RSpliet: The PCI quirk runs before nouveau. Don't think that's a good plan
13:07RSpliet: There's apparently a reason why this is a quirk and not just sth that nouveau has to enable
13:08RSpliet: Oh, it has to happen before probing all the PCI devices, otherwise snd_hda_intel doesn't get loaded at all unless manually triggered... or sth like that
13:08RSpliet: Perhaps that's more an issue for the blob than for us
13:09karolherbst: RSpliet: the reason is nvidia I think :D
13:09karolherbst: or maybe the probing order or something...
13:10karolherbst: it's a shitty situation
13:10RSpliet: aka a shituation
14:04mrnuke: Hi. My Vega M GL laptop is in service, and Dell says they can't fix it. They're offering a replacement with a GTX 1650 Ti 4GB GDDR6, and I'm wondering how well it is supported by nouveau
14:04karolherbst: well.. OpenGL should work with it, but of course there can be always bugs, and you won't be able to get much perf out of it anyway
14:05karolherbst: and if you rely on 4K to work I wouldn't use nouveau, because of performance
14:05karolherbst: might be better with higher end GPUs though
14:07orbea: mrnuke: i wouldn't trust dell technicians, they'll string you along with broken refurbished parts until your warranty runs out
14:08orbea: rather, dell management above them...
14:08karolherbst: orbea: I've heard both ways :D
14:09orbea: heh, i've only had terrible experiences, maybe others were more lucky
14:09karolherbst: I know a case where the support screwed up so much, that that person got a state of the art new machine
14:09mrnuke: karolherbst: Thank you! Rest assured, It's either nouveau or IGP -- I'll stay away from nvidia's bloatware
14:09karolherbst: orbea: the biggest problem are contractors
14:09karolherbst: but once to get to the proper dell support it should be better
14:10karolherbst: but dell might assign a contractor to you from what I've heard
14:10karolherbst: and sometimes they are just super terrible
14:10mrnuke: orbea: Well, Extended warranty is only ~$80/year with dell, so I don't mind paying that if they give me faster parts when the machine breaks (I intentionally said "when" instead of "if")
14:11orbea: i had failed mobos, they replaced it maybe 3 times where I basically got the same mobo I already had
14:11karolherbst: duh.. :(
14:12karolherbst: but apparently it's harder to get different parts because of $rules
14:12mrnuke: orbea: hehe! I have a precision workstation here that got its mainboard replaced about that many times -- never fixed the issue though. It's a firmware bug. Dell sent several technicians to my office to replace it live
14:15orbea: never learned what my issue was, it also the kind of hardware that failed to boot with most distros, opensuse did work for some reason.
14:16mrnuke: orbea: did it work on cold boot, then failed to initialize the display?
14:17orbea: it would fail during init, dont remember more than that
14:17orbea: basically just crash or freeze
14:18orbea: and at somepoint opensuse became unstable crashing every 5 minutes, even windows did that...
14:18mrnuke: those are always fun to debug
14:18mrnuke: I'm sad to see that Vega M GL go to the landfill. I played DXMK on it, and it was butter smooth. I'm assuming I should drop all hopes of that if they send me an nvidia
14:19mrnuke: deus ex - mankind divided
14:20orbea: that is dx11 or dx12?
14:21orbea: and since nouveau has no vulkan and by extension no dxvk or vkd3d then I wouldn't get my hopes up
14:21mrnuke: I honestly have no idea -- I think they have an OpenGL linux port
14:21orbea: i see
14:22orbea: they have some kind of opengl support, that might work, no idea at what speed
14:22orbea: no doubt amd is a better choice if you want to game
14:25mrnuke: if you can get radeon. Intel made that CPU with a radeon chip on the same package, back in 2018. And that's the last time I've seen a decent radeon in a laptop
14:26orbea: how about a pure intel laptop?
14:26orbea: at least you would get vulkan
14:29mrnuke: And batterry life, yes. I have one of those too.
14:30mrnuke: I guess I could just run a VM with PCI passthrough for the nvidia card. Oh wait, nvidia desrtoyed that too!
14:38mrnuke: orbea: Actually, I think they use DRI_PRIME, so I could just run off the IGP and ignore the nvidia GPU if the driver's aren't there
19:34RSpliet: We were talking about Dell? https://eclypsium.com/2021/06/24/biosdisconnect/
19:39mrnuke: RSpliet: that's one of the reasons I don't use pre-installed OSes
19:41cyberpear: I've got a Pascal card in my laptop. Works great w/ nouveau. I do have an occasional problem, though. Sometimes when I wake an screen after it's been turned off due to inactivity, my ViewSonic monitor won't detect the signal...
19:41mrnuke: RSpliet: You'll like this: https://www.youtube.com/watch?v=5N7aYtkzKJc
19:41RSpliet: mrnuke: AFAIK the OS is not involved in any of these CVEs. It's all baked into the BIOS
19:42RSpliet: cyberpear: anything suspicious in the dmesg?
19:42cyberpear: two workarounds I've found is 1) to unplug the monitor power for 15 minutes then plug it back in or 2) to plug in my USB-C-to-Displayport to my Android phone or a Windows computer.
19:43cyberpear: (after either of the two above items, i can plug it back into the laptop and things work fine)
19:43RSpliet: both sound dodgy. 15 minutes is a long time
19:43mrnuke: RSpliet: "Dell SupportAssist [...] comes preinstalled on most Windows-based Dell machines." Doesn't sound like BIOS alone would be involved
19:44karolherbst: mrnuke: wasn't the bug that you can trigger a malicious firmware update?
19:45karolherbst: so if fwupd would have a security issue, this could be used to take control over the hw on a firmware level, no?
19:46karolherbst: or well.. just having root access
19:46mrnuke: karolherbst: My understanding is that you need to run windoze with dell's bloatware to initiate the process
19:46karolherbst: that's not my understanding
19:47karolherbst: why would you need a firmware update in this case?
19:47karolherbst: the firmware itself is buggy, the tool could just be used to fake being dell.com
19:47karolherbst: but you can use other ways of attacking the firmware
19:48mrnuke: karolherbst: you need dell's windows tool to initiate the update, right?
19:48karolherbst: UEFI has an API to do that without any tools
19:48karolherbst: "vendor" tools
19:48karolherbst: you can use fwupd on linux as well
19:50mrnuke: karolherbst: are you aware of the DisplayPort link training quirks in linux for Dell Monitors?
19:50karolherbst: but I think in this case one just triggered a "os recovery" or whatever and then the firmware thought you were dell.com to initiate random commands
19:50cyberpear: RSpliet: the monitor-won't-see-signal issue happens about once a week or so
19:50karolherbst: mrnuke: no
19:50cyberpear: re fwupd and Dell https://twitter.com/hughsient/status/1408086110786535428
19:52mrnuke: karolherbst: https://github.com/torvalds/linux/blob/4a09d388f2ab382f217a764e6a152b3f614246f6/drivers/gpu/drm/drm_dp_helper.c#L253
19:53mrnuke: karolherbst: so, the same monitors that Linux has to have a quirk for, aren't initialized properly by Dell's own BIOS
19:53karolherbst: what has the dells BIOS to do with any of that
19:54karolherbst: that's usually in the domain of the GPU and its firmware
19:54mrnuke: karolherbst: I have a workstation machine that boots to a blank screen and freezes. Dell has replaced the mainboard several times instead of fixing the BIOS
19:54cyberpear: RSpliet: dmesg shows`nouveau 0000:01:00.0: disp: outp 03:0006:0f81: training failed`
19:55RSpliet: cyberpear: ah! That's interesting. Slightly dodgy DP cable perhaps? Not claiming there's no bugs in nouveau or anything :-P But that's my first suspect
19:56RSpliet: cyberpear: btw, which kernel is this?
19:56cyberpear: same exact cable when plugging into Windows or Android to "fix" the issue and make the monitor no-longer-braindead
19:58cyberpear: it's not a new issue; I've just recently been annoyed enough to seek a better solution... the "plug it into Windows or Android" is already much better than my older "remove power from monitor for 15 minutes" workaround
19:59cyberpear: (this is a USB-C to DisplayPort cable)
19:59RSpliet: Ehm... which part is USB-C? The bit that goes into nouveau?
19:59RSpliet: Didn't know we even supported that!
20:00RSpliet: Either way I think Lyude is the closest we're going to get to a display expert, perhaps she has some hints or ideas...
20:01cyberpear: it looks like DisplayPort to the OS, I think. The port is technically ThunderBolt-3, IIUC.
20:02mrnuke: cyberpear: Laptop - USB-C <- cable -> DisplayPort - Monitor ?
20:02cyberpear: mrnuke: yes
20:04mrnuke: hmm, might be displayport signals over USB-C connector. Could be connected to a DP encoder on the GPU, hence the nouveau dmesg
21:24Lyude: USB-C typically uses displayport
21:24Lyude: same for TB3
21:24Lyude: and TB4 they're both the same thing and still use DP fwiw
21:25Lyude: cyberpear: does this monitor always work on windows?
21:27imirkin: it's entirely likely that we're supposed to do something to the usb-c port that we don't
21:28Lyude: imirkin: maybe, but I'm a lot more likely to blame our displayport training first. most of the issues with usb-c that i thought might have been the result of something like ucsci have turned out to be different issues
21:28Lyude: well, actually first i'm blaming the monitor because i've had monitors that are like this and are just kind of broken
21:29imirkin: Lyude: what's the up/down detection thing?
21:29Lyude: imirkin: you don't typically deal with that on modern usb-c, I think that was only ever something to deal with on some of the very very early machines with usb-c
21:30Lyude: at least I think? bentiss would actually know the answer to this I think
21:31imirkin: Lyude: ah ok. i know so little about this stuff...
21:31imirkin: i do know that you have to do some level of orientation detection
21:32imirkin: and possibly that's why turning monitor off for 15 mins "fixes" it, as well as another OS which tells it to flip
21:32imirkin: but this is all based on ... no actual information
21:33Lyude: yeah, I'm pretty sure orientation detection just happens on the fw level now for the most part, except maybe for some embedded systems somewhere out there
21:34Lyude: oh, probably should clarify the logic for blaming the monitor a bit more: the whole "needing to unplug for 15 minutes" thing is definitely something I've seen some very problematic monitors do
21:35imirkin: yeah, i mean clearly the issue is coz of something "resetting" in the monitor
21:35imirkin: but what could it be
21:35imirkin: maybe SCDC?
21:35imirkin: er no, that's hdmi
21:36Lyude: I would just peg it to firmware issues, and likely being either one of "the monitor is broken" or "the monitor is broken, but can be made to work and we need to avoid doing some very specific silly thing so it doesn't get upset"
21:37Lyude: I just realized I am making some assumptions that there's MST here, so if there isn't that would definitely change my theory a bit
21:41imirkin: i doubt there's MST involved... why would there be MST? i guess some in-laptop MST hub you think?
21:41Lyude: no, just because TB3 is involved
21:41imirkin: if it works reliably with other things, clearly there's something we could/should do differently though
21:42Lyude: imirkin: yeah of course, that's why I mentioned the latter kind of issue. which, pretty much all of these issues turn into
21:42Lyude: well, most of them
21:42Lyude: I do have one monitor that is legitimately just, kind of broken
21:42imirkin: yeah, i hear some earlier dell DP-MST monitors are busted-ish
21:43Lyude: imirkin: bingo, that's exactly the monitor I'm thinking of :)
21:43imirkin: esp the 4k variety
21:43Lyude: 4k dell mst monitor
21:43imirkin: but i might have assumed those types of issues would be largley sorted by now
21:43Lyude: imirkin: yeah, it's dependent on how old this display is though
21:44Lyude: but it's -very- suspicious that there's a 15 minute timeout. that's a bit much even for a confused mst hub
21:44Lyude: (although honestly, I wonder if 30-60 seconds would be long enough as well)
21:45Lyude: cyberpear: I'd file an issue on the nouveau gitlab and messag eme on it, but I unfortunately can't make any guarantees on when I'll have a chance to look at it
21:46cyberpear: what debug info should I include?
21:47Lyude: cyberpear: boot with log_buf_len=20M drm.debug=0x116 nouveau.debug=disp=trace, reproduce the issue and grab the full dmesg logs from the same boot (no trimming needed, it's better to give more debuginfo then we need then not enough ;)
21:47cyberpear: I hadn't tried different "unplug times" than 10 seconds and 15 minutes.
21:47Lyude: cyberpear: yeah i'd think 30-60 seconds should probably work too, gives enough time for stuff to discharge enough to clear state
21:48cyberpear: but temporarily plugging in windows or android avoids needing to unplug at all
21:48Lyude: Oh. ok, then we're definitely doing something wrong
21:50Lyude: cyberpear: any chance you might have something running linux with an intel or amd gpu?
21:51cyberpear: this has Intel built-in, but the USB-C is hardwired to the nVidia card
21:52Lyude: dang, not totally necessary but would have helped answer a couple of questions (in particular, I'm wondering what kind of link training settings windows/android end up going with and whether they differ from what we're selecting in nouveau or not
21:52cyberpear: my older laptop had the Optimus tech where the bios would allow turning off nVidia IIRC.
21:53cyberpear: but the problem only happens once a week or so. can't reproduce on demand
21:53Lyude: mhh, yeah that makes me slightly more suspicious of link training
21:54Lyude: we will have to see from the logs
21:54Lyude: also - feel free to poke me every now and then if the bug goes cold, although in the next few months I'm actually hoping I will finally be free and able to focus more on upstream work again
21:59cyberpear: thanks! will file the bug when I'm back at my desk.
21:59mangix: There are nvidia headers in the kernel now
22:00Lyude: mangix: which ones?
22:03mangix: Lyude: the ones in include/nvhw/class
22:03mangix: they weren't there last I looked at nouveau
22:03Lyude: mangix: oh you mean the nouveau headers
22:04Lyude: yeah - nvidia's actually been providing us some of that stuff as of late :)
22:04mangix: that's nice
22:04Lyude: crcnotif??? wait that's new
22:05Lyude: ah, just moving some crc defs around
22:11mangix: hmm what's nvidia's relationship to nouveau? Is it for tegra?
22:12imirkin: Lyude: i guess the notifier contents
22:12imirkin: mangix: they release display-related docs to make sure that nouveau can light up displays on linux.
22:12imirkin: and additional docs that help them further their interests somehow
22:13imirkin: on rare occasions, they've been helpful in actual graphics stuff too
22:13mangix: last I looked at the driver, Alexandre Corbeau was still at nvidia
22:13imirkin: my guess is the result of someone helpful overstepping their bounds
22:13imirkin: afaik gnurou left nvidia some years back
22:14mangix: right. it's been a while :)
22:14Lyude: imirkin: yeah I just went through it and it's the same stuff we had before, just in a different header format
22:14imirkin: tegra stuff is maintained, again, for display