IRC Logs of #dri-devel on irc.freenode.net for 2024-01-31

04:28 kurufu: https://registry.khronos.org/EGL/extensions/EXT/EGL_EXT_device_query.txt suggests eglQueryDisplayAttribEXT should never return EGL_TRUE and EGL_NO_DEVICE_EXT, but I am experiencing this and quite confused how this might occur without my EGLDisplay being totally busted.
05:37 kurufu: Is it possible that glvnd is somehow introducing this, debugging with minimal mesa symbols suggests I get into eglQueryDisplayAttribEXT in eglapi.c, and it writes to value, but it writes 0.
05:37 kurufu: but i dont have symbols to inspect disp.
05:38 HdkR: kurufu: I'd recommend getting symbols
05:41 kurufu: If symbols suggest that device is 0, whats left?
05:43 HdkR: Confirmation at least :)
05:45 kurufu: The write is confirmed at least, by the time it returns to glvnd indeed it has written nothing and returned something that seems to be forbidden by the spec.
05:45 kurufu: but yea ill build mesa later and confirm.
05:54 kurufu: `$5 = (_EGLDevice *) 0x0` gdb seems to agree the device is 0.
06:36 kurufu: I guess its worth a bug, it seems dri2_setup_device happens and has the device and sets it, but the same display later has the device zeroed out...
09:58 ity: Hi, is there a channel for asking for kernel driver bugs, namely amdgpu? I got a serious the-driver-crashes-blender-in-the-kernel-on-7900xtx issue :/ I am aware this channel is for mesa, but I thought you all might know a place I could ask for kernel stuff too.
10:01 pepp: ity: you can report your issue here: https://gitlab.freedesktop.org/drm/amd/-/issues
10:35 sima: ity, this is also for kernel stuff, it's kinda the general gpu driver channel
10:35 sima: ity, agd5f and hwentlan are two people here who can help with amdgpu kernel issues
10:57 ity: Ooh, oki, should I send the stacktrace + info here?
11:00 ity: Right, so, it occurs when I open Preferences in blender (4.0.2), the blender window freezes, and it does not respond to SIGKILL, which makes me think it's stuck in a syscall. Imma copy the log and send a link right away, gimme a second
11:01 ity: https://hastebin.skyra.pw/ekijuvoxuc dmesg output
11:02 ity: uname -a `Linux ity-pc 6.7.0-arch3-1 #1 SMP PREEMPT_DYNAMIC Sat, 13 Jan 2024 14:37:14 +0000 x86_64 GNU/Linux` , Arch Linux-patched kernel
11:03 ity: Do note that both Vulkan & OpenGL seem to work, I tested a few random games + my own few Vulkan test apps, as well as HIPBLAS with llama.cpp.
11:04 ity: Lemme also try stracing blender
11:06 ity: write(2, "HIP hipInit: Invalid device\n", 28HIP hipInit: Invalid device
11:06 ity: ) = 28
11:06 ity: ioctl(12, DRM_IOCTL_AMDGPU_GEM_CREATE, 0x7ffd050651e0) = 0
11:06 ity: ioctl(12, DRM_IOCTL_V3D_PERFMON_CREATE
11:06 ity: Last output in strace
11:11 ity: Unfortunately I do not have any experience debugging kernel drivers whatsoever, so this is as much info as I can provide right away. I do not have the option to passthrough the GPU to a QEMU VM and attach a debugger to the kernel at the current moment unfortunately. This might be a regression, as it did not happen a few months ago, though I had firmware issues with this particular
11:11 ity: combo of mobo & GPU (/sys/class/dmi/id/board_name = PRO Z690-A WIFI DDR4(MS-7D25))
11:12 javierm: vsyrjala: at some point the kernel stable process changed from opt-in to opt-out :(
11:13 javierm: now it seems that even commits without a Fixes: tag are getting pulled, I guess yours was due the "Fix" in the subject line?
11:15 pepp: ity: does your kernel have this revert https://patchwork.freedesktop.org/patch/573129/?
11:16 pepp: if not you probably want to add it because it's likely the fix for your issue
11:18 ity: Actually my browser just crashed and I am unable to open it, and random utilities seem to refuse to start up
11:18 ity: Namely doas, firefox, chromium, nix all refuse to cooperate now that the blender crash occured, no idea how to fix that one up
11:18 ity: Not sure if it's related to the GPU bug
11:19 ity: Random apps that worked before are now crashing, I think I might have to reload my session
11:23 ity: Add `login` to the list of things affected, had to reboot :/
11:24 ity: Lemme check the patch
11:25 ity: Is the patch mainlined?
11:26 ity: pepp:
11:50 ity: Trying to compile the kernel with the patch applied
12:02 sima: mripard, some neat discussion going on on #wayland about the totally broken SAND format modifiers in vc4
12:15 karolherbst: mhhhh... hitting an LTO related bug in radeonsi? :')
12:23 karolherbst: mareko: any idea what's going wrong here? https://gist.githubusercontent.com/karolherbst/266a8ce86e4cf021cc835d086b962f28/raw/bf99cef3af9616b19453e9533ddf55189f68a002/gistfile1.txt
12:23 karolherbst: getting an `LLVM ERROR: Cannot select: 0x7f91140ba7e0: v4i32 = bitcast 0x7f91140b5940`
13:17 ity: weechat wiped my IRC history :/. With that said, seems that on kernel 6.7.2 the GPU seems to be fully out of order, same with 6.7.0 with the patch applied. The kernel never modesets stuff correctly if a monitor is connected to the dGPU's output. The dGPU is also not visible to *anything* now. There might be another thing that happened that broke it to this degree, but my GPU that
13:17 ity: was semi-working yesterday outside of blender is no longer working at all :( There is a bunch of amdgpu stacktraces in dmesg
13:31 karolherbst: ity: maybe your cable isn't proberly connected or something? unplugging an eGPU isn't really supported, or rather, pretty much not tested, so a flaky cable could trigger all sorts of weird issues
13:31 karolherbst: best to reboot with it connected and use the `sysfs` `remove` file to remove the eGPU before unplugging
13:31 ity: Wdym by unplugging?
13:32 karolherbst: the eGPU?
13:32 zamundaaa[m]: karolherbst: unplugging eGPUs works completely fine with amdgpu + Plasma Wayland
13:32 ity: I am confused
13:32 zamundaaa[m]: But ity wrote dGPU
13:33 karolherbst: ehh wait..
13:33 karolherbst: ohh yeah...
13:33 karolherbst: my fault 🙃
13:33 karolherbst: brain is silly today
13:33 ity: Why would I unplug the dGPU, like I have before while debugging but the computer has rebooted quite a few times since then
13:33 karolherbst: nevermind me
13:33 ity: Ah
13:34 karolherbst: I read "eGPU" not "dGPU"
13:34 ity: OH
13:35 ity: I kinda hoped when I bought the 7900 XTX that it will be a smooth experience on Linux :/ So far it has given me so much more trouble than even Nvidia :/
13:35 ity: This computer has never 100% worked since I bought it
13:36 pq: ity, are you sure it's not a hardware fault?
13:36 karolherbst: mhh yeah.. new gen issues probably
13:37 ity: Well, I stress-tested the GPU on another computer yesterday, and absolutely no issues happened
13:37 karolherbst: or hw being faulty, though in 99% of the cases where that's assumed it's actually a sw bug :P
13:37 pq: ity, like maybe your PSU is not big enough?
13:37 ity: Is 1000W not enough? :P
13:37 pq: I don't know what is, or what you have.
13:37 karolherbst: ity: maybe want to ask inside #radeon .. I think..
13:38 ity: #radeon for questions about amdgpu? I could try
13:38 karolherbst: yeah
13:38 ity: pq: it's the amount of watts the power supply has
13:38 karolherbst: checked the bug tracker? could be there are others with the same issue
13:39 pq: ity, I don't know how much is enough, or how much you have.
13:39 ity: I did say I have 1000W
13:39 pq: sure
13:40 pq: ity, you said the machine has never worked properly? even without the dGPU?
13:40 ity: karolherbst: I haven't yet tbh, this whole situation is extremely stressful for me :/ I just want my computer to at least go back to only crashing inside blender. Could this be a 6.7.2 regression? And downgrading to 6.7.0 would fix it? I dunnooooooooo :/
13:40 ity: The machine never worked *fully*, but the issues it had changed ~monthly
13:40 ity: Namely the dGPU *used* to work ~5 months ago
13:41 pq: so it has always been somewhat unstable?
13:41 karolherbst: ity: tried 6.6?
13:42 karolherbst: but yeah.. I'd file a bug report on gitlab or at least see if others have similar issues. Could also be that it's just HIP being HIP or something...
13:42 ity: haven't tested 6.6 recently yet nope
13:42 karolherbst: are you only seeing those issues when using HIP/ROCm, or generally?
13:43 ity: Currently the dGPU is 100% out of order, it fails to modeset
13:43 ity: HIP/ROCm used to work
13:43 karolherbst: ahh
13:43 karolherbst: on a fresh boot? but it works in a different machine?
13:44 karolherbst: yeah.. I'd try downgrading the kernel release first.. try 6.6 or 6.5 and see if that's better
13:44 ity: Yep on a fresh boot. It works on a different machine running Windows (I have no Windows machines at home so I drove to someone else's place to test it)
13:44 ity: Hmm, downgrading, I guess lemme try downgrading to 6.7.0 first
13:45 ity: Do note I have no IRC bouncer set up, so I won't be able to see any messages while I am rebooting
13:50 ity: Back, the kernel that used to work yesterday, 6.7.0, doesn't anymore
13:50 ity: oftc seems to be having some connection issues :/ took 8 tries to connect
13:50 ity: [ 5.997772] amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
13:50 ity: [ 5.997774] amdgpu 0000:03:00.0: amdgpu: Failed to enable requested dpm features!
13:50 ity: [ 5.997775] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
13:50 ity: [ 5.997775] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <smu> failed -62
13:50 ity: [ 5.997973] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
13:50 ity: [ 5.997974] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
13:50 ity: [ 5.997976] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
13:50 ity: I should probably post this in #radeon too I guess?
13:52 karolherbst: ity: https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2024-01-31
13:53 karolherbst: ity: yeah.. I'd check the bug tracker if there is something going on with this
13:53 karolherbst: https://gitlab.freedesktop.org/drm/amd/-/issues/
13:53 karolherbst: but looks like a kernel issue
13:53 karolherbst: I'd try 6.6 or older as well
13:54 karolherbst: ity: maybe this is yours? https://gitlab.freedesktop.org/drm/amd/-/issues/3140
13:54 karolherbst: anyway..
13:54 karolherbst: there seem to be regressions in 6.7
13:56 ity: Hmm, I do not *think* so, as 6.7.0 *worked* yesterday, and the error seems slightly diff. I will try with 6.6 though, in case some magic thing triggered the 6.7 regression
13:56 ity: Thanks for the IRC logs btw
13:57 karolherbst: ohh wait
13:57 karolherbst: ity: https://gitlab.freedesktop.org/drm/amd/-/issues/3135 maybe that's yours?
13:58 karolherbst: same GPU at least 🙃
13:58 karolherbst: looks like a linux-firmware update broke it
13:58 karolherbst: which would explain why it broke on 6.7 if your firmware files were updated in the meantime
13:59 ity: What oughta do it, I did not downgrade linux-firmware
13:59 ity: Though that does not seem to be the exact issue either
13:59 ity: My GPU fails to modeset, rather than random hangs
13:59 ity: Random hangs *did* happen before but they only happened ~once a week, ~5 months ago or so
14:01 ity: If the hangs are patched tho then that'd be nice
14:01 ity: Though rn my priority is getting the GPU to not fail initialization
14:01 ity: Ig firmware would explain "*ERROR* hw_init of IP block <smu> failed -62" ?
14:02 karolherbst: yeah...
14:02 karolherbst: sometimes those issues cause random errors to appear
14:03 karolherbst: I'd try if you can get linux-firmware-git or something installed
14:03 karolherbst: and see if that solves it
14:03 ity: https://gitlab.freedesktop.org/drm/amd/-/issues/3110 this might be related? Idk
14:03 karolherbst: seems like those files were pushed a week ago
14:03 karolherbst: maybe
14:03 karolherbst: could be a duplicate of the other
14:03 karolherbst: who knows :)
14:04 ity: This is torture :)
14:04 karolherbst: welcome to kernel development/debugging
14:04 ity: I can try fetching linux-firmware-git
14:04 ity: Thank you :D This is indeed the 12th ring of hell
14:04 ity: (11th ring being X11)
14:05 ity: I am gonna try -git + booting with no screen plugged in
14:05 ity: Thank god I can blast loud music outta my speakers to try to destress a bit while doing all this lmao
14:06 karolherbst: mood
14:06 ity: Listening to an English cover of Cruel Angel's Thesis haha.
14:06 ity: While installing the AUR package
14:08 ity: In order to have a change of pace from listening to the German cover
14:12 ity: Eyy it's done, reboot time!!
14:35 ity: Aaaand now not even the iGPU modesets if the dGPU is plugged in, I had to plug out the dGPU to make the system boot, even into firmware settings
14:40 mareko: karolherbst: you need to set shader_info::image_buffers
14:40 karolherbst: ahh
14:41 mareko: it's only set by GLSL right now
14:41 karolherbst: yeah.. let me try that
14:44 karolherbst: okay cool, this seems to work, it crashes a bit later on a test with GL_RGBA16F mhh...
14:45 karolherbst: but GL_RGBA16UI_EXT works..
14:45 karolherbst: oh well
14:45 karolherbst: okay, fixing the image_buffers thing first properly and debug the other bug after that :)
14:45 ity: karolherbst: After booting with linux-firmware-git, somehow now my iGPU isn't modesetting either if the dGPU is plugged in :/ Any ideas what to do now? I plugged it out so I can use my computer at all, but like, :( I don't want my 7900 XTX to only be an expensive paperweight...
14:46 karolherbst: ity: made sure your initramfs was regenerated and everything?
14:46 karolherbst: but yeah.. no ideas besides commenting on the bug with "yeah, same issue here" or so
14:47 ity: mkinitcpio ran yea, hmm
14:47 karolherbst: I'd try 6.6 and older linux-firmware (from when it used to work or so) to get your going at least
14:49 ity: I am now highly afraid to touch this pile of cards, but I will try that in a bit haha
14:49 ity: So, kernel 6.6 and which linux-firmware ver?
14:50 ity: Like, by messing with it seemingly random shit broke
14:50 ity: Eg my mouse no longer works properly :(
14:52 ity: I feel like I am cursed
14:52 ity: I always run into the most obscure bugs ever
14:54 ity: Gonna try with firmware 2023-09-18 & kernel 6.6.0
14:54 ity: Or is that a bad combo?
14:54 karolherbst: ity: probably good to follow the changes here: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/amdgpu
14:54 karolherbst: ity: there can't be a bad combo as it's alwyas supposed to work "somewhat"
14:55 karolherbst: seems like there was a big update on 2023-09-28
14:55 ity: Hmm
14:55 karolherbst: so maybe try before/after that
14:55 karolherbst: maybe before first and see if that works
14:55 ity: Oki
14:56 ity: Time to shut this thing down, plug in the dGPU, and try if it works :/
14:56 ity: Imma report back in hopefully in a bit
15:02 karolherbst: mareko: thanks, proberly setting image_buffers makes it all work :)
15:11 ity: karolherbst: good news, system boots, OpenGL & Vulkan work, monitor is connected to the dGPU. Bad news, opening Blender Preferences freezes the system. I am unsure as to whether the kernel is up and background services or if it died, I need to test that.
15:12 ity: Time to try the newer firmware package
15:12 karolherbst: ity: yeah.. so.. firmware updates happen for a reason, so it might be they tried to address some issues, but then regressed or something
15:12 karolherbst: soo yeah...
15:12 ity: 2023-10-30 namely, which is the oldest package newer than 2023-09-18 & also newer than 2023-09-28
15:13 ity: Yea :/
15:13 ity: Lemme also test if other HIP stuff works
15:14 ity: Well, ROCm, (I still have no idea what is the diff between ROCm, HIP, and BLAS)
15:15 ity: Seems that llama.cpp works, it does some ROCm HIPBLAS stuff
15:15 karolherbst: BLAS is an API, ROCm is what AMD calls their current compute stack and HIP is a CUDA clone in bad
15:16 ity: A compute API, smth like OpenCL?
15:16 karolherbst: nah
15:16 karolherbst: more on a primitive level
15:16 ity: Oh
15:16 karolherbst: like...
15:16 ity: O?
15:16 karolherbst: implementing algorithms
15:16 ity: I am confused
15:16 karolherbst: BLAS is a collection of common algorithms, and there are many implementations for various compute APIS
15:17 ity: OOOH
15:17 karolherbst: that reminds me.. I wanted to try llama.cpp on top of CL 🙃
15:17 ity: So HIPBLAS is a part of ROCm and is AMD's impl of BLAS on top of their compute API called HIP which is like OpenCL but AMD specific?
15:17 ity: CL?
15:17 ity: ~~You don't mean Common Lisp right~~
15:18 karolherbst: OpenCL
15:18 ity: OH
15:18 ity: Which GPU do you have?
15:18 karolherbst: that's difficult to answer
15:18 ity: Oh lmfao
15:19 karolherbst: because I don't have one GPU, I have like 40 🙃
15:19 karolherbst: maybe 50
15:19 karolherbst: but one of the discrete AMDs I have is a 6700 XT I think
15:21 ity: Wait why do you have 50 GPUs, are you a data center O.O
15:21 tnt: karolherbst: did you get an Arc one btw ?
15:21 karolherbst: I haven't
15:21 karolherbst: ity: developer more like
15:22 karolherbst: *driver developer
15:22 ity: OH
15:22 ity: Makes sense
15:22 karolherbst: though most are Nvidia ones :D
15:22 ity: Where do you get the money for it though
15:22 karolherbst: that's the neat part, I don't
15:22 ity: O.O
15:22 karolherbst: my employer does :P
15:22 karolherbst: well
15:22 karolherbst: some GPUs are also just lendings, so there is that
15:22 ity: Oooh, might I ask who is your employer?
15:22 karolherbst: red hat
15:22 ity: Oooh
15:23 ity: Which drivers do you work on?
15:23 karolherbst: mhhh.. I used to work primarily on nouveau, but lately I've been found working on a couple of drivers for various reasons (mostly fixing OpenCL related issues or adding new features or something)
15:24 ity: Ooooh
15:25 ity: Sounds fun tbh haha, I would like to work on drivers but it seems kinda impenetrable lol.
15:25 ity: Well, I am reading driver code at random and trying to understand how the stuff works, but it has been...
15:25 ity: Very hard :D
15:38 ity: So, the new firmware seems to have the regression in that the amdgpu does not modeset but it does not prevent the igpu (intel) from modesetting. "*ERROR* hw_init of IP block <smu> failed -62". The regression seems to have happened somewhere between 2023-09-18 (blender crash) and 2023-10-30 (hw_init of IP block)
15:39 ity: It is also possible that the blender thing is a regression in the amdgpu driver rather than the firmware like someone mentioned before, idk.
15:40 ity: Though the patch mentions "since v6.6.1. Revert it to fix blender again.", though I am on v6.6.0
15:42 ity: I am not 100% sure how I should report this on the isuse tracker
15:43 ity: There is no working version, just diff versions get diff issues
15:43 ity: Which one do I report
15:44 karolherbst: yeah... maybe just file a new issue and describe the entire situation
15:45 ity: Honestly I kinda lost track of the situation partway through myself, there is just so much like, random things happening :/
15:45 ity: Let's see what I remember
15:47 ity: Latest kernel and firmware, unable to boot at all iirc? v6.6.0, firmware-20230918, ROCm, VK & GL work, but Blender Preferences bring down the kernel. Nothing interesting in journalctl --boot=-1. firmware-2023-10-30, the AMD gpu fails to modeset but Intel iGPU modesets properly. Latest kernel & firmware on arch, computer refuses to boot at all, I have no further confirmed
15:47 ity: information.
15:48 karolherbst: ity: I think I'd ignore the blender thing for now
15:48 ity: Oh I am dum I repeated myself at the beginning and end of the message, wtf is with my short term memory
15:48 karolherbst: could be a rocm bug or something
15:48 ity: Hmm
15:48 karolherbst: if your system boots, that's a baseline for the kernel
15:48 karolherbst: userspace missbehaving can bring down the GPU
15:48 karolherbst: and GPU reset isn't the most reliable thing on AMD
15:48 ity: I mean it *used* to work at *some* point, I don't know when though
15:48 ity: Could also be a blender regression
15:48 ity: Ooh
15:48 karolherbst: yeah.. so if your kernel accesses a NULL pointer with bad luck this can bring down your system as well. it shouldn't but...
15:49 karolherbst: those things aren't easy to blame the kernel or usespace for without investigating
15:49 ity: Ah
15:49 karolherbst: so if your system boots, that's your working state :D
15:49 ity: I mean rn the blender thing is decently important for me
15:49 karolherbst: if updating linux-firwmare breaks it -> bug, if updating your kernel breaks it -> bug
15:49 karolherbst: yeah...
15:49 karolherbst: but that's a different bug probably
15:49 ity: Might be
15:50 karolherbst: should either file against blender or ROCm
15:50 ity: Should I try older blender versions perhaps hmm
15:50 karolherbst: yeah.. maybe
15:50 karolherbst: or older rocm
15:50 ity: Hmm
15:51 ity: There's 2 HIPs in arch repos for ROCm O.O
15:52 ity: rocm-hip-runtime and hip-runtime-amd
15:53 ity: Hmm
15:53 ity: The arch rocm version is from 2023-11-12
15:53 ity: Well, llama.cpp works so I don't *think* it's rocm? But might also be a diff code path between blender and llama, idk
15:54 karolherbst: yeah..
15:54 karolherbst: it can easily be triggered by a kernel doing weird things
15:54 karolherbst: or something
15:55 ity: I mean I presume that just opening Preferences wouldn't actually run a compute kernel... Right???
15:55 ity: Is this naive hope
15:55 karolherbst: it probably is :D
15:55 ity: Oh no...
15:56 karolherbst: could try to identify capabilities but running stuff, who knows
15:56 karolherbst: or runtime initialization or something
15:56 ity: That's a lovely test, if you hit a negative nothing happens, if you hit a positive the computer blows up :D The 7900 XTX also has a horrid POST time of 20 seconds, so each reboot is costly
15:57 karolherbst: that's quite a bit
15:57 ity: I forgot that nix on non-nixos has problems with graphics acceleration and tried to downgrade blender with nix :D
15:57 ity: Yea :/
15:59 ity: Oh fuck "blender: error while loading shared libraries: libOpenColorIO.so.2.2: cannot open shared object file: No such file or directory"
16:07 jani: drm-tip seems to have a wrong conflict resolution for drivers/gpu/drm/bridge/samsung-dsim.c http://paste.debian.net/1305887/
16:17 ity: Back, oftc took a few minutes to decide to stop timing out. So, I tried blender inside a flatpak, and also no dice, it can't see the GPU at all there, but at least it doesn't crash
16:44 MrCooper: is it intentional that code-validation stage CI jobs run automatically in Mesa fork pipelines?
17:05 airlied: jani: yes it does, there's a thread, I should get to fixing it up
17:06 jani: airlied: thanks
17:39 airlied: hwentlan: https://keithp.com/blogs/MST-monitors/
18:18 hwentlan: airlied: thanks
23:43 airlied: jani: okay should be fixed now
23:44 airlied: as soon as tip rebuilds here