02:27 fdobridge_: <e​nigma9o7> Hello, just checking in again. I have issues with nouveau causing my machine to lockup, and trying to learn how to get the appropriate logs, and what to do with them.
02:28 fdobridge_: <e​nigma9o7> I was advised earlier to try a newer kernel, which I did, and now it locks up way more often, so easier to catch an error in journalctl; I've posted a couple screenshots but no feedback yet. Im wondering what kinda logs might help, and what should be my next step?
02:28 fdobridge_: <e​nigma9o7> I went back to my 5.4 kernel cuz it only locks up every day or two, with the 6.6 it happens every few hours or less....
02:30 fdobridge_: <e​nigma9o7> I was also advised to get logs, so have found this journalctl command,a nd wonder if that's the logs you want, or something else....
02:31 fdobridge_: <e​nigma9o7> (This is on old hardware, gt330m, no lockups with nvidia-340 driver)
02:32 fdobridge_: <e​nigma9o7> (And a laptop, with no other gpu)
04:39 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186165818208292985/shot-2023-12-17_20-38-57.png?ex=65924200&is=657fcd00&hm=02f5114ef2d90b350611a0a6bc4a5ae430370396fe3106f4050b546374feff5e&
04:40 fdobridge_: <e​nigma9o7> Okie dokie, just had my first lockup since I swithced back to 5.4. Please let me know if these logs are useful, or what exactly you're looking for, and ifthere's a better paste to post for support.
04:41 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186166445827166239/shot-2023-12-17_20-41-26.png?ex=65924296&is=657fcd96&hm=bad2c091971e0e8a103d652c344a4066f238bfa1f54ba309680daa625e022bd4&
04:42 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186166662358106162/shot-2023-12-17_20-42-25.png?ex=659242ca&is=657fcdca&hm=da7f462d6d1feb023f6e4ae5857d1a97459613e5da8be1523b4cfd71d172e6ad&
04:43 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186166922274934824/shot-2023-12-17_20-43-26.png?ex=65924308&is=657fce08&hm=a146dccf9616197b72b04a5e657bdf66308a12404cf8265b8ae49063e63894b8&
04:44 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186167179486449747/message.txt?ex=65924345&is=657fce45&hm=621d6062ca1e77fcd8cf9d34abde4502ad6f9d75c082bda52cc4c4adf36535b1&
04:47 fdobridge_: <e​nigma9o7> https://cdn.discordapp.com/attachments/1034184951790305330/1186167882762158100/message.txt?ex=659243ed&is=657fceed&hm=48fc990986f9a7f5b958999336913d524ae12aa83b993ec194eaa5233fcab429&
04:48 fdobridge_: <e​nigma9o7> Okay that textblock shows it from the beginning.
04:49 fdobridge_: <e​nigma9o7> The screenshots are just bits and pieices, there's obviously lotsa stuff goes crazy. From user prospective, mouse keeps moving, but can't do anything, cant swtich to tty, and pushing power button doesn't appear to work (but it does if I wait).
06:01 fdobridge_: <a​irlied> that stuff isn't really useful, it just says the gpu hung
06:30 fdobridge_: <e​nigma9o7> so what is useful, that's the question....
06:31 fdobridge_: <e​nigma9o7> It happens every day or two, so I can do whatever is necessary to capture it, or if its already logged can dig it out easily....
06:33 fdobridge_: <e​nigma9o7> The reason I "Dec 17 20:32:50 VPCF115FM systemd-logind[908]: System is powering down" is cuz I push the power button cuz computer has been completely unresponsive since "Dec 17 20:31:41 VPCF115FM kernel: nouveau 0000:01:00.0: FSBroker2209[1550]: failed to idle channel 16 [FSBroker2209[1550]]"; only the mouse moves, but cant switch apps, cant do anything, cant open tty with ctrl-alt-f2, etc.
06:34 fdobridge_: <e​nigma9o7> So the problem is this 'failed to idle channel' stuff. I've have googled for FSBroker with no clue what that is.
06:37 fdobridge_: <e​nigma9o7> so what is useful, that's the question @airlied ? (edited)
06:51 fdobridge_: <a​irlied> I've no diea what FBBroker is or why it's creating a channel
06:51 fdobridge_: <a​irlied> I don't really have a good idea what is useful, finding GPU hangs is messy work, you have to work out what application is causing it, and see if you can then work out if it is doing something strange
07:36 fdobridge_: <e​nigma9o7> That sounds hard.
07:40 fdobridge_: <e​nigma9o7> To "cause" this, allI have to do is use nouveau for a day or two. I can collect logs as directed, and describe what happens. I can build and try other kernels. But that's about it. I dunno how to do more than that, but I'm happy to test and report until its resolved.... or if advised, giveup and go back to nvidia-340 and cry cuz I can't upgrade OS.
07:45 fdobridge_: <r​edsheep> Does it ever crash when you aren't interacting with it at all? If not it could be worth disabling gpu acceleration for any applications you use that can tolerate the performance loss, and if you stop crashing then it was connected to one of those.
07:46 fdobridge_: <r​edsheep> If and when you know it is a particular application then that's at least a start on a useful path to finding the root cause
09:50 fdobridge_: <S​id> @airlied defintely a nouveau-i915 interaction thing, reproduced with icd specification as well
09:51 fdobridge_: <S​id> ```
09:51 fdobridge_: <S​id> [ 599.058607] nouveau 0000:01:00.0: gsp:msg fn:4123 len:0x24/0x4 res:0x0 resp:0x0
09:51 fdobridge_: <S​id> [ 599.058615] msg: 00000000: 0b 00 00 00 ....
09:51 fdobridge_: <S​id> [ 599.196137] nouveau 0000:01:00.0: gsp:msg fn:4123 len:0x24/0x4 res:0x0 resp:0x0
09:51 fdobridge_: <S​id> [ 599.196145] msg: 00000000: 0c 00 00 00 ....
09:51 fdobridge_: <S​id> [ 660.751412] nouveau 0000:01:00.0: Metal.exe[7610]: job timeout, channel 24 killed!
09:51 fdobridge_: <S​id> [ 660.751556] [drm:nouveau_job_submit [nouveau]] *ERROR* Trying to push to a killed entity
09:51 fdobridge_: <S​id> [ 671.201624] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d0e8 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 671.201837] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> [ 682.295123] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d0f4 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 682.295330] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> [ 693.175673] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d100 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 693.175873] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> [ 704.269646] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d102 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 704.269875] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> [ 715.150325] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d104 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 715.150537] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> [ 768.274502] Asynchronous wait on fence 0000:00:02.0:kwin_wayland[883]:d110 timed out (hint:intel_atomic_commit_ready [i915])
09:51 fdobridge_: <S​id> [ 768.274706] Asynchronous wait on fence drm_sched:nouveau_sched:833 timed out (hint:submit_notify [i915])
09:51 fdobridge_: <S​id> ```
09:51 fdobridge_: <S​id> however not reproduced when running on an external display over hdmi
10:15 fdobridge_: <S​id> repro'd on external display over hdmi as well
10:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> Can you try a NVIDIA-only setup?
10:16 fdobridge_: <S​id> sadly not
10:17 fdobridge_: <!​DodoNVK (she) 🇱🇹> I mean disconnecting the iGPU from the compositor 🔌
10:17 fdobridge_: <S​id> actually, let me try
10:17 fdobridge_: <S​id> ?
10:18 fdobridge_: <S​id> not sure I follow
10:18 fdobridge_: <!​DodoNVK (she) 🇱🇹> There's a variable called `KWIN_DRM_DEVICES` on KDE (and `WLR_DRM_DEVICES` on wlroots)
10:20 fdobridge_: <S​id> ah
10:20 fdobridge_: <!​DodoNVK (she) 🇱🇹> You have to set it to the path to the DRI device for the NVIDIA card (in my case this would be `KWIN_DRM_DEVICES=/dev/dri/card0`)
10:22 fdobridge_: <!​DodoNVK (she) 🇱🇹> So in my case I have to run `KWIN_DRM_DEVICES=/dev/dri/card0 startplasma-wayland` in a tty when the DE isn't running
10:24 fdobridge_: <S​id> on it
10:24 fdobridge_: <!​DodoNVK (she) 🇱🇹> You should connect an external monitor to the NVIDIA GPU if you want to try this (otherwise you won't see anything on the laptop screen unless you have a MUX switch)
10:25 fdobridge_: <S​id> yup, that's what I'm doing
10:36 fdobridge_: <S​id> just played through quake 2's first level, dxvk, 60fps
10:36 fdobridge_: <S​id> just played through quake 2's tutorial level, dxvk, 60fps (edited)
10:37 fdobridge_: <S​id> first level is at 25-40 fps but no system freezes
10:39 fdobridge_: <S​id> metal hellsinger itself freezes but doesn't cause the system to go unresponsive
10:42 fdobridge_: <S​id> even control (dx11) runs at 12-15 fps but doesn't cause a system freeze
11:23 fdobridge_: <h​untercz122> @tiredchiku does it solve the freezes which needs hard reset?
11:27 fdobridge_: <S​id> it does, yeah
11:27 fdobridge_: <S​id> so it's definitely something with nouveau/i915 kernel module interaction
12:44 diagonal10x: What I was saying yes I liked the girl who turned out to be a fuck up in any reasonable event, I really can not do anything hence, since I can not recommend them, they can not manage things, they would ruin everything, hence the decision was to just block the sexual attraction based thingy, and I say I can not rely on what they do either not only broken attraction but also broken friendship cause for me they are too big fuck ups to
12:44 diagonal10x: maintain friendship, but you are opposite here, you are geniuses. One extreme to other.
12:47 Ermine: karolherbst: requesting banhammer
12:50 diagonal10x: I just can't really do anything wherever they go only problems that I am not willing to solve all my life, I do not even know how to :(
12:58 diagonal10x: This was some critics I got that I backed off from people after staying with them under my financial sanctions period, I backed off again, I could not solve their issues they every day caused.
13:01 diagonal10x: So my idea was ok there's no love frankly, then I tried to say we have extreme technology cause one subject appeared to have organically head that should process, but that was not enough, problems kept coming from everywhere they went and did.
13:05 diagonal10x: I could not handle those after every day solving or trying for 2.5 years in a row, simple as that I overestimated my strength there and had to quit alive .
13:05 diagonal10x: While alive
13:46 fdobridge_: <h​untercz122> diagonal10x
18:31 fdobridge_: <e​nigma9o7> Thanks for reply Redsheep. I have in fact enabled webgl in firefox, which I think was disabled by default and I had to force enable, so I could test if that prevents it. And I do always have firefox running, but this doesn't always happen when firefox is in focus, it's happened when I'm using thunar for file management, and I doubt it does anything accelerated... I'll start paying attention to what app I'm using when it crashes next tim
18:32 fdobridge_: <e​nigma9o7> Also I think its noteworthy it happens way way way more with the 6.6 kernel than the 5.4, so it seems some regression. If there's some kernel inbetween that would be worth testing, I could do that.
18:33 fdobridge_: <e​nigma9o7> like maybe there were minor fixes for a while, the nin some version was some major change, if i could try a version before the major change.... etc.
18:37 fdobridge_: <e​nigma9o7> Could hardware video decoding be related? Maybe I set that up wrong. I was unsure what to do, I kinda mixed a couple versions of tutorials myself and so possibly did it wrong? I installed ubuntu's nouveau-firmware package, but also followed the instructions on the freedesktop page to extract from nvidia proprietary driver and copy manually.... and I used 340.108 driver instead of the 325 that is suggested.... coudl any of this be an iss
18:38 fdobridge_: <e​nigma9o7> (I'm referring to the instructions on https://nouveau.freedesktop.org/VideoAcceleration.html under "firmware")
18:41 fdobridge_: <e​nigma9o7> I *really really* want to be able to use nouveau. I've said this before, but the nvidia-340 even tho it doesn't lockup, more and more apps are refusing to work with it cuz they're using EGL instead of GLX and nvidia-340's EGL sucks....
18:42 fdobridge_: <e​nigma9o7> (for example can't run most modern gnome apps, including the gtk4 demo app, or the epiphany browser, etc)
18:43 fdobridge_: <e​nigma9o7> even etr doesn't work on nvidia-340 anymore (unless i use old version) and who can live without extreme tux racer!
21:25 fdobridge_: <g​fxstrand> Okay, now I'm even more confused about double-precision. `mufu.rcp64h` works as described in the PTX docs. I've got an amber test that tests all the edge cases and just that op passes.
21:25 fdobridge_: <g​fxstrand> That means the NaN is coming from the `drcp` somehow.
21:25 fdobridge_: <g​fxstrand> WTH?!?
21:25 fdobridge_: <k​arolherbst🐧🦀> mhhhh
21:26 fdobridge_: <k​arolherbst🐧🦀> you mean the lowering?
21:27 fdobridge_: <k​arolherbst🐧🦀> what happens if you use `mufu.rcp64h` directly for the impl and nothing else?
21:29 fdobridge_: <g​fxstrand> Ah! Found it. It's that `dfma 0 inf %x` returns NaN
21:32 fdobridge_: <g​fxstrand> So the approximation is fine. It's the newton steps that are blowing us up
21:34 fdobridge_: <k​arolherbst🐧🦀> figures
21:34 fdobridge_: <k​arolherbst🐧🦀> wait...
21:35 fdobridge_: <k​arolherbst🐧🦀> mhh
21:35 fdobridge_: <k​arolherbst🐧🦀> what's the expected IEEE result from this?
21:36 fdobridge_: <g​fxstrand> NaN
21:36 fdobridge_: <g​fxstrand> So the real question is how on earth does this work for everyone else. 😂
21:36 fdobridge_: <k​arolherbst🐧🦀> maybe they don't return NaN?
21:36 fdobridge_: <g​fxstrand> The problem appears to be that NVIDIA is too correct. 🙃
21:38 fdobridge_: <k​arolherbst🐧🦀> ohh btw.. FP64 instructions are fixed and variable latency if I haven't said so already
21:39 fdobridge_: <k​arolherbst🐧🦀> fixed on those uber compute cards with proper FP64 alu, variable on consumer cards (my guess)
21:39 fdobridge_: <g​fxstrand> Yeah, we already handle that
21:39 fdobridge_: <k​arolherbst🐧🦀> okay, cool
21:39 fdobridge_: <k​arolherbst🐧🦀> that means, that barriers are ignored on those GPUs where it's fixed
21:39 fdobridge_: <g​fxstrand> Well, we set barriers for them
21:39 fdobridge_: <g​fxstrand> Oh, so we're getting them wrong there.
21:39 fdobridge_: <k​arolherbst🐧🦀> you have to account for both
21:39 fdobridge_: <g​fxstrand> What's the latency on said uber GPUs?
21:39 fdobridge_: <k​arolherbst🐧🦀> set the wait as if it's fixed, and set barrier as if it's variable
21:40 fdobridge_: <k​arolherbst🐧🦀> I don't know 🙂
21:40 fdobridge_: <g​fxstrand> 🤡
21:40 fdobridge_: <k​arolherbst🐧🦀> but I have such a gPU
21:40 fdobridge_: <k​arolherbst🐧🦀> try double?
21:40 fdobridge_: <k​arolherbst🐧🦀> I actually have 3 of those GPUs.. I think 😄
21:40 fdobridge_: <g​fxstrand> 🤷🏻‍♀️
21:40 fdobridge_: <g​fxstrand> It's not hard to fix
21:40 fdobridge_: <k​arolherbst🐧🦀> ohh actually jsut one
21:40 fdobridge_: <g​fxstrand> It sucks that they ignore the barriers
21:41 fdobridge_: <k​arolherbst🐧🦀> only the GV100 has proper FP64 support
21:41 fdobridge_: <k​arolherbst🐧🦀> yeah.. once you are done with it, just ping me and I test it on the volta one
21:42 fdobridge_: <k​arolherbst🐧🦀> yeah.. let me read the section carefully once more, just in case
21:43 fdobridge_: <k​arolherbst🐧🦀> yeah..
21:43 fdobridge_: <k​arolherbst🐧🦀> they don't touch the rd/wr scoreboards
21:43 fdobridge_: <p​avlo_it_115> як буде вирішуватися питання з hardware video accelerator? Буде якийсь транслятор-емулятор для цього?
21:43 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand at least it's only a concern for `DFMA`
21:44 fdobridge_: <k​arolherbst🐧🦀> (and `DMUL`, `DADD`)
21:44 fdobridge_: <p​avlo_it_115> how will the problem with the hardware video accelerator be solved? Will there be some kind of translator through the volcano for this?
21:44 fdobridge_: <k​arolherbst🐧🦀> afaik
21:44 fdobridge_: <p​avlo_it_115> how will the problem with the hardware video accelerator be solved (vaapi, vdpau)? Will there be some kind of translator through the volcano for this? (edited)
21:45 fdobridge_: <g​fxstrand> Yeah, it's mostly that I need to make my `has_fixed_latency()` more fuzzy. It needs to return `true`, `false`, and `maybe`. 😂
21:45 fdobridge_: <k​arolherbst🐧🦀> 😄
21:47 fdobridge_: <k​arolherbst🐧🦀> ohh wait...
21:47 fdobridge_: <k​arolherbst🐧🦀> same is true for some `fp16` ops
21:47 fdobridge_: <p​avlo_it_115> how will the problem with the hardware video accelerator be solved (vaapi, vdpau)? Will there be some kind of translator through the vulkan api for this? (edited)
21:47 fdobridge_: <k​arolherbst🐧🦀> so I guess `HFMA`
21:48 fdobridge_: <k​arolherbst🐧🦀> ehh
21:48 fdobridge_: <k​arolherbst🐧🦀> `HFMA2`
21:48 fdobridge_: <k​arolherbst🐧🦀> and potentially `HMNMX2`?
21:48 fdobridge_: <k​arolherbst🐧🦀> I have no idea if that's a concern on volta+
21:50 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand however... as we don't really ship GPU binaries, we could also at runtime detect what the GPU has and just choice one option :ferrisUpsideDown:
21:50 fdobridge_: <k​arolherbst🐧🦀> *choose
21:50 fdobridge_: <g​fxstrand> Yeah
21:50 fdobridge_: <k​arolherbst🐧🦀> but that might be error prone
21:50 fdobridge_: <k​arolherbst🐧🦀> not sure if we even have a uapi bit for this and if the kernel knows
21:53 fdobridge_: <g​fxstrand> I suppose `Option<bool>` would work...
21:53 fdobridge_: <g​fxstrand> Seems a little funky...
22:03 fdobridge_: <g​fxstrand> What about `DSETP`?
22:04 fdobridge_: <g​fxstrand> How would that work?!? If they don't touch the rd/wr scoreboard but then something waits, it'll hang the GPU.
22:06 fdobridge_: <k​arolherbst🐧🦀> why?
22:07 fdobridge_: <g​fxstrand> Because if you wait on a read or write that was never signaled, you hang
22:07 fdobridge_: <k​arolherbst🐧🦀> that's because you don't know how it works internally :ferrisUpsideDown:
22:07 fdobridge_: <k​arolherbst🐧🦀> it shouldn't matter in practice, unless you hit such a case
22:08 fdobridge_: <k​arolherbst🐧🦀> the barrier is just a inc of the value and once it becomes 0 the barrier is signaled or something
22:08 fdobridge_: <k​arolherbst🐧🦀> but if it stays 0, nothing bad should happen
22:09 fdobridge_: <k​arolherbst🐧🦀> `DEPBAR` e.g. allows you to wait on a specific value of that barrier
22:09 fdobridge_: <g​fxstrand> Hrm...
22:09 fdobridge_: <g​fxstrand> That's not what I've observed
22:09 fdobridge_: <k​arolherbst🐧🦀> and is a "less or" compare
22:09 fdobridge_: <k​arolherbst🐧🦀> mhh
22:09 fdobridge_: <k​arolherbst🐧🦀> *less or equal
22:10 fdobridge_: <g​fxstrand> I've seen issues with my deps pass where it waits on something but fails to set it. Insta-hang every time.
22:10 fdobridge_: <k​arolherbst🐧🦀> interesting
22:10 fdobridge_: <k​arolherbst🐧🦀> but anyway.. in theory instruction "signalling" them just up them by 1 and once the hardware is done, it gets decreased by one. `DEPBAR` can be used to wait until a specific value besides 0 is reached
22:11 fdobridge_: <k​arolherbst🐧🦀> `wt` on the instructions are equivalent to `DEPBAR` on that barrier with a wait cnt of 0
22:11 fdobridge_: <g​fxstrand> Hrm...
22:11 fdobridge_: <k​arolherbst🐧🦀> at least that's how it should work in theory
22:11 fdobridge_: <k​arolherbst🐧🦀> the hw might disagree 😄
22:11 fdobridge_: <g​fxstrand> Yeah...
22:12 fdobridge_: <g​fxstrand> Maybe something else was going on in those cases? 🤷🏻‍♀️
22:12 fdobridge_: <k​arolherbst🐧🦀> maybe
22:13 fdobridge_: <k​arolherbst🐧🦀> but the drain token (max wait, all wire/read, all wait) would also be more complex to implement if it's not done like described
22:13 fdobridge_: <k​arolherbst🐧🦀> *write
22:13 fdobridge_: <k​arolherbst🐧🦀> like the hardware just ignoring the bits and moving on
22:14 fdobridge_: <k​arolherbst🐧🦀> but again, that's literally how `DEPBAR` works
22:15 fdobridge_:<g​fxstrand> really needs those docs...
22:15 fdobridge_: <k​arolherbst🐧🦀> I wonder if you can deadlock it...
22:16 fdobridge_: <d​adschoorse> ofc amd would return NaN for 0*Inf too
22:21 fdobridge_: <d​adschoorse> fix_inv_result should make 0*Inf in the Newton-Raphson irrelevant
22:23 fdobridge_: <g​fxstrand> Oh... right.
22:57 fdobridge_: <d​adschoorse> get_signed_inf looks broken with some denorm inputs if denorms are flushed, but that's probably not your issue I guess