IRC Logs of #nouveau on irc.freenode.net for 2024-01-29

02:12 fdobridge_: <airlied> https://lore.kernel.org/dri-devel/20240129015053.1687418-1-airlied@gmail.com/T/#u
02:12 fdobridge_: <airlied> @gfxstrand and anyone who has the i915 prime deadlock, that patch on top of 6.8-rc2 might fix sync problems a bit better
02:14 fdobridge_: <Sid> i'll try it out
02:21 fdobridge_: <Sid> compilimg
02:23 fdobridge_: <gfxstrand> Does that go on top of the other patch? Replace it?
02:24 fdobridge_: <airlied> replaces the rip it all out patch
02:25 fdobridge_: <airlied> and if you aren't on rc2 you have to revert the previous nouveau_fence.c patch
02:26 fdobridge_: <airlied> so if you have eacabb5462717a52fccbbbba458365a4f5e61f35 and it isn't reverted it, you have to revert it first
02:27 fdobridge_: <gfxstrand> Okay.
02:27 fdobridge_: <Sid> I applied it on top of the rc2 tag
02:28 fdobridge_: <gfxstrand> I can give it a try at some point. I'm more interested in this fault at the moment, though.
02:33 fdobridge_: <airlied> yeah I gotta fix last weeks problem before I can get to this weeks
02:34 fdobridge_: <airlied> I'd also like to know if this has any possible effects on that, I'd stop testing on the last patch I gave you or on the previous deadlock fix, since they are both wrong
02:35 fdobridge_: <gfxstrand> Okay. I'll give it a try then
02:37 fdobridge_: <airlied> I'll get my ada updated and see if I can reproduce your other thing
02:43 fdobridge_: <gfxstrand> Building now...
03:00 fdobridge_: <gfxstrand> Doesn't fix the fault
03:15 fdobridge_: <Sid> no system freezed on a prime rendering setup
03:15 fdobridge_: <Sid> no system freezes on a prime rendering setup (edited)
03:16 fdobridge_: <Sid> oh shoot I spoke too soon e-e
03:16 fdobridge_: <Sid> freeze as soon as I alt-tabbed
03:17 fdobridge_: <Sid> yup
03:18 fdobridge_: <airlied> okay using repro.txt isn't faulting for me here so far
03:18 fdobridge_: <Sid> patch doesn't fix the i915/prime sync issues
03:18 fdobridge_: <Sid> for me at least
03:18 fdobridge_: <airlied> okay guess I better drag my prime machine back out
03:18 fdobridge_: <Sid> I'm not even able to switch to a different tty 😅
03:20 fdobridge_: <Sid> ...now I don't know how much of it could be attributed to me using plasma 6.0-rc1
03:24 fdobridge_: <gfxstrand> If it really is some sort of overflow somewhere, it's probably pretty chip-specific.
03:28 fdobridge_: <Sid> wait
03:28 fdobridge_: <Sid> one reboot (and a changed kernel param later) I don't have the sync issue
03:28 fdobridge_: <Sid> I was using modprobe.blacklist=nvidia to block it, but I still saw a fair bit NVRM output in my dmesg last boot
03:29 fdobridge_: <Sid> this boot, things seem to be ok
03:29 fdobridge_: <airlied> I just get 100/218533 (0.0%)
03:29 fdobridge_: <airlied> @gfxstrand same numbers for you?\
03:30 fdobridge_: <Sid> I'm now using module_blacklist=nvidia
03:31 fdobridge_: <Sid> I even managed to trigger an Xid 13 without a system freeze...
03:36 fdobridge_: <Sid> yeah, can confirm, I'm unable to get the system to freeze again
03:36 fdobridge_: <Sid> it was likely whatever was triggering those NVRM logs that was messing with things
03:47 fdobridge_: <airlied> @gfxstrand so you've seen this on all gpus, but just reproduced it easier on the ada?
03:48 fdobridge_: <gfxstrand> Without knowing for sure what "this" is, yes.
03:48 fdobridge_: <gfxstrand> It's got the same repro pattern on Turing and Ampere but with the WSI tests enabled.
03:48 fdobridge_: <gfxstrand> It's got the same repro pattern on Turing and Ampere with the WSI tests enabled. (edited)
03:48 fdobridge_: <gfxstrand> But on Ada I can hit it way faster
03:48 fdobridge_: <airlied> does it happen on ada if you haven't set DISPLAY or WAYLAND_DISPLAY?
03:49 fdobridge_: <gfxstrand> Yeah, no WSI tests are in the repro list
03:49 fdobridge_: <airlied> no I mean if CTS never sees the window system
03:49 fdobridge_: <airlied> I'm not sure there's a difference, but I don't trust CTS :-
03:49 fdobridge_: <airlied> I'm not sure there's a difference, but I don't trust CTS 😛 (edited)
03:50 fdobridge_: <airlied> I'm running it inside my wayland session on a turing now in a loop
03:54 fdobridge_: <gfxstrand> Yup! Just ran with `WAYLAND_DISPLAY= DISPLAY= ` and it still blows up.
03:56 fdobridge_: <airlied> can you boot with nouveau.debug=trace ? it might give some info we aren't seeing (it might also just consume all of the log space on the planet)
03:57 fdobridge_: <airlied> probably want to do it without a desktop session running
03:57 fdobridge_: <airlied> since it should in theory provide cleanr logs
03:58 fdobridge_: <gfxstrand> Desktop is ruining on Intel so it shouldn't spam at all, actually.
04:02 fdobridge_: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1201376834667954206/dmesg?ex=65c9985d&is=65b7235d&hm=d3e39e3a89152b75b768b1d69c9f92bc6ba86ccac1e9d8ea2ca5efe475d2730a&
04:02 fdobridge_: <gfxstrand> Have fun@!
04:02 fdobridge_: <gfxstrand> Have fun! (edited)
04:04 fdobridge_: <airlied> oh can you boot with nouveau.runpm=0 as well?
04:05 fdobridge_: <gfxstrand> Sure
04:12 fdobridge_: <gfxstrand> That makes my Ada fail go away...
04:13 fdobridge_: <airlied> yay logs worked 😛
04:13 fdobridge_: <airlied> also explains why lots of tests doing nothing causes it
04:13 fdobridge_: <gfxstrand> So, something with not waking the GPU up properly?
04:13 fdobridge_: <gfxstrand> Yup! Just enough time to shut the GPU off
04:14 fdobridge_: <gfxstrand> So, this one's probably not the same as WSI then.
04:14 fdobridge_: <airlied> yeah some race somewhere on gpu resume, probably fence related
04:14 fdobridge_: <airlied> so your WSI one is against a i915 wayland session?
04:14 fdobridge_: <airlied> I thought it was all nouveau before, explains why I haven't seen it
04:15 fdobridge_: <airlied> guess I should switch my desktop to intel primary
04:15 fdobridge_: <gfxstrand> It was against nouveau on Turing and Ampere
04:16 fdobridge_: <airlied> check if my kernel patch has any effect on it, I haven't seen any wsi death
04:18 fdobridge_: <gfxstrand> I'm going to kick off a no-WSI Ada run tonight with runpm=0 and a WSI Ampere run.
04:38 fdobridge_: <gfxstrand> Running.
05:38 fdobridge_: <airlied> @karolherbst Jan 29 15:21:46 mobius kernel: pcieport 0000:00:01.0: retraining failed is that one of the one you've seen in the past?
06:00 liquidben: The indexing part is not as interesting, but i still try to cover it, the idea behind subtracting the tag/offset/index or however called is to manipulate the order output, so if permutes do not combine with masks to a limit but fall short let's call them immature, to make them mature you add index to either a value or mask, or use any amount of fields like tag offset index, this talk does not take long at a conference 15 minutes is enough to
06:00 liquidben: put all important on the table, however investigating the compiler tasks and routines more closely, one could talk for days , but overall index fields are filled in the runtime, and holed at compiled time.
06:52 fdobridge_: <airlied> uggh, my ampere laptop isn't coming out of runpm, will check my turing
07:41 fdobridge_: <!DodoNVK (she) 🇱🇹> You're already advertising Vulkan 1.3 though
07:58 fdobridge_: <airlied> okay I've reproduced the runpm fail on my turing at least, will dig in a bit tomorrow
10:13 fdobridge_: <karolherbst🐧🦀> mhhhh.... I _think_ but I can't remember
10:14 fdobridge_: <karolherbst🐧🦀> @airlied but anything PCI related requires copying whatever nvidia does. They apply tons of bridge workarounds and other things
10:16 fdobridge_: <airlied> Yeah I can't find any for this bridge, but I'll copy your quirk for it tomorrow and test it, then dig into opengpu
10:16 fdobridge_: <airlied> I do see some online complaints from people doing passthrough with same error
10:17 fdobridge_: <karolherbst🐧🦀> ahh
10:17 fdobridge_: <karolherbst🐧🦀> maybe it's new enough so there is no workaround yet 🥲
13:12 fdobridge_: <zmike.> is it a known issue that devices using nouveau can't suspend?
13:12 fdobridge_: <zmike.> I'm on kernel 6.6
13:18 fdobridge_: <!DodoNVK (she) 🇱🇹> I got D3cold after unplugging my external monitor so 🤷‍♀️
13:30 fdobridge_: <zmike.> okay it was a kernel bug...updated to latest 6.6 and now it works \o/
14:11 fdobridge_: <gfxstrand> Yeah, but there's these kernel problems that are preventing me from actually doing a formal submission.
14:33 fdobridge_: <gfxstrand> Yeah, no dice.
14:51 fdobridge_: <gfxstrand> Ada x86 completed okay. i686 died but it looks like a flake so I'm running i686 again.
14:55 fdobridge_: <gfxstrand> I don't know how?!? I can hit them pretty reliably (though coming up with a minimal reproducer has been difficult.
14:56 fdobridge_: <gfxstrand> I do have a theory but I've gotta think some about how I want to test it.
14:58 fdobridge_: <karolherbst🐧🦀> oh no... I found something nvidia removed in Ampere from tha ISA 🥲
14:58 fdobridge_: <karolherbst🐧🦀> fp16 instructions don't support fp32 sources anymore
14:59 fdobridge_: <gfxstrand> That's fine
14:59 fdobridge_: <gfxstrand> Mixed data types is semantic hell inside the compiler anyway.
15:00 fdobridge_: <karolherbst🐧🦀> I hope that the source mods still have the same values 😄
15:05 fdobridge_: <marysaka> still the same values :aki_thonk:
15:05 fdobridge_: <marysaka> (at least for nvdisasm might be wrong)
15:08 fdobridge_: <marysaka> It decode as invalid on SM90+
15:32 fdobridge_: <dadschoorse> how did fp16 instruction with fp32 sources work?
15:37 fdobridge_: <karolherbst🐧🦀> truncated to fp16 and then used for both lanes
15:40 fdobridge_: <dadschoorse> cool, so like the inverse of amd's v_fma_mix which can use fp16 dst/sources but the math is fp32
15:41 fdobridge_: <karolherbst🐧🦀> well.. those fp16 ops can also produce a fp32 result
15:41 fdobridge_: <karolherbst🐧🦀> then it just uses the first lane
17:58 fdobridge_: <gfxstrand> Yes, but what does "truncated" mean? What rounding mode?
17:58 fdobridge_: <gfxstrand> At least with 32-bit results, there's no rounding so the type conversion is well-defined.
18:05 fdobridge_: <karolherbst🐧🦀> round to 0
18:07 fdobridge_: <gfxstrand> So, not what you want most of the time. 😢
18:09 fdobridge_: <gfxstrand> I'm going to see if `nouveau.runpm=0` affects this WSI issue
18:09 fdobridge_: <gfxstrand> It's got a similar shape where a bunch of tests get skipped
18:09 fdobridge_: <gfxstrand> But I wouldn't think so because why would it be shutting down a PCI card?
18:30 fdobridge_: <redsheep> Is it that it's shutting off the card, or that it's just shutting off portions of the chip?
18:31 fdobridge_: <gfxstrand> I have no idea!
18:31 fdobridge_: <airlied> Some motherboards may support it, but I'd be mildly surprised
18:31 fdobridge_: <airlied> Is it also an mmu fault?
18:31 fdobridge_: <redsheep> Unused parts of the chip getting power/clock gated would sound normal, and it doesn't seem weird to me for things to break if it's used before it manages to wake all the way up again
18:32 fdobridge_: <gfxstrand> Yup. And there's also a bunch of skipped tests right before if that matters.
18:32 fdobridge_: <gfxstrand> We'll know in another 15-20min
18:34 fdobridge_: <redsheep> It might be a hack, but would it be viable to add code that tries to intentionally wake things back up before it's actually used? I suppose it's probably hard to know when it's likely gone to sleep
18:38 fdobridge_: <airlied> runpm is all the way off
18:39 fdobridge_: <airlied> The whole PCI device
18:44 fdobridge_: <redsheep> Wasn't this one of the issues related to prime? Without multiple gpus it would not be possible to have the entire card shut off given you need display out to remain active, right?
18:45 fdobridge_: <gfxstrand> No dice. 😢
18:46 fdobridge_: <airlied> Okay so need to figure out exactly what you are running again and see can I reproduce, I've done complete deqp runs and ones with wsi and sync tests with no problems
18:46 fdobridge_: <gfxstrand> vulkan-cts-1.3.7.3
18:47 fdobridge_: <gfxstrand> nvk/1.3-conformance
18:47 fdobridge_: <gfxstrand> `./deqp-vk --deqp-caselist-file="${MUSTPASS_DIR}/vk-default.txt" --deqp-fraction-mandatory-caselist-file="${MUSTPASS_DIR}/vk-fraction-mandatory-tests.txt" --deqp-log-images=disable --deqp-log-shader-sources=disable --deqp-fraction=${shard_1},${SHARDS} --deqp-log-filename="${REPORT_DIR}/TestResults-${arch}-${shard}-of-${SHARDS}.qpa"`
18:47 fdobridge_: <gfxstrand> It always fails in shard 1
18:48 fdobridge_: <gfxstrand> Which, to be clear, is `--deqp-fraction=0,4`
18:48 fdobridge_: <gfxstrand> File names are 1-indexed, fractions are 0-indexed. 🙄
18:49 fdobridge_: <airlied> Okay I'll throw it at a few machines later and see if anything happens
18:51 fdobridge_: <gfxstrand> Oh, and you'll need this patch on top of `vulkan-cts-1.3.7.3`
18:51 fdobridge_: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1201600452312186900/0001-Re-bind-state-after-executing-secondary-command-buff.patch?ex=65ca68a0&is=65b7f3a0&hm=f961f1a2563115dbb460d13a1215813794bd417652db3cf65df6e45e37bfad9e&
18:51 fdobridge_: <airlied> See if on both i915 and nouveau primary GPUs?
18:51 fdobridge_: <gfxstrand> I'm reproducing it with nouveau as the primary
18:52 fdobridge_: <gfxstrand> Running GNOME
18:52 fdobridge_: <gfxstrand> F35
18:52 fdobridge_: <gfxstrand> Updated today
18:52 fdobridge_: <gfxstrand> deqp-vk is running in a screen session over SSH so it's not spamming the compositor
18:53 fdobridge_: <!DodoNVK (she) 🇱🇹> Why isn't this needed for other drivers?
18:53 fdobridge_: <gfxstrand> IDK. Maybe it is.
18:54 fdobridge_: <gfxstrand> It wasn't needed for us until I started force resetting some state
18:54 fdobridge_: <gfxstrand> Not that it was correct before. Just that it worked by accident.
19:20 fdobridge_: <airlied> F38?
19:21 fdobridge_: <airlied> Also it reproduces on 6.8-rc2 with sync patch?
19:38 fdobridge_: <!DodoNVK (she) 🇱🇹> Why could NVK be causing malloc() issues in Vita3K emulator? :triangle_nvk:
19:39 fdobridge_: <rhed0x> wat
19:44 fdobridge_: <karolherbst🐧🦀> memory corruptions?
19:44 fdobridge_: <!DodoNVK (she) 🇱🇹> RADV works in the same exact case but NVK weirdly gets a malloc() error (I'm not sure if I can test lavapipe)
19:46 fdobridge_: <!DodoNVK (she) 🇱🇹> I tested the game with both NAK and codegen (even ripping out the NAK code) and I still get the same error (so it must be inside the main NVK code if it exists)
19:47 fdobridge_: <pixelcluster> address sanitizer time
19:48 fdobridge_: <!DodoNVK (she) 🇱🇹> Here's the gdb backtrace (I had to recompile the emulator with debug symbols which was a bit of a pain)
19:48 fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1201614928365244559/message.txt?ex=65ca761b&is=65b8011b&hm=339de212e31912bffe239de1fb73001aa92ebb91778076b32a7c49535a157dc9&
19:49 fdobridge_: <nanokatze> dw that pain wasn't in vain because you're about to recompile that emulator with -fsanitize=address
20:42 fdobridge_: <gfxstrand> Yes to both
20:50 fdobridge_: <!DodoNVK (she) 🇱🇹> I tried LD_PRELOADing it (I got some weird ASAN errors that didn't close the emulator and then it aborted at the segfault you can see above)
20:51 fdobridge_: <karolherbst🐧🦀> ohh, looks like they have their own allocator
20:57 fdobridge_: <!DodoNVK (she) 🇱🇹> Here's what trace I got with -fsanitize=address enabled in the emulator instead
20:57 fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1201632341219618816/message.txt?ex=65ca8653&is=65b81153&hm=9c7e90fd722a5478be75cac35c527a25cad3f207d7ea06b42cbf2deab282dd2e&
20:58 fdobridge_: <karolherbst🐧🦀> it's more important what asan prints
20:58 fdobridge_: <karolherbst🐧🦀> if at all
20:59 fdobridge_: <!DodoNVK (she) 🇱🇹> (it didn't print anything this time)
20:59 fdobridge_: <!DodoNVK (she) 🇱🇹> Let's enable ASAN on NVK too
21:01 fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1201633264402698321/hacky-nak-disable.diff?ex=65ca872f&is=65b8122f&hm=ba2be5107a6aedd072ffabf98ba680acd7b3b5c7126eea77ac3f34b16b920fbd&
21:07 fdobridge_: <!DodoNVK (she) 🇱🇹> I get no change
21:08 fdobridge_: <!DodoNVK (she) 🇱🇹> If I preload libasan.so now (after enabling it for both NVK and Vita3K) I get this instead: `Your application is linked against incompatible ASan runtimes.`
21:19 fdobridge_: <karolherbst🐧🦀> need to `LD_PRELOAD` it
21:22 fdobridge_: <!DodoNVK (she) 🇱🇹> I did that above
21:23 fdobridge_: <karolherbst🐧🦀> yeah, but before that nvk wasn't compiled with asan
21:23 fdobridge_: <karolherbst🐧🦀> but anyway, it can also be that the application is simply buggy
21:27 fdobridge_: <!DodoNVK (she) 🇱🇹> I compiled it with ASAN and `LD_PRELOAD`ed ASAN and I get no ASAN output (at least inside gdb)
21:27 fdobridge_: <karolherbst🐧🦀> yeah.. then it's probably just a bug in the application
21:32 fdobridge_: <!DodoNVK (she) 🇱🇹> I do get some ASAN output outside gdb though 🤔
21:33 fdobridge_: <karolherbst🐧🦀> what's the asan output?
21:43 fdobridge_: <!DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1201643843565654216/message.txt?ex=65ca9109&is=65b81c09&hm=de4256b473f5096c8cbc9fc389d2c9d48e10551281f083abfc4797d29edd7366&
21:44 fdobridge_: <karolherbst🐧🦀> ah hyeah.. those are just crashes
21:44 fdobridge_: <karolherbst🐧🦀> or well..
21:44 fdobridge_: <karolherbst🐧🦀> maybe emulator doing emulator things
21:46 fdobridge_: <!DodoNVK (she) 🇱🇹> That's what I was thinking
22:08 fdobridge_: <gfxstrand> This is getting old...
22:08 fdobridge_: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1201649986715271218/image.png?ex=65ca96c2&is=65b821c2&hm=7361c2ea78452c579b0829faaeadb85e299f63445d02fb636e81ad51190c8b42&
22:08 fdobridge_: <gfxstrand> That's running with 2 shards instead of 4
22:08 fdobridge_: <gfxstrand> Let's try 8
23:12 fdobridge_: <airlied> so are the wsi tests even connecting to a display in those runs?
23:26 fdobridge_: <gfxstrand> yes
23:26 fdobridge_: <gfxstrand> Well, a pikvm
23:26 fdobridge_: <gfxstrand> but it acts like a display
23:27 fdobridge_: <gfxstrand> I've found another sequence that blows up, though, so I'm going to start poking from this new angle.
23:27 fdobridge_: <gfxstrand> If I run shard 1/8, it fails a bunch of workgroup memory tests.
23:27 fdobridge_: <gfxstrand> I don't think WSI is involved this time