00:00karolherbst[d]: like I even check the alignment at runtime before the load
00:00karolherbst[d]: I even check if the 0xfff low bits are zero and even if they are it's an unaligned load fault
00:01karolherbst[d]: maybe it's some weird encoding fuckery...
00:03karolherbst[d]: anyway.. the offset has 0x8 alignment and that's guaranteed through shifts
00:03karolherbst[d]: it's also weird that `fault_addr:0000003ffded0005`
00:03karolherbst[d]: I don't even think it's my matrix
00:04karolherbst[d]: mhhhh
00:07karolherbst[d]: mhhhhh
00:07karolherbst[d]: somehow it seems that for one thread the address does appear to be `0x0000003ffded0005` 🥲
00:09karolherbst[d]: and for a single thread it's `0x000003ffdec0005` on the store side
00:11airlied: does commenting out the stores stop it?
00:11karolherbst[d]: yeah
00:12karolherbst[d]: but it's weird...
00:13karolherbst[d]: like those are the addresses I'm seeing: https://gist.githubusercontent.com/karolherbst/bc0940058bdf033374c8af2499eb0388/raw/be75b34f54d42983bb25d284db89b2b553571c35/gistfile1.txt
00:13karolherbst[d]: also for store, but I forgot to update the print
00:15karolherbst[d]: like I'm sure that my math is correct and the matrix row/col has to be aligned by enough
00:28karolherbst[d]: okay.... now that's weird
00:28karolherbst[d]: I ignore the load on `0x0000003ffded0000` and ignore the store on `0x000003ffdec0000` and now it doesn't fault anymore
00:28karolherbst[d]: even passes the test 🙃
00:28karolherbst[d]: I wonder....
00:29karolherbst[d]: are there rogue threads doing nasty things?!?
00:34karolherbst[d]: ohhhhhhhh
00:34karolherbst[d]: AHHHHH
00:34karolherbst[d]: man
00:34karolherbst[d]: it was the passing test that caused the fault
00:34karolherbst[d]: and I was looking at the wrong one
00:34karolherbst[d]: but like it passes
00:34karolherbst[d]: and the next submission runs into a fault
00:35mhenning[d]: Which test?
00:36karolherbst[d]: `dEQP-VK.spirv_assembly.instruction.compute.untyped_pointers.vulkan_memory_model.cooperative_matrix.mixed.load.a.row_major.uint8`
00:36karolherbst[d]: that also passes if I just don't do any loads or stores 🙃
00:37karolherbst[d]: anyway..
00:37karolherbst[d]: `push_sync` causes it to fail
00:37karolherbst[d]: given that that one operates on int8 values, an address of `0000003ffded0005` does actually make sense 😄
00:43karolherbst[d]: let's see...
00:45karolherbst[d]: ohhh... mhh
00:45karolherbst[d]: yeah.. I think that test could execute unaligned stores...
00:45karolherbst[d]: sysval 33 gets added to the base
00:46karolherbst[d]: which is TID_X
00:46karolherbst[d]: mhhh
00:46karolherbst[d]: yeah...
00:47karolherbst[d]: but mhh
00:47karolherbst[d]: that shouldn't be possible or rather it's weird the code is like this
00:48karolherbst[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1437604876652908594/mma-8816-A-i8.png?ex=6913d960&is=691287e0&hm=3e626f0160971364ce9d1ac5682bd3577168d52f96a92a2bde6c79d4a27a3446&
00:48karolherbst[d]: because the matrix looks like this:
00:49karolherbst[d]: maybe some of the offset code is wrong but in a way it never showed up as buggy
00:50karolherbst[d]: ohhhh wait....
00:50karolherbst[d]: is it a bug in the test...
00:51karolherbst[d]: I don't belive this...
00:51karolherbst[d]: it's a bug in the test 🙃
00:51karolherbst[d]: they offset the ssbo by the global invocation id
00:52karolherbst[d]: which... you know isn't alligned to the matrix row/col or 16B anymore
00:52karolherbst[d]: man....
00:53tdaven[d]: at least the universe makes sense now
00:53karolherbst[d]: I hate the universe
00:53tdaven[d]: shhh..it might hear you
00:56karolherbst[d]: sooo.. maybe I need a different pairs of eyes
00:56karolherbst[d]: does https://gist.githubusercontent.com/karolherbst/3c16b8cdef1a130718aa867ed26311d1/raw/0d7bdbe836d4c38c654ee73b7dc5fbf03eea1499/gistfile1.txt
00:56karolherbst[d]: violate `VUID-RuntimeSpirv-OpCooperativeMatrixLoadKHR-08986
00:56karolherbst[d]: For OpCooperativeMatrixLoadKHR and OpCooperativeMatrixStoreKHR instructions, the Pointer and Stride operands must be aligned to at least the lesser of 16 bytes or the natural alignment of a row or column (depending on ColumnMajor) of the matrix (where the natural alignment is the number of columns/rows multiplied by the component size)`
00:59karolherbst[d]: it's a super weird test anyway
01:15karolherbst[d]: anyway.. I think my MR is fine, just causing more regressions, because...
01:36mhenning[d]: karolherbst[d]: what are the values of %rows and %cols ?
01:36mhenning[d]: I ask because they're OpSpecConstant
01:40karolherbst[d]: at least 8 and 8
01:41karolherbst[d]: so the minimum alignment should be 8B
01:41karolherbst[d]: but like it indexes an int8 array with the global invocation id, which means an alignment of 0x1
01:42karolherbst[d]: like the load is fine, the store isn't
01:43karolherbst[d]: mhh maybe the load is also not fine...
01:48karolherbst[d]: but also could be that I'm missing something here
01:48karolherbst[d]: anyway.. that's for tomorrow to file a bug...
01:51mhenning[d]: Yeah, I think that's right
01:51mhenning[d]: I think both the load and store violate that VUID
01:52karolherbst[d]: it's kinda weird that the base address isn't uniform within a subgroup...
01:52karolherbst[d]: like it already starts at that
01:53karolherbst[d]: `All the operands to this instruction must be dynamically uniform within every instance of the Scope of the cooperative matrix. ` 😄
01:53karolherbst[d]: ah yes
01:58karolherbst[d]: so it's not just the VUID, but also the base address isn't dynamically uniform, and the scope is subgroup
01:58karolherbst[d]: oh well.. a problem for tomorrow me
07:12phomes_[d]: mhenning[d]: the screenshot in the qt issue you filed looks kind of similar to the CS issue we had earlier
07:12phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1437701613891485757/Screenshot_From_2025-11-11_08-10-34.png?ex=69143377&is=6912e1f7&hm=3b635469cc38b582f69ec91ea2b6db7536cd1c84d994e0755792e1941cfdb26e&
07:12phomes_[d]: here they are side by side:
17:36mhenning[d]: I'm pretty certain they have different underlying issues
17:36mhenning[d]: but yes, you can see the blocks along the triangle edges in both cases
18:46_lyude[d]: Is alexander courbet in this channel at all?
19:56notthatclippy[d]: _lyude[d]: gnurou
20:01steel01[d]: He is? Oh cool. Maybe I won't have to bug him over email anymore. 😛
20:16notthatclippy[d]: Doubt he actively reads it, but maybe he'll see the pings.
20:17notthatclippy[d]: I was curious and skimmed the user list. There's at least 15 NVIDIA people I recognize there. Don't resist the assimilation.
20:24steel01[d]: Gnurou used to be pretty active with tegra nouveau stuff back in 2016, 2017 era. He's said that he's pretty busy with nova on the clock now and life in general off the clock. So, he may not read much now.
22:45gnurou[d]: yo 👋
22:47gnurou[d]: yeah sorry I don't actively read, so please ping me if there is anything 🙇
22:47gnurou[d]: also most of my memory about Nouveau somehow got wiped out during the time I wasn't active on it 😬😅
22:49gnurou[d]: and steel01[d] please accept my apologies for not replying to your last email! 🙇 I just noticed I left it hanging in my Inbox. Not that I had useful answers anyway 😓
22:51steel01[d]: gnurou[d]: He does live! 😛 And eh, I got it working. Much easier than I ever expected it to be. But I've got to meme on the multi-week response delays anyways. 😉
22:54gnurou[d]: steel01[d]: Do put on me the Shame I deserve! 😁
22:58steel01[d]: steel01[d]: Oh, since you're here, gnurou[d]. Let me bounce this off you real quick. Do you have any idea why gp10b would do this? It's only with nouveau, swiftshader via tegra-drm without the gpu is fine. And gm20b with nouveau is fine. But I get the quoted issue on gp10b with nouveau. So far, neither me nor anyone else even know where to start looking.
23:02gnurou[d]: Uh, so lower clock is less broken? Interesting. I have no idea. 😅
23:03steel01[d]: Yeah, that's the weirdest part. So the underflow theory doesn't seem to work. So... confusion.
23:03gnurou[d]: Maybe you get more FPS that way, which starves another resource? (e.g. mem bandwidth?)
23:03steel01[d]: Mmm. I think I've cranked emc clock to check and it didn't make a difference. Any other pieces of the soc that would be involved?
23:06gnurou[d]: Nothing comes to mind atm, unless the DC is another clock that has been overlooked?
23:07gnurou[d]: (DC might have several clocks. I completely forgot the details thought)
23:09steel01[d]: I tried to look at the dc clocks and got confused... But it looks like they're all just set to max on mainline anyways.
23:09steel01[d]: But alright, thanks. I'll keep poking around.
23:10steel01[d]: Maybe I should take another look through tegradc and see if there's any bandwidth handling in the nvdisp part.