00:33airlied[d]: skeggsb9778[d]: any more ideas on hopper? just getting back to poking itn ow
01:16skeggsb9778[d]: where did you get to? was it "just" nonstall not being received
01:27airlied[d]: 2nd channel submits never kick off
01:27airlied[d]: though I think nonstall is still broken, but I've hacked around that a bit
01:29skeggsb9778[d]: the channels are on different engines?
01:30airlied[d]: don't think so, the kernel test passes, but when userspace creates a channel and tries to exec, nothing happens
01:33airlied[d]: guess either gpfifo or the doorbell is broken
01:33airlied[d]: if I pass a bad vma address to the exec ioctl, I don't see any vm faults
01:34skeggsb9778[d]: the engine tests should catch the doorbell
02:03skeggsb9778[d]: did you rule out aarch64-specific stuff btw?
02:04airlied[d]: no, and there could be some memory mapping or barrier related problems, not sure how best to figure that out
02:41kar1m0[d]: orowith2os[d]: I don't think they care and even if they did I don't care
03:05airlied[d]: skeggsb9778[d]: MEMDESC_FLAGS_MAP_SYSCOH_OVER_BAR1 something we need to do?
03:40gfxstrand[d]: Is `Box<T>` guaranteed to be the size of a pointer?
03:41mhenning[d]: gfxstrand[d]: I think so, yes
03:45gfxstrand[d]: Then I think we're can make `SSARef` a union. I just need to think about some details.
04:01airlied[d]: skeggsb9778[d]: hmm that plus bBar1Mapping seem relevant
04:01airlied[d]: if (hClass >= HOPPER_USERMODE_A && pAllocParams != NULL)
04:02gingleballsoff: so computing is already easy, so let's remind that 371-512=-141 while 371-230=141 and 371-230-397=-256 and 371-115=256 index 115 and twice index 230, so as the answer sets are all delta encoded naturally one does not have to double in/validate, but only remove the rembrants of -256, and share the correct index, we learned that in pseudo code the inverse max is 140 for index 112, so if
04:02gingleballsoff: you remove -140 you and interpolants you get one time the answer, when you 256-X-116 yet one more, you get twice the answer etc. So you can compile single value on every set, and remove it for the selection at runtime. so when 126 gets removed, 121 stays, when 140 get blocked 129 get's not, performance path has too many opportunities which is why i am not interested to talk about it,
04:02gingleballsoff: cause my modifications i do not share to programming sociaties, they are military and aerospace type of codes. And also for my own financial works, i no longer deal with you.
04:12skeggsb9778[d]: airlied[d]: hmm, maybe - though i wonder wtf that's all about
04:14skeggsb9778[d]: kfifoConstructUsermodeMemdescs_GH100 also calls the _GV100 version (which is the BAR0 mapping of it)
04:14skeggsb9778[d]: gb20x seems to use the _GH100 version too, though, gb20x does work
04:14skeggsb9778[d]: still, probably worth digging into that
04:15airlied[d]: the gv100 path fills out pRegVF and the gh100 path also fills out pBar1PrivVF and pBar1VF
04:15airlied[d]: then the usrmodeConstruct_IMPL picks one
04:16airlied[d]: looks like on gh100 and newer it picks one of the newer ones for bar1
04:16airlied[d]: just looks like it applies to the doorbell page
04:17skeggsb9778[d]: yeah indeed - does it help?
04:17airlied[d]: NV_HOPPER_USERMODE_A_PARAMS hopperParams = { 0 };
04:17airlied[d]: if (userModeClasses[i].classNumber != VOLTA_USERMODE_A) {
04:17airlied[d]: allocParams = &hopperParams;
04:17airlied[d]: // The BAR1 mapping is used for (faster and more efficient) writes
04:17airlied[d]: // to perform work submission, but can't be used for reads.
04:17airlied[d]: // If we ever want to read from the USERMODE region (e.g., to read
04:17airlied[d]: // PTIMER) then we need a second mapping.
04:17airlied[d]: hopperParams.bBar1Mapping = NV_TRUE;
04:17airlied[d]: }
04:18airlied[d]: haven't had a chance to figure out where to hack it in yet, had some other work distract me
04:21airlied[d]: though seems strange if that would break it ,since I'm just doing writes, but I'll start there
04:21skeggsb9778[d]: yeah, gives me less hope that it's related if it's a perf optimisation, but worth trying
04:23skeggsb9778[d]: i was going to send a pull req, but i'll hold off another few hours and see if you/we figure anything out for this
04:33orowith2os[d]: gfxstrand[d]: If T isn't repr(transparent) over a slice or something similar, at least
04:34orowith2os[d]: You can embed a DST in a T without a reference-y type, in which case T itself will become a DST and need a reference-y type around it
04:36orowith2os[d]: Either some const shenanigans or putting a transmute somewhere would warn you if something goes awry
04:37airlied[d]: skeggsb9778[d]: there was a change made to the r535 headers for counting stuff around registry table and I think those got lost somewhere for r570, it results in runtime warns
04:37themoreyousuffer: The more you come to others territory playing bosses like you had come for 2025-15 already in a row to my territories attacking me and banning me out of my home. The more you are made to suffer. You bring your sluts and dirty me being outsider wanker at my home territory you rob me assault sneaked from the back, you try to kill me, you humiliate me, and our people just punish you so
04:37themoreyousuffer: heavy that you are not existing soon anymore, the more you pump abuse and disrespect to people like me, the more you suffer, none believes in your western powers anyhow you are incapable maniac tyrans.
04:38skeggsb9778[d]: airlied[d]: oh, thanks for the reminder. someone (Timur I think) mentioned it already. i did attempt to preserve the fix, but didn't do it properly it seems
04:39orowith2os[d]: orowith2os[d]: I think putting a trait bound for T: Sized should be enough too
04:59skeggsb9778[d]: skeggsb9778[d]: that at least should be fixed now
07:42airlied[d]: skeggsb9778[d]: also the blackwell random hangs running cts
07:43airlied[d]: ./deqp-vk --deqp-case=dEQP-VK.api.copy_and_blit.*
08:06mohamexiety[d]: Speaking of Blackwell what nvk patches do I need to get it to run again? I got all the qmd stuff so want to fill it in today and see if it works
08:06mohamexiety[d]: My older branch theoretically should have worked but last time I tried it on ssh it didn’t work. Didn’t try it on this 5070 tho
09:54airlied[d]: whatever is in my branch is probably the latest,
10:01mohamexiety[d]: got it, thanks!
13:21karolherbst[d]: I've debugged a "used 0x4 instead of 0xc as a mask" bug for 4 hours :blobcatnotlikethis:
13:31karolherbst[d]: subgroups are hard (tm)
13:32pixelcluster[d]: last week i debugged an "asm is 0xbf" instead of "0xb0" bug for 3 days
13:32karolherbst[d]: oof
13:33pixelcluster[d]: the disassembler showed the same text for both :D
13:33karolherbst[d]: 🥲
14:00karolherbst[d]: okay.. running the CTS slowed down pretty massively now...
14:02karolherbst[d]: Test run totals:
14:02karolherbst[d]: Passed: 2769/31410 (8.8%)
14:02karolherbst[d]: Failed: 0/31410 (0.0%)
14:02karolherbst[d]: Not supported: 28614/31410 (91.1%)
14:02karolherbst[d]: Warnings: 27/31410 (0.1%)
14:02karolherbst[d]: a bit annoying that there are warnings we can't do anything about
14:05karolherbst[d]: gfxstrand[d]: `HMMA.884` has a `.STEP0..3` attribute, but it's not on the other `.HMMA` variants. Do you prefer it being two different opcodes in NAK or does it kinda depend on how different the rest of the encoding is (it's identical)?
14:06karolherbst[d]: with the `HMMA.884` only parts of the threads are actually doing something, so it needs multiple executions to do a full hmma
14:10karolherbst[d]: mhhhhhh...
14:11karolherbst[d]: I wonder if it really needs a 256 bit source...
14:23gfxstrand[d]: <a:shrug_anim:1096500513106841673> What does that attribute do?
14:23gfxstrand[d]: Are all the others STEP1?
14:24karolherbst[d]: the matrix operation is split across those steps
14:25karolherbst[d]: so for a fp16 8x8x4 matrix you need to execute .STEP0 and .STEP1
14:25karolherbst[d]: and apparently use different halfs of the C input
14:25karolherbst[d]: for a fp32 8x8x4 you apparently use 4 steps
15:04gfxstrand[d]: Oh, then just assert step==0 in the encoder in that case.
15:10karolherbst[d]: yeah.. that's what I'm doing atm
16:07kar1m0[d]: karolherbst[d]: will running dEQP with nvidia drivers help somehow?
16:07karolherbst[d]: I don't see how
16:11mhenning[d]: gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33306 is ready for review when you get a chance
16:32mohamexiety[d]: hm as a quick sanity check I tried the shader ptr/cbuf test on ada and I noticed that shader ptr goes 1 => 2 => 1, while cbuf0 ptr changes every time, but cbuf1 ptr goes 1 => 2 => 1
16:33mohamexiety[d]: upper addresses for all of these are constant throughout
16:43karolherbst[d]: gfxstrand[d]: okay... so apparently we don't need vec8 inputs. The "vec8" input only exists as the full thing across all the steps. But like for fp16xfp16+fp32 -> fp32 HMMA.8x8x4, the input gets split across all the 4 steps, so you have a vec2 for each and not a full vec8. _however_, the doc specifies that the full vector cross all steps is supposed to be consecutive and aligned to it's full size,
16:43karolherbst[d]: and I think that's relevant to the destination only tho... but the docs imply some funkyness and that there _might_ be internal state stored across the steps somewhere somehow, because it does mention "please execute them all in order and with no other HMMA in between" and stuff
17:18mhenning[d]: Destination being aligned and full size probably means that we do want to model it as a full vec8 in nak
17:27karolherbst[d]: yeah.... probably
17:27karolherbst[d]: atm I'm already splitting it up in nir, but I was considering lowering in nak
17:28karolherbst[d]: but so far I'm not running into any issues, but could also be that I just get lucky with the RA
17:51gfxstrand[d]: Yeah, I think so. This is all sounding very magic.
18:00karolherbst[d]: the big work isn't really the MMA instructions anyway, it's just dealing with all the various different matrix layouts...
18:00gfxstrand[d]: Like it's possible that it works unaligned but I suspect it's doing some magic register pass-through that wants things aligned for $reasons.
18:00karolherbst[d]: maybe
18:01karolherbst[d]: though it does state you can execute random instructions inbetween
18:01karolherbst[d]: just no hmma
18:03karolherbst[d]: also.. there are a few things I'm not entirely sure on.. the HMMA.884 one can be executed with fp16 inputs, but with a fp32 result, and I'm not sure you can even express that in Vulkan...
18:03karolherbst[d]: well.. vulkan maybe, but no in glsl
18:20airlied[d]: I think the hmma pipe is separate so you can do things on other pipes fine
18:22karolherbst[d]: right
18:23karolherbst[d]: probably the 884 pipe is also independent from the 16x8x8 one
18:26mohamexiety[d]: how does it all work anyways? have always been curious what the tensor cores are like
18:26gfxstrand[d]: Yeah, that all sounds sane but we're going to have to figure out some way to model that in the scheduler so it doesn't slide hmmas past each other.
18:26skeggsb9778[d]: airlied[d]: i'm not sure i trust the nvk patches yet 😛 even vkcube dies with a mmu fault here
18:27skeggsb9778[d]: i do see the deqp thing though
18:29mhenning[d]: gfxstrand[d]: I don't think that's too bad. Naively, they could just be SideEffect::Memory and then they'll be serialized. A better implementation might be to add a new side effect type and then serialize them the same way memory is serialized
18:30mhenning[d]: neither of which is too much work
18:33airlied[d]: skeggsb9778[d]: I don't trust nvk, but those tests randomly fail with a kernel fencing problem, normally when nvk screws up its consistently in the same place
18:33airlied[d]: like vkcube dying
18:34skeggsb9778[d]: well, i'm not sure it's a fencing problem, it looks like the channel has just hung (i put printks to read the fence seq in kill)
18:34airlied[d]: we don't even get a gpu hang, it's just like a lost irq or fence and it's random, it kinda looks similiar to hopper just more inconsistent
18:34skeggsb9778[d]: shaders getting stuck in loops?
18:35skeggsb9778[d]: i've seen that happen randomly on previous gens with messed up sched stuff
18:36airlied[d]: I just don't see shader loops being that random, also not sure those tests use a lot of shaders, but yeah could be uninitialised data somewhere
18:36airlied[d]: until we get qmds I'm kinda ignore blackwell
18:37airlied[d]: yeah the last hang I see is an all copy engine test
18:38airlied[d]: https://paste.centos.org/view/raw/c931f181 has the last few submitted push bufs
18:50karolherbst[d]: gfxstrand[d]: thing is.. we could also just ignore the 8x8x4 HMMA ones, because they are gone on Ampere anyway. The only argument one could make for them is that you don't have anything else on Volta, but Volta won't have great performance anyway
18:52airlied[d]: I expect most apps won't take advantage of them
18:52karolherbst[d]: I was mostly just curious on how it works, but nvidia also decided to not advertise them and I think they are also significantly slower then the 16x8x8 one
18:52karolherbst[d]: airlied[d]: yeah... probably not
19:08gfxstrand[d]: mhenning[d]: Yup
19:09gfxstrand[d]: karolherbst[d]: Yeah....
19:14karolherbst[d]: looks like the 884 fp16 one gets placed at 2..4 and 4..6 and the result is correct
19:14karolherbst[d]: but something is messed up with the fp32 one.. but uhm..
19:14karolherbst[d]: could be anything
19:38karolherbst[d]: anyway.. I can just keep the code in, because it's not harming anything really.. no idea what's wrong with the 884 fp32 one, but I suspect I pass the wrong channels to the steps, because the 884 fp32 layout is entirely messed up and I have 0 docs on what needs to be passed into the steps, and can't be bothered figuring out what nvidia is doing 😄
19:49airlied[d]: skeggsb9778[d]: turing + r570 regresses on parallel cts runs
19:53airlied[d]: I should test the 570 branch with 535 paths and bisect I suppose
20:04airlied[d]: hmm 535 seems fine
21:02airlied[d]: skeggsb9778[d]: okay cleaned up the machine a bit, latest CTS, running parallel, runs fine on 535, same branch 570 I get
21:02airlied[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1366519956811939840/message.txt?ex=68113e62&is=680fece2&hm=6ea39cea85750bb5e00ae7bdefc86fe4d2e6f6220da8530132b164c39a9c02c9&
21:17airlied[d]: now I'm running 24 deqp jobs in parallel in the test, running 4 doesn't seem to trigger it
21:31airlied[d]: 16 threads goes fine, I got one lost irq
21:35airlied[d]: seems to fail above 21 threads
21:37airlied[d]: oh 21 dies a bit later
21:38airlied[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1366528919636938752/run-deqp.sh?ex=681146bb&is=680ff53b&hm=6cb8223f3651bd8977f04a06b6996e5d84e7189b220dbc48a1b3f92fdfbbad5f&
21:38airlied[d]: This is Faith's script that I'm using, just point it a vk-gl-cts build and change the jobs number