00:22airlied[d]: notthatclippy[d]: _lyude[d] as I said in the wrong channel, it looks like the bEnteringGcoffState is causing the s/r problems
00:32airlied[d]: I'd have thought since we don't support GC6 we'd always be hitting Gcoff
01:28airlied[d]: if I got to pass the runtime state through nvkm this will be a monster patch
03:22airlied[d]: _lyude[d]: https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/nouveau-hacks
03:51_lyude[d]: OMG my power bill is going to be so happy
03:51_lyude[d]: thank youuuuu
05:29airlied[d]: okay two patchsets : fix suspend/resume on r570 https://lore.kernel.org/nouveau/20260203052431.2219998-1-airlied@gmail.com/T/#t
05:29airlied[d]: fix spte/lpte handling : https://lore.kernel.org/nouveau/20260203052823.2220053-1-airlied@gmail.com/T/#t
06:51airlied[d]: gfxstrand[d]: in correct channel, where did you see blackwell has uniform float ops, I just f2fp in my docs, but I might be missing somethnig
07:20mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1468144054624718970/image.png?ex=6982f333&is=6981a1b3&hm=72327c3a38c290a053f5cc607d95a2c6bd87db36c364712f9f4f008a9f58a4ab&
07:20mohamexiety[d]: airlied[d]:
07:20mohamexiety[d]: from: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#blackwell-instruction-set
07:21mohamexiety[d]: this doc does confuse sm100 and sm120 though so might be a sm100 thing. but i'd find it really strange since it would make more sense for 120 to have it than 100
07:21airlied[d]: it might also be the other way around
07:21airlied[d]: I don't see them in my blackwell docs, but I've no idea if they are for datacenter GPUs only
07:22mohamexiety[d]: check if you have tensor memory instructions
07:22mohamexiety[d]: those are datacenter blackwell exclusive
07:23airlied[d]: not in my hopper docs either
07:24mohamexiety[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory this stuff
07:24airlied[d]: yeah I have LDTM/STTM
07:24mohamexiety[d]: yeah if that corresponds to those ptx intrinsics then that's sm100
07:24airlied[d]: guess someone needs to RE the timings for UF ops then
07:30mohamexiety[d]: we'd probably still need sm120 docs since all the matmul instructions should have different timings. not sure on vector
07:36airlied[d]: table out of memory fc: 0 nd: 1048576 ma: 1048576 a:1048576
07:36airlied[d]: MESA: error: ZINK: vkCreateBufferView failed (VK_ERROR_OUT_OF_HOST_MEMORY)
08:10marysaka[d]: airlied[d]: ran this for the night (for ~10h) no crash in sight
08:11marysaka[d]: will pull the new serie and see if I get any issues but hopefully we are all good now 😄
08:13airlied[d]: excellent, good timing if I get both bugs in a day 🙂
10:29marysaka[d]: airlied[d]: tested the series, it fixes the bug for sure on mel's reproducer, the games I tested were fine... but tried a deqp-runner and got this in the first 2 seconds: https://paste.centos.org/view/9e724b7e
10:30marysaka[d]: going to try to get the test group that can trigger it
10:43marysaka[d]: Actually got another one while trying to trim down... testgroup is `dEQP-VK.api.buffer.suballocation.transfer_src.transfer_dst.uniform_texel.storage_texel.uniform.storage.index.vertex.indirect.*` causing https://paste.centos.org/view/00a1f1bd
10:43marysaka[d]: The branch I'm testing on is main + revert the patch disabling compression <https://gitlab.freedesktop.org/marysaka/mesa/-/commits/nvk-reenable-compression>
11:32airlied[d]: Maybe drop the last patch and see if it persists, I might have made a mistake in refactoring
13:35marysaka[d]: reverted top patch and it is still broken so I guess the refactoring have some issue 😅
14:48karolherbst[d]: I have a workload that sees -10% performance due to the prepass instruction scheduler 😢
14:55gfxstrand[d]: mohamexiety[d]: Yeah, it's based on that. We'll have to experiment to see if we actually have them in 50xx.
14:56gfxstrand[d]: But it doesn't seem like something they'd made DC-only. But given how DC Blackwell and consumer Blackwell aren't quite in linear order, it could go either way.
14:57mohamexiety[d]: given Dave’s docs have ldtm/sttm which are DC exclusive I guess that confirms they’re consumer only
14:57gfxstrand[d]: I think I also saw them in the disassembler R/E dumps we did like a year ago and I trust those more than the docs.
14:57karolherbst[d]: DCs cards usually the older models
14:57karolherbst[d]: *are
14:57mohamexiety[d]: oh yeah they were in Mel’s disassembly
14:57gfxstrand[d]: But yeah, I suspect sm220
14:58karolherbst[d]: and they also often add more features to consumer cards
14:58karolherbst[d]: so it makes sense if those are just sm120+ only
14:58gfxstrand[d]: karolherbst[d]: *more 3D features
14:58mohamexiety[d]: it really depends. DC Blackwell has a much more featureful tensor core for example. Hopper also had wgmma and such
14:58karolherbst[d]: also instructions
14:59gfxstrand[d]: They'll happily add more AI shit to DC cards
14:59karolherbst[d]: it happened in the past and it will happen in the future
14:59karolherbst[d]: or they add stuff to tegra before it reaches DC cards
14:59gfxstrand[d]: Heh
14:59karolherbst[d]: fp16 alu support is sm53+
15:00gfxstrand[d]: Apart from a few things like ASTC, it usually all converges by the next generation but who gets it first depends on where it's most important.
15:00karolherbst[d]: I think the MUFU.sqrt also was consumer first?
15:01karolherbst[d]: but yeah... uniform float instructions is a bit of a bigger change than those 😄
15:01karolherbst[d]: yeah.. mufu.sqrt is SM52+
15:02gfxstrand[d]: Intel's atom line is such a mess, too. Bay Trail is mostly Ivy bridge but with Haswell vertex fetch and the Broadwell sampler.
15:02karolherbst[d]: well the PTX ISA is linear at least
15:02karolherbst[d]: I don't think they ever only added instructions only to DC cards
15:03karolherbst[d]: but yeah, I think in the classes it's more chaotic than that
15:06karolherbst[d]: mohamexiety[d]: those are DC exclusive? Mhhh.. I guess that kinda explains why starting with sm100 or sm120 the PTX SM versioning became a mess?
15:07mohamexiety[d]: karolherbst[d]: yep. sm100, 103, and sm110 and sm120.
15:07mohamexiety[d]: sm120 isn’t a superset of the older ones
15:07karolherbst[d]: rough
15:07mohamexiety[d]: anything tensor is very different on 100/103/110 compared to 120
15:07mohamexiety[d]: it’s like a completely different architecture
15:07karolherbst[d]: I guess the uniform tensor stuff is DC only?
15:07mohamexiety[d]: the other stuff shouldn’t be affected though, at least not much
15:08mohamexiety[d]: karolherbst[d]: not sure actually. I only know about tensor memory specifically
15:08karolherbst[d]: funny how they broke with the linearity due to tensor stuff 🙃
15:08mohamexiety[d]: but it probably is
15:08karolherbst[d]: yeah uniform tensor ops operate on tensor memory
15:09karolherbst[d]: `UTCHMMA` e.g.
15:09mohamexiety[d]: ah yeah
15:38karolherbst[d]: soo how can one create fossils? 😄
15:39karolherbst[d]: ahh vulkan layer.. of course
15:56karolherbst[d]: ... we don't advertise Vulkan 1.4 with the drmshim 🥲
15:57karolherbst[d]: ehh wait..
15:57karolherbst[d]: I never passed in `NOUVEAU_CHIPSET=0x170` with my fossil stats. oops
15:59karolherbst[d]: `Static cycle count: 2252617 -> 2541686 (+12.83%); split: -0.02%, +12.86%` with the prepass scheduler.. impessive
16:00karolherbst[d]: some shaders regress by +34.69%
16:06karolherbst[d]: interesting enough they also improve warps/SM, but less than static cycle count
17:06mhenning[d]: gfxstrand[d]: Yeah, I'm seeing uniform float ops on sm120 but not sm100 based on nvdisasm output
17:07mhenning[d]: karolherbst[d]: yeah, the prepass scheduler is designed to prioritize occupancy before static cycle count
17:08mhenning[d]: I've also seen examples where the prepass scheduler improves things slightly but then RA does a much worse job
17:09karolherbst[d]: yeah...
17:09karolherbst[d]: I'll be looking into what happens and if there is a simple enough solution, but yeah... I'm just surprised it can have such a big negative impact there
17:20karolherbst[d]: mhenning[d]: I do wonder if the pass should be less aggressive within loops... or maybe the answer is just `nir_opt_licm`, which I haven't wired up yet :').. let me actually check if that opt helps
17:24karolherbst[d]: yeah mh doesn't seem to help in this example...
17:24karolherbst[d]: bit yeah.. the shader I'm looking at is going from 32 to 40 warps, but 40k cycles to 52.5k
17:26mhenning[d]: Is the instruction count the same?
17:28karolherbst[d]: yes
17:29karolherbst[d]: it's just some inner loop that gets hurt with higher delays due to the reordering
17:38mhenning[d]: You could try hacking get_schedule_types to not return the lowest register count setting. If that fixes it we might need a better heuristic to make that trade-off
17:49karolherbst[d]: mhenning[d]: Doesn't seem to matter what is returned there
17:50karolherbst[d]: originally it was ` [RegLimit(37), RegLimit(45), RegLimit(53), RegLimit(61)]`, and now I forced it to `[RegLimit(69)]` and get the same result
17:54mhenning[d]: Oh, that's odd. It's supposed to bail out if the new schedule is worse and the old schedule meets the register limits
17:55karolherbst[d]: original stats: https://gist.github.com/karolherbst/a323d2818c2baa46ed5514e9201367eb
18:01karolherbst[d]: mhhhhh
18:01karolherbst[d]: mhhhh
18:01karolherbst[d]: I think there is a bug in the code
18:03karolherbst[d]: ohh wait no..
18:04karolherbst[d]: I was confused by the loop
18:04karolherbst[d]: btw.. a loop can return a value passed to `break` which I think could simplify the code there a bit 😄
18:06mhenning[d]: Oh, I didn't know that.
18:08karolherbst[d]: I think it might even return the last value if it's at the end..
18:11karolherbst[d]: ahh no.. has to be explicit then
18:15cubanismo[d]: Yeah, got excited and thought maybe for loops could return stuff
18:15cubanismo[d]: But nope
18:15cubanismo[d]: Makes sense after thinking about it more.
19:03karolherbst[d]: I wonder if I want to play around with prepass scheduling heuristics, but I also feel like it's a lot of time spent on something that's not gotta get me a clear win over what we have today 😄
19:35karolherbst[d]: I just found something interesting: `UMEMSETS:`
19:35karolherbst[d]: "Initialize Shared Memory"
19:35karolherbst[d]: Seems to be Blackwell+
19:37HdkR: Interesting, attempting to solve the problem of leaking data through shared memory?
19:38karolherbst[d]: Well.. Vulkan has a feature for initializing workgroup memory
19:38karolherbst[d]: but you have to do it by hand without it
19:38HdkR: ah
19:39HdkR: That's pretty nice then
19:39karolherbst[d]: but it would be interesting to know how much faster the instruction is
19:39karolherbst[d]: memory fills could be implemented pretty efficiently, if it's a proper memory fill op
19:40karolherbst[d]: but that stuff is kinda more relevant if you hit real memory due to the bandwidth
19:41karolherbst[d]: `VK_KHR_zero_initialize_workgroup_memory` is the vulkan ext
19:42karolherbst[d]: we have `nir_zero_initialize_shared_memory` that is already pretty smart as it balances the instruction across the invocations
19:56HdkR: I wonder if it only needs one invocation to execute it, or if it balances all invocations that use it.
19:58karolherbst[d]: well the nvidia instruction appears to be a uniform one
19:59karolherbst[d]: but obviously no idea how it works internally
20:01HdkR: Ah right, U* class, I read that as part of the instruction name. So only the provoking thread executes it, got it.
20:23airlied[d]: marysaka[d]: diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
20:23airlied[d]: index ea1191386c6e..19a7407cf702 100644
20:23airlied[d]: --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
20:23airlied[d]: +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
20:23airlied[d]: @@ -422,7 +422,7 @@ nvkm_vmm_sparse_unref_ptes(struct nvkm_vmm_iter *it, bool pfn, u32 ptei, u32 pte
20:23airlied[d]: memset(&pt->pde[ptei], 0x00, sizeof(pt->pde[0]) * ptes);
20:23airlied[d]: else
20:23airlied[d]: if (it->desc->type == LPT)
20:23airlied[d]: - memset32(&pt->pte[ptei].u, 0x00, sizeof(pt->pte[0]) * ptes);
20:23airlied[d]: + memset32(&pt->pte[ptei].u, 0x00, ptes);
20:23airlied[d]: return nvkm_vmm_unref_ptes(it, pfn, ptei, ptes);
20:23airlied[d]: }
20:23airlied[d]: is one bug I sptted
20:28marysaka[d]: I missed that one :EstelleFacepalm:, it seems to be running fine so far, will see how the CTS end
20:35karolherbst[d]: HdkR: yeah, though not sure how relevant that is, because it could all be handled by something else. Invocations only submit the memory requests, but like something else executes them
20:36karolherbst[d]: but yeah.. no idea if you are supposed to execute that one in parallel or not
20:36karolherbst[d]: maybe there is something in the ptx docs...
20:37HdkR: karolherbst[d]: Yea, just means because it is U* class means that even if you have multiple invocations active, it'll only be hit once. So sounds like whatever pipeline/coprocessor scales out the initialization :D
20:37HdkR: Uniform class was a great addition
20:38karolherbst[d]: `st.bulk instruction initializes a region of shared memory starting from the location specified by destination address operand a.` I think that's the thing on the PTX side
20:38karolherbst[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-st-bulk
20:39karolherbst[d]: `The only numeric value allowed for operand initval is 0` lol
20:39HdkR: A max of 16kb is pretty cool
20:39karolherbst[d]: so you need at most 14 of them?
20:40karolherbst[d]: I think 224k is the max atm
20:40karolherbst[d]: I wonder how much of that are PTX limitations and how much are hw ones
20:40HdkR: Kernels that zero all all their shared memory, plus having the maximum allocated sounds nuts :D
20:40karolherbst[d]: that initval has to be 0 is an odd restrictions
20:41karolherbst[d]: I'd have to check if the constant is part of the encoding on the hw ISA level
20:41karolherbst[d]: ohh wait
20:42karolherbst[d]: it's uhm...
20:42karolherbst[d]: well
20:42karolherbst[d]: it's not a constant on the ISA
20:42karolherbst[d]: it's a uniform reg, that has to be `URZ` 🙃
20:42HdkR: hah
20:43karolherbst[d]: but anyway, good to know there is a PTX instruction for it, so that makes it easier to RE
20:43airlied[d]: so no poisoning the memory
20:44HdkR: Looking forward to the microbenchmark of zeroing 224k of shared memory with 14 instructions versus a loop :D
20:44karolherbst[d]: it feels like they have future plans to allow uniform fill values, but for now it's always zero...
20:45karolherbst[d]: most consumer cards only have 100k shared memory sadly
20:45karolherbst[d]: but yeah...
20:46karolherbst[d]: ohh the max is 228, not 224, because uhm.. they have this weirdo 100k value
20:46karolherbst[d]: 64 -> 100 -> 132 are the steps... it's kinda weird
20:47HdkR: Wacky
20:49karolherbst[d]: I wonder if there are may shaders out there that zero initialize shared memory repeatedly..
20:49karolherbst[d]: or well.. sub ranges of it