00:12 gfxstrand[d]: Holding steady at 52
00:17 redsheep[d]: So you said you want to potentially default zink on those older gens with 25.2? Are they going to have nvk by default with 25.1, or 25.2?
00:18 gfxstrand[d]: NVK enabled in 25.1 and maybe Zink on 25.2 or 25.3?
00:19 gfxstrand[d]: We've got a month before 25.1 branches. Plenty of time to land the fixes.
00:19 gfxstrand[d]: Won't be officially conformant until after the branch point but that's fine.
00:20 redsheep[d]: Hmm. Was just thinking, having the choice to not have any released version where nvk and nouveau gl coexist seems like it has some merit
00:20 redsheep[d]: It's probably not so easy to be sure that nvk+zink works well enough in just one month though, and maybe not worth the wait for nvk to be enabled. I dunno
00:21 gfxstrand[d]: Is it likely to work worse than nouveau GL? <a:shrug_anim:1096500513106841673>
00:22 redsheep[d]: I bet it's fine, yeah. That's what I'm kind of getting at, we can reduce the needed surface area for testing by making those gens never have nouveau gl and nvk together
00:22 gfxstrand[d]: Honestly, the thing that concerns me most is perf. Not that anyone's going to get perf on most Maxwell cards but NVK+Zink is almost certainly worse, especially given that Zink loves VK_EXT_descriptor_buffer.
00:22 redsheep[d]: Though... flatpak probably negates any kind of control over that, so nevermind
00:22 gfxstrand[d]: Yup
00:23 gfxstrand[d]: And I don't expect Maxwell to have any new NVK+NGL issues
00:23 gfxstrand[d]: gfxstrand[d]: I guess I could disable EDB pre-Turing. There's no good way to make it not suck there. <a:shrug_anim:1096500513106841673>
00:25 redsheep[d]: That sounds reasonable, and is probably the simplest solution to avoid perf implications and complexity on the nvk side
00:25 redsheep[d]: Just means the non-edb path is more load bearing in zink
00:25 redsheep[d]: But iirc it already was due to img+zink needing that path?
00:26 redsheep[d]: Or maybe it was another driver. It was something important that has to work without it, I forget exactly what
00:30 redsheep[d]: I'm curious how well a znvk session actually works without reclocking. Sounds slow.
00:30 gfxstrand[d]: You could always try it out! πŸ˜›
00:32 redsheep[d]: I would if I had a card. I don't intend on buying one though. I also have just the one computer, no test bench or anything.
00:33 redsheep[d]: I maybe have the space but I try to be somewhat minimalist and a whole pc for this... isn't minimal
00:33 gfxstrand[d]: Yeah, no need to go spend your hard-earned money on a massive GPU that runs like mud.
00:33 gfxstrand[d]: Unless you're me. πŸ™ƒ
00:33 redsheep[d]: That sounds like an expense for the company card
00:33 mohamexiety[d]: It pays for itself at least, hopefully :KEKW:
00:34 gfxstrand[d]: I think a boot-clocked Volta probably takes the cake for the most GPU/performance. πŸ˜‚
00:34 mohamexiety[d]: :KEKW:
00:34 gfxstrand[d]: mohamexiety[d]: In the sense that I make enough to afford it, yes. No worries there.
00:35 mohamexiety[d]: I guess you could cheat and do a 3090 without gsp to beat that but that’s cheating
00:35 gfxstrand[d]: Yeah, that's cheating
00:36 redsheep[d]: Then you send that 3090 to orbit a supermassive black hole
00:55 airlied[d]: do we have anything to validate the code, getting unaligned register access but have no idea what is causing it
01:04 mhenning[d]: be sure you're building with assertions enabled, but otherwise no, we don't have much validation right now
01:15 gfxstrand[d]: Typically that's caused by a mismatch between how many registers the instruction thinks it's getting and how many we're allocating. RA should never give an actually unaligned register.
01:16 gfxstrand[d]: But if it thinks it's getting a vec4 and we're only providing a vec2, it may not be vec4 aligned
01:34 zmike[d]: gfxstrand[d]: perf will be alright, but without descriptor buffer you will be subject to mid-frame shader compiles
01:35 gfxstrand[d]: zmike[d]: Given how much time those GPUs are likely to take between frames when running at boot clocks, I'm not sure I should care. πŸ˜†
01:35 zmike[d]: it's probably fine
01:35 zmike[d]: and who knows, maybe we'll get something in vulkan that works better than descriptor buffer someday
01:37 gfxstrand[d]: Perhaps. But will anyone working on designing such a thing care about making it fast Pre-Turing? πŸ™ƒ
01:37 zmike[d]: not my problem, that's for sure
01:38 zmike[d]: you're all trying to figure out how GPUs work and I'm not even in this galaxy anymore
01:39 gfxstrand[d]: I think it's possible to make fast. I'm just not sure I care enough.
02:29 airlied[d]: gfxstrand[d]: is there any reason deref_array couldn't take a 32-bit index on a 64-bit array ptr?
02:30 orowith2os[d]: If I get my own place, or some space for a server rack, it would be fun to buy one GPU from each generation (except the newest gens. I ain't dealing with allat) and setting up separate VMs (one for each GPU) to run CI and CTS stuff on :hammy:
02:31 orowith2os[d]: gfxstrand[d]: (the idea that comes to mind here ^)
02:43 gfxstrand[d]: airlied[d]: Because it's defined to have to match?
02:43 gfxstrand[d]: That decision was made long ago when we first started doing non-32-bit types.
02:45 airlied[d]: I just feel we throw away some information that might be useful to know later on without having to do some sort of range tracker
02:46 airlied[d]: reverse engineering how the 64-bit load_global value is built is definitely the wrong way to be doing this πŸ˜›
02:46 gfxstrand[d]: One of the problems you run into when you try to allow that is: Is it sign extended? We could have picked a convention. We could have picked either convention. I had to argue with people in the SPIR-V group to get them to pick one and actually write it down. (Array indices are signed in SPIR-V.) With NIR, we decided to just require the bit size to match and you have to do an explicit up-cast if
02:46 gfxstrand[d]: you want to use a smaller type.
02:47 gfxstrand[d]: airlied[d]: No argument there
02:48 gfxstrand[d]: But also, if there were something useful you could do with 32-bit array indices, you can do that same thing with an array index that's `i2i(x)`. It's no less detectable now.
02:49 gfxstrand[d]: Backing it out of addresses is a pain, granted.
02:50 airlied[d]: the problem is where the i2i gets introduced blocks a lot of working out which parts of the address calculation are convergent or divergent
02:51 airlied[d]: since adds after the i2i in 64-bit space are hard to work back to being something you could add in a convergent manner to the original address
02:51 airlied[d]: but maybe I could make a rewrite pass to just try and figure all of it out and leave it in a state that nak can use the correct instructions
02:57 gfxstrand[d]: But you're going to get the i2i anyway. That's necessary to implement SPIR-V semantics.
02:57 gfxstrand[d]: Whether it's explicit or implicit in the array deref doesn't change that.
02:58 gfxstrand[d]: And yeah, that makes it all a pain.
02:58 gfxstrand[d]: What we really need is better integer range analysis in NIR to know when things will never be negative or will never overflow. That's what we're really missing.
03:00 gfxstrand[d]: Like if something is an array index to an in-bounds access chain that automatically gives us some range information. Right now, we have no way of tagging that and preserving/propagating it.
03:02 gfxstrand[d]: (Though that assumption can only really be propagated to the address calculation and only in particular ways. Compiler assumptions are tricky.)
03:05 gfxstrand[d]: Looks like nir_lower_explicit_io already tries to be smart about this (and gets it wrong. 🀦🏻‍♀️ )
03:05 airlied[d]: this is definitely where I run into my compiler knowledge limits
03:08 airlied[d]: I feel for this hw actually having convergent and divergent addresses components and never mixing them until you get to load, and doing any calcs on each addr as needed, but I don't think we have the divergence info available at those times
03:15 gfxstrand[d]: So, `nir_lower_explicit_io()` tries to be clever here with `amul`. Maybe we can use that?
03:18 airlied[d]: I might play around a bit with that, see where I end up
03:18 gfxstrand[d]: In theory, we could roll our own `nir_lower_explicit_io` which takes divergence information into account. 😬
03:18 gfxstrand[d]: If it really matters that much
03:19 gfxstrand[d]: But also, that gets really tricky when you have "real" pointers like in CL
03:20 gfxstrand[d]: Do pointers suddenly become `u64vec4`s or something crazy like that?
03:21 gfxstrand[d]: The thing we probably could do is some sort of mass re-association. At the end of the day, the address is going to be a bunch of adds. We could scrape them all up into one mega-add, sort it into divergent, uniform, and immediate parts, and then create a new set of adds which groups them nicely.
03:21 airlied[d]: I expect on CL you won't get the benefits of convergence that often anyways
03:22 gfxstrand[d]: gfxstrand[d]: Oh, actually the `amul` trick might be hurting us here!
03:22 gfxstrand[d]: Or maybe not? Unclear
03:22 airlied[d]: that suggestion is where I meant a big rewrite pass
03:22 airlied[d]: just go back and reassociate everything once we have diveregence info, move things like shifts up as into both sides
03:23 gfxstrand[d]: Yeah, we could do that post-divergence-analysis.
03:23 gfxstrand[d]: And just preserve the information
03:23 airlied[d]: so x = add con, div, shl x, 16 goes to con << 16, div << 16 etc
03:23 airlied[d]: and then see if things just end up falling out of it
03:24 gfxstrand[d]: I committed some really bad sins trying to improve this for DA:TV and I did get 1 FPS out of it.
03:24 gfxstrand[d]: My patches weren't correct, though.
03:24 gfxstrand[d]: And that was without even doing anything with UGPRs. Just better use of immediates.
03:24 airlied[d]: this shader I'm concentrating on seems like it'll care
03:25 gfxstrand[d]: care about UGPRs?
03:25 airlied[d]: yes doing maximal UGPR address calcs seems like it's good a TFlop or 2
03:26 gfxstrand[d]: Yeah
03:26 gfxstrand[d]: I think they all run into roughly the same set of problems, unfortunately.
03:27 gfxstrand[d]: https://gitlab.freedesktop.org/gfxstrand/mesa/-/commit/82ed3581af6d7be4b0b2137994f72fc78232b7f6
03:27 airlied[d]: I think once I get hacks closer to the NVIDIA TFlops 27, I'm at 19, I'll have to consider more on how to do things properly
03:29 gfxstrand[d]: 19 is still pretty impressive, TBH.
03:29 gfxstrand[d]: But yeah, there's a lot to optimize here.
03:29 gfxstrand[d]: Predication will probably shave away a good chunk, too, if it's getting bounds checked anywhere.
03:30 airlied[d]: yeah predication is the other big difference I can spot their shader using
03:30 airlied[d]: though I think it's mostly just to avoid doing some stuff in the last iteration of a loop
03:30 gfxstrand[d]: Predication gave me about 1-2 FPS in DA:TV and cut my shader compile time in half. Smashing on ldg immediates gave me another 1 FPS.
03:31 gfxstrand[d]: I'm sure using more of the different ldg forms will help, too.
03:31 airlied[d]: nearly all my ldg/stg are using ur now
03:31 gfxstrand[d]: ]o/
03:31 airlied[d]: though its via naive peepholes πŸ™‚
03:32 gfxstrand[d]: sure
03:32 gfxstrand[d]: Figuring out what shader we need to generate is the first step. Figuring out how to do all the necessary optimizations correctly can be the second step.
03:32 airlied[d]: but they are doing 64-bit UR + 64-bit R, whereas NVIDIA seem to do a lot of 64-bit UR + 32-bit R
03:32 airlied[d]: which is why I'm where I am πŸ™‚
03:33 gfxstrand[d]: airlied[d]: Right. So we should basically have that most of the time since we're using the 32-bit offset form
03:33 airlied[d]: yup I just have to reextract that info at the end after it's been through a few i2i, iadd and shifts
03:33 gfxstrand[d]: Almost everything should be a UR base (since it comes from the descriptor load) plus an offset that's been cast from 32-bit
03:34 gfxstrand[d]: Even if there's some divergent stuff that goes into that 32-bit value, we should still be able to save the iadd(base, u2u64(offset))
03:34 airlied[d]: yeah the two cases are UR base + offset, or sometimes after loop unroll I think UR base + fixed offset + offset
03:35 airlied[d]: currently we don't the loop unroll fixed offset is divergent needlessly
03:35 gfxstrand[d]: Yeah so that last one is tricky because the offset+imm might overflow and that gets sticky
03:36 gfxstrand[d]: IDK what, NV does about overflow when you have 3 address sources
03:36 gfxstrand[d]: Even just for 64+32, you still have to decide on sign extended or not
03:36 gfxstrand[d]: I think they probably sign-extend everything because immediates are signed, IIRC.
03:36 gfxstrand[d]: (Except for certain ldc cases where they're unsigned for reasons.)
03:37 airlied[d]: oh they have 24-bit imms, so yeah amul might be useful then
03:38 airlied[d]: ah Ra + signed 24-bit or Rz + unsigned at least
03:38 gfxstrand[d]: Right. It's rz that changes the signedness
03:38 gfxstrand[d]: Which is really mean
03:38 airlied[d]: so it's signed 24-bit when being added to anything
03:38 gfxstrand[d]: The only reason we're at all anywhere close to correct right now is because rZ doesn't get needlessly propagated into I/O instructions
03:38 airlied[d]: so any R or UR makes it signed 24-bit
03:39 airlied[d]: otherwise it's unsigned
03:39 gfxstrand[d]: So cursed
03:40 airlied[d]: also the UR + R , the R is unsigned
03:40 gfxstrand[d]: *sigh*
03:40 gfxstrand[d]: Why do you do this to us, NVIDIA?!?
03:40 gfxstrand[d]: I mean, they're doing the useful thing. I get that. But still, it makes compilers hard.
03:42 gfxstrand[d]: airlied[d]: Okay, so what if you do UR64+R32+imm? Who does roll-over where?
03:43 gfxstrand[d]: Is it all extended (R32 unsigned and imm 24-bit signed) and then added in 64 bits?
03:43 gfxstrand[d]: That's the sane thing to do
03:43 gfxstrand[d]: Or does it try to do an add in 32 bits and then unsigned extend that and add to the 64-bit value?
03:43 airlied[d]: the docs are written as Ra.U32 + URb + immS24
03:44 airlied[d]: I've no idea if that is how it's done or just written down
03:44 gfxstrand[d]: So presumably it extends everything
03:44 gfxstrand[d]: That would be the sane thing to do and NVIDIA is usually sane
03:44 gfxstrand[d]: And it's not like it really adds that many bits to your adders to do the right thing
03:45 gfxstrand[d]: You just have to add the u32 and s24 in s33
03:45 gfxstrand[d]: or maybe s34
03:45 airlied[d]: "The effective address is equal to the sum of URb (or URb+1:URb), Ra (or Ra+1:Ra) and the immediate
03:45 airlied[d]: operand. An omitted immediate offset is assembled as zero. All offsets are in bytes."
03:45 airlied[d]: nothing more specific
03:45 gfxstrand[d]: Yeah....
03:45 gfxstrand[d]: Let's assume they have enough bits in their adders to get it correct as if it's all extended first.
03:47 gfxstrand[d]: So peeling off the UR is easy. We just look for `iadd(uniform, u2u64(x))` which we'll see 90% of the time thanks to `addr_to_global()` in `nir_lower_io.c`.
03:48 gfxstrand[d]: But then we need to look at `x` and see if it's an `iadd(y, imm24)` and know that the add doesn't overflow in funky ways.
03:49 gfxstrand[d]: Because what we really want to prove is that `u2u64(x) == iadd(u2u64(y), i2i64(imm24))`
03:50 gfxstrand[d]: Which pretty much means that `imm24 >= 0` and `(uint32_t)y <= INT32_MAX`
03:51 gfxstrand[d]: There are other cases where the overflow still works but they're not something I think we can prove statically.
03:51 gfxstrand[d]: Even checking that `y` is non-negative when signed is tricky enough
03:52 gfxstrand[d]: Ugh....
03:53 gfxstrand[d]: I think we do technically have the needed information with **OpInBoundsAccessChain** in SPIR-V but propagating it all is a pain
03:53 gfxstrand[d]: Actually....
03:53 gfxstrand[d]: Maybe `imm24 >= 0` is enough
03:54 gfxstrand[d]: No. It isn't. We need that and we need to know that the `iadd` doesn't overflow.
03:55 redsheep[d]: airlied[d]: What is the theoretical throughput for the hardware on the calculations you're doing? Is this your 2070 super?
03:55 gfxstrand[d]: `nir_opt_offsets()` may help us here, though I don't know what all magic it's doing.
03:56 airlied[d]: redsheep[d]: no idea, whatever NVIDIA gets on the same benchmark is my goal, this is some laptop turing I'm playing on πŸ™‚
03:56 gfxstrand[d]: Oh, no, that doesn't do anything helpful. 😭
03:56 airlied[d]: Quadro RTX4000 Mobile it calls itself
03:57 airlied[d]: NVIDIA have written the second coopmat extension to get closer to CUDA at least, but baby steps and all that
03:58 gfxstrand[d]: I think maybe we just need to teach `nir_lower_explicit_io` about `no_[un]signed_wrap`?
03:59 gfxstrand[d]: So if `x = iadd(y, imm24)` and `.no_unsigned_wrap` is set and `imm24 >= 0` then I think we're good to optimize.
03:59 gfxstrand[d]: That should be something we can generate from `lower_explicit_io()`.
04:00 redsheep[d]: Ok so a TU104. Yeah I can't really see how that number would map cleanly at all to any specs, so I am guessing the throughputs of different parts of the shader are mixed, or they're not achieving peak theoretical. Interesting.
04:00 airlied[d]: we have nir_iadd_imm_nuw but only amd use it
04:02 gfxstrand[d]: Yeah, I'm looking at it a bit now. I really should go to bed, though.
04:03 gfxstrand[d]: I think we can definitely do something there. I just need to look at the helper functions and make a decision or two
04:03 airlied[d]: go sleep, I'm not getting anywhere fast today anyways, wanted to get above 20Tf but I'm distracting myself
04:04 gfxstrand[d]: Heh. Yeah, nerd-sniping me with address calculations at 10 PM isn't nice. πŸ˜›
04:04 gfxstrand[d]: (Actually 11 now)
04:05 redsheep[d]: Is this the nvidia shaders you're seeing 27 tflops from them on? https://github.com/jeffbolznv/vk_cooperative_matrix_perf
04:06 airlied[d]: redsheep[d]: yes, just using the first shmem16 shader as a test
04:07 redsheep[d]: Ok I am curious to see what performance ada achieves. There was a news article accompanying this where they even say outright that they only achieve 60% of peak theoretical which is fascinating. https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/
04:08 airlied[d]: there is a talk at Vulkanized 2025 a couple of weeks back
04:09 airlied[d]: about coop mat 2
04:09 redsheep[d]: I'll have to find that and give it a watch
04:10 airlied[d]: https://www.vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf
04:10 airlied[d]: slides 23-24
04:10 airlied[d]: and 25-26
04:10 airlied[d]: for the tl:dr
04:11 redsheep[d]: Oh wow they're getting close to cuda
04:12 redsheep[d]: Sometimes beating it is insane
04:13 redsheep[d]: aww I don't have 575 to play with the newest toys
04:14 redsheep[d]: I could install the vulkan beta driver but I haven't had the best success running my daily driver machine off of those in the past πŸ˜›
04:38 redsheep[d]: Hmm. I am getting a load of errors building that project, saying there's a bunch of undefined types. The readme is pretty bare and I'm not sure what I am missing. Guess I won't be benchmarking that on my system for now. Maybe is an issue with my vulkan sdk but surely 1.4.309 is good enough?
04:50 airlied[d]: should be
07:07 airlied[d]: oh looks like a 5080 is starting a long journey towards me
07:34 redsheep[d]: airlied[d]: tiredchiku[d] Helped me build it, the project has build issues and needed changes.
07:34 redsheep[d]: Which of these results is the one you're looking at? I assume the big number that comes first?
07:34 redsheep[d]: ```shader: shaders/shmemfp16.spv
07:34 redsheep[d]: cooperativeMatrixProps = 16x16x16 A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 256.415958 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 90.177123 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 63.880527 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 111.548538 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 140.818600 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 76.635973 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 110.472593 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 118.420604 TFlops
07:34 redsheep[d]: shader: shaders/shmemfp16.spv
07:34 redsheep[d]: cooperativeMatrixProps = 16x8x16 A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 119.720343 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 70.815619 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 92.551484 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 59.536042 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 133.956095 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 69.659885 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 71.668641 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 70.546635 TFlops
07:34 redsheep[d]: shader: shaders/shmemfp16.spv
07:34 redsheep[d]: cooperativeMatrixProps = 16x8x8 A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 82.526092 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 72.842354 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 84.964734 TFlops
07:34 redsheep[d]: TILE_M=128 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 107.516978 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 78.317257 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=128, TILE_K=32 BColMajor=1 workgroupSize=256 86.499436 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=0 workgroupSize=256 71.871021 TFlops
07:34 redsheep[d]: TILE_M=256 TILE_N=256, TILE_K=32 BColMajor=1 workgroupSize=256 69.790765 TFlops```
07:35 Sid127: that was painful on the IRC side :P
07:35 tiredchiku[d]: hm
07:36 Sid127: oh the bridge just took a whole minute
07:36 redsheep[d]: Yeah, sorry, I kind of knew it would be. The thing I wanted to see was whether later generations did a better job of saturating the hardware with this test, and 256 TF is quite a lot closer to theoretical than the 27 on that TU104 would be, or the 36 nvidia said from their article.
07:36 tiredchiku[d]: jeez
07:37 tiredchiku[d]: karolherbst[d]: 30-50 second delay for irc->discord bridging
07:37 tiredchiku[d]: discord->irc is instant
07:41 airlied[d]: I'm looking at the very first value
07:41 redsheep[d]: Huh. All of the lower numbers on this test are not consistently getting my gpu to clock up.
07:42 airlied[d]: TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 256.415958 TFlops
07:42 redsheep[d]: Stuff like this is insight I wish I had on nouveau πŸ˜›
07:42 airlied[d]: but I only really care about relative speed here of nvk vs nvidia
07:43 redsheep[d]: Yeah I will try out your branch in the next couple days if you have it on gitlab or something
07:43 redsheep[d]: Um, 225 is the accurate number for what will be comparable. That 256 was affected by overclock
07:44 redsheep[d]: Wow I got 334 on fp8
08:00 mupuf: FYI, 8x RTX 4060 are on their way to a datacenter for Nouveau use
08:01 mupuf: pre-merge testing for NVK is coming!
08:07 mupuf: and I'll try switching to linux 6.13.7 next week
08:34 asdqueerfromeu[d]: gfxstrand[d]: <https://gitlab.freedesktop.org/gfxstrand/mesa/-/commit/82ed3581af6d7be4b0b2137994f72fc78232b7f6>: Is that NIR optimization only a misdemeanor (or a felony)? 🐸
08:56 Jasper[m]: @mupuf weird flex but okay
08:59 mupuf: Jasper[m]: lol, slight misunderstanding here. This is for Mesa CI, and this work and hardware is sponsored by Valve and the plan was agreed in private with some NVK Deva. This was not a flex, just a status report πŸ˜…
09:00 Jasper[m]: No I know, I'm joking hahaha
09:00 Jasper[m]: You explained it perfectly fine, just thought it was a funny message when you deliberstely remove it from its context
09:01 mupuf: ha, lol πŸ˜… I've been removed from nouveau development for toooooo long to know better at this point
09:01 Jasper[m]: s/deliberstely/deliberately/
09:01 mupuf: yeah πŸ˜‚
09:09 asdqueerfromeu[d]: I'm surprised nouveau hasn't been tested with a display hotplug stress test πŸ˜…
09:15 Jasper[m]: ngl I haven't used an nvidia card on Linux in years, but when I still had my GTX770 I couldn't even boot with more than one display attached
09:44 mupuf: asdqueerfromeu[d]: that would require a chamelium, and these are rare devices...
09:50 Mary: pre-merge CI is going to be amazing to have!
11:03 mohamexiety[d]: yooo this is so cool. thanks a lot!
12:18 tiredchiku[d]: looks like I've been beaten to the punch
12:18 tiredchiku[d]: https://discuss.haiku-os.org/t/haiku-nvidia-porting-nvidia-gpu-driver/16520
12:18 tiredchiku[d]: > I finally managed to make initial port NVRM kernel driver to Haiku and added initial NVRM API support to Mesa NVK Vulkan driver, so NVRM and NVK can work together. Some simple Vulkan tests are working.
12:20 asdqueerfromeu[d]: tiredchiku[d]: We'll see if that support works on Linux too (once the source code gets released of course)
12:21 avhe[d]: no code?
12:21 avhe[d]: also I published my own little driver a couple days ago: <https://github.com/averne/Envideo>
12:22 tiredchiku[d]: https://github.com/X547/mesa/commits/mesa-nvk/
12:22 avhe[d]: it probably doesn't have all the primitives a full vk driver would need, but what it has is working rather well
12:23 tiredchiku[d]: gonna try pulling their code into my tree
12:24 tiredchiku[d]: can always mark them as co-author πŸ˜…
12:25 tiredchiku[d]: but it's nice to see other OSs also benefiting from openrm
12:27 tiredchiku[d]: would've been nice to have gotten there myself, but alas
12:27 tiredchiku[d]: real life constantly took priority :Stitch_sad:
12:34 mohamexiety[d]: Could pull into your tree or reach out and collaborate with them
12:34 tiredchiku[d]: yeah, trying to find an email
12:34 mohamexiety[d]: Yep. Having someone to work with could be a lot of help
12:34 tiredchiku[d]: did clone their tree to poke into the code
12:35 tiredchiku[d]: but git log has their email listed as `danger_mail@list.ru`
12:35 tiredchiku[d]: :doomthink:
12:39 tiredchiku[d]: and there's no way to initiate a DM on haiku's forum
12:57 tiredchiku[d]: oh I see
12:57 tiredchiku[d]: they don't use libdrm
12:58 tiredchiku[d]: interesting
12:59 tiredchiku[d]: that's a few ways different from what I was trying to do
13:06 tiredchiku[d]: yeah I'm gonna continue with mine, maybe use theirs as a reference
13:06 tiredchiku[d]: I'd like to continue using libdrm
13:29 pavlo_kozlenko[d]: tiredchiku[d]: https://tenor.com/view/the-office-interested-reaction-steve-carell-michael-scott-gif-3915584
13:36 tiredchiku[d]: yeah, very suspicious
13:46 gfxstrand[d]: tiredchiku[d]: Yeah, if it's advertised by the kernel as a DRM driver, there's no reason to avoid libdrm. Of course, there may be some code we want to share with a theoretical WDDM2 backend and that shouldn't depend on libdrm. But it's fine for smashing on ioctls.
13:46 tiredchiku[d]: I understand haiku dropping libdrm to get nvrm working but
13:46 tiredchiku[d]: that's not something we should be doing upstream I think
14:01 gfxstrand[d]: Volta survived but failed. 😭
14:05 gfxstrand[d]: I was running on a build that didn't have my fp64 fixes. 😭
14:08 pavlo_kozlenko[d]: not a big loss
14:32 Jasper[m]: @_oftc_gfxstrand[d]:matrix.org I get that Tegra is about 900th on the list of things to fix as far as priorities go, but is there anything you want me to check specifically on the hardware itself?
14:33 Jasper[m]: To aid whenever you do work on it again basically.
14:44 snowycoder[d]: I have a rather interesting failure with hw-tests on kepler:
14:44 snowycoder[d]: The 129th invocation of `Runner::run` always fails with an OOB write .
14:44 snowycoder[d]: How can I better pinpoint the failure?
15:10 gfxstrand[d]: snowycoder[d]: Sounds like the buffer is too small
15:10 gfxstrand[d]: 128 is a nice round number
15:11 gfxstrand[d]: Oh, wait... Those are independant runs so they should just be re-using the buffer.
15:11 gfxstrand[d]: Weird...
15:12 snowycoder[d]: Yep, doing the same run with a giant buffer does nothing, but even doing sanity test 129 times crashes
15:13 gfxstrand[d]: Jasper[m]: It's not quite that far down. πŸ˜… But no, there's not much that just a bit of testing will help. I need to figure out a couple Arm CPU corner cases and that will inform the rest of the design. Then it should be pretty close to good to land.
15:13 gfxstrand[d]: Once things are landed, some app testing might be good.
15:15 Jasper[m]: That's fair, thank you! Ping me if you need anything
15:26 tiredchiku[d]: the haiku guys getting some stuff going has filled me with a competitive spirit :LUL:
15:26 tiredchiku[d]: I may or may not have been putting some studies aside to work on my code for the past few hours
15:31 tiredchiku[d]: notthatclippy[d]: I see a NV2080_CTRL_FB_INFO_INDEX_RAM_SIZE and a NV2080_CTRL_FB_INFO_INDEX_HEAP_SIZE in the headers, with the latter having a HEAP_FREE but not the former
15:32 tiredchiku[d]: I wanna get how much VRAM is in use, would it be reasonable to subtract HEAP_FREE from TOTAL_RAM_SIZE?
16:14 gfxstrand[d]: Probably?
16:20 gfxstrand[d]: snowycoder[d]: Does the same happen on other GPUs?
16:21 snowycoder[d]: gfxstrand[d]: On my turing GPU it runs smoothly
16:21 gfxstrand[d]: okay. That's good to know at any rate
16:22 snowycoder[d]: also, in dmesg I see `fault 01 [WRITE] at 000000000005b000`.
16:22 snowycoder[d]: Maybe we need to set some kind of memory base addr?
16:22 snowycoder[d]: (addr changes from test to test)
16:25 gfxstrand[d]: I'm wondering if you're getting a bad QMD
16:27 gfxstrand[d]: That's one of the issues I had to fix on Maxwell A to get it to pass the CTS
16:27 gfxstrand[d]: I don't really want to add that to the NAK runner but we can if we have to
16:28 snowycoder[d]: What should I check to fix that? I have very low experience with QMDs
16:30 gfxstrand[d]: So, the problem is that the hardware has a QMD cache but no way to manage said cache. On Maxwell B they added a cache invalidate method for it. Prior to that, it's just sort of there and screwing things up for you in the background.
16:31 snowycoder[d]: gfxstrand[d]: Ah, wonderfulπŸ˜‚
16:31 gfxstrand[d]: On Turing, what this looked like was that I had a bug where after a while it would just run the wrong compute shader. That was hell to debug! Eventually, I found the invalidate and off we go.
16:32 gfxstrand[d]: On Maxwell A, I just added a thing (not yet pushed anywhere) which makes us allocate QMDs from a heap. There it seems sufficient to just re-use the same memory so QMD addresses don't suddenly go poof.
16:32 gfxstrand[d]: IDK what kepler's deal is but it sounds like the QMD cache again.
16:33 gfxstrand[d]: 128 sounds like a very convenient number of QMDs to be able to cache.
16:33 gfxstrand[d]: We probably need something similar in the NAK runner where we allocate a blob of memory up-front and all QMDs ever always come from that blob and we hope that recycling is okay.
16:35 snowycoder[d]: Hold on, why does reusing the same address for QMDs play nicer with caches?
16:36 gfxstrand[d]: In the case of Maxwell A, it seems the hardware is okay at picking up the new QMDs we've written from the CPU but somehow evicting something from the cache causes a problem.
16:36 gfxstrand[d]: There's a hell of a lot of details in there that I don't understand
16:37 snowycoder[d]: That's cursed.
16:37 snowycoder[d]: So we should write a memory allocator for QMDs 0_o
16:42 snowycoder[d]: I'll just implement the remaining instructions for nowπŸ˜…
16:45 snowycoder[d]: Using the same QMD addr and running tests with `--test-threads=1` works, it's confirmed:blobcatnotlikethis:
16:48 snowycoder[d]: for hw_tests we could set up a locked pool of 128 addresses, no idea how we could handle that in real vulkan code though
17:46 i509vcb[d]: Assuming this uses a GSP and somehow magically the money appeared it would certainly be neet to have nouveau on a DGX Spark
17:46 i509vcb[d]: https://www.nvidia.com/en-us/products/workstations/dgx-spark/
17:46 i509vcb[d]: Yes it's $4000, I doubt many will be bought by the normal user
17:47 i509vcb[d]: But the proposition there for a aarch64 desktop is certainly interesting just from the CPU side
18:06 gfxstrand[d]: I've already reserved one.
18:07 tiredchiku[d]: bah
18:08 gfxstrand[d]: (Or, rather, my boss has. It'll get to spend some time on my desk.)
18:08 tiredchiku[d]: I replaced our nvtypes.h with the one from openrm, and I can't get past this build error
18:08 tiredchiku[d]: https://pastebin.com/YE21ENcG
18:08 tiredchiku[d]: been at it for an hour now
18:10 gfxstrand[d]: That looks like the kind of thing that might come from an X.h conflict
18:10 mhenning[d]: Sounds like something made a typedef of Status, which conflicts with declaring a variable Status
18:10 tiredchiku[d]: bleh
18:11 tiredchiku[d]: it's 2340, I'll figure this out tomorrow
19:52 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1352731695492759605/0001-nak-Improve-WS-abstractions-in-hw_runner.patch?ex=67df1510&is=67ddc390&hm=1ec0e62a576dc4f9dd194d67525da92002cb6696929e98dd209bf0e6ed344a06&
19:52 gfxstrand[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1352731695807463605/0002-nak-Add-a-QMD-heap-to-hw_runner.patch?ex=67df1510&is=67ddc390&hm=7f8557942d8a003c9cda280294a69ab445779958850fc087022f9bba147bfb99&
19:52 gfxstrand[d]: snowycoder[d]
19:55 gfxstrand[d]: Untested but I typed something
20:56 asdqueerfromeu[d]: avhe[d]: I guess I didn't look at the Repositories tab then (or the push was quite recent)
21:27 snowycoder[d]: gfxstrand[d]: Thanks, this should work until we test with 128 threads (there is no limit to the QMDs we are creating so we can allocate more than 128 if there are enough workers)
21:31 snowycoder[d]: the easiest solution I see is a per-sm limit and a `std::sync::Condvar` to wake up threads when a QMD is availavle?
21:33 gfxstrand[d]: I'm not sure we're actually limited to 128
21:33 gfxstrand[d]: I think 128 is just where the cache starts evicting things and doing funky stuff
21:39 snowycoder[d]: Ok so you're saying that if we squish all the QMDs together we could fit the cache better, right?
21:39 gfxstrand[d]: Not sure
21:39 snowycoder[d]: We could still crash at some point at runtime though :/
21:40 gfxstrand[d]: I haven't spent enough time breaking the cache to understand just how it all works
21:40 snowycoder[d]: Ok, thanks for the patches!
21:41 gfxstrand[d]: Like, I know there's a cache and I'm 90% confident invalidating it like we are on Maxwell is enough to get things correct.
21:42 gfxstrand[d]: But on Maxwell A and earlier, there's a lot I don't know.
21:43 gfxstrand[d]: The NAK runner might actually provide a well-controlled enough environment to sort it out if we really want to poke at things.
21:48 mohamexiety[d]: hmm this is funny. relaxing the GART page size thing gives us a record amount of mmu faults per second but the desktop experience actually is somewhat stable :thonk:
21:51 mohamexiety[d]: weird thing is even things that _shouldnt_ care like `vulkaninfo` mmu fault
21:51 mohamexiety[d]: lets try text mode only I guess
21:52 gfxstrand[d]: creating a device does execute some stuff on init
21:58 snowycoder[d]: sm32 is passing all hw tests! πŸŽ‰
21:58 snowycoder[d]: Now I just need to implement the other 60% of instructions and test them on the field
21:58 mohamexiety[d]: hm are there cases where nouveau.debug=mmu=debug doesn’t output anything? 😦
21:59 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1352763662456066108/IMG_0239.jpg?ex=67df32d6&is=67dde156&hm=7cec5530b87132e26fd94597e73f4244d6646d65e56b14f1c73c4101f5733485&
22:01 gfxstrand[d]: snowycoder[d]: Good progress! πŸ’œ
22:03 gfxstrand[d]: snowycoder[d]: A lot of the others are hard to unit test but having ALU you can actually trust will go a long way towards making the rest of it implementable.
22:03 gfxstrand[d]: You start going a little insane when you can't trust add and shifts
22:03 snowycoder[d]: gfxstrand[d]: Yep, thanks!
22:03 snowycoder[d]: I though about adding Foldable for floats but ugh, rounding modes seem hard
22:04 gfxstrand[d]: Yeah...
22:04 gfxstrand[d]: And you have to deal with NaN which may be slightly different on the CPU. It's a bit of a mess.
22:05 mohamexiety[d]: mohamexiety[d]: skeggsb9778[d] if it doesn’t output anything, does this give a hint where things failed at? p much anything vulkan gives 4+ mmu faults
22:11 mhenning[d]: snowycoder[d]: Yeah, nir has some of this solved for constant folding already, but it's all a little annoying.
22:12 mhenning[d]: at this point, you could probably start running some small subsets of CTS and getting them working.
22:15 gfxstrand[d]: Yup
22:15 airlied[d]: mohamexiety[d]: you need r570 to get addresses for mmu faults
22:16 mohamexiety[d]: yeah but it worked before :thonk:
22:16 gfxstrand[d]: Not with GSP
22:16 mohamexiety[d]: mohamexiety[d]: like take this
22:17 gfxstrand[d]: mhenning[d]: I recommend starting out with `dEQP-VK.ssbo.*`
22:19 tiredchiku[d]: ~~super smash brothers oltimate~~
22:21 airlied[d]: mohamexiety[d]: that is just the cpu side checker catching an illegal state before the gpu fualts
22:21 airlied[d]: whatever is causing it to fault now isn't breaking and of the userside checks
22:22 mohamexiety[d]: ahh so I guess I need to build on top of ben's tree
22:24 mohamexiety[d]: before proceeding though, just want to verify something:
22:24 mohamexiety[d]: ```c
22:24 mohamexiety[d]: if ((domain & NOUVEAU_GEM_DOMAIN_GART) &&
22:24 mohamexiety[d]: (!vmm->page[i].host || vmm->page[i].shift > PAGE_SHIFT))
22:24 mohamexiety[d]: continue;
22:24 mohamexiety[d]: we need to completely remove this, right? since the idea is that GART doesnt restrict us to sysmem mappings and also doesn't enforce a 4KiB cap anymore.
22:31 gfxstrand[d]: gfxstrand[d]: Honestly, start with `dEQP-VK.ssbo.*` then `dEQP-Vk.api.smoke.triangle` then run the whole CTS and see what blows up.
22:34 gfxstrand[d]: Cross your figers but I think Volta is about to pass the CTS. πŸ˜„
22:34 gfxstrand[d]: (It already passed 64-bit. Working on 32-bit)
22:35 mohamexiety[d]: <a:vibrate:1066802555981672650>
23:09 tiredchiku[d]: <a:excited:1022260893846872076>
23:11 gfxstrand[d]: gfxstrand[d]: Passed! Running the verification script now. Then I'll upload it.
23:12 gfxstrand[d]: With that, Everything Maxwell+ will be submitted.
23:12 gfxstrand[d]: Actual conformance in 30 days
23:13 snowycoder[d]: gfxstrand[d]: Will do, just need to encode all the ops first since I'm missing everything float or tex-related
23:14 gfxstrand[d]: I'd leave tex ops for a bit. Focus on float first and see how much you can get passing.
23:14 gfxstrand[d]: Encoding the tex ops will be easy. But we're going to have to write new NIR lowering code in nak_nir_lower_tex.c
23:15 gfxstrand[d]: I'm pretty sure everything changed on Maxwell
23:15 gfxstrand[d]: So get other stuff working first so that when you start attacking tex ops you can actually run all the tex tests.
23:16 gfxstrand[d]: At bare minimum you need basic VS/FS working
23:16 gfxstrand[d]: None of that is to scare you off of it. Just that tex ops are one of the few places where you can't yet trust that what nak::from_nir is giving you will more or less work already.
23:17 gfxstrand[d]: In other news, I think 570 is much more stable for mass CTSing. I'm on my 10th run since switching to the 570 kernel and it hasn't died yet.
23:17 gfxstrand[d]: mupuf[d]: eric_engestrom ^^
23:17 gfxstrand[d]: skeggsb9778[d]: ^^
23:22 snowycoder[d]: gfxstrand[d]: I think I'll ask for a bit of guidance when I get there then
23:24 gfxstrand[d]: snowycoder[d]: It's not too hard. The NIR lowering is really straightforward. It's just that all the fields moved around between Kepler and Maxwell.
23:24 eric_engestrom[d]: gfxstrand[d]: as in gtx 570? /me is not quite sure what we're talking about here
23:25 gfxstrand[d]: eric_engestrom[d]: GSP firmware version 570. skeggsb9778[d] is working on updating the nouveau kernel to use that instead of 565. With 565, I get random GSP crashes every few CTS runs. With 570, things seem way more stable. Also maybe faster? (I'm not sure why things sped up, TBH.)
23:25 gfxstrand[d]: But it would require the CI containers to boot with a custom kernel and linux-firmware.
23:26 eric_engestrom[d]: ah ok, I see
23:26 eric_engestrom[d]: custom kernel & firmware is the current situation already
23:27 gfxstrand[d]: https://gitlab.freedesktop.org/bskeggs/nouveau/-/tree/03.00-r570?ref_type=heads
23:27 gfxstrand[d]: https://gitlab.freedesktop.org/bskeggs/linux-firmware/
23:27 mohamexiety[d]: gfxstrand[d]: wait there's a newer one: https://gitlab.freedesktop.org/bskeggs/nouveau/-/tree/03.01-gb20x?ref_type=heads
23:28 mohamexiety[d]: I'd guess this one has more development
23:29 eric_engestrom[d]: mupuf[d]: has been playing with the amd & nvidia kernels lately (in hte context of upreving them, not switching to another branch, but that's not so different), I'll let him look at these links
23:31 gfxstrand[d]: Sounds good
23:31 gfxstrand[d]: I'm just concerned that if we enable CI on 565, it'll add instability to pre-merge.
23:32 gfxstrand[d]: But 570 is looking really solid
23:32 mhenning[d]: gfxstrand[d]: oh, faster is interesting. A month or two ago I benchmarked 565 vs 570 in a few games and didn't find a difference
23:33 gfxstrand[d]: mhenning[d]: If it's because of the kernel, I suspect something is faster with context creation and/or GSP locking. Wouldn't likely affect games but it would make a huge difference with the CTS.
23:33 gfxstrand[d]: I just did a full CTS run in 25 min. I think that's a record.
23:33 redsheep[d]: Will you still be doing cts before marge? As well
23:34 gfxstrand[d]: That's the plan. The Valve lab has 4 4060s incoming so we can enable pre-merge.
23:34 mhenning[d]: gfxstrand[d]: ah, yeah I wasn't benchmarking context creation
23:34 redsheep[d]: I meant like before having their lab try on it doing it on your end
23:35 redsheep[d]: Either way, sounds great
23:35 gfxstrand[d]: Oh, yeah, I'll still CTS myself
23:35 gfxstrand[d]: But when other people make MRs, I'll probably just assign and trust the CI
23:35 gfxstrand[d]: Right now I run most MRs myself
23:39 eric_engestrom[d]: gfxstrand[d]: actually 8, in 4 hosts
23:39 eric_engestrom[d]: but yeah, looking forward to making that post-merge job into a merge job πŸ™‚
23:42 gfxstrand[d]: Are you planning to run 2 vms per host with pass-through or shard across GPUs?
23:44 airlied[d]: mohamexiety[d]: yes I think you can drop that
23:45 mohamexiety[d]: alright then, thanks
23:46 mohamexiety[d]: (everything exploded after removing that which is why I became hesitant and asked :KEKW:)
23:50 gfxstrand[d]: gfxstrand[d]: Submitted! :transtada128x128:
23:51 gfxstrand[d]: Now I can put my Volta away for a while
23:52 mohamexiety[d]: awesome work!