08:23chikuwad[d]: hm, where can I find the requirements for vulkan 1.0 conformance?
08:50marysaka[d]: chikuwad[d]: you mean features/properties? that's in the spec itself there are multiple tables (same goes for image formats)
08:51chikuwad[d]: yeah I did find https://docs.vulkan.org/spec/latest/appendices/versions.html#versions-1.0
08:52patrikkarlsson: There is collatz and kekeci conjectures, but obviously this is going to take too much time to decode later for IO operations. So the last and least important property i decided not to go in detail about. Just to see how you manage things your own you fucking bullshitters. Anyways collatz and kekeci conjectures is the reason why the sequence i gave works and it can be shifted and
08:52patrikkarlsson: reinterprated just like page tables any way wanted and translated/interprated accordingly, but can one quit the indirections by repeating the same value as a transitioning value to know that the reinterpratation is done, and pad it to all levels as the results? I did some tests didn't I? Anyhow my head similarly has pain on the right side now where as eye is swollen from that side this
08:52patrikkarlsson: time, the fuckers are trying to finish me. ABsolutely nonsense amoral suicidal brutes. Well it can technically be anything that the value transitions to, so as long as collisions are no longer there, it will be interpreted correctly in the end. And i say the mathematics community on the conjectures is merely in a wrong dark streets bush or even though the conejctures are partially proven
08:52patrikkarlsson: it is all in all pretty pointless to go proving it, but collatz and kekeci make a little sense indeed, just don't waste your time on them, i gave procedures that are more powerful from computer technologies strains or stem.
09:08chikuwad[d]: I also found the vk1.0 spec pdf :ha:
13:36karolherbst[d]: mhh on my Turing it should give me 3 times the workgroups, but the coop-matrix stuff only gets 25% faster.. probably other bottlenecks running into there
13:37karolherbst[d]: but maybe also not surprising because nvidia ain't that much faster either
13:38karolherbst[d]: me: `TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 52.674748 TFlops` nvidia: `TILE_M=128 TILE_N=128, TILE_K=32 BColMajor=0 workgroupSize=256 64.936902 TFlops`
14:39gfxstrand[d]: chikuwad[d]: Generally, if there's a feature bit then it's not required. The exception here is `robustBufferAccess`, which has a feature bit so that app can choose not to turn it on, not because it's optional for drivers. For limits, look at the limits table.
14:42chikuwad[d]: :meowsalute:
14:42chikuwad[d]: thanks
15:23gfxstrand[d]: Writing a Fermi driver?
15:24esdrastarsis[d]: fervk confirmed
15:24huntercz122[d]: *my gt 430 peeking from the shelf*
15:25chikuwad[d]: gfxstrand[d]: no 😅
15:26chikuwad[d]: just curious, and for an unrelated project
15:27gfxstrand[d]: "Unrelated project"... Hrm... 🤔
15:27gfxstrand[d]: It's always the unrelated projects that are the most interesting.
15:29chikuwad[d]: :3
16:01chikuwad[d]: okay, I think I understand _what_ I have to do with atomic float16-vector
16:02chikuwad[d]: I have to wire up standard nir_lower* lowering passes for fl16-vec{2,4}
16:02chikuwad[d]: get rid of whatever custom intrinsics exist in NAK for it, if any
16:03chikuwad[d]: and emit the standard NIR intrinsic
16:03chikuwad[d]: (please correct me if I'm wrong)
17:04gfxstrand[d]: Yup
17:04gfxstrand[d]: But you only need the lowering for shared memory. There are real atomics for global.
17:08chikuwad[d]: :wahoo:
17:08chikuwad[d]: progress :ha:
17:11karolherbst[d]: blackwell gets really fun for those
17:11karolherbst[d]: support fp32x2 global atomics apparently
17:11karolherbst[d]: and x4
17:11karolherbst[d]: and fp16x4 and fp16x8 and 🙃
17:12snowycoder[d]: How many bytes do we *actually* need to store the index for blocks, instructions and src/dst? (bonus points: delay?)
17:12karolherbst[d]: practically? I've seen shaders with millions of instructions
17:13karolherbst[d]: do you want to fail to compile those? also yes
17:13snowycoder[d]: Wait, why?
17:13karolherbst[d]: RA is kinda a pain on those
17:14karolherbst[d]: might take a while to complete
17:14karolherbst[d]: though I think u32 limits for defs is reasonable.. going above u16 is happening from time to time
17:14karolherbst[d]: if you need u16 for blocks? mhh
17:14karolherbst[d]:doubtful
17:15karolherbst[d]: you can get away with a u8 for per instruction delays
17:15karolherbst[d]: like the highest we've seenis like 27? or 34 or so?
17:15snowycoder[d]: karolherbst[d]: Even storing u32 instead of usize for both instr_idx and block_idx would save us a lot of pointers
17:16karolherbst[d]: yeah
17:16karolherbst[d]: like millions of instructions is also only something I've seen with CL
17:16karolherbst[d]: with a raytracer
17:16karolherbst[d]: where we inline everything
17:17karolherbst[d]: the inlining just exploded that one and some drivers take like 16 hours to compile that
17:17karolherbst[d]: and like I dunno 40 GiB or RAM or so
17:17karolherbst[d]: point is... you don't want to bother with shaders _that_ huge
17:18snowycoder[d]: karolherbst[d]: Oh wow
17:20karolherbst[d]: coop matrix stuff
17:20karolherbst[d]: it doesn't even encode and you need to insert nops
17:20karolherbst[d]: but luckily you do enough coop matrix stuff in one go that you hide it completely almost always
17:21snowycoder[d]: Yeah I was worried about that, theoretically you could also use 255 and insert a lot of nops (I used that to isolate some instructions)
17:21karolherbst[d]: right...
17:22karolherbst[d]: but like no actual instruction needs it
17:23gfxstrand[d]: Yeah, for absolute cycle numbers, u32. For deltas between instructions, u8 with saturation is fine.
17:28chikuwad[d]: am I dumb
17:28chikuwad[d]: was the answer just
17:28chikuwad[d]: `OPT(nir, nir_lower_fp16_casts, nir_lower_fp16_split_fp64);` in `nak_nir.c:optimize_nir()` all along
17:28chikuwad[d]: only one way to find out
17:30snowycoder[d]: A more practical question: I need to track a list of both reads and writes in RegUse, right now I'm using two Vecs, it might not seem much but doing it for each register means almost 17KB of data per block.
17:30snowycoder[d]: I could store both of them in a single Vec + index with a bit more complexity and that would half the memory complexity, is it worth it?
17:31karolherbst[d]: are they always equally long?
17:31karolherbst[d]: ehh wait
17:31karolherbst[d]: they aren't
17:31karolherbst[d]: What you could do is to reuse the Vecs from previous blocks?
17:31karolherbst[d]: or do you need to save them all?
17:32snowycoder[d]: You mean something lazy like a cow?
17:32snowycoder[d]: They could be mutated (e.g., this block adds an Add)
17:32snowycoder[d]: *Read
17:33snowycoder[d]: We could yes, but most of the vecs are empty anyways, it just takes a lot of memory to store two Vecs (ptr+capacity+length) for each register
17:35mhenning[d]: if you're not modifying their lengths over time one trick is to store a boxed slice, which brings it down to ptr + length
17:36snowycoder[d]: That is really slow on reads, we need to just one element
17:37mhenning[d]: I don't know what that means
17:38snowycoder[d]: Mmmmh, let's go the simple route and bench before worrying
17:40mhenning[d]: Box<[T]> acts very similar to a vec that you cannot resize
17:41mhenning[d]: but yes, doing the simple thing and then seeing where it lands perf-wise is reasonable
17:41karolherbst[d]: snowycoder[d]: no. You just keep the `Vec` object around but empty it so you keep the allocation
17:42snowycoder[d]: mhenning[d]: Yes but for each src of an instruction we need to add an element to that "Vec", it might be counter-productive to reallocate for each register read usage
17:43mhenning[d]: oh that's what you mean by "read". Yeah, like I said, it's not a good idea if you're modifying the length
17:43chikuwad[d]: I was wrong, but I learned a thing!
17:43karolherbst[d]: me every day
17:44chikuwad[d]: nak_nir_lower_f16vec4_atomic_intrin() lowers f16vec4 to 2x f16vec2, and I think all I have to do is use shared lowering for f16vec2 here
17:46gfxstrand[d]: Yes
17:47gfxstrand[d]: NVIDIA doesn't have f16 atomics. It has f16v2 modes for the 32-bit atomics. It's still a 32-bit atomic under the hood, just with different math.
17:47chikuwad[d]: interesting
17:50karolherbst[d]: I'm mildly curious why the shared mem splits are so weird on Ampere+
17:51karolherbst[d]: like I kinda understand why steps get bigger, but why the 64 -> 100 step? no clue
17:51karolherbst[d]: maybe some hw engineer can explain why it makes sense
18:21chikuwad[d]: hm
18:23chikuwad[d]: I've gotten this far: https://textbin.net/quhl1qlfpt
18:23chikuwad[d]: which isn't very far tbh
18:24chikuwad[d]: and I'm hitting this assert
18:24chikuwad[d]: panicked at ../mesa/src/nouveau/compiler/nak/from_nir.rs:430:13:
18:24chikuwad[d]: assertion failed: vec.len() == bits.div_ceil(32)
18:25chikuwad[d]: I guess that makes sense if we're doing f16vec4 -> 2x f16vec2, there's probably two 32-bit SSAValues
18:30chikuwad[d]: augh, this is for tomorrow
18:31karolherbst[d]: airlied[d]: mind testing https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37135 on Blackwell? (or somebody else?)
18:43mhenning[d]: chikuwad[d]: Yeah, nak's SSAValues are always 32-bit for gprs
18:48chikuwad[d]: I printed out the vec.len() and the bits right before the assert and this was printed right before the assert popped:
18:48chikuwad[d]: vec.len() = 1
18:48chikuwad[d]: bits = 64
18:49chikuwad[d]: good ol printf debugging
18:49chikuwad[d]: tomorrow I'll try to look at the test's spirv to see why that's happening and where I'm going wrong
18:50ermine1716[d]: So bits.div_ceil(32) = 2?
18:51mhenning[d]: chikuwad[d]: you probably just need to pass a larger vec into set_ssa - look at the call site from the stack trace
18:52chikuwad[d]: hang on I saved the backtrace
18:52chikuwad[d]: I've already shut down my pc for the night however 😅
18:59mhenning[d]: yeah, it's probably an issue in parse_intrinsic which means the relevant line is mesa/src/nouveau/compiler/nak/from_nir.rs:2962:22
18:59mhenning[d]: parse_intrisic because that's what handles the individual instruction in this case
19:00mhenning[d]: (for when you take a look at it next)
19:01chikuwad[d]: thanks mel :saigeheart:
19:09airlied[d]: karolherbst[d]: blows up with a divide by zero in the qmd code
19:10airlied[d]: ERROR - Test dEQP-VK.pipeline.monolithic.bind_point.graphics_compute.template_push_template_push.setup_cs_gp_gs_cp.cmd_draw_dispatch: Crash: See "smem/c12.r1.log"
19:10airlied[d]: ERROR - dEQP error:
19:10airlied[d]: ERROR - dEQP error: thread '<unnamed>' panicked at ../src/nouveau/compiler/nak/qmd.rs:372:38:
19:10airlied[d]: ERROR - dEQP error: attempt to divide by zero
19:10airlied[d]: ERROR - dEQP error: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
19:10airlied[d]: ERROR - dEQP error:
19:10airlied[d]: ERROR - dEQP error: thread '<unnamed>' panicked at library/core/src/panicking.rs:218:5:
19:10karolherbst[d]: oh shooo...
19:12karolherbst[d]: replace it with `size.max(1)`
19:12karolherbst[d]: will push in a moment
19:13karolherbst[d]: done
19:22airlied[d]: wierd I wrote a message somewhere wrong, anyways CTS running it takes 10 hours on this laptop
19:25gfxstrand[d]: I'll kick it off. Takes about 30 min on my Blackwell and 25 for Ampere.
19:29gfxstrand[d]: Running now
19:40gfxstrand[d]: I should update my scripts to also build on the test machine and then I can just throw a branch at it. 🤔
19:42gfxstrand[d]: I had something like that for piglit back in the day where I could just give it N branch names or SHAs and it would build and run them all and give me a summary. It even cached results so I could do a bunch and it would only re-run the new ones.
19:47mhenning[d]: caches? fancy
20:04karolherbst[d]: maybe next week I'll manage to clean up that gpr+ugpr mess 😄
20:07gfxstrand[d]: karolherbst[d]: `dEQP-VK.compute.*.zero_initialize_workgroup_memory.max_workgroup_memory.*`
20:08karolherbst[d]: oof
20:08karolherbst[d]: only regression or still running?
20:08gfxstrand[d]: I think that's the only regression
20:08karolherbst[d]: but how 🙃 that's gonna be interesting to figure out
20:08gfxstrand[d]: Or it could be really easy
20:09karolherbst[d]: the scary part it is sounds plausible
20:09gfxstrand[d]: Ampere is still running. It looks like it has more crashes
20:10gfxstrand[d]: Oh, and `maxComputeSharedMemorySize` is 0 so that might be breaking things
20:10gfxstrand[d]: One of the limits tests caught that one
20:11karolherbst[d]: ohhh...
20:12karolherbst[d]: ohhh damn
20:12karolherbst[d]: I noticed that you pasted wrong code, but then I forgot a minute later and just copied your suggestion anyway 🙃
20:12gfxstrand[d]: oops
20:13karolherbst[d]: needs to be `info->sm_smem_sizes_kB[info->sm_smem_size_count - 1] * 1024`
20:13karolherbst[d]: I love how my testing didn't catch that one eitehr
20:13gfxstrand[d]: Anyway, I just reverted that commit and I'll run again
20:15karolherbst[d]: pushed a fix
20:15karolherbst[d]: `maxComputeSharedMemorySize = 49152` yep..
21:08gfxstrand[d]: Okay, with the right shared memory size, Blackwell and Ampere are both happy
21:27karolherbst[d]: nice
22:00gfxstrand[d]: I kinda feel like we should at least touch test a few others but it's early in the release cycle and I don't feel like playing GPU roulette this week.
22:02karolherbst[d]: Turing might be good tho
22:02karolherbst[d]: on earlier gens it should be a noop
22:02karolherbst[d]: or rather.. no change at all
22:03karolherbst[d]: _though_ I suspect we could do the same on e.g. kepler
22:03karolherbst[d]: maxwell + pascal don't have this issue at all
22:33mhenning[d]: mangodev[d]: hey, want to try something? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37153 could possibly fix some issues with flickering
22:38airlied[d]: watching YT in firefox on f42 on turing I get flashing black rectangle in like 1/64th of the image every 5-10 frames
22:41mhenning[d]: airlied[d]: That's with the patch?
22:42airlied[d]: no just in normal f42, I'm not sure I want to replace the nvk on this box right now 🙂
22:43mhenning[d]: fair enough. flickering in a small portion of the image might be a different issue, although it depends on exactly how the app renders.