13:21 zmike[d]: snowycoder[d]: Yes GL supports
13:24 snowycoder[d]: zmike[d]: How? I don't know anything to control image layout in opengl
13:25 zmike[d]: Oh you meant layout
13:25 zmike[d]: Technically EXT external objects supports layout but nobody actually does anything with it
13:26 zmike[d]: So you're good
13:27 snowycoder[d]: Yep, linear filtering is handled by tex instruction, I only need to re-implement raw image load/stores
13:51 gfxstrand[d]: mhenning[d]: Ugh... I'm really not liking how `as_u32()` is playing out. I'm going to play around.
15:07 gfxstrand[d]: Hooray for having my 30 min CTS runs back. 😄
15:11 karolherbst[d]: vulkan spec question... if we don't support coop matrix on a GPU, does vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR also have to return an empty list?
15:11 karolherbst[d]: or is that all within UB?
15:18 orowith2os[d]: karolherbst[d]: The NV one says it can be NULL, in which case you'd probably send 0 as the len
15:19 orowith2os[d]: Either way, 0
15:20 gfxstrand[d]: karolherbst[d]: Yeah, it probably should. As long as it's all extensions, we can do whatever because you need the extension to get at the query function. But the moment someone pulls it into some version of core, even if optional, we probably have to support the query.
15:21 karolherbst[d]: yeah fair
16:48 gfxstrand[d]: Woof. Large SSARef MR reviewed.
16:49 karolherbst[d]: ~~could do the coop matrix one next~~
17:28 gfxstrand[d]: Yeah. That one's coming soon
17:28 gfxstrand[d]: And I should be getting my Blackwell on Wednesday so I'll start landing that stuff
17:29 gfxstrand[d]: mhenning[d]: If you can look at !34818, we can rebase the large SSARef one on top of it and drop the one patch I didn't like.
17:29 gfxstrand[d]: Then I'll try to land both.
17:30 gfxstrand[d]: Do you want me to CTS it for you? I've got my big machine back.
17:31 karolherbst[d]: mhh.. I should make sure that the code doesn't break on newer gens....
17:31 mhenning[d]: gfxstrand[d]: yep, reviewing it right now
17:32 mhenning[d]: It'll be faster if you cts it, but I don't have a strong preference either way
17:32 gfxstrand[d]: Okay, I'll do that.
17:33 karolherbst[d]: gfxstrand[d]: what are you doing that you get 30 minutes tho...
17:34 gfxstrand[d]: 36 threads, 2 GPUs
17:34 karolherbst[d]: oof
17:35 gfxstrand[d]: 2 GPUs lets me reduce the GSP lock contention
17:36 karolherbst[d]: yeah...
17:36 karolherbst[d]: that one is really bad
17:36 x512[m]: Maybe threads will be implemented in GSP in future?
17:37 karolherbst[d]: in theory it's really simple to get nvidia to fix bugs
17:37 karolherbst[d]: be a customer with 1000 trillion dollars who wants a bug fixed, done
17:41 karolherbst[d]: okay.. let's check if ampere just passes 😄
17:42 gfxstrand[d]: karolherbst[d]: Collabora doesn't pay me that much
17:42 karolherbst[d]: skill issue
17:42 notthatclippy[d]: The way I’d _want_ to fix those issues is to just not hit GSP at all when running a vulkan app. We’ll see how feasible that is
17:43 karolherbst[d]: mhhhh
17:43 karolherbst[d]: is that even feasible?
17:43 karolherbst[d]: though...
17:43 karolherbst[d]: mhhhh
17:43 karolherbst[d]: I have an idea...
17:43 gfxstrand[d]: Uh oh...
17:43 karolherbst[d]: could the kernel just have ~10 unused contexts ready and refill a bucket in the background?
17:43 karolherbst[d]: like a process asking for a context just gets one handed out
17:43 karolherbst[d]: and then the kernel refills in the background
17:43 notthatclippy[d]: Something roughly like that, yes. _Very_ roughly.
17:44 karolherbst[d]: obviously can't reuse them so you need to nuke them on deallocation
17:44 karolherbst[d]: though that won't necessarily help with the CTS 😄
17:44 notthatclippy[d]: Ben’s recent patches for blackwell support should reduce contention, btw
17:45 karolherbst[d]: ahh, cool
17:45 notthatclippy[d]: It removes a few RPCs per app
17:45 karolherbst[d]: gfxstrand[d]: how strongly do you feel about having warnings in the CTS result? 😄
17:46 gfxstrand[d]: uh...
17:46 gfxstrand[d]: The big thing we need to do is sort out a new GPU topology API so I can stop creating contexts just to enumerate physical devices. We can nuke like half the context creations right there.
17:47 karolherbst[d]: it's something super silly and only happens with 8x8x4 matrices, because each value is replicated in 4 threads, so it complains that the coop matrix size doesn't really align with the subgroups
17:48 karolherbst[d]: oh no.. I have fails on ampere `Failed: 672/32116 (2.1%)` ...
17:48 karolherbst[d]: well.. will look into it later
18:29 karolherbst[d]: ohh those fails are all instruction latency stuff
19:19 gfxstrand[d]: karolherbst[d]: That's... Odd
19:19 karolherbst[d]: yeah well.. that's how they work
19:19 karolherbst[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-884-f16
19:19 karolherbst[d]: or do you mean the CTS complaining? 😄
19:20 airlied[d]: It might be part of why NVIDIA didn't expose it
19:20 gfxstrand[d]: Does it complain on the NVIDIA driver?
19:20 karolherbst[d]: nvidia doesn't expose it
19:21 gfxstrand[d]: Ah
19:21 karolherbst[d]: it's also slower and everything
19:21 gfxstrand[d]: Then maybe we should also not expose it?
19:21 karolherbst[d]: I don't have hard feelings of keeping or removing it, I just REed it for fun mostly
19:21 gfxstrand[d]: Fair
19:21 gfxstrand[d]: I'd say don't expose anything Nvidia doesn't
19:21 karolherbst[d]: heh
19:22 karolherbst[d]: but I got to use the entire hardware 🙃
19:23 gfxstrand[d]: That's what Jeff said when he wrote the line stipple spec. 😝
19:23 karolherbst[d]: 😄
19:24 karolherbst[d]: could also just leave the code as comments and in case it ever matters (never) we can bring it back or something
19:35 karolherbst[d]: `Failed: 60/32116 (0.2%)` mhhh
19:35 karolherbst[d]: passes with serial.. how annoying
19:38 airlied[d]: I think on ampere some of the fp16 latencies have not been entirely true, or there is something else we are missing
19:40 karolherbst[d]: mhhh
19:40 karolherbst[d]: I can double check
19:41 karolherbst[d]: maybe I really should keep working on the proc macro 😄 so it's easier to check
19:46 airlied[d]: no I don't mean I wrote them down wrong, I mean they spreadsheet is wrong
19:46 airlied[d]: or we are missing something else
19:47 airlied[d]: I've commented in the sm80 file in the couple of places I've modified them from the spreadsheet to get CTS to pass
19:47 karolherbst[d]: mhhhh
19:47 airlied[d]: // these next two are 4 in the spreadsheet, 5 passes test
19:47 airlied[d]: // dEQP-VK.spirv_assembly.instruction.graphics.float16.arithmetic_1.fsign_vert
19:47 airlied[d]: // dEQP-VK.glsl.builtin.precision_fp16_storage16b.faceforward.compute.vec3
19:47 airlied[d]: FP16 => 5,
19:47 airlied[d]: FP16_Alu => 5,
19:48 karolherbst[d]: mhhh
19:50 karolherbst[d]: it's funny that the 16x16x16 test passes...
19:50 karolherbst[d]: but not 16x8x8 or 16x8x16
19:52 karolherbst[d]: well.. 16x16x16 is emulated
19:52 karolherbst[d]: so might just hide the issue
20:00 karolherbst[d]: mhhh
20:02 karolherbst[d]: need to use 24 instead of 17... something smells here..
20:26 karolherbst[d]: mhhh
20:26 karolherbst[d]: airlied[d]: :
20:26 karolherbst[d]: r11 = prmt r12 [0x5410] r13 // delay=7 wt=010001
20:26 karolherbst[d]: r10..12 = hmma.m16n8k8.f16 r8..10 r0 r10..12 // delay=7 rd:0 wr:1
20:26 karolherbst[d]: so uhm.. that looks wrong, no?
20:27 karolherbst[d]: though mhh..
20:27 karolherbst[d]: maybe 7 is fine?
20:28 karolherbst[d]: ohh wait..
20:28 karolherbst[d]: the other direction does look wrong
20:28 karolherbst[d]: r10..12 = hmma.m16n8k8.f16 r8..10 r0 r10..12 // delay=7 rd:0 wr:1
20:28 karolherbst[d]: r6 p0 = iadd3 r6 c[0x1][0x30] rZ // delay=4
20:28 karolherbst[d]: r8 = iadd3.x rZ c[0x1][0x34] rZ p0 pF // delay=2 wt=000001
20:28 karolherbst[d]: r3 = prmt r11 [0x10] rZ // delay=1 wt=000010
20:32 karolherbst[d]: mhh
20:33 karolherbst[d]: is the delay after or before executing the instruction?
20:33 karolherbst[d]: I somehow never remember
20:35 karolherbst[d]: guess after..
20:46 karolherbst[d]: I hope it's not something silly as we got the latencies for SM80, but for SM86 they are a little different 🙃
20:51 airlied[d]: delay is before
20:51 karolherbst[d]: mhhh
20:52 karolherbst[d]: yeah well, then it's clearly wrong
20:52 airlied[d]: actually no I'm wrong as well
20:53 karolherbst[d]: yeah.. I always get it wrong
20:53 airlied[d]: I have to read the doc a few times and still miss it 🙂
20:53 karolherbst[d]: yeah, it just doesn't make sense
20:53 airlied[d]: ah yes we pick the instructions that consume the output from another instruction then add the delays of all the intervening instructions
20:54 airlied[d]: and whatever is left over is the delay on the initial instruction
20:54 karolherbst[d]: right
20:54 karolherbst[d]: so.. 13 between the hmma and prmt
20:54 karolherbst[d]: which matches the table
20:54 karolherbst[d]: but somehow I need 15...
20:55 airlied[d]: oh you have to have a wait2 between the wr= rd= and req= scoreboards
20:55 karolherbst[d]: ehh 16
20:55 karolherbst[d]: mhhh
20:55 airlied[d]: I wonder if the cumlative delay is broken for > 15?
20:55 karolherbst[d]: mhhhh
20:55 karolherbst[d]: I mean.. it just needs 13
20:56 karolherbst[d]: it's a collect_x1 mma
20:56 airlied[d]: oh but it works if you give it 15?
20:56 karolherbst[d]: *16
20:56 karolherbst[d]: r10..12 = hmma.m16n8k8.f16 r8..10 r0 r10..12 // delay=10 rd:0 wr:1
20:56 karolherbst[d]: r6 p0 = iadd3 r6 c[0x1][0x30] rZ // delay=4
20:56 karolherbst[d]: r8 = iadd3.x rZ c[0x1][0x34] rZ p0 pF // delay=2 wt=000001
20:56 karolherbst[d]: r3 = prmt r11 [0x10] rZ // delay=1 wt=000010
20:56 karolherbst[d]: that works
20:57 karolherbst[d]: ohhhh I just noticed.. we don't use .reuse?
20:57 karolherbst[d]: but anyway..
20:58 karolherbst[d]: it's weird
20:58 karolherbst[d]: however
20:58 karolherbst[d]: I blame it on SM86 needing different latencies, at least that's my bet
20:58 karolherbst[d]: hopper needs 16 between alu and mma_x1_collect 🙃
20:59 HdkR: Oh, you don't use reuse at all right now? Or just for mma things?
20:59 karolherbst[d]: blackwell is completely different again
20:59 airlied[d]: I'd buy that 🙂
20:59 karolherbst[d]: let me check your fp16 thing
20:59 karolherbst[d]: you won't belive this Dave 🙃
21:00 airlied[d]: oh nice, seems like it might make sense then!
21:00 karolherbst[d]: it's a bit more complex
21:00 karolherbst[d]: but hopper has a 5 where ampere has a 4
21:01 karolherbst[d]: fp16 reading from fp16_alu is 5 on hopper, 4 on ampere
21:01 karolherbst[d]: but fp16 from fp16 is still 4
21:01 karolherbst[d]: I wonder...
21:01 karolherbst[d]: let me use the hopper table and see how that works out
21:02 karolherbst[d]: ahh yeah...
21:02 karolherbst[d]: "24"
21:02 karolherbst[d]: that was also fixing the x2 one...
21:02 karolherbst[d]: 🙃
21:02 karolherbst[d]: instead of 17
21:03 karolherbst[d]: also what hopper wants
21:04 gfxstrand[d]: Okay, big SSARef MR is assigned to marge. Sorry for the delay. I was running all over town hunting down a new modem so I could get internet working in the new place.
21:06 airlied[d]: hopefully can get hopper into the userspace this week
21:06 karolherbst[d]: nice
21:07 airlied[d]: it does appears as if x86 h100 can run vulkaninfo without dying, so likely some aarch64 specific issue
21:08 karolherbst[d]: ` Failed: 0/32116 (0.0%)`
21:09 karolherbst[d]: I should wire up 16x8x32 IMMA
21:10 karolherbst[d]: and 16x8x16
21:10 karolherbst[d]: turing only has 8x8x16, so ampere used the same lowering for now...
21:10 karolherbst[d]: maybe I get that part done tomorrow
21:24 mhenning[d]: yeah, `.reuse` isn't used at all yet in nak. It only shaves off a cycle here or there and there have been more pressing compiler tasks
21:24 HdkR: Yea, shaves some cycles and reduces power consumption.
21:24 karolherbst[d]: is it even shaving off a cycle?
21:25 karolherbst[d]: I thought it was only a power consumption thing
21:25 airlied[d]: it also increase memory bw
21:25 mhenning[d]: yes, it can save a cycle for register bank conflicts in some cases
21:25 airlied[d]: or at least cache bw
21:26 karolherbst[d]: ahh, I see
21:26 airlied[d]: I think we'd notice when we get to the high throughput hmma tests
21:26 HdkR: Yea, bank conflicts go away since it goes in to a bypass network instead :D
21:27 HdkR: Power consumption is an interesting one as well since it'll let the card boost higher, turning power savings in to performance.
21:28 HdkR: Probably only enough to go one or two bins higher, but perf is perf.
21:28 karolherbst[d]: .reuse only works if it's used in the same "unit", no?
21:28 HdkR: I believe so
21:28 HdkR: Since it effectively keeps the previously read data in the read port of the pipeline.
21:30 HdkR: So I guess saying "bypass network" isn't quite right, but that's how my brain still maps it :D
21:32 karolherbst[d]: heh
21:32 karolherbst[d]: bypass is something else 😛
21:33 karolherbst[d]: it's if you pipe the destination into the C argument
21:51 HdkR: Yea, It do be like that.
22:06 mhenning[d]: gfxstrand[d]: I can fix the marge failures for https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34794 if you're not currently looking at it
22:09 gfxstrand[d]: I'm already working on it
22:10 gfxstrand[d]: I think that should sort it
22:18 gfxstrand[d]: I added a `Src::ZERO` const and just did full-length array initializers for all the other cases.
22:21 gfxstrand[d]: The fact that rust can't infer that `Dst::None` is copyable is kinda dumb.
22:23 gfxstrand[d]: Same for `Option::None`.
22:24 linkmauve: gfxstrand[d], there is no concept of enum variant at the type level.
22:25 linkmauve: What the compiler sees when resolving traits is Option<Whatever>, not Option::None::<Whatever>.
22:25 gfxstrand[d]: Sure. But there are different rules for things that are `const` and it feels like it should be able to figure out that a variant with no data is automatically `const`.
22:26 gfxstrand[d]: Like if I do
22:26 gfxstrand[d]: ```rust
22:26 gfxstrand[d]: impl EnumType {
22:26 gfxstrand[d]: const VARIANT: EnumType = Variant;
22:26 gfxstrand[d]: }
22:26 gfxstrand[d]: now `EnumType::VARIANT` is copyable because it's a const but `EnumType::Variant` isn't.
22:27 mhenning[d]: Yeah, I don't think it's copyable so much as whether the compiler recognizes it as const
22:28 mhenning[d]: but yeah knowing rust I wouldn't be surprised if there's an RFC for that that's been open for 4 years
22:29 linkmauve: gfxstrand[d], const here is like copy/paste, so wherever you use VARIANT it’s as if you wrote the full expression Variant. That’s different from acting at the type level.
22:42 karolherbst[d]: okay.. looks like wiring up 16x8x16 needs a bit more than to just enable it... I need to adjust the matrix layout conversion code... that's going to be fun
23:23 gfxstrand[d]: 4th try's a charm?
23:28 orowith2os[d]: gfxstrand[d]: how would it know that, if the type itself can contain non-Copy data, and that type isn't const?
23:29 gfxstrand[d]: It could make non-data enumerants automatically const
23:29 orowith2os[d]: if an `EnumType::Variant` went through four different functions to get to me, all non-const, it could've changed, and become non-Copy (because it might contain data now)
23:29 gfxstrand[d]: Oh, once it goes through functions, all bets are off
23:30 gfxstrand[d]: I mean that I should be able to do `[None; 4]` regardless of what `T` goes in the `Option<T>`.
23:31 karolherbst[d]: not sure that's guaranteed to work
23:31 orowith2os[d]: it's not
23:32 gfxstrand[d]: It doesn't but this does:
23:32 gfxstrand[d]: ```rust
23:32 gfxstrand[d]: const T_NONE: Option<T> = None;
23:32 gfxstrand[d]: let foo = [T_NONE; 4];
23:32 orowith2os[d]: T_NONE isn't the same as None, though
23:32 orowith2os[d]: it's Option::<T>::None
23:33 orowith2os[d]: and because T_NONE is const, T_NONE can also be copy here
23:35 orowith2os[d]: it's super weird. I'm having trouble wrapping my head around it myself, but it does make sense
23:36 karolherbst[d]: the issue is, that `None` doesn't have a defined layout afaik
23:37 karolherbst[d]: but `Option::<T>::None` always does
23:37 orowith2os[d]: keep in mind, if T_NONE were changed to T_SOME, and had a value, and the struct weren't Copy, but T_SOME were still const, it *would* work out
23:37 orowith2os[d]: ```rust
23:37 orowith2os[d]: #[derive(Clone, Debug)]
23:37 orowith2os[d]: struct Bleh(u32);
23:37 orowith2os[d]: const T_NONE: Option<Bleh> = Some(Bleh(2));
23:37 orowith2os[d]: [T_NONE; 4]
23:38 karolherbst[d]: sure
23:39 orowith2os[d]: the compiler has some weird little hints in there with const that *let* it to that Copy, and if you could do, basically `const Option::<T>::None` as a type param, it would work.
23:39 orowith2os[d]: Maybe `const Option::<_>::None`?
23:40 karolherbst[d]: the thing is
23:40 karolherbst[d]: const objects aren't guaranteed to exist once
23:40 karolherbst[d]: so if they are copy or not is entirely irrelevant
23:41 orowith2os[d]: by design, they're Copy, pretty much?
23:41 karolherbst[d]: so `[T_NONE; 4]` doesn't even mean it's copying `T_NONE` at all
23:41 karolherbst[d]: they aren't
23:41 karolherbst[d]: the compile can make 4 instances of `T_NONE` and then they are moved instead of copied
23:42 karolherbst[d]: *compiler
23:42 orowith2os[d]: oh no
23:42 karolherbst[d]: it's also the reasons that if you take a reference to const data and compare it with another reference to the same const data, the address doesn't have to be equal
23:44 karolherbst[d]: which normally doesn't matter because rust compares by value, and not by address
23:44 karolherbst[d]: or rather whatever PartialEq is doing
23:46 karolherbst[d]: which.. believe it or not, depending on `PartialEq::eq` a comparison between two references to const data could actually return false. Imagine how I know
23:47 karolherbst[d]: though it's such an edge case to get the compiler to even duplicate const data
23:47 karolherbst[d]: but anyway.. nothing guarantees a single instance of const data
23:49 karolherbst[d]: I think the correct wording is `More specifically, constants in Rust have no fixed address in memory.`
23:57 karolherbst[d]: do we have nir opt passes that turn shuffles of 16 bit values into vec2 variants or packed ones or something? 🙃
23:57 karolherbst[d]: or maybe the reverse would be better
23:58 karolherbst[d]: just shuffle vectors and let some pass split it if needed