00:00airlied: there was some later stuff under NDA, but the driver is mostly considered the documentation unless you can come up with a reason to ask for more details
00:00x512[m]: Asking as non-Linux kernel driver developer that can' sign NDA.
00:05airlied: probably wrong channel, but yeah you'd have to say what is missing from the linux kernel driver that you wanted to use I suppose
01:18gfxstrand[d]: The reality is that hardware and software aren't designed in separate vacuums (or at least they shouldn't be). As much as we love a good bit of documentation, the reality is that documentation is expensive. Software and hardware engineers talking to each other is far cheaper. When the software and hardware people work at the same company, why not just do that?
01:18gfxstrand[d]: The reality is that the documentation that open-source types are constantly asking for often doesn't exist, even internally.
01:20gfxstrand[d]: Intel does but, honestly, that's part of why Intel's hardware and software story is shit. Because the hardware designers never talk to the software designers except through specs and documentation, they have no idea what they're doing. They think they can just design hardware based on their understanding of the DX API (as someone who's never written a line of driver code), write a document telling
01:20gfxstrand[d]: the software people how to program the hardware (they never describe how the hardware actually works), and then expect software to just materialize. It doesn't work.
01:21gfxstrand[d]: At AMD and NVIDIA, the software and hardware people actually talk to each other. This is good for having competent designs but it also means a lot of that documentation never even gets written.
01:26notthatclippy[d]: gfxstrand[d]: This is also a very common reason for some of the more cursed HW designs. They make sense on the unified stack, but don't map cleanly to what DRI expects.
01:27gfxstrand[d]: For sure
01:30airlied[d]: seeing a bunch of vm faults with coopmat on blackwell
01:31gfxstrand[d]: yup
01:31gfxstrand[d]: imma throws illegal instruction exceptions and hmma faults
01:32gfxstrand[d]: The hmma faults don't happen if you nop out the opcode. I suspect hmma is messing with something. Either it's stomping more registers than we think or it's mashing predicates or something.
01:32airlied[d]: I fixed a bunch of imma by not exposing 8816
01:32airlied[d]: just pushed it to smae branch in my repo
01:32gfxstrand[d]: 16816 also fails. ðŸ˜
01:33gfxstrand[d]: The `hmma` case I was looking at had an hmma right before a pair if `iadd3` doing an address calculation. The fault looked a lot like the `iadd3`s went sideways or something like that. I haven't fully figured it out.
01:33airlied[d]: I managaed to really confuse the gpu, and an unrelated cts test would fail standalone until I rebooted
01:33gfxstrand[d]: womp womp
04:42airlied[d]: didn't get too far, life distraction, will see if I can grab some time later
05:35karolherbst[d]: could also be random other stuff going wrong...
06:28airlied[d]: dang it some idiot installed rhel10 on his ada laptop, guess I have to swap cards instead
06:29airlied[d]: it doesn't make sense the hmma instruction could be causing a vmfault, but also no idea why addr calcs on blackwell would be different
06:32airlied[d]: ah idiot, latencies
06:32airlied[d]: guess I better finish the blackwell latency patches
06:37airlied[d]: NAK_DEBUG=serial ./deqp-vk --deqp-case=dEQP-VK.*coop* passes for me with the 8816 disable patch on my branch
07:45karolherbst[d]: airlied[d]: blackwell added some memory thing for the tensor stuff... I'm kinda curious if something is used under the hood there
07:45karolherbst[d]: airlied[d]: ahh.....
07:45karolherbst[d]: I asked about that 🙃
07:46karolherbst[d]: airlied[d]: yeah.. imma 8816 is gone for real
08:24mohamexiety[d]: I really don’t think tensor memory is a think on consumer Blackwell/sm_120
08:24mohamexiety[d]: But even if so, it’s dedicated scratchpad so it needs explicit usage to use it
08:31karolherbst[d]: yeah.. anyway, turns out it was just latencies going wrong 🙃
12:29gfxstrand[d]: I figured it was something like that. If we've got number, we should get those in ASAP.
12:30gfxstrand[d]: Not sure why I got an illegal instruction encoding, though.
13:01karolherbst[d]: yeah... that part confuses me...
13:01karolherbst[d]: maybe you got really unlucky and something stomped over shaders
13:02karolherbst[d]: in theory we got latencies, but who knows if they are right for SM120
13:02gfxstrand[d]: Maybe that test does multiple sizes?
13:03karolherbst[d]: yeah the CTS tests generally test multiple matrix sizes in one run
13:25karolherbst[d]: mhh
13:34ahuillet[d]: karolherbst[d]: need something checked?
13:34karolherbst[d]: ahuillet[d]: if SM100 == SM120 in regards to MMA
13:35karolherbst[d]: though...
13:35karolherbst[d]: I don't think we have anything in anyway
13:35karolherbst[d]: I'm kinda confused how that happened..
13:36karolherbst[d]: gfxstrand: what latencies are picked with blackwell? random ones?
13:37gfxstrand[d]: The old R/E'd ones
13:37gfxstrand[d]: Dave's got a branch. IDK if it's accurate
13:37gfxstrand[d]: Oh, it's with his giant CSV mess
13:37gfxstrand[d]: Maybe I should look at that.
13:37karolherbst[d]: ahh...
13:38karolherbst[d]: yeah I mean without that the latencies will be wrong for sure 🙂
13:38gfxstrand[d]: Yeah. Last I knew he was flailing around with "how to make CSV not suck?"
13:38karolherbst[d]: right
13:39karolherbst[d]: anyway. I'm sure that will land in time on the 25.2 branch
13:46gfxstrand[d]: These CSVs seem sideways...
13:46ahuillet[d]: karolherbst[d]: (for other people reading: discussed in private)
13:47mohamexiety[d]: I would be surprised if they are equal tbh cuz sm100 MMA has _insane_ throughput. it's like 4x as much as sm_120
13:51karolherbst[d]: I'm sure they aren't
13:51karolherbst[d]: without knowing it
13:51karolherbst[d]: like Dave also ran into fp16 related issues on Ampere
13:56karolherbst[d]: gfxstrand[d]: I think we should just use something crazy high for MMA in the fallback latencies, like 10 for waw and 30 raw or something
13:56karolherbst[d]: and maybe that's not even high enough
13:56gfxstrand[d]: progbably
13:57gfxstrand[d]: Or just flag them as needing barriers
13:57karolherbst[d]: well.. hardware not needing barriers ignores them
13:57gfxstrand[d]: *sob*
13:57karolherbst[d]: it's great
13:59karolherbst[d]: but the high values I've seen are generally around 25 or something
13:59karolherbst[d]: maybe use 40 to be safe
16:39karolherbst[d]: it's funny because they migrated the server to new hardware or something, without telling up front, because $urgency oh well
16:39karolherbst[d]: but it's all auto starting, so whenever the bot is online, it's working 😄
18:30gfxstrand[d]: Rebased, fixed up the build system bits to something I don't hate and pushed latencies to nvk/blackwell-latencies in my tree
18:37gfxstrand[d]: Of course it's blowing up. 😅
18:39gfxstrand[d]: No idea if the tables are wrong or if I screwed something up refactoring
18:48gfxstrand[d]: These latencies are insanely low
19:19gfxstrand[d]: Latencies are kinda fine (probably). The scheduler is screwing it up. How? No clue!
19:29gfxstrand[d]: Nothing it's moving looks like it conflicts with anything
19:31gfxstrand[d]: And if I run with serial and the scheduler it gets the same scheduling but passes
19:32gfxstrand[d]: https://tenor.com/view/kikis-delivery-service-tired-exhausted-sleep-sleepy-gif-4730448
19:33mohamexiety[d]: maybe the build process mangles the latencies somehow?
19:34mohamexiety[d]: that or the latencies are for the wrong HW :thonk: (sm_100 instead of sm_120)
19:34gfxstrand[d]: Yeah, but doing `+ 5` on all latencies doesn't fix it
19:34mohamexiety[d]: :nervous:: hm
19:41gfxstrand[d]: Okay, `+14` is enough
19:46gfxstrand[d]: Okay, I think something fishy is going on with `ldcu`
19:51karolherbst[d]: if you need some values which are maybe correct, I'm sure I can tell you 😄
19:53karolherbst[d]: oof
19:53karolherbst[d]: LDCU is just oof
19:54gfxstrand[d]: There's something we're not doing right that's making `ldcu` variable latency when it shouldn't be
19:54karolherbst[d]: are you using r2ur for anything?
19:55gfxstrand[d]: not in these shaders
19:55karolherbst[d]: it needs like 15 cycles
19:55karolherbst[d]: but yeah.. ldcu is constant time
19:55gfxstrand[d]: `ldcu ur1, c[0x1][urz]` should be 2 cycles. It's variable latency.
19:55karolherbst[d]: it's not
19:55gfxstrand[d]: The hardware disagrees
19:56karolherbst[d]: well it's also variable but it also needs higher waits
19:56karolherbst[d]: up to 5
19:56gfxstrand[d]: +10 doesn't fix it
19:56karolherbst[d]: blackwell, right?
19:56gfxstrand[d]: y
19:57karolherbst[d]: sooo
19:57karolherbst[d]: uldc reading from certain operations needs like 12
19:58karolherbst[d]: "certain operations": be like the ones starting with U...
19:59gfxstrand[d]: uldc is reading `c[0x1][urz]`
19:59karolherbst[d]: ohh urz..
19:59karolherbst[d]: what's waiting on the result?
19:59gfxstrand[d]: `uisetp`
19:59karolherbst[d]: mhhh.. that should be 2 cycles of waits...
20:00gfxstrand[d]: Yeah. I know. It's not
20:00karolherbst[d]: could be that desktop blackwell is different
20:01gfxstrand[d]: This fails:
20:01gfxstrand[d]: r0 = ipa.pass a[0x7c] rZ // delay=1 wr:0
20:01gfxstrand[d]: r1 = ipa.pass a[0x80] rZ // delay=1 wr:1
20:01gfxstrand[d]: r0 = mufu.rcp r0 // delay=1 wt=000001 wr:0
20:01gfxstrand[d]: ur0 = ldc.b32 c[0x1][+0x0] // delay=2
20:01gfxstrand[d]: up0 = isetp.eq.i32 ur0 rZ // delay=4
20:01gfxstrand[d]: ur2 = sel !up0 rZ 0x3f800000 // delay=3
20:01gfxstrand[d]: ur0 = mov 0x3f800000 // delay=2
20:01gfxstrand[d]: r3 = mov ur0 // delay=1
20:01gfxstrand[d]: r2 = mov ur2 // delay=1
20:01gfxstrand[d]: r1 = fmul.ftz r1 r0 // delay=1 wt=000011
20:01gfxstrand[d]: r0 = mov ur0 // delay=5
20:01gfxstrand[d]: exit // delay=1
20:01gfxstrand[d]: This doesn't
20:01gfxstrand[d]: r0 = ipa.pass a[0x7c] rZ // delay=1 wr:0
20:01karolherbst[d]: ehh wait.. ldcu?
20:01gfxstrand[d]: r1 = ipa.pass a[0x80] rZ // delay=1 wr:1
20:01gfxstrand[d]: r0 = mufu.rcp r0 // delay=1 wt=000001 wr:0
20:01gfxstrand[d]: ur0 = ldc.b32 c[0x1][+0x0] // delay=2 wr:2
20:01gfxstrand[d]: up0 = isetp.eq.i32 ur0 rZ // delay=4 wt=000100
20:01gfxstrand[d]: ur2 = sel !up0 rZ 0x3f800000 // delay=3
20:01gfxstrand[d]: ur0 = mov 0x3f800000 // delay=2
20:01gfxstrand[d]: r3 = mov ur0 // delay=1
20:01gfxstrand[d]: r2 = mov ur2 // delay=1
20:01gfxstrand[d]: r1 = fmul.ftz r1 r0 // delay=1 wt=000011
20:01gfxstrand[d]: r0 = mov ur0 // delay=5
20:01gfxstrand[d]: exit // delay=1
20:01karolherbst[d]: I checked uldc 🙃
20:01gfxstrand[d]: I think they just renamed it on blackwell
20:02gfxstrand[d]: but maybe they're different ops
20:02karolherbst[d]: stuff waiting on ldcu needs a scoreboard, yeah
20:02gfxstrand[d]: That'll do it!
20:02gfxstrand[d]: What's the opcode for uldc?
20:02karolherbst[d]: no idea 🙂
20:03gfxstrand[d]: Pardon me while I scan the opcode space
20:03karolherbst[d]: ohh.. I htink that's just for MMA stuff actually
20:04karolherbst[d]: or uhm.. dunno actually, but it does kinda exist but also kinda not?
20:04gfxstrand[d]: How long can fuzzing 12 bits take?
20:05karolherbst[d]: or maybe it's a rename and they left the old name in some places?
20:05karolherbst[d]: mhhh yeah weird
20:06karolherbst[d]: yeah.. seems like a simple rename afterall
20:07karolherbst[d]: weird...
20:07karolherbst[d]: anyway
20:07karolherbst[d]: reading from ldcu on blackwell needs a scoreboard
20:08gfxstrand[d]: so uldc is a lie?
20:08karolherbst[d]: maybe they renamed it because it needs a scoreboard now and is more like a load instruction, but they didn't remove all references to ULDC
20:08karolherbst[d]: so it is referenced, but I don't think it actually exists
20:08karolherbst[d]: it's also gone from the nvdisasm docs
20:08gfxstrand[d]: Getting rid of the bound cbuf path and not having a fast path in `uldc` makes no sense
20:09karolherbst[d]: no idea
20:09gfxstrand[d]: There's also a whole `uldc` latency class
20:10gfxstrand[d]: And `uldc_mma` which is a different thing
20:10karolherbst[d]: yeah..
20:10karolherbst[d]: I think they forgot to remove it 🙂
20:15mhenning[d]: Pre-blackwell has ULDC. Blackwell has LDCU.
20:15mhenning[d]: We've assumed that it was a simple rename so far
20:15mhenning[d]: (the opcodes also change)
20:15mhenning[d]: we represent them the same in the IR, and just pick the right opcode at encoding time
20:20karolherbst[d]: well.. apparently they aren't
20:20karolherbst[d]: though they do work the same, just different scoreboarding
20:20karolherbst[d]: ohhhhhh
20:20karolherbst[d]: uhm....
20:21karolherbst[d]: gfxstrand[d]: you will love this... on Blackwell LDCU's offsets are always signed
20:21karolherbst[d]: no special handling for URZ
20:22karolherbst[d]: though it's a 17 bit offset instead of 16 for the banked access
20:22karolherbst[d]: but besides that they seem identical
20:23gfxstrand[d]: There's a fun version of ldcu that takes a predicate
20:23karolherbst[d]: that's not new with blackwell
20:24karolherbst[d]: that one also exists in turing
20:24gfxstrand[d]: `ldcu.256` also looks fun
20:24karolherbst[d]: oh yeah, that's new
20:24karolherbst[d]: wait
20:24karolherbst[d]: .256?
20:24karolherbst[d]: mhh
20:24karolherbst[d]: does SM100 have .256?
20:26gfxstrand[d]: nope
20:26karolherbst[d]: ohh
20:26karolherbst[d]: sooo
20:26karolherbst[d]: that input predicate disables the load and returns 0 if true
20:26gfxstrand[d]: nift
20:26karolherbst[d]: you know.. for bound checking purposes
20:27karolherbst[d]: yeah looks like the raw access doesn't do bound checks, which isn't surprising
20:29mhenning[d]: gfxstrand[d]: yeah, there's also a new ldg_256_uniform__Ra32
20:30karolherbst[d]: but that's SM120+, no?
20:32mhenning[d]: Probably? I know it's there on 120, I haven't double checked where it was added
20:34karolherbst[d]: I only know that .128 exists, so I probably have SM100 only docs
20:34karolherbst[d]: haven't checked
20:36karolherbst[d]: wait.. 256 that's a vec8
20:37karolherbst[d]: ohh
20:37karolherbst[d]: the .256 variants take two vec4 dests
20:37karolherbst[d]: mhhh _interesting_
20:39mhenning[d]: yeah
20:39mhenning[d]: I'm guessing it was easier to output two vec4s instead of 1 vec8 at the hardware level
20:41mhenning[d]: but for that reason it should probably be represented as a new instruction in nak - afaik we don't need to fully align the output to a vec8
20:46gfxstrand[d]: Yup
20:51gfxstrand[d]: I'm starting to think these tables are for sm100, not sm120
20:56mohamexiety[d]: it would be really hilarious tbh if they gave the tables for 100 instead of 120 given sm100 is probably not even vulkan capable at this point :KEKW:
20:58gfxstrand[d]: <a:shrug_anim:1096500513106841673>
21:04mohamexiety[d]: looking at the tables it does kinda feel suspiciously low for some instructions tbh. hmma being 7 cycles is :thonk:
21:05mohamexiety[d]: like if I am not reading this wrong, sm_80 for hmma looks to be 16+ cycles. yeah it's a new tensor core and all but a difference _this_ large is suspicious
21:07mohamexiety[d]: _especially_ given how consumer blackwell TCs have the exact same per clock per TC throughput of ampere - ada. I get that throughput and latency arent strictly tied but a halving in latency for no change in throughput is :thonk:
21:08gfxstrand[d]: Yeah but it's not just tensors
21:09mohamexiety[d]: yeah ofc. I am only singling out the mma stuff because it's the main one that really sticks out as weird
21:10gfxstrand[d]: At this point I think I'm gonna push, go home, and hope airlied can help sort it out
21:13gfxstrand[d]: At least I sorted out the build system nonsense
23:04karolherbst[d]: gfxstrand[d]: same for the ampere ones
23:04karolherbst[d]: like yes, those are for sm100, not sm120
23:06karolherbst[d]: dabe also fixed fp16 for ampere, because sm80 and sm86 diverged there slightly
23:06karolherbst[d]: tables for consumer cards is WIP
23:13gfxstrand[d]: Well, I guess we can land this as sm100 and we'll have to wait for new tables for sm120
23:14gfxstrand[d]: Or we can not until someone has hardware they can test on
23:29gfxstrand[d]: In any case, the scripts and build system are sorted now. We can add more SMs at will.
23:30gfxstrand[d]: Do we want to convert the older gens?
23:30mhenning[d]: yes please
23:30gfxstrand[d]: If we've got them as XLSX files, I'd love to just import those
23:33mhenning[d]: yeah, I think karol or airlied can copy them in
23:33gfxstrand[d]: I also wish there were an easy way to automate the op mappings
23:57gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36183
23:57gfxstrand[d]: Can't merge for two more weeks but at least now it's marked as a release blocker