00:00 mhenning[d]: gfxstrand[d]: okay, well what assumptions would you make?
00:01 mhenning[d]: because we need to write a compiler and it needs to do something
00:01 gfxstrand[d]: Yes we do. Yes it does.
00:01 gfxstrand[d]: NIR's semantics are pretty much poison.
00:06 misyltoad[d]: gfxstrand[d]: same for opkill πŸ˜†
00:12 mhenning[d]: alright, I think I've spent enough time thinking about this for today
00:12 mhenning[d]: link for more context: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35100
00:25 gfxstrand[d]: misyltoad[d]: Nah, we settled OpKill. We settled on "Stop using that crap!"
00:27 misyltoad[d]: gfxstrand[d]: then glslang started emitting OpTerminate for discard; until 2 weeks ago :D
00:27 misyltoad[d]: which is even worse ;_;
00:27 gfxstrand[d]: In theory we'll settle OpUndef some time in the next decade. In theory...
00:27 misyltoad[d]: https://github.com/KhronosGroup/glslang/pull/3954
00:28 gfxstrand[d]: That's because OpTerminate is what the GLSL spec says.
00:29 misyltoad[d]: but not what any client ever wants :P
00:30 gfxstrand[d]: Clients want HLSL. 😝
00:34 gfxstrand[d]: Oh, that's right. We deleted discard from NIR.
00:34 gfxstrand[d]: I forgot about that.
00:36 gfxstrand[d]: Or mostly deleted
00:43 karolherbst[d]: mhenning[d]: that reduction in spills tho
00:44 karolherbst[d]: anyway.. got tomorrow off, and maybe I'll figure out the vec phi stuff anyway πŸ™ƒ probably will have to understand reg_alloc first tho
03:00 ialsomean: Me i dunno, i think what i saw HdkR very fine piece of art, fluent in english and everything, probably even airlied fucking smart guy, actually i do not know why am i stressting bad, Ryan Houdek i dunno where you figured that i am so toxic, maybe i left bad impression. I think americans did a fine job, but they had to supprt all the mathematics where if you are not braindead like me, you will
03:00 ialsomean: find a way, hell dudes i love you.
03:01 ialsomean: i have seen so much worse guys over the years
03:11 LeedoReagan: Wrestling yeah , my favorite overall among the rapists yeah Kristjan Palusalu, beautiful and strong, I try to follow his steps. But kristjan raped on the floor, maybe i go sentimental now, beautiful man he was.
06:01 mangodev[d]: mhenning[d]: as a tomorrow question
06:01 mangodev[d]: what is a spill and a fill, and what's the difference between doing so with memory versus a register? isn't a register just part of the larger memory?
06:03 mangodev[d]: beginner questions, ik
06:03 mangodev[d]: i just want to learn instead of staying completely in-the-dark about the inner machinations of nvk
06:39 orowith2os[d]: mangodev[d]: Registers vs memory isn't really something NVK-specific, at least. Registers are closer to where the instructions execute, and are faster to access. I'm sure you can find better write-ups out there.
06:41 orowith2os[d]: Spills vs fills are probably regarding how many values you want in registers, but not having enough registers to store them.
06:41 orowith2os[d]: So it spills to memory.
08:13 snowycoder[d]: mhenning[d]: How hard would it be to run shaderdb on each MR in CI to catch these regressions?
13:11 gfxstrand[d]: orowith2os[d]: Yeah, it's that. Specifically we count the number of local memory instructions we emit to shuffle values to/from memory as a rough estimate of the cost of spilling.
13:14 gfxstrand[d]: orowith2os[d]: "closer to the instructions" is a good way of putting it. On modern chips, they're all various forms of RAM, effectively. Just some are faster and closer to the chip. What gets called registers is usually one or two cycles to access rather than 20 or 200.
14:29 snowycoder[d]: gfxstrand[d]: I've addressed all the comments in the Kepler image storage MR
14:34 gfxstrand[d]: \o/
14:35 snowycoder[d]: They were a lot, but the code now is a lot more readable! (and it has proper support for both 3d images and 3d sliced images), thanks!
14:35 gfxstrand[d]: Re-building now. I'll test against it
14:36 gfxstrand[d]: snowycoder[d]: This is a big project and your first really big, complex thing. Lots of comments were expected. But you've done a bang-up job. I have comments, not complaints. πŸ’œ
14:50 gfxstrand[d]: Okay, playing with my crucible tests now
14:51 gfxstrand[d]: It's writing okay but the writes are definitely going to the wrong addresses
15:04 gfxstrand[d]: Okay, if I hack the NIR lowering up to just return `(0, addr_shifted8)`, it writes to `addr + 0x20`. The hell?!?
15:06 snowycoder[d]: Wait, why? even with no bitfield?
15:06 gfxstrand[d]: No bitfield. No nothing
15:06 gfxstrand[d]: 0x000000: mov r2, c[0x0][0x128] // delay=0
15:06 gfxstrand[d]: 0x000008: mov r3, c[0x0][0x12c] // delay=0
15:06 gfxstrand[d]: 0x000010: ld.e.64 r2, [r2+0x10] // delay=0
15:06 gfxstrand[d]: 0x000018: mov r0, rz // delay=0
15:06 gfxstrand[d]: 0x000020: mov r4, 0x42 // delay=0
15:06 gfxstrand[d]: 0x000028: mov r5, 0x42 // delay=0
15:06 gfxstrand[d]: 0x000030: mov r6, 0x42 // delay=0
15:06 gfxstrand[d]: 0x000038: mov r7, 0x42 // delay=0
15:06 gfxstrand[d]: 0x000040: mov r9, r2 // delay=0
15:06 gfxstrand[d]: 0x000048: mov r8, r0 // delay=0
15:06 gfxstrand[d]: 0x000050: sustga.p.wb.ign [r8], r3, r4, !pt // delay=0
15:06 gfxstrand[d]: 0x000058: exit // delay=0
15:08 snowycoder[d]: I honestly have no idea 0_o
15:08 gfxstrand[d]: And I don't have my sources in the wrong order. It's definitely writing 42 and it's definitely writing it to the image.
15:11 gfxstrand[d]: Maybe something funky with the format specified?
15:12 snowycoder[d]: The strange thing is that usually sustga/suldga take the lower 8-bit offset from the bitfield, if it's at 0 they should just add 0
15:12 gfxstrand[d]: Yup
15:13 gfxstrand[d]: And 0x20 is definitely bottom bits
15:14 snowycoder[d]: What does it do with fmt=0?
15:15 gfxstrand[d]: Writes nothing
15:16 snowycoder[d]: Fair enoughπŸ˜‚
15:18 snowycoder[d]: I really don't know, try to see if codegen outputs something different
15:19 snowycoder[d]: or maybe there's something strange with sustga encoding? that's not in the hw_tests
15:34 gfxstrand[d]: I'm suspecting that
15:41 snowycoder[d]: How can I signal to CSE that my new intrinsic is an image intrinsic?
15:48 snowycoder[d]: Right now it's merging `suldga_nv` across barriers, but I don't see where it's making that decision
15:50 snowycoder[d]: Ok nevermind, it follows the CAN_REORDER flag
16:14 karolherbst[d]: gfxstrand[d]: it's .p, so the format matters and it's not byte addressed
16:15 karolherbst[d]: mhhh
16:15 karolherbst[d]: or maybe kepler is different...
16:17 karolherbst[d]: uhhh
16:18 karolherbst[d]: uhm....
16:18 karolherbst[d]: oh no
16:18 karolherbst[d]: sustp on kepler is a frigging mess
16:22 gfxstrand[d]: πŸ˜„
16:26 karolherbst[d]: I don't understand what codegen is doing, but it's doing a lot of random nonsense compared to maxwell even for sustp, so... uhm... it looks like a disaster honestly
16:28 gfxstrand[d]: Yeah and snowycoder[d] figured all that out and it's even constant-folded in NAK. It's great. I'm just trying to figure out why Kepler A is more cursed than Kepler B.
16:30 karolherbst[d]: mhhh
16:47 snowycoder[d]: karolherbst[d]: Yeah I only use formatted stores, the ISA also allows for unformatted stores but formatted seem to do everything
16:49 gfxstrand[d]: There's no reason to do unformated, TBH.
16:51 snowycoder[d]: Actually, what do they do? Aren't they just load/stores?
16:51 snowycoder[d]: In other archs I assume they get surface coords, but in kepler they need actual global address
17:29 gfxstrand[d]: `suld.b` is the same as `suld.p` on SM50+ except that it loads/stores raw bytes instead of looking at the surface format. You can use it to override the format in the shader.
17:45 gfxstrand[d]: Wait... I think I know what things are funky. Maybe
17:45 gfxstrand[d]: I think it's just inter-gob stuff
17:50 gfxstrand[d]: Okay, I think I found the encoding issue. I just need to turn the lowering back on and see if that's it. But first, an appointment at the bank.
17:57 gfxstrand[d]: There's a type specifier for the offset that can be either u8, s8, u32, or s32. We want u8.
18:00 gfxstrand[d]: IDK if Kepler B has that or not. I don't remember seeing it on the NAK op.
18:19 mhenning[d]: mangodev[d]: When a computer remembers a number, there is a physical circuit that actually stores the number. These are organized into a hierarchy of different memories, from (nearby, small, fast memory) to (far away, large, slow memory).
18:20 mhenning[d]: Register allocation is part of deciding, physically, where those numbers go. It generates spills, which move values from fast memory to slow memory, and fills which do the reverse
18:20 snowycoder[d]: gfxstrand[d]: Yep, i just hardcoded it to u8, we might use u32 for buffer though
18:21 mhenning[d]: snowycoder[d]: Yeah, we should probably put together more tooling for checking for this kind of regression. Running shaderdb during CI isn't too hard, but doing something useful with the numbers is a little more tricky. I'm not sure what other drivers do.
18:36 karolherbst[d]: gfxstrand[d]: I according to codegen, u8 seems very invalid for buffer images
18:46 gfxstrand[d]: Eh, we can do whatever we want. It's just a matter of making sure we set the bits we think we're setting.
19:12 karolherbst[d]: some encodings are just invalid.. that reminds me, I wanted to fix cctl, because that's for sure not correct atm πŸ™ƒ
19:43 gfxstrand[d]: snowycoder[d]: I just pushed three commits to your MR. Please go ahead and squash them in where they belong.
19:43 gfxstrand[d]: With those, most of `dEQP-VK.image.\*` passes
19:49 gfxstrand[d]: Something funky is going on with MSAA (but that may be a more general issue) and 64bit atomics are throwing an illegal instruction encoding
19:49 snowycoder[d]: gfxstrand[d]: General in what way?
19:50 gfxstrand[d]: I think the MSAA layout might be different on Kepler
19:50 gfxstrand[d]: Or Kepler A
19:50 gfxstrand[d]: I saw some MSAA texture fails, too
19:51 snowycoder[d]: KeplerB works with all deqp-vk.image.* tests
19:51 gfxstrand[d]: But I'm not sure. I've spent zero time investigating. I just know it fails without a shader exception and I thought I remembered something similar from a month ago or so
21:24 mangodev[d]: mhenning[d]: i honestly forgot GPUs have caches too
21:24 mangodev[d]: do they have the full stack/heap concept that cpu memory does?
21:25 mangodev[d]: i'd think they have extremely small caches (or mostly instruction cache in L1) given nvidia's obsession over memory speed
21:28 mangodev[d]: is cache mostly used by the driver? since i've never really heard much of gpu cache outside the driver space
21:28 mangodev[d]: i have no idea how much cache my GPU even has, meanwhile i can easily find how much cache my CPU has (L3 on local specs, and L1 and L2 on online spec sheets)
21:30 mangodev[d]: mhenning[d]: good to know what the proper terminology for those moves are though
21:30 mangodev[d]: are fills a good thing (because they're moving to a lower cache level) or bad (because of potentially unnecessary moves)? i'd assume this is a thing that "depends," but they sound like they'd be good in some contexts rather than bad
21:31 mangodev[d]: i'd assume your pr *reducing* fills is a **good** thing though, given it was moving a ton of memory around, which is likely bad regardless (and seemingly going back and forth between cache and vram)
21:47 snowycoder[d]: mangodev[d]: Exactly.
21:47 snowycoder[d]: if you remove spills you remove 2 memory operations, generally less is always good
21:48 mangodev[d]: snowycoder[d]: why 2? a write and clear, i assume?
21:48 mangodev[d]: or is there a dedicated move op
21:50 snowycoder[d]: A Copy from register to memory and a copy from memory to register.
21:50 mangodev[d]: does the driver or the hardware handle memory allocations? i'd assume for it to be the driver, but I may be mistaken given how surprisingly ""automatic"" the hardware can be
21:51 mangodev[d]: like how the hardware has one op for decoding texture data, no matter what type of (hardware) compression (afaik)
21:55 gfxstrand[d]: The driver
21:57 gfxstrand[d]: mangodev[d]: I mean, stack/heap are a lie on CPUs, too. It's all just memory. The only real magic about stacks is that there is often a stack pointer register and a way to easily increment/decrement it and read relative to it. NVIDIA has that, too. Not all GPUs do but NVIDIA does.
21:58 gfxstrand[d]: mangodev[d]: GPU caches are often pretty big, actually. But there are typically fewer levels. And often the cache is segmented with textures using one cache, constant data another, read/write data a third, etc.
21:59 mangodev[d]: gfxstrand[d]: ig that kind of explains why my os can only tell me my L3? i assume normal programs can't access L2 or L1, mayhaps they're OS-reserved memory?
22:00 mangodev[d]: are cache levels actually as physically separate as they're made out to be, or are they just neat categorizations for something sort of homogenous?
22:01 mangodev[d]: gfxstrand[d]: the words "texture cache" explains perfectly that the gpu cache is pretty chunky
22:02 mangodev[d]: so is GPU memory pretty much
22:02 mangodev[d]: - the fast memory
22:02 mangodev[d]: - the slightly less fast memory
22:04 mangodev[d]: are spills and fills just memory moves out of or into the cache region? are they just a range of addresses, or are they more unique than that?
22:07 mohamexiety[d]: mangodev[d]: no they're separate
22:08 mohamexiety[d]: it depends a lot on the actual architecture on how it's all organized but generally if we take the current NVIDIA stuff
22:09 mangodev[d]: mohamexiety[d]: ofc nvidia has changed it in otherwise relatively insignificant generations 🫠
22:09 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1375596631000354847/image.png?ex=683243b2&is=6830f232&hm=3d732752426202fa1fc76c331cb190ede1955a30e18f15e91a08f5b0e97b97b2&
22:09 mohamexiety[d]: this is the building block of your compute. you can see that the instruction cache is split into 4 partitions, each SMSP (the quarters of the SM) gets its own instruction cache. this is physically actually close to these execution units. then there's the L1, which is SM wide. again, it's physically distinct and interconnected differently
22:10 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1375596827356827768/image.png?ex=683243e0&is=6830f260&hm=6dd106d6796358e0ae1cca6d19e7c58ce83f6a4b02338fea2ec91c51b18589c3&
22:10 mohamexiety[d]: finally you get the L2, which is completely elsewhere, and connects everything together, and is directly attached to the memory controller
22:11 orowith2os[d]: gfxstrand[d]: A major point of the heap, being able to allocate memory from it arbitrarily, doesn't really matter for performance or anything if you allocate a giant chunk of it and pull from that in your own program, right?
22:12 mohamexiety[d]: I don't have an annotated die shot of blackwell in similar style, but this is Ada (which has the same memory hierarchy as Blackwell): https://www.nemez.net/die/GPU/Ada/AD102_annotated.png you can see that the different caches are pretty distinct
22:13 mohamexiety[d]: and they have to be distinct really because their location and interconnection directly affects their speed (whether that be latency or bandwidth)
22:15 mohamexiety[d]: the only one to _allegedly_ have homogenous cache is Apple starting from the M3. but I haven't really seen people talking about that
22:17 mohamexiety[d]: allegedly, if you take their technical marketing at face value, instead of having a distinct register file and then a L1/L0 cache, they have one giant blob of SRAM that can be used to dynamically increase register capacity at the expense of lower cache, or the opposite dependong on what the app needs
22:18 mohamexiety[d]: https://developer.apple.com/videos/play/tech-talks/111375/ there's not really much about it beyond this video though sadly
23:14 gfxstrand[d]: orowith2os[d]: I mean, that's all just software. We can write that software in shaders, in theory. (Okay, so we can't talk to the kernel from a shader but still...)
23:15 orowith2os[d]: Reminds me of when Misyl put Linux on TF2
23:15 orowith2os[d]: My point there was, it's really all just the hardware put in a way software likes it to be
23:17 gfxstrand[d]: Yeah.
23:18 gfxstrand[d]: The hard part with GPUs is that the shaders are really isolated. They're turing complete, yes, but memory management and stuff is pretty external to the shader cores and they just run in the context you give them.
23:18 gfxstrand[d]: But also, at the end of the day it's all just memory and you can kinda do whatever. It's "just" a matter of writing the software. See also the Asahi driver.
23:19 orowith2os[d]: Omw to compile brainfuck on an Ada GPU
23:19 orowith2os[d]: :monad:
23:21 orowith2os[d]: Mmm, I guess bf on spir-v is already that. Oh well
23:44 mangodev[d]: orowith2os[d]: …not a typo?
23:44 mangodev[d]: the tf2 community is dedicated as hell
23:45 orowith2os[d]: https://m.youtube.com/watch?v=zi6osAtyaio