01:50gfxstrand[d]: airlied[d]: The hell is .texunpack?
01:50airlied[d]: seems like a magic 32-bit value to get from ldcu to give to texture instruction
01:51gfxstrand[d]: I doubt it's magic
01:51airlied[d]: I've wired it through in my nvk-wip-gb20x branch, but tests still doesn't pass
01:51airlied[d]: it's wierd it's not 64-bit though
01:51gfxstrand[d]: Yeah
01:52airlied[d]: though it could be and I'm just not seeing that
01:52airlied[d]: since there is nothing in the decode to say
01:52gfxstrand[d]: That feels like maybe it's `LDC & 0x000fffff`
01:52gfxstrand[d]: airlied[d]: True
01:53gfxstrand[d]: We might be able to unit test that somehow. We have a cb0 in the unit tests, though it might need extra plumbing.
01:59airlied[d]: https://paste.centos.org/view/raw/04ebd9e4 is the dump
01:59airlied[d]: for dEQP-VK.glsl.texture_functions.texture.sampler2d_fixed_fragment
02:07mangodev[d]: i'm curious
02:07mangodev[d]: what even *is* a vulkan layer?
02:08mangodev[d]: from what little i've seen, it seems like a driver-specific extension usually involving fetching some sort of value (such as the current graphics device or surface contents), but i have little ideas on what it actually means
02:14airlied[d]: https://docs.vulkan.org/guide/latest/layers.html
02:18airlied[d]: hmm the 64-bit modifier still seems to apply separately though so maybe it is only 32-bit
02:24airlied[d]: hope they don't encode something magically on the state side into the constbuf
02:27gfxstrand[d]: airlied[d]: TBH, that looks like it's just loading a bindless handle. I don't know why the LDC would need to be special.
02:46mhenning[d]: gfxstrand[d]: It's not using .B for bindless though?
02:47mhenning[d]: I didn't figure out the non-bindless encodings since the forms are different on those from earlier gens
02:48mhenning[d]: Maybe the LDCU.TEXUNPACK + TEX is the way they implement cbuf textures now since the old forms for that are gone
02:49gfxstrand[d]: mhenning[d]: I suspect it's that. If they got rid of cbuf from ALU, why keep it for textures? In that case, .B might just be gone.
02:50gfxstrand[d]: Go ahead and disable the cbuf texture optimization for Blackwell+ and we can figure out `ldc.texunpack` later.
02:50mhenning[d]: no, there's still an encoding for .B
02:50gfxstrand[d]: <a:shrug_anim:1096500513106841673>
02:50mhenning[d]: gfxstrand[d]: I already did that
02:51gfxstrand[d]: Maybe it's some crazy optimized thing, then? The cbuf thing was already "just load a texture handle from a cbuf and use it", though, so IDK what they're optimizing.
02:55mhenning[d]: Yeah, I don't know. It is noteworthy that in an example where the cuda compiler uses .B on older gens, it doesn't generate that anymore on blackwell. I assumed that was just the example not convincing it to use the flag, but I guess it's possible that .B shouldn't be used any more (even though nvdisasm is happy with it)
02:56gfxstrand[d]: I think we need to RE it thoroughly. If we can figure out what `ldc.texunpack` does, we can probably figure out how to plumb it into texture instructions.
02:57gfxstrand[d]: Here's a crazy thought: What if UR4 is a whole vec4? What if it's doing something crazy like loading part of the descriptor or returning the actual descriptor addresses instead of 32-bit handles, hoisting some of that work out of the texture instruction itself.
03:00mhenning[d]: That sounds possible. We do have some 128-bit ureg ops on blackwell, which I think are new
03:02airlied[d]: I suppose I need to dump some things that do multiple textures to see if what happens with the URs
03:02airlied[d]: though in this test, it shouldn't matter since nothing else is using the UR range between the two instructions
03:04airlied[d]: if you don't set bit 91, it doesn't decode with nvdisasm
03:31gfxstrand[d]: Yeah but bases on alignments and the way other things are RA'd in that shader, it's looking very much like RA is intentionally aligning to 4, which would mean vec3 or vec4.
04:27airlied[d]: not 100% sure but compute dispatch might be using mme now
04:45santosgenesis: 370+328+328+324+362+322−144−72−144−512=1162 370+328+328+324+362+322+1162 3196.00−144−36−1024=1992 370+328+328+324+362+322−1992=42 so short is actually the access method, so those guys talking about compression and blue stuff aren't wrong, so is not skeggsb9778[d] , computation should work too, 370-328 is 42, i am simplifying those terms, HdkR's x86 fex looks quite
04:45santosgenesis: thin (man is also cool ponytail hahaaaaa), but i think wasmati is thinner, so we are going to build a filesystem as well as computation engine for lightning fast computing.
04:46santosgenesis: because that compression is very well possible, i just showed you
04:50airlied[d]: also not seeing inline QMD
04:51gfxstrand[d]: airlied[d]: That wouldn't surprise me. I'd love to dispatch with MME.
04:53airlied[d]: https://paste.centos.org/view/raw/e24ef30b is the MME macro
04:53airlied[d]: I had to hack mme decode to use a compute class
05:09airlied[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12961 did a quick issue summary, not sure if I want to push into this or wait for open docs drop 🙂
05:29airlied[d]: oh there might be a QMD MME macro
05:37airlied[d]: though eyeballing it vs a QMD from nvk doesn't line up much
05:42airlied[d]: oh my eyeballs were looking at wrong thing on nvk side
06:11airlied[d]: okay it is a QMD, but QMD has changed a lot
06:11airlied[d]: I think I might just enter a holding pattern 🙂
07:52airlied[d]: okay we do have hopper texture headers, which might work for us
12:52gfxstrand[d]: QMD should be REable. It's just pain.
14:32asdqueerfromeu[d]: x512[m]: NV2080_CTRL_CMD_GPU_GET_SHORT_NAME_STRING could be used to get the chipset name instead of that nvkmd_nvrm_get_chipset_name() thing (the first 5 characters in the string are what you're looking for)
14:34x512[m]: It return a bit different thing like TU117GL.
14:34x512[m]: But not that important.
14:34asdqueerfromeu[d]: x512[m]: Well that's why getting only the first 5 characters is important here
15:01gfxstrand[d]: mohamexiety[d]: If you want to start REing the QMD, I'd suggest starting with the crucible local-id test and modifying it to poke at QMDs a few different ways:
15:01gfxstrand[d]: 1. Dispatch with different global sizes to figure out where those go. In the Ada QMD, each of the global sizes is just a dword. If you do a few different tests with different, recognizable numbers, you should be able to figure out which goes where.
15:01gfxstrand[d]: 2. Compile two almost identical, but slightly different shaders and then dispatch ABA in the same command buffer, pushing new constants each time. This will hopefully show you where the shader and cbuf0 pointers are. cbuf0 should change every time but the shader pointer should go ABA. If we're lucky, that'll also tell us where all the other cbufs go since they're usually an array.
15:01gfxstrand[d]: 3. Compile the same shader with different local sizes. We need to know where those go.
15:01gfxstrand[d]: 4. Similar shaders but with different register counts. This one's a little harder but there are a few tricks you can use to force a compiler to use lots of registers. We can talk through this one when you get to it.
15:01gfxstrand[d]: 5. Similar shaders but one uses shared memory
15:01gfxstrand[d]: In each case, you run the modified test against the blob, look at the QMDs in the blob's pushbufs, and try to derive information from them. Fortunately, we don't use a lot of the fancier QMD fields for now so we should be able to get working compute shaders without full docs.
15:06mohamexiety[d]: gfxstrand[d]: alright, will try that out now then. thanks! my first thought was trying out the vk compute cts but this should be much better in terms of controlling for variables
15:06gfxstrand[d]: Yeah. With R/E, it's always better if you can write your own test case./
16:44gfxstrand[d]: Looks like the case for my AMD machine can, in fact, support an RTX 5090. 🥳
16:44gfxstrand[d]: My Intel desktop can't. It's 1mm too short.
16:50orowith2os[d]: that's what ankle grinders are for
17:04mohamexiety[d]: hey on the bright side the AMD machine is the better one anyways as the Intel one would bottleneck it
18:16mhenning[d]: gfxstrand[d]: When you get a chance, pls review https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34161
18:16mhenning[d]: It's ~a dozen lines of code for a ~10% improvement in multiple benchmarks. I'd like to see it make it in for 25.1
18:22gfxstrand[d]: Will do. I'll try to review today
19:04mangodev[d]: mhenning[d]: oh yeah, this pr
19:04mangodev[d]: i do love perf boosts like this because even if they're not a lot on their own, they stack up :D
19:06mangodev[d]: speaking of
19:06mangodev[d]: why does the ZCULL branch (reportedly) only run 3% faster than main? i'd think adding support for z prepass would boost performance a lot more than that, especially in the AAA 3D titles it was tested in
19:08mohamexiety[d]: could be we're being held back by something else so the benefits dont show
19:08redsheep[d]: Or that zcull isn't as critical on recent Nvidia hardware as previously assumed
19:50gfxstrand[d]: mhenning[d]: Assigned to Marge. That was easy. It's not like I was going to disagree with the formula you reversed from the blob and which neither of us have the capacity to verify. 😂
21:23airlied[d]: why does tic format fit so nicely into 32-bit, now I have to feel bad about finding ways to break it 😦
21:24airlied[d]: oh maybe I can steal 4 bit from support
21:59mhenning[d]: airlied[d]: Have you seen the differences in clcb97tex.h ? Looks like a bunch of enum variants were removed and a new "v2" texture header format added for hopper
22:05airlied[d]: yeah I've just started writing code for v2 now
22:05airlied[d]: the data types difference is where I needed 4-bits for
22:28airlied[d]: not 100% sure what maps to what data type, but just get the basics going first would be a start
22:42gfxstrand[d]: We have hopper TICs?
22:42gfxstrand[d]: airlied[d]: Why do you need to break it?
22:43mhenning[d]: gfxstrand[d]: https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/3d/clcb97tex.h
22:44mhenning[d]: to be honest I don't know what TIC stands for, but we do have texture headers
22:49mohamexiety[d]: I dont remember what it stood for tbh but the TIC was something like a table that stored some relevant values/"settings" of textures. e.g. tic[0] could have been the texture header, tic[1] could have been the texture type, tic[2] could have been row pitch, tic[3] height, etc.
22:50mohamexiety[d]: my memory is a bit hazy since the only time I interacted with it was for linear render stuff a long time ago; and iirc this stuff was in nvc0_tex.c
22:50mohamexiety[d]: (in the gallium driver)
22:51gfxstrand[d]: We should really just rename them to descriptors or headers in the code. IDK that TIC is actually an NVIDIAism. I just copied it from nouveau.
22:52skeggsb9778[d]: those names came long before headers - [T]exture [I]mage/[S]ampler [C]ontrol
22:52gfxstrand[d]: ah
22:53gfxstrand[d]: That makes sense
22:53mohamexiety[d]: yeah
22:53gfxstrand[d]: So back when they were MMIO regs to bind things
22:53skeggsb9778[d]: more just what they got called before nv released any kind of headers
22:53gfxstrand[d]: 👍🏻
22:54gfxstrand[d]: airlied[d]: If I were to dive into Blackwell, where would you like me to start. I don't have hardware of my own at the moment but I could maybe get into matt_schwartz[d] box.
22:55gfxstrand[d]: I should have HW in a few weeks but I'm gonna wait until after I move offices.
22:55matt_schwartz[d]: sure i can set you up on it
22:55matt_schwartz[d]: will work on that in a bit'
22:55gfxstrand[d]: That'd be swell.
23:01mohamexiety[d]: gfxstrand[d]: I did 1, 3, and 5 btw. but havent ran on ada yet though. will do 2 and 4 tomorrow. for 2 do you mean push constants, or just normal constants in the shaders? (i.e., the difference between the shaders would be the constants). for 3 do you mean different test runs, each run with a different local size? or the same test, but multiple shaders (like 2). and same question for 5 actually;
23:01mohamexiety[d]: same test or different runs?
23:03gfxstrand[d]: for 2, just anything that will force it to update the root descriptor table, i.e. cbuf0. Push constants should do that but descriptor bindings and a few other things probably will, too. Or maybe we don't need to do anything? <a:shrug_anim:1096500513106841673> I don't really know. I suppose trying doing nothing first is a good plan.
23:04gfxstrand[d]: But the idea is to make it so you get the same instruction pointer for two of the runs and 3 different cbuf0 pointers so you can tell which is cbuf0 and which is the instruction pointer.
23:04airlied[d]: gfxstrand[d]: new v2 tic entry has a single data type that is 4bits and encoded different
23:04mohamexiety[d]: ahh I see
23:04airlied[d]: So no per channel data types
23:05gfxstrand[d]: Ah, well that seems fine
23:05airlied[d]: As for Blackwell, qmd and textures are the two big holes, I think ldcu needs more work
23:06gfxstrand[d]: Want me to crack at textures?
23:06airlied[d]: You might want the code ive half written to encode v2
23:06airlied[d]: I'll push wip in an hour once I get back
23:06gfxstrand[d]: Why don't you just leave me your notes at the end of your day today and push somewhere and I'll pick up where you left off tomorrow.
23:07airlied[d]: Was trying to decide between adding another column to the format info or working it out later
23:07airlied[d]: Just adding the column now
23:07gfxstrand[d]: Or maybe we can delete some columns? Having 4 is kinda redundant anyway.
23:07airlied[d]: There may also be Mme work
23:08gfxstrand[d]: gfxstrand[d]: I always hated that it was 4 things anyway
23:10gfxstrand[d]: I can fix that for you easily enough.
23:14mhenning[d]: airlied[d]: I could hack at ldcu encoding if that's helpful. I had assumed it was working on your branch
23:17airlied[d]: I think it's working but Ive low confidence in it esp c[ vs cx[
23:17mhenning[d]: Okay, I'll write some disassembly tests
23:46gfxstrand[d]: gfxstrand[d]: Ugh... Not as redundant as I'd hoped. I think it'll still be okay if we don't care about weird D3D9 formats
23:47gfxstrand[d]: I don't think anything uses those except the Nine state tracker, though, so maybe meh?
23:48gfxstrand[d]: Yup. And literally the only driver to support them is nv50/nvc0. Yeet!
23:48gfxstrand[d]: Oh, and virgl, I guess.
23:49gfxstrand[d]: But I find it hard to believe they actually work there
23:49mhenning[d]: Oh, non-uniform ldc is broken too on blackwell
23:50mhenning[d]: gfxstrand[d]: Yeah, nvc0 has some code in it that has literally never once been tested.
23:50mhenning[d]: It's a toss up whether any given thing is dead code or very important
23:51airlied[d]: gfxstrand[d]: I think the ZS need special casing, so hence why at least initially just adding another column
23:53gfxstrand[d]: I don't think it does.
23:53gfxstrand[d]: I'm about to verify that assumption with the CTS.
23:57gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34415
23:58gfxstrand[d]: First 2 minutes looks good but I'll let the run complete before dropping Draft
23:58gfxstrand[d]: I should probably also throw it at Maxwell over night just to make sure my assumption holds everywhere (it probably does, but let's be safe).