00:37gfxstrand[d]: Someone should figure out host_image_copy on Maxwell/Pascal. I have a crucible test that makes it pretty easy to figure out the GOB structure: https://gitlab.freedesktop.org/mesa/crucible/-/merge_requests/155
00:38gfxstrand[d]: We also need to figure out where Volta and Pascal fall between Maxwell and Turing. I suspect the cutoff is Turing based on the description of the NVIDIA modifiers in fourcc.h but we'll have to test both of them.
00:40gfxstrand[d]: gfxstrand[d]: And by "pretty easy", I mean you just run the tests and it dumps out the GOB layout. Then you add code to nil/copy.rs
03:16dwlsalmeida[d]: finally made the h.264 DPB work in NVK! fyi gfxstrand[d]
03:16dwlsalmeida[d]: we should be able to decode most h.264 videos now
03:16dwlsalmeida[d]: omg finally, this took a bit of time to debug x_x
03:18dwlsalmeida[d]: gfxstrand[d]: hey can I take that? I gotta add a new api to this GOB thing anyway for the video stuff, because we must make the values equal for both Y and UV planes
03:19gfxstrand[d]: dwlsalmeida[d]: Sure. It's a good learning experience if nothing else. And you should still have that Maxwell laptop. karolherbst[d] can run in Volta and Pascal for you when the time comes to verify that they're the same as Maxwell.
03:20gfxstrand[d]: dwlsalmeida[d]: Sweet!
03:20karolherbst[d]: Pascal is probably be the same, but Volta is kinda.. uhm.. dunno :
03:20karolherbst[d]: volta is weird
03:20gfxstrand[d]: Yeah
03:20dwlsalmeida[d]: gfxstrand[d]: I still ask myself to this day how did you carry this machine around, this thing weighs 3kg at least
03:21dwlsalmeida[d]: a 3kg laptop, lol
03:21gfxstrand[d]: karolherbst[d]: I think Volta is Maxwell in this case but it's hard to say with Volta.
03:21karolherbst[d]: mhhh
03:22karolherbst[d]: what's `HTEX`?
03:22karolherbst[d]: 😄
03:22gfxstrand[d]: dwlsalmeida[d]: It mostly lived in my roller bag.
03:22karolherbst[d]: `NVC597_TEXHEAD_BL_TEXTURE_TYPE_HTEX_` is new with Turing
03:22gfxstrand[d]: I have no idea
03:24gfxstrand[d]: dwlsalmeida[d]: I never carried it around as a laptop. Hell, some days I regret switching to the Alienware that I've got now from the XPS 13. It's nice having an NVIDIA GPU with me but still...
03:25karolherbst[d]: "per-halfedge texturing (Htex)" does that make any sense?
03:27karolherbst[d]: ahh found something with at nvidia
03:28karolherbst[d]: `When NV_D3D12_RESOURCE_FLAG_HTEX is set, the texels are centered on integer coordinates and filtering and LOD are calculated based on the size minus one, which then allows the edges to filter to the exact texels on the edge, eliminating the border/edge filtering issue. Dimension of next mip level is CEIL(currentMipDimension/2), and size of smallest mip is 2x2.`
03:28karolherbst[d]: https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gga5dc0df8def0c24d308cbe40ce496b562a9bd3b59da57f90b8ce7f8ac0448999fd
03:29karolherbst[d]: not sure if that's interesting for anything though 🙃
03:40gfxstrand[d]: No idea. They've not added that to Vulkan as far as I know. I'm sure Jeff will find and add it one of these days, though. 🙃
12:03dwlsalmeida[d]: gfxstrand[d]: you mean this right https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_host_image_copy.html
12:03dwlsalmeida[d]: I'll work on that later tonight
12:11mohamexiety[d]: dwlsalmeida[d]: yup. it's all wired up and done already, the main thing missing is maxwell/pascal GOB support here: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/nil/copy.rs?ref_type=heads#L161
12:11mohamexiety[d]: (that one has the Turing+ GOB format, you want to add a different implementation for the older cards)
12:12mohamexiety[d]: Faith's test outputs the GOB layout directly so assuming no other weirdness with maxwell, that should be it
15:18gfxstrand[d]: Yes
15:35gfxstrand[d]: I believe I already landed the patch to add a GOBType enum to nil::Tiling so we should be able to make different GOBTypes for Turing and Maxwell.
15:37mohamexiety[d]: yup!
15:38mohamexiety[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/nil/tiling.rs?ref_type=heads#L14
18:26airlied[d]: dwlsalmeida[d]: nice! H265 is probably not a massive amount more
18:46airlied[d]: Video decode doesn't use shaders
18:56redsheep[d]: Not certain how the firmware relates but I know video stuff goes through a separate hardware queue and hardware block, the shaders aren't involved as airlied said
18:57redsheep[d]: Shader cores wouldn't be good at the job anyway
19:01airlied[d]: Just nvdec and nvenc queues, only works on GSP enabled GPUs since GSP has the fws
19:01avhe[d]: airlied[d]: h265 will require some bitstream parsing though, for this field <https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/video/nvdec_drv.h#L462>
19:02redsheep[d]: We have the gsp
19:03airlied[d]: Pre gsp is doubtful we will ever get it going unless someone really cares and works it out
19:05redsheep[d]: pavlo_kozlenko[d]: I mean if gsp is good enough for nvidia to make opengpu work it's good enough for nouveau
19:07redsheep[d]: Cuda cores don't get used with video either, no shaders or compute iiuc, at least not for decoding anything the dedicated hardware can do
19:07redsheep[d]: Unless you're talking about the weird HDR or upscaling stuff that nvidia has on windows with video
19:08redsheep[d]: That does involve compute
19:13airlied[d]: The cuda cores are just shaders cores or you mean matrix units which are also just shader accessible
19:14airlied[d]: mohamexiety[d]: were you looking at coop matrix?
19:15mohamexiety[d]: nah that was marysaka[d] iirc
19:15mohamexiety[d]: mine is compression/large pages. just fell ill a while back so things are slow 😦
19:16marysaka[d]: yeah I'm still on that but mostly got things working... apart from scheduling issues
19:17redsheep[d]: Yes
19:18redsheep[d]: They're just special instructions
19:18redsheep[d]: It's marketing
19:26magic_rb[d]: What can a matrix unit do? Given two matrices, add them, multiply them?
19:26magic_rb[d]: So GPUs which dont have to do matrix multiplication manually have "tensor cores"
19:30magic_rb[d]: Okay so tensor cores and cuda cores are just marketing bs
19:30magic_rb[d]: Thanks
20:01dwlsalmeida[d]: avhe[d]: A Rust parser to go with the other Rust stuff we have in there you say?
20:01avhe[d]: yes, that's one of the places where rust would shine :ferris:
20:02avhe[d]: in my ffmpeg code i simply patched the hevc parser to retrieve that value, but in a vulkan-video driver you're just handed the bitstream so you have to re-do the job
20:04dwlsalmeida[d]: I have the parser I wrote for cros-codecs, which is in Rust already..I can probably take some things from there I think
20:05dwlsalmeida[d]: Google took over the project and pushed a bunch of Android specific stuff in there, which sucks
20:05avhe[d]: i *think* the av1 structure has something similar but i have no idea how it's computed
20:05dwlsalmeida[d]: otherwise we could have linked against it from NVK
20:56dwlsalmeida[d]: gfxstrand[d]: can anybody explain this GOB business to me? I read from the Tegra TRM that it's a 3d grid of data, right
20:56dwlsalmeida[d]: where two of the dimensions are fixed, and the "height" is variable
20:57dwlsalmeida[d]: how is this different from a 2d tile? Why this extra dimension?
21:01redsheep[d]: magic_rb[d]: The meaning of the word "core" with respect to a GPU is really fuzzy so I'm not sure I'd call it BS but it's not exactly meaningful either. The marketing words are just supposed to be saying something about the throughput of the hardware and in that way it's not wrong. Tensor/matrix "cores" (matrix multiply hardware) needed to implement those instructions do make the GPU faster for
21:01redsheep[d]: that task.
21:02magic_rb[d]: no i get that, i just really dislike marketing in general, and more things, but then we're getting into politics and im not mentally well enough for that
21:06gfxstrand[d]: dwlsalmeida[d]: It's two levels of tiling. Images are made out of tiles. Tiles are made out of GOBs and GOBs are made out of bytes.
21:08gfxstrand[d]: redsheep[d]: It's not BS. Even though they're exposed as instructions in the shader, they go through a different HW unit than normal float or integer math. Doubles (fp64) are also a separate unit. On the higher-end chips, they basically one-to-one match the shader cores are run "full rate". On desktop chips, the hardware is there, there's just less of it and the shader cores have to share so you
21:08gfxstrand[d]: get a lot lower throughput.
21:10magic_rb[d]: okay so im not denying they exist, what i mean is, call a spade a spade. brand name vs actual name
21:10dwlsalmeida[d]: gfxstrand[d]: Why can they be stacked vertically (according to the tegra docs) ? That’s very weird
21:10gfxstrand[d]: dwlsalmeida[d]: I don't know what you mean by "stack vertically"
21:10HdkR: Pipeplines?! In my GPU?! :D
21:11dwlsalmeida[d]: gfxstrand[d]: Let me quote that, just a min
21:11redsheep[d]: gfxstrand[d]: Doubles are actually separate and not just another mode that only some of the units have?
21:12gfxstrand[d]: redsheep[d]: I believe so, yes.
21:12redsheep[d]: You can't actually get fp32 and fp64 to run is parallel though right? So in the end it's kind of a wash
21:13gfxstrand[d]: Oh, it very much matters. That's why we have to use the variable latency instruction barriers on all double ops.
21:13karolherbst[d]: some GPUs just have very few fp64 units
21:13redsheep[d]: Right ok
21:13dwlsalmeida[d]: gfxstrand[d]: ```
21:13dwlsalmeida[d]: Blocks themselves are arranged from GOBs, which are Groups of Bytes. A Maxwell GOB, as used in Tegra X1, is 512 bytes
21:13dwlsalmeida[d]: arranged as 64x8 bytes. GOBs are stacked vertically to form a block. The number of GOBs stacked vertically in a block is
21:13dwlsalmeida[d]: controlled by an additional surface parameter called the block height. The recommended block height for most buffers on Tegra
21:13dwlsalmeida[d]: X1 is 16 GOBs (128 lines). This supports 8x8 bank interleaving. For buffers for which linear display access is more important
21:13dwlsalmeida[d]: than access by GPU, VIC, or other block-oriented engine, block height may be set to 1 GOB, providing more locality for the
21:13dwlsalmeida[d]: display client.
21:13karolherbst[d]: so I suspect the barriers are there because the operations are probably queued
21:14gfxstrand[d]: Integer and fp32 are also different and I think fma is magic.
21:14karolherbst[d]: mhhh
21:14karolherbst[d]: not quite
21:14gfxstrand[d]: And scalar might be separate, too.
21:14karolherbst[d]: there is alu and fp32 (mostly ffma, but not only)
21:14gfxstrand[d]: Hardware does not work the way you think it does from taking classes at school.
21:14gfxstrand[d]: 😅
21:15dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1313614658157940836/Screenshot_2024-12-03_at_18.14.44.png?ex=6750c67c&is=674f74fc&hm=ed935eb3393b8158f5310229474a57a9dc96a7b26da39128b107b984349bd252&
21:15dwlsalmeida[d]: dwlsalmeida[d]:
21:15karolherbst[d]: `FMNMX` is alu e.g.
21:15gfxstrand[d]: dwlsalmeida[d]: Ok, that's just a very weird way of trying to describe that fact that tiles are made out of GOBs. (What I call a tile, the Nvidia docs call a block.)
21:16karolherbst[d]: (and `FSEL`, `FSET`, etc..)
21:16dwlsalmeida[d]: gfxstrand[d]: yeah but in the pic, the GOBs appear one on top of the other, should I just disregard that?
21:17redsheep[d]: gfxstrand[d]: Reminds me of that old quote about naming things being the hardest part of engineering
21:17karolherbst[d]: `IDP` is on the `fma` unit 🙃
21:17gfxstrand[d]: dwlsalmeida[d]: Oh, that's absolutely what it looks like if `Tiling::y_log2 = 1`.
21:17karolherbst[d]: and so is `IMAD`
21:17karolherbst[d]: anyway
21:17karolherbst[d]: it makes no logical sense
21:18gfxstrand[d]: Of course IMAD is on the FMA unit. That's where all the fast multipliers are.
21:18gfxstrand[d]: You wouldn't want to waste them
21:18redsheep[d]: I mean it makes sense, fast integer multiply is limited to 22 or 24 bits or something, right?
21:18karolherbst[d]: alu and fma have the same throughput tho
21:19redsheep[d]: I think amd might have full 32 bit integer multiply at full rate
21:19karolherbst[d]: I'm sure the reason on why things are the way they are is "it was cheaper to put the gates together this way"
21:20redsheep[d]: Yeah why have the hardware twice
21:20HdkR: Stick all the slow stuff inside the SFU where no one needs to look at it
21:20karolherbst[d]: true
21:20karolherbst[d]: ohh good point tho
21:20karolherbst[d]: hah
21:21karolherbst[d]: `MUFU` and `POPC` are on the same thing 🙃
21:21dwlsalmeida[d]: gfxstrand[d]: again, the drawing shows the GOBs on top of each other, I assume that when you say
21:21dwlsalmeida[d]: > Ok, that's just a very weird way of trying to describe that fact that tiles are made out of GOBs.
21:21dwlsalmeida[d]: >
21:21dwlsalmeida[d]: You mean that this "stacking" is just for illustration purposes? or should I consider this as a 3d array?
21:21karolherbst[d]: and conversion ass well
21:21karolherbst[d]: *conversions as
21:23gfxstrand[d]: dwlsalmeida[d]: There are three levels of tiling and they nest. An image is a 3D array of blocks, a block is a 3D array of GOBs, and a GOB is a 2D (usually) array of bytes. When you're working with a 2D image with y_log2=1, it looks exactly like that picture with the GOBs in pairs, one above the other.
21:24gfxstrand[d]: karolherbst[d]: Yeah, MUFU is the random weird stuff unit. 🙃
21:26gfxstrand[d]: But circling back to the original topic of CUDA cores vs. tensor cores vs. double precision rate, tensor and doubles are on separate units precisely so that they can put fewer of them into the consumer cards to save silicon at the cost of throughput without having to resort to crazy shader shenanigans to emulate the functionality.
21:26karolherbst[d]: MUFU isn't even real
21:27karolherbst[d]: though "fast" emulation of fp64 on top of f32 is pretty cheap tbh
21:27karolherbst[d]: and good enough
21:28gfxstrand[d]: gfxstrand[d]: Because resorting to shader shenanigans is insane. We had to deal with that on Intel when they deleted FP64 support. It's not fun. Emulating tensor cores would be similarly unfun.
21:28karolherbst[d]: there was some algos on how to do it without loosing too much precision
21:28gfxstrand[d]: karolherbst[d]: Legit evaluation is crazy. Like 100-200 instructions for a single FMA.
21:28karolherbst[d]: yeah, I don't mean perfect emulation
21:28karolherbst[d]: just good enough
21:29gfxstrand[d]: Good enough for what? Passing CTS? I doubt it.
21:29karolherbst[d]: yeah, I doubt it's good enough to pass that
21:30karolherbst[d]: it's wild why GL requires fp64 anyway 😄
21:30gfxstrand[d]: Also, that emulation assumes you still have nice things like FP32<->FP64 conversion instructions. Intel deleted those, too.
21:30karolherbst[d]: that's kinda rude tbh
21:30karolherbst[d]: 😄
21:30gfxstrand[d]: The Nvidia plan of just putting in less of the same hardware is the good plan.
21:31karolherbst[d]: the better plan would have been to simply not put it into opengl core 🙃
21:32redsheep[d]: karolherbst[d]: That's the kind of hack you want to leave to applications willing to consider the tradeoffs
21:33gfxstrand[d]: karolherbst[d]: Nah, compute wants it. I mean, it shouldn't have gone into OpenGL as a required feature but having it in the hardware and exposing that hardware makes sense.
21:33karolherbst[d]: it's optional in OpenCL
21:33gfxstrand[d]: Yeah. Because no one wants that shit on mobile
21:34karolherbst[d]: it's also optional for the full profile
21:34gfxstrand[d]: It's really not worth the die area there
21:35karolherbst[d]: though I should properly support fp64 at some point, but.. uhm... claiming to support it is a giant pain, because it's insane
21:37karolherbst[d]: fp64 division requires 0 ulp precision 🙃
21:38karolherbst[d]: though might even make sense if everybody is doing that in software anyway
21:38karolherbst[d]: but anyway.. requiring it for Opengl was a mistake
21:40dwlsalmeida[d]: gfxstrand[d]: how does this relate with the image's width, height and depth? if `depth==1`, is the image still a 3D array of blocks?
21:44gfxstrand[d]: dwlsalmeida[d]: It is but then it's only one block deep so it's kinda 2D at that point.
21:45gfxstrand[d]: You can always treat it as 3D and just let some dimensions be 1.
21:48dwlsalmeida[d]: is the `depth` value equivalent to the number of planes? So, for NV12 data, we'd have `depth == 2`?
22:05gfxstrand[d]: No
22:05gfxstrand[d]: Planes are two separate images.
22:05gfxstrand[d]: Depth is just depth
22:05gfxstrand[d]: 3D images are a thing
22:30airlied[d]: dwlsalmeida[d]: I think we have some stuff in the core vulkan video code that might already do some of that parsing
22:30airlied[d]: though we do parse the hevc bitstream for intel, it's actually not something we should be doing, so we should look at what nvidia do there
22:31airlied[d]: as the bitstream is usually in a vkbuffer that might not even exist at the time of command stream recording
23:22airlied[d]: gfxstrand[d]: considered https://registry.khronos.org/vulkan/specs/latest/man/html/VK_NV_shader_sm_builtins.html ?