00:01HdkR: ooo, rusticl on Zink. That's a powerful OpenCL driver right there
00:03DavidHeidelberg[m]: I wonder if some highend nvidia with Zink -> rusticl would beat radeonsi in LLM
00:26alyssa: HdkR: ruzticl! \ o /
00:27HdkR: :D
00:30zmike: no, no, no, you know the z has to replace a vowel
00:31airlied: rustzcl
00:32idr: rustizl fo shizzle
00:36zmike: let's just shorten to rzl
00:43DemiMarie: Is it just me or is it rather silly for Zink to convert NIR to SPIR-V only for the Vulkan frontend to convert SPIR-V back to NIR again? It seems like there should be some sort of Mesa-only extension that allows short-circuiting this when Mesa is also the Vulkan implementation.
00:47youmukonpaku133: that certainly does sound very silly to do
00:49zmike: it's been a topic of multiple blog posts
00:51alyssa: rzstzcl
00:58DemiMarie: zmike: any links?
01:03youmukonpaku133: DemiMarie: supergoodcode.com youll find it there
01:33kisak: DemiMarie: doesn't sound silly to me when game devs can bake zink into their game, and shortly after the NIR running inside the game isn't ABI compatible with the NIR in mesa.
01:35DemiMarie: kisak: what about when Zink is built as part of Mesa to support PowerVR? There the ABI _is_ guaranteed to match, so the round trip really is useless.
01:56HdkR: 14
02:02chaserhkj: umm, I might move myself too fast to post this on the issues: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9685 is this appropriate?
02:03chaserhkj: also, can anyone confirm if we really don't have such bypasses in mesa?
07:32cmarcelo: anyone know if there is a particular reason vkrunner doesn't vendor vulkan headers (like mesa does)? it does make updating make-*.py scripts particularly hard, since they need support an unspecified range of vulkan header versions...
07:53sima: airlied, can you ping me when you've done the msm pr and backmerge?
07:53sima: so I can rebase the drm-ci branch
07:53sima: robclark, koike ^^ if you have any updates for result files please incremental patch
09:00airlied: sima: oh yeah I was going to do that, put it on top for tmrw
09:09eric_engestrom: alyssa: for dEQP-EGL.functional.create_context.no_config IIRC it's a broken test, it assumes that if any GLES version is supported then all of them must be
09:09eric_engestrom: I think there's an issue opened in the khronos tracker already, let me have a look
09:10eric_engestrom: as for dEQP-EGL.functional.create_context.rgb888_no_depth_no_stencil it doesn't ring a bell, I'll have a look when I have some time
09:15eric_engestrom: alyssa: https://gitlab.khronos.org/Tracker/vk-gl-cts/-/issues/3816 (reported by anholt)
10:14alyssa: eric_engestrom: Aright. How should we proceed then?
10:53karolherbst: soooo... nir_op_vec8 and nir_op_vec16 lowering. `nir_lower_alu_width` doesn't lower them on purpose and I was wondering what would be the best idea to deal with those
10:55karolherbst: I can also just lower them in rusticl though
11:01eric_engestrom: alyssa: for dEQP-EGL.functional.create_context.no_config I don't think there's a way to query which GLES version is supported by a given driver, so I'm inclined to consider the whole test invalid and it should be deleted
11:03eric_engestrom: but that's as far as I got when I looked into that a long time ago, maybe I missed something and it's possible to make this test valid
11:11alyssa: karolherbst: you shouldn't be getting them at all, if you lower width of all ALU and memory and then run opt_dce
11:11karolherbst: yeah.. that's what I thought, but I still have them.. but maybe it's just bad pass ordering... I'll deal with it once I've dealt with other vulkan validation errors :D
11:27alyssa: karolherbst: Still have them in what context?
11:27alyssa: What is reading them?
11:28alyssa: Oh, one other thing, you need to copyprop too
11:29karolherbst: yeah.. I think I do all of this, might just be in the wrong order
11:29karolherbst: or something weird
11:30karolherbst: or maybe something I've done fixed it now... I hav to find a test again which is running into this
11:30alyssa: yeah
11:30alyssa: This should work
12:23zamundaaa[m]: robclark: btw what happened to the dma fence deadline stuff?
12:44Lynne: DavidHeidelberg[m]: 250 tokens, 26 seconds for rusticl on zink on a 6000 ada, 19 seconds on 6900XTX (rusticl on radeonsi)
12:46austriancoder: alyssa: which nir pass lowers vec16, vec8 to vec4?
12:46alyssa: none
12:46alyssa: but if you lower everything that reads vec16 into smaller things, then there's nothing left to write vec16 either
12:51austriancoder: ah..
13:05robclark: zamundaaa[m]: all the non-uabi stuff landed.. the uabi stuff is just waiting for someone to implement some userspace support
13:06zamundaaa[m]: I can take care of that
13:19emersion: gfxstrand: would VK_EXT_host_image_copy be useful for a wayland compositor?
13:29Lynne: I can't think of a use for it, besides less blocky image downloads (copy to optimally tiled host image, convert layout on the CPU?)
13:30emersion: i think the big win would be to get rid of the staging buffer on setups where that makes sense
13:32emersion: this ext looks a lot like OpenGL's "here's a pointer to my host memory, plz do whatever to upload to GPU"
13:32Lynne: you can do that already by host mapping
13:32emersion: you mean VK_EXT_external_memory_host?
13:32Lynne: yes
13:33Lynne: you don't even need page size alignment for the source pointer
13:33emersion: i never managed to fully understand how this ext works
13:33Lynne: alloc a buffer, and allocate memory for it, only instead of allocating, chain a struct to use a host memory pointer
13:33emersion: i got some stride alignment issues iirc, and no way to discover the required alignment
13:33emersion: or something
13:33Lynne: that then allows you to let the GPU copy
13:34Lynne: nnnope
13:34emersion: would need to check again
13:34Lynne: you can workaround all alignment issues
13:34emersion: also VK_EXT_external_memory_host might required pinned memory when VK_EXT_host_image_copy doesn't?
13:34emersion: not sure
13:34emersion: require*
13:34Lynne: pinned memory?
13:35Lynne: host_image_copy is slower, not sure why you'd want to use it
13:35karolherbst: mhh
13:35karolherbst: so the vec8 thing I'm seeing looks like this:
13:35karolherbst: 8x8 %66 = vec8 %51, %53, %55, %57, %59, %61, %63, %65
13:35karolherbst: 8x4 %67 = iadd %49.abcd, %66.abcd
13:36karolherbst: 8x4 %68 = iadd %49.efgh, %66.efgh
13:36emersion: the host memory region can't be evicted from RAM while the driver does the upload, so that makes the kernel unhappy
13:36emersion: i'm not sure about the details
13:36Lynne: it's a mapping, not a buffer
13:37Lynne: it's not even a cached mapping
13:37emersion: VK_EXT_external_memory_host is not a mapping, it's "here's my host pointer, please make a VkBuffer from it", no?
13:37Lynne: yes, it's up to how the driver implements it
13:39Lynne: as for alignment, like I said, although the extension requires the source pointer is aligned, you can simply pick any arbitrary address nearby which is aligned, then during the copy buffer to image command, you can offset the start point
13:39Lynne: both libplacebo and ffmpeg have been doing this for years with no bugs on any platform
13:39haasn: yeah it's UB but it works on all platforms I care about
13:40austriancoder: karolherbst: I you found a solution.. tell me
13:40austriancoder: 32x16 %76 = vec16 %19, %22, %25, %28, %31, %35, %39, %43, %47, %51, %55, %59, %63, %67, %71, %75
13:40austriancoder: 32x4 %82 = fmax %76.abcd, %79.xxxx
13:40austriancoder: 32x4 %83 = fmax %76.efgh, %79.xxxx
13:40austriancoder: 32x4 %84 = fmax %76.ijkl, %79.xxxx
13:40austriancoder: 32x4 %85 = fmax %76.mnop, %79.xxxx
13:41karolherbst: I don't, I think it's just not supported doing this on a vectorized shader
13:41emersion: Lynne: but how do you find out what is the required alignment?
13:41emersion: oh no i remember now
13:41emersion: VkPhysicalDeviceExternalMemoryHostPropertiesEXT.minImportedHostPointerAlignment exists
13:42karolherbst: I mean.. what should copy_prop do in this case?
13:42emersion: however
13:42karolherbst: it all neatly works once I just scalarize alus
13:42emersion: there is no way to communicate the stride when importing
13:42emersion: VkImportMemoryHostPointerInfoEXT just has a host pointer and that's it
13:43austriancoder: karolherbst: for my vec4 hw I would love to have 4x vec4 instead of the one vec16
13:43karolherbst: austriancoder: anyway, I kinda prefer of dealing with all of this in the frontend unless you have other use cases for vec8/16 you can probably start dropping the code
13:43karolherbst: yeah...
13:43karolherbst: zink also doens't want to scalarize
13:43Lynne: emersion: you communicate the stride during the buffer copy command
13:43karolherbst: I suspect we'll have to handle vec8/vec16
13:43Lynne: same as usual
13:43austriancoder: karolherbst: I would love if you can do it in the frontend (rusticl)
13:43emersion: eh
13:43emersion: that's right
13:43karolherbst: I already have a patch
13:44karolherbst: austriancoder: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24839/diffs?commit_id=49b480d5cf888090d1d47eb97f48150b26a512a4
13:44karolherbst: that should make almost everything work
13:44emersion: last time i looked at this i was a vulkan beginner :S
13:44karolherbst: there are just a handful of tests where it doesn't
13:44emersion: okay, i'll give it another go
13:45emersion: Lynne: also, for a wayland compositor which doesn't want to block, offloading host memory copies to a separate thread would make sense right?
13:46Lynne: for downloads, which are required to be synchronous, yes, otherwise for uploads, no, it's the GPU doing them
13:46Lynne: (not that they're required to be synchronous, it's just that most things you'd like to do with a buffer want it to be fully written already, if you carry around a fence it's not a problem)
13:47Lynne: you can look at the ffmpeg code as a reference on importing: https://github.com/cyanreg/FFmpeg/blob/vulkan/libavutil/hwcontext_vulkan.c#L3424-L3477
13:47emersion: oh very nice
13:48emersion: Lynne: why are downloads required to be synchronous?
13:48emersion: can't you queue the copy to the host VkBuffer?
13:48emersion: and then carry on?
13:48Lynne: they're not, but if the thing on the other end doesn't understand fences, you have to wait
13:48Lynne: for wayland, it's not a problem
13:49emersion: so, the thing on the other side is just my own code, so that's probably fine
13:49Lynne: sure
13:49emersion: i basically want a VkSemaphore which i can export as a sync_file and integrate in my event loop
13:49emersion: to wake a callback when the copy is done
13:50Lynne: by the way, if you try to import exported memory, vkGetMemoryHostPointerPropertiesEXT will succeed, but buffer creation will fail
13:50Lynne: so you should deal with failure there
13:50emersion: import exported memory?
13:50austriancoder: karolherbst: thx
13:50Lynne: yes, if earlier on, some code maps GPU memory to a pointer, and you're passed a pointer, which you try to map, it will fail
13:50emersion: what does that mean exactly? the ext only does import, not export
13:51karolherbst: but anyway.. I don't see a solution besides handling vec8/16 explicitly. Maybe we need a flag for nir_lower_alu_width to toggle it
13:51Lynne: I've been asking for someone here to fix vkGetMemoryHostPointerPropertiesEXT finally, but no one seems interested :(
13:51karolherbst: and then if you scalarize anyway you just disable vec8/16 handling
13:52emersion: okay, basically, if the pointer i pass to VK_EXT_external_memory_host is "bad", then the VkBuffer creation fails
13:52emersion: and i can fallback to what i'm doping right now
13:52emersion: doing*
13:52emersion: good to know
13:52karolherbst: or we handle it differently.. dunno
13:52karolherbst: I think the issue is, that touching vec8/16 there (or any vecs) is that it messes around with vectorize load/stores
13:53emersion: eh i just realized i don't really need to plumb callbacks in my API even
13:54emersion: i can just wait on the fence GPU-side when the texture is first sampled from
13:54emersion: i do need to internally have a callback to release the shm buffer
13:56Lynne: you just have to make sure the buffer exists from the time of the call to the point you wait on the fence
13:56emersion: yup
14:14gfxstrand: emersion: For SHM clients, sure.
14:14gfxstrand: It'd actually work pretty swell for SHM
14:19emersion: gfxstrand: VK_EXT_external_memory_host should probably be preferred when available?
14:19emersion: VK_EXT_external_memory_host, then VK_EXT_host_image_copy, then regular two-stage copy with staging buffer?
14:23karolherbst: gfxstrand: btw, any ideas on how to deal with vec8/16 ops in vectorized nirs? I was playing around with lowering all vec16/vec8 to vec4 before passing it into drivers (or lower to scalar if they request this), but there are some leftovers like this: https://gist.githubusercontent.com/karolherbst/3fda5dd969d2b254b7865458b7d2092b/raw/7897b482743d65433435910d56c5d6b2bb8a149e/gistfile1.txt
14:25robclark: zamundaaa[m]: ok, cool.. I expect the uabi patches should still apply cleanly but lmk if you need me to re-post
14:26zamundaaa[m]: robclark I'm not 100% sure which patches I need to pick. Do you have a branch somewhere to make it easier to test?
14:28Lynne: emersion: if you want three separate versions, but I don't think host_image_copy is popular enough yet, and pretty much everything implements host mapping
14:28emersion: yeah
14:37gfxstrand: emersion: Which is optimal is going to depend on the hardware and the behavior of the app. If it's changing every frame and you can sample directly from it, VK_EXT_external_memory_host is probably faster. If it's going to sit there for a couple of frames, host copy is probably faster.
14:38emersion: gfxstrand: by VK_EXT_external_memory_host, i don't mean sample directly from the host memory VkBuffer
14:38gfxstrand: You mean CopyBufferToImage?
14:38emersion: there would still be a copy to a GPU buffer
14:39gfxstrand: In that case, I'd assume a host image copy is the fastest.
14:39emersion: wrap my shm client's buffer with a VK_EXT_external_memory_host VkBuffer, then copy that buffer to the final image
14:39gfxstrand: host buffers have all sorts of potential perf problems.
14:39emersion: hm, can you elaborate?
14:40karolherbst: or maybe we should just ask drivers to deal with nir_op_vec8/16 as otherwise we might risk devectorizing load/stores or something.. dunno.. maybe it's not a big deal at this vector size. Maybe I should just add a flag to nir_lower_alu_width indeed, or let the filter deal with this
14:40emersion: is there a way i can use them to avoid the problems?
14:40karolherbst: but it's messy
14:40gfxstrand: karolherbst: Currently, even fully scalar drivers have to deal with vec8/16
14:40karolherbst: yeah...
14:40karolherbst: for zink that's just super annoying
14:40gfxstrand: You can break it up, it's just tricky.
14:40karolherbst: sure
14:41karolherbst: but I think `nir_lower_alu_width` is probably the best place for it, no?
14:41karolherbst: or a special pass?
14:41gfxstrand: IDK
14:41gfxstrand: IMO, a Zink-specific pass is probably the way to go
14:41karolherbst: yeah.. probably
14:42gfxstrand: Something that replaces any read of vec8/16 with creating a subvec.
14:42karolherbst: austriancoder: how much pain would it be for you to _only_ deal with nir_op_vec8/16 opcodes and everything else is vec4 at most?
14:42gfxstrand: Then copy-prop should delete the vec8/16
14:42gfxstrand: karolherbst: Oh, right... there are vec4 GPUs that exist. *sigh*
14:42karolherbst: yeah.. it's probably easier to deal with this from the alu op perspective
14:42karolherbst: yeah....
14:42gfxstrand: So, yeah, maybe make it general but we probably do want to make it special.
14:42gfxstrand: For scalar drivers, it's easier to just handle it.
14:43zamundaaa[m]: robclark here's a simple implementation, which sets the deadline for compositing and direct scanout: https://invent.kde.org/plasma/kwin/-/merge_requests/4358
14:43karolherbst: I was also thinking about adding a flag to nir_lower_alu_width and then it moves sources into vec4
14:43karolherbst: or something
14:43karolherbst: like
14:43karolherbst: if a alu references a vec8 source, just extract the relevant channels into a vec4 and consume this
14:43karolherbst: and then hope every vec8/16 gets removed
14:43karolherbst: which it kinda should
14:44robclark: zamundaaa[m]: ok, cool, that is just what I was waiting for to re-post..
14:44karolherbst: anyway, for scalar it already works
14:45karolherbst: or at least lowerinv vec8/16 to vec4 works, because the scalarize things scalarizes stuff :D
14:45karolherbst: anyway, will play around with nir_lower_alu_width then
14:46gfxstrand: emersion: Couple of things. One is that it will always live in system RAM and have all the access problems that come with that. In theory a CopyBufferToImage *should* be fast but it also means round-tripping to the kernel to create a BO and kicking something to the GPU and tracking that host image buffer etc. The CopyMemoryToImage case should do a WC write from the CPU and there's no extra
14:46gfxstrand: resources to track. Both should be able to saturate PCIe, I think. The resulting image should be about as fast, probably.
14:46austriancoder: karolherbst: not much .. I think .. but I would spend an hours or so on some nir magic
14:46karolherbst: mhh.. yeah...
14:46karolherbst: for scalar backends all this stuff is trivial
14:46gfxstrand: emersion: So I guess IDK which is going to be faster in an absolute sense. Host copies are certainly easier and involve less tracking and less stuff you have to kick to the GPU.
14:46gfxstrand: Probably depends on where your system is loaded.
14:47karolherbst: but vectorized things are annoying as you can have random swizzles and that nonsense
14:47austriancoder: yeah..
14:47karolherbst: though a pass to materialize num_component > 4 sources into their own vec4 shouldn't be tooo hard
14:47karolherbst: probably pretty straight forward
14:47karolherbst: I'll write something up
14:48gfxstrand: karolherbst: So the thing I think scalar back-ends want to avoid is having extra vec instructions that get turned into extra movs that it has to copy-prop out later.
14:48karolherbst: yeah..
14:48karolherbst: this should absolutely be optional
14:48gfxstrand: Anything which actually takes a vector, like a texture op or intrinsic will get a vector that's already the right width.
14:49gfxstrand: They really exist just for weird ALU swizzle cases.
14:49karolherbst: yep
14:49gfxstrand: Which I'm now questioning....
14:49emersion: gfxstrand: hm if i keep the host pointer VkBuffer around, does that help the kernel not re-create the BO over and over again?
14:49gfxstrand: Like, if all your ALU ops are scalar, why can't copy-prop get rid of the vector...
14:49gfxstrand: :thinking:
14:50karolherbst: it works for scalar
14:50gfxstrand: emersion: Yeah. There is weird locking stuff in the kernel, though, and you'll end up paying that cost on every submit, even submits that aren't your copy (thanks, bindless)
14:50gfxstrand: emersion: But we're seriously micro-optimizing at that point.
14:50emersion: i do get that host memory is more work for the driver/kernel, since it's work i'm not doing anymore :P
14:51emersion: ok
14:51emersion: i guess i'll just try and make sure we don't regress
14:51emersion: and see how it goes on my hw collection
14:51gfxstrand: sounds good
14:51gfxstrand: But, generally, the host copy path is probably the path you're going to get in a GL driver when you TexSubImage with properly matching formats.
14:52gfxstrand: (Where by host copy I mean CopyMemoryToImage)
14:52emersion: right
14:53alyssa: DavidHeidelberg[m]: FWIW, I disabled the T720 job
14:53alyssa: It seems that either the T720 devices are down, or they're absolutely overloaded right now, or a combination of the two
14:53emersion: that's exactly what my mental model was :P "VK_whatever_gl_does"
14:53emersion: gfxstrand: so this host copy path potentially allocates memory?
14:53alyssa: marge pipeline had t720 waiting for a device for 45+ minutes without making progress, so I pushed the disable.
14:54emersion: or would a driver that needs to allocate memory just not implement the ext?
14:54alyssa: that marge pipeline was lost (reassigned, we'll try this again) but no other collateral damage and hopefully the rest of the collabora farm is ok
14:55gfxstrand: emersion: No, it shouldn't allocate memory.
14:56gfxstrand: Well, it might need to malloc() a bit depending on $DETAILS
14:56emersion: ok, good
14:56gfxstrand: But it shouldn't need to allocate GPU memory
14:56emersion: sure, i meant memory for buffer
14:56gfxstrand: And it won't submit anything to the GPU
14:56emersion: ah, interesting
14:56emersion: right
14:56emersion: no VkCommandBuffer
14:56gfxstrand: On Intel, it'll be an isl_tiled_memcpy()
14:57gfxstrand: Feel free to go read that eldritch horror if you wish. I cannot guarantee your sanity will be intact on the other end, though. :joy
14:57gfxstrand: It's even able to use SSE to do BGRA swizzles!
14:59emersion: lol
15:01gfxstrand: I'm hopefull that the NVK implementation will be substantially more sane but that's still TBD.
15:01gfxstrand: Some of it is just because tiling is complicated
15:02gfxstrand: in general
15:15alyssa: gfxstrand: I'm not sure what to make of host copy for agx or mali
15:16alyssa: We have a CPU path for tiled images, but the GL driver almost never uses it
15:16alyssa: since most of the time we use compressed images, which are accessed via a GPU blit to/from a linear image
15:17alyssa: the tiled CPU path is used for ASTC/BCn and that's usually it
15:17alyssa: so I can't think of a situation where host image copy would ever be the right thing to do
15:17alyssa: for this class of hw
15:18alyssa: CopyBufferToImage (or vice-versa) or maybe BlitImage with a Linear image will typically be the optimal path
15:18alyssa: de/compressing on the CPU is a nonstarter
15:19zmike: VkHostImageCopyDevicePerformanceQueryEXT ?
15:19alyssa: Sure
15:20alyssa: Mostly I'm surprised there's hardware that *doesn't* benefit from using its framebuffer compression for textures
15:20DavidHeidelberg[m]: alyssa: if not broken completly, always switch it to the manual, `.` is usually forgotten
15:20alyssa: I don
15:20alyssa: t know how to do that
15:21DavidHeidelberg[m]: s/.panfrost-midgard-rules/.panfrost-midgard-manual-rules:/ :)
15:21alyssa: midgard is fine
15:21alyssa: it's just t720 that needs to be wacked
15:21alyssa: oh in the extends
15:21DavidHeidelberg[m]: yup :)
15:21alyssa: ok that doesn't seem terrible i guess
15:21alyssa: still a lot more likely to screw it up
15:25daniels: DavidHeidelberg[m]: well alyssa is gone, but https://gitlab.freedesktop.org/krz/mesa/-/jobs/47878513 went through ~immediately, and those boards are extremely under-loaded, so I think it's just some weird temporal issue; perhaps a recurrence of the old issue where gitlab just wasn't handing out jobs when it should
15:29DavidHeidelberg[m]: it look like it https://lava.collabora.dev/scheduler/device_type/sun50i-h6-pine-h64
15:29DavidHeidelberg[m]: daniels: I'll revert and run it few times so we're sure it's fine
15:51Lynne: emersion: the overhead of creating a buffer from host memory wasn't significant enough in my tests
15:52Lynne: on every frame
15:52emersion: cool!
16:02zmike: DavidHeidelberg[m]: I got another trace for you
16:03zmike: https://gitlab.freedesktop.org/zmike/dumps/-/blob/master/blender.3-trim.trace hits some unique texture upload mechanics that (apparently) nothing else in ci hits
16:06DavidHeidelberg[m]: Nice! I'll check in 2-3 hours ;)
16:06zmike: it's single frame and pretty small
16:24karolherbst: do we have a proper way to replace alu srcs or would I ahve to just recreate the alu instruction?
16:31gfxstrand: karolherbst: What do you mean?
16:32gfxstrand: nir_src_rewrite() should work. You just have to fixup the swizzle yourself.
16:32karolherbst: yeah, already found it
16:35gfxstrand: karolherbst: Okay, I'm now very confused how vec8 ends up in back-ends
16:35gfxstrand: Maybe it's via phis?
16:35karolherbst: nah
16:35karolherbst: it's just silly stuff
16:35karolherbst: did you check out the shader?
16:35gfxstrand: Yeah, that one has actual vec4 ALU stuff
16:36karolherbst: yep
16:36gfxstrand: But if you don't have vec4 ALU stuff, how does it happen?!?
16:36karolherbst: well.. the shader had originally vec8 alu ops
16:37karolherbst: so, nir_lower_alu_width split that up, but we are left with that silly vec8
16:39gfxstrand: But copy-prop should clean up once every ALU source is scalar
16:40karolherbst: but they never become scalar
16:40karolherbst: not for zink at least
16:40gfxstrand: Well, yeah, Zink is vec4 hardware, effectively.
16:40gfxstrand: I'm wondering why Intel has support
16:40karolherbst: yeah
16:40gfxstrand: I know there was a case
16:40karolherbst: uhh.. I've added vec8/16 support to a bunch of drivers (or somebody else) so a lot of drivers are fine
16:40karolherbst: _but_
16:41karolherbst: if the driver asks for lower_to_scalar then they get scalar alus
16:41karolherbst: and copy prop cleans it all up
16:41gfxstrand: I mean, for RT, I added some spicy wide vec loads for Intel but that's a different issue.
16:41karolherbst: anyway.. got it working :D
16:41karolherbst: I'm sure it's super sketchy
16:43karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24839/diffs?commit_id=30ca5e3fb37e1c7123b41f3c5359b42148f46c98
16:43karolherbst: anyway.. I think it's fair for rusticl to reduce the vecs to vec4 and then we can drop a bunch of vec8/16 code from backends again... or maybe we keep it and drivers can say what they support
16:43karolherbst: or
16:44karolherbst: drivers can lower it themselves
16:44karolherbst: dunno
16:44karolherbst: for now I just want something which works
16:47karolherbst: ehh.. I've commit my prints, oops
16:47karolherbst: and it's also broken
16:47karolherbst: sad
16:50karolherbst: yooooo
16:50karolherbst: copy_prop reverses my think
16:50karolherbst: *thing
16:50karolherbst: :<
16:51karolherbst: yeah so this isn't great
16:52karolherbst: https://gist.githubusercontent.com/karolherbst/bb0405538df026ac91551e8c751ff6eb/raw/5941978624d568864c98c7d1a95d73f9fc4cb2c1/gistfile1.txt
16:52karolherbst: maybe I have to do that later...
16:53karolherbst: and only if the source comes from a vec? mhh...
17:13karolherbst: Pass 2350 Fails 68 Crashes 36 Timeouts 0: 100% and 60 of those fails are ignorable images tests with NEAREST filtering... I think that's probably as good as it needs to be to focus on cleaning up that MR :D
17:13karolherbst: (that's on ANV btw)
17:13gfxstrand: \o/
17:14gfxstrand: karolherbst: Ugh... copy lowering is going to be a mess.
17:14karolherbst: radv is a bit worse, but ACO doesn't support any of the CL nonsense, so that's not surprising
17:14gfxstrand: I think we need a per-stage array of nir_variable_mode which supports indirects.
17:14gfxstrand: We have so many things which are currently keying off of misc things
17:15gfxstrand: If nir_lower_var_copies were happening entirely in drivers, it'd be okay, but it's not.
17:15karolherbst: last radv run was "Pass 2222 Fails 15 Crashes 217 Timeouts 0: 100%", wondering how much that final vec8/16 lowering will help
17:15karolherbst: lvp is a bit better
17:15karolherbst: anyway
17:15karolherbst: good enough on all drivers
17:16gfxstrand: NIR lowering has started becomming an eldritch horror
17:16karolherbst: yeah....
17:16karolherbst: it's a mess
17:17gfxstrand: s/started becoming/become/
17:17gfxstrand: It happened a long time ago
17:17karolherbst: I also procrastinated reworking the nir pipeline, because.......
17:17karolherbst: it works and I don't want to break it
17:17gfxstrand: At some point, I should do an audit and make some opinionated choices.
17:18karolherbst: yeah
17:18gfxstrand: But ugh.....
17:18karolherbst: that would help
17:18karolherbst: :D
17:18gfxstrand: There's no way I don't break the universe in that process.
17:18gfxstrand: And I'd much rather write NVK code
17:18karolherbst: fair enough
17:18karolherbst: you can just comment on bad ordering and I can figure out the details
17:18gfxstrand: Like WTH is st_nir_builtins.c lowering var copies?!?
17:18karolherbst: heh
17:18karolherbst: :D
17:18gfxstrand: It's worse than that
17:18gfxstrand: I can't just drop some comments places.
17:19karolherbst: there are probably good reasons for it (tm)
17:19gfxstrand: Oh, I'm sure there is a reason
17:19gfxstrand: There's a reason for every line of code
17:19karolherbst: ahh you meant the entire st/mesa stuff?
17:19gfxstrand: Whether or not it's a good reason is TBD.
17:19gfxstrand: Oh, I mean EVERYTHING
17:19karolherbst: uhhh
17:19gfxstrand: Vulkan, OpenCL, st/mesa, various back-ends. The lot.
17:19karolherbst: maybe... work on NAK instead and hope we can just drop gl drivers in the future?
17:19zmike: <gfxstrand> There's a reason for every line of code
17:19karolherbst: :D
17:19zmike: https://i.pinimg.com/736x/fb/4e/07/fb4e07ea3c37c426dd0a2b3344b1ff85.jpg
17:19karolherbst: ahh yeah...
17:20karolherbst: _maybe_ we just need a better model
17:20karolherbst: or rather.. some stages of nir where certain things are allowed/forbidden
17:20karolherbst: but uhhh
17:21karolherbst: no idea really
17:23gfxstrand: I think the #1 thing we need is someone with the appropriate credentials to be opinionated.
17:23gfxstrand: Which basically boils down to that.
17:23gfxstrand: We may also want some helpers to assert said opinions.
17:24gfxstrand: I've mostly looked the other way while the st/mesa stuff was built because, well, Ew, GL.
17:24karolherbst: fair
17:24gfxstrand: That was maybe a mistake.
17:25gfxstrand: In any case, I think it's well past time for a cleanup. Sadly, there are very few people who really understand all of NIR and the various APIs in play and back-end needs well enough to do that clean-up.
17:25gfxstrand: And by "very few" I mean "me, maybe". :joy:
17:26karolherbst: oh wow.. my vec8/16 patch really helped a lot with lvp... from ~200 carshes down to ~50... but a bunch of fails, but nothing serious
17:26karolherbst: so I guess what I have kinda works now...
17:26karolherbst: uhhh
17:26karolherbst: pain
17:26karolherbst: I wished it wouldn't be so hakcy
17:30daniels: DavidHeidelberg[m]: huh, is CelShading new on T860? it hangs reliably on your MR …
17:30Lynne: airlied: could you push !24572?
17:40karolherbst: I should probably check if it runs on nvidias driver by now...
17:53karolherbst: uhhh.. there are things like ieq %44.abcd, %45 (0x0).xxxx what my code doesn't account for.. pain
17:54karolherbst: that was easy to fix
17:56anarsoul: what does GL spec say on empty fragment shader, i.e. void main(void) {}, without actually writing any outputs?
17:56gfxstrand: I think you get undefined garbage
17:56Lynne: speaking of rusticl on nvidia on zink, it runs but it generates junk
17:57karolherbst: Lynne: try my most recent rusticl/tmp branch and run with RUSTICL_DEBUG=sync
17:57karolherbst: I won't promise it will work, but it should work better
17:57karolherbst: something is up with the fencing with zink and rusticl and I stll have to figure that out
17:58karolherbst: but that _might_ be enough to run luxball or something
18:02karolherbst: ehhh maybe I've broken it.. try rusticl/zink instead :D
18:04karolherbst: uhhhhhh
18:04karolherbst: load_const
18:04karolherbst: uhhhhh
18:04karolherbst: pain
18:04karolherbst: now copy_prop reverts my vec4 on load_const..
18:04karolherbst: what a pain
18:05karolherbst: I think I need a better model on how to do that lowering
18:05karolherbst: or I just split load_const
18:07zmike: anarsoul: in what sense
18:08Lynne: karolherbst: yup, everything is correct with RUSTICL_DEBUG=sync on rusticl/zink
18:08anarsoul: zmike: see https://gitlab.freedesktop.org/mesa/mesa/-/issues/9689
18:08karolherbst: Lynne: but is it faster than nvidia's CL stack
18:08anarsoul: shaders/tesseract/270.shader_test has a FS with an empty body
18:08Lynne: what was the envvar to select an opencl device again?
18:09karolherbst: uhhh...
18:09zmike: anarsoul: not sure exactly what you're asking
18:09zmike: it just does nothing
18:09karolherbst: you can normally do that in apps
18:09anarsoul: zmike: I'm asking what is expected result of it. gfxstrand already replied that it's undefined
18:09zmike: you mean the result in the framebuffer?
18:09anarsoul: yeah
18:10karolherbst: do we have a helper to extract a const channel used by an alu instruction?
18:10Lynne: llama.cpp just picks the first device afaik
18:10karolherbst: ehh.. maybe I just use nir_build_imm
18:10zmike: anarsoul: "Any colors, or color components, associated with a fragment that are not written by the fragment shader are undefined."
18:10zmike: from 4.6 compat spec
18:10zmike: but this is in the compat language section
18:11anarsoul: zmike: thanks!
18:13Lynne: found a workaround, yes, nvidia's opencl stack is faster
18:13karolherbst: not the answer I wanted to hear :P
18:13Lynne: around 13 times faster
18:14karolherbst: uhh
18:14karolherbst: annoying
18:14karolherbst: that's quite a bit
18:14karolherbst: which benchmark did you use? luxball?
18:14Lynne: llama.cpp
18:14karolherbst: ahh mayb that's expected
18:14Lynne: I'll try ffmpeg's nlmeans filter
18:15karolherbst: nah, most of those benchmarks actually test something else
18:15karolherbst: the only I know is realibly testing kernel execution speed is luxmark
18:15karolherbst: the API overhead is..... significant if you do debug builds and it's not very optimized anyway
18:16karolherbst: mostly interested in raw kernel execution speed, because that's probably the hardest part to get really performant
18:16karolherbst: and I also don't know if other benchmarks actually validate the result or not
18:17DavidHeidelberg[m]: <daniels> "David Heidelberg: huh, is..." <- Old, very old. Sonething is wrong.
18:17Lynne: okay, ffmpeg's nlmeans_opencl filter is a little bit faster on zink on nvidia!
18:17karolherbst: yeah, but is the result correct?
18:17karolherbst: zmike: ^^ you know what to do
18:17zmike: MICHAEL
18:17zmike: MICHAEL WHERE IS THE POST
18:17karolherbst: but being faster with luxball would be impressive
18:18karolherbst: because that's actually 100% kernel throughput
18:18karolherbst: and no funny business
18:18karolherbst: it's basically benchmarking the compiler
18:18Lynne: looks correct
18:19karolherbst: the issue with llama.cpp is that it probably uses fp16 and other funky extensions we don't support, and probably matrix stuff and.. uhh.. cuda on nvidia
18:19karolherbst: so dunno if that's even remotely compareable
18:19karolherbst: also.. it's heavily optimized for nvidia
18:20karolherbst: Lynne: nice :)
18:20karolherbst: okay.. I'm getting close ...
18:20karolherbst: maybe official zink conformance this year would be possible :D
18:20Lynne: nah, llama.cpp's cl code is pretty meh, pretty much it all is except cuda
18:21karolherbst: fair
18:21karolherbst: but did you compare zink vs cuda or cl?
18:21Lynne: I should try fp16 though, is it supported on zink?
18:21karolherbst: no
18:21karolherbst: it's not supported in rusticl at all atm
18:21karolherbst: for.... annoying reasons
18:21Lynne: I meant through the features flag?
18:21karolherbst: "Pass 2374 Fails 69 Crashes 11 Timeouts 0: 100%" getting close (and again, like 60 fails won't matter)
18:21Lynne: correctness is another chapter
18:22zmike: fp16 is supported in zink
18:22karolherbst: RUSTICL_FEATURES=fp16
18:23karolherbst: I'll probably test it on my GPU later this week :D but for now I want it to work on mesa drivers
18:23Lynne: fp16 output is correct, but it's not any faster
18:23karolherbst: for conformance over zink, would 3 mesa drivers count or do I have to use different vendors?
18:24karolherbst: annoying
18:24karolherbst: maybe it's not using it? maybe it's broken? maybe all of it? I have no idea :)
18:24zmike: you'd need to see what the conformance requirements are for CL
18:24karolherbst: it's mostly blocked on libclc not supporting it
18:24zmike: in GL it's 2
18:24Lynne: ggml_opencl: device FP16 support: true
18:24Lynne: it is using it
18:24karolherbst: ahh
18:24karolherbst: I suspect the slowness comes from something else
18:24karolherbst: like the API
18:25karolherbst: rusticl still is doing a few silly things
18:25zmike: better drop everything and debug this karol
18:25karolherbst: but that lama.cpp actually works is kinda neat
18:25zmike: your p-score is plummeting
18:25karolherbst: yeah I know
18:25karolherbst: that's why I want a higher score on luxball than nvidia
18:25zmike: absolute freefall
18:25Lynne: out for blood, eh? good luck
18:26karolherbst: zmike: right.. but it can't be like 2 mesa drivers, can it?
18:27zmike: if it's a hw implementation then the vendor has to have submitted a conformant driver
18:27zmike: so like zink+anv would need a conformant intel GL driver
18:27zmike: from the company itself
18:27zmike: (iris or proprietary in this case since both are owned by intel)
18:27karolherbst: sure
18:27zmike: so yes, it could be mesa
18:28karolherbst: but would zink+anv and zink+radv be enough or would you have to use a different vk driver
18:28zmike: if it meets the CL requirements it'd be fine
18:28karolherbst: okay
18:28zmike: I only know the GL ones
18:28zmike: just mail neil and ask, not sure there's been many conformant layered impls
18:29zmike: or whoever the CL chair is
18:30karolherbst: mhhh
18:30karolherbst: for CL they don't specify much
18:30karolherbst: I can ask on the next meeting though
18:31karolherbst: but yeah, neil still is the chair
18:31karolherbst: ohh wait, they say two
18:31karolherbst: "For layered implementations, for each OS, there must be Successful Submissions using at least two (if
18:31karolherbst: available) independent implementations of the underlying API from different vendors (if available), at
18:31karolherbst: least one of which must be hardware accelerated."
18:32zmike: sounds like same as GL
18:32karolherbst: yeah, but different vendors :)
18:32karolherbst: so two mesa drivers won't cut it
18:32karolherbst: I guess?
18:32zmike: pretty sure they will
18:32karolherbst: dunno
18:32karolherbst: I can ask
18:32zmike: but again, ask neil
18:33karolherbst: anyway... I should upstream some of the patches, we can leave the vec8/16 lowering out a bit, because that's going to be pain to get right...
18:33karolherbst: but seems like my current attempt works pretty well
18:47airlied: Lynne: was waiting for Benjamin to update the tags, but I suppose I can do that
19:00DemiMarie: gfxstrand: I saw https://lore.kernel.org/dri-devel/CAOFGe94JC8V2GS5L2iCaD9=X-sbbcvrvijck8ivieko=LGBSbg@mail.gmail.com/ and it makes so much more sense why you are pushing for userspace command submission, despite the security concerns.
19:02gfxstrand: Modern AMD/NV/Intel hardware really is built for userspace submit
19:02gfxstrand: Even modern Mali tries
19:02DemiMarie: And from my (Qubes) perspective userpace submit is absolutely awful
19:03gfxstrand: Eh, yes and no.
19:03gfxstrand: userspace submit doesn't really increase the surface that much
19:04DemiMarie: Can you elaborate?
19:04DemiMarie: Because this is something that has worried me for quite a while.
19:04gfxstrand: The moment you turn on the GPU, you have a massive pile of hardware that's potentially an attack surface.
19:04gfxstrand: The thing that saves you is page tables. The kernel controls those and every bit of GPU memory access goes through them.
19:05DemiMarie: So basically the same as a CPU?
19:05Lynne: airlied: thanks, the ffmpeg side of the patch fixes nvidia, but I'd rather have at least one fully working implementation at any one time
19:05gfxstrand: DemiMarie: Pretty much.
19:06DemiMarie: gfxstrand: from my perspective, the big difference is how much of the GPU firmware is exposed to untrusted input
19:07gfxstrand: Yeah, so that's the thing. How much untrusted input are we talking about? On something like NVIDIA, it's just a bunch of 64b pushbuf packets with maybe another type of packet for signaling a semaphore.
19:07gfxstrand: Those themselves are probably pusbufs (I'm not actually sure)
19:07DemiMarie: Interesting
19:08gfxstrand: On Intel, it's the same command streamer that is used for processing graphics commands. It just has a 3 level call/return stack.
19:08gfxstrand: You can put 3D rendering commands in level0 if you want.
19:08gfxstrand: IDK how AMD's works.
19:09gfxstrand: The only additional thing you're really exposing is the doorbells.
19:09DemiMarie: Does that mean that on Intel there is no difference at all attack surface wise?
19:09DemiMarie: Doorbells are actually a problem for virtualization.
19:09gfxstrand: Well, it depends a bit on the doorbell situation.
19:09gfxstrand: If all you expose is the ring itself and you leave the kernel in charge of doorbells, then I don't think the attack surface is different.
19:10DemiMarie: You can’t expose doorbells (or any MMIO) to untrusted guests due to hardware bugs on Intel
19:10gfxstrand: Sure
19:10gfxstrand: But by "doorbells" I think I mean a regular page of RAM.
19:10DemiMarie: Specifically to fully untrusted guests — a guest with access to MMIO can hang the host
19:10gfxstrand: There's details here I'm missing.
19:10gfxstrand: If you expose doorbells (I don't remember most of the details), then you're opening up a bit more surface but probably not much.
19:11DemiMarie: How could it be a regular page of RAM? Does the firmware just poll for input on its various queues without relying on interrupts?
19:11gfxstrand: I think so
19:11gfxstrand: But, again, I don't remember all the details.
19:12DemiMarie: That’s the other part that is potentially problematic, where by “problematic” I mean “scary”
19:13DemiMarie: The GPU drivers are open source software and at least some of them (Intel IIRC) are being fuzzed continuously by Google for ChromeOS.
19:13DemiMarie: So while I can’t guarantee that it is perfect, it likely at least has no blatently obvious flaws.
19:14gfxstrand: Where Intel gets scary is the FW interop queues. That's where you can send actual commands to the firmware to do various things. IDK if those are ever expected to be exposable to userspace or if they're always privileged. (Details of GuC interaction are not my strong suit.) If they are, then that's an actual attack surface. If all that is marshalled through the kernel or hypervisor, then I think
19:14gfxstrand: you're fine.
19:14DemiMarie: Are they performance critical?
19:14DemiMarie: Hopefully they are always privileged.
19:14gfxstrand: No, I don't think they're perf critical.
19:14DemiMarie: Then keeping them privileged should be fine.
19:15sima: the guc queues are for less privileged guest kernel drivers, but not for userspace
19:15DemiMarie: Is GPU firmware any scarier than CPU microcode?
19:15gfxstrand: That's what you have to use to susped a queue if you need to swap out memory or something.
19:15sima: so still attack surface
19:15gfxstrand: And when you run out of doorbells, you also have to use them.
19:15DemiMarie: sima: depends on whether you are using SR-IOV or virtio-GPU native contexts
19:15sima: DemiMarie, well for sriov
19:15DemiMarie: sima: yeah
19:16DemiMarie: native contexts have the advantage that they could work on AMD and Nvidia hardware too
19:16gfxstrand: If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.
19:16gfxstrand: Suddenly the only thing any kernel has to get involved with is page migration.
19:16DemiMarie: gfxstrand: interesting
19:17DemiMarie: The native contexts I was thinking of are basically ioctl-level passthrough
19:17DemiMarie: <gfxtrand> If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.
19:17DemiMarie: That’s an interesting idea.
19:18DemiMarie: sima: do the GPU vendors do any sort of internal security reviews of their firmware?
19:20DemiMarie: gfxstrand: thanks, this makes me feel better
19:22DemiMarie: Honestly, GPU hardware is actually quite a bit nicer than CPU hardware in that it doesn’t have to have obscene amounts of speculative execution to get good performance.
19:23DemiMarie: Nor does it need to remain compatible with 20-year-old binaries.
19:23gfxstrand: heh
19:24bcheng: airlied: Lynne: ah give me one moment I can update the tags
19:24DemiMarie:tries to figure out if this is the first part of a longer series of comments
19:26HdkR: DemiMarie: Don't worry, ARM hardware is fine with cutting out 20 years of cruft
19:26DemiMarie: HdkR: and that is a good thing, but they still have the speculation issues
19:26airlied: bcheng: oh cool I haven't moved much yet
19:27DemiMarie: overall, how does GPU driver attack surface compare to CPU hypervisor attack surface?
19:28sima: DemiMarie, not sure I should talk about that :-)
19:28DemiMarie: sima: why?
19:28sima: I work for such a vendor ...
19:28sima: pretty sure I'm breaking all the rules if I disclose anything about the hw security process here
19:29bcheng: airlied: I added some new stuff since your last review, including moving some stuff of gallium into util in case you missed it.
19:29sima: like pretty much everything is cool, except hw security issues is absolute no fun or the CEO will know your name if you screw up
19:29DemiMarie: sima: okay, fair
19:30DemiMarie: so I will phrase things this way: to what degree would giving a guest access to GPU acceleration make life easier for an attacker?
19:31bcheng: airlied: it's pretty trivial stuff, but not sure if someone else needs to ack it or something
19:32airlied: my assiging it to marge is implicitly ack enough :)
19:32DemiMarie: and can I trust that the company you work for would take a security problem with their GPUs (be it hardware or firmware) just as seriously as they would take a problem with their CPUs?
19:32airlied: but I'll reread it now
19:32DemiMarie: because those two things are ultimately what my users care about sima.
19:34airlied: bcheng: just missed a license header on the new util c file
19:34sima: intel's fixed a ton of device hw security bugs too over the past years (starting with the cpu smeltdown mess more or less), anything more is way beyond my pay grade
19:35sima: (i.e. don't ask more questions, I wont answer them anyway)
19:35DemiMarie: fair
19:36bcheng: airlied: ah ok, and I assume because its copy-pasted I should just copy the license from there?
19:36sima: (the device hw security fixes have been in a pile of press announcements, otherwise I couldn't tell you about those either)
19:37DemiMarie: sima: “the guc queues are for less privileged guest kernel drivers, but not for userspace” tells me enough
19:38DemiMarie: (in a good way)
19:38DemiMarie: “(i.e. don't ask more questions, I wont answer them anyway)” — thanks for making this explicit
19:38sima: yeah like I said, the rules around hw security bugs are extremely strict, can't even talk about them freely internally
19:39DemiMarie: yikes
19:39sima: if you're not on the list of people who know for a specific hw bug, you don't get to talk about it
19:47DemiMarie: what matters most for me (and likely for my users) is that the GPU is considered a defended security boundary, which it obviously is
19:47DemiMarie: (and to be clear: I do not expect a reply sima, so do not feel bad for being unable to give one)
19:55DemiMarie: Also, if I made you uncomfortable, I apologise.
20:00DemiMarie: sima: thank you for what you did say, it really does give me confidence in Intel’s handling of hardware vulns.
20:23Lynne: bcheng: just copy paste the license and keep the headers the same