IRC Logs of #dri-devel on irc.freenode.net for 2023-04-13

00:41 ishitatsuyuki: Lynne, I probably need to understand what kind of structure your algorithm is. Can you write out it in pseudocode? e.g. Step 1: Do prefix sum for x[k][i]..x[k][j] for all k
01:01 Lynne: ishitatsuyuki: the opencl version is easier to read - https://github.com/FFmpeg/FFmpeg/blob/master/libavfilter/opencl/nlmeans.cl
01:02 Lynne: step 1 is load all values needed from the source images and compute the s1 - s2 diff
01:02 Lynne: store that per-pixel vector in the integral image
01:02 Lynne: then compute the horizontal prefix sum, followed by the vertical prefix sum
01:03 Lynne: which gives you the integral image
01:04 Lynne: finally, get the a, b, c, d (a rectangle) vector values from the integral for each individual block at a given offset, compute the weight, and add it in the weights array
01:04 Lynne: opencl does it by having a separate pass for each step, and when I tried to do that, the performance was 20 _times_ slower than opencl than vulkan, because of all the pipeline barriers
01:05 Lynne: even though I was using a prefix sum algorithm which is around 100 times faster than the naive sum that opencl uses
01:05 Lynne: so I decided to merge the horizontal, vertical and weights passes into a single shader, to eliminate pretty much all of the pipeline barriers
01:07 Lynne: merging the horizontal and vertical passes works fine, but merging the weights pass causes any integral image loads after height amount of _horizontal_ pixels to give old, pre-vertical pass data
01:10 Lynne: this is the vulkan port of the code - https://github.com/cyanreg/FFmpeg/blob/vulkan/libavfilter/vf_nlmeans_vulkan.c
01:39 ishitatsuyuki: this does sound like a use case requiring pipeline barrier unfortunately
01:40 ishitatsuyuki: because you need to wait for the entire prefix sum to finish before proceeding to the next stage
01:42 ishitatsuyuki: there is https://w3.impa.br/~diego/projects/NehEtAl11/, which gives you reduced bandwidth for integral image by keeping the horizontal-only prefix sum in shared memory
01:42 ishitatsuyuki: but it's significantly more complicated to implement
01:46 ishitatsuyuki: another approach is to do decoupled lookback but with 2D indices, where you calculate in the order of block (0,0), (1,0),(0,1),(2,0),(1,1),(0,2),... which will give you the block on top and left on each step
01:47 ishitatsuyuki: at the end of day you will still need a barrier between generating the integral image and weight computation
01:47 Lynne: why there and not between the horizontal and vertical prefix sums too?
01:48 ishitatsuyuki: for you current approach you need barrier there too, but I described alternative algorithms that doesn't require synchronization and even memory store between hori/vert prefix sums
01:50 Lynne: why need a pipeline barrier at all?
01:50 ishitatsuyuki: because computing vertical prefix sum requires the entire horizontal prefix sum to finish?
01:51 Lynne: why is barrier() not enough?
01:51 ishitatsuyuki: barrier only synchronize within the workgroup
01:51 Lynne: we only have a single workgroup
01:51 ishitatsuyuki: code says s->pl_int_hor.wg_size[2]
01:51 Lynne: always set to 1
01:52 Lynne: just copypaste from other code, I'll remove it to make it clearer
01:52 ishitatsuyuki: a single workgroup does sound like why it's slow though
01:52 ishitatsuyuki: a GPU has many WGPs (in amd terms) and a workgroup runs within a single WGP
01:53 Lynne: it's a max-sized workgroup, 1024, with each invocation handling 4 pixels at a time for a 4k image
01:54 ishitatsuyuki: for a single workgroup, it should work if you put controlBarrier(wg,wg,buf,acqrel | makeavail | makevisible) between each pass
01:59 Lynne: nope, doesn't, even if I splat it before each prefix_sum call
01:59 Lynne: on neither nvidia nor radv
02:05 Lynne: btw, each dispatch handles 4 displancements (xoffs/yoffs) at a time, and for a research radius of 3, there are 36 dispatches ((2_r)_(2*r) - 1)/4) that have to be done
02:05 Lynne: so we still do multiple wgps, it's just that each dispatch handles one
02:06 ishitatsuyuki: ok, utilization should be fine with that
02:06 Lynne: the default research radius is 15 btw so that's 900ish dispatches, hence you can see why barriers kill performance -_-
02:07 Lynne: needing to do 3 passes would result in 3x the number of dispatches and barriers
02:07 ishitatsuyuki: use the meme split barrier feature that no one use? :P (vkCmdSetEvent)
02:08 Lynne: that would still need a memory barrier, though, wouldn't it?
02:08 ishitatsuyuki: consider it an async barrier
02:09 Lynne: I'll leave it as a challenge for someone to do better than my current approach
02:09 ishitatsuyuki: yeah fair
02:10 ishitatsuyuki: back to debugging, did you try putting the barrier right after prefix sum as well?
02:11 Lynne: yup, splatted it everywhere, the integral image buffer has coherent flags too
02:13 ishitatsuyuki: i'm afraid i'm out of ideas again
02:16 Lynne: oh, hey, maybe I could test with llvmpipe
02:27 Lynne: well, that's weird, the part which is broken on a GPU is also broken on lavapipe
02:27 Lynne: but the part which is fine on a gpu is pure black on lavapipe
02:29 Lynne: limiting lavapipe to a single thread doesn't help either
02:29 Lynne: do none of the thousand of synchronization options actually do anything in vulkan?
02:29 Lynne: I know a lot of them are there just to satisfy some alien hardware's synchronization requirements, but still
02:31 HdkR: Gitlab having some performance issues right now?
02:32 HdkR: Managing to clone at a blazing 38KB/s
02:57 Lynne: anyone with any ideas or willing to run my code?
02:57 Lynne: it's the last roadblock to merging the entire video patchset in ffmpeg
03:00 zmike: what "synchronization options" are you referring to
03:12 Lynne: barrier(); memoryBarrier(); controlBarrier();
03:13 Lynne: I still don't understand how this can fail so consistently on all hardware
03:21 zmike: sounds like you're using it wrong if it's broken everywhere
03:28 Lynne: I simplified the issue down to load the same value in all invocations, and put it on a pixel
03:29 Lynne: and I got different values after height amount of invocations*rows
03:55 Lynne: simplified it as much as possible - https://paste.debian.net/1277180/
03:56 Lynne: load integral_img with vec4(1), do a vertical prefix, load the vec4 at 300,300, write the .x to all of weights[]
03:58 Lynne: err, had a typo, https://paste.debian.net/1277181/
03:59 Lynne: making the if on line 252 "if ((gl_GlobalInvocationID.x * 4) < height[0]) {" to always true seems to fix the issue
07:23 ishitatsuyuki: oh yeah, control barrier inside control flow is undefined behavior
07:24 ishitatsuyuki: common practice is to hoist barrier/controlBarrier() outside if condition
08:08 Lynne: more gotchas than... I have no idea what, there's just too many
10:26 Lynne: finally finished it, code is here - https://github.com/cyanreg/FFmpeg/blob/vulkan/libavfilter/vf_nlmeans_vulkan.c
10:27 Lynne: still 3x slower than opencl, I leave it as a challenge for anyone to make it faster -_-
10:27 Lynne: ishitatsuyuki: could you take a look at it, just in case I missed something?
10:34 ishitatsuyuki: it's hard to guess about performance unless you're obviously underutilizing
10:35 ishitatsuyuki: radv's profiler support requires the app to be presenting, but maybe nvidia has better tools on that side
10:37 ishitatsuyuki: looks like you still have barrier in CF though https://github.com/cyanreg/FFmpeg/blob/aeff8ad1de646f501a7d8d8b769b5533bb4ff08b/libavfilter/vf_nlmeans_vulkan.c#L96
11:15 Lynne: ishitatsuyuki: if (first) is C, though
11:15 ishitatsuyuki: ah ok, nevermind
11:16 ishitatsuyuki: the practice of mixing glsl in C is making me uncomfortable :/
11:16 Lynne: with rdna3, which has atomic float ops, barrier overhead is pretty much zero
11:16 Lynne: it's also not 3x, but 2x slower than opencl
11:17 ishitatsuyuki: well that's a huge red flag
11:17 Lynne: removing the prefix sums boosts fps to 20, which sound still quite low to me
11:17 ishitatsuyuki: don't use float atomics
11:17 Lynne: it's either barriers, or atomic float adds :/
11:17 ishitatsuyuki: have you tried aggregating through shared memory first so only one thread per workgroup need to do the global atomic?
11:17 Lynne: 2 per dispatch, so it's not horrible
11:18 Lynne: not quite sure what you mean
11:18 ishitatsuyuki: https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
11:20 ishitatsuyuki: i'd avoid float atomics at all cost
11:20 ishitatsuyuki: the alternative is probably a workgroup barrier, not pipeline barrier, right?
11:21 ishitatsuyuki: I don't know how float atomics are emulated but they seem to be insanely low throughput
11:22 ishitatsuyuki: meanwhile, barrier only costs as much as it needs to wait, it's not like the instruction itself has any execution cost
11:25 Lynne: sadly integral image calculations are not separable
11:26 Lynne: I could calculate weights across multiple buffers and just merge them during the final step, plenty of descriptors left
11:27 ishitatsuyuki: is integral image not separable?
11:27 ishitatsuyuki: separable in the filter sense?
11:29 Lynne: separable in that you can't independently compute a horizontal and vertical prefix sum and merge them
11:30 ishitatsuyuki: the usual definition of separability for 2D IIR/FIR filters is a bit different, but ok
11:31 ishitatsuyuki: separability there means if a NxN convolution can be done with a 1xN then Nx1 convolution
11:31 ishitatsuyuki: can you use ints for the atomics somehow?
11:32 ishitatsuyuki: that should still be better than float atomics
11:34 Lynne: removing the atomic float adds makes literally no difference, they're not the roadblock
11:34 Lynne: *bottleneck
11:34 ishitatsuyuki: ok
11:35 ishitatsuyuki: you should probably try getting a profile
11:47 Lynne: guess I could fire up the filter under mpv, the client will be presenting in that case
11:47 Lynne: how do I use the radv profiler?
11:54 Lynne: uuh, I think I found the bottleneck - resolution
11:55 Lynne: I was testing on 4k video, but on 720p, I get 700fps with my code, while opencl gets 70
12:01 ishitatsuyuki: does mpv present with vulkan though
12:01 ishitatsuyuki: for instructions see https://docs.mesa3d.org/envvars.html#envvar-RADV_THREAD_TRACE
12:23 Lynne: it does, how do I analyze the captures?
13:18 ishitatsuyuki: open it with https://gpuopen.com/rgp/
13:19 ishitatsuyuki: you first identify the slow pass (in your case, there should be only a single compute pass), then go to instruction timing
13:20 ishitatsuyuki: the numbers should give you a rough idea of "cost" of instructions
14:01 tleydxdy: when I look at all the fds opened by a vulkan game and their corresponding drm_file, some of them have the correct pid but they don't seems to be doing any work, and some have the pid of Xwayland and is doing all the work. does anyone know why that is?
14:01 tleydxdy: I assume it is because those fds are sent over by the X server, but why is it using those to do all the rendering work?
14:05 tursulin: tleydxdy: try if you want https://patchwork.freedesktop.org/patch/526752/?series=109902&rev=4
14:06 tleydxdy: how fun
14:07 tleydxdy: I'm mostly curious about why this pattern exist
14:08 tleydxdy: seeing how the application opens the render node itself anyway
14:09 danvet: tleydxdy, I thought for vk the render operations should always go through a file that's directly opened
14:09 danvet: and just winsys might go through one opened by Xwayland (if it's DRI3 proto)
14:11 danvet: gfxstrand, ^^ or does this work differently?
14:12 tleydxdy: yeah, if I look at vram used reported by fdinfo for example, the directly opened ones only uses 4K while the Xwayland one have >300MiB
14:14 emersion: tleydxdy: that's how the X11 DRI3 proto was designed
14:14 emersion: the Wayland protocol is different, for instance
14:15 emersion: AFAIK, the X11 DRI3 protocol was designed with DRM authentication in mind, where the X11 server would send authenticated DRM FDs to clients
14:15 emersion: before render node existed
14:16 alyssa: jenatali: you've been conscripted to ack https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22123
14:16 alyssa: I think
14:16 alyssa: Strictly I think maybe I could get away with xfb_size if I make my dispatch more complicated...
14:19 tleydxdy: emersion I see, so if the underlying wsi for vulkan is not X11 DRI3 then the pattern should not be seen?
14:19 emersion: on Wayland, all DRM FDs should be opened by the client *
14:20 emersion: ( * except DRM leasing stuff)
14:21 emersion: (but that's only for VR apps, and the DRM FD send by the compositor is read-only before the DRM lease is accepted by the compositor)
14:21 tleydxdy: got it
14:25 jenatali: alyssa: Ack, but without seeing how it's used it's hard for me to really get why it's needed
14:25 jenatali: XFB is one of those things that's magic from my POV, I don't know any of the implementation details
14:33 danvet: emersion, yeah but I thought for vk you still open the render node
14:33 danvet: since winsys is only set up later
14:33 emersion: what do you mean?
14:34 danvet: emersion, like you can set up your entire render pipeline and allocate all the buffers without winsys connection
14:34 emersion: maybe Mesa will open render nodes, but these won't be coming from the compositor
14:34 danvet: and so no way to get the DRI3 fd
14:34 danvet: and only when you set up winsys will that part happen
14:35 danvet: so I kinda expected that the render nodes will have most buffers and the winsys one opened by xwayland just winsys buffers
14:35 emersion: sure. the question was about FDs coming from Xwayland though
14:35 emersion: ah
14:35 danvet: but it seems to be the other way round
14:35 danvet: per tleydxdy at least
14:35 emersion: maybe the swapchain buffers are allocated via Xwayland's FD\
14:35 danvet: for glx/egl it'll all be on the xwayland fd
14:35 emersion: the swapchain is tied to WSI
14:36 danvet: 300mb swapchain seems a bit much :-)
14:36 tleydxdy: yeah, also all the gfx engine time is logged on the xwayland fd
14:36 tleydxdy: so it's also doing cs_ioctl on that
14:36 emersion: that's weird
14:39 tleydxdy: the game is unity, fwiw, and as far as I can tell it's not doing anything special
14:51 gfxstrand: danvet: That's how things work initially, yes. I think some drivers are trying to get a master and use that if they can these days.
14:51 gfxstrand: They shouldn't be getting it from the winsys, though. That'd be weird.
14:52 danvet: gfxstrand, well the master you only need for display, and for that you pretty much have to lease it
14:52 danvet: unless bare metal winsys
14:52 danvet: no one else should be master than the current compositor
14:54 alyssa: jenatali: purely software xfb implementation, see linked MR :~)
14:55 jenatali: Ah I missed that link
14:57 alyssa: ~~just you wait for geometry shaders~~
14:57 jenatali: Got it, so you run an additional VS with rast discard and just augment the VS to write out the xfb data?
14:58 alyssa: Yep
14:58 alyssa: I mean, that's conceptually how it works for GLES3.0 level transform feedback
14:58 alyssa: and that's what panfrost does
14:59 alyssa: all the real fun comes in when you want the full big GL thnig
14:59 jenatali: Hm. Can't you mix VS+XFB+rast?
14:59 alyssa: indexed draws, triangle strips, all that fun stuff
15:00 jenatali: I haven't looked at the ES limitations for XFB so I dunno
15:00 alyssa: GLES3.0 is just glDrawArrays() with POINTS/LINES/STRIPS
15:00 alyssa: 1 vertex in, 1 vertex our
15:00 alyssa: *out
15:00 alyssa: which is all panfrost does (and hence panfrost fails a bunch of piglits for xfb)
15:00 MrCooper: emersion tleydxdy: there are pending kernel patches which will correctly attribute DRM FDs passed from an X server to a DRI3 client to the latter
15:01 alyssa: s/STRIPS/TRIANGLES/
15:01 alyssa: for full GL there's all sorts of batshit interactions allowed, e.g. indirect indexed draw + primitive restart + TRIANGLE_STRIPS + XFB
15:01 alyssa: how is that supposed to work? don't even worry about it ;-)
15:01 emersion: MrCooper: my preference would've been to fix the X11 protocol, instead of fixing the kernel…
15:01 alyssa: spec has a really inane requirement that you can draw strips/loops/fans but they need to streamed out like plain lines/triangles
15:02 alyssa: (e.g. drawing 4 vertices with TRIANGLE_STRIPS would emit 6 vertices for streamout, duplicating the shared edge)
15:03 alyssa: in that case the linked MR does the stupid simple approach of invoking the transform feedback shader 6 times (instead of 4) and doing some arithmetic to work out which vertex should be processed in a given invocation
15:03 alyssa: this is suboptimal but hopefully nothing ever hits this other than piglit
15:03 alyssa: (..hopefully)
15:14 jenatali: alyssa: Right that all makes sense. But can you not mix XFB+rast?
15:15 jenatali: Or alternatively, does GLES3 not allow SSBOs/atomics in VS?
15:23 alyssa: VS side effects are optional in all the khronos apis
15:23 alyssa: for mali, works on older hw but not newer ones due to arm's questionable interpretations of the spec
15:24 alyssa: for agx, IDK, haven't tried, Metal doesn't allow it and I don't know what's happening internally
15:24 alyssa: (VS side effects are unpredictable on tilers in general, the spec language is very 'forgiving' here)
15:27 alyssa: wouldn't help a ton in every case, consider e.g. GL_TRIANGLE_FANS with 1000 triangles drawn
15:27 alyssa: vertex 0 needs to be written out 1000 times
15:27 alyssa: all other vertices are written out just once
15:29 alyssa: re side effects being unpredictable, I *think* this means the decoupled approach is kosher even if we allow vertex shader side effects
15:29 alyssa: but I'd need to spec lawyer to find out
15:33 karolherbst: alyssa: are SSBO writes a thing in vertex shaders?
15:33 alyssa: 15:23 alyssa: VS side effects are optional in all the khronos apis
15:34 karolherbst: right.. was more like about is it a thing in your driver/hardware
15:35 alyssa: 15:23 alyssa: for mali, works on older hw but not newer ones due to arm's questionable interpretations of the spec
15:35 alyssa: 15:24 alyssa: for agx, IDK, haven't tried, Metal doesn't allow it and I don't know what's happening internally
15:35 alyssa: 15:24 alyssa: (VS side effects are unpredictable on tilers in general, the spec language is very 'forgiving' here)
15:39 daniels: 'newer ones' being anything with IDVS?
15:41 jenatali: I was very surprised when I learned that Vulkan not only allows side effects, but also wave ops and even quad ops in VS. Like wtf is the meaning of a quad of vertex invocations?
15:42 jenatali: FWIW D3D does wave ops, but not quads
15:51 gfxstrand: jenatali: Well, when you render with GL_QUADS...
15:51 gfxstrand:shows herself out
15:51 jenatali: Which Vulkan doesn't have, right?
15:51 jenatali: ... right?
15:52 gfxstrand: There was a quads extension but we killed it. :)
15:53 gfxstrand: Also, quad lane groups in a VS have nothing whatsoever to do with GL_QUAD. I was just making dumb jokes.
15:53 gfxstrand: They're literally just groups of 4 lanes which you can do stuff on.
15:53 jenatali: Yeah I know, I'm just also confirming :)
15:54 gfxstrand: They do make sense with certain CS patterns you can do with the NV derivatives extension, though.
15:54 jenatali: I guess I could lower quad ops to plain wave ops and support them in VS
15:54 jenatali: Yeah D3D defines quad ops + derivatives in CS
15:57 gfxstrand: Yeah, that's really all they are
15:57 gfxstrand: In fact, I think we have NIR lowering for it
15:57 gfxstrand: Yup. lower_quad
15:58 mareko: quad ops work in VS if num_patches == 4 or 2 and TCS is present
15:59 mareko: I mean num_input_cp
16:00 jenatali: Oh cool, I should just run that on non-CS/FS and then I can support quad ops everywhere
16:11 macromorgan: so question... I'm trying to troubleshoot a problem that happens only on suspend and shutdown of regulators unbalanced disables.
16:13 macromorgan: As best I can tell when I try to shut down a panel mipi_dsi_drv_shutdown is getting called which runs the panel_nv3051d_shutdown function which calls drm_panel_unprepare which calls panel_nv3051d_unprepare. Then, I also see panel_bridge_post_disable is calling panel_nv3051d_unprepare
16:13 macromorgan: should there not be a shutdown function for the panel?
16:16 macromorgan: a lot of panels have a "prepared" or "enabled" flag, but when I was upstreaming the driver I was told not to do that
16:26 jenatali: Has the branchpoint happened?
16:28 jenatali: Oh yep, there it is. Would be nice to have a dedicated label for the post-branch MR that bumps the version for the next release. eric_engestrom
16:28 jenatali: I'd subscribe to that
17:39 eric_engestrom: jenatali: I've created the label ~mesa-release and I'll write up some doc in a bit, and hopefully we (dcbaker and I) won't forget to use it too often :]
17:40 dcbaker: eric_engestrom: thanks for doing that
17:40 jenatali: Thanks!
18:29 karolherbst: dcbaker: some hackish bindgen_version thing: https://github.com/mesonbuild/meson/pull/11679
18:35 dcbaker: karolherbst: I left you a few comments, it's really annoying that they treat the command line as always up for change
18:36 karolherbst: yeah...
18:37 karolherbst: get_version seems to work alright, nice
18:37 dcbaker: sweet
18:38 karolherbst: it's a bit annoying that rust.bindgen is already taken otherwise it could be some higher level struct and rust.bindgen => rust.bindgen.generate and we could just add rust.bindgen.version_compare().... but oh well...
18:51 dcbaker: Yeah. A long time ago I'd written a find_program() cache, which would have made this a bit simpler since you could have done something like find_program('bindgen').get_version().version_compare(...) and since all calls to find_program would use the same cache the lookup would be effectively free and we could just recommend that
18:51 dcbaker: unfortunately I never got it working quite right
18:52 karolherbst: yeah.. but that's also kinda annoying
18:52 karolherbst: I'd kinda prefer wrapping those things so it's always in control of meson
19:00 karolherbst: dcbaker: anyway... would be cool to get my rust stuff resolved for 1.2 so I only have to bump the version once :D
19:01 karolherbst: do I need to add any kwarg stuff?
19:04 dj-death: oh noes, gitlab 504
19:22 dcbaker: karolherbst: Yeah, I left you a comment, otherwise I think that looks good
19:29 anholt: starting on the 1.3.5.1 CTS update (with a couple extra bugfixes pulled in)
19:41 karolherbst: now I need somebody else or me to figure out that isystem stuff ...
20:10 alyssa: daniels: yeah, Arm's implementation of IDVS is "creative"
20:10 alyssa: gfxstrand: lol at VK_QUADS ops
20:47 karolherbst: quads? reasonable primitives
20:48 alyssa: * Catmull-Clark has entered the chat
21:02 robclark: alyssa: idvs sounds _kinda_ like qcom's VS vs binning shader (except that adreno VS also calcs position/psize)
21:11 alyssa: robclark: same idea, yeah
21:12 alyssa: the problem isn't the concept, it's an implementation detail :~)
21:12 robclark: hmm, ok
22:06 Kayden: there's some mention of nir_register in src/freedreno/ir3/* still, is this dead code now that the backend is SSA?
22:07 anholt: Kayden: indirect temps are still registers
22:07 Kayden: oh, interesting, ok
22:09 gfxstrand: We should convert ir3 to load/store_scratch
22:09 gfxstrand: Unless you really can indirect on the register file and really want to be doing that.
22:14 karolherbst: anholt: or scratch mem if lowered to it
22:15 Kayden: was just a little surprised to see it there still
22:15 Kayden: wasn't sure if it was leftover or still used :)
22:15 karolherbst: I don't know if I or somebody else ported codegen to scratch, but I think it was done...
22:15 karolherbst: ahh nope
22:15 gfxstrand: Kayden: I mean, the Intel vec4 back-end still uses it last I checked... 😭
22:16 karolherbst: or was it...
22:16 karolherbst: mhhh
22:16 Kayden: gfxstrand: not surprised to see it in general, just in ir3 :)
22:16 karolherbst: what was the pass again to lower to scratch?
22:16 gfxstrand: Kayden: Ues, but we should kill NIR register indirects in general.
22:16 Kayden: ah.
22:17 Kayden: yeah, probably
22:17 gfxstrand: I suppose I do have a haswell sitting in the corner over there.
22:17 anholt: Kayden: register indirects turn into register array accesses. large temps get turned into scratch.
22:17 karolherbst: I'm sure it's almost trivial to remove `nir_register` in codegen as it already supports scratch memory
22:17 gfxstrand: NAK won't support it
22:18 karolherbst: yeah, no point in supporting it on nv hw
22:19 gfxstrand: There's no point on Intel, either. They go to scratch in the vec4 back-end it's just that the back-end code to do that has been around for a long time and no one has bothered to clean it up.
22:19 gfxstrand: Technically, Intel can do indirect reads
22:19 gfxstrand: And indirect stores if the indirect is uniform and the stars align.
22:22 alyssa: I don't have a great plan for ir3 nir_register use.
22:46 karolherbst: anybody here every played around with onednn? I kinda want to know what I need to actually use it
22:46 karolherbst: bonus point: make it non painful
22:51 gfxstrand: That sounds like an Intel invention
22:51 karolherbst: it is
22:51 gfxstrand: Of course...
22:51 gfxstrand: They have to have One of everything...
22:51 karolherbst: but apparently it has a CL backend
22:51 karolherbst: and it can run pytorch
22:51 karolherbst: and stuff
22:51 karolherbst: dunno
22:51 karolherbst: just want to see what CL extensions it needs
22:51 karolherbst: but I think it's INTEL_subgroup and INTEL_UVM
23:22 karolherbst: ehhh..
23:22 karolherbst: why is oneDNN checking if the platform name is "Intel" 🙃
23:24 karolherbst: they even have a vendor id check 🙃
23:25 psykose: because it was made by intel
23:25 karolherbst: you know that this won't stop me!
23:25 karolherbst: (but rusticl not having all the required features will! 🙃)
23:26 gfxstrand: What features are you missing? Intel subgroups you should be able to pretty much just turn on
23:26 gfxstrand: Even on non-intel
23:26 karolherbst: I don't know yet
23:26 karolherbst: the stuff just faults randomly
23:28 karolherbst: but yeah.. maybe I check if there are any tests for that extension and just support it
23:28 karolherbst: the README mentions subgroups and UVM
23:29 karolherbst: the CTS has tests for cl_khr_subgroups
23:29 karolherbst: but there is of course cl_intel_subgroups
23:29 karolherbst: and I don't even know if that's upstream llvm
23:31 karolherbst: at least now a days those things are open source
23:31 karolherbst: ahh yes
23:31 karolherbst: another "is intel platform" check 🙃
23:35 karolherbst: uhhhhh
23:35 karolherbst: :pain:
23:35 karolherbst: they are doing unholy things
23:36 karolherbst: they fetch the binary of the compiled CL program and check if it's some ELF thing do to more cursed stuff
23:36 karolherbst: welll....
23:36 karolherbst: I guess that's a WONTFIX then
23:36 gfxstrand: ugh
23:36 gfxstrand: Am I surprised? No.
23:36 karolherbst: the file patch gave me the rest: "/home/kherbst/git/oneDNN/src/gpu/jit/ngen/ngen_elf.hpp"
23:37 karolherbst: anyway...
23:37 karolherbst: they do seem to support other vendors via SyCL, but.. uhhh
23:37 karolherbst: why...
23:39 gfxstrand: It's a One* product
23:39 gfxstrand: The open-source is a sham. It exists for vendor lock-in.
23:39 karolherbst: I mean.. yes
23:49 anholt: mupuf: are you intentially rebooting the DUT after a GPU hang? I'm not getting any info on the test that hung, so I can't mark it a skip.
23:53 zmike: I think he's still afk another week