IRC Logs of #dri-devel on irc.freenode.net for 2023-04-12

04:57 kode54: https://gitlab.freedesktop.org/drm/intel/-/issues/8325 hmm, fascinating
04:58 kode54: wouldn't know the first thing to bisect to find what may be causing this
08:03 danvet: emersion, https://lore.kernel.org/dri-devel/20230411222931.15127-2-ville.syrjala@linux.intel.com/ for you I think
08:03 danvet: pq not here ...
08:03 danvet: daniels, ^^ maybe
08:03 danvet: emersion, this makes me wonder, do we need to formalize kms uapi reviewers?
08:04 danvet: it's a bit hard to catch with a MAINTAINERS entry because it's all over
08:21 emersion: ah, i've actually never used CTM
08:21 emersion: JoshuaAshton: ^ does this match your experience?
08:22 emersion: danvet, that would be nice. i think we already ask people to CC wayland-devel for new props?
08:22 danvet: emersion, is that documented anywhere?
08:23 danvet: I'm also wondering whether we could at least make some pattern matches in MAINTAINERS
08:23 danvet: like matching the uapi header file is easy, but many props are just all over
08:24 emersion: hm i thought so, but apparently not
08:24 danvet: plus often the important stuff is in the kerneldoc and I don't think there's a way to match that
08:25 danvet: emersion, I guess at least a "DRM KMS UAPI" MAINTAINERS entry that matches include/uapi/drm/drm_mode.h?
08:25 danvet: as a start at least
08:25 jannau: that ctm order is what Apple display processor FW expects and is used by kwin 5.27.3 and later
08:25 emersion: maybe we can move the KMS prop docs to a rst
08:26 emersion: hm, not sure that'd be a welcome change
08:26 emersion: but yeah maybe start with drm_mode.h and we can improve later
09:40 javierm: danvet: why are you looking fbmem and fbcon again?
10:41 Lynne: how is it possible a memory load after 10s of barrier(); and memoryBarrier(); and everything else I could think of still differs between compute local invocations within a workgroup?
10:52 ishitatsuyuki: Lynne: IIUC, neither barrier nor memoryBarrier will flush caches unless the variable is marked as coherent
10:52 ishitatsuyuki: hm, but you mentioned within a workgroup
10:54 ishitatsuyuki: so, technically, barriers only ensures coherency for shared memory
10:54 ishitatsuyuki: but as far as AMD hardware is concerned, memory access within the same workgroup should be coherent
11:19 Lynne: I've tried both nvidia's binary drivers and radv, the behavior is the same
11:19 Lynne: the only way I know I'm somewhat sane is that if I do another dispatch with a memory barrier, the memory looks fine
11:20 Lynne: but I really want to do the processing in a single shader, since barriers completely kill performance
11:21 Lynne: adding coherent doesn't fix the issue either
11:21 Lynne: I'm using BDAs if it matters
11:23 Lynne: this is my shader if anyone's interested - https://paste.debian.net/1277097/
11:24 Lynne: issue is at line 313
11:25 Lynne: the same value should get splatted horizontally, but yet, after x = height[0], the old value is written instead, as if prefix_sum() wasn't even called
11:26 Lynne: if I put a hardcoded value for a, it works, if I don't call prefix_sum() the second time it works
12:22 ishitatsuyuki: Lynne: if I understand correctly, the decoupled lookback algorithm only waits for indices up to the thread's job to complete before continuing. hence loading v[4096200] feels a bit suspicious to me, unless this is conditioned to only happen for the last workgroup
12:23 Lynne: it's just a convenient index, since it's non-zero, it breaks with any index
12:28 ishitatsuyuki: Are the prefix sums global? Are there sufficient synchronization before and after the prefix sum to ensure src/dst writes are visible?
12:32 Lynne: global?
12:32 ishitatsuyuki: Are you doing prefix sums across workgroups?
12:32 Lynne: no, there's only a single workgroup dispatch
12:32 ishitatsuyuki: so vkCmdDispatch(1,1,1)?
12:33 Lynne: yup
12:34 ishitatsuyuki: weird
12:36 Lynne: I can push the code somewhere with instructions to run it if you find it interesting enough
12:38 ishitatsuyuki: looks like RDNA can actually have two L0 cache within a dual compute unit, where the same workgroup can be scheduled
12:38 ishitatsuyuki: when you added coherent, where did you add it to?
12:39 Lynne: in both the buffer definition (coherent buffer DataBuffer) and the pushConstants pointer
12:40 ishitatsuyuki: I guess that should do the job
12:41 ishitatsuyuki: busy with other stuff right now but I'll check later to see if we're emitting the correct instruction flags for this case (workgroup running on dual CU)
12:42 ishitatsuyuki: can you try barrier with explicit memory scope on nvidia?
12:43 Lynne: explicit memory scope?
12:44 ishitatsuyuki: lemme lookup the syntax but it's from vulkan memory model
12:46 ishitatsuyuki: controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup,gl_StorageSemanticsBuffer,gl_SemanticsAcquireRelease)
12:50 Lynne: same output on both radv and nvidia after putting it right after the barrier() call
12:50 Lynne: same == wrong
12:57 ishitatsuyuki: I think you need it before and after the prefix sum too
12:58 ishitatsuyuki: (for the src/dst synchronization I mentioned above)
13:04 Lynne: hey, that did something!
13:05 Lynne: still not correct, but it looks a bit more correct now
13:05 HdkR: karolherbst: Congrats on rusticl for radeonsi merge! Next stop rusticl for Freedreno? :D
13:15 Lynne: hmm, it still looks as wrong in my regular use-case rather than this hacked-up simplified version
13:16 danvet: narmstrong, so you'll apply the bridge maintainer patch with all the fixups?
13:17 narmstrong: danvet: yup, or you can resend as you wish
13:17 danvet: nah happy if you take care :-)
13:17 narmstrong: danvet: sure!
13:18 javierm: danvet: I think narmstrong's suggestion was to fix the author field though
13:18 javierm: while in the past you asked to add both S-o-B tags
13:18 narmstrong: I can do either just tell me
13:21 karolherbst: HdkR: probably just works.. or maybe not as it needs some func pointers, dunno...
13:21 karolherbst: I don't have hardware :D
13:25 narmstrong: danvet: ok to add both SoB then?
13:25 danvet: sure
13:26 mlankhorst: danvet: is the merge window for v6.4-rc1 still open?
13:29 danvet: mlankhorst, I guess one last -next pull or something
13:33 mlankhorst: sent, didn't look too spectacular. :)
13:38 HdkR: karolherbst: Sounds like a good reason to get such hardware :P
14:27 danvet: mlankhorst, did you prep drm-misc-next-fixes already?
14:35 mlankhorst: danvet: hm no, should that not be drm-next after merge?
14:57 danvet: mlankhorst, yup, and that's pushed now so you can forward
15:44 robclark: danvet: how to link to rst?
15:45 danvet: robclark, https://dri.freedesktop.org/docs/drm/doc-guide/sphinx.html#cross-referencing
15:45 danvet: note that functions should auto-link anywhere
15:46 danvet: https://dri.freedesktop.org/docs/drm/doc-guide/kernel-doc.html#cross-referencing-from-restructuredtext
15:46 danvet: there's actually good docs for the kernel's doc system :-)
15:46 robclark: so just use full path then
15:47 danvet: well if you want a specific chapter in an .rst then it gets more tricky
15:49 danvet: need to put a sphinx tag above the heading and reference that
15:52 robclark: full path I think is better.. because it is also something that looks sensible to someone looking at code instead of html ;-)
16:08 robclark: `make htmldocs` does take some time..
16:17 danvet: robclark, incremental is a lot better
17:27 mlankhorst: done
17:28 mlankhorst: drm-misc-next is closed!
17:38 javierm: danvet, jfalempe: any idea how my patch could had made things worse? That's very surprising... https://lists.freedesktop.org/archives/dri-devel/2023-April/399981.html
19:17 DavidHeidelberg[m]: robclark: a630 showing huge performance boost, any particular commit you suspect? :D https://gitlab.freedesktop.org/mesa/mesa/-/issues/7144#note_1865796
19:37 DavidHeidelberg[m]: gallo: performance go brrrrr ^ :)
19:57 DavidHeidelberg[m]: this could be offender: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22148 but most likely this one: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22098
20:22 robclark: more likely the latter
20:24 robclark: there was also a kernel fix for anything doing short duration (<10ms) fence waits.. not sure if that shows up in any traces (stk does this)
20:25 robclark: (hmm, well the kernel fix might actually make the trace appear slower since the trace is just a pre-recorded # of fence waits and nothing to do with reality)
21:25 karolherbst: dcbaker: seems like bindgen is doing incompatible changes and we have to know the version in meson :(
22:46 Lynne: ishitatsuyuki: any ideas on how to fix this?