00:10 karolherbst: ecm: mhhh.. I wonder if this is actually maybe a dri2 problem as nouveau still defaults to it
00:17 ecm: so it's coming from egl2_dri2.c instead
00:17 ecm: s/egl2/egl/
00:17 karolherbst: you could try to force enable DRI 3 in the xf86-video-nouveau driver and see if that helps with anything
00:17 karolherbst: but it's still kinda weird
00:33 ecm: ok I forced DRI 3 for nouveau, it doesn't show the fd == -1 error, but eglinfo still fails eglInitialize
00:36 ecm: eglInitialize fails for platform x11, sorry
00:36 ecm: doesn't fail for DRI3
00:36 karolherbst: mhhh.. interesting
00:39 ecm`: eglinfo after dri3 enabled: https://0x0.st/HTXE.txt
00:40 karolherbst: something is broken and I have no idea what :)
00:40 ecm`: libEGL debug: EGL user error 0x300c (EGL_BAD_PARAMETER) in eglGetPlatformDisplay is the new error now
01:04 airlied: mlankhorst: care to dequeue drm-misc-fixes?
07:36 MrCooper: AFAICT drm_syncobj fds can't be polled, can they?
07:38 emersion: if you mean poll(), no. i have a patch for that but it needs an IGT
07:40 MrCooper: right, thanks
08:06 RAOF: Ok, amdgpu. Why are you allocating a buffer with tiling mode incompatible with scanout when I ask for a GBM_BO_USE_RENDERING | GBM_BO_USE_SCANOUT surface?!
08:06 RAOF: What am I doing differently to the working case?!
08:09 MrCooper: sounds like a radeonsi bug, it should pick a scanout capable modifier with GBM_BO_USE_SCANOUT
08:10 emersion: how do you figure out that it's not scanout capable?
08:10 emersion: are you sure it's the right device?
08:10 MrCooper: RAOF: see https://gitlab.freedesktop.org/mesa/mesa/-/issues/8729
08:11 RAOF: Because when it's displayed it's garbled in a nice blocky tiling fashion.
08:12 RAOF: It's definitely the right device; I'm only opening a single drm node.
08:12 MrCooper: that sounds like an amdgpu kernel bug then; if the modifier isn't scanout capable, it should refuse to create a KMS FB for it
08:12 emersion: RAOF: are you stripping an explicit modifier by any chance?
08:12 RAOF: Or maybe it's the other way around? Maybe EGL is confused and it's rendering to it as if it's tiled?
08:13 emersion: ie, allocating with with_modifiers(), then importing it without passing the modifier?
08:13 emersion: amdgpu should reject buffers it cannot scanout, in theory
08:13 RAOF: emersion: Nope! I'm deliberately using non-modifiers path, (because there aren't any supported modifiers on amdgpu, on at least this card).
08:14 emersion: ok, GFX8-
08:15 RAOF: It might still be a bug on my end; a different branch (with significantly different code flow) does work, but I can't see any difference in the way I'm allocating the gbm_surface, nor in the way I'm using EGL.
08:16 RAOF: And none of the debugging I've tried has seen any differences, and it's difficult to introspect this state.
08:19 RAOF: If there's any magical MESA_DEBUG environment that will make some of these decisions more legible that'd be awesome 😐
08:23 MrCooper: AMD_DEBUG might be more relevant here, maybe check AMD_DEBUG=help and try some of those which sound related
08:29 MrCooper: lynxeye: interesting plot twist on mesa#8729 :)
08:30 lynxeye: MrCooper: He, sorry about that, but I hadn't seen this discussion before you linked to it in here.
08:31 MrCooper: no worries, happens to me all the time
09:33 ishitatsuyuki: i'm new to drm/ttm, but I wonder why the wait-wound business is needed instead of e.g. sorting locks by their pointer or other ID?
10:14 airlied: ishitatsuyuki: because sorting is expensive usually
10:16 karolherbst: anybody ever thought about enforcing DRM API locking rules via `WARN_ON(spin_is_locked(lock))`? at least the dma-fence API looks very hard to actually use correctly and there seem plenty of code around just not caring properly about locks
10:18 airlied: sima likely had
10:18 airlied: has
10:18 karolherbst: also.. I'm convinced that `dma_fence_is_signaled` has to go
10:19 karolherbst: (or to properly lock access to fence->flags)
10:20 karolherbst: dma-fence kinda smells to much "we outsmart data races by using atomics"
10:20 airlied: sima: ^
10:21 lynxeye: karolherbst: Not a fan of the WARN_ON way to do this, but lockdep_assert_held is really useful.
10:22 karolherbst: yeah.. so that's basically the same just triggers when lockdep is enabled, right?
10:23 karolherbst: or is it checking for deps?
10:23 karolherbst: because I don't see why lock dependencies matter here at all, it's just enforcing what the API states
10:23 karolherbst: *it should
10:23 lynxeye: karolherbst: Yea, triggers with lockdep and in contrast to spin_is_locked it actually verifies that it's your thread that has lock and not someone else.
10:23 karolherbst: ahhh
10:23 karolherbst: okay
10:24 karolherbst: I expect this spamming warnings all over the place, so having it behind lockdep might be better here anyway
10:53 cwabbott: jenatali: wow, that spec seems to require some real driver heroics, especially around suspend/resume
10:54 cwabbott: implementing suspend/resume was already painful enough in turnip after they added it to Vulkan dynamic rendering to match DX12
10:55 alyssa: cwabbott: is now a good time to invoke the axiom of QDS?
10:55 cwabbott: now it sounds like they want the driver to compute tile layouts at submit time, because you can't compute the tile layout until you know all of the render passes you want to merge
10:57 cwabbott: alyssa: uhh, what is that?
11:11 alyssa: "now it sounds like they want the driver to compute tile layouts at submit time"
11:11 alyssa: IDK about Qualcomm but this would be extra spicy on Apple (and maybe Mali) because the layouts get baked into the fragment shaders
11:12 alyssa: what's that? you want even more FS prologs + FS epilogs?
11:12 alyssa: and somehow want to defer shader linking until submit time instead of just draw time?
11:12 alyssa: well, if you insist! (-:
11:13 alyssa: (might be possible to push offsets as a uniform, but still.)
11:25 mlankhorst: airlied: sorry about that!
12:02 jenatali: cwabbott: Yeah, that's the impression that I get, but QC was okay with it so 🤷
12:05 alyssa: jenatali: so far i've not been impressed with QC's software (-:
12:06 jenatali: Yeah
12:17 alyssa:writes optimization passes Just For Fun becuase it's Friday and she did real work all week
12:17 alyssa: Do we have an easy way to find the first unconditional block executed after an instruction? Shouldn't be too hard to walk the cf list
12:17 karolherbst: alyssa: are you bored?
12:17 alyssa: karolherbst: tired
12:18 karolherbst: mhhh
12:18 karolherbst: I still have my subgroup MR, but you already reviewed quite a bit of it
12:18 alyssa: oh I can look at that today
12:18 karolherbst: but it also changed quite a bit
12:18 karolherbst: cool
12:18 alyssa: I feel like what I want is something dominance related maybe?
12:18 alyssa: although maybe not even
12:18 karolherbst: soooo
12:19 karolherbst: I have a fun optimization we need
12:19 karolherbst: but it's also quite a bit of work
12:19 karolherbst: but probably also fun
12:19 karolherbst: ever looked into loop merging?
12:19 alyssa: uh oh
12:19 alyssa: I already know what fun optimization I'm writing :-p
12:19 karolherbst: like merging the inner loop with the outer one
12:19 alyssa: i know better than to write loop opts
12:19 karolherbst: so threads taking longer in the inner loop don't stall threads waiting on the next outer iteration
12:19 karolherbst: :D
12:21 HdkR: That sounds like a dependency tracking hellscape
12:21 karolherbst: it's what nvidia is doing
12:22 karolherbst: HdkR: but it's actually not that hard, you just turn the inner loop into predicated blocks
12:22 karolherbst: and build a little state machine deciding what outer+inner loop iteration the thraed is at
12:23 karolherbst: and decouple threads like this
12:23 karolherbst: it's kinda fun from a concept perspective
12:24 karolherbst: HdkR: it also helps with minimizing c/r stack usage in shaders
12:26 HdkR: I see the improvements. Sounds painful :D
12:26 karolherbst: :D
12:26 karolherbst: we have to do it though
12:46 jenatali: Have to?
12:59 alyssa: karolherbst: ...Lol
12:59 alyssa: I googled loop merging and the result is a presentation from a prof at my school :-p
13:00 alyssa:recognized the name
13:03 karolherbst: :D
13:03 karolherbst: it's a sign
13:03 karolherbst: now you have to do it, it's the law
13:03 karolherbst: jenatali: well.. for getting more perf I mean
13:04 jenatali: Got it
13:05 karolherbst: it basically leads to threads getting diverged less often and you even need to converge them less
13:06 alyssa: karolherbst: The school I graduated from and am slowly recovering mentally from? Indeed a sign that I should not write the nir pass :-D
13:06 karolherbst: :D
13:06 karolherbst: oh well.. guess I'll have to do it sooner or later as it's more critical for compute anyway
13:26 alyssa: karolherbst: there's still https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18399#note_1779418 for loop fun
13:29 karolherbst: GPUs have branch predictors?
13:30 karolherbst: at least I'm sure nvidia GPUs don't have any
13:44 alyssa: not any I know of
13:45 glehmann: does xeon phi count?
13:45 karolherbst: I misread the comment there anyway.. but yeah... I don't really see any benefit of doing this on nvidia hardware
13:45 karolherbst: you'll have to predicated branches in either case
13:45 karolherbst: ehh wait.. one non predicated in the original code version
13:47 karolherbst: _but_ it could lead to more optimized code as the if condition and the loop bodys instructions live in the same block
13:47 karolherbst: so dunno
13:48 karolherbst: but we probalby already optimize that?
13:49 karolherbst: I'd say we should go with whatever benchmarks say
13:51 alyssa: karolherbst: The point is cutting the # of branches executed in half
13:51 karolherbst: but looping is also a branch
13:51 alyssa: loop { if foo { break } ... }
13:51 alyssa: that's 2 branches per iteration
13:52 karolherbst: not necassarily
13:52 alyssa: if bar { do { .. } while (bar)
13:52 alyssa: that's one branch per iteration (plus one at the start)
13:52 karolherbst: on nvidia hardware you can do all of this without actual branching
13:52 karolherbst: only thread divergency is a problem you need to keep into account
13:53 karolherbst: but you can do this without branching
13:54 karolherbst: sure, you jump, but you don't really need to deal with blocks in the native ISA inside the loop
13:55 karolherbst: because there is just one
13:55 karolherbst: (in either case)
13:58 sima: karolherbst, yeah lockdep_assert_held is the best one
13:58 karolherbst: alyssa: something like this: https://gist.githubusercontent.com/karolherbst/69a58884b1b20fe06ee8297386c9cf8e/raw/5a8eab6e0415b41a0af538a815cc8b9bd522dfce/gistfile1.txt
13:58 sima: karolherbst, and yeah part of the dma_fence compromise bikeshed was to make the fastpath fast
13:58 sima: since in some drivers that was just an ordered load for a quick check
13:59 karolherbst: uhhh
13:59 karolherbst: pain
13:59 alyssa: karolherbst: Right. The second case is executing half the jumps as the first.
13:59 sima: karolherbst, we have endless amounts of code that practically assumes that dma_fence_is_signaled is practically dirt cheap once it's signalled
13:59 alyssa: (1 predicated + 1 unconditional --vs-- 1 predicated)
14:00 alyssa: For n iterations, the top one is (2 * n) jumps but the bottom is n + 1
14:00 alyssa: unless there are 0 iterations in which case they are both 1 jump
14:00 karolherbst: ahh right...
14:00 karolherbst: yeah, no you are right, I was looking at it from an block perspevtive too much
14:00 alyssa: so if it's 0 or 1 iteration it's the same, and if there are multiple iterations the second is strictly better
14:01 alyssa: if you imagine running the loop 100 times... that adds up
14:01 karolherbst: yeah...
14:01 sima: karolherbst, imo the fastpath with dma_fence is fairly ok, the real absolute pain is the rcu protected lookups
14:01 karolherbst: sima: it's still quite racy
14:01 sima: because the rules are that the driver may recycle
14:01 sima: so you don't just have data races, you have "this might have become an entirely different dma_fence" races
14:02 karolherbst: yeah...
14:02 karolherbst: it's bad tho :P
14:02 sima: karolherbst, yeah but as far as lockless tricks goes, it's kind standard
14:02 sima: karolherbst, so did you find a bug in there?
14:02 karolherbst: no, but nouveau used dma_fence_is_signaled incorecctly and that lead me to look deeper into it
14:02 karolherbst: or rather..
14:03 karolherbst: sima: we had to do this: https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=c8a5d5ea3ba6a18958f8d76430e4cd68eea33943
14:03 karolherbst: and it does fix or improve the situation a lot
14:03 karolherbst: anyway.. assuming dma_fence_is_signaled locks data is a wrong assumption
14:04 karolherbst: because currently it doesn't
14:04 karolherbst: or not all data at least
14:04 karolherbst: it has this fast path which can race with other threads
14:05 sima: karolherbst, use-after-free is a refcount issue
14:05 sima: if you don't hold a full refcount, you can _only_ do a _very_ limited set of refcount checks
14:05 sima: this aint a locking issue
14:08 sima: s/refcount checks/rcu protect dma_fence checks/
14:08 sima: (not yet enough oxygen in the brain after work out I guess)
14:10 karolherbst: ehhh.. yes, but I also found another nouveau bug I think...
14:10 sima: unless you do a weak reference protected by a lock or so, if you need locking to fix a use-after-free your design is very cursed
14:10 karolherbst: the fence lock needs to be taken before calling into dma_fence_signal_locked, right?
14:11 karolherbst: nouveau design here is a little cursed anyway
14:12 karolherbst: anywya, my point was rather, that those interfaces are hard to use correctly and we should at least try to warn on certain patterns violating API contracts
14:12 karolherbst: I look at this dma-fence code and it immediate rings "people try to outsmart locking" bells
14:13 karolherbst: and what nouveau seems to be doing is to have a global fence list lock and takes this instead of taking the fence own locks
14:14 gfxstrand: uh oh...
14:14 sima: karolherbst, it's maybe just badly documented, but we assume your driver can cope with concurrent calls to this
14:14 gfxstrand: Don't try to outsmart the locking. That will not go well for you.
14:15 karolherbst: well.. dma_fence_signal_locked states "Unlike dma_fence_signal(), this function must be called with &dma_fence.lock held)" which nouveau absolteuly doesn't :)
14:16 karolherbst: and I'd rather have the kernel warn on violating this rule
14:16 sima: karolherbst, oh that's clearly a bug
14:16 karolherbst: or drop the rule and put whatever is the actual rule
14:16 karolherbst: okay :)
14:16 karolherbst: so we _should_ warn on using it incorrectly
14:16 sima: karolherbst, enable lockdep and it will
14:16 karolherbst: it didn't
14:16 sima: dma_fence_signal_timestamp_locked() has lockdep_assert_held(fence->lock);
14:17 karolherbst: huh...
14:17 karolherbst: maybe I should check that out again, but I was sure that lockdep didn't tell me anything
14:17 karolherbst: but yeah.. dma_fence_signal_timestamp_locked has indeed a assert here...
14:17 sima: karolherbst, you checked it's still running? lockdep gets disabled after the first splat
14:18 karolherbst: uhhh... dunno actually
14:18 sima: tbf I'd have surprised me if we'd indeed sucked that much
14:18 sima: I've been sprinkling lockdep_assert_held and encouraged others to do the same for years now
14:18 sima: they're both good documentation and good runtime checks
14:18 karolherbst: yeah.. let me check that out again just to be sure
14:19 karolherbst: maybe the problem was me running the full debug kernel and uhm.. doing other weird things
14:32 sima: karolherbst, so locking at this again I think we're missing a load_acquire barrier before the various test_bit()
14:32 karolherbst: potentially
14:33 karolherbst: but `test_bit` isn't strictly an atomic operation, is it?
14:33 sima: the test_and_set_bit is an rmw which on linux has full barriers
14:33 sima: it is
14:33 karolherbst: ahh, so it is
14:33 sima: linux atomic bitops do not have atomic_ anywhere in their name
14:33 karolherbst: silly
14:33 sima: for entertainment value
14:33 sima: the non-atomic versions have a __ prefix
14:33 karolherbst: ....
14:34 karolherbst: maybe we need a "keep naming closer to C11" patch
14:34 karolherbst: the closest you can get to a CC all on the lkml
14:35 sima: correction, only the rmw with return value have full barrier semantics
14:35 karolherbst: pain
14:35 karolherbst: it's still feels wrong
14:35 karolherbst: s/it's/it/
14:36 sima: so yeah we need a pile of smp_mb_after_atomic I think
14:36 sima: for the "it's signalled already" case
14:36 sima: plus a pile of comments
14:36 karolherbst: yeah so my complain about dma_fence_signal specifically is, that it appears to be a locked operation but strictly isn't
14:37 karolherbst: ehh
14:37 karolherbst: I meant the other one
14:37 karolherbst: dma_fence_is_signaled
14:37 sima: yeah it's only a conditional barrier
14:37 sima: or well, supposed to be, it's a bit buggy in that regard
14:37 karolherbst: yep
14:37 sima: this follows the design of waitqueue and completion and everything else in the linux kernel
14:38 sima: so yeah this is how this works
14:38 karolherbst: pain
14:38 karolherbst: it shouldn't
14:38 sima: imo it's the right semantics
14:38 sima: for completions or anything that looks like one
14:38 karolherbst: I disagree :P
14:38 sima: if your completion needs locking your design seriously smells
14:38 karolherbst: yeah, probably
14:39 karolherbst: I'm sure nouveaus code there is kinda wrong anyway, but oh well
14:39 karolherbst: the future is just to use linas abstractions on this probably anyway :P
14:40 sima: yeah
14:40 sima: in general, if the barrier semantics of core primitives (completion, work, kref, anything really) don't work for you
14:40 sima: you're doing something really fishy
14:40 sima: the atomics are lolz because they don't match C11 and have inconsistent naming
14:41 karolherbst: yeah... dunno.. maybe they work, but nouveau doesn't take fence locks but instead it's own lock across a list of fences
14:41 karolherbst: so that's kinda fishy
14:41 sima: but the other stuff is imo solid
14:41 karolherbst: maybe it does so in a few places.. I found one where it doesn't
14:41 sima: yeah if that fence list lock keeps the fence alive, then the irq handler might need to take it too
14:41 sima: karolherbst, btw you're volunteering for the dma_fence barrier review patch?
14:41 sima: it's fun, I promise :-P
14:42 karolherbst: https://gitlab.freedesktop.org/drm/nouveau/-/blob/nouveau-next/drivers/gpu/drm/nouveau/nouveau_fence.c#L54
14:42 karolherbst: call to `dma_fence_signal_locked`
14:42 karolherbst: doesn't take fence->base.lock
14:42 karolherbst: so maybe that's the actual bug here and everything just works after solving that
14:42 karolherbst: or it should juse use dma_fence_signal instead.. dunno
14:43 karolherbst: I'll check if lockdep screams at me
14:43 sima: karolherbst, it should splat, so maybe it holds the lock somehow?
14:43 sima: (assuming lockdep is on and all that)
14:43 karolherbst: yeah.. dunno
14:47 karolherbst: it doesn't seem to be at all.. so that's kinda confusing
14:48 karolherbst: let's see how hard it screams at me...
14:49 karolherbst: sima: uhh... you said it disables after the first splat?
14:50 karolherbst: remembered that weird mm lockdep discussion we had a few weeks ago? :D
14:52 sima: karolherbst, no memories of that?
14:52 karolherbst: that stuff happening on driver init loading firmware
14:53 sima: oh request_firmware in the wrong place
14:53 sima: yeah that'll kill lockdep
14:53 karolherbst: :')
14:53 karolherbst: pain
14:53 karolherbst: I'll turn that one into a WARN_ON thing then (the dma-fence one) and see if it triggers
14:55 sima: yeah
14:55 sima: but also, you have to fix lockdep splats
14:55 karolherbst: right...
14:55 sima: or the bugs start piling in real bad, real fast
14:55 karolherbst: let me try to do it over the next few weeks :')
14:55 karolherbst: I hope this doesn't fix like most of hte nouveau instabilty problems
14:56 sima: karolherbst, how old is that fw lockdep splat?
14:58 karolherbst: uhhh... dunno
14:58 karolherbst: potentially old? I have no idea
15:10 sima: karolherbst, just to have a guess of how much pain awaits you
15:10 karolherbst: a lot
15:10 sima: lockdep splats are an excellent canary for bad design and busted data structures
16:40 mbrost: dakr: where is the latest version of gpuva? Also any idea if / when you plan on landing this upstream? We are fairly close to trying to land Xe with it.
19:39 airlied: karolherbst: there was a lockdep splat we were seeing that was being ignored i think
19:39 airlied: pretty sure i dont see it now after the fix
19:39 karolherbst: yeah....
19:39 airlied: but there is a new one to fix
19:40 karolherbst: mind trying to reproduce it, because I sure can't
19:47 jannau: is there a way to tell Xorg/modesetting to not use a kms device? preferably in a generic way like MatchDriver in OutputClass
19:48 jannau: one option might be to patch modesetting to ignore "non-desktop" device
19:50 jannau: we have now a working minimal implementation for the touchbar on apple silicon macbooks
19:54 karolherbst: jannau: I suspect the plan is to have a special daemon take care of displaying stuff on it?
20:00 jannau: karolherbst: yes, that exist now: https://github.com/WhatAmISupposedToPutHere/tiny-dfr
20:00 karolherbst: I wonder if a dirty enough hack would be to ensure it starts before X
20:01 jannau: but sddm/Xorg either grabs the touchbar kms device or fails to start if the daemon is already running
20:01 jannau: unfortunately not
20:01 karolherbst: uhh... right
20:01 karolherbst: there was this drm lease thing, could that be useful?
20:01 karolherbst: not sure it ever landed though
20:02 karolherbst: ahh it did
20:02 karolherbst: wayland docs: https://wayland.app/protocols/drm-lease-v1
20:02 karolherbst: the basic idea behind DRM leases are is to hand over control of a decide to a client
20:04 karolherbst: but no idea if that also works in X
20:04 airlied: karolherbst: karolherbst: https://paste.centos.org/view/raw/36a38a6b was the classic one I've been seeing, which I think is the fw one you mentioned
20:05 airlied: but I'm not seeing it at the moment in that form
20:05 jannau: it works with gdm + plasma under wayland. either since those respect a multi-seat config or because of "non-desktop" = 1
20:05 karolherbst: airlied: yeah.. it's kinda random it seems
20:06 karolherbst: jannau: right... I think DRM leases would allow you to do this without relying on any kind of configs. But I also see why compositors/X shouldn't use the touchbar anyway, because it's really not a display in the common sense and it probably messes up things
20:06 airlied: but yeah once that goes down you ain't seeing anything else
20:06 karolherbst: but I think with DRM leases the daemon won't need root privileges
20:07 karolherbst: but no idea how mature all of this is
20:08 karolherbst: but yeah.. I guess Xorg shouldn't use `non-desktop` displays at all, but not sure if that's the responsibility of the compositor in an X world or not
20:08 karolherbst: and also not sure if we even care enough
20:09 karolherbst: just let X die already 🙃
20:10 airlied: there's a difference between non-desktop displays and non-desktop kms drivers though
20:18 jannau: we're ready to kill X on apple silicon devices (declaring it broken and unsupported) but we need a sddm release with wayland support first
20:18 DavidHeidelberg[m]: any last reviews before I merge the new farm ON/OFF handling? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23629
20:19 karolherbst: uhh... sddm still doesn't support wayland? mhhh
20:19 karolherbst: annoying
20:19 jannau: airlied: what is a non-desktop kms driver?
20:19 airlied: well a kms driver that is attached to a single small display that isn't for desktop use
20:20 jannau: sddm git supports wayland but the latest release (~2 years old) does not
20:20 karolherbst: ahh, fair
20:21 jannau: fedora has afaik a git snapshot with wayland support
20:24 karolherbst: I think if nobody has something against Xorg ignoring such devices then it's fine, probably
20:31 karolherbst: gfxstrand: I'm sure you are going to like this: OpMemoryBarrier with non constant semantics
20:33 karolherbst: aparently that's legal in CL SPIR-V :)
20:34 karolherbst: whyyyyyyy
20:39 cmarcelo: karolherbst: ugh :( do you have a test around for that?
20:40 karolherbst: yes, but it's SyCL
20:40 cmarcelo: karolherbst: can I run it using rusticl?
20:41 jenatali: karolherbst: just apply all?
20:41 karolherbst: uhhhh...
20:41 karolherbst: cmarcelo: https://gist.github.com/karolherbst/2965158c53342bb7b3f87b85d97f07e2
20:41 karolherbst: I think this is easier
20:42 gfxstrand: Applying all should be a legal implementation
20:42 karolherbst: maybe we could scan all possible values...
20:42 karolherbst: but uhhhh
20:43 jenatali: I could see a pass that gathers bcsels and phis and ors all the constant values, but you'd have to give up pretty easily once you get out of that pattern
20:43 karolherbst: yeah...
20:44 karolherbst: well... I think it has to resolve to constants
20:44 karolherbst: so it's just a phi of constants in the end
20:44 jenatali: Oh then that's not so bad
20:45 jenatali: Could even do that straight in vtn
20:45 karolherbst: maybe we could do this: if it all resolves to constant, use that, otherwise do all
20:45 karolherbst: jenatali: well.. I suspect it could also be nested phis
20:45 jenatali: Sure
20:45 karolherbst: maybe llvm gets smart and adds alus on it and we get cursed spir-v
20:46 jenatali: And then we give up lol
20:46 jenatali: Unless those alus are bcsels
20:46 karolherbst: I think it would still all constant fold
20:46 karolherbst: maybe we should have a scoped barrier intrinsics with variable semantics
20:46 karolherbst: and then after constant folding we resolve it
20:47 karolherbst: and we shall call it cursed_scoped_barrier
20:48 jenatali: Appropriate name at least
20:48 jenatali: I'd just as soon put it straight in vtn and not try hard at all before giving up
20:48 karolherbst: I'll deal with lower hanging fruits for now
20:48 karolherbst: jenatali: btw, are you subscribed to the OpenCL label?
20:49 jenatali: I am
20:49 karolherbst: ahh, okay
20:49 karolherbst: I have more additions to clc :D
20:49 jenatali: I'll be honest though I haven't been paying too much attention, just following to make sure we don't get broken
20:50 karolherbst: I'm just adding options to the validator
20:50 karolherbst: as apparently the validator checks number of function args and....
20:50 jenatali: Oh I saw that one. I'm not in the office today but remind me on Monday and I can take a look
20:50 karolherbst: cool :)
20:51 karolherbst: I also fixed a bug, because apparently if you pass options, the "src" is a null string killing the logger :)
20:51 jenatali: Hah
20:51 karolherbst: and I was confused about those "(file=" errors :)
21:41 jannau: airlied: looks like rejecting devices with max_width or max_height smaller than 120 pixels (arbitrarily chosen) would be easier than checking the non-desktop connector property
21:46 jannau: desktop at a smaller resolution than 320x240 would be painful smaller than quarter of that hopefully hits not too many fringe use cases
22:26 DavidHeidelberg[m]: Merged the updated farm handling, if for some reason you'll see some anomaly when testing MR or running Marge, please ping me, it could be related. Just for sure!
22:27 DavidHeidelberg[m]: *anomaly = missing or having too many farms inside the pipeline
23:20 karolherbst: airlied: wanna review the llvmpipe part of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22893 ? thanks
23:21 karolherbst: Kayden: I know I kinda hand waved a lot on this topic in the past, but what's the actual deal with the SIMD situation with iris? 8, 16, 32 are generally supported? Or can certain CSOs only support certain SIMD sizes, and how would I know? And it appears that 16 is somehow the one which is preferred?
23:29 Kayden: for compute, it depends on the local workgroup size
23:29 Kayden: the shaders can run in 8, 16, or 32 channels
23:30 Kayden: well...be dispatched in groups of 8, 16, or 32 lanes
23:30 karolherbst: right... so I started to implement CL subgroups and they are kinda annoying
23:31 Kayden: not surprising
23:31 karolherbst: sooo.. on some gpus it appears you can launch 56 sub groups max, which doesn't work with e.g. SIMD16 if you want to launch 1024 threads
23:31 Kayden: yeah. for those, we force to SIMD32 :/
23:31 karolherbst: so I'm mostly trying to understand what's like the optimalSIMD size in general
23:31 karolherbst: or rather what gives more perf
23:32 karolherbst: I'm planning to add a cb to get the SIMD size for a given block: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22893/diffs?commit_id=50245f910d14bba01393c871be75c71463a07701
23:32 karolherbst: so atm it's mostly just "what's optimal"
23:32 karolherbst: e.g. Intel seems to limit to 256 threads in CL
23:32 karolherbst: and I suspect it's because of their messy SIMD situation :)
23:32 Kayden: Almost nothing can run in SIMD32, so those programs are basically pairs of SIMD16 instructions for each half
23:32 Kayden: so we only do that when necessary. it can perform better for things like simple pixel shaders
23:33 Kayden: but it's often really bad for register pressure too
23:33 karolherbst: I see
23:33 Kayden: SIMD16 is usually the sweet spot as long as your register pressure isn't terrible
23:33 Kayden: memory access can normally happen 16 channels at a time
23:33 karolherbst: maybe I should do some benchmarks and force a SIMD mode and see how it goes
23:33 karolherbst: okay
23:33 karolherbst: mhhh
23:34 karolherbst: I wonder if I want to limit to max_subgroups * preferred_simd_size (being 16) for iris
23:34 karolherbst: Kayden: btw, "devinfo->max_cs_workgroup_threads" is the amount of subgroups, right?
23:34 karolherbst: or is there a different limit I'm not aware of
23:35 Kayden: I guess so
23:35 karolherbst: I was seeing worse perf compared to Intel's stack, so I'm actually wondering if this has anything to do with the SIMD size forced based on the block size
23:35 Kayden: I was thinking max_cs_threads but it looks like that's basically clamped
23:35 Kayden: err
23:35 Kayden: max_cs_workgroup_threads is a clamped version of max_cs_threads
23:35 Kayden: so it's probably what you want, yeah
23:35 karolherbst: it reports 56 on gen0.5
23:35 karolherbst: *9.5
23:36 karolherbst: and 64 on gen12
23:36 karolherbst: which kinda makes sense
23:36 Kayden: mmm
23:37 Kayden: yeah it's clamped to 64 based on the bit-width of a GPGPU_WALKER field...
23:37 Kayden: you can actually have 112 threads on gen12...
23:37 karolherbst: huh
23:37 Kayden: I think we're dispatching height 1 row N grids, I guess we'd have to do a rectangular grid
23:38 karolherbst: yeah well.. doesn't really matter as long as max_cs_workgroup_threads * 16 stays at or above the max threads
23:38 karolherbst: on my 9.5 it's werid because 56 * 16 < 1024
23:39 karolherbst: so if I report 1024 threads, applications might just enqueue 1024 threads per block and force SIMD32
23:41 karolherbst: but again, intel only reports 256 threads and 8 as the subgroup size, so maybe that allows for even more optimized code? no clue.. it's kinda weird
23:43 karolherbst: the one advantage I have is, that if an application does not specify the block size I can pick freely, so for this alone it would already be helpful know what's the best to pick
23:44 karolherbst: if e.g. SIMD8 is generally the fastest, I'd just use that
23:44 karolherbst: unless the app sets constraints making that impossible
23:48 Kayden: yeah, awkwardly I think SIMD16 is the best, but it really depends :/