14:08karolherbst[d]: skeggsb9778: is that an issue where it's worth checking if 570 works? https://gitlab.freedesktop.org/drm/nouveau/-/issues/421
19:02misyltoad[d]: pac85[d]: thats basically hoe steamvr works but pre-tl sem and pre-bda
19:02misyltoad[d]: scene image gets late latched by a queue wait on gpu that blocks a write for the new idx
20:53pac85[d]: misyltoad[d]: Ah intetesting, and is it done like on async compute? Like I thought about submitting a write packet but that could get delayed by other workloads if you submit it yo gfx
20:55airlied[d]: gfxstrand[d]: mhenning[d] one problem with moving to the latencies I've noticed while rebasing onto main is that according to docs all the flow control instructions are "decoupled" but we claim they have fixed latencies, not sure where I should solve that discrepancy]
21:07gfxstrand[d]: Right now we stall everything at control flow so it kinda doesn't matter.
21:11gfxstrand[d]: What does "decoupled" mean?
21:23misyltoad[d]: pac85[d]: the composition is done on async compute, the late latching is done on gfx queue
21:24misyltoad[d]: in theory one could use transfer queue for late latching actually
21:24pac85[d]: misyltoad[d]: Uhm so game could delay the latching?
21:24pac85[d]: misyltoad[d]: Yeah that woyld be cool but radv doesn't expose it
21:25misyltoad[d]: pac85[d]: I guess technically, but much less likely with VR due to how SteamVR schedules things... there's no backpressure for obvious reasons
21:25pac85[d]: misyltoad[d]: Uhm I see
21:25misyltoad[d]: at least that is my understanding of when i looked at how this worked last time
21:26airlied[d]: gfxstrand[d]: decoupled is the use scoreboards state, coupled if fixed latency, and redirectable is do both
21:31gfxstrand[d]: What other stuff is decoupled? And what all is scoreboarded?
21:32gfxstrand[d]: AFAICT, only barrier registers and maybe some internal state is scoreboarded.
21:33gfxstrand[d]: misyltoad[d]: So you're saying steamvr would like that little extension?
21:33pac85[d]: gfxstrand[d]: Probably any compositor
21:33misyltoad[d]: It'd need us to actually start using TL semaphores in SteamVR... :P
21:33pac85[d]: like if I wrote a compositor it would be 100% bindless
21:33misyltoad[d]: Lots of that cold is old
21:33gfxstrand[d]: I'm still not sure if I really want to open the can of worms that drafting and posting that extension might unleash but if it's useful...
21:33misyltoad[d]: I'd like it for Gamescope/Gamescope 2
21:34pac85[d]: gfxstrand[d]: But was your idea to do it in a way different than my bda idea?
21:35gfxstrand[d]: I would probably make it a descriptor type rather than a device address so we can make it explicitly read-only.
21:35airlied[d]: gfxstrand[d]: scoreboards are what the nvidia docs call the rd/wt stuff
21:35pac85[d]: gfxstrand[d]: My take would be to just say "if the user overwrites the value they just lose it and it's their fault"
21:36gfxstrand[d]: The Linux implementation would be a BO attached to the syncobj so the driver can just get the address and do whatever.
21:36pac85[d]: It does mean you have to write pet context and can't share the buffer
21:36pac85[d]: gfxstrand[d]: Ah I had something kmd specific in mind
21:37gfxstrand[d]: Oh, the buffer would be implicitly shared. Anyone who has a handle to that syncobj would be able to get at the page.
21:37pac85[d]: Uhm I see
21:37pac85[d]: So a single page of ro stuff you just hand over
21:37gfxstrand[d]: We can't enforce RO.
21:38pac85[d]: Not all hw can map ro?
21:38gfxstrand[d]: Nope
21:38pac85[d]: Like in adreno you'd map rw in ttbr1 and ro in ttbr0 then write from the ring
21:38pac85[d]: gfxstrand[d]: Well if you can't expose ro sharing the buffer sounds bad
21:39gfxstrand[d]: Nvidia can, I think, but I'm not sure it's hooked up. Intel has had the bits since forever but they've only ever worked on one HW generation.
21:39pac85[d]: I see
21:40airlied[d]: "The minimum time between a scoreboard generator (&wr= or &rd=) and the first &req= consumer
21:40airlied[d]: is 2 cycles." might explain the nop 2
21:42pac85[d]: Here is my idea:
21:42pac85[d]: * The kmd uapi has a way of saying "associate a write to this buffer when a paritcualr syncobj is dignaled"
21:42pac85[d]: * kmd would just hold some kind of list, then when any submission is made that signals that sigobj it would queue up the write right after it
21:42gfxstrand[d]: airlied[d]: Right. So we don't have the tri-state right now. FP64 wants both scoreboard and latency-aware stalling on some hardware.
21:42airlied[d]: yes and in theory so does fp16 we just haven't found the hw 😛
21:43gfxstrand[d]: 😂
21:43airlied[d]: so the api I've defined is needs_scoreboard, which is mostly what we care about
21:44airlied[d]: the old api is has_fixed_latency which seems to be making stuff up about flow control
21:44airlied[d]: just wondering what API I actually want, needs_scoreboard_and_isnt_branch?
21:45gfxstrand[d]: The other problem is that we can't currently handle the case where A->B is fixed latency but A->C needs scoreboard.
21:46gfxstrand[d]: I've got a plan for that but I need to sit down and basically rewrite the scoreboard code to use Mel's new graph data structure.
21:48gfxstrand[d]: airlied[d]: The API I started drafting last year was one where each of the latency functions returned an enum which was basically `Fixed(N)` or `Min(N)` for scoreboard things.
21:50gfxstrand[d]: I don't think that distinction between decoupled and redirectable is all that interesting. Control flow don't "need" fixed latencies because they don't write anything besides internal control registers.
21:52airlied[d]: Decoupled doesn't need waits though
21:52airlied[d]: so it is interesting because you need waits on redirectable as well as scoreboards
21:53gfxstrand[d]: pac85[d]: You can already kinda do that today with the copy engine on the client doing the signaling. If I'm reading what you're suggesting right, it sounds like a lot of bookkeeping in the kernel and kinda racy if a different client is allowed to request that after the fact.
21:55gfxstrand[d]: airlied[d]: Wait, what? So it can wait on the scoreboard immediately without the 2 cycle ready delay?
21:57airlied[d]: you have to have the normal delay for the instruction
21:57airlied[d]: as specified in the docs
21:57airlied[d]: sorry for decoupled if has to be 2
21:57airlied[d]: for redirected it has to be the the actual delay value
21:58pac85[d]: gfxstrand[d]: Re copy engine: not all gpus have that and of something else uses it it can delay the write.
21:58pac85[d]: But yes possibile other than that.
21:58pac85[d]: As for it being racy, well the idea is that once you attach an address to the syncobj eventually sync points are written to it. Submissions that happen before attaching wouldn't be accounted for
21:58airlied[d]: mhenning[d]: we have the actual latencys for turing, but not sure how to correspond the docs to your code here
21:58gfxstrand[d]: Right. So how is `Decoupled` not just `Redirected(2)`?
22:00airlied[d]: I dislike hardcoding things that might change in the future, seems a bit arbitrary
22:00gfxstrand[d]: pac85[d]: Doesn't have to be the copy engine. Everything can do `vkCmdFillBuffer()`.
22:00airlied[d]: hmm don't seem to have ampere latencies
22:00airlied[d]: karolherbst[d]: did we get latencies for ampere like the turing ones?
22:01karolherbst[d]: yes
22:01karolherbst[d]: I sent those to you
22:01gfxstrand[d]: airlied[d]: Sure. I'm not saying make it a #define. I just don't get what's special about it and why it needs different handling.
22:01pac85[d]: gfxstrand[d]: Yes but the point is to have control on when it happens. I want the write to happen right after a command buffer from another context for the compositor case.
22:01pac85[d]: For the in process case then yes you don't need a special api at all just that or CmdUpdateBuffrr
22:02airlied[d]: karolherbst[d]: not the hazed latencies, the decoupled oines, from the 2nd sheet of the turing excel
22:03karolherbst[d]: ohh..
22:04karolherbst[d]: don't think so
22:04karolherbst[d]: you find those things also on the nvidia partners site if you have an account
22:04karolherbst[d]: sometimes they update things
22:05karolherbst[d]: but I doubt they've added that stuff recently anyway
22:06gfxstrand[d]: pac85[d]: Sure. But now we're back to drivers having to walk this arbitrary list of BOs and do writes to buffers and the first few frames racing because you can't back-patch already submitted commands. Also, since it requires explicit driver support from the driver used by the other client (not the one requesting it), everything is going to fall apart if they're on different kernel drivers and we
22:06gfxstrand[d]: have no way to tell when importing a syncobj where it came from.
22:08pac85[d]: gfxstrand[d]: Uh right I didn't think about the cross diver thing.
22:10gfxstrand[d]: What I envisioned was a sort of syncobj listener which mirrors the syncobj timeline in a BO which you can map into your GPU address space or on the CPU. The GPU wouldn't write it directly but it would instead tie into the kernel's internal dma-buf callback mechanism and would get written from the interrupt handler. Not quite as fast as writing from the GPU itself but pretty darn close.
22:11gfxstrand[d]: Now whether you would be guaranteed one per syncobj or be able to create multiple for process isolation purposes is a detail we could bike shed on. If this were to be a stepping stone on the way toward actual memory fence, we probably only want one.
22:13gfxstrand[d]: In either case, we could implement this for everybody in a couple hundred lines of code.
22:13gfxstrand[d]: Then Mesa drivers would just have to wire up whatever the Vulkan extension looks like and boom! Shader semaphores.
22:17gfxstrand[d]: gfxstrand[d]: In theory, we could also add a way that drivers could say, "nah, I got this" and shave off a couple nanoseconds but, honestly, I think writing from an interrupt context is fast enough.
22:19gfxstrand[d]: karolherbst[d]: can you give me the purple role?
22:20karolherbst[d]: gfxstrand[d]: you can give it yourself
22:20karolherbst[d]: "channels & roles" and have fun
22:24gfxstrand[d]: There we go
22:27gfxstrand[d]: gfxstrand[d]: airlied[d] I think what I'm saying is that, from the perspective of the scheduler, delay and scoreboard trackers, and everything else, the only things they care about are:
22:27gfxstrand[d]: 1. The minimum number of cycles between instructions for the given dependency
22:27gfxstrand[d]: 2. Whether or not the scoreboard is needed for that dependency or if fixed delays are enough.
22:30gfxstrand[d]: To me that sounds like we want
22:30gfxstrand[d]: ```rust
22:30gfxstrand[d]: struct Delay {
22:30gfxstrand[d]: pub min_cycles: u8,
22:30gfxstrand[d]: pub scoreboard: bool,
22:30gfxstrand[d]: }
22:31gfxstrand[d]: Possibly with a third `pub estimated: u8`
23:16airlied[d]: okay I'll play around with that a bit more then
23:20snowycoder[d]: Just reporting, seems like pre-Kepler's `fswz` already does quad shuffling and you cannot turn it off.
23:20snowycoder[d]: Second src is taken from the other thread and there's a modifier to select if we want it horizontal (ddx) or vertical (ddy).
23:20snowycoder[d]: I should encode it as a separate NAK IR struct, right?
23:20snowycoder[d]: Also: do you know what the `.NDV` modifier might do in this context? old codegen always keeps it on but it does not seem to affect dEQP correctness.
23:29gfxstrand[d]: So there's just `fswz` and not `fswzadd`?
23:30snowycoder[d]: Doesn't seem so, I cannot find it with nvdisasm and codegen doesn't emit it
23:30gfxstrand[d]: Okay
23:31gfxstrand[d]: Then yeah, separate op
23:31gfxstrand[d]: Wait... At that point, should we just do two shfl?
23:31pac85[d]: Uhm I see that sounds simpler. Though if you are gonna do it CPU side it makes me wonder how much better it would be than just having a userspace thread that just waits on the fence and does the write. That said it would still be a welcome addition since a thread is not free.
23:32pac85[d]: pac85[d]: Sorry this message lost the reply
23:32pac85[d]: gfxstrand[d]: It was a reply to this
23:32snowycoder[d]: gfxstrand[d]: why two shfl? we can just emit a single fswz for ddx/ddy
23:32gfxstrand[d]: pac85[d]: Interrupt contexts are way faster than waking up a userspace thread. It's typically about as fast as the kernel-internal dependency would be between processes.
23:33gfxstrand[d]: snowycoder[d]: Right. I was about to type that but then got distracted by pac85[d] .
23:33pac85[d]: gfxstrand[d]: Sorryyy
23:33gfxstrand[d]: So yeah, we want a new FSwz op
23:33gfxstrand[d]: pac85[d]: No worries. Sometimes my brain's preemption doesn't work.
23:33pac85[d]: gfxstrand[d]: Yes but we are talking < 100us aren't we? For compositor usecases that's a order of magnitude less than other sourced of latency
23:33pac85[d]: gfxstrand[d]: Lol fair
23:34pac85[d]: pac85[d]: Though better consistency is a nice property for sure
23:34pac85[d]: Like yeah I would be for having this in the kernel, even though doing it fully gpu side felt cool
23:34gfxstrand[d]: pac85[d]: Yeah. Doing it all in a userspace thread is probably fine and, honestly, maybe even better than what steamvr does today. But if we wanted it even faster, doing it from the IRQ handler directly is possible.
23:37pac85[d]: gfxstrand[d]: Sounds cool!