03:59 airlied[d]: okay have 580 and 595 scheduling channels on blackwell, might be time to play with spark
17:42 marysaka[d]: If we receive a RC trigger event from GSP, the engine type provided should be valid?
19:41 mohamexiety[d]: also what is our async copy situation like from a kernel pov?
19:46 mhenning[d]: async copy has been implemented already. it requires kernel 7.0+ for some bug fixes but the kernel support itself has been there for ages
19:47 mhenning[d]: oh, it also requires turing+ for now due to kernel bugs on earlier hardware
19:51 marysaka[d]: mhenning[d]: I am currently reconfirming the claim that async copy is actually working because allocating a context with NOUVEAU_FIFO_ENGINE_CE seems to end up on GR0 if I trust the fault I'm forcing
19:52 mhenning[d]: ah, okay it could be buggy then
19:52 marysaka[d]: (testing on current drm-misc-next-fixes)
19:53 mhenning[d]: It's supposed to prefer CEs without a GR attached. Could probably print out what's actually getting selected on the kernel side
19:53 marysaka[d]: currently trying to get that yeah
19:54 marysaka[d]: (I wanted to confirm that async CE doesn't have I2M and got lost in another rabbit holeπŸ˜…)
19:57 karolherbst[d]: ohh right.. at that topic.. do we use copy for VRAM to VRAM copies?
19:57 mhenning[d]: Are you using I2M for something? I was just wondering about using it for CmdFillBuffer the other day
19:57 mhenning[d]: karolherbst[d]: yes
19:58 karolherbst[d]: okay, because we shouldn't do that
19:58 karolherbst[d]: copy is fast enough to hide PCIe latencies, but not to saturate VRAM bandwidth.
19:58 mhenning[d]: what do we use instead?
19:59 karolherbst[d]: 2D
19:59 mhenning[d]: really? I had assumed that 2d was slower than copy
19:59 karolherbst[d]: nah, 2D acts with the same access as 3D or compute
19:59 karolherbst[d]: it got even faster on Ada
19:59 karolherbst[d]: or ampere?
20:00 karolherbst[d]: anyway, dma-copy is mostly good for transfers over PCIe, for anything else we should use 2D or something else
20:00 marysaka[d]: mhenning[d]: I have some patches that unify copy and fill operations and also add some I2M util
20:00 marysaka[d]: and I use that for UpdateBuffer
20:00 karolherbst[d]: and even for transfers over PCIe, I2M _might_ be faster for small data
20:01 marysaka[d]: Was thinking of switching QMD upload to that too
20:01 karolherbst[d]: to inline?
20:01 karolherbst[d]: yeah
20:01 marysaka[d]: yes
20:01 karolherbst[d]: I think the threshold is somewhere around UBO sizes
20:01 karolherbst[d]: not sure
20:01 karolherbst[d]: would have to check what nvidia does
20:02 mhenning[d]: alright well I should dust off my 2d engine MR at some point then
20:02 karolherbst[d]: yeah
20:02 karolherbst[d]: would be good to do some benchmarking there
20:03 mhenning[d]: yeah, all my benchmarking so far has been 2d vs 3d. Haven't tried 2d vs CE
20:03 karolherbst[d]: ahh it become faster on Ada
20:04 karolherbst[d]: 4x times as fast as previous gens
20:04 marysaka[d]: marysaka[d]: Okay so good news it actually select CE2 even for the sub object for XXB5... the bad news is that it always select GR2 for async
20:04 karolherbst[d]: so prior Ada the perf difference could be slower
20:04 karolherbst[d]: *different
20:05 marysaka[d]: marysaka[d]: we have a ffs(runl_mask) - 1 that select it not quite sure what we should do here, maybe userspace should be selecting the engine index?
20:05 marysaka[d]: As for Blackwell we do have GR1 to GR3 too now
20:07 mhenning[d]: Maybe that should be a kernel scheduling decision rather than a userspace decision?
20:07 mohamexiety[d]: karolherbst[d]: 2D, or compute shaders? the prop driver iirc uses compute shaders for things >128KiB
20:08 karolherbst[d]: mohamexiety[d]: I know that 2D is faster than copy for textures and stuff
20:08 karolherbst[d]: but it migh be that compute can be even faster
20:08 karolherbst[d]: 2D does benefit from compression where copy doesn't
20:08 marysaka[d]: mhenning[d]: I do think too but I am pretty sure that the blobs directly select CE2 to 4 for the transfer queue in userspace
20:08 karolherbst[d]: so it could be that for buffers it matters less
20:09 mohamexiety[d]: i see
20:11 mhenning[d]: using compute does avoid a subc switch if you're already running compute stuff
20:12 karolherbst[d]: but it's still a full shader running instead of just using an engine, but yeah.. it might be better to use shaders for bigger transfers
20:12 karolherbst[d]: AMD seems to have a similar situation
20:12 mhenning[d]: right, I don't know how the relative costs actually play out
20:13 karolherbst[d]: yeah.. and for small things there is always I2M that's actually pretty fast in that case
20:15 mhenning[d]: should we be using 2d for async copy queues too? or is it better to run on a CE?
20:15 karolherbst[d]: well.. it's not async anymore
20:15 karolherbst[d]: but yeah.. no idea..
20:15 mhenning[d]: I guess yeah
20:16 karolherbst[d]: I don't think speed particularly matters for async
20:16 karolherbst[d]: but dunno.. guess it depends how applications are using it
20:17 marysaka[d]: marysaka[d]: Opened some MR, really not happy with the look of the interface but yeah https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40834
20:17 karolherbst[d]: benchmarks benchmarks <a:ferrisBongo:498944916286603265>
20:18 marysaka[d]: For I2M I have a very simple program that just measure the time of the UpdateBuffer it's not really great but I can push it if anyone want
20:18 karolherbst[d]: oh I meant games
20:18 karolherbst[d]: πŸ˜„
20:19 karolherbst[d]: marysaka[d]: but yeah.. I think it would be interesting to see where copy overtakes I2M
20:19 marysaka[d]: oh well the bit I tested wasn't any different πŸ˜„
20:19 marysaka[d]: karolherbst[d]: So it kind of overtake at the 48KiB mark knowing that we are splitting some pushbuf at that point
20:19 karolherbst[d]: but I guess that could be difficult to measure?
20:19 karolherbst[d]: marysaka[d]: mhhh
20:20 karolherbst[d]: karolherbst[d]: guess my guess was pretty good then πŸ™ƒ
20:20 karolherbst[d]: I think if we do 64k, then we always have I2M for UBOs
20:21 karolherbst[d]: marysaka[d]: mhhh
20:21 karolherbst[d]: might want to make it split around 64k then?
20:21 karolherbst[d]: then we never have to sub chan switch to copy for UBO uploads
20:21 marysaka[d]: so UpdateBuffer hard limit by spec is 64k
20:22 marysaka[d]: ah you mean UBOs
20:22 karolherbst[d]: but also wondering if nvidia ever uses something else for UBOs
20:23 karolherbst[d]: marysaka[d]: right.. I mean for normal buffers we could just switch to copy above 64K, but that would allow us a uniform path for UBO, unless I2M gets slower even at 64k without a push buffer split
20:23 karolherbst[d]:but
20:23 karolherbst[d]: I also hope there is no game that constantly reuploads full UBos
20:23 calim: if there is it would deserve to be slow :P
20:25 karolherbst[d]: marysaka[d]: do we really have to initialize the M2MF engine to use the I2M methods on 3D/compute?
20:25 marysaka[d]: karolherbst[d]: I don't think so but if we use the subchannel we might
20:25 karolherbst[d]: mhhhhh
20:26 marysaka[d]: tho technically with what I typed it should never happen
20:26 marysaka[d]: so I could drop it actually
20:26 karolherbst[d]: right... my assumption was always that I2M is 1:1 copied into 3D and compute
20:26 karolherbst[d]: or rather..
20:26 karolherbst[d]: it's not part of either anyway
20:26 karolherbst[d]: it's just available on those sub channels
20:27 karolherbst[d]: was just wondering skimming over your MR
20:27 marysaka[d]: yeah but the blobs do ask the GSP to allocate the I2M subchannel anyway
20:27 marysaka[d]: also that's one of my diff with NVIDIA blobs UpdateBuffer, they always put it on I2M subchannel
20:27 marysaka[d]: I don't see a reason not to stay on 3D or compute anyway as it should be the same
20:27 karolherbst[d]: mhhhhhhh
20:27 karolherbst[d]: okay...
20:28 karolherbst[d]: I _think_ I know why
20:28 marysaka[d]: (and I need the render override as condrendering affect inline copies)
20:28 karolherbst[d]: it might be that using the I2M subchannel won't invoke a sub channel switch
20:28 karolherbst[d]: might be interesting to test this
20:28 marysaka[d]: hmmm
20:29 karolherbst[d]: like I can't think of any other reason why nvidia would do that πŸ˜„
20:32 marysaka[d]: yeah that would make sense tbh
20:34 karolherbst[d]: it's going to be more interesting once we get rid of a couple of memcpys
20:34 karolherbst[d]: though not sure if really do blocking host uploads like that?
20:35 calim: when the nv driver does something it could always be because you don't have a pro card ;)
20:36 karolherbst[d]: I need to get back to my nvpaint program idea to implement a paint GUI app on top of 2D πŸ™ƒ
20:56 calim: lol that wouldn't have a lot of functionality though would it?
20:56 karolherbst[d]: well
20:57 karolherbst[d]: the hw can do polylines
20:57 karolherbst[d]: no but seriously, it can do quite a lot
20:57 karolherbst[d]: https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/twod/cl902d.h
20:58 karolherbst[d]: `NV902D_SET_COMPRESSION` πŸ₯²
20:58 karolherbst[d]: mohamexiety[d]: I guess for 2D we would hve to toggle that to true with compression enabled? I just saw that one
20:59 karolherbst[d]: anyway there are like weird things like `MONOCHROME_PATTERN` I want to RE
21:01 calim: oh it might have been fun to use for drawing PCB routing
21:01 mohamexiety[d]: karolherbst[d]: yeah would make sense
21:02 karolherbst[d]: calim: compositors might be interested in having some accelerated 2D primitives that aren't necessarily like huge beasts like shaders and such
21:03 calim: isn't it secretly doing shaders anyway? or is it really its own dedicated thing
21:03 karolherbst[d]: it's it own thing afaik
21:04 airlied[d]: surprised it hasn't disappeared on newer gpus
21:04 calim: curious, I thought they'd be emulating it these days
21:04 karolherbst[d]: AMD added it back afaik
21:04 calim: like, in firmware
21:04 karolherbst[d]: like modern AMDs apparently have a dedicated 2D engine now
21:04 karolherbst[d]: again
21:04 karolherbst[d]: or so I've heard
21:04 karolherbst[d]: I guess with 8k displays it kinda makes sense πŸ™ƒ
21:05 airlied[d]: well there's a bunch of video post processing engines
21:05 airlied[d]: amd vpe
21:05 karolherbst[d]: or 16k even
21:05 karolherbst[d]: mhh
21:05 karolherbst[d]: yeah no idea what it's called on AMD, but apparently there is fixed function stuff for 2D ops
21:05 karolherbst[d]: I haven't checked
21:05 calim: 8k displays lol my eyes don't even have 4k resolution
21:06 karolherbst[d]: ~~skill issue~~
21:07 calim: they should add a dedicated webpage scrolling engine
21:07 karolherbst[d]: but as always, the display I have is peak of whatever one needs
21:12 HdkR: Someone needs to ship Karol an 8k TV
21:13 HdkR: A big 65" for your desk? :P
21:13 karolherbst[d]: please don't
21:13 karolherbst[d]: my GPU is already screaming at my current display
21:13 karolherbst[d]: actually wondering my Intel GPU is even able to keep up
21:14 HdkR: Should be fast enough for xterm
21:16 karolherbst[d]: even at 4K@165? πŸ™ƒ
21:16 karolherbst[d]: my pre Xe intel GPU wouldn't have been able to tho πŸ˜„
21:17 airlied[d]: I think the industry has decided 8k is dead
21:17 airlied[d]: like 3D
21:18 karolherbst[d]: good
21:24 HdkR: Gaming industry at least has decided that 1080p 1060hz panels are more important
21:24 karolherbst[d]: πŸ˜„
21:24 karolherbst[d]: 1060?!?
21:24 HdkR: Yea, I think that's currently the peak
21:28 calim: that sounds silly, really you should make hzless displays where you only specify the response time in nanoseconds
21:28 dwfreed: blame esports
21:29 HdkR: esports and 4x framegen :D
21:31 HdkR: 4k240 is definitely the most interesting combination for me. Fast terminals and mouse response. VRR meaning my 9070 XT can just go as high as it can in most games.
21:36 mohamexiety[d]: i love 4k240 (doesnt work on nouveau though rip) but i am eyeing that samung 6k165 for coding
21:37 calim: can you type at 240 letters per second?
21:37 HdkR: I currently have 3x 4k144 (although one is locked at 120hz because amdgpu bugs), works pretty well.
21:37 mohamexiety[d]: calim: heh, no but it does feel a lot better for responsiveness and scrolling and such
21:39 calim: this works like a drug, really I never knew I needed more than 60 hz but then they give you the speed and your brain gets addicted
21:39 HdkR: Pixel response of higher refresh panels is /very/ nice
21:40 HdkR: Which is why people love OLEDs so much
21:41 HdkR: I upgraded from some 4k60 Dell panels that had /horrid/ pixel response times. High refresh was a major factor for my current panels because of it.
21:43 mohamexiety[d]: yeah
21:43 mohamexiety[d]: the current 4k240 i am using for coding etc is actually oled
21:43 HdkR: Nice
21:43 mohamexiety[d]: which i know is a bit dumb/destructive buuuuut
21:43 HdkR: Eh, all monitors are consumable
21:44 mohamexiety[d]: yeah, just these are a bit more than the usual hehe
21:44 HdkR: All my IPS panels have some image retention, so my i3wm splits tend to have a line down the center of the screen
21:44 mohamexiety[d]: hopefully the 6k thing comes out soon and then i should be able to spread the static content load better
22:27 airlied[d]: I did start typing frl patches, but never got them to not blow up
22:32 karolherbst[d]: on nvk or radv?
22:32 karolherbst[d]: for nvk we probably will have to support a few MMA types we don't actually support in hardware
22:44 airlied[d]: well nouveau
22:44 airlied[d]: FRL HDMI support
23:11 karolherbst[d]: ohhhhh FRL...
23:11 karolherbst[d]: I thought of FSR πŸ˜„