11:09phomes_[d]: While testing multiple graphics and transfer queues, I started tracking how many of each queue type various games create. I’ve added this data to a new tab in the spreadsheet in case anyone else finds it useful
11:15phomes_[d]: I also ran some tests on a patched nvk with 16 graphical, 8 compute, and 2 transfer queues. It seemed to work fine
12:15marysaka[d]: so something to note about the 2 transfer queue that I noticed: the blobs actually assign different CE manually for each it seems (I can see CE1 and CE2 being selected as engine type when allocating the AMPERE_DMA_COPY_B object)
12:17marysaka[d]: I am not too sure how nouveau assign GRs/CEs on the kernel side at the moment but we might want to make sure we are selecting others CE when we have more that one (Seems to have 3 of them on my 4060 but I didn't really confirm it)
12:17marysaka[d]: Normally we would only see one GR so that should be fine on that side, it really just matter for CE
13:44karolherbst[d]: there are GPUs with like 10 CEs, so how does that work on those?
13:45karolherbst[d]: but are the hardware CEs statically allocated? I thought the hardware or something schedules async jobs between available units
14:42marysaka[d]: karolherbst[d]: do you have the reference for those? might be worth checking vulkan info reporting on gpuinfo.org and see if they expose more queue or not
14:43marysaka[d]: I do know that the channel with GR0 assigned bind CE0, and I see two stray channels with only CE1 and CE2 respectively in my traces
14:45karolherbst[d]: marysaka[d]: uhm.. some DC GPU has 10 entries inside nouveau
14:45karolherbst[d]: `drivers/gpu/drm/nouveau/nvkm/engine/device/base.c` has the tables for it
14:46karolherbst[d]: ohh wait.. ce got reworked
14:47karolherbst[d]: ohh it's dynamic now
14:47karolherbst[d]: check how `nvkm_inth_add` is used
14:48karolherbst[d]: 50551b15c760b3 was that reworked it..
14:48karolherbst[d]: maybe "10" was a bit high of a guess
14:48karolherbst[d]: `nv140_chipset` had 9 🙃
14:48karolherbst[d]: marysaka[d]: ^^
14:49karolherbst[d]: though it might be that it's abstracted away in hardware from a channel perspective
14:49karolherbst[d]: and it just schedules between available hardware units
14:49karolherbst[d]: anyway, don't know for sure how it all works, just that some GPUs have a lot of copy engines
14:53marysaka[d]: I see :aki_thonk:
14:56marysaka[d]: GA100 also have 10 of them if I trust openrm def
15:01marysaka[d]: Seems to still map to 2 transfer queues hmm
15:02marysaka[d]: the 5090 also seems to have 8 CEs
15:06karolherbst[d]: yeah I suspect it's more virtualized from a channel pov
16:30mhenning[d]: Yeah, the proprietary driver seems to always expose 2 CEs
16:31mhenning[d]: The nouveau kmd does things like filter a bunch of runlists in nvif_fifo_runlist_ce so I assumed it was correctly using all the device's CEs but I admit I haven't studied the code too closely
16:37mhenning[d]: phomes_[d]: Yeah, we've done the work so exposing that many queues should be correct, the only reason we don't is because doing so tends to slow things down. I think actually exposing that many queues should wait on us having proper TSGs (and SCG for the compute queues)
16:40phomes_[d]: yes that makes sense. I just wanted to mention that I found no issues while testing it
16:41mhenning[d]: Sure, thanks for testing