17:25 marysaka[d]: mhenning[d]: for the whole "FLUSH_SYSMEM" barrier flag I'm really not sure how to name it to be honest... my understanding of `NV906F_SET_REFERENCE` was that it was flushing any writes from the front-end to "sysmem"
17:29 marysaka[d]: I also noted that they generate a SET_REFERENCE in case of `dst_access_mask & VK_ACCESS_2_COMMAND_PREPROCESS_READ_BIT_EXT` but I don't really have an test app that use DGC to see how they implement it
17:30 marysaka[d]: marysaka[d]: But I suppose it actually wait for those writes to happen and flush I guess? really unsure how we could prove the behavior of those to be honest
17:32 mhenning[d]: That's not my understanding at all. I thought it was used to prevent reading ahead in the command buffer.
17:43 marysaka[d]: mhenning[d]: Hmm I was thinking that way mostly because of how it's used on openrm kernel side https://github.com/NVIDIA/open-gpu-kernel-modules/blob/db0c4e65c8e34c678d745ddb1317f53f90d1072b/src/nvidia/src/kernel/gpu/rc/kernel_rc_watchdog.c#L1410
17:43 marysaka[d]: The next pushbuf is set with "sync_wait" (what we also call no prefetch) so that usage wouldn't make much sense...
17:43 marysaka[d]: But at the same time, looking at most of my Ada dumps, SET_REFERENCE is used right before the end of the current command buffer followed by a command buffer without the sync wait flag, so it might prevent the prefetch in some ways but I wouldn't trust it to invalidate what come after a SET_REFERENCE (well up to the amount it can prefetch at least, I think this is documented somewhere in openrm
17:43 marysaka[d]: btw)
17:50 marysaka[d]: Hmm no looking at it it's totally used right before splitting the command buffer on the blobs too... I'm really having a hard time understanding the whole deal here
17:51 mhenning[d]: marysaka[d]: Looks like open-gpu-doc has some docs for it under "SET_REF"
17:51 mhenning[d]: > Before the reference count value is altered, Host waits for the engine to
17:51 mhenning[d]: > be idle (to have completed executing all earlier methods), issues a SysMemBar
17:51 mhenning[d]: > flush, and waits for the flush to complete.
17:52 mhenning[d]: from manuals/ampere/ga100/dev_pbdma.ref.txt
17:55 marysaka[d]: Ah okay that's at least making more sense with also:
17:55 marysaka[d]: > If software desires that the engine finish processing all methods generated from
17:55 marysaka[d]: > one PB segment before a second PB segment is fetched, then software may place
17:55 marysaka[d]: > Host methods that wait until the engine is idle in the first PB segment (like
17:55 marysaka[d]: > WFI, SET_REF, or SEM_EXECUTE with RELEASE_WFI_EN set).
17:56 marysaka[d]: So yeah it ensure that the next segment entry isn't fetched before the engine is idle then too okay
17:57 marysaka[d]: but it also mean that sync wait need to be set apparently
18:00 mhenning[d]: right, we typically do a SET_REFERENCE followed by a nvk_cmd_buffer_push_indirect which sets .no_prefetch
18:06 mhenning[d]: Anyway, maybe calling it WFI | FLUSH_SYSMEM does make sense in that case
18:07 marysaka[d]: `HOST_WFI_FLUSH_SYSMEM` I guess?
18:08 mhenning[d]: Sure, that works
18:50 karolherbst[d]: anybody had any chance to benchmark/test my recent MR? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41042
19:08 phomes_[d]: I just got home from a trip and will start benchmarking in a bit
20:22 karolherbst[d]: cool cool
20:22 karolherbst[d]: that also seems to help the surge2 😄
20:23 karolherbst[d]: noticed any artifacts or anything?
20:23 karolherbst[d]: but yeah.. cool that it helps 2 games pretty significantly
20:30 phomes_[d]: no artifacts. I will give it some more testing. I can also try and test it together with the other MRs that helped The surge 2
20:37 karolherbst[d]: I kinda doubt they'll help in combination...
20:37 karolherbst[d]: tho maybe.. dunno
20:37 karolherbst[d]: the sink one might help
20:40 karolherbst[d]: but still kinda sad that this also doesn't help X4 😄
20:44 karolherbst[d]: anyway.. the main reason I don't think the combination of those three MRs is gonna be better than just one, is because they all simply allow to have more warps by reducing used GPRs
20:44 phomes_[d]: combining them helped a little. 67 fps
20:45 karolherbst[d]: okay
20:45 karolherbst[d]: seems like reducing the amount of GPRs even more still helps there
20:46 phomes_[d]: and yes poor X4 still remain stuck
20:47 karolherbst[d]: I failed to look at the shaders, because the fossilize tooling didn't like them...
20:47 karolherbst[d]: I think there was a way to get the fossils that steam generates somewhere
20:50 matt_schwartz[d]: STEAM_FOSSILIZE_DUMP_PATH= iirc
20:53 karolherbst[d]: but I think having a replayable trace of the benchmark or what you are doing to test it would help even more
20:57 phomes_[d]: I can do a renderdoc or gfxreconstruct capture if that helps?
21:36 karolherbst[d]: phomes_[d]: yeah, I think that would help