14:42 phomes_[d]: skyrim is another game with poor performance. Prop hits the 60 fps cap. NVK runs at 4 fps
14:43 phomes_[d]: there are ~5k CmdDrawIndexed calls in the frame. One of them is a really bad outlier on NVK, but the avg case is also interesting
14:44 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1476590644888862821/Screenshot_From_2026-02-26_14-58-36.png?ex=69a1adb3&is=69a05c33&hm=2cbd37e19413b1ef9b601aef16cf6ebc8a35ffa9496c3af84dec95dfe4c9e925&
14:44 phomes_[d]: here is a box plot of the draw duration as measured by renderdoc:
14:47 karolherbst[d]: huh...
14:47 karolherbst[d]: that looks.. "odd"
14:48 karolherbst[d]: but maybe that's just due to log scaling
14:48 karolherbst[d]: but the line around 500µs is ... "odd"
14:49 marysaka[d]: really sounds like we are stalling like crazy
14:49 marysaka[d]: does the capture also show the same performance issue?
14:49 marysaka[d]: and if so could you make a gfxreconstruct one too (I could probably fire up Skyrim to do one actually...)
14:50 phomes_[d]: I can do a gfxreconstruct too and put it all somewhere
14:53 phomes_[d]: the plots are based on measurements done with renderdoc. The fps is from normal gameplay
14:53 marysaka[d]: on a sidenote, renderdoc timestamps when you load the capture are a bit off compared to the blob and doesn't always align to actually the game performance... I haven't done much digging on that yet but maybe something is different on the pipeline location that we report compared to the blobs
14:54 marysaka[d]: One thing I know is that the pipeline location when we use compte stage bit is wrong compared to the blobs
14:58 phomes_[d]: do you want just 1 frame or many in the gfxreconstruct capture?
15:02 marysaka[d]: many would be better
15:02 marysaka[d]: we can always trim down to one frame if needed
15:19 phomes_[d]: gfxreconstruct capture is here: https://drive.google.com/drive/folders/1FFBbajSwoKohHAnix4ojGk1ktr58r2MS?usp=sharing
16:57 mhenning[d]: marysaka[d]: oh? what does the blob use?
17:12 marysaka[d]: mhenning[d]: it seems to use LOCATION_ALL but I haven't dig too much there
17:12 marysaka[d]: it also stores the available as part of the payload with one SET_REPORT_SEMAPHORE
17:13 mhenning[d]: Oh, yeah the available thing is something we should take advantage of
17:15 marysaka[d]: yeah, I also think we could possibly handle writing the timestamp for transfer stages with the DMA subchannel
17:16 marysaka[d]: but also mhenning[d] we do not check the subchannel for compute... we could very well use SET_REPORT_SEMAPHORE on compute subchannel
17:16 marysaka[d]: the nvk_event codepath we have handle that but not the query pool side
17:17 mhenning[d]: yeah, that too
17:17 mhenning[d]: I wired it up for the event code path but never bothered with query pool
17:19 mhenning[d]: but yeah, our usage of NV9097_SET_REPORT_SEMAPHORE_D_PIPELINE_LOCATION_NONE for compute should be correct because the pipeline switch forces a stall (and therefore the compute work is done)
17:19 mhenning[d]: but also we ideally wouldn't add extra stalls when measuring perf
17:20 marysaka[d]: yeah now that I think about it having ALL would be weird for compute, technically we do not have the whole async compute setup on that queue so it shouldn't be any different than NONE I think?
17:21 marysaka[d]: as 3D -> Compute would have caused a WFI already on the gfx side
17:25 mhenning[d]: Well, it could be different if you use the compute engine, then the graphics engine, and then run the query
17:25 mhenning[d]: ALL waits on the graphics work but NONE does not
17:43 karolherbst[d]: probably should also look into SCG, because it does mitigation some of the problems here with 3D and compute alternating
17:47 mhenning[d]: Right. We probably need support for multiple queues in the same TSG first though. We've had that discussion before
18:12 karolherbst[d]: TSG?
18:12 karolherbst[d]: but yeah.. not entirely sure how it all works in practical terms, but I don't think you need a second hw channel because SCG appears to work within the same one
18:13 karolherbst[d]: but who knows for sure
18:14 mhenning[d]: Thread switching group
18:14 karolherbst[d]: ohh that stuff
18:15 mhenning[d]: karolherbst[d]: I have a patch that tries to set up SCG on a single queue and it makes things significantly slower
18:15 mhenning[d]: I assume that that's because it's really meant for multiple queues and doesn't make sense on just one
18:16 karolherbst[d]: which kinda of SCG did you set up?
18:16 karolherbst[d]: because there are like 3 kinds
18:16 mhenning[d]: I just tried to mirror what the blob does
18:16 karolherbst[d]: is the blob using different channels for compute and grahics there?
18:17 karolherbst[d]: but yeah, could be possible that one needs multiple channels, but not quite sure why tho...
18:19 mhenning[d]: What do you mean "not sure why"? The feature lets a compute queue and a graphics queue run at the same time. If you don't have a compute queue you don't benefit.
18:20 karolherbst[d]: well it doesn't on a hw level
18:20 karolherbst[d]: afaik it's just scheduling
18:21 karolherbst[d]: within the same context
18:23 karolherbst[d]: but if nvidia is doing multiple channels for it, then that's probably also what we have to do
18:23 karolherbst[d]: but form my understanding I don't really see why that would be necessary
18:26 karolherbst[d]: but what I do know is, that the scheduling policies in the classes do matter here
19:59 mohamexiety[d]: SCG is just the NV name for async compute which does need multiple queues, otherwise the feature doesn't really do anything since there's nothing to overlap work with
20:05 karolherbst[d]: afaik they also enable SCG even if you don't have a special compute queue
20:10 karolherbst[d]: but in a very limited capacity
20:11 karolherbst[d]: but it could also be that I misremember or that I misinterpreted the data
20:11 karolherbst[d]: so probably best to just check what's going on
20:13 airlied[d]: do we have a trace of a vulkan app on prop that uses two queues to see what ctrl msgs it sends?
20:29 phomes_[d]: I made some notes of games that use different/multiple queues https://docs.google.com/spreadsheets/d/1RuHD3Z_nBKCp618HHC5I9hOu0lqCoFYwQ4FM69M-Ajg/edit?gid=2144951446#gid=2144951446&range=A1
20:29 phomes_[d]: I can make gfxreconstruct or renderdoc trace of either of those if you want