01:32 linkmauve: What is CRC in that context?
01:49 karolherbst: https://deepwiki.com/rib/intel-gpu-tools/4.1.1-crc-based-display-validation
01:49 karolherbst: ehhh...
01:49 karolherbst: nooooo
01:49 karolherbst: I didn't want to link to that...
01:50 karolherbst: don't we have our own docs on what CRC is? 🙃
01:52 karolherbst: I don't even know if it's correct 🙃 but tldr of CRC is, you can kinda verify the display is doing what it's supposed to be doing or something
09:42 karolherbst[d]: phomes_[d]: I just checked something and it seems MSAA has a massive impact on perf.. like I see 60 -> 20 fps drops and not sure if MSAA is supposed to be that heavy...
09:42 karolherbst[d]: so I'm wondering if most of the games see more competitive games against nvidia with MSAA disabled
09:44 karolherbst[d]: but also MSAA x8 seems broken in xcom2
09:44 karolherbst[d]: and I think that's a regression
09:46 karolherbst[d]: or at least it would be nice to see if x4 and xcom2 are much better there
09:49 karolherbst[d]: ehh
09:50 karolherbst[d]: going from main to 26.0 -> fixes MSAA x8 _and_ perf goes from 20 -> 35 fps. 60 -> 85 fps without any AA...
09:50 karolherbst[d]: guess that's gonna be interesting
09:50 karolherbst[d]: going to bisect the MSAA x8 breakage first
10:03 karolherbst[d]: I'm sure I broke it 🥲
10:17 phomes_[d]: I can do some tests here with different MSAA soon
10:18 phomes_[d]: I did get a gfxrecapture of a single frame btw
10:18 karolherbst[d]: ahh nice
10:18 karolherbst[d]: but yeah.. let me git bisect the perf regression first
10:18 karolherbst[d]: maybe it's something actionable
10:19 karolherbst[d]: but it's weird...
10:19 karolherbst[d]: you don't have that big of a regression in your sheed
10:19 karolherbst[d]: *sheet
10:19 karolherbst[d]: but it might be turing specific or something
10:24 karolherbst[d]: nvm.. msaa x8 isn't broken, it's just that I still had this test patch to verify the MS index location in my tree for whatever reason 🙃
11:02 karolherbst[d]: ahhh yeah uhm.. the perf regression is compression and I'm still on 6.19 🙃
11:03 karolherbst[d]: phomes_[d]: on which kernel version are you testing? 7.0?
11:24 phomes_[d]: 7.0 yes
11:26 karolherbst[d]: mhhhhh
11:27 karolherbst[d]: given that compression has this huge of an impact on xcom2 it's save to assume that it's prolly memory bandwidth limited...
11:28 karolherbst[d]: and your GPU is an Ada one... I might have some ideas I want to try locally first
11:29 karolherbst[d]: phomes_[d]: ahh nice, anything useful in regards to the pushbuffer dump yet?
11:30 karolherbst[d]: huh.. xcom2 doesn't do compute? mhhhh
11:31 karolherbst[d]: phomes_[d]: for your queue created table, is the graphics queue the "main" one, or really only graphics?
11:32 karolherbst[d]: like if the game creates a single queue that is graphics and compute, would that show up as graphics 1, or graphics 1 and compute 1?
11:41 phomes_[d]: that would just be graphics 1
11:42 phomes_[d]: I can do a full push dump of the gfxr capture. And one from renderdoc with the output push dump when clicking the timer button
11:43 phomes_[d]: I'll upload them but they are huge 🙂
11:44 karolherbst[d]: okay.. soo for xcom2, no AA: 95 fps, MSAA x8: 38 fps (with compression support)
11:45 karolherbst[d]: let's see with the blob driver..
11:51 phomes_[d]: I noticed in renderdoc that the duration of subsequent draw calls within a pass kept increasing. I seems to happen for several passes. I checked with Urban Trial Playground too (where we get 92% perf) and see it there as well. Main difference seems to be that it does a lot fewer draw calls in its passes
11:52 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1500102556750446653/image.png?ex=69f736e0&is=69f5e560&hm=e996a9e90ed15e6b823fb861732b923dded702dcfdac8722e3dda4c7c0e75a19&
11:52 phomes_[d]: One pass in XCOM2
11:52 phomes_[d]: duration bar is relative to the max duration within the pass
11:53 phomes_[d]: we suspected the timestamps in general. I did some printf debugging of GetQueryPoolResults
11:53 karolherbst[d]: yeah... it's likely that our timestamps are a bit messed up there
11:55 phomes_[d]: We get roughly 2x the number of draw calls + a few other things. From a glance of the code in renderdoc it looks like they collect a timestamp before and after each drawcall. I took all of them and plotted them
11:55 phomes_[d]: here is the timestamps for 2x the draw calls of pass 1 and pass 2. The values are relative to the lowest timestamp
11:56 phomes_[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1500103626734043269/image.png?ex=69f737df&is=69f5e65f&hm=52a49eb74640249218af5fa89e8907ceeb982b17d05212a4be2f747b037deb9e&
11:57 phomes_[d]: so it looks like the timestamps have a start and end that looks sane enough. The place where start and end become the same is where pass 2 starts
11:58 karolherbst[d]: mhhh right..
11:58 karolherbst[d]: but why 2x the number of draw calls? that's kinda weird
11:58 phomes_[d]: one before the draw and one after was my guess
11:59 karolherbst[d]: ahh
12:00 phomes_[d]: so the diff of start and end do indeed increase as we go ahead in the pass
12:02 karolherbst[d]: why does stuff not work anymore the moment I move to the blob driver...
12:26 phomes_[d]: gfxr capture: https://drive.google.com/file/d/12ZSmy64r2MeCfx6ogNUnZ8zbIGQILY1_/view?usp=sharing
12:27 phomes_[d]: push dump of the renderdoc timing: https://drive.google.com/file/d/1w2M1aD1mXVJ376ZtQ9dJfliNDvzFyVf3/view?usp=sharing
12:28 phomes_[d]: gfxr capture was done with the perf baseline (commit 12f81eaa88d5422219e78b4ed34eee9fd3451eb2)
12:46 karolherbst[d]: nice.. rebooting my machine fixed that issue 🥲
12:47 karolherbst[d]: or not...
13:41 mohamexiety[d]: karolherbst[d]: what do you mean?
13:41 mohamexiety[d]: compression regressed msaa perf in xcom2?
13:41 karolherbst[d]: ahh no, no compression just bad perf
13:42 mohamexiety[d]: ah
13:42 mohamexiety[d]: comp _should_ help with msaa in particular
13:42 karolherbst[d]: yeah but perf is still bad
13:43 karolherbst[d]: nvidia refuses to cooperate now tho, so I can't really test on the blog and I have no idea why steam is messing up....
13:46 Advert[m]: https://t.me/+c-LC9ed_hBgwYmZh
13:50 karolherbst[d]: anybody know what to do, when witht he nvidia blob driver, wine spins 100% on a cpu core?
13:55 karolherbst[d]: phomes_[d]: lol?
13:55 karolherbst[d]: yeah...
13:55 karolherbst[d]: I see the problem
13:55 karolherbst[d]: it's everything 🙃
13:56 karolherbst[d]: phomes_[d]: is that a single frame or....
13:59 karolherbst[d]: anyway.. that thing copies like crazy
13:59 karolherbst[d]: I'm wondering if dma-copy -> 2D would help a lot there
14:01 karolherbst[d]: phomes_[d]: have you ever checked if https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/38683 helps? Though it might not be hit in your trace.. would have to check what commands it's using
14:11 karolherbst[d]: I should play around with 2D more...
14:12 mrx: hi
14:31 phomes_[d]: Yes 38683 did not change perf
14:32 phomes_[d]: Note that renderdoc changes some things, so I am not sure if what you see is 100% what the game does
14:36 karolherbst[d]: sooooo
14:36 karolherbst[d]: it apparently creates tons of occlusion queries...
14:36 karolherbst[d]: you can normally disable that in graphic settings, no?
14:37 karolherbst[d]: phomes_[d]: do you know if disabling occlusion in the settings helps a lot with perf
14:37 karolherbst[d]: dunno how that game calls it, but.. that might be the thing slowing it down
14:37 karolherbst[d]: it creates 16384 occlusion queries... not sure if that's a good idea with nvk or not 🙃
14:40 karolherbst[d]: okay.. I think Mary got it 🙃
14:41 karolherbst[d]: so apparently `nvk_CmdResetQueryPool` with that many queries ain't good (probably)
14:43 karolherbst[d]: okay.. will try to hack something up later
15:16 phomes_[d]: Not at the computer right now but will check soon
15:17 phomes_[d]: Maybe a push dump from a replay of the gfxreconstruct capture is better? I am not sure if renderdoc does other things by itself when timing
15:18 karolherbst[d]: ahh don't worry
15:18 karolherbst[d]: the reset query thing is indeed a problem
15:18 karolherbst[d]: without a proper fix we just don't know how much
15:18 karolherbst[d]: but I'll write something up later
15:18 karolherbst[d]: like it causes tons of 4 byte writes
15:19 karolherbst[d]: because the way we reset those is really terrible
15:20 marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1500155099232731216/image.png?ex=69f767cf&is=69f6164f&hm=34481d216bfb767c8a0fc98a9110dfb46cc034d835590e874b69f4eb6f626c63&
15:20 marysaka[d]: so yeah talking with Karol we *need* to fix the reset path because dxvk allocate a large fix amount of queries too
15:21 karolherbst[d]: should give us a couple of FPS here and there hopefully
15:22 marysaka[d]: it does track queries properly for reset so it might be *okay*
15:22 karolherbst[d]: oh so dxvk is better in the reset path?
15:23 karolherbst[d]: but still.. a single copy should be a lot better than whatever we are doing...
15:51 mhenning[d]: karolherbst[d]: using 2d for copy commands is something I'm hoping to get to soon
15:56 karolherbst[d]: anyway.. let's optimize this query pool reset thing first 😄
15:58 karolherbst[d]: ehh I can't replay this trace...
15:58 karolherbst[d]: gotta do it blind then
16:03 karolherbst[d]: phomes_[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/e5abde93f14551ce4c56f1a02bf359f6e2573dd0 that _should_ help hopefully
16:04 karolherbst[d]: might cause rendering issues, but I'm more curious about the perf difference
16:17 phomes_[d]: Can’t replay the trace I did in gfxr? It was fine here in a devenv of main
16:17 karolherbst[d]: maybe due to different GPU/setup
16:17 karolherbst[d]:anyway
16:17 karolherbst[d]: it doesn't help in xcom2 at least 🙃
16:21 karolherbst[d]: yeah.. so MSAA x8 seems to have a massive perf impact
16:21 karolherbst[d]: x4 is at 60 fps, x8 at 38
16:21 karolherbst[d]: x2 at 70
16:22 karolherbst[d]: normal at 90 and FXAA at 84
16:22 karolherbst[d]: can't check if it's the same on the blob *sigh*
16:22 karolherbst[d]: anyway, xcom2 doesn't hit this query pool reset path aparently
16:24 karolherbst[d]: phomes_[d]: I'm getting a bus error when copying to GPU memory, so who knows what's up there
16:34 karolherbst[d]: mhhh `Replay adjusted the vkGetPhysicalDeviceSurfaceFormatsKHR array count: capture count = 2, replay count = 0`
17:18 karolherbst[d]: yeah.. I'm mostly just interested if it's heavier on nvk than nvidia
17:24 karolherbst[d]: might be something good to look into next
17:26 karolherbst[d]: do all talos principle games have a benchmark mode?
17:26 karolherbst[d]: heh...
17:28 karolherbst[d]: okay.. my query pool patch crashes sottr 😄
17:34 karolherbst[d]: ohh it's asserting actually
17:41 karolherbst[d]: oh uhhh.. maybe it's not my fault actually :blobcatnotlikethis:
18:02 karolherbst[d]: yeah I think for me sottr just randomly has a hick up
19:55 phomes_[d]: karolherbst[d]: looks like something more should have been in that commit: `error: implicit declaration of function ‘nvk_cmd_fill_memory’`
19:57 phomes_[d]: my bad. I was testing on top of the baseline I use. This just needs something newer
20:12 karolherbst[d]: yeah.. it depends on new stuff
20:13 karolherbst[d]: phomes_[d]: maybe test X4 first, because it's something we found there explicitly
20:20 karolherbst[d]: mhhhhhhh I'm disappointed 🙃
20:20 karolherbst[d]: mind doing another push buf dump with the patch?
20:21 karolherbst[d]: maybe there is something else...
20:35 phomes_[d]: new push dump: https://drive.google.com/file/d/1fKwbGFz9FVVwTHR-KUIMjNMuARsWS0Wl/view?usp=sharing
20:35 phomes_[d]: much smaller size. I did not look at the diff so maybe something is wrong
20:35 karolherbst[d]: I'll check
20:35 karolherbst[d]: mhhh
20:36 karolherbst[d]: nah it did what I wanted it to do
20:36 karolherbst[d]: but ruling out something is a perf issue, is still helping!
20:37 phomes_[d]: it did help serious sam a bit
20:37 karolherbst[d]: ohh nice
20:37 karolherbst[d]: maybe CPU overhead
20:38 karolherbst[d]: phomes_[d]: I hope it's not due to the new base 🙃
20:38 phomes_[d]: push dump from the gfxr capture: https://drive.google.com/file/d/1UrXjeGKCphm5s1rIf5GMSMdpe2Fr-M8Z/view?usp=sharing
20:39 karolherbst[d]: phomes_[d]: the entire thing? how does it compare to the one above?
20:39 phomes_[d]: I am testing with the baseline + your patch + the one commit that adds nvk_cmd_fill_memory
20:39 karolherbst[d]: ahhh I see
20:41 phomes_[d]: I did not compare the push dump from renderdoc and gfxr. I suppose the latter will add less extra stuff like the timestamps from renderdoc
21:11 karolherbst[d]: phomes_[d]: mind trying to see if you can run the gfxreconstruct on the blob and use envyhooks to dump what they are submitting to the GPU? ever used envyhooks?
21:12 phomes_[d]: never did envyhooks but I can give it a try
21:12 karolherbst[d]: but the dump is very interesting:
21:12 karolherbst[d]: 17744 mthd 0400 NVC1B5_OFFSET_IN_UPPER
21:12 karolherbst[d]: 17744 mthd 0404 NV85B5_OFFSET_IN_LOWER
21:12 karolherbst[d]: 17747 mthd 0408 NVC1B5_OFFSET_OUT_UPPER
21:12 karolherbst[d]: 17747 mthd 040c NV85B5_OFFSET_OUT_LOWER
21:12 karolherbst[d]: 18112 mthd 0300 NVC7B5_LAUNCH_DMA
21:12 karolherbst[d]: 18112 mthd 0418 NV85B5_LINE_LENGTH_IN
21:12 karolherbst[d]: 18112 mthd 041c NV85B5_LINE_COUNT
21:12 karolherbst[d]: 20823 mthd 0508 NVC597_LOAD_ROOT_TABLE
21:12 karolherbst[d]: 28274 mthd 1144 NV9097_FLUSH_PENDING_WRITES
21:12 karolherbst[d]: 84836 mthd 1b00 NV9097_SET_REPORT_SEMAPHORE_A
21:12 karolherbst[d]: 84836 mthd 1b04 NV9097_SET_REPORT_SEMAPHORE_B
21:12 karolherbst[d]: 84836 mthd 1b08 NV9097_SET_REPORT_SEMAPHORE_C
21:12 karolherbst[d]: 84836 mthd 1b0c NVC797_SET_REPORT_SEMAPHORE_D
21:12 karolherbst[d]: most called methods
21:12 karolherbst[d]: the semaphore stuff stands out soo much 🙃
21:12 karolherbst[d]: and the dma-copy methods are also prolly not helping given they'd cause WFI constantly
21:13 phomes_[d]: yes
21:16 phomes_[d]: the gfxr is nice. Now I can automate creating a diff of the push dump for new commits
21:16 karolherbst[d]: yeah.. I'm sure my commit did _something_ but apparently not enough
21:17 phomes_[d]: gfxr push dump dropped from 121mb to 91mb
21:18 karolherbst[d]: impressive
21:20 karolherbst[d]: phomes_[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/6ef354020f06026c9763908c8e1344630db2152e does this change anything in X4?
21:24 mhenning[d]: could try zeroing with i2m or the mme to avoid the subc switch away from graphics
21:24 karolherbst[d]: yeah...
21:24 karolherbst[d]: but I don't think it matters
21:24 karolherbst[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1500246567246889011/message.txt?ex=69f7bcfe&is=69f66b7e&hm=5e58ec396886b60ae39f2932d6baea91c76602436e9d24b3b353bb31467567e4&
21:24 karolherbst[d]: it's pretty well grouped
21:24 karolherbst[d]: like yeah.. there are some switches, but... not _that_ many
21:24 mhenning[d]: oh, cool, I haven't thought of using uniq like that
21:25 karolherbst[d]: like we can do better of course
21:25 karolherbst[d]: and it might matter
21:25 karolherbst[d]: but those set semaphores commands are a bit more concerning? dunno
21:26 mhenning[d]: yeah, is that 84836 semaphores per frame?
21:26 mhenning[d]: that's a lot of semaphores
21:27 karolherbst[d]: yeah...
21:27 karolherbst[d]: it's all occlusion queries
21:27 karolherbst[d]: $ grep '\.REPORT =' ~/Downloads/gfxr-push-dump.txt | sort | uniq -c | sort -n
21:27 karolherbst[d]: 28290 .REPORT = NONE
21:27 karolherbst[d]: 56546 .REPORT = ZPASS_PIXEL_CNT64
21:28 karolherbst[d]: the query pool has 16k occlusion queries in it...
21:28 karolherbst[d]: it's more set semaphores without my patch to use copy to init them 🙃
21:29 marysaka[d]: those report = none are likely events I guess
21:35 mhenning[d]: marysaka[d]: some of them are probably us writing the query pool availability info
21:35 marysaka[d]: oh right it... match half of it
21:35 marysaka[d]: as we need one for each end query after all..
21:35 mhenning[d]: also, looks like pascal introduces a REPORT_TILED_ZPASS_PIXEL_CNT64. I wonder if that's useful
21:44 phomes_[d]: karolherbst[d]: not enough to change the fps
21:44 phomes_[d]: XCOM 2 push dump drops from 91mb to 88mb though
21:45 karolherbst[d]: new dump pls tho 🙃
21:45 karolherbst[d]: yeah.. mostly just trying to rule out things still
21:45 phomes_[d]: gfxr dump okay?
21:45 karolherbst[d]: ohh wait..
21:45 karolherbst[d]: that was the `FLUSH_PENDING_WRITES` one...
21:45 karolherbst[d]: mhhh
21:45 karolherbst[d]: I only wanted to know if that's relevant for perf
21:46 phomes_[d]: no difference in X4 and no difference in XCOM 2
21:46 karolherbst[d]: mhhh
21:46 karolherbst[d]: maybe it's WFI after all ..
21:47 phomes_[d]: want me to do a full test of all the games or is that not relevant?
21:47 karolherbst[d]: not relevant
21:47 karolherbst[d]: we are overusing dma-copy anyway, and that's hopefully going to help, so maybe it's best to wait until that's done
21:49 karolherbst[d]: I don't really know if we can optimize this set report thing.. mhh
21:52 karolherbst[d]: but anyway.. good to know that optimizing the pool query reset does help somewhat in serious sam...
21:55 karolherbst[d]: phomes_[d]: mind adding https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41057 to the testing and create a new push buffer dump?
21:55 karolherbst[d]: just for x3
21:55 karolherbst[d]: *x4
21:56 phomes_[d]: it is already in the sheet but I should retest with the latest push to the MR I suppose
21:57 phomes_[d]: also note that the push dumps so far has been XCOM2 not X4. I can do a capture of X4 and do that too if you want
22:00 karolherbst[d]: ohhh it was xcom2
22:00 karolherbst[d]: that's fine
22:00 karolherbst[d]: but you could have told me, because I do have xcom2 myself and can just test 🙃
22:00 karolherbst[d]: or maybe you did
22:00 karolherbst[d]: and I missed it 😄
22:36 karolherbst[d]: guess I should learn how to create a gfxreconstruct.. but yeah.. not sure I found anything yet..
22:40 phomes_[d]: got the XCOM2 push dump here. Sorry it took a while. The MR is ahead of the baseline I am using and the patches do not apply directly on top of only the baseline. So the dump is with some extra commits
22:40 phomes_[d]: https://drive.google.com/file/d/1F9limci7ePDLA44nxkZDTfSgrbhH60TY/view?usp=sharing
22:42 phomes_[d]: for getting gfxr dumps I just built the main branch gfxreconstruct and then put this in steam options of the game `VK_LAYER_PATH=/home/phomes/gfxreconstruct/build/layer:$VK_LAYER_PATH VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct GFXRECON_CAPTURE_TRIGGER=F9 GFXRECON_CAPTURE_TRIGGER_FRAMES=1 GFXRECON_CAPTURE_FILE=/home/phomes/xcom2/test.gfxr %command%`
22:42 phomes_[d]: and then play the game and press F9 when I want the capture
22:42 karolherbst[d]: ahh
22:42 karolherbst[d]: I'd just capture the main menu 😄
22:43 karolherbst[d]: phomes_[d]: ohh have you compared non MSAA vs blob?
22:43 phomes_[d]: got get it to build I had to do this `cmake .. -DCMAKE_BUILD_TYPE=Debug -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_CXX_FLAGS="-Wno-error=deprecated-declarations"`
22:43 karolherbst[d]: I'm kinda wondering how much this is mostly terrible MSAA perf
22:44 phomes_[d]: will do non MSAA now
22:44 karolherbst[d]: sadly blob hates me and it didn't want to render here on the blob driver 😢
22:45 karolherbst[d]: it's enough to just get FPS numbers from the main menu or something, really only interested if the impact of MSAA is kinda the same or worse with nvk
22:45 karolherbst[d]: anyway.. I think I managed to get like +2-3% FPS out of it by ditching some WFI, but....
22:46 karolherbst[d]: I'll have to check traces tomorrow to check how many there are still left
22:46 karolherbst[d]: we have a big of compute <-> 3D stuff going on... so that's not helping
22:46 karolherbst[d]: and I'm testing on Turing 🥲 should really do that on ampere
22:53 phomes_[d]: MSAA on NVK:
22:53 phomes_[d]: Off: 44
22:53 phomes_[d]: FXAA: 42
22:53 phomes_[d]: 2X: 32
22:53 phomes_[d]: 4X: 25
22:53 phomes_[d]: 8X: 14
23:01 karolherbst[d]: ohh you were testing with FXAA?
23:01 karolherbst[d]: RIP 🥲
23:02 karolherbst[d]: wait.. you have a 4070?
23:02 karolherbst[d]: what resolution? 4k?
23:04 karolherbst[d]: I seriously hope it's 4K 🙃
23:09 phomes_[d]: PROP:
23:09 phomes_[d]: Off: 91
23:09 phomes_[d]: FXAA: 84
23:09 phomes_[d]: 2X: 54
23:09 phomes_[d]: 4X: 37
23:09 phomes_[d]: 8X: 23
23:09 phomes_[d]: it is on 4070
23:09 phomes_[d]: resolution was set... weird. 3200x1800
23:11 phomes_[d]: I think I just set it to max preset originally
23:12 phomes_[d]: I'll retest with 3840x2160
23:16 karolherbst[d]: phomes_[d]: offf
23:16 karolherbst[d]: so maybe MSAA perf isn't so bad actually...
23:16 karolherbst[d]: that's a steep cliff at 2x...
23:17 karolherbst[d]: phomes_[d]: nah.. I was just surprised, because I hit up to 98 fps, but that was on FHD
23:17 karolherbst[d]: so I hope it's on some insane resolution 😄
23:17 karolherbst[d]: my GPUs isn't that beefy in comparison
23:18 karolherbst[d]: but anyway.. so we are close to the blob at high MSAA levels?
23:19 karolherbst[d]: 48.3%, 50%, 59.3%, 67.6%, 61%
23:19 karolherbst[d]: I have to think about what that means...