15:37 gfxstrand[d]: Wow. This hardware really doesn't like command buffers being in VRAM. Weird, but okay, I guess.
15:38 gfxstrand[d]: airlied[d]: skeggsb9778 is nouveau.ko doing any sort of read-back of push buffers?
15:43 gfxstrand[d]: The Witness drops from like 160 FPS to 4 the moment I put pushbufs in VRAM
15:46 karolherbst[d]: are you CPU bound?
15:46 karolherbst[d]: I have a hunch, but it's a terrible one
15:46 gfxstrand[d]: nope. 10% CPU according to mangohud
15:46 gfxstrand[d]: oh?
15:47 karolherbst[d]: the push buffer builder does read-back to increment the size, though that _should_ get optimized away
15:47 clangcat[d]: I do have a question about Nouveau relating to resource usage.
15:47 gfxstrand[d]: True...
15:47 clangcat[d]: how much RAM should Nouveau actually be using when the kernel module is running
15:47 gfxstrand[d]: I've been meaning to fix that
15:47 clangcat[d]: Cause for me it's 300MB
15:47 gfxstrand[d]: That might be it, actually
15:47 clangcat[d]: as soon as I call
15:47 clangcat[d]: `modprobe nouveau`
15:48 karolherbst[d]: gfxstrand[d]: yeah, I fear the same
15:48 gfxstrand[d]: Okay, let's try to fix that without breaking the univers
15:48 karolherbst[d]: good luck
15:48 gfxstrand[d]: hehe
15:48 gfxstrand[d]: It should be possible
15:49 karolherbst[d]: yeah.. it should in theory. Maybe it's just some bad places in the code where it actually reads back
15:49 karolherbst[d]: but yeah...
15:52 karolherbst[d]: could also just construct on the CPU and then upload it into VRAM in one big copy
15:52 karolherbst[d]: probably more efficient that way anyway
15:53 karolherbst[d]: clangcat[d]: what GPU?
15:53 karolherbst[d]: maybe it's GSP related
15:53 clangcat[d]: karolherbst[d]: 3050M
15:53 clangcat[d]: karolherbst[d]: mmmm possible.
15:53 karolherbst[d]: we keep the firmware in RAM
15:54 karolherbst[d]: but yeah.. maybe it's also other things...
15:55 clangcat[d]: karolherbst[d]: Well firmware accounts for 40MB on my laptop. I imagine some is just generic data structures you use for managing the card but yea not exactly clear where the rest of the RAM goes.
15:56 clangcat[d]: and sadly I don't really know how to accurately track kernel memory
15:56 clangcat[d]: Other than black listing modules and starting them and measuring the useage difference
15:57 karolherbst[d]: could check for memory leaks, but I think that requires building your own kernel
15:57 karolherbst[d]: maybe debug kernel builds have that enabled
15:57 karolherbst[d]: https://docs.kernel.org/dev-tools/kmemleak.html
16:01 gfxstrand[d]: PSA: I updated my kernel branch. It's now based on 6.9.7
16:03 clangcat[d]: karolherbst[d]: Yea don't have kmemleak. But part of it is also looked into there is a framebuffer for the `/dev/fb1` of the device
16:03 clangcat[d]: so part of the useage
16:03 clangcat[d]: will also be that framebuffer
16:04 karolherbst[d]: the framebuffer shouldn't be in system RAM
16:04 clangcat[d]: True but the actual driver may have allocated some data structures relating to it.
16:05 karolherbst[d]: right, but those are usually small
16:05 clangcat[d]: Shame debug fs doesn't let me see the firmware that I know of :/
16:05 karolherbst[d]: but how do you get the "300MB" number anyway?
16:06 gfxstrand[d]: karolherbst[d]: Yup! That was it. Getting rid of that gets me back to a little under 160 but my CPU is 2x what it was system ram for pushbufs
16:06 karolherbst[d]: gfxstrand[d]: sooo... how did you fix it exactly?
16:07 karolherbst[d]: but sounds like you have higher CPU usage now with pushbufs in VRAM?
16:07 gfxstrand[d]: Never mind, CPU is about the same
16:07 clangcat[d]: karolherbst[d]: Like I said I start the nouveau module after boot and measure the difference in usage. Also did it with AMDGPU where with AMD I only saw about a 10MB increase.
16:07 karolherbst[d]: clangcat[d]: ahh
16:08 karolherbst[d]: tried disabling GSP? your GPU should still run without it
16:08 clangcat[d]: (though that make sense as the AMDGPU replaced the EFI FB and probably freeded the EFI fb mem/used the same memory)
16:08 karolherbst[d]: ehh wait
16:08 karolherbst[d]: yours is ampere
16:08 karolherbst[d]: so it's GSP only :blobcatnotlikethis:
16:08 gfxstrand[d]: karolherbst[d]: Actually, vram seems about the same as system ram
16:08 karolherbst[d]: or is it...
16:08 gfxstrand[d]: for both
16:08 clangcat[d]: karolherbst[d]: Does that require rebuilding the kernel?
16:08 karolherbst[d]: clangcat[d]: are you even using GSP? mhh
16:08 clangcat[d]: I am
16:08 gfxstrand[d]: The reality is that we have to push the entire pushbuf across PCI one way or another
16:09 karolherbst[d]: you enable GSP with `nouveau.config=NvGspRm=1`
16:09 karolherbst[d]: so setting it to 0 should disable it
16:09 karolherbst[d]: I think
16:09 clangcat[d]: Okay
16:09 karolherbst[d]: maybe people change those parts
16:09 gfxstrand[d]: So I doubt it'll make much difference living in VRAM vs. system RAM anyway.
16:09 clangcat[d]: karolherbst[d]: Worth a short.
16:09 karolherbst[d]: gfxstrand[d]: yeah...
16:09 karolherbst[d]: I can only imagine that doing it once in a full copy would help
16:10 clangcat[d]: Either way it's just tricky for people like me with Low RAM systems.
16:10 karolherbst[d]: because atm it's "random" access with either path
16:10 karolherbst[d]: *over PCIe
16:10 gfxstrand[d]: Yeah, but I suspect that, unless there is significant re-use (there isn't in modern engines), the command streamer prefetch is good enough for getting good PCI access patterns.
16:10 karolherbst[d]: ohh right.. there is also a prefetcher..
16:11 karolherbst[d]: yeah well...
16:11 gfxstrand[d]: I'm going to land my patch to not read back. That seems like an improvement either way.
16:11 karolherbst[d]: yeah, would be curious to see how you solve that part
16:11 gfxstrand[d]: I just cache the DW in the nv_push
16:11 karolherbst[d]: though maybe somebody should look at the assembly being compiled again
16:11 karolherbst[d]: ahh
16:11 gfxstrand[d]: It's really dumb
16:11 karolherbst[d]: fair enough
16:11 gfxstrand[d]: What I'd like to do long-term, I think, is build on the stack and then push at the end.
16:11 karolherbst[d]: yeah...
16:12 karolherbst[d]: but...
16:12 karolherbst[d]: is it even worth it if the compiler is smart enough
16:12 karolherbst[d]: 😛
16:12 gfxstrand[d]: Then the compiler has enough visibility to actually optimize everything away
16:12 karolherbst[d]: when I was looking into it, it was smart enough in most cases
16:12 karolherbst[d]: so it only pushed the new value without reading back
16:12 karolherbst[d]: maybe something changed enough so the assembly was bad
16:12 gfxstrand[d]: Yeah, but if we have the header DW cached, I think that'll make a difference
16:13 karolherbst[d]: yeah
16:13 gfxstrand[d]: That helps the compiler A LOT
16:13 karolherbst[d]: yep
16:13 gfxstrand[d]: Side-note: It's kind-of hilarious how being a compiler engineer changes how you think about optimizing CPU code. 😅
16:13 karolherbst[d]: 😄
16:13 karolherbst[d]: mood
16:15 clangcat[d]: karolherbst[d]: karolherbst[d] this with 0 seems to fix it.
16:15 karolherbst[d]: I wonder how that stuff is handled in nvidia's driver
16:15 clangcat[d]: Well
16:15 karolherbst[d]: clangcat[d]: okay.. so it's GSP stuff.. mhhh
16:15 clangcat[d]: idk "fix"
16:15 karolherbst[d]: yeah, now your perf is terrible
16:15 clangcat[d]: I now don't have GSP
16:15 clangcat[d]: But
16:15 clangcat[d]: Memory
16:16 karolherbst[d]: how much tho?
16:16 clangcat[d]: karolherbst[d]: almost all of it
16:16 clangcat[d]: there is still about 20MB increase from starting Nouveau
16:16 clangcat[d]: at run time.
16:16 karolherbst[d]: but anyway, thanks for testing. "With GSP ~250MB higher usage" is more actionable than "nouveau uses 300MB" 😄
16:17 clangcat[d]: But some of that is what I would consider random spikes anyway
16:17 clangcat[d]: karolherbst[d]: Well this won't effect me much longer I have more RAM coming.
16:17 karolherbst[d]: could also be a bad memory leak
16:17 clangcat[d]: But yea it's a lot of RAM.
16:17 clangcat[d]: For low memory systems
16:17 karolherbst[d]: maybe skeggsb9778 has any ideas why using GSP uses that much more RAM and if that's fixable
16:18 karolherbst[d]: yeah.. meanwhile me with 60GB RAM and 68GB zram swap
16:18 karolherbst[d]: lol
16:18 clangcat[d]: karolherbst[d]: Could be that there is more firmware than what I know loading.
16:18 karolherbst[d]: maybe GSP is also operating on RAM and needs it for internal allocations or other silly reasons
16:19 karolherbst[d]: anyway...
16:19 karolherbst[d]: might be good to know what it is
16:19 clangcat[d]: as I am only aware that the files in `/lib/firmware/nvidia/ga107` that are loaded
16:20 karolherbst[d]: should be all with GSP
16:20 clangcat[d]: karolherbst[d]: Yea I mean I wouldn't have found it other than just running steam, TBOI and one FF tab made me OOM.
16:20 karolherbst[d]: clangcat[d]: :blobcatnotlikethis:
16:20 clangcat[d]: So I killed like every user space app
16:20 clangcat[d]: save for TTY1
16:20 clangcat[d]: and then went through Kernel modules.
16:21 clangcat[d]: karolherbst[d]: Welp atleast I'll have more RAM soonish
16:21 karolherbst[d]: nice
16:21 karolherbst[d]: how much swap/zram are you using atm?
16:22 karolherbst[d]: maybe you want to increase zram size to 16GB or so
16:23 clangcat[d]: karolherbst[d]: 4GB of zram. and swap. I prefer to avoid using but 8GB.
16:24 clangcat[d]: karolherbst[d]: Mhmmmm maybe
16:26 karolherbst[d]: mhhh
16:26 karolherbst[d]: if you use storage swap then it's a bit painful
16:26 clangcat[d]: It's storage swap yea I kinda need it at times to avoid just hard lock ups
16:26 karolherbst[d]: I don't know if you can configure `zswap` to only compress for storage swap
16:27 clangcat[d]: when I go a little to above my RAM.
16:27 karolherbst[d]:but
16:27 karolherbst[d]: if you use only storage swap, I'd consider enabling zswap
16:28 clangcat[d]: karolherbst[d]: I mean even if I can wouldn't that involve compressing the swap partition which would make the storage swap space even slower to access than it already is.?
16:28 karolherbst[d]: nah, `zswap` is a compressed cache in RAM
16:28 clangcat[d]: Ahhh mmmm maybe
16:28 karolherbst[d]: and if you run out of RAM, it starts to move things into storage swap
16:29 karolherbst[d]: so you should hit storage swap later
16:29 karolherbst[d]: and be able to use a bit more RAM overall
16:29 clangcat[d]: But there is only so much I can put compressed things in 8GB of RAM.
16:29 karolherbst[d]: yeah...
16:29 clangcat[d]: sadly 8GB is just dying in terms of usefulness.
16:29 karolherbst[d]: yeah...
16:30 karolherbst[d]: compressed RAM helps a lot, but uhhh...
16:31 karolherbst[d]: I'm currently at 21.6GB use :blobcatnotlikethis:
16:31 clangcat[d]: karolherbst[d]: Yea it helps but you can't exactly have all your RAM compressed
16:32 clangcat[d]: as atleast some needs to be uncompressed to actually process with.
16:32 karolherbst[d]: apparently my firefox extensions alone use 2GiB 😄
16:32 clangcat[d]: karolherbst[d]: What are you using?
16:32 clangcat[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1259909993679814696/grafik.png?ex=668d661e&is=668c149e&hm=3e8b643ff183d7d34ebb3515abdb35bcbee781b8f19feebfca606a525bdd07dd&
16:32 clangcat[d]: MY only extension
16:32 clangcat[d]: and I'm not gonna lie
16:32 clangcat[d]: at times this thing uses a lot
16:32 clangcat[d]: XD
16:33 karolherbst[d]: I have to figure out which one I don't need
16:33 clangcat[d]: Get 8GB of RAM
16:33 clangcat[d]: It'll help you trim the fat
16:33 clangcat[d]: XD
16:34 karolherbst[d]: apparently ublock
16:34 karolherbst[d]: enabled and disabled it -> 1GiB
16:34 clangcat[d]: Yea that doesn't suprise me
16:35 clangcat[d]: sadly it's kinda needed for the modern internet at times
16:54 esdrastarsis[d]: memoryHeaps[1]:
16:54 esdrastarsis[d]: size = 268435456 (0x10000000) (256.00 MiB)
16:54 esdrastarsis[d]: Is this normal? Having such a low memory heap
17:05 karolherbst[d]: yeah
17:06 karolherbst[d]: if you don't have resizeable BAR that's what you get
17:06 karolherbst[d]: though
17:06 karolherbst[d]: why is there only one heap?
17:29 esdrastarsis[d]: karolherbst[d]: There are 3 heaps, I was just unsure about this one (which is the second one)
17:30 karolherbst[d]: yeah, that's fine
17:30 karolherbst[d]: it's system RAM
17:30 karolherbst[d]: and the default BAR size is 256MiB
18:22 notthatclippy[d]: 256mb is also the size of the GSP's carveout (text+data+heap+stack), which is what you want to (re)store during suspend/resume, etc, so that might be where it is coming from.
18:23 notthatclippy[d]: This includes all the other firmware's that are embedded into gsp.bin and that GSP loads onto the appropriate falcons
18:30 karolherbst[d]: notthatclippy[d]: ohh, so my hunch was right...
18:30 karolherbst[d]: could we optimize this a bit? though failing to allocate on suspend is also terrible...
18:30 notthatclippy[d]: I have no idea if nouveau does anything with that though, sorry. That's a Ben question when he wakes up.
18:31 karolherbst[d]: and can this be in VRAM instead? though then we'd have to move it into system RAM on suspend anyway...
18:32 notthatclippy[d]: karolherbst[d]: Yes, NV prop driver has this bug currently...
18:32 notthatclippy[d]: karolherbst[d]: Yes, the 256mb carveout is in VRAM, but on suspend it gets stored to RAM. And if you're outta RAM, well, Bad Things happen.
18:32 karolherbst[d]: yeah, fair
18:32 karolherbst[d]: but I wouldn't be surprised if in nouveau it's system RAM
18:32 karolherbst[d]: which would also explain the huge memory use
18:33 notthatclippy[d]: It's not. Well, it might be _also_ there, but GSP is running out of VRAM
18:33 karolherbst[d]: mhhh
18:33 karolherbst[d]: interesting
19:27 gfxstrand[d]: Got another 10 FPS out of The Witness by putting stuff in VRAM. 😁
19:29 karolherbst[d]: nice
19:30 karolherbst[d]: I really should do the instruction latency stuff and test that with pixmark_piano, because that's like all shader and 0 vram
19:34 gfxstrand[d]: Yeah
19:35 gfxstrand[d]: I'm going to land NVKMD in a few minutes once my current CTS run is done. Then I'm on to the next thing. I'm not sure what that next thing will be yet.
19:35 gfxstrand[d]: I might try to plumb better shader model stuff through NAK
19:52 phomes_[d]: gfxstrand[d]: I just tested again and all the games I mentioned are not performing as good or better as before 🙂
21:10 gfxstrand[d]: phomes_[d]: Yeah, should be a little better. Putting descriptors in VRAM didn't give as much of a boost as I was hoping but it
21:41 skeggsb9778[d]: karolherbst[d]: umm, no. if it were vram, it'd make sense 😛
21:41 skeggsb9778[d]: perhaps i'm just forgetting something, but i suspect a bug. i'll poke around sometime this morning
21:53 skeggsb9778[d]: yeah, i see 119MiB with GSP, 27MiB without
21:54 karolherbst[d]: not the 300MiB reported, but that's still a significant difference
21:55 karolherbst[d]: maybe something with the firmware handling is really bad?
21:55 esdrastarsis[d]: gfxstrand[d]: marging my MR :happy_gears:
21:58 gfxstrand[d]: esdrastarsis[d]: Sure! Let me go dig it up
21:58 gfxstrand[d]: Thanks for the reminder!
23:00 asdqueerfromeu[d]: skeggsb9778[d]: Poking at DP audio would be nice too 🎧