IRC Logs of #nouveau on irc.freenode.net for 2025-01-07

00:14 dwlsalmeida[d]: By the way, do you guys have any idea where I should start investigating regarding the massive performance drop between the C version from one year ago and what we have now?
00:14 dwlsalmeida[d]: we apparently went from 7s for the H.264 test suite, to 78s
00:15 dwlsalmeida[d]: nothing I wrote justifies a 10x drop, whatever it is
00:16 airlied[d]: in a release build?
00:17 dwlsalmeida[d]: both debug builds
00:17 airlied[d]: rust debug is usually a lot higher overhead, maybe perf record a run and see if anything stands out
00:18 dwlsalmeida[d]: not 10x as much, and a release build doesn't fix it either :/
00:18 airlied[d]: oh so it's still slow on release, yeah then it's probably perf record time
00:57 gfxstrand[d]: I'd try perf first. The other thing I'd look at is if you're ending up with more pushbufs for some reason. You could stick in a little queue submit print for that.
00:59 gfxstrand[d]: IDK why I think that's a thing that could happen. It just seems like it's worth checking.
01:06 dwlsalmeida[d]: I have a separate pushbuf for the rust stuff
01:07 airlied: you could also just stick time printfs at cmd buffer begin/end to rule that out
01:08 dwlsalmeida[d]: What I am trying to say is that I have a separate allocation for the rust stuff, is this a problem ?
01:08 dwlsalmeida[d]: I notice that when you say nv_push_new() it reuses the BO
01:08 dwlsalmeida[d]: Which…I don’t think I do
01:08 dwlsalmeida[d]: It’s probably reallocated on each frame
01:13 dwlsalmeida[d]: I mean this btw:
01:13 dwlsalmeida[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31867/diffs?commit_id=e6dc5469650e019b8be9b0bd9bf9aef23f14be34
01:18 airlied[d]: hmm yeah that might cause some overhead, in theory command buffers should be mostly recyclable using cmd pools and resets
01:19 airlied[d]: also memcpying the whole thing seems suboptimal
01:49 gfxstrand[d]: Yeah, that's extra memcpy but shouldn't be a problem. There's a better way to provide that function that doesn't involve extra rust stuff in the command buffer but we can get to that later.
03:10 esdrastarsis[d]: nvidia announced rtx blackwell, now gsp firmware headache begins :blobcatnotlikethis:
03:19 redsheep[d]: Assuming reviews don't say it's somehow terrible I am probably going to buy that 5090, looks quite appealing
04:53 gfxstrand[d]: Any idea when they're going to be on the shelves?
04:54 gfxstrand[d]: Jan 30 according to the press release.
04:55 tiredchiku[d]: 5070ti looks like a solid option
04:55 tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1326051608647434250/960x0.jpg?ex=677e054c&is=677cb3cc&hm=cbb2ef5a5ebf30718e7c7bbf2e23b3ec42238ee15327a033476fd9dbecbd1dec&
04:55 gfxstrand[d]: esdrastarsis[d]: Hopefully Ben has been working on it. He's on the inside now so there's a good chance he already has a branch.
04:56 tiredchiku[d]: day one support :shocked_horse:
04:56 gfxstrand[d]: 🤷🏻‍♀️
04:56 gfxstrand[d]: I'm going to get one and eagerly await a branch
04:57 gfxstrand[d]: The other question is when will we have PCIe 6.0 capable motherboards to plug it into.
04:59 gfxstrand[d]: A while ago they were bragging that Blackwell would be PCIe 6.0 but I see no mention of it in the press release. Probably need to wait for more CES announcements.
05:00 gfxstrand[d]: I also have no idea if I'm going to have any NVK/NAK fixing to do or not.
05:00 gfxstrand[d]: Hopefully nothing major
05:01 gfxstrand[d]: But we'll find out
05:03 tiredchiku[d]: `4x PCIe 8-pin cables (adapter in box) OR 1x 600 W PCIe Gen 5 cable`
05:03 tiredchiku[d]: for the 5090
05:03 tiredchiku[d]: so, maybe pcie gen 5?
05:03 tiredchiku[d]: maybe gen5 x16
05:08 redsheep[d]: gfxstrand[d]: Is that confirmed for 5090? I only found something saying as much for 5080
05:08 mhenning[d]: gfxstrand[d]: yeah, I was thinking about tinkering with the instruction encodings whenever we get a new disassembler
05:08 redsheep[d]: And the specs list rtx blackwell as gen 5 x16, so new motherboards wouldn't be a thing
05:09 tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1326055044373680159/VYK0JYv.png?ex=677e087f&is=677cb6ff&hm=c4886bbede3703c719fd924f073196025b5ebbacb235310566556f692d1bfdf7&
05:09 tiredchiku[d]: redsheep[d]:
05:09 tiredchiku[d]: nvidia product page: https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/
05:10 tiredchiku[d]: 5080 also on jan 30th
05:10 tiredchiku[d]: 5070 family in feb
05:15 redsheep[d]: Some of the specs look a little weird. Definitely waiting for reviews, even if I can get a decent price for my 4090 second hand
05:16 tiredchiku[d]: somehow I'm still hung up on not wanting to sell the 3070
05:34 gfxstrand[d]: Now with FP4!
05:35 gfxstrand[d]: I'm not 100% sure what that is but I fear it might be 4-bit floats...
05:35 gfxstrand[d]: Yeah, it is. 🙃
05:36 airlied[d]: some nir work to do then? 🙂
05:41 gfxstrand[d]: Likely we'll just add some sort of conversation instruction and treat the ops as unit. They're going to work on 8 at a time like DP4.
05:42 gfxstrand[d]: At some point we probably need to do some work to nir_alu_type to support more things we've been chatting about bfloat16 for a while now.
05:43 gfxstrand[d]: But also there's no SPIR-V support for FP4 yet (though I'm sure Jeff has a draft spec ready) so we may have some time.
05:45 gfxstrand[d]: The problem is that once you get below 16 bits, no one can agree on the encoding. Even 16 has multiple variants: binary16 and bfloat16. 8 and 4 are way worse. I have no idea what Nvidia settled on. We'll have to figure that out.
05:46 gfxstrand[d]: So we need more than "float" and "16 bits" in NIR.
05:46 gfxstrand[d]: Which is really annoying
06:36 karolherbst[d]: that's all going to be fun in regards to opt_algebraic anyway
07:58 redsheep[d]: I tried to look around for how blackwell even does fp4 and for whatever reason it isn't easy to find, but it sounds like it either has 1 mantissa bit, or none at all, which is... interesting. It might not even have infinity or NaN either.
08:10 tiredchiku[d]: > for whatever reason
08:11 tiredchiku[d]: the gen has just been announced .-.
08:13 __ad: hi, trying to fix a backlight issue on a recent lenovo legion, what i see until now is that edp "dpcd" data seems not correct. Is this pssobile ?
08:16 __ad: i have DP_EDP_PWMGEN_BIT_COUNT = 8, DP_EDP_BACKLIGHT_BRIGHTNESS_MSB set to 1, but maximum calulated is 255, while display wants 4096 (12 bits)
08:16 __ad: So suspecting on wrong dpcd info from the panel. Not sure if this could be possible.
08:32 redsheep[d]: tiredchiku[d]: I mean, it's almost certainly the same format as the datacenter blackwell gpus that were announced 9 months ago...
08:37 tiredchiku[d]: hm
08:37 tiredchiku[d]: fair enough
09:03 notthatclippy[d]: redsheep[d]: Dunno about this in particular but in general that's not a safe assumption to make. Our naming scheme sure is something, and GB20X is often closer to AD10X (previous consumer grade generation) than GB10X datacenter grade
09:09 tiredchiku[d]: on an unrelated note
09:09 tiredchiku[d]: capture sdk release when nvidia
09:10 tiredchiku[d]: still 8.0.8, the header still doesn't have anything for egl nvfbc or pipewire nvfbc that was added in recent drivers
11:13 tiredchiku[d]: gfxstrand[d]: is there any specific reason we order our memory types the way we do?
11:13 tiredchiku[d]: radv seems to do
11:13 tiredchiku[d]: 1) DEVICE_LOCAL | | |
11:13 tiredchiku[d]: 2) | HOST_VISIBLE | HOST_COHERENT |
11:13 tiredchiku[d]: 3) DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT |
11:13 tiredchiku[d]: 4) | HOST_VISIBLE | HOST_COHERENT | HOST_CACHED
11:13 tiredchiku[d]: nvprop does something similar
11:14 tiredchiku[d]: but NVK does
11:14 tiredchiku[d]: [0] DEVICE_LOCAL
11:14 tiredchiku[d]: [1] DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT
11:14 tiredchiku[d]: [2] HOST_VISIBLE | HOST_COHERENT | HOST_CACHED
11:15 tiredchiku[d]: Hans (vkd3d-proton) says putting rebar early in the list is going to cause a lot of problems, not just in vkd3d-proton
11:19 mohamexiety[d]: notthatclippy[d]: in this case in particular it's tricky because there is normal FP4 and there's the microscaling format stuff. it's safe to assume any new datacenter GPU will support both, but for consumer stuff it's.... yeah
11:20 tiredchiku[d]: unrelated, can be closed:
11:20 tiredchiku[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12326
11:20 tiredchiku[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11177
12:37 tiredchiku[d]: tiredchiku[d]: nvm I get it
15:18 dwlsalmeida[d]: Any ideas what can cause that?
15:18 dwlsalmeida[d]: ERROR: [Loader Message] Code 0 : setup_loader_term_phys_devs: Failed to detect any valid GPUs in the current config
15:18 dwlsalmeida[d]: ERROR at ./vulkaninfo/./vulkaninfo.h:241:vkEnumeratePhysicalDevices failed with ERROR_INITIALIZATION_FAILED
15:18 dwlsalmeida[d]: Being run with:
15:18 dwlsalmeida[d]: #!/bin/sh
15:18 dwlsalmeida[d]: MESA=$HOME/DEV/mesa-nvk/builddir/install \
15:18 dwlsalmeida[d]: LD_LIBRARY_PATH=$MESA/lib64:$MESA/lib:$LD_LIBRARY_PATH \
15:18 dwlsalmeida[d]: LIBGL_DRIVERS_PATH=$MESA/lib64/dri:$MESA/lib/dri \
15:18 dwlsalmeida[d]: VK_ICD_FILENAMES=$MESA/share/vulkan/icd.d/nouveau_icd.x86_64.json \
15:18 dwlsalmeida[d]: VK_ENABLE_BETA_EXTENSIONS=1 \
15:18 dwlsalmeida[d]: exec "$@"
15:18 dwlsalmeida[d]: It's not my machine, otherwise I'd figure this out myself instead of asking
15:19 dwlsalmeida[d]: and I can't reproduce..
15:25 tiredchiku[d]: check kernel module
15:35 gfxstrand[d]: tiredchiku[d]: Aren't those the same, just without uncached GART? We can flip the last two if it helps games.
15:38 gfxstrand[d]: notthatclippy[d]: I suspect it's safe in this case. Y'all don't tend to do engineering work twice and introduce divergences between consumer and data center that might result in a CUDA program producing different results. Not without a damn good reason, anyway. If they've decided on an FP4 format, I expect it'll be consistent.
15:38 gfxstrand[d]: What is it? Now that I don't know. I've got my theories but I'll need to get my hands on either docs or hardware to really know.
15:47 tiredchiku[d]: gfxstrand[d]: yeah, I'm not entirely sure
15:47 tiredchiku[d]: might be best to leave it as is until we can figure out how to do uncached sysmem
15:47 tiredchiku[d]: or until issues pertaining to it actually show up again
15:48 redsheep[d]: I just hope the new world model recognizes that paperclips are less important than people
15:51 redsheep[d]: I got a pretty good chuckle out of the guy introducing Jensen accidentally saying the self driving cars are self aware cars
15:53 Jasper[m]: <redsheep[d]> "I just hope the new world..." <- I understood this reference
15:54 karolherbst[d]: ... they also added FP6?
15:54 karolherbst[d]: why would they do that ...
15:54 karolherbst[d]: imagine the day vulkan gets a non pot data type into it's core added
15:59 gfxstrand[d]: I don't want to think about non-POT data types...
15:59 gfxstrand[d]: NIR is going to need some work before that'll be a thing.
16:00 gfxstrand[d]: LLVM already tried to hand us 24-bit integers once.
16:01 gfxstrand[d]: As for why? Well, sometimes 4 isn't enough, I guess. 🤷🏻‍♀️
16:02 karolherbst[d]: yeah....
16:02 karolherbst[d]: llvm is weird with that
16:02 gfxstrand[d]: I suspect some of this stuff is pretty niche and added because some AI customer is buying enough GPUs that they get to ask for specific hardware.
16:02 karolherbst[d]: for sure
16:03 karolherbst[d]: I'll be happy once this is all finally over
16:03 mohamexiety[d]: yeah all the weird FP data types are due to this
16:03 gfxstrand[d]: Fp64 is wonky. Like, you get 5 per register and that little bit of perf is enough that they didn't just spring for 4x8? I find it hard to believe that that's going to be generally applicable.
16:04 mohamexiety[d]: got to save on VRAM without giving brain damage to the AI I guess :KEKW:
16:04 gfxstrand[d]: FP4 makes some sense anyway
16:04 karolherbst[d]: how does FP4 make any sense
16:04 karolherbst[d]: can you like only encode 4 numbers with it?
16:05 gfxstrand[d]: But if there's one thing we've learned in the GPU world, it's that the tiny memory savings from NPOT isn't worth the bother.
16:05 karolherbst[d]: I wonder if it's more of a cost efficiency thing, like in terms of perf/power
16:06 gfxstrand[d]: karolherbst[d]: For a some models, 0 and +/-1 are enough.
16:06 karolherbst[d]: yeah, but then why not just use ints and bitmask it or something
16:07 gfxstrand[d]: Some AI people are playing around with binary
16:07 gfxstrand[d]: karolherbst[d]: Even if it was i2, you'd still want a matrix multiply instruction.
16:08 karolherbst[d]: mhhhhh
16:08 karolherbst[d]: I guess at that point it doesn't matter because you already use all that die space for the stuff
16:08 gfxstrand[d]: It's not just about memory. It's about ALU throughout. Gotta smash those numbers as fast as possible.
16:08 mohamexiety[d]: also FP6 may actually be the microscale block format stuff now that I think about it (this one: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)
16:09 karolherbst[d]: yeah, but like.. INT4 matrix multiply still makes more sense to me than FP4 😄
16:09 gfxstrand[d]: karolherbst[d]: Yeah and FP4 is cheap. It's literally just a lookup table at that point.
16:09 karolherbst[d]: mhhhhhhhhh
16:09 karolherbst[d]: actually....
16:09 karolherbst[d]: uhhhhh
16:09 karolherbst[d]: I had a realization
16:11 gfxstrand[d]: There's are also some crazy formats some people have experimented with which are SNORM logarithmic, so it's basically just mantissa. They apparently work really well for 8-bit and smaller.
16:11 karolherbst[d]: yeah, fair
16:13 gfxstrand[d]: There's a lot of research that's gone into low-bit computation. It's just not had enough money behind it to be put into big hardware until now.
16:13 karolherbst[d]: yeah...
16:14 redsheep[d]: karolherbst[d]: If you don't encode NaN and infinity you can use all 4 bits to represent 16 actual values on fp4.
16:14 karolherbst[d]: I'm surprised that fp4 has that many values
16:15 gfxstrand[d]: gfxstrand[d]: Unfortunately, lots of academic research and experimentation doesn't lead to standardizing. The next few years are going to be interesting.
16:15 karolherbst[d]: bfloat6 and float6
16:16 gfxstrand[d]: Please no
16:17 gfxstrand[d]: I'm hoping we get one or two standards for 4 and 8 and some interesting integer ops that give us 1-bit. I hope 6 dies.
16:17 karolherbst[d]: well.. bfloat needs at least 9 bits anyway
16:17 karolherbst[d]: bfloat9
16:17 gfxstrand[d]: No
16:17 gfxstrand[d]: We already have R11G11B10. That's enough damage.
16:17 mohamexiety[d]: gfxstrand[d]: at least for AMD and NVIDIA, both seem to have standardized on the OCP formats which seems good for now
16:18 karolherbst[d]: though bfloat9 would have no fraction part, so I think we need 10
16:21 redsheep[d]: Watch as sentience is discovered to require 9 bits minimum and the AI overlords redesign all computers to have 9 bits to a byte
16:21 gfxstrand[d]: If it is OCP then it's 1.2.1 with no inf/nan.
16:22 gfxstrand[d]: Which makes sense. That's kinda what I guessed.
16:22 karolherbst[d]: my big hope is, that there is yet to be money made in the biz if you aren't a hardware vendor
16:22 karolherbst[d]: so maybe it just all implodes and we won't have to bother with those weirdo types
16:23 redsheep[d]: Problem is some of the applications are very valid. This all isn't just going to go away, even if the delusional parts disappeared tomorrow
16:23 karolherbst[d]: yeah, but then you don't need fp4 or so
16:23 gfxstrand[d]: Oh, we need fp4
16:23 karolherbst[d]: pain
16:24 karolherbst[d]: guess fp6 is out the window then
16:24 gfxstrand[d]: That's needed on the end device doing the inference. I expect we'll see it in Mali and similar very soon.
16:24 karolherbst[d]: fair I guess
16:45 gfxstrand[d]: On Nvidia, this shit exists for compatibility and so that you can train truly absurd models. On Mali, Intel, and whoever else, it exists so you can do inference in reasonable time without burning all your system resources (or at all, in that case if mobile).
19:47 dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1326276110823985295/perf_C.data?ex=677ed661&is=677d84e1&hm=28e688e33a77518b09b8e92ca8aa2ad8e4db4b7c44a29b6f7993193afedb92f3&
19:47 dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1326276111155593307/perf_rust.data?ex=677ed661&is=677d84e1&hm=747c87c9f91d48576cff1858229322be17e44c070e29bb313551b8739047abdb&
19:47 dwlsalmeida[d]: Hey guys, I've never used perf before, I wonder if any of you can spot what is broken between these two?
19:48 dwlsalmeida[d]: I am still trying to find why the newer Rust video stuff takes 10x longer
19:57 dwlsalmeida[d]: ugh, the file is too large to send
20:07 dwlsalmeida[d]: https://drive.google.com/file/d/1SIJEnwcUvbrq0AN2uxaXQxpQCbFE7cC1/view?usp=sharing
20:07 dwlsalmeida[d]: https://drive.google.com/file/d/1_bsTqpRO6WkA7abd4S0ajc2fiDNZAOci/view?usp=sharing
20:21 airlied[d]: not even sure how to use those files 🙂
20:21 airlied[d]: normally perf record --call-graph dwarf <testapp> then perf report is my go to
20:35 dwlsalmeida[d]: this is perf archive, so, IIUC, there's a perf.data inside, plus the dwaft business to make it portable/readable between machines
20:35 dwlsalmeida[d]: anyways, I am looking through this with a more experienced colleague
20:35 tanriol: A nice shareable (not sure about file size limits, however) option for perf is converting and uploading with Firefox Profiler https://profiler.firefox.com/docs/#/./guide-perf-profiling
20:36 dwlsalmeida[d]: I don't think there's anything meaningful, apparently the CPU is idle most of the time, which may be us waiting on the fence for the frame currently being decoded?
20:37 dwlsalmeida[d]: tanriol: so, `perf record --call-graph dwarf <testapp> `, then `perf script -g`?
20:38 dwlsalmeida[d]: how do I share that, is it a matter of simply sharing the URL?
20:39 airlied: I use perf report and you should be able to see the largest consumers
20:39 airlied: but if it's actually just all in idle, then perf won't realy help
20:39 tanriol: dwlsalmeida[d]: `perf script -F +pid` for converting and IIRC there should be an upload button somewhere in the top right corner
20:53 dwlsalmeida[d]: C version ---> https://share.firefox.dev/4j4UusI
20:53 dwlsalmeida[d]: Rust version -----> https://share.firefox.dev/4j6AHcb
20:55 airlied: I think gst_element_set_state going higher is where I'd start
21:07 dwlsalmeida[d]: gfxstrand[d]: airlied[d] what do you guys think about this?
21:07 dwlsalmeida[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31867/diffs#92b686abc64efd665e8e50b45fbab7f359761ca6_0_92
21:08 dwlsalmeida[d]: all calls to vkCmdBeginVideoCodingKHR and vkCmdDecodeVideoKHR are serialized using this mutex
21:08 dwlsalmeida[d]: I don't think this matters right? only if you were trying to record commands from multiple threads at once
21:08 airlied: I think ffmpeg wants to do that
21:09 dwlsalmeida[d]: ok, but do you think this mutex is related to the performance drop?
21:09 dwlsalmeida[d]: just from initially looking at that code, that is
21:10 airlied[d]: I doubt it, esp if gst is single thread
21:10 dwlsalmeida[d]: it is
21:12 gfxstrand[d]: Yeah, mutexes are cheap.
21:13 airlied[d]: uncontested 🙂
21:17 dwlsalmeida[d]: well, they are I think, unless you have multiple threads recording with the same VkVideoSession object at once
21:17 dwlsalmeida[d]: which is not the case here
21:18 dwlsalmeida[d]: alright, after a lot of this `perf` thing, we aren't any better :/
21:26 airlied[d]: there is a definite difference in % for that fn above
21:27 airlied[d]: just can't see perf samples in the driver there
21:31 airlied[d]: if you search for a frame with nvk_CmdVideoDecodeKHR it seems to show a large diff
21:32 dwlsalmeida[d]: ok, but (again IIUC), this is a tiny fraction of the CPU time anyways?
21:32 airlied[d]: https://share.firefox.dev/3W49PzF vs https://share.firefox.dev/3W4a0ej
21:33 airlied[d]: 12ms vs 140ms seems significant over a lot of frames
21:35 dwlsalmeida[d]: sorry, where did you read the 140ms from?
21:36 dwlsalmeida[d]: (not sure the page loaded right here)
21:43 airlied[d]: oh click on the function
21:43 airlied[d]: I clicked off it when I permalinked
21:44 airlied[d]: I think just doing a printk timing around that one function would be enough to demonstrate it and help tell difference between debug/release as well
21:44 airlied[d]: sorry printf time
21:45 airlied[d]: since a lot of the overheads seesm to be rust not inlining
21:46 airlied[d]: though the append is in there as well
21:51 dwlsalmeida[d]: welp, but that doesn't explain why it's not much better on release builds
21:52 dwlsalmeida[d]: but ok, let's try what you said
21:52 dwlsalmeida[d]: also, I don't think Rust can inline across crates if you don't specifically `#[inline` stuff
21:53 dwlsalmeida[d]: so, the push stuff is a crate, the video stuff is another crate, all calls into `nv_push_rs` are not inlined
21:54 airlied[d]: sounds important to fix
21:54 ndufresne: o/
21:56 dwlsalmeida[d]: @ndufresne has been working with me to try and figure out what's happening
22:11 mhenning[d]: dwlsalmeida[d]: fwiw, I always run perf on release builds - imo having compiler optimizations off distorts the results way more than missing some stack frames in the trace because of inlining
22:11 mhenning[d]: or rather, release with debug info
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.029 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.005 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.004 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.004 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.003 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 airlied[d]: okay if the C speed is the same, them it seems unlikely
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:12 dwlsalmeida[d]: I don't think the problem is here ^
22:14 dwlsalmeida[d]: Ok, this rules out Rust btw
22:14 dwlsalmeida[d]: this is the only function written in Rust basically
22:14 dwlsalmeida[d]: unless the problem is actually flushing the Rust pushbuf to the GPU
22:15 karolherbst[d]: looking at the graphs, it looks like the CPU waits mostly
22:15 karolherbst[d]: so it's not like it's slower in terms of spent CPU cycles, but rather the CPU spends a lot of time in blocked state
22:15 karolherbst[d]: (well the process rather)
22:16 dwlsalmeida[d]: is it possible to say what exactly blocks? It's not the mutex thing otherwise it would show up in the thing I pasted above
22:17 karolherbst[d]: `g_mutex_lock`?
22:17 dwlsalmeida[d]: no, I am referring to the Rust mutex I placed before vkCmdBeginVideoCoding and vkCmdDecodeVideo
22:18 dwlsalmeida[d]: this `g_mutex_lock` is some gstreamer lock somewhere else
22:18 karolherbst[d]: yeah, but it's most likely the thing it blocks on
22:19 karolherbst[d]: could also be `nvcmd_ctx_wait`
22:19 karolherbst[d]: it's kinda weird, bcause it seems to be various things,
22:19 karolherbst[d]: but anyway
22:19 karolherbst[d]: the process is clearly not doing anything a lot of time
22:20 karolherbst[d]: (it's the blue bar in the process/thread view)
22:20 karolherbst[d]: well..
22:20 karolherbst[d]: it's not a bar if the CPU doesn't do anything
22:21 karolherbst[d]: looks like it happens roughly ever 5ms
22:22 karolherbst[d]: it's executing stuff for like 0.1ms then 5ms of nothing
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.029 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000090 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000060 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:23 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.001 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000080 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000090 ms
22:23 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:23 karolherbst[d]: yeah....
22:23 karolherbst[d]: it doesn't feel like it's actually something in the process
22:23 karolherbst[d]: but something else messing
22:24 karolherbst[d]: you can zoom into those blue segments
22:25 karolherbst[d]: but you can see it in both traces if you zoom in close enough
22:26 karolherbst[d]: but maybe there was also a problem of doing the trace
22:26 karolherbst[d]: dunno
22:27 karolherbst[d]: anyway, profiling is hard and should only be done with release builds
22:28 dwlsalmeida[d]: well, the client has to wait on the fence associated with the `VkImage` containing the decoded data
22:28 dwlsalmeida[d]: I wonder if...that's not related?
22:29 dwlsalmeida[d]: although I did not touch that at all
22:31 karolherbst[d]: kinda want to trace all related processes at the same time to really see what's going on probably. But if there is a GPU in the mix it's getting a bit more complicated to trace it properly and see where it blocks on
22:34 dwlsalmeida[d]: ok, this is how long we have to wait in the Rust version before the VkImage is ready:
22:34 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000080 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:34 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.448 ms
22:34 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000010 ms
22:34 dwlsalmeida[d]: Redistribute latency...
22:34 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.457 ms
22:34 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000010 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:34 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.460 ms
22:34 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:34 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000040 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000010 ms
22:35 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.450 ms
22:35 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000080 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000040 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000020 ms
22:35 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.448 ms
22:35 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000040 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000010 ms
22:35 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 1.104 ms
22:35 dwlsalmeida[d]: nvk_CmdDecodeVideoKHR execution time: 0.002 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000070 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000030 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000050 ms
22:35 dwlsalmeida[d]: nvkmd_ctx_wait execution time: 0.000010 ms
22:35 dwlsalmeida[d]: lets see the C version now
22:35 Ermine: did you pipe your stuff into discord?
22:35 HdkR: cat dmesg > discord
22:36 DodoGTA: Ermine: That's definitely a text snippet from Discord
22:36 Ermine: ah, indeed
22:36 HdkR: That's some amazingly bad integration from the bot
22:37 karolherbst[d]: yeah.... that part is kinda weird, but not sure there are many options besides automatically uploading it to some pastebin website
22:37 HdkR: Isn't that what matrix does?
22:38 Ermine: matrix hosts its own pastebin
22:39 dwlsalmeida[d]: dwlsalmeida[d]: ```
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.526 ms
22:39 dwlsalmeida[d]: Redistribute latency...
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.452 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.449 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.456 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.477 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.457 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.118 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.398 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.459 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.456 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.455 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.462 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.462 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.462 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.453 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.458 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.118 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.119 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.119 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.117 ms
22:39 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.118 ms
22:39 dwlsalmeida[d]: eeh...looks similar?
22:39 Jasper[m]: Yeah the matrix bot does not enjoy that
22:40 dwlsalmeida[d]: sorry guys,
22:40 dwlsalmeida[d]: will pastebin stuff from now on
22:40 HdkR: Ermine: Sounds like the discord both needs to host its own pastebin as well then :P
22:40 HdkR: discord bot*
22:42 dwlsalmeida[d]: actually..not really..we seem to take longer in the Rust version
22:44 dwlsalmeida[d]: plenty of frames like this in the C code:
22:44 dwlsalmeida[d]: gst_vulkan_operation_wait execution time: 0.116 ms
22:44 dwlsalmeida[d]: but literally no frame takes less than `0.450` in the Rust version
22:52 asdqueerfromeu[d]: dwlsalmeida[d]: Does that mean that Rust bäd? /s
22:53 redsheep[d]: That still doesn't quite account for a 10x difference
22:53 redsheep[d]: At least it doesn't look like it would
22:54 dwlsalmeida[d]: so total wait time is `151ms` vs `72ms`
22:55 dwlsalmeida[d]: still far off from 10x worse
23:17 dwlsalmeida[d]: ok, finally some progress
23:18 dwlsalmeida[d]: @ndufresne
23:18 dwlsalmeida[d]: vkh26adec output_picture execution time: 0.197 ms
23:18 dwlsalmeida[d]: vs
23:18 dwlsalmeida[d]: vkh26adec output_picture execution time: 5.423 ms
23:19 orowith2os[d]: `#[inline]` everything 🔥 /j
23:20 dwlsalmeida[d]: sometimes as much as 7ms per frame
23:21 dwlsalmeida[d]: yeah, the C version is consistent at 0.2 per frame
23:24 dwlsalmeida[d]: I'll leave this to some other day, but for whatever reason `gst_video_decoder_finish_frame`is taking much more time
23:24 dwlsalmeida[d]: maybe `uftrace` can help here