00:09 redsheep[d]: I haven't missed something huge about Nvidia open sourcing cuda, right? https://youtu.be/tuexWDxpRDk?t=248
00:09 redsheep[d]: Their sources don't seem to mention cuda and I feel like someone would have been in this channel popping champagne
00:22 mohamexiety[d]: He is joking
00:22 mohamexiety[d]: Notice the “or else if it would be pretty hypocritical huh?” right after that line
00:24 mohamexiety[d]: The delivery is a bit confusing though
00:44 phomes_[d]: nvdump segfaults for me on nvk. Is this to be expected?
00:44 redsheep[d]: mohamexiety[d]: Huh. I guess, but it's very odd for the joking statement to be partially true.
00:44 phomes_[d]: it works fine on the prop driver (with my MR update deps)
12:38 pixelcluster[d]: dwlsalmeida[d]: got used car
12:38 pixelcluster[d]: turbo leaked oil (probably)
12:38 pixelcluster[d]: fucked up catalytic converter too
12:38 pixelcluster[d]: sadness ensued
12:57 dwlsalmeida[d]: pixelcluster[d]: do they sell these small three cylinder turbo engines where you live? curious how common this will be once more cars have turbos
16:05 dwlsalmeida[d]: what's the difference between `nvk_cmd_buffer_upload_alloc` and `nvkmd_mem_map`?
16:06 dwlsalmeida[d]: can I map a memory and write to it using `nvkmd_mem_map` like I can with `nvk_cmd_buffer_upload_alloc`?
16:07 dwlsalmeida[d]: I now need to read something from the GPU, so I think upload_alloc is not the right thing to use
16:26 pixelcluster[d]: dwlsalmeida[d]: yea p sure they do (germany)
16:27 dwlsalmeida[d]: avhe[d]: does the size of the av1 coloc buffer show up anywhere?
16:32 avhe[d]: Which one are we talking about? I see AV1_SET_COL_MV0_READ_BUF_OFFSET, AV1_SET_COL_MV1_READ_BUF_OFFSET, AV1_SET_COL_MV2_READ_BUF_OFFSET and AV1_SET_COL_MVWRITE_BUF_OFFSET which are AV1 specific
16:32 avhe[d]: or is it the usual SET_COLOC_DATA_OFFSET
16:45 dwlsalmeida[d]: well, I don't know either, but the trace shows that MV0..3 and point to the same buffer:
16:45 dwlsalmeida[d]: Method 0x0043 (0x20018194): type 1, size 1, subchannel 4, reg 0x00000650 (NVC7B0_AV1_SET_COL_MV0_READ_BUF_OFFSET)
16:45 dwlsalmeida[d]: 0x01200800
16:45 dwlsalmeida[d]: Method 0x0045 (0x20018195): type 1, size 1, subchannel 4, reg 0x00000654 (NVC7B0_AV1_SET_COL_MV1_READ_BUF_OFFSET)
16:45 dwlsalmeida[d]: 0x01200800
16:45 dwlsalmeida[d]: Method 0x0047 (0x20018196): type 1, size 1, subchannel 4, reg 0x00000658 (NVC7B0_AV1_SET_COL_MV2_READ_BUF_OFFSET)
16:45 dwlsalmeida[d]: 0x01200800
16:45 dwlsalmeida[d]: ah, MVWRITE too:
16:45 dwlsalmeida[d]: Method 0x0049 (0x20018197): type 1, size 1, subchannel 4, reg 0x0000065c (NVC7B0_AV1_SET_COL_MVWRITE_BUF_OFFSET)
16:45 dwlsalmeida[d]: 0x01200800
16:47 dwlsalmeida[d]: maybe it's this:
16:47 dwlsalmeida[d]: // AV1 Temporal MV buffer
16:47 dwlsalmeida[d]: #define AV1_TEMPORAL_MV_SIZE_IN_64x64 256 // 4Bytes for 8x8
16:47 dwlsalmeida[d]: #define AV1_TEMPORAL_MV_BUF_SIZE(w, h) ALIGN_UP( ALIGN_UP(w,128) * ALIGN_UP(h,128) / (64*64) * AV1_TEMPORAL_MV_SIZE_IN_64x64, 4096)
16:48 dwlsalmeida[d]: which would match the H.264/5 stuff:
16:48 dwlsalmeida[d]: size_t colmv_size = aligned_w * aligned_h / MB_SIZE;
16:48 dwlsalmeida[d]: or this, I don't know:
16:48 dwlsalmeida[d]: #define AV1_HINT_DUMP_SIZE(w, h) NVDEC_ALIGN(AV1_HINT_DUMP_SIZE_IN_SB128*((w+127)/128)*((h+127)/128)) // always use SB128 for allocation
16:50 dwlsalmeida[d]: there's also this from the V4L2 driver, but I am not sure how that's related:
16:50 dwlsalmeida[d]: https://pastebin.com/y54nqk4E
16:51 dwlsalmeida[d]: Ah, COLOC also points to the same place apparently:
16:51 dwlsalmeida[d]: Method 0x0033 (0x20018105): type 1, size 1, subchannel 4, reg 0x00000414 (NVC7B0_SET_COLOC_DATA_OFFSET)
16:51 dwlsalmeida[d]: 0x01200800
16:57 avhe[d]: I don't see SET_COLOC_DATA_OFFSET being used in the tegra pushbuffer build routine for AV1, maybe this is something specific to a later hardware gen?
16:57 avhe[d]: The maps I see are `SET_DRV_PIC_SETUP_OFFSET`, `SET_SUB_SAMPLE_MAP_OFFSET`, `SET_SUB_SAMPLE_MAP_IV_OFFSET`, `SET_IN_BUF_BASE_OFFSET`, `SET_HISTOGRAM_OFFSET`, `SET_TILE_SIZE_BUF_OFFSET`, `AV1_SET_GLOBAL_MODEL_BUF_OFFSET`, `AV1_SET_SEGMENT_READ_BUF_OFFSET`, `AV1_SET_SEGMENT_WRITE_BUF_OFFSET`, `AV1_SET_COL_MV0_READ_BUF_OFFSET`, `AV1_SET_COL_MV1_READ_BUF_OFFSET`, `AV1_SET_COL_MV2_READ_BUF_OFFSET`,
16:57 avhe[d]: `AV1_SET_COL_MVWRITE_BUF_OFFSET`, `SET_FILTER_BUFFER_OFFSET`, `SET_INTRA_TOP_BUF_OFFSET`, `AV1_SET_PROB_TAB_WRITE_BUF_OFFSET`, `AV1_SET_FILM_GRAIN_BUF_OFFSET`, `SET_NVDEC_STATUS_OFFSET`
17:12 dwlsalmeida[d]: IDK either, it shows up here from what I can tell
17:12 dwlsalmeida[d]: I will try dumping the picture params from the blob
17:12 dwlsalmeida[d]: maybe this will give us more info
18:56 airlied[d]: dwlsalmeida[d]: reading what from the GPU?
18:57 dwlsalmeida[d]: The updated prob context
18:57 dwlsalmeida[d]: Gotta take that, run some algorithms, and then feed it on the next frame
19:01 dwlsalmeida[d]: I was hoping to just run nvkmd_mem_map on the vkDeviceMemory
19:14 airlied[d]: So you have to serialise each frame submit?
19:15 airlied[d]: Could you run the algorithms in a compute shader?
19:20 tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1329168257344733339/Screenshot_20250116-004758.png?ex=67895be7&is=67880a67&hm=bb34a0968581d866f7eef7d66d84b7db0c936a5da9b4b83cc517f4d284bc18a3&
19:20 tiredchiku[d]: re: vkd3d-proton overhead compared to windows
19:20 tiredchiku[d]: I wonder if we'll run into a similar issue on NVK, or if it's more a driver thing than a hardware thing
19:20 tiredchiku[d]: seems to be a driver thing from what I can tell, in which case, I wonder if we can get nv to look into it :p
20:43 dwlsalmeida[d]: airlied[d]: at least for now, yeah.. :/
20:43 dwlsalmeida[d]: airlied[d]: I have no idea how to do that
20:43 dwlsalmeida[d]: maybe?
20:46 airlied[d]: do we have any traces of nvidia doing anything here?
20:48 gfxstrand[d]: Yeah, we really don't want to be ping-ponging between CPU and GPU if we can avoid it. How complicated is this algorithm?
20:54 ndufresne: dwlsalmeida[d]: remember that our nvidia insider told us the update is offloaded to the firmware, perhaps you missed something 6
20:55 dwlsalmeida[d]: hm, IIRC he said that they're parsing the h265 bitstream to get the skip value, and doing the vp9 prob updates in firmware
20:56 dwlsalmeida[d]: to make it clear:
20:56 dwlsalmeida[d]: a) this is AV1
20:56 dwlsalmeida[d]: b) the hardware expects a pointer to the probability buffer
20:57 dwlsalmeida[d]: BTW: at this point we all know this is hantro, so we can poke at the V4L2 driver for the same IP:
20:57 dwlsalmeida[d]: av1_dec->prob_tbl.cpu = dma_alloc_coherent(vpu->dev,
20:57 dwlsalmeida[d]: ALIGN(sizeof(struct av1cdfs), 2048),
20:57 dwlsalmeida[d]: &av1_dec->prob_tbl.dma,
20:57 dwlsalmeida[d]: GFP_KERNEL);
20:57 dwlsalmeida[d]: if (!av1_dec->prob_tbl.cpu)
20:57 dwlsalmeida[d]: return -ENOMEM;
20:57 dwlsalmeida[d]: av1_dec->prob_tbl.size = ALIGN(sizeof(struct av1cdfs), 2048);
20:57 dwlsalmeida[d]: av1_dec->prob_tbl_out.cpu = dma_alloc_coherent(vpu->dev,
20:57 dwlsalmeida[d]: ALIGN(sizeof(struct av1cdfs), 2048),
20:57 dwlsalmeida[d]: &av1_dec->prob_tbl_out.dma,
20:57 dwlsalmeida[d]: GFP_KERNEL);
20:58 dwlsalmeida[d]: which map to:
20:58 dwlsalmeida[d]: Method 0x004f (0x20018190): type 1, size 1, subchannel 4, reg 0x00000640 (NVC7B0_AV1_SET_PROB_TAB_READ_BUF_OFFSET)
20:58 dwlsalmeida[d]: 0x01200600
20:58 dwlsalmeida[d]: Method 0x0051 (0x20018191): type 1, size 1, subchannel 4, reg 0x00000644 (NVC7B0_AV1_SET_PROB_TAB_WRITE_BUF_OFFSET)
20:58 dwlsalmeida[d]: 0x012002b0
21:01 dwlsalmeida[d]: the entrypoints for the algorithms we have to run are:
21:01 dwlsalmeida[d]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/media/platform/verisilicon/rockchip_vpu981_hw_av1_dec.c#L1125
21:01 dwlsalmeida[d]: and
21:01 dwlsalmeida[d]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/media/platform/verisilicon/rockchip_vpu981_hw_av1_dec.c#L1160
21:01 dwlsalmeida[d]: which also call in the stuff at https://elixir.bootlin.com/linux/v6.12.6/source/drivers/media/platform/verisilicon/rockchip_av1_entropymode.c
21:03 dwlsalmeida[d]: airlied[d]: I don't think we can map these buffers in the tracer, it works by intercepting `mmap` calls, so if it's just a BO that's not been mapped in the CPU, I think it doesn't work, right avhe[d] ?
21:03 airlied[d]: just because its hantro at the bottom, doesn't mean they don't do some stuff in their own fw
21:06 jannau: gfxstrand[d]: replicating the av1 cdf updates on the CPU would require parsing the whole frame bitstream including parsing coeffs
21:10 dwlsalmeida[d]: maybe if we just pass the addresses for the buffers the firmware will do its thing then? I assume that's what you're trying to say
21:13 avhe[d]: dwlsalmeida[d]: you could look for mapping/unmapping operations on the prob table buffer, but there's no telling if they just keep a cpu pointer alive at all times
21:14 avhe[d]: maybe you could try finding the cpu address to the prob table, set a hardware watchpoint somewhere inside, and see if you get any hits
21:14 dwlsalmeida[d]: avhe[d]: I remember trying to read from the intra_top buffer like we do for the scaling lists and others. It segfaults
21:15 airlied[d]: yes I'd expect nvidia deal with this on the fw side, and wouldn't worry about it until we prove otherwise
21:15 jannau: probabilities shouldn't be needed for parsing on the CPU so if the decoder supports different read and write buffers and has some way to synchronize dependent framec it might be possible to keep the probablilities in the decoder
21:15 avhe[d]: dwlsalmeida[d]: Also, for AV1 I did not see any sign of CPU-side prob update in the tegra code I checked... of course I might've missed something
21:15 avhe[d]: but for VP9 yes this is happening
21:15 gfxstrand[d]: I really doubt the proprietary driver is ping-ponging with the CPU. They might be doing something in firmware or maybe (unlikely) in a shader but I doubt they're doing anything interesting on the CPU.
21:16 airlied[d]: yes I think it would be a major hw design failure if they were
21:17 avhe[d]: Well for VP9, it only happens when backward updates are active which in my experience is quite rare. In fact I didn't implement this initially, because I never ran into it until I shipped the code and a user reported the problem
21:19 avhe[d]: dwlsalmeida[d]: I don't remember exactly how the nvidia.ko uapi works, maybe there is something you have to do before calling mmap
21:22 ndufresne: fun fact, the compressed header are encrypted in Widewine implementation, so without a firmware on the secure side to do it, you can't do probability updates ;-P
21:25 ndufresne: imho, parsing some bitstream on CPU shouldn't be considred such a big deal though, can anyone explain why its consider so bad ?
21:25 ndufresne: (it has to be filled by CPU to start with)
21:27 airlied[d]: we don't have access to the bitstream in vulkan when recodring the command buffers
21:27 airlied[d]: anywhere we end up parsing it, is a failure of the vulkan API spec or of the hw impl
21:28 dwlsalmeida[d]: also you can record the decode commands for a lot of frames before submitting, if you need a frame to be decoded to get e.g. symbol counts and stuff, that's a problem I suppose
21:32 dwlsalmeida[d]: airlied[d]: why did you include this in `radv_video.c`?
21:32 dwlsalmeida[d]: `#include "ac_vcn_av1_default.h"`
21:33 dwlsalmeida[d]: this contains the default probability tables and other stuff, but I don't see you using this anywhere?
21:34 airlied[d]: oh it's probably left overs, initially that code wasn't shared, but I moved it all into code shared with radeonsi
21:51 gfxstrand[d]: dwlsalmeida[d]: did you ever make any progress on the rust vs C perf issues?
22:03 dwlsalmeida[d]: gfxstrand[d]: I managed to isolate the problem to downloading the image from the GPU
22:03 dwlsalmeida[d]: So..nothing rust related
22:04 gfxstrand[d]: Cool
22:04 dwlsalmeida[d]: A colleague is investigating further, I’ve been focusing on getting AV1 out