01:47dwlsalmeida[d]: yeah, apparently anything with more than one frame instantly errors out when submitting
01:48dwlsalmeida[d]: `nouveau/vulkan/nvk_queue_drm_nouveau.c:262] Code 0 : DRM_NOUVEAU_EXEC failed: No such device (VK_ERROR_DEVICE_LOST)`
01:49dwlsalmeida[d]: I'm eyeing this `NVC5B0_SET_PIC_SCRATCH_BUF_OFFSET` business
01:50dwlsalmeida[d]: I wonder if I should use this somehow,
01:50dwlsalmeida[d]: blob doesn't
01:53dwlsalmeida[d]: airlied[d]: what's the deal with P frames?
02:01airlied[d]: need to work out the DPB
02:03airlied[d]: likely involves programming the SET_PICTURE_* offsets
02:03airlied[d]: but would likely need a trace of a second frame
02:07dwlsalmeida[d]: yeah the DPB is right actually, first thing I checked
02:08dwlsalmeida[d]: but regardless, for the SET_PICTURE_OFFSET stuff, any nvk_image address should theoretically work
02:09dwlsalmeida[d]: ` nvh264->dpb[i].col_idx = idx;` I wonder where this buffer is actually allocated
02:19airlied[d]: the app creates the DPB allocation and you get in the slot setup
05:54asdqueerfromeu[d]: dwlsalmeida[d]: So could JPEG decoding work fine? 🖼️
07:47avhe[d]: jpeg de/encoding go through a separate engine (nvjpg) that consumer discrete cards don't have
07:54ahuillet: what is HW accelerated jpg decoding useful for these days?
08:14avhe[d]: nvidia advertises it for machine learning application
11:29mohamexiety[d]: yeah I remember coming across it and being confused. from what I understood it's mostly intended for machine learning stuff to help with feeding images in a fast way since for AI stuff you deal with _a lot_
11:53avhe[d]: exactly
11:53avhe[d]: from the A100 product brief:
11:53avhe[d]: NVJPG Decode for DL Training
11:53avhe[d]: The A100 GPU adds a new hardware-based JPEG decode feature. One of the fundamental issues in achieving high throughput for DL training / inference for images is the input bottleneck of JPEG decode. CPUs and GPUs are not very efficient for JPEG decode due to the serial operations used for processing image bits. Also, if JPEG decode is done in the CPU, PCIe becomes another bottleneck.
11:53avhe[d]: A100 addresses these issues by adding a hardware JPEG decode engine. A100 includes a 5-core hardware JPEG decode engine called NVJPG. Applications can batch images into chunks of up to five images and pass onto NVJPG for processing. These images can be of heterogeneous sizes, though for best performance, images of similar sizes should be batched together wherever possible.
11:55avhe[d]: i don't know if that's still the case in the revised nvjpg inside the A100, but on the tegra210 nvjpg can also be programmed to perform yuv->rbg conversion with a programmable coefficients (so, removing the need to pipe the output through another accelerator, or doing the conversion on the cpu)
12:04dwlsalmeida[d]: avhe[d]: can you explain what you mean by scratch buffer in your FFmpeg code ?
12:04dwlsalmeida[d]: (Or scratch ref IIRC)
12:06dwlsalmeida[d]: I was looking at the output of your tracer, there’s a section that says “buffers mapped” or something, and I noticed that one of the frames appears twice
12:15avhe[d]: if you're talking about h264, scratch ref is what i call the frame data that gets bound to empty dpb slots
12:16avhe[d]: you probably noticed that each `NVC5B0_SET_PICTURE_(LUMA|CHROMA)_OFFSET\d` is written to, even when decoding a single frame
12:18avhe[d]: dwlsalmeida[d]: that would be the "scratch ref" probably
12:22avhe[d]: on top of my head, for I-frames the scratch ref will be the frame itself, then what i observed for h264 is that they seem to use the frame_num, but the dpb management code in the library i reversed (TVMR) was extremely painful so i never bothered to understand the logic in full
12:58dwlsalmeida[d]: Yeah I noticed that all slots are written
13:09dwlsalmeida[d]: Is there a particular meaning to each slot?
13:09dwlsalmeida[d]: I think that is given by SET_CURRENT_PICTURE_INDEX
13:11dwlsalmeida[d]: so luma[0] and chroma[0] should contain the YUV for whatever picture called SET_CURRENT_PICTURE_INDEX(0) and so on
13:12dwlsalmeida[d]: also, I see that in your code you have a single allocation for mbhist, coloc, history and picture data, is that needed?
13:12dwlsalmeida[d]: these are all separate allocations here
13:13avhe[d]: dwlsalmeida[d]: the current frame location is indicated by CurrPicIdx/CurrColIdx in the setup structure
13:13avhe[d]: mind that once a frame is attributed a given slot, it must always be bound to that one until it gets dropped from the dpb
13:14avhe[d]: dwlsalmeida[d]: no, it's just an optimization
13:15avhe[d]: in my implementation, i separate per-frame buffers (setup, cmdbuf, etc) from common one (coloc, mbhist, etc)
13:15avhe[d]: per-frame buffers are allocated from a pool, while the common buffer is always the same for a decoder instance
13:16avhe[d]: since allocation/mappings need to be page-aligned in size and address, you'd waste some memory by doing per-buffer allocation
13:18dwlsalmeida[d]: if you have the time later, would you be able to quickly bat an eye here?
13:18dwlsalmeida[d]: https://gitlab.freedesktop.org/dwlsalmeida/mesa/-/blob/nvk-vulkan-video-dave/src/nouveau/vulkan/nvk_video.c?ref_type=heads#L425
13:18dwlsalmeida[d]: that's where the picture parameter is set up
13:18dwlsalmeida[d]: I am testing this on videos with only two frames
13:19dwlsalmeida[d]: they come out exactly equal, and we get a failure in a submit somewhere,
13:19dwlsalmeida[d]: I have traces for both the blob and mesa, and they match 100%, let me find them
13:22dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1303348641188286565/log_blob.txt?ex=672b6d83&is=672a1c03&hm=8706cd6efbc1b75acf02925c2fb591cbda1692701f232cd80458ce0b98af837c&
13:22dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1303348641519374346/log_mesa.txt?ex=672b6d83&is=672a1c03&hm=68d6c8e6f09e5ddcd723176effd9f9d2e0ff2c1b05df0fe9b60906a3ea4eb036&
13:22dwlsalmeida[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1303348641855176714/push_blob.log?ex=672b6d83&is=672a1c03&hm=423c61e2e5b619015c8af0dda5b09ec7fb878e37807ddc5693f7601db8583344&
13:22avhe[d]: the failure is reported in the nvdec_status_s structure? if not, it's probably not related to your codec setup, in my experience
13:22dwlsalmeida[d]: I found nvdec_status_s to be mostly useless
13:23dwlsalmeida[d]: there's no header file containing the #defines for the error codes
13:25avhe[d]: unfortunately, but it still hints at whether the failure happened in nvdec microcode/hardware, or by handling submission wrong (in which case it should be 0)
13:27dwlsalmeida[d]: do you know whether there's any particular meaning to the order of the array entries here?
13:27dwlsalmeida[d]: nvdec_dpb_entry_s dpb[16];
13:28dwlsalmeida[d]: it asks for the index and col_index, so it shouldn't IMHO
13:28dwlsalmeida[d]: typedef struct _nvdec_dpb_entry_s // 16 bytes
13:28dwlsalmeida[d]: {
13:28dwlsalmeida[d]: NvU32 index : 7; // uncompressed frame buffer index
13:28dwlsalmeida[d]: NvU32 col_idx : 5; // index of associated co-located motion data buffer
13:30avhe[d]: dwlsalmeida[d]: yeah, same as PicIdx/ColIdx, once a frame has been allocated an index, it should get bound to the same one for its entire lifetime
13:31avhe[d]: if you check my code, each frame has its own little structure containing pic_idx (which is written to index/col_idx) and dpb_idx (the index in the dpb array)
13:33avhe[d]: PicIdx/ColIdx can be the same (they are in nvidia's tegra code), but the dpb index needs to be decoupled, because PicIdx is allocated on decode, and dpb_idx when it gets used as a reference (usually one frame later)
13:34avhe[d]: there is a comment in my code explain this in a bit more detail
13:35avhe[d]: but typically if you mismanage the dpb, you'll get a corrupted frame, not an error
13:40dwlsalmeida[d]: ^ yeah, that's my point from yesterday:
13:40dwlsalmeida[d]: > but regardless, for the SET_PICTURE_OFFSET stuff, any nvk_image address should theoretically work
13:40dwlsalmeida[d]: so long as it's a valid address, it should be good
13:40dwlsalmeida[d]: in terms of not crashing, but it may be corrupted if you mess them up of course
13:48dwlsalmeida[d]: in my setup with two frames,
13:48dwlsalmeida[d]: frame 0 takes `frame_num == CurrPicIdx == CurrPicColIdx == 0`, with `SET_PICTURE_INDEX(0)`, `SET_LUMA_OFFSET0(<VkImage of Pic 0>)`, `SET_CHROMA_OFFSET0(<VkImage of Pic 0>)`
13:48dwlsalmeida[d]: frame 1 then adds frame 0 to `dpb[0]`, with `index==col_idx==0`, and sets
13:48dwlsalmeida[d]: `frame_num == CurrPicIdx == CurrPicColIdx == 1`, with `SET_PICTURE_INDEX(1)`, `SET_LUMA_OFFSET0(<VkImage of Pic 0>)`, `SET_CHROMA_OFFSET0(<VkImage of Pic 0>)`, `SET_LUMA_OFFSET1(<VkImage of Pic 1>)`, `SET_CHROMA_OFFSET1(<VkImage of Pic 1>)`
13:49dwlsalmeida[d]: I have two calls to SET_OBJECT for C5B0 though, I wonder if this clears any context in the GPU
13:54avhe[d]: it looks like the blob only uses this during initial channel setup
13:54avhe[d]: i'm not so familiar with gpfifo stuff, since nvdec on tegra is managed from a different engine (host1x)
14:23dwlsalmeida[d]: gfxstrand[d]: are VkImages always resident?
14:24dwlsalmeida[d]: if you pass an address as a reference frame, and that is not valid in the GPU anymore, that would be a problem
14:24gfxstrand[d]: Yes.
14:24gfxstrand[d]: You don't need to worry about residency.
14:24gfxstrand[d]: Not unless the app is busted anyway
14:25dwlsalmeida[d]: can I map them into some CPU address within NVK?
14:26dwlsalmeida[d]: so that I can dump what's in them
14:39mohamexiety[d]: dwlsalmeida[d]: if it's `HOST_VISIBLE` you should be able to use `nvkmd_mem_map()` on the memory region
14:50dwlsalmeida[d]: mohamexiety[d]: can you force this? i.e.: force HOST_VISIBLE on allocation?
14:53mohamexiety[d]: dwlsalmeida[d]: I don't fully remember the exact entry point you'd be looking for here (either `create_image` or `BindImageMemory`) but yeah you should be able to always set the bit if you change things in `nvk_image.c`
14:54dwlsalmeida[d]: yeah, I'll add a debug option to dump all these images if you're doing video stuff
14:54dwlsalmeida[d]: we shouldn't rely on what GStreamer is showing
20:27avhe[d]: dwlsalmeida[d]: i took a cursory look and didn't see anything glaringly wrong, except stuff that's missing of course
20:27avhe[d]: just one comment, NVC5B0_SET_PICTURE_INDEX is typically just a monotonically increase integer, i don't think it serves any purpose other than helping debugging
20:28dwlsalmeida[d]: avhe[d]: avhe[d] what is missing you mean?
20:28avhe[d]: well dpb management mostly
20:29avhe[d]: i don't know if you can rely on vk apps feeding you frames at fixed indices
20:29dwlsalmeida[d]: wait what, I was under the impression that was pretty much done
20:30avhe[d]: well i'm not familiar with the vk_video spec, if frames in frame_info->pReferenceSlots are stably bound and so is their slotIndex, then it should work
20:31dwlsalmeida[d]: yeah that's the idea
20:31dwlsalmeida[d]: you bind a VkImage with a slot index, and that remains active until you bind some other VkImage to that same index
20:32avhe[d]: that makes your life much easier then
20:33avhe[d]: there is also interlacing-related stuff that's missing (top/bottom_field_marking in the nvdec dpb array)
20:34dwlsalmeida[d]: I am feeding two frames of progressive content
20:34dwlsalmeida[d]: so this shouldn't apply (for now)
20:35avhe[d]: yeah
22:14dwlsalmeida[d]: I just noticed that the problem apparently is not the pushbuf
22:15dwlsalmeida[d]: static void
22:15dwlsalmeida[d]: gst_vulkan_device_finalize (GObject * object)
22:15dwlsalmeida[d]: {
22:15dwlsalmeida[d]: GstVulkanDevice *device = GST_VULKAN_DEVICE (object);
22:15dwlsalmeida[d]: GstVulkanDevicePrivate *priv = GET_PRIV (device);
22:15dwlsalmeida[d]: if (device->device) {
22:15dwlsalmeida[d]: vkDeviceWaitIdle (device->device); <--------
22:15dwlsalmeida[d]: vkDestroyDevice (device->device, NULL);
22:15dwlsalmeida[d]: }
22:15dwlsalmeida[d]: when GStreamer is tearing down the pipeline, it calls vkDeviceWaitIdle, and this is returning -ENODEV internally
22:32airlied[d]: might be a bug hooking up the video queue fences
23:19dwlsalmeida[d]: ^ any idea why this would not manifest with a single frame?
23:19dwlsalmeida[d]: maybe because a single frame doesn't have to wait on anybody else?
23:20airlied[d]: it might do some other operation after all frames that fixes the fences?
23:25airlied[d]: though you'd probably get that if the channel is killed
23:41dwlsalmeida[d]: I always get this on dmesg:
23:41dwlsalmeida[d]: [ 687.345483] nouveau 0000:01:00.0: gst-launch-1.0[4627]: channel 32 killed!
23:41dwlsalmeida[d]: [ 706.573649] nouveau 0000:01:00.0: gsp: rc engn:00000013 chid:32 type:68 scope:1 part:233
23:41dwlsalmeida[d]: [ 706.573656] nouveau 0000:01:00.0: fifo:6606c307:0004:0020:[gst-launch-1.0[4805]] errored - disabling channel
23:41dwlsalmeida[d]: Even when it's only one frame, and the frame is actually decoded, and in these cases, we don't get ENODEV in vkDeviceWaitIdle
23:42dwlsalmeida[d]: I am starting to think this is not a problem ^
23:43airlied[d]: yeah that is killing the channel
23:46airlied[d]: 68 is ROBUST_CHANNEL_NVDEC0_ERROR
23:46airlied[d]: which is informative and useless at the same time to know