12:08tomeu: phh: btw, I can confirm that the buffer pointed to by REG_DPU_RDMA_RDMA_BS_BASE_ADDR contains the biases
12:09phh: Ah bs is biases, cool
15:42tomeu: so, I got to that dreaded point in REing in which sending bit-identical buffers to the HW results in a hang
15:42tomeu: phh: any ideas? https://paste.debian.net/1309916/
15:43tomeu: this is what the kernel prints:
15:43tomeu: [ 29.149629] RKNPU: job: ffff00010210ab00, wait_count: 1, continue wait: 0, commit elapse time: 6131398us, wait time: 6131402us, timeout: 6000000us... (full message at <https://matrix.org/_matrix/media/v3/download/matrix.org/jKDPPDnOUoNYlOWwDDLvfDbd>)
15:44tomeu: that causes a hang that requires rebooting the machine
15:44phh: is it possible you omit some leading 0 0 0 after last instr?
15:44phh: it's possible 0 0 0 instruction has some magical meaning to pc
15:45tomeu: could be, indeed, that was causing trouble in the VSI NPU
15:45tomeu: I haven't paid attention to padding yet, and my scripts trim the zeroes from the start and end of the buffers
15:55tomeu: no luck so far
15:56phh: why is the input all -?
15:57tomeu: because the blob has zeroes in what I think is the input buffer
15:57tomeu: REG_CNA_FEATURE_DATA_ADDR
15:57tomeu: haven't figured out why yet
15:58phh: you didn't miss any cpu2device memory barrier?
15:58tomeu: not that I know, but I guess I should be dumping other ioctls besides submit
15:59phh: well my test code does ret = ioctl(fd, DRM_IOCTL_RKNPU_MEM_SYNC, &m_sync);
15:59phh: of all gems basically
15:59phh: with RKNPU_MEM_SYNC_TO_DEVICE
15:59tomeu: yeah, I do that on unmap, after writing
15:59phh: ah you unmap. you don't close though?
15:59tomeu: nope
15:59phh: fwiw I don't unmap
16:00tomeu: ah, I don't really unmap either, I just do that in the unmap callback in gallium
16:00tomeu: I only sync there
16:00tomeu: all my BOs have RKNPU_MEM_KERNEL_MAPPING, wonder if that could ever be a problem
16:00tomeu: hmm, guess it could
16:01phh: I don't really see how, but yeah, the only kernel_mapping i have is for task_obj_addr
16:01tomeu: and that one sin't accessed by the HW, so I guess it could be it
16:08phh: yeah it's reasonable it's the only one with kernel mapping, it's just that I fail to see why having a kernel mapping would break this
16:08tomeu: haven't checked, but I would imagine their driver skipping something needed for HW-accessible buffers
16:09tomeu: doesn't seem to be it though :(
16:20phh: you're using three cores, fwiw my test code only used one
16:20phh: so
16:20phh: .subcore_task = {
16:20phh: // Only use core 1, nothing for core 2/3
16:20phh: {
16:20phh: .task_start = 0,
16:20phh: .task_number = 1,
16:20phh: }, { 0, 0}, {0, 0},
16:20phh: },
16:21phh: and .core_mask = 1,
16:22phh: I also have task_base_addr = instr_dma (so phys address of task)
16:25phh: I don't understand how task_base_addr can be null, because driver does
16:25phh: 274: REG_WRITE(args->task_base_addr, RKNPU_OFFSET_PC_DMA_BASE_ADDR);
17:07tomeu: I don't think I'm using 3 cores, I checked in the kernel, and rknpu_job_subcore_commit_pc gets called only for core 0
17:07tomeu: why do you say that task_base_addr is NULL?
17:09tomeu: I think next I should add the rest of the ioctls to the trace logs, and probably also all register writes that the kernel driver does
17:10phh: well in your pastebin ` .task_base_addr = 0x0,`
17:10phh: which is null in my book
17:12tomeu: ah, I think it's correct for that to be NULL
17:12tomeu: For feature DMA, weight DMA, DPU DMA, PPU DMA, the address
17:12tomeu: is set as offset address. Final address appear on AXI bus is base
17:12tomeu: address + offset address.
17:12tomeu: wonder why they add this kind of functionality
17:13phh: dirt-cheap dirt-broken iommu?
17:13tomeu: ah, makes sense
17:13tomeu: this si supposed to work without a iommu as well
17:13phh: not really -_-'
17:13phh: especially since it's still only 32bits
17:14tomeu: well, the whole memory management is bonkers
17:14tomeu: guess the SW is written by HW engineers
17:14phh: I don't understand why I have `tasks[0].regcmd_addr = instr_dma` `tasks[0].regcmd_addr = instr_dma` and result was working iirc...
17:15phh: that must be wrong, I agree that task_base_addr = instr_dma would just break everything according to the documentation
17:20tomeu: tasks[0].regcmd_addr = instr_dma is ok, right?
17:20tomeu: task_base_addr = instr_dma sounds wrong
17:20phh: yeah I agree it sounds very wrong
17:20tomeu: but maybe it is disabled by some bit
17:20phh: maybe my memory is faulty and I never managed to get an IRQ back...
17:21tomeu: but Jasbir got it working, he was comparing the contents of the output BO
17:24phh: did you send RKNPU_ACT_RESET?
17:24tomeu: yep
17:24tomeu: https://gitlab.freedesktop.org/tomeu/mesa/-/tree/rocket?ref_type=heads
17:25tomeu: have pushed here
17:25phh: I don't understand .task_number = 3, your instructions only have one task
17:25tomeu: I will add ioctls and register writes to the traces
17:25tomeu: and see
17:25tomeu: fortunately, there are far fewer ioctl calls and registers than for VSI... shouldn't be too bad