07:27 tomeu: phh: by next week I should get my first HW with Rockchip's NPU, wonder if you have any notes about registers, buffer formats, etc. so I don't redo unnecessarily work
07:28 phh: I have nothing better than the TRM really
07:30 tomeu: phh: that is Rockchip RK3588 TRM V1.0-Part1-20220309.pdf and part 2?
07:31 phh: Part2 iirc, but yes
07:32 tomeu: hmm, I'm going through the ToC and cannot find the NPU
07:32 tomeu: do you remember under what chapter is all that?
07:35 phh: k sorry it's part 1 chapter 36 called RKNN
07:36 tomeu: awesome, thanks :)
10:04 phh: tomeu: I'll just give some informations that would have helped me understand this thing (you probably already understood that, but telling is cheap): this is not a general purpose processing unit, it's only a fixed-pipeline hardware. It has the "RKNN_pc" IP to make the exchanges between the various pipelines more seamless from the kernel point-of-view, but I personally think it's easier to ignore it at first, and directly used the fixed pipeline.
10:04 phh: RKNN_pc is really just a DMA from RAM to the RKNN_{cna,core,dpu,ppu,ddma,sdma} registers with feedback from the HW component to know when to switch to next step. So I think the easiest way to start is by wiring DPU with its source and destination as RAM and forget about every other IP (I think that with DPU used this way you can only do trivial operations like a add, at proper matmul would require CNA)
10:06 tomeu: hmm, so if RKNN_pc isn't programming the DPU registers, who does that?
10:06 tomeu: the kernel directly?
10:06 tomeu: and yeah, please pass on any information you think could be useful :)
10:07 phh: as far as I can tell, CNA is a hardware convolution pipeline (well CNA as seen from the registers, though seeing the TRM it's just a controller to the MAC array), DPU a matrix-wise operator (as in apply an operation per value in the vector. though there is a max though I'm not sure how it works), PPU is more akin to a blitter
10:09 tomeu: if I compare to VSI's NPU, I would say that the CNA is a systolic array for matmuls, plus controlling logic, and the DPU is another set of systolic arrays plus logic, but wired for tensor transformations
10:12 phh: ah, looks like what I implemented myself is actually using pc
10:12 tomeu: VSI also have something similar, they call it FE (front end)
10:13 tomeu: has a command for writing a register, plus others for triggering jobs in different units, synchronization, etc
10:13 tomeu: and facilities for looping
10:13 tomeu: is the ping-pong reference in the kernel about looping in the PC thing?
10:15 phh: https://github.com/phhusson/rknpu-reverse-engineering/blob/main/instrs.h ; https://github.com/phhusson/rknpu-reverse-engineering/blob/main/hello2.c ; INSTR macro (seen in hello2.c) uses the format RKNN_pc uses, first argument "TGT" "describes" the targetting IP (one value means cna, one means dpu, etc. IIRC there are also flags to say which RKNN core to target, it is technically redundant with another value), value is the value of the register to
10:15 phh: write, and "reg" is the adress the of the register as described in the TRM. So first line reads as "write 0xe to register 0x1004 to CNA IP" which is RKNN_cna_s_pointer register in the TRM
10:17 phh: tomeu: I'd need to check again, but yes, I think that the idea of ping-pong is that you chain multiple operations: Linux pings RKNN; RKNN_PC writes a set of instruction, wait for DPU/CNA/... interrupt saying it's finished, sends pong to Linux, and Linux pings again RKNN_PC to switch to next operation
10:17 phh: it's possible that ping_pong implies it's done fully in hardware without going through linux, but i don't think so
10:18 phh: you can find the format of those "instructions" in the documentation of RKNN_pc_register_amounts, but IIRC there was one or two bits used by proprietary libs that don't appear in the TRM
10:19 phh: tomeu: is that VSI documentation public? Maybe I can try to check if the concepts you mention indeed match. based on what you say yes
10:33 tomeu: ah no, I didn't have any documentation, all was reverse engineered
10:33 phh: ok
10:33 tomeu: but they leaked a lot of info via debug logging
10:33 phh: I was hoping for a meaning of systoilic array ^^
10:33 tomeu: ah, there is a lot published about that, one sec
10:34 tomeu: https://qengineering.eu/google-corals-tpu-explained.html
10:34 tomeu: this is a good one
10:34 tomeu: and you can find a lot of publications in arxiv.org
10:36 phh: oh, I didn't expect TPUs to be hard-wired IPs, thanks
10:37 tomeu: yeah, if something is programmable, then there is going to be a bottleneck when reading each instruction
10:37 tomeu: and also when each instruction reads its operands and writes its results
10:38 tomeu: with systolic arrays, data flows though the array until the operation is complete
10:39 tomeu: I'm reading that NVDLA is supposed to be SIMD instead of an array, but I guess it just means that the instructions are really high level, and are implemented with systolic arrays as well
10:41 tomeu: hmm, apparently NVDLA uses an adder tree for the matmuls, instead of a systolic array: https://charleshong3.github.io/projects/nvdla_v_gemmini.pdf
10:42 tomeu: but seems to be the same in terms of memory accesses (the data remains inside of the array/tree for the duration of the whole operation)
10:43 tomeu: but I guess that that difference isn't exposed to us
13:17 tomeu: phh: btw, have you gone through http://nvdla.org/hw/v1/hwarch.html ? I think at least three quarters of it will be applicable to most NPUs out there
13:28 phh: nope, I'll look thanks
14:24 tomeu: phh: ok, found it: http://nvdla.org/hw/v1/hwarch.html#ping-pong-synchronization-mechanism :)
14:25 phh: lol great
14:25 tomeu: btw, the userspace stack for NVDLA is BSD-3, so I would be surprised if they wouldn't have reused that as well