14:56f_: tomeu: Seems like the bridge died last night, so you lost operator status. Just gave you back +o.
14:56f_: Consider enabling AUTOOP too.
15:42tomeu: cool, thanks, will give it a look
16:01hallo1: Currently I'm pretty confused about why all other activation functions is significantly slower than ReLU on RKNPU2?
16:02hallo1: For example from my benchmark of Yolov5s, the model is 1.8x faster if all Sigmoid in the model is replaced by ReLU
16:02phh: because only ReLu is done in hardware
16:03phh: for other activations it goes to fallback implementation
16:03hallo1: If it is correct, the NPU have a LUT for other functions
16:03hallo1: It shouldn't be that slow
16:06phh: I haven't seen a LUT in the TRM, but I could have missed it. Either way, as I understand it, for the moment tomeu only wired things for MobileNet8
16:06hallo1: Yeah, this is tested under the closed source SDK
16:08phh: ah ok. that section of the TRM is rather hard to read, but as I understand it there is indeed only RELU (or rather "RELUX"....) that's supported. There is technically MUL, ADD and SUB that are also listed but hum, yeah
16:10hallo1: There are some LUT configuration registers in the DPU part, as seen in the TRM.
16:12phh: (+abs + neg + floor + ceil)
16:13phh: ah yeah you're right, i've always skimmed over those considering how not-understandable they are
16:20phh: anyway a LUT requires memory access which can't be done in just one clock cycle, performance loss is not too surprising
16:20phh: exception of SRAM, it's possible that doing co-work between gpu and npu could be useful, but even though Android's NNAPI allows that, it doesn't look like tflite does
16:22f_: I hope they didn't quit because of the +o I set..