13:53mmediouni[m]: hello, been thinking about Hexagon specifically (and how it's programmable enough to wish for things like full PyTorch) - would that be a possible fit for this or too far away from the "embedded" part of the spectrum?
13:54tomeu: Mohamed Mediouni: Hi! I think the Hexagon is a fair target. I'm working right now on a driver for the TI C7x DSP and it is in architecture quite similar.
13:57mmediouni[m]: Qualcomm has their AI runtime (high level w/o custom kernel support, supported by ExecuTorch tho) and a QAIC only PyTorch eager mode backend which is proprietary. On the KM side, upstream driver for SoC platforms is at drivers/misc/fastrpc.c instead of being part of DRM like the qaic one (same Hexagon device code shareable between the two though)
13:59tomeu: yeah, my hope is that vendors end up rebasing their frameworks on top of the DRM/accel drivers at some point.
14:00tomeu: or companies just ignore vendor's stack and just use mainline :)
14:04mmediouni[m]: will have to start hacking then :) my fear is more on ending with a stack that doesn't have PyTorch support (and apparently, one of the possible paths for that for programmable hardware through an already defined ABI is to go PjRT...)
14:05tomeu: I already have a Executorch branch, but the Meta people haven't shown much enthusiasm in having their code packaged in distros. Eg. they don't build a shared library that programs in Debian could link to.
14:06tomeu: Hopefully I will find some time soon to talk with them some more about it.
14:08mmediouni[m]: thinking more along the lines of actual PyTorch rather than ExecuTorch, the latter doesn't really have tons of adoption
14:08tomeu: hmm, so far people have asked about tflite and executorch for edge AI use cases
14:11mmediouni[m]: for CNNs that makes sense, for LLMs and other "generative" use cases the ecosystem is split w/ tflite and ExecuTorch not really used all that much
14:12tomeu: ExecuTorch seems to be better suited for those, but I haven't seen much traction of LLM in the smaller end of the edge
14:12tomeu: I have in my TODO to look at how LLMs are being used there though and see what should be done there
14:13tomeu: Ideally, I think I would like to see graphs expressed in SPIRV and submitted via Vulkan. Then we could extend ollama's vulkan backend for that.
14:13tomeu: or other frameworks
14:13mmediouni[m]: Hexagon and AMDXDNA are in that oddball place where they're in things much bigger than embedded but yet are way too far away from GPUs for Vulkan
14:14tomeu: I think Vulkan should be fine. If you can decode video with it...
14:14tomeu: we would "just" need to have SPIRV operations at the ML framework level
14:17mmediouni[m]: Hexagon prior to the recently launched 8 Elite Gen 5 (and X2) has a 32-bit address space so that some tough decisions have to be made on placement instead of just being able to use a memory heap. And with the cherry on top that using the DMA engine to the TCM instead load/store instructions are needed to get more membw which doesn't match Vulkan's design. XDNA adds a cherry on top by relying exclusively on the DMA engine for weights
14:17mmediouni[m]: access
14:18mmediouni[m]: and the new hexagons still have a 32-bit memory address space for load/store, beyond 4GB can only be accessed through DMA...
14:20tomeu: guess that will be more of a problem for LLMs, right?
14:22mmediouni[m]: Yeah... although the 16KB program memory limit on xdna might impact other device kernel writers too
14:28tomeu: guess with LLMs we are going to have to partition the graph anyway, for more than one capacity limit
14:29mmediouni[m]: random question but how hard would it be to support non-power of two pointer sizes in mesa
14:29mmediouni[m]: amdxdna also needs that...
14:30tomeu: Mesa won't care about that
14:35mmediouni[m]: thank you, guess that I'll have to play around more and see what'd make the most sense to do for those bigger (oddball) machines
14:39tomeu: How big can hexagon on the arm SOCs go? TI's can be very big on automotive
14:41mmediouni[m]: iirc the one on the X2 Elite is a 80 TOPs unit, but the standalone Hexagon cards can go much bigger (~570 TOPs on the AI 100 Ultra)
14:46tomeu: cool, that sounds similar
18:12mmediouni[m]: And is no IEEE compliance (older Hexagon) considered as a blocker or can it be waived?
18:12mmediouni[m]: old qfloat is... not ideal
18:13mmediouni[m]:uploaded an image: For reference. (625KiB) < https://matrix.org/oftc/media/v1/media/download/AS9I_caajLoOIbFGU43i_TuujvVm0fqtyLwOW9bCIEWYRSRWEomv_CCEtEnyCg0samWoYWmlmNlEpAQl7qB-axdCeawV8stQAG1hdHJpeC5vcmcvSlR6ZHpFSFpnQ2RUR2FjRHBWdUZrZWZN >
18:17mmediouni[m]: (older meaning anything before the 8 Elite which was shipped late last year)
18:17tomeu: I think it should be fine. Most of the hardware in NPUs isn't mathematically exact.
21:08jhugo: There is work happening on hexagon already, just slowly
21:11jhugo: "qaic one (same Hexagon device code shareable between the two though)" - this is not true sadly
21:24mmediouni[m]: jhugo: I mean the QuRT exposed interfaces are different but the bulk of the compute code is shareable
21:24mmediouni[m]: different conduit instead of fastrpc..
21:25jhugo: Nope
21:26mmediouni[m]: jhugo: I do have quite some HVX common code between the two that I use quite a bit...
21:27mmediouni[m]: did I miss something in particular (or is it specific to the HMX part?)
21:32jhugo: I guess if you have baremetal HVX (v68), those instructions would run on both. The AI 100 products don't provide the "skeleton" framework that fastrpc depends on. The multiple DMA engines are different, as is the memory protection, synchronization, doorbells, etc
21:32jhugo: HMX is a whole other thing (that annoys me frequently)
21:33jhugo: The AI 100 has up to 16 NSPs and the whole SoC is designed to be able to utilize them for a single workload. The "mobile" NPUs are basically single cores that are not designed to have multiple in use for the same workload
21:36mmediouni[m]: jhugo: Have been playing around on 8cx Gen 3 (on an X13s) and AI100 which match each other quite well at least for compute kernels even if the OS interfaces are quite different
21:37jhugo: mmediouni[m]: Are you using a particular higher level language? OpenCL?
21:39mmediouni[m]: jhugo: no, at this point just a hacked up tinygrad for the Hexagon side. I don't believe I saw mentions of OpenCL on Hexagon anywhere
21:40jhugo: Ah. Yes, I'm not aware of any official OpenCL support on Hexagon, which IMO is a shame. I'm doing some investigations on the side, but if you hacked something together I was curious to get your thoughts. I can't say much but Tomeu's work has been very inspiring
21:42mmediouni[m]: jhugo: For OpenCL or Vulkan the numerics are what I'm worried the most about. I can't see how it'd be possible to have anything passing there pre-v79
21:42mmediouni[m]: At least if emulating a SIMT model...
21:43mmediouni[m]: the old qfloat unfortunately doesn't mix very well with standard expectations...
22:23jhugo: AI 100 has IEEE HVX support