10:21 mrsinisalu_: So to speak i am interested to give my ideas to general public, and also interested to talk only over network and about only technology and work, until the chips drivers get fixed for performance.
10:23 mrsinisalu_: FLHerne: you are trolling all the time about bots and AI which has very little to do with GPU stuff.
10:31 mrsinisalu_: I am also confused who is he/she, was it one of the guys who works on spir/spir-v alternatative compiler, dude talk something sane also sometimes it won't hurt
10:40 mrsinisalu_: i got some chinease cortex-a7 based iphone SX clone , slightly irrelevant to talk about it, but it has mali-400 gpu too, i discovered that there is a path for in order CPUs too to pimp up their performance
10:41 mrsinisalu_: so this cortex-a7 is not as weak as people think, when someone were to use a shortened fast pipeline for this, but the async queues are there on the debug path
10:43 mrsinisalu_: on out-of-order processors they are ROBs and reservetion stations, but on in-order a5/a7 cortexes and even earlier Xscale, they are on debug pipeline which isn't optional though, it is there, but it can be probably turned off
10:43 mrsinisalu_: documentation there for ARM is very large even, and the packet format is something yet to study
10:44 mrsinisalu_: on a5 and later, they have ITM and ETM debug pipeline, they work together in concert
10:45 mrsinisalu_: but ITM is the instrumentation pipeline where SWIT packets from ETM traces can be sent to issue module
10:48 mrsinisalu_: yeah it is fast speed they offer some TLB or memory/cache support for this ITM, and ETM has onchip storage for that
10:55 mrsinisalu_: so xscale even had this too, but then it was some earlier looking to be the same type of thingy though but called DCC, but also high speed tramsfers to issue and from issue to xscale trace buffers on chip
10:55 mrsinisalu_: *transfers
10:57 mrsinisalu_: but the iphone clone can also run android, basically lots of work needs to be performed, the debug pipe for performance is more complex on CPUs then the ROB mechanism which is basically quite simple
11:01 mrsinisalu_: I have understood that any of the sane performance lifting you have refused to fill in in your code, so to be honest lots of not in so much in depth knowledgable people are bullying over time here, they make people suffer and their talks is pretty low-quality too
11:01 mrsinisalu_: some even blame the hw, well hw is mostly good , some minor issues here and there, it is just i see vliw the best possible hw though
11:03 mrsinisalu_: maybe those osdev and CPU experts have some envy against russians , they have their elbrus cpus, but most institutions implement VLIWs today yeah including that russian company or whatever, and yeah this CPU when correct code is used on it, it would probably run very finely
11:46 mrsinisalu_: It is though only my hypothesis i assume VLIW pipelined CPUs having always queues available, they naturally fit more ALUs to the cores, they always should implement bypass networks
12:25 mrsinisalu_: well it is not clasically pipelined which I'd agree with though, however the bundle can be in a loop
13:55 mrsinisalu_: yes, this should be correct, VLIW does not start another bundle until the last had finished, branch opcode flushes the bundle. and fetches a new one, but it would be pointless to do such thing with a loop right
13:56 mrsinisalu_: but on GPUs things are different