13:21RSpliet: mwk: Risking like I sound like I'm on substances: does NVA0 support indirect branches?
13:23RSpliet: Wait, no, you actually documented the existence of that in envydocs.
13:26RSpliet: I guess I understand why... but what the actual flying duck. You could just be pushing 32 entries on the control stack with a single bra c instruction... which by default is only 16 entries long. Do we even ever bother emitting such madness in codegen?
13:28mupuf: RSpliet: Madness? This is NV50!
13:29mwk: RSpliet: FWIW you could in theory implement it with pushing only a single entry
13:30mwk: just take one path, and push the remaining mask to the stack with the PC pointing to the branch itself
13:30mwk: though I have no idea if it's actually done that way or not
13:32RSpliet: mwk: well, the indirect branch bra c could imply that the parameter to c is not a register containing the offset in the constbuffer, but rather a fixed number. Not sure they implement the full indirect branch thing?
13:34RSpliet: Docs are a little terse on the topic, and like everyone it's not on the top of my priority list either sadly
13:37karolherbst: RSpliet: how would a branch be indirect when reading the value from c?
13:37karolherbst: you can't change the value at runtime
13:37karolherbst: it's fixed for the entire invocation
13:38RSpliet: Indirection is "access c, branch to the value found there". Essentially just allowing this "replace immediates with constbuf values" optimisation, but technically it would be indirect
13:38karolherbst: all you change is instead of having to hard code the value in the sahder and have to recompile, you read from c and save the recompilation
13:39karolherbst: RSpliet: no, that's not what indirect is
13:39RSpliet: Of course it is. Indirect means branch to a target read from a value in memory.
13:39karolherbst: you gain 0 over using a constant directly if you ignore the recompilation
13:39Sarayan: karol: it's a table of uniform, essentially? For subroutine/switch handling?
13:39karolherbst: RSpliet: a non constant value
13:40karolherbst: indirection with constant values is not indirect at all
13:40karolherbst: sure, you could call it that way
13:40karolherbst: but you have no advantage over the direct approach
13:41karolherbst: it only really gives you an advantage if every thread could jump somewhere else
13:41karolherbst: but that doesn't work with c
13:41karolherbst: again, all you do is save on recompilation
13:43karolherbst: RSpliet: can you actually do a bra "c[$r0+0x10]" thing?
13:44RSpliet: karolherbst: semantics aside, yes I got the implication of not being able to specify a reg as an offset in your c param
13:44RSpliet: envydocs doesn't make note of such a feature
13:45karolherbst: okay, so not even jump tables are possible
13:45RSpliet: I see only limited application for that - indeed in jump tables might it be possible to derive maximum values
13:46RSpliet: s/might it be/if it were/
13:47karolherbst: HdkR: subroutines are fine, because you can just put the value as a uniform
13:47karolherbst: you have no conditional jumps, right?
13:47karolherbst: ignoring that subroutins are insane anyway and shouldn't have been added to the spec to begin with
13:48HdkR: Sure, but you could do something dumb like have a subroutine array and then `subroutines[laneId]()`
13:48HdkR: and everyone cries
13:49Sarayan: karol: details detalis :-)
13:50karolherbst: HdkR: uhm..., that you can do?
13:50HdkR: It's disgusting
13:51karolherbst: HdkR: "This index must be a Dynamically Uniform Expression."
13:51karolherbst: "A dynamically uniform expression is a GLSL expression in which all invocations of that shader within a particular set of invocations will have the same value."
13:51karolherbst: so no
13:51HdkR: hm. The nvidia compiler is getting away with too much then
13:52karolherbst: well, "The gl_DrawID input to the Vertex Shader"
13:52karolherbst: this can be messy
13:52Sarayan: the nvidia individual shared programs aren't simd?
13:52HdkR: What is the definition of an "individual shared program" here?
13:53Sarayan: shader, not shared, sorry
13:53Sarayan: bad fingers, bad
13:53RSpliet: karolherbst: I'm not 100% sure jump tables are actually efficient since you must allocate control stack in local memory to deal with the worst case. Not sure what the access latency to local memory is, but it could well be more expensive than the "test, bra, join" overhead that'll let you use the HW stack
13:53HdkR: Nvidia likes to call it SIMT
13:54karolherbst: RSpliet: jump tables inside c
13:54karolherbst: but what do you mean with control stack in local memory? why would you need to do that
13:55Sarayan: HdkR: but you have one PC for multiple parallel computations, right?
13:55RSpliet: because the HW stack is only 16 entries. If every thread requires a different code path, this indirect branch would have to push 32 entries onto the control stack. One for every code path (minus one, but you need to push a reconvergence point too)
13:55HdkR: Sarayan: Yes, one PC for up 32 threads in a warp
13:55HdkR: s/up //
13:56Sarayan: wow, what a terminology fuckup
13:56karolherbst: RSpliet: I see :/
13:56Sarayan: so a thread is actually one index value in a vector register, and a warp is a thread, right?
13:57RSpliet: Sarayan: no, a warp is a block of (32) threads that on NVIDIA hardware run in SIMD lock-step.
13:58HdkR: The ISA is even visible as scalar
14:00Sarayan: RSpliet: what the difference with saying that a warp is a thread with vector instructions working on 32-wide vectors?
14:01HdkR: How the execution changes when divergent mostly
14:01RSpliet: Sarayan: you'd confuse everyone in the GPGPU development and research communities, that's not how terminology is agreed upon
14:01HdkR: That too
14:02Sarayan: RSpliet: that was the terminology of 198x and 199x, but heh, pretending you're doing new things by renaming them confusely is nothing new
14:03RSpliet: HdkR: the difference is smaller than you think. On Intel processors, if you want to handle SSE with divergent paths for your SIMD lanes you'd have to use predication too. It's a bit more laborious, but you'd apply the same principle
14:03HdkR: Oh yea, definitely
14:04RSpliet: Sarayan: it's a matter of perspective. The whole reason why GPUs have been relatively manageable for programmers is because they are allowed to write code as single threaded. Don't worry about SIMD lanes and predication themselves, just write the C-code or otherwise for a single thread as you would for sequential code.
14:05RSpliet: Want to run it on AMD hardware, you don't have to rewrite it because now your SIMD processor is 64 lanes
14:05RSpliet: Run it on ARM Mali T600? Fine, now it's not SIMD at all anymore.
14:07Sarayan: RSpliet: yeah, they came through the "embarassingly parallel" side and optimized for that, including software-wise, which was a really good idea
14:11RSpliet: G80 stepping away from vec4 (more traditional SIMD, as it were) was probably the best decision they've ever made, despite moves in the opposite direction now...
14:11Sarayan: T600 is pure mimd? fun
14:11Sarayan: well, they're vec32 now it seems, right? :-)
14:12HdkR: Mali keeps mixing it up. I hear they are changing it after Bifrost as well
14:13RSpliet: Sarayan: the point is that previously each thread was 4vec. If you look in the OpenCL specification you can find the "float4" data type. On GeForce FX type GPUs (not that they support OpenCL), you'd do float4+float4=float4 in a single instruction in your program, on G80 the compiler transforms that to four separate additions.
14:15RSpliet: Sounds more expensive, until you realise that a lot of times you were only using 2 or 3 components of your float4 (like manipulating RGB values) and they ended up keeping half your hardware idle. The G80 model allowed them to better keep the FPUs and ALUs occupied
14:16Sarayan: well, intel did that for a while, istr that at a time (i965?) vector shaders went the fx way and the pixel shaders the g80 way
14:16Sarayan: I wonder if they tried to find the most efficient way for compute shaders or if it's fixed
14:16imirkin: starting with gen8 (broadwell), all stages can do it the "pixel shader" way
14:17imirkin: and that's what mesa prefers doing
14:17RSpliet: Yep. Turns out the G80 way was efficient enough for vector shaders too (because there's tons of threads anyway), and now they essentially use one type of shader core for all types of shaders
14:17Sarayan: makes sense
14:18Sarayan: more poly per object, so it works out
14:18Sarayan: and nobody cares that much about tesselation or geometry shaders
14:18Sarayan: )(which tend to be more in the one invocation per object frequency)
14:19RSpliet: the movement back to per-thread vector instructions is mainly the fault of neural nets, that can get away with really low precision. They can't subdivide 32-bit regs into 8-bit values and launch one thread per 8-bit value, so they used packed SIMD for that (and more insane stuff with Volta and the likes)
14:21Sarayan: you mean a nv can't see the 32 floats as 128 bytes?
14:21Sarayan: or they'd need per-byte condition flags?
14:21imirkin: you can fetch values from other threads in various complex ways
14:22Sarayan: (not sure if condition is the term, the flags that disable writes to parts of registers)
14:22imirkin: but there's no "wide" view, like there is on GCN
14:22imirkin: [or intel]
14:22Sarayan: I wonder what is hardware limit, what is ISA limit, and what is cuda limit there
14:23imirkin: not sure how cuda limit would differ from the combination of hardware, isa limit?
14:23imirkin: and isa is kinda baked into the hardware...
14:25Sarayan: imirkin: sure. w.r.t cuda, it can be hard to express simd-ish concepts where the language does its best to hide the simd aspects
14:25HdkR: CUDA exposes most everything the ISA can do
14:25imirkin: Sarayan: same with ISA :)
14:25HdkR: (Which is why it is such a great resource for RE)
14:28Sarayan: hmmm, so when it is said that my non-booting MX150 chip has 384 cuda cores, that means it has 12 cpu-core equivalents, right?
14:28Sarayan: (I'm trying to understand the numbres waved around)
14:30RSpliet: Sarayan: cutting corners, what they mean is that there are 384 floating point units on the device
14:31karolherbst: RSpliet: *ALU
14:31Sarayan: RSpliet: which is nice :-) And 12 program counters, right? Or they do funkier scheduling than that?
14:31RSpliet: There's not really a "CPU equivalent", because there's no scalar pipeline. Control flow instructions can be perceived as "scalar", but they work in mysterious ways anyway :-P
14:31karolherbst: volta/turing splits the FPU parts out of the ALU
14:32HdkR: Because Volta+ is awesome :)
14:32RSpliet: Sarayan: Well, no. 12*8 program counters is a better approximation I think
14:32RSpliet: karolherbst: I *think* G80 had more FPUs than ALUs... but don't take my word on that one
14:33karolherbst: well, but on the mx150 there is no such difference
14:33Sarayan: RSpliet: so each running program handles vec4s?
14:33karolherbst: there is one ALU for both floating point and integer stuf
14:33karolherbst: on tesla you had the SFU where you could do muls on it
14:33karolherbst: or something like that
14:34karolherbst: I doubt tesla had a FPU/ALU split
14:34karolherbst: (ignoring SFUs)
14:34Sarayan: you can't really have any calculation from any register sent to any fpu, or you'd have a chip that's a centimenter high from the metal layers for the routing :-)
14:35HdkR: tldr, hardware is complex and explaining it with a single blanket statement isn't going to work well. Choose what works best, explain the rest as questions arise.
14:35karolherbst: Sarayan: what do you mean?
14:35Sarayan: there must be *some* local grouping
14:35karolherbst: okay, sure
14:35karolherbst: the hw does weird things I doubt anybody really knows what it does
14:35RSpliet: Sarayan: what do you mean "handles vec4's"? It handles them by scalarising them and running individual threads on each. a vec4 just allocates 4 "consecutive" registers
14:35karolherbst: except nv engineers
14:36karolherbst: but in the end registers are also just really fast "memory"
14:36karolherbst: it's not like you have 65k tiny seperate registers on the hardware somewhere
14:36RSpliet: Imagine the register file being two dimensional. It has columns and rows. You can access it by addressing a column. Each row is hard-wired to a specific FPU (or Core, as there is some debate about what the functions of each of these things is and isn't)
14:36Sarayan: karol: with some minor exceptions (which we tend to call special registers), it's very true
14:37Sarayan: RSpliet: I'd have seen it as 3 or even 4-dimensions
14:38RSpliet: Sarayan: I don't want to overcomplicate matters
14:38karolherbst: how would you even do it on the hw?
14:38RSpliet: But yes, they're banked.
14:38karolherbst: I mean 4 dimensions
14:38Sarayan: karol: Logical dimensions, not physical
14:38RSpliet: karolherbst: Not physically, "mathematically"
14:38RSpliet: logical is a nicer word!
14:38karolherbst: but that means nothing in the real world :p
14:39Sarayan: karol: I would be *extremely* surprised that the registers are *one* memory array for the whole chip
14:39Sarayan: routing, again
14:39RSpliet: Apart from adding the ability to fetch two registers in a single cycle, which is what I experimented with in one of those bit-rotten mesa branches
14:39karolherbst: Sarayan: but that's most likely the case
14:40karolherbst: as you are able to slice the register file as you see fit for the invocation
14:40karolherbst: with constraints of course
14:40karolherbst: like allocating 32 or 28 or 60 registers for each thread
14:40Sarayan: it's the constraints that reveal the actual physical configuration
14:40RSpliet: Sarayan: rows (each 32 bits) are hard-wired to cores, columns are addressable. You'd need 32*32 wires from a register bank to a set of 32 FPUs
14:41karolherbst: RSpliet: are you sure those are hard-wired?
14:42RSpliet: karolherbst: they're banked, there's likely to be muxes in between to be able to access multiple banks in the same cycle
14:42Sarayan: so I have 12 packs of 32 fpus.. and they share a 32-wide data bus in a pack?
14:42RSpliet: But I don't think there's shuffle networks to let you pass data from thread to thread
14:42karolherbst: Sarayan: well, you have 3 SM
14:43karolherbst: so 3x 4 packs of 32 fpus
14:43karolherbst: well, alu
14:43Sarayan: SMs are fully isolated from each other except at boundaries I guess?
14:43karolherbst: not fpus
14:43RSpliet: Sarayan: The easiest to understand the concept is that you'd have 12 packs of 32 FPUs, and each pack has 1024 wires to a dedicated register file.
14:43RSpliet: Yes, there's hierarchy ;-)
14:44Sarayan: so within a SM there's 4x32 fpus (alus?) with a 4x32x32 wide bus to the associated registers?
14:44Sarayan: that's... wide
14:44karolherbst: and then there is that situation that you might limit the total amount of parallel threads according on how many registers you allocate per thread
14:45RSpliet: There's tons of other fun that make reality even more interesting. Like that thing karolherbst just said^
14:45karolherbst: and you can access 4 regs at once within one instruction
14:45karolherbst: like 4 regs as a quad op
14:45karolherbst: doing a 128 bit operation
14:45karolherbst: consecutive regs though
14:45Sarayan: aligned? or just consecutive?
14:46Sarayan: ok, makes sense, hardware wise
14:46RSpliet: Sarayan: and yep, that's wide. But it's a highly regular structure, so you don't really need many layers of routing for that in theory.
14:46Sarayan: your instruction tickles how many fpus at a time, 32?
14:46karolherbst: the 128 bit one?
14:46RSpliet: In practice we can't look inside the floorplan of the GPU ;-)
14:47karolherbst: those 128 bit ops are for memory operation
14:47karolherbst: and tex
14:47Sarayan: whichever... how many separate instructions a SM can run in parallel?
14:47karolherbst: depends on the workload, but you can have all uits busy
14:47karolherbst: and wait until a unit frees up for the next instruction
14:48karolherbst: but you also have some delay in issueing an instruction
14:48RSpliet: Sarayan: to make stuff even more complex :-D then there's... up to 8 program counters per "pack".
14:48karolherbst: like on kepler you were able to issue two instruction at once in one cycle
14:48Sarayan: sure, but there's an intrinsic wideness, isn't it? Or the SM is running 1024 different programs at the same time?
14:49Sarayan: (different can just be that the pc is in the different place due to access delays or conditionals)
14:49Sarayan: and by 1024 I meant 128, sorry
14:49karolherbst: I doubt that works
14:49karolherbst: I think warps can run different programs?
14:49RSpliet: And hardware kind of switches between 8 warps as it seems fit. If one is waiting for a memory request, it pickes another warp and continues to run.
14:49karolherbst: and you have like 32 max warps per SM or sometihng
14:49Sarayan: yeah, it's hyperthreading within a SM
14:50karolherbst: Sarayan: what is hyperthreading?
14:50karolherbst: if you want, the entire SM is doing hyperthreading all the time
14:50RSpliet: karolherbst: that's a fairly good analogy
14:50karolherbst: that's just the inherent nature of how things work
14:50karolherbst: picking unsued engines
14:51Sarayan: it's separating the flow of execution from the computational units. That's how intel saturates its fpus/alus/etc a little better by having two thread scheduled on the same set of computational units
14:51RSpliet:out. 't has been fun!
14:51karolherbst: but those threads don't run in complete parallel, right?
14:51karolherbst: or do they?
14:51Sarayan: they do
14:52Sarayan: they have separate sets of registers and pc and priviledge and everything, they just shared fpu/alu/whatever
14:52Sarayan: so they kinda wait on each other
14:52karolherbst: I guess they also share the same clock?
14:52Sarayan: that's pretty sure, yeah
14:52karolherbst: mhh, okay
14:53karolherbst: on kepler dual issuing is a per thread thing
14:53Sarayan: an intel cpu core is a complete processor, but an ht core has two logical threads running in it
14:53karolherbst: so one thread issues two instructions in one cycle
14:53Sarayan: that's superscalar, issuing multiple instructions at the same time
14:53Sarayan: old school too, but quite useful :-)
14:53karolherbst: well, nvidia decided they won't continue with it
14:54karolherbst: so it's a kepler only thing
14:54Sarayan: well, with a 8-wide hyperthreading, it makes sense to drop superscalar
14:55Sarayan: superscalar requires tracking cross-instruction dependencies, which has its cost
14:56Sarayan: ends up with register renaming and all that stuff
14:56karolherbst: compiler does it
14:56karolherbst: allthough the hardware is able to detect mistakes
14:56karolherbst: and causes a bit penalty of doing so
14:56Sarayan: if you can just run instructions from other "programs" instead, it's much simpler silicon-wise
14:56karolherbst: doesn't matter with maxwell
14:56karolherbst: like seriously
14:56karolherbst: it's all inside the compiler
14:57karolherbst: you have to declare dependencies inside the shader program
14:57karolherbst: if you fail to do so, the result is garbage
14:57Sarayan: you mean you have to say which instruction needs the result of what instruction?
14:57karolherbst: more or less
14:58karolherbst: it is more based on latencies/delays though
14:58Sarayan: within a range I guess
14:58karolherbst: no rangge
14:58karolherbst: you say: delay for 3 cycles, that's it
14:58karolherbst: or you create barriers
14:58karolherbst: for variable runtime instructions
14:58karolherbst: like 32 bit integer multiplication
14:58karolherbst: there is a 16 bit form with fixed runtime length
14:58Sarayan: oh ok, you have to ensure there are no depencies then
14:59karolherbst: you have to resolve those by putting delays + barriers
14:59Sarayan: the compiler must be "interesting" to write
14:59karolherbst: if you have memory operations you create a writer barrier and the instruction consuming those values have to wait
15:00karolherbst: Sarayan: well, the reverse engineering work is more interesting here
15:00karolherbst: but you try to do a lot of latency hiding inside the shader, so you reorder to minimize stalls
15:01karolherbst: I don't even think the hw itself is able to do so
15:01karolherbst: RSpliet: do you know anything of that? afaik there is no out of order execution
15:13Sarayan: karol: he said he was out :-)
15:14Sarayan: but yeah, no ooo simplifies things a lot, so you can pack more cores
15:15Sarayan: so, the first unit is the SM, which has a number of registers and 4x32 fpus/alus and manages 8 execution threads
15:16Sarayan: how wide are the actual instructions at the isa level, is it float a=b+c or vec4 a=b+c or wider?
15:18HdkR: Which can also be fp16x2
15:19Sarayan: How can you feed 32x4 fpus with only 8 instructions using one each?
15:19HdkR: Oh, it can also do FP64 by pairing a couple of registers
15:20karolherbst: but 64 bit stuff is quite limited
15:20HdkR: and depending on arch there are a couple of SIMD 8bit or 16bit instructions labeled "video" instructions
15:20karolherbst: I think I got nvidia once to use those
15:20HdkR: Apparently geometry shaders will use them
15:20karolherbst: I was doing wmma stuff
15:21HdkR: Huh, surprising
15:21karolherbst: but I really tried hard to mess things up
15:22karolherbst: Sarayan: you don't wait
15:22karolherbst: a thread _issues_ an instruction
15:22karolherbst: not executes it
15:22karolherbst: so you issue one instruction, apply the scheduling (stalls barriers) issue the next
15:22karolherbst: even if the first one isn't done
15:23Sarayan: But one issued instruction can tickle how many fpus at a time?
15:23karolherbst: and then you have the warp scheduling stuff
15:23karolherbst: per thread
15:23karolherbst: so if you have 32 threads running, one instruction takes 32 fpus
15:23Sarayan: number of threads = wideness of the instruction then
15:24karolherbst: I think?, wait
15:24karolherbst: lest me check
15:24karolherbst: no, doesn't make sense
15:24karolherbst: it takes one FPU
15:24karolherbst: you can have 2k threads per SM resident
15:25Sarayan: errr, 8 warps, 128 fpus, 2048 threads, it's getting weird
15:25karolherbst: 8 is only valid for tesla
15:25karolherbst: on maxwell that's 32
15:26karolherbst: uhm, actually
15:26karolherbst: 64 warps
15:26karolherbst: it's all silly in the end
15:26karolherbst: because there is blocks vs warps
15:26Sarayan: MX150 is.. pascal, great :-)
15:26karolherbst: maxwell == pascal
15:26Sarayan: oh cool
15:27karolherbst: you have 65536 registers per SM in the end
15:27karolherbst: and a thread can take from 4 up to 255 registers
15:27karolherbst: of course you can't have 16384 threads at the same time
15:27karolherbst: so you don't really optimize against having 4 regs
15:27Sarayan: I'm not yet entirely sure what a thread is in the first place :-)
15:28Sarayan: since the execution unit seems to be the warp
15:28Sarayan: as in, issues instructions
15:28karolherbst: you can have 1024 threads per block, which gives you 64 regs per thread
15:28karolherbst: HdkR: what is actually the differencce between block and warp?
15:29karolherbst: Sarayan: you have to deal with divergent threads, like taking different paths
15:29HdkR: A block is a collection of warps
15:29Sarayan: that's usually done through write-nullification bits
15:30karolherbst: so you can have 64 warps on maxwell
15:30Sarayan: but they may have invented something else :-)
15:30karolherbst: Sarayan: yes, single threading mode
15:30Sarayan: we're per-SM there, right?
15:30Sarayan: good, I'm still following then, phew
15:30karolherbst: anyway, I remember that having 32 regs used is the "best"
15:31HdkR: Grouping things in blocks effectively allows you to logically tie warps together and the global ID will offset correctly
15:31karolherbst: so you allocate 32 regs for each thread and be able to run 2048 threads in parallel
15:31Sarayan: except you can only do 128 ops in parallel, or 256 is fpu and alu are separate, right?
15:32karolherbst: ALU = float + int
15:32karolherbst: on mxawell
15:32karolherbst: you operate on multiple data at the same time
15:32karolherbst: so all threads execute the same instruction
15:32karolherbst: but on different piece of the register file
15:32Sarayan: yup, that's simd in old parlance
15:33Sarayan: or vector instructions in even older parlance, but with more flexibility
15:33Sarayan: if, again, I follow correctly
15:33karolherbst: well, from the ISA point of view it all looks scalar
15:33karolherbst: but the hw isn't scalar
15:33karolherbst: it just looks like that
15:33Sarayan: that's nice, yo can decoupled instruction wideness from the instruction itself that way
15:34Sarayan: I don't think intel manages that, it didn't for i965 anyway
15:38karolherbst: nvidia made a big trade of
15:38karolherbst: simplier hardware, but the compiler team has to do tons of work
15:39karolherbst: which is, what you want to have in the ultimiate end anyway
15:39karolherbst: hardware/software tends to be broken
15:39Sarayan: hmmm, 32 warps, 32 threads per warp, that's 1024 threads max
15:39karolherbst: software is fixable
15:39karolherbst: Sarayan: how says 32 thread per warp?
15:39Sarayan: depends, that's what VLIW was supposed to be, and look at the itanic
15:40HdkR: No dev has to program assembly in VLIW though ;)
15:40karolherbst: ohh, yeah
15:40karolherbst: I am not that strong on the numbers
15:40Sarayan: nobody can be :-)
15:40karolherbst: Sarayan: what I meant is, hardware is broken
15:41karolherbst: intel had to learn it the hard way now
15:41Sarayan: as whitequark would way, everything's cursed, but at least you can fix software without a soldering iron
15:42karolherbst: and if the hardware stays less complex and you can move stuff into software while being actually more power efficient, then that's what you have to do
15:42Sarayan: and shaders are small, so you can throw a lot of compile resources at them
15:42karolherbst: or even pre compile
15:42karolherbst: and cache
15:43karolherbst: intel had the silly situation that x86 is just stupid
15:43HdkR: https://hastebin.com/aluduyawop.sql <---Turing changed some of those numbers
15:43karolherbst: so they worked around a lot of x86 limitation inside hw
15:43Sarayan: the hardware still does the annoying things, like selecting the fpu cores or scheduling in general
15:43karolherbst: it has to
15:43Sarayan: itanic essentially didn't, and urgh
15:44Sarayan: x86 has one good thing: it's very compact
15:44karolherbst: I don't even think thats remotly true
15:44karolherbst: because if you start looking at the actual hardware, you want to run away
15:45Sarayan: compact as in I$ usage
15:45Sarayan: nothing else
15:45karolherbst: okay, sure
15:45HdkR: I'd hate to be the person that has to work on that CPU's frontend
15:45karolherbst: I think the frontend isn't that bad
15:45karolherbst: bad is the internal ISA to x86 layer
15:46karolherbst: the internal ISA looks much more like the nv ISA
15:46karolherbst: or well
15:46karolherbst: that's what I know regarding some AMD micorcode stuff
15:46HdkR: Isn't that the frontend when it is converting instructions to uops?
15:46karolherbst: might be different on intel
15:46karolherbst: HdkR: ohh, right
15:46karolherbst: I wasn't thinking correctly :p
15:49karolherbst: the big problem is though that you want your CPU applications to be quite stable in the ISA
15:50HdkR: Yea, that is one of the big advantages of x86
15:50karolherbst: all that out of order stuff CPUs are doing is simply caused by x86
15:50karolherbst: out of order isn't a perf feature, it's a workaround
15:50HdkR: GPUs get away with this by having a compilation step late in the process :P
15:51karolherbst: but you also don't want to have a compiler inside the kernel, so you are like super screwed
15:52karolherbst: maybe we just have to give in, into that madness and have a PPU
15:52karolherbst: for program processing unit
15:52karolherbst: and the CPU compiles applications for hte GPU and PPU
15:52karolherbst: and the CPU is kernel only
15:53Sarayan: ooo is a perf feature for scalar-type cpus though
15:54Sarayan: in fact speculation is the #1 perf feature, and specifically cache priming through speculation
15:54Sarayan: GPUs can't do memory random access at the scale CPUs do
15:55karolherbst: "specualtion" is minaly a hack
15:55karolherbst: it super hard to get it right
15:55karolherbst: everybody failed to do so
15:56Sarayan: yup, but it's more than a hack
15:56karolherbst: now we workaround that inside software
15:56Sarayan: and the perf hit tells you how necessary it is
15:56karolherbst: okay, it is a sophisticated hack
15:57Sarayan: it's very much all about finding which memory you're going to need as early as possible to maximize the useful bandwidth
15:57karolherbst: you only need this if you can't deal with latencies/stalls
15:57Sarayan: because when you're single threaded, every read is a stopping point
15:58karolherbst: doesn't have to be
15:58karolherbst: even on the GPU it isn't
15:58karolherbst: even if you run single threaded
15:58Sarayan: the only reason GPUs can is because the programs they run are embarassingly parallel
15:58karolherbst: no, it isn't
15:58Sarayan: I'm listening
15:58karolherbst: GPUs can just execute the instruction after the read
15:58karolherbst: it doesn't matter
15:59karolherbst: you stop when you _need_ the value
15:59karolherbst: not when you start the read
15:59karolherbst: and you can put different ops between read and use to hide it
15:59Sarayan: agreed, but in scalar code that comes rather fast
15:59karolherbst: depends on the workload, but yes
15:59Sarayan: especially with branches everywhere
15:59karolherbst: what I meant is though, that on x86 there is no incentive to optimize all that
16:00karolherbst: because it's crap to start with
16:00Sarayan: not enough registers, sure
16:00karolherbst: that's another issue, yes
16:00karolherbst: easily solveable though
16:00karolherbst: you just have to made that cut at some point
16:00karolherbst: and say: here, new CPU gen, deal with it
16:00HdkR: Time to make ARM have 256 GPRs
16:01karolherbst: x86 won't be important in long term
16:01karolherbst: I highly doubt that arm will be the replacement though
16:02Sarayan: HdkR: that doesn't work at all for performance
16:02Sarayan: because too big instructions, and because subroutine calls
16:04Sarayan: in any case, a lot of thanks for all that very very interesting information
16:04Sarayan: I'm starting to understand the organization of all that stuff
16:04Sarayan: blocks are still unclear, but heh :-)
16:05karolherbst: I guess we just have to agree that CPUs suck big times :D
16:05karolherbst: and just don't do anything on it
16:05karolherbst: still impressive what intel/AMD/ARM are able to do with all those limitations
16:06Sarayan: well, GPUs are useful for the easy part, the hard part is still done by the CPU :-P
16:06Sarayan: (random access, branching, single-threaded performance)
16:06HdkR: karolherbst: ARM is trying to turn CPUs in to a GPU with SVE at least
16:09HdkR: Got to love having up to 2048bit vector registers running on the CPU
16:11Sarayan: it's a power management nightmare I suspect
16:11HdkR: nah, just have the frontend break it up in to a million uops
16:11Sarayan: Oh, one instruction per minute?