02:44 imirkin: karolherbst: no - it's not always legal
02:47 imirkin: (about sticking the neg modifier up-front)
10:22 karolherbst: imirkin: yeah... we've got a problem then
10:22 karolherbst: can't just remove OP_SUB
10:22 karolherbst: except we move the lowering of NEG somewhere else
13:57 imirkin: karolherbst: OP_ADD + OP_NEG
13:58 imirkin: karolherbst: if it's the post-ra lowering, then it's fine to do the modifier thing
13:58 imirkin: i meant pre-ssa, should not stick in modifiers
14:05 karolherbst: imirkin: 64bit types
14:05 karolherbst: we never emit a 64 bit neg when coming from tgsi
14:05 karolherbst: also, OP_NEG lowering is done by using OP_SUB
14:06 karolherbst: which... we want to replace by OP_ADD+OP_NEG
14:06 karolherbst: but yeah, might make sense to move to lowering of OP_NEG somewhere else
14:09 imirkin: that sort of "optimizing" lowering should be done post-RA
14:09 imirkin: or at least in the LegalizeSSA pass
14:20 imirkin: mupuf: could you have a look at https://bugs.freedesktop.org/show_bug.cgi?id=109613 ?
14:20 mupuf: imirkin: hmm.... ok, will check it out
14:21 imirkin: his vbios is pretty wild... 16 connectors!
14:21 imirkin: i wonder if some of the values aren't bogus
19:20 karolherbst: mhhh
19:20 karolherbst: total instructions in shared programs : 9649878 -> 9571262 (-0.81%)
19:20 karolherbst: total gprs used in shared programs : 1058330 -> 1050152 (-0.77%)
19:20 karolherbst: HdkR: ^^
19:25 karolherbst: thats for moving immediates into the driver const buf if they can't be load propagated
19:26 karolherbst: not fully implemented, just the compiler bits to be able to tell how big of a difference that is
19:26 karolherbst: apperantly the difference is huge
19:59 HdkR: karolherbst: Nice!
21:06 karolherbst: HdkR: mhh, I think I will implement that for maxwell+ only for now. Way easier to just use a new const buffer than trying to fit it into the driver or uniform one :/
21:07 HdkR: Makes sense
21:12 karolherbst: mhh, for compute shaders we have the issue anyway, sadly :/
21:14 HdkR: Which issue?
21:15 karolherbst: only 8 const buffers
21:15 HdkR: ah, yea
21:33 karolherbst: huh, can fermi take more than 8 cbs in compute shaders?
21:39 karolherbst: ufff
21:39 karolherbst: HdkR: I have to create one buffer for all shader stages :/
21:40 karolherbst: now that's annoying
21:40 karolherbst: or I just limit it to 4k per stage
21:40 karolherbst: and offset it
21:42 karolherbst: mhhhh
21:42 karolherbst: or we just have _one_ driver wide buffer
21:42 karolherbst: and put whatever we get in there
21:42 karolherbst: and just append new values if we fine new one
21:43 karolherbst: we have space for 16384 32 bit immediates on maxwell
21:43 HdkR: could do :P
21:43 karolherbst: and all the short ones are already loaded
21:43 karolherbst: I mean short imms
21:43 karolherbst: so... we really only have to cover like 2**12 immediates anyway
21:44 karolherbst: and if we run out of space, we simply run out...
21:44 karolherbst: biggest issue is just caching the shader, but we don't do that for nv50ir anyway
21:44 karolherbst: we only cache the TGSI I think?
21:44 karolherbst: or do we cache the SASS binary as well?
21:44 karolherbst: no idea
21:45 karolherbst: mhh, that optimization is slowly getting super complex...
21:52 HdkR: Nothing unusual then :P
22:06 karolherbst: HdkR: that one is more annoying than the others :p
22:08 HdkR: Gotta get those shader size saving though :P
22:09 karolherbst: well, the size saving doesn't matter that much :p
22:09 karolherbst: some movs optimized away, so waht
22:09 karolherbst: but the gprs usage drop is nice
22:10 HdkR: True, large shaders would end up just having improved occupancy :D
22:13 karolherbst: well, the reduction in mov instruction can be nice though
22:17 karolherbst: HdkR: https://cgit.freedesktop.org/mesa/mesa/commit/?id=d346b8588c36949695f2b01ca76619e84754dd50
22:17 karolherbst: roughly 2% more perf in pixmark_piano
22:18 karolherbst: and that's just due to making use of the limm fma form
22:19 HdkR: ah
22:19 HdkR: If only every instruction had a large immediate form :P
22:21 HdkR: You also gain the same optimization for the other instructions by putting them in to a UBO, so it isn't the biggest problem. Just need to write that optimization
22:21 karolherbst: yeah...
22:21 karolherbst: the compiler bits are trivial
22:21 karolherbst: what is not trivial, is the integration part
22:22 HdkR: Sounds like my life
22:35 karolherbst: uff, today was the last day of february
22:36 HdkR: Oh, so it was
22:55 karolherbst: each cb can hold up to 64kb of data, right?
23:15 HdkR: karolherbst: Correct
23:16 HdkR: Gives you quite a bit of room to stuff immediates
23:17 karolherbst: yeah
23:17 karolherbst: I think I will limit those to 4k
23:17 karolherbst: seems enough
23:17 karolherbst: 1k wasn't
23:17 HdkR: If a shader has 4k unique immediates than that shader is an asshole and can materialize them
23:19 karolherbst: bytes
23:19 karolherbst: not immediates
23:19 karolherbst: ;)
23:19 HdkR: Ah, so 1k immediates then
23:19 HdkR: :P
23:19 karolherbst: well there are shaders with more than 256 long immediates
23:20 karolherbst: but I don't check for unique ones
23:20 karolherbst: maybe I should?
23:20 karolherbst: dunno
23:20 HdkR: ah, yea
23:20 karolherbst: first I want to get something working
23:20 HdkR: Do unique checking after the initial implementation
23:20 karolherbst: at least the offseting works now
23:20 karolherbst: "max ftz f32 $r6 $r6 c16[0x1008]" :)
23:20 HdkR: Ends up being the case that people do a #define in the shader for the immediate and use it in multiple locations
23:21 karolherbst: right
23:21 karolherbst: not that a 20k upload per shader invocation matters all that much
23:22 karolherbst: but probably better if we would be able to reduce it
23:22 HdkR: 4k lets you keep it in a page which should be quite quick
23:22 karolherbst: mhh, right
23:22 karolherbst: allthough, it doesn't matter for const buffers
23:22 karolherbst: or does it?
23:23 karolherbst: it's already cached
23:23 HdkR: Depends on where you cache the data
23:23 karolherbst: and for uploading it doesn't really matter all that much anyway
23:23 karolherbst: HdkR: const buffer...
23:23 karolherbst: const buffer are equally fast as registers to read
23:24 HdkR: I mean, where the data backing ends up being. If you have it stored in vram constantly, or are changing it from system ram whenever you swap shaders
23:24 karolherbst: well, that's kind of the issue, right?
23:24 karolherbst: generally
23:24 HdkR: It's either paying a upload time cost or a vram cost for that range for each shader :P
23:24 karolherbst: you can't really do much about that because you have to swap it whenever you bind a new shader
23:25 HdkR: It's flexible enough for doing whatever you want
23:26 karolherbst: if you think about shader caches and SSO?
23:26 HdkR: SSO?
23:26 karolherbst: seperate shader objects
23:26 karolherbst: you kind of have to reupload it on every bind
23:26 karolherbst: you can't get around that
23:27 HdkR: Right
23:27 karolherbst: or we just check what are the most common immediates and just use it for those?
23:27 karolherbst: then we never have to rebind
23:28 HdkR: Eh. You'll end up with some overlap but a significant portion will be unique per shader/program
23:29 karolherbst: question is: do we care all that much?
23:29 karolherbst: if we can cover like 95% of all immediates with a static list, that would be fine
23:29 karolherbst: thing about constants like PI or e
23:30 HdkR: Not sure if it'll end up being that great, but maybe. Once you have an implementation you can start doing analysis on it at least :P