02:44imirkin: karolherbst: no - it's not always legal
02:47imirkin: (about sticking the neg modifier up-front)
10:22karolherbst: imirkin: yeah... we've got a problem then
10:22karolherbst: can't just remove OP_SUB
10:22karolherbst: except we move the lowering of NEG somewhere else
13:57imirkin: karolherbst: OP_ADD + OP_NEG
13:58imirkin: karolherbst: if it's the post-ra lowering, then it's fine to do the modifier thing
13:58imirkin: i meant pre-ssa, should not stick in modifiers
14:05karolherbst: imirkin: 64bit types
14:05karolherbst: we never emit a 64 bit neg when coming from tgsi
14:05karolherbst: also, OP_NEG lowering is done by using OP_SUB
14:06karolherbst: which... we want to replace by OP_ADD+OP_NEG
14:06karolherbst: but yeah, might make sense to move to lowering of OP_NEG somewhere else
14:09imirkin: that sort of "optimizing" lowering should be done post-RA
14:09imirkin: or at least in the LegalizeSSA pass
14:20imirkin: mupuf: could you have a look at https://bugs.freedesktop.org/show_bug.cgi?id=109613 ?
14:20mupuf: imirkin: hmm.... ok, will check it out
14:21imirkin: his vbios is pretty wild... 16 connectors!
14:21imirkin: i wonder if some of the values aren't bogus
19:20karolherbst: total instructions in shared programs : 9649878 -> 9571262 (-0.81%)
19:20karolherbst: total gprs used in shared programs : 1058330 -> 1050152 (-0.77%)
19:20karolherbst: HdkR: ^^
19:25karolherbst: thats for moving immediates into the driver const buf if they can't be load propagated
19:26karolherbst: not fully implemented, just the compiler bits to be able to tell how big of a difference that is
19:26karolherbst: apperantly the difference is huge
19:59HdkR: karolherbst: Nice!
21:06karolherbst: HdkR: mhh, I think I will implement that for maxwell+ only for now. Way easier to just use a new const buffer than trying to fit it into the driver or uniform one :/
21:07HdkR: Makes sense
21:12karolherbst: mhh, for compute shaders we have the issue anyway, sadly :/
21:14HdkR: Which issue?
21:15karolherbst: only 8 const buffers
21:15HdkR: ah, yea
21:33karolherbst: huh, can fermi take more than 8 cbs in compute shaders?
21:39karolherbst: HdkR: I have to create one buffer for all shader stages :/
21:40karolherbst: now that's annoying
21:40karolherbst: or I just limit it to 4k per stage
21:40karolherbst: and offset it
21:42karolherbst: or we just have _one_ driver wide buffer
21:42karolherbst: and put whatever we get in there
21:42karolherbst: and just append new values if we fine new one
21:43karolherbst: we have space for 16384 32 bit immediates on maxwell
21:43HdkR: could do :P
21:43karolherbst: and all the short ones are already loaded
21:43karolherbst: I mean short imms
21:43karolherbst: so... we really only have to cover like 2**12 immediates anyway
21:44karolherbst: and if we run out of space, we simply run out...
21:44karolherbst: biggest issue is just caching the shader, but we don't do that for nv50ir anyway
21:44karolherbst: we only cache the TGSI I think?
21:44karolherbst: or do we cache the SASS binary as well?
21:44karolherbst: no idea
21:45karolherbst: mhh, that optimization is slowly getting super complex...
21:52HdkR: Nothing unusual then :P
22:06karolherbst: HdkR: that one is more annoying than the others :p
22:08HdkR: Gotta get those shader size saving though :P
22:09karolherbst: well, the size saving doesn't matter that much :p
22:09karolherbst: some movs optimized away, so waht
22:09karolherbst: but the gprs usage drop is nice
22:10HdkR: True, large shaders would end up just having improved occupancy :D
22:13karolherbst: well, the reduction in mov instruction can be nice though
22:17karolherbst: HdkR: https://cgit.freedesktop.org/mesa/mesa/commit/?id=d346b8588c36949695f2b01ca76619e84754dd50
22:17karolherbst: roughly 2% more perf in pixmark_piano
22:18karolherbst: and that's just due to making use of the limm fma form
22:19HdkR: If only every instruction had a large immediate form :P
22:21HdkR: You also gain the same optimization for the other instructions by putting them in to a UBO, so it isn't the biggest problem. Just need to write that optimization
22:21karolherbst: the compiler bits are trivial
22:21karolherbst: what is not trivial, is the integration part
22:22HdkR: Sounds like my life
22:35karolherbst: uff, today was the last day of february
22:36HdkR: Oh, so it was
22:55karolherbst: each cb can hold up to 64kb of data, right?
23:15HdkR: karolherbst: Correct
23:16HdkR: Gives you quite a bit of room to stuff immediates
23:17karolherbst: I think I will limit those to 4k
23:17karolherbst: seems enough
23:17karolherbst: 1k wasn't
23:17HdkR: If a shader has 4k unique immediates than that shader is an asshole and can materialize them
23:19karolherbst: not immediates
23:19HdkR: Ah, so 1k immediates then
23:19karolherbst: well there are shaders with more than 256 long immediates
23:20karolherbst: but I don't check for unique ones
23:20karolherbst: maybe I should?
23:20HdkR: ah, yea
23:20karolherbst: first I want to get something working
23:20HdkR: Do unique checking after the initial implementation
23:20karolherbst: at least the offseting works now
23:20karolherbst: "max ftz f32 $r6 $r6 c16[0x1008]" :)
23:20HdkR: Ends up being the case that people do a #define in the shader for the immediate and use it in multiple locations
23:21karolherbst: not that a 20k upload per shader invocation matters all that much
23:22karolherbst: but probably better if we would be able to reduce it
23:22HdkR: 4k lets you keep it in a page which should be quite quick
23:22karolherbst: mhh, right
23:22karolherbst: allthough, it doesn't matter for const buffers
23:22karolherbst: or does it?
23:23karolherbst: it's already cached
23:23HdkR: Depends on where you cache the data
23:23karolherbst: and for uploading it doesn't really matter all that much anyway
23:23karolherbst: HdkR: const buffer...
23:23karolherbst: const buffer are equally fast as registers to read
23:24HdkR: I mean, where the data backing ends up being. If you have it stored in vram constantly, or are changing it from system ram whenever you swap shaders
23:24karolherbst: well, that's kind of the issue, right?
23:24HdkR: It's either paying a upload time cost or a vram cost for that range for each shader :P
23:24karolherbst: you can't really do much about that because you have to swap it whenever you bind a new shader
23:25HdkR: It's flexible enough for doing whatever you want
23:26karolherbst: if you think about shader caches and SSO?
23:26karolherbst: seperate shader objects
23:26karolherbst: you kind of have to reupload it on every bind
23:26karolherbst: you can't get around that
23:27karolherbst: or we just check what are the most common immediates and just use it for those?
23:27karolherbst: then we never have to rebind
23:28HdkR: Eh. You'll end up with some overlap but a significant portion will be unique per shader/program
23:29karolherbst: question is: do we care all that much?
23:29karolherbst: if we can cover like 95% of all immediates with a static list, that would be fine
23:29karolherbst: thing about constants like PI or e
23:30HdkR: Not sure if it'll end up being that great, but maybe. Once you have an implementation you can start doing analysis on it at least :P