00:37 imirkin: anholt: hmmm ... 20% is a bit more than i expected
00:37 imirkin: not horrible, but more than i expected.
00:38 imirkin: maybe given that shaders get cached it's all good? i def remember that some game load times were highly compiler-bound, but that was before the disk cache days
02:15 karolherbst: imirkin: there is one thing we can do to speed up stuff and that is implementing the finalize_nir callback, which means that st/mesa can be better at caching. There is even more nifty things we can do there: extracting constants into a ubo0 for more optimizations
02:15 imirkin: karolherbst: that's a "shared" opt though
02:15 imirkin: (the ubo consts thing ... there used to be the beginnings of support for something like that, but i nuked it)
02:16 karolherbst: sure, but we do tell which constants we want st/mesa to put into ubo0
02:16 imirkin: er, s/consts/immediates/
02:16 imirkin: i assume we're talking about the same thing
02:16 karolherbst: yeah, but immediates aren't perfect
02:17 karolherbst: it's useful for things like fmad
02:17 imirkin: right
02:17 karolherbst: I wrote stuff once and got pretty huge gains
02:17 karolherbst: like 1-2% less instructions
02:17 karolherbst: I think..
02:17 karolherbst: could be 0.5%
02:17 imirkin: yeah
02:18 imirkin: anyways, i was talking more about compile time
02:18 karolherbst: I know
02:19 imirkin: basically we have to decide whether we care about a 20% compile time regression
02:19 karolherbst: finalize_nir helps with reducing that afaik
02:19 imirkin: ok. i know nothing about that.
02:19 imirkin: this is on a cold cache
02:19 karolherbst: it's essentially st/mesa doing the opts instead of the driver
02:19 karolherbst: so st/mesa caches post driver opts
02:20 karolherbst: and well.. we don't have to do it as it's already optimized and we won't have to cache it ourselves
02:21 karolherbst: well.. we still have to do our backend stuff, but we'd skipp calling nir passes
02:21 karolherbst: imirkin: ahh... btw it was really closer to 0.5%: https://github.com/karolherbst/mesa/commit/ee35bee0c4ed55c5cce5615297847e3fa29d136e
02:21 karolherbst: but that was with a constant constant table
02:22 imirkin: ah right
02:22 imirkin: that was the "baby" version
02:22 karolherbst: yeah... I did a real one and I think it wasn't really much better
02:22 imirkin: sounds familiar
02:22 karolherbst: it does come with the overhead of rebinding ubos and stuff
02:22 imirkin: + overhead of uploading all the stuff
02:23 imirkin: right
02:23 karolherbst: yep
02:23 karolherbst: mhh I also have this version: https://github.com/karolherbst/mesa/commit/a1f4372ea8d001e873368ffc129a80b682723d9b
02:23 karolherbst: ehh that one is newer
02:23 karolherbst: weird
02:23 karolherbst: even though the other one was on a _v2 branch
02:24 karolherbst: I think doing it for everything was like -0.7%
02:25 karolherbst: anyway... compile times with caching are kind of a mood point to discuss as games tend to load faster with caching enabled, even coming from a cold cache
02:26 karolherbst: shadow warrior is a good test, it took like 5 minutes to load...
02:26 karolherbst: with caching disabled
02:26 karolherbst: and like 1-2 with enalbed and that was before nouveau did caching
02:51 mhenning: fwiw, nir_opt_large_constants also does an immediates -> ubo transform
02:53 imirkin: this is for all immediates
02:53 imirkin: basically certain immediates are less likely to fit into the encodings
02:54 anholt: mhenning: mhenning well, that one goes to a ubo-like data store that isn't represented in the state tracker
02:54 karolherbst: mhhh.. using nir_opt_large_constants could actually help though
02:54 karolherbst: but I don't like what the pass is doing
02:54 anholt: but it's an important one to enable in your native nir backend (aka make kerbal playable)
02:54 mhenning: anholt: right, I maybe wasn't clear with that
02:55 mhenning: requires additional driver support
02:55 anholt: intel had one that moved constants to uniforms at the mesa/st level, too.
02:55 karolherbst: afaik nir_opt_large_constants doesn't check if the data is accessed indirectly or not, but is doing it for all which are of a certain size, no?
02:55 mhenning: that's what the comment says
02:55 anholt: it's "arrays of a certain size"
02:56 imirkin: we'd want something more general
02:56 karolherbst: yeah.. kind of pointless
02:56 karolherbst: what's bad about having many constants in the sahder?
02:56 imirkin: actually freedreno does it
02:56 karolherbst: *Shader
02:56 imirkin: but it has a more natural spot to stick those
02:56 karolherbst: right
02:56 karolherbst: I'd use nir_opt_large_constants if I can say "only for indirects"
02:56 anholt: karolherbst: if you load a pile of constants into an array, and the array doesn't get copy-propagated into just constant loads, then you probably would rather access the array from memory than access it from registers.
02:57 karolherbst: sure, but what if that doesn't happen?
02:57 karolherbst: what if it's just accessed normally
02:58 anholt: if your array got copy-propagated out into normal immediate accesses of the components, then the array gets DCEed and the pass doesn't trigger
02:58 karolherbst: mhh, okay
02:58 anholt: so, effectively, the pass only triggers for indirects assuming that you haven't turned off copy prop
02:58 karolherbst: yeah okay, that makes more sense then
03:11 karolherbst: ehhh.. I wished we would fix fma in mesa
03:11 karolherbst: have to do it for CL anyway
03:11 karolherbst: CL needs _real_ fma, not the fake fma GLSL needs
03:12 imirkin: anholt: btw, just want to verify ... the testing you did was on a regular "plain" x86 box, right?
03:12 imirkin: (the perf testing, that is)
03:12 anholt: it was an amd64 with a lot of cpus, but other than that normal
03:13 imirkin: ok, many-cpu shouldn't really matter here
03:13 karolherbst: anholt: thanks for forking on that btw
03:13 karolherbst: anholt: btw.. I do have a collection of... alot of shaders if you are interested
03:14 anholt: I have the collection we came up with at intel, which includes things like "approximately everything linux-native on steam"
03:14 imirkin: that's ... a strong collection
03:14 karolherbst: that sounds even better than what I have
03:14 anholt: also have a pile of android from robclark
03:15 karolherbst: might want to throw that ntt stuff at those shaders, because the shaders in shader-db are... well...
03:15 karolherbst: games do weird things, so it might be good to catch regressions there
03:15 anholt: that collection is what I used.
03:16 karolherbst: ahh, cool
03:17 imirkin: it's almost like this isn't anholt's first rodeo...
03:17 imirkin: ;)
03:17 karolherbst: true...
03:18 karolherbst: at some point I will get back to improving the nir stuff in nouveau, when perf actually starts to matter
03:18 karolherbst: also loop merging.. this will be fun, unless somebody beats me to it
03:18 airlied: probably start with a new register allocator :-P
03:19 imirkin: RA is actually pretty good...
03:19 karolherbst: airlied: I can show you my todo list of shame
03:19 karolherbst: imirkin: it's broken
03:19 imirkin: meh
03:19 karolherbst: not code broken
03:19 karolherbst: the algo is broken
03:19 airlied: imirkin: the nvidia hw would be quite suitable for the new SSA regallocs that radv and turnip use
03:19 imirkin: oh?
03:19 karolherbst: well code too, but the algo is
03:19 karolherbst: imirkin: yes
03:19 imirkin: airlied: nouveau uses a SSA regalloc...
03:20 airlied: imirkin: oh okay, wasn't aware it did that already
03:20 imirkin: nouveau had by far the most advanced backend
03:20 karolherbst: so.. I have a shader which uses ~20 values and we fail to spill
03:20 imirkin: others have caught up by now
03:20 imirkin: karolherbst: with function calls?
03:20 karolherbst: and debugging RA convinced me that the algo is fundamentally broken
03:20 karolherbst: imirkin: nothing
03:20 karolherbst: no branching
03:20 imirkin: uhhhh
03:21 karolherbst: no weirdo things
03:21 karolherbst: just plain assigns
03:21 imirkin: i don't believe you :p
03:21 imirkin: i've never seen anything like that
03:21 karolherbst: I can show you the TGSI
03:21 imirkin: please do
03:21 karolherbst: if I find it
03:21 imirkin: is it with the vec3 texture results?
03:21 imirkin: if so, i know that one :)
03:21 imirkin: basically a _ton_ of texture().xyz's get stored up
03:21 karolherbst: imirkin: https://gist.github.com/karolherbst/1597d06d931cf649c7ac1d9bb14e0b24
03:21 imirkin: and overflow the regs
03:22 imirkin: but the RA spill logic doesn't deal with it properly
03:22 imirkin: that ... is not this case
03:22 imirkin: karolherbst: for what target chipset?
03:22 karolherbst: let me try
03:22 karolherbst: imirkin: try e6
03:22 imirkin: k
03:22 mhenning: imirkin: we do RA in SSA, but it's not a decoupled RA like radv/turnip
03:22 karolherbst: imirkin: it's funny it even tries to spill :)
03:23 imirkin: interesting - this has a ton of outputs
03:23 imirkin: which are obviously unspillable
03:23 karolherbst: imirkin: yeah soo
03:23 karolherbst: RA sees those scalar values all as vec4s
03:23 karolherbst: sooo
03:23 karolherbst: it needs more than 64 regs
03:23 karolherbst: but of course there is nothing to spill
03:23 karolherbst: so it fails
03:23 imirkin: "those scalar values"?
03:23 karolherbst: ssa values used
03:24 karolherbst: most of theam are getting treated as full vec4 values by RA
03:24 karolherbst: with just one live component
03:24 imirkin: i can check it out.
03:24 karolherbst: it's shitty
03:24 karolherbst: yea...
03:24 karolherbst: but I am convinced that our coloring stuff is just broken
03:24 imirkin: i believe you that ther eare some werid cases
03:24 imirkin: this sounds like one of them
03:24 karolherbst: as this is just the implication of our algorithm we use
03:24 karolherbst: yeah, because our RA is broken
03:25 imirkin: i'll double check to see if we do something particularly dumb
03:25 karolherbst: yes, the algo is dumb :p
03:25 imirkin: the times i've investigated issues
03:25 imirkin: it was because of largely unforced errors, not any core failure
03:25 karolherbst: the code is fine
03:25 karolherbst: yeah
03:25 karolherbst: this one is different
03:25 imirkin: but like i said, i'll double-check what's going on here
03:25 karolherbst: have fun
03:25 imirkin: not right this second, but ... soonish
03:25 karolherbst: maybe you come to a different conclusion
03:26 karolherbst: regardless I'd still like to use the shared reg alloc in mesa and see if it works as good or better and ditch ours
03:26 karolherbst: less code to deal with
03:27 karolherbst: and better support for non 32 bit sizes
03:27 imirkin: i don't think anyone uses that one anymore?
03:27 karolherbst: etnaviv, lima, r300 and vc4 I see including it
03:28 anholt: intel and various other drivers do. but if you have an ssa regalloc I wouldn't recommend moving to the graph coloring
03:28 karolherbst: is there something newer and better than register_allocate.h?
03:28 karolherbst: ahh
03:28 anholt: radv/turnip ssa regalloc is the newer/better thing
03:28 karolherbst: ahh right
03:28 anholt: but unfortunately there's no shared infra, and it's hard to write any.
03:28 karolherbst: I see
03:29 karolherbst: sad
03:29 imirkin: nvidia doesn't actually have a ton of weirdness
03:29 imirkin: our RA supports a bunch of little nice things
03:29 karolherbst: I thought util/register_allocate.h could be used with a SSA form shader, but maybe I was wrong
03:29 karolherbst: this saddens me deeply
03:29 anholt: the really coll thing that radv/turnip got was figuring out if spilling was necessary and doing all the spills in a single pass, rather than the incremental disaster of our graph-coloring RA
03:30 imirkin: like allowing spilling of e.g. flags/predicates to registers rather than local memory
03:30 anholt: util/ra is very much not for ssa
03:30 karolherbst: imirkin: well.. you could do that with util/register_allocate.h afaik
03:30 karolherbst: anholt: :(
03:30 imirkin: we do single-ish pass spill ... we guess how many things to spill, spill them, and then hope that the spilling hasn't caused it to require more spilling
03:30 imirkin: (which we handle, but it leads down a path of sadness)
03:31 anholt: yeah, the neat thing about this ssa RA is that you get an exact decision for spills
03:31 karolherbst: yeah.. not saying our RA code is problematic, I just say the algo is broken and we need a new one :)
03:31 karolherbst: and we kind of do too many things at once
03:32 karolherbst: sometimes I think writing a new backend compiler is what we need and not making the same mistakes + cleaner design
03:33 karolherbst: handling of textures is.... annoying, and arch specific stuff is spread out through the entire codebase
03:33 imirkin: mostly abstracted away behind the target
03:33 imirkin: or in target-specific lowering logic
03:33 karolherbst: sure, mostly
03:33 karolherbst: also that we have a strict pass ordering and everything
03:34 imirkin: that's the bit i like ;)
03:34 karolherbst: well I don't
03:34 imirkin: we don't have to iterate to a fixed point
03:34 imirkin: the passes are structured so that it all works out
03:34 karolherbst: well it doesn't
03:34 imirkin: it _mostly_ does
03:34 karolherbst: sure, _mostly_ :D
03:34 imirkin: the benefit of not looping
03:34 karolherbst: it's good enough
03:34 imirkin: is higher than the occasional instruction we'd save here and there
03:35 imirkin: at least IMO
03:35 karolherbst: but I am dreaming of a perfect backend compiler so "good enough" is not good enough
03:35 karolherbst: :P
03:35 karolherbst: yeah well
03:35 imirkin: perfect = slow as hell? :p
03:35 karolherbst: with shader caching all of that matters less
03:36 karolherbst: most time is spend on parsing glsl though
03:37 karolherbst: at least for GL
03:37 karolherbst: anyway.. the idea was not to have like hundred passes looping over each other
03:38 karolherbst: but kind of to get a nir which is kind of good enough and do the final touches in our backend (like all the modifier business or whatever we should be doing) and just focus on what's actually specific to our hardware
03:38 imirkin: go for it
03:38 karolherbst: yeah...... at some point in the future :D
03:38 imirkin: i did spend a _lot_ of effort getting the codegen to be very good
03:38 karolherbst: sure, but it has its limits
03:38 imirkin: it's obv not perfect, but in most cases it's hard to manually improve
03:39 imirkin: in some cases ... it is ;)
03:39 karolherbst: yeah.. but the juicy bits are... impossible to do in codegen
03:39 karolherbst: try implement loop merging in codegen
03:39 karolherbst: have fun :P
03:39 karolherbst: *implementing
03:39 karolherbst: and we kind of need that for perf
03:40 karolherbst: also to prevent us from having to use the off chip stack
03:40 karolherbst: (Which we don't, so we fail sometimes)
03:41 karolherbst: anyway.. no point in users caring as those things are usually only happening in games and stuff
03:43 airlied: how good is support for all those new instruction scheduling bits in the hw isa?
03:43 karolherbst: it's not a problem
03:44 airlied: do we schedule optimally
03:44 airlied: ?
03:44 karolherbst: the bigger problem is, that we don't schedule instructions :D
03:44 karolherbst: nope
03:44 karolherbst: not at all
03:44 karolherbst: but the scheduling bits are quite trivial, the compiler inserts scheduling information instead of the hw doing it itself
03:45 karolherbst: which we are good enough with
03:45 karolherbst: not perfect, but also not terrible
03:46 karolherbst: anyway, instruction scheduling is a problem for all gens
03:46 karolherbst: not just the new ones
03:46 karolherbst: airlied: so you are aware that the isa bits are just metadata, right?
03:46 karolherbst: they don't actually _do_ anything
03:46 karolherbst: you can't reorder stuff with it or anything
03:47 karolherbst: it just tells the hw to stall for x cycles or use a barrier to wait on stuff finishing up
03:48 mhenning: you do miscompile if you don't stall enough, which I'd argue counts as doing something
03:49 karolherbst: yeah well.. if you tell the hw to stall for 3 cycles even though it needs to stall for 4, yeah you are in trouble
03:49 karolherbst: but the hw can't reorder or do any funny business
03:49 mhenning: anyway, I was looking at the stuff in nir_schedule and it might not be too hard to glue together to get some basic scheduling
03:49 karolherbst: yeah... probably
03:50 karolherbst: just if we optimize the hell out of it in codegen it's kind of pointless doing it in nir
03:50 karolherbst: it will help I am sure, but...
03:50 mhenning: yeah, it would be better to schedule later
03:50 karolherbst: hence me wanting to move most opts into nir, but then we still can do good opts
03:50 karolherbst: not sure what the state of source modifiers is in nir atm
03:50 anholt: source mods are still very much present, but we all kinda wish they weren't
03:51 karolherbst: mhhh
03:51 karolherbst: I wish they were though, so I can use them in nir and use nir_schedule :)
03:51 karolherbst: :P
03:51 anholt: there are some nice tools for helping you skip the nir lowering of the actual source mods, but tools for efficient sat handling in your backend are kinda missing
03:52 karolherbst: the most annoying bit are immediates though
03:52 anholt: mhenning: if you're looking at nir_schedule to help with register pressure, note that it's been tuned a bit for v3d, but those heuristics can be pretty tricky and it's easy to fall off a cliff. the next-step paper to implement for nir_schedule was "A randomized heuristic approach to register allocation"
03:52 karolherbst: modifiers we can _assume_ we can inline them
03:52 karolherbst: immediates not so much
03:52 anholt: (I believe the methods in that paper would avoid the cliffs I've seen)
03:52 karolherbst: for some instructions we only know after RA if we can use immediates
03:53 anholt:has spent too much time in the scheduler mines over the years
03:54 karolherbst: so for FMA on some gens we can use a 32 bit immediate if the dst and third source are the same register :)
03:54 karolherbst: otherwise we only have 20 bits?
03:54 karolherbst: or something
03:54 karolherbst: but of course we can always inline the ubo access
03:54 karolherbst: so that would be one way out
03:55 karolherbst: but yeah...
03:55 karolherbst: inlining stuff into the instruction is funny business on nv hw
03:55 karolherbst: stuff depends on the isntruction, how big the offset is, if you have an indirect, etc...
03:56 karolherbst: so I kind of thing we have to schedule instructions in the backend one way or the other anyway
03:56 karolherbst: *think
03:57 mhenning: anholt: I was thinking of using it more for the actual scheduling part (using the callback to set the right instr latencies) as a poor replacement for an actual backend sched pass
03:57 karolherbst: mhenning: sure.. but what do you do with all those movs and load_ubos, etc.. which will just vanish
03:57 mhenning: but I might read that paper anyway :) I certainly read more papers than I implement
03:58 anholt: mhenning: I don't think I understand the backend scheduling needs enough to say if nir_schedule will be useful to you
03:58 karolherbst: maybe we can assume that all movs are gone
03:58 karolherbst: dunno
03:58 anholt: are you just looking for "hide the latency, the hw will stall if you don't fit stuff in between"?
03:58 mhenning: yes
03:58 anholt: ok, not the "if you read your stuff early, you get undef"
03:59 karolherbst: anholt: yeah no, we can do that as well, but it's different
03:59 karolherbst: we have to tell the hw how long to wait
03:59 karolherbst: and that depends on the instructions we've got
03:59 anholt: nir_schedule should be helpful for hiding latency, and has a knob for trying to balance "hide latency" vs "not too many regs thanks"
03:59 karolherbst: so the idea is to reorder, so we stall less
03:59 karolherbst: anholt: can we tell it to reduce load_ubos, load_imms, and all movs? :D
03:59 karolherbst: *ignore
03:59 anholt: though it's kinda limited depending on how different your hw instructions are from the nir instructions
04:00 karolherbst: mhh though we can say those are 0 latency and 0 cycles needed or something?
04:00 anholt: karolherbst: you can set them to no latency, yeah.
04:00 karolherbst: okay
04:00 karolherbst: then maybe that's good enough
04:01 anholt: but I don't think there is "this instr will definitely not take up reg space for the register pressure vs latency decision"
04:01 karolherbst: if we can't inline those we have to stall a little, but not much and better than not scheduling
04:01 karolherbst: ahh..
04:01 karolherbst: so be it then
04:01 anholt: for handling that, I wanted to finish noltis, then have schedule only consider noltis-selected nodes for register pressure
04:02 anholt: had too much stupid regular hardware for noltis to show wins, though.
04:02 mhenning:does not know what noltis is
04:03 anholt: near-optimal linear-time instruction selection. "how do I decide how to map these nice clean little nir instrs to my complex instruction set?"
04:03 anholt: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/629
04:03 karolherbst: I am quite happy that our ISA is not complex
04:04 mhenning: oh, cool
04:04 anholt: should be useful for stuff like "I can only encode one immediate or one constant load per instr"
04:05 karolherbst: mhhh
04:05 karolherbst: we have that.. somtimes
04:05 karolherbst: and less in newer hardware
04:05 anholt: or choosing your fmas, as is so popular here in #nouveau
04:05 karolherbst: we don't support fma in mesa anyway :P
04:05 karolherbst: it's all something
04:05 karolherbst: but not fma
04:05 anholt: fmads
04:05 anholt: whatever
04:05 anholt: same thing
04:05 karolherbst: in mesa, yes
04:05 karolherbst: it's just annoying that for CL it's actually not :(
04:06 karolherbst: so we do have to emulate it
04:06 karolherbst: bye bye performance
04:06 imirkin: anholt: the distinction is that the backend will happy mess with MAD but won't touch FMA
04:06 karolherbst: imirkin: which... isn't correct for glsl
04:07 karolherbst: you can split glsl fmas if you want to
04:07 anholt: imirkin: I'm talking here about the process of turning muls+adds into whatever you call the thing is that is the 3-operand multiply-add
04:07 imirkin: right. so fma is not multiply + add.
04:07 imirkin: as you probalby know :)
04:07 karolherbst: why not?
04:07 karolherbst: it actually can be that
04:07 imirkin: fma in nouveau is not that.
04:07 karolherbst: well it can be
04:08 karolherbst: on some hardware
04:08 imirkin: it can't
04:08 imirkin: (i mean, we can make the backend do whatever we like. but it's not the case today)
04:08 karolherbst: yeah okay, because we are more strict than the rest of mesa
04:08 anholt: OP_FMA does not do multiply and add?
04:08 imirkin: anholt: it's a fused multiply-add
04:08 imirkin: i.e. you get the extra bit of precision for the internal operation
04:09 karolherbst: fermi actually got fma and dropped mad, it's annoying, but there we are
04:09 anholt: well, this kind of transform is only something you'd do for !exact, so the distinction is not important then.
04:09 imirkin: whereas MAD is literally the same as multiply then add
04:09 karolherbst: anholt: ehh, that's not true
04:09 karolherbst: you can split exact fmas
04:09 karolherbst: exact doesn't mean "it has to be fma"
04:10 karolherbst: the point is just that you have to be consistent
04:10 karolherbst: so if you split them all, you can split them all, even if they are exact
04:10 imirkin: that's the higher-level context
04:10 imirkin: however the way that nouveau backend treats it
04:10 imirkin: is that FMA is the fused thing
04:10 karolherbst: ohh sure
04:10 imirkin: and MAD can be whatever it likes and can be variously reassociated/combined/etc
04:11 karolherbst: yeah
04:11 karolherbst: and I think nir has to do the same thing
04:11 karolherbst: it needs ffma + fmad
04:11 karolherbst: there is no working solution with just one of those
04:11 karolherbst: so discussing fma is pointless with the status quo
04:11 imirkin: karolherbst: you could just have ffma and leave fmad alone
04:11 imirkin: and not try to "fuse" things into ffma
04:11 karolherbst: not going to work
04:12 imirkin: and so you just have fmul, fadd, and ffma
04:12 imirkin: where the latter means "fma for realz"
04:12 karolherbst: doesn't work
04:12 anholt: all I was saying was "if you have the problem of needing to turn muls and adds into a single 3-operand instr, then noltis can help with that." I was really not trying to open the "semantics of fma" thing.
04:12 karolherbst: for CL we need "real fma"
04:12 karolherbst: like real real
04:12 imirkin: karolherbst: yes. aka "ffma"
04:12 imirkin: (in my proposal)
04:12 karolherbst: anholt: I see
04:13 karolherbst: so like turning two iadds into an iadd3 :P
04:13 karolherbst: imirkin: I am still convinced we need both though as we do want to optimize fmul+fadd into fmad
04:13 karolherbst: and not lose context
04:13 imirkin: karolherbst: no fmad
04:13 imirkin: karolherbst: you just keep it as fmul + fadd
04:13 anholt: sure. these things look easy to do ad-hoc, but a basic greedy version of it leaves optimization on the table, and intel (for example) has struggled with fallout from that where innocent changes change how their fmas get selected and things go off the rails.
04:13 karolherbst: imirkin: but you want to fuse it
04:14 karolherbst: because optimizations
04:14 imirkin: karolherbst: let the backend sort it out
04:14 imirkin: based on its specific needs
04:14 karolherbst: well.. some people want to do algebraic opts in nir
04:14 karolherbst: but sure, we can tell nir to never fuse it and let backends handle it
04:14 karolherbst: you can have that discussion if you want to :P
04:14 imirkin: anyways, yes, small perturbation to the specific association of fmul + fadd can have big changes
04:15 imirkin: and it's very difficult to control
04:15 anholt: my position is: we need to finish noltis and have backends sort out fma fusing using it or something like it, because ughhhh all the is_used_once() hacks in algebraic are so fragile
04:15 imirkin: since you can end up with long algebraic chains
04:15 karolherbst: anholt: yeah.. but I am convinced that the problem isn't the problem in itself, just that we treat fma wrong
04:15 karolherbst: just add fmad and be done with it
04:16 karolherbst: so you want to add like 1.5k loc just because we don't want to fix ffma? that's fine I guess, I just say we should add fmad :P
04:16 karolherbst: and let glsl _never_ emit ffma
04:17 anholt: if you're looking at the MR, things like "v3d: Let NOLTIS turn add(a, -b) into sub(a, b)." are also the point
04:17 anholt: it's the solution to this problem that you keep running into in backends.
04:17 imirkin: what's NOLTIS?
04:17 anholt: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/629/
04:18 imirkin: does it stand for something
04:18 anholt: I had expanded it above. near optimal linear-time instruction scheduling
04:18 karolherbst: anholt: yeah.... some of that stuff is indeed really annoying and we need a solution for that
04:18 karolherbst: I just say that fma is a different problem
04:18 imirkin: anholt: ah ok
04:18 anholt: https://www.cs.cmu.edu/~dkoes/research/dkoes_cgo08.pdf
04:18 karolherbst: we think we talk about fma where we really don't
04:18 imirkin: thanks
04:20 karolherbst: the fma problem is just annoying because there is hardware having both, where one is slower than the other one....
04:20 imirkin: anholt: interesting side result from that paper -- too much CSE hurts. i've definitely noticed that too.
04:36 karolherbst: yeah.. so in terms of being a better CSE this looks very interesting and useful indeed :)
04:50 anholt: does anyone have some actual performance tests to throw at !15541? looks like gdm isn't starting up for me, and glmark2-es2-drm is failing to set crtc.
05:29 anholt: ok, got some perf numbers in the commit message for nv92-ntt
05:40 anholt: bummer, can't replay any of the more interesting traces on nv92, it just wedges.
09:10 cwabbott: airlied: imirkin: no, it does not use SSA-based register allocation
09:11 cwabbott: it uses classic graph-coloring, slightly better than util/ra but similar
09:12 cwabbott: what's more concerning is that no one seems willing to fix bugs in it
12:46 imirkin: cwabbott: ok. i guess i don't know what "SSA-based register allocator" is
12:46 imirkin: cwabbott: i meant that it does RA based on SSA.
12:46 cwabbott: imirkin: well, yes and no
12:47 cwabbott: yes, the IR is in SSA form
12:48 cwabbott: but it creates copies and merges SSA values, which is exactly what an out-of-SSA pass would do
12:48 imirkin: it most definitely does use graph coloring
12:48 cwabbott: it then does RA on the merged classes
12:48 imirkin: and yes, it creates merged nodes
12:48 cwabbott: so it's really the same as first replacing the SSA values with registers and then doing RA afterwards
12:49 imirkin: well, except for the phi merging
12:49 cwabbott: no, the phi merging is really the same
12:49 cwabbott: it's similar to what nir_from_ssa does
12:49 imirkin: but yes, any ssa -> not ssa pass requires _some_ sort of register strategy
12:49 imirkin: er, some sort of phi-handling strategy
12:50 cwabbott: yeah, so it does do out-of-SSA, and it's considered part of register allocation, but it happens before RA
12:51 cwabbott: in SSA-based RA phi nodes aren't lowered until afterwards, and each SSA value can get a unique register, so phi deconstruction happens after RA
12:51 imirkin: huh, ok
12:51 imirkin: yeah, that's pretty different.
12:51 imirkin: [than what nouveau does]
12:52 cwabbott: and copy coalescing happens implicitly by trying to assign values involving a phi node to the same register
12:53 cwabbott: the advantage is that it's "optimal" in the sense that it uses the minimal number of registers, which is the maximum register pressure across the program
12:53 cwabbott: but it can create more moves
12:53 cwabbott: so, you can more easily tune how many registers to use
12:54 imirkin: how does it handle the if () { swap(a, b) } type things?
12:54 cwabbott: you need to have a swap instruction, or use the xor trick
12:55 imirkin: ok, but like normally this is handled with phi's after the if, so how do you know that it's a swap?
12:55 imirkin: or you have to detect it?
12:55 cwabbott: well, first I'd recommend you watch our XDC talk :)
12:55 imirkin: ok
12:55 cwabbott: we go over most of this stuff
12:56 cwabbott: but, you compare what registers get assigned to the phi's sources and destination
12:56 imirkin: yeah, i'll check it out
12:56 imirkin: which year? this past one?
12:56 cwabbott: and after RA, insert copies/swaps along each edge to move the sources to the destination
12:56 cwabbott: yes, this one
12:56 imirkin: cool
12:56 imirkin: i didn't get a chance to look at this year's stuff
13:32 karolherbst: cwabbott: I think I don't fix stuff because I know the bugs I am aware of are a result of the algorithm used, so I'd rather replace the thing :)
13:32 cwabbott: karolherbst: what bugs specifically are those?
13:33 cwabbott: I didn't see anything really extraordinary in nouveau's RA
13:33 cwabbott: I doubt there are any really unsolveable bugs
13:33 karolherbst: cwabbott: this fails to spill: https://gist.github.com/karolherbst/1597d06d931cf649c7ac1d9bb14e0b24
13:33 karolherbst: and it's because we treat those scalar values as vec4 in the graph coloring
13:33 karolherbst: there is nothing to spill
13:33 karolherbst: and that's why it fails
13:34 karolherbst: but not in the "there is nothing to spill" way
13:34 karolherbst: but "we are out of space" way
13:34 cwabbott: you have to spill vec4's then
13:34 karolherbst: well
13:34 karolherbst: nope
13:34 karolherbst: there is nothing to spill
13:34 cwabbott: no, there is
13:34 karolherbst: that shader hardly uses 25 values
13:35 cwabbott: look, there are tons and tons of other allocators using something similar
13:35 karolherbst: all I can say it's busted
13:35 cwabbott: so if yours can't do something right, it's 100% *not* due to the algorithm
13:36 karolherbst: ehh, well, I didn't meant the original algorithm, but the thing we implemented
13:36 cwabbott: what, imo, is the problem is a lack of willingness to get down to details and actually figure out what's wrong
13:36 karolherbst: which might or might not be 100% the same
13:36 karolherbst: in order to fix it, one essentially has to rewrite most of it, that was my conclusion
13:37 karolherbst: so I was thinking of just using util/ra.h
13:38 cwabbott: util/ra.h is even worse than what you have
13:38 cwabbott: it's the "training wheels" of RA
13:38 cwabbott: it doesn't handle subregister interference
13:38 cwabbott: don't use it unless you're writing something from scratch and don't have time to understand anything
13:40 cwabbott: again, I'm 100% sure that any problem you have is fixable, if there's a will to actually understand what the code is doing and what needs to be fixed up
13:40 karolherbst: I digged into it and my conclusion was it's not fixable unless one changes the algorithm used
13:40 karolherbst: I couldn't find a bug in the code itself
13:41 karolherbst: or well maybe it's in the way we use that algorithm, but my point is, it's not a bug in the code
13:42 cwabbott: again, there are lots and lots of other backends that use something similar - and util/ra actually uses the same basic idea under the hood
13:42 karolherbst: I can only tell you what I figured out
13:42 cwabbott: it could be that there are some things it refuses to spill that it needs to
13:43 karolherbst: there is nothing which needs spilling
13:43 imirkin: we stick "noSpill" on the merges
13:43 karolherbst: that's literally the shader pre RA: https://gist.githubusercontent.com/karolherbst/6be7f48c02afe0fc82af86dde32429bd/raw/c7333d66c0ea8360067759e143cd7b4fd2a6804e/gistfile1.txt
13:43 cwabbott: nope, you can't say that
13:43 karolherbst: nothing needs spilling there
13:43 karolherbst: I know it just by looking at it
13:43 imirkin: the idea is that you're supposed to spill the values being merged
13:43 imirkin: er, there are diff types of merges
13:43 imirkin: i mean the "vec4" type merges
13:44 karolherbst: imirkin: look at the shader
13:44 imirkin: karolherbst: i'll dig into that one.
13:44 karolherbst: yeah, I've posted the nv50ir one as well
13:44 imirkin: but not right now
13:44 karolherbst: :)
13:45 cwabbott: I mean, graph-based RA can always wind up doing bad things
13:45 karolherbst: if I just hack around the spiller, that shader gets like ~20 register used and everything works out fine
13:45 karolherbst: cwabbott: sure, but that's why I say the algo is broken, not the code
13:45 karolherbst: the bug in the end is, it tries to spill, but there is nothing to spill and it fails
13:45 cwabbott: and then you need to spill even though there are enough register
13:45 cwabbott: karolherbst: it's RA-hard
13:46 cwabbott: if that's your definition of a broken algo, then sorry, every algo is broken
13:46 cwabbott: SSA-based register allocation sometimes produces way too many moves
13:46 cwabbott: graph-based can sometimes spill
13:46 cwabbott: pick your poison
13:46 cwabbott: if you watch our talk, daniel also lays this out at the beginning
13:46 karolherbst: okay, so the solution is just to skip over it when it tries to but there is nothing to spill or what?
13:47 cwabbott: the solution is to just always keep spilling
13:47 imirkin: karolherbst: the solution is to dig in and understand what's going on.
13:47 cwabbott: there definitely isn't nothing to spill in that example
13:48 imirkin: karolherbst: if your digging in resulted in "everything is fine", then that's insufficient digging.
13:48 karolherbst: imirkin: well, I am not saying that everything is fine, do I?
13:48 cwabbott: imirkin: yes, exactly
13:48 imirkin: karolherbst: you're saying "everything is working as intended and the algorithm is broken"
13:48 karolherbst: imirkin: sure
13:48 imirkin: which is pretty much the same thing
13:49 karolherbst: it isn't
13:49 cwabbott: for example, it could've spilled %r109, but didn't - why?
13:49 imirkin: but phrased in a way that makes it sound like you completed the task :p
13:49 cwabbott: ask those sort of questions, and then dig in
13:49 cwabbott: rather than just give up and assume the algo is wrong
13:49 karolherbst: so what my conclusion was, that each of those values filled a full vec4 slot and it ran out of space
13:50 imirkin: so why not spill one of them then
13:50 karolherbst: but as all those slots are really just filled by a single value, not 4...
13:50 imirkin: why are the values taking up a full vec4 slot when they shouldn't be
13:50 imirkin: etc
13:50 karolherbst: because that's how the algo works in codegen
13:50 imirkin: that is definitely not the case
13:50 karolherbst: well, maybe my conclusion is wrong, but that's how I ended up with
13:50 imirkin: i'll try to sort it out ... maybe tonight
13:51 imirkin: but for now, it's off to work i go
13:51 imirkin: ttyl
13:53 karolherbst: cwabbott: I mean, you can always just say "evertyhing is fine, it has to be some weirdo bug", but... I looked into it and that's what my result was, if you think it's something else you are happy to look into it
13:53 karolherbst: s/happy/free/
13:54 cwabbott: karolherbst: if that's your result, you didn't look hard enough to actually understand it
13:54 cwabbott: and just came to conclusions
13:54 cwabbott: I'm no expert on nouveau, but I can already tell that's the case
13:55 cwabbott: I have other stuff to do, so I can't look into it myself
13:55 cwabbott: but I know that's the case already
13:55 karolherbst: yeah well, if you say so