01:26imirkin: skeggsb: i don't suppose you heard back from nvidia on the gpio thing?
02:11karolherbst: imirkin: mhhh... either I broke it or I found a bug in FlatteningPass::tryPropagateBranch :/
02:12karolherbst: https://gist.githubusercontent.com/karolherbst/e19907918006ce719b5c71562df9c9b3/raw/927416e3a3f598685fbf16db9b176198a648326e/gistfile1.txt
02:14imirkin: there's also a plain join and a join $b0
02:14karolherbst: well.. we can't do plain joins on volta+ and I thought I didn't break anything :/
02:15karolherbst: right now debugging with chipset forced to 0x50
02:15imirkin: i'm not sure what that's trying to do in the first place.
02:15karolherbst: yeah...
02:15imirkin: if that's nv50, should definitely not be any $b0's
02:15imirkin: :)
02:15karolherbst: well..
02:15karolherbst: right
02:15karolherbst: I meant.. I see what happens with 0x50 to debug this
02:16karolherbst: so the idea is because BB:7 is empty, BB:5 can jump to BB:7 jump target directly
02:16karolherbst: but that obviously won't work out well if there is a join...
02:17karolherbst:really dislikes the fact, that OP_JOIN has multiple meanings
02:18karolherbst: ehhh
02:18karolherbst: it works on tesla as flattening succeeds in predicating the branches *sigh*
02:22imirkin: you can force-disable predication
02:22imirkin: note that on nv50, joins are "on the outside"
02:22imirkin: i.e. you have joinat
02:22imirkin: branches
02:22imirkin: then the branches converge
02:22imirkin: then you have join
02:22karolherbst: right
02:23imirkin: the join does "nothing", so to speak
02:23karolherbst: same on volta
02:23imirkin: you could omit it and everything would work
02:23imirkin: except you'd get shit perf
02:23karolherbst: ohhh
02:23imirkin: whereas on nvc0+, the join flag is actually a jump
02:23karolherbst: ohh, joinat pushed the target and once the threads arrive they join?
02:23imirkin: mmmm
02:24karolherbst: ohh wait, the join is important
02:24karolherbst: nvm then
02:24karolherbst: yeah
02:24karolherbst: same on volta
02:24imirkin: tbh i'm not 100% sure that the joinat target address means anything
02:24imirkin: not sure it's actually encoded
02:24karolherbst: on volta you save the addess but it means nothing to the hw :)
02:25karolherbst: anyway.. instead of pushing into some stack we save into those barrier registers
02:25karolherbst: but fundamentally it works like tesla
02:25imirkin: the join is a "run all the sub-threads to this point" indicator
02:25imirkin: whereas otherwise they'd run until program exit
02:25karolherbst: right
02:25imirkin: so it'd effectively run one thread at a time for the whole program
02:25imirkin: which is functionally equivalent, but ... shit-for-perf
02:26karolherbst: yeah... not on volta I couldn't see any difference in perf at all...
02:26karolherbst: at least not with heaven
02:26imirkin: well, volta has a different threading model
02:26karolherbst: right
02:26karolherbst: but some instructions still require threads to converge :)
02:26imirkin: i don't think it would run the shader 32x
02:26imirkin: right
02:26imirkin: most shaders don't include such instructions though
02:26imirkin: (unless you count "exit")
02:28karolherbst: uhh.. mhhh
02:28karolherbst: nv50 has the join flag on instruction, doesn't it?
02:29imirkin: iirc no
02:29imirkin: but it's been a while since i've looked
02:30karolherbst: it does
02:30karolherbst: code[1] |= 0x2; is the encoding.. annoying
02:30karolherbst: yeah.. well, let me turn that off as well
02:31imirkin: just checked -- looks like joinat *does* get the target encoded
02:31karolherbst: imirkin: it might not matter
02:31imirkin: and yeah, join is a flag available on 8-byte encodings
02:32imirkin: maybe not. but it's there.
02:32karolherbst: on volta the target is there for the disassembler only
02:32imirkin: i've never tried to get it wrong to see what happens :)
02:32karolherbst: :)
02:32karolherbst: I bet nothing happens
02:33karolherbst: or maybe it mattered on nv50.. who knows
02:33karolherbst: but probably not
02:34karolherbst: well, if it matters, the hw would have to do something with it, like mapping sync points and offsets.. and I bet making the hw more complicated for no reason is even something nvidia was aware of at that time
02:35imirkin: wow - g200+ have the vote op. i don't think we expose that in nouveau.
02:35imirkin: too bad i have a g84 plugged in, so can't test it
02:37HdkR: vote woo
02:37karolherbst: turing has VOTEU....
02:37karolherbst: but I guess that makes sense :p
02:40HdkR: Sadly still doesn't have warp-reduce ops like AMD
02:41karolherbst: HdkR: what do you mean?
02:41HdkR: reduce in RF rather than to memory
02:42karolherbst: why would you need it on register files?
02:42karolherbst: don't make hw complicated :p
02:42HdkR: So something like RED.U.Min <UReg>, <Reg>
02:43karolherbst: ehh
02:43karolherbst: that would just make the hw complicated with 0 benefit
02:44HdkR: Useful for warp-wide reductions so you don't need to do a shuffle vote reduction dance :)
02:44karolherbst: but this means you need to make registers available cross threads
02:44karolherbst: and I bet that's super complicated to do :p
02:45HdkR: SHFL already does it :)
02:45karolherbst: right, but that's one exception which is a bit easier to implement
02:45karolherbst: doing alu operation in order cross thread?
02:46karolherbst: ufff
02:47HdkR: AMD doing it isn't really fair as a comparison point, since it just falls down their vector pipeline, but people have been expecting it for what. eight years the consoles have been here?
02:48karolherbst: yeah.. no idea?
02:49karolherbst: putting magic instructions into hw was never a good idea :p
02:49karolherbst: keeping insane out is where you win
02:49HdkR: async copy from memory to shared? :P
02:49imirkin: didn't you say there was a "start async copy from gmem to smem, and go get coffee" instruction now?
02:49imirkin: heh
02:50imirkin: we thought of the same one.
02:50karolherbst: HdkR: heh.. at least you don't do it in the GPC but wait on the barrier :p
02:51HdkR: :D
02:51karolherbst: there is also nanosleep and I still have no idea why :p
02:51HdkR: oh yea, that's a great one
02:51karolherbst: (I have great ideas on how to abuse it though)
02:54karolherbst: imirkin: heh.. I can reproduce on nv50 if I hack it up enough...
02:55karolherbst: https://gist.githubusercontent.com/karolherbst/3b5b5e981cd6573345ff660282c250a4/raw/af9d5eca35e0b1df000d4eebdaf1f98c87e7d3a6/gistfile1.txt
02:55karolherbst: so..
02:55karolherbst: if you have a conditional branch jumping to a block with only a join
02:55imirkin: even without your changes? (but with your hacks)
02:55karolherbst: that join might get copied over into BB:6
02:55imirkin: mmmmm
02:55karolherbst: let's see
02:55karolherbst: I can try on master
02:55imirkin: the conditions on branches shoudl get propagated too
02:56imirkin: i fixed that bug a while back :)
02:56karolherbst: that's not the issue though
02:56karolherbst: the join in BB:5 was a bra BB:7 before
02:56imirkin: right
02:56karolherbst: soo.. and because BB:7 ends in a flow instructions it thinks it can just copy
02:56imirkin: some piece of logic is going very wrong
02:56karolherbst: yeah
02:56karolherbst: I know what
02:56karolherbst: just.. annoying
02:56imirkin: that join should never be propgated to BB:5 in the first place
02:57imirkin: there's a bool on the target
02:57imirkin: which indicates anterior joins
02:57imirkin: and the propagate thing should be sensitive to that
02:57karolherbst: happens on master as well
02:57karolherbst: hack was: disable predicated _just_ for BB:4
02:57karolherbst: so the BB:4->BB:7 tree stays not predicated
02:58karolherbst: and after the program was flattend it arrives at BB:5 quite late
02:58imirkin: BB:5 should have a branch to BB:7
02:58karolherbst: yeah.. I guess I'll fix it :)
02:58imirkin: what replaces it with a join?
02:58karolherbst: yes
02:58karolherbst: welll
02:58karolherbst: the code replaces it with the branch of the target
02:58imirkin: it should start out with the branch
02:58imirkin: initially
02:58karolherbst: I'll show you the code in a moment
02:59imirkin: like in tgsi, we put the join outside iirc
02:59imirkin: or rather, in tgsi -> nv50 ir
02:59karolherbst: imirkin: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp#L3372
02:59karolherbst: so you see how that can just go through there and do the wrong shit :)
02:59karolherbst: it's just super unlikely
03:00karolherbst: and I think the OP_JOIN is there for kepler
03:00karolherbst: or gens where OP_JOIN is actually a jump :)
03:01karolherbst: on my branch I can remove the OP_JOIN check though as I made those gens do OP_BRA with join = 1 instead
03:01imirkin: yeah
03:01imirkin: that seems slightly bogus =/
03:01karolherbst: yeah..
03:02karolherbst: I cleaned that up though :p
03:02imirkin: esp by the time this happens
03:02karolherbst: OP_JOIN syncs and never jumps on my branch
03:02karolherbst: I think? maybe I removed that again? let's see
03:04karolherbst: oh ehh.. I think I threw it away again
03:06karolherbst: anyway.. should sleep :D will think about on how to make all of this more sane
03:08imirkin: nite
09:53karolherbst: imirkin: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6149
11:36karolherbst: you what I like the most about Volta/Turins ISA? the encoding is sooo sane
11:37karolherbst: supporting uniform regs uses the same encoding as non uniform ones
11:37karolherbst: you just need flip special bits being the same for all instructions
11:37karolherbst: 1 << 91 enables uniform sources
11:37karolherbst: 6 << 9 enables sets src1 as uniform
11:38karolherbst: 7 << 9 sets src2 as uniform
11:38karolherbst: 1 << 7 enables dst as uniform
11:38karolherbst: enabling dst as uniform reuses the non uniform layous
11:39karolherbst: so 3 << 9 selects c[] as src2 etc...
11:39karolherbst: that's soo nice
11:40karolherbst: "1 << 91 enables uniform sources" also flips from non uniform to uniform c[] indirect
11:44HdkR: karolherbst: When your instruction encoding is 128bits, loads of things end up being sane :D
11:45karolherbst: well :D
11:45karolherbst: you only have 96 bits though :p
11:45HdkR: Sure, some bits stolen for scheduling
11:45HdkR: and some bits stolen for opc
11:45karolherbst: mhh
11:45karolherbst: well
11:45karolherbst: how many bits do you have for the opc? 5?
11:45karolherbst: :p
11:46karolherbst: I think it's actually 7, but it's hard to tell
11:46karolherbst: source types are part of the opc so there is a system
11:46HdkR: Think it depends on the family, since some families may reuse some opc bits for other data?
11:47karolherbst: I think they tried but always failed
11:47karolherbst: pre volta
11:47karolherbst: but yeah..
11:47HdkR: Even with some kludge it is still significantly more sane
11:47karolherbst: yes
11:48karolherbst: adding support for uniform regs is quite some work actually..
11:48HdkR: It ends up breaking the scalar model mindset a bit
11:48karolherbst: " 21 files changed, 680 insertions(+), 216 deletions(-)" :/
11:49karolherbst: HdkR: well.. I use nir divergency analysis pass so I don't even have to deal with any of that
11:49karolherbst: I just accept whatever I get :p
11:49HdkR: Ah right, NIR has that
11:50karolherbst: the problem is rather checking for legal combinations and stuff
11:50karolherbst: and fix all the places where codegen just inserts new values and breaks it
11:51HdkR: hehe, new RF always causing woes
11:52karolherbst: yeah...
11:52karolherbst: why does it have to be three new one :D
11:53HdkR: If you're going to cause pain then might as well as do it all at once
11:53HdkR:pushes Volta under a rug
11:53karolherbst: well.. we have to support the barrier file anyway
11:53karolherbst: but I also already did the work for that :D
11:53karolherbst:still wasn't able to do a full shader-db run without crashing :D
11:56HdkR: Seems like you're doing quite well for Volta and Turing support
11:57karolherbst: yeah
11:57karolherbst: I don't know what doesn't run :p
11:58HdkR: CTS? :P
11:58karolherbst: dunno.. with my local patches I have like 5 fails
11:58karolherbst: and the fails are just annoying at this point :D
11:58HdkR: whoa
11:59HdkR: Almost time to go back to OpenCL 2.0 features?
11:59karolherbst: uff :D
12:01HdkR: The idea of running OpenCL at <100Mhz sounds..great
12:04karolherbst: it's not that bad :D
12:04HdkR: Just make sure you're running the largest GPUs so you take advantage of going wide?
12:05HdkR: Can't imagine someone using it for Blender though
12:06karolherbst: I already look forward to the part where shader-db succeeds and I track down regressions because codegen is silly :p
12:19Ingvix: hey, I've setup reverse prime now. I ran 'xrandr --setprovideroutputsource 1 0' and got external display available. Now though xrandr can't seem to find the mode I'm trying to set for the display though it's listed within the display's modes
12:20Ingvix: I tried creating my own mode with same settings but that isn't found either
12:23Ingvix: I think help would be needed
12:33karolherbst: Ingvix: could be that we don't support the mode or something.. was it there with nvidia?
12:35Ingvix: karolherbst, resolution was and I'm fairly sure the rate was same as well though I'm not absolutely sure
12:36Ingvix: it's 1920x1200 with 59.95 refresh rate
12:36karolherbst: what does xrandr report for the display?
12:37Ingvix: karolherbst, http://vpaste.net/GIYQY
12:39Ingvix: uh, I mean I first tried that mode listed first and then tried to create a new mode cause it wasn't working
12:39Ingvix: and the other modes listed aren't found either
12:39karolherbst: Ingvix: why dod you want to create new ones?
12:40Ingvix: to get it to work
12:40Ingvix: just tested it it helped
12:40Ingvix: but it didn't
12:40karolherbst: well, there is no mode selected
12:41karolherbst: why not just use one of the available ones?
12:41Ingvix: as I just said, I tried setting the first one on the list but it was not found
12:41Ingvix: none of them are
12:42karolherbst: what args are you calling xrandr with?
12:42Ingvix: karolherbst, xrandr --output DP-1-1-1 --primary --mode 1920x1200 --pos 1920x0 --output eDP-1 --mode 1920x1080 --pos 0x0
12:43Ingvix: and I get: xrandr: cannot find mode 1920x1200
12:43karolherbst: try "xrandr --output DP-1-1-1 --mode 1920x1200 --rate 59.95"
12:44Ingvix: I get the same error
12:46karolherbst: Ingvix: output of "grep . /sys/class/drm/card*-*//modes" please
12:47Ingvix: http://vpaste.net/tyemi
12:48karolherbst: Ingvix: "xrandr --output DP-1-1-1 --primary --right-of eDP-1" does this do something?
12:48karolherbst: uhm.. wait
12:48karolherbst: "xrandr --output DP-1-1-1 --primary --auto --right-of eDP-1"
12:50Ingvix: nothing changes and I get no errors
12:50karolherbst: ehh.. mhh
12:51karolherbst: add a --verbose to that and paste the output
12:51karolherbst: I am sure that's a bug somewhere, but not sure where at this point
12:52karolherbst: maybe dmesg and Xorg log would help as well
12:53Ingvix: hmm, no output from xrandr by adding --verbose to previous command
12:53karolherbst: strange
12:53karolherbst: so it doesn't try to do anything
12:54Ingvix: should I still fetch dmesg and Xorg?
12:54karolherbst: wait.. I think xrandrs providers are just wrongly setup
12:55Ingvix: alright
12:57karolherbst: no.. should be fine. mhh
12:57karolherbst: yeah, please get dmesg and xorg logs
12:59Ingvix: dmesg: http://vpaste.net/MsP2M Xorg.0.log: http://vpaste.net/fJRFE
13:01karolherbst: ehhh..
13:02karolherbst: something seems fishy but I don't know what
13:02karolherbst: those "present flip failed" are also weird
13:03karolherbst: but the core notifier timeouts could also break things in weird ways
13:04karolherbst: but they shouldn't prevent enabling modes
13:05karolherbst: Ingvix: maybe ask on #xorg? I don't think it's a driver issue, but I also don't know enough at this point to figure out what's up
13:06Ingvix: when I started x for the first time with this setup I could set the mode but the display stayed black. No idea what changed after that if anything
13:06Ingvix: I can do that
13:06karolherbst: HdkR: *sadface* https://gist.github.com/karolherbst/736c022daa1f6807b330eed57acdf9b1
13:07karolherbst: Ingvix: maybe the xrandr setprovideroutputsource did break something
13:07karolherbst: normally you won't have to do that
13:07karolherbst: the display staying black is probably a nouveau issue though
13:07karolherbst: and for that it makes sense to test with a newer kernel just to make sure we didn't fix it already
13:18HdkR: karolherbst: Interesting that it increased
13:19karolherbst: HdkR: well, because of stupid reasons :p
13:19karolherbst: loadpropagation fails
13:19karolherbst: ld u32 %ur56 c0[0x0] (0)
13:19karolherbst: mad f32 %r62 %r54 %ur56 %ur57 (0)
13:19karolherbst: codegen doesn't load propagate %ur56 anymore
13:19HdkR: ah, interesting
13:19karolherbst: and I'd prefer it propagating const memory instead of keeping uniform regs :p
13:19karolherbst: just need to fix that
13:20karolherbst: and I bet it's like 99% of the fallout
13:20karolherbst: ahh
13:20karolherbst: i->src(2).getFile() != FILE_GPR :D
13:20karolherbst: right..
13:23karolherbst: mhhh
13:23karolherbst: now another fallout
13:23karolherbst: in some places I keep LDC instead of MOV from c[] :/
13:24karolherbst: ldc needs a barrier :/
13:27Ingvix: karolherbst, I see. I reboot to the latest one I can get
13:28karolherbst: HdkR: now I expect no shader will use uniform regs :D
13:28karolherbst: maybe a few compute ones
13:35Ingvix: There was one funny thing that both providers listed by xrandr were called "modesetting" but the other one had all the external outputs so figured it's actually nouveau
13:39HdkR: karolherbst: Why only compute?
13:40karolherbst: HdkR: group is is uniform :p
13:40karolherbst: and can't be propagated
13:40karolherbst: *group id
13:41HdkR: Does group id matter in fragment shaders that don't use interlock?
13:43Ingvix: yay, it's working on the latest kernel
13:43karolherbst: cool
13:44Ingvix: though I still need to set the source with xrandr before the external displays become available
13:44karolherbst: Ingvix: aren't you using some desktop environment, but starting stuff yourself?
13:44Ingvix: I'm using dwm, so no de really
13:45karolherbst: there are some proper files where stuff like that could be added to.. normally desktops already handle that themseles
13:46Ingvix: well yeah, I intend to put it in .xinitrc now that it works
13:47Ingvix: I just sort get the assumption that it wouldn't be a necessary step from what you said before but I guess you were just referring that the de usually does it automatically
13:47Ingvix: *got
13:48Ingvix: so real issue here anymore
13:55karolherbst: Ingvix: yeah.. normally users won't have to bother with any of that, but without a desktop all those godies are usually not in place
13:56karolherbst: HdkR: maybe I should figure out this: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/nvc0/nvc0_program.c#L649 :D
13:56karolherbst: but I bet that volta needs at least 8 regs :D
13:57karolherbst: or something stupid
13:59HdkR:doesn't remember
13:59HdkR: I very well could have just set that to max when tinkering
13:59karolherbst: :D
13:59karolherbst: there is probably some stupid rule
14:00karolherbst: mhh.. much better but also worse: https://gist.github.com/karolherbst/92d78472b7994810edbb2e805c43f044
14:00HdkR: That's more reasonable
14:00karolherbst: well, it starts getting reasonable when gprs are dropping :p
14:01HdkR: Sadly shader stats won't give you performance improvements
14:01karolherbst: less gprs will
14:02HdkR: Sure, improved occupancy
14:02HdkR: That's not the only improvement that using the uniform datapath gives you though
14:04karolherbst: I bet it's also more of a "less heat" thing
14:04karolherbst: I doubt the operation in itself are faster
14:05HdkR: Depends on how you measure "faster"
14:05karolherbst: well. if you can run higher clocks for longer that's faster sure :p but I meant same clock and everything
14:05HdkR: Depends on how you measure "faster" ;)
14:05karolherbst: maybe it also reduces memory bandwidth a little.. but those ops don't read from memory...
14:06karolherbst: ahh damn memoryopt
14:10karolherbst: or rather copy prop? mhh
14:10karolherbst: uhm.. load prop
14:11HdkR: Hard to say what the other improvements are without giving it all away :P
14:12karolherbst: I bet :p
14:13karolherbst: I mean.. you could probably disable threads and use resources for something else, but I imagine that would be quite complicated to actually implement in hw
14:15karolherbst: ahh. it's my fault that copy propagation fails
14:15karolherbst: can't if the files don't match
14:55karolherbst: HdkR: now I am getting there :) https://gist.github.com/karolherbst/e68fe51e07e75b594beca0bf3357759f
14:58HdkR: Nice
15:12karolherbst: mhh.. first checking out the maxgpr thing though
15:13karolherbst: I guess that could give a nice boost
15:13karolherbst: it's probably something about OOR gprs for nops or something stupid
15:13karolherbst: or if you got OOR you just don't get 0 anymore but the hw fails..
15:20HdkR: OOR gprs?
15:20karolherbst: if you allocate 8 and use r8
15:21HdkR: oh, out of range
15:21karolherbst: yeah
15:21Ingvix: I have some new issues. mpv and telegram-desktop refuse to create a window. Telegram's process freezes altogether while I can still terminate mpv without problems
15:21karolherbst: maybe allocating at least 8 indeed fixes it
15:21karolherbst: would be better to always add 4
15:22karolherbst: *than
15:23Ingvix: I believe it's nouveau related since they worked fine without nouveau
15:23karolherbst: Ingvix: probably multithreading problems
15:23karolherbst: there are some deeper annoying race conditions we still need to fix
15:24Ingvix: is there any workaround or am I just better off not using nouveau then?
15:25karolherbst: don't use applications doing threaded GL
15:25karolherbst: which.. besdies chromium based ones there isn't much
15:26Ingvix: uh, my chromium-based browser works fine though
15:26karolherbst: chromium blacklists nouveau
15:26karolherbst: but that is application controlled afaik
15:26karolherbst: so every application using CEF need their own blacklist or something
15:27Ingvix: mpv and telegram are quite essential for me
15:28karolherbst: I think there might be env variable or flags to disable it... for mpv there is also some specific workaround, like not using the gl backend?
15:28karolherbst: dunno
15:33karolherbst: HdkR: ehh.. don't tell me we have to allocate predicates + zero regs as well :p
15:34HdkR: You don't
15:34karolherbst: mhhh
15:34karolherbst: but we need to allocate more than we use
15:34HdkR: AlignUp(Regs, 8);
15:34karolherbst: nope
15:35karolherbst: ohh...
15:35karolherbst: annoying :D
15:35karolherbst: but yeah
15:35karolherbst: that makes sense
15:38HdkR: :P
15:42karolherbst: and at the same time, doesn't :p
15:42karolherbst: why was adding always 5 enough?
15:42karolherbst: sure it's 8 and not 4?
15:42karolherbst: but yeah.. the alignment stuff does make sense
15:43karolherbst: we didn't before afaik
15:43karolherbst: orrr.. wait
15:43HdkR: I feel like I recall 8, but could be 4
15:44karolherbst: okay.. so for previous gens we have a minimum of 4
15:44karolherbst: but no alignment of the value
15:44karolherbst: volta _could_ be minimum of 8, but 4 aligned
15:44karolherbst: or just aligned
15:44karolherbst: will try the alignment thing first
15:56HdkR: Pretty sure behaviour changed slightly there, needs aligned
15:58HdkR: Probably also min size of the alignment size
15:58HdkR: zero is invalid afaik
16:07karolherbst: HdkR: how can that be invalid? the hw refuses to launch?
16:09HdkR: some sort of crash I believe
16:19karolherbst: I am currently wondering why last_id + 5 was "fine" ...
16:53HdkR: karolherbst: Fixed the problem of < 8 for shaders that used 3 or 4 registers for colour output? Idunno
17:27karolherbst: HdkR: mhh.. I added an align around the +5 and that seems to work: align(info->bin.maxGPR + 5, 4)
17:27karolherbst: +4 doesn't
17:27karolherbst: so maybe the hw aligns with 4 automatically
17:27karolherbst: but needs it 8 aligned actually
17:27karolherbst: so maybe align(info->bin.maxGPR + 8, 8) would be better?
17:31karolherbst: let's see what gives better values, but I guess the 8 aligned stuff does
17:31HdkR: :)
17:36karolherbst: anyway.. for shader-db reporting we want to have the original value mhh
17:36karolherbst: oh well
18:00karolherbst: thinking about this.. if the hw aligns_up by 4 we would have align(info->bin.maxGPR + 8, 8) vs align(info->bin.maxGPR + 8, 4).. HdkRdoes that make sense?
18:00karolherbst: just wodnering what's the best way to compare old vs new behaviour when I don't know for sure what the hw does :p
18:04imirkin: karolherbst: does elemental run ok?
18:04karolherbst: imirkin: generally or with patches?
18:05imirkin: with the latest and greatest
18:05karolherbst: I guess I could check
18:06imirkin: if you want a hard hang, just start up F1 2015
18:06imirkin: ;)
18:06imirkin: i had a trace which hung kepler pretty reliably
18:06karolherbst: right...
18:06karolherbst: I really didn't test against heavy stuff yet
18:06karolherbst: just heaven and realisticrendering are probably the heaviest things I've ran
18:06imirkin: that's where a lot of issues arise :)
18:07karolherbst: at this point it's only really useful for 2D acceleration sadly
18:07imirkin: obviously having CTS tests pass gives you some confidence that e.g. addition works
18:07karolherbst: right
18:07karolherbst: well and blits :p
18:07imirkin: but lots of bugs happened in "weird" shader situations
18:07imirkin: which are unlikely to occur with CTS
18:10karolherbst: imirkin: elemental was unreal, correct?
18:10imirkin: yes
18:10imirkin: should be on the unreal demos page
18:11karolherbst: imirkin: I think they reworked it at some point and it required downloading the full thing or something weird
18:11imirkin: it's a big download
18:11imirkin: like 1.5GB iirc
18:11karolherbst: sure.. but I don't think you can download the demo alone anymore
18:12karolherbst: at least I don't find from where
18:12imirkin: hold on
18:13imirkin: if you have somewhere i can scp it to maybe?
18:14karolherbst: I could also boot my other machine and scp it from there
18:14karolherbst: I have elemental on some machine :)
18:14imirkin: ok. well it's 1.1GB ... let's see if people.fd.o has enough space
18:15imirkin: nooope, down to 100MB free in /home (and i already wiped most of my stuff)
18:15imirkin: let me see if i can easily put up a server...
18:16karolherbst: python3 -m http.server $port :p
18:16imirkin: that's not the problem :p
18:16karolherbst: I know :D
18:16karolherbst: I could check where PTS downloads from
18:17karolherbst: ehhh
18:17karolherbst: URL broken
18:17imirkin: ok, let's see if this works...
18:24karolherbst: seems to run alright
18:24karolherbst: just slow
18:24imirkin: run the whole demo
18:24imirkin: in the past there have been parts that run wrong
18:24imirkin: oh also
18:24imirkin: there's a way to change the resolution
18:24imirkin: but i don't remember what it is ;)
18:25karolherbst: well.. I run that on a 4k screen.. but I think it auto scales down because X on wayland
18:25imirkin: yeah, but there's a way to run it in e.g. a 1280x800 window
18:25karolherbst: but yeah.. it runs surprisingly fine
18:27karolherbst: I think there might be one or two issues, but I don't know, would have to compare
18:28karolherbst: ehh.. the channel died at the end
18:28imirkin: add ResX=1280 ResY=720 at the end of the command
18:28imirkin: iirc there was a time when the unreal engine logo with the frosting had a bug rendering at the very end :)
18:29karolherbst: yeah.. I think I'll wait until my piglit run finished, I shouldn't push it :D
18:29imirkin: lol
19:53karolherbst: ehh our align() is already aligning up D: I think I didn't had enough sleep
19:53karolherbst: imirkin: btw, did you see this MR? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6149
19:54imirkin: karolherbst: i did not. i'll have to double-check wtf hasJoin is
19:55karolherbst: not 100% but I think hasJoin is the join modifier on instructions
19:55imirkin: ok, so then that's not the right condition
19:56karolherbst: I think none is
19:56karolherbst: honestly
19:56karolherbst: or you'd have to check the target of the join, _but_ on gens where you don't have join BB: forms you only end upw ith that NOP.join thing being OP_JOIN so you are actually fine
19:57karolherbst: this all just sucks
19:58karolherbst: so.. we have the "jumpy" OP_JOIN being NOP.join on targets with hasJoin
19:59karolherbst: on targets !hasJoin OP_JOIN is SYNC
20:01karolherbst: imirkin: I think I'll really split it up, it's just causing issues understanding this whole mess
20:03imirkin: the question is "does JOIN jump or not"
20:03imirkin: i believe that's hasAnteriorJoin
20:03imirkin: (or the inverse of that)
20:04karolherbst: how was it on tesla vs fermi?
20:05karolherbst: that's essentially the only gens I don't know how it works
20:05karolherbst: maxwell+ is quite clear
20:05imirkin: fermi was jumpy
20:05imirkin: tesla is not-jumpy
20:05karolherbst: okay
20:05karolherbst: mhhhh
20:06karolherbst: jumpy join == hasJoin && !hasAnteriorJoin, non-jumpy join == hasJoin && hasAnteriorJoin, ssy = !hasJoin && !hasAnteriorJoin
20:06karolherbst: that's how the target objects are setup right now
20:06karolherbst: nv50: hasJoin and hasAnteriorJoin both to true
20:07karolherbst: fermi/kepler: hasJoin but not hasAnteriorJoin
20:07karolherbst: maxwell: both false
20:07imirkin: i think you're making this more confusing
20:07imirkin: than it needs to be
20:07imirkin: hasAnteriorJoin = join does not jump
20:07imirkin: hasJoin = there is a join modifier on instructions
20:08karolherbst: joinAnterior is false for fermi
20:08karolherbst: uhh.. wait
20:08imirkin: correct.
20:08karolherbst: but also false for maxwell
20:08imirkin: correct
20:08karolherbst: yeah, but joins don't jump on maxwell
20:08imirkin: it should only be true on tesla (and maybe now volta?)
20:08imirkin: joins jump on maxwell.
20:09karolherbst: uhhh... right..
20:09imirkin: =]
20:09karolherbst: annoying
20:09imirkin: so i think you want that condition to be OP_JOIN && !hasAnteriorJoin
20:09karolherbst: yeah.. I guess so
20:09karolherbst: joinAnterior is just not used at this point
20:09karolherbst: but yeah
20:09karolherbst: I guess this is what it means
20:10imirkin: it might be used in the tgsi -> nv50 ir?
20:10karolherbst: it's not used
20:10imirkin: src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp: if (prog->getTarget()->joinAnterior) {
20:10karolherbst: mhh.. I thought that's a local change of mine.. guess I was mistaken
20:12imirkin: ok, so the tgsi -> nv50 ir pass sticks the join "anterior" style
20:12karolherbst: I am wondering if it makes sense to rework all of that and make a "join bra" the jumpy one and codegen just has to emit the correct thing
20:12imirkin: (see insertConvergenceOps)
20:12karolherbst: yeah
20:12imirkin: it puts a joinat at the start, and join at the convergence point
20:13karolherbst: right
20:13karolherbst: that's how we need it for volta as well
20:13imirkin: right.
20:13karolherbst: maybe we should rename hasAnteriorJoin to hasJumpyJoin, then it's obvious what's the meaning of it :p
20:13imirkin: well, anterior is an english word
20:14imirkin: with a meaning
20:14imirkin: just not a common word
20:14karolherbst: also doesn't help understanding it
20:14imirkin: actually it means something different than i thought
20:14imirkin: learn something every day
20:14karolherbst: this concept of "joins before executing" is not helping
20:14imirkin: i thought it meant "on the outside"
20:15imirkin: but it actually means "nearer the front"
20:15karolherbst: yeah
20:15karolherbst: let's say we could use a word more people will have a easier time understanding what's up :)
20:16karolherbst: I am wondering if it should get reworked to get the ISA to a more sane state (an op has one and only one meaning) or we keep it like this and just rename stuff
20:19imirkin: eh wtvr
20:19karolherbst: yeah.. I am not too fond on changing stuff which could break everything
20:56robi: hasJoinBeforeExecute ?
20:56robi: hasJoinBeforeExec ?
20:56robi: hasPreExecJoin ?
20:56karolherbst: imirkin: is for you in the default nv-report.py script the output also misalligned?
20:57karolherbst: like this: https://gist.github.com/karolherbst/2382a84328a6157230a8061c7bc84183
20:59karolherbst: ohhhhh
20:59karolherbst: I know why
20:59karolherbst: annoying
21:15karolherbst: imirkin: anything you always disliked about nv-report in shader-db?
21:16karolherbst: https://gitlab.freedesktop.org/mesa/shader-db/-/merge_requests/46
21:17karolherbst: ehh.. wanted to make it python3 and python2 compatible
21:22karolherbst: mhhh
21:22karolherbst: the regex...
21:41imirkin: karolherbst: didn't use to be misaligned
21:41karolherbst: yeah.. but I figured out what's wrong
21:41karolherbst: something with the multiple print calls
21:41karolherbst: it's fixed with my MR
21:41imirkin: anyways, i was happy with it. we could nuke the "bytes" thing, i guess - it only matters for tesla, everywhere else it's 1:1 with instructions
21:42karolherbst: imirkin: I rewrite it that we have _one_ list with all attributes
21:42karolherbst: and we can just add/remove things as we please without having to change the entire file
21:49karolherbst: mhhhh
21:49karolherbst: RE = dict((k, re.compile(regex)) for k, regex in [("name", r"^(.*) - ")] + [(a, r"%s: (\d+)" % a) for a in ("type",) + attrs])
21:49karolherbst: it does generate the regex list based on the attribues, at the same time, it makes it so that nobody wants to change or even understand that :p
21:52karolherbst: I think this is good enough
21:52karolherbst: https://gitlab.freedesktop.org/mesa/shader-db/-/blob/666c096da649eb3514a970ac413c448d2d69ae55/nv-report.py
21:54karolherbst: ehh whitespace messup
21:54karolherbst: ooh no.. it's a loop
21:54karolherbst: imirkin: I was also think of adding "aligned" gprs in addition to the normal ones
21:55karolherbst: pre volta the file is 4 aligned where on volta it's 8 aligned apparently
21:55karolherbst: or maybe we should rather report how many threads could run in parallel...
21:56karolherbst: anyway, with that it's trivial to add or remove attributes
22:15karolherbst: anyway... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6152
22:16karolherbst: "total gpr in shared programs : 1561876 -> 1443744 (-7.56%)" with volta :)
22:16karolherbst: well.. hw gpr count here, not the result of codegen
22:16karolherbst: meaning alligned to whatever the hw expects
22:17imirkin: that's a nice optimization :p
22:17karolherbst: well... :p
22:18karolherbst: at least it has a good more_perf/loc ratio :p
22:18karolherbst: let's see if it actually changes perf as well
22:19karolherbst: but yeah.. before we reported a higher reg count than we neede (assuming the hw aligns to 4 on its own)
22:27imirkin: yeah
23:24karolherbst: imirkin: pixmark_piano 189 -> 194 points :)
23:25karolherbst: pixmark_volplosion 2930 -
23:25karolherbst: > 3039
23:25karolherbst: I already feared I will be unlucky and none of those get affected
23:30karolherbst: mhh.. triangle shows OOR_REG errors :/
23:33imirkin: can't win 'em all
23:33HdkR: ?)
23:33HdkR: :)
23:33HdkR: Aligning up by 8 or 4? :P
23:33karolherbst: HdkR: 8
23:34karolherbst: I bet it's some it has to be at least 16 or something :p
23:34karolherbst: or..
23:34karolherbst: there are no regs
23:34karolherbst: so we have -1 + 1 aligned to 8 == 0
23:34HdkR: time to ensure it is always at least 8 and run it again? :)
23:35karolherbst: ahhh "type: 3, local: 0, shared: 0, gpr: 0, inst: 1, bytes: 16" mhhh
23:35karolherbst: 0 aligned to 8 is 0 :)
23:35HdkR: max(alignup(Regs, 8), 8) woop woop
23:35karolherbst: I can drop that stupid min anyway
23:35karolherbst: 254 is the highest we get and +1 aligned to 8 that's 256 :)
23:36HdkR: :)
23:39HdkR: Good to know that I was right that zero is invalid config though :D
23:40karolherbst: mhhh.. still seeing errors
23:42HdkR: hmmm
23:44karolherbst: good: MAX2(4, align(info->bin.maxGPR + 5, 4))
23:44karolherbst: bad: MAX2(8, align(info->bin.maxGPR + 1, 8))
23:46imirkin: well, max(4, align(x+5,4)) == align(x+5,4)
23:46imirkin: that max isn't doing anything
23:46imirkin: neither is the max(8, ...)
23:47imirkin: since maxGPR + 1 is at least 1, i think
23:47karolherbst: nope
23:47karolherbst: it can be -1
23:47imirkin: or can maxGPR be -1 if non are used?
23:47imirkin: ah
23:47imirkin: good to know.
23:47karolherbst: not often that it happens :)
23:47karolherbst: the tess passthrough is probably the only shader hitting this
23:48karolherbst: mhh.. I have a few shaders which got 12 set before and now 8.. but mhh
23:48karolherbst: I hope it's not something stupid
23:48imirkin: what else could it be?
23:48imirkin: if not something stupid
23:55karolherbst: ehh
23:56karolherbst: I bet it's something stupid as "vertex need 16 regs min" or so
23:56karolherbst: anyway.. it's the vertex passtrhough shader
23:56karolherbst: https://gist.github.com/karolherbst/40293b668ed862d184606516449cbfe2
23:56karolherbst: let me search for the binary
23:57karolherbst: huh wait...
23:57karolherbst: 1 is fragment
23:57karolherbst: :D