01:26 imirkin: skeggsb: i don't suppose you heard back from nvidia on the gpio thing?
02:11 karolherbst: imirkin: mhhh... either I broke it or I found a bug in FlatteningPass::tryPropagateBranch :/
02:12 karolherbst: https://gist.githubusercontent.com/karolherbst/e19907918006ce719b5c71562df9c9b3/raw/927416e3a3f598685fbf16db9b176198a648326e/gistfile1.txt
02:14 imirkin: there's also a plain join and a join $b0
02:14 karolherbst: well.. we can't do plain joins on volta+ and I thought I didn't break anything :/
02:15 karolherbst: right now debugging with chipset forced to 0x50
02:15 imirkin: i'm not sure what that's trying to do in the first place.
02:15 karolherbst: yeah...
02:15 imirkin: if that's nv50, should definitely not be any $b0's
02:15 imirkin: :)
02:15 karolherbst: well..
02:15 karolherbst: right
02:15 karolherbst: I meant.. I see what happens with 0x50 to debug this
02:16 karolherbst: so the idea is because BB:7 is empty, BB:5 can jump to BB:7 jump target directly
02:16 karolherbst: but that obviously won't work out well if there is a join...
02:17 karolherbst:really dislikes the fact, that OP_JOIN has multiple meanings
02:18 karolherbst: ehhh
02:18 karolherbst: it works on tesla as flattening succeeds in predicating the branches *sigh*
02:22 imirkin: you can force-disable predication
02:22 imirkin: note that on nv50, joins are "on the outside"
02:22 imirkin: i.e. you have joinat
02:22 imirkin: branches
02:22 imirkin: then the branches converge
02:22 imirkin: then you have join
02:22 karolherbst: right
02:23 imirkin: the join does "nothing", so to speak
02:23 karolherbst: same on volta
02:23 imirkin: you could omit it and everything would work
02:23 imirkin: except you'd get shit perf
02:23 karolherbst: ohhh
02:23 imirkin: whereas on nvc0+, the join flag is actually a jump
02:23 karolherbst: ohh, joinat pushed the target and once the threads arrive they join?
02:23 imirkin: mmmm
02:24 karolherbst: ohh wait, the join is important
02:24 karolherbst: nvm then
02:24 karolherbst: yeah
02:24 karolherbst: same on volta
02:24 imirkin: tbh i'm not 100% sure that the joinat target address means anything
02:24 imirkin: not sure it's actually encoded
02:24 karolherbst: on volta you save the addess but it means nothing to the hw :)
02:25 karolherbst: anyway.. instead of pushing into some stack we save into those barrier registers
02:25 karolherbst: but fundamentally it works like tesla
02:25 imirkin: the join is a "run all the sub-threads to this point" indicator
02:25 imirkin: whereas otherwise they'd run until program exit
02:25 karolherbst: right
02:25 imirkin: so it'd effectively run one thread at a time for the whole program
02:25 imirkin: which is functionally equivalent, but ... shit-for-perf
02:26 karolherbst: yeah... not on volta I couldn't see any difference in perf at all...
02:26 karolherbst: at least not with heaven
02:26 imirkin: well, volta has a different threading model
02:26 karolherbst: right
02:26 karolherbst: but some instructions still require threads to converge :)
02:26 imirkin: i don't think it would run the shader 32x
02:26 imirkin: right
02:26 imirkin: most shaders don't include such instructions though
02:26 imirkin: (unless you count "exit")
02:28 karolherbst: uhh.. mhhh
02:28 karolherbst: nv50 has the join flag on instruction, doesn't it?
02:29 imirkin: iirc no
02:29 imirkin: but it's been a while since i've looked
02:30 karolherbst: it does
02:30 karolherbst: code[1] |= 0x2; is the encoding.. annoying
02:30 karolherbst: yeah.. well, let me turn that off as well
02:31 imirkin: just checked -- looks like joinat *does* get the target encoded
02:31 karolherbst: imirkin: it might not matter
02:31 imirkin: and yeah, join is a flag available on 8-byte encodings
02:32 imirkin: maybe not. but it's there.
02:32 karolherbst: on volta the target is there for the disassembler only
02:32 imirkin: i've never tried to get it wrong to see what happens :)
02:32 karolherbst: :)
02:32 karolherbst: I bet nothing happens
02:33 karolherbst: or maybe it mattered on nv50.. who knows
02:33 karolherbst: but probably not
02:34 karolherbst: well, if it matters, the hw would have to do something with it, like mapping sync points and offsets.. and I bet making the hw more complicated for no reason is even something nvidia was aware of at that time
02:35 imirkin: wow - g200+ have the vote op. i don't think we expose that in nouveau.
02:35 imirkin: too bad i have a g84 plugged in, so can't test it
02:37 HdkR: vote woo
02:37 karolherbst: turing has VOTEU....
02:37 karolherbst: but I guess that makes sense :p
02:40 HdkR: Sadly still doesn't have warp-reduce ops like AMD
02:41 karolherbst: HdkR: what do you mean?
02:41 HdkR: reduce in RF rather than to memory
02:42 karolherbst: why would you need it on register files?
02:42 karolherbst: don't make hw complicated :p
02:42 HdkR: So something like RED.U.Min <UReg>, <Reg>
02:43 karolherbst: ehh
02:43 karolherbst: that would just make the hw complicated with 0 benefit
02:44 HdkR: Useful for warp-wide reductions so you don't need to do a shuffle vote reduction dance :)
02:44 karolherbst: but this means you need to make registers available cross threads
02:44 karolherbst: and I bet that's super complicated to do :p
02:45 HdkR: SHFL already does it :)
02:45 karolherbst: right, but that's one exception which is a bit easier to implement
02:45 karolherbst: doing alu operation in order cross thread?
02:46 karolherbst: ufff
02:47 HdkR: AMD doing it isn't really fair as a comparison point, since it just falls down their vector pipeline, but people have been expecting it for what. eight years the consoles have been here?
02:48 karolherbst: yeah.. no idea?
02:49 karolherbst: putting magic instructions into hw was never a good idea :p
02:49 karolherbst: keeping insane out is where you win
02:49 HdkR: async copy from memory to shared? :P
02:49 imirkin: didn't you say there was a "start async copy from gmem to smem, and go get coffee" instruction now?
02:49 imirkin: heh
02:50 imirkin: we thought of the same one.
02:50 karolherbst: HdkR: heh.. at least you don't do it in the GPC but wait on the barrier :p
02:51 HdkR: :D
02:51 karolherbst: there is also nanosleep and I still have no idea why :p
02:51 HdkR: oh yea, that's a great one
02:51 karolherbst: (I have great ideas on how to abuse it though)
02:54 karolherbst: imirkin: heh.. I can reproduce on nv50 if I hack it up enough...
02:55 karolherbst: https://gist.githubusercontent.com/karolherbst/3b5b5e981cd6573345ff660282c250a4/raw/af9d5eca35e0b1df000d4eebdaf1f98c87e7d3a6/gistfile1.txt
02:55 karolherbst: so..
02:55 karolherbst: if you have a conditional branch jumping to a block with only a join
02:55 imirkin: even without your changes? (but with your hacks)
02:55 karolherbst: that join might get copied over into BB:6
02:55 imirkin: mmmmm
02:55 karolherbst: let's see
02:55 karolherbst: I can try on master
02:55 imirkin: the conditions on branches shoudl get propagated too
02:56 imirkin: i fixed that bug a while back :)
02:56 karolherbst: that's not the issue though
02:56 karolherbst: the join in BB:5 was a bra BB:7 before
02:56 imirkin: right
02:56 karolherbst: soo.. and because BB:7 ends in a flow instructions it thinks it can just copy
02:56 imirkin: some piece of logic is going very wrong
02:56 karolherbst: yeah
02:56 karolherbst: I know what
02:56 karolherbst: just.. annoying
02:56 imirkin: that join should never be propgated to BB:5 in the first place
02:57 imirkin: there's a bool on the target
02:57 imirkin: which indicates anterior joins
02:57 imirkin: and the propagate thing should be sensitive to that
02:57 karolherbst: happens on master as well
02:57 karolherbst: hack was: disable predicated _just_ for BB:4
02:57 karolherbst: so the BB:4->BB:7 tree stays not predicated
02:58 karolherbst: and after the program was flattend it arrives at BB:5 quite late
02:58 imirkin: BB:5 should have a branch to BB:7
02:58 karolherbst: yeah.. I guess I'll fix it :)
02:58 imirkin: what replaces it with a join?
02:58 karolherbst: yes
02:58 karolherbst: welll
02:58 karolherbst: the code replaces it with the branch of the target
02:58 imirkin: it should start out with the branch
02:58 imirkin: initially
02:58 karolherbst: I'll show you the code in a moment
02:59 imirkin: like in tgsi, we put the join outside iirc
02:59 imirkin: or rather, in tgsi -> nv50 ir
02:59 karolherbst: imirkin: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp#L3372
02:59 karolherbst: so you see how that can just go through there and do the wrong shit :)
02:59 karolherbst: it's just super unlikely
03:00 karolherbst: and I think the OP_JOIN is there for kepler
03:00 karolherbst: or gens where OP_JOIN is actually a jump :)
03:01 karolherbst: on my branch I can remove the OP_JOIN check though as I made those gens do OP_BRA with join = 1 instead
03:01 imirkin: yeah
03:01 imirkin: that seems slightly bogus =/
03:01 karolherbst: yeah..
03:02 karolherbst: I cleaned that up though :p
03:02 imirkin: esp by the time this happens
03:02 karolherbst: OP_JOIN syncs and never jumps on my branch
03:02 karolherbst: I think? maybe I removed that again? let's see
03:04 karolherbst: oh ehh.. I think I threw it away again
03:06 karolherbst: anyway.. should sleep :D will think about on how to make all of this more sane
03:08 imirkin: nite
09:53 karolherbst: imirkin: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6149
11:36 karolherbst: you what I like the most about Volta/Turins ISA? the encoding is sooo sane
11:37 karolherbst: supporting uniform regs uses the same encoding as non uniform ones
11:37 karolherbst: you just need flip special bits being the same for all instructions
11:37 karolherbst: 1 << 91 enables uniform sources
11:37 karolherbst: 6 << 9 enables sets src1 as uniform
11:38 karolherbst: 7 << 9 sets src2 as uniform
11:38 karolherbst: 1 << 7 enables dst as uniform
11:38 karolherbst: enabling dst as uniform reuses the non uniform layous
11:39 karolherbst: so 3 << 9 selects c[] as src2 etc...
11:39 karolherbst: that's soo nice
11:40 karolherbst: "1 << 91 enables uniform sources" also flips from non uniform to uniform c[] indirect
11:44 HdkR: karolherbst: When your instruction encoding is 128bits, loads of things end up being sane :D
11:45 karolherbst: well :D
11:45 karolherbst: you only have 96 bits though :p
11:45 HdkR: Sure, some bits stolen for scheduling
11:45 HdkR: and some bits stolen for opc
11:45 karolherbst: mhh
11:45 karolherbst: well
11:45 karolherbst: how many bits do you have for the opc? 5?
11:45 karolherbst: :p
11:46 karolherbst: I think it's actually 7, but it's hard to tell
11:46 karolherbst: source types are part of the opc so there is a system
11:46 HdkR: Think it depends on the family, since some families may reuse some opc bits for other data?
11:47 karolherbst: I think they tried but always failed
11:47 karolherbst: pre volta
11:47 karolherbst: but yeah..
11:47 HdkR: Even with some kludge it is still significantly more sane
11:47 karolherbst: yes
11:48 karolherbst: adding support for uniform regs is quite some work actually..
11:48 HdkR: It ends up breaking the scalar model mindset a bit
11:48 karolherbst: " 21 files changed, 680 insertions(+), 216 deletions(-)" :/
11:49 karolherbst: HdkR: well.. I use nir divergency analysis pass so I don't even have to deal with any of that
11:49 karolherbst: I just accept whatever I get :p
11:49 HdkR: Ah right, NIR has that
11:50 karolherbst: the problem is rather checking for legal combinations and stuff
11:50 karolherbst: and fix all the places where codegen just inserts new values and breaks it
11:51 HdkR: hehe, new RF always causing woes
11:52 karolherbst: yeah...
11:52 karolherbst: why does it have to be three new one :D
11:53 HdkR: If you're going to cause pain then might as well as do it all at once
11:53 HdkR:pushes Volta under a rug
11:53 karolherbst: well.. we have to support the barrier file anyway
11:53 karolherbst: but I also already did the work for that :D
11:53 karolherbst:still wasn't able to do a full shader-db run without crashing :D
11:56 HdkR: Seems like you're doing quite well for Volta and Turing support
11:57 karolherbst: yeah
11:57 karolherbst: I don't know what doesn't run :p
11:58 HdkR: CTS? :P
11:58 karolherbst: dunno.. with my local patches I have like 5 fails
11:58 karolherbst: and the fails are just annoying at this point :D
11:58 HdkR: whoa
11:59 HdkR: Almost time to go back to OpenCL 2.0 features?
11:59 karolherbst: uff :D
12:01 HdkR: The idea of running OpenCL at <100Mhz sounds..great
12:04 karolherbst: it's not that bad :D
12:04 HdkR: Just make sure you're running the largest GPUs so you take advantage of going wide?
12:05 HdkR: Can't imagine someone using it for Blender though
12:06 karolherbst: I already look forward to the part where shader-db succeeds and I track down regressions because codegen is silly :p
12:19 Ingvix: hey, I've setup reverse prime now. I ran 'xrandr --setprovideroutputsource 1 0' and got external display available. Now though xrandr can't seem to find the mode I'm trying to set for the display though it's listed within the display's modes
12:20 Ingvix: I tried creating my own mode with same settings but that isn't found either
12:23 Ingvix: I think help would be needed
12:33 karolherbst: Ingvix: could be that we don't support the mode or something.. was it there with nvidia?
12:35 Ingvix: karolherbst, resolution was and I'm fairly sure the rate was same as well though I'm not absolutely sure
12:36 Ingvix: it's 1920x1200 with 59.95 refresh rate
12:36 karolherbst: what does xrandr report for the display?
12:37 Ingvix: karolherbst, http://vpaste.net/GIYQY
12:39 Ingvix: uh, I mean I first tried that mode listed first and then tried to create a new mode cause it wasn't working
12:39 Ingvix: and the other modes listed aren't found either
12:39 karolherbst: Ingvix: why dod you want to create new ones?
12:40 Ingvix: to get it to work
12:40 Ingvix: just tested it it helped
12:40 Ingvix: but it didn't
12:40 karolherbst: well, there is no mode selected
12:41 karolherbst: why not just use one of the available ones?
12:41 Ingvix: as I just said, I tried setting the first one on the list but it was not found
12:41 Ingvix: none of them are
12:42 karolherbst: what args are you calling xrandr with?
12:42 Ingvix: karolherbst, xrandr --output DP-1-1-1 --primary --mode 1920x1200 --pos 1920x0 --output eDP-1 --mode 1920x1080 --pos 0x0
12:43 Ingvix: and I get: xrandr: cannot find mode 1920x1200
12:43 karolherbst: try "xrandr --output DP-1-1-1 --mode 1920x1200 --rate 59.95"
12:44 Ingvix: I get the same error
12:46 karolherbst: Ingvix: output of "grep . /sys/class/drm/card*-*//modes" please
12:47 Ingvix: http://vpaste.net/tyemi
12:48 karolherbst: Ingvix: "xrandr --output DP-1-1-1 --primary --right-of eDP-1" does this do something?
12:48 karolherbst: uhm.. wait
12:48 karolherbst: "xrandr --output DP-1-1-1 --primary --auto --right-of eDP-1"
12:50 Ingvix: nothing changes and I get no errors
12:50 karolherbst: ehh.. mhh
12:51 karolherbst: add a --verbose to that and paste the output
12:51 karolherbst: I am sure that's a bug somewhere, but not sure where at this point
12:52 karolherbst: maybe dmesg and Xorg log would help as well
12:53 Ingvix: hmm, no output from xrandr by adding --verbose to previous command
12:53 karolherbst: strange
12:53 karolherbst: so it doesn't try to do anything
12:54 Ingvix: should I still fetch dmesg and Xorg?
12:54 karolherbst: wait.. I think xrandrs providers are just wrongly setup
12:55 Ingvix: alright
12:57 karolherbst: no.. should be fine. mhh
12:57 karolherbst: yeah, please get dmesg and xorg logs
12:59 Ingvix: dmesg: http://vpaste.net/MsP2M Xorg.0.log: http://vpaste.net/fJRFE
13:01 karolherbst: ehhh..
13:02 karolherbst: something seems fishy but I don't know what
13:02 karolherbst: those "present flip failed" are also weird
13:03 karolherbst: but the core notifier timeouts could also break things in weird ways
13:04 karolherbst: but they shouldn't prevent enabling modes
13:05 karolherbst: Ingvix: maybe ask on #xorg? I don't think it's a driver issue, but I also don't know enough at this point to figure out what's up
13:06 Ingvix: when I started x for the first time with this setup I could set the mode but the display stayed black. No idea what changed after that if anything
13:06 Ingvix: I can do that
13:06 karolherbst: HdkR: *sadface* https://gist.github.com/karolherbst/736c022daa1f6807b330eed57acdf9b1
13:07 karolherbst: Ingvix: maybe the xrandr setprovideroutputsource did break something
13:07 karolherbst: normally you won't have to do that
13:07 karolherbst: the display staying black is probably a nouveau issue though
13:07 karolherbst: and for that it makes sense to test with a newer kernel just to make sure we didn't fix it already
13:18 HdkR: karolherbst: Interesting that it increased
13:19 karolherbst: HdkR: well, because of stupid reasons :p
13:19 karolherbst: loadpropagation fails
13:19 karolherbst: ld u32 %ur56 c0[0x0] (0)
13:19 karolherbst: mad f32 %r62 %r54 %ur56 %ur57 (0)
13:19 karolherbst: codegen doesn't load propagate %ur56 anymore
13:19 HdkR: ah, interesting
13:19 karolherbst: and I'd prefer it propagating const memory instead of keeping uniform regs :p
13:19 karolherbst: just need to fix that
13:20 karolherbst: and I bet it's like 99% of the fallout
13:20 karolherbst: ahh
13:20 karolherbst: i->src(2).getFile() != FILE_GPR :D
13:20 karolherbst: right..
13:23 karolherbst: mhhh
13:23 karolherbst: now another fallout
13:23 karolherbst: in some places I keep LDC instead of MOV from c[] :/
13:24 karolherbst: ldc needs a barrier :/
13:27 Ingvix: karolherbst, I see. I reboot to the latest one I can get
13:28 karolherbst: HdkR: now I expect no shader will use uniform regs :D
13:28 karolherbst: maybe a few compute ones
13:35 Ingvix: There was one funny thing that both providers listed by xrandr were called "modesetting" but the other one had all the external outputs so figured it's actually nouveau
13:39 HdkR: karolherbst: Why only compute?
13:40 karolherbst: HdkR: group is is uniform :p
13:40 karolherbst: and can't be propagated
13:40 karolherbst: *group id
13:41 HdkR: Does group id matter in fragment shaders that don't use interlock?
13:43 Ingvix: yay, it's working on the latest kernel
13:43 karolherbst: cool
13:44 Ingvix: though I still need to set the source with xrandr before the external displays become available
13:44 karolherbst: Ingvix: aren't you using some desktop environment, but starting stuff yourself?
13:44 Ingvix: I'm using dwm, so no de really
13:45 karolherbst: there are some proper files where stuff like that could be added to.. normally desktops already handle that themseles
13:46 Ingvix: well yeah, I intend to put it in .xinitrc now that it works
13:47 Ingvix: I just sort get the assumption that it wouldn't be a necessary step from what you said before but I guess you were just referring that the de usually does it automatically
13:47 Ingvix: *got
13:48 Ingvix: so real issue here anymore
13:55 karolherbst: Ingvix: yeah.. normally users won't have to bother with any of that, but without a desktop all those godies are usually not in place
13:56 karolherbst: HdkR: maybe I should figure out this: https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/nouveau/nvc0/nvc0_program.c#L649 :D
13:56 karolherbst: but I bet that volta needs at least 8 regs :D
13:57 karolherbst: or something stupid
13:59 HdkR:doesn't remember
13:59 HdkR: I very well could have just set that to max when tinkering
13:59 karolherbst: :D
13:59 karolherbst: there is probably some stupid rule
14:00 karolherbst: mhh.. much better but also worse: https://gist.github.com/karolherbst/92d78472b7994810edbb2e805c43f044
14:00 HdkR: That's more reasonable
14:00 karolherbst: well, it starts getting reasonable when gprs are dropping :p
14:01 HdkR: Sadly shader stats won't give you performance improvements
14:01 karolherbst: less gprs will
14:02 HdkR: Sure, improved occupancy
14:02 HdkR: That's not the only improvement that using the uniform datapath gives you though
14:04 karolherbst: I bet it's also more of a "less heat" thing
14:04 karolherbst: I doubt the operation in itself are faster
14:05 HdkR: Depends on how you measure "faster"
14:05 karolherbst: well. if you can run higher clocks for longer that's faster sure :p but I meant same clock and everything
14:05 HdkR: Depends on how you measure "faster" ;)
14:05 karolherbst: maybe it also reduces memory bandwidth a little.. but those ops don't read from memory...
14:06 karolherbst: ahh damn memoryopt
14:10 karolherbst: or rather copy prop? mhh
14:10 karolherbst: uhm.. load prop
14:11 HdkR: Hard to say what the other improvements are without giving it all away :P
14:12 karolherbst: I bet :p
14:13 karolherbst: I mean.. you could probably disable threads and use resources for something else, but I imagine that would be quite complicated to actually implement in hw
14:15 karolherbst: ahh. it's my fault that copy propagation fails
14:15 karolherbst: can't if the files don't match
14:55 karolherbst: HdkR: now I am getting there :) https://gist.github.com/karolherbst/e68fe51e07e75b594beca0bf3357759f
14:58 HdkR: Nice
15:12 karolherbst: mhh.. first checking out the maxgpr thing though
15:13 karolherbst: I guess that could give a nice boost
15:13 karolherbst: it's probably something about OOR gprs for nops or something stupid
15:13 karolherbst: or if you got OOR you just don't get 0 anymore but the hw fails..
15:20 HdkR: OOR gprs?
15:20 karolherbst: if you allocate 8 and use r8
15:21 HdkR: oh, out of range
15:21 karolherbst: yeah
15:21 Ingvix: I have some new issues. mpv and telegram-desktop refuse to create a window. Telegram's process freezes altogether while I can still terminate mpv without problems
15:21 karolherbst: maybe allocating at least 8 indeed fixes it
15:21 karolherbst: would be better to always add 4
15:22 karolherbst: *than
15:23 Ingvix: I believe it's nouveau related since they worked fine without nouveau
15:23 karolherbst: Ingvix: probably multithreading problems
15:23 karolherbst: there are some deeper annoying race conditions we still need to fix
15:24 Ingvix: is there any workaround or am I just better off not using nouveau then?
15:25 karolherbst: don't use applications doing threaded GL
15:25 karolherbst: which.. besdies chromium based ones there isn't much
15:26 Ingvix: uh, my chromium-based browser works fine though
15:26 karolherbst: chromium blacklists nouveau
15:26 karolherbst: but that is application controlled afaik
15:26 karolherbst: so every application using CEF need their own blacklist or something
15:27 Ingvix: mpv and telegram are quite essential for me
15:28 karolherbst: I think there might be env variable or flags to disable it... for mpv there is also some specific workaround, like not using the gl backend?
15:28 karolherbst: dunno
15:33 karolherbst: HdkR: ehh.. don't tell me we have to allocate predicates + zero regs as well :p
15:34 HdkR: You don't
15:34 karolherbst: mhhh
15:34 karolherbst: but we need to allocate more than we use
15:34 HdkR: AlignUp(Regs, 8);
15:34 karolherbst: nope
15:35 karolherbst: ohh...
15:35 karolherbst: annoying :D
15:35 karolherbst: but yeah
15:35 karolherbst: that makes sense
15:38 HdkR: :P
15:42 karolherbst: and at the same time, doesn't :p
15:42 karolherbst: why was adding always 5 enough?
15:42 karolherbst: sure it's 8 and not 4?
15:42 karolherbst: but yeah.. the alignment stuff does make sense
15:43 karolherbst: we didn't before afaik
15:43 karolherbst: orrr.. wait
15:43 HdkR: I feel like I recall 8, but could be 4
15:44 karolherbst: okay.. so for previous gens we have a minimum of 4
15:44 karolherbst: but no alignment of the value
15:44 karolherbst: volta _could_ be minimum of 8, but 4 aligned
15:44 karolherbst: or just aligned
15:44 karolherbst: will try the alignment thing first
15:56 HdkR: Pretty sure behaviour changed slightly there, needs aligned
15:58 HdkR: Probably also min size of the alignment size
15:58 HdkR: zero is invalid afaik
16:07 karolherbst: HdkR: how can that be invalid? the hw refuses to launch?
16:09 HdkR: some sort of crash I believe
16:19 karolherbst: I am currently wondering why last_id + 5 was "fine" ...
16:53 HdkR: karolherbst: Fixed the problem of < 8 for shaders that used 3 or 4 registers for colour output? Idunno
17:27 karolherbst: HdkR: mhh.. I added an align around the +5 and that seems to work: align(info->bin.maxGPR + 5, 4)
17:27 karolherbst: +4 doesn't
17:27 karolherbst: so maybe the hw aligns with 4 automatically
17:27 karolherbst: but needs it 8 aligned actually
17:27 karolherbst: so maybe align(info->bin.maxGPR + 8, 8) would be better?
17:31 karolherbst: let's see what gives better values, but I guess the 8 aligned stuff does
17:31 HdkR: :)
17:36 karolherbst: anyway.. for shader-db reporting we want to have the original value mhh
17:36 karolherbst: oh well
18:00 karolherbst: thinking about this.. if the hw aligns_up by 4 we would have align(info->bin.maxGPR + 8, 8) vs align(info->bin.maxGPR + 8, 4).. HdkRdoes that make sense?
18:00 karolherbst: just wodnering what's the best way to compare old vs new behaviour when I don't know for sure what the hw does :p
18:04 imirkin: karolherbst: does elemental run ok?
18:04 karolherbst: imirkin: generally or with patches?
18:05 imirkin: with the latest and greatest
18:05 karolherbst: I guess I could check
18:06 imirkin: if you want a hard hang, just start up F1 2015
18:06 imirkin: ;)
18:06 imirkin: i had a trace which hung kepler pretty reliably
18:06 karolherbst: right...
18:06 karolherbst: I really didn't test against heavy stuff yet
18:06 karolherbst: just heaven and realisticrendering are probably the heaviest things I've ran
18:06 imirkin: that's where a lot of issues arise :)
18:07 karolherbst: at this point it's only really useful for 2D acceleration sadly
18:07 imirkin: obviously having CTS tests pass gives you some confidence that e.g. addition works
18:07 karolherbst: right
18:07 karolherbst: well and blits :p
18:07 imirkin: but lots of bugs happened in "weird" shader situations
18:07 imirkin: which are unlikely to occur with CTS
18:10 karolherbst: imirkin: elemental was unreal, correct?
18:10 imirkin: yes
18:10 imirkin: should be on the unreal demos page
18:11 karolherbst: imirkin: I think they reworked it at some point and it required downloading the full thing or something weird
18:11 imirkin: it's a big download
18:11 imirkin: like 1.5GB iirc
18:11 karolherbst: sure.. but I don't think you can download the demo alone anymore
18:12 karolherbst: at least I don't find from where
18:12 imirkin: hold on
18:13 imirkin: if you have somewhere i can scp it to maybe?
18:14 karolherbst: I could also boot my other machine and scp it from there
18:14 karolherbst: I have elemental on some machine :)
18:14 imirkin: ok. well it's 1.1GB ... let's see if people.fd.o has enough space
18:15 imirkin: nooope, down to 100MB free in /home (and i already wiped most of my stuff)
18:15 imirkin: let me see if i can easily put up a server...
18:16 karolherbst: python3 -m http.server $port :p
18:16 imirkin: that's not the problem :p
18:16 karolherbst: I know :D
18:16 karolherbst: I could check where PTS downloads from
18:17 karolherbst: ehhh
18:17 karolherbst: URL broken
18:17 imirkin: ok, let's see if this works...
18:24 karolherbst: seems to run alright
18:24 karolherbst: just slow
18:24 imirkin: run the whole demo
18:24 imirkin: in the past there have been parts that run wrong
18:24 imirkin: oh also
18:24 imirkin: there's a way to change the resolution
18:24 imirkin: but i don't remember what it is ;)
18:25 karolherbst: well.. I run that on a 4k screen.. but I think it auto scales down because X on wayland
18:25 imirkin: yeah, but there's a way to run it in e.g. a 1280x800 window
18:25 karolherbst: but yeah.. it runs surprisingly fine
18:27 karolherbst: I think there might be one or two issues, but I don't know, would have to compare
18:28 karolherbst: ehh.. the channel died at the end
18:28 imirkin: add ResX=1280 ResY=720 at the end of the command
18:28 imirkin: iirc there was a time when the unreal engine logo with the frosting had a bug rendering at the very end :)
18:29 karolherbst: yeah.. I think I'll wait until my piglit run finished, I shouldn't push it :D
18:29 imirkin: lol
19:53 karolherbst: ehh our align() is already aligning up D: I think I didn't had enough sleep
19:53 karolherbst: imirkin: btw, did you see this MR? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6149
19:54 imirkin: karolherbst: i did not. i'll have to double-check wtf hasJoin is
19:55 karolherbst: not 100% but I think hasJoin is the join modifier on instructions
19:55 imirkin: ok, so then that's not the right condition
19:56 karolherbst: I think none is
19:56 karolherbst: honestly
19:56 karolherbst: or you'd have to check the target of the join, _but_ on gens where you don't have join BB: forms you only end upw ith that NOP.join thing being OP_JOIN so you are actually fine
19:57 karolherbst: this all just sucks
19:58 karolherbst: so.. we have the "jumpy" OP_JOIN being NOP.join on targets with hasJoin
19:59 karolherbst: on targets !hasJoin OP_JOIN is SYNC
20:01 karolherbst: imirkin: I think I'll really split it up, it's just causing issues understanding this whole mess
20:03 imirkin: the question is "does JOIN jump or not"
20:03 imirkin: i believe that's hasAnteriorJoin
20:03 imirkin: (or the inverse of that)
20:04 karolherbst: how was it on tesla vs fermi?
20:05 karolherbst: that's essentially the only gens I don't know how it works
20:05 karolherbst: maxwell+ is quite clear
20:05 imirkin: fermi was jumpy
20:05 imirkin: tesla is not-jumpy
20:05 karolherbst: okay
20:05 karolherbst: mhhhh
20:06 karolherbst: jumpy join == hasJoin && !hasAnteriorJoin, non-jumpy join == hasJoin && hasAnteriorJoin, ssy = !hasJoin && !hasAnteriorJoin
20:06 karolherbst: that's how the target objects are setup right now
20:06 karolherbst: nv50: hasJoin and hasAnteriorJoin both to true
20:07 karolherbst: fermi/kepler: hasJoin but not hasAnteriorJoin
20:07 karolherbst: maxwell: both false
20:07 imirkin: i think you're making this more confusing
20:07 imirkin: than it needs to be
20:07 imirkin: hasAnteriorJoin = join does not jump
20:07 imirkin: hasJoin = there is a join modifier on instructions
20:08 karolherbst: joinAnterior is false for fermi
20:08 karolherbst: uhh.. wait
20:08 imirkin: correct.
20:08 karolherbst: but also false for maxwell
20:08 imirkin: correct
20:08 karolherbst: yeah, but joins don't jump on maxwell
20:08 imirkin: it should only be true on tesla (and maybe now volta?)
20:08 imirkin: joins jump on maxwell.
20:09 karolherbst: uhhh... right..
20:09 imirkin: =]
20:09 karolherbst: annoying
20:09 imirkin: so i think you want that condition to be OP_JOIN && !hasAnteriorJoin
20:09 karolherbst: yeah.. I guess so
20:09 karolherbst: joinAnterior is just not used at this point
20:09 karolherbst: but yeah
20:09 karolherbst: I guess this is what it means
20:10 imirkin: it might be used in the tgsi -> nv50 ir?
20:10 karolherbst: it's not used
20:10 imirkin: src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp: if (prog->getTarget()->joinAnterior) {
20:10 karolherbst: mhh.. I thought that's a local change of mine.. guess I was mistaken
20:12 imirkin: ok, so the tgsi -> nv50 ir pass sticks the join "anterior" style
20:12 karolherbst: I am wondering if it makes sense to rework all of that and make a "join bra" the jumpy one and codegen just has to emit the correct thing
20:12 imirkin: (see insertConvergenceOps)
20:12 karolherbst: yeah
20:12 imirkin: it puts a joinat at the start, and join at the convergence point
20:13 karolherbst: right
20:13 karolherbst: that's how we need it for volta as well
20:13 imirkin: right.
20:13 karolherbst: maybe we should rename hasAnteriorJoin to hasJumpyJoin, then it's obvious what's the meaning of it :p
20:13 imirkin: well, anterior is an english word
20:14 imirkin: with a meaning
20:14 imirkin: just not a common word
20:14 karolherbst: also doesn't help understanding it
20:14 imirkin: actually it means something different than i thought
20:14 imirkin: learn something every day
20:14 karolherbst: this concept of "joins before executing" is not helping
20:14 imirkin: i thought it meant "on the outside"
20:15 imirkin: but it actually means "nearer the front"
20:15 karolherbst: yeah
20:15 karolherbst: let's say we could use a word more people will have a easier time understanding what's up :)
20:16 karolherbst: I am wondering if it should get reworked to get the ISA to a more sane state (an op has one and only one meaning) or we keep it like this and just rename stuff
20:19 imirkin: eh wtvr
20:19 karolherbst: yeah.. I am not too fond on changing stuff which could break everything
20:56 robi: hasJoinBeforeExecute ?
20:56 robi: hasJoinBeforeExec ?
20:56 robi: hasPreExecJoin ?
20:56 karolherbst: imirkin: is for you in the default nv-report.py script the output also misalligned?
20:57 karolherbst: like this: https://gist.github.com/karolherbst/2382a84328a6157230a8061c7bc84183
20:59 karolherbst: ohhhhh
20:59 karolherbst: I know why
20:59 karolherbst: annoying
21:15 karolherbst: imirkin: anything you always disliked about nv-report in shader-db?
21:16 karolherbst: https://gitlab.freedesktop.org/mesa/shader-db/-/merge_requests/46
21:17 karolherbst: ehh.. wanted to make it python3 and python2 compatible
21:22 karolherbst: mhhh
21:22 karolherbst: the regex...
21:41 imirkin: karolherbst: didn't use to be misaligned
21:41 karolherbst: yeah.. but I figured out what's wrong
21:41 karolherbst: something with the multiple print calls
21:41 karolherbst: it's fixed with my MR
21:41 imirkin: anyways, i was happy with it. we could nuke the "bytes" thing, i guess - it only matters for tesla, everywhere else it's 1:1 with instructions
21:42 karolherbst: imirkin: I rewrite it that we have _one_ list with all attributes
21:42 karolherbst: and we can just add/remove things as we please without having to change the entire file
21:49 karolherbst: mhhhh
21:49 karolherbst: RE = dict((k, re.compile(regex)) for k, regex in [("name", r"^(.*) - ")] + [(a, r"%s: (\d+)" % a) for a in ("type",) + attrs])
21:49 karolherbst: it does generate the regex list based on the attribues, at the same time, it makes it so that nobody wants to change or even understand that :p
21:52 karolherbst: I think this is good enough
21:52 karolherbst: https://gitlab.freedesktop.org/mesa/shader-db/-/blob/666c096da649eb3514a970ac413c448d2d69ae55/nv-report.py
21:54 karolherbst: ehh whitespace messup
21:54 karolherbst: ooh no.. it's a loop
21:54 karolherbst: imirkin: I was also think of adding "aligned" gprs in addition to the normal ones
21:55 karolherbst: pre volta the file is 4 aligned where on volta it's 8 aligned apparently
21:55 karolherbst: or maybe we should rather report how many threads could run in parallel...
21:56 karolherbst: anyway, with that it's trivial to add or remove attributes
22:15 karolherbst: anyway... https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6152
22:16 karolherbst: "total gpr in shared programs : 1561876 -> 1443744 (-7.56%)" with volta :)
22:16 karolherbst: well.. hw gpr count here, not the result of codegen
22:16 karolherbst: meaning alligned to whatever the hw expects
22:17 imirkin: that's a nice optimization :p
22:17 karolherbst: well... :p
22:18 karolherbst: at least it has a good more_perf/loc ratio :p
22:18 karolherbst: let's see if it actually changes perf as well
22:19 karolherbst: but yeah.. before we reported a higher reg count than we neede (assuming the hw aligns to 4 on its own)
22:27 imirkin: yeah
23:24 karolherbst: imirkin: pixmark_piano 189 -> 194 points :)
23:25 karolherbst: pixmark_volplosion 2930 -
23:25 karolherbst: > 3039
23:25 karolherbst: I already feared I will be unlucky and none of those get affected
23:30 karolherbst: mhh.. triangle shows OOR_REG errors :/
23:33 imirkin: can't win 'em all
23:33 HdkR: ?)
23:33 HdkR: :)
23:33 HdkR: Aligning up by 8 or 4? :P
23:33 karolherbst: HdkR: 8
23:34 karolherbst: I bet it's some it has to be at least 16 or something :p
23:34 karolherbst: or..
23:34 karolherbst: there are no regs
23:34 karolherbst: so we have -1 + 1 aligned to 8 == 0
23:34 HdkR: time to ensure it is always at least 8 and run it again? :)
23:35 karolherbst: ahhh "type: 3, local: 0, shared: 0, gpr: 0, inst: 1, bytes: 16" mhhh
23:35 karolherbst: 0 aligned to 8 is 0 :)
23:35 HdkR: max(alignup(Regs, 8), 8) woop woop
23:35 karolherbst: I can drop that stupid min anyway
23:35 karolherbst: 254 is the highest we get and +1 aligned to 8 that's 256 :)
23:36 HdkR: :)
23:39 HdkR: Good to know that I was right that zero is invalid config though :D
23:40 karolherbst: mhhh.. still seeing errors
23:42 HdkR: hmmm
23:44 karolherbst: good: MAX2(4, align(info->bin.maxGPR + 5, 4))
23:44 karolherbst: bad: MAX2(8, align(info->bin.maxGPR + 1, 8))
23:46 imirkin: well, max(4, align(x+5,4)) == align(x+5,4)
23:46 imirkin: that max isn't doing anything
23:46 imirkin: neither is the max(8, ...)
23:47 imirkin: since maxGPR + 1 is at least 1, i think
23:47 karolherbst: nope
23:47 karolherbst: it can be -1
23:47 imirkin: or can maxGPR be -1 if non are used?
23:47 imirkin: ah
23:47 imirkin: good to know.
23:47 karolherbst: not often that it happens :)
23:47 karolherbst: the tess passthrough is probably the only shader hitting this
23:48 karolherbst: mhh.. I have a few shaders which got 12 set before and now 8.. but mhh
23:48 karolherbst: I hope it's not something stupid
23:48 imirkin: what else could it be?
23:48 imirkin: if not something stupid
23:55 karolherbst: ehh
23:56 karolherbst: I bet it's something stupid as "vertex need 16 regs min" or so
23:56 karolherbst: anyway.. it's the vertex passtrhough shader
23:56 karolherbst: https://gist.github.com/karolherbst/40293b668ed862d184606516449cbfe2
23:56 karolherbst: let me search for the binary
23:57 karolherbst: huh wait...
23:57 karolherbst: 1 is fragment
23:57 karolherbst: :D