00:00 mwk: worth saving for future reference, I suppose
00:02 mwk: though for now I'm keeping most weird things marked as UNK until their function is reasonably proven... which would be about the time I get to RE the ROPs
00:02 mwk: it seems to have a list of texture fromats as well, neat
00:55 mwk: well that was a handful
00:55 mwk: so I can begin and end patches now
00:56 mwk: I suppose I could try beginning or ending a transition...
00:57 mwk: swatches are scary; curves are double-scary
02:51 imirkin: skeggsb: let me know if any of that doesn't make sense... i still have the NV5 + NV17 plugged in in case you want me to test something
13:24 rhyskidd: skeggsb: For what it's worth, mmiotracing a GP107 on blob has worked fine with a 4.10+ kernel
13:57 imirkin: anyone with a gt21x or maxwell plugged in, and having mesa 17.1.x, mind sending me the output of `glxinfo -l -s`?
14:00 imirkin: skeggsb: sync-to-vblank doesn't appear to work on NV5 - is that expected?
14:03 RSpliet: imirkin: blob or nouveau?
14:03 imirkin: RSpliet: nouveau
14:04 RSpliet: https://paste.fedoraproject.org/paste/WHSrvrA7Q-uWvrUWlx95JA
14:04 RSpliet: ^ NV117 n Mesa 17.1.5
14:04 imirkin: thanks much
14:05 RSpliet: (using prime offloading. Presumably that doesn't make a difference)
14:05 imirkin: nope
14:06 imirkin: [at least not for the bits i care about, i suppose the GLX stuff could be different]
14:07 imirkin: oh, and also i need fermi
14:28 tobijk: any comments on this patch? it has been floating a while now on the ML and nobody commented: https://patchwork.freedesktop.org/patch/169870/
14:31 imirkin: tobijk: yeah - that clearly shouldn't be needed ;)
14:32 imirkin: i've been trying to fix the spilling logic for a while
14:32 imirkin: with limited success
14:32 imirkin: you can see some of that work on my 'cts' branch on github
14:33 tobijk: imirkin: mh that shader has a big local chunk, so having more spill candidates is not that bad anyway
14:33 imirkin: merges should never need to get spilled
14:34 imirkin: if all of their components are spilled, in effect, so is the merge
14:34 imirkin: however ... there could be something a little wrong with that. dunno.
14:35 tobijk: yeah maybe, thats why i'm asking :)
14:36 tobijk: imirkin: not sure if commiting this until ra is fixed is something we should do
14:36 tobijk: at least it gives some more shaders a chance in the meanwhile ;-)
14:37 imirkin: if you can demonstrate to me that this is clearly the right approach, then fine. but i don't think even you believe that.
14:38 imirkin: i'd be more inclined to take a patch which allows spilling of merges/unions outright
14:38 tobijk: imirkin: well i believe that its the right way to go until ra is fixed :D
14:38 imirkin: "until ra is fixed" isn't a thing that will ever happen
14:38 imirkin: i don't want to make that code unnecessarily complicated, as that will complicate future attempts to improve things
14:39 tobijk: well, until it improved then :P
14:40 imirkin: having logic like the one you propose in your patch makes it less likely to get fixed up. i won't take hacks.
14:40 tobijk: imirkin: spilling like any other value regresses shader-db by about 9%, thats why i have persued that way
15:04 imirkin: yeah i'm not overly surprised
15:30 RSpliet: imirkin: NVD9 -> https://paste.fedoraproject.org/paste/5cD4S71jlQlgMA5UDZ8sIg
15:35 imirkin: RSpliet: yay, thanks!
15:35 RSpliet: Don't have access to a GT21x right now unfortunately... just NVAC
15:37 imirkin: no worries
15:50 imirkin: welp, looks like nvidia and freedreno is mostly covered for 17.1 now, which i assume are the only two drivers anyone cares about ...
15:50 imirkin: so i'll make that the default.
16:03 imirkin: [47174.478953] nouveau 0000:09:01.0: gr: intr 00000001 [NOTIFY] nsource 00000800 [STATE_INVALID] nstatus 00003000 [INVALID_STATE BAD_ARGUMENT] ch 1 [X[5561]] subc 4 class 005f mthd 0308 data 012c012c
16:03 imirkin: [47174.492122] nouveau 0000:09:01.0: gr: intr 00000001 [NOTIFY] nsource 00000002 [DATA_ERROR] nstatus 00003000 [INVALID_STATE BAD_ARGUMENT] ch 1 [X[5561]] subc 2 class 0042 mthd 0304 data 20000500
16:03 imirkin: fun. some kind of issue on NV5 at 1920x1200? odd =/
16:03 imirkin: mwk: have you RE'd enough of NV5 to explain what those mean precisely?
16:05 mwk: the first one is a pre-launch fail, ie. you tried to do an operation, but something about your context is wrong, and it's hard to tell what exactly it is from a single line
16:05 mwk: the second... let's see
16:07 imirkin: i see - the 308 command is a "go" command, and there's some sort of incompatibility in the previously-set args?
16:07 mwk: the second means you're trying to set unsupported pitch for 2d destination buffer
16:08 imirkin: what's the max pitch?
16:08 mwk: you're trying to set pitch 0x2000 , specifically
16:08 mwk: and the max supported pitch for NV4 is 0x1fe0
16:08 imirkin: and NV5 too probably. ok.
16:08 mwk: yes, NV5 too
16:09 imirkin: i wonder why it's 0x2000 though.....
16:09 mwk: NV10 bumped that to 0xffe0
16:09 imirkin: should be 0x1e00 unless i've messed up my math
16:09 imirkin: (1920 * 4)
16:10 imirkin: perhaps something helpfully rounds up to 0x1000 somewhere
16:10 mwk: the required rounding is 0x20 for NV4:NV20, 0x40 for NV20: FWIW
16:11 imirkin: ah, i assumed it was 0x40 everywhere
16:11 imirkin: (at least it is for the video overlay)
16:11 mwk: well
16:12 mwk: the alignment/max requirements do differ between various parts of the card
16:12 imirkin: yea
16:12 imirkin: i only played around with it on NV34 and NV4A though
16:12 mwk: they even vary depending on the class you're using, eg. before NV20 you can get away with just 0x10-aligned pitch if you're using the NV1/NV3 classes
16:13 imirkin: and assumed it was the same on earlier boards. but i don't think alignment's such a big issue in any case.
16:14 mwk: overall it's an unholy mess
16:15 imirkin: aha. ok, looks like no wait-for-vblank on pre-nv10
16:15 imirkin: pNv->WaitVSyncPossible = (pNv->Architecture >= NV_ARCH_10) &&
16:15 imirkin: (implementation != 0x0100);
16:15 imirkin: (from xf86-video-nv)
16:15 mwk: correct
16:16 imirkin: is that because there's no semaphore method? or something else?
16:16 mwk: no
16:16 mwk: it's about the wait-for-crtc methods
16:17 mwk: and the 0x9f blit / 0x96 celsius classes that introduce these
16:17 mwk: methods 0x120-0x134
16:17 imirkin: ah
16:17 mwk: aka FLIP methods
16:18 imirkin: mwk: do you know anything that might cause an overlay on NV5 to have snow on it? are there clocks we might not be twiddling? or can the HW really not handle a full-screen 1920x1200 overlay?
16:20 mwk: no idea about overlays, sorry
16:25 imirkin: and looks like xf86-video-nv didn't support them on NV4 :(
16:31 imirkin: nv4CalcArbitration has various references to enable_video though
16:31 imirkin: ugh. whoever "simplified" those functions for nouveau ought to be shot.
16:32 mwk: ?
16:33 imirkin: xf86-video-nv has these totally unreadable and yet presumably-working functions which compute various clock values
16:33 imirkin: someone went ahead and "simplified" them for nouveau and removed like half the logic in them
16:33 mwk: ah heh
16:34 imirkin: of course there's now no way to know whether those functions were also originally broken for some stuff
16:35 imirkin: and it's not like i have an abundance of hardware or time to test all the various cases
16:35 imirkin: for an example of the "simplifications" that were made... https://github.com/skeggsb/nouveau/commit/cef5d6a638611e6386abe589d7c9c163b57a2e7f
17:46 mooch2: imirkin, unfortunately, you can't do nouveau tests on my nv5 emulation just yet, because no linux works on 86box currently
17:46 mooch2: except for like, REALLY OLD ONES
17:46 mooch2: like dsl
17:46 imirkin: mooch2: i also suspect that you haven't *precisely* emulated the various clock behaviors
17:47 mwk: even I'm not that insane...
17:47 mooch2: imirkin, not really
17:48 mooch2: all that's implemented is the basic pll behavior
17:48 mooch2: that's mostly all it needas
17:48 mooch2: *needs
17:48 mooch2: unless, of course, you're okay with a weird refresh rate
17:48 mooch2: of like 17 hz
17:48 mooch2: when you asked for 60
17:48 mooch2: still have no fucking idea why it does that
17:48 mooch2: the parameters passed are correct, so
17:50 mooch2: also, the only dev that knows the cpu code won't fucking fix linux
17:50 mooch2: ever since she fucking BROKE IT
17:51 mwk: wheee
17:51 mwk: NV40 state load/dump sort-of works
17:51 mooch2: ?
17:51 mooch2: nice
17:51 imirkin: mwk: that's those ctxprog-driven thingies?
17:51 mooch2: pls add moar scan stuff to nv40 debug regs pls
17:52 mwk: mooch2: actually, it seems to be complete
17:52 mwk: you've extracted the scans from NV4E, right?
17:52 mwk: they match perfectly for my NV40, so they're probably the same for all NV40 GPU
17:52 mooch2: qwies
17:52 mooch2: *weird
17:52 imirkin: well, NV40 and NV41+ are variously different
17:52 mwk: imirkin: no, hwtest loads/dumps all state directly through MMIO
17:53 mooch2: you should verify that that's not ALL the nv40 debug regs
17:53 mwk: though I'll have to start using ctxprogs once I model the NV40 pipeline state
17:53 mooch2: why did they even MAKE ctxprogs?
17:53 mwk: mooch2: it's all NV40 debug regs in 400080:400100 range
17:54 mwk: ie. main debug regs
17:54 mwk: I suppose there are *way* more per-unit debug regs
17:54 mooch2: well then
17:54 mooch2: i can't believe that worked lol
17:54 mwk: but I'll handle these for each unit individually
17:55 mwk: as for the entire rest, right now I've only covered the regs that are also on NV30
17:55 mwk: and NV40 has like 8× more MMIO registers than NV30
17:56 mwk: it seems all pipeline units that were only visible through RDI before suddenly came out of hiding and appeared in MMIO space
17:57 mwk: so... the grand plan now is to hammer the method submission code until it works on NV40, then start modeling all the weirdo state
17:57 mooch: well then
17:57 imirkin: mwk: have you gotten to drawing anything on any of the chips yet?
17:58 mwk: imirkin: I sorta drew a point on NV3
17:58 imirkin: hehehe
17:59 mwk: well
17:59 mwk: I know *all* about drawing points on NV3, actually
17:59 imirkin: the quadratic thingies - not so much though?
17:59 mwk: these are scary
18:00 mwk: I mean, sure, I've drawn them mannually
18:00 mwk: but modelling their behavior in hwtest... is not simple
18:03 mwk: imirkin: the high level status is, I have mostly covered non-drawing methods on everything prior to NV40
18:03 mwk: with some missing stuff here or there
18:03 imirkin: cool
18:03 mwk: + solid & blit ROP tests for NV1/NV3
18:04 mwk: + currently finishing up T&L tests for NV1x
18:04 imirkin: yeah, getting all that stuff *properly* modeled is going to be a pain... even plain rast, it's not strictly-defined
18:04 mooch: mwk, when i implemented the RMA bar on nv3, nt4 stopped using the nv3 drivers
18:05 mwk: imirkin: the main problem is that rasterization likes to lock up when given weirdo parameters
18:05 imirkin: hehe
18:05 mwk: matter of fact, all pipeline units like to do that
18:05 imirkin: yeah, that doesn't work well with a randomized fuzzing approach
18:05 mwk: so I have to spend some time figuring out a "safe" state space
18:06 mooch: mwk, why did nt4 stop using the nv3 drivers when i implemented the RMA BAR
18:06 imirkin: right
18:06 mwk: NV20/NV30 are particularly fun, since I have to figure out a safe state for every pipeline unit to test any 3d method
18:06 mooch: it doesn't make sense
18:07 mwk: because they just send a state bundle down the pipeline, which touches *all* 3d units
18:07 mooch: lol
18:08 mwk: it's already an unholy mess to take care of, and I've only modelled like 5% of pipeline state so far
18:09 mwk: the rest is on the default state
18:10 mwk: mooch: well, do you have some sort of mmio trace of the conversation?
18:10 mooch: yeah gimme a sec
18:10 mwk: are you working on 86box or qemu now? I kind of lost track...
18:11 mwk: imirkin: and the worst part, as you've noticed, is that with all that mess, I still haven't drawn a single 3d primitive yet :(
18:11 imirkin: mwk: yes, i've noticed.
18:11 mooch2: mwk, 86box
18:12 imirkin: mwk: feels like you make more progress while doing the easy stuff i assume
18:12 mwk: the closest I've come is the T&L thing
18:12 mwk: imirkin: well, that has to be done first either way
18:13 mooch2: https://hastebin.com/mapoheluze.pas mwk
18:13 imirkin: mwk: sure, but you're just doing the easy stuff for each board
18:13 mwk: one major thing I have now is a map of NV10/NV20/NV30 pipeline state bundles
18:13 imirkin: before doing the difficult stuff for each board :)
18:14 mwk: doing anything about any pipeline unit without having a state map would be kind of impossible :p
18:14 mwk: imirkin: true...
18:15 mwk: but tbh, mapping NV40 state right now feels more interesting than getting NV4 to draw things in 3D
18:15 imirkin: hehe
18:15 mwk: also, the end is in sight
18:15 imirkin: more like on the horizon
18:15 imirkin: which is an imaginary line that moves further away as you attempt to approach it :)
18:16 mwk: once I map NV40 frontend state, the only thing left to do will be actually drawing shit
18:17 mwk: and it appears I can reuse lots of NV20 code here
18:33 mwk: mooch2: could you make it show the returned value on reads?
18:33 mwk: and.. the whole log seems to be from the vbios, does this mean the driver doesn't even start talking to the card?
18:42 karolherbst: imirkin: how good do you know the spilling code of codegen? There is some oddity going on where we have a mov with both values spilled to the same address and we could save the store and load instruction here
18:43 karolherbst: fixing this gives us around 40% more perf in the dolphin uber shaders
18:43 imirkin: yeah, i've tried to tackle that
18:43 karolherbst: I have a silly post RA pass to fix it up, but it's too messy and I kind of get the feeling, that the sources/def links are totally broken
18:44 karolherbst: maybe not totally, but not reliable
18:44 imirkin: karolherbst: https://github.com/imirkin/mesa/commit/e6a123f1f12659c402c3d8d913accad766ff0070
18:44 imirkin: take a look at the self_mov stuff in there
18:45 karolherbst: mhhh
18:45 karolherbst: looks like a lot of implicit things are going on here
18:46 imirkin: it also doesn't completely work
18:46 imirkin: i mean - there are cases where it explodes because a mov doesn't have any sources
18:46 karolherbst: :/
18:47 imirkin: because it's some dumb left-over thing in the middle of being deleted
18:47 karolherbst: maybe we should fix all those inconsistencies first?
18:47 imirkin: the spilling logic needs to be redone wholesale.
18:47 karolherbst: I had some ld/st pointing to non existing movs
18:47 imirkin: the problem is that every time i've done that, i've run into yet-another impossible-to-solve-using-current-approach situation
18:47 imirkin: so i'm running out of ideas.
18:47 karolherbst: I was more talking about keeping the def/src links correct at least
18:48 orbea: karolherbst: for ubershaders this dolphin PR gives a small perf boost including with nouveau. https://github.com/dolphin-emu/dolphin/pull/5880
18:48 imirkin: yeah, well the fact that RA merges stuff when there are merges
18:48 imirkin: makes it ... tricky.
18:48 karolherbst: orbea: obbb
18:49 karolherbst: orbea: this may solve a lot of issues already
18:49 karolherbst: imirkin: the shader looked fun. some BBs with like 40 phis, or a BB with 10 phis with ~8 sources each....
18:49 imirkin: yeah
18:50 imirkin: that's coz of the fact that the thing gets converted into a while loop
18:50 imirkin: with braek's
18:50 imirkin: so all those break's point to the post-while BB
18:50 imirkin: hence 40 phi's
18:50 imirkin: er, phi's with 40 sources
18:50 karolherbst: here is my patch to at least try to fix it up Post RA, after that I prefer to fix RA instead: https://github.com/karolherbst/mesa/commit/ee62f048d5d1e42df664e86db84a7f9b6f57b5b8
18:50 karolherbst: yeah
18:50 imirkin: the ldst thing is eliminated using my self_mov thing
18:51 imirkin: i noticed it a *long* time ago
18:51 karolherbst: yeah, I was mainly doing this to get a number on performance
18:51 imirkin: it happens a lot with flags and regs too
18:51 karolherbst: ohh I see
18:51 imirkin: i mean flags being stored in regs, moved to flags, moved to regs, moved to flags
18:51 karolherbst: ouch
18:53 karolherbst: imirkin: doesn't self_mov need some kind of validation of the dst and src? or is it guarenteed, that this is only about those RA movs?
18:54 imirkin: ?
18:54 imirkin: this is for any mov
18:54 imirkin: but it doesn't matter
18:54 imirkin: this is about spilling for a mov that is in effect a mov to itself
18:54 imirkin: which is not required.
18:55 karolherbst: ohhh, mhh okay
18:55 karolherbst: but I don't see this part inside the function, that's why I am asking
18:56 imirkin: right
18:56 karolherbst: or doesn't it matter?
18:56 imirkin: well
18:56 imirkin: nodes get merged
18:56 imirkin: so like
18:56 imirkin: what happens is you have
18:56 imirkin: a = 1;
18:56 imirkin: b = 2;
18:56 imirkin: c = merge(a, b)
18:56 imirkin: now, the merge creates constraint movs
18:56 imirkin: so you end up with
18:56 imirkin: a = 1; b = 2;
18:57 imirkin: c(0) = a; c(1) = b;
18:57 imirkin: now if e.g. a gets spilled
18:57 imirkin: then when spilling a
18:57 imirkin: it'll do store(1); and then c(0) = load(a)
18:57 imirkin: which ... you don't want.
18:57 imirkin: anyways... if you're writing patches without understanding what's going on
18:58 imirkin: then you ... should stop writing patches. sit down and understand WHY it's happening.
18:58 karolherbst: yeah, I know
19:00 RSpliet: Love how academic resources are inconsistent on register bank organisation on NVIDIA GPUs
19:02 RSpliet: 2013 paper: Fermi and Kepler has a {0,1,2,3,4,...} -> {0,1,2,3,0,...} mapping from reg to bank. 2017 paper: GK110 has a {0,1,2,3,4,...} -> {0,1,0,1,2,...} mapping, "which is consistent with the distribution on GTX680 [reference to 2013 paper]"
19:05 RSpliet: Mmm, or the 2013 paper is just vague. Let's test this...
20:34 RSpliet: Cool...
20:35 RSpliet: karolherbst: using a greedy approach to avoid register bank conflicts in RA (following that 2017 paper) brings your saints_3 apitrace replay from ~8.2fps to ~9.3fps on my GT640
20:35 nyef`: What was that bit earlier about not having a GT21x to test with?
20:35 karolherbst: RSpliet: sounds like progress
20:36 RSpliet: karolherbst: I'd say. It's fairly trivial as well, so with a bit of refactoring I'd be able to bring it upstream. Might need a bit of help from imirkin with that though...
20:36 RSpliet: don't have a lot of 64-bit benchmarks unfortunately
20:36 karolherbst: RSpliet: well we might have a different layout for every chipset
20:36 karolherbst: so this has to be kept in mind already
20:37 RSpliet: karolherbst: academic sources indicate this bank layout is valid from Fermi to GK110
20:37 RSpliet: (including GK110)
20:37 karolherbst: so not every chipset
20:37 RSpliet: nobody did this reverse engineering work on Maxwell I guess
20:37 karolherbst: but it could be different
20:37 RSpliet: yes
20:37 karolherbst: it would be just nice if the approach includes some chipset abstraction already
20:38 karolherbst: so that it will be less pain later on to adjust it for a certain chipset
20:38 RSpliet: Also, ideally you'd use liveness analysis to estimate the number of registers to somewhat bound the GPR increase if any
20:38 RSpliet: but... I'm tempted to hard-code "28" as an estimate
20:38 karolherbst: mhh
20:39 karolherbst: maybe make it depending on the instruction count?
20:39 RSpliet: Why would that make sense?
20:40 karolherbst: RSpliet: I only think it might make more sense to do something better than just hardcode 28 in it
20:42 RSpliet: well, 28 isn't completely arbitrary of course. It's "less than 32 with a margin" to reduce the odds of accidentally tipping it over the "fewer parallel warps" boundary
20:42 RSpliet: Personally think that's better than assuming a correlation that I can't back up
20:45 imirkin: RSpliet: happy to provide assistance
20:47 karolherbst: RSpliet: the question is, what's the alternative to high GPR, sure we could schedule better, but then we reduce latency hiding if we decrease GPR count, but currently we do neither anyway
20:47 karolherbst: also, I am sure it's perfetly fine for some shaders to use 40 GPRs or even 48 or so
20:47 karolherbst: it's not like if we trip over 32 everything goes bad
20:50 RSpliet: karolherbst: the point is that, for instance, if I need to pick a GPR that conflict with GPRs in bank 0, 1 and 3 and GPR 0-30 are taken, I'll end up picking GPR 32 (7 warps per SM) rather than 31 (8 warps per SM, but one extra conflict)
20:50 RSpliet: or actually, I'll end up picking GPR 36 in this example...
20:51 RSpliet: Yes, this is pathalogical, but there's a million situations in which this can happen and it seems like the worse choice
20:51 karolherbst: RSpliet: okay, I see, but when do those conflicts matter? I guess reading/writing only needs to be "enough away", so that we just ignore those conflicts?
20:52 RSpliet: actively and greedily avoiding bank conflicts gave me the 10% perf improvement in your trace
20:52 RSpliet: they matter
20:52 RSpliet: (despite a slightly higher GPR usage)
20:52 karolherbst: could you do it without having a threshold and check how much worse all the shaders go in avarage, which we have?
20:53 nyef`:remembers a discussion (elsewhere) about setting up the potential cost model for a chunk of code as an HMM-style system and then using Viterbi to do the actual allocation decisions.
20:53 RSpliet: karolherbst: sure, all possible. I don't have a large bunch of benchmarks though
20:53 nyef`: (Okay, the actual context was boxing vs. unboxed operation in a managed runtime.)
20:53 karolherbst: RSpliet: thing is, the higher the GPR count, the less it actually matters if we use more or less, maybe if we simply ignore the threshold for now, it might be fine already
20:54 karolherbst: RSpliet: test it with pixmark_piano
20:54 karolherbst: there you should have higher differences
20:54 karolherbst: hopefully
20:54 RSpliet: hah, yes well.. there is that
20:54 RSpliet: let me try it again
20:54 karolherbst: allthough it's just one shader with 40+ gprs
20:56 RSpliet: karolherbst: Pixmark piano regresses in perf... probably because I push it over a GPR cliff edge
20:56 RSpliet: (8->7fps... about 5-10ms per frame extra)
20:56 karolherbst: RSpliet: you should check the shader stats then
20:57 RSpliet: karolherbst: got a one-liner for me to poop them out?
20:57 karolherbst: shader-db
20:57 karolherbst: the shaders are in the repository
20:58 RSpliet: hmm, isn't there a simpler way to get mesa to output GPR per shader compile without recompiling with debugging?
20:58 RSpliet: I don't have shader-db ready... should set it up probably
20:58 karolherbst: set the debug context thingy
20:59 karolherbst: it doesn't have to be a debug build
21:06 RSpliet: karolherbst: "nouveau debug context thingy" returns no useful results on google
21:06 karolherbst: opengl
21:07 karolherbst: GL_KHR_debug
21:07 imirkin: RSpliet: did you actually search for that? =]
21:08 imirkin: you can use pipe_debug_message() to emit messages
21:08 imirkin: RSpliet: oh, and on every compilation, it outputs that
21:08 imirkin: so as long as you've hooked up a debug callback, you should get it
21:09 imirkin: http://docs.gl/gl4/glDebugMessageCallback
21:09 imirkin: in combination with some others
21:09 RSpliet: imirkin: I don't think I have any of that... and MESA_DEBUG=context seems to be introduced in Mesa 13.1
21:09 RSpliet: which... means I have another reason to rebase
21:09 karolherbst: another would be: rebasing gets harder over time
21:11 RSpliet: karolherbst: yes. More compellingly: without a rebase there's no way this is going upstream cleanly
21:11 imirkin: RSpliet: wtf is your tree based on?!
21:11 RSpliet: old 13.1
21:11 imirkin: ah yes, as opposed to new 13.1 ;)
21:12 RSpliet: pretty sure it's pre-release
21:12 imirkin: i guess that's the release in second half of 2016... not too much has changed, nouveau compiler-wise
21:17 RSpliet: figures... git am gave zero clucks and just did what it was supposed to do
21:19 imirkin: too bad you didn't get the joy of rebasing
21:21 RSpliet: I still have visions of carrying PM work over one of Ben's kernel rewrites...
21:33 RSpliet: karolherbst: hah interesting... for pixmark_piano GPR usage doesn't rise (49 in both cases), nor does the number of instructions or anything
21:34 karolherbst: RSpliet: interesting
21:34 RSpliet: it remains a benchmark (with a ridiculous shader) rather than a true workload so not entirely sure whether I should care
21:37 karolherbst: well it benchmarks the computing power
21:40 RSpliet: I guess I can be cleverer about this. Instead of "either find me a preferred bank or any will do" it should be "get me a reg in the bank with minimal conflicts - ideally 0"
21:42 karolherbst: yeah
21:42 karolherbst: sounds like a better plan to me
21:43 RSpliet: more work for the compiler... but meh, banks is a constant, so complexity doesn't grow :-P
21:43 karolherbst: bonus question: does dual issue work with bank conflicts?
21:44 RSpliet: it does, but is ideally taken into account
21:44 RSpliet: you could be dual-issuing two instructions with more than 4 SGPRs between them, so conflicts aren't 100% avoidable
21:45 karolherbst: yeah, just wondering how much it affects it
21:45 karolherbst: but I think the worse decisiton would be to not dual issue
21:45 karolherbst: *decision
21:45 RSpliet: defo
21:45 RSpliet: it's only one extra cycle for the conflict. issue delays tend to be greater than 3
21:46 RSpliet: but this might be maskable and blahblah it's never trivial is it?
21:46 karolherbst: sometimes it makes more sense to dual issue the next pairs if the third instruction can't be dual issued with the 4th one
21:47 karolherbst: should be fairly easy to test out later
21:47 RSpliet: most optimisations are heuristics anyway... can't always get everything 100% right