03:35 RSpliet: imirkin_: good stuff on mesa git... does any of it notably impact performance?
03:42 mupuf_: RSpliet: I need to setup ezbench on reator!
03:42 mupuf_: it is in a state where I could actually track this kind of things :)
03:42 RSpliet: oh that'd be great stuff
03:42 mupuf_: right
03:43 mupuf_: the auto bisect is not yet here though, I want to understand better how to say with a certain confience that there is an improvement or regression
03:44 mupuf_: and that is ... messy when the density of probably is not at all guaussian
03:45 mupuf_: depending on the benchmark, if we are cpu or gpu limited, the density looks very different
03:45 mupuf_: and we cannot spend an entire night running the same benchmark over and over again to get any statistical significance unless this is absolutely necessary
03:46 RSpliet: well, but that is future work
03:46 mupuf_: that is WIP work
03:46 mupuf_: every night I collect more data
03:46 RSpliet: yeah exactly
03:46 mupuf_: week ends are the best, running 2k times the same benchmark :D
03:47 RSpliet: and that corresponds to no more than 50 mesa commits usually
03:47 mupuf_: oh, you misunderstood
03:47 mupuf_: I collect information about the variance of one benchmark, with a different bottleneck
03:47 RSpliet: understood is my middle name
03:47 RSpliet: Miss. Understood
03:47 mupuf_: lol
03:47 mupuf_: good one!
03:48 RSpliet: variance of one benchmark, with a different bottleneck
03:48 RSpliet: eg. you artificially introduce bottlenecks
03:48 RSpliet: and then run the same benchmark?
03:48 mupuf_: yes
03:48 RSpliet: oh
03:48 RSpliet: :-P
03:48 mupuf_: being cpu-limited yield funky results
03:49 mupuf_: but being gpu-limited does not yield a single dirac either
03:49 mupuf_: but you usually get the same result at +/- 1%
03:50 mupuf_: with sometimes runs at +/- 5%
03:50 mupuf_: and those group nicely too around a 2% variance
03:51 RSpliet: in other words, a nice normal distribution?
03:51 mupuf_: not sure if it is a platform issue, a general issue, etc..
03:51 mupuf_: no, I get ,multiple normal distributions
03:51 mupuf_: and they are a bit twisted distributions, but hey, it does not matter
03:51 RSpliet: that's not normal :-P
03:51 mupuf_: hehe
03:51 mupuf_: no shit, sherlock :D
03:52 RSpliet: poop, watson
03:52 mupuf_: ah ah
03:53 mupuf_: so yeah, the auto bisect requires this modelling work to be able to know how many times we need to run
03:53 mupuf_: I am OK with running a bit more than the strict mimimum
03:53 mupuf_: but I want to make sure it applies to multiple cases
03:54 mupuf_: cpu-limited cases are horrible, you get a variance of +/-10%
03:54 RSpliet: how do you limit your CPU?
03:54 mupuf_: needless to say that trying to bisect any perf change is .... time consuming
03:55 RSpliet: force freq down to 66 MHz? :-P
03:55 mupuf_: disable turbo, set the min and max freq of the governor to a single frequency ... of your choosing
03:55 mupuf_: ah ah
03:55 RSpliet: ah right
03:55 RSpliet: does that decrease the cache frequency as well? memory bus is not limited, is it?
03:55 mupuf_: the PUNIT of the intel cpu MAY throttle you, but the kenrel is supposed to warn you about that
03:56 mupuf_: I guess it depends on the platform or the PUNIT
03:56 mupuf_: the kernel has no control on this
03:56 RSpliet: you should know :-P
03:56 mupuf_: I need perf counters for that
03:56 karlmag: a question since I'm curious; is it possible (in general) to run auto bisects in parallel? (If you happen to have multiple computers?)
03:56 mupuf_:works on the gpu side, the PUNIT is ... somewhere :D
03:56 mupuf_: karlmag: no
03:56 RSpliet: I take it DDR is not clocked back for the CPU
03:57 RSpliet: so you get an unbalanced device, slow core with megafast memory access times :-P
03:57 mupuf_: well, the gpu is also using it
03:57 karlmag: mupuf_: kind of a pity, since it probably could speed up finding errors in some cases.
03:57 mupuf_: see, lovely time!
03:57 mupuf_: karlmag: don't worry about that
03:57 mupuf_: bisecting is usually not too long
03:57 mupuf_: and having auto-triggering auto-bisect is what you need
03:58 mupuf_: and it will do the perf analysis and the bisecting of issues during the night anyway :p
03:58 karlmag: hehe.. ok.. Well, a thought experiment anyway. But it would be kind of fun if it was possible.
03:58 mupuf_: bisecting is a dichotomy, so not sure how you could do that faster with multiple machines
03:59 mupuf_: unless you just try to predict
03:59 mupuf_: or gamble
03:59 mupuf_: which may work
03:59 karolherbst: mupuf_: my glxspheres64 fps founter is really funky
03:59 mupuf_: but you will not halve your bisecting time though
03:59 mupuf_: founter?
03:59 karolherbst: counter
03:59 mupuf_: brb, meeting time
03:59 mupuf_: what counter?
03:59 karlmag: mupuf_: well, you could set multiple starting points and se (across the cluster) where the errors still appear.
03:59 karolherbst: between 200 and 163 fps
04:00 karolherbst: in a single run
04:00 mupuf_: karlmag: that is clearly suboptimal "p
04:00 mupuf_: ah, intra-run variance
04:00 mupuf_: you are cpu-limited?
04:00 karolherbst: yes
04:00 karolherbst: no
04:00 mupuf_: are you sure?
04:00 karolherbst: 60% cpu load
04:00 karlmag: mupuf_: in real cases, very possibly
04:00 karolherbst: pcie bottleneck
04:00 mupuf_: well, it seesms like turbo issues
04:00 karolherbst: you know
04:00 mupuf_: have to go
04:01 karolherbst: k
04:01 mupuf_: maybe you lack vra
04:01 mupuf_: cannot talk about nouveau during the day, sorry :p
04:01 mupuf_: ezbench is fine though
04:06 karolherbst: this is ezbench related
04:06 karolherbst: :p
04:07 karolherbst: messuring fps variance is very important, because more stable fps => good
04:07 karolherbst: should be benchmarked as well
04:29 mupuf_: karolherbst: right
04:30 mupuf_: karolherbst: I hooked the environment dumping to ezbench by the way
05:02 karolherbst: nice :)
05:03 karolherbst: mupuf_: I am thinking about sending an email to nvidia regardng those pdaemon counters, like which bit is available where and what they kind of do, somehow I doubt though that I get any meaningful answer :/
05:03 karolherbst: ohh right, no ezbench related, sorry for that :D
05:05 mupuf_: karolherbst: you will likely not get any answer, yes
05:09 RSpliet: karolherbst: https://github.com/kfractal/nouveau/blob/26e6e939825034dd9a735249b4d41347eb634c4a/drm/nouveau/include/nvkm/hwref/gf100/nv_pwr_pri_hwref.h ?
05:13 RSpliet: idk, there might be other relevant headers in that repository
05:13 mupuf_: karolherbst: as for the bits available, they almost never changed, as far as I can tell
05:13 RSpliet: that you could consider digging through before sending out an e-mail
05:13 mupuf_: and we can check it out easily
05:13 mupuf_: RSpliet: I am sure he wants to know the semantic behind the signal
05:13 mupuf_: s
05:15 karolherbst: well I wrote this: https://gist.github.com/karolherbst/1dcb58f2a45b34eed529
05:15 karolherbst: maxwell is a bit strange
05:15 karolherbst: FB_PART0_REQ and BFB_PART0_REQ
05:16 karolherbst: but these are the used by the blob :/
05:16 karolherbst: I am more interessted in the other ones :D
05:18 karolherbst: the NV_PPWR_PMU_IDLE_MASK_GR 0:0 thingy is interessting though
05:18 karolherbst: I already assumed that the this bits toggles all GR relted stuff
05:19 karolherbst: ..* this bit
05:19 karolherbst: but I want to know all the others :D
08:11 imirkin: RSpliet: unlikely to have any real perf impact. might cause less stuttering in some situations.
11:29 m3n3chm0:nasZ
13:36 joujoumen: i don't understand how can source register immediate float can be stored in 6bits as docs say? is it an address that stores the value i.e loaded from memory address by instruction decoder?
13:38 imirkin_: can you elaborate what you're talking about?
13:38 imirkin_: in many cases, instructions store "short" immediates
13:39 imirkin_: although i don't remember any situations where only 6 bits of an immediate are stored...
13:39 joujoumen: https://code.google.com/p/asfermi/wiki/OpcodeInteger#IMAD you see reg1 and reg0 and reg2 are all 6bits
13:40 imirkin_: yeah, coz there are only 6 bits worth of registers.
13:40 imirkin_: aka 64.
13:41 joujoumen: but how does the hw get a value that could be some sort of float with 18digits, where is that stored?
13:41 joujoumen: obviously 32bit value would not fit into 6bits
13:45 joujoumen: high pilotage, i really can't understand 6bits worth of registers aka 64?
13:51 joujoumen: ouh, some patents say that register file is connected to data memory
14:08 joujoumen: imirkin_: can i somehow get the whole virtual memory address ranges of register file?
15:21 imirkin_: mwk: do you know what "prmt" does?
15:21 imirkin_: 00000020: 081c0001 b600042a prmt b32 $r0 $r0 $r1 0x5410
15:21 imirkin_: this appears to do $r0 = $r0 | $r1 << 16
15:23 mwk: ISTR that was covered by PTX documentation
15:23 hakzsam: http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prmt
15:24 imirkin_: ah yes, thanks
15:34 mwk: imirkin_: although I thing the hw instruction is slightly different from ptx
15:34 mwk: something like argument order
15:35 imirkin_: i don't see a serious use-case for it in my given scenario
15:35 imirkin_: but i could imagine it being useful in some scenarios
17:32 imirkin: gryffus: Option "DRI" "3", but you need a ddx from git
17:34 gryffus: imirkin: ok, thank you :)
20:26 imirkin: alrighty... we should able to do shader-db stuff with nouveau now
20:27 imirkin: for both nv50 and nvc0, shaders should end up getting compiled at glLinkProgram() time