00:13agrecascino: oh jesus christ i fucked up
00:13agrecascino: as soon as nouveau loads
00:13agrecascino: everything goes black
00:13agrecascino: what did i do wrong
00:14imirkin_: agrecascino: mmmm... potentially nothing
00:14imirkin_: agrecascino: how are your screens connected? which gpu do you have? can you grab dmesg after this (e.g. by ssh'ing in)?
00:14agrecascino: the only reason i'm here is because i logged in blind
00:14agrecascino: and X used vesa apparently
00:15imirkin_: errrr.... that's weird
00:15imirkin_: dmesg + xorg log would be great
00:17agrecascino: it seems to unload nouveau at one point
00:18imirkin_: it looks like nouveau gets very upset somewhere in the disp logic
00:18imirkin_: which is what causes the screens to go black
00:18imirkin_: however X manages to get it out of its funk?
00:18agrecascino: what i learned:
00:18imirkin_: either way, you have an ancient version of X
00:18agrecascino: bow down to the proprietary overlords
00:18agrecascino: learn how to login blind
00:19imirkin_: with a definitely not-new-enough modesetting + glamor config
00:19imirkin_: er, s/config/logic
00:19imirkin_: the actual moral of the story is "buy amd"
00:20agrecascino: AMD's MXM cards haven't been updated in years, last time i checked
00:20imirkin_: no clue. i do know that amd supports open-source efforts while nvidia does not.
00:20agrecascino: i would be using an amd machine
00:20agrecascino: however, the power supply to go in that machine is too wide for the case
00:21agrecascino: so rip
00:22agrecascino: also yeah
00:22agrecascino: i'm right
00:22agrecascino: the latest MXM cards on AMDs side are all based on 7XXX cards
00:22agrecascino: that's pretty depressing
00:23agrecascino: is nouveau supposed to break text consoles though
00:23agrecascino: i mean
00:24agrecascino: if it fails loading horribly
00:24agrecascino: or 2d acceleration is WIP
00:24agrecascino: i mean that's overkill for failing horribly
00:24imirkin_: text console works... just the display is off
00:24imirkin_: which is of little consolation to you :)
00:24agrecascino: i really love not seeing anything
00:25agrecascino: gives me tons of assurance that everything is ok
00:25imirkin_: either way, no, that's not supposed to happen - getting this stuff right is hard =/
00:26agrecascino: especially when the manufacturer of said product is trying to make it harder to do it right
00:26imirkin_: basically the hw is unhappy with how we're driving it... not a whole lot the driver can do.
00:26imirkin_: it's not some adaptable intelligent actor
00:28agrecascino: any possible workarounds?
00:28imirkin_: provide a detailed bug report and hope ben can figure out what's wrong
00:29imirkin_: (e.g. one that includes your full dmesg, not just a small extract)
00:29agrecascino: where to report it?
00:29imirkin_: bugs.freedesktop.org xorg -> Driver/nouveau
00:30skeggsb: evo is pissed because we're doing a modeset for heads 2/3 (with sors 0/1), and leaving heads 0/1 displaying but without an OR attached
00:31skeggsb: nfi how that'd actually happen, but that's what the state says
00:31imirkin_: skeggsb: note that he has a super-old Xorg + modesetting version
00:33agrecascino: :\ I think I'm going to bow down to NVIDIA.
00:44imirkin_: skeggsb: any prospect of integrating (some of?) karolherbst's work in 4.7?
00:44karolherbst: imirkin_: he told me he is short on time ;)
00:45karolherbst: imirkin_: also for 4.7 it is a bit too late now
00:45imirkin_: meh. he puts the pull request together at the last second usually :)
00:45imirkin_: and it's not like your patches are new
00:47karolherbst: I think there are still some design issues or something
00:47karolherbst: anyway, I would be surprised if ben has nothing to complain about ;)
00:47imirkin_: he _is_ a complainer, that ben :p
00:50airlied: skeggsb: the last second is approaching for -next :)
01:00agrecascino: off topic question but
01:00agrecascino: my cursor is invisible, how do i fix this
01:02karolherbst: agrecascino: usually a software bug. How did it disappear?
01:02agrecascino: it was never there
01:02karolherbst: agrecascino: and usually logout/login fixes it
01:02karolherbst: agrecascino: using ubuntu?
01:03agrecascino: no, i just can't see my console
01:03karolherbst: on an ubuntu machine I found the bug that when at login time there is no cursor, after login you still have none
01:03agrecascino: i'll try it
01:05karolherbst: imirkin_: now that I enabled SSO in like every shader which enables it, most of those "WARNING: value %92 not uniquely defined" warnings are gone and I already thought something went wrong :D
01:05agrecascino: didn't fix
01:05agrecascino: but i got an odd visual
01:07karolherbst: yeah well
01:07karolherbst: "2149818 -> 2149790 (-0.00%)" :/
01:07karolherbst: imirkin_: seems like all the effort is indeed for nothing (for now)
01:08imirkin_: fwiw it did seem quite odd
01:08agrecascino: let me describe the visuals
01:08agrecascino: i guess
01:08agrecascino: i logged out
01:08karolherbst: imirkin_: right
01:08agrecascino: x stayed on screen, even though it was not running
01:08agrecascino: and when i started up x again
01:09agrecascino: the screen got filled with garbage for a few seconds
01:09agrecascino: before coming to xfce
01:09karolherbst: imirkin_: at least my PostRADCE pass finds usefull stuff now: helped inst ../nvidia_shaderdb/tomb_raider/8430.shader_test - 1 705 -> 651
01:09imirkin_: figure out why :p
01:10karolherbst: ohh maybe it is fine
01:17imirkin_: karolherbst: you kinda need 20,21 - they're used in 24
01:17imirkin_: your post-ra dce pass needs some help
01:17karolherbst: i thought so
01:18karolherbst: i guess something is missing here
01:18imirkin_: i guess the refcounts are messed up post-ra? dunno.
01:19imirkin_: sorry =/
01:19karolherbst: let me check
01:19imirkin_: (it kinda makes sense)
01:20imirkin_: there used to be a merge constraint
01:20imirkin_: which was being ref'd
01:20imirkin_: which tells the RA to give things the same values
01:20imirkin_: but i guess we don't up the ref count
01:24karolherbst: seems like it
01:34karolherbst: imirkin_: but depending on the shaders linked together, my pass could still optimize some branches away? Or wouldn't that be possible with the current code?
01:35imirkin_: anything's possible
01:35imirkin_: some things are just unlikely.
01:36karolherbst: at least a valley shader benefits a lot
01:36imirkin_: well, the current PostRADCE is a non-starter
01:36imirkin_: it messes up shaders
01:36imirkin_: coz it relies on refcount
01:37imirkin_: which is not accurate post-ra
01:37karolherbst: only 5 shader are changing anyway
01:37karolherbst: stupid shader-db
01:37karolherbst: why wasn't that sso stuff merged :D
04:15Tom^: what happends when you run out of vram, is there some sort of oomkiller? or does everything simply crash and burn
04:15imirkin: everything simply crashes and burns
04:16imirkin: the pushbuf submit fails, and we don't handle pushbuf submit failures in mesa
04:17imirkin: which means we end up starting to submit commands that try to read/write to unmapped addresses
04:17imirkin: it's a good system :)
12:57karolherbst: hakzsam: well, I think those ipc metrics are kind of important. In saints row I get usually around 0.5 IPC :/
12:57karolherbst: that sounds low
12:57karolherbst: especially because in pixmark_piano I get above 4
13:32RSpliet: karolherbst: artificial benchmarks usually obtain a much higher IPC
13:33mupuf: RSpliet: you mean ALU-bound benchmarks
13:33mupuf: there are no artificial benchmarks :D
13:33RSpliet: all of them are artificial ;-)
13:33mupuf: that's another way of looking at it :D
13:34mupuf: karolherbst: you get a poor utilisation because ... you know ... memory accesses?
13:35RSpliet: it's a very useful metric
13:35RSpliet: not to compare two applications, but to measure the improvement of your optimisations
13:35mupuf:never doubted that
13:39karolherbst: RSpliet: right, but having 0.5 seems low
13:40karolherbst: RSpliet: especially because in other scense in the same game I get much higher (usually due to reduce scene complexity)
13:41karolherbst: low IPC usually comes due to low dual issueing or stalls in the shader?
13:42mupuf: karolherbst: or spilling
13:42mupuf: or poor instruction scheduling to hide the memory latency
13:43karolherbst: yeah, with the latter I also meant that
13:43karolherbst: in fact, if something has to wait
13:45mupuf: karolherbst: would be good for you to find out which shader is responsible for the biggest perf drop and optimize it :)
13:45mupuf: either you can check the exec time of each draw call
13:45karolherbst: now that I have patched SSO support into shader-db
13:45mupuf: and then map that to the shaders used based on the program id
13:45karolherbst: that should be quite possible
13:46mupuf: what you need is a way to replace shaders :D
13:46karolherbst: yeah, shader-db doesn't support sso
13:46mupuf: separate shader objects
13:46mupuf: either replace shaders or use gl to monitor the length of each drawcall
13:47mupuf: length == exec time
13:47karolherbst: apitrace can do that
13:47mupuf: yep, make sure you are not cpu-limited
13:47karolherbst: I just create a trace and turn on gpu profiling on retrce
13:47karolherbst: that's my smallest concern
13:47mupuf: yop, and you can dump perf counters too
13:47karolherbst: anyway, nouveau does busy waiting on fences
13:47karolherbst: so cpu usuage is often pretty high
13:48mupuf: sure, but apitrace can be the bottleneck
13:48karolherbst: not with those traces
13:48mupuf: lucky you then
13:48karolherbst: k. then I will trace SR3 because perf sucks
13:48mupuf: trtt and I will be working again on apitrace and perf counters this summer
13:49karolherbst: yay, I have a 6.6GB sr3 trace already
13:50karolherbst: mupuf: and if I am CPU limited, I can simply downclock my GPU
13:50mupuf: hmm, make sure IO is not a bottleneck too :p
13:50karolherbst: mupuf: 16gb ram
13:50karolherbst: file cache
13:50karolherbst: usually my cache is around 11GB
13:51karolherbst: 70% cpu usage, sounds fine
13:52karolherbst: mupuf: "pixels drawn profiling" does this makes sense?
13:52karolherbst: pixels drawn per draw call or something
13:54karolherbst: mupuf: what is odd, is that while I retrace it, I get hardly above 24W
13:54karolherbst: and I idle at 19W on 0f
14:00karolherbst: gpu core load is around 50%
14:01karolherbst: memory load below 5%
14:01karolherbst: it's like the CPU and the GPU are both super bored
14:03karolherbst: maybe if the IPC drops below 100% the pmu counter also drops below it?
14:10mupuf: karolherbst: force the CPU speed
14:11karolherbst: mupuf: what do you mean by that?
14:11mupuf: force the cpu frequency
14:11karolherbst: you mean besides intel_pstate performance governor?
14:11karolherbst: so I should disable sleep state
14:12mupuf: nah, just force the frequency to 100%
14:12karolherbst: yeah, right, and how should I do that otherwise? Because I don't drop below 2.4GHz
14:12mupuf: ok, then it is not the issue
14:13karolherbst: just the effective freqs drop due to sleeping
14:13karolherbst: "Avg_MHz" int he "good" turbostat tool
14:13mupuf: /sys/devices/system/cpu/intel_pstate <-- that is where you can force frequencies
14:13karolherbst: "Avg_MHz" int he "good" turbostat tool
14:14karolherbst: mupuf: well, those things don't do anything for me
14:14karolherbst: maybe due to the performance governor
14:14karolherbst: performance governor means max perf, always
14:14mupuf: no, performance also downclocks
14:15mupuf: but performance is sometimes slower than ondemand, go figure out
14:15karolherbst: only between base and max boost
14:15mupuf: anyway, this is not your isssue
14:16karolherbst: mupuf: I can simply get the shaders from qapitrace with the call id and write my own shader_test file right?
14:16mupuf: good luck..
14:17mupuf: you may try juha-pekka's c-file writer for apitrace
14:18karolherbst: why? well I get the shaders
14:18karolherbst: or do I need something else?
14:18mupuf: because running shaders without data is just ... wrong?
14:20karolherbst: yeah, but what kind of information does this give me? My plan was to look at the most expensive draw call and see what the shader looks like
14:26karolherbst: ohhh wait
14:26karolherbst: I saw this once
14:27karolherbst: something is odd with the sched oppcode
14:28karolherbst: hakzsam: any reasons why inst_issued2 should give me a value not 0, when I return false in canDualIssue?
14:29karolherbst: mupuf: dual issueing more than the hardware can results in a _big_ perf penalty
14:30karolherbst: and with big I mean like 90% perf drops
14:30mupuf: this makes absolutey no sense
14:30mupuf: but I cannot work with you now
14:30mupuf: still at work
14:31karolherbst: this has to be it
14:31karolherbst: the situation: I disabled canDualIssue by returning false, always.
14:31karolherbst: inst_issued is usually around 60M in the trace
14:32karolherbst: but in the frames where the perf is really really bad, inst_issued2 is unusually high
14:32karolherbst: allthough I explisitly disabled dual issueing in the mesa code
14:33karolherbst: maybe there is a max value for the opcode and by overflowing we dual issue?
14:44karolherbst: fun, I set all sched codes to 0x20 and get dual issueing allthough we though 0x04 is the dual issue code
14:46karolherbst: I am wondering now.... could we use the inst_issued1 and inst_issued2 perf counters to get the right sched codes?
14:50karolherbst: hakzsam: do you think there could be somewhere an overflow getting inst_iussued values?
14:57karolherbst: hakzsam: especially because the per frame values are multiplies of 10
15:03karolherbst: when a vertex shader produces OUT 0-6, but the fragment shader only has IN 0-5, OUT6 could be dropped, right?
15:03imirkin: check the semantic... the actual OUT/IN index doesn't matter
15:04karolherbst: ohh okay
15:04karolherbst: one OUT is POSITION the other are GENERICs, do you mean this for example?
15:04imirkin: and it'll be GENERIC
15:04imirkin: and GENERIC matches up to GENERIC
15:04imirkin: irrespective of the IN/OUT index itself
15:04karolherbst: all GENERICs are using in the fragment shader
15:05karolherbst: is POSITION a special case which is always used?
15:05imirkin: it's a semantic
15:05imirkin: just like generic is a semantic
15:05imirkin: not sure what you mean
15:05imirkin: you can check the tgsi docs for what these things mean
15:06karolherbst: I was thinking about something else
15:06karolherbst: what if in a fragment shader, the opts lead to only two of those IN being used
15:06karolherbst: that would mean we could drop the OUTs in the vertex, right?
15:06imirkin: we don't do any linking optimizations
15:06imirkin: but yes, it does.
15:06karolherbst: mhh okay
15:06karolherbst: in the nv50 ir, how are those INs accessed?
15:07imirkin: in a vertex shader, vfetch. in a fragment shader, interp
15:07karolherbst: and then a[0xac] maps to o[0xac]?
15:07imirkin: something like that, yes
15:08karolherbst: mhh okay
15:08karolherbst: then I can at least check if it might make a difference
15:08imirkin: the gallium interface doesn't lend itself nicely to such optimizations
15:09imirkin: since all the shader stages are completely separate
15:09imirkin: until draw time
15:09imirkin: and can be rebound in various ways with diff shaders
15:09imirkin: effectively all gallium shaders are sso
15:09karolherbst: if this could cut effectivley in half, then it has to be fixed
15:09imirkin: could? sure. dunno that that would happen very often though.
15:10karolherbst: well, that's what I am checking now
15:13karolherbst: in the one shader "export b128 # o[0x70] $r20q" -> "export b32 # o[0x7c] $r23"
15:13karolherbst: this would be possible
15:14karolherbst: 3 out of 28 b32 values unsued
15:15imirkin: that's position
15:15imirkin: chances are things would get unhappy if that were not emitted
15:15imirkin: but not sure
15:16karolherbst: still an interessting thing to do, maybe just complicate to implement due to gallium
15:16karolherbst: but, the game engines also use SSO
15:16karolherbst: and I could think, that devs could just not care enough to optimize that on their end
15:17karolherbst: especially if nvidia already does it
15:21karolherbst: floor ftz f32 $r2 abs $r0; sub ftz f32 $r6 abs $r0 $r2
15:22karolherbst: isn't there some native instruction for this?
15:23imirkin: FRC? no.
15:23karolherbst: mhh but this shader is especially interessting here
15:23karolherbst: set ftz u8 $p0 ge f32 $r0 neg $r0
15:24karolherbst: not $p0 neg ftz f32 $r6 $r6
15:29karolherbst: this is
15:30karolherbst: like neg(frc($r0))
15:34karolherbst: this can be shorten to this:
15:34karolherbst: floor ftz f32 $r2 $r0
15:34karolherbst: add ftz f32 $r6 neg $r0 $r2
15:45karolherbst: mhh okay, doesn't work for negative ones yet
15:51karolherbst: am I stupid or is there no easy way to get the negated fractual part
15:55karolherbst: ohhhh I am stupid
15:55karolherbst: floor(-4.2) is -5
17:06mlankhorst: indeed! :D
17:23karolherbst: but there has to be a way to do fract in less than 4 instructions
17:24karolherbst: or at least in 4 instructions without branching
17:25imirkin_: should be 2 ops iirc
17:26karolherbst: imirkin_: for positive ones, yes
17:26imirkin_: karolherbst: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_from_tgsi.cpp#n3096
17:27karolherbst: imirkin_: -5.4 - math.floor(-5.4) == 0.6
17:27karolherbst: at least in python
17:27karolherbst: and I doubt that on the gpu floor(-5.4) is -5
17:28imirkin_: hopefully that answers your question? :)
17:28karolherbst: how insane
17:28karolherbst: because the shader wants the real thing
17:29karolherbst: imirkin_: the generated code does something like this: 5.4 => -0.4 and -5.4 => 0.4
17:31karolherbst: and currently nouveau generates floor/sub/set/ predicated neg
17:35karolherbst: that set and that predicated neg could be merged into a slct
17:45imirkin_: karolherbst: it may be beneficial to have a pass that looks for if () x = foo; else x = bar;
17:45imirkin_: it's a pretty standard kind of thing to look for.
17:45imirkin_: known as a "sel peephole" i think.
17:46karolherbst: yeah, currently thinking exactly about that
17:46imirkin_: i would recommend doing that pre-ssa tbh
17:46karolherbst: I already removed the floor/sub thing in my example
17:46imirkin_: since messing with BB's after SSA is done is next to impossible
17:46karolherbst: ohh, I think I got it now
17:47karolherbst: messing with BB's isn't that hard actually
17:47imirkin_: no, it is.
17:47imirkin_: trust me.
17:47karolherbst: well at least I can remove the edges without breaking stuff now
17:47imirkin_: you can't.
17:47imirkin_: you just think you can.
17:47karolherbst: sure I can
17:48karolherbst: doesn't node->cut removes the edges too?
17:48karolherbst: uhh right, phi nodes
17:49karolherbst: well maybe we should clean that CFG mess up a bit and store instruction <=> edge pointers
17:49karolherbst: this would make things so much easier in the end
17:50imirkin_: llvm stores the incident BB as part of the PHI node argument
17:50imirkin_: which i think is the right approach
17:51karolherbst: yeah, but the same problem also exists for branch instructions
17:51karolherbst: you have a bra, but only implicit edge ordering
17:51imirkin_: that's a different issue
17:52karolherbst: looks like same issue, just different side
17:52imirkin_: the phi nodes thing makes edge manipulation a non-starter
17:52imirkin_: i think the bra thing is more easily fixable by removing the implicit stuff
17:52karolherbst: yeah, right, but it is the same issue: depending on edge ordering
17:53imirkin_: right... just with a much simpler fix
17:57karolherbst: imirkin_: for the example above: https://gist.github.com/karolherbst/be50b52ad6e1df0fb47f8f2f861befca
17:59imirkin_: karolherbst: right.
18:00hakzsam: karolherbst, what's the issue with inst_issued2 ?
18:02karolherbst: hakzsam: it isn't 0 when I think it should be
18:03karolherbst: hakzsam: I disabled dualIssuing in the emiter and forced for every instruction the same sched value
18:03hakzsam: and it's not 0?
18:03karolherbst: hakzsam: but then inst_issued2 suddenly was not 0 anymore (depending on inst_issued?)
18:03karolherbst: hakzsam: well it is, but not the entire trace
18:04hakzsam: I think this perf counter returns the correct result though
18:04karolherbst: tell me what I should check and I will check it :)
18:06hakzsam: well, checking all perf counters is really not trivial, but if you want to do it you have to write a minimal GL test and use GL_AMD_performance_monitor into it to monitor perf counters
18:07hakzsam: and to write a minimal shader of course :)
18:08hakzsam: this is a general solution though
18:08karolherbst: I think the issue only triggers when you have like 10M+ inst per frame
18:09hakzsam: yeah, but if want to make sure that inst_executed should return N instructions for a shader, you can write it, monitor inst_executed and see what's happening :)
18:09hakzsam: karolherbst, if the "issue" only happens with a big application, that's really hard to figure out if it's a real problem or not
18:10karolherbst: inst_issued2 shouldn't be non 0 when nothing dual issues
18:10karolherbst: hakzsam: but the value doesn't depend on the metric-inst_issued :/
18:11hakzsam: are you sure that nothing should be dual issued in that specific case?
18:12karolherbst: pretty sure
18:12karolherbst: maybe something slipps through, but I really doubt that
18:12hakzsam: I mean, maybe the hardware is lying you, and dual issue even if you don't want to (for some instructions)
18:12hakzsam: I don't know if this can happen though
18:12hakzsam: imirkin_, is that possible?
18:13karolherbst: I will show you and you will immediatly think "wtf..."
18:13imirkin_: i don't know anything about dual-issue
18:13imirkin_: or what those counters are counting
18:13karolherbst: especially the correlation with the fps
18:14hakzsam: I don't know much about dual-issue too, that's why I'm asking and guessing that maybe the hw does something :)
18:14karolherbst: hakzsam: well I return false in "TargetNVC0::canDualIssue" for everything
18:14hakzsam: okay, 110 vs 83M
18:15karolherbst: so maybe something slipps through
18:15karolherbst: and maybe this causes this big perf issue?
18:15karolherbst: I know that dual issuing too much can cut of like 90% of perf
18:15imirkin_: probably the blit shaders
18:15imirkin_: they're hardcoded
18:15karolherbst: imirkin_: like in binary hardcoded?
18:16hakzsam: imirkin_, that's a theory yes
18:16karolherbst: because if those instruction go through SchedDataCalculator::setDelay then they also be affected by canDualIssue
18:16hakzsam: they won't reach that point
18:16karolherbst: 00 sched means: we don't know
18:17karolherbst: so they shouldn't get dual issues as well
18:17karolherbst: but maybe we could sched information to those?
18:17hakzsam: 00 is unknow right?
18:17karolherbst: no idea if that makes sense for performance reasons
18:17karolherbst: 0x04 is dual issue
18:17hakzsam: so, how can you be sure that nothing is dual issued? :)
18:18karolherbst: because 0x00 is "default"
18:18karolherbst: hakzsam: https://envytools.readthedocs.io/en/latest/hw/graph/fermi/cuda/isa.html?highlight=dual#notes-about-scheduling-data-and-dual-issue-on-gk104
18:18karolherbst: 00: no scheduling info, suspend warp for 32 cycles
18:18hakzsam: oh okay, so we know 00
18:18hakzsam: I see
18:19hakzsam: makes more sense
18:19karolherbst: hakzsam: yep, cause kepler can't schedule
18:19karolherbst: there is no hw scheduler anymore
18:20hakzsam: karolherbst, can you reproduce the issue with something else? or this really only happens with big traces and ton of instructions ?
18:20karolherbst: also note the "weak" correlation between inst_issued2 and metric-inst_issued
18:20karolherbst: hakzsam: let me check
18:21hakzsam: it's not obvious :)
18:21karolherbst: hakzsam: pixmark_piano: 2.88G inst_issued
18:21karolherbst: 0 inst_issued2
18:21karolherbst: when in doubt, now you now what to check :D
18:22karolherbst: now those 80M look like nothing really
18:22hakzsam: sure, but I don't have any gpus right now (not at home)
18:22hakzsam: it works with pixmark_piano and not with your trace?
18:22hakzsam: what's that game btw?
18:23karolherbst: saints_row 3
18:23karolherbst: which runs at like 15% nvidia performance ;)
18:23karolherbst: maybe this is one of the issues
18:24hakzsam: karolherbst, did you remove the sched instructions from codegen/lib btw?
18:24hakzsam: I don't know if they are used though
18:24karolherbst: hakzsam: yeah
18:24karolherbst: I can even force everything to 0x3f
18:24hakzsam: karolherbst, probably not the only perf issue to be honest
18:24hakzsam: but that's a good start
18:24karolherbst: let me try something
18:24karolherbst: I will dual issue everything
18:25karolherbst: if the perf stays the same...
18:25hakzsam: but seriously 110 vs 83M is nothing
18:25hakzsam: you can call that "noise" or whatever you want :)
18:26karolherbst: 0 vs 3G sounds better :D
18:26hakzsam: so, the perf counter is most likely correct?
18:26hakzsam: I mean it seems to work in different cases
18:26hakzsam: and doesn't with saints row 3
18:26karolherbst: I didn't check
18:27karolherbst: the only thing we now for sure: when I return false in canDualIssue or force a specific sched code in SchedDataCalculator::setDelay
18:27hakzsam: it would be good to check with a different chip
18:27karolherbst: inst_issued2 shows not 0 in sr3
18:27hakzsam: fermi or kepler2
18:27karolherbst: I dual issue everything
18:27karolherbst: same perf
18:28hakzsam: karolherbst, okay, I'll try to remember that thing and test when I will be back home
18:28karolherbst: hakzsam: nope, I think your counter is good
18:28karolherbst: something is odd elsewhere
18:28hakzsam: I'll trust you then :)
18:29karolherbst: the perf should be shit with dual issue everything
18:32hakzsam: we definitely need those graphics perf counters
18:32hakzsam: sorry for the delay :/
18:32karolherbst: hakzsam: in pixmark_piano: 16 -> 2 fps
18:32karolherbst: when I dual issue _everything_
18:32hakzsam: big performance drop
18:33hakzsam: but that makes sense
18:34karolherbst: 14 fps when I dual issue nothing
18:34karolherbst: checking with heaven
18:34hakzsam: Tom^, do you still have tour gk110?
18:34Tom^: yes sir.
18:35hakzsam: time to test something?
18:36hakzsam: do you have mesa master? or a mesa not too old like two weeks ago ?
18:36Tom^: built it yesterday iirc.
18:36Tom^: or on saturday
18:36hakzsam: I just want to make sure that enabling compute support by default on GK110 won't break the universe because it's done at initialization time (ie. context creation)
18:37hakzsam: I already asked you few months ago to test something like that but it didn't work correctly
18:37hakzsam: so, you just need to run some application with export NVF0_COMPUTE=1
18:37hakzsam: like heaven
18:37karolherbst: hakzsam: well in heaven performance breaks from 14 to 7 fps
18:37hakzsam: or whatever you want, but not glxgears :)
18:37karolherbst: I guess it highly depends on the IPC value how much it hurts
18:37hakzsam: karolherbst, half!
18:38karolherbst: yeah well half is boring :D
18:38hakzsam: yeah, probably :)
18:38karolherbst: anyway, test with sr3 trace again
18:38hakzsam: Tom^, I remember that heaven did not work correctly with NVF0_COMPUTE=1 before, it hanged your gpu IIRC
18:38hakzsam: Tom^, now, it should just work like a charm
18:38Tom^: works like a charm.
18:39Tom^: as long as it actually got activated i guess
18:39Tom^: no way to confirm its on? :p
18:39hakzsam: glxinfo| grep "core"
18:40hakzsam: it should return 4.3 :)
18:41hakzsam: err, 4.2
18:41karolherbst: hakzsam: well either the nouveau code is soooo bad, that it doesn't matter if we dual issue completly wrong or not or we indeed have a sched issue
18:42Tom^: hakzsam: http://i.imgur.com/PhSIQag.png
18:42karolherbst: our normal sched code vs sched 0x04 everything has the same perf
18:42hakzsam: karolherbst, maybe we have a sched issue
18:42hakzsam: Tom^, and without NVF0_COMPUTE I guess it returns GL 4.1?
18:42Tom^: not really no
18:43hakzsam: oh right, I'm stupid :)
18:43hakzsam: 4.2 in any cases
18:43karolherbst: hakzsam: anyway, inst_issued2 should return 0 if we don't dual issue
18:43karolherbst: hakzsam: one way or another we should fix that
18:43hakzsam: Tom^, but it should expose GL_ARB_compute_shader with NVF0_COMPUTE=1?
18:43hakzsam: karolherbst, right
18:44Tom^: hakzsam: seems so indeed.
18:45Tom^: now i need to benchmark things see if anything happends :P
18:45karolherbst: imirkin_, hakzsam: is there anything which might get invoked 10*n times per frame? or some_constant*n times?
18:45hakzsam: Tom^, cool, thanks for testing
18:45karolherbst: hakzsam: funny enough with GALLIUM_HUD_PERIOD=0 those values are always 10*n
18:45imirkin_: karolherbst: oh, heh. those perf counter compute shaders :)
18:46hakzsam: Tom^, compute support will be enabled by default soon on your chip
18:46karolherbst: imirkin_: so you say, when I disable some, the value should... drop?
18:46hakzsam: karolherbst, yeah, we use compute shaders to read out perf counters
18:46Tom^: hakzsam: cool
18:46imirkin_: which does appear to do dual-issue
18:47hakzsam: oh right
18:47hakzsam: how did I forget that? :)
18:47karolherbst: imirkin_: but why does it only happen in sr3?
18:47hakzsam: karolherbst, so, you know what to do
18:47karolherbst: no, that doesn't make sense though
18:47hakzsam: try to disable them anyway
18:48hakzsam: imirkin_, well, compute support seems good on gk110
18:50karolherbst: imirkin_: nope, that wasn't it
18:50hakzsam: still 110?
18:51karolherbst: hakzsam: same values
18:51karolherbst: hakzsam: did you look at the graph? it isn't like it is a constant value ;)
18:51karolherbst: and that's why it was odd from the beginning that those shaders are counted
18:51karolherbst: they should be counted for every frame then too
18:52karolherbst: and the value shouldn't be 0 at any time
18:52hakzsam: didn't look at the details ;)
18:52hakzsam: != 0 I would say
18:52karolherbst: peak is 890 ;)
18:53hakzsam: maybe you could try to dump generated code and check for sched codes?
18:53hakzsam: that's a pain but heh
18:53Tom^: hakzsam: with it on http://i.imgur.com/2UgD0Ow.png so yea it seems fine as long as unigine-heaven makes any use of it i guess
18:53karolherbst: hakzsam: right
18:53karolherbst: hakzsam: this is my very last approach
18:53hakzsam: Tom^, cool
18:53karolherbst: hakzsam: I could check the emiter though
18:53karolherbst: hakzsam: somewhere is a sched => byte translation
18:54hakzsam: yeah, but make sure that no hardcoded codes (like the blitter one) are used
18:57Tom^: so just GL_ARB_robust_buffer_access_behavior and GL_ARB_shader_image_size and then i have gl 4.3 wohoo :)
18:58karolherbst: hakzsam: well there is a NVC0_DEBUG_SCHED_DATA define :D
19:01hakzsam: Tom^, ARB_shader_image_size should be already exposed
19:02Tom^: hakzsam: oh i was just checking https://mesamatrix.net/
19:02hakzsam: karolherbst, and no NV50_PROG_DUMP=filename :)
19:02hakzsam: karolherbst, misreading
19:02karolherbst: grep "sched 04" :)
19:03hakzsam: karolherbst, but having an envvar which dumps generated code could help at some point
19:03karolherbst: I don't get anything though
19:03karolherbst: maybe it is indeed some binary code
19:04hakzsam: Tom^, yeah, only this robustness thing needs to be done, but it's sort of bullshit ;)
19:05karolherbst: mhh src/gallium/drivers/nouveau/codegen/lib/gk104.asm
19:05hakzsam: I asked you earlier if you removed the sched codes from the codegen lib
19:06karolherbst: ohh, then I missunderstood you
19:06Tom^: hakzsam: oh i also have a gk110b but i guess its the same card pretty much
19:06Tom^: hakzsam: to an gk110 that is
19:06karolherbst: would be funny if there is something fishy inside those builtins :D
19:07hakzsam: karolherbst, there is probably
19:08karolherbst: couldn't we write those builtins inside TGSI and compile it normally?
19:09karolherbst: hakzsam: and how do I compile those asm files now?
19:09hakzsam: karolherbst, go to codegen/lib and make
19:10karolherbst: yep, looks good
19:10karolherbst: now compiling
19:10hakzsam: the idea of those builtins is to be precompiled
19:11hakzsam: using TGSI will be useless
19:12karolherbst: ahh okay
19:12karolherbst: well at least we shouldn't need to put sched opcodes there
19:13hakzsam: we need to, but maybe some sched codes are not totally correct
19:13hakzsam: karolherbst, so, 0 now?
19:13karolherbst: ohh wait
19:13karolherbst: no, I replaced the 0x04 with 0 :D
19:14hakzsam: what's 0x04?
19:14hakzsam: dual issue?
19:15karolherbst: performance is still pretty bad
19:15karolherbst: but most of those builtins use 0x00 as sched
19:15hakzsam: saints row 3 now returns 0 for inst_issued2?
19:17karolherbst: well i "optimize" those builtins a bit then :D
19:19karolherbst: who needs this anyway :)
19:20hakzsam: all shaders which use OP_CALL ;)
19:22karolherbst: I hope I messed not too much now
19:23karolherbst: well at least it should be easy to find out which builtin is getting used
19:24hakzsam: I think so
19:24karolherbst: yep, I just dual issue one function and see if inst_issued2 increases :D
19:24hakzsam: but if the function is not called...
19:25hakzsam: you won't see anything useful :)
19:26karolherbst: I just whiped out the all the functions inside the asm file
19:26karolherbst: and no visual change
19:28karolherbst: yeah, seems like only a minor thing calls those builtins
19:29hakzsam: makes sense
19:29karolherbst: and with this, I am where I was at the beginning :D
19:30hakzsam: but now, you have inst_issued to 0 with saint rows 3
19:30karolherbst: which helps me with nothing really
19:30hakzsam: this just confirms that inst_issued2 is correct
19:31karolherbst: hakzsam: the thing is just the GPU is bored (around 50% engine load, <10% memory load), the CPU is boared (~60%)
19:31karolherbst: and something causes the performance to be insanly bad
19:32hakzsam: mmh, that's pretty bad yeah
19:33karolherbst: maybe we stall the gpu like crazy
19:33karolherbst: but then we should have some more memory load
19:34karolherbst: ohhh wait
19:34karolherbst: this is the CU
19:35karolherbst: while the really bad frames are drawn the GPU core load drops below 10% but cpu is at 100%
19:37karolherbst: hakzsam: maybe in the end it is something stupid as the compiler is compiling each frame
19:37karolherbst: which means....
19:37hakzsam: cpu bound you mean?
19:37karolherbst: how can I disable all opts?
19:37hakzsam: NV50_PROG_OPTIMIZE=0 ?
19:38hakzsam: but TF2 which seems like cpu bound has good performance with nouveau
19:38karolherbst: more perf
19:39karolherbst: well a bit more perf
19:39karolherbst: not so much that I would call it playable though
19:39hakzsam: without OPTIMIZE?
19:39hakzsam: with OPTIMIZE=0 I mean?
19:39karolherbst: not much, but usually 1 or 2 fps more
19:40karolherbst: I also did a --pcpu run
19:40hakzsam: not a big issue I would say
19:40karolherbst: guess what
19:41hakzsam: no ideas what those numbers are
19:41hakzsam: are you using apitrace?
19:42karolherbst: # call no gpu_start gpu_dura cpu_start cpu_dura vsize_start vsize_dura rss_start rss_dura pixels program name
19:42karolherbst: coloum 6: cpu time
19:43karolherbst: those are the most CPU expensive calls in order
19:43hakzsam: I see
19:43karolherbst: there is a glTexStorage2D at position 214
19:43karolherbst: or a glclear at 229
19:43karolherbst: all those bs stuff
19:44hakzsam: maybe you could replay the trace with perf?
19:44hakzsam: and see if you find some cpu bottlenecks
19:44karolherbst: at 1500+ there are alos some glReadPixel calls
19:44karolherbst: at 1900+ some "usefull" stuff comes, but yeah
19:45karolherbst: the issue is, that those linkProgram calls are done like every single frame
19:45karolherbst: and glcompileshader
19:45karolherbst: and a lot of those
19:46karolherbst: maybe an in memory cache might help already?
19:46hakzsam: no clue
19:49karolherbst: mhh maybe we could async those compile calls and join on upload time or something like that
19:52karolherbst: let's compile with O3 before I trace with perf
19:53RSpliet: karolherbst: seems that potential stability patch thing runs fine, let's find out tomorrow morning whether my 780Ti crashed (angel)
19:53karolherbst: RSpliet: yeah, it seemed okay on mine too
19:57karolherbst: hakzsam: funny, it seems better after I compiled with 03 :D but still awesomely bad
19:58hakzsam: 03 is definitely better than 00, especially when it's cpu bound :)
19:59karolherbst: yeah but it isn't significantly better
19:59karolherbst: well it's better and that what matters
19:59hakzsam: yeah, this won't increase performance a lot
19:59karolherbst: but now the top 500 changed a lot
20:00karolherbst: 9: call 93884 0 0 797993074 23063609 0 0 0 0 0 0 glClear
20:00karolherbst: hakzsam: 374 glClears under the top 500
20:00karolherbst: 508 under the top 5000
20:01karolherbst: what does glClear do anyway? :D
20:01karolherbst: well aynway, it is full with glCompilerShader, glClear and glLinkProgram that it actually hurts
20:01hakzsam: it clears :)
20:01karolherbst: okay, now perf
20:03karolherbst: perf -g, but what should I pass too?
20:03hakzsam: usually I use 'perf record'
20:03hakzsam: and then 'perf report'
20:03hakzsam: but there are ton of options
20:04karolherbst: I think -g is enough
20:04urmet: what cards are you perf-testing?
20:04hakzsam: gk106 I guess
20:05hakzsam: piglit is soooo long
20:05urmet: aw. i have gm.. :(
20:05hakzsam: and I need two runs :/
20:05karolherbst: hakzsam: meh, well at least on my GPU I don't need -1 :)
20:06hakzsam: I always use -1
20:06karolherbst: yeah, that's your issue :p
20:07hakzsam: each time I tried to run concurrent tests my gpu hanged miserably :)
20:07karolherbst: try RSpliets patch :D
20:08hakzsam: which ones?
20:09karolherbst: RSpliet: do you think it might help with concurrent piglit?
20:11hakzsam: first run is almost done :)
20:12karolherbst: 49.07%-- snappy::RawUncompress :/
20:14karolherbst: maybe I should just start the game and run that under perf
20:16RSpliet: karolherbst: depends on the symptoms
20:16RSpliet: if it's "mouse moves, hangs otherwise, CTX_TIMEOUT in logs" then I hope so
20:16RSpliet: (or... something like CTX timeout, don't remember the exact string)
20:18hakzsam: maybe I should try concurrent piglit then
20:23hakzsam: karolherbst, piglit seems to be really slower on gm107 than gk208 for the same number of tests
20:31hakzsam: second run now
20:33imirkin_: hakzsam: for fewer tests... maxwell doesn't have tess
20:34hakzsam: imirkin_, yeah, whatever it should take 30 minutes or so
20:38hakzsam: imirkin_, btw, what's the issue with tess on maxwell?
20:38imirkin_: i got lazy
20:39imirkin_: fetching inputs and outputs doesn't work (and storing outputs is unlikely to work as well on TCS)
20:40hakzsam: is there some piglit tests which fail ?
20:40imirkin_: anything that uses tess
20:40imirkin_: start with nop.shader_test
20:40imirkin_: and move up from there
20:41hakzsam: okay, that's a good start
20:41hakzsam: (it's on my todolist just after images as you already know)
20:42imirkin_: it's just something that'll take a day of concentration from me, and i haven't had the motivation to do it
20:42hakzsam: only one day? :)
20:42imirkin_: in large part because nvidia kinda gave open-source a big "fuck you" with GM20x, which leads me to be less inclined to investigate.
20:42hakzsam: yeah, I see
20:43imirkin_: i obviously won't nack code that implements it... just... i have better thing to do in the time i put towards nouveau
20:44imirkin_: and it wasn't trivially easy to do :)
20:44hakzsam: sure, I guess it's not easy yeah
20:44imirkin_: fermi and kepler were pretty similar... i only discovered differences pretty late in the tess development cycle
20:44imirkin_: (to do with indirect accesses)
20:44hakzsam: oh okay
20:44imirkin_: while maxwell accesses these from a totally different place
20:45imirkin_: i think it's semi-similar to how GS works, but i dunno if it's the same (or if GS was even done correctly)
20:45hakzsam: I don't know how GS works on maxwell, but I'll have a look when I'll work on tess
20:46hakzsam: anyway, if we get images before tess, we will be able to bump from 3.3 to 4.2 in one shot :)
20:46hakzsam: this is for what? GS?
20:47imirkin_: for fetching vertex attribs in GS
20:47imirkin_: since you get it primitive-at-a-time
20:47imirkin_: vfetch takes an extra arg
20:47imirkin_: which is the "lane" to fetch from
20:47imirkin_: or... something
20:47imirkin_: and this is the result of pfetch
20:48imirkin_: however it's different for tess
20:48hakzsam: well, tracing the blob with mmt will help for sure :)
20:48imirkin_: yeah, there are traces already
20:48imirkin_: sanity = sanity.shader_test, quads = quads.shader_test
20:49hakzsam: imirkin_, btw, are you going to have some time to look into the frag/comp issue for images on fermi?
20:50hakzsam: I'll probably have an other look later, but I'm a bit lazy right now
20:50hakzsam: especially because I have tried a ton of different things
20:50hakzsam: without any good success
20:50hakzsam: except the patch I pasted you earlier today :)
20:51imirkin_: hakzsam: not today, most likely
20:51imirkin_: hakzsam: maybe tomorrow? not sure.
21:15karolherbst: hakzsam: somehow I get the feeling that perf is completly broken on my system
21:15karolherbst: because perf report doesn't respect any parameters at all
21:35karolherbst: hakzsam: mhh 20% inside the kernel
21:36karolherbst: and another 20% inside libdrm_nouveau
21:53karolherbst: okay, as it seems running under real conditions changes quite a lot
22:18karolherbst: imirkin_: where should be a pass put for: set ge $r1 neg $r1 => set ge $r1 0? Algebraic?
22:18imirkin_: i think so yea
22:19karolherbst: maybe this is enough to trigger other passes
22:26karolherbst: imirkin_: what are those LTU, NEU cond codes? U=unsigned?
22:26imirkin_: if only
22:27imirkin_: iirc they're the stupid unordered things from floating comparisons
22:27karolherbst: the heck
22:27imirkin_: i.e. foo > nan. do you want true or false.
22:28imirkin_: ge = false. geu = true.
22:28karolherbst: ahh okay
22:28imirkin_: or something along those lines. mwk will know for sure.
22:28karolherbst: is there a short code for cc == CC_LT || cc == CC_LE || cc == CC_LTU ...
22:28imirkin_: what are you trying to do?
22:28imirkin_: there's reverseCondCode()
22:28imirkin_: and another related helper
22:29karolherbst: well if you compare a number with the negated self, you usually only test for signess or a comparision against 0 is enough
22:29karolherbst: which makes it a bit easier to deal with that instruction
22:30karolherbst: like i >= -i <==> i >= 0
22:38karolherbst: imirkin_: and with that we can merge set cc $r0 $r63 + predicated mod insn => slct
22:38karolherbst: or + anything really
22:39karolherbst: and then we have no predicate set anymore (and most likely one instruction less)
22:45karolherbst: or is this already part of the sel peephole and I should just create a new pass and deal with that there completly?
22:49karolherbst: mhh okay, maybe this way: the target of a sel peephole is to optimize simple conditional code into slcts. This can be anything set related + simple dependend instruction (like mov, abs, neg...)
22:50karolherbst: in short anything like "x = condition ? foo : bar"
22:51karolherbst: yeah, maybe this makes sense
22:51karolherbst: then we can iterate over all BBs and get the last instruction
22:52karolherbst: and if the condition and result of those condition is easy enough, we can just modify the result into a slct
22:52karolherbst: and then we could also have phi instructions depending on that
22:52karolherbst: yeah.. maybe pre SSA is really the easiest way to do that