00:00 mwk: hmm, I wonder if I can get an official ELF machine ID assigned for Falcon
00:00 mwk: given the kind of junk that's already there...
00:13 huelter: imirkin, still 0 freezes on 4.6
01:09 mwk: well, overlays are going to be hellishly annoying, that's for sure :(
01:10 mwk: so hm
01:11 mwk: the general idea is that Falcon execs may have a "backing storage" in external memory, aka VRAM
01:11 mwk: if you need to fetch new code, you xcld from it; likewise for data and xdld/xdst
01:14 imirkin: anyone around with a fermi plugged in and wants to test some mesa patches?
01:14 mwk: each segment (ie. PT_LOAD) can be marked as: uploaded manually (loader will poke it to Falcon via code/data port, no backing storage), backed by extmem and auto-uploaded (loader stores to VRAM then runs XFER by IO access), or backed by extmem and not auto-uploaded (loader stores to VRAM, program fetches it on its own)
01:16 mwk: for data (and v0 code), you have p_vaddr describing the address in data/code RAM, p_paddr describing the address in extmem ; for v3+ code you have p_vaddr describing the virtual address and extmem address, p_paddr decribing the physical address in RAM
01:17 mwk: it's annoyingly backwards between code and data, but that's unavoidable, p_vaddr must correspond to the address where the program actually sees it
01:20 mwk: well
01:20 mwk: that sounds good, now I only need to convince lld that paddr != vaddr is actually a thing
08:47 night199uk: hey
08:47 night199uk: anyone can share some knowledge on the displayport training code?
08:47 night199uk: seems like it is performance sensitive?
08:49 mupuf: performance sensitive?
08:49 night199uk: hard to describe
08:49 night199uk: i have some serial debug code on my displayport training code
08:50 night199uk: and at least with clockrecovery i’m reading different values from the chipset and clockrecovery bombs out
08:50 night199uk: if i disable the serial debug it seems the clockrecovery part gets further
08:50 night199uk: my suspicion right now is that the serial debug is adding enough of a delay to upset clockrecovery
08:51 night199uk: i get different results with serial debug on and off
08:51 night199uk: but it’s pretty hard to tell with all the debugging disabled
08:51 night199uk: so wondered if anyone had any kind of anecdotal yes/no to confirm what i’m seeing
08:52 night199uk: this is in my uefi driver btw not nouveau but there’s not really anyone else to ask :-)
09:00 mupuf: night199uk: sorry, can't help
09:00 night199uk: nah, np
09:00 night199uk: anyone who is expert on the displayport training?
09:02 mupuf: night199uk: you can alwyas check the code of nouveau
09:02 mupuf: but IIRC, the hw assists a lot of the procedure
09:02 night199uk: yeah, i’ve been comparing to nouveau
09:02 night199uk: the code i have is pretty much identical, basically
09:03 night199uk: yeah, this is what i figured so i didn’t believe it would be so timing sensitive
09:04 night199uk: i’ve been flipping between MMIO traces, the DP spec, nouveau code and my code for about 4 days now :-)
09:05 mupuf: sounds like fun :D
09:05 night199uk: haha, i wish :-(
09:05 mupuf: so, my turn to have fun
09:05 night199uk: it was fun for the first two days
09:05 night199uk: now it’s nnnggggg :-/
09:05 mupuf: let's try to see if I can fix the fan BUG
09:05 night199uk: :-)
09:05 night199uk: gl
09:06 night199uk: hey do you know who would be most familiar with DP training in nouveau?
09:06 night199uk: imirkin or skeggsb?
09:09 mupuf: skeggsb
09:09 night199uk: tx :-)
09:54 karolherbst: mhh, inst_executed: 988M -> 985M :/
10:19 pmoreau: karolherbst: You were hoping for a bigger reduction I guess?
10:19 karolherbst: now we are talking! 988M -> 981M in pixmark_piano
10:19 karolherbst: :D
10:19 karolherbst: yep
10:19 pmoreau: :-)
10:19 karolherbst: 1% less instructions executed seems like a good thing
10:19 karolherbst: well
10:19 karolherbst: nearly 1%
10:19 pmoreau: Wellll, depends
10:19 karolherbst: just by moving some instructions into different BBs
10:20 pmoreau: If you are memory limited, removing compute instruction won’t help
10:20 karolherbst: right, but it doesn't hurt either, right?
10:21 pmoreau: Mmmh…
10:21 karolherbst: basically what my current pass does: search for %p bra and check all instructions in its targets, for src, with refCount == 1 and move them into that target
10:22 karolherbst: which basically moves instructions into conditional branches where they are actually needed
10:24 pmoreau: If you remove work, they might stall quicker, prompting the scheduler to swap in another warp, which will also need data, further reducing the available bandwith? Measuring the perf difference is probably the best thing to estimate the impact.
10:24 karolherbst: yeah, but without a scheduler we can just hope for the best anyway
10:24 pmoreau: I had a case where reducing the register usage, which increased the occupancy, resulted in lower perf.
10:24 karolherbst: yeah
10:25 karolherbst: but that haüüens quite often
10:25 karolherbst: you also have to check the executed instructions per frame in that case
10:25 karolherbst: reducing gprs doesn't make sense if you increase this
10:26 karolherbst: well, depends on how much in the end and other things
10:27 karolherbst: well I think my pass should also reduce the amount of live values?
10:27 karolherbst: yeah, it should
10:28 pmoreau: I always try to have as small branches as possible, to minimise thread divergence within a warp. And if I can, remove the branch altogether of course. :-)
10:28 pmoreau: Yes, it should
10:28 karolherbst: right
10:29 karolherbst: but small branches doesn't make sense if that leads to doing conditional work all the time
10:29 karolherbst: and for simple branches we have predicated instructions, slct and selp
10:29 pmoreau: yes
10:31 karolherbst: uhh, a tesseract shader crashes
10:32 pmoreau: :-/
10:32 karolherbst: uhhh
10:32 karolherbst: one branch: 25->47 instructions
10:34 karolherbst: ../../../../../src/gallium/drivers/nouveau/codegen/nv50_ir_util.cpp:119: bool nv50_ir::Interval::extend(int, int): Assertion `a <= b' failed. :/
10:34 karolherbst: I guess I moved stuff a little wrong? odd
10:35 pmoreau: What is this extend function doing?
10:35 karolherbst: live ranges I think
10:35 karolherbst: basically it checks for overlapping stuff or something like that
10:35 karolherbst: it's part of RA
10:36 pmoreau: Ok
10:37 karolherbst: ha
10:37 karolherbst: found it
10:37 karolherbst: 288: mad ftz f32 %r775 %r769 c0[0x154] %r774 (0)
10:37 karolherbst: 293: ld u64 { %r771 %r774 } c0[0x290] (0)
10:37 karolherbst: this sounds wrong :D
10:39 karolherbst: well, I can be smart later then
10:41 karolherbst: mhh: total local used in shared programs : 27789 -> 28929 (4.10%) and total gprs used in shared programs : 286465 -> 287322 (0.30%)
10:41 karolherbst: I guess I don't reduce live value ranges?
10:42 pmoreau: Weird
10:42 karolherbst: yeah
10:42 karolherbst: maybe RA isn't smart enough to drop the live value early enough or something? :/
10:42 karolherbst: hurt gpr shaders/tomb_raider/8468.shader_test - 1 51 -> 63
10:42 karolherbst: this looks good to investiage what's going wrong
10:43 pmoreau: That’s… quite an increase!
10:43 karolherbst: yeah
10:43 pmoreau: At least it doesn’t go above 64, otherwise that would be quite a performance drop.
10:44 karolherbst: ...
10:44 karolherbst: you know, 63 is always 0
10:44 karolherbst: so the highest gpr count is 63
10:46 karolherbst: ahh I se
10:46 karolherbst: e
10:47 karolherbst: BB:0 128 -> 95 instructions, some later BB: 131->164 instructions
10:47 karolherbst: and I keep a lot of live values in between
10:47 karolherbst: either I need to be more aggressive moving stuff or a lot smarter
10:48 pmoreau: That’s true for SM<32, but not above, which is 255: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
10:48 karolherbst: well on mine it is
10:50 pmoreau: Maybe 63 is still zero on those, but that doesn’t prevent you from using 64->254
10:50 karolherbst: gk106 here
10:51 karolherbst: 63 is the limit
10:51 karolherbst: or somebody messed up and it really isn't
10:51 pmoreau: Which SM version is it?
10:52 karolherbst: I would guess 3.0 then, but I try to look it up
10:53 karolherbst: yeah, 3.0
10:54 pmoreau: So, 63 it is
10:57 karolherbst: is it possible to split tex instructions?
10:57 karolherbst: 57: tex 2D $r13 $s0 f32 { %r1614 %r1615 %r1616 %r1617 } %r1568 %r1571 (0) // ==> into { %r1614 %r1615 %r1616 } and %r1617
10:59 yann|work: hello karolherbst
11:00 pmoreau: I have never interacted with those, sorry
11:00 yann|work: just noticed that my 960M is still always reported at DynPwr :(
11:01 karolherbst: yann|work: odd :/ well it does turn off
11:01 karolherbst: yann|work: can you check if logging out helps?
11:01 karolherbst: yann|work: I would guess some application does crazy things all the time
11:01 karolherbst: like someting calling lspci every second
11:02 yann|work: ah, checking that...
11:06 yann|work: well, that did not change much - but even without an X session I still have quite a number of things running
11:08 yann|work: karolherbst: what's lspci precisely doing that could cause such a behaviour ? Reading /sys/bus/pci/devices/0000\:01\:00.0/config ?
11:08 karolherbst: yann|work: yeah
11:09 yann|work: I can easily find who would do that, then, wait a minute
11:18 yann|work: karolherbst: bad guess, it would seem - noone seems to open() anything under 0000:01:00.0, except lspci when I launch it manually
11:18 yann|work: hm, by probe probably would not catch openat() calls, however
11:20 karolherbst: yann|work: well maybe you can just kill the processes one by one and wait like 10 seconds in between until the gpu goes off
11:23 yann|work: will try that, but not right now :)
11:53 karolherbst: I am not quite sure, but I think I do something wrong: https://i.imgur.com/IVCvE7k.jpg
11:55 pmoreau: karolherbst: Nah, looks fine! You just added some nice modern art generator directly in the driver! :-D
11:56 karolherbst: exactly! :D
11:59 pmoreau: I would say, give that feature a fancy name, and ship it with Mesa 12.0!
12:01 karolherbst: but unconditionally enabled, right? Sometimes you have to force the user to their luck
12:02 karolherbst: but I still wonder why that happens
12:08 pmoreau: Of course!
12:15 karolherbst: seriously...
12:15 karolherbst: a frigging mov?
12:16 karolherbst: moved:mov u32 %r270 0x3f800000
12:16 pmoreau: Grrrr… my 64-bit mul/mad splitting doesn’t work with signed values…
12:16 karolherbst: I really don't see why I can't move that mov wherever I want it to be...
12:29 pmoreau: How do you negate a value?
12:32 karolherbst: pmoreau: depends
12:32 karolherbst: pmoreau: usually you do src(i).mod ^ Modifier(NV50_IR_MOD_NEG)
12:32 karolherbst: or just add a new NEG instruction
12:32 pmoreau: Which operations support the neg modifier?
12:33 karolherbst: no idea, some do. I think there is a function to check that?
12:33 karolherbst: I know that arithmetic operations usually can
12:33 pmoreau: Cool, let’s try this
12:33 pmoreau: Thanks
12:34 karolherbst: well in the code you need to do src(i).mod = src(i).mod ^ Modifier(NV50_IR_MOD_NEG) ;)
12:34 karolherbst: ^= doesn't work for whatever reasons
12:34 karolherbst: or I messed up
12:34 karolherbst: who knows
12:35 karolherbst: okay, now my issue: when the head of the BB is join, I can't move my instruction into bb->head...
12:35 karolherbst: flattening messes up in such a case
13:34 karolherbst: pmoreau: this is useless without a scheduler :/
13:40 karolherbst: maybe I will have more luck with moving stuff out of loops
13:41 imirkin: i think i remember saying something like that :p
13:44 karolherbst: yeah well, it does decrease the inst-executed count though
13:45 karolherbst: but moving instructions into conditional branches shouldn't have a big impact anyway
13:45 karolherbst: but that loop thing might increase performance even though we don't schedule, especially when we have nested loops
13:52 imirkin: hakzsam: did you get a chance to try my patches that limit RA?
13:52 hakzsam: imirkin, will test right now
13:54 hakzsam: imirkin, well, your patches fix the dmesg errors, but rendering seems to be totally broken
13:54 imirkin: progress ;)
13:55 imirkin: did you refetch btw? i force-pushed a few times last night
13:55 hakzsam: just cloned
13:55 imirkin: oh GRRRRRR
13:55 imirkin: noticed another emission bug in nvc0 this time
13:55 imirkin: hold on.
13:55 imirkin: er actually, i guess it's not an issue
13:56 hakzsam: I could have a look later today too :)
13:58 karolherbst: hakzsam: I also guess you didn't had any time to check the dual issueing on fermi?
13:59 imirkin: hakzsam: can you chekc if this patch helps anything? http://hastebin.com/macicudivi.coffee
14:01 karolherbst: imirkin: any good way to check if a BB is part of a loop? Otherwise I can just search for continue/breaks or stuff like that
14:01 imirkin: karolherbst: probably, but i'm not an expert on this stuff
14:02 imirkin: karolherbst: i believe what you're looking for is known as "structurizing" in the various literature.
14:04 karolherbst: also sometimes I see stuff like that: https://gist.github.com/karolherbst/64f36ba58aa18e5af5b807548d1c5cb8
14:04 karolherbst: wouldn't it make more sense to just branch here?
14:05 imirkin: branching ain't free
14:05 imirkin: i think we have a heuristic
14:05 imirkin: the thing is ... even though this all *looks* single-threaded, it's all secretly massively SIMD
14:06 imirkin: so any sort of branch where some lanes jump and others don't means "stop doing simd, do these one at a time"
14:06 imirkin: that's what the joinat/join stuff is about - that tells it when it can start doing simd again
14:06 karolherbst: yeah, makes sense
14:07 karolherbst: so in general, it makes more sense to have a more flat cf
14:07 imirkin: yeah
14:07 imirkin: so what the blob does
14:08 imirkin: is before a long $p0 section
14:08 imirkin: it does a vote on $p0
14:08 imirkin: and if all the threads have it set to 0, it just *jumps* over that section
14:08 karolherbst: uhhh
14:08 imirkin: (vote is an op)
14:08 karolherbst: sounds interessting :)
14:08 karolherbst: I guess we don't use it yet?
14:08 imirkin: no
14:09 karolherbst: anyway, this sounds more usefull than the stuff I do now anyway
14:09 karolherbst: so basically, before that section there would be something like vote $p0 BB:296 or something
14:09 imirkin: well no
14:09 imirkin: it'd be p1 = vote(p0)
14:10 imirkin: (not p1) bra end-of-bb
14:10 karolherbst: ahh okay
14:10 karolherbst: so another predicate gets set
14:10 imirkin: i think yeah
14:10 imirkin: just look at some blob shaders
14:10 imirkin: it comes up moderately often
14:10 karolherbst: yeah, doing that
14:11 imirkin: you might have to add a new relocation type to jump to the end of the bb rather than to the start of another one
14:11 imirkin: (since the bb might have multiple destinations)
14:11 imirkin: although ... hrm... probably not
14:12 karolherbst: mhh
14:12 imirkin: or you can only do it in the case where there's only one outbound edge
14:12 karolherbst: envydis will call it vote?
14:12 imirkin: no clue
14:12 imirkin: probably?
14:12 karolherbst: because I don't see it in the examples I have here :/
14:13 imirkin: odd
14:15 karolherbst: and I have a full mmt of heaven
14:16 imirkin: maybe they decided it was a bad idea? dunno. i've definitely seen it.
14:17 karolherbst: or maybe since kepler it doesn't exist anymore?
14:17 karolherbst: or it is something really new?
14:18 karolherbst: okay, for gk110 it is called vote
14:18 karolherbst: gf100 too
14:18 karolherbst: mhh
14:27 karolherbst: okay, anyway, in the entire heaven mmt, not one vote
14:30 imirkin: seems unlikely. probably something else wrong.
14:30 imirkin: like it's not showing half the shaders
14:30 imirkin: ors omething
14:31 karolherbst: maybe
14:31 karolherbst: well I could just try to add that instruction for my gpu and see if it changes something
14:45 karolherbst: imirkin: and I guess if we already know that every thread has the same value, we can simply branch without joinat/join, right?
14:45 imirkin: yes.
14:45 imirkin: well, you can do that anyways, but it'll kill perf if you do, and they're not all the same
14:46 karolherbst: right
14:46 karolherbst: so in the worst case we loose perf by doing the vote
14:47 imirkin: yes.
14:48 karolherbst: ohh, there is already some vote stuff in mesa
14:48 imirkin: i'm adding it in now.
14:48 imirkin: for ARB_shader_group_vote
14:48 karolherbst: :D I see
14:49 karolherbst: and subOp is either all, any or uni?
14:49 imirkin: right.
14:49 imirkin: so you want "ANY" here, and then not the result.
14:50 karolherbst: yeah, I figured
14:50 karolherbst: so I write a pass, which scans every BB if all instructions are predicated and the last is _not_ bra
14:50 karolherbst: and if there are like 4 or more, I could add that vote in front
14:51 karolherbst: and branch through the only outgoing edge
14:51 karolherbst: uhhh
14:51 karolherbst: adding a bra means I would have to split the BB
14:52 imirkin: wlellll
14:52 imirkin: you could add a different kind of bra
14:52 imirkin: that is not a flow op
14:52 imirkin: and do a new type of relocation
14:52 karolherbst: mhhh
14:52 imirkin: since you're nto fundamentally altering the control flow
14:54 karolherbst: well, that sounds like complicating things
14:55 karolherbst: is there a problem by adding a bra and bb->splitAfter ?
14:59 imirkin: dunno
14:59 imirkin: post-ra, maybe not
15:02 karolherbst: well I would do that pre-ra anyway
15:02 karolherbst: ohh wait
15:02 karolherbst: doesn't work there I think
15:03 karolherbst: right, flattening creates those big predicated BBs
15:18 karolherbst: imirkin: how expensive is a bra without joins?
15:19 imirkin: karolherbst: if every invocation takes it - it's free(ish)
15:19 karolherbst: mhh okay
15:19 imirkin: if only some invocations take it, you go into thread-at-a-time mode until each invocation finishes
15:19 karolherbst: yeah well, I only plan to branch depending on the vote
15:19 imirkin: then you're fine
15:19 karolherbst: so, if the BB has 2 instructions, it doesn't matter anymore?
15:19 karolherbst: vote + predicated bra
15:20 imirkin: ?
15:20 karolherbst: well, if I find a BB with like 2 predicated instructions
15:20 karolherbst: mhh
15:20 karolherbst: well I then execute 2 instructions one way or the other anyway...
15:22 imirkin: right
15:22 imirkin: so there's a "sweet spot" where it maeks sense
15:23 imirkin: i figure 6 instructions
15:23 imirkin: based on absolutely no scientific thought process
15:37 karolherbst: mhh, how can I create a predicate value after RA?
15:40 imirkin: yes... RA's a bitch =/
15:41 imirkin: hakzsam: did you get a chance to try my second patch?
15:47 mupuf: imirkin: can you control your fan on your GPU that does crash?
15:47 imirkin: you mean the GF108 that i took out? or the GT215 that's still in?
15:47 imirkin: or the GK208 that i just added?
15:48 imirkin: anyways, give me instructions, as long as they don't involve rebooting, i'm happy to try them out
15:48 mupuf: well, the one with which you have had issues
15:48 karolherbst: imirkin: huh? Assertion `i->src(0).getFile() == FILE_PREDICATE && i->def(1).getFile() == FILE_PREDICATE' failed.
15:48 karolherbst: imirkin: vote has two defs?
15:48 mupuf: sure then
15:48 imirkin: karolherbst: nfc
15:48 imirkin: karolherbst: i'm about to start playing with it too
15:49 karolherbst: guess who wrote that emitVOTE part :D
15:49 karolherbst: hakzsam: I think I need your help a little :D
15:52 imirkin: oh, you should assume it's wrong then
15:52 imirkin: i think he was just playing around
15:53 karolherbst: okay
15:53 imirkin: i do think that VOTE has both a reg and predicate output
15:53 karolherbst: envydis says this: { 0x4800000000000004ull, 0xfc00000000000067ull, N("vote"), N("all"), DST, PDST2, T(pnot1), PSRC1 },
15:54 imirkin: right
15:54 imirkin: so like i said.
15:54 mupuf: imirkin: cd /sys/class/drm/card0/device/hwmon/hwmon*/; echo 1 > pwm1_enable; echo 50 > pwm1
15:54 imirkin: the assert is a bit overzealous
15:54 imirkin: mupuf: and then?
15:54 mupuf: and if you had a dmesg, that would be nice
15:55 mupuf: if the fan speed really changed, tell me
15:55 imirkin: 100 is max right?
15:55 mupuf: if it did not, try going higher, but not going to 100
15:55 mupuf: yep
15:55 mupuf: 75 max
15:55 imirkin: so card0 is my GK208
15:55 imirkin: which is what upset the whole situation
15:55 mupuf: good
15:55 mupuf: do you have the workaround applied?
15:55 imirkin: so i have pwm1_max = 80
15:55 imirkin: yes
15:55 mupuf: good
15:55 imirkin: ;)
15:56 imirkin: otherwise it crashed in under 24h
15:56 imirkin: rigth now pwm1 is at 41 (min is 40)
15:56 karolherbst: mhhh
15:56 mupuf: well, you are lucky that you never need to increase the fan speed
15:56 karolherbst: imirkin: just adding the vote shouldn't do anything, right? :/
15:57 imirkin: karolherbst: right.
15:57 imirkin: karolherbst: unless you ovewrite some reg ;)
15:57 mupuf: because I do not see how fan_tog would ever work in conjunction with automatic fan management
15:57 karolherbst: mhh, well the stuff turned yellowish :D
15:57 mupuf: and I am *very* surprised there are users for it. I thought it was on nv40 cards only :o
15:57 imirkin: mupuf: i'm not 100% sure that the fan speed changes.
15:57 mupuf: right, that makes sense
15:58 imirkin: mupuf: oh, well it could also be the NV34
15:58 mupuf: so, we have two bugs
15:58 imirkin: which i also have plugged in
15:58 mupuf: the nv34 is more likely, yes
15:58 imirkin: well, that one doesn't have a hwmon dir at all
15:58 mupuf: can I see your kernel logs?
15:58 imirkin: well, it has it
15:58 imirkin: but there's nothing there
15:59 imirkin: mupuf: http://hastebin.com/lufupacuvu.css
15:59 mupuf: shit, skeggsb made nouveau way less verbose, so we do not get information about which type of fan it is
15:59 karolherbst: imirkin: guess what, envydis says vmin
15:59 imirkin: i gave you references to the vbioses i think
15:59 mupuf: imirkin: yes, you did
16:00 imirkin: karolherbst: check nvdisasm.
16:01 karolherbst: the emiter was wrong
16:01 karolherbst: remove the stuff except the op and subop: $p0 vote any $r0 $p0 $p0
16:01 imirkin: mupuf: hm, well, i dunno. i just checked my other card, and i'm quite sure fan control works there
16:02 imirkin: mupuf: but i also didn't hear it spinning up and down
16:02 imirkin: so ... could be that fan control broke. or my system is too loud coz it's 30 degC in nyc
16:02 mupuf: which is likely an indication that fan management does not work
16:02 mupuf: fans are obnoxiously loud when they spin up
16:02 imirkin: well, it definitely *used to*
16:02 mupuf: but hwmon would tell you if the RPM changed
16:02 imirkin: no rpm =/
16:03 mupuf: :s
16:05 mupuf: are you sure you changed the mode to 1 for pwm1_enable?
16:05 imirkin: yes.
16:05 mupuf: otherwise, it will just throw a EINVAL
16:05 mupuf: and when you read pwm1 back, you get the value you set?
16:05 imirkin: ye
16:07 mupuf: ok
16:08 imirkin: karolherbst: http://hastebin.com/ususikilak.coffee
16:08 imirkin: this is what i'm about to start testing with
16:09 imirkin: karolherbst: er wait, i missed something. hold on.
16:09 karolherbst: anyway, how can I convert nouveau output so that nvdiasm eats it?
16:10 mupuf: well, from your trace, it looks like it is using fan_toggle which is more than legacy so it suggests we do something wrong with your vbios/gpu. But anyway, I see how you could get a deadlock
16:10 mupuf:is thinking whether we need a lock a all here
16:10 imirkin: karolherbst: http://hastebin.com/koluvuleye.coffee
16:11 imirkin: karolherbst: perl -ane 'foreach (@F) { print pack "I", hex($_) }' > tt; nvdisasm -b SM20 tt
16:11 imirkin: adjust to taste
16:11 mupuf: I guess we are not fine with calling kmalloc 10 times per second, right?
16:11 mupuf: fuck I hate this bitbanging using ptimer.alarm
16:12 imirkin: karolherbst: obv you'd adjust the 255 to 63 in your version, and offsets are probably different.
16:12 karolherbst: right
16:13 imirkin: and there will be more peephole things i'll want to do
16:13 imirkin: but that's for later
16:18 karolherbst: ohh crap
16:18 karolherbst: 00000270: 00001cfc ld b32 $r0 c7[$r0]
16:19 karolherbst: I think I split an opcode now...
16:20 karolherbst: ahh, last 4 changed lines
16:20 karolherbst: do I have to change something at the 7 << 16?
16:21 imirkin: probably.
16:21 imirkin: 7 is for P7
16:21 imirkin: aka PT
16:21 karolherbst: in any case, "code[0] |= 63 << 2" is wrong
16:22 karolherbst: for the first thing
16:22 imirkin: yeah
16:22 imirkin: 2 is the right offset for SM35
16:23 gouchi: hi
16:23 gouchi: imirkin: thank you nouveau.tv_norm fix the issue ;-)
16:23 imirkin: awesome
16:27 karolherbst: VOTE.ANY R0, P1, P0; /* 0x4840000000001c24 */
16:27 karolherbst: mhh
16:27 imirkin: so that's storing the resutl of the vote in both R0 and P1
16:27 imirkin: based on the contents of P0
16:28 karolherbst: ahh okay
16:28 karolherbst: so it sets the predicate and the registers
16:30 karolherbst: VOTE.ANY P1, P0; /* 0x48400000000fdc24 */
16:30 karolherbst: much better now :)
16:31 karolherbst: and no visual change anymore
16:34 imirkin: cool
16:36 mupuf:really wonders if it is OK to use workqueues for a never-ending task
16:37 karolherbst: mupuf: only if they are interruptible
16:37 mupuf: sure, they will sleep all the bloody time
16:37 karolherbst: doesn't mean you can interrupt/stop them
16:38 mupuf: but I don't want the workqueue scheduler to wait for me
16:38 karolherbst: change priotiry?
16:38 karolherbst: *priority
16:38 mupuf: yes, they are interuptable, they would not depend on any other task
16:39 mupuf: I mean, it would
16:39 mupuf: but ... it still feels icky
16:40 mupuf: well, maybe I need to suck it up and use kernel threads. I just need to add the compatibility code for the userspace nouveau
16:41 imirkin: mupuf: i think it's ok to use a wq
16:41 imirkin: mupuf: the issue is that you're doing this directly from irq
16:41 imirkin: instead of delaying onto the qp
16:41 imirkin: wq*
16:41 mupuf: well, no, the issue is that we have a deadlock
16:42 mupuf: I do agree that having a workqueue handle the work would be good, but it would not solve anything
16:42 mupuf: or actually, would it?
16:42 mupuf: the issue we have is that, sometimes, when scheduling work, alarm() will automatically start calling other functions scheduled
16:43 mupuf: and that's how you can create loops
16:43 imirkin: mupuf: the issue is that alarm executes in-thread instead of delaying unconditiaonlly
16:43 mupuf: yeah, exactly
16:43 imirkin: mupuf: it does that if the timeout == 0 iirc
16:43 imirkin: which is just broken
16:44 imirkin: remove that junk, and all will be well
16:44 mupuf: well, it is necessary ... because it will never be signaled otherwise
16:45 imirkin: ?
16:45 imirkin: delay into the wq unconditionally seems like the proper course of action.
16:46 imirkin: this is how these things normally work.
16:46 imirkin: and then you have a different work function called from the wq which does the work.
16:46 imirkin: without checking for any times/etc
16:46 karolherbst: we should rename CC_NOT_P to CC_NOT
16:46 karolherbst: ...
16:46 imirkin: no
16:47 mupuf: imirkin: right, that would be an acceptable solution
16:47 mupuf: but then, one thing that I dislike ... when we do this, we need to allocate a different object for every callback
16:47 mupuf: and ... that means kmalloc 10 times per second
16:48 imirkin: (a) not such a huge deal, (b) if you're clever, you can reuse
16:49 mupuf: well, that's the issue, everything is trying to be clever here :s
16:49 mupuf: i'll try to find a way
16:49 mupuf:really considers the entire alarm() thing very fragile
16:50 mupuf: you know what, screw it, therm will create two workqueues
16:50 mupuf: one for temperature polling and one for changing the fan toggle (SW PWM)
16:50 karolherbst: imirkin: mehr :/ the vote stuff in this case has like no effect whatsoever :/
16:50 mupuf: this way, we get rid of this insanity, no kmalloc and a code that is super easy to read
16:51 imirkin: i like that.
16:52 mupuf: now, if only someone could tell me if it is an accepted practice :s
16:52 karolherbst: imirkin: I decided now to always jump above those things... no change
16:54 imirkin: gr, now i need to write piglit tests
16:54 imirkin: I HATE TESTING
16:54 karolherbst: :D
16:59 mupuf: imirkin: think about the peace of mind you will get :)
17:00 imirkin: mupuf: the peace of mind i'll get if i never have to write tests ever again, ever?
17:00 imirkin: that would be so nice
17:02 karolherbst: imirkin: by the way, what does vote give me? If for $subOp thread the condition is true?
17:03 karolherbst: *predicae
17:03 mupuf: imirkin: that would be if you wrote an AI :D
17:03 mupuf: a pretty good one :D
17:06 imirkin: karolherbst: i assume, yes
17:06 karolherbst: imirkin: okay, I enabled that pass for every BB with 2 predicated instructions or more, and no visual change now :)
17:06 karolherbst: I think the pass works now, but I have no idea if it helps _anywhere_ at all
17:09 karolherbst: ohh right, and I also have to deal with the reg.id of that new predicate...
17:16 imirkin: stupid hardware. why doesn't VOTE.EQ work :(
17:17 imirkin: karolherbst: feel like running a shader test on nvidia?
17:17 imirkin: karolherbst: http://hastebin.com/atixonoyud.coffee
17:17 imirkin: (would like to get a mmt if it passes)
17:21 karolherbst: imirkin: yeah, it passes
17:22 imirkin: mmt would be great
17:23 imirkin: coz i'm generating seemingly-correct code
17:23 imirkin: and yet it fails
17:24 karolherbst: imirkin: http://filebin.ca/2iqSSRkQFaB8/tmp.mmt.xz
17:24 imirkin: cool thanks
17:30 imirkin: and i thought our codegen was bad
17:30 imirkin: this must be hitting like every bad corner case in the nvidia compiler
17:31 karolherbst: :D
17:33 imirkin: their atomicadd is way better though
17:33 imirkin: since it does 1 op per simd group instead of 1 per lane
17:34 imirkin: hm interesting
17:34 imirkin: vote doesn't return what i thought in the register
17:35 karolherbst: now that I think of the stuff you said today, maybe in those eon games, we joinat/join too much... this might be a rather good reason why the perf is really bad...
17:35 imirkin: i figured it just returned the "same" as the predicate
17:35 imirkin: but in 0/-1 form
17:35 imirkin: but it's some "real" value... maybe a mask with which threads were true? dunno.
17:35 karolherbst: :O
17:35 karolherbst: well
17:35 karolherbst: it might make sense actually
17:37 mwk: tests, eh
17:37 mwk:needs to write a fuckton of tests for Falcon relocations
17:38 mwk: I'm sort of convinced they work, but... :)
17:38 imirkin: ok, so now i just need to make the same adjustments for nvc0 and gm107 and i'm good to go
17:38 karolherbst: ohh wow
17:39 karolherbst: 26 joinat/join pairs
17:39 karolherbst: in a 520 instruction shader
17:39 karolherbst: this sounds like much pain
17:39 karolherbst: and I am stupid again
17:39 karolherbst: because this was pre ra
17:40 karolherbst: 6 joins, 3 joinats after RA
17:42 karolherbst: mhh, odd? https://gist.github.com/karolherbst/1ff9229ba5cce448e623dd6f0b8d5f0c
18:01 karolherbst: okay, in 270 we know what $p0 already, we don't need this set for that https://gist.github.com/karolherbst/76a5e562cbecdeb9717817d250680875
18:02 pmoreau: imirkin: vote is probably the same as __ballot() in CUDA (or rather the other way round): http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions
18:03 karolherbst: or __all or __any ;)
18:03 pmoreau: imirkin: If you need some Fermi testing, I can plug in a GF100.
18:03 imirkin: pmoreau: yeah, the register return for VOTE is probably the output of __ballot()
18:03 pmoreau: Except they do not seem to return a mask
18:03 imirkin: huh?
18:03 pmoreau: (for __all and __any
18:03 imirkin: oh right
18:03 imirkin: but that's an API failing
18:04 imirkin: i think the hw supports it
18:04 imirkin: probably.
18:04 pmoreau: Most likely
18:05 karolherbst: or they left it out because it is useless
18:06 imirkin: pmoreau: no special need to plug cards in ... hakzsam will work it out
18:07 karolherbst: does it makes sense to have anything between joinat and a bra?
18:08 imirkin: doesn't matter
18:08 imirkin: joinat pushes a value onto the callstack
18:08 imirkin: and join pops that off and jumps to that location
18:09 karolherbst: okay
18:23 pmoreau: imirkin: Okay
18:29 karolherbst: mhh
18:29 karolherbst: dolphin-emu crashes with nouveau
18:29 karolherbst: but it also uses EGL
18:30 karolherbst: dri2_ctx is NULL
18:30 karolherbst: well
18:30 karolherbst: in src/egl/drivers/dri2/platform_x11_dri3.c:101
18:31 imirkin: where's hdkr when i need to yell at him
18:31 karolherbst: ahh so it is dolphins fault? :D
18:31 karolherbst: doesn't look like it though
18:32 karolherbst: trace: https://gist.github.com/karolherbst/0f60ea41b686c88aefb1d02c8ee0b75b
18:32 karolherbst: *backtrace
18:53 karolherbst: imirkin: https://bugs.freedesktop.org/show_bug.cgi?id=94925
18:53 karolherbst: :D
19:07 imirkin: pmoreau: oh, looks like there's a ARB_shader_ballot that covers the ballot functionality
19:08 karolherbst: imirkin: should this succeed on nouveau? eglCreateContext(dpy = 0x7f3d6c2586c0, config = 0x7f3d6c2ab760, share_context = NULL, attrib_list = [EGL_CONTEXT_OPENGL_PROFILE_MASK, EGL_CONTEXT_OPENGL_CORE_PROFILE_BIT, EGL_CONTEXT_FLAGS_KHR, EGL_CONTEXT_OPENGL_FORWARD_COMPATIBLE_BIT_KHR, EGL_CONTEXT_MAJOR_VERSION, 4, EGL_CONTEXT_MINOR_VERSION, 3, EGL_NONE])
19:09 imirkin: karolherbst: if the driver you're running it on exposes GL 4.3, yes
19:09 karolherbst: well it does
19:09 karolherbst: mhh okay
19:09 karolherbst: well dolphin devs tried to be smart and create 4.5, if fails create 4.4 and so on
19:09 karolherbst: but it ends up checking for 3.3, and then the last one with no args
19:10 karolherbst: so the only context creation which suceeds is this: eglCreateContext(dpy = 0x7f3d6c2586c0, config = 0x7f3d6c2ab760, share_context = NULL, attrib_list = [EGL_NONE]) = 0x7f3d6c2a4480
19:10 karolherbst: and the other return 0x0
19:12 imirkin: i haven't traced that stuff
19:12 imirkin: and i'm not an EGL expert, neither spec, nor mesa impl
19:12 hakzsam: imirkin, your second patch doesn't improve the thing
19:12 imirkin: hakzsam: ok. i'll push it without cc to stable then.
19:13 hakzsam: which one? the first one?
19:13 imirkin: the second one
19:13 imirkin: before the BF bit was being set
19:13 imirkin: which is unexpected
19:13 imirkin: since it was writing 0x3f800000 instead of 0xffffffff
19:13 imirkin: but in practice it shouldn't matter
19:13 hakzsam: okay
19:14 hakzsam: karolherbst, nope sorry and I'm too tired today for testing this stuff, tomorrow for sure :)
19:14 imirkin: hakzsam: does running without opts help? iirc it uses fewer regs that way
19:14 hakzsam: no
19:14 imirkin: k
19:18 pmoreau: imirkin: Interesting :-)
19:18 imirkin: can't add it coz no int64 in mesa yet
19:18 imirkin: and the spec relies on int64 =/
19:18 imirkin: (i guess to account for 64-bit wide SIMD groups
19:19 imirkin: not sure which hw has those
19:19 pmoreau: I started looking at ARB_gpu_shader_int64 yesterday, since I was working on 64-bit int/uint
19:19 imirkin: s/64-bit/64-/
19:19 imirkin: airlied has a branch which implements parts of it
19:19 pmoreau: But I gave up when I realise there was no piglit tests for it
19:19 pmoreau: Oh, okay. Maybe I’ll have a look
19:20 imirkin: it's in early days not anywhere close to done
19:20 pmoreau: But I first want to fix my 64-bit MUL/MAD split patch. It’s working for U64, but fails completely if I give it signed numbers
20:07 imirkin: whoa. the flickering in talos is WAY worse with the GK208 than with the GF108...
20:07 imirkin: pmoreau: make judicious use of the mulhi subop.
20:17 mupuf: skeggsb: hey, what did you base your latest rebase upon?
20:17 mupuf: drm-next is incompatible
21:10 yann|work: karolherbst: progressively killing nearly everyone gave no result (kept systemd and friends, udisks, rsyslog, dbus), just disabled a couple of services that should not be there permanently while doing that (yes, only some for which stopping them did not change a thing), and now right after boot the nvidia gpu is off, with Xorg session running
21:11 yann|work: go wonder
21:11 karolherbst: yann|work: well, maybe your power manager disables the runtime pm feature
21:12 karolherbst: yann|work: and when you unplug your charger it gets enabled again?
21:13 imirkin: yann|work: my guess is you have a phantom VGA output
21:13 imirkin: yann|work: grep . /sys/class/drm/*/status
21:13 yann|work: karolherbst: bingo - not when I unplug it, but when I replug it
21:14 yann|work: imirkin: only eDP-1 connected
21:14 imirkin: do you have a VGA port in there though?
21:14 imirkin: (in the list)
21:14 yann|work: no
21:15 imirkin: hm ok
21:15 yann|work: DP, eDP,and 2 HDMI's
21:15 imirkin: o well
21:15 karolherbst: yann|work: then your power manager is messing with you
21:15 yann|work: I had noticed that the suspend/resume on this machine had quite a number of issues, this is just one more on the list :(
21:15 karolherbst: yann|work: I think also powertop can do something like that?
21:17 yann|work: karolherbst: right, I can set it off from there
21:18 karolherbst: well maybe laptop-mode-tools does the same, who knows what turns it off now
21:18 karolherbst: something will do this
21:18 yann|work: (by setting runtime pm to good for it)
21:19 yann|work: I have laptop-mode installed too, never tried to play with its settings
21:29 skeggsb: mupuf: the drm-next commit id it's based on is in the log messages somewhere
21:30 mupuf: skeggsb: ok, thanks