00:16 Lyude: EYYYY]
00:16 Lyude: fixed the random disp fail on the P50
00:16 Lyude: karolherbst, skeggsb ^ :)
00:19 karolherbst: Lyude: nice
00:19 karolherbst: Lyude: patch or it didn't happen
00:19 Lyude: karolherbst: hehe, coming up in just a moment
03:30 rhyskidd: Lyude: nice, well done
08:55 diogenes_: Hello guys, i've got an optimus system (intel+nvidia) and it's been always fine to run it with nouveau but for some time, i've noticed that the nvidia card is always active and it shouldn't be active unless i run something with DRI_PRIME=1, i don't know what causes it to stay active all the time so maybe you could help me to figure that out?
08:56 diogenes_: as a result i'm getting a much higher cpu temperature and that's annoying
12:22 karolherbst: diogenes_: can have different causes
12:22 karolherbst: diogenes_: maybe check dmesg
12:29 diogenes_: karolherbst, i just did modprobe.blacklist=nouveau in kernel params so whenever i will need the nvidia card, i'll do modprobe nouveau.
14:25 pendingchaos: karolherbst: any comments on the 4th patch of the 3rd (and latest) version of the xmad patches?
14:44 karolherbst: pendingchaos: I will take a look later today
14:45 pendingchaos:nods
14:45 pendingchaos: after that, I think I'll post a new version with the 4th patch split up
15:17 Subv: hi, in maxwell, does the SSY instruction push the address to some stack (and the SYNC instruction pops from it) or is it a single target register?
15:25 karolherbst: Subv: kind of like that
15:25 Subv: what do you mean?
15:25 karolherbst: Subv: it sets the address a sync jumps to
15:26 karolherbst: I think it is a stack actually, like we have with the prebreak and precont instructions as well
15:26 Subv: ah, i imagine the parameter of the SYNC instruction says whether to actually jump or just pop the stack?
15:27 Subv: there's a few weird things in the SYNC parameter (on envydis) like 'Fcsm_Tr'
15:28 karolherbst: Subv: uhm, which ISA do you mean?
15:28 Subv: gm107
15:29 karolherbst: sync shouldn't have any arguments there in anvydis
15:29 karolherbst: ohh wait
15:29 karolherbst: we have that f0f8_0 thing, right
15:30 Subv: yeah
15:31 karolherbst: this is the conditional parameter actually
15:31 Subv: it seems to be used in most control flow instructions
15:31 karolherbst: mhh
15:31 karolherbst: we _could_ use it in mesa actually
15:31 Subv: what does it do?
15:31 karolherbst: conditional execution
15:31 Subv: it seems like a grab bag of condition codes and some other weird stuff
15:31 karolherbst: yes
15:32 Subv: any docs on what they mean and what value they check for the condition?
15:33 karolherbst: doubtful
15:34 Subv: i see, well, do you know who i could ask what they mean?
15:35 karolherbst: nvidia
15:35 karolherbst: ;)
15:35 karolherbst: maybe skeggsb knows something
15:35 karolherbst: but I kind of doubt it
15:37 Subv: how would you implement that in mesa if nobody knows how it works?
15:39 karolherbst: well, we would have to figure that out as well
15:42 Subv: i think i once saw a game use 'exit fcsm_tr' (on the Switch) but i can't remember which one it was
16:14 imirkin: Subv: stack
16:15 imirkin: which has to be unwound in the same order that it's created. so like SSY/PBK/etc push stuff on the stack
16:15 imirkin: and SYNC/BREAK/etc pop stuff off
16:16 imirkin: the entry being popped off has to match the type
16:16 imirkin: or else you get an error
16:16 imirkin: (99% sure of that)
16:16 imirkin: the stack is also not infinitely-sized... i think the size is specified in the shader header.
16:16 imirkin: or perhaps a method. i forget.
16:22 pendingchaos: imirkin: perhaps launch descriptor's cstack_size/WARP_CSTACK_SIZE?
16:24 pendingchaos: I think there's some commented code in nvc0_program.c for setting something similar
16:25 imirkin: sounds right
16:27 pendingchaos: since it isn't set, I wonder what if defaults to
16:27 imirkin: it's set... can't not be set.
16:28 pendingchaos: I don't see where?
16:29 imirkin: probably with hdr[x]=y
16:30 pendingchaos: with NV50_PROG_DEBUG=1 on a random shader: HDR[08] = 0x00000000
16:30 pendingchaos: I think I'm looking at the right word?
16:32 pendingchaos: after adding non-uniform control flow, it seems to still be zero
16:37 pendingchaos: I think it's meant to be HDR[0c] and the commented code is wrong, though it still doesn't seem to be set
16:48 Subv: i see
16:49 Subv: any idea what tab5c40_1 means in the lop instruction, btw? (gm107)
16:49 Subv: values seem to be {"t", "z", "nz"} https://github.com/envytools/envytools/blob/master/envydis/gm107.c#L1336
17:16 karolherbst: imirkin: if a shader faults, we basically just have a dead channel and userspace now way to figure out which shader was causing it, right?
17:20 karolherbst: pendingchaos: regarding that 4th xmad patch, did you do a perf benchmark with that patch vs only mul->1 shift + mul with smallimm => 2 xmad?
17:22 karolherbst: basically your current branch vs what you posted
17:23 karolherbst: also I would be kind of interested if those "(2^shl + 1)" cases indeed give us a perf improvement as well (if affected games show any differences that is)
17:25 karolherbst: pendingchaos: one thing: we can get 64 bit muls there as well
17:25 karolherbst: but your code only handles 32 bit ones
17:25 karolherbst: (the imm argument is 32 a 32 bit int)
17:25 karolherbst: (and other things, like creating 32 bit ops)
17:32 karolherbst: or mhh, the code kind of looks fishy anyway, let me read some code
17:32 karolherbst: I mean the original one
17:37 imirkin: pendingchaos: hrmph. odd.
17:37 imirkin: karolherbst: you can have a trap handler
17:37 karolherbst: imirkin: yeah, that was my plan to work on in the end
17:38 karolherbst: first stap would be to just identify the faulting shader though
17:38 imirkin: ok.
17:38 karolherbst: *step
17:38 imirkin: fermi is a bit different than kepler+
17:38 karolherbst: so, current idea was to install a pseudo trap handler and just stop execution
17:38 karolherbst: but what happens if eg the geometry shader faults
17:38 imirkin: the trap handler is *in* the shader
17:38 imirkin: not external
17:38 karolherbst: I know
17:39 imirkin: when it faults, it just exits
17:39 imirkin: so same as if you had an EXIT right there
17:39 karolherbst: I was more thinking about if the next stage is executed nonetheless or does the execution aborts and we can check the exit status?
17:39 imirkin: [without a trap handler]
17:39 karolherbst: I see
17:39 karolherbst: so exit means RET + skip next stages?
18:09 Subv: aha, managed to get nvcc to generate a LOP.AND.NZ, sadly i can't get it to generate a LOP.AND.Z or a LOP.AND.T
19:16 pendingchaos: karolherbst: I did a comparison of with the 3rd patch vs with both the 3rd and 4th patch
19:16 pendingchaos: but not the individual transforms in the 4th patch
19:16 pendingchaos: I probably could compare with and without the near-power-of-two optimization
19:16 pendingchaos: it might be more or less useful with pre-Maxwell cards than Maxwell/Pascal though (I'm not sure)
19:16 pendingchaos: karolherbst: I'm not sure how useful 64-bit handling would be?
19:16 pendingchaos: karolherbst: what looks fishy?
19:18 karolherbst: pendingchaos: it looks like the current code is broken if it encounters 64 bit int muls there
19:19 pendingchaos: yeah, looks like it
19:22 karolherbst: oh well
19:22 pendingchaos: seems it was always broken and things seemed to work fine
19:22 pendingchaos: so I'm not sure if optimizing 64-bit multiplications would be very useful
19:23 karolherbst: well, it would be very much so
19:23 karolherbst: 64 bit mul doesn't exist
19:24 karolherbst: but yeah, 64 bit int muls in graphics are super rare
19:24 karolherbst: maybe even non existent
19:24 karolherbst: dunno
19:25 karolherbst: ....
19:25 karolherbst: we really need to rework the peephole file
19:25 pendingchaos: it's too big?
19:25 karolherbst: no
19:25 karolherbst: it isn't fit for the future
19:25 karolherbst: like not at all
19:26 karolherbst: pendingchaos: see the 64 bit int lowering
19:26 karolherbst: there we create Split64BitOpPreRA
19:26 karolherbst: ...
19:26 karolherbst: the Split64BitOpPreRA pass
19:26 karolherbst: there we create more muls/mads
19:26 karolherbst: and we _could_ optimizte those to xmads
19:27 karolherbst: or uhm
19:27 karolherbst: or wait
19:27 pendingchaos: I think most of them would be?
19:27 pendingchaos: not NV50_IR_SUBOP_MUL_HIGH ones though
19:28 karolherbst: ahh right, xmad is inside late algebraic opts
19:28 karolherbst: so yeah
19:28 karolherbst: but still
19:28 karolherbst: even having two late algebraic opts there
19:28 karolherbst: and even having seperate "lowering" passes
19:28 karolherbst: this is just not good
19:28 karolherbst: conceptionally, lowering of alu ops and algebraic optimizations is fundamentally the same
19:29 karolherbst: this already bites us with Volta
19:30 karolherbst: we always have to trade of keeping the original opt to optiize it and lower later (+ leaving unoptimized code) or lower first and having to add a bunch of new opts for new opcodes every time
19:33 karolherbst: pendingchaos: anyway, you use 64 bit ops in the last patch, even though the input is 32 bit
19:34 pendingchaos: the llabs, util_is_power_of_two_or_zero64 and util_logbase2_64?
19:34 karolherbst: yes
19:34 pendingchaos: I think some of those could be make 32-bit
19:34 karolherbst: I mean, it is fine, because we want to handle 64 bit muls there in the end
19:34 karolherbst: I am just thinking if it is really worth the troubles
19:36 pendingchaos: if handling 64-bit multiplication for the near-power-of-two optimization is worth it?
19:37 pendingchaos: (and the mul -> 2 xmad optimization)
19:37 karolherbst: I don't even know if a 64 bit shladd is even handled
19:37 pendingchaos: I don't think it is
19:38 karolherbst: more fun then
19:38 karolherbst: I think we should just skip that optimization for 64 bit muls
19:38 pendingchaos: I used llabs() instead of abs() because abs(-2147483648) > 2147483647 btw
19:39 karolherbst: right
19:39 pendingchaos: and also skip the mul -> 2 xmad optimization with 64-bit?
19:40 karolherbst: you end up with tons of more xmads with 64 bit muls anyway
19:40 karolherbst: thing is, we can't just generate 32 bit ops in case of a 64 bit int mul
19:48 karolherbst: the patch itself looks fine, we should just add some guards for 64 bit int muls there
19:48 pendingchaos: so, for the next version:
19:48 pendingchaos: - make the power-of-two optimization handle 64-bit multiplications again
19:48 pendingchaos: - only do the near-power-of-two optimization for 32-bit multiplications
19:48 pendingchaos: - only do the two xmad optimization for 32-bit multiplications
19:48 pendingchaos: looks good?
19:50 pendingchaos: oh and also split the 4th patch into separate shl, shladd and xmad patches
19:51 pendingchaos: I think I'll also be removing https://github.com/pendingchaos/mesa/blob/nv-xmad-v4/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp#L968-L969 and rely on LateAlgebraicOpt to create a SHLADD
19:51 pendingchaos: it makes it slightly smaller and seems to help shader-db numbers slightly overall
19:52 karolherbst: yeah, maybe a good idea
19:53 karolherbst: I mean the current code should fail with 64 bit muls anyway
19:53 karolherbst: or maybe it is fine
19:54 pendingchaos: I think the current code is fine
19:54 karolherbst: "i->setSrc(1, new_ImmediateValue(prog, imm0.reg.data.u32));" creates a 32 bit value
19:54 karolherbst: mhh but
19:55 pendingchaos: it should never be shifting left by more than 2**32 bits though
19:55 pendingchaos: unless OP_SHL is broken when given operands of differing sizes
19:55 karolherbst: mhh not, actually that should be fine
19:55 karolherbst: I kind of forgot about that second arg 32 bit thing
19:56 karolherbst: at least I think this is how the hw works
19:56 karolherbst: okay, so current code is probably fine for 64 bit muls
19:57 pendingchaos: just to be sure: current code = without the 4th patch?
20:22 pendingchaos: the 5th, and hopefully final, version of the xmad patches have been sent
20:32 karolherbst: current code is whatever is in master
20:32 pendingchaos:nods
20:32 Lyude: rhyskidd: thanks! I will probably have a patch on monday, as it looks like I needed to apply the fix I found to all of the display channels getting left on before we start initializing them
20:33 Lyude: i'm almost wondering if we should just run ->fini on disp channels before ->init, as it seems like that would fix any other possible problems like this
20:44 karolherbst: Lyude: sooo, hw was in weirdo state and was firing silly interrupts?
20:46 Lyude: karolherbst: seems like it, that would explain the spurious interrupt when nouveau doesn't load on time. on boots where I saw that spurious interrupt, nouveau would always hit the disp init fail if I loaded it later, and 0x610490 (some unknown ctrl register for controlling the core disp channel) would always have 0x490a009b on those boots instead of the normal 0x48000088 or 0x48070088
20:47 Lyude: it seems like that's the case for the rest of the channel control registers as well, so the way to workaround it is to just check if we need to shut down each channel before we start setting it up
20:47 Lyude: *disp channel control registers
20:49 Lyude: luckily I think that is the last major P50 bug I've hit
20:49 karolherbst: I see
20:49 karolherbst: makes sense kind of
20:50 karolherbst: allthough it is weird that the state is a bit undefined before loading nouveau
20:51 Lyude: karolherbst: not really. on good well behaving machines, one can expect the state to be clean, but I've seen similar issues to this on other machines with intel before that were quite similar (one instance of some dell tower was leaving on various parts of the display hw by the time i915 got it's hands on it for instance)
20:51 Lyude: karolherbst: my best guess is that the bios is probably interacting (directly or indirectly) with the gpu before we load the OS to try to probe for displays it can turn on in POST
20:52 karolherbst: sure, but why does it matter when we load nouveau
20:52 karolherbst: ohh wait
20:52 karolherbst: no
20:52 karolherbst: this is a random value depending on the boot
20:52 karolherbst: okay
20:52 Lyude: karolherbst: if the bios doesn't clean up after itself, yeah
20:53 Lyude: i do see a "post delay" option for detecting monitors in the bios settings on this P50 as well, so it seems plausable
20:54 Lyude: karolherbst: funnily enough: nvidia's driver doesn't seem to be able to recover from that at all, lol
20:54 karolherbst: no surprise
20:54 karolherbst: nvidias driver sucks at recovering
20:54 Lyude: it throws up some unrelated "can't read vbios" error
20:55 karolherbst: nvidias driver just simply doesn't cause the GPU to crash or get messed up at all, that's why they don't need to ;)
20:55 Lyude: hehe
20:56 HdkR: lol
20:56 karolherbst: you can even switch vertain vbios bits and everything breaks, even in super unimportant tables
20:56 karolherbst: or well, trables not worth crashing the driver
20:56 Lyude: lol
20:57 Lyude: Yeah like, I have had very few times where setting up the nvidia driver did not immediately ensue with me having to figure out why nvidia's driver isn't loading properly
20:57 karolherbst: well
20:57 karolherbst: some_array[vbios[some_byte]] ;)
20:58 karolherbst: better if you mix mmio access in them
20:58 karolherbst: vbios.some_table.entry[mmio(some_address) & 0x1f] ;)
20:58 Lyude: hehe. interestingly enough though nvabios can't read the bios on this either; which has prompted me to work on some small patches for adding strap_peek into our debugfs so we can just grab the vbios and strap peek directly from nouveau
20:59 Lyude: *vbios
20:59 karolherbst: why error check if your documentation states it is an invalid value :D
20:59 karolherbst: Lyude: huh? nvbios should be able to read it for pre pascal GPUs
20:59 Lyude: karolherbst: yeah; i'm not sure, oh, hm
21:00 Lyude: let me make sure I didn't accidentally try getting the vbios before it was powered on
21:03 Lyude: karolherbst: yeah; https://paste.fedoraproject.org/paste/IrsgOLCuyDjsveaYUjmZOA
21:03 Lyude: doesn't work, tbh though I'd still like to have strap_peek in our debugfs anyway
21:03 karolherbst: Lyude: ohhh, you mean nvagetbios
21:04 Lyude: yeah
21:04 Lyude: nouveau works 100% fine with that
21:04 karolherbst: yeah, on some laptopts it's inside the ACPI
21:04 karolherbst: see the various extracting methods nouveau supports ;)
21:06 karolherbst: there are like 7 or so
21:06 karolherbst: ACPI slow/fast, OpenFirmware, PCI, PROM, PRAMIN, PCI?
21:06 karolherbst: should be all of them
21:06 karolherbst: uhh PCI twice
21:06 karolherbst: Platform is the other one
21:13 nyef: Maybe one of the PCI things is for getting it via a credit card reader? (-:
23:50 karolherbst: nice
23:50 karolherbst: imirkin: for pascal we actually have the information on where to write the trap handler address :)
23:51 karolherbst: actually since tresla (but only for compute shaders)