09:22Aziroshin: Hello. :)
09:23Aziroshin: It's possible I have a hardware-failed GTX 780 Ti on my hands, and I'd like to find out what's wrong with it at some point, if that is at all possible. It's failing in different ways depending on the system (and driver I suppose). With a stock live USB Kubuntu 17.04, I get lines such as the following as spam:
09:23Aziroshin: ** printk messages dropped ** [ nouveau 0000:01:00.0: gr: GPC2/TCPO/MP trap: gobal 00000000  warp 0000 
09:24Aziroshin: If anyone has a hint as to what direction to head in terms of further investigation based on such slim evidence, I'd be grateful. :)
09:26Aziroshin: As to the failure itself, it happened during playing World of Warcraft with wine, nvidia binary drivers (different computer), when the picture all of a sudden froze and the computer followed suit in a hard lockup about 2-3 seconds later. The card would still display a console after reboot, but the X server (nvidia-persistenced fail) wouldn't initialize anymore.
09:28Aziroshin: One thing I am wondering is whether the on-board firmware could potentially have been compromised (if that's at all possible during a lockup), and re-flash of the firmware (if at all possible) might help matters.
09:50RSpliet: Aziroshin: unlikely. Apart from the VBIOS all firmware is uploaded during boot
09:51RSpliet: the most likely component to die in a system is the fan (killing the GPU slowly), followed by DRAM. Or you got unlucky. Either way nouveau doesn't support any kind of BIST routines to test hardware
11:14RSpliet: codimwk: before I make changes to envydis, a brief consultation: I know that some ops are displayed wrong (the one I care about is fp atomic add - or rather "RED.E.ADD.F32.FTZ.RN" as nvdisasm displays it being displayed as "add" in demmt), and I'd like to correct some of that
11:15RSpliet: Is there a preferred display format?
11:18mwk: not really
11:19mwk: just try to make it consistent with previous ISAs if possible
12:35karolherbst: mwk: did you read my message regarding breakpoints on falcons?
12:59mwk: karolherbst: I might've missed it
13:09pmoreau: xexaxo1: You can completely disregards my comments from yesterday evening: I was running autogen from the wrong git worktree, so yeah, of course it would not pick up my changes… --"
13:22smusiland: Allah is doing
13:59karolherbst: mwk: somehow whenever I let the falcon hit the breakpoint I set, the falcon stops instead of pausing, so that I can't continue or single step from there
14:13mwk: what kind of break?
14:31RSpliet: tea break, apparently
14:32RSpliet: I'm sure all of this is supposed to be parameterisable...
14:34karolherbst: mwk: a breakpoint
14:35karolherbst: mwk: falonc+0x098
14:35karolherbst: it indeed stops at the address I put into it, but it hard stops
14:35karolherbst: and can't be continued
14:39RSpliet: karolherbst: what's bit 29 in that reg for?
14:48karolherbst: RSpliet: dunno
14:51karolherbst: RSpliet: did you see it through nvascan?
14:51RSpliet: well, no, I didn't want to ruin my system
14:51RSpliet: but nvapoke 0x10a098 0x7fffffff, nvapeek 0x10a098
14:55karolherbst: well it is very unlikely that you hit the breakpoint at 0x3ffffff
14:56karolherbst: maybe bit 29 is something like "pause on break" or so
14:56karolherbst: will check it out today at hom
14:56RSpliet: well, the mask was 0x60ffffff
14:56karolherbst: ohh right, make sense
14:57karolherbst: PC is 0-23
15:04Aziroshin: RSpliet: Thanks for your reply. :)
15:08jamm: hakzsam, pmoreau: check out https://hastebin.com/toberuripi.bash line 25
15:08jamm: i just can't seem to get proper results with (st 0x6 wt 0x1), it seems to screw up all my fonts and window shadows become big squares
15:09jamm: everything else works perfectly so far
15:09jamm: i'm surely missing something above or below
15:10jamm: (see st 0x0, it should be (st 0x6 wt 0x1) i think
15:10RSpliet: jamm: are these shaders automatically appended with nops?
15:18pmoreau: jamm: Do you have a link to the solution for the previous shader? I don't remember where the rd were needed
15:20pmoreau: I am not sure about the wt 0x4 in (st 0xf wr 0x2 wt 0x4) line 20
15:21pmoreau: I'll have another look in a bit
15:26karolherbst: jamm: just curious: were you thinking about moving that logic into the assembler so that nobody needs to do that by hand anymore?
15:49jamm: RSpliet: not sure, i haven't looked at what the assembler does exactly yet
15:49jamm: karolherbst: haven't thought of that yet
15:51jamm: pmoreau: well, with the (st 0x0) on fmul ftz $r3 $r3 $r4, it seems to work well
15:52jamm: karolherbst: it's probably nice to have something like that, but i'm not sure if it's justified given that it's just 8 shaders with most of them almost identical to each other
16:03mwk: karolherbst: have you set the break mask through the debug port?
16:03mwk: it selects which events pause to debug port vs causing an in-processor trap
16:03karolherbst: ohh, okay
16:04karolherbst: I think I forgot to do that then
16:04karolherbst: how does the mask work? I thought there are only two hw breakpoints?
16:05karolherbst: or what kind of value do I set with the break mask?
16:05mwk: you can set break on lots of events, not just brekpoints
16:05karolherbst: ohhhhh, I see
16:05mwk: but umm
16:05mwk: seems I haven't REd it :(
16:05karolherbst: so this 16-31 value in DEBUG_CMD has only two bits for the breakpoints
16:05mwk: anyhow, bit 7 controls the breakpoints
16:05mwk: according to my notes
16:06karolherbst: you mean value 7 in CMD
16:06karolherbst: <value value="0x7" name="SET_BREAK_MASK"/>
16:06karolherbst: and the mask is in 16-31
16:06mwk: bit 7 of the mask
16:06karolherbst: so you mean 16+7?
16:06karolherbst: and 16+8 is breakpoint 2?
16:06mwk: not sure
16:06mwk: maybe bit 7 controls both...
16:06karolherbst: okay, I will figure that out then
16:06mwk: I only have bit 7 mentioned
16:06karolherbst: why though....
16:06mwk: but if I had to guess
16:07mwk: you could also set breaks on other conditions
16:07mwk: interrupts, CPU exceptions
16:07karolherbst: I will play around with it a bit then
16:07mwk: maybe the weirdo syscall insns
16:07mwk: ISTR I found more bits, but envytools mentions only the breakpoint
16:07karolherbst: maybe in the end I will be able to write a debugger where we can even execute random code
16:08karolherbst: I meant, code we put into the debuger
16:08mwk: that'd be nice
16:08karolherbst: div by 0 might be a good reason to break as well :)
16:09mwk: ordinarily div by 0 just returns ffffffff and doesn't trap
16:09mwk: unless they changed it on v4+ or something
16:09mwk: I need to finish my falcon hwtest
16:10mwk:got up to opcode 0x33, but failed on the fused compare+branch insns
16:10karolherbst: but the debugging interface is somewhat fun to work with :)
16:10mwk: my hwtest uses it :p
16:10karolherbst: it already helped me a lot
17:03smusiland: Allah is doing
17:04mupuf: I'm sure he is doing great, but unless he is willing to do Nouveau development, summoning him is no interest to us and thus, we do not want to hear about this
17:05karolherbst: mwk: mhh, I set a breakpoint to 0x810, but the falcon stopped at 0x420....
17:05mupuf: close enough :D
17:05karolherbst: now it stopped at 0x813
17:06karolherbst: mhhh interesting
17:06karolherbst: it stopped
17:06smusiland: Nouveau can't be developed without Allah permission
17:06karolherbst: now at 0x400, mhhh interesting
17:07mupuf: smusiland: then, he obviously gave us the permission. Can you please go talk about this in a religion-oriented chanel?
17:08smusiland: Smusiland can't go talking in religion channels without Allah permission
17:09AllahOneTrueGod: smusiland: I hereby gave you the permission to go hang out somewhere else
17:10smusiland: Smusiland can't talk to fake Allahs without Allah permission
17:16karolherbst: mwk: somehow that doesn't work as expected
17:18karolherbst: poke falcon+0x200 0x00800007 ; poke falcon+0x98 0x80000810
17:19karolherbst: and it stops kind of between 0x400 and 0x410
17:19mupuf: karolherbst: how do you read PC?
17:19karolherbst: through 0x200
17:19karolherbst: and then I read the spcial pc register
17:19mupuf: oh, cool
17:19smusiland: Sun is not doing, Allah is doing
17:20karolherbst: mupuf: I can also write registers already :) and data+io
17:20karolherbst: and single stepping also works, and breaking current execution
17:20karolherbst: just breakpoints don't
17:20mupuf: Ah ah, I guess they remembered to lock it down, when in HS mode :D
17:20karolherbst: I am on my kepler
17:21karolherbst: ohh, I guess you tried it on maxwell or so
17:21karolherbst: k, I see
17:21karolherbst: well it makes sense that they locked it down
17:22mupuf: karolherbst: no, I said: I guess they remembered to lock it down ;)
17:22mupuf: Would be fun if they had forgotten about this :D
17:23karolherbst: well maybe after I understood how it works on mine, I might to go and checkout other GPUs
17:25karolherbst: like always, I wrote my debugger in bash
17:25karolherbst: but maybe if I find some time I rewrite it to have a nice ncurses based client
17:25karolherbst: maybe something gdb alike
17:27karolherbst: okay, so something is funky with the breakpoints
17:28karolherbst: uhhh... I think I know
17:29karolherbst: maybe not
17:29karolherbst: I was thinking, maybe, this enable bit is important
17:29karolherbst: doesn't seem to be
17:30karolherbst: RSpliet: thank you for pointing out that one bit
17:30karolherbst: allthough now hell breaks loose
17:31karolherbst: no clue, at least the falcon stopped at 0x810
17:31karolherbst: the then it jumped to 0xf7
17:32karolherbst: which is intr:
17:32karolherbst: maybe it makes sense
17:32RSpliet: well, easy enough. check the interrupt status register when you trigger it and see :-)
17:32karolherbst: I think it says: hey breakpoint triggered
17:33karolherbst: it says WATCHDOG
17:33karolherbst: silly watchdog
17:34karolherbst: after the interrupt handler I am back at 0x810
17:35karolherbst: and finally able to step from there
17:36karolherbst: this is only slightly annoying
17:39karolherbst: step from code :D
17:39karolherbst: I like it that you can continue from any random address you want
17:40karolherbst: of course you can also continue from 0x811 even if the instruction is at 0x810
17:40karolherbst: cause the falcon doesn't really care as long as there is something valid starting at 0x811
17:48smusiland: Allah is doing
18:07mwk: karolherbst: try stuffing 0xffff to the break mask
18:07Lyude: huh, don't think I see imirkin in here. that's a new one
18:07mwk: I'd be interested to know what the other breaks are
18:08mwk: ISTR the interrupts were there, somehow
18:14karolherbst: mwk: well, it kind of works, but I need to step through the interrupt handler first, or I just step out of it
18:15karolherbst: there is a flag for interrupts
18:15karolherbst: let me find it
18:17karolherbst: 0x0100 is break on interrupt
18:20mwk: which interrupt?
18:21mwk: so, both iv0 and iv1?
18:21karolherbst: I think only iv0?
18:21mwk: ISTR they were separatre
18:21karolherbst: well the falcon breaks on intr when I set it to 0x100
18:21karolherbst: I could check for iv1
18:21pmoreau: Lyude: It happens, very very rarely, but it is not unheard of. :-)
18:23Lyude: pmoreau: hehe
18:24karolherbst: mwk: iv0 and iv1 are USER1 and USER2?
18:24karolherbst: ohh wait
18:24karolherbst: that makes no sense
18:24karolherbst: INTR_ROUTING :)
18:26mwk: routing is complicated
18:26karolherbst: well but if I route USER1 and USER2 to M2, I should get the interrupt handler behind iv1 executed, right?
18:27karolherbst: or not?
18:28karolherbst: 0200 is IV1
18:29mwk: and that belongs in falcon.xml :)
18:29karolherbst: should I call them IV0 and IV1?
18:29mwk: AFAICT there should also be breaks for trap 0 - trap 3, TLB miss, TLB multihit, invalid opcode
18:30karolherbst: mhh okay
18:30mwk: and who knows what else
18:32karolherbst: breakmask: 0xfcff and I still step into the interrupt handler
18:33karolherbst: I mean the falcon breaks at 0x810
18:33karolherbst: and single steping gets me into the interrupt handler
18:34karolherbst: and with 0xfc7f the falcon stops
18:34karolherbst: so I guess there is no flag for handling this alright
18:34Lyude: mind if I ask what kind of code you're trying to debug on the falcon out of curiosity?
18:34karolherbst: Lyude: the PMU code I wrote myself for dynamic recocking
18:35Lyude: no way, you actually got it to load?
18:35karolherbst: what do you mean?
18:35karolherbst: I wrote that myself for my kepler
18:35karolherbst: of course it loads
18:35Lyude: i mean like, you got the falcon to actually execute the code in high security mode? or are you still working on that
18:35Lyude: oh I thought you were talking about pascal
18:35Lyude: got my hopes up, whoops. lol
18:36karolherbst: I have much better lines for that, like "mupuf check your maxwell2 fan :O" (keeping it spin up and down alternatly)
18:38Lyude: wow, good to see nvidia hasn't stopped complaining about the VGA console in their kernel blob since I stopped buying from them
18:38karolherbst: yeah well
18:38karolherbst: their fault basically
18:38Lyude: like I said I won't ever report any bugs I find with their driver :)
18:38Lyude: good luck figuring out your broken i2c stack nvidia
18:39karolherbst: hey, we might need it without bugs to RE stuff :O
18:39Lyude: eh, it's not something that would prevent us from doing that
18:40karolherbst: well depends on how much broken it is
18:40Lyude: i mean, it only really is a bug with broken hardware I happen to have that you can't really buy
18:40Lyude: but i am sure somewhere out there something else hits it
18:41karolherbst: mwk: will figure out invalid opcode now :)
18:41karolherbst: that sounds like the best fun to me right now
18:41mwk: karolherbst: I suppose the correct thing to do would be to disable watchdog :p
18:41Lyude: i need to continue playing with power saving stuff
18:41karolherbst: mwk: funny though, my code I need to debug is triggered via watchdog
18:41mwk: I mean, interrupts are enabled, so single stepping should enter them by definition...
18:42mwk: that's an interesting question
18:42mwk: maybe these extra $flags bits on Falcon v4+ are related to debugging somehow?
18:42karolherbst: I think it is the watchdog though
18:42mwk: like "please break on interrupt return, kthx"
18:42karolherbst: not quite sure what #timer does
18:42karolherbst: yeah, it configures some watchdog stuff
18:43karolherbst: mwk: maybe there is a difference between ret and iret?
18:43karolherbst: but I like your idea
18:43karolherbst: k, iret/ret first, then invalid opcode
18:44mwk: please check with various flags combinations if you're doing that :0
18:44karolherbst: you mean $flags?
18:47karolherbst: okay mhh I doubt they have break on ret
18:47karolherbst: $flags is 0x00110501
18:49karolherbst: in the docs are some fields missing
18:49karolherbst: are those no used?
18:49karolherbst: like 12-15
18:53karolherbst: okay, they do not exist indeed
18:55karolherbst: mwk: funny though, if I fully step through the interrupt handler, the falcon won't jump back to it
18:55karolherbst: ohh it does
18:55karolherbst: I still need to wait
20:52pmoreau: jamm: No clue, sorry, It seems good to me.
21:09karolherbst: mwk: odd, either I hit an instruction which we don't know but exists, or there is no breakpoint on invalid instruction
21:44mwk: karolherbst: what opcode are you using?
21:44mwk: you need to use an invalid first byte of an opcode to trigger the trap
21:44mwk: Falcon only triggers the exception when it can't decode the instruction's form
21:45karolherbst: ohh, I see
21:45karolherbst: I am currently checking the traps, I could do that after that
21:45mwk: if it can decode the form, but the opcode is invalid, it's just a nop
21:45karolherbst: what is a good example?
21:45mwk: or maybe an instruction that clears dst to 0
21:45mwk: depends on opcode
21:45mwk: hmm, Falcon v4 or v5?
21:45mwk: try 0x3f
21:46karolherbst: returning with iret from a trap handler makes the falcon totally confused
21:46mwk: 0xee is definitely considered valid
21:54karolherbst: mwk: there is one bit for all traps
21:54karolherbst: and it is 0x8000
21:55karolherbst: mhh at least I thought, odd
21:57karolherbst: ohhh, I am stupid
21:59mwk: by all traps, you mean the trap X instructions?
21:59karolherbst: but it doesn't work
21:59karolherbst: the falcon just stopped
22:00karolherbst: but... I found something else
22:01mwk: then what does it do?
22:01karolherbst: a breakpoint hit at ticks_from_ns_quit
22:01karolherbst: after "mov b32 $r14 $r12"
22:02karolherbst: breakmask is 0xfc7f
22:03mwk: what is ticks_from_ns_quit?
22:03karolherbst: mov b32 $r14 $r12; push....
22:04karolherbst: $pc is 0x200 though
22:04karolherbst: odd number
22:04mwk:wonders if there are memory breakpoints on that thin
22:04karolherbst: then it wouldn't explain why it triggers there
22:04karolherbst: $sp is 5fb8
22:09karolherbst: now it is at 0x3ef, but same $sp
22:11karolherbst: okay, I don't get the software traps and I think that other thing might be a memory breakpoint indeed, allthough this feels odd
22:15karolherbst: maybe I can'set the regs from the PMU itself
22:16karolherbst: but it indicates it executed it
22:16karolherbst: 0x3f works by the way
23:07mwk: what the flying fuck.
23:08mwk: hwtest running on my GM107's PVDEC engine says that div and mod instructions are NOPs iff their destination register is %r0
23:09mwk: at least in the "divide by imm8" forms
23:09mwk: that.... is potentially very good to know
23:13phoenixz: Hello there, I know this is the channel for the open source nvidia driver, is there also a channel for the closed source binary drivers?
23:14mupuf: mwk: ah ah
23:14mupuf: that's the sort of things one would not catch without randomized testing like you do
23:15mwk: but... why :(