00:00 imirkin: kepler was different, so ... maybe the trap handler is directly writeable, as an offset from the code segment
00:00 karolherbst: kind of, but different
00:00 imirkin: gf100_sw_chan_mthd -- case 0x644: /* MP.TRAP_WARP_ERROR_EN */
00:01 imirkin: dunno if that's related.
00:01 karolherbst: "PUSH_DATA (push, screen->text->offset + 0x7f700 + 0x700);" is what I have to write into it 0x700 is the offset taken from the builtins, 0x7f700 is the offset to the first builtin
00:01 imirkin: but that could be outside of 32-bit...
00:01 karolherbst: I doubt it
00:02 imirkin: it's a VA... VA is 40-bit...
00:02 karolherbst: "screen->text->offset" is the offset of some bo
00:02 karolherbst: afaik
00:02 imirkin: which is 40-bit.
00:02 karolherbst: ohh, mhh
00:02 karolherbst: that is annoying
00:02 imirkin: unless screen->text is a nv04_resource?
00:02 imirkin: then it's the offset into the text bo
00:03 imirkin: anyways... it's possible that calim had worked out the nve4 trap stuff, but i doubt the stuff in mesa was it
00:04 imirkin: have a look at the fermi stuff, even if it's different
00:04 imirkin: it should give you an idea of what sorts of things are out there
00:05 karolherbst: "MP+0x10: BPT_CONTROL" that sounds usefull
12:12 karolherbst: nice
12:13 karolherbst: imirkin: I got the Trap handler to execute :)
12:13 karolherbst: so apperantly I don't need that "screen->text->offset" offset. I just got misslead by what value nvidia put into it (I guess I miss interpreted the trace)
12:14 imirkin: that makes more sense
12:14 karolherbst: was able to trigger it with a "nvapoke 504610 80000001"
12:14 imirkin: all code executes relative to the global text segment
12:14 karolherbst: with HMM that is super easy to debug as you can just malloc memory and let the TRAP handler write into it and simply dump the memory in your C code
12:15 karolherbst: and nouveau can mess up as much as it wants, you still get that memory content :)
12:15 imirkin: yay
12:17 karolherbst: mhh, I need to invoke it for all MPs actually, so maybe if I do that the application will stop correctly
12:20 karolherbst: nice
12:20 karolherbst: the application stoped :)
12:20 karolherbst: and the CL code has an infinite loop
12:20 karolherbst: sooooo
12:20 karolherbst: maybe this is something we want to do regardless. Like when we request killing an application, but the MPs are not stopping, so we just trap all of them?
12:21 imirkin: that's one for skeggsb
12:25 karolherbst: mhh
12:25 karolherbst: no out of range faults for going beyond max reg in a trap handler
12:29 karolherbst: mhhhh
12:29 karolherbst: imirkin: one problem is, the TRAP handler has to be bug free, otherwise it might not actually get to the "rtt"
12:31 imirkin: guessing you found that out the hard way? :)
12:31 karolherbst: I guess so
12:33 karolherbst: uhm
12:33 karolherbst: imirkin: the GPU has fallen of the bus ....
12:33 karolherbst: now I have to reboot, annoying
12:38 karolherbst: ohh wait, a suspend helps as well I think
12:41 Lekensteyn: so the runtime resume/drm_unload/drm_device_remove functions all try to access the device which may result in a lockup. I just patched some calls out (obviously leaking resouces like memory and workqueue stuff) which allows me to recover from an inaccesible device
12:42 Lekensteyn: is it feasible to implement a recovery path where all device accesses (nvkm_rdXX) are skipped, only taking care of resources outside the GPU (memory, workqueues, drm stuff, ...)
12:44 Lekensteyn: karolherbst: btw, patching PCI to prevent the PMCSR write for D0 -> D3 is not acceptable, Windows 7/10 do not do this (verified this via a VFIO-PCI trace)
13:20 karolherbst: Lekensteyn: yeah, I know
13:20 karolherbst: I am sure this is some nouveau bug, just no idea what causes that
13:21 karolherbst: Lekensteyn: well, recovery paths should be fine, skeggsb just doesn't seem to be very interested into having these
13:21 karolherbst: I disagree, but...
13:36 karolherbst: imirkin: do you know where/how I can extract the compute firmware uploaded inside the mmiotrace? I only see that "GP104_COMPUTE.FIRMWARE[0x4] = 0x419e10", but that valu looks suspiciously like a mmio register...
13:38 karolherbst: mhh, I should just be able to figure that out from mesa actually
13:58 pqatsi: Hello folks
13:58 Tom^: karolherbst: i had a 780ti , not 750 :p
13:59 pqatsi: No progress with GP108M and nouveau here - even with runpm=0. There is a chance of better support in kernel 4.18 or something I can try?
14:53 karolherbst: imirkin, pendingchaos: uhm, what was it I have to done so that I can compile the mme stuff?
14:56 pendingchaos: uncomment a line in macro.c
14:56 pendingchaos: perhaps line 145?
14:57 pendingchaos: no I think it was line 120
14:57 pendingchaos: but line 145 is related to the problem
14:58 pendingchaos: https://pastebin.com/raw/VDrNKqsG
15:04 karolherbst: pendingchaos: thanks
15:08 karolherbst: okay, now I have to figure out how to enable debugging on all MPs through a macro
15:11 karolherbst: 419e10 is the reg.. mhh
15:11 karolherbst: pendingchaos: by any chance, do you know how to write into a per channel mmio reg through a macro?
15:12 karolherbst: I am sure there is some kind of mapping, but I sadly don't know it
15:14 karolherbst: ohhhhhhh
15:15 pendingchaos: no, I don't think I do
15:15 karolherbst: imirkin: https://gist.githubusercontent.com/karolherbst/b0cfd51317cee883bac4a1f6d73fb8ac/raw/309e14c91e67f190a9d9775f29c13a177ac5e9dc/gistfile1.txt
15:15 karolherbst: guess what
15:15 karolherbst: 0x419e10: PGRAPH.GPC_BROADCAST.TPC_ALL.MP.BPT_CONTROL
15:17 karolherbst: "MARK 337434.205727" set TRAP_HANDLER. "MARK 337434.219457" set BPT_CONTROL
15:18 karolherbst: I guess this is related :)
15:18 karolherbst: I don't know what those scratch values are. but 0x1 is what I planed to write into it
15:54 karolherbst: ahhh
15:54 karolherbst: I can just use nvidias stuff, as we have the same firmware interface, nice
15:59 karolherbst: mhhhh
16:00 karolherbst: sooo
16:00 karolherbst: enabling debugging isn't enough apperantly
16:00 karolherbst: not even a bpt trap helps
17:33 Lyude: skeggsb: btw; any decision on the ->shutdown() patch I've got on the ML?
18:34 karolherbst: mwk: maybe you know this. Any idea what could cause the TRAP_HANDLER to be not invoked for internal errors (MP traps.., bpt trap, whatever), but it gets executed whenever I request it through the BPT_CONTROL register?
18:36 mwk: hmm
18:36 mwk: that was ages ago
18:36 mwk: maybe there's some sort of condition mask for trap enabling? I don't remember
18:36 karolherbst: I expect some setting up inside the mmio space where all the traps are enabled or whatever
18:37 karolherbst: but it looks like we do the same as nvidia there
18:37 karolherbst: mwk: mhh, I literally use the same method nvidia uses for enabling debugging
18:37 karolherbst: maybe they gave us crappy firmware, who knows
18:38 karolherbst: SCRATCH[0] = 0, SCRATCH[1] = 0x1, SCRATCH[2] = 0x7, FIRMWARE[4] = 0x419e10, is what I do inside mesa on the channel
18:38 karolherbst: and it seems to write 0x1c01 into 0x419e10 (PGRAPH.GPC_BROADCAST.TPC_ALL.MP.BPT_CONTROL)
18:42 karolherbst: I guess I should verify that we indeed setup all the trap mmio registers up the same way
18:56 karolherbst: ohh, seems like we miss something, checking if that makes a difference
19:11 orbea: anyone mind helping determine if this is a RetroArch, nouveau or mesa bug? In short RetroArch KMS no longer starts after mesa commit 753f603b52db5eb38e27e1842fa43299a348998b see this issue for more info. https://github.com/libretro/RetroArch/issues/7119
19:11 orbea: I can't test with the llvmpipe or swrast because those haven't worked for much longer...
23:24 karolherbst: skeggsb: any objections on us having a basic trap handler inside mesa, which currently only quits, so that we can manually trap all running shaders within a channel when they are inside an infinite loop as this may cause nouveau to freeze otherwise
23:24 karolherbst: or do you know another way on how to do that?