00:00imirkin: kepler was different, so ... maybe the trap handler is directly writeable, as an offset from the code segment
00:00karolherbst: kind of, but different
00:00imirkin: gf100_sw_chan_mthd -- case 0x644: /* MP.TRAP_WARP_ERROR_EN */
00:01imirkin: dunno if that's related.
00:01karolherbst: "PUSH_DATA (push, screen->text->offset + 0x7f700 + 0x700);" is what I have to write into it 0x700 is the offset taken from the builtins, 0x7f700 is the offset to the first builtin
00:01imirkin: but that could be outside of 32-bit...
00:01karolherbst: I doubt it
00:02imirkin: it's a VA... VA is 40-bit...
00:02karolherbst: "screen->text->offset" is the offset of some bo
00:02imirkin: which is 40-bit.
00:02karolherbst: ohh, mhh
00:02karolherbst: that is annoying
00:02imirkin: unless screen->text is a nv04_resource?
00:02imirkin: then it's the offset into the text bo
00:03imirkin: anyways... it's possible that calim had worked out the nve4 trap stuff, but i doubt the stuff in mesa was it
00:04imirkin: have a look at the fermi stuff, even if it's different
00:04imirkin: it should give you an idea of what sorts of things are out there
00:05karolherbst: "MP+0x10: BPT_CONTROL" that sounds usefull
12:13karolherbst: imirkin: I got the Trap handler to execute :)
12:13karolherbst: so apperantly I don't need that "screen->text->offset" offset. I just got misslead by what value nvidia put into it (I guess I miss interpreted the trace)
12:14imirkin: that makes more sense
12:14karolherbst: was able to trigger it with a "nvapoke 504610 80000001"
12:14imirkin: all code executes relative to the global text segment
12:14karolherbst: with HMM that is super easy to debug as you can just malloc memory and let the TRAP handler write into it and simply dump the memory in your C code
12:15karolherbst: and nouveau can mess up as much as it wants, you still get that memory content :)
12:17karolherbst: mhh, I need to invoke it for all MPs actually, so maybe if I do that the application will stop correctly
12:20karolherbst: the application stoped :)
12:20karolherbst: and the CL code has an infinite loop
12:20karolherbst: maybe this is something we want to do regardless. Like when we request killing an application, but the MPs are not stopping, so we just trap all of them?
12:21imirkin: that's one for skeggsb
12:25karolherbst: no out of range faults for going beyond max reg in a trap handler
12:29karolherbst: imirkin: one problem is, the TRAP handler has to be bug free, otherwise it might not actually get to the "rtt"
12:31imirkin: guessing you found that out the hard way? :)
12:31karolherbst: I guess so
12:33karolherbst: imirkin: the GPU has fallen of the bus ....
12:33karolherbst: now I have to reboot, annoying
12:38karolherbst: ohh wait, a suspend helps as well I think
12:41Lekensteyn: so the runtime resume/drm_unload/drm_device_remove functions all try to access the device which may result in a lockup. I just patched some calls out (obviously leaking resouces like memory and workqueue stuff) which allows me to recover from an inaccesible device
12:42Lekensteyn: is it feasible to implement a recovery path where all device accesses (nvkm_rdXX) are skipped, only taking care of resources outside the GPU (memory, workqueues, drm stuff, ...)
12:44Lekensteyn: karolherbst: btw, patching PCI to prevent the PMCSR write for D0 -> D3 is not acceptable, Windows 7/10 do not do this (verified this via a VFIO-PCI trace)
13:20karolherbst: Lekensteyn: yeah, I know
13:20karolherbst: I am sure this is some nouveau bug, just no idea what causes that
13:21karolherbst: Lekensteyn: well, recovery paths should be fine, skeggsb just doesn't seem to be very interested into having these
13:21karolherbst: I disagree, but...
13:36karolherbst: imirkin: do you know where/how I can extract the compute firmware uploaded inside the mmiotrace? I only see that "GP104_COMPUTE.FIRMWARE[0x4] = 0x419e10", but that valu looks suspiciously like a mmio register...
13:38karolherbst: mhh, I should just be able to figure that out from mesa actually
13:58pqatsi: Hello folks
13:58Tom^: karolherbst: i had a 780ti , not 750 :p
13:59pqatsi: No progress with GP108M and nouveau here - even with runpm=0. There is a chance of better support in kernel 4.18 or something I can try?
14:53karolherbst: imirkin, pendingchaos: uhm, what was it I have to done so that I can compile the mme stuff?
14:56pendingchaos: uncomment a line in macro.c
14:56pendingchaos: perhaps line 145?
14:57pendingchaos: no I think it was line 120
14:57pendingchaos: but line 145 is related to the problem
15:04karolherbst: pendingchaos: thanks
15:08karolherbst: okay, now I have to figure out how to enable debugging on all MPs through a macro
15:11karolherbst: 419e10 is the reg.. mhh
15:11karolherbst: pendingchaos: by any chance, do you know how to write into a per channel mmio reg through a macro?
15:12karolherbst: I am sure there is some kind of mapping, but I sadly don't know it
15:15pendingchaos: no, I don't think I do
15:15karolherbst: imirkin: https://gist.githubusercontent.com/karolherbst/b0cfd51317cee883bac4a1f6d73fb8ac/raw/309e14c91e67f190a9d9775f29c13a177ac5e9dc/gistfile1.txt
15:15karolherbst: guess what
15:15karolherbst: 0x419e10: PGRAPH.GPC_BROADCAST.TPC_ALL.MP.BPT_CONTROL
15:17karolherbst: "MARK 337434.205727" set TRAP_HANDLER. "MARK 337434.219457" set BPT_CONTROL
15:18karolherbst: I guess this is related :)
15:18karolherbst: I don't know what those scratch values are. but 0x1 is what I planed to write into it
15:54karolherbst: I can just use nvidias stuff, as we have the same firmware interface, nice
16:00karolherbst: enabling debugging isn't enough apperantly
16:00karolherbst: not even a bpt trap helps
17:33Lyude: skeggsb: btw; any decision on the ->shutdown() patch I've got on the ML?
18:34karolherbst: mwk: maybe you know this. Any idea what could cause the TRAP_HANDLER to be not invoked for internal errors (MP traps.., bpt trap, whatever), but it gets executed whenever I request it through the BPT_CONTROL register?
18:36mwk: that was ages ago
18:36mwk: maybe there's some sort of condition mask for trap enabling? I don't remember
18:36karolherbst: I expect some setting up inside the mmio space where all the traps are enabled or whatever
18:37karolherbst: but it looks like we do the same as nvidia there
18:37karolherbst: mwk: mhh, I literally use the same method nvidia uses for enabling debugging
18:37karolherbst: maybe they gave us crappy firmware, who knows
18:38karolherbst: SCRATCH = 0, SCRATCH = 0x1, SCRATCH = 0x7, FIRMWARE = 0x419e10, is what I do inside mesa on the channel
18:38karolherbst: and it seems to write 0x1c01 into 0x419e10 (PGRAPH.GPC_BROADCAST.TPC_ALL.MP.BPT_CONTROL)
18:42karolherbst: I guess I should verify that we indeed setup all the trap mmio registers up the same way
18:56karolherbst: ohh, seems like we miss something, checking if that makes a difference
19:11orbea: anyone mind helping determine if this is a RetroArch, nouveau or mesa bug? In short RetroArch KMS no longer starts after mesa commit 753f603b52db5eb38e27e1842fa43299a348998b see this issue for more info. https://github.com/libretro/RetroArch/issues/7119
19:11orbea: I can't test with the llvmpipe or swrast because those haven't worked for much longer...
23:24karolherbst: skeggsb: any objections on us having a basic trap handler inside mesa, which currently only quits, so that we can manually trap all running shaders within a channel when they are inside an infinite loop as this may cause nouveau to freeze otherwise
23:24karolherbst: or do you know another way on how to do that?