06:46imanho: SM20 does not have the sched blocks in between instructions?
06:47imirkin: they were optional on SM30/SM35, and required on SM50+
06:50imanho: I finally managed to exfiltrate what I presume is most likely device code from gpu (and should be SM20) but nvdisasm chokes on it
06:51imanho: 0x10100248B38 0C 14 07 01 00 00 D0 81
06:51imanho: 0x10100248B40 F1 41 20 3E 00 C8 1D 00 0C 12 07 01 00 00 D0 A4
06:51imanho: 0x10100248B50 0C 12 07 11 00 00 D0 A4 08 14 07 02 00 00 D0 81
06:51imanho: 0x10100248B60 F1 42 20 5E 00 C8 3D 00 08 12 07 02 00 00 D0 A4
06:51imanho: 0x10100248B70 08 12 07 12 00 00 D0 A4
06:51imirkin: try it with envydis i guess?
06:51imirkin: (SM20 = fermi btw)
06:51imirkin: (in case there's any confusion)
06:52imirkin: pretty sure that first op is wrong though
06:53imirkin: perhaps a boundary issue
06:53imirkin: 00000000: 81d00000 3e2041f1 $p0 slct b32 $r0 $r29 c0[0x7c60] gtu f32 $r16
06:53imirkin: at least that looks like a proper op?
06:53imanho: hmm.. cuobjdump says "arch = sm_20" , it's a gtx 1060 though, why would the cuobjdump tell me sm_20
06:53imirkin: no clue
06:54imanho: I just did "cuobjdump --sass prog" and it was compiled by "nvcc prog.cu"
06:54imirkin: pascal definitely _does_ have scheduling info
06:54imirkin: 8 bytes out of every 32 are scheduling info
06:55imirkin: instructions are aligned to 32-byte boundaries, first 8 are sched info, then 3x 8-byte ops
06:57imirkin: imanho: https://www.toptal.com/developers/hastebin/emiseziyih.apache
06:57imanho: they seem to be valid for SM32/SM35/SM37.
06:57imirkin: seems to make sense.
06:57imirkin: you want SM60
06:57imirkin: SM50 = maxwell
06:57imirkin: SM60 = pascal
06:59imirkin: SM3x = kepler (SM35+ is the "kepler2" isa which brings 256 regs, available on GTX 780 / Titan)
07:05imanho: I was dumping 0x100 bytes of memory starting at 0x10100248B38, changed it to 0x10100248B00, saved to binary and nvdisasm SM60 and finally, beautiful code:
07:05imanho: /*0038*/ LD.E.CG.128 R12, [R20+0x10], P0; /* 0x81d000000107140c */
07:05imanho: /* 0x001dc8003e2041f1 */
07:05imanho: /*0048*/ ST.E.128 [R18+0x10], R12, P1; /* 0xa4d000000107120c */
07:05imanho: /*0050*/ ST.E.128 [R18+0x110], R12, P1; /* 0xa4d000001107120c */
07:05imirkin: matches up with what i had decoded :)
07:07imanho: wait you decoded it manually?
07:07imirkin: did you not see the paste?
07:07imanho: I saw it
07:07imirkin: that was with envydis
07:08imirkin: i just typed in the bytes from what you pasted above, starting with 0x8b40
07:08imanho: ahh.. envydis is really cool then. I was under the impression nvdisasm >> envydis
07:08imirkin: generally, nvdisasm is more accurate yea
07:09imirkin: but (a) envydis can also assemble (with envyas) and (b) nvdisasm can be extremely picky
07:09imanho: nvdisasm also doesnt have any sched info, just skips them.
07:09imirkin: but e.g. the sm50+ stuff in envydis is purely from fuzzing nvdisasm
07:09imanho: those "sched (st 0x0 wr 0x0 rd 0x0 ru 0x8) (st 0xe wr 0x0 rd 0x4 wt 0x31 ru 0x7) (st 0x0 yl wr 0x0 rd 0x0 wt 0x31 ru 0x7)
07:09imanho: " actually mean something right?
07:09imirkin: there are barriers
07:09imirkin: each instruction can read/write them through that sched info
07:10imirkin: (and wait on them, etc)
07:10imirkin: iirc it's explained at the top of gm107.c what they mean
07:10imirkin: imanho: https://github.com/envytools/envytools/blob/master/envydis/gm107.c#L55
17:22karolherbst: imirkin: do you think you will find some time to look over some of the mt fixes I think are good? I removed the juicy bits, because I want to play around a little with a different approach I think is ultimately the one we want to go long term. But those patches should fix already a bit, and I plan to rework my patch fixing races with fencing on top of those in the MR
17:23imirkin: will try to look tonight
17:23karolherbst: okay, cool
17:23karolherbst: For quite some time I was really worried I messed up fencing, but seems like that nobody complained I broke something :)
17:25imirkin: can you also write a paragraph somewhere (perhaps you already have) which explains what the general approach that you took is?
17:26karolherbst: yeah, planning to do that. Most of the patches in there are mostly refactorings though, except the last fencing bits I guess
17:27karolherbst: mostly just adding wrappers for easier locking
17:28imirkin: ah ok
17:28imirkin: well if it's all self-explanatory, no need for extra explanation :)
17:28imirkin: i didn't look at the changes
17:28imirkin: i just figured i'd need some commentary to properly grok them
17:29karolherbst: yeah... my general idea was, that if you need to take locks, all code touching libdrm should assert on locks being taken or not, etc...
17:30karolherbst: e.g. https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/b222a9aed741fcb18e8e2500566a9e6485dad3b2#3a93f5a5256f174da7d6dc7879c8b3145674f284
17:30karolherbst: that's the bit I need to rework
17:31karolherbst: (well, the patch, but it explains why I wrap everything)
17:32karolherbst: still not 100% sure on how to deal with races on pushbufs, so that's the part I still need to toy around with
17:37imirkin: are you still going the "separate pushbuf" way?
17:37imirkin: or did you realize that it gains you nearly nothing?
17:37karolherbst: there is a third option actually
17:38karolherbst: but didn't really tried it out yet. We can fill up context private bos with commands and enqueue them into a global pb when flushing commands out via nouveau_pushbuf_data
17:40karolherbst: so yeah.. some of the patches there might not make sense anymore once I am done figuring out what's the best approach
17:42karolherbst: so the idea I had was 1. lock state 2. collect data 3. create command stream 4. submit... atm we don't do 1. and 2 and 3 are one step. One idea is to use multiple pushbuffers, but the new approach wouldn't really rely on that
17:42karolherbst: there are just more races within libdrm when using one, and that's the annoying part
17:43imirkin: well, that's what separate pushbufs are
17:43imirkin: each pushbuf has its own bo
17:43imirkin: and then you submit it, it's added to the IB
17:44imirkin: so you "gain" on the time it takes to write out the commands to the pushbuf, but you lose on having more complexity :)
17:44imirkin: given that pushbufs (normally) live in gart
17:44imirkin: writing them isn't a huge time sink, i think?
17:44karolherbst: the annoying part is just, that if you share the client+pushbuf between multiple threads there are weird races annoying to fix
17:45imirkin: the list of races doesn't change
17:45imirkin: they just become more apparent
17:45karolherbst: no, it actually changes
17:45imirkin: since one way you get misrendering due to incorrect ordering of commands
17:45karolherbst: waiting on bos e.g.
17:45imirkin: the other way you get much worse crashes
17:45imirkin: you still have to wait on the same things.
17:45karolherbst: libdrm only waits on bos from the same client object
17:45imirkin: oh, you mean having a diff client for each thread?
17:46imirkin: then yea, that's totally different
17:46imirkin: sorry, i missed that
17:46karolherbst: but that kind of requires having a pb for each thread as well
17:46imirkin: so separate clients but a single VM?
17:46karolherbst: so it's more of an implication
17:46karolherbst: imirkin: correct
17:46karolherbst: single channel even
17:46karolherbst: nouveau_client is just a weirdo libdrm wrapper for associating objects
17:47imirkin: well, there's a client in the kernel too
17:47karolherbst: yeah, but that's different :)
17:47karolherbst: and completely unrelated
17:48imirkin: i thought they were 1:1
17:48imirkin: but maybe not
17:48karolherbst: nouveau_client_new doesn't even call any ioctls
17:48karolherbst: it's just libdrm internal stuff
17:49karolherbst: it matters for nouveau_bo_wait e.g.
17:49karolherbst: push = cli_push_get(client, bo)
17:49karolherbst: that line changes behavior
17:50karolherbst: if the bo is associated with a different pb from a different client, this will return NULL
17:50karolherbst: and you won't kick the pb of that different client anymore
17:50karolherbst: -> no race
17:52karolherbst: so my next idea was to just deal with bos instead of pushbuffer objects and just have the global pb locked for a very very short time when enqueing commands or deal with other things.. but maybe we want to share a pb with different clients, not sure if that even works though
17:52karolherbst: anyway.. some things to try out
17:53karolherbst: also.. keep in mind that creating command buffers are generally considered hot paths and if you end up waiting on a bo... you lost anyway, so if we only have to lock when waiting on the GPU and not when we are busy with important things I think that's in the end the best approach
17:53imirkin: ok. been a while since i looked at that stuff
17:57karolherbst: but anyway.. before deciding what to do about pushbuffers, we can already fix races on nouveau_mm.h and nouveau_fence.h stuff I think :)
17:57imirkin: for sure
17:58imirkin: don't have to fix all races in one go
17:58imirkin: certain things which are unambiguously right, or move the locking into the direction we want to go, i'm all for
17:59imirkin: i also need to review that dude's compiler changes
17:59imirkin: i feel bad for taking so long
17:59karolherbst: which MR?
18:00imirkin: the person's name is M Hennig or Henning or something like that
18:01karolherbst: that L1 stuff, right?
18:01imirkin: there are a couple of changes
18:01imirkin: but yes, that's one of them