06:17 anholt: imirkin: nice
06:17 imirkin: anholt: just going to be plopping in some "fails" files
06:18 anholt: imirkin: yep, I've found it to be quite helpful to have them in git.
06:18 imirkin: anholt: a bit tricky to put in toml files since they're pretty base-path-dependent
06:18 anholt: haven't had a chance this week to debug the nano any more, but I am still looking forward to getting some nouveau in the pipelines.
06:18 imirkin: that'd be nice
06:18 anholt: if anyone wants to debug the nano, I can forward my bits.
06:18 imirkin: unfortunately i have no useful advice on getting nano to work
06:19 imirkin: i know karol did get one working. but sounds like he flashed some firmware on it or something?
06:21 imirkin: if you have trouble with any desktop GPUs, let me know, i can usually help with those :)
06:22 imirkin: [oh, and before you decide to be clever and pick up an arm board with PCIe and plug one of those in ... don't. people have had a variety of problems.]
06:28 anholt: the worst part is I had the nano working at one point
06:28 anholt: and trying to get it working a second time has been equally hard
06:28 imirkin: =/
06:28 imirkin: perhaps you could convince tagr to take pity on you
06:34 anholt: tagr: one possibly easy question for you: should I be concerned about "Failed to set up IOMMU for device 57000000.gpu; retaining platform DMA ops", or is that normal? (lots of devices have a message like that)
06:34 anholt: the following lines at boot are:
06:34 anholt: 22-01-30 06:33:44 [ 8.560346] nouveau 57000000.gpu: NVIDIA GM20B (12b000a1)
06:34 anholt: 22-01-30 06:33:44 [ 8.565815] nouveau 57000000.gpu: imem: using IOMMU
06:34 anholt: 22-01-30 06:33:45 [ 8.637711] nouveau 57000000.gpu: bus: MMIO read of 00000000 FAULT at 137000 [ TIMEOUT ]
06:34 anholt: (and then more faults and then a timeout in falcon_v1_wait_for_halt)
06:43 imirkin: anholt: fyi, that register has to do with clock stuff
06:43 imirkin: at least on fermi and kepler. but these things tend to stay in place.
07:07 anholt: hmm. seems there are some gm200 falcom commits just after this 5.16 kernel. picking a couple gets me a null deref in init instead of that fail, but maybe I need to pick more.
07:07 anholt: anyway, sleep
07:09 imirkin: hrm. when west coast goes to sleep, maybe that's a sign for me to hit the hay too...
14:40 imanho: Hi. This is not a nouveau question per-se: I'm running CUDA [so proprietary driver] and I'm wondering if I can get the memory mapping of a cuda context? [I want to know what _virtual address_ contains what, e.g. where is the code of this kernel?] I can get a dump of _physical_ memory with envytools but that wouldn't help me with VA view of a given context.
14:43 imanho: I've tried cuda-gdb, and then 'info memory mappings' [which only works when _outside_ a kernel] which shows things like: "0x7fffc6000000 0x7fffc6200000 0x200000 0x7fffc6000000 /dev/nvidia-uvm" and then that region can be dumped but I'm not sure what this data is really.
16:17 imanho: & it's very weird, the dump of SASS code [by nvdisasm] shows _device_ function calls as "JCAL 0x0 ; /* 0xe220000000000040 */" which doesn't make sense! But examining it at runtime with cuda-gdb it still reads out as "JCAL 0x0" but x/8b shows "0x40 0x00 0x00 0x0c 0x7e 0x02 0x20 0xe2" which clearly isn't JCALL 0x0!
16:24 imirkin: the compiler could fill in the relocs at upload time
16:25 imirkin: nouveau definitely does "binary" fixups for various things, i don't see why blob couldn't as well
16:25 imirkin: i forget if JCAL is relative or absolute
16:25 imirkin: but if it's the absolute one, or if it's calling something which is built/uploaded separately, then the offset would be dependent on where exactly it's placed in the code page
16:26 imirkin: (this is definitely a thing we do on nv50, where the hw wants absolute addresses in the code page iirc? i forget)
16:35 imanho: JCAL is absolute indeed. & yup! the fixup happens, but what's surprising is that "cuda-gdb x/i" still says JCAL 0x0 _after_ the fixup!
16:36 imirkin: is that command looking at the actual code page?
16:36 imirkin: nvidia GPUs (at least the gens I'm most familiar with, not sure about volta+) execute shaders from a fixed code page
16:37 imirkin: or more like ... code "area"
16:37 imirkin: can be (and usually is) multiple pages
16:37 imirkin: i suspect that cuda-gdb is showing you the shader's original data, and it's being fixed up at upload time into that code area
16:37 imirkin: and you're not actually looking at the contents of that code area
16:37 imirkin: but i have no idea
16:37 imanho: honestly, I don't know how cuda-gdb works under the hood, I do "x/4i $pc" and the call is a call to "printf" from a __device__ function. [but it's same for calls to other user-defined __device__ functions]
16:37 imirkin: just a guess
16:38 imirkin: hm
16:38 imirkin: what's $pc?
16:38 imirkin: i.e. what value does $pc have?
16:38 imanho: program counter
16:38 imanho: it's some address like 0x555...
16:38 imirkin: is it relative to the start of the code area? or is it an absolute address?
16:38 imanho: it is absolute [at least seems like, it's a big address]
16:39 imirkin: can you check in /proc/maps where that goes?
16:39 imirkin: specifically, /proc/<pid>/maps
16:39 imanho: ooo... "0x5555555cb000 0x5555555cc000 0x1000 0x76000 /home/iman/projs/zuda/of.elf" I think you're theory is correct
16:39 imanho: it's just looking back into where that address is in the original elf and not actual runtime, device memory
16:40 imirkin: yeah, makes sense. i think the raw $pc value is actually an offset from code area start
16:41 imanho: and it does this JCAL thing for printf but _not_ for other user-defined __device__ code (those become relative CALs and don't do the fixup for them, I can see its a CAL 0xd00)
16:41 imirkin: yeah, so "printf" is a pretty unnatural thing to do on a gpu
16:42 imirkin: it's most likely in a library which is uploaded globally
16:42 imirkin: but i guess they don't fix the location of that library? (in nouveau, we do)
16:42 imanho: exactly. that's what I'm looking for actually
16:42 imanho: supposedly, for _some_ specific chunk there is no ASLR (so that chunk with ~200 functions is always loaded at same place) and I'm trying to find it
16:43 imanho: it's this paper: https://www.sciencedirect.com/science/article/pii/S0167404820303886
16:43 imirkin: probably either at start or end
16:43 imirkin: (that's where we stick it :) )
16:44 imanho: if all these stuff is happening somewhere.. is there like a way to instrument the kernel module (which is open-source) and like, put printk statements somewhere that can then spew out when that chunk is allocated?
16:44 imirkin: no
16:44 imirkin: it's not super-global
16:44 imirkin: it's just global within the application
16:45 imirkin: instead of bundling printf into every shader that uses it
16:45 imirkin: it sticks printf into a single area for all shaders that the application executes to use
16:46 imanho: hmm.. btw, when I do "strace" I see a bunch of ioctls, these are what "libcudart.so" says to the gpu?
16:46 imirkin: right
17:49 imanho: "memcpy(buf,(void*)magicaddress,16);" would not lead to illegal memory access, if magicaddress pointed to some code right? So 1 way is to just keep trying this for different addresses right?
17:50 imanho: [The memcpy -in a device function- would translate to code which copies from SRC to registers then to DST]
17:50 imirkin: i mean ... that assumes that the code page is mmap'd into your process
17:50 imirkin: which it might or might not be
17:50 imanho: oh.. you mean that might happen "on demand"?
17:51 imirkin: or not at all
17:51 imirkin: and the code would be uploaded via a temp staging buffer
17:51 imirkin: to vram
17:51 imirkin: it's faster to write system memory than vram
17:51 imirkin: so the usual thing is to write to a system buffer, and then get the GPU to do a copy into vram while the cpu does other things
17:54 imanho: wait, A) the memcpy here is device_to_device. B) the fabled magical code section, I assume, is at that point accessible (in gpu) because "printf" works right before that "memcpy" so the code for that gpu-"printf" should be somewhere in the VA-space of that cuda context (process) right?
17:55 imirkin: oh. i thought you were just doing memcpy in the cpu process
17:55 imanho: no no, it's in a __device__ function
17:55 imirkin: if it's a memcpy running in the shader
17:55 imirkin: then that _should_ work, but doesn't _have_ to
17:55 imirkin: iirc there can be some slightly subtle differences between memory that's shader-accessible vs just mapped in the vm
17:55 imirkin: but in practice, it shouldn't matter
17:57 imanho: & it's very ineffective to search through such a large addr space like this, given that |that code region| << |addr. space|
17:58 imanho: and you can't even do it in parallel, any of those glitch and it corrupts the whole context! You have to start a _new_ host process to try again
17:58 imirkin: and the address isn't fixed
17:59 imirkin: you can use valgrind-mmt to find out what the base gets set to
17:59 imirkin: and hope that the allocations stay relatively constant across runs?
18:00 imirkin: (although it's just as likely that valgrind-mmt won't work with newer blob versions ... they tend to change uapi and we tend not to update too often)
18:16 imanho: mmm.. "/usr/lib/x86_64-linux-gnu/openmpi/include/mpi.h:322:57: error: expected expression before ‘_Static_assert’" is there any step other than what's seen on valgrind-mmt to get it to build?
18:23 imanho: purged my openmpi* and it worked!
18:27 imanho: " ? v a l g r i n d : t h e ' i m p o s s i b l e ' h a p p e n e d : ? " means "broken due to binary blob" right?
18:38 imirkin: hehe
18:39 imirkin: well, valgrind-mmt is designed specifically for the binary blob
18:39 imirkin: did you follow the steps over at
18:39 imirkin: https://nouveau.freedesktop.org/Valgrind-mmt.html
18:40 imirkin: imanho: there's a mmt-3.16.1 btw - probably best to use the latest...
18:40 imirkin: oh. and it's the default branch. so you can skip the "git checkout" step there.
18:44 imirkin: aaand it's gone.
18:52 imanho: I updated to 3.16 & now I get "valgrind: failed to start tool 'mmt' for platform 'amd64-linux': No such file or directory" hmm.. I just did the exact same thing but with 3.16.1
18:52 imirkin: do a clean?
18:53 imirkin: i assume you stuck it into a prefix btw
18:53 imirkin: you don't want to overwrite the system valgrind...
18:58 imanho: [sure. always-use-prefix is one of the 0xA commandments ;) ] things got interesting... With this new valgrind,
18:58 imanho: when I run the program. it hangs up.
18:58 imanho: for like ~5 minutes, before spiting up "ilegal memory access encountered"
19:00 imanho: there are millions of lines of things like "--0000-- w 40:0x8e740, 0x00000000 " in the trace
19:00 imanho: couple of "==550953== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints."
19:01 imirkin: did you try it with like glxgears?
19:01 imirkin: the trace is supposed to be a binary file
19:02 imanho: yes this stuff I'm saying is the decoded version of that binary
19:02 imirkin: oh ok
19:02 imirkin: there's a tool called 'demmt' which can help analyze the trace
19:02 imanho: and it ran on a hello-world cuda which does a memcpy
19:02 imirkin: it's in envytools
19:02 imirkin: or upload the trace somewhere, i can take a look to see if it's usable
19:10 imanho: thanks. https://drive.google.com/file/d/1wAwu29QI0cf3hymk-nJdwla3YK_kl-WU/view?usp=sharing
19:13 imirkin: erm, demmt totally doesn't like that
19:13 imirkin: that's the original log file?
19:14 imirkin: demmt says "Bad system call"
19:14 imirkin: so either valgrind-mmt needs help, or demmt needs help, or most likely, both need help
19:29 imanho: when I do " $mmt_bin2dedma < mytrace.log > mytrace.txt ", I get: https://drive.google.com/file/d/1y-cOoZoPPklNw_rUSNtXMs3qim3DegCh/view?usp=sharing
19:31 imirkin: the dedma stuff is the "old" bits, but i guess it could work
19:32 imirkin: it's simpler, which means it's more finicky, but it can be semi-usable
19:32 karolherbst: imirkin: I did add support for the uvm stuff at some point but never pushed it because it depends on driver headers
19:33 imirkin: boo
19:33 karolherbst: https://github.com/karolherbst/envytools/commits/UVM
19:34 karolherbst: but yeah... it's quite annoying to keep up with nvidia on that front, so my thinking was easier to just use their headers
19:34 karolherbst: but supporting multiple versions would be painful unless we can just import the headers
19:36 imanho: did you see the .txt? it looks fine.
19:37 karolherbst: mhh well.. one needs to fix valgrind as well :/
19:42 imanho: demmt works fine (no error) for me on that log I sent. Are you sure your demmt is same version?
19:43 karolherbst: depends on what your expectations are
19:43 karolherbst: you'll miss out on a lot of stuff without handling uvm ioctls
19:45 imirkin: oh. i probably haven't updated in ages
19:46 imanho: yeah, by working I just meant I am not getting errors like imirkin :)
19:46 karolherbst: ahh :)
19:46 karolherbst: sadly for the valgrind changes I've made I also used nvidias headers :D
19:46 karolherbst: https://github.com/karolherbst/valgrind/commit/e29d6ef2b3de297f20c7f52756f3de50ad9461ba
19:46 imanho: but still.. uvm is not probably issue for me here too right? the device code is not related to unified addressing stuff right? because host never touches it really
19:47 karolherbst: it is
19:47 karolherbst: nvidia starts to use uvm ioctls for mapping memory
19:47 karolherbst: so you miss out on a lot of memory mappings
19:47 karolherbst: and those might or might be used for code uploading
19:47 karolherbst: I wrote those patches because I wasn't seeing any kernels/shaders uploaded
19:48 karolherbst: see the uvm ioctls I explicitly handled in that valgrind patch
19:48 karolherbst: (which demmt also picks up)
19:49 imirkin: imanho: look for CODE_ADDRESS
19:49 imirkin: if you find that, that's the GPU VM address of the code page
19:49 karolherbst: imirkin: we don't have that on newer GPUs btw
19:49 imirkin: oh
19:50 imirkin: yeah, i don't know how volta+ works
19:50 karolherbst: you put in the address directly
19:50 imirkin: ah
19:50 imanho: fortunately I couldnt care less about Volta :D
19:50 imirkin: turing/ampere as well
19:50 imanho: D'oh!
19:50 imirkin: volta+ means "volta and later"
19:51 imirkin: tesla, fermi, kepler, maxwell, pascal, volta, turing, ampere
19:51 karolherbst: imirkin: check gv100_compute_setup_launch_desc
19:51 karolherbst: NVC3C0_QMDV02_02_VAL_SET(qmd, PROGRAM_ADDRESS_LOWER, entry & 0xffffffff);
19:51 karolherbst: it's part of the qmd stuff now
19:51 imirkin: karolherbst: ah ok
19:52 karolherbst: it still gets uploaded, so demmt should be able to tell, but... :D
19:52 karolherbst: memory mappings also work completely different with uvm
19:53 karolherbst: _anyway_ normally there is no need to go the valgrind route just to look at some kernels
19:53 karolherbst: imanho: do you want to know what code nvidia generates or is there a more complicated issue?
19:53 karolherbst: you can just disassemble cuda binaries
19:58 karolherbst: imanho: also.. opcodes are 128 bit
19:58 imanho: actually, I don't care about my cuda code. turns out, the driver _always_ loads some chunk of device code including things like 'printf' and 'malloc' at a specific address (no randomization) and I want to find that address and dump that chunk
20:00 karolherbst: imanho: do you have the full 128 bit of that JCAL?
20:03 imanho: <$_Z4kernPii$_Z7normal1v+8>: 0x1007f004 0x010ce200 0xfff7f005 0x01000007
20:04 karolherbst: mhhh
20:04 karolherbst: that looks strange
20:05 imanho: (x/4w $pc) I was confused because, nvdisasm shows it as "JCAL 0x0 ; /* 0xe220000000000040 */" (made me think it's 64 bits)
20:05 karolherbst: right.. but +8 is a odd offset
20:06 karolherbst: mind dumping the first 16 32 bit values of that function?
20:08 imanho: like this? (x/16w $pc)
20:08 imanho: 0x555555a06ea8 <$_Z4kernPii$_Z7normal1v+8>: 0x1007f004 0x010ce200 0xfff7f005 0x01000007
20:08 imanho: 0x555555a06eb8 <$_Z4kernPii$_Z7normal1v+24>: 0x0ff70006 0x5c980780 0xffa007f6 0x001fc400
20:08 imanho: 0x555555a06ec8 <$_Z4kernPii$_Z7normal1v+40>: 0x0ff70007 0x5c980780 0x0c000040 0xe220027e
20:08 imanho: 0x555555a06ed8 <$_Z4kernPii$_Z7normal1v+56>: 0x0ff70004 0x5c980780 0xffe007f0 0x001ff400
20:09 karolherbst: mhh, heh
20:09 imanho: something is wrong here
20:09 karolherbst: yeah..
20:09 karolherbst: you are on turing or ampere, right? not that I assume the wrong gen here
20:10 imanho: (no, it's gtx 1060)
20:10 karolherbst: ohhh
20:10 karolherbst: okay.. then opcodes are 64 bit.. sorry for that
20:11 karolherbst: but...
20:11 karolherbst: on pascal you have 4 64bit blocks containing 3 64 bit opcodes + scheduling stuff
20:12 imanho: this is the normal1() function:
20:12 imanho: __device__ __noinline__ unsigned long normal1(){
20:12 imanho: printf("NORMAL\n");
20:12 imanho: return 0;
20:12 imanho: }
20:12 karolherbst: yeah...
20:12 karolherbst: but
20:12 imanho: and this is what x/8i shows:
20:12 imanho: 0x555555a06ea8 <$_Z4kernPii$_Z7normal1v+8>: MOV32I R4, 0x0
20:12 imanho: 0x555555a06eb0 <$_Z4kernPii$_Z7normal1v+16>: MOV32I R5, 0x0
20:12 imanho: 0x555555a06eb8 <$_Z4kernPii$_Z7normal1v+24>: MOV R6, RZ
20:12 imanho: 0x555555a06ec0 <$_Z4kernPii$_Z7normal1v+32>:
20:12 imanho: 0x555555a06ec8 <$_Z4kernPii$_Z7normal1v+40>: MOV R7, RZ
20:12 imanho: 0x555555a06ed0 <$_Z4kernPii$_Z7normal1v+48>: JCAL 0x0
20:12 imanho: 0x555555a06ed8 <$_Z4kernPii$_Z7normal1v+56>: MOV R4, RZ
20:12 imanho: 0x555555a06ee0 <$_Z4kernPii$_Z7normal1v+64>:
20:12 karolherbst: okay...
20:12 imanho: some are empty?
20:12 karolherbst: scheduling info
20:12 imanho: ohh ohh got it
20:13 karolherbst: let's see...
20:13 imanho: and here is bytes for the JCAL part: <$_Z4kernPii$_Z7normal1v+48>: 0x40 0x00 0x00 0x0c 0x7e 0x02 0x20 0xe2
20:13 karolherbst: bunog
20:13 karolherbst: *bingo
20:13 karolherbst: /*0010*/ JCAL 0x27e0c0 ; /* 0xe220027e0c000040 */
20:15 karolherbst: wondering why cuda-gdb prints it wrongly though
20:15 imanho: & yes, when I run again -> address of a local variable changes but THIS address remains the same
20:15 imanho: bingo.
20:16 karolherbst: imanho: btw.. https://gist.githubusercontent.com/karolherbst/2b5f19c5a8f020c81fd1276828e098b2/raw/1482ac90a070d90e8c600da201ea907d41017cde/hex_nvdisasm.sh
20:16 karolherbst: use it like echo "0xffa007f6 0x001fc400 0x0ff70007 0x5c980780 0x0c000040 0xe220027e 0x0ff70004 0x5c980780" | hex_nvdisasm SM60
20:16 karolherbst: but you have to be careful about alignment
20:17 karolherbst: need those 4x 64 bit blocks
20:17 karolherbst: and start 256 aligned
20:17 karolherbst: because of sched opcodes and whatnot
20:17 imanho: oh this is a godsend! thanks!
20:17 karolherbst: so you can't start at +8, but have to dump at +0 or +32 and so on
20:18 karolherbst: thank imirkin who write the script I think?
20:18 imirkin: perl is fun.
20:18 imanho: yea both of you for the _discussion_ & all. really appreciated
20:19 imanho: this is great because now we have _1_ address in that fixed-addr blob.. we can just start expanding both ways until we fault, and that would be our _magic_ chunk
20:20 karolherbst: yeah...
20:20 imanho: but isn't 0x27e0c0 too small?
20:20 karolherbst: why?
20:20 karolherbst: it's an offset
20:21 karolherbst: before volta you have a huge memory area with all the code
20:21 karolherbst: and absolute jumps jump within this area
20:21 karolherbst: so JCAL 0x27e0c0 is jumping to code_memory_va + 0x27e0c0
20:21 imanho: I though it was "CAL relative", "JCAL absolute" but the absolute in JCAL is still relative with respect to that base. got it!
20:22 karolherbst: yep
20:23 karolherbst: if I would have to guess.. the code_memory starts at 0x555555a00000
20:23 karolherbst: but...
20:23 imanho: is that fixed too or randomized?
20:23 karolherbst: well.. should be easy to find
20:23 karolherbst: imanho: randomized.. but
20:24 karolherbst: + 0x27e0c0 should be a function :D
20:24 karolherbst: try 0x555555900000 or 0x555555800000 or something until you find something
20:24 karolherbst: I think...
20:24 karolherbst: not sure how much the area is aligned, but I'd assume pretty much
20:25 imanho: ohh... for testing this though
20:25 imanho: cuda-gdb
20:25 imanho: won't cut it. we have to do our memcpy stuff in the device function
20:26 karolherbst: why though?
20:26 imanho: so we were also wondering why cuda-gdb get's it as JCAL 0x0
20:27 karolherbst: mhh maybe they hid it ... could be.. but you should be able to print code at random addresses, no?
20:27 imanho: and imirkin ansatz was that cuda-gdb is looking at this _pre_ fixup (so it's not really looking at the device memory)
20:27 karolherbst: mhh
20:27 karolherbst: worth trying though
20:27 imirkin: i dunno whether it is or not
20:27 imirkin: just a thought
20:27 imirkin: anyways, an "absolute" jump is def relative to the code page
20:28 karolherbst: I'd assume they just clear absolute jumps :D
20:28 imanho: I think it's right. cuda-gdb doesn't see changes
20:28 karolherbst: ahh
20:28 imanho: I write all over the memory and do x/8i and get the same things
20:28 imanho: as if I haven't changed them
20:28 karolherbst: mhh
20:28 karolherbst: call dir once and try again?
20:28 karolherbst: (or whatever functions gdb might force to drop caches
20:28 karolherbst: )
20:29 imanho: tried it. still spits out JCAL 0x0.
20:29 karolherbst: annoying
20:30 karolherbst: but then again.. it's nvidia. wouldn't surprise me if they hide internal code
20:30 karolherbst: imanho: maybe break at that JCAL and single step and see what happens? :D
20:30 imanho: yea.. so on brand :D
20:30 imanho: and yea this is funny. I did that too
20:30 imanho: it then goes to the next instruction
20:30 karolherbst: figures
20:30 imanho: as if there was no JCALL at all
20:31 imirkin: make sure you "step" rather than "next"?
20:32 imanho: yup. I do "si"
20:43 imirkin: imanho: the relative jumps are relative to PC
20:43 imirkin: (just in case that wasn't clear)
21:05 imanho: I sweeped that range yet memcpy [in __device__, copy from addr+0x27e0c0 to a buffer] always failed:
21:05 imanho: adr = 0x555555000000
21:05 imanho: for x in range(10000):
21:05 imanho: res, errs = test(hex(adr))
21:05 imanho: adr += 0x1000
21:05 imanho: if(len(errs) <2):
21:05 imanho: print("BINGO!")
21:05 imanho: break
21:08 imanho: BINGOOOOOOO! addr: "0x10101d90000"
21:09 imanho: (memcpy didn't fail - bunch of zeroes though, but still great!)
23:20 imanho: I found out that for addr in interval [0x10101d81f40,0x10101f81f30] addr+0x27e0c0 can be memcpied from, without error. But then I scoured this range, _all_ of it, and the memcpies didn't throw err BUT they are all 0x00s there. Why would there be a region of such size, empty.
23:21 imanho: (I started to look into ~0x1010XXXXXXX range based on Table 4 of that mind control paper)