02:33imirkin: anholt: just want to clarify your comment on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14797 -- you just want me to add a blah-skips.txt file, yes?
04:50anholt: that's what I would do, but by no means required.
04:50imirkin: nah, it's a reasonable idea. i'll set it up
04:51imirkin: (but not right this minute. probably a task for tomorrow night)
15:12karolherbst: anholt: btw.. figured out the issues with the jetson?
16:34imirkin: skeggsb: this guy posted a book-length comment, but it's really good analysis of a really weird issue. basically resuming from suspend fails across a lot of nv50 GPUs since some time due to init_on_alloc=1 (setting it to 0 makes it "work") -- https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547#note_1242815
16:35anholt: karolherbst: nope
16:37imirkin: skeggsb: ideas welcome ;)
16:40imirkin: i'm having trouble thinking of scenarios where init_on_alloc=1 would *break* stuff while init_on_alloc=0 init_on_free=1 still works.
16:50karolherbst: imirkin: I just hope that this issue wasn't already fixed :D
16:50imirkin: karolherbst: tons of people are having this issue
16:50imirkin: and i think it's still current
16:50karolherbst: yeah... I guess so
16:51karolherbst: that init_on_alloc makes a difference just means we make use of unitialized memory somewhere
16:52imirkin: and we just get lucky on average?
16:52imirkin: that it's not zero'd?
16:52karolherbst: seems that way
16:52imirkin: but with init_on_free=1 ...
16:53karolherbst: I'd run it with kasan enabled and see what happens
16:53imirkin: feel free to make that suggestion
16:56karolherbst: ehh.. I guess kasan won't hit it but a kernel version of ubsan
16:57karolherbst: let's see what ubsan checks we do have
16:59karolherbst: ohh no.. kasan is the right one :D
17:03karolherbst: imirkin: so all tesla gpus are affected?
17:04imirkin: i mean ... all? dunno. but many.
17:09karolherbst: well, I do have a few, so maybe I hit something on some
17:10imirkin: i sorta feel bad coz i have all this hardware. but i've been wfh for the past while, can't be messing with what is essentially my work comp
17:11karolherbst: yeah... unedrstandable
19:21imanho: In valgrind-mmt trace, an entry as "--0000-- call method 0x5c000001:0x00800280"
19:21imanho: how should I _read_ this?
19:22imanho: isn't 0x5c000001 odd?
19:23imirkin: certainly not even...
19:24imanho: I recall you guys saying it _misses_ some ioctls, but I'm having trouble understanding those ioctls that _are_ being caught
19:25imanho: (and the calls)
19:25imanho: " --0000-- create gpu object 0x00000000:0x00000000 type 0x0000 () "
19:28imanho: the only thing I could see in the trace and was interesting was "--0000-- got new mmap at 0x200200000, len: 0x00200000, offset: 0x0, serial: 1, fd: 12" because that 0x200200.. address is actually among the addresses which don't fault
19:28imirkin: sorry, it's been _ages_ since i've looked at these details
19:29imanho: and as the authors mention enytools but not valgrind, I don't even think _this_ is what they did as well. It's probably not the way to go anyways.
19:32imanho: " Specifi-
19:32imanho: cally, we launched a CUDA application with a root privilege to
19:32imanho: directly access the GPU page table. In our CUDA application,
19:32imanho: the host process sets the candidate bits in PTE one at a time,and the GPU kernel tries to write to the corresponding mem-
19:32imanho: ory address." -> here they are tweaking PTE from host process.
19:33imanho: how can I tweak the PTE of the GPU?! what sorcery is this?
20:04karolherbst: imanho: the PTE is just in VRAM
20:04karolherbst: and you can map VRAM into the CPUs VM
20:04karolherbst: we have some docs on that somewhere
20:04karolherbst: kock yourself out: https://github.com/NVIDIA/open-gpu-doc
20:05karolherbst: I think there was something better somewhere
20:06karolherbst: ehh no.. wait... let me check
20:06imanho: :karolherbst thank you thank you. this would be _really_ awesome! yea, I'm exactly looking for this: where that PTE is & how to map it into host VA.
20:06karolherbst: there it is
20:06karolherbst: e.g. https://github.com/NVIDIA/open-gpu-doc/blob/master/manuals/turing/tu104/dev_ram.ref.txt
20:07karolherbst: I think this describes the full stack (including GPU contexts and VA per context and shit)
20:08karolherbst: when in doubt.. read the nouveau kernel code.. we have to do the same stuff anyway
20:10karolherbst: imanho: when booting with iomem=relaxed you can even read GPU PCI memory (a.k.a. GPU registers) out to read some of the pointers, but be aware that some (especially the GR parts) are all context switched
22:58imanho: I can see the layout of stuff in the docs, but where are these structures? to bring them in the VA-space of a process is it as easy as to mmap the region with the data I want from "/dev/nvidia0" ?
23:41pabs: imirkin, karolherbst: that init issue sounds like a use-after-free scenario?
23:43imirkin: pabs: wouldn't init_on_free=1 defeat that?
23:44imirkin: i guess i dunno if it runs on kfree
23:44imirkin: or only at a larger "free" event
23:44imirkin: should probably check precisely what that init_on_* does
23:46pabs: you said init_on_free=1 fixes the issue. I assume init_on_free=1 means the free overwrites the used data, which means the next use of that data will find zeros, which means the issue will not trigger
23:50imirkin: init_on_alloc=0 fixes the issue
23:50imirkin: init_on_alloc=1 init_on_free=1 --> broken
23:50imirkin: (and the default)
23:57pabs: and init_on_alloc=1 init_on_free=0 = working?
23:58imirkin: init_on_alloc=0 -> working
23:58imirkin: like i said, need to check precisely when init_on_free does its thing
23:58imirkin: perhaps use-after-free does explain it