02:20esdrastarsis[d]: https://github.com/KhronosGroup/Vulkan-Docs/commit/87e6442f335fc08453b38bbd092ca67c57bfd3ab
02:20esdrastarsis[d]: VK_EXT_descriptor_heap finally
04:04sonicadvance1[d]: Wow, literally everyone at Khronos worked on that one
04:04sonicadvance1[d]: A powerful contributor list
04:05esdrastarsis[d]: finally Vulkan 2
04:05sonicadvance1[d]: min-spec, VK_KHR_descriptor_heap.
05:27rinlovesyou[d]: Happy to see nvidia put out a developer driver for it as well
05:28HdkR: I'm surprised that no one opened a mesa MR yet, but I guess it released at an odd time.
06:38airlied[d]: I managed to reproduce mhenning[d] script on one box finally without my patch, trying now with it
06:41marysaka[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1464147976300658762/image.png?ex=69746990&is=69731810&hm=33014dab5528effe2f813b0131a24b268a9a68bd24e29e7e98fdbacacf310c55&
06:41marysaka[d]: airlied[d]: I reproduced the crash after ~1h
06:41marysaka[d]: the python script around mel's shell script:
06:41marysaka[d]: ```py
06:41marysaka[d]: import subprocess
06:41marysaka[d]: import os
06:41marysaka[d]: import sys
06:42marysaka[d]: import time
06:42marysaka[d]: TIMEOUT=30
06:42marysaka[d]: env = os.environ
06:42marysaka[d]: env["NVK_DEBUG"] = "sync"
06:42airlied[d]: Is that with the same test you wrote before?
06:42marysaka[d]: env["MESA_VK_ABORT_ON_DEVICE_LOSS"] = "1"
06:42marysaka[d]: i = 1
06:42marysaka[d]: print(f"Trying to reproduce (timeout set to {TIMEOUT}s)")
06:42marysaka[d]: while True:
06:42marysaka[d]: try:
06:42marysaka[d]: subprocess.run(["./test_comp_internal.sh"], stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL, timeout=TIMEOUT, shell=True, env=env)
06:42marysaka[d]: print(f"Attempt {i} successfully crashed, ending here")
06:42marysaka[d]: break
06:42marysaka[d]: except subprocess.TimeoutExpired:
06:42marysaka[d]: print(f"Attempt {i} failed, retrying")
06:42marysaka[d]: i += 1
06:42marysaka[d]: subprocess.run(["killall", "deqp-vk"])
06:42marysaka[d]: time.sleep(1)
06:42marysaka[d]: continue
06:42marysaka[d]: except KeyboardInterrupt:
06:42marysaka[d]: print(f"\nAttempt {i} cancelled by user, ending here")
06:42marysaka[d]: subprocess.run(["killall", "deqp-vk"])
06:42marysaka[d]: sys.exit(0)
06:42marysaka[d]: print(f"Crashed in {TIMEOUT * (i - 1)}~{TIMEOUT * i}s!")
06:42marysaka[d]: no it's with your patch only
06:42airlied[d]: Does your mmu test crash still?
06:42marysaka[d]: it does not
06:42marysaka[d]: I will try to make the mmu test loop forever to see actually
06:43marysaka[d]: want me to try with the locking patch I had too?
06:43marysaka[d]: maybe we are triggering a case there I don't know
06:46airlied[d]: Those patches will just work around the underlying races I think, best to figure out the offending sequences of parallel operations
06:47airlied[d]: Like it might be unfixable without a lock but id like to see the race
06:52marysaka[d]: airlied[d]: no I think the first patch fixes is needed too because if we don't look VMM main lock then we could race between unref/ref and unmap/map operations
06:52marysaka[d]: so we could have a possible corruption of page tables that way I think
06:53marysaka[d]: (and non "raw" methods of the VMM code do take that lock for get/put/unmap, map takes the lock by indirection)
06:53marysaka[d]: but could be something else too hard to tell
06:54marysaka[d]: I will try to rerun it with mmu debug logs enabled and see if I can get something out of it before reapplying that patch
06:55airlied: marysaka[d]: as I've said that should all be taken care of by the higher layers, taking locks in there isn't correct
06:57airlied: ref/unref can take locks, map/unmap shouldn't take them
06:57airlied: all memory allocations need to happen at ref/unref time
06:57airlied: then map/unmap should just be filling in PTEs
07:02Mary: I'm saying that because if you look at nvkm_vmm_put, nvkm_vmm_get, and nvkm_vmm_unmap they all lock vmm->mutex.vmm while none of the raw methods end up locking it anywhere
07:02Mary: that's one of the main diff between before and after we introduced VM_BIND but I could be very well mistaken on that one
07:02airlied: yes vm bind does things very differently
07:03airlied: the old API isn't designed for vmbind and is incompatible with it
07:03airlied: the old api locked at the lower levels, the vmbind API locks higher up and has different expectations
07:03airlied: anything that doesn't work is a bug in the lower level code, the plan pre-nova was to rewrite all of it
07:04Mary: I see so a different lock, sorry about that... it's quite hard to navigate between the layers ^^'
07:04airlied: it's a completely different architecture
07:05airlied: it relies on dma fence to sync stuff
13:21forrestoracle: I am doing excellent with my specs. one day I may option to release the final corrections, I did not read others well but myself I am doing excellent, but I look at those same notes, it’s only one golden line missing, I posted already everything, some doubt me but it’s the work of my life I will finish it for sure.
15:09forrestoracle: it’s that no accident will ever happen to me, it never did, my body mass index is perfect 94kg , my organs are protected more than 10times of the average , I am safe pick.
15:15forrestoracle: experts say blood pressure 122,5 high is ideal, and it’s always frozen to there in my case, you do not even fantasize as to how strong am I.
20:25_lyude[d]: btw airlied[d] I am probably going to put together proper patches for adding some of the suspend/resume flags + your other fixes. While it doesn't seem like we have AD102 sleeping yet if I'm not crazy, I think it's actually made runtime s/r more reliable on my laptop. I was able to suspend/resume the GPU before but after a few cycles I wouldn't be able to actually render anything on it (weirdly
20:25_lyude[d]: enough though runtime s/r still seemed to work fine) and everything would just cause dma channel timeouts. I think part of this is a now fixed mesa bug, but with all of the AD102 patches we have so far it doesn't cause timeouts any longer