00:56dakr: airlied, remeber the fix from yesterday to check chan->killed in nouveau_ioctl_exec() to avoid subsequent jobs to hang waiting for the fence to signal??
00:57dakr: there could also be already queued up exec jobs..
00:59dakr: e.g. job C depends on B depends on A, when A fails and the channel is killed job B will hang waiting for the fence..
01:00airlied: dakr: yeah I think we have to work out a plan for those as well
01:00airlied:is in training today so no planning :-P
01:00dakr: in userspace the corresponding syncobj will time out and the client will destroy the channel sooner or later, which will also kill the fence context, which will signal the fence and job C will run and cause a fault on the deleted fence context
01:02airlied: dEQP-VK.memory.*
01:02airlied: Passed: 2535/3205 (79.1%)
01:02airlied: Failed: 20/3205 (0.6%)
01:02airlied: not too bad
01:02dakr: yeah..
01:02dakr: I'd have two ideas how to fix this issue:
01:03dakr: first, we could tear down the sched_entity in nouveau_channel_killed() right before nouveau_fence_context_kill() which will signal all pending fences.
01:04dakr: second, reject emitting a new fence in nouveau_fence_emit() when the fence context is killed.
01:04dakr: second would still let the jobs run even though they can't succeed, but they wouldn't block anymore.
01:05airlied: I'd try the first and see how it goes
01:07dakr: yeah, I agree, and now we're getting there..
01:07dakr: we already did, remember this one? https://paste.centos.org/view/raw/df0fe16b
01:08dakr: for me the backtrace of nouveau_channel_killed() looks different and isn't in atomic context
01:08airlied: dakr: oh I merged some of Ben's code into my tere
01:09airlied: so I could get the stability fixes
01:11dakr: if run nouveau_channel_killed() with preemption disabled we can only tear down the sched_entity by scheduling a work item, which makes it racy and hence we'd need to take the second approch.
01:12dakr: airlied, I wonder what's the reason for the change, can you point me to the commits?
01:15airlied: dakr: https://gitlab.freedesktop.org/skeggsb/nouveau/-/tree/00.06-gr-ampere is the tree
01:15airlied: it's got a lot of stuff in it
01:30dakr: airlied, with those changes nouveau_channel_killed() is called with two spinlocks acquired..
01:34dakr: so, probably need to refuse emitting a fence when the fence context / channel is killed and tear down the sched entity based on that
01:39dakr: skeggsb_, you have another idea to deal with that given your recent changes?
03:09fdobridge: <jekstrand> airlied: At some point, I should probably take your branches and make them do the interesting things and see how badly things fall over. IDK that you're ready for that yet, though. 😅
03:09fdobridge: <jekstrand> I'm also in no rush.
03:10fdobridge: <jekstrand> Rest of this week will be code review and little stuff. Starting in on NAK on Monday. That's the plan anyway.
03:12airlied: jekstrand: not ready yet, though I'm trying the first full cts parallel run now and it hasn't fallen over yet
03:13airlied: jekstrand: have to think about alignment a bit more as well
03:14airlied: having to align images to what the hw requires esp when compression kinds are used
03:14airlied: might need some nil enhancements
05:00airlied: Pass: 197325, Fail: 12522, Crash: 1704, Warn: 4, Skip: 1395119, Timeout: 5, Missing: 5749, Flake: 24828, Duration: 3:25:02, Remaining: 0 from a run on uapi
06:32airlied: dakr: timeline semaphores seems to function too!
06:34airlied: dakr: oh the nouveau exec ioctl needs a u32 padding between the u32 and u64
06:35airlied: as does the bind one
06:39airlied: okay timeline semaphores kinda work, one test hung wierd
07:31airlied: jekstrand, dakr : I've stuck the new uapi changes for nvk into an MR
07:31airlied: jekstrand: I'd appreciate any pre-review of any of it you have time for
07:31airlied: there's quite a few changes outside the obvious syncobj/mem allocs
13:08vliaskov: hi skeggsb_ karolherbst is there a bleeding edge branch that I can try ampere tests on instead of mainline? E.g. https://gitlab.freedesktop.org/skeggsb/nouveau/-/commits/00.06-gr-ampere and https://gitlab.freedesktop.org/skeggsb/nouveau/-/commits/01.01-gsp-rm have a new interrupt tree for TU/GA and reworked mc ampere component (among many other unrelated changes). I am not yet interested in hwaccel / gr, just stable
13:08vliaskov: display and runpm. Still trying to debug "mc: intr 00000040" message and stacktraces seen on https://pastebin.com/8PHpuBus
13:09vliaskov: I also wonder what nouveau branch this tweet refers to: https://twitter.com/jekstrand_/status/1560648097197608961 perhaps that wouldn't be ampere related though
15:43jekstrand: vliaskov: No, not ampere-related.
15:44jekstrand: airlied: Yeah, I'll try to take a look this afternoon. Need to use my good brain time for SPIR-V CFG review
15:58dakr: airlied, sounds good!
15:59dakr: airlied, regarding the hung test, is it one that makes the channel fail?
16:00dakr: I fixed the fault we've got in such a case, but syncobj might still be blocked by the fence never getting signaled.
16:02dakr: Another option to fix that would be to use the scheduler timeout handler to unblock such a job, but I still think we probably shouldn't insert a new fence on a killed channel / fence context.
16:10dakr: This would also give us the option to set the dma fence' status to -ENODEV, indicating userspace that the channel is dead.
17:51TimurTabi: @jekstrand: aren't you talking about the 01.01-gsp-rm branch?
17:52airlied: probably the ampere one
17:52airlied: its the one with all the stability work, the gsprm vranch is based on it
17:58fdobridge: <gouz> hello all,
17:58fdobridge: <gouz> i am trying to debug a GL app with valgrind-mmt (followed https://nouveau.freedesktop.org/Valgrind-mmt.html)
17:58fdobridge: <gouz> but i am getting errors such as:
17:58fdobridge: <gouz> ERROR: nvrm_ioctl_host_map56: cannot find object 0xc1d0004e 0xbeefc360
17:58fdobridge: <gouz> ERROR: nvrm_mmap: couldn't find object/space offset: 0x0000000000000000
17:58fdobridge: <gouz> ERROR: Can't detect chipset, you need to use -m option or regenerate trace with newer mmt (> Sep 7 2014)
17:58fdobridge: <gouz> i am running on the proprietary nvidia driver on an rtx 2070
17:59fdobridge: <karolherbst🐧🦀> yeah.. that stuff doesn't work anymore with newest drivers
17:59fdobridge: <gouz> 😩
17:59fdobridge: <karolherbst🐧🦀> I have some WIP aptches to wire up UVM, but that's very painful
17:59fdobridge: <gouz> ahh ok thanks!
18:00fdobridge: <karolherbst🐧🦀> https://github.com/karolherbst/valgrind/commits/uvm
18:00fdobridge: <karolherbst🐧🦀> and https://github.com/karolherbst/envytools/commits/UVM
18:01fdobridge: <gouz> i am going to try it!
18:01fdobridge: <karolherbst🐧🦀> not really sure it's worth it though as now we have the open source driver and could just dump stuff there
18:01fdobridge: <karolherbst🐧🦀> also probably needs to rebased and adjusted to the newer driver versions
18:01fdobridge: <gouz> i probably should explain what lead me to this
18:02fdobridge: <gouz> i was experimenting with instanced drawing in nvk, some tests do not pass because there is no support for gl_BaseInstance in the shaders
18:02fdobridge: <gouz> for gallium nouveau these values are passed with a ubo i think
18:03fdobridge: <karolherbst🐧🦀> yeah.. but if you have questions about what the ISA supports I can also answer those
18:03fdobridge: <gouz> oh nice!
18:03fdobridge: <gouz> do you think there exists such a value as a system value?
18:04fdobridge: <karolherbst🐧🦀> let me check
18:07fdobridge: <karolherbst🐧🦀> nope, doesn't seem to exist
18:07fdobridge: <karolherbst🐧🦀> guess you'll have to copy what we do in gallium
18:08fdobridge: <karolherbst🐧🦀> but jekstrand has some cheating ways of dumping command buffers
18:09fdobridge: <gouz> for testing, i managed to pass it through the nvk_root_descriptor_table and it works, but this is not the correct solution i think
18:10fdobridge: <gouz> good for experimenting though 🙂
18:11fdobridge: <jekstrand> Eh, it's as good as anything
18:11fdobridge: <gouz> but that way i cannot support gl_DrawID (i.e. on multidraw)
18:11fdobridge: <jekstrand> Right
18:14fdobridge: <jekstrand> For gl_DrawID, I think we want to have a spot reserved in the root descriptor but then emit a `LOAD_CONSTANT_BUFFER` from the MME to set the value.
18:18fdobridge: <gouz> I think I understand
18:19fdobridge: <gouz> Should I try implementing the gl_BaseInstance case as said above?
18:22fdobridge: <jekstrand> Sure but if you get into nouveau/codegen, walk away. I'm going to start typing us a new compiler on Monday so I don't want to add more weird interdependencies between NVK and codegen.
18:22fdobridge: <jekstrand> In particular, if it has to be plumbed through some hard-coded constant buffer or similar.
18:24fdobridge: <gouz> Ok nice! I will try something out and add a merge request for review
18:25fdobridge: <jekstrand> cool
22:15airlied: jekstrand: always syncing is actually a bit painful, since I have to make a syncobj :-P
22:15jekstrand: airlied: Yeah, well...
22:16jekstrand: airlied: There's something to be said for having a syncobj per queue which we always signal so that we have a way to do waitIdle().
22:17jekstrand: Not actual vkQueueWaitIdle(). That's already handled in common code.
22:17jekstrand: But for things like "OMG we need to swap out the texture pool"
22:17jekstrand: Which, incidentally, you need to do. IDK if you have code for that yet.
22:19airlied:has no idea what "swapping out the texture pool" entails thankfully
22:48jekstrand: airlied: If anything ever changes in nvk_queue_state_update(), we need to idle the queue
22:48jekstrand: airlied: It should only happen a single digit number of times per app launch so idling is fine and saves us from having to reference count.