00:15fdobridge: <gfxstrand> `Pass: 402160, Fail: 510, Crash: 26, Skip: 1730079, Timeout: 2, Flake: 593, Duration: 1:10:01`
00:15fdobridge: <gfxstrand> Fixed 324 tests, sadly none of them were flakes or causing flakes/hangs.
00:33fdobridge: <airlied> @gfxstrand 24592 says it fixes some crashers
02:11fdobridge: <airlied> @gfxstrand I moved your s8 support to a new MR, rebased it, fixed kind selection, made it turing+ for now, clears/copies work but still causing gpu hangs with draws
02:12fdobridge: <esdrastarsis> Oh, Debian already has a GSP firmware package available
02:14fdobridge: <airlied> I wonder how they picked a version
02:24culuar: it's really sophisticated language, but overall semantics analysis are recursive cause it needs a context in which the words provide the meaning, inference rule is not searching any context, it does all syntactically, in the context of this solver it makes sense that it is non-iterative procedure for the resolution rule. But the rest of it comes with debugging with say gdb and DWARF metainfo.
02:25fdobridge: <gfxstrand> Thanks!
02:26fdobridge: <gfxstrand> I'm doing a serial run now. We'll see if I can figure out the hangs from that tomorrow.
02:26fdobridge: <gfxstrand> It's about a 6h run if deqp-runner is to be believed
02:27fdobridge: <gfxstrand> @airlied Do you have IRC mod on #Nouveau?
02:28fdobridge: <airlied> I have ops I think
02:29fdobridge: <airlied> so do you
02:30fdobridge: <gfxstrand> I do? 😂
02:32fdobridge: <gfxstrand> I should probably learn how to use them
02:33fdobridge: <airlied> both you and me are in the list as CHANOP in chanserv
02:36fdobridge: <gfxstrand> k
03:00airlied: dakr: sent you a lockdep trace on rh email
03:01fdobridge: <airlied> I wonder what magic s8 needs
03:04culuar: in those terms, would inference be faster or slower, it actually is suppose to be slower in when transformed to computer world, but since computers are forced to emulate language and human brain by regulation to match semiotics, they forced the design to be faster. So that is what that scottish author actually is trying to say.
03:13fdobridge: <airlied> okay the locking issues is pre-gsp ampere+ and gsp
03:18fdobridge: <airlied> hmm push_sync is broken unfortunately with the new uapi, we don't get fail until the next push submit
03:18fdobridge: <airlied> since exec returns errors but we don't send errors on sync objs as we know that can cause bad stuff
03:29fdobridge: <gfxstrand> If we had an ioctl to check for channel death, there's a vk_device hook for it that gets called periodically.
03:30fdobridge: <gfxstrand> In particular it gets called before `vkWaitSemaphore()` and `vkWaitForFences()` return.
03:31fdobridge: <gfxstrand> We can still implement Vulkan without it, it just lets us know a little earlier.
03:33fdobridge: <airlied> null exec is that ioctl really
03:34fdobridge: <airlied> granted we could make it nullier
03:34fdobridge: <airlied> so maybe an alternate ioctl would be useful
03:43fdobridge: <airlied> @gfxstrand https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/bd644d97b300001f6a69a0efd47f1da89e5cf505.patch first pass
03:45fdobridge: <gfxstrand> Being able to query for any dead channel would be useful too. Maybe return the first dead one?
03:46fdobridge: <gfxstrand> Also, are you planning to take a lock before you walk that linked list. 😅
03:51fdobridge: <airlied> abi16 get/put is the lock afaik
03:52fdobridge: <airlied> will we ever really have more than 1 channel on a fd?
03:53fdobridge: <airlied> ah well s8 traces from the prop driver are telling me they definitely do more stuff for s8 than d24/s8 but what more they do is not very obvious
03:53fdobridge: <airlied> but there appears to be some s8 clears to 0xff before clears to 0x0
03:53fdobridge: <airlied> but not sure if that is just some ZCULL workarounds
04:21fdobridge: <gfxstrand> Yeah. Channel is per-queue, fd is per-device.
04:22fdobridge: <gfxstrand> Weird...
04:30fdobridge: <airlied> ah real queues 😛
04:30fdobridge: <gfxstrand> I'm not surprised there might be workarounds. What surprised me is that rendering just seemed to not work. Maybe it was the PTE thing, though.
04:31fdobridge: <gfxstrand> Yeah, pretty sure we already support those properly. Have for a while now, actually.
04:32fdobridge: <gfxstrand> I should probably plumb stuff through a bit so we don't allocate subchannels we don't need. Line, a DMA-only queue doesn't need 3D or compute subchannels. 🤔
04:32fdobridge: <airlied> the PTE fixed clears for me, but rendering still doesn't work
04:32fdobridge: <airlied> gr: DATA_ERROR 00000135 [] ch 4 [01ffe43000 deqp-vk[3181]] subc 0 class c597 mthd 0d78 data 00000006
04:32fdobridge: <airlied> on the first draw to ti
04:32fdobridge: <gfxstrand> I should probably plumb stuff through a bit so we don't allocate subchannels we don't need. Like, a DMA-only queue doesn't need 3D or compute subchannels. 🤔 (edited)
04:32fdobridge: <gfxstrand> Ah, yes, the universal error message. 😅
04:33fdobridge: <airlied> yeah something went wrong, please try again
04:34fdobridge: <airlied> the d24s8 version of the test works fine, and the command streams are the same apart from the stencil bit
04:34fdobridge: <airlied> whereas nvidia does a bit different
04:35fdobridge: <gfxstrand> 🙃
04:36fdobridge: <airlied> https://people.freedesktop.org/~airlied/scratch/nv-traces/ has traces, stencil-op and ds-op are s8 and d24s8 from nvidia
04:38fdobridge: <airlied> added ptrs in the MR
04:43fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Where?
06:05fdobridge: <gfxstrand> `[109149.353018] nouveau 0000:17:00.0: deqp-vk[77832]: pushbuf push count exceeds limit: 514 max 512`
06:05fdobridge: <gfxstrand> @airlied I thought you said there was no limit. 😛
06:11fdobridge: <gfxstrand> Fortunately, with the new UAPI, splitting pusbufs is relatively easy
06:12fdobridge: <gfxstrand> But I'm wondering if I really need to split or if we just need to improve the kernel
06:19fdobridge: <airlied> Didn't think we limited it. But maybe the ring space limits it
06:19fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> So the new uAPI was a lie? 🤔
06:19fdobridge: <gfxstrand> nouvaeu_exec.c:390
06:20fdobridge: <gfxstrand> No, with the new uAPI, splitting stuff up in NVK is trivial. It's still way better.
06:20fdobridge: <gfxstrand> I mean, it's not totally trivial with all that old UAPI code in the way but, once we delete that, it'll be really easy.
06:25fdobridge: <gfxstrand> @airlied, dakr: I started a serial CTS run 4 hours ago and it's got 3 hours left. I have yet to see a single fault. I see a dozen segmentation faults, two hung shaders and two of those push count messages. No faults.
06:25fdobridge: <gfxstrand> Oh, and so far only 1 flake.
06:26fdobridge: <gfxstrand> @airlied, dakr: I suspect the remaining faults (and corresponding flakes) are also kernel bugs. Maybe something wrong with eviction that's causing it to not wait properly on pending work before swapping stuff out from underneath us?
06:57fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> I see a NvGspRm nouveau option in Ben's kernel (is it used to disable the GSP support?)
07:50fdobridge: <airlied> @gfxstrand hey does dEQP-VK.memory.allocation.basic.size_8KiB.forward.count_4000 fail for you?
07:50fdobridge: <airlied> I'm hitting a 4k limit of allocations that I'm not sure who is at fault, tomorrow me will try and work it out
07:51fdobridge: <airlied> I think we can burn that limit check
07:51airlied: dakr: ^
09:24fdobridge: <esdrastarsis> It's to enable the support
09:36fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> I didn't need this to use GSP before
09:38fdobridge: <esdrastarsis> But now you need it, as the support is still WIP gsp is disabled by default
10:15fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> I wonder if the early GSP blobs are MIT-licensed (because they're binary blob firmware) :nouveau:
10:52fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Anyway I packaged the Turing/Ampere GSP blobs in an Arch package :nouveau:
10:57fdobridge: <butterflies> > if the early GSP blobs are MIT-licensed
10:57fdobridge: <butterflies> Yes (as in, the bootstrap)
14:48fdobridge: <gfxstrand> Here's the full dmesg from my overnight serial run. No faults. Only two shader hangs.
14:48fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1139208671935598602/message.txt
15:03fdobridge: <karolherbst🐧🦀> yeah... so I don't exactly know what it is, but I _think_ nouveau is kinda able to mess up command submissions if multiple processes submit things at the same time. I sadly have no solid evidence of this...
15:04fdobridge: <karolherbst🐧🦀> but threaded runs were always more crashy, but one reason for that is that a dead channel could wipe other channels as well if the gr needs rebooting
15:05fdobridge: <karolherbst🐧🦀> anyway.. I'm sure it's something silly
15:10dakr: airlied: dEQP-VK.memory.allocation.basic.size_8KiB.forward.count_4000 does pass on my end
15:45fdobridge: <gfxstrand> It's not that. I have exactly 2 channel deaths in that run. That doesn't explain 500 flakes.
15:46fdobridge: <gfxstrand> `Pass: 402772, Fail: 501, Crash: 13, Skip: 1730079, Timeout: 2, Flake: 3, Duration: 7:08:52`
15:46fdobridge: <karolherbst🐧🦀> the point is, if the runner tries a failed test again and it passes, the channel might have been killed caused by another test killing its channel
15:47fdobridge: <karolherbst🐧🦀> and then running it again fixes it
15:47fdobridge: <gfxstrand> Yes, i know. But there are exactly 2 resets in that entire run.
15:47fdobridge: <gfxstrand> 2 resets * 18 threads is a maximum of 36 tests affected by that
15:47fdobridge: <karolherbst🐧🦀> right, I think some of that threading stuff is in the channel recovery, but I also think there are other threading issues besides those
15:48fdobridge: <karolherbst🐧🦀> sometimes it feels like contexts mess up eather other states
15:49fdobridge: <gfxstrand> Yeah, given that most of what we're seeing at this point is faults, I suspect one of two thing:
15:49fdobridge: <gfxstrand> 1. We have a binding race in NVK or the CTS that somehow doesn't show up EVER in a serial run.
15:49fdobridge: <gfxstrand> 2. We have a kernel bug either pertaining to multiple contexts in flight or to memory pressure and eviction
15:49fdobridge: <gfxstrand> Given the statistics of 400k tests, I really don't think 1 is likely.
15:49fdobridge: <gfxstrand> Yeah, given that most of what we're seeing at this point is faults, I suspect one of two thing:
15:49fdobridge: <gfxstrand>
15:49fdobridge: <gfxstrand> 1. We have a binding race in NVK or the CTS that somehow doesn't show up EVER in a serial run.
15:49fdobridge: <gfxstrand> 2. We have a kernel bug either pertaining to multiple contexts in flight or to memory pressure and eviction
15:49fdobridge: <gfxstrand> Given the statistics of 400k tests, I really don't think 1 is likely. (edited)
15:49fdobridge: <gfxstrand> Yeah, given that most of what we're seeing at this point is faults, I suspect one of two thing:
15:49fdobridge: <gfxstrand> 1. We have a binding race in NVK or the CTS that somehow doesn't show up EVER in a serial run.
15:49fdobridge: <gfxstrand> 2. We have a kernel bug either pertaining to multiple contexts in flight or to memory pressure and eviction
15:49fdobridge: <gfxstrand> Given the statistics of 400k tests, I really don't think 1 is likely. (edited)
15:49fdobridge: <gfxstrand> Yeah, given that most of what we're seeing at this point is faults, I suspect one of two thing:
15:49fdobridge: <gfxstrand> 1. We have a binding race in NVK or the CTS that somehow doesn't show up EVER in a serial run.
15:49fdobridge: <gfxstrand>
15:49fdobridge: <gfxstrand> 2. We have a kernel bug either pertaining to multiple contexts in flight or to memory pressure and eviction
15:49fdobridge: <gfxstrand> Given the statistics of 400k tests, I really don't think 1 is likely. (edited)
15:49fdobridge: <gfxstrand> Yeah, given that most of what we're seeing at this point is faults, I suspect one of two thing:
15:49fdobridge: <gfxstrand> 1. We have a binding race in NVK or the CTS that somehow doesn't show up EVER in a serial run.
15:49fdobridge: <gfxstrand> 2. We have a kernel bug either pertaining to multiple contexts in flight or to memory pressure and eviction
15:50fdobridge: <gfxstrand> Given the statistics of 400k tests, I really don't think 1 is likely. (edited)
15:50fdobridge: <karolherbst🐧🦀> yeah.. sooo.. command submission is a bit of a pita and I think it involves forcing/switching to a gpu context
15:50fdobridge: <karolherbst🐧🦀> but I don't actually know for sure
15:50fdobridge: <karolherbst🐧🦀> I'd have to read up on that code, but I wouldn't be surprised if the kernel can submit to a wrong context
15:50fdobridge: <karolherbst🐧🦀> how many dead channels do you see in a threaded run?
15:52fdobridge: <gfxstrand> IDK about the dead channel message itself but I see hundreds of faults
15:52fdobridge: <karolherbst🐧🦀> mhh yeah...
15:52fdobridge: <karolherbst🐧🦀> so I think there is the possibility to mess up other channels
15:53fdobridge: <karolherbst🐧🦀> It also kinda feels like that something is racing on the channel ids as well
15:53fdobridge: <karolherbst🐧🦀> for whatever reason
15:54fdobridge: <karolherbst🐧🦀> maybe I just write some userspace code to fuzz that stuff, just submit hundreds of things or something
15:55fdobridge: <gfxstrand> That doesn't sound like a bad plan. You can put that code in IGT
15:55fdobridge: <gfxstrand> 🙂
15:55fdobridge: <karolherbst🐧🦀> what concerns me that we literally put the channel id in the submission ioctl.. what if...
15:56fdobridge: <gfxstrand> Wait... The nouveau kernel isn't assuming that's a globally unique key is it? 🤯
15:56fdobridge: <karolherbst🐧🦀> that's my working theory
15:56fdobridge: <gfxstrand> That would be v. bad
15:56fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Like not resetting a variable's value 🐸
15:56fdobridge: <karolherbst🐧🦀> yeah...
16:14fdobridge: <gfxstrand> Looks like the one shader hang will get fixed by reclocking.
17:17fdobridge: <gfxstrand> Hrm... If there are lots of long shaders, it's possible that, when we run in parallel, shaders that complete in plenty of time in the serial run also time out. If those resets break other channels, that could explain it, too.
17:21fdobridge: <gfxstrand> I would say that the solution is to do my threaded runs with GSP except I have a feeling that might not be such a good idea. 😂
17:48fdobridge: <karolherbst🐧🦀> ahh yeah.. might run into some weird timeouts
17:48fdobridge: <karolherbst🐧🦀> I've seen some issues going away on older GPUs when the users clocked up the GPU
17:48fdobridge: <karolherbst🐧🦀> nah, GSP is fine.. probably 😄
17:49fdobridge: <karolherbst🐧🦀> I've done a CTS run on Ada with GSP
17:49fdobridge: <karolherbst🐧🦀> threaded it was kinda a hit and miss, but serial it should be totally fine
17:54fdobridge: <airlied> I'm doing threaded gsp runs on ampere most days, i think my flake numbers are similiar
17:54fdobridge: <gfxstrand> 🫤
17:54fdobridge: <airlied> Also gsp has one explosion mode that takes out the test run randomky
17:55fdobridge: <gfxstrand> The faults make me really suspect context juggling or eviction.
17:55fdobridge: <airlied> And I think this 4000 alloc limit might be gsp related
17:56fdobridge: <gfxstrand> Why? I don't get an Alloc limit at all. It's all just page tables. We should be about to point PTEs at as many things as we want.
17:58fdobridge: <airlied> Yeah no idea just noticed the tests failing
17:58fdobridge: <airlied> Like I don't think I'm running out of vram after 4000 8k alloca
17:59fdobridge: <airlied> Was old uapi flakes about the same?
18:02fdobridge: <airlied> https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/6aff9ad3d1f69502d43f8d83c7f5297470f9adb6.patch was something I wrote before
18:02fdobridge: <airlied> But I doubt it will make any difference
18:31fdobridge: <airlied> @gfxstrand care to paste the failures.csv from the single thread run?
18:37airlied: dakr: actually if we do remove that limit on pushbuf, I think there's a chance u_memcpya can be overflowed
18:37airlied: but it can probably be overflowed now with the signal/waits stuff
18:39fdobridge: <gfxstrand> https://cdn.discordapp.com/attachments/1034184951790305330/1139266761909354576/message.txt
18:42fdobridge: <airlied> thanks!
20:13dakr: airlied: Oh! Guess, you mean because of "size *= nmemb"?
20:13airlied: yup
20:13airlied: I sent a second patch
20:14airlied: I think we need vmemdup_user_array
20:15dakr: where did you sent them?
20:21airlied: dri-devel + nouveau
20:22airlied: https://patchwork.freedesktop.org/series/122304/
22:46dakr: gfxstrand, airlied: since the pushbuf limit of 512 wasn't enough, what's the worst case amount to expect?
22:51fdobridge: <airlied> Not sure if Vulkan advertised any limit, do we have any limitation other than kmalloc?
22:51fdobridge: <airlied> The ring might have a limit I suppose
22:53dakr: airlied: Just trying to figure out if I want to get rid of an extra copy of this array..
22:56dakr: However, I think we can get rid of it either way. I think I took this over from how I initialize VM_BIND jobs. There I wasn't sure if we will ever have to case where we need to generate jobs from the kernel side and hence don't deal with userptrs, hence I abstracted this part.
23:07dakr: airlied: for the channel it seems to be 1023, but I'm not entirely sure tbh..
23:09fdobridge: <gfxstrand> We can split in userspace if we know the limit. We couldn't before, not practically, but we can now.
23:10fdobridge: <gfxstrand> But so can the kernel
23:10fdobridge: <gfxstrand> Worst case, split the job into N jobs and put the in-fences on the first one and the out-fences on the last one.
23:29dakr: gfxstrand: I think doing it in the kernel would be nice, that nothing userspace should be bothered with.
23:37dakr: My initial idea for jobs larger than the channel size would be to return the scheduler a dummy fence in run_job(), fill up the ring, emit a real fence and in its fence callback fill the ring up again. Once we don't have anything anymore to fill the ring up again signal the dummy fence.