IRC Logs of #nouveau on irc.freenode.net for 2025-02-24

01:39 peterson[m]: From Trusted and Vouched Dealers... (full message at <https://matrix.org/oftc/media/v1/media/download/AY4gn8QvRUi13-3qeFEz8PkMAoe8xCgpeM3sesWDdDZwaaIY-zqlLD2ayiD0MF27etRA4AUp2QiBY6tHyiE2C8JCeVfbUBhAAG1hdHJpeC5vcmcvZVFHa1loT2RZSlViUnFWTXJ5U3lmWUFP>)
02:25 HdkR: Missed the Pascal Tegra, but considering that one didn't get much upstream support it's probably fine
02:26 HdkR: I personally adored how quirky the CPU arrangement was on Tegra X2, even if the performance wasn't great :D
02:43 airlied[d]: gfxstrand[d]: btw if you are doing a lot of CTS, this might shave some time off https://paste.centos.org/view/37b22571 though I'm not sure if it's a lot
02:45 gfxstrand[d]: Are things guaranteed to be aligned?
02:45 gfxstrand[d]: I guess x86 doesn't care as much as some
02:49 airlied[d]: like maybe, but on balance I think my time has gone from over 4:10 hrs to 3:50 or so
02:49 airlied[d]: also if it's unaligned on arm occasionally I doubt it outweighs the benefits
02:50 airlied[d]: mhenning[d]: I've dropped the turing patch, so maybe if I can get ack/r-b for the remaining two
02:50 airlied[d]: there's a bunch of wastage in the sparse tests as well,
02:50 airlied[d]: u8 compares in a loop that could be u32
02:51 mhenning[d]: airlied[d]: Sure, I'll take a look tomorrow
03:19 gfxstrand[d]: airlied[d]: My runs are about 35-40 minutes. 😝
03:24 airlied[d]: okay it might shave 5 more mins off then
03:41 HdkR: Ideal conversion for loading a packed uint32_t in to floatx4_t on arm64 would be...ldr+uxtl+uxtl+scvtf. Hopefully that change gets close to that :D
03:43 HdkR: Or I guess ucvtf actually
03:50 gfxstrand[d]: Are you planning to submit that change to upstream?
03:54 airlied[d]: probably not as-is, but maybe as a bug report
04:09 gfxstrand[d]: Okay
04:09 gfxstrand[d]: Seems reasonable
04:09 gfxstrand[d]: I'm not a huge fan of carrying patches but if it shaves 10%...
04:12 airlied[d]: let me know if it's noticable for you, I'll do a few more runs, I'd like to shave the horror that is the sparse tests as well
04:12 gfxstrand[d]: Oh, I know what's wrong with those.
04:14 gfxstrand[d]: They create a new instance and device for every test. I improved them significantly a while ago but there's more shaving to do there.
04:14 gfxstrand[d]: Or maybe they don't have that problem anymore? I'm not 100% sure. It's been a bit.
04:14 airlied[d]: they do a huge u8 to u8 comparison loop, it's all cache misses
04:14 gfxstrand[d]: Ah
04:14 gfxstrand[d]: Memcpy?
04:15 airlied[d]: I suspect memcpy it once might be a better option and do it in cache
04:15 gfxstrand[d]: Are they peeking at VRAM?
04:15 gfxstrand[d]: Because that would be terrible
04:15 airlied[d]: not 100% sure, haven't dug that deeply yet, just have a profile where's all the test time, it could be a vram load actually, I wouldn't be shocked
04:16 gfxstrand[d]: PCIe is evil. That's what I learned this week. 😂
04:19 gfxstrand[d]: The e stands for evil
04:20 airlied[d]: just plug it in over thunderbolt and really slow it down
04:21 airlied[d]: like I love that CTS does LTO and all this stuff to try and help things, then just fails to not suck at basics
04:21 gfxstrand[d]: Yup
04:22 gfxstrand[d]: At least they sped up the fuzzy image compare stuff. That was destroying us in the 1.0 or 1.1 days.
04:23 airlied[d]: okay doesn't at least appear to be over pcie, adding Coherent to the output buffer doesn't seem to help
04:23 gfxstrand[d]: Add cached
04:23 gfxstrand[d]: VRAM is coherent if it's WC
04:24 gfxstrand[d]: I don't know if anyone seriously does non-coherent anymore. We dropped it on Intel after we screwed it up one too many times.
04:25 airlied[d]: yup that indeed does fix it
04:25 gfxstrand[d]: Womp womp
04:25 airlied[d]: guess I should file that one
04:25 airlied[d]: off to figure out gerritt again
04:26 gfxstrand[d]: Yeah. Memcpy may help. Then it's at least doing wide loads and not blasting it with u8
04:27 gfxstrand[d]: Another option is to reorder heaps in NVK and put GART before WC VRAM.
04:27 gfxstrand[d]: I think we confused VKD3D for a while with that
04:27 airlied[d]: I'm actually testing the hacks on radv first
04:32 gfxstrand[d]: But yeah, choosing mappable memory really should be !device unless it's all device in which case grab the first mappable.
04:33 gfxstrand[d]: Or maybe look for cached+coherent first?
04:33 gfxstrand[d]: IDK.
04:33 gfxstrand[d]: Just grabbing the first host visible thing isn't great
04:37 airlied[d]: yup adding coherent/cached drops one set of tests from 10m to 3m
04:39 virtual-image: Has anyone experienced this issue before: the kernel loads nouveau, the X server loads nouveau, but querying OpenGL via glxinfo or equivalent says a software renderer is in use?
04:44 airlied[d]: I've filed an issue, will see if I can get motivate to gerrit later
05:01 gfxstrand[d]: airlied[d]: Does dEQP have a standard memory type picker? Because it should and it should try to get cached.
05:02 gfxstrand[d]: I imagine there are probably a lot of tests that suffer from that. I know dwlsalmeida was struggling with perf on the video tests and I suspect that's related.
05:03 gfxstrand[d]: Although it should probably be using VRAM whenever possible for sparse tests and just use cached for upload/download buffers.
05:12 airlied[d]: I don't think it's very smart about readbacks at all
05:12 airlied[d]: it might also be why my earlier 32-bit hacks work better
05:14 gfxstrand[d]: Yup
05:14 gfxstrand[d]: It's a known fact that dEQP kinda sucks on PCIe cards. Maybe this problem is just everywhere?
05:16 airlied[d]: maybe I should trick mupuf into looking at it under the guise of saving CI time 😛
05:17 gfxstrand[d]: Do it!
05:18 gfxstrand[d]: IDK if he's the best choice but it has the potential to significantly reduce CI resources for AMD and Nvidia testing (and I guess Intel now, too).
05:20 mupuf[d]:is more into finding ways to save his limited time nowadays :D
05:21 mupuf[d]: But if there is something actionable, I know who I could persuade to work on it... Igalia!
05:22 airlied[d]: I'm going to file a bigger issue in the khronos tracker
05:22 mupuf[d]: that sounds like a good idea
05:23 mupuf[d]: I'm sure some ifdefs for the happy cases would be working just as well for us
05:29 airlied[d]: this is not the Yak I was supposed to be shaving today
08:25 airlied[d]: okay got back to turing latencies, my old code was bogus, in that I forgot to wire it up properly, but I'm well into CTS now and working it out
09:17 dwlsalmeida[d]: gfxstrand[d]: I need to find the energy to debug that
09:20 karolherbst[d]: airlied[d]: I'd be interested in perf numbers there
09:51 marysaka[d]: marysaka[d]: btw airlied[d] ^ if you missed, do you need SM86 output too?
10:11 airlied[d]: Not sure yet, stuck in the latency rabbit hole
16:12 gfxstrand[d]: Okay, I think rZ on Tex ops might be working. I should probably plug in a Maxwell and see if it blows up.
16:13 gfxstrand[d]: That'll at least get rid of some of that RA annoyance.
16:14 gfxstrand[d]: I wish Maxwell could survive a CTS run. 😢
16:22 karolherbst[d]: gfxstrand[d]: only for unused arguments or generally?
16:37 gfxstrand[d]: unused
16:38 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33716
16:39 karolherbst[d]: yeah, for unused ones that's correct afaiak
17:03 gfxstrand[d]: Maxwell isn't exploding so I think it's probably okay
17:04 gfxstrand[d]: And Ampere was happy too
17:05 gfxstrand[d]: IDK why I didn't do that earlier, TBH. I think I was just paranoid about too much rZ going places.
17:28 gfxstrand[d]: Ugh... Clearly no one is using Maxwell. It's broken AF right now and we didn't get any bugs filed.
17:28 karolherbst[d]: well yeah.. because it's slow
17:29 mhenning[d]: gfxstrand[d]: It's still behind I_WANT_A_BROKEN_VULKAN_DRIVER, right?
17:29 karolherbst[d]: maybe we should just switch to zink by default
17:29 karolherbst[d]: 😄
17:29 karolherbst[d]: that's going to give you tons of bugs
17:30 mhenning[d]: Honestly, I've been wondering about turning on zink by default for volta+
17:36 redsheep[d]: mhenning[d]: I thought this was only kepler
17:37 mhenning[d]: I think it's anything pre-volta
17:41 gfxstrand[d]: I'm seeing a pile of faults on Maxwell, typically in the query copy tests. They don't happen when I run the tests individually. I wonder if we have some sort of race somewhere.
17:41 gfxstrand[d]: Yeah, Maxwell and Pascal are still listed as broken. Volta might be as well.
17:42 gfxstrand[d]: Basically anything non-GSP is currently too unstable
17:42 gfxstrand[d]: See also the universe faulting
17:43 redsheep[d]: Ok the reason I thought it was only kepler was that it's the only thing left on codegen
17:47 mhenning[d]: Oh, yeah. We have nak on maxwell+pascal, but we still haven't turned on the driver by default
17:48 pavlo_kozlenko[d]: What are the plans for the Fermi?
17:49 gfxstrand[d]: The plan is to not support Fermi with NVK
17:50 pavlo_kozlenko[d]: Even 1.0?
17:51 gfxstrand[d]: I mean, if someone wants to write the code, I don't mind having it in tree if it's not too disruptive.
17:51 gfxstrand[d]: But there's no one planning to work on it AFAIK
17:51 gfxstrand[d]: Maxwell is a "sometimes I feel like throwing it in and fixing the bugs" level of support. Kepler doesn't exist yet and no one's really started on it. Fermi needs Kepler as a prereq.
17:52 gfxstrand[d]: Also, Fermi is missing bindless textures which is going to be a real PITA
17:54 redsheep[d]: mhenning[d]: I think this is getting close to being reasonable but probably not quite there. My zink testing last week was massively improved but still somewhat mixed
17:56 mhenning[d]: redsheep[d]: Okay, is it worse than nouveau gl? Have you filed bugs for any ways that it is worse than nouveau gl?
18:01 gfxstrand[d]: !33716 is now turning into a "Maxwell is broken" MR. 😭
18:03 gfxstrand[d]: Most of the fails are either in primitive generated queries for XFB or memory model tests.
18:04 gfxstrand[d]: But I have no idea why
18:04 gfxstrand[d]: But they always pass when run by themselves.
18:05 gfxstrand[d]: `[ 5031.288624] nouveau 0000:01:00.0: fifo: fault 01 [WRITE] at 0000003effde0000 engine 00 [gr] client 1d [HUB/DFALCON] reason 02 [PTE] on channel 37 [007f72d000 deqp-vk[40038]]`
18:07 karolherbst[d]: hey
18:07 karolherbst[d]: look at that number
18:07 redsheep[d]: mhenning[d]: Yes, still worse than nouveau gl in a few ways. I have some I filed way back that probably aren't even correct anymore. I'll file some later this week if I get time
18:08 gfxstrand[d]: redsheep[d]: Thanks! We really need to try and close the gap.
18:19 redsheep[d]: gfxstrand[d]: I think part of it miiiiight be that I'm running an ada card, whereas Sid and some other testers are on ampere and Turing. If you could try it with your 4060 maybe it would be more capable of replicating my issues
18:20 gfxstrand[d]: Okay, I can repro the fault. I just have to run a bunch of tests. Woof.
18:20 redsheep[d]: Cuz from Sid it almost sounds like it's just working
18:20 tiredchiku[d]: yeah, zink has been smooth sailing for mr
18:20 gfxstrand[d]: I've got a 4060 in my laptop and in my dev desktop
18:20 gfxstrand[d]: Maybe once I get done playing with Maxwell?
18:20 redsheep[d]: It could also be that I have multiple monitors
18:20 tiredchiku[d]: could be
18:21 gfxstrand[d]: That could be, actually. Xorg has very different behavior when you plug in a second monitor.
18:21 tiredchiku[d]: we both tried wayland
18:21 gfxstrand[d]: Because it has to render to one big screen because it's X11
18:21 tiredchiku[d]: x11 is less smooth sailing for me actually
18:21 gfxstrand[d]: Regrettably, I have worked on that code. :frog_weary:
18:22 tiredchiku[d]: I should file a bug for that
18:22 redsheep[d]: I did have some of my issues disappear with just one, but certainly not all
18:23 redsheep[d]: I'll probably have to take a video of what discord does on Wayland, it's this whole relabel party
18:25 tiredchiku[d]: <a:SCrainBowDance:858113374998626344>
18:25 redsheep[d]: I should probably try the 570 kernel branch, I think some of what has been killing me has potentially been gsp bugs that somehow don't happen under nouveau gl that seems to do with modesetting
18:25 mhenning[d]: Faith did some zink fixes last week, so might also be worth retesting depending on when you last tried
18:26 gfxstrand[d]: Those shouldn't affect most apps but it's worth a try, I suppose.
18:26 gfxstrand[d]: It was mostly around GL<->Vulkan interop
18:27 mhenning[d]: that was hitting chromium, so electron apps could plausibly be affected, right?
18:27 gfxstrand[d]: If they use Vulkan
18:28 gfxstrand[d]: But right now you have to smash flags in `chrome:flags` if you want Vulkan enabled on NVK
18:28 mhenning[d]: fair enough, maybe not then
18:29 gfxstrand[d]: But generally, testing with latest is a good idea. There were also some issues Mike fixed pretty recently.
18:48 gfxstrand[d]: Okay, so the fault address literally never appears in any VA that's ever allocated.
18:49 gfxstrand[d]: So it's not that something's getting freed early. We're concocting a bogus address somehow.
18:52 gfxstrand[d]: gfxstrand[d]: Okay, no. That's a lie. I can't read my own printfs.
19:06 gfxstrand[d]: I think it's a bad address calculation in my CopyQueryPoolResults shader
19:08 gfxstrand[d]: But how/why? That is the question...
19:14 gfxstrand[d]: I wish there was some easier way to get the driver into exactly the right state than running 330 tests
19:17 gfxstrand[d]: Could also be a caching issue
19:17 gfxstrand[d]: Or a shader issue
19:23 gfxstrand[d]: Yeah, it's some sort of funky dependency bug
19:29 gfxstrand[d]: If I stick an `if ((addr & ~0xffff) == 0x3feffe12000) st.global 0 res`, not only does it never trigger (no NULL fault) but all the tests pass.
19:30 gfxstrand[d]: So somehow sticking control flow between an address calculation and a `st.global` is fixing it.
19:30 gfxstrand[d]: aaaaaarrrrrrrggggggg
20:15 gfxstrand[d]: gfxstrand[d]: Of maybe adding that extra shader code is making the shader bigger and causing memory allocation to be different and the bug to not trigger. That would be cursed.
20:27 gfxstrand[d]: I'm not liking any of the thoughts running through my brain about this...
20:28 gfxstrand[d]: I guess the shader could also be stomping itself
20:28 gfxstrand[d]: But that seems a little nuts
20:29 mhenning[d]: gfxstrand[d]: You could try adding NOPs to see if it's the shader size
20:29 gfxstrand[d]: Yeah. That's a bit of a pain but possible
20:31 mhenning[d]: I assume you've already tried NAK_DEBUG=serial
20:36 gfxstrand[d]: Yeah. That breaks other tests in the set for some reason. :blobcatnotlikethis:
20:40 gfxstrand[d]: I may also be able to trim down the test list a bit but it seems to need just the right memory VA layout.
20:40 gfxstrand[d]: Which makes me think it's a 64-bit add rollover problem
23:04 phomes_[d]: oh. We need llvm to build nvk now
23:11 mhenning[d]: Yeah, that's a consequence of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33362
23:12 mhenning[d]: It's only required at compile time, not at runtime
23:16 gfxstrand[d]: airlied[d]: , mohamexiety[d] and I are gonna need Danilo's help on this VMM stuff.
23:16 gfxstrand[d]: Trying to crawl through this totally undocumented mess just the two of us isn't working.
23:16 gfxstrand[d]: Or maybe skeggsb9778[d] ?
23:16 gfxstrand[d]: Someone
23:17 gfxstrand[d]: mohamexiety[d]: has something that kinda works and doesn't crash but it's also clearly broken and he doesn't know why.
23:18 gfxstrand[d]: I've got vague ideas of theories as to what's going wrong but without knowing how the code is intended to work, it's kinda hard to really chase them.
23:22 gfxstrand[d]: gfxstrand[d]: It is. :blobcatnotlikethis:
23:23 gfxstrand[d]: Curses
23:27 gfxstrand[d]: gfxstrand[d]: Which isn't to say it's bad. I just don't know how the code is designed to work or what any of the expectations are.
23:32 skeggsb9778[d]: gfxstrand[d]: gfxstrand[d] i've been helping a bit over DM already, but i'm also not too familiar with that bit of code. i'm not sure if it's going to be enough to try and magic up the page size selection as it stands, or if userspace should be specifying that (i suspect probably the latter, but it *might* be possible to get away without it)
23:32 gfxstrand[d]: Yeah, that was my concern as well.
23:33 skeggsb9778[d]: i also suspect some of the gpuvm callbacks, particularly "remap" etc, are going to need extra logic to do the right thing with mixed page sizes
23:33 gfxstrand[d]: I'm also not sure if you can just map with one page size and unmap with another.
23:33 gfxstrand[d]: I'm reasonably happy to have userspace specify the page size when it allocates a new VA
23:33 gfxstrand[d]: If that makes things simpler
23:33 skeggsb9778[d]: the backend interface is pretty simple (if you ignore the layers in between) - get() allocates the page tables themselves for a given page size, map() writes the PTEs into them, unmap()/put() do the reverse
23:34 skeggsb9778[d]: get()+map() have to be at the same page size, because for 4k vs 64k pages, a different tree of page tables needs to be invented, so you can't just map() 64k pages into an area that has get() called on it for 4k pages
23:34 skeggsb9778[d]: balance all those things and it should work
23:34 gfxstrand[d]: Right. I suspected that
23:37 gfxstrand[d]: I'm wondering if we need to add some tracking somewhere. It doesn't look like there is any at present and so I don't know how we can ensure consistency.
23:42 gfxstrand[d]: skeggsb9778[d]: mohamexiety[d] ^^ I suspect this is the problem. We need to somehow track what page size we're using for different VA allocations and make sure we map/unmap/put with the same page size as we used for get.
23:43 skeggsb9778[d]: i think the nouveau_uvma would be the place to stash this, it's where the per-VA 'kind' value is stored already
23:43 gfxstrand[d]: Unfortunately, we don't have a very handy object for doing that, at least not one that I can see.
23:43 gfxstrand[d]: Ah, Okay, maybe we can stick it there.
23:45 mohamexiety[d]: This is what my branch does actually (with help from Ben, thanks! :catheart:) — page_size is in nouveau_uvma but there’s probably a missing piece I missed
23:46 gfxstrand[d]: I'm seeing you re-compute it every time
23:47 gfxstrand[d]: Oh, maybe I misread
23:47 mohamexiety[d]: So Unmap and remap then retrieve the uvma object and access the size but I couldn’t figure out a way to avoid recomputing it for the split merge map case. Otherwise all should be using the OG value from op_map_prepare
23:48 gfxstrand[d]: Hrm... I'm not sure nouveau_uvma helps. I think that's a transient object that just contains the operation so it can be prepared, enqueued, and executed.
23:48 gfxstrand[d]: What we need is something persistent that shadows the state of the page tables and actually tracks stuff.
23:49 gfxstrand[d]: Maybe uvma_region?
23:50 mohamexiety[d]: gfxstrand[d]: It does contain a pointer to the memory object and should last as long as the map (until it gets a call to unmap)
23:59 gfxstrand[d]: Okay, right. So nouveau_uvma is a subclass of drm_gpuva
23:59 gfxstrand[d]: Things are starting to make more sense