IRC Logs of #nouveau on irc.freenode.net for 2024-06-05

00:42 redsheep[d]: gfxstrand[d]: Enjoy https://gitlab.freedesktop.org/mesa/mesa/-/issues/11279
00:54 redsheep[d]: ahuillet[d]: You had been helping me debug that issue ^ a few weeks ago, so you might be interested to see the added detail I have on there now. The full backtrace is huge.
01:29 airlied[d]: karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29542 might be useufl
02:55 tiredchiku[d]: hmmm
02:55 tiredchiku[d]: something is horribly wrong with steam and zink
02:58 tiredchiku[d]: and prime render offload
02:58 redsheep[d]: Crazy flashing and flicker, I hope? I want to not be the only one having that issue anymore 😛
02:58 tiredchiku[d]: no, my plasma session just doesn't load
03:00 tiredchiku[d]: nothing in the dmesg...
03:02 tiredchiku[d]: which is weird, because my session is run on Intel
03:03 redsheep[d]: Wayland session, or x11?
03:04 tiredchiku[d]: wl
03:04 redsheep[d]: Assuming you are on systemd, do you have a bunch of crashes in `coredumpctl list -r`
03:04 tiredchiku[d]: yup
03:04 tiredchiku[d]: that's how I identified it was steam
03:05 tiredchiku[d]: even with NOUVEAU_USE_ZINK=0, steam just doesn't put up its window
03:05 tiredchiku[d]: just quietly stays in the tray
03:05 tiredchiku[d]: on last night's HEAD + 29154 + 29504
03:06 redsheep[d]: Have you seen this across multiple boots? I think there might be the confounding factor of an issue with a plasma update, just earlier I had some crazy session issues I had never seen before without having updated mesa in a week
03:06 tiredchiku[d]: yup
03:10 redsheep[d]: Does it happen on main?
03:14 tiredchiku[d]: I am on main
03:14 tiredchiku[d]: with 2 MRs applied
03:17 redsheep[d]: ... I didn't miss that, I mean does it happen without the two other MRs? Also, I went to look what those are, why are you applying an MR for v3d, and an MR for intel lunar lake?
03:18 tiredchiku[d]: ..wait
03:18 tiredchiku[d]: did I remember the MR numbers wrong
03:18 tiredchiku[d]: :derproo:
03:18 tiredchiku[d]: hang on let me check
03:19 tiredchiku[d]: ah yes I did
03:19 tiredchiku[d]: 29194 and 29405
03:20 tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1247751858261327962/image.png?ex=66612af9&is=665fd979&hm=639b41cbe32c7e201ff6ee25e5071fd7756bb320e69401272886a95c5c13b574&
03:24 redsheep[d]: Hmm I don't think either of us have tested these, and 29194 in particular doesn't really seem done, so I wouldn't be crazy surprised if it managed to break steam. Though, if you are on prime, I guess steam should be running on intel? Do the shared buffers used when you are on prime involve copies? I would suspect so
03:24 redsheep[d]: Testing on plain jane main probably has value
03:26 tiredchiku[d]: that's what am building rn
03:26 tiredchiku[d]: I just trashed my old PKGBUILD, however
03:26 tiredchiku[d]: because it insisted on doing a clean build every time
06:54 airlied[d]: karolherbst[d]: btw seeing an mmu fault in pure nvc0 env, looks like userspace destroys a bo before the hw is done with it, pushbuf sync seems to make it go away, adding printfs to prove that is happening, isn't reproducing it yet
06:55 airlied[d]: gnome-shell + emacs on f40, but getting a consistent reproducer is painful
06:55 karolherbst[d]: yeah...
06:55 karolherbst[d]: that's like the case with a lot of those bugs
07:14 ahuillet[d]: redsheep[d]: stack overflow is it not?
07:15 ahuillet[d]: tiredchiku[d]: I wouldn't trust 29194 ATM. 29405 might also be broken (may need 1 << 22, left a comment, needs testing because can't find a definite HW doc reference)
07:15 tiredchiku[d]: mhm, I just wanted to take them for a ride
07:19 phomes_[d]: yes 29194 is just my experiments to get post depth coverage to pass cts by making nvk do what nv does for that specific test
07:22 phomes_[d]: I wish we had a table of games/apps that use specific vulkan extensions. We have several cts tests for pdc but it is mostly the same test run with under different conditions
07:24 ahuillet[d]: +1 on the wish
07:24 phomes_[d]: I would be great to also have games to test it under real conditions
07:25 ahuillet[d]: phomes_[d]: sorry, I had not realized you were the author. I planned to review your change and dig into internal docs to make sense of what's supposed to be required for the feature.
07:25 ahuillet[d]: I see you dumped SPH, did you dump all the bits, or only those of SPHv3?
07:28 phomes_[d]: I used nvdump for nv and nvk_shader_dump() on nvk. I am not sure if they dump all the bits, but I can take a look today
07:37 redsheep[d]: ahuillet[d]: I guess? I didn't look terribly closely at the backtrace
07:41 redsheep[d]: Once I managed to get one that didn't look like complete garbage I called it good. Takes some really annoying stuff to get good data out of a wine app.
07:42 ahuillet[d]: yes
07:45 redsheep[d]: Google search results have gotten so bad, it's nearly impossible to use it to learn how to use gdb. Took an embarrasingly long time to figure out how to get it to continue properly so the application would actually reach the crash, because it was signaling SIGUSR1 over and over
07:46 tiredchiku[d]: it's not just google
07:46 tiredchiku[d]: I was struggling to find info on rust 32 bit compilation as well
07:46 tiredchiku[d]: I use ecosia .-.
07:59 ahuillet[d]: redsheep[d]: handle SIGUSR pass nostop noprint
07:59 ahuillet[d]: and yeah I know that wasn't your point and you're right.
08:00 redsheep[d]: Hmm I was just doing `handle SIGUSR1 nostop` I will have to look at what that other stuff does. I can guess at it though
08:08 ahuillet[d]: pass is default so doesn't matter, and noprint does what it implies
08:08 ahuillet[d]: Unity games like SIGPWR and SIGXCPU for some reason so get a flood of these
08:09 HdkR: That's their garbage collector logic
08:10 redsheep[d]: But hey, now I can actually kind of sort of do the debugging you wanted from me two months ago 😛
08:10 HdkR: bdwgc uses those signals to stop the world, and then sigxcpu to restart it
08:11 ahuillet[d]: ah, good to know. it never made much sense to me why you'd use these signals
08:15 asdqueerfromeu[d]: redsheep[d]: SIGUSR1 is used by some thread stuff inside Wine
14:57 tiredchiku[d]: it appears plasma wayland just doesn't like plain mesa-git
14:57 tiredchiku[d]: with nvk running
14:57 tiredchiku[d]: both with and without NOUVEAU_USE_ZINK, it just fails to launch
14:58 tiredchiku[d]: everything is peachy on sway, even steam
15:01 tiredchiku[d]: either that, or I need to try again after uninstalling nvprop
16:47 pavlo_kozlenko[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10981
16:47 pavlo_kozlenko[d]: I'll try to tackle it, wish me luck 😆
17:02 gfxstrand[d]: tiredchiku[d]: I worry plasma also has explicit sync bugs. 😢
17:02 gfxstrand[d]: It's hard to verify but it seems plausible.
17:02 tiredchiku[d]: x11 loads fine, fwiw
17:02 tiredchiku[d]: but yeah, could be
17:09 pavlo_kozlenko[d]: gfxstrand[d]: Do you mind if I bother and ask you often?
17:09 pavlo_kozlenko[d]: I'm just a little confused
17:12 karolherbst[d]: gfxstrand[d]: does vulkan have something like `ARB_compute_variable_group_size`?
17:14 karolherbst[d]: though I guess not, because zink implements that via spec constants afaik :ferrisUpsideDown:
17:15 cwabbott: even if it did, you really wouldn't want to use it - performance with the maximum group size is typically terrible, and the driver has to assume the worst case
17:16 karolherbst[d]: ~~it could compile variants~~
17:16 cwabbott: it doesn't
17:16 karolherbst[d]: in either case, OpenCL is variable by default
17:17 gfxstrand[d]: cwabbott: No it doesn't
17:17 gfxstrand[d]: I mean, it depends on your HW but NV really doesn't care.
17:17 gfxstrand[d]: Intel does but I think we compile 2 shaders there in that case.
17:17 gfxstrand[d]: So you pay for it with extra compile time but not at runtime.
17:17 cwabbott: really? there's no register sharing?
17:17 gfxstrand[d]: It's all automagic in the hardware
17:18 karolherbst[d]: you limit the amount of threads running in parallel
17:18 gfxstrand[d]: We compile exactly the same shader regardless.
17:18 karolherbst[d]: or rather, the hardware limits it
17:18 cwabbott: you can't have it be automagic if you want all the warps to be resident at once
17:18 cwabbott: unless the register file isn't divided dynamically
17:19 karolherbst[d]: it is
17:19 gfxstrand[d]: Register files on NVIDIA are HUGE.
17:19 karolherbst[d]: that as well
17:19 gfxstrand[d]: And the 1024 limit really isn't that big
17:19 karolherbst[d]: but the amount of registers allocated per thread just changes how many threads run in parallel anyway
17:20 cwabbott: on qcom, iirc you only get something like 32 registers at a group size of 1024
17:20 cwabbott: so you really, really don't want to do that
17:20 gfxstrand[d]: But you can reach 32 simultaneous resident threads with the full 253 registers
17:20 karolherbst[d]: nvidia has 64k regs per block
17:21 karolherbst[d]: *warp/whateve ryou want to call it
17:21 gfxstrand[d]: On Intel, we had to use SIMD32 on some HW to get the full 1024 invocations resident. It sucked.
17:21 karolherbst[d]: ehh wait
17:21 karolherbst[d]: warp is smaller
17:21 karolherbst[d]: anyway
17:21 karolherbst[d]: there are 64k registers per block :ferrisUpsideDown:
17:22 cwabbott: I think amd has a larger register file but it does the same thing, you'd be somewhat limited at the max size
17:22 karolherbst[d]: nvidia just doesn't guarantee that your entire block runs at the same time
17:23 cwabbott: that's not possible with barrier() - so yeah, without barriers it doesn't matter
17:23 cwabbott: qualcomm will also let you have only some of the workgroup resident if there are no barriers
17:25 cwabbott: anyway, if you want a competent rusticl on zink, you'll need to shepard through a vulkan extension to bring the CL-style unsized workgroups where the max size is decided by the compiler
17:25 karolherbst[d]: probably
17:25 karolherbst[d]: atm zink just sets a spec constant
17:25 karolherbst[d]: it's good enough
17:26 karolherbst[d]: though I'd have to figure out what nvidia is doing for 1024 threads with 255 registers :ferrisUpsideDown:
17:27 karolherbst[d]: maybe just context switching or something silly
17:27 cwabbott: qualcomm has a crazy thing where 4 waves share a hw context
17:27 cwabbott: and they take turns executing the stretches between barriers
17:28 karolherbst[d]: heh
17:28 karolherbst[d]: though I don't know if the 64k reg limit is real, it's just what cuda says
17:28 cwabbott: so you have to spill everything around a barrier, but you get 4x the occupancy
17:28 karolherbst[d]: maybe that's why shaders have to tell if there are barriers for nvidia 😄
18:09 karolherbst[d]: `Without documentation, how do you know the semantics precisely enough to use them correctly?` :ferrisUpsideDown:
18:10 HdkR: That's what reverse engineering is!
18:11 gfxstrand[d]: Yay! I found my maximal reconvergence UGPR bug!
18:11 karolherbst[d]: nice
18:11 karolherbst[d]: what was it?
18:13 gfxstrand[d]: Oh, I just messed up bssy placement in my CF lowering pass when I refactored it to track block uniformity
18:14 karolherbst[d]: ahh
18:14 gfxstrand[d]: I think I have a reasonable dep tracker now, too. Doing a full CTS run without UGPRs before I try UGPRs again.
18:15 karolherbst[d]: I'm considering entering some latencies for very basic things to get some more perf once you are done with the tracker rework :ferrisUpsideDown:
18:16 karolherbst[d]: most of the basic alu can be done in 4 or 5, so that should give some very basic speed ups over using 6
18:17 gfxstrand[d]: I'm also debating whether or not I want to have a shader model abstraction of some sort.
18:17 gfxstrand[d]: It feels like instruction latencies and encoding really should go in the same file.
18:17 karolherbst[d]: maybe
18:17 gfxstrand[d]: Or at least grouped together into a module
18:17 gfxstrand[d]: And legalization, too.
18:17 karolherbst[d]: there are differences between turing and ampere with the latencies though
18:18 karolherbst[d]: but that's already in the same file anyway
18:18 gfxstrand[d]: That's fine.
18:18 gfxstrand[d]: The latency code will have access to the SM number
18:18 karolherbst[d]: right
18:18 karolherbst[d]: sadly I only have access to the ampere stuff for now :ferrisUpsideDown: and my ampere card is kinda fucked
18:18 gfxstrand[d]: It's just that the more I mess around with things, the less sense it makes to have any of that stuff be methods on `Op` or `Instr`.
18:19 karolherbst[d]: heh
18:19 karolherbst[d]: does nak print cycle estimations?
18:19 karolherbst[d]: in some shader stat debugging thing?
18:20 gfxstrand[d]: Not right now
18:21 gfxstrand[d]: gfxstrand[d]: But I think I want to see the tables in some form before I make major decisions about how to structure it all.
18:21 karolherbst[d]: yeah, fair enough
18:22 gfxstrand[d]: What I'd really like is some sort of `SM70Instr` trait that has encoding, legalization, and latency stuff all in one place. IDK if that's tractable, though.
18:22 gfxstrand[d]: I hate having letalization and encoding separate
18:23 gfxstrand[d]: Especially for SM50 where all the immediate stuff is so closely tied.
18:23 gfxstrand[d]: But SM70+ isn't much better, really
18:23 gfxstrand[d]: But I'm going to focus on UGPRs now and we'll do that restructure later
18:39 pac85[d]: I would like to understand exactly how nvidia differs from, say, amd, wrt register pressure and occupancy. On amd from what I know the hw is told how many VGPRs and SGPRs are used and I think it partitions the register file automatically from there. How does nvidia differ? (What's more magic about it?)
18:41 gfxstrand[d]: That's pretty much the same.
18:41 gfxstrand[d]: If anything differs, it's the size of the register file. NVIDIAs is big enough, you can get enough threads with maximum register counts. IDK about AMD.
18:41 HdkR: Do SGPRs affect occupancy on AMD? Thaat sounds rough
18:42 gfxstrand[d]: I would be surprised if they affect it much. SGPRs are so dang cheap.
18:43 gfxstrand[d]: On NVIDIA, you're required to program the register count to GPRs+2 on Volta+ because it assumes you're always burning 63 UGPRs.
18:43 dadschoorse[d]: HdkR: not since rdna
18:43 HdkR: Oh good
18:44 gfxstrand[d]: Yeah, when your whole sGPR pool costs the equivalent of 2 vGPRs, there's not much point in making it a scaling factor.
18:47 dadschoorse[d]: gfxstrand[d]: 256 vgprs is really not great for memory latency hiding. although big rdna3 chips have 50% more physical vgprs (yes, non pot regfile), so it might not be as bad anymore
18:48 gfxstrand[d]: Yeah
18:48 gfxstrand[d]: That's something I haven't even looked into yet
18:48 gfxstrand[d]: Trying to make a tradeoff of a little spilling in exchange for better latency hiding.
18:48 dadschoorse[d]: does nak even have a scheduler atm?
18:49 gfxstrand[d]: No
18:49 gfxstrand[d]: On the ToDo list
18:50 dadschoorse[d]: so you are not even taking advantage of all the nice things that ssa reg alloc gives you 😄
18:53 gfxstrand[d]: Oh, yeah. NAK is very much a correctness-first implementation.
18:58 karolherbst[d]: gfxstrand[d]: though I'm not entirely sure that's actually correct, but it's surprisingly close to the truth
18:59 gfxstrand[d]: Yeah, I don't know for 100% either but it makes enough sense that I'm willing to say it's the truth unless someone from NVIDIA wants to correct me and provide a correct explanation.
18:59 karolherbst[d]: I know cases where you can reduce it a little, but...
19:00 karolherbst[d]: "hey, we could reduce the gpr count by one in those cases" isn't really giving us that much 😄
19:00 gfxstrand[d]: Nope
19:01 gfxstrand[d]: One thing I do need to fix is that NAK always uses at least 16 GPRs and I really should trim that down a bit.
19:01 gfxstrand[d]: It's needed for RA in certain texture instruction cases
19:01 karolherbst[d]: I don't think it matters at all
19:02 karolherbst[d]: there was some lower bound where not using registers is just a waste
19:02 gfxstrand[d]: Do we know the minimum at which it matters?
19:02 karolherbst[d]: no, but I think 16 is close
19:02 gfxstrand[d]: Reducing to 14 might be good, then. Then +2 is 16
19:02 karolherbst[d]: ahh yeah
19:03 gfxstrand[d]: 16 is a bit conservative
19:03 gfxstrand[d]: We don't need to be that tight
19:03 gfxstrand[d]: *can be tighter
19:03 karolherbst[d]: on some hardware you can have up to 2k threads resident per SM
19:03 gfxstrand[d]: I also need to smash .yld all over everywhere
19:03 karolherbst[d]: gives you 32 regs per thread
19:04 karolherbst[d]: though I'm also not 100% sure how it all fits together
19:05 karolherbst[d]: however
19:05 karolherbst[d]: the hardware might partition or might be able to run blocks with high and low reg counts in parallel?
19:05 karolherbst[d]: :ferrisUpsideDown:
19:05 karolherbst[d]: there are some magic things like that going on where the hw can interleave stuff
19:07 gfxstrand[d]: I don't know if the same SM can run programs with different register counts but I know you can have different compute shaders in-flight.
19:07 gfxstrand[d]: And I think the same SM can run programs with different shared memory sizes so maybe?
19:09 karolherbst[d]: yeah.. well.. there are SM counters which could give us hints on that stuff
19:09 karolherbst[d]: but yeah.. the hardware is a bit more dynamic in those cases. I actually might have a doc I could look those kind of things up actually
19:10 karolherbst[d]: maybe
19:10 dadschoorse[d]: amd can run waves with different register allocations
19:11 karolherbst[d]: yeah.. it would make perfect sense if that means you could run threads with low and high reg counts in parallel, becuase otherwise you are just wasting something
19:12 gfxstrand[d]: They probably just round-robin it or something
19:13 gfxstrand[d]: Once you've got the logic to carve up and indirect the register file, it's not too hard to handle those cases.
19:13 gfxstrand[d]: It's only if you make stupid assumptions like powers of two that things get problematic.
19:13 karolherbst[d]: yeah
19:31 mhenning[d]: gfxstrand[d]: My understanding is that the .yld bit actually disables yield, so by not setting the bit NAK currently allows the hardware to yield almost everywhere
19:32 karolherbst[d]: huh.. interesting
19:32 mhenning[d]: I experimented a little with a heuristic to set the bit sometimes, but it breaks shaders in some cases and I couldn't figure out why
19:33 karolherbst[d]: nvidia doesn't call it yield anyway
19:34 karolherbst[d]: or uhm...
19:34 karolherbst[d]: at which bit is yield?
19:36 karolherbst[d]: btw, the last 4 bits is an union :ferrisUpsideDown:
19:36 karolherbst[d]: and meaning depends on the instruction
19:37 karolherbst[d]: mhenning[d]: was yield at bit 2?
19:38 mhenning[d]: I think yield is right above stall cycles in the scheduling info
19:39 karolherbst[d]: mhhh
19:39 karolherbst[d]: our understanding on the layout is wrong anyway
19:40 karolherbst[d]: ehh wait
19:40 karolherbst[d]: I need to shift
19:40 karolherbst[d]: okay...
19:40 karolherbst[d]: got it
19:40 karolherbst[d]: the wait count starts at bit 3 (in turing)
19:40 karolherbst[d]: and
19:41 karolherbst[d]: yield at 3 + 4 is part of the wait count
19:41 karolherbst[d]: go figure 😛
19:41 karolherbst[d]: (yes, the wait count has a high bit with a special meaning)
19:41 karolherbst[d]: okay.. so what does it mean
19:42 karolherbst[d]: mhhhh
19:42 karolherbst[d]: OHHHHHH
19:42 karolherbst[d]: duh
19:42 karolherbst[d]: :ferrisUpsideDown:
19:42 gfxstrand[d]: ?
19:43 karolherbst[d]: mhenning[d]: what we call "yield" invalidates .reuse
19:43 gfxstrand[d]: huh?
19:43 mhenning[d]: I knew that, but does it only invalidate reuse?
19:43 karolherbst[d]: it disallows and invalidates the content
19:44 mhenning[d]: And that's all it does?
19:44 karolherbst[d]: I don't know yet
19:44 gfxstrand[d]: I know you have to set it on `bar` and `bsync` or the hardware wedges
19:44 karolherbst[d]: but yeah
19:44 karolherbst[d]: sooo
19:44 karolherbst[d]: `.yield` means that the next instruction can't make use of any collected values
19:45 karolherbst[d]: gfxstrand[d]: yes
19:45 karolherbst[d]: so
19:45 karolherbst[d]: any instruction which might impact the thread active mask act like that
19:46 karolherbst[d]: but I think bar and bsync are special
19:46 gfxstrand[d]: That probably has to do with when the hardware is able to wonder off
19:46 karolherbst[d]: instruction which might modify the mask do it implicitly though
19:47 karolherbst[d]: and always
19:47 gfxstrand[d]: Specifically, re-used values are probably cached locally (which is why they're fast) so they aren't going to survive a thread-switch.
19:47 karolherbst[d]: yeah
19:47 karolherbst[d]: well
19:47 karolherbst[d]: they remain in the unit
19:47 karolherbst[d]: and some units share the cache
19:48 karolherbst[d]: e.g. fma and alu
19:49 karolherbst[d]: there is more to it though
19:49 karolherbst[d]: so.. there is the concept of "groups"
19:49 karolherbst[d]: and you can suggest the scheduler to switch to something else
19:50 karolherbst[d]: (something else == another warp)
19:51 karolherbst[d]: but anyway
19:51 karolherbst[d]: the reason `.yield` might break things (it's really not called that though), that it invalidates the collectors and that might have messed things up
19:51 karolherbst[d]: everything else is just perf hints
19:52 mhenning[d]: By "reuse", I assumed you meant the reuse flags at the top of the control info, and we don't currently set any of those
19:53 karolherbst[d]: in codegen we do, no?
19:53 karolherbst[d]: `emitReuse`
19:54 karolherbst[d]: it's super restricted
19:54 karolherbst[d]: but also potentially wrong :ferrisUpsideDown:
19:54 mhenning[d]: My experiment was run on nak and I don't remember us setting any there.
19:54 karolherbst[d]: yeah
19:54 karolherbst[d]: it's definitely wrong
19:54 karolherbst[d]: ohh wait
19:54 karolherbst[d]: there is `isReuseSupported`
19:55 karolherbst[d]: mhhh
19:55 karolherbst[d]: might be fine....
19:55 karolherbst[d]: actually
19:55 mhenning[d]: Anyway, I need to go afk
20:15 ahuillet[d]: for correctness, I would suggest to kill reuse entirely: its perf benefits aren't significant enough vs. tightening/getting delays right
20:26 ahuillet[d]: mhenning[d]: if I understand correctly what bit you are referring to, then your understanding is rather correct
20:27 ahuillet[d]: but, this is only a hint to the scheduler, so it is not relevant for correctness EXCEPT that it messes with register reuse
20:27 ahuillet[d]: (also, this bit isn't called that, and isn't even a bit, because some wait values aren't valid when it is set)
20:34 ahuillet[d]: I suggest that it's not worth bothering with what you call .yld for the time being, and never set it. that also kills all register reuse. and this lets you focus on using the correct tightest latencies possible
20:34 ahuillet[d]: it will have to be dealt with eventually of course, and before register reuse can be made useful
20:43 ahuillet[d]: mhenning[d]: oh, another potential reason of course is if your scoreboards or wait durations were wrong, you were potentially hiding it without the flag, and suddenly you create correctness issues by exposing the incorrect dependency tracking
20:43 ahuillet[d]: would have to see the shader to try and provide analysis.
20:45 ahuillet[d]: ahuillet[d]: (like, maybe you were getting lucky and seeing your shader kick off the SM every time you were hitting the problematic instruction, thereby potentially compensating for some of the latency error)
20:50 gfxstrand[d]: ahuillet[d]: Cool! I'll forget about it then.
20:56 karolherbst[d]: yeah... once the scheduling is correct, we get `.reuse` implemented correctly for free anyway, as it depends on a correct implementation of the other bits anyway :ferrisUpsideDown:
21:11 ahuillet[d]: I wonder if it impacts your RA or if it's a post-RA optimization pass
21:12 ahuillet[d]: I suspect the latter but don't know what the blob does.
21:13 karolherbst[d]: I _think_ we might have something like that in codegen
21:13 karolherbst[d]: we certainly do for weird things like fma long immediate more pre volta
21:13 karolherbst[d]: *mode
21:13 karolherbst[d]: uhm
21:14 karolherbst[d]: *form
21:19 tiredchiku[d]: sounds like code depending on patches that aren't in the kernel snuck into 24.1.0
21:19 tiredchiku[d]: look at mesa/mesa issue #11270
22:52 airlied[d]: Sounds like the fallback path for when the kernel doesn't support the feature might be broken
22:53 tiredchiku[d]: that is also possible, yeah
23:00 esdrastarsis[d]: tiredchiku[d]: #11166 too, I was hitting this.
23:03 gfxstrand[d]: I suspect it's a Mesa WSI bug that's only getting hit with the combination of explicit sync and no modifiers.
23:05 gfxstrand[d]: I should boot into a distro kernel and poke about
23:06 airlied[d]: is it plasma only?
23:07 gfxstrand[d]: esdrastarsis[d]: Can you repro and give me a backtrace to a `nouveau_ws_bo_new_tiled_locked()` call that fails?
23:07 gfxstrand[d]: the OOM should be easier to diagnose
23:11 gfxstrand[d]: Are folks hitting it with prime setups or stand-alone?
23:17 gfxstrand[d]: I have one uniform ALU bug left.... in 6-nested loops: dEQP-VK.reconvergence.maximal.compute.nesting6.1.2 FML...
23:20 gfxstrand[d]: Who wants to be that it's because bmov doesn't actually do what I think it does?
23:22 gfxstrand[d]: This sounds like a problem for future Faith.
23:23 gfxstrand[d]: So tomorrow I get to fix that and write the optimization pass
23:23 gfxstrand[d]: redsheep[d]: If you wanted to try out the branch, it's probably safe now. I wouldn't expect any perf wins from it just yet, though.
23:24 gfxstrand[d]: It may even hurt
23:24 redsheep[d]: Which branch?
23:24 gfxstrand[d]: uniform-alu
23:25 redsheep[d]: Ah, okay sounds good, I will do some testing later today
23:25 gfxstrand[d]: That's not a request for testing. But you did seem to be rather excited to try it out. 😅
23:26 redsheep[d]: Yep, same page
23:41 esdrastarsis[d]: gfxstrand[d]: How? With gdb?
23:42 tiredchiku[d]: hoi is anyone interested in testing out nvfbc for me
23:42 tiredchiku[d]: on 555.52.04
23:48 esdrastarsis[d]: esdrastarsis[d]: vkcube crashes at `wsi_common_get_images` here (seg fault)
23:53 redsheep[d]: Probably want to `set logging on` and `bt full` that
23:55 gfxstrand[d]: esdrastarsis[d]: Ugh... Those might be different then.
23:58 airlied[d]: swapchain creation fails then it crasesh there
23:59 gfxstrand[d]: Ah