00:06 mhenning[d]: code review
00:10 gfxstrand[d]: Which is to say, me. :blobcatnotlikethis:
00:11 karolherbst[d]: classic
00:12 karolherbst[d]: I did some testing with a bunch of passes thrown at it: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33306?commit_id=0e00659abe6f6f05bc7e9b4d67b8c52716b2a494#note_3032638
00:13 karolherbst[d]: I should check if that "Static cycle count: 42989117200 -> 30666014127 (-28.67%); split: -31.45%, +2.78%" translates to real worl gaming perf 😄
00:13 gfxstrand[d]: heh
00:13 karolherbst[d]: that's with your loop thing tho
00:14 karolherbst[d]: and passes that move things out of loops
00:15 karolherbst[d]: maybe I should just check with shapez2 actually...
00:15 gfxstrand[d]: I should rebase my predication patches and try those with the scheduler
00:15 gfxstrand[d]: That'll probably yield real game perf
00:22 karolherbst[d]: ahh predication yeah...
00:22 karolherbst[d]: which branch/MR?
00:22 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33676
00:23 gfxstrand[d]: It needs some heavy rebasing, though. I'm sure it'll make a mess out of the latency stuff
00:24 karolherbst[d]: uhh yeah...
00:25 karolherbst[d]: but still.. I don't think shader perf is a big bottleneck these days with games... probably want to apply all the compression stuff first to see a bigger impact there
00:25 gfxstrand[d]: Yeah, compression is the bigger bottleneck for sure
00:25 karolherbst[d]: yeah.. so with all my patches, it shows almost no changes 🙃
00:25 gfxstrand[d]: But nvidia is well enough balanced that it kinda all matters at least some
00:26 karolherbst[d]: I'm also getting to the point where even for the coop matrix stuff nothing matters much
00:26 karolherbst[d]: like..
00:26 karolherbst[d]: not being silly with membars helps
00:26 gfxstrand[d]: The biggest advantage to my predication MR is that it cuts compile times in half. 🙃
00:26 karolherbst[d]: nice
00:31 karolherbst[d]: membars are a huge thing in those coop matrix shaders tho.. it's impressive.. on my turing it's literally 15 -> 30 TFlops
00:32 karolherbst[d]: though all the other opts gets me up to 50 TFlops 🙃
00:32 gfxstrand[d]: From scheduling?
00:33 karolherbst[d]: and a bunch of other stuff
00:34 karolherbst[d]: scheduling doesn't really help that much there
00:34 karolherbst[d]: it's just 75% address calculation
00:34 gfxstrand[d]: Woof
00:34 karolherbst[d]: yeah...
00:35 karolherbst[d]: soo.. I did some range analysis to be able to fold a lot of those into the load instructions
00:35 karolherbst[d]: that help
00:35 karolherbst[d]: ed
00:35 karolherbst[d]: but membar is a huge thing.. and ldsm obivously
00:35 gfxstrand[d]: yeah
00:36 karolherbst[d]: well with just ldsm +scheduling + sink/move + lcim+ offset opts I'm still at 15 TFlops
00:36 karolherbst[d]: 😄
00:37 karolherbst[d]: ehh wait...
00:37 karolherbst[d]: it's testing a different matrix type
00:39 karolherbst[d]: mhhhh...
00:40 karolherbst[d]: seems like on my turing the membar thing isn't _as_ important than on my ampere...
00:41 karolherbst[d]: one thing does help a bit tho
00:42 karolherbst[d]: lower latencies on the last instruction of a block, and membar
00:42 karolherbst[d]: okay.. I got my branch figured out
00:42 karolherbst[d]: it's a disaster
00:42 airlied[d]: blackwell seems to be doing fine on bra.u
00:43 karolherbst[d]: nice nice
00:43 karolherbst[d]: okay... soo
00:46 karolherbst[d]: all my changes ~25 -> ~30TFlops, not doing membar (results are still correct): -> 45 TFlops 🙃
00:47 karolherbst[d]: it's just ridiculous
00:54 karolherbst[d]: not doing membar on my turing: ~35TFlops -> 60TFlops
00:54 karolherbst[d]: *ampere
00:56 karolherbst[d]: the shader: https://gist.githubusercontent.com/karolherbst/003c0627608f85c411ea00fc49490712/raw/266444184be167b5b451077d44eccba11f9d12b5/gistfile1.txt
00:56 karolherbst[d]: like it's clear that those barriers are kinda pointless
00:57 karolherbst[d]: the big question is just.. how to optimize those
00:57 karolherbst[d]: `nir_opt_barrier_modes` only works before lowered IO
00:57 karolherbst[d]: but those barriers are the result of loop unrolling
00:57 karolherbst[d]: and that happens after lowered IO
00:58 karolherbst[d]: anybody any smart ideas what we could do about those?
01:01 karolherbst[d]: mhhhhhhhh
01:02 karolherbst[d]: so nak emits them as `membar.sc..gpu`
01:03 karolherbst[d]: welll.....
01:03 karolherbst[d]: we shouldn't 🙃
01:03 karolherbst[d]: why is it .gpu?!?
01:03 gfxstrand[d]: Yeah, .gpu is a bit much
01:03 karolherbst[d]: you know what it's doing?
01:04 karolherbst[d]: like the entire warp halts until it's done with the membar
01:04 karolherbst[d]: but that's also true for VC
01:05 karolherbst[d]: let me check if .SM makes a huge difference there
01:09 karolherbst[d]: ohhh interesting
01:11 karolherbst[d]: actually having this example it clearly shows the impact, maybe I should print numbers
01:14 karolherbst[d]: .sys/.gpu/.vc -> 35TFlops
01:14 karolherbst[d]: .sm/.cta -> 50 TFlops
01:14 gfxstrand[d]: 🙃
01:15 gfxstrand[d]: Well, .gpu might be kinda write depending on stuff
01:15 karolherbst[d]: yeah.. so sys/gpu/vc make the warp go idle until it's done
01:15 karolherbst[d]: right..
01:15 karolherbst[d]: sooo...
01:16 karolherbst[d]: on the nir level we have `@barrier (execution_scope=NONE, memory_scope=WORKGROUP, mem_semantics=ACQ|VISIBLE, mem_modes=shared)`
01:17 karolherbst[d]: sooo..
01:17 karolherbst[d]: I think instead of doing .GPU for workgroup, we should just do CTA
01:17 gfxstrand[d]: Is shared per-CTA?
01:17 karolherbst[d]: yes
01:18 karolherbst[d]: CTA is the workgroup
01:18 gfxstrand[d]: Yeah, so we should probably map workgroup to CTA in general
01:18 karolherbst[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays
01:18 karolherbst[d]: there is also .SM which is for uhm...
01:18 karolherbst[d]: all the other CTAs
01:18 karolherbst[d]: so the entire grid
01:19 karolherbst[d]: SCOPE_QUEUE_FAMILY should probably be .VC
01:19 karolherbst[d]: SCOPE_QUEUE_FAMILY is within the same context, right?
01:19 gfxstrand[d]: Yeah
01:19 karolherbst[d]: yeah.. that's .VC
01:19 gfxstrand[d]: But that's not a valid scope in Vulkan
01:19 karolherbst[d]: ahh
01:20 karolherbst[d]: what does device mean in vulkaan?
01:20 karolherbst[d]: `SCOPE_DEVICE` I mean
01:21 karolherbst[d]: but anyway.. doing CTA for workgroup is probably gonna help a lot already
01:22 karolherbst[d]: but you probably want to CTS that one on all the gens 😄
01:24 karolherbst[d]: anyway.. with that the idea of just skipping all membars becomes less interesting: ~50 -> 60 TFlops 😄
01:24 karolherbst[d]: just optimizing them well enough gets me there as well
01:25 karolherbst[d]: yeah...
01:27 karolherbst[d]: ohhh
01:27 karolherbst[d]: it seems like the so called "yield" flag is relevant to membars
01:28 karolherbst[d]: ehh different flag
01:30 karolherbst[d]: there is a way to force instructions to execute even after a membar.VC (or higher) gets executed
01:30 karolherbst[d]: it's part of the scoreboarding
01:36 karolherbst[d]: ohhh `QueueFamily` is part of vulkan 1.5
01:36 karolherbst[d]: ehh spir-v 1.5
01:36 karolherbst[d]: not that it matters much
01:45 karolherbst[d]: looks like nvidia added "cluster" as something between workgroups and grids that can communicate via shared memory, interesting
01:49 karolherbst[d]: anyway.. gonna CTS it properly tomorrow
01:50 karolherbst[d]: a quick `*memory_model*` run shows no fails so far
06:47 mohamexiety[d]: karolherbst[d]: yeah new addition with hopper
09:03 karolherbst[d]: I wonder if I should play around a bit more with membar and see if we even need it for SCOPE_INVOCATION or SCOPE_SUBGROUP... but I'll test just the CTA change first, because that's gonna help the most
09:15 karolherbst[d]: mohamexiety[d]: funky.. can't use `LDS` to load from shared memory from other CTAs in the same cluster
09:15 karolherbst[d]: need to use `LD`
09:16 mohamexiety[d]: Huh that’s weird
09:16 karolherbst[d]: I mean.. it makes sense
09:16 karolherbst[d]: each CTA has its own window into shared memory
09:16 mohamexiety[d]: I’d have thought the whole point was to allow you to use normal shared mem ops tho
09:16 karolherbst[d]: that isn't impacted by that
09:17 karolherbst[d]: yeah sooo
09:17 karolherbst[d]: you use LD and specify the target CTA's id in the high bits
09:18 mohamexiety[d]: I see
09:19 karolherbst[d]: I think it uses the mapped shared memory address.. kinda weird 😄
09:19 karolherbst[d]: I don't see LD being able to operate on raw shared memory addresses
09:20 karolherbst[d]: funky
09:21 karolherbst[d]: even worse, you need to specify the CTA ID also in LDS but it has to match the local one
09:21 karolherbst[d]: but if the CTA isn't part of a cluster, it must be 0
09:22 karolherbst[d]: this sounds like pain
10:51 phomes_[d]: karolherbst[d]: I have done some game testing with the memscope MR. VKD3D games gained 1-2 fps. DXVK and native vulkan games were unaffected
13:15 gfxstrand[d]: karolherbst[d]: Shouldn't need anything for invocation. Subgroup is <a:shrug_anim:1096500513106841673>
13:15 karolherbst[d]: yeah....
13:15 karolherbst[d]: I just want the CTA stuff done first, because that matters for the $current_thing (tm)
13:18 karolherbst[d]: I kinda hate my current version of the MR, but it does align with what nvidia is doing
14:58 phomes_[d]: Results from testing the compression MR:
14:58 phomes_[d]: | API | GAME | Git main | Compression | Prop 575 |
14:58 phomes_[d]: |--------|------------------------|----------|-------------|----------|
14:58 phomes_[d]: | vkd3d | Age of Empires IV | 197 | 222 | 357 |
14:58 phomes_[d]: | vkd3d | Atomic Heart | 39 | 48 | 89 |
14:58 phomes_[d]: | vkd3d | Sniper Elite 5 | 33 | 33 | 63 |
14:58 phomes_[d]: | vkd3d | Deep Rock Galactic | 62 | 79 | 143 |
14:58 phomes_[d]: | dxvk | Recipe for Disaster | 129 | 154 | 211 |
14:58 phomes_[d]: | dxvk | Urban Trial Playground | 44 | 53 | 62 |
14:58 phomes_[d]: | dxvk | X-Com 2 | 21 | 37 | 86 |
14:58 phomes_[d]: | vulkan | Serious Sam 2017 | 360 | 360 | 719 |
14:58 phomes_[d]: | vulkan | Sniper Elite 5 | 35 | 40 | 77 |
14:58 phomes_[d]: | vulkan | The Surge 2 | 52 | 52 | 110 |
14:58 phomes_[d]: | vulkan | X4 Foundations | 21 | 22 | 48 |
14:58 karolherbst[d]: nice
14:58 mangodev[d]: nice
15:01 mohamexiety[d]: phomes_[d]: awesome!! thank you so much
15:01 mohamexiety[d]: so looks to be overall 20-30% with some outliers not changing at all and one changing massively. did you notice any graphical glitches while doing this?
15:02 phomes_[d]: no issues what so ever
15:03 mohamexiety[d]: nice! given no modifiers I know desktop will be horribly broken but wasnt sure about games. it's a great relief to hear that visually things are fine
15:03 gfxstrand[d]: mohamexiety[d]: Without modifiers, desktop should be fine. It just won't get compression (unless we screwed something up).
15:04 mohamexiety[d]: hmm.. not sure then what's up. marysaka[d] tried it with Weston and it was really... trippy
15:04 mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1400494327385292861/image.png?ex=688cd77b&is=688b85fb&hm=ed4182dffa4a84b944ff68a9ff2a889963d58186f049f3779a6484aff268e2ab&
15:04 mohamexiety[d]: peak compositing
15:05 phomes_[d]: this is on a desktop computer with only 1 gpu. Running gnome-shell on the system installed mesa and the compression MR in meson devenv
15:05 mohamexiety[d]: yeah that's how I did it too
15:44 gfxstrand[d]: If we're trying to share compressed images, I could easily see that leading to insanity
15:51 karolherbst[d]: wouldn't they just need to use a different modifier and it would be fine? Or is there more to it?
15:51 mohamexiety[d]: Yeah working on wiring that up today
15:52 gfxstrand[d]: karolherbst[d]: They definitely need modifier changes
15:52 gfxstrand[d]: But also I'm not sure how all the tricks we're doing to get compression working will play with sharing across processes
16:03 mohamexiety[d]: If those synchronization2.cross_instance failures I am seeing are any indication probably not very well 🐸
16:04 mohamexiety[d]: But not sure of a good idea for dealing with that
16:04 marysaka[d]: gfxstrand[d]: I was explicitly disallowing compression in case of DRM modifiers
16:04 marysaka[d]: (otherwise we were asserting in NIL later on with the mismatch on compression)
16:06 mohamexiety[d]: mohamexiety[d]: These
16:17 karolherbst[d]: uhh.. I'm silly.. I can totally do this opt without causing correctness issues 🙃
16:17 gfxstrand[d]: which opt?
16:17 karolherbst[d]: div 32 %211 = iadd %202, %210 (0x40000)
16:17 karolherbst[d]: div 32 %212 = ushr %211, %203 (0x3)
16:17 karolherbst[d]: like.. that's iadd(ushr(a), const >> 3)
16:17 karolherbst[d]: and it wouldn't change the result
16:18 gfxstrand[d]: yup
16:18 karolherbst[d]: but I need to know the constant is big enough, but I do have the infra for that...
16:18 karolherbst[d]: I just thought yesterday all day I couldn't
16:18 gfxstrand[d]: Uh... No
16:18 gfxstrand[d]: Overflow
16:18 karolherbst[d]: what overflow?
16:18 karolherbst[d]: ohhh
16:18 karolherbst[d]: right...
16:18 gfxstrand[d]: 🙂
16:18 karolherbst[d]: 🙃
16:19 karolherbst[d]: I do this opt for `iadd.nuw`
16:20 karolherbst[d]: it's part of my range analysis stuff
16:20 karolherbst[d]: uhh...
16:20 karolherbst[d]: annoying
16:22 karolherbst[d]: the best part is, the iadd is actually nuw, just it depends on a loop variable and....
16:25 gfxstrand[d]: yup
16:29 karolherbst[d]: does being able to do a 64 bit shift in 32 bit wins us anything?
16:30 karolherbst[d]: like for patterns like these:
16:30 karolherbst[d]: div 32 %393 = iand %94, %392 (0x3)
16:30 karolherbst[d]: div 64 %396 = u2u64 %393
16:30 karolherbst[d]: div 64 %414 = ishl %396, %390 (0x2)
16:30 karolherbst[d]: could move the u2u64 down
16:31 mhenning[d]: I think that would save an instruction
16:31 karolherbst[d]: but I don't think a 32 it shift helps with much...
16:31 karolherbst[d]: ohh it's two instructions..
16:31 karolherbst[d]: yeah then it helps
16:32 mhenning[d]: Yeah, iirc the 64-bit shift is two instructions, the 32-bit is one
16:32 karolherbst[d]: okay, let me try that and see if that helps much
16:33 karolherbst[d]: it's just annoying that uub is also restricted to 32 bits 🙃
16:34 karolherbst[d]: ohh, I have an idea
17:04 karolherbst[d]: 739 -> 736 instructions, yay
17:06 karolherbst[d]: mhhh the resulting shader is very interesting...
17:12 cubanismo[d]: Uhg. Just realized I probably broke NV5x support for linear modifiers when I added format modifier support to nouveau years ago.
17:13 cubanismo[d]: Only noticed because the same issue broke it on Blackwell with my pending new format modifier changes.
17:14 cubanismo[d]: I don't know how. I had to buy an NV50 card on eBay just to test that series since it was already so old at that point that I didn't have any left.
17:14 cubanismo[d]: So I should have noticed.
17:22 karolherbst[d]: gfxstrand[d]: okay... I know why nvidia inserts those nops with a wait of 2 🙂
17:23 karolherbst[d]: it's kinda silly honestly...
17:30 karolherbst[d]: I don't see a reason why it would always be required, just in some cases
17:32 karolherbst[d]: sadly those two cycles matter in my testing...
17:34 karolherbst[d]: uhhh.. pain
17:36 karolherbst[d]: so I mentioned that on a membar.GPU (or VC or sys) the CTA halts executing instructions? Well, that's true unless it's followed by coupled instructions. If you need to force a way, you'll have to use a nop
17:42 karolherbst[d]: there is a bit more going on...
17:44 karolherbst[d]: I don't think we'll need it on membar.CTA, but I'll have to check what nvidia is doing there
19:11 gfxstrand[d]: karolherbst[d]: Oh, lovely...
19:11 karolherbst[d]: it gets worse
19:11 gfxstrand[d]: I knew it would
19:11 karolherbst[d]: the membar doesn't complete at the membar
19:12 karolherbst[d]: it checks with each following instruction
19:12 karolherbst[d]: until it's done
19:12 karolherbst[d]: implicitly
19:12 gfxstrand[d]: uh... what?
19:12 karolherbst[d]: and you can excempt instructions from waiting for the completion
19:12 karolherbst[d]: *exempt
19:12 karolherbst[d]: like
19:12 karolherbst[d]: the membar itself only sets up the thing to do the wait
19:13 karolherbst[d]: and then does the check as an implicit barrier wait on the next one
19:13 karolherbst[d]: so the nop just forces it to happen on the nop
19:13 gfxstrand[d]: Oh. Neat
19:14 karolherbst[d]: still don't understand all the details 100%, but I might be able to predict when we need it and when not
19:14 karolherbst[d]: anyway, that explains why the nop exist
19:14 karolherbst[d]: and the 2 wait on it are just the implicit min wait on scoreboards
19:15 karolherbst[d]: the other barrier instructions can opt-in into this behavior as well, just membar always does it
19:16 karolherbst[d]: `DEFER_BLOCKING` is the thing, it's implicit on membar
19:18 karolherbst[d]: there is a list of instructions that would wait on it and those which wouldn't, but you can e.g. exempt a load instructions not having inter thread dependencies to skip the wait imposed by the membar
19:18 karolherbst[d]: anyway, that's kinda the tldr on what's going on there
19:20 karolherbst[d]: and the 6 cycles on the membar are just how long it takes to set it up
19:21 karolherbst[d]: so I _think_ we could move instructions between the membar and the first memory operation and cut down on the wait
19:25 karolherbst[d]: sooo with the membar.CTA I'm at 40TFlops, but messing with them more gets me to 45... mhhh troublesome
19:27 karolherbst[d]: ohh maybe not...
19:27 karolherbst[d]: maybe it was another change...
19:29 karolherbst[d]: yeah.. it's the membars but a bit less of an impact
19:30 karolherbst[d]: I .. have an idea
19:31 karolherbst[d]: `max_unroll_iterations = 1024` gives me 60 tflops, right...
19:33 karolherbst[d]: `55172` cycle count after aggressive loop unrolling @ 60 tflops, `66583` cycles @ 52 tflops mhhh
19:33 karolherbst[d]: the second is without my barrier hack
19:34 karolherbst[d]: it looks like it's not the barrier, it just gets faster because less cycles...
19:34 karolherbst[d]: 10% less cycles, 10% more speed, makes sense
19:34 karolherbst[d]: ehh..
19:35 karolherbst[d]: well.. 20% less cycles, but 15%? more speed or something
19:38 karolherbst[d]: ohh.. I'll do that on the fossils I have and see how close Faiths loop aware cycle counting gets
19:47 karolherbst[d]: 2.85s -> 163.4s compile times.. "fun"
19:48 cubanismo[d]: Does Mesa still support the gallium nouveau driver for NV50/GF100/etc. older GPUs?
19:51 gfxstrand[d]: Yeah
19:51 mhenning[d]: cubanismo[d]: It's supposed to, yes
19:52 karolherbst[d]: 👀
19:52 gfxstrand[d]: Well as much "support" as it's ever gotten
19:52 cubanismo[d]: Trying to test NV50
19:52 karolherbst[d]: I _think_ I might have broken it
19:52 cubanismo[d]: Instant segfault because karolherbst[d] deleting some pipe context functions
19:52 cubanismo[d]: Saying it's all dead code
19:52 gfxstrand[d]: oops
19:52 mhenning[d]: in practice bugs don't get fixed very quickly
19:52 cubanismo[d]: On ToT kernels, just panics trying to submit the very first pushbuffer
19:52 karolherbst[d]: it's the framebuffer stuff isn't it?
19:52 cubanismo[d]: Had to drop back to like 6.12
19:52 cubanismo[d]: Haven't bisected it.
19:53 karolherbst[d]: mhhh
19:53 karolherbst[d]: let me give you a commit
19:53 cubanismo[d]: I think I see the mesa commit
19:53 cubanismo[d]: But I did a quick revert of a tiny part. Gonna see if it gets it running kmscube at least.
19:53 karolherbst[d]: try on c96003305ee7b9014e129ddd63ba02d33ed4011f
19:53 karolherbst[d]: but I really should just fix it.. gimme a sec
19:55 cubanismo[d]: Ah, I think it was broken long before that. One second.
19:56 cubanismo[d]: I don't dig the GeForce 8800 GTX out of my closet very often
19:56 mohamexiety[d]: impressed it still works, damn
19:56 karolherbst[d]: heh...
19:56 karolherbst[d]: I know that nv30 works
19:56 karolherbst[d]: people fixed bugs so it runs on gnome 40!
19:57 karolherbst[d]: vertex stuff was broken since it existed 😄
19:57 cubanismo[d]: My nv30 finally died ~8 years ago or so.
19:57 karolherbst[d]: I still have a few that work
19:57 cubanismo[d]: I have a working NV4 somewhere I think
19:57 karolherbst[d]: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/c5463cf25b6d8a2ed5b241dc9c90081cdce58098 should make it work on
19:57 cubanismo[d]: I bought both of those pre-working-at-NVIDIA
19:57 karolherbst[d]: heh
19:57 marysaka[d]: what is a Geforce 6600 mapping to btw
19:57 cubanismo[d]: NV42 or something
19:57 karolherbst[d]: yeah...
19:57 marysaka[d]: I still have one that should be alive-ish
19:58 cubanismo[d]: Those were good cards. NV3x, well, was also a thing
19:58 marysaka[d]: well one of the two, the other lost a cap
19:58 karolherbst[d]: nv3x was fixed function pipeline, no?
19:58 karolherbst[d]: nv4x was the first gen with shaders or something?
19:58 cubanismo[d]: No, they both had shaders
19:58 cubanismo[d]: But it was pre-unified shaders
19:58 karolherbst[d]: ohh indeed...
19:58 cubanismo[d]: NV2x was the last register combiners-only thing IIRC.
19:59 karolherbst[d]: anyway... if the commit above does work, then the commit linked above should fix it
20:00 karolherbst[d]: I thought I tested on nv50....
20:02 karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36500 please leave comment if it fixes it 😄
20:04 cubanismo[d]: There seems to be some other issue where after I run kmscube once, it takes ~2 minutes for it to start the second time, so need to reboot every time.
20:04 cubanismo[d]: Ah, kernel side is throwing a fit.
20:04 karolherbst[d]: mhhh
20:05 cubanismo[d]: One sec, have to reboot to test your patch. My version worked though 🙂
20:06 karolherbst[d]: cubanismo[d]: might be modifier/zbuf related or something weird? Depends on the error the kernel throws...
20:06 karolherbst[d]: but those older GPUs are kinda weird...
20:06 karolherbst[d]: never know if they are kinda broken or not
20:06 karolherbst[d]: have some nv40 which just don't work and I'm like.. is this a mesa regression? Switching to a different nv40 and it just magically works
20:07 karolherbst[d]: and I think I have one or two broken nv50 as well...
20:07 cubanismo[d]: The commit that broke things AFAICT was facb048cdbbe1acffb41cdfbebc9042c1d539cd4
20:07 karolherbst[d]: yeah...
20:07 karolherbst[d]: so my MR should fix that
20:08 cubanismo[d]: Yeah, looks like it does.
20:08 karolherbst[d]: k thanks I'll merge it 😄
20:08 karolherbst[d]: but that should only fix CPU side issues
20:08 karolherbst[d]: I'm sure the entire programming the push buffer parts have all sorts of funky bugs
20:09 karolherbst[d]: uhh it's on 25.2.. like.. I really should merge it before the release
20:10 karolherbst[d]: which is in a week, nice
20:10 cubanismo[d]: I think I have a GF106 or something too. Want me to check that as well while I have my DVI cable out?
20:10 karolherbst[d]: nah, that uses the nvc0 driver
20:10 cubanismo[d]: RIght
20:10 karolherbst[d]: I mean, you can test it if you want, but I'm more sure that works, because I've done more testing there
20:10 karolherbst[d]: recently
20:11 cubanismo[d]: K, won't bother then
20:15 karolherbst[d]: anybody wants to run this MR on kepler NVK? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36482
20:15 karolherbst[d]: maybe I should... 🙃
20:17 cubanismo[d]: FWIW, here's what I get trying to start kmscube a second time on NV50: https://pastebin.com/uNbtGmhe
20:18 karolherbst[d]: I'd guess that also happens before facb048cdbbe1acffb41cdfbebc9042c1d539cd4
20:19 cubanismo[d]: I can check real fast.
20:19 karolherbst[d]: that's what we call `NV50_3D_TEX_LIMITS`
20:19 cubanismo[d]: But I assume so.
20:19 karolherbst[d]: data 54...
20:20 karolherbst[d]: so NV50_3D_TEX_LIMITS(2) set with samplers_log2: 4 and textures_log2: 5
20:20 karolherbst[d]: no idea why that would cause errors...
20:21 cubanismo[d]: Is it validating against the bound surface or something?
20:21 karolherbst[d]: this happens as part of channel initialization
20:21 cubanismo[d]: I'm not clear what's originating the error: The HW or some checks in nouveau
20:21 karolherbst[d]: like.. before any substantiak GL api call is even handled
20:21 karolherbst[d]: like at dlopen time basiclaly
20:22 cubanismo[d]: Yeah, that matches what I observe as far as program progression
20:22 karolherbst[d]: so either the previous context upset the hardware enough and the kernel can't handle it
20:22 karolherbst[d]: or the kernel fails to somehow sets up a new context?
20:24 cubanismo[d]: MF that unicode-indent build thing
20:24 karolherbst[d]: I'd assume it's some kernel regression maybe...
20:25 karolherbst[d]: like this breaks on first submission just pushing static context initialization stuff...
20:25 cubanismo[d]: Yeah
20:25 cubanismo[d]: Well as I say, it's better than 6.15, where it doesn't even get through the first run
20:26 karolherbst[d]: 🥲
20:26 cubanismo[d]: I can't put much more time into it right now
20:26 cubanismo[d]: But if someone has ideas, I can get this card back out to test.
20:27 karolherbst[d]: anyway.. I plugged in my titan with the cool brightness adjustable LEDs 🙃
20:27 cubanismo[d]: Ha
20:28 cubanismo[d]: I think someone had to go add a bunch of stuff to our Linux control panel to handle those at one point
20:28 karolherbst[d]: heh
20:28 karolherbst[d]: I think the there was a maxwell? Titan where you could even change the colors?
20:29 cubanismo[d]: I saw one of those when I was digging the NV50 out of the bin-o-GPUs.
20:30 cubanismo[d]: but it's probably an engineering board. Might not have fancy LEDs.
20:31 cubanismo[d]: OK, yeah, the kernel issues happen even with a build from before that Mesa commit
20:31 cubanismo[d]: As expected.
20:31 cubanismo[d]: Gonna put that card away now. Thanks for the help
20:33 karolherbst[d]: yeah... might be more lucky with a different gpu
20:34 airlied[d]: not sure I want to plug in the nv50
21:06 karolherbst[d]: ur3 = iadd3 ur3 0x20000 rZ // delay=4
21:06 karolherbst[d]: ur3 = iadd3 ur3 ur0 rZ // delay=4
21:06 karolherbst[d]: I think those can be merged...
21:07 mhenning[d]: I think we're already supposed to merge them in nir
21:08 karolherbst[d]: yeah...
21:08 karolherbst[d]: I'm sure something weird is going on there
21:08 karolherbst[d]: I was just looking at the sass output for a chance and just saw this
21:08 karolherbst[d]: *change
21:11 karolherbst[d]: okay.. found it in the nir:
21:11 karolherbst[d]: con 32 %235 = load_const (0x00020000 = 131072)
21:11 karolherbst[d]: con 32 %236 = iadd %234, %235 (0x20000)
21:11 karolherbst[d]: con 32 %237 = iadd %236, %71
21:11 karolherbst[d]: yeah.. why isn't that an iadd3...
21:13 karolherbst[d]: you won't believe it
21:14 karolherbst[d]: soo.. the nir late algebraic opt only triggers if the two adds have two non constant sources (okay, we have that) and if the constant is at most 16 bit
21:15 karolherbst[d]: guess I'll just duplicate it and remove that restriction 😄
21:17 karolherbst[d]: ` ur4 = iadd3 ur4 0x20000 ur0 // delay=4` nice
21:19 mhenning[d]: Yeah, not sure why we would have that restriction
21:20 mhenning[d]: looks like we have 32 bits for the immediate on sm70
21:20 karolherbst[d]: intel reasons
21:20 karolherbst[d]: like intel added it to opt_algebraic_late and for them it only works with 16 bits
21:21 mhenning[d]: ohh. for some reason I thought that was in nak_nir_algebraic already
21:21 karolherbst[d]: only lea is
21:21 karolherbst[d]: (and imin/imax lowering)
21:21 mhenning[d]: and isub for some reason
21:22 karolherbst[d]: yeah..
21:22 karolherbst[d]: I have a couple of other trivial algebraic opts, should probably put them into an MR later
21:28 karolherbst[d]: I wonder if I want to wire up the 4th source of lea...
21:29 mhenning[d]: we already have 4 source lea?
21:29 mhenning[d]: assuming you're talking about the 64-bit stuff
21:29 karolherbst[d]: ohh wait.. it already is
21:29 karolherbst[d]: yeah I meant the high bits c source
21:29 karolherbst[d]: but...
21:29 mhenning[d]: yeah, we have that at home
21:29 karolherbst[d]: I meant it more in a "you can do funky opts with them" meaning
21:29 karolherbst[d]: soo...
21:30 mhenning[d]: oh, yeah it could be used for rotates or whatever
21:30 karolherbst[d]: r4 = shf.r.w.u32.hi rZ r4 0x3 // delay=1
21:30 karolherbst[d]: r14 p1 = lea r4 4 cx[ur2..4][0x0] // delay=1
21:30 mhenning[d]: we only use it for 64-bit lea so far
21:30 karolherbst[d]: mhh maybe looking at the nir is easier there
21:31 karolherbst[d]: div 32 %69 = ushr %68, %16 (0x3)
21:31 karolherbst[d]: div 64 %70 = u2u64 %69
21:31 karolherbst[d]: div 64 %71 = lea_nv %28, %70, %4 (0x4)
21:32 karolherbst[d]: lea_nv could probably just shift the thing into the high bits and fold %68 in, no?
21:33 karolherbst[d]: mhhh
21:33 karolherbst[d]: we'd need to shift again..
21:33 karolherbst[d]: uhh
21:34 mhenning[d]: I don't think that works
21:34 mhenning[d]: since (x >> 3) << 3 clears the low bits
21:34 karolherbst[d]: right...
21:36 karolherbst[d]: ohhh.. I just remembered something..
21:36 karolherbst[d]: right...
21:36 karolherbst[d]: soo...
21:36 karolherbst[d]: but that's unrelated to this.. well.. kinda
21:37 karolherbst[d]: since turing there is a multiplier on LDS and STS
21:37 karolherbst[d]: basically a free shift
21:38 mhenning[d]: only on shared?
21:38 karolherbst[d]: yes
21:38 karolherbst[d]: it makes sense
21:38 mhenning[d]: weird
21:38 karolherbst[d]: it only shifts the register, not the offset
21:38 karolherbst[d]: so Ra << 2 + offset
21:38 karolherbst[d]: supports only << 2, 3 and 4, not 1
21:39 karolherbst[d]: I should see if I can make use of it here..
21:39 karolherbst[d]: yep
21:39 karolherbst[d]: nice...
21:41 karolherbst[d]: div 32 %115 = ishl.nuw %94, %4 (0x4)
21:41 karolherbst[d]: @store_shared (%185, %115) (base=0, wrmask=xyzw, align_mul=16, align_offset=0)
21:41 karolherbst[d]: ->
21:41 karolherbst[d]: `LDS %185 << 4 + base`
21:41 karolherbst[d]: `@store_shared (%186, %115) (base=5120, wrmask=xyzw, align_mul=16, align_offset=0)` would mean -> `LDS [%185 << 4 + 0x1400]`
21:41 karolherbst[d]: ehh %115
21:42 karolherbst[d]: bit you get the idea
21:42 karolherbst[d]: ehh.. %94 actually
21:42 karolherbst[d]: yeah.. maybe I should wire this up, could be fun
21:45 karolherbst[d]: `STS.128 [R10.X8+0x1400], R24 ;` okay, found it
21:46 karolherbst[d]: I wonder if I should do it in from_nir.rs or make it a nak pass...
21:46 karolherbst[d]: there is already `get_io_addr_offset`...
21:46 karolherbst[d]: and adding that makes it even worse
21:48 karolherbst[d]: mhhh.. I kinda hate it now
21:49 karolherbst[d]: and it's useless for current upstream, because if you need to find the constant offset _and_ a constant shift you'll be going crazy
21:49 karolherbst[d]: unless you check for lea...
21:49 karolherbst[d]: but lea can't take a constant offset..
21:50 karolherbst[d]: yeah.. I'd have to upstream the nir_opt_offset stuff first
21:53 karolherbst[d]: yeah whatever.. I should upstream stuff
21:55 Solak:has a problem with the nv-driver: the the graphics-environment freezes completely at random moments, but I can still access the machine (ssh).
21:57 Solak: I checked the faq, and I have an error-log. The proprietary-driver doesn't have the problem, but Debian doesn't support this driver since "Bookworm".
22:02 mhenning[d]: what card do you have?
22:02 Solak: The gfx-card is: "VGA compatible controller: NVIDIA Corporation GF116 [GeForce GTX 550 Ti] (rev a1)"
22:07 Solak: according to the logs it is a read-fault, page not present on channel 2 (which is then killed).
22:14 mhenning[d]: that might be a tricky one. maybe file a bug report on https://gitlab.freedesktop.org/mesa/mesa/-/issues including card name, logs, and what software you're running when it happens
22:19 karolherbst[d]: would be cool to get more reviews on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36465 and https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36363
22:19 Solak: regarding the software, most of the time it occurs running applications that make more than avarage use of the card, like mplayer/vlc, geany and vice.
23:51 gfxstrand[d]: karolherbst[d]: I'll get to them
23:51 gfxstrand[d]: I've been distracted by Arm boards again. :blobcatnotlikethis:
23:52 HdkR: Speaking of ARM boards, Thor ordering is opening up. Just a cool $3500 to engage with it :P
23:53 gfxstrand[d]: Woof
23:53 gfxstrand[d]: Maybe I can get someone at NVIDIA to send me one.
23:53 gfxstrand[d]: Though I'd need the kernel to work on it first
23:53 HdkR: Apparently it is going to be using kernel 6.8, so quite old out of the gate.
23:54 gfxstrand[d]: woo
23:57 gfxstrand[d]: At least 6.8 has the new uAPI so in theory...
23:57 HdkR: :BlobSweat: