04:31 anholt: gfxstrand[d]: oof. I guess with reassociate, you can at least hope to see addresses that are base + iadd (con, div) and not have to go grubbing around too far in the add chain.
05:05 mangodev[d]: i hate myself
05:05 mangodev[d]: part of my lag issues were the fact that i had my mouse plugged into usb2 instead of usb3, and the usb2 port couldn't handle the 1000hz polling rate properly :/
05:05 mangodev[d]: the entire time i thought it was frame pacing
05:05 mangodev[d]: still partially is, but it's noticeably better
05:07 redsheep[d]: That's very weird, I've never had an issue with 1000hz on usb2
05:09 mangodev[d]: i think it could either be with my motherboard or my cpu
05:09 redsheep[d]: I would think it's more likely that the controller for the 2.0 port sucks. Regardless, glad to hear you found a partial fix
05:09 mangodev[d]: either way, it feels smoother on the usb3
05:09 mangodev[d]: still have some cursor issues, but it's good to see the constant stutter mostly gone
05:10 mangodev[d]: feels more consistent
05:10 mangodev[d]: i wish i could pinpoint where the cursor jitter bug happens though
05:10 mangodev[d]: but mesa project structure is just… :blobcatnotlikethis:
05:11 mangodev[d]: i tried looking for hardware sprite rendering stuff in zink, vk runtime, and nvk, but came up empty-handed
05:12 mangodev[d]: ~~took me 10 minutes to find how to search for code *in a project* in gitlab instead of searching *for projects*~~
05:12 redsheep[d]: Did the broken 32x32 hardware cursor ever get addressed? If it did you should be able to use the hardware cursor and might see no issue there
09:37 karolherbst[d]: I wished the static cycle counts would take loops into account 🙂
11:15 gfxstrand[d]: redsheep[d]: _lyude and I looked at it but I don't remember if she ever came up with patches and/or what happened to them.
11:42 karolherbst[d]: okay.. let's make NAK factor in loops with static cycle reporting, so opts like licm don't look horrible on paper 🙃
11:43 karolherbst[d]: NAK doesn't know the loop depth of a block, right? I wonder if I can just optimistically pass that on from nir and just assign a loop depth thing to each block...
11:44 karolherbst[d]: not sure it even makes sense...
11:44 karolherbst[d]: not sure how much NAK messes with the CFG
11:45 karolherbst[d]: though lowering CFG also throws this information away..
11:45 karolherbst[d]: result of shader-db + licm https://gist.githubusercontent.com/karolherbst/d58cf6ca1a52b5fa36271488ed6b61cd/raw/2e169d43868c928249e27a92c41cbab727e9e4d6/gistfile1.txt
11:46 karolherbst[d]: but it does speed things up
11:47 karolherbst[d]: or maybe we don't care about the stats there... yet, it be interesting to know those things
11:49 karolherbst[d]: mhh there is detect_loops
12:12 karolherbst[d]: I should upstream the bra.u stuff instead, because that's actually useful 🙃
13:18 karolherbst[d]: could fold a boolean iand into branch instructions, but...
13:19 karolherbst[d]: maybe should check in from_nir how often that happens
13:25 karolherbst[d]: gor bra.u: `CodeSize: 922229056 -> 912026800 (-1.11%); split: -1.11%, +0.00%` nice
13:25 karolherbst[d]: well. bra.u with a upred source specifically
13:37 karolherbst[d]: ohh.. blackwell has a bra.u.any thing
13:37 karolherbst[d]: like all threads branch if any of them evaluates to true
13:40 gfxstrand[d]: We can figure out loop depth from the cfg
13:41 gfxstrand[d]: Is just how many times you can call `loop_header()`
13:44 karolherbst[d]: ahh
13:45 karolherbst[d]: apparently I'm now running into shaders where the upreds get spilled 🙃
13:46 karolherbst[d]: meaning I have to wire up bra.u with a non uniform pred as well *sigh*
13:48 gfxstrand[d]: gfxstrand[d]: You could even add a `loop_depth()` helper:
13:48 gfxstrand[d]: ```rust
13:48 gfxstrand[d]: fn loop_depth(&self, idx: usize) -> usize {
13:48 gfxstrand[d]: let mut idx = idx;
13:48 gfxstrand[d]: let mut depth = 0;
13:48 gfxstrand[d]: loop {
13:48 gfxstrand[d]: if let Some(hdr) = self.loop_header_index(idx) {
13:48 gfxstrand[d]: depth += 1;
13:48 gfxstrand[d]: idx = hdr;
13:48 gfxstrand[d]: } else {
13:48 gfxstrand[d]: return depth;
13:48 gfxstrand[d]: }
13:48 gfxstrand[d]: }
13:48 gfxstrand[d]: }
13:49 karolherbst[d]: ahh cool, will play around with it
13:49 gfxstrand[d]: I'm typing
13:49 gfxstrand[d]: I haven't written NAK code in a minute
13:53 karolherbst[d]: my bra.u code is a disaster and I don't like it
13:56 karolherbst[d]: `Spills to reg: 67982 -> 68532 (+0.81%); split: -0.02%, +0.83%` 🙃 why
13:57 karolherbst[d]: but who cares: `Static cycle count: 222856984 -> 217707273 (-2.31%); split: -2.31%, +0.00%`
13:58 karolherbst[d]: some shaders just shrink by 12%.... impressive
13:58 karolherbst[d]: and like big ones
14:13 karolherbst[d]: ohh shoooo.. Src::is_uniform() isn't doing what I think it's doing 🙃
14:19 gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36466
14:21 gfxstrand[d]: karolherbst[d]: Yeah, it's just checking the file
14:22 karolherbst[d]: any quick qay to check if a Src is a UPred?
14:23 karolherbst[d]: *way
14:24 gfxstrand[d]: `src.is_predicate() && src.is_uniform()`
14:24 karolherbst[d]: mhhh
14:24 gfxstrand[d]: Which should return true for `SrcRef::True` and `SrcRef::False`.
14:24 karolherbst[d]: yeah, I want it to return false for those
14:25 karolherbst[d]: bra has a upred encoding similar to alus having a ureg encoding
14:25 karolherbst[d]: cursed part is.. it only adds a upred (and leaves the pred as it is as a normal pred)
14:26 karolherbst[d]: changes the opcode tho, but only if you want the upred
14:26 karolherbst[d]: but upred opcode doesn't support non .u
14:26 gfxstrand[d]: `src.src_ref.as_ssa().is_some_and(|s| s.file() == Some(RegFile::UPred))`
14:27 karolherbst[d]: mhhh I wished for something shorter, but I guess that's better than nothing
14:27 karolherbst[d]: maybe I just add a `is_upred()` helper
14:28 gfxstrand[d]: In the backend, that'd be `src.src_ref.as_reg().is_some_and(|s| s.file() == RegFile::UPred)`
14:28 gfxstrand[d]: karolherbst[d]: `is_upred_reg()`, please.
14:28 karolherbst[d]: okay
14:28 gfxstrand[d]: Otherwise, it's not clear that it does funky stuff with constants
14:28 karolherbst[d]: my laptop's DNS thing stopped resolving *.lan and now I'm 🙃
14:29 karolherbst[d]: :blobcatnotlikethis:: I mean
14:29 karolherbst[d]: right..
14:29 karolherbst[d]: ohh maybe it's the VPN taking prio again
14:33 gfxstrand[d]: But yeah if we can do `bra.upred`, that would be awesome. The spilling for `bra` sucks
14:33 karolherbst[d]: yeah..
14:33 karolherbst[d]: it's ampere+ only tho
14:35 gfxstrand[d]: Wait... for upred, we have to use the source and not the predicate?
14:35 gfxstrand[d]: What does the source do?
14:35 karolherbst[d]: the bra takes the branch if all sources (there is a pred and upred) and the guard predicates are true
14:36 gfxstrand[d]: Funky
14:36 karolherbst[d]: upred only with .U and its own opcode
14:36 karolherbst[d]: otherwise just a pred
14:36 karolherbst[d]: it's on all control flow instructions
14:36 karolherbst[d]: so you can fold an boolean iand in
14:36 karolherbst[d]: or upred and pred bool
14:36 karolherbst[d]: or upred and pred and pred
14:37 gfxstrand[d]: What does `.u` do if the source isn't actually uniform?
14:37 karolherbst[d]: .u just means either all branches (if they all agree) branch or none
14:37 gfxstrand[d]: So it's `.all`?
14:37 karolherbst[d]: hopper added a .any flag, where all threads branch if one evaluates to true
14:37 karolherbst[d]: yeah
14:38 karolherbst[d]: the .u doesn't stand for uniform, but unanimity
14:39 karolherbst[d]: anyway.. using it with a upred is trivial, it gets complicated if we want to be smarter about it
14:39 karolherbst[d]: even if the upred gets spilled to pred the .u remains valied
14:39 karolherbst[d]: *valid
14:50 karolherbst[d]: there is another form with a UReg where the thread participates if it's thread mask bit is set in the ureg, but it has some really weirdo semantics which are such a niche thing to use...
14:53 gfxstrand[d]: Yeah, I'm not too excited about that
14:53 gfxstrand[d]: I don't feel like re-implementing ACO on top of NAK
14:55 karolherbst[d]: heh
14:55 karolherbst[d]: it's not that tho
14:56 karolherbst[d]: it's related to convergence and stuff, but my brain doesn't really parse what's useful for
14:57 karolherbst[d]: like the mask says which thread takes part in the branching _decision_ and then after the decision is reached, threads do something
14:57 gfxstrand[d]: weird
14:58 karolherbst[d]: there is a .CONV and .DIV flag.. let's see if ptx has it
14:58 karolherbst[d]: it doesn't
15:19 karolherbst[d]: gfxstrand[d]: one thing I'm wondering about is, if we should set .u on all the bras always jumping, but I have no idea if 1. nvidia does and 2. if it matters one or the other way
15:20 gfxstrand[d]: I doubt it matters and I would say no unless we have a good reason to do so
15:21 karolherbst[d]: but also not in the mood of wiring it up on each gen and test it 😄
15:22 gfxstrand[d]: Yeah, no need
15:23 gfxstrand[d]: If we ever decide we care about the any/all behavior, I can plug in some cards and test. But for now I'm fine with just hooking it up enough to get rid of the extra `UPred` -> `Pred` spills that are everywhere.
15:24 mohamexiety[d]: Well that’s interesting; the reproducible mmu fault on the hacky branch with the binding model CTS no longer reproduces
15:24 gfxstrand[d]: Though folding `vote.any` into `bra` might be cool.
15:24 mohamexiety[d]: So I am running deqp-runner on the compression MR
15:24 mohamexiety[d]: Let’s see if it blows up
15:24 mohamexiety[d]: I am guessing format modifiers will but beyond that hopefully things are fine
15:26 karolherbst[d]: gfxstrand[d]: it's just a hopper+ thing anyway
15:26 karolherbst[d]: .any that is
15:26 gfxstrand[d]: Yeah
15:26 gfxstrand[d]: You can do .all with two branches. 🤡
15:26 karolherbst[d]: though I'm surprised we can't do anything better on turing?
15:28 karolherbst[d]: there is a `PSETP` that takes an input upred
15:28 karolherbst[d]: but you use that one to convert
15:28 karolherbst[d]: could probably fold the psetp into the source
15:28 karolherbst[d]: or something
15:29 gfxstrand[d]: I'm not sure how useful folding an and in will be
15:29 gfxstrand[d]: I think it's probably more than zero usefulness
15:31 karolherbst[d]: probably similar to the ampere changes except you have more pressure on the pred file
15:36 mohamexiety[d]: mohamexiety[d]: Sadly it did blow up in 4 minutes. Bleeeeeh
15:38 karolherbst[d]: oof
15:38 karolherbst[d]: I also do a CTS run, maybe I can beat your 4 minutes
15:39 karolherbst[d]: though the bra.u thing is something I expect to either blow up after 5 seconds or work without regressions 🙃
15:39 karolherbst[d]: `Pass: 13674, Skip: 14326, Duration: 1:24, Remaining: 2:21:55` good enough
15:40 mohamexiety[d]: [ 6202.521609] nouveau 0000:07:00.0: gsp: mmu fault queued
15:40 mohamexiety[d]: [ 6202.739316] nouveau 0000:07:00.0: gsp: rc engn:00000001 chid:8 gfid:0 level:2 type:31 scope:1 part:233 fault_addr:0000003fff800000 fault_type:00000002
15:40 mohamexiety[d]: [ 6202.739324] nouveau 0000:07:00.0: fifo:c00000:0008:0008:[Xorg[3029]] errored - disabling channel
15:40 mohamexiety[d]: [ 6202.739327] nouveau 0000:07:00.0: Xorg[3029]: channel 8 killed!
15:40 mohamexiety[d]: [ 6202.756424] nouveau 0000:07:00.0: Xorg[3029]: error fencing pushbuf: -19
15:40 mohamexiety[d]: last thing it wrote before it all froze up skeggsb9778[d]
15:40 mohamexiety[d]: (full parallel CTS with deqp-runner)
15:43 mohamexiety[d]: it's a bit interesting it mmu faults and kills xorg rather than e.g. one of the deqp instances :thonk:
15:47 karolherbst[d]: gfxstrand[d]: ohh I know where the second pred source can be useful for. If you predicate entire blocks, and they itself contain an even smaller if/else, could move the guard pred to a source one for the inner bras and end up with something like `p0 bra p1 L3`
15:47 mhenning[d]: karolherbst[d]: A while back I tried to get nvcc to use those flags and I didn't succeed
15:48 karolherbst[d]: mhenning[d]: yeah, I'm sure you only succeed if you already know what they are doing
15:48 karolherbst[d]: they might not even have it set up in the frontend
15:49 mhenning[d]: well, I guess my point is that if nvcc doesn't typically use them then I assume they're not all that important
15:49 karolherbst[d]: I think they could be helpful with workgroup collective functions or something weird?
15:50 karolherbst[d]: I'm 99.99999% confident it's entirely out of scope for anything vulkan allows 😄
15:53 mohamexiety[d]: #define NV_PFAULT_FAULT_TYPE_UNSUPPORTED_APERTURE 0x0000000a /* */
15:53 mohamexiety[d]: what does this mean
15:54 karolherbst[d]: aperture is uhm.. isn't it like something scanout related or even gart or both?
15:54 mohamexiety[d]: I am confused yeah. it's in context of something in the virtual mem management
15:55 karolherbst[d]: like aperture is this firmware framebuffer window or something
15:55 karolherbst[d]: or "system aperture" that is
15:56 karolherbst[d]: heh there seems to be a physical and virtual aperture thing
15:57 karolherbst[d]: `manuals/ampere/ga100/dev_pbdma.ref.txt` has some docs apparently
15:57 tagr: I think you'd get this unsupported aperture thing on Tegra's iGPUs in certain cases
16:00 tagr: and I guess also on the dGPU if the MMU isn't properly programmed
16:09 marysaka[d]: karolherbst[d]: getting this on `dEQP-VK.memory_model.transitive.coherent.atomic_fence.payload_local.physbuffer.guard_nonlocal.image.transvis` with the compression patches
16:09 skeggsb9778[d]: that can happen if a PTE kind not supported on sysmem is used
16:13 tagr: ah... I was wondering if maybe new aperture types had been added, because my recollection was that dGPU supported all 3 (or so) that existed
16:15 tagr: sysmem doesn't support things like compression, right?
16:15 avhe[d]: karolherbst[d]: i've seen that word used to mean mmio region in the tegra kernel
16:15 skeggsb9778[d]: yeah, that's what i'm suspecting is happening here
16:16 skeggsb9778[d]: karolherbst[d]: in this case it refers to vidmem, coherent sysmem or non-coherent sysmem
16:16 skeggsb9778[d]: ie. where the PTE points at
16:16 skeggsb9778[d]: "peer" is also a thing, but i don't know much about it
16:16 mohamexiety[d]: yep, can repro. now the question is how is sysmem even getting compression because in theory in nvk we're explicitly marking it as VRAM only :thonk:
16:16 mohamexiety[d]: it shouldnt even be evictable
16:20 karolherbst[d]: also why is my turing slower than my ampere 😄
16:21 redsheep[d]: Isn't that business as usual?
16:21 karolherbst[d]: well apparently the ampere I have is theoretically twice as fast
16:22 karolherbst[d]: but it kinda looks almost equally fast compared to my ampere and I got the impression that most games don't run faster on the ampere one
16:22 karolherbst[d]: it's kinda weird
16:23 redsheep[d]: What cards? We're talking games, and not your matrix stuff?
16:23 karolherbst[d]: RTX 6000 and RTX a6000 😄
16:24 karolherbst[d]: those are quadro ones
16:24 karolherbst[d]: and yeah
16:24 karolherbst[d]: I meant in games
16:24 karolherbst[d]: though it might also be that the ampere one just ran into nvidia driver issues
16:24 mohamexiety[d]: mohamexiety[d]: or not, I am a big dumdum :blobcatnotlikethis:
16:24 karolherbst[d]: maybe the hardware is bad, but dunno
16:26 karolherbst[d]: I think I've optimized enough out of those shaders anyway... I suspect the other big perf gaps are in regards to command submission or so...
16:26 karolherbst[d]: something odd is going on
16:30 redsheep[d]: I just checked over the specs cuz those are a big different from the gaming configurations, those chips aren't hugely different in terms of specs really. The ampere only has like 15% more memory bandwidth and SMs
16:32 redsheep[d]: Got double the rated tflops and tensor and all that but lots of possible bottleneck scenarios have those performing very similarly
16:36 mohamexiety[d]: eh the 6000 is a Titan RTX with a lower power limit and the A6000 is a RTX 3090 with double the VRAM, lower power limit, and G6 instead of G6X
16:36 mohamexiety[d]: it should still be a ~ 30% or so difference in games
16:36 karolherbst[d]: well it doesn't really
16:38 redsheep[d]: mohamexiety[d]: Should be, but that's far from double and varies of course
16:38 karolherbst[d]: I think real work benchmark show ~10%
16:38 karolherbst[d]: *world
16:39 redsheep[d]: That is a bit oddly low. Is this on nvk or Nvidia prop?
16:39 karolherbst[d]: nvidia prop
16:40 karolherbst[d]: I tested also on cyperpunk 2077 a while ago and it was just faster on the turing
16:41 karolherbst[d]: lol
16:41 karolherbst[d]: https://www.gpu-monkey.com/en/compare_gpu-nvidia_rtx_a6000-vs-nvidia_rtx_6000_ada
16:41 karolherbst[d]: great in benchmarks, the only game "battlefield 5": same perf
16:41 karolherbst[d]: ehh wait
16:41 karolherbst[d]: wrong gpu
16:43 karolherbst[d]: anyway.. from my own testing in games it didn't really matter which one I choose
16:50 redsheep[d]: Kinda sounds like your a6000 isn't working right, bad paste or something
17:40 cubanismo[d]: Unless the games use ridiculous amounts of memory, they aren't going to be much faster with workstation cards.
17:43 karolherbst[d]: it's about comparing two work station cards 😄
17:43 karolherbst[d]: one the high end ampere one and the other the almost highest end turing one
17:43 karolherbst[d]: like one generation in between isn't gonna make much of a difference anyway
17:50 ermine1716[d]: Workstation cards = Quadro stuff?
18:06 mohamexiety[d]: yeah
18:07 HdkR: Except they don't call them Quadro anymore, which is just confusing.
18:08 mohamexiety[d]: cubanismo[d]: they were usually slower actually on windows. not sure when this changed but I know someone with the RTX 6000 (turing) and even on game ready drivers he wouldnt get any of the game specific optimizations or such so sometimes it ran a lot worse than GeForce. current workstation cards work fine though and get the optimizations (someone with a Pro 6000 Blackwell tested it)
18:09 mohamexiety[d]: HdkR: yeah. the evolution of quadro since turing :KEKW:
18:09 mohamexiety[d]: Quadro RTX 6000
18:09 mohamexiety[d]: RTX A6000
18:09 mohamexiety[d]: RTX 6000 Ada
18:09 mohamexiety[d]: RTX PRO 6000 Blackwell
18:11 kar1m0[d]: mohamexiety[d]: how about a pro max
18:11 mohamexiety[d]: best I can do is `RTX PRO 6000 Blackwell Max-Q Edition`
18:14 karolherbst[d]: HdkR: whatever they are doing, I hope it's better than quadro RTX 6000, quadro RTX a6000, quadro RTX 6000 ada 😄
18:14 karolherbst[d]: mohamexiety[d]: marketing people
18:14 karolherbst[d]: I loved when they used random letters like
18:14 karolherbst[d]: GT, or GTX or XTX
18:15 karolherbst[d]: XTX might be AMD
18:15 karolherbst[d]: ahh it was GTS
18:16 karolherbst[d]: `GeForce FX 5900 ZT`, `GeForce FX 5900 XT`, `GeForce FX 5900`, `GeForce FX 5900 Ultra`, `GeForce PCX 5900`, best time
18:16 karolherbst[d]: it's in order of speed even
18:16 karolherbst[d]: not actually the PCX one was the slowest
18:18 karolherbst[d]: they did this at some point: `6800 LE`, `6800 XT`, `6800`, `6800 GTO`, `6800 GS`, `6800 GT`, `6800 Ultra`, `6800 Ultra Extreme Edition`, 🙃
18:36 cubanismo[d]: karolherbst[d]: My bad, our marketing managed to throw me off yet again.
18:38 cubanismo[d]: I *think* RTX A6000 (as opposed to RTX 6000) is Ampere, not Turing though, so Ampere <-> Ada comparison. Having two "A" names right in sequence caused plenty of confusion as well since that tends to get baked into various internal naming schemes.
18:39 chikuwad[d]: techpowerup's database confirms that (RTX A6000 being ampere)
18:39 cubanismo[d]: None of us know which card is which. We just label and refer to them all by their 4-5 digit codename and have to go look them up in some complicatead translator when end users report issues.
18:39 chikuwad[d]: GA102
18:39 cubanismo[d]: The external databases and wiki pages are easier to use, but the internal one is more accurate in corner cases.
18:39 chikuwad[d]: makes sense
18:40 karolherbst[d]: cubanismo[d]: yeah A stands for ampere
18:40 karolherbst[d]: there was RTX 6000 for Turing, and RTX 6000 Ada for Ada (obviously)
18:41 karolherbst[d]: anyway
18:41 karolherbst[d]: what's strange about RTX 6000 (turing) vs RTX a6000 is that the performance is kinda the same
19:09 notthatclippy[d]: Yeah, that sounds all sorts of wrong, considering the Ampere one also has a higher TDP too. Is the utilization % the same (and not single digits)?
19:09 karolherbst[d]: well it might also be just unoptimized drivers or something. The nvidia driver also managed to hang/crash the GPU constantly
19:09 karolherbst[d]: so maybe I should again with like newest drivers
19:11 karolherbst[d]: though the ampere one is twice as fast with coop matrix operations 😄
19:11 karolherbst[d]: I just know that at least in the past it didn't matter much for gaming
19:11 karolherbst[d]: or well they were kinda equally fast
19:17 marysaka[d]: skeggsb9778[d]: about the PTE faults, I cannot reproduce on my side with a surfaceless VKCTS run, not sure if that can help, I know that mohamexiety[d] was running stuffs with X and on another display that isn't tied to the card
19:30 skeggsb9778[d]: i'm a bit out of date in these areas, but, how are surfaces shared between the vk driver and X/wayland?
19:31 skeggsb9778[d]: wondering if some care needs to be taken with compressed surfaces there
19:31 chikuwad[d]: VK_KHR_*_surface, if I'm not mistaken
19:31 chikuwad[d]: {wayland,xcb,xlib}
19:40 gfxstrand[d]: karolherbst[d]: Just to be clear, `exit` has a form that takes a pred source but not one that takes a upred source?
19:40 gfxstrand[d]: skeggsb9778[d]: dma-buf
19:40 gfxstrand[d]: with modifiers
19:40 gfxstrand[d]: so both drivers are aware of the compression
19:40 karolherbst[d]: gfxstrand[d]: it only has one form that always has the input pred, but yes, no upred
19:41 karolherbst[d]: `e.set_field(87..90, 0x7_u8); // TODO: Predicate` from sm70_encode.rs
19:41 karolherbst[d]: that thing
19:42 karolherbst[d]: the additional upred is a bra.u specific addition
19:42 karolherbst[d]: so in fact it's `p0 bra.u p1 up0` and the branch is taken if all of those are true for all threads
19:43 skeggsb9778[d]: gfxstrand[d]: that's what i thought. i didn't see anything in current patches to handle modifiers, though maybe i missed it / nvk handles it already
19:44 karolherbst[d]: which means if we ever end up with a `p0 bra.u up0` it gonna has really really weird semantics, because even though up0 _might_ be true for all, if there is a single thread where p0 is false, it's gonna be not taken for all the threads
19:44 karolherbst[d]: I think...
19:44 karolherbst[d]: let me read it very very very carefully
19:45 chikuwad[d]: gfxstrand[d]: huh
19:45 chikuwad[d]: what do those extensions do then
19:45 digoutpr[m]: Get in touch with this platform for greatness you’ll definitely thank me later... (full message at <https://matrix.org/oftc/media/v1/media/download/Afz3l4XF6TuRjwNFgVbyHCKPyGU_Zr2Kav11SIBhVVZeBXZnn216uPzW7H3gCkvyjkieILeJxX0mwJw0AFHvP2NCeYpPe2BgAG1hdHJpeC5vcmcvRFhDUlRJY2ZGbE1ya3l4TVJLZ3Jxd2dE>)
19:47 gfxstrand[d]: skeggsb9778[d]: Well, we likely need something for that. In particular, we're going to need to force a dedicated allocation in the case where it's modifier + compressed. There's still a bit of sorting out that needs to be done around the image creation flow and compression.
19:47 karolherbst[d]: yes.. so for all _active_ threads, the branch is taken if _all three_ predicates do evaluate to true
19:48 karolherbst[d]: (for all threads)
19:48 gfxstrand[d]: Are the threads for which the instruction predicate is false considered active?
19:48 karolherbst[d]: yes
19:48 karolherbst[d]: active is more for things like killed threads
19:48 gfxstrand[d]: Okay. That's... odd.
19:49 karolherbst[d]: or threads not being MACTIVE
19:49 karolherbst[d]: yeah..... it's kinda a dangerous instruction if you aren't careful
19:50 karolherbst[d]: for the branch instructions there is no difference between the guard predicate and the input predicate(s)
19:54 gfxstrand[d]: I hate that I just installed Ubuntu on a computer just so I can install Ubuntu on another computer. Have I ever mentioned how much I hate working on Arm?
19:55 HdkR: That's just the life of working with ARM.
19:56 HdkR: Although NVIDIA Jetpack is particularly frustrating since it is very opinionated about working on only particular Ubuntu versions.
19:56 gfxstrand[d]: Yes. And what do you think I'm trying to use? :silvy_sweat:
19:56 HdkR: I made an assumption :P
19:57 mohamexiety[d]: skeggsb9778[d]: I completely forgot about modifiers so we currently dont handle that at all and it blows up. it's easy to fix though I think, I am just looking into the host memory compression stuff first since that's a bit problematic
20:00 avhe[d]: gfxstrand[d]: the worst thing is that their sdk manager tool only support outdated versions of ubuntu
20:01 avhe[d]: last time i did it from a live usb, except their tool uses so much storage that it exhausted my 32gb of ram, so i also had to setup swap space
20:04 gfxstrand[d]: They have versions that run in docker except I can't get them to work.
20:04 gfxstrand[d]: I tried running in a VM but the flashing process involves rebooting the Tegra device which kills the USB pass-through and it doesn't have a long enough timeout for me to click buttons.
20:05 avhe[d]: i tried pass-through and couldn't get it working, that's why i settled for the live iso
20:05 gfxstrand[d]: So I pulled out my NUC and stuck 20.04 on it
20:06 avhe[d]: just yesterday i had to compile vulkan-loader because it's outdated vs the api version the jetson driver actually supports
20:06 avhe[d]: just ubuntu things
20:07 HdkR: You can also /technically/ manually flash it, but the documentation is so poor that doing so has a chance of just making things worse :D
20:08 gfxstrand[d]: Well, and this is an Xavier which is apparently fun for other reasons
20:09 avhe[d]: HdkR: on orin you must go through the sdk manager to install to an ssd
20:09 avhe[d]: manual install only works for sd cards
20:09 HdkR: Ouchies
20:43 mohamexiety[d]: does anyone know what `dEQP-VK.synchronization2.cross_instance.dedicated.* ` does?
21:16 mhenning[d]: i suppose that guessing "cross-instance synchronization" won't be helpful?
21:49 mohamexiety[d]: mhenning[d]: my guess is it shares images between VkDevices or such but I am not sure
21:49 karolherbst[d]: sooo.. I have a theory of what I'm seeing with the bra.u upred thing... Does the spiller in NAK prefer to spill values with a long value range? That at first sounds like a good thing to do for registers, but what if it's a pred spilled to a reg? Now there is a reg that lives really long being an actual pred and potentially hurting gprs? Though what I'm seeing is an increase in spills/fills not
21:49 karolherbst[d]: necessarily GPRs... maybe it just ends up picking a value that is used more often? Dunno tbh...
21:49 karolherbst[d]: does somebody wants to take a look as well?
21:50 mohamexiety[d]: skeggsb9778[d]: do you know something about compression types? Was looking into wiring up modifiers and noticed that modifiers encode compression type. They’re listed here and I am not sure tbh which ones map to what
21:50 mohamexiety[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/nil/modifiers.rs?ref_type=heads#L87
21:51 karolherbst[d]: at least without bra.u upred, it would spill unspill around the branch... but it's still kinda weird
21:52 karolherbst[d]: not that it matters given there are no shaders which an actual increase in gpr and only a single shader with 4 more instructions 🙃
21:53 karolherbst[d]: but there is one spilling to memory more often and that's a bit weird...
21:53 mohamexiety[d]: We don’t really do anything special beyond just passing in compressible PTE kinds and the HW does everything automatically so it’s a bit interesting modifiers have a type
21:54 cubanismo[d]: Just ignore the other types
21:55 cubanismo[d]: The PTE flags are all "ROP" compression
21:55 mhenning[d]: karolherbst[d]: we try to avoid spilling values that will be used soon, which can mean longer lived values are spilled, yes
21:55 karolherbst[d]: mhenning[d]: how easy is it to change the heuristics just for preds?
21:56 mohamexiety[d]: cubanismo[d]: Alright, thanks a lot!
21:56 mhenning[d]: I'm not convinced we want to change the heuristics just for preds?
21:56 karolherbst[d]: mhenning[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36465
21:56 karolherbst[d]: see the stats
21:56 mhenning[d]: But it's not too hard. Each register file is a different spilling pass
21:57 karolherbst[d]: like this change literally skips moving a upred to a pred, and now I see more spills/fills to regs?
21:57 karolherbst[d]: that's... kinda odd
21:57 mhenning[d]: I don't consider a (+0.01%) increase in spills to memory to be that big of a deal
21:58 karolherbst[d]: ohh I'm more curious about the spills to regs
21:58 mhenning[d]: karolherbst[d]: spills/fills are only the ones generated by the register spiller
21:58 mhenning[d]: so yeah, if you don't move it then the spiller might move it instead
21:58 mhenning[d]: and that's more spills
21:58 karolherbst[d]: ahhhh
21:58 karolherbst[d]: I see
21:59 mhenning[d]: tbh I don't consider "Spills to reg" to be very important
21:59 karolherbst[d]: so what I'm seeing there is mostly some of the benefits to CodeSize turned into spills/fills from/to regs
22:00 mhenning[d]: yeah, Static cycle count and Code Size are both a lot more important than spills/fills from/to regs
22:00 karolherbst[d]: I see
22:00 karolherbst[d]: yeah not saying it should block the MR at all
22:00 karolherbst[d]: it just looked curious
22:02 mhenning[d]: Yeah, I wouldn't worry about it
22:02 karolherbst[d]: yeah.. it's barely any shader with a real negative impact
22:03 karolherbst[d]: one with 2 more instructions, one using 4 bytes more SLM
22:03 karolherbst[d]: there are some with static cycle being worse, but instruction scheduling should take care of that, because that one was always flaky
22:06 karolherbst[d]: it's a bit sad this is only for ampere+
22:09 gfxstrand[d]: Does nouveau.ko work on orin?
22:10 karolherbst[d]: gfxstrand[d]: should I just delete all of BranchOp for now? Might be for the best because it's just pointless if we don't actually care
22:14 i509vcb[d]: I'm not sure Orin has been implemented in nouveau.ko in the first place
22:14 gfxstrand[d]: You could, for sure
22:18 gfxstrand[d]: I hate everything! :blobcatnotlikethis:
22:18 gfxstrand[d]: The good news about Xavier is that it actually does UEFI so installing isn't quite as much of a mess as it looks like. The bad news is that no nouveau.
22:18 gfxstrand[d]: Orin same
22:19 gfxstrand[d]: Kepler and Maxwell ones have nouveau but the overall board situation is trash
22:19 gfxstrand[d]: 😭
22:20 karolherbst[d]: yeah....
22:20 i509vcb[d]: Yeah I tried to get Orin working but found I'd be practically rewriting the clock setup code
22:21 gfxstrand[d]: 😭
22:21 gfxstrand[d]: Maybe I should just go back to hacking on Switches
22:21 gfxstrand[d]: Or give up on arm again. 🙃
22:26 karolherbst[d]: heh
22:31 mohamexiety[d]: Definitely give up on arm, risc-v will make it irrelevant anytime now! (Fact checked by true armchair forum experts)
22:33 gfxstrand[d]: But really, the delta between where we are now and decent NVK Tegra support isn't huge. Not in userspace, anyway. In fact I've got most of it typed. I just can't test it and finish off the last few bits until I get a board to boot and bring up nouveau.
22:34 mohamexiety[d]: Yeeeah ;-;
22:35 sonicadvance1[d]: Would sure be nice if they just did ACPI boot with normal Arm64 server images. Gotta wait for Thor for that :headempty:
22:36 karolherbst[d]: is thor even a _real_ tegra 😛 I thought they just streamlined stuff and it's literally a normal nvidia dGPU
22:37 sonicadvance1[d]: nvlink-c2c instead of pci, must be Tegra. Gotta get me those Tegra Grace SoCs 😛
22:37 mohamexiety[d]: If Thor boots with normal generic images it will be a bit funny how NV beats QC to this
22:55 karolherbst[d]: okay.. now licm ..
23:00 gfxstrand[d]: I'm installing on Xavier right now off a thumb drive
23:00 gfxstrand[d]: Stock aarch64 boot iso works.
23:01 gfxstrand[d]: The only problem is that you need a serial console because display doesn't work.
23:01 gfxstrand[d]: But the rest of the system comes up, network and all
23:05 karolherbst[d]: mhhhh
23:05 karolherbst[d]: is display supposed to work? not sure how much the uefi stuff works there tbh
23:35 karolherbst[d]: mhenning[d]: what's holding back the instruction scheduling pass atm?