00:15gfxstrand[d]: Sorry if I'm getting a little ranty. I think that means it's weekend time.
00:53redsheep[d]: gfxstrand[d]: I don't think anyone here will ever complain about you explaining what you're thinking. Enjoy your weekend!
00:53anarsoul: gfxstrand[d]: time to open the fridge and get a beer!
01:00gfxstrand[d]: anarsoul: Ooh! I think I have a really good BlackBerry cider in there.
01:02redsheep[d]: gfxstrand[d]: That sounds good, I should go buy some of those
01:02esdrastarsis[d]: gfxstrand[d]: On which gpu?
01:35snowycoder[d]: redsheep[d]: If it was an hour-long blog post I would gladly read it all
01:47gfxstrand[d]: esdrastarsis[d]: RTX 4060
01:48gfxstrand[d]: snowycoder[d]: Maybe one day, once we land on a final solution.
01:53orowith2os[d]: nobody said it had to be *one* blog post. Two or more hour-long blog posts, anybody?
01:55gfxstrand[d]: 🤣
09:48karolherbst[d]: gfxstrand[d]: yeah.. 2. is kinda what codegen did, though I suspect there are bugs in this area 🙃
12:40karolherbst[d]: gfxstrand[d]: fyi https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33688
12:42karolherbst[d]: there is a `rust_2024_compatibility` lint to enable all of those automatically, but distros will hate us for it 🙃
12:50asdqueerfromeu[d]: karolherbst[d]: Why are unsafe blocks used in unsafe functions though? 🤔
12:50karolherbst[d]: so you don't forget your `// SAFETY:` comments
12:51karolherbst[d]: usually the outer function is unsafe because you call into unsafe stuff you can't guarantee locally, also it makes the compiler warn/error on other unsafe things you are doing you aren't aware of
12:52snowycoder[d]: When you write unsafe functions 80% of the time only part of the function requires unsafe, the rest would benefit from compiler safe-checks
12:54snowycoder[d]: I've seen sometimes an unsafe function that calls a safe function with an unsafe code inside to overcome that, but this language-level support is much better
12:55snowycoder[d]: Unrelated question,
12:55snowycoder[d]: `CBuf` means "constant buffer" right?
12:57karolherbst[d]: yes
15:13gfxstrand[d]: karolherbst[d]: If you want to drop that list into NIL, NAK, and the compiler common code, I would ack.
15:14gfxstrand[d]: snowycoder[d]: Yes
15:15gfxstrand[d]: karolherbst[d]: Yeah, Mel has patches for 1. 2 is what codegen did and probably part of why it can't RA in SSA and 3 is what I just did in NAK yesterday.
15:15karolherbst[d]: you make it sound like codegen can RA outside of SSA 🙃
15:16karolherbst[d]: there are awful bugs in there.. oh well
15:16snowycoder[d]: Tld "mask"? seems as a bitmask for what components should be loaded but is never printed in DisplayOp (same with tld4, txq, tld...)
15:20karolherbst[d]: though codegen did the predicate stuff more like 1. I think... I don't think UNION is used on predicates... maybe they are, but the thing is, the if/else -> predicate opt was done after RA, anyway. I hope you'll find a good solution to this problem
15:22gfxstrand[d]: snowycoder[d]: Yeah, that's what it is. It's missing in the printer
15:22gfxstrand[d]: That's just a bug in the print code.
15:24gfxstrand[d]: karolherbst[d]: I'm pretty happy with what I typed out yesterday. It actually ended up being less invasive than I thought and Rust caught most of the refactoring issues for me.
15:25karolherbst[d]: you just treat the predicate as another source?
15:25karolherbst[d]: more or less?
15:27gfxstrand[d]: The predicate itself is a special source. It doesn't have the full generality of the other sources but whatever. NAK has had that since the dawn of time.
15:29gfxstrand[d]: What I did yesterday is handle predicated destinations. In SSA, each `SSARef` is paired with an `Option<SAARef>` source, which is the value taken if the predicate is false. So the predicated things chain together, with each one possibly replacing the value. This is still totally valid SSA and RA just ensures that the two values end up in the same register.
15:30karolherbst[d]: mhhhhh, interesting
15:31gfxstrand[d]: If `prev == None` the previous value is effectively an undef.
15:33karolherbst[d]: guess you can take phi nodes and create those pairs and then propagate it through the dependencies?
15:33gfxstrand[d]: If we want to do something like codegen with a sort of phi node or if we just want an optimization that tries to get rid of `sel`, we can. It'll just optimize to that.
15:34karolherbst[d]: mhhh
15:35karolherbst[d]: I think the idea with sel sounds like a good plan
15:36gfxstrand[d]: Like
15:36gfxstrand[d]: op1 a b c
15:36gfxstrand[d]: op2 x y z
15:36gfxstrand[d]: sel q p a x
15:36gfxstrand[d]: could be optimized to
15:36gfxstrand[d]: @p op1 a b c
15:36gfxstrand[d]: @!p op2 q?a y z
15:36karolherbst[d]: kinda want to do the if/else -> sel conversion in nir... I think? Needs to analyze to find side-effects or common expressions or whatever (e.g. if the start and end of both paths is different, but they share a bit in the middle)
15:37karolherbst[d]: I see a lot of funky cfg based opt passes to make this tractable 🙃
15:37karolherbst[d]: could also split an if/else tree into several ones to extract common code
15:38gfxstrand[d]: Depends on how much we're trying to flatten
15:38karolherbst[d]: yeah..
15:38gfxstrand[d]: I'm not nearly as scared of control flow as you seem to be.
15:38karolherbst[d]: heh
15:39karolherbst[d]: it matters for perf tho
15:39gfxstrand[d]: But we can't schedule shit right now because every memory load we get out of VKD3D is in its own if.
15:39karolherbst[d]: yeah... I think it's fair to focus on bound checks for now and make it more general later
15:39karolherbst[d]: but that optimization matters quite a bit
15:39karolherbst[d]: the flattening in general
15:52gfxstrand[d]: I don't think it matters nearly as much on Maxwell+
15:53gfxstrand[d]: But I'm sure control flow and warp barriers have a real cost.
15:54gfxstrand[d]: Like, just flattening SSBO loads gave me 2 more FPS in DA:TV and cleaning up address calculations gave me one more.
15:54gfxstrand[d]: And that's without a scheduler
15:54karolherbst[d]: yeah...
15:55karolherbst[d]: it's more costly the deeper you go, but avoiding non uniform jumps is important
15:55karolherbst[d]: though I guess in your case it was all uniform?
15:57karolherbst[d]: but yeah.. `bsync` kinda "preempts" the block if there are threads to wait on
15:58karolherbst[d]: well.. which in your case also shouldn't happen
15:58karolherbst[d]: unless the game likes to access OOB
15:59gfxstrand[d]: Assuming I haven't screwed up bsync too badly (I have a bit but only if you go to crazy levels of nesting), we should never diverge more than is absolutely necessary to implement the client's shader and we always reconverge whenever possible. We don't have a runaway divergence problem.
16:00gfxstrand[d]: That was the problem with Fermi and friends. Once you diverged, you were stuck with it and perf sucked.
16:00gfxstrand[d]: That's not a problem on Maxwell+
16:00karolherbst[d]: sure, but even then predicates are cheaper than branching, often, at least in the average case. I know that at least in codegen the flatten opt was always one of those giving the most performance
16:00gfxstrand[d]: Yes
16:01karolherbst[d]: also I don't see why it's different on fermi, because you converge all the same
16:02gfxstrand[d]: There are lots of reasons why predication is better than small branches. It is on basically all hardware, even if you're just going both bits of math and then doing `sel`.
16:02karolherbst[d]: (well.. as long as you have enough stack that is)
16:02gfxstrand[d]: Maybe Fermi is okay
16:02gfxstrand[d]: Some of the older ones had a runaway problem. I don't remember where the cutoff is
16:02karolherbst[d]: the stuff on volta+ is really not all that different to previous gens, it's just moving the stack into something being more like registers
16:03karolherbst[d]: mhh though the details did change from time to time
16:03karolherbst[d]: and for more complex shaders you had to spill to VRAM, which sucks of course
16:04karolherbst[d]: might be that the runaway problem was mostly avoiding the cost of spilling I'd assume
16:04karolherbst[d]: so you rather not converge all the time, because using VRAM for the stack would suck even more
16:05gfxstrand[d]: 🤷🏻♀️
16:05gfxstrand[d]: In any case, predication is good
16:05karolherbst[d]: yep
16:05karolherbst[d]: sadly it's such an annoying thing for RA and SSA and stuff 😄
16:07karolherbst[d]: though I think optimizing sel to predicates might not always be better
16:08gfxstrand[d]: Sorry. People worrying too much about runaway divergence is one of my pet peves in graphics programming. Nvidia made some kind shitty hardware back in the early GL2 / D3D9 days and now that's stuck in everyone's heads. People do some truly absurd nonsense to avoid branching in shaders and I'm like, "It's really not that bad, guys."
16:09karolherbst[d]: heh
16:09karolherbst[d]: I agree that app developers shouldn't care as much, because compilers can be smart instead
16:10karolherbst[d]: though pre nv50 hardware had funky limitations
16:11gfxstrand[d]: Game developers get very surprised when I tell them that Intel has always reconverged since forever and AMD literally can't diverge. Breaks their brains. 😂
16:11snowycoder[d]: gfxstrand[d]: Back in the days when I was learning opengl with my dad we used some crazy moltiplications to avoid writing an if in the shaders, so yeah, it's "common knowledge"
16:11karolherbst[d]: gfxstrand[d]: 🙃
16:12karolherbst[d]: yeah...
16:12karolherbst[d]: anyway.. predicates in itself are spensy, it's absured
16:12gfxstrand[d]: snowycoder[d]: Yeah, it's so common that people just take it as fact.
16:12karolherbst[d]: *absurd
16:12karolherbst[d]: it's mostly scheduling latency which is just high tho
16:12gfxstrand[d]: I should write a blog about control flow
16:13gfxstrand[d]: Something to go along with my "descriptors are hard" but for control flow.
16:14snowycoder[d]: It would be pretty awesome to know about what is divergence really, I've taken a basic CPU architecture course and I've been mindblown when I heard that nvidia can diverge branches
16:14karolherbst[d]: well....
16:15karolherbst[d]: about that
16:15karolherbst[d]: CPUs are getting funky themselves there
16:15karolherbst[d]: and modern CPUs reach the point where avoiding branches makes your code go faster 🙃
16:15karolherbst[d]: as in "calculating both branches and select" can be cheaper than branching
16:15snowycoder[d]: Well, yep, I studied all about pipelining and branch prediction
16:16karolherbst[d]: yeah...
16:16karolherbst[d]: it's only getting worse
16:16snowycoder[d]: coff. spectre.
16:16karolherbst[d]: well that as well, but also on the hardware level
16:17gfxstrand[d]: snowycoder[d]: https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/
16:17snowycoder[d]: gfxstrand[d]: 0_0
16:17snowycoder[d]: I completely missed this, thanks!
16:18gfxstrand[d]: There was also a great talk at Vulkanised last week. It'll be a few weeks before the videos go up on YouTube, though.
16:18mhenning[d]: gfxstrand[d]: all the way back to nv50, the hardware can fully reconverge. there are some artificial limitations in codegen, but that's not a hardware issue
16:20gfxstrand[d]: karolherbst[d]: Branch prediction is really frickin' hard. Get too many wrong and, yeah, you're toast. The Pentium 4 was so bad that they added hyperthreading to fix it.
16:21karolherbst[d]: yeah...
16:21karolherbst[d]: sadly it's required to get impressive IPC numbers
16:22karolherbst[d]: ~~branch predictors for GPUs when~~
16:23gfxstrand[d]: And for a primarily scalar architecture, you kind of have to. It's the only way to make pipelining work. With a SIMT architecture, you have a lot more ways you can hide your pipeline.
16:23snowycoder[d]: fun read about branch prediction going wrong: https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array
16:23karolherbst[d]: I wonder if branching on a uniform register makes any different...
16:24gfxstrand[d]: karolherbst[d]: I don't know if you can. That's an area I'd like to dig into a bit. Right now we always spill to a vector predicate to branch.
16:24karolherbst[d]: wild question: are you converging around uniform jumps?
16:24mhenning[d]: snowycoder[d]: I really like Matt Pharr's blog posts on ispc, which talks about implementing a gpu-like programming model on x86 cpus https://pharr.org/matt/blog/2018/04/18/ispc-origins
16:24mhenning[d]: karolherbst[d]: no
16:24gfxstrand[d]: karolherbst[d]: Not if everything stays uniform
16:25gfxstrand[d]: There's no reason to. Not unless the hardware does mean things like forcing a divergence at memory loads.
16:25karolherbst[d]: ohh right.. the `CALL` instruction was funky there
16:25karolherbst[d]: supports all the things as call targets
16:26gfxstrand[d]: I haven't scrubbed through the fuzzed isa to see if they're a `ubra`. It would be nice...
16:26karolherbst[d]: bra only takes uniform regs as non constant targets
16:27karolherbst[d]: reg + constant offset
16:27gfxstrand[d]: Yeah
16:27karolherbst[d]: *uniform reg
16:27karolherbst[d]: but call is funky...
16:27gfxstrand[d]: Uniform registers are annoyingly limited on Nvidia
16:27karolherbst[d]: no float ops?
16:27mhenning[d]: there's a flag on branch instructions called uniform or something, but I can't get cuda to generate it so I didn't bother wiring anything up
16:28gfxstrand[d]: We should probably be more conservative in when we use them
16:28karolherbst[d]: mhenning[d]: it's something else
16:29karolherbst[d]: the flags only matter with the input predicate
16:29gfxstrand[d]: karolherbst[d]: No float ops. No uniform branch instruction.
16:29gfxstrand[d]: The only reason uniform predicates exist is to predicate uniform things.
16:29mhenning[d]: oh, do you mean a branch with a upred?
16:29karolherbst[d]: it's like vote
16:29karolherbst[d]: just built into bra
16:30karolherbst[d]: you can make all threads jump undepetently, or when they all have the same value, etc...
16:30karolherbst[d]: *independently
16:30gfxstrand[d]: mhenning[d]: Yeah. I would love a bra with a upred.
16:30karolherbst[d]: it's funky stuff
16:31gfxstrand[d]: Most vector ALU can take uniform sources, even float ops, so that's not too bad. But the moment you hit a uniform branch, we have to insert an op to splat it out.
16:31gfxstrand[d]: Branches and memory ops are the annoying ones there.
16:31mhenning[d]: There are some things along those lines that we aren't using yet. eg LDG on ampere+ has an address format that's constant+ureg+reg
16:31karolherbst[d]: there is some weird interaction with the guard predicate, the input predicate and the masks of active threads
16:32karolherbst[d]: and the .U, .CONV and .DIV flags influence the behavior there
16:32gfxstrand[d]: mhenning[d]: Yeah, figuring that out might be nice. I see a lot of ureg being copied just so we're can feed it into ldg
16:32karolherbst[d]: ohh.. and apparently the input is a thread mask 🙃
16:34mhenning[d]: We also do some really silly things right now eg. I've seen us feed a ureg as the input to `r2ur`. I suspect teaching copy-prop about some cases like that could shave off some instructions
16:36karolherbst[d]: anyway.. the funky bits of bra are basically VOTE folded into the bra
16:36karolherbst[d]: hope that helps to get nvidia to generate it 😄
16:37mhenning[d]: gfxstrand[d]: Looks like we have `0x547 0x0 bra_uniform_pred_`
16:38mhenning[d]: Not on turing, but it's there on ampere
16:38snowycoder[d]: mhenning[d]: Huh, do you have some test cases for that?
16:39mhenning[d]: I don't remember where I saw that, sorry
16:39snowycoder[d]: Don't worry, I'll try to reproduce :3
16:42mhenning[d]: One way you could do that is to just add a little code that panics in that case and run either cts or shaderdb to see what fails
16:45gfxstrand[d]: mhenning[d]: Oh neat!
16:46gfxstrand[d]: mhenning[d]: That can happen with spilling but I thought we did a decent job of propagating that away...
16:47mhenning[d]: I don't think so? I think copy-prop is somewhat naive in this case
16:48mhenning[d]: I've also seen plenty of places where we load a constant into a ureg and then just fail to copy-prop that constant into later instructions. My impression is that copy-prop is terribly naive with uregs right now
16:50gfxstrand[d]: Copy prop doesn't care about ureg vs reg
16:50gfxstrand[d]: But I don't know if it knows what r2ur is
16:51gfxstrand[d]: This is an area that definitely needs work, though.
16:51snowycoder[d]: reading the code, it handles r2ur only if the source reg is uniform
16:51snowycoder[d]: so if you're copying ur 2 ur
16:51gfxstrand[d]: gfxstrand[d]: Like, if copy prop couldn't propagate a ureg, you wouldn't get immediates in ALU basically ever.
16:55gfxstrand[d]: snowycoder[d]: That's because if we blindly propagate reg into sources of things that expect ureg, that causes problems.
16:56gfxstrand[d]: Also, blindly dropping r2ur when it's actually uniformizing is a behavior change. Typically, if it's app-desires behavior, it'll be a shfl, though.
16:57gfxstrand[d]: I'm sure there are cases where it's safe to drop, we just have to be careful with it.
16:58mhenning[d]: Maybe I'm misremembering some of the details, but I maintain that we have some pretty dumb cases here right now.
16:59gfxstrand[d]: Oh, I'm sure.
16:59gfxstrand[d]: We should start collecting them in issues and fixing them.
16:59mhenning[d]: Like, here's part of vkcube which is literally the first thing I tried:
16:59mhenning[d]: ur0 = mov rZ // delay=1
16:59mhenning[d]: ...
16:59mhenning[d]: r0 = mov ur0 // delay=6
16:59mhenning[d]: r6..8 r4..6 = tex.2d c[0x1][0x10] r2..4 r0 // delay=6 wr:0
16:59gfxstrand[d]: One thing I know we need to do is be more aggressive about zero
17:01snowycoder[d]: wait, there's nothing for opmov in copy-prop?
17:01gfxstrand[d]: Also, spilling needs to learn it can just rematerializes immediate and bound cbuf values and doesn't need to actually spill them. And maybe should even have a heuristic to prefer spilling (by which I mean rematerializing) them.
17:01gfxstrand[d]: snowycoder[d]: OpMov is basically never emitted except my copy lowering.
17:02gfxstrand[d]: And it has funky mask semantics
17:02gfxstrand[d]: We could but I opted to add OpCopy instead
17:04gfxstrand[d]: mhenning[d]: Right so that's legalization. Copy prop makes the `tex` use `ur0` and then legalize says, "no, you can't do that" and adds a copy.
17:04gfxstrand[d]: Legalize is good and important but there's lots of stuff that should probably just not do the illegal thing in the first place.
17:05mhenning[d]: gfxstrand[d]: Right, but copy-prop shouldn't make it use ur0, it should make it use rZ
17:05snowycoder[d]: That should be resolved by adding copies and mov to the copy-prop
17:06gfxstrand[d]: mhenning[d]: You can't stuff misc rZ into a tex op. You might be able to if the whole vector source is all zeros but not if it's just one component.
17:07mhenning[d]: It's not the vector source
17:07gfxstrand[d]: Right. In this case it isn't.
17:08mhenning[d]: Also, isn't that supposed to be legalize's problem, not copy-prop's problem?
17:09gfxstrand[d]: Sort of? The way we express the rules via source types could probably be improved. I'm not at all happy with how the non-ALU ones work today.
17:09gfxstrand[d]: But also, not everything can take rZ. 🫤
17:10gfxstrand[d]: Which is really annoying
17:10gfxstrand[d]: I'm not sure if it's an alignment issue where 255 can only be a scalar or if it's something more fundamental. I just know some things need a real gpr.
17:12karolherbst[d]: gfxstrand[d]: I think it's just that
17:12mhenning[d]: If it's an alignment issue, shaders that use less than the full number of regs could just encode something aligned and out of bounds
17:12mhenning[d]: iirc the semantics are that any out of bounds reg is zero
17:12karolherbst[d]: I think it still triggers an error, no?
17:13gfxstrand[d]: And part of the problem is that nothing but the SSA and Reg forms of SrcRef communicate a number of components. So if we start propagating zeroes into things that are theoretically vectors, we lose information. There's probably something we can do to sort that out, though.
17:13karolherbst[d]: though I think that can be configured
17:13mhenning[d]: karolherbst[d]: Oh, does it? Maybe I'm wrong
17:13karolherbst[d]: I mean.. more as in a warning printed on dmesg
17:13karolherbst[d]: but it wouldn't kill the shader
17:13karolherbst[d]: I _think_
17:13mhenning[d]: Right, but if it's a warning we still don't want to be triggering that all the time
17:14gfxstrand[d]: karolherbst[d]: Except FP64 ops can totally use rZ, even though it's not aligned.
17:14karolherbst[d]: yeah
17:14karolherbst[d]: mhhh
17:14gfxstrand[d]: I'm pretty sure this is special cased all over the hardware. 🙃
17:14karolherbst[d]: might only be the tex units
17:15karolherbst[d]: they also added that `.lz` flag 🙃 I'm sure it's something silly there
17:17gfxstrand[d]: 🤷🏻♀️
17:19karolherbst[d]: mhhhhhh
17:19karolherbst[d]: actually
17:19karolherbst[d]: sooo
17:19karolherbst[d]: what the hardware does is to align the register
17:20karolherbst[d]: so if you put in r255 in a fp64 op, it will access r254
17:20gfxstrand[d]: Maybe we should have a `SrcType::Tex` which means "is a vector and it should usually be GPRs except rZ is okay if it's all zeros"? 🤷🏻♀️
17:20karolherbst[d]: I _think_
17:20karolherbst[d]: might want to cerify this
17:20karolherbst[d]: but I think the hardware just ignores the bits
17:21karolherbst[d]: let me check
17:21gfxstrand[d]: karolherbst[d]: 🤷🏻♀️ All I know is that FP64 on rZ works.
17:21karolherbst[d]: try allocating all registers... and uhm.. wait.. that's not possible anymore
17:21gfxstrand[d]: And I'm pretty sure I've seen the blob use it
17:21karolherbst[d]: r253 is max, no?
17:21karolherbst[d]: ehh r252
17:22karolherbst[d]: maybe it's legal since volta
17:22karolherbst[d]: because you know, it's never available
17:22gfxstrand[d]: I don't totally buy the explanation that rZ is implemented as just an OOB read. Do the docs actually explicitly say that or is it conjecture?
17:23karolherbst[d]: so RZ is used for unused tex parameters, but that much you already knew
17:24karolherbst[d]: anyway, yes
17:24karolherbst[d]: it's a tex specific thing
17:24karolherbst[d]: no RZ if the reg is used
17:25karolherbst[d]: surface ops also can't use RZ for the handle
17:26karolherbst[d]: RZ apparently never does an actual read
17:26karolherbst[d]: so I guess it sounds like more of a special case
17:27karolherbst[d]: but yeah.. don't know
17:31gfxstrand[d]: Yeah, so we could probably make the tex lowering smarter and avoid src2 entirely if it's not needed. Right now we always stick something in there.
17:31karolherbst[d]: at least it's well documented once you have access
17:31gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/compiler/nak_nir_lower_tex.c#L231
17:32karolherbst[d]: also how scalar works and all that
17:32gfxstrand[d]: That shouldn't be a hard fix
17:32karolherbst[d]: you'd need to know what arguments exist in hardware
17:32karolherbst[d]: because of implicit lods and all that
17:33gfxstrand[d]: Tex lowering was written in a hurry and I didn't have enough working to feel reliable just throwing stuff at the CTS.
17:33karolherbst[d]: yeah...
17:33karolherbst[d]: you want to wait until you got the NDA stuff sorted out and just read through it
17:33karolherbst[d]: it's different for every tex ops, but you already know that
17:34karolherbst[d]: and you want to use `.SCR` anyway
17:34gfxstrand[d]: I think the NIR lowering is pretty good. At this point, I'd be willing to throw it at the CTS and trust the results, I think.
17:34gfxstrand[d]: .scr?
17:34karolherbst[d]: .SCR allows you to fill gaps, non .SCR has fixed position of the argument tyes
17:34karolherbst[d]: scr == scalar
17:34gfxstrand[d]: Ah
17:35gfxstrand[d]: Yeah, we probably want that eventually
17:35karolherbst[d]: reduces register pressure, also it balances between the two sources
17:35karolherbst[d]: so you can use 2 vec2 instead of a single vec4
17:35snowycoder[d]: karolherbst[d]: Hold on, where? can you access documentation without signing NDAs?
17:35karolherbst[d]: you need an NDA
17:35gfxstrand[d]: You can't
17:35karolherbst[d]: but yeah, it's neatly explained
17:36gfxstrand[d]: Not for shader stuff, anyway. There's some Tegra docs you can get without it but they're mostly about memory layouts and display programming and stuff.
17:37snowycoder[d]: That's not a bat start, where can I find it?
17:37karolherbst[d]: like if you do a 1D load with an explicit lod, you can use two scalar registers
17:37karolherbst[d]: with .SCR
17:38karolherbst[d]: no need to align to vec4, so yeah, it's great
17:41gfxstrand[d]: Yeah, that's nice
17:42mhenning[d]: snowycoder[d]: A lot of the docs we have are here: https://github.com/NVIDIA/open-gpu-doc Mostly just names for things we can put in command buffers or for memory mapped registers. Doesn't cover the shader isa at all
17:43gfxstrand[d]: karolherbst[d]: If you want to use your magic docs and sort all that out, go nuts.
17:44gfxstrand[d]: The NIR lowering is pretty straightforward and there's no longer like 3 layers of lowering to sort through like with codegen.
17:44snowycoder[d]: mhenning[d]: ah yes, the classes headers
17:44mhenning[d]: Eg. https://github.com/NVIDIA/open-gpu-doc/tree/master/classes/3d has stuff for the 3d engine
17:44mhenning[d]: snowycoder[d]: Yeah, that's the main thing there. But there's also a grab bag of other stuff
17:44karolherbst[d]: gfxstrand[d]: yeah.... I don't think I'll find enough time anytime soon, but yeah...
17:45gfxstrand[d]: Fair
17:45karolherbst[d]: at least it's less of a mess than pre volta
17:45gfxstrand[d]: Gotta do the things that make rent.
17:45karolherbst[d]: scalar on maxwell was just pure agony
17:45gfxstrand[d]: karolherbst[d]: We can decide not to do things that are agony. 😁
17:46karolherbst[d]: the thing with maxwell was, that the encoding doesn't have enough bits
17:46karolherbst[d]: soo...
17:46karolherbst[d]: they became creative
17:46gfxstrand[d]: 🙃
17:46mhenning[d]: yeah good thing we don't support maxwell *ducks*
17:46karolherbst[d]: check out `isScalarTexGM107` if you want to see the fun
17:46karolherbst[d]: (and the emitter part)
17:47karolherbst[d]: tons of exceptions
17:47karolherbst[d]: they merged bits into a common thing and well...
17:47karolherbst[d]: a number means a certain combination of flags and dimension
17:47gfxstrand[d]: Woo
17:47karolherbst[d]: including the special lod flags
17:48karolherbst[d]: so there is e.g. only `TEXS.1D.LZ`
17:48karolherbst[d]: but not `TEXS.1D`
17:48karolherbst[d]: it's wild
17:48gfxstrand[d]: Womp womp
17:48karolherbst[d]: anyway... it helped with RA and using less registers...
17:48karolherbst[d]: volta+ doesn't have this insanity
17:48karolherbst[d]: it's just differenc encoding
17:48gfxstrand[d]: Yeah, I'm sure it'll help some
17:51gfxstrand[d]: But yeah, the only thing really missing from NAK for that is another flag bit, code to use rZ when the source is missing, and some typing in the NIR pass.
17:52karolherbst[d]: do the nak instruction encode the sources as a high level thing or just a list of values?
17:52karolherbst[d]: `.SCR` works more closer to how nir designs tex instructions
17:53karolherbst[d]: though it's kinda between having a fixed list and nir
17:54karolherbst[d]: there is a fixed order per instruction, but you only collect the sources you need and skip the others
17:54karolherbst[d]: and then check if it you need scalars or vec2 or whatever
17:55karolherbst[d]: and I _think_ `.SCR` is limited to 4 sources in total
17:55karolherbst[d]: yeah..
17:55karolherbst[d]: 4 sources max, otherwise you fall back to the vec4 encoding
17:56karolherbst[d]: anyway.. I need to do grocery shopping 🙃
18:26gfxstrand[d]: I just kicked nak/tex-src at CTS
18:26gfxstrand[d]: I'm sure it'll blow up
18:26gfxstrand[d]: But we'll see
18:31gfxstrand[d]: Oh, yeah. It's blowing up everything. 🙂
18:32gfxstrand[d]: Nothing on dmesg, though. That's weird.
18:50redsheep[d]: gfxstrand[d]: Yeah it would be nice if games used branching just enough not to need an actual million shader variants that take up 30 GB of disk space and need to spend half an hour recompiling on every game update.
18:51karolherbst[d]: well that part actually makes sense
18:52karolherbst[d]: more code in the same shader, especially if it increases the amount of registers used does decrease performance
18:55redsheep[d]: Sure, I am not saying every game should be doing enormous shaders everywhere, but I have seen a few game devs saying that most of the stuff causing the huge numbers of variants should just be branches
18:55karolherbst[d]: well if it costs you 0.5FPS you might use multiple shaders
18:55gfxstrand[d]: Yeah, it's a trade-off. Uniform branches are generally fine if you don't blow up register pressure. Shader switching has a cost.
18:56redsheep[d]: Not saying have fewer shaders in your code necessarily, just to not cause them to compile 32768 times because of your code could have had 15 branches but didn't
18:56karolherbst[d]: you pay the cost of the worst case branch
18:56karolherbst[d]: always
18:56gfxstrand[d]: But hardware is generally very heavily optimized for shader switching and that's a cost you pay once per draw at most, not once per pixel.
18:56karolherbst[d]: GPUs don't rename registers
18:57gfxstrand[d]: karolherbst[d]: That's a little oversimplified but yes.
18:57karolherbst[d]: so if your shader ends up using 8 registers more, then you run less threads in parallel -> worse performance
18:57karolherbst[d]: yeah.. I'm oversimplifying. Meant it in terms of parallelism
18:57gfxstrand[d]: Nvidia has a thing where we can reduce the register count mid-shader. I've never looked into how it works, though.
18:57karolherbst[d]: ohhhh.. that's new, isn't it?
18:57karolherbst[d]: do you know what it's called?
18:58gfxstrand[d]: Turing or Ampere, I think.
18:58karolherbst[d]: system value?
18:58gfxstrand[d]: I've seen it in the CUDA docs. Never investigated further.
18:59karolherbst[d]: mhh system values are read only..
18:59gfxstrand[d]: It would be good to do that if we can get into a low-key register region uniformly and prove we never leave it. It would help with ubershaders.
19:00gfxstrand[d]: It'll be annoying to plump through RA, though.
19:00karolherbst[d]: mhhhhhh
19:01karolherbst[d]: is it part of the barrier stuff?
19:01karolherbst[d]: mhh
19:01karolherbst[d]: I can't find it, maybe it's super secret
19:01karolherbst[d]: maybe they just switch invocations in cuda 🙃
19:02karolherbst[d]: I know you can tell the compiler to cap the register usage tho
19:04karolherbst[d]: https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg
19:04karolherbst[d]: I found it
19:05redsheep[d]: Not sure on the technical details of how the prop drivers make it work well, but I remember a specific interview between an id engineer and digital foundry where he explained that they opted to branch a lot more in order to not end up with a stutter game that freezes to compile shaders all the time. If I remember right he was saying that it didn't hurt performance enough for the alternative to be
19:05redsheep[d]: worth it for them
19:05redsheep[d]: This was in regards to the development of doom eternal specifically
19:05karolherbst[d]: yeah..
19:06karolherbst[d]: that's what you do if you don't precompile
19:06gfxstrand[d]: karolherbst[d]: .inc looks scary! 👻
19:06karolherbst[d]: trap handler probably
19:06karolherbst[d]: at least that's where I'd guess you might need it.
19:07karolherbst[d]: redsheep[d]: dolphin emulator had this ubershader thing to kill stuttering, but it tanked perf a lot
19:07karolherbst[d]: if you compile shaders on the fly, then yes, you might want to branch more often
19:07karolherbst[d]: but other games just precompile everything
19:07redsheep[d]: I want to say the ubershaders in dolphin are much much larger than what the id engineer was advocating for
19:08redsheep[d]: Those hurt perf quite a bit. It's really impressive that those shaders work at all
19:09karolherbst[d]: yeah, but if you can precompile everything, because you don't compile shaders midgame, that problem goes away
19:09redsheep[d]: Anaway I think dolphin is still doing that, but now by default it just uses them to mask compilation stutter, not outright prevent compiling more
19:10gfxstrand[d]: redsheep[d]: The id folks have really optimized their shader compilation stuff. It's impressive. It was way less impressive in the Vulkan 1.0 but they've figured some things out
19:11marysaka[d]: redsheep[d]: it's a stop gap to have something before the actual shader is compiled yeah
19:11karolherbst[d]: funky
19:11karolherbst[d]: `ptxas test.ptx, line 34; error : Argument 1 of instruction 'setmaxnreg.dec': value '8' out of range, expected to be in range [24..256]`
19:11karolherbst[d]: `ptxas info : (C7508) Potential Performance Loss: 'setmaxnreg' ignored; unable to determine register count at entry.`
19:11karolherbst[d]: anyway...
19:11karolherbst[d]: it's SM90a and newer
19:11karolherbst[d]: so blackwell?
19:11gfxstrand[d]: 🤭
19:12karolherbst[d]: anyway
19:12karolherbst[d]: it doesn't appear as an instruction, but my test ptx also doesn't use more than 8 regs 🙃
19:13karolherbst[d]: maybe it's some funky QMD level stuff
19:14redsheep[d]: karolherbst[d]: I believe some engines have quite a hard time telling what possible states there will be that they need to precompile against, so even if they wanted to wait around they couldn't effectively do so.
19:14karolherbst[d]: yeah...
19:14karolherbst[d]: if the specific case isn't costing you that much, then whatever
19:14karolherbst[d]: but there is a cost, unless if everything is perfectly balanced
19:15redsheep[d]: https://tenor.com/view/thanos-balanced-gif-3226445978144001245
19:16tiredchiku[d]: it also helps id's case that their engine is built exactly for the kinds of games they make
19:16tiredchiku[d]: where a lot of engineering effort is put in to make the renderer efficient, which I feel doesn't happen enough with studios using more "general purpose" engines like UE
19:17karolherbst[d]: meanwhile sometimes I have to argue with people who say "well.. in the future applications just have to ship all GPU binaries, because compiling at runtime is just bad"
19:18redsheep[d]: That's a terrifying position
19:18karolherbst[d]: "just compile the package for each GPU arch, how many will there be at most? 100? sounds reasonable to me"
19:19karolherbst[d]: the worst part is, there are people serious about it
19:19tiredchiku[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1342938852888809582/veQBsO4.png?ex=67bb74c5&is=67ba2345&hm=167f1a32b82a9a1b21b9b59c6c36337f9c95dc6e21520ca4e2f47621bbe40a54&
19:19tiredchiku[d]: :cat_pray:
19:19tiredchiku[d]: I believe that also exists on windows now
19:19tiredchiku[d]: (I could be wrong, haven
19:19karolherbst[d]: impressive
19:20tiredchiku[d]: I have it disabled because I bounce between drivers, but it's nice for people who stick to a driver
19:20redsheep[d]: Also I wasn't trying to say branch always, do variants never. I just mean it would be good for devs to check when and where it's cheap and cut most of the variants if they can. Even if average fps is lower the perception of smooth gameplay has more to do with less stutter than good averages
19:21karolherbst[d]: well you don't really want to compile mid-game, so not many games actually have that issue
19:22karolherbst[d]: but anyway.. it's all a trade-off
19:22redsheep[d]: karolherbst[d]: Lots of games that do precompile well do it on the loading screens between levels. Cutting that time down is also worthwhile for player's enjoyment
19:22karolherbst[d]: oh for sure
19:23tiredchiku[d]: tfw open world
19:23karolherbst[d]: if you increase compilation time by 10x to get a 0.5% perf increase it probably isn't worth it, right
19:30airlied[d]: Shader pre cache is great for things like steam deck
19:31redsheep[d]: It's incredible on the deck. Shaders take so long to compile on that thing
19:32gfxstrand[d]: karolherbst[d]: If you listen to the phoronix forums, GPUs are all going to run SPIR-V soon.
19:33karolherbst[d]: gfxstrand[d]: well.. yeah
19:33karolherbst[d]: but I didn't mean it like that 😄
19:34gfxstrand[d]: I know
19:34karolherbst[d]: it's a hot take from HPC folks, I mean.. I get it for HPC use cases, but then they also want the same for linux desktops or something and it's a bit annoying sometimes
19:35gfxstrand[d]: But I don't think most people understand that GPUs make architectural changes to their ISA every generation. You can only really precompile on console. And even there, people want backwards compatibility so uh...
19:35karolherbst[d]: tell that to HPC folks
19:35gfxstrand[d]: Oh, I know
19:36gfxstrand[d]: I don't know what Sony and Nintendo are planning to do for that. I wouldn't be surprised if Nintendo has paid a pile of cash to get Ampere with some Maxwell cores attached.
19:36karolherbst[d]: yeah...
19:36redsheep[d]: gfxstrand[d]: It's really interesting how AMD made the rdna variant for the ps5 pro able to still run binaries compiled for rdna 2 well
19:36karolherbst[d]: it's going to be interesting
19:36karolherbst[d]: ohh wait
19:37karolherbst[d]: gfxstrand[d]: you need to recompile the games for the switch 2
19:37karolherbst[d]: unless they don't ship gpu binaries
19:37karolherbst[d]: sooo....
19:37gfxstrand[d]: 🤷🏻♀️
19:37gfxstrand[d]: If they require recompilation, that's easier.
19:37karolherbst[d]: but yeah.. games need to be made compatible and stuff
19:37gfxstrand[d]: Sony doesn't, AFAICT.
19:37karolherbst[d]: and it's tbd which games will run on the switch 2
19:38karolherbst[d]: mhhh
19:38redsheep[d]: gfxstrand[d]: I really doubt the ps6 will do what the pro did. In the interviews Mark Cerny seemed to make it pretty clear that stretching like that for compatibility was only something they intend for the pro variants
19:38karolherbst[d]: well the PS is also bigger then the switch
19:38gfxstrand[d]: Writing a Maxwell to Ampere transpiler should be possible if the shader uses less than 250 registers or so.
19:38karolherbst[d]: yeah.. but thing is, they decided to not do it at runtime either
19:39karolherbst[d]: some games just use GL or vulkan, so those are fine
19:39mohamexiety[d]: they did specify that not all games are going to be forwards compatible iirc
19:39karolherbst[d]: others need to be recompiled and tested afaik
19:39karolherbst[d]: mohamexiety[d]: yeah... I think I've heard more details somewhere
19:40karolherbst[d]: but yeah.. there isn't 100% compatability
19:41karolherbst[d]: the other part is, that you also need to hot patch command submission
19:41marysaka[d]: karolherbst[d]: games that uses NVN always ship with the binaries and that's what most thing use
19:41marysaka[d]: but if I remember correctly the containers of those blobs is the same as what we parse
19:42marysaka[d]: and those have IR around sooo it should be easier I guess
19:42karolherbst[d]: but yeah.. if they just need to recompile and there is some promise of things not breaking, might not be too bad
19:42marysaka[d]: there is also command recording on never version of NVN and uuum that might be fun for them to handle ™️
19:43karolherbst[d]: maybe they have a "good enough" transpiler stuff which works for 90% of the caases
19:43karolherbst[d]: but it really sounds like they made it the game publishers problem
19:44redsheep[d]: If you know for sure your console will sell then forcing a bit of work from the devs to bring it to the new generation isn't a bad solution, I think
19:45karolherbst[d]: might also allow you to replace some textures with more higher quality ones and stuff
19:45karolherbst[d]: 😄
19:46marysaka[d]: I'm pretty sure they are swapping the shared library of the SDK / sub-sdks for original Switch stuffs
19:46tiredchiku[d]: nExT gEn uPdaTe
19:46marysaka[d]: and might have some "addons" for other stuffs (aka game dev get involved)
19:46marysaka[d]: but will see in time 😄
19:48redsheep[d]: I look forward to someone figuring out how to install linux on switch 2 and make nvk work
19:49tiredchiku[d]: inb4 TURNING THE SWITCH 2 INTO A STEAM DECK :shocked:
19:49tiredchiku[d]: on youtube
19:49redsheep[d]: Hacking a switch 2 into being a twisted steam deck seems like it should be a lot more viable than it ever could have been on switch 1
19:53redsheep[d]: Between it having a newer nvidia gpu, not anemic cpu cores, and having a mouse-like input method, it might even be good
20:16snowycoder[d]: I've put my "parser" work in this MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33691
20:16snowycoder[d]: It's in an early stage, but there should be enough info to derive most Ops, for now I just derive `DisplayOp`
20:17snowycoder[d]: Please comment if there's anything I could do better
20:21snowycoder[d]: The next step is to begin parsing, I've seen that etnaviv uses pest, but it requires generating both a PEG file and code to initialize the structs from the parsed data, I think it would be better to use nom
20:30airlied[d]: HPC folks are the reason AMD screwed up ROCm so badly
20:52kayliemoony[d]: howso, im not familiar
21:12gfxstrand[d]: airlied[d]: And yet they still didn't make something HPC folks want to use! 🤡
21:12gfxstrand[d]: snowycoder[d]: I'll take a look next week. I think mhenning[d] and her scheduler are first in line.
21:13snowycoder[d]: gfxstrand[d]: Yep, thanks!
21:13gfxstrand[d]: I think it sounds pretty cool, though. 😎
21:14gfxstrand[d]: I think the scheduler plus better dep tracking plus predication will get us a good bit.
21:14gfxstrand[d]: Now that we're not totally murdering the PCI bus and little things like shaders matter. 🙃
21:14snowycoder[d]: Just a question, is it a problem if I add a dependency (nom)?
21:15gfxstrand[d]: Does it have other deps?
21:16gfxstrand[d]: Yeah, it does. 😫
21:16snowycoder[d]: One 😦
21:16snowycoder[d]: lol, I'll check if I can do without
21:16gfxstrand[d]: If it's just for building tests or tools, it may be okay
21:17gfxstrand[d]: But yeah... Dependencies kinda suck with meson right now
21:17gfxstrand[d]: I need to try to get us moved over to the automatic cargo thing at some point. That'll make it easier.
21:18snowycoder[d]: automatic cargo thing? 0_o
21:18gfxstrand[d]: Yeah, it tries to read cargo files and scrape them into meson.
21:19gfxstrand[d]: It should make deps nothing but a 4-line wrap. Right now, we're carrying full meson for everything.
21:20gfxstrand[d]: Like, meson pulls it down and subs in a totally different build system. It's not the world's most robust solution.