00:00fdobridge: <karolherbst🐧🦀> and a pipe is essentially a fifo
00:00fdobridge: <karolherbst🐧🦀> never really thought about how one could implement it, but that inc/dec nvidia has really would come in handy
00:01fdobridge: <karolherbst🐧🦀> as it's essentially a ring buffer indeed
00:01fdobridge: <karolherbst🐧🦀> wondering if cuda has something like that
00:02fdobridge: <karolherbst🐧🦀> funky
00:22fdobridge: <gfxstrand> It still seems kinda pointless when you could just make it a power of two. 🤷🏻♀️
00:22fdobridge: <karolherbst🐧🦀> sure, but CL pipes have custom sizes handed in by the application :3
00:25fdobridge: <karolherbst🐧🦀> anyway.. at least it allows you to get rid of some and instructions even with pot values
00:28fdobridge: <gfxstrand> You should be able to round up to a power of two. But, yeah...
03:01fdobridge: <gfxstrand> `Pass: 307186, Fail: 4765, Crash: 1525, Skip: 1673845, Flake: 54, Duration: 1:22:59`
13:28fdobridge: <gfxstrand> @Mr Fall🐧 So, this reclocking stuff you were talking about last week with the firmware juggling... Is it all theoretical or do you have patches somewhere? I'd really like to be able to demo stuff at the meetup in 1.5 weeks. I don't need something clean, just something that works well enough.
13:29fdobridge: <gfxstrand> @Mr Fall🐧 So, this reclocking stuff you were talking about last week with the firmware juggling... Is it all theoretical or do you have patches somewhere? I'd really like to be able to demo stuff at the meetup in 1.5 weeks. I don't need something clean and upstreamable, just something that works well enough. (edited)
13:29fdobridge: <karolherbst🐧🦀> let me figure it out on the weekend and I'll throw something at you
13:30fdobridge: <gfxstrand> Cool. Thanks!
13:42fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Could it be possible to do it on Turing too? 😅
13:44fdobridge: <karolherbst🐧🦀> with GSP
13:44fdobridge: <karolherbst🐧🦀> though I'm not quite sure... does reclocking already work on the GSP branches?
13:49fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> `fail ttm_validate` is a big showstopper 🐸
13:50fdobridge: <gfxstrand> Then we need to fix that
13:50fdobridge: <gfxstrand> I mean, if the hacks work on turning, may as well, but long-term we want GSP.
13:51fdobridge: <karolherbst🐧🦀> they don't
13:51fdobridge: <karolherbst🐧🦀> 1. turing is different 2. turing has gddr6
13:52fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Airlie said that GSP consumes a lot of memory (but I only use half of my VRAM when running a pretty intensive game with GSP on NVIDIA driver)
13:55fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> BTW why does the new nouveau kernel API require changes outside of drm/nouveau? 🍩
14:00fdobridge: <gfxstrand> I think it's because @airlied is also trying to build some common infrastructure for page table management so that other drivers can get competent uAPIs as well.
15:48fdobridge: <gfxstrand> I don't understand barriers....
15:49fdobridge: <gfxstrand> The good news is that it's no longer throwing ILLEGAL_PARAM
15:49fdobridge: <karolherbst🐧🦀> what kind of barriers?
16:02fdobridge: <gfxstrand> barrier()
16:02fdobridge: <gfxstrand> I think I figured out what we were doing wrong
16:02fdobridge: <gfxstrand> Well, with that
16:03fdobridge: <gfxstrand> IDK why my shared memory tests still aren't passing
16:04fdobridge: <karolherbst🐧🦀> soooo
16:04fdobridge: <karolherbst🐧🦀> you probably know that, but I'll say it anyway: some instructions have to be executed with all threads converged
16:04fdobridge: <karolherbst🐧🦀> barriers are one of those
16:04fdobridge: <karolherbst🐧🦀> they don't work if not all threads aren't there yet
16:05fdobridge: <karolherbst🐧🦀> if you have divergent control flow, you have to sync up the threads
16:06fdobridge: <gfxstrand> Yes
16:06fdobridge: <gfxstrand> Hrm...
16:06fdobridge: <gfxstrand> I"m not doing a warp sync
16:06fdobridge: <gfxstrand> Ugh
16:06fdobridge: <karolherbst🐧🦀> not talking about warp sync tho
16:07fdobridge: <karolherbst🐧🦀> well
16:07fdobridge: <karolherbst🐧🦀> it's part of it
16:07fdobridge: <karolherbst🐧🦀> but often not even needed
16:07fdobridge: <karolherbst🐧🦀> have you wired up that barrier register file yet?
16:07fdobridge: <karolherbst🐧🦀> B2R specifically
16:07fdobridge: <gfxstrand> no
16:07fdobridge: <karolherbst🐧🦀> ehh not B2R
16:08fdobridge: <karolherbst🐧🦀> BMOV
16:08fdobridge: <karolherbst🐧🦀> `MACTIVE` is the mask of active threads and can be read from or written to
16:09fdobridge: <karolherbst🐧🦀> otherwise you can only use warpsync outside of cfg
16:09fdobridge: <karolherbst🐧🦀> as you wouldn't know what threads have to participate
16:11fdobridge: <karolherbst🐧🦀> check `enum TSSemantic` for all the available things
16:12fdobridge: <karolherbst🐧🦀> WARPSYNC will ignore exited threads tho
16:13fdobridge: <gfxstrand> This may be too big of a project for today. :0/
16:13fdobridge: <gfxstrand> This may be too big of a project for today. 😕 (edited)
16:18fdobridge: <karolherbst🐧🦀> yeah.. I've still didn't done it fully for GL yet
16:18fdobridge: <karolherbst🐧🦀> but I needed ti for some lowering partly
16:18fdobridge: <karolherbst🐧🦀> texgrad on 3d
16:19fdobridge: <gfxstrand> As for the barrier register file, that shouldn't be hard. It's pretty easy to add files in NAK.
16:19fdobridge: <gfxstrand> The only real pain is that RegFile will go from 2 bits to 3 in a couple places.
16:19fdobridge: <gfxstrand> The only real pain is that `RegFile` will go from 2 bits to 3 in a couple places. (edited)
16:19fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11061
16:19fdobridge:<gfxstrand> wishes rust had bitfields some days
16:20fdobridge: <karolherbst🐧🦀> anyway.. you have 16 general purpose barrier registers and 16 of those fixed ones and they live inside the same space
16:20fdobridge: <gfxstrand> kk
16:21fdobridge: <gfxstrand> That's a little anoying...
16:21fdobridge: <gfxstrand> But I think I can deal with it
16:21fdobridge: <karolherbst🐧🦀> the special ones start at 16
16:21fdobridge: <karolherbst🐧🦀> so 0-15: general purpose, 16-31: special ones
16:21fdobridge: <gfxstrand> That makes it a bit easier
16:25fdobridge: <karolherbst🐧🦀> e.g. you force the shader to run in quads by doing a BMOV.PQUAD on MACTIVE with a value you usually read out from BMOV with MACTIVE
16:26fdobridge: <karolherbst🐧🦀> BMOV B0, MACTIVE
16:26fdobridge: <karolherbst🐧🦀> BMOV.PQUAD MACTIVE, B0
16:26fdobridge: <karolherbst🐧🦀> ... code requiring quads
16:26fdobridge: <karolherbst🐧🦀> BMOV MACTIVE, B0
16:27fdobridge: <karolherbst🐧🦀> but I doubt you'll need this any time soon 🙂
16:28fdobridge: <karolherbst🐧🦀> before volta the hardware managed the masks itself and you hade like QUADON/QUADPOP instructions doing it instead
16:28fdobridge: <karolherbst🐧🦀> and other things to handle breaks/conts/whatever
16:31fdobridge: <gfxstrand> Are there docs on any of this?
16:31fdobridge: <karolherbst🐧🦀> of course they aren't 🙂
16:31fdobridge: <karolherbst🐧🦀> *there
16:31fdobridge: <karolherbst🐧🦀> well
16:32fdobridge: <karolherbst🐧🦀> at least nothing I could give to you 😄
16:32fdobridge: <gfxstrand> *grumble*
16:32fdobridge: <karolherbst🐧🦀> but what I have isn't useful besides that anyway
16:32fdobridge: <karolherbst🐧🦀> I know what PQUAD and MACTIVE does
16:32fdobridge: <karolherbst🐧🦀> everything else I only got the names
16:33fdobridge: <karolherbst🐧🦀> welll.. I know what WARPSYNC does and you can just use it with a 0xffffffff mask
16:33fdobridge: <karolherbst🐧🦀> but... that won't work inside non uniform CFG
16:33fdobridge: <karolherbst🐧🦀> if you sync the threads up outside of any CFG you'd be good for now
16:34fdobridge: <karolherbst🐧🦀> @gfxstrand did you try using BAR.SYNC btw?
16:36fdobridge: <karolherbst🐧🦀> though I have no idea if that works the same or differently.. mhh probably different
16:37fdobridge: <karolherbst🐧🦀> there is also BSSY+BSYNC to set convergence points
16:37fdobridge: <gfxstrand> I'm using BAR.SYNC right now
16:38fdobridge: <karolherbst🐧🦀> so you have a BSSY B0, $address_to_after_cfg, then some divergent control flow and then BSYNC B0
16:38fdobridge: <karolherbst🐧🦀> and then the threads are converged
16:40fdobridge: <karolherbst🐧🦀> the address argument on BSSY is ignored by the hardware tho, it's just useful for debuggers
16:40fdobridge: <karolherbst🐧🦀> or general debugging
16:40fdobridge: <gfxstrand> Yeah, the trick is how to do it in uniform control-flow
16:41fdobridge: <gfxstrand> Seems like the Nvidia compiler falls back to just BAR.SYNC for that
16:41fdobridge: <karolherbst🐧🦀> yeah.. probably
16:41fdobridge: <gfxstrand> I've been playing around with it a bit
16:41fdobridge: <karolherbst🐧🦀> I'm not 100% on when it's actually needed and when BAR.SYNC is enough and stuff
16:41fdobridge: <gfxstrand> Yeah, I need to spend some time with it
16:41fdobridge: <karolherbst🐧🦀> anyway, I tried to summarized on what I know on this and what we currently use it for
16:42fdobridge: <gfxstrand> Thanks
16:42fdobridge: <karolherbst🐧🦀> that stuff is more solid pre volta and the general logic applies, it's just hardware stack managed instead of barrier registers
16:43fdobridge: <karolherbst🐧🦀> funky is that you have an on-chip stack, but if that's not enough you can spill the stack to VRAM, but you have to know beforehand how much you need and stuff. Luckily Volta+ doesn't have that, but you have to deal with that barrier stuff 🙂
17:38fdobridge: <mhenning> oh, that reminds me - I actually did some REing on kepler and I think I figured out how to size the hardware stack correctly... and then I never actually followed through with those patches to codegen
17:39fdobridge: <mhenning> I forget if I broke something or just got distracted. maybe I should resurrect that branch
17:39fdobridge: <karolherbst🐧🦀> the hardware stack is fixed in size
17:39fdobridge: <karolherbst🐧🦀> what you can size is the software one
17:39fdobridge: <karolherbst🐧🦀> and it's part of TLS
17:40fdobridge: <mhenning> yeah I meant the spill space for the stack
17:40fdobridge: <karolherbst🐧🦀> we actually have a game running out of space 😢
17:40fdobridge: <karolherbst🐧🦀> so what nvidia does to combat the need of a big stack is to do loop merging
17:41fdobridge: <mhenning> yeah I might have gone down a rabbit hole of trying to conserve stack slots
17:42fdobridge: <karolherbst🐧🦀> I think I figured it out at some point as well, but codegen is annoying
17:42fdobridge: <mhenning> eg. you only need one slot for a switch case, but nir lowers that to nested ifs that naively consume a bunch of slots, so I started dreaming up ways of backing that info out
17:43fdobridge: <karolherbst🐧🦀> also being a multiple of 512b doesn't help
17:43fdobridge: <karolherbst🐧🦀> or whatever the size was
17:43fdobridge: <mhenning> yeah, codegen is super annoying
17:43fdobridge: <karolherbst🐧🦀> could have an array with jump targets *hides*
17:44fdobridge: <mhenning> haha you can implement that one on codegen, I'll watch
17:46fdobridge: <karolherbst🐧🦀> I just spill those arrays to a constbuf
17:46fdobridge: <karolherbst🐧🦀> that reminds me...
17:48fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/ee35bee0c4ed55c5cce5615297847e3fa29d136e