02:56fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> You should try to get gamescope working (because basically all the required features are supported)
03:09fdobridge: <gfxstrand> Mesa/mesa branch: main.
09:13fdobridge: <karolherbst🐧🦀> @gfxstrand uhm.... you want to call `nir_lower_int64` 🙃
09:13fdobridge: <karolherbst🐧🦀> volta+ relies on that
09:14fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_from_nir.cpp#L3469
09:16fdobridge: <karolherbst🐧🦀> though we can do a bit better than that...
09:19fdobridge: <karolherbst🐧🦀> but yeah... the problem wasn't the 64 bit source, but that the result is always 32 bit
09:19fdobridge: <karolherbst🐧🦀> `71: shf (SUBOP:4) s64 $r2d $r8d 0x0000000000000004 $r255 (16)` the emitted `shf` only returns 32 bits
09:59fdobridge: <karolherbst🐧🦀> though I'm not so sure about the 64 bit source even... but the ISA doc suggests those are two 32 bit values. I'm mostly confused around the `.S64` and `.U64` modifiers...
10:16fdobridge: <karolherbst🐧🦀> yeah.. so a CL 64 bit shift is `LOP32I.AND R5, R4, 0x3f ; SHF.R.S64 R2, R2, R5, R3 ; SHR R3, R3, R5 ;`
10:19fdobridge: <karolherbst🐧🦀> mhhhh
10:24fdobridge: <karolherbst🐧🦀> the shift source is always 32 bit no matter what
10:27fdobridge: <karolherbst🐧🦀> it kinda feels like that the 32 or 64 bit of the type thing doesn't matter one bit
10:28fdobridge: <karolherbst🐧🦀> I think the intention was to enable/disable one of the input sources, but...
10:28fdobridge: <karolherbst🐧🦀> anyway
10:28fdobridge: <karolherbst🐧🦀> all registers are always 32 bit
10:29fdobridge: <karolherbst🐧🦀> we might want to add a nir alu instruction `nir_alu_shf` or something to express that in nir long term
10:30fdobridge: <karolherbst🐧🦀> ehh `nir_op_shf`
10:30fdobridge: <karolherbst🐧🦀> and lower 64 bit ops to that
10:30fdobridge: <karolherbst🐧🦀> `SHR` and `SHL` are just alias to `SHF`
10:42HdkR: Everyday I'm shuffling?
11:04karolherbst: mhhh, I'm actually confused how nvk got so far not calling into int64 lowering :D
11:07karolherbst: HdkR: dunno if you can talk about it, but do you have any information you can share on how to use more than 8 constant buffers in compute shaders? I've seen nvidia generate instructions using the 0xa and 0xc one and I'm kinda confused on how to configure those in compute...
11:09HdkR: Last I knew you could still only bind 8. Maybe it got resolved in some new generation?
11:09karolherbst: it generated that code for all gens
11:09karolherbst: but granted.. I played with PTX 2.1 code and explicit ubo bindings
11:09karolherbst: the entire result was weird, maybe it's something silly and they live patch it away
11:10karolherbst: but anyway.. I've seen ptxas generating such kernels
11:10karolherbst: tried every gen from sm_35 up to sm_80
11:10karolherbst: mhh
11:10karolherbst: fermi has this though: NV90C0_BIND_CONSTANT_BUFFER_SHADER_SLOT 12:8
11:10karolherbst: 5 bits
11:11HdkR: Weird. Maybe it does some voodoo
11:11karolherbst: maybe more than 8 were supported pre kepler
11:12karolherbst: ptx _had_ a 11 binding slots for constant buffers, but it got deprecated with ptx 2.2
11:14HdkR: Oh, maybe. I never looked at anything older than Maxwell
11:14karolherbst: ptx 2.0 added support for sm_20 which is fermi
11:14karolherbst: and I think 2.1 or 2.2 is were kepler was added
11:15karolherbst: ohh the ptx version number follows the sm number
11:15karolherbst: convenient
11:15karolherbst: kepler is sm_30 mhh
11:16karolherbst: but yeah.. I think it's related to ditching that in kepler. Let me check if we used more than 8 with fermi
11:17karolherbst: indeed
11:17karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/nouveau/nvc0/nvc0_program.c#L619
11:18karolherbst: yeah, so that explains it
11:18karolherbst: I still wonder why they generate that for newer gens, but maybe it also would just cause issues at runtime :D
11:18HdkR: Neat
11:19HdkR: I definitely complained about the lack of binding slots in compute. Was kind of hoping it got fixed
11:20karolherbst: yeah... tough it might not matter as much.. dunno
11:21karolherbst: I'm still confused on how real all of that ubo stuff is, and if they use the storage for other things... but it also sounds like htere is this 4k/8k cache and also explicit dram space for all the 18 slots
11:21HdkR: Since latest hardware has Uniform SSBO access that matches UBO access perf for the most part, it kind of matters less
11:21HdkR: Obviously can't embed an SSBO slot in an instruction encoding, but whatever
11:21karolherbst: and there are also bindless UBOs
11:22karolherbst: soo.. just use those? :P
11:22HdkR: true
11:22karolherbst: I still wonder if those just fill the unused UBO slots or if other magic is going on
11:22karolherbst: the bindless address does contain the size of the ubo
11:23HdkR: Should be other magic
11:23karolherbst: maybe
11:24HdkR: If it works like I think it does anyway :)
11:24karolherbst: yeah.. I mean the benefit you get here is, that the hardware can load the entire UBO in one go and doesn't have to bother with memory latency anymore
11:25HdkR: Indeed
11:25karolherbst: and I suspect LDG.CONSTANT is just some caching optimization
11:26HdkR: :D
11:26karolherbst: or maybe uses the same thing
11:26karolherbst: who knows
11:26karolherbst: though can't really, because the size isn't known
11:32HdkR: Just prefetch forward until maximum UBO size or fault. EZ
11:32karolherbst: well.. I'd choose the page, but yeah :P
11:33HdkR: Isn't the GPU page size 64KB just like the max UBO size? :D
11:33karolherbst: good question
11:33karolherbst: well.. nvidia actually have us the MMU stuff
11:33karolherbst: https://nvidia.github.io/open-gpu-doc/pascal/gp100-mmu-format.pdf
11:34karolherbst: `Dropped support for 128KB Big Pages.` :D
11:34HdkR: ooo, fancy
11:34karolherbst: looks like 64kb are second level already
11:34karolherbst: or something
11:35karolherbst: I think it mostly exists for PPC
11:39HdkR: Main thing is that it needs to support 4k because of CPU mapping. So GPU pages will always be second level
12:14fdobridge: <georgeouzou> The failing CTS tests use 2 queries: one VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT and one VK_QUERY_TYPE_PRIMITIVES_GENERATED_EXT.
12:14fdobridge: <georgeouzou> For tests that use a single query the tests seem to pass.
12:14fdobridge: <georgeouzou> If i change the transform feedback's availability semaphore location from PIPELINE_LOCATION_ALL to PIPELINE_LOCATION_STREAMING_OUTPUT then all tests pass
12:30fdobridge: <georgeouzou> Its like this:
12:30fdobridge: <georgeouzou> - begin_primitives_generated_query
12:30fdobridge: <georgeouzou> - begin_xfb_query
12:30fdobridge: <georgeouzou> - draw
12:30fdobridge: <georgeouzou> - end_primitives_generated_query
12:30fdobridge: <georgeouzou> - end_xfb_query
12:30fdobridge: <georgeouzou> - get_primitives_generated_query
12:30fdobridge: <georgeouzou> - get_xfb_query
12:30fdobridge: <georgeouzou> - check if they are equal
12:53fdobridge: <gfxstrand> Yeah, idk how we want to structure the NIR ops but I think we do want new NIR ops for this. Shifts are actually kinda painful to split into normal 32-bit shifts
12:53fdobridge: <karolherbst🐧🦀> yeah...
12:53fdobridge: <karolherbst🐧🦀> I could add support for that in codegen as it's literally just two ops instead of one
12:53fdobridge: <karolherbst🐧🦀> I'm under the impression that somebody even wrote the code
12:54fdobridge: <karolherbst🐧🦀> maybe it was pierre? I couldn't find it in my repo
12:54fdobridge: <gfxstrand> Yeah, shouldn't be hard.
12:54fdobridge: <karolherbst🐧🦀> let's ping pierre...
12:54fdobridge: <karolherbst🐧🦀> mhh where is pierre anyway
12:54fdobridge: <gfxstrand> 🤷🏻♀️
12:54fdobridge: <karolherbst🐧🦀> but anyway.. nvk should call into int64 and float64 lowering anyway 😄 we can optimize the shift lowering after that
12:55fdobridge: <gfxstrand> Honestly, I'm a bit inclined to just handle it in nv50_ir_from_nir.cpp for now.
12:56fdobridge: <gfxstrand> Shifts and adds are really annoying to describe efficiently in NIR.
12:56fdobridge: <gfxstrand> Multiply is okay because it's just a like of mul_2x32_64
12:56fdobridge: <karolherbst🐧🦀> I've added a function to do this "early" lowering: LoweringHelper
12:57fdobridge: <karolherbst🐧🦀> should probably just fo in there
12:57fdobridge: <gfxstrand> And comparisons...
12:58fdobridge: <gfxstrand> Really, NV has most of what you want for 64-bit integers. It's just very cleverly shaped to not actually use 64-bit registers most of the time.
12:58fdobridge: <karolherbst🐧🦀> yeah
12:58fdobridge: <karolherbst🐧🦀> I can write the code tomorrow or something
12:59fdobridge: <gfxstrand> We need to lower multiply and maybe bit count/shift nonsense.
12:59fdobridge: <gfxstrand> I can probably type the code, too.
12:59fdobridge: <mohamexiety> does the HW have 64-bit registers?
12:59fdobridge: <gfxstrand> No
12:59fdobridge: <gfxstrand> It uses pairs of registered when 64-bit is actually required
13:00fdobridge: <mohamexiety> I see, thanks!
13:00fdobridge: <karolherbst🐧🦀> MUL is already handled, no?
13:00fdobridge: <karolherbst🐧🦀> ehh maybe not for volta+
13:01fdobridge: <gfxstrand> I don't think so but also the NIR lowing will be what we want for that.
13:01fdobridge: <karolherbst🐧🦀> IMAD has a .WIDE modifier to make it 64 bit
13:02fdobridge: <karolherbst🐧🦀> funky
13:02fdobridge: <karolherbst🐧🦀> but only for the ADD part
13:02fdobridge: <karolherbst🐧🦀> so you have a multiplication of two 32 bit values + 64 bit
13:03fdobridge: <karolherbst🐧🦀> but yeah.. it also has a .HI flag
13:03fdobridge: <karolherbst🐧🦀> so I think we can stright support most of that actually indeed
13:03fdobridge: <karolherbst🐧🦀> well `nir_lower_imul_high64` at least
13:03fdobridge: <karolherbst🐧🦀> (and drop it)
13:04fdobridge: <karolherbst🐧🦀> just if we really want to fight codegen here, because fixing those opts passes is sometimes annoying 😄
13:17fdobridge: <gfxstrand> I don't care too much about fixing codegen at the moment. I care a bit about shifts because this seems to be the #1 source of faults right now which is impacting CTS stability.
13:19fdobridge: <karolherbst🐧🦀> right... I think the code probably handles shifts properly enough so we can lower 64 bit ones
13:19fdobridge: <karolherbst🐧🦀> just the others bits.. not so much
13:25fdobridge: <gfxstrand> I need to find myself another intern... There's something I've wanted since about forever and haven't taken the time to write: a NIR opcode hardware fuzzer.
13:25fdobridge: <karolherbst🐧🦀> ahh, would be a cool project
13:27fdobridge: <mohamexiety> ~~_I mean, I am still here..._~~
13:27fdobridge: <mohamexiety> :p
13:27fdobridge: <gfxstrand> Basic idea would be to back-door either GLSL or SPIR-V somehow to let you run arbitrary NIR ALU ops and then write a bit of GL or Vulkan code to drive the front-end. I want to be able to fuzz the shit out of opcodes and find the corners.
13:29fdobridge: <gfxstrand> You are but I could probably keep you going for a while without really getting into compiler stuff.
13:29fdobridge: <gfxstrand> Unless you want to. 🤷🏻♀️
13:33fdobridge: <mohamexiety> ah. yeah that's fine too, thought there wasn't anything else
14:06fdobridge: <karolherbst🐧🦀> @gfxstrand https://gist.githubusercontent.com/karolherbst/2ba57ee0e64dc834d4c25f7235fd0aa7/raw/38332db7732e72b2228e76f616ef65a991cac287/tmp.patch
14:06fdobridge: <karolherbst🐧🦀> I'll clean it up tomorrow
14:06fdobridge: <karolherbst🐧🦀> and I hope I didn't mess up
14:06fdobridge: <karolherbst🐧🦀> at least the test still passes with the old uapi here
14:20fdobridge: <georgeouzou> Is there a way to move MRs from nouveau to mesa? Or i need to close and open a new one ?
14:20fdobridge: <karolherbst🐧🦀> I think you can retarget it
14:21fdobridge: <karolherbst🐧🦀> maybe not...
14:22fdobridge: <georgeouzou> it seems that i can change the target branch but on the same repo
14:26fdobridge: <gfxstrand> Yeah, you probably have to recreate it. 🫤
14:26fdobridge: <gfxstrand> You can move issues but I don't know about MRs.
14:30fdobridge: <georgeouzou> Ok thanks!
14:55fdobridge: <gfxstrand> Test passes. No idea if that's actually correct, though, and the test is just shifting 1 by 4 bits to the left so it wouldn't exhibit all the possible bugs.
14:58fdobridge: <karolherbst🐧🦀> yeah.. but at least we don't emit 64 bit sources/dests on `SHF` anymore.. I'll clean up the patch and send it out tomorrow
14:58fdobridge: <karolherbst🐧🦀> could run the CTS on it, I don't think the patch is wrong and my clean ups won't change the result
15:00fdobridge: <gfxstrand> Yeah, you could always throw rusticl at it and the CL ALU tests for shift
15:00fdobridge: <karolherbst🐧🦀> ahh.. good idea actually
15:00fdobridge: <karolherbst🐧🦀> thing is.. gallium calls int64 lowering, so I'd have to disable that first 🙂
15:01fdobridge: <gfxstrand> I'm running Vulkan CTS right now but I don't have int64 enabled so I'm not running the shift tests
15:01fdobridge: <gfxstrand> You can control what gallium lowers
15:01fdobridge: <gfxstrand> It's in the `nir_shader_compiler_options`
15:01fdobridge: <karolherbst🐧🦀> yeah. I know 🙂
15:01fdobridge: <karolherbst🐧🦀> I'd disable shift lowering if everything is fine with that patch
15:02fdobridge: <karolherbst🐧🦀> anyway.. I've sent my initial draft for CL prog variables in case you feel bored next week 😄
15:03fdobridge: <karolherbst🐧🦀> it was surprisingly easy when ignoring init/fini kernels
15:04fdobridge: <mhenning> I think this actually had some similarities to the gk20a lowering, although I haven't compared too carefully https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_lowering_nvc0.cpp#L268
15:04fdobridge: <mhenning> I think this actually has some similarities to the gk20a lowering, although I haven't compared too carefully https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_lowering_nvc0.cpp#L268 (edited)
15:04fdobridge: <karolherbst🐧🦀> yeah.. it's mostly the same
15:04fdobridge: <karolherbst🐧🦀> gk20a+ just use OP_SHL/OP_SHR opcodes instead of OP_SHF
15:04fdobridge: <karolherbst🐧🦀> so maybe the solution here is to streamline everything
15:05fdobridge: <karolherbst🐧🦀> but that's part of "cleanup"
15:05fdobridge: <mhenning> Fair enough
15:05fdobridge: <karolherbst🐧🦀> just wanted to give Faith something to verify it fixes it
15:05fdobridge: <georgeouzou> just got a freeze on deqp-runner
15:06fdobridge: <georgeouzou> with the new uapi
15:06fdobridge: <karolherbst🐧🦀> I did try to use the gk20a lowering, but I've ran into more issues and this was simply the path of less resistence 😄
15:15fdobridge: <gfxstrand> This dEQP run already seems much happier than previous runs. Also, maybe even a bit faster. 🧐
15:15fdobridge: <karolherbst🐧🦀> well.. less crashes and all that 😛
15:15fdobridge: <gfxstrand> Yup
15:16fdobridge: <karolherbst🐧🦀> anyway, this was some evil undefined behavior thing 🙃 I wouldn't be surprised if enabling int64 lowering fixes more things on the side
15:30fdobridge: <georgeouzou> xx
15:56fdobridge: <gfxstrand> Yeah, probably. And I should do that, too. I just wanted to figure out why this fairly obvious thing wasn't working.
15:57fdobridge: <gfxstrand> Also, the fact that codegen attempts to compile something clearly wrong is a little disturbing. 😕
15:58fdobridge: <georgeouzou> with the new uapi
15:58fdobridge: <georgeouzou> https://pastebin.com/8vt47ELr (edited)
15:59fdobridge: <karolherbst🐧🦀> mhh.. yeah well.. codegen has almost zero validation of anything
16:33fdobridge: <gfxstrand> `Pass: 400879, Fail: 1074, Crash: 86, Skip: 1730149, Timeout: 2, Flake: 1180, Duration: 1:20:58`
16:36fdobridge: <mohamexiety> nice! most crashes gone now
16:37fdobridge: <gfxstrand> Yeah, still seeing some texture tests crashing and IDK why.
16:37fdobridge: <gfxstrand> And a bunch of the cross-device synchronization tests fail
16:37fdobridge: <gfxstrand> I suspect the later causes the former
16:42fdobridge: <gfxstrand> I added a lower_int64 call and I'm running that now. If it passes, I'll hand it to Marge.
16:43fdobridge: <dadschoorse> MR Label Maker needs to learn about the nvk label
16:57fdobridge: <gfxstrand> Yeah, it does. IDK who's in charge of that
17:13fdobridge: <gfxstrand> Really should figure out what's going on with the 1k flakes...
17:14fdobridge: <dadschoorse> looks like it may be as simple as opening a MR that changes <https://gitlab.freedesktop.org/freedesktop/mr-label-maker/-/blob/main/mr_label_maker/mesa.py>?
17:22fdobridge: <gfxstrand> https://gitlab.freedesktop.org/freedesktop/mr-label-maker/-/merge_requests/15
17:22fdobridge: <gfxstrand> Feel free to double-check I didn't miss-type any path names and RB it.
18:22fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> The int64 change got merged when I was looking at it :triangle_nvk:
19:48fdobridge: <karolherbst🐧🦀> there is this repo in the mr label maker description you can add rules to
19:48fdobridge: <karolherbst🐧🦀> ahh.. you already did
19:49fdobridge: <airlied> Also we use the vma alloc in low addr mode, hoping to fix it with NAK
19:55fdobridge: <airlied> For some reason that msg didn't send yesterday
20:01fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> Is NIR lowering the cheat code of codegen? :nouveau:
20:02fdobridge: <gfxstrand> No, not using enough NIR is the failure of codegen. 😂
20:08fdobridge: <gfxstrand> Uh... If you set that bit, it got lost somewhere.
20:09fdobridge: <gfxstrand> Bah! Lost it in 0b6afbc407fb4a08ce5cdd234b729db662b944fe
20:10fdobridge: <gfxstrand> I'm going to set it again and see if it fixes things
20:17fdobridge: <airlied> oh yeah that might explain some pain
20:18fdobridge: <airlied> just retargeted conditional rendering https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24520
20:19fdobridge: <gfxstrand> Running now.
20:19fdobridge: <gfxstrand> I thought there was a regression but I didn't know why. 😅
22:41fdobridge: <gfxstrand> `Pass: 400714, Fail: 1084, Crash: 87, Skip: 1730145, Timeout: 2, Flake: 1338, Duration: 1:26:00`
22:45fdobridge: <gfxstrand> So, not much improvement now that 64-bit lowering is on
22:46fdobridge: <![NVK Whacker] Echo (she) 🇱🇹> How close are we to "no failures, no flakes, it just passe-"? :triangle_nvk:
22:48fdobridge: <gfxstrand> A month or two of debugging. Oh, and half a compiler. 😛
23:03fdobridge: <airlied> I definitely wouldn't go for total polish with codegen, seems like wasted effort
23:11fdobridge: <gfxstrand> Yeah
23:15fdobridge: <airlied> I assume 1.1 is probably NAK blocked if we need to get proper subgroups
23:18fdobridge: <airlied> also scalarBlockLayout which zink quite wants
23:19fdobridge: <gfxstrand> Yeah
23:19fdobridge: <gfxstrand> And 1.2 really wants memory model
23:19fdobridge: <gfxstrand> (Not actually required until 1.3)