02:56 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> You should try to get gamescope working (because basically all the required features are supported)
03:09 fdobridge: <g​fxstrand> Mesa/mesa branch: main.
09:13 fdobridge: <k​arolherbst🐧🦀> @gfxstrand uhm.... you want to call `nir_lower_int64` 🙃
09:13 fdobridge: <k​arolherbst🐧🦀> volta+ relies on that
09:14 fdobridge: <k​arolherbst🐧🦀> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_from_nir.cpp#L3469
09:16 fdobridge: <k​arolherbst🐧🦀> though we can do a bit better than that...
09:19 fdobridge: <k​arolherbst🐧🦀> but yeah... the problem wasn't the 64 bit source, but that the result is always 32 bit
09:19 fdobridge: <k​arolherbst🐧🦀> `71: shf (SUBOP:4) s64 $r2d $r8d 0x0000000000000004 $r255 (16)` the emitted `shf` only returns 32 bits
09:59 fdobridge: <k​arolherbst🐧🦀> though I'm not so sure about the 64 bit source even... but the ISA doc suggests those are two 32 bit values. I'm mostly confused around the `.S64` and `.U64` modifiers...
10:16 fdobridge: <k​arolherbst🐧🦀> yeah.. so a CL 64 bit shift is `LOP32I.AND R5, R4, 0x3f ; SHF.R.S64 R2, R2, R5, R3 ; SHR R3, R3, R5 ;`
10:19 fdobridge: <k​arolherbst🐧🦀> mhhhh
10:24 fdobridge: <k​arolherbst🐧🦀> the shift source is always 32 bit no matter what
10:27 fdobridge: <k​arolherbst🐧🦀> it kinda feels like that the 32 or 64 bit of the type thing doesn't matter one bit
10:28 fdobridge: <k​arolherbst🐧🦀> I think the intention was to enable/disable one of the input sources, but...
10:28 fdobridge: <k​arolherbst🐧🦀> anyway
10:28 fdobridge: <k​arolherbst🐧🦀> all registers are always 32 bit
10:29 fdobridge: <k​arolherbst🐧🦀> we might want to add a nir alu instruction `nir_alu_shf` or something to express that in nir long term
10:30 fdobridge: <k​arolherbst🐧🦀> ehh `nir_op_shf`
10:30 fdobridge: <k​arolherbst🐧🦀> and lower 64 bit ops to that
10:30 fdobridge: <k​arolherbst🐧🦀> `SHR` and `SHL` are just alias to `SHF`
10:42 HdkR: Everyday I'm shuffling?
11:04 karolherbst: mhhh, I'm actually confused how nvk got so far not calling into int64 lowering :D
11:07 karolherbst: HdkR: dunno if you can talk about it, but do you have any information you can share on how to use more than 8 constant buffers in compute shaders? I've seen nvidia generate instructions using the 0xa and 0xc one and I'm kinda confused on how to configure those in compute...
11:09 HdkR: Last I knew you could still only bind 8. Maybe it got resolved in some new generation?
11:09 karolherbst: it generated that code for all gens
11:09 karolherbst: but granted.. I played with PTX 2.1 code and explicit ubo bindings
11:09 karolherbst: the entire result was weird, maybe it's something silly and they live patch it away
11:10 karolherbst: but anyway.. I've seen ptxas generating such kernels
11:10 karolherbst: tried every gen from sm_35 up to sm_80
11:10 karolherbst: mhh
11:10 karolherbst: fermi has this though: NV90C0_BIND_CONSTANT_BUFFER_SHADER_SLOT 12:8
11:10 karolherbst: 5 bits
11:11 HdkR: Weird. Maybe it does some voodoo
11:11 karolherbst: maybe more than 8 were supported pre kepler
11:12 karolherbst: ptx _had_ a 11 binding slots for constant buffers, but it got deprecated with ptx 2.2
11:14 HdkR: Oh, maybe. I never looked at anything older than Maxwell
11:14 karolherbst: ptx 2.0 added support for sm_20 which is fermi
11:14 karolherbst: and I think 2.1 or 2.2 is were kepler was added
11:15 karolherbst: ohh the ptx version number follows the sm number
11:15 karolherbst: convenient
11:15 karolherbst: kepler is sm_30 mhh
11:16 karolherbst: but yeah.. I think it's related to ditching that in kepler. Let me check if we used more than 8 with fermi
11:17 karolherbst: indeed
11:17 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/nouveau/nvc0/nvc0_program.c#L619
11:18 karolherbst: yeah, so that explains it
11:18 karolherbst: I still wonder why they generate that for newer gens, but maybe it also would just cause issues at runtime :D
11:18 HdkR: Neat
11:19 HdkR: I definitely complained about the lack of binding slots in compute. Was kind of hoping it got fixed
11:20 karolherbst: yeah... tough it might not matter as much.. dunno
11:21 karolherbst: I'm still confused on how real all of that ubo stuff is, and if they use the storage for other things... but it also sounds like htere is this 4k/8k cache and also explicit dram space for all the 18 slots
11:21 HdkR: Since latest hardware has Uniform SSBO access that matches UBO access perf for the most part, it kind of matters less
11:21 HdkR: Obviously can't embed an SSBO slot in an instruction encoding, but whatever
11:21 karolherbst: and there are also bindless UBOs
11:22 karolherbst: soo.. just use those? :P
11:22 HdkR: true
11:22 karolherbst: I still wonder if those just fill the unused UBO slots or if other magic is going on
11:22 karolherbst: the bindless address does contain the size of the ubo
11:23 HdkR: Should be other magic
11:23 karolherbst: maybe
11:24 HdkR: If it works like I think it does anyway :)
11:24 karolherbst: yeah.. I mean the benefit you get here is, that the hardware can load the entire UBO in one go and doesn't have to bother with memory latency anymore
11:25 HdkR: Indeed
11:25 karolherbst: and I suspect LDG.CONSTANT is just some caching optimization
11:26 HdkR: :D
11:26 karolherbst: or maybe uses the same thing
11:26 karolherbst: who knows
11:26 karolherbst: though can't really, because the size isn't known
11:32 HdkR: Just prefetch forward until maximum UBO size or fault. EZ
11:32 karolherbst: well.. I'd choose the page, but yeah :P
11:33 HdkR: Isn't the GPU page size 64KB just like the max UBO size? :D
11:33 karolherbst: good question
11:33 karolherbst: well.. nvidia actually have us the MMU stuff
11:33 karolherbst: https://nvidia.github.io/open-gpu-doc/pascal/gp100-mmu-format.pdf
11:34 karolherbst: `Dropped support for 128KB Big Pages.` :D
11:34 HdkR: ooo, fancy
11:34 karolherbst: looks like 64kb are second level already
11:34 karolherbst: or something
11:35 karolherbst: I think it mostly exists for PPC
11:39 HdkR: Main thing is that it needs to support 4k because of CPU mapping. So GPU pages will always be second level
12:14 fdobridge: <g​eorgeouzou> The failing CTS tests use 2 queries: one VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT and one VK_QUERY_TYPE_PRIMITIVES_GENERATED_EXT.
12:14 fdobridge: <g​eorgeouzou> For tests that use a single query the tests seem to pass.
12:14 fdobridge: <g​eorgeouzou> If i change the transform feedback's availability semaphore location from PIPELINE_LOCATION_ALL to PIPELINE_LOCATION_STREAMING_OUTPUT then all tests pass
12:30 fdobridge: <g​eorgeouzou> Its like this:
12:30 fdobridge: <g​eorgeouzou> - begin_primitives_generated_query
12:30 fdobridge: <g​eorgeouzou> - begin_xfb_query
12:30 fdobridge: <g​eorgeouzou> - draw
12:30 fdobridge: <g​eorgeouzou> - end_primitives_generated_query
12:30 fdobridge: <g​eorgeouzou> - end_xfb_query
12:30 fdobridge: <g​eorgeouzou> - get_primitives_generated_query
12:30 fdobridge: <g​eorgeouzou> - get_xfb_query
12:30 fdobridge: <g​eorgeouzou> - check if they are equal
12:53 fdobridge: <g​fxstrand> Yeah, idk how we want to structure the NIR ops but I think we do want new NIR ops for this. Shifts are actually kinda painful to split into normal 32-bit shifts
12:53 fdobridge: <k​arolherbst🐧🦀> yeah...
12:53 fdobridge: <k​arolherbst🐧🦀> I could add support for that in codegen as it's literally just two ops instead of one
12:53 fdobridge: <k​arolherbst🐧🦀> I'm under the impression that somebody even wrote the code
12:54 fdobridge: <k​arolherbst🐧🦀> maybe it was pierre? I couldn't find it in my repo
12:54 fdobridge: <g​fxstrand> Yeah, shouldn't be hard.
12:54 fdobridge: <k​arolherbst🐧🦀> let's ping pierre...
12:54 fdobridge: <k​arolherbst🐧🦀> mhh where is pierre anyway
12:54 fdobridge: <g​fxstrand> 🤷🏻‍♀️
12:54 fdobridge: <k​arolherbst🐧🦀> but anyway.. nvk should call into int64 and float64 lowering anyway 😄 we can optimize the shift lowering after that
12:55 fdobridge: <g​fxstrand> Honestly, I'm a bit inclined to just handle it in nv50_ir_from_nir.cpp for now.
12:56 fdobridge: <g​fxstrand> Shifts and adds are really annoying to describe efficiently in NIR.
12:56 fdobridge: <g​fxstrand> Multiply is okay because it's just a like of mul_2x32_64
12:56 fdobridge: <k​arolherbst🐧🦀> I've added a function to do this "early" lowering: LoweringHelper
12:57 fdobridge: <k​arolherbst🐧🦀> should probably just fo in there
12:57 fdobridge: <g​fxstrand> And comparisons...
12:58 fdobridge: <g​fxstrand> Really, NV has most of what you want for 64-bit integers. It's just very cleverly shaped to not actually use 64-bit registers most of the time.
12:58 fdobridge: <k​arolherbst🐧🦀> yeah
12:58 fdobridge: <k​arolherbst🐧🦀> I can write the code tomorrow or something
12:59 fdobridge: <g​fxstrand> We need to lower multiply and maybe bit count/shift nonsense.
12:59 fdobridge: <g​fxstrand> I can probably type the code, too.
12:59 fdobridge: <m​ohamexiety> does the HW have 64-bit registers?
12:59 fdobridge: <g​fxstrand> No
12:59 fdobridge: <g​fxstrand> It uses pairs of registered when 64-bit is actually required
13:00 fdobridge: <m​ohamexiety> I see, thanks!
13:00 fdobridge: <k​arolherbst🐧🦀> MUL is already handled, no?
13:00 fdobridge: <k​arolherbst🐧🦀> ehh maybe not for volta+
13:01 fdobridge: <g​fxstrand> I don't think so but also the NIR lowing will be what we want for that.
13:01 fdobridge: <k​arolherbst🐧🦀> IMAD has a .WIDE modifier to make it 64 bit
13:02 fdobridge: <k​arolherbst🐧🦀> funky
13:02 fdobridge: <k​arolherbst🐧🦀> but only for the ADD part
13:02 fdobridge: <k​arolherbst🐧🦀> so you have a multiplication of two 32 bit values + 64 bit
13:03 fdobridge: <k​arolherbst🐧🦀> but yeah.. it also has a .HI flag
13:03 fdobridge: <k​arolherbst🐧🦀> so I think we can stright support most of that actually indeed
13:03 fdobridge: <k​arolherbst🐧🦀> well `nir_lower_imul_high64` at least
13:03 fdobridge: <k​arolherbst🐧🦀> (and drop it)
13:04 fdobridge: <k​arolherbst🐧🦀> just if we really want to fight codegen here, because fixing those opts passes is sometimes annoying 😄
13:17 fdobridge: <g​fxstrand> I don't care too much about fixing codegen at the moment. I care a bit about shifts because this seems to be the #1 source of faults right now which is impacting CTS stability.
13:19 fdobridge: <k​arolherbst🐧🦀> right... I think the code probably handles shifts properly enough so we can lower 64 bit ones
13:19 fdobridge: <k​arolherbst🐧🦀> just the others bits.. not so much
13:25 fdobridge: <g​fxstrand> I need to find myself another intern... There's something I've wanted since about forever and haven't taken the time to write: a NIR opcode hardware fuzzer.
13:25 fdobridge: <k​arolherbst🐧🦀> ahh, would be a cool project
13:27 fdobridge: <m​ohamexiety> ~~_I mean, I am still here..._~~
13:27 fdobridge: <m​ohamexiety> :p
13:27 fdobridge: <g​fxstrand> Basic idea would be to back-door either GLSL or SPIR-V somehow to let you run arbitrary NIR ALU ops and then write a bit of GL or Vulkan code to drive the front-end. I want to be able to fuzz the shit out of opcodes and find the corners.
13:29 fdobridge: <g​fxstrand> You are but I could probably keep you going for a while without really getting into compiler stuff.
13:29 fdobridge: <g​fxstrand> Unless you want to. 🤷🏻‍♀️
13:33 fdobridge: <m​ohamexiety> ah. yeah that's fine too, thought there wasn't anything else
14:06 fdobridge: <k​arolherbst🐧🦀> @gfxstrand https://gist.githubusercontent.com/karolherbst/2ba57ee0e64dc834d4c25f7235fd0aa7/raw/38332db7732e72b2228e76f616ef65a991cac287/tmp.patch
14:06 fdobridge: <k​arolherbst🐧🦀> I'll clean it up tomorrow
14:06 fdobridge: <k​arolherbst🐧🦀> and I hope I didn't mess up
14:06 fdobridge: <k​arolherbst🐧🦀> at least the test still passes with the old uapi here
14:20 fdobridge: <g​eorgeouzou> Is there a way to move MRs from nouveau to mesa? Or i need to close and open a new one ?
14:20 fdobridge: <k​arolherbst🐧🦀> I think you can retarget it
14:21 fdobridge: <k​arolherbst🐧🦀> maybe not...
14:22 fdobridge: <g​eorgeouzou> it seems that i can change the target branch but on the same repo
14:26 fdobridge: <g​fxstrand> Yeah, you probably have to recreate it. 🫤
14:26 fdobridge: <g​fxstrand> You can move issues but I don't know about MRs.
14:30 fdobridge: <g​eorgeouzou> Ok thanks!
14:55 fdobridge: <g​fxstrand> Test passes. No idea if that's actually correct, though, and the test is just shifting 1 by 4 bits to the left so it wouldn't exhibit all the possible bugs.
14:58 fdobridge: <k​arolherbst🐧🦀> yeah.. but at least we don't emit 64 bit sources/dests on `SHF` anymore.. I'll clean up the patch and send it out tomorrow
14:58 fdobridge: <k​arolherbst🐧🦀> could run the CTS on it, I don't think the patch is wrong and my clean ups won't change the result
15:00 fdobridge: <g​fxstrand> Yeah, you could always throw rusticl at it and the CL ALU tests for shift
15:00 fdobridge: <k​arolherbst🐧🦀> ahh.. good idea actually
15:00 fdobridge: <k​arolherbst🐧🦀> thing is.. gallium calls int64 lowering, so I'd have to disable that first 🙂
15:01 fdobridge: <g​fxstrand> I'm running Vulkan CTS right now but I don't have int64 enabled so I'm not running the shift tests
15:01 fdobridge: <g​fxstrand> You can control what gallium lowers
15:01 fdobridge: <g​fxstrand> It's in the `nir_shader_compiler_options`
15:01 fdobridge: <k​arolherbst🐧🦀> yeah. I know 🙂
15:01 fdobridge: <k​arolherbst🐧🦀> I'd disable shift lowering if everything is fine with that patch
15:02 fdobridge: <k​arolherbst🐧🦀> anyway.. I've sent my initial draft for CL prog variables in case you feel bored next week 😄
15:03 fdobridge: <k​arolherbst🐧🦀> it was surprisingly easy when ignoring init/fini kernels
15:04 fdobridge: <m​henning> I think this actually had some similarities to the gk20a lowering, although I haven't compared too carefully https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_lowering_nvc0.cpp#L268
15:04 fdobridge: <m​henning> I think this actually has some similarities to the gk20a lowering, although I haven't compared too carefully https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_lowering_nvc0.cpp#L268 (edited)
15:04 fdobridge: <k​arolherbst🐧🦀> yeah.. it's mostly the same
15:04 fdobridge: <k​arolherbst🐧🦀> gk20a+ just use OP_SHL/OP_SHR opcodes instead of OP_SHF
15:04 fdobridge: <k​arolherbst🐧🦀> so maybe the solution here is to streamline everything
15:05 fdobridge: <k​arolherbst🐧🦀> but that's part of "cleanup"
15:05 fdobridge: <m​henning> Fair enough
15:05 fdobridge: <k​arolherbst🐧🦀> just wanted to give Faith something to verify it fixes it
15:05 fdobridge: <g​eorgeouzou> just got a freeze on deqp-runner
15:06 fdobridge: <g​eorgeouzou> with the new uapi
15:06 fdobridge: <k​arolherbst🐧🦀> I did try to use the gk20a lowering, but I've ran into more issues and this was simply the path of less resistence 😄
15:15 fdobridge: <g​fxstrand> This dEQP run already seems much happier than previous runs. Also, maybe even a bit faster. 🧐
15:15 fdobridge: <k​arolherbst🐧🦀> well.. less crashes and all that 😛
15:15 fdobridge: <g​fxstrand> Yup
15:16 fdobridge: <k​arolherbst🐧🦀> anyway, this was some evil undefined behavior thing 🙃 I wouldn't be surprised if enabling int64 lowering fixes more things on the side
15:30 fdobridge: <g​eorgeouzou> xx
15:56 fdobridge: <g​fxstrand> Yeah, probably. And I should do that, too. I just wanted to figure out why this fairly obvious thing wasn't working.
15:57 fdobridge: <g​fxstrand> Also, the fact that codegen attempts to compile something clearly wrong is a little disturbing. 😕
15:58 fdobridge: <g​eorgeouzou> with the new uapi
15:58 fdobridge: <g​eorgeouzou> https://pastebin.com/8vt47ELr (edited)
15:59 fdobridge: <k​arolherbst🐧🦀> mhh.. yeah well.. codegen has almost zero validation of anything
16:33 fdobridge: <g​fxstrand> `Pass: 400879, Fail: 1074, Crash: 86, Skip: 1730149, Timeout: 2, Flake: 1180, Duration: 1:20:58`
16:36 fdobridge: <m​ohamexiety> nice! most crashes gone now
16:37 fdobridge: <g​fxstrand> Yeah, still seeing some texture tests crashing and IDK why.
16:37 fdobridge: <g​fxstrand> And a bunch of the cross-device synchronization tests fail
16:37 fdobridge: <g​fxstrand> I suspect the later causes the former
16:42 fdobridge: <g​fxstrand> I added a lower_int64 call and I'm running that now. If it passes, I'll hand it to Marge.
16:43 fdobridge: <d​adschoorse> MR Label Maker needs to learn about the nvk label
16:57 fdobridge: <g​fxstrand> Yeah, it does. IDK who's in charge of that
17:13 fdobridge: <g​fxstrand> Really should figure out what's going on with the 1k flakes...
17:14 fdobridge: <d​adschoorse> looks like it may be as simple as opening a MR that changes <https://gitlab.freedesktop.org/freedesktop/mr-label-maker/-/blob/main/mr_label_maker/mesa.py>?
17:22 fdobridge: <g​fxstrand> https://gitlab.freedesktop.org/freedesktop/mr-label-maker/-/merge_requests/15
17:22 fdobridge: <g​fxstrand> Feel free to double-check I didn't miss-type any path names and RB it.
18:22 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> The int64 change got merged when I was looking at it :triangle_nvk:
19:48 fdobridge: <k​arolherbst🐧🦀> there is this repo in the mr label maker description you can add rules to
19:48 fdobridge: <k​arolherbst🐧🦀> ahh.. you already did
19:49 fdobridge: <a​irlied> Also we use the vma alloc in low addr mode, hoping to fix it with NAK
19:55 fdobridge: <a​irlied> For some reason that msg didn't send yesterday
20:01 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> Is NIR lowering the cheat code of codegen? :nouveau:
20:02 fdobridge: <g​fxstrand> No, not using enough NIR is the failure of codegen. 😂
20:08 fdobridge: <g​fxstrand> Uh... If you set that bit, it got lost somewhere.
20:09 fdobridge: <g​fxstrand> Bah! Lost it in 0b6afbc407fb4a08ce5cdd234b729db662b944fe
20:10 fdobridge: <g​fxstrand> I'm going to set it again and see if it fixes things
20:17 fdobridge: <a​irlied> oh yeah that might explain some pain
20:18 fdobridge: <a​irlied> just retargeted conditional rendering https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24520
20:19 fdobridge: <g​fxstrand> Running now.
20:19 fdobridge: <g​fxstrand> I thought there was a regression but I didn't know why. 😅
22:41 fdobridge: <g​fxstrand> `Pass: 400714, Fail: 1084, Crash: 87, Skip: 1730145, Timeout: 2, Flake: 1338, Duration: 1:26:00`
22:45 fdobridge: <g​fxstrand> So, not much improvement now that 64-bit lowering is on
22:46 fdobridge: <!​[NVK Whacker] Echo (she) 🇱🇹> How close are we to "no failures, no flakes, it just passe-"? :triangle_nvk:
22:48 fdobridge: <g​fxstrand> A month or two of debugging. Oh, and half a compiler. 😛
23:03 fdobridge: <a​irlied> I definitely wouldn't go for total polish with codegen, seems like wasted effort
23:11 fdobridge: <g​fxstrand> Yeah
23:15 fdobridge: <a​irlied> I assume 1.1 is probably NAK blocked if we need to get proper subgroups
23:18 fdobridge: <a​irlied> also scalarBlockLayout which zink quite wants
23:19 fdobridge: <g​fxstrand> Yeah
23:19 fdobridge: <g​fxstrand> And 1.2 really wants memory model
23:19 fdobridge: <g​fxstrand> (Not actually required until 1.3)