00:26 fdobridge_: <g​fxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26617
04:14 fdobridge_: <g​fxstrand> Okay, so it looks like ldc doesn't like to do indirects with anything but 8 and 32-bit. 64-bit throws misaligned addr and I've had problems with 16-bit, too.
04:15 fdobridge_: <g​fxstrand> @karolherbst Do you have docs that could put a finer point on that understanding?
04:25 fdobridge_: <g​fxstrand> @asdqueerfromeu FYI: I just made you a Reporter in Mesa so you can now apply labels to issues (not sure about MRs) and you're a bit easier to tag. IDK how useful that'll be but you've been trying out and reporting enough stuff that it seemed like a good idea.
04:25 fdobridge_: <g​fxstrand> @asdqueerfromeu FYI: I just made you a "Reporter" in Mesa so you can now apply labels to issues (not sure about MRs) and you're a bit easier to tag. IDK how useful that'll be but you've been trying out and reporting enough stuff that it seemed like a good idea. (edited)
05:24 fdobridge_: <g​fxstrand> Okay, now I'm very confused. More misaligned addr errors and all my LDCs are `ldc.b32`. 🙃
05:28 fdobridge_: <g​fxstrand> Ugh... Okay, it appears that I have an instruction dep bug. `NAK_DEBUG=serialize` makes some of my problems go away. 😩
05:36 fdobridge_: <g​fxstrand> Maybe instructions that read cbufs have variable latency?
05:46 fdobridge_: <g​fxstrand> That doesn't seem to fix it. Maybe something weird with LDC? IDK.
05:47 fdobridge_: <g​fxstrand> Bumping all latencies to 15 doesn't do anything either
07:04 fdobridge_: <!​DodoNVK (she) 🇱🇹> Does nouveau KMD support cache snooping? :nouveau:
07:05 fdobridge_: <!​DodoNVK (she) 🇱🇹> Apparently it's needed for memory to be both cached and coherent
07:13 fdobridge_: <!​DodoNVK (she) 🇱🇹> More precisely I get `ROBUST_CHANNEL_FIFO_ERROR_MMU_ERR_FLT`
07:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> :cursedgears:
07:15 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183305926200004668/message.txt?ex=6587da85&is=65756585&hm=7b523cccb8137d730e3cbc3a59f208c2fccd17a004cbfe1553f87e89df7c3408&
07:36 diagonal3x: There are open course MIT lectures, now one interesting lecture there is a data structure called van Emde Boas trees, i also looked FFT, awesome academics teacher is doing them on board and the lectures are recorded with video.
07:50 diagonal3x: So we can do a lot yet with memory and performance, it looks similar to fusion and sardine trees. So you can insert delete and successor query
08:03 fdobridge_: <!​DodoNVK (she) 🇱🇹> The test that causes it is `test_uav_counter_null_behavior_dxbc` (and the `dxil` version too)
08:28 fdobridge_: <!​DodoNVK (she) 🇱🇹> Here's a singlethreaded test run (for some reason I don't get a CPU hang anymore)
08:28 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183324229601599508/message.txt?ex=6587eb91&is=65757691&hm=b079ae2de936a6d628664c23844fbbb1f632b973f089b1931886b78b285a2dff&
09:27 diagonal3x: i am on 50 min of that lecture, but that is just for data access, my version is algorithmic structures, it generates the data access constants, and the time is logloglogu imo, cause there isn't any not even one recursive call, but they explain that lowestbound is loglogu or n
09:28 diagonal3x: its inherent to the time complexity, they just assume we are working on arrays
09:41 diagonal3x: cause operation system is indeed working on array , so that makes sense, cause it just handles gazillion of structures and it just iterates over program counter anyhow
09:41 diagonal3x: so lower bound in time complexity is the one they introduced which is loglogn
10:07 diagonal3x: https://en.wikipedia.org/wiki/Fusion_tree actually 2014 they already had logµn time
10:08 diagonal3x: but this is not yet reflected on this lecture
10:09 diagonal3x: 7 years ago was that lecture so 2016, two years earlier they proposed a new time domain access
11:26 fdobridge_: <k​arolherbst🐧🦀> sadly no
12:43 fdobridge_: <k​arolherbst🐧🦀> however.. I have examples of `LDC.64` using registers
12:44 fdobridge_: <k​arolherbst🐧🦀> ehh wait.. they use `RZ` there.. mhh
12:46 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand actually... I'll play around with OpenCL to PTX to SASS stuff as this should make it easy to figure out what they are doing there :ferrisUpsideDown:
12:51 fdobridge_: <k​arolherbst🐧🦀> `LDC.64.IL R2, c[0x0][R2]`
12:51 fdobridge_: <k​arolherbst🐧🦀> `LDC.U16.IL R5, c[0x0][R0]`
12:52 fdobridge_: <k​arolherbst🐧🦀> `LDC.U16.IL R5, c[0x0][R5+0x6]`
12:53 fdobridge_: <k​arolherbst🐧🦀> `LDC.64.IL R2, c[0x0][R2+0x18]`
12:53 fdobridge_: <k​arolherbst🐧🦀> that's all what nvidia produces
13:22 fdobridge_: <k​arolherbst🐧🦀> okay.. vulkaninfo at least doesn't crash on volta
13:23 fdobridge_: <!​DodoNVK (she) 🇱🇹> Now try the cube
13:23 fdobridge_: <k​arolherbst🐧🦀> I just go straight to the VTS
13:23 fdobridge_: <k​arolherbst🐧🦀> *CTS
13:24 fdobridge_: <k​arolherbst🐧🦀> mhh `Error: Finding QPA test start delimiter for dEQP-VK.info.device`
13:24 fdobridge_: <!​DodoNVK (she) 🇱🇹> Vendor Test Suite 🐸
13:24 fdobridge_: <k​arolherbst🐧🦀> maybe I shall try the cube first
13:25 fdobridge_: <k​arolherbst🐧🦀> okay uhhh
13:25 fdobridge_: <k​arolherbst🐧🦀> I think GL is broken 🥲
13:25 fdobridge_: <k​arolherbst🐧🦀> or uhm...
13:25 fdobridge_: <k​arolherbst🐧🦀> maybe it's kms
13:25 fdobridge_: <k​arolherbst🐧🦀> everything is yellow 😄
13:25 fdobridge_: <k​arolherbst🐧🦀> the heck
13:26 fdobridge_: <k​arolherbst🐧🦀> yeah.. okay.. kernel bug
13:27 fdobridge_: <k​arolherbst🐧🦀> let's try 6.5 and 6.4 just to be sure
13:27 fdobridge_: <k​arolherbst🐧🦀> funky...
13:28 fdobridge_: <k​arolherbst🐧🦀> it's also yellow on 6.5 and 6.4
13:28 fdobridge_: <k​arolherbst🐧🦀> but it's okay in grub
13:28 fdobridge_: <k​arolherbst🐧🦀> *sigh*
13:29 karolherbst: airlied, Lyude: kms is entirely broken on my GV100, like.. it's all yellow :)
13:29 karolherbst: like... not solid yellow, but just well.. only yellow
13:38 fdobridge_: <k​arolherbst🐧🦀> anyway.. the CTS doesn't run with the runner...
14:05 diagonal3x: http://web.stanford.edu/class/archive/cs/cs166/cs166.1196/lectures/16/Small16.pdf stanford deals with it too,they call patricia codes such compression. But i go step further i pack alus too.
14:21 diagonal3x: I think that is a bit party ruining to go straight into highest performance, but it gives benefits too for users,people are bored they want to communicate and hold down to some set of dogmas. HOWEVER i try to roll public domain implementation of everything and make a product.
15:17 fdobridge_: <g​fxstrand> Well, there's two directions snooping can go. I think generally PCI devices can snoop the host CPU cache. IDK if nouveau is maps them that way, though. If so, GART memory should be both host visible and host cached. Then there's where the host can see through GPU caches. I think only Intel can do that.
15:19 fdobridge_: <g​fxstrand> Wait, what?!? Those are all the sizes I've had trouble with. 😭
15:29 fdobridge_: <!​DodoNVK (she) 🇱🇹> I wanted to set both HOST_CACHED and HOST_COHERENT flags for the GTT heap
15:32 fdobridge_: <g​fxstrand> Yeah, I'm not sure we can.
15:35 fdobridge_: <g​fxstrand> Well, we probably get a write-combined map which isn't quite the same as fully cached.
15:35 fdobridge_: <g​fxstrand> We really should have a memory property for write-combined but we don't.
15:36 fdobridge_: <!​DodoNVK (she) 🇱🇹> So should I make two memory types for GTT (one with HOST_VISIBLE_BIT and HOST_COHERENT_BIT and the other with HOST_VISIBLE_BIT and HOST_CACHED_BIT)? 🐸
15:36 fdobridge_: <g​fxstrand> WC maps are coherent, fast to write and slow AF to read. They're what you want for data upload.
15:37 fdobridge_: <!​DodoNVK (she) 🇱🇹> RADV sets HOST_VISIBLE, HOST_CACHED and HOST_COHERENT bits for the GTT memory type
15:38 fdobridge_: <g​fxstrand> For GART memory, we want CACHED+VISIBLE+COHERENT.
15:39 fdobridge_: <g​fxstrand> (assuming system memory BOs are cached which I'm pretty sure they are. We should double-check with @airlied or dakr, though )
15:41 fdobridge_: <g​fxstrand> For VRAM, I think we can say VISIBLE+COHERENT but we probably want two different types, one with host and one without and set the MMAP flag when allocating accordingly.
15:42 fdobridge_: <g​fxstrand> (assuming system memory BOs are cached which I'm pretty sure they are. We should double-check with @airlied or dakr, though) (edited)
15:42 fdobridge_: <!​DodoNVK (she) 🇱🇹> gamescope needs VISIBLE and DEVICE_LOCAL memory (while vkd3d-proton needs VISIBLE and CACHED memory)
15:43 fdobridge_: <g​fxstrand> Yeah, I think the above should satisfy both. DXVK really wants DEVICE_LOCAL+VISIBLE, too. Data uploads will go way faster if we turn that on.
15:44 fdobridge_: <!​DodoNVK (she) 🇱🇹> I added the CACHED bit to the GTT type and vkd3d-proton tests have much less failures
16:05 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26621
16:12 fdobridge_: <g​fxstrand> @asdqueerfromeu Beat me by 6 minutes
16:12 fdobridge_: <g​fxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26622
16:16 fdobridge_: <g​fxstrand> As for tegra, I honestly have no idea. It's not a PCI device so IDK what the snooping story is. A lot of iGPUs need uncached memory but sometimes they can snoop. I'll need to chat with dakr and @airlied about it. I've pinged them on my MR and hopefully we can get our story straight.
16:16 fdobridge_: <g​fxstrand> @asdqueerfromeu It's long past time we sorted out heaps so thanks for bringing it up again.
16:17 HdkR: Tegra changes depending on the SoC
16:17 fdobridge_: <g​fxstrand> Of course it does
16:17 HdkR: Some of the newer ones can have the GPU snoop the CPU caches, but not the other way around
16:17 fdobridge_: <g​fxstrand> Do we have that information exposed through the UAPI somehow so we can configure the bits properly?
16:18 fdobridge_: <g​fxstrand> Yeah, I didn't figure. Only Intel is crazy enough to have a shared cache.
16:18 fdobridge_: <g​fxstrand> Which is aweseom, BTW. It's one of the few really nice things about working on Intel GPUs.
16:18 HdkR: You /might/ get lucky with Orin, but does anyone actually care about that SoC? :)
16:19 HdkR: Grace is probably coherent both ways, but no consumer is buying that
16:19 fdobridge_: <g​fxstrand> What I want is the ability to control snooped vs. uncached. We want uncached for more "device local" things and cached for mapped things.
16:20 fdobridge_: <g​fxstrand> That makes sense for a compute-focused SoC.
16:20 HdkR: Hopefully when you program the MMIO those bits are controlled explicitly
16:20 HdkR: er...IOMMU
16:21 fdobridge_: <g​fxstrand> Yeah, and NVK isn't really advertised as working on Tegra right now so I'm not too worried about it.
16:21 fdobridge_: <g​fxstrand> I should probably throw Tegra behind the BROKEN_VULKAN_DRIVER flag
16:22 HdkR: Only support the Tegra X1 and you've already covered 90% of the userbase :P
16:23 fdobridge_: <!​DodoNVK (she) 🇱🇹> The Nintendo Switch definitely increases userbase a lot
16:24 fdobridge_: <g​fxstrand> Yeah. I want to support Tegra eventually. Just not today.
16:25 fdobridge_: <k​arolherbst🐧🦀> did you try it with `.IL`?
16:25 fdobridge_: <k​arolherbst🐧🦀> at this point, everything can be relevant 😄
16:25 fdobridge_: <k​arolherbst🐧🦀> maybe there are some mmio regs we also have to flip.. or something on the header... or maybe compute is different?
16:25 fdobridge_: <k​arolherbst🐧🦀> who knows
16:26 fdobridge_: <g​fxstrand> Yeah. I think I need to sort out my instruction deps bug first
16:28 fdobridge_: <k​arolherbst🐧🦀> also.. might be worth putting an `iand` in front of those `LDC` just to verify the alignment is indeed correct
16:29 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand anyway, any idea why deqp-runner refuses to run? Also.. in case it's a CTS thing, what git commit are you on?
16:29 fdobridge_: <g​fxstrand> Are you setting `NVK_I_WANT_A_BROKEN_VULKAN_DRIVER=1`?
16:30 fdobridge_: <g​fxstrand> I'm on `vulkan-cts-1.3.7.0`
16:30 fdobridge_: <k​arolherbst🐧🦀> I do
16:31 fdobridge_: <g​fxstrand> Does `./deqp-vk -n dEQP-VK.api.smoke.triangle` work?
16:31 fdobridge_: <k​arolherbst🐧🦀> let's try that tag and see if it works
16:32 fdobridge_: <k​arolherbst🐧🦀> ~~no~~ yes
16:32 fdobridge_: <k​arolherbst🐧🦀> `assertion failed: sm >= 75`... but that's the part I will deal with 😄
16:32 fdobridge_: <g​fxstrand> hehe
16:32 fdobridge_: <k​arolherbst🐧🦀> but yeah.. that's on gfxstrand/nvk/volta
16:33 fdobridge_: <g​fxstrand> I think most `sm >= 75` can become `sm >= 70` except for the parts about UGPRs.
16:33 fdobridge_: <g​fxstrand> Oh, and Volta doesn't have IMnMx for some reason
16:33 fdobridge_: <k​arolherbst🐧🦀> yeah.... I'll figure all those things out 🙂
16:35 fdobridge_: <g​fxstrand> Hrm... I don't think the CTS is liking that branch... IDK why. It's not like we don't mmap VRAM all over the driver internally.
16:36 fdobridge_: <g​fxstrand> Maybe we're hitting BAR limits that the kernel doesn't know how to handle gracefully?
16:44 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand mhhh.. same error when running the tag you are using..
16:44 fdobridge_: <k​arolherbst🐧🦀> I'm sure it's something silly and me running it through SSH...
16:46 fdobridge_: <k​arolherbst🐧🦀> ehh wait
16:47 fdobridge_: <k​arolherbst🐧🦀> it was something silly 🙂
16:47 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand I used the GL deqp runner instead of the vk one 🥲
16:47 fdobridge_: <k​arolherbst🐧🦀> (in my script)
16:48 fdobridge_: <k​arolherbst🐧🦀> it "works": `Pass: 5, Crash: 45, Skip: 450, Duration: 31, Remaining: 70:00:46`
16:49 fdobridge_: <k​arolherbst🐧🦀> dmesg going crazy with that assert fixed 🥲
16:50 fdobridge_: <k​arolherbst🐧🦀> I suspect it's the shader header thing...
16:56 fdobridge_: <k​arolherbst🐧🦀> okay.. works with codegen
16:58 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand Volta uses the old header format btw
16:58 fdobridge_: <k​arolherbst🐧🦀> meaning its size is 0x50, not 0x80
17:01 fdobridge_: <k​arolherbst🐧🦀> @marysaka are the shader header changes extractable from your nak maxwell support stuff?
17:02 fdobridge_: <k​arolherbst🐧🦀> would be a good idea to land those independently as volta needs it 🙂
17:02 fdobridge_: <m​arysaka> didn't Faith merged the branch in main now?
17:02 fdobridge_: <k​arolherbst🐧🦀> ohh..
17:02 fdobridge_: <m​arysaka> but I don't think I have any specific changes for Maxwell on SPH :aki_thonk:
17:02 fdobridge_: <k​arolherbst🐧🦀> maybe I need to rebase faiths volta branch then
17:02 fdobridge_: <k​arolherbst🐧🦀> size and offset
17:02 fdobridge_: <k​arolherbst🐧🦀> the shader instructions need to be aligned, not the header 🙂
17:03 fdobridge_: <k​arolherbst🐧🦀> uhhh.. I have to disable clippy with nak :blobcatnotlikethis:
17:05 fdobridge_: <!​DodoNVK (she) 🇱🇹> gamescope needs the KHR_present_id/wait stuff (this should be easy but requires adding DRI option support)
17:07 fdobridge_: <!​DodoNVK (she) 🇱🇹> :cursedgears:
17:07 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183454980414382202/Screenshot_20231210_190623.png?ex=65886556&is=6575f056&hm=4d353424071d0a519d9f594b38b996b0e1b05fcc3a823a3167495bd2feb1d3e4&
17:07 fdobridge_: <k​arolherbst🐧🦀> @marysaka but yeah.. it's in main 🙂
17:08 fdobridge_: <e​sdrastarsis> @prop_energy_ball
17:09 fdobridge_: <p​rop_energy_ball> well that's in common code
17:09 fdobridge_: <p​rop_energy_ball> you just need plasma 6 etc
17:09 fdobridge_: <p​rop_energy_ball> you just need plasma 6 for that (edited)
17:10 fdobridge_: <!​DodoNVK (she) 🇱🇹> I'm on Plasma 5
17:11 Lyude: karolherbst: i can look into it, not super surprised
17:12 Lyude: really few people have access to gv100
17:12 Lyude: or at least ones who would run nouveau on a desktop :P
17:15 karolherbst: :D
17:26 fdobridge_: <p​rop_energy_ball> probably needs some stuff for foreign queue transitions or whatever
17:26 fdobridge_: <p​rop_energy_ball> although that looks suspiciously linear
17:26 fdobridge_: <p​rop_energy_ball> so perhaps modifiers arent hooked up too?
17:27 fdobridge_: <g​fxstrand> I don't think much actually changes there. Just the version. All the stuff after 0x50 should be zero pre-Turing anyway. It's just used for barycentrics and stuff like that.
17:28 fdobridge_: <k​arolherbst🐧🦀> it's needed for proper alignment of the shader
17:28 fdobridge_: <k​arolherbst🐧🦀> but yeah...
17:28 fdobridge_: <k​arolherbst🐧🦀> I rebased your branch on main and will go from there
17:32 fdobridge_: <k​arolherbst🐧🦀> mhh now I'm getting ILLEGAL_INSTR_ENCODING: ')
17:33 fdobridge_: <!​DodoNVK (she) 🇱🇹> NVK doesn't really use the NVIDIA-specific modifiers (and they can't be used anyway in PRIME setups)
17:33 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand what env vars do I have for debugging?
17:34 fdobridge_: <k​arolherbst🐧🦀> (and could somebody document them inside `envvars.rst`
17:34 fdobridge_: <k​arolherbst🐧🦀> )
17:34 fdobridge_: <g​fxstrand> `NAK_DEBUG=print`
17:35 fdobridge_: <g​fxstrand> There's also `serialize` but I don't think that changed between Volta and Turing.
17:37 fdobridge_: <!​DodoNVK (she) 🇱🇹> If I use my AMD GPU to run gamescope and NVIDIA GPU to run vkcube then vkcube looks fine
17:37 fdobridge_: <g​fxstrand> Yeah, I have a very old branch for modifiers. @mohamexiety is going to work on it after he gets sparse working.
17:37 fdobridge_: <p​rop_energy_ball> cool
17:38 fdobridge_: <g​fxstrand> I think I already got most of the typing but it's been a bit since I've looked at the branch. The big hole is render to linear.
17:38 fdobridge_: <p​rop_energy_ball> When you get modifiers + foreign queue working properly you will have officially beaten the NVIDIA driver in terms of being able to actually run SteamOS
17:38 fdobridge_: <p​rop_energy_ball> which is kinda depressing
17:39 fdobridge_: <k​arolherbst🐧🦀> yeah... it's the header :ferrisUpsideDown:
17:39 fdobridge_: <m​ohamexiety> is it a matter of the hardware not playing well with that stuff, or is it that they simply don't care?
17:39 fdobridge_: <p​rop_energy_ball> Maybe I should peep at HDR signalling for nouveau if it isn't already done
17:39 fdobridge_: <g​fxstrand> I'm hopeful that having NVK be competent will finally let the lookout graphics stack move forward easier. Even though I'm sure NVIDIA will try to hold on to their blob, NVK is going to give us a hell of a lot more leverage.
17:39 fdobridge_: <p​rop_energy_ball> There are bugs and I am guessing they are low priority
17:40 fdobridge_: <g​fxstrand> I'm hopeful that having NVK be competent will finally let the Linux graphics stack move forward easier. Even though I'm sure NVIDIA will try to hold on to their blob, NVK is going to give us a hell of a lot more leverage. (edited)
17:40 fdobridge_: <p​rop_energy_ball> My last email said they fixed stuff in the latest Vulkan dev driver, but the foreign queue transition stuff still seems broken
17:40 fdobridge_: <p​rop_energy_ball> (to be clear foreign queue transitions are only broken on async compute queues on the blob driver)
17:41 fdobridge_: <!​DodoNVK (she) 🇱🇹> I wonder if we can beat them at CUDA too
17:42 fdobridge_: <g​fxstrand> Like, distro can reasonably say, "Yeah, we're going to ship $FEATURE" without really worrying about breaking shit for NVIDIA blob users. Suddenly it becomes NVIDIA's problem to support all the things, not our problem to work around them.
17:42 fdobridge_: <k​arolherbst🐧🦀> `nvdisasm error : Unrecognized operation for functional unit 'uC' at address 0x00000110` pain
17:42 fdobridge_: <k​arolherbst🐧🦀> `IMNMX.U32 R8, R1, 0xffff, PT ;` 😄
17:42 fdobridge_: <k​arolherbst🐧🦀> okay
17:42 fdobridge_: <k​arolherbst🐧🦀> yeah soo the rebase actually fixed that
17:42 fdobridge_: <k​arolherbst🐧🦀> just need to lower min/max
17:43 fdobridge_: <g​fxstrand> That'll do it. 😂 There might be NIR lowering or you can just emit the two instructions.
17:43 fdobridge_: <m​ohamexiety> yup. and NVK working well would take most gamers away from the blob anyways
17:43 fdobridge_: <k​arolherbst🐧🦀> yeah, I'll see what we can do
17:43 fdobridge_: <k​arolherbst🐧🦀> but yeah.. the rebase fixed the header being too big issue or something
17:44 fdobridge_: <k​arolherbst🐧🦀> there is `FMNMX` 🥲
17:45 fdobridge_: <p​rop_energy_ball> I wonder how much adjustment the vblankmanager in Gamescope is going to need to work on nouveau nicely
17:45 fdobridge_: <g​fxstrand> Of course, that assumes that NVK is competent. If games are still 2x as fast on the blob, people will still use it. 🫤
17:45 fdobridge_: <p​rop_energy_ball> AMDGPU needs a scary amount of redzone
17:46 fdobridge_: <p​rop_energy_ball> So if nouveau ends up needing more that'd be scary 🐸
17:46 fdobridge_: <g​fxstrand> To be fair, that one is more instructions to emulate.
17:46 fdobridge_: <p​rop_energy_ball> So if nouveau ends up needing more that'd be scary 🐸 But I have not tested (edited)
17:46 fdobridge_: <k​arolherbst🐧🦀> yeah..
17:46 fdobridge_: <g​fxstrand> Only one more instruction but still
17:46 fdobridge_: <k​arolherbst🐧🦀> sooo uhm... does anybody else needs imin/imax lowering?
17:47 fdobridge_: <k​arolherbst🐧🦀> 😄
17:47 fdobridge_: <k​arolherbst🐧🦀> *need
17:47 fdobridge_: <k​arolherbst🐧🦀> intel has it in the backend...
17:49 fdobridge_: <k​arolherbst🐧🦀> ohh bifrost as a nir_lower_algebraic thing for it
17:49 fdobridge_: <k​arolherbst🐧🦀> *has
17:49 fdobridge_: <k​arolherbst🐧🦀> (for 8 bit)
17:50 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand mind if we add a nak specific lower algebric thing or do you want this into global nir straight away?
17:50 fdobridge_: <k​arolherbst🐧🦀> or lowering inside nak?
17:51 fdobridge_: <g​fxstrand> We should probably have a nak-specific algebraic thing. You can move the fmul needed by sin/cos to it while you're at it.
17:51 fdobridge_: <k​arolherbst🐧🦀> okay
17:52 fdobridge_: <g​fxstrand> IDK if we can plumb the nak_shader into it, though. 🤔
17:53 fdobridge_: <g​fxstrand> For now you might need to have it generate a volta-specific pass for immmx. 🙃
17:53 fdobridge_: <g​fxstrand> But that's easy
17:53 fdobridge_: <k​arolherbst🐧🦀> I was thinking of passing the sm as an option
17:54 fdobridge_: <k​arolherbst🐧🦀> or a custom struct
17:54 fdobridge_: <k​arolherbst🐧🦀> depending on if we want the logic to be inside the lowering pass or whoever calls it
17:54 fdobridge_: <g​fxstrand> I don't remember what plumbing we have for that. I'd be open to adding a thing to let backends pass extra parameters if we don't already.
17:55 fdobridge_: <k​arolherbst🐧🦀> mhhh... good question..
17:55 fdobridge_: <k​arolherbst🐧🦀> yeah.. there isn't, but that should be trivial to add
17:55 fdobridge_: <g​fxstrand> It looks like there is no such thing but it'd be really easy to add
17:55 fdobridge_: <k​arolherbst🐧🦀> :ferrisUpsideDown:
17:56 HdkR: Annoy open source driver developers by deleating this one simple instruction :)
17:56 fdobridge_: <g​fxstrand> Just make it default to `[]` or something so we don't have to change everything everywhere.
17:56 fdobridge_: <k​arolherbst🐧🦀> yeah
17:57 fdobridge_: <g​fxstrand> And then pass a `const struct nak_compiler *nak`
17:57 fdobridge_: <k​arolherbst🐧🦀> but honestly, which other vendor changes their ISA 7 times :ferrisUpsideDown:
17:57 fdobridge_: <g​fxstrand> Have you seen Intel?
17:57 fdobridge_: <g​fxstrand> Or Arm?
17:57 fdobridge_: <k​arolherbst🐧🦀> I meant like complete new encodings
17:57 fdobridge_: <g​fxstrand> Have you seen Intel? Or Arm?
17:57 fdobridge_: <k​arolherbst🐧🦀> tehcnically nvidia has like 30 isas
17:57 fdobridge_: <k​arolherbst🐧🦀> (versions)
17:58 fdobridge_: <g​fxstrand> Intel has totally changed encodings 4 or 5 times and that doesn't count vec4.
17:58 fdobridge_: <k​arolherbst🐧🦀> yeah, nvidia did it more often 😄
17:58 fdobridge_: <k​arolherbst🐧🦀> but fair
17:58 fdobridge_: <g​fxstrand> And every version has new region restrictions, often backwards-incompatible, and sometimes varying by SKU.
17:58 fdobridge_: <k​arolherbst🐧🦀> though the ISA kinda stayed similar enough
17:59 fdobridge_: <g​fxstrand> Trust me, NVIDIA is a nice ISA.
17:59 fdobridge_: <k​arolherbst🐧🦀> 😄
17:59 fdobridge_: <k​arolherbst🐧🦀> fair enough
17:59 fdobridge_: <g​fxstrand> Panfrost has 4 compilers and that's just for like 5 hardware versions.
18:00 fdobridge_: <g​fxstrand> Not 4 different encoding back-ends. 4 different compilers.
18:00 HdkR: Just don't look at the old NVIDIA ISAs :D
18:00 fdobridge_: <g​fxstrand> Oh, I'm sure I don't want to look at anything from the GT 7000 days
18:01 fdobridge_: <!​DodoNVK (she) 🇱🇹> :triangle_nvk:
18:01 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://cdn.discordapp.com/attachments/1034184951790305330/1183468537122132150/Screenshot_20231210_195950.png?ex=658871f7&is=6575fcf7&hm=5365753495d0db3bac598ee5a4488af1c63216a9551acb67593989125182f651&
18:04 HdkR: Only ever look at Maxwell and newer and it'll be smooth sailing
18:06 fdobridge_: <k​arolherbst🐧🦀> pain
18:06 fdobridge_: <k​arolherbst🐧🦀> yeah soo.. we have two compilers :ferrisUpsideDown:
18:06 fdobridge_: <k​arolherbst🐧🦀> one pre nv50 and one nv50+
18:06 fdobridge_: <k​arolherbst🐧🦀> it was uhhh.. a vec ISA obviously
18:06 fdobridge_: <k​arolherbst🐧🦀> however
18:07 fdobridge_: <k​arolherbst🐧🦀> nvidia going fully scalar in 2006 was kinda ground breaking for the time 😄
18:10 fdobridge_: <k​arolherbst🐧🦀> mhhhh
18:11 fdobridge_: <!​DodoNVK (she) 🇱🇹> So was my 9500 GT futuristic?
18:11 fdobridge_: <b​utterflies> CUDA 0.7 public beta
18:12 fdobridge_: <b​utterflies> early CUDA also had an interpreter publicly shipped with it
18:12 fdobridge_: <b​utterflies> so you didn't have to have a GPU to prototype programs
18:12 fdobridge_: <b​utterflies> NVIDIA got rid of that quite quickly though
18:13 fdobridge_: <k​arolherbst🐧🦀> NOOOOO
18:13 fdobridge_: <k​arolherbst🐧🦀> god dammit
18:14 fdobridge_: <k​arolherbst🐧🦀> I have to make a late algebraic lowering pass...
18:14 fdobridge_: <k​arolherbst🐧🦀> what a pain
18:15 fdobridge_: <k​arolherbst🐧🦀> opt_algebraic reverted my totally working imin/imax lowering 🥲
18:16 fdobridge_: <k​arolherbst🐧🦀> `Pass (Pass)` :3
18:16 fdobridge_: <k​arolherbst🐧🦀> let's gooo!
18:19 fdobridge_: <k​arolherbst🐧🦀> `Pass: 176, Fail: 52, Crash: 93, Skip: 1679, Duration: 2:09, Remaining: 71:48:20` mhh
18:19 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand running into `ALU srcs must be legalized explicitly` a lot
18:21 fdobridge_: <g​fxstrand> There's probably something that's not getting lowered. What op is triggering it?
18:24 fdobridge_: <k​arolherbst🐧🦀> gdb is crashing 🥲
18:24 fdobridge_: <k​arolherbst🐧🦀> imul, but: https://gist.github.com/karolherbst/888dd576a11a3f64e0e92fe856a23826
18:27 fdobridge_: <k​arolherbst🐧🦀> mhh actually...
18:27 fdobridge_: <k​arolherbst🐧🦀> shouldn't that have become imad?
18:29 fdobridge_: <k​arolherbst🐧🦀> ohh yeah
18:29 fdobridge_: <k​arolherbst🐧🦀> 😄
18:30 fdobridge_: <k​arolherbst🐧🦀> `if self.sm() > 70 {`
18:30 fdobridge_: <k​arolherbst🐧🦀> 🥲
18:32 fdobridge_: <k​arolherbst🐧🦀> okay,.. that's better 😄
18:33 fdobridge_: <k​arolherbst🐧🦀> I think I'm ready for a proper CTS run now
18:35 fdobridge_: <k​arolherbst🐧🦀> still getting tons of `OOR_REG`... mhh
18:35 fdobridge_: <k​arolherbst🐧🦀> maybe compute is busted
18:37 fdobridge_: <k​arolherbst🐧🦀> `Pass: 142, Fail: 68, Crash: 23, Skip: 1266, Flake: 1, Duration: 5:14, Remaining: 232:00:04`
18:38 fdobridge_: <k​arolherbst🐧🦀> yeah.. compute is broken
18:38 fdobridge_: <k​arolherbst🐧🦀> that has to wait until after dinner
19:01 fdobridge_: <a​irlied> Lyude: I gave Ajax a Volta, you might need to get it off him
19:15 RedSheep: I am not sure I'm in the right place, but I have been testing my 4090 with kernel 6.7rc4 to see what is left for me to be able to daily drive the open source drivers now that GSP firmware is a thing, and I haven't had any luck getting it to run my displays at 4k120. I am able to run 120hz on AMD and the nvidia blob, but I'm not sure if that rules out my distro or DE being at fault.
19:16 RedSheep: What should I check, or where should I open an issue, if that would make sense to do?
19:26 fdobridge_: <k​arolherbst🐧🦀> mhhhh
19:26 fdobridge_: <k​arolherbst🐧🦀> the shader uses r16 at most, but setting 17 regs does't work.. but 18 does
19:31 fdobridge_: <k​arolherbst🐧🦀> now we are talking...
19:31 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand `Pass: 388, Crash: 7, Skip: 2105, Duration: 13, Remaining: 5:59:40` and the crash is the indirect compute assert
19:33 fdobridge_: <k​arolherbst🐧🦀> `Pass: 5303, Crash: 97, Skip: 29600, Duration: 2:14, Remaining: 4:12:19` and dmesg is still clean
19:34 fdobridge_: <k​arolherbst🐧🦀> maybe I should run with more threads than just with 4? 😄
19:34 fdobridge_: <g​fxstrand> Volta probably needs the +2 as well. It doesn't have UGPRs AFAIK but it may still reserve them.
19:35 fdobridge_: <g​fxstrand> Or maybe it has them. 🤷🏻‍♀️
19:35 fdobridge_: <k​arolherbst🐧🦀> yeah, that's what I decided to do
19:35 fdobridge_: <k​arolherbst🐧🦀> and is also what we do in gallium
19:35 fdobridge_: <k​arolherbst🐧🦀> anyway.. the only thing left seems to be indirect dispatch
19:36 diagonal3x: So that example should work so: you add biggest power to value and it's inverse, so that one busts, removing the power again from both results and summing together guarantees a range half of max to max -1 , so in other words removing that biggest power once more, you know that inverse is larger, and can use subtract to subtract a constant to later compensate after stashing this to value, but this gets you the smaller value always which can
19:36 diagonal3x: be focal value or inverse, now final riddle is to understand which one it was
19:36 fdobridge_: <k​arolherbst🐧🦀> let's toast this GPU
19:36 fdobridge_: <k​arolherbst🐧🦀> 20 threads lets go
19:37 fdobridge_: <k​arolherbst🐧🦀> not a lot faster `Pass: 5066, Crash: 90, Skip: 28344, Duration: 59, Remaining: 1:56:55` 😄
19:41 fdobridge_: <k​arolherbst🐧🦀> kinda pain volta uses the old macro stuff :ferrisUpsideDown:
19:42 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand what's the problem with indirect compute pre turing? just the macro running out of regs or the mme being less capable?
19:43 fdobridge_: <g​fxstrand> The MME can't multiply so we can't get the counters right.
19:43 fdobridge_: <k​arolherbst🐧🦀> oof
19:44 fdobridge_: <g​fxstrand> IDK how NVIDIA does it
19:44 fdobridge_: <k​arolherbst🐧🦀> I'd say we should get the counters wrong and still do indirects :ferrisUpsideDown:
19:44 fdobridge_: <g​fxstrand> Maybe repeated add?
19:44 fdobridge_: <k​arolherbst🐧🦀> mhhhhhh
19:44 fdobridge_: <g​fxstrand> Yeah, that's probably fine for now
19:45 fdobridge_: <k​arolherbst🐧🦀> yeah.. at least we test indirects this way
19:46 fdobridge_: <g​fxstrand> The other thing I've considered is sticking an atomic at the start of the shader that increments a bit of memory.
19:53 fdobridge_: <g​fxstrand> Or just fire off a compute shader that preprocesses the indirect draw buffer and do it in a shader.
19:54 fdobridge_: <k​arolherbst🐧🦀> mhhh... maybe @marysaka could dump what nvidia is doing and take a look?
19:55 fdobridge_: <k​arolherbst🐧🦀> `Pass: 117516, Fail: 9, Crash: 1872, Warn: 1, Skip: 653102, Duration: 19:15, Remaining: 1:20:09` btw
20:22 fdobridge_: <m​arysaka> there is an MR opened by @lucidawnamberaeonjaharmonia I think
20:23 fdobridge_: <m​arysaka> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26230
21:08 diagonal2x: from that point the solution is easy, you had the value before inverting, you subtract from that (the number you know that came larger) and earlier you added 1, 1 is mapped to answer smaller all other values are mapped for bigger, so that one got solved , and now if all modules get combined there is an engine of super execution. So no sign, no wrap around, no control flow, no over/under flow handling, and only sub/add its still able to
21:08 diagonal2x: compress everything.
21:19 fdobridge_: <p​homes_> the patch with HOST_CACHED_BIT lets X4 Foundations start and shows the intro. It initially shows a garbled image and crashes quickly after the intro, but at least it starts now
21:19 fdobridge_: <p​homes_> https://cdn.discordapp.com/attachments/1034184951790305330/1183518482336333824/Screenshot_from_2023-12-10_22-13-26.png?ex=6588a07a&is=65762b7a&hm=502663bfc857e2378c4ea0cd22737bd00b96e6e366f81e087eb88bfafbd48d04&
21:48 fdobridge_: <k​arolherbst🐧🦀> run is almost done: `Pass: 557801, Fail: 45, Crash: 9090, Warn: 4, Skip: 3112051, Timeout: 1, Flake: 8, Duration: 1:33:14, Remaining: 7:51` :ferrisBongo:
21:52 fdobridge_: <k​arolherbst🐧🦀> oh no...
21:52 fdobridge_: <k​arolherbst🐧🦀> the GPU died...
21:55 fdobridge_: <k​arolherbst🐧🦀> mhhh
21:55 fdobridge_: <k​arolherbst🐧🦀> or rather something dead-locked
21:55 fdobridge_: <k​arolherbst🐧🦀> well.. the GPU is also toast
22:27 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand pushed my changes to the nouveau branch and I think the code is more or less ready to be reviewed. I'll try to figure out the remaining issues over the next week
22:28 fdobridge_: <k​arolherbst🐧🦀> the optional params to nir_algebraic wasn't even painful
22:59 fdobridge_: <e​sdrastarsis> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26626
22:59 fdobridge_: <e​sdrastarsis> moar perf
22:59 fdobridge_: <k​arolherbst🐧🦀> yo
23:00 fdobridge_: <k​arolherbst🐧🦀> maybe 32 is a bit too aggressive, but okay 🙂
23:08 fdobridge_: <m​henning> Oh, I just picked 32 because it seems like the most common value other drivers use for max_unroll_iterations
23:08 fdobridge_: <m​henning> I don't actually have an opinion there - we could set it lower if you think it's justified
23:08 fdobridge_: <k​arolherbst🐧🦀> ohh no, it's fine I think...
23:09 fdobridge_: <k​arolherbst🐧🦀> that reminds me... nak misses an important optimization :ferrisUpsideDown:
23:09 fdobridge_: <k​arolherbst🐧🦀> lowering cfg to predication
23:09 fdobridge_: <k​arolherbst🐧🦀> so instead of doing if/else branches, just predicate both sides
23:09 fdobridge_: <k​arolherbst🐧🦀> or is nak already doing that?
23:09 fdobridge_: <m​henning> it isn't doing that yet
23:10 fdobridge_: <m​henning> At some point the sched code was a little sketchy for predicates but that might have been fixed
23:10 fdobridge_: <k​arolherbst🐧🦀> I see
23:12 fdobridge_: <m​henning> If you switch to predication, you presumably also don't want to emit reconvergence but we currently do that in nir so you'd either need to find and delete the reconvergence ops or find a way to avoid emitting them in the first place
23:12 fdobridge_: <k​arolherbst🐧🦀> yeah.. that's the idea
23:13 fdobridge_: <k​arolherbst🐧🦀> in codegen the pass flattening cfg kills the reconvergence ops
23:13 fdobridge_: <m​henning> Yeah, I've seen that and I'm not the biggest fan of how it works out in terms of tracking down the ops to delete
23:14 fdobridge_: <k​arolherbst🐧🦀> 🙂
23:14 fdobridge_: <k​arolherbst🐧🦀> yeah.. same
23:14 fdobridge_: <k​arolherbst🐧🦀> I think all those opts should happen first
23:14 fdobridge_: <k​arolherbst🐧🦀> then we decide if reconvergence is needed and where
23:14 fdobridge_: <k​arolherbst🐧🦀> we also don't always have to reconvergence anyway
23:16 fdobridge_: <m​henning> Yeah, the pass for inserting them is already reasonably smart, just imo it should happen later than it currently does
23:21 fdobridge_: <k​arolherbst🐧🦀> Num GPRs: 4 *doubt*
23:21 fdobridge_: <k​arolherbst🐧🦀> https://gist.githubusercontent.com/karolherbst/eee3d0f37bc2e695d2b52a9cffbf54ee/raw/1aadc0524fb8914632399b15ef3204f7855976d8/gistfile1.txt
23:22 fdobridge_: <k​arolherbst🐧🦀> ohh wait...
23:22 fdobridge_: <k​arolherbst🐧🦀> I forgot to do something
23:25 fdobridge_: <k​arolherbst🐧🦀> found it
23:32 fdobridge_: <k​arolherbst🐧🦀> anyway.. another run..
23:37 fdobridge_: <g​fxstrand> Yeah, I've been meaning to do that. 😅 Last I checked it failed some CTS tests.
23:37 fdobridge_: <g​fxstrand> They might be fixed by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26526 but I haven't gotten around to checking.
23:38 fdobridge_: <g​fxstrand> Should be fixable
23:39 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand soo.. volta hits all those vertex robustness issues 🥲
23:39 fdobridge_: <k​arolherbst🐧🦀> or was it vertex buffer?
23:39 fdobridge_: <g​fxstrand> Oh fun...
23:39 fdobridge_: <k​arolherbst🐧🦀> like the thing....
23:39 fdobridge_: <m​henning> There's another patch in that MR that fixes the test failures
23:39 fdobridge_: <g​fxstrand> Oh, cool
23:39 fdobridge_: <k​arolherbst🐧🦀> anyway... I forgot to reduce the gprs from 255 to 253 for volta 🥲
23:39 fdobridge_: <k​arolherbst🐧🦀> nak might want to make sure that stuff doesn't overflow
23:40 fdobridge_: <g​fxstrand> Oh drp... Texture indices.
23:41 fdobridge_: <g​fxstrand> NIR doesn't know that NV clamps.
23:41 fdobridge_: <k​arolherbst🐧🦀> `max(4, s.info.num_gprs + 2)` where `num_gprs` was 255, so that wrapped to 4
23:41 fdobridge_: <g​fxstrand> I've been meaning to wire up the conversation intrinsic we made for CL. That would let us communicate that.
23:41 fdobridge_: <k​arolherbst🐧🦀> do we have a good plan on how to handle those kinda fails if `checked_add` is used as it returns `Option`?
23:41 fdobridge_: <k​arolherbst🐧🦀> ohh yeah.. big mood
23:43 fdobridge_: <m​henning> Oh, is there already an intrinsic? I looked for something in the algebraic ops and didn't see what we needed
23:44 fdobridge_: <g​fxstrand> Yeah, there's an intrinsic that has all the bells and whistles. I think we can implement basically all of it in hardware.
23:44 fdobridge_: <g​fxstrand> I just haven't wired it up. GL and Vulkan don't generate it. Only OpenCL.
23:45 fdobridge_: <k​arolherbst🐧🦀> well.. not all combinations, but yeah
23:45 fdobridge_: <k​arolherbst🐧🦀> at least the rounding modes all exist
23:45 fdobridge_: <k​arolherbst🐧🦀> sometimes the hw is a bit picky
23:47 fdobridge_: <k​arolherbst🐧🦀> `Pass: 70840, Fail: 3, Crash: 1160, Skip: 394997, Duration: 14:35, Remaining: 1:50:05` okay that looks a bit more promising