00:06fdobridge: <airlied> the kernel has more to cleanup
00:06fdobridge: <airlied> now whether we could store that piece of info in the kernel side so userspace doesn't care is probably a good question
00:07airlied: dakr: ^ you may have some thoughts
00:08fdobridge: <airlied> @gfxstrand I think we have is so if you have a sparse range, and map things non-sparse into it, when you unmap those things it goes back to sparse state,
00:08fdobridge: <airlied> then you unmap the sparse region separately
00:08fdobridge: <airlied> to get back to unbacked
00:09fdobridge: <airlied> for normal operation, you just map/unmap normally, for sparse operations, you map sparse at bringup, then all the queued map/unmap and then unmap sparse at the end
00:10fdobridge: <gfxstrand> What happens if they aren't properly bracketed?
00:10fdobridge: <airlied> the kernel rejects you
00:10fdobridge: <gfxstrand> What happens if they aren't properly nested? (edited)
00:10fdobridge: <gfxstrand> Okay. That works, I guess.
00:21fdobridge: <gfxstrand> Was there a particular reason for that choice? Does the kernel reserve page table space as part of the sparse setup which gets torn down as part of the sparse unmap?
00:38fdobridge: <airlied> yes
00:39fdobridge: <airlied> the kernel has to fill in sparse page tables entries different from unbacked
00:49fdobridge: <karolherbst🐧🦀> uhhh... I entirely forgot that in GL we just allocate the TLS buffer once and never resize it.. and the initial allocation is apparently big enough for the CTS to not fall over it...
01:25fdobridge: <airlied> @gfxstrand I added a comment in a the branch that might clarify it
01:27fdobridge: <gfxstrand> Okay, so if we've reserved a sparse range, we're guaranteed that bind can't fail?
01:28fdobridge: <gfxstrand> Does the kernel also validate that all queued binds are to sparse reserved ranges?
01:31fdobridge: <airlied> no we don't reserve all the page table entries
01:32fdobridge: <airlied> a sparse entry might be a single entry covering a huge amount of address space, whereas individual page entries might require smaller grained ptes
01:34fdobridge: <gfxstrand> So what's the point of the nesting? Or is that just the API folks picked?
01:34fdobridge: <airlied> the nesting allows unbinding
01:35fdobridge: <gfxstrand> Like, there's three states a page can be in: fully unbound, sparse unbound, and bound.
01:35fdobridge: <airlied> at a PTE level
01:35fdobridge: <airlied> there are higher level page descriptors
01:37fdobridge: <airlied> we've gone around on this a few times, and I think the kernel patches document some of it, but the ranges were to allow unbinding of bound regions back to sparse to be merged
01:38fdobridge: <airlied> so if you create sparse 0..20, then map 13, 14, 15, 16, and unmap 13, 14, 15, 16 you end up with a single entry again for 0..20
01:38fdobridge: <airlied> no 0..12,13,14,15,16,17..20
01:43fdobridge: <gfxstrand> Okay, that's believable.
01:44fdobridge: <gfxstrand> So there's sort of a fake BO under there and we treat it like huge pages?
01:44fdobridge: <airlied> the comment in nouveau_exec.c in the kernel talks about it
01:45fdobridge: <gfxstrand> This seems like it should be solvable without that but I can believe it at least makes it easier.
01:45fdobridge: <airlied> it gets messy around the bounadries
01:45fdobridge: <gfxstrand> I'm also trying to better understand the semantics. It sounds like it's probably fine.
01:45fdobridge: <airlied> between sparse and non-sparse regions
01:46fdobridge: <gfxstrand> Right
01:46fdobridge: <gfxstrand> So do we want to over-align sparse stuff to encourage big full-sparse regions?
01:47fdobridge: <airlied> https://gitlab.freedesktop.org/nouvelles/kernel/-/blob/new-uapi-drm-next-fixes/drivers/gpu/drm/nouveau/nouveau_exec.c#L41 is the best comment
01:47fdobridge: <airlied> probably not, I'd let userspace run it's own show as much as possible
01:49fdobridge: <airlied> I suspect we might have to deal with some alignment around bo_kind though
01:53fdobridge: <gfxstrand> Okay, that helps understand the model better. I think it works okay.
01:54fdobridge: <gfxstrand> So what happens if the kernel goes to exec a job and there's not enough memory for page tables? Evict other stuff to system ram?
01:55fdobridge: <airlied> currently the kernel driver can't swap page tables, but yes eventually that would be the plan
01:56fdobridge: <airlied> it might also make sense to have a copy engine do the async page table updates
01:56fdobridge: <gfxstrand> Oh, I'm not worried about the "what if you have too many page tables for VRAM?" case.
01:57fdobridge: <airlied> but yeah if we run out of vram we will kick something else out to make space now
01:57fdobridge: <gfxstrand> This isn't DG1 where the hardware is stupid and requires half your resources to be pinned to VRAM or or just falls over.
01:57fdobridge: <gfxstrand> Another question: VM's are per-file, yes?
01:58fdobridge: <gfxstrand> Like there isn't a separate VM object like I did for Xe
01:58fdobridge: <airlied> currently, for a while we had vm_id, but we didn't find a great reason for it
01:59airlied: dakr: you remember the answer to that one^
01:59fdobridge: <gfxstrand> Okay. That's fine. It just means we need to move the winsys device to `nvk_device` which I've been meaning to do anyway.
01:59fdobridge: <airlied> radv doesn't do that at all
01:59fdobridge: <airlied> it has always had vm at physical device
01:59fdobridge: <airlied> and nobody has complained
01:59fdobridge: <gfxstrand> Then RADV is broken
02:00fdobridge: <gfxstrand> Vulkan requires that separate `VkDevice`s can't touch each other.
02:00fdobridge: <gfxstrand> That's a pretty hard requirement of the robustness chapter.
02:01fdobridge: <gfxstrand> Vulkan guarantees very little in terms of footgun safety but that is where it draws the line.
02:03fdobridge: <gfxstrand> And thanks to buffer device address, it's *really* easy to write that crucible/CTS test.
02:03fdobridge: <gfxstrand> Well, it requires invoking undefined behavior so you have to process isolate it but still.
02:04fdobridge: <airlied> maybe we should bring back vm_id then
02:05fdobridge: <airlied> or at least make space in the uapi structs for it
02:08fdobridge: <gfxstrand> We can either bring back vm_id or we can pull the DRM file up to the `VkDevice`. I don't care which.
02:17fdobridge: <gfxstrand> It's not like it's that expensive to open the DRM file back up per-device, though. That's what we do in ANV and it's fine.
02:18fdobridge: <airlied> yeah it might be cleaner to just do that
02:22fdobridge: <airlied> going to give it a run on my kelper today as well
02:24fdobridge: <gfxstrand> Cool
02:24fdobridge: <gfxstrand> I'll look into re-opening the device tomorrow
02:31fdobridge: <airlied> gotta drag a lot of stuff up to device from ws for that
02:33dakr: airlied: as you said, there wasn't really a reason for having it.
02:35fdobridge: <gfxstrand> Not really. I think I've already done the work for that.
02:35fdobridge: <gfxstrand> Everything that we would want should be in `nv_device_info` and if it's not it needs to be addedthere.
02:35fdobridge: <gfxstrand> Everything that we would want should be in `nv_device_info` and if it's not it needs to be added there. (edited)
02:37fdobridge: <gfxstrand> Looks like we're missing chipset but that's easy to add.
02:38fdobridge: <gfxstrand> Though I don't get why some of those chipset checks aren't `cls_3d` checks
02:40fdobridge: <airlied> I think we'd have to drag bo create up
02:41fdobridge: <airlied> or at least make vma allocation happen consistently higher up
02:44fdobridge: <gfxstrand> We'd just pull the whole `nouveau_ws_device` up
02:44fdobridge: <gfxstrand> I've got a plan
02:45fdobridge: <gfxstrand> I've already started typing. I'll post the MR tomorrow.
02:45fdobridge: <airlied> ah cool
02:46fdobridge: <gfxstrand> Feel free to continue doing your thing. I'm headed for bed now. I'll rebase whatever's in the branch in the morning on top of the rework.
02:50fdobridge: <airlied> your idea sounds better 🙂
03:00fdobridge: <airlied> okay my kepler dies lies your pascal
03:05fdobridge: <airlied> time to dig into pte_kind back then
03:09airlied: dakr: you tested on your pascal in recent times?
03:09airlied: (with vulkan userspace)
03:09dakr: yes, but just a subset of the VK CTS.
03:12airlied: vulkaninfo works?
03:14dakr: airlied: yeah, prints out a bunch of stuff, looking for anything specific?
03:16airlied: nope, just doesn't crash
03:17airlied: with latest nvk branch it's crashing try to create a sparse binding
03:18dakr: With mesa I'm still on 07d3a08603810d5f7abe996bd4a9e978be7ae28d
03:18dakr: kernel is my latest patch series.
03:24airlied: okay I think I can see where I messed up
03:35airlied: nope wasn't that one :-P
04:27airlied: okay solved kepler, since it won't do sparse :)
04:28airlied: not sure on maxwell/pascal though
04:33fdobridge: <airlied> I wonder do we care about sparsebinding only support on kepler
04:36airlied: dakr: did we ban non-sparse async bindings?
05:05fdobridge: <airlied> okay I've separated out sparseBinding vs sparseResidencyBuffer, since kepler has only the former, but kepler passes the tests now
05:05fdobridge: <airlied> @gfxstrand need to narrow down what is going on with your secondary card, since helper works now, would be interested to see how it goes
05:06fdobridge: <airlied> though I think for sparse residency we might need to tune things a bit more for the pre-turing gpus
05:06fdobridge: <airlied> pushed out an updated branch
11:27fdobridge: <karolherbst🐧🦀> I'm currently wondering if I want to move drm/nouveau to nouveau/nouveau to have it all in one place
13:31dakr: airlied: the kernel should do that
14:10fdobridge: <karolherbst🐧🦀> okay... this local memory issue also exists on turing 🙃 now it all makes sense. Running the CTS to see if my fixes have any regressions now
16:22Mangix: meson: error: unrecognized arguments: --verbose
17:00fdobridge: <gfxstrand> PSA: https://gitlab.freedesktop.org/nouveau/mesa/-/issues/74
17:04fdobridge: <gfxstrand> Not that I want someone to take that as a task. Just clean up as you find stuff if you feel so inclined.
17:04fdobridge: <gfxstrand> And please don't land new code with the wrong names.
17:05fdobridge: <karolherbst🐧🦀> uhhh.. turing has 64 warps per sm, but CUDA docs say 32 🙃
17:06fdobridge: <karolherbst🐧🦀> I'll have to double check ampere as well
17:16fdobridge: <karolherbst🐧🦀> mhhh `Pass: 371271, Fail: 8, UnexpectedPass: 7, ExpectedFail: 2985, Skip: 1632322, Flake: 61, Duration: 35:59, Remaining: 0`
17:18fdobridge: <karolherbst🐧🦀> ahh, those 8 fails are also flakes
17:18fdobridge: <karolherbst🐧🦀> nice
17:18fdobridge: <karolherbst🐧🦀> I'll have to check out ampere tomorrow
18:59fdobridge: <gfxstrand> @airlied https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/233
18:59fdobridge: <gfxstrand> CTSing it now
19:13fdobridge: <karolherbst🐧🦀> how common is it, that application even create more than one VkDevice?
19:27fdobridge: <karolherbst🐧🦀> huh.. on Ampere it needs to be threads * 96 * 2 * tpc count... now I'm super confused
19:27fdobridge: <gfxstrand> Rare
19:28fdobridge: <gfxstrand> So far it's looking like my CTS run is going to take about 36 minutes which is about the same as before that MR so I really don't think device creation is going to be a problem.
19:29fdobridge: <gfxstrand> The CTS does like creating devices. 😅
19:29fdobridge: <karolherbst🐧🦀> yeah... it shouldn't be super expensive
19:34fdobridge: <gfxstrand> If it is we can do something crazy where we stash the `ws_dev` in the `nvk_physical_device` and have the first `nvk_device` to be created steal it.
19:34fdobridge: <gfxstrand> But we'll only do that if someone can prove there's a serious device creation perf problem to solve and I very much doubt that will ever happen.
19:34fdobridge: <karolherbst🐧🦀> the ws_dev doesn't allocate the VM though
19:35fdobridge: <gfxstrand> It opens the DRM file so it does
19:35fdobridge: <karolherbst🐧🦀> ehh yeah, my bad, I was confused with the GPU context
19:38fdobridge: <gfxstrand> Okay, I think I'm going to merge that and rebase the uAPI branch
19:39fdobridge: <mohamexiety> max threads per SM? or what's that? 😮
19:41fdobridge: <karolherbst🐧🦀> yeah
19:41fdobridge: <karolherbst🐧🦀> well
19:41fdobridge: <karolherbst🐧🦀> per the entire thing
19:42fdobridge: <gfxstrand> @karolherbst Do you have any especially strong feelings about the SM patch in that MR?
19:42fdobridge: <gfxstrand> It's been sitting in my NAK branch for a while.
19:43fdobridge: <karolherbst🐧🦀> I don't actually know, I guess once nvidia uses it as hex we can change it back
19:45fdobridge: <gfxstrand> Given that they have SM75 and SM89, I doubt they're thinking in hex
19:45fdobridge: <gfxstrand> We made that same mistake in Mesa with Intel gen numbers. 🙄
19:46fdobridge: <karolherbst🐧🦀> btw, I figured out that num_gpr stuff
19:46fdobridge: <gfxstrand> Nice!
19:46fdobridge: <karolherbst🐧🦀> not that I figured out _why_ but it makes more sense now
19:47fdobridge: <karolherbst🐧🦀> https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/de3653e00798f4475073e40522d63d30b1a79b1e
19:47fdobridge: <karolherbst🐧🦀> survived a CTS run on Turing
19:47fdobridge: <karolherbst🐧🦀> there are two less gprs available and it always uses 2 internally it seems
19:48fdobridge: <karolherbst🐧🦀> sooo... dunno for what, maybe it's the barrier stuff
19:48fdobridge: <karolherbst🐧🦀> something something
19:48fdobridge: <karolherbst🐧🦀> it's less bad like this I think 😄
19:48fdobridge: <karolherbst🐧🦀> but ugprs kinda makes sense
19:49fdobridge: <karolherbst🐧🦀> there are 32 threads per subgroup and there are 64 ugprs
19:49fdobridge: <karolherbst🐧🦀> so 2 gprs per thread
19:49fdobridge: <karolherbst🐧🦀> still figuring that TLS nonsense...
19:53fdobridge: <karolherbst🐧🦀> mhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
19:53fdobridge: <karolherbst🐧🦀> 64 -> 96 warps per SM for Ada
19:54fdobridge: <karolherbst🐧🦀> also have to take into account that we have 2 mps per tpc..
19:54fdobridge: <karolherbst🐧🦀> but I was sure that yesterday I concluded the difference is 1.5, not 2.5
19:59fdobridge: <karolherbst🐧🦀> it kinda makes no sense
20:05fdobridge: <karolherbst🐧🦀> mhhhh
20:18fdobridge: <karolherbst🐧🦀> okay, so on kepler it's definetly 32 * 64 * tpc_count
20:19fdobridge: <karolherbst🐧🦀> same on maxwell 2nd gen..
20:19fdobridge: <karolherbst🐧🦀> so nvidia has this "Maximum number of resident warps per SM" thing in their CUDA docs and up to SM70 it's 64
20:20fdobridge: <karolherbst🐧🦀> SM70 is volta
20:21fdobridge: <mhenning> I assume gl will need the num_gprs fix too?
20:21fdobridge: <karolherbst🐧🦀> yeah
20:22fdobridge: <karolherbst🐧🦀> though the problem is really just when that value overflows to 0
20:22fdobridge: <karolherbst🐧🦀> mhh.. pascal also needs 32 * 64 * tpc_count
20:22fdobridge: <karolherbst🐧🦀> I wish I had a GP100 because that has two SMs per TPC, which other Pascals don't
20:23fdobridge: <mohamexiety> should be similar to Volta/Turing, no?
20:24fdobridge: <karolherbst🐧🦀> it's not, that's the weird part
20:24fdobridge: <mohamexiety> huh O_O
20:24fdobridge: <karolherbst🐧🦀> sooo...
20:24fdobridge: <karolherbst🐧🦀> mhh
20:24fdobridge: <karolherbst🐧🦀> on pascal each SM contains 4 partitions of stuff
20:24fdobridge: <karolherbst🐧🦀> *of
20:25fdobridge: <karolherbst🐧🦀> so you have 4 register files, 4 * 32 "cuda" cores, etc...
20:25fdobridge: <karolherbst🐧🦀> 4 instruction schedulers, whatever
20:25fdobridge: <karolherbst🐧🦀> page 7 and 8 in chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://www.es.ele.tue.nl/~heco/courses/ECA/GPU-papers/GeForce_GTX_1080_Whitepaper_FINAL.pdf
20:25fdobridge: <karolherbst🐧🦀> ...
20:26fdobridge: <karolherbst🐧🦀> https://www.es.ele.tue.nl/~heco/courses/ECA/GPU-papers/GeForce_GTX_1080_Whitepaper_FINAL.pdf
20:27fdobridge: <mhenning> you might be able to spin up a cloud instance with a P100
20:27fdobridge: <karolherbst🐧🦀> it gets interesting with Turing
20:27fdobridge: <karolherbst🐧🦀> and figuring out how to unload nvidia and load nouveau?
20:28fdobridge: <karolherbst🐧🦀> turing whitepaper: https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
20:28fdobridge: <karolherbst🐧🦀> page 8 is the GPU
20:28fdobridge: <karolherbst🐧🦀> so there you clearly see there are 2 SMs per TPC
20:28fdobridge: <karolherbst🐧🦀> but each SM still has the same amount of execution units
20:29fdobridge: <karolherbst🐧🦀> well kinda
20:29fdobridge: <mhenning> my hunch is that if you pick one of the standard image types you won't need to deal with proprietary driver being installed
20:29fdobridge: <mhenning> I've never tried it though
20:30fdobridge: <karolherbst🐧🦀> anyway.. the most interesting part of turing (SM75) is that "Maximum number of resident warps per SM" is 32
20:30fdobridge: <karolherbst🐧🦀> but you have two SMs per TPC, so it should even out
20:30fdobridge: <karolherbst🐧🦀> but it doesn't
20:31fdobridge: <karolherbst🐧🦀> ehh wait...
20:31fdobridge: <karolherbst🐧🦀> I messed up
20:31fdobridge: <karolherbst🐧🦀> ehh what
20:32fdobridge: <karolherbst🐧🦀> YOOOOO
20:32fdobridge: <karolherbst🐧🦀> it's an alignment problem
20:32fdobridge: <mohamexiety> nice! how so, tho?
20:32fdobridge: <karolherbst🐧🦀> dahsjdhakjdhakdjhsd
20:33fdobridge: <karolherbst🐧🦀> `uint64_t bytes_per_mp = bytes_per_warp * 32 * 2; bytes_per_mp = ALIGN(bytes_per_mp, 0x8000); uint64_t size = bytes_per_mp * dev->pdev->dev->tpc_count;` <= works
20:33fdobridge: <karolherbst🐧🦀> `uint64_t bytes_per_mp = bytes_per_warp * 32; bytes_per_mp = ALIGN(bytes_per_mp, 0x8000); uint64_t size = bytes_per_mp * dev->pdev->dev->tpc_count * 2;` <= doesn't
20:34fdobridge: <karolherbst🐧🦀> I wasted 4 hours on this
20:34fdobridge: <karolherbst🐧🦀> 🙃
20:35fdobridge: <mohamexiety> I am sorry 😦
20:35fdobridge: <mohamexiety> still.. at least you got it now!
20:37fdobridge: <karolherbst🐧🦀> ahh yeah.. becuase the per mp value is used elsewhere as well
20:38fdobridge: <karolherbst🐧🦀> but it needs to be the per tpc one I think
20:41fdobridge: <gfxstrand> @airlied Rebased your branch on !233
20:42fdobridge: <gfxstrand> @karolherbst Uh, what?
20:42fdobridge: <gfxstrand> @karolherbst Is it the align that's a problem?
20:42fdobridge: <karolherbst🐧🦀> nah...
20:42fdobridge: <karolherbst🐧🦀> on turing it's fine
20:42fdobridge: <karolherbst🐧🦀> I just make it all properly working
20:42fdobridge: <karolherbst🐧🦀> soo.. turing doens't have 64 warps_per_mp, it has 32
20:43fdobridge: <karolherbst🐧🦀> but it also has 2 mps per tpc
20:43fdobridge: <karolherbst🐧🦀> so the end result is the same
20:43fdobridge:<karolherbst🐧🦀> but
20:43fdobridge: <karolherbst🐧🦀> on ampere it's 48 warps per mp
20:43fdobridge: <karolherbst🐧🦀> and nouveau reports the tpc count to us, not mp count
20:43fdobridge: <karolherbst🐧🦀> so I just want to get rid of this constant `64`
20:44fdobridge: <gfxstrand> ugh
20:44fdobridge: <karolherbst🐧🦀> it will make all sense once you see the aptch
20:47fdobridge: <airlied> @gfxstrand you should try dropping the turing+ hack and see if vulkaninfo works
20:50fdobridge: <gfxstrand> Seems to. I'll drop that patch
20:56fdobridge: <karolherbst🐧🦀> @gfxstrand https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/234
20:58fdobridge: <karolherbst🐧🦀> there is a table for those limits in the cuda docs, e.g.: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
20:58fdobridge: <karolherbst🐧🦀> the second one
20:58fdobridge: <karolherbst🐧🦀> "Maximum number of resident warps per SM" -> `max_warps_per_mp_for_sm`
20:58fdobridge: <karolherbst🐧🦀> the 2 SM per TPC was extract from all the whitepapers 🙃
20:58fdobridge: <karolherbst🐧🦀> now it makes all sense
21:00fdobridge: <karolherbst🐧🦀> @airlied with that MR you should see those local temp size errors to go away on ampere
21:03fdobridge: <karolherbst🐧🦀> I hope.. I only tested turing and Ada
21:03fdobridge: <karolherbst🐧🦀> ehhh.. rebase compilation problem..
21:05fdobridge: <karolherbst🐧🦀> yep.. also fixes it on ampere 🙂
21:09fdobridge: <karolherbst🐧🦀> sadly this MR only fixes like 7 tests....
21:09fdobridge: <karolherbst🐧🦀> ehh.. on turing
21:10fdobridge: <karolherbst🐧🦀> should run on ampere as well
21:14fdobridge: <karolherbst🐧🦀> @mhenning https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24261
21:56fdobridge: <gfxstrand> Ugh.... sparse is busted...
22:16fdobridge: <airlied> since you rebased or just in general?
22:31fdobridge: <gfxstrand> In general
22:31fdobridge: <gfxstrand> Planes and z32s8 don't work
22:31fdobridge: <gfxstrand> I'm working on it but offsets are a giant PITA
23:01fdobridge: <gfxstrand> @airlied https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/150/diffs?commit_id=0fce82fcf090f9d380b267cd42d34051e788c67e
23:02fdobridge: <gfxstrand> With that, I think all the new uAPI stuff is RB
23:02fdobridge: <gfxstrand> I should give it all another read through tomorrow
23:04fdobridge: <gfxstrand> Once we drop the old code, there's a few things I'd like to clean up but I don't think any of it particularly annoys me anymore.
23:05fdobridge: <gfxstrand> Specifically, we should get rid of the push_builder limits. They're okay for now, though.
23:08fdobridge: <karolherbst🐧🦀> I wonder how long I'm in the mood to support the old UAPI in gl, because I'd hate if we'd continue to use libdrm there 🥲
23:08fdobridge: <karolherbst🐧🦀> but I suspect the old one stays for 10+ years anywy...
23:12fdobridge: <gfxstrand> The real question is which will survive longer? Old uAPI support in GL or the GL driver. 😛
23:12fdobridge: <karolherbst🐧🦀> well... depends on how long we support Fermi or older 😛
23:12fdobridge: <karolherbst🐧🦀> *and
23:13fdobridge: <gfxstrand> I could see us deprecating the old API on everything pre-GSP at some point
23:13fdobridge: <karolherbst🐧🦀> ohh totally
23:13fdobridge: <karolherbst🐧🦀> but I'm sure this will take a while
23:13fdobridge: <karolherbst🐧🦀> I still get bugs from users where their 20 years old GPU regresses 🥲
23:14fdobridge: <karolherbst🐧🦀> though almost nobody seems to use Fermi which is funny
23:15fdobridge: <karolherbst🐧🦀> it's kinda painful that Fermi is such an oddball, otherwise I'd say support Fermi and NVK and just nuke the entire `nvc0` driver
23:15fdobridge: <karolherbst🐧🦀> and then just porting over `nv50` isn't all that painful, and `nv30` could jsut be put into legacy mode
23:17fdobridge: <karolherbst🐧🦀> maybe it would be worth figuring out how bad it would be to support fermi in NVK, but it can't use the copy class which is my biggest concern there
23:18fdobridge: <karolherbst🐧🦀> anyway... the biggest reason for me to use the fancy new winsys is the command submission stuff
23:19fdobridge: <karolherbst🐧🦀> port that over, fix random bugs while doing that and drop our custom headers
23:22fdobridge: <gfxstrand> Yeah... I don't want to implement copies...
23:22fdobridge: <gfxstrand> I mean, people are building `vk_meta` code for it so we could...
23:22fdobridge: <karolherbst🐧🦀> ohh.. Fermi has copies, but it's using a different class
23:22fdobridge: <gfxstrand> Oh, well that's not a real problem.
23:23fdobridge: <karolherbst🐧🦀> yeah.. just fermi only code
23:23fdobridge: <gfxstrand> I'm more worried with how crappy storage image support gets
23:23fdobridge: <karolherbst🐧🦀> and being broken all the time 😛
23:23fdobridge: <karolherbst🐧🦀> https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/memory-to-memory-format/cl9039.h
23:23fdobridge: <karolherbst🐧🦀> that's what Fermi would have to use
23:23fdobridge: <gfxstrand> I really need to figure out why the sparse branch takes 2x as long to run...
23:25fdobridge: <airlied> more tests?
23:25fdobridge: <airlied> there might be some kernel ioctl overhead, but I doubt it could be that bad
23:26fdobridge: <karolherbst🐧🦀> _maybe_ I'll get bored and see how painful it would be to support fermi well enough to run zink
23:32fdobridge: <airlied> @gfxstrand can we add an explicit ack/rb to the MR so I can say the userspace has been reviewed enough to land the kernel side?
23:33fdobridge: <gfxstrand> Yeah. I'm going to do one more read tomorrow and then I'll stick an RB on the MR for the permanent record.
23:34fdobridge: <airlied> sounds good
23:34fdobridge: <gfxstrand> Unless you're really itching to merge the kernel patches yet today
23:34fdobridge: <airlied> nope no great hurry
23:35fdobridge: <gfxstrand> Should I also pull it into nvk/main? Or do you want to leave it in a branch until the kernel parts are merged?
23:36fdobridge: <gfxstrand> We definitely won't merge nvk/main into mesa/main until after the kernel has landed.
23:38fdobridge: <airlied> we could pull it in with the flag off until we land the kernel bits
23:41fdobridge: <gfxstrand> Okay
23:41fdobridge: <gfxstrand> I guess I don't care too much. It's mostly a matter of whether or not it's useful for folks.
23:42fdobridge: <gfxstrand> I am still a bit concerned about stability
23:42fdobridge: <gfxstrand> Not being able to get through whole runs concerns me
23:44fdobridge: <airlied> I'd need some debugging on where it goes wrong, I've been doing pretty full runs on two machines here
23:44fdobridge: <airlied> I might turn off some kernel debug things to see does less debug overhead make it fall over
23:48fdobridge: <karolherbst🐧🦀> maybe I should also run it and see if I can make it fall apart
23:49fdobridge: <gfxstrand> I'm about an hour into this run and it's still going so we'll see.
23:49fdobridge: <gfxstrand> Maybe my d32s8 fixes added enough stability to get us back to where we were?
23:53fdobridge: <karolherbst🐧🦀> @airlied do you think you'll find some time to run my MR through ampere? I'd be curious if there are any regressions there and how many issues it fixes, but now I have to sleep anyway 😄 if you didn't run it by tomorrow I'll just do it anyway
23:53fdobridge: <airlied> I'll throw it at it today
23:54fdobridge: <karolherbst🐧🦀> cool