00:12gfxstrand[d]: Okay, I think I'm starting to see the shape of things
00:12gfxstrand[d]: This might not be too bad
00:12airlied[d]: I think adding asserts to check various things are true in the backend might help, but my only idea is to reproduce it here and dig in
00:12gfxstrand[d]: This is a lot of DRM to page in. (pun intended)
00:13airlied[d]: if I had to guess I'd look closer at the LPT/SPT interactions, but that's also because that code is quite intricate and I don't really understand it even after I rewrote it for nova-core
00:13gfxstrand[d]: Well, that give me hope. :frog_sweat:
00:15gfxstrand[d]: mohamexiety[d]: I pulled your branch and I'm building now
00:16gfxstrand[d]: I'm building across all 84 x86 cores in my office so it shouldn't take too long. π
00:21gfxstrand[d]: I probably shouldn't be building full Fedora configs. π€‘
00:21airlied[d]: make localmodconfig
00:22gfxstrand[d]: Yeah, I know. But then you don't have drivers for random USB things you want to plug in later
00:22airlied[d]: is there an easy way to say an instruction has a bindless arg?
00:22gfxstrand[d]: airlied[d]: Context?
00:24airlied[d]: I need to classify instructions that are in bindless form differently
00:24gfxstrand[d]: Oh, for timings?
00:24airlied[d]: Yes
00:25airlied[d]: Just have to iterate sources for cbuf then check if they are bindless?
00:26gfxstrand[d]: For image ops, they're always bindless right now. For texture ops, look at TexRef
00:27airlied[d]: lots of other ops can take them as well
00:32mhenning[d]: "instructions that are in bindless form" includes anything that uses a bindless cbuf?
00:35gfxstrand[d]: Oh, that's a totally different thing
00:35airlied[d]: Docs are don't join all the dots, so I'm assuming that is what it means, it's used for ureg latencies
00:35gfxstrand[d]: For that you have to walk the srcs
00:36airlied[d]: Okay yes I think it means uses a bindless cbuf notation. It's not explained in details, but makes sense
00:38gfxstrand[d]: gfxstrand[d]: Back to maxwell blowing up... Yes, me adding shader code causes VAs to be different because of the way the upload queue works. :blobcatnotlikethis:
00:41mhenning[d]: ALU operations have a bindless cbuf form - there's a regular encoding where each op has a few different forms that determine their argument types. The actual form is determined in encode_alu_base in sm70.rs, but just walking sources and looking for a CBuf::BindlessSSA or CBuf::BindlessUGPR is probably what you want for your case
00:42gfxstrand[d]: Yup
00:42mhenning[d]: That should be right assuming the docs are saying that applies to alu operations. If it's instead referring to texture operations take a look at what faith said above
00:54gfxstrand[d]: mohamexiety[d]: I'm about to call it quits for the day but I've got the kernel built now and tomorrow I'll do a little typing and see if my idea makes any sense.
00:54gfxstrand[d]: As often is the case, I need to poke at things and break them before I'll know if I'm on the right track. π
00:55mohamexiety[d]: Yep, relatable
00:56gfxstrand[d]: I think we may need UAPI for this
00:56gfxstrand[d]: But I'm not 100% sure yet
00:56airlied[d]: do you want to have userspace pick the page size?
00:57gfxstrand[d]: We may have to
00:57gfxstrand[d]: I'm not sure yet
00:58gfxstrand[d]: If we do that, we'll need to also have userspace pick the page size when it creates the BO so that's a bit awkward
00:58gfxstrand[d]: I think we're going to end up with a BO having a max page size based on what it was able to do at allocation time and then the VA will have to clamp down to that.
00:58skeggsb9778[d]: right now i believe it'll choose the max page size that doesn't exceed the bo size, for the memory itself
00:58skeggsb9778[d]: that's fine, it can be used for small pages too
00:59gfxstrand[d]: We have an alignment parameter that gets passed to BO creation. I don't think it's used with VM_BIND but we have it if we need it.
01:00gfxstrand[d]: Ideally, userspace would just allocate a BO with a nicely aligned size and then VM_BIND it at a nicely aligned offset and range and everything works by magic.
01:00gfxstrand[d]: But things get tricky the moment we have a remap operation that splits things with an alignment other than the original bind alignment.
01:01gfxstrand[d]: map and unmap are okay, I think, because the alignment gets picked on map and unmap is just unmap.
01:01gfxstrand[d]: It's remap where everything is going to fall apart if we're not careful
01:01skeggsb9778[d]: yeah, remap() would (potentially) have to put() the overlapping part, and re-get() it
01:01gfxstrand[d]: And if that's not atomic...
01:03gfxstrand[d]: I think the big danger (and why we might need userspace's help) is if you map and the kernel decides it can do 2M pages and then you unmap a bit in the middle and now you have to split a 2M page and you can't do that atomically.
01:03airlied[d]: though for apps that don't use sparse I don't see where it would go wrong
01:03gfxstrand[d]: If userspace declares page sizes, we can shoot it if it tries to do that.
01:20gfxstrand[d]: But I also don't know where we would put the page size in the UAPI. It's not like we pre-declare non-sparse bindings.
01:20gfxstrand[d]: Actually... We could hang it off the BO. Whenever we map, we'll have a BO and we can map with whatever page size the BO wants.
01:20gfxstrand[d]: Unmap is easy
01:21gfxstrand[d]: When we go to remap, the unmap operation in the middle of the map region will have to follow the alignment rules of whatever BO is currently mapped there so we know the resulting split map is going to be aligned.
01:22gfxstrand[d]: It does make rules checking on unmap a little more complicated but I think that's okay.
01:23gfxstrand[d]: This scheme does mean we'll never get 2M pages unless the client does a dedicated allocation but I think that's okay.
01:24gfxstrand[d]: Hrm...
01:25gfxstrand[d]: But I really want 2M pages for the default VA we allocate along side the BO
01:26gfxstrand[d]: Or maybe we can have it be a lazier rule? The client provides a page size when it does a BIND and then any UNBIND that overlaps that range has to be aligned.
01:26gfxstrand[d]: That might be annoying to check, too.
01:26gfxstrand[d]: But maybe not so bad?
01:30redsheep[d]: What would be the benefit of making it possible to have pages that huge anyway?
01:30redsheep[d]: It kind of sounds like most of the benefits will arrive just making 64k work
01:31gfxstrand[d]: 2M pages are way faster
01:31gfxstrand[d]: It lets you skip a whole level of page table lookups on every access
01:31redsheep[d]: Ooo ok I like that
01:32gfxstrand[d]: Yeah, 64K pages are just a necessary requirement to handle compression but you still have a 4-level page table walk.
01:32gfxstrand[d]: With 2M pages, it's 3-level.
01:32redsheep[d]: So that means lower actual latency going out to vram?
01:32gfxstrand[d]: yes
01:33airlied[d]: I think reduced TLB pressure is often the main benefit
01:33gfxstrand[d]: Yeah, it thrashes those caches a lot less, too
01:34gfxstrand[d]: I doubt it'll be a massive boost but it'll probably be measurable.
01:34redsheep[d]: Well take all the boosts available
01:34kayliemoony[d]: it's only lower latency on TLB miss, and then 512 (? assuming 4k min page size) fewer tlb entries
01:35redsheep[d]: How big of an issue is tlb capacity in GPU land typically?
01:35kayliemoony[d]: so if you're doing random access over that whole 2M, you've completely erased the TLB lookups for it cost wise
01:35kayliemoony[d]: (i'd happily wager your tlb cache is not larger than 512 entries)
01:42skeggsb9778[d]: gfxstrand[d]: you can do 512MiB pages on hopper/blackwell too π
01:44redsheep[d]: Nobody has any Blackwell cards anyway :p
01:45redsheep[d]: I swear the 5090 is impossible to purchase, I have set up stock pings and followed up on most of them, more than I should, and it's all been futile
01:46skeggsb9778[d]: yeah, youtube keeps showing me videos of people complaining about that
01:46skeggsb9778[d]: i've got a 5080 at least
01:47redsheep[d]: Ok so I just need to get hired at nvidia then
01:49redsheep[d]: You got the kernel/gsp side pretty much sorted, right? At least in a branch?
01:49redsheep[d]: Have you tried nvk to see how hard it explodes?
01:50skeggsb9778[d]: i'm still debugging why channels aren't working for me, so haven't tried nvk yet, though i had a poke around to get a feel for where the chipset/class etc checks live
01:51gfxstrand[d]: skeggsb9778[d]: That's fun. I assume Blackwell has 5-level tables, at least on the server cards, though.
01:51skeggsb9778[d]: https://gitlab.freedesktop.org/bskeggs/nouveau/-/commits/03.01-gb20x?ref_type=heads
01:51skeggsb9778[d]: gfxstrand[d]: pre-Hopper has 5, Blackwell has 6
01:51gfxstrand[d]: That tracks
01:52gfxstrand[d]: Yeah, you want really big pages then
01:54redsheep[d]: I wonder if the prop driver ever actually uses those in games
01:54skeggsb9778[d]: Yes. Nouveau does too with GL π
01:57gfxstrand[d]: And we can with NVK. We just need to do a bit of VM_BIND work
01:58skeggsb9778[d]: Yeah, the GL case is much simpler to handle
02:07HdkR: Dang, six levels so it can map 5 levels of CPU and also still have room for GPU?
02:08HdkR: Everyone scared of the TLB fetches
02:10gfxstrand[d]: Oh, on a GB rack, it's basically all the same and you can map anything in the rack into that GPU VA. IDK what mapping over NVLink looks like exactly, though.
02:11HdkR: With six levels do we finally get the full 64-bit VA space? So top level is the remaining bits from 57-bit or something?
02:13HdkR: Or does it also reserve a bit still for kernel space in x86 land :D
02:15HdkR: Very wacky GPUs
02:17skeggsb9778[d]: It's 57 bits now
02:18redsheep[d]: Isn't that whole giant supercomputer thing that jensen said is all one gpu one address space?
02:18redsheep[d]: the 4 elephants one
02:18redsheep[d]: I forget what it's called cuz I tune out when the HPC and AI stuff starts
02:24HdkR: Interesting, 57-bits with six level even though Intel/AMD can get away with 57-bits using five levels?
02:26HdkR: Interestingly Arm64 introduced 56-bit but I'm not sure how many levels that is
02:27HdkR: Roughly equivalent to the x86 side since ARM doesn't reserve a bit for kernel things.
02:30HdkR: ...The ARM ARM says it is six levels, at lookup levels -2,-1,0,1,2,3.
02:34airlied[d]: also things started to use the other bits for storing interesting information
02:35airlied[d]: get some 128-bit pointers going π
02:35HdkR: ARM Morello would be so happy
04:06airlied[d]: gfxstrand[d]: mhenning[d] we only do CCTL for data caches right now?
04:06gfxstrand[d]: Yeah
05:52butterflies[d]: redsheep[d]: Up to 256 GPUs sharing memory
05:52butterflies[d]: But shipped configuration is "only" up to 72 GPUs
11:22fooishbar[d]: gfxstrand[d]: heh, that would be the one I dumped up there just before Christmas since I wanted it as far away from me as possible
11:30mohamexiety[d]: sounds like it was traumatic :KEKW:
11:32snowycoder[d]: Parsing question!
11:32snowycoder[d]: When a Src prints "-X" it could either be INeg or FNeg, how can we know which one it is?
11:32snowycoder[d]: We could use SrcType but sometimes it's GPR and it doesn't help
11:46marysaka[d]: fooishbar[d]: Volta is cursed as hell... but that's also why I love it too π
11:47marysaka[d]: but hey if we can get syncpoints working on Xavier/Volta, we could test Maxwell for real if we wanted without abyssal perfs
11:47marysaka[d]: *looks at her 3 TX1 devkit taking dust*
11:50Jasper[m]: I bid for fun on 10 TK1's before to set CI/CD up, but it got snagged from under me sadly
11:51karolherbst[d]: syncpoints... pain
11:51fooishbar[d]: Jasper[m]: anholt did set a bunch of TK1s up for CI
11:51fooishbar[d]: unfortunately that didn't seem to improve Kepler support
11:51Jasper[m]: I'm assuming that's still due to the syncpoints issue that has hampered Tegra for a bit :p
11:52Jasper[m]: I do also have the TX2
11:53karolherbst[d]: the tegra situation is just not that great overall
11:53karolherbst[d]: it's basically unmaintained, because nobody is working on it
11:55Jasper[m]: I've seen stuff ranging from it not working at all (wayland's status) to it starting to work on mesa 19 (iirc) to it magically being fixed with some xorg.conf
11:55Jasper[m]: And other than direct code contributions (which I cannot do sadly) the only way to contribute is to set up ci/cd I guess?
11:55karolherbst[d]: it needs explicit modifier support
11:56Jasper[m]: Jasper[m]: "it" being nouveau btw, not nvk
13:14gfxstrand[d]: snowycoder[d]: When is it GPR?!?
13:17gfxstrand[d]: We really need to clean some of that up and reevaluate the SrcType rules.
14:05snowycoder[d]: gfxstrand[d]: In a lot: OpFSwzAdd (should be float?), OpShf, OpShl, OpShr, and a lot of others, sometimes it's also ALU, so it still doesn't carry any type info
14:05gfxstrand[d]: OpFSwzAdd is wonky. Shifts I didn't think accepted modifiers.
14:08snowycoder[d]: But how should a parser handle it? should it default to INeg if it can't be sure or should it maybe throw a parse error?
14:10snowycoder[d]: Also, sorry to ask, but what is exactly a SrcType? it seems to carry some type information but it also has GPR, SSA and ALU that seem to be catch-all?
14:14gfxstrand[d]: It's per-source metadata that's currently massively underdefined. I really need to go through and do an audit and clean it up.
14:15gfxstrand[d]: The technical mechanism we're using to track it (decorations) is good but there's a lot of stuff that doesn't quite follow the same rules as everything else.
14:15gfxstrand[d]: FSwzAdd, for instance, allows FNeg and maybe FAbs but doesn't accept cbuf sources.
14:16gfxstrand[d]: GPR vs SSA is also wonky
14:16gfxstrand[d]: ALU is meant to mean cbuf/imm but no modifiers
14:17gfxstrand[d]: But I think it gets slapped in the wrong spot sometimes
14:17gfxstrand[d]: And then there's things like IAdd3 which have really funky rules. It can take two INeg modifiers but not three.
14:18snowycoder[d]: WAT
14:19gfxstrand[d]: For the purposes of a parser, I think we can assume INeg is only on I32 sources and FNeg/Abs are only on float sources.
14:19snowycoder[d]: Can it accept INeg/FNeg on GPR?
14:19gfxstrand[d]: snowycoder[d]: It's because of the way negatives are handled internally and the fact that it's only 34 bits.
14:19gfxstrand[d]: snowycoder[d]: No. GPR means no modifiers
14:21gfxstrand[d]: OMG this stuff all needs an audit so bad.
14:22gfxstrand[d]: It's all mostly fine on most of the ALU stuff but then there are the corner cases...
14:22snowycoder[d]: Welp, the parser doesn't need to catch every case
14:23snowycoder[d]: Even if we can parse some incorrect cases it's not the end of the world
14:23gfxstrand[d]: Yup
14:26gfxstrand[d]: The biggest thing it's used for right now is so that copy-prop can propagate modifiers safely but even that has special cases when integers are involved.
14:27gfxstrand[d]: And part of the problem is that the rules change a bit per-SM so there's a limit as to how definitive a struct member definition can be.
14:28snowycoder[d]: Oh no:blobcatnotlikethis:
14:30gfxstrand[d]: But it should be good enough for "what source modifiers are valid?"
14:30snowycoder[d]: BNot can only be applied to predicates (or maybe also Carry?)
14:44gfxstrand[d]: And PLop3
14:44gfxstrand[d]: Not carry
14:46gfxstrand[d]: And maybe Lop2? I don't remember.
14:58gfxstrand[d]: We probably need a B32 type.
15:00karolherbst[d]: mhh not sure it's a great idea because 32 bit booleans meaning depend on the type (int vs float)
15:00karolherbst[d]: or rather.. the representation changes
15:02snowycoder[d]: We have a B32, it's the type for Lop2
15:03karolherbst[d]: the B prefix is usually used as "untyped" data tho
15:03karolherbst[d]: though maybe you meant it as untyped data?
15:06gfxstrand[d]: karolherbst[d]: I mean bitwise. I don't give a crap about all that float Boolean nonsense.
15:07gfxstrand[d]: snowycoder[d]: Okay, then BNot is allowed on B32 and Pred
15:08gfxstrand[d]: IAdd2/3X also take B32 sources which is fun.
15:13gfxstrand[d]: https://mastodon.gamedev.place/@gfxstrand/111020866052957274 if you want the full saga of how I figured out IAdd3
15:16gfxstrand[d]: gfxstrand[d]: There may be some tiny benefit in a couple of cases to using FSet but NIR does a good enough job of cleaning up that shit and Nvidia GPUs are powerful enough that I'm not worried about micro-optimizing D3D9.
15:24karolherbst[d]: IADD3 is just a special case tho π it's really annoying
15:25snowycoder[d]: gfxstrand[d]: Ohh, so the copy-prop foes nothing for OpIAdd3X since since it has B32 src_type 0_0
15:25snowycoder[d]: I'm starting to understand
15:26snowycoder[d]: gfxstrand[d]: We should also add a bit of comments to the opcodes, some are a bit obscure
15:33asdqueerfromeu[d]: How much work would be needed for this?: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12685
15:41gfxstrand[d]: What is RVM?
15:42gfxstrand[d]: Perfetto?
15:46zmike[d]: https://gpuopen.com/rmv/
15:51gfxstrand[d]: Ah, yeah, we can probably add whatever's needed for that. I don't know what's needed for that but we can try.
16:15gfxstrand[d]: skeggsb9778[d]: Will the VMM back-end atomically split pages if I `nvif_vmm_get_raw()` with a smaller page size?
16:24gfxstrand[d]: That's the crucial question, I think: Can we split a large page live? If so, how do we do that?
16:24gfxstrand[d]: Because if we can split a large page live, then we can implement whatever we want. If we can't, then we need a very careful UAPI dance.
16:30mohamexiety[d]: gfxstrand[d]: when I was looking into it I was told that no, the calls to get() have to be with the correct page size. it doesn't handle it by itself
16:31mohamexiety[d]: but _if_ I understood correctly, we should be able to handle splitting and such in the upper layers :thonk:
16:32gfxstrand[d]: Yeah, I'm crawling through that now
16:33gfxstrand[d]: I think at the very least we want a `DRM_NOUVEAU_VM_BIND_64K` flag to require the kernel to either do at least a 64K binding or fail the VM_BIND op.
16:33mohamexiety[d]: I guess to avoid compression failing?
16:34gfxstrand[d]: Yup
16:34mohamexiety[d]: I see, makes sense
16:34gfxstrand[d]: Of course, the kernel still needs to handle smashing PTE kinds for when stuff gets paged out to system RAM
16:37mohamexiety[d]: yup!
16:53gfxstrand[d]: asdqueerfromeu[d]: Probably not a lot. The first thing we need to do is figure out what NVK is missing. There's a memory report extension but I'd be a bit surprised if it needs that since it's mostly for driver-internal stuff.
17:57gfxstrand[d]: mohamexiety[d]: The more code I read the simultaneously more and less sense it all makes. π
17:57gfxstrand[d]: Damnit! I was trying very hard NOT to become a kernel developer!
17:57mohamexiety[d]: relatable :nervous:
17:58mohamexiety[d]: yeah I explicitly said no kernel dev at first, then welp... :KEKW:
18:07snowycoder[d]: I did one too many rust crimes, now I hit a rustc hang :'3
18:08gfxstrand[d]: infinite loop in your proc macro?
18:09snowycoder[d]: nope, macro terminates and if I add an error in the non-macro code rustc catches it
18:09snowycoder[d]: with gdb there seems to be some infinite recursion with traits
18:12redsheep[d]: gfxstrand[d]: Surely you mean like generation, and not different SMs on the same GPU?
18:12gfxstrand[d]: Yeah, I mean "Shader Model"
18:12gfxstrand[d]: Like SM70, SM80, etc.
18:13redsheep[d]: Ah, right. If the GPU could have different kinds of SMs mixed that would be cursed to oblivion
18:14redsheep[d]: So many overloaded acronyms
20:10airlied[d]: gfxstrand[d]: mohamexiety[d] the question I have though is for non-sparse using allocations why would this cause the problems that are being seen?
20:10airlied[d]: do we hit that many remaps in that case?
20:11gfxstrand[d]: I don't think we will. I just want to make sure what we're doing makes sense and that the kernel API is solid.
20:12airlied[d]: I'd like to understand why mohamexiety[d] current code is broken for the normal allocation patterns though, because I expect that might be more important to solve
20:12airlied[d]: though I expect if we do need some heavy rework it might be in the wait for nova bucket
20:12gfxstrand[d]: I suspect it's just calculating page_shift too many places and things are inconsistent
20:13airlied[d]: we've discussed moving all the raw VM mgmt into the drm side of the driver in the past and put the idea on hold for nova
20:13airlied[d]: since doing it in the nvkm piece violates locking rules
20:14skeggsb9778[d]: it's not like it's a bad thing to do in the meantime though, especially since it should be a decent win - and, it'll still be useful for pre-Turing after nova exists
20:14airlied[d]: it's just how much surgery is required, if it's yeah we can fix this with a flag, I'm good, if it's we need to rewrite raw vm handling than less so π
20:15skeggsb9778[d]: i think it can all be done in nouveau_uvmm.c basically
20:15snowycoder[d]: gfxstrand[d]: Confirmed compiler bug https://github.com/rust-lang/rust/issues/137636
20:15snowycoder[d]: I may have built a very cursed parserπ
20:22mohamexiety[d]: airlied[d]: good question. honestly I don't know and am just as confused about it but I don't really know how to further test π¦
20:22mohamexiety[d]: only have one system here so can't really do anything beyond a full boot with DE and such
20:22mohamexiety[d]: airlied[d]: isn't nvkm part of drm?
20:29gfxstrand[d]: snowycoder[d]: Oh fun...
20:30airlied[d]: mohamexiety[d]: It is in the driver but has its own plane of existence, esp when it comes to locking
20:32mohamexiety[d]: airlied[d]: but why/how is that problematic for locks?
20:53gfxstrand[d]: gfxstrand[d]: This is the part that has me concerned with the current code. With the legacy BO path, we choose a page size based on the buffer size and alignment. It's pretty straightforward. With VM_BIND, everything is more dynamic it can potentially change if we bind a big region of a BO and then unbind part of it. That might not be a thing that happens often in practice, as airlied[d] points
20:53gfxstrand[d]: out, but the current UAPI supports it and we need to either figure out how to make it work or figure out some restrictions that we can add which make binding with larger pages practical.
20:55gfxstrand[d]: I tried poking at mohamexiety[d] 's branch this morning and plumbing things through differently. I found one pretty clear case but the others have proven more elusive. I also am still not sure what happens if we try to do a remap on something with a page size that's too big for the remap. I think it's probably possible, I just don't know how to make it work.
21:32airlied[d]: gfxstrand[d]: for what the UAPI supports, it isn't possible with vulkan sparse interfaces, then I'd rather tighten up the uapi than support it
21:38gfxstrand[d]: Sure but what do we tighten? That's the question.
21:42airlied[d]: mohamexiety[d]: not sure there's enough space in discord to describe the interactions between dma fences, memory allocations and locking hierarchies
21:45airlied[d]: but ideally all the vmbind page table handing would be under the gpuvm lock and no further locking would be required, we get most of the way there with raw but not all
21:48mohamexiety[d]: hmm I see, thanks
22:00airlied[d]: mohamexiety[d]: might be tricky, but you could boot into console and run some CTS tests π
22:08airlied[d]: smoke triangle seems to fault here on your branch
22:11gfxstrand[d]: Back to poking at Maxwell...
22:12gfxstrand[d]: The fault I'm seeing is `HUB/FALCON` but if I smash a shader write to NULL, I see `GPC0/LTP_UTLB_1`
22:12gfxstrand[d]: So it's not an invalid shader address.
22:14mohamexiety[d]: airlied[d]: does anything stand out about the fault? I did notice that vkcube was a fault as well but wasn't sure if that was related to things running in a DE as well or not (since e.g. the Files app would also fault)
22:15mohamexiety[d]: airlied[d]: if it's still useful sure. haven't done this before though so is there anything special that needs to be done to boot into console?
22:16airlied[d]: depends on distro, usually putting 3 on the kernel commandline avoids starting a gui
22:17airlied[d]: you can also of course boot to gdm and then stop the gdm service
22:19mohamexiety[d]: which CTS tests would be applicable here?
22:20airlied[d]: dEQP-VK.api.smoke.asm_triangle faults for me
22:21gfxstrand[d]: Also, if you boot straight to console, you can run with VM debugging on and that'll give a lot more information.
22:22mohamexiety[d]: gfxstrand[d]: how do I enable that?
22:25gfxstrand[d]: Not sure
22:25gfxstrand[d]: `nouveau.debug=vm` or something?
22:41gfxstrand[d]: Okay, so QMDs are loaded through `HUB/FALCON`
22:41skeggsb9778[d]: debug=mmu=debug
22:41skeggsb9778[d]: =trace for more verbose
22:42skeggsb9778[d]: you might need CONFIG_NOUVEAU_DEBUG_MMU too
22:42mohamexiety[d]: Hmm let me reboot then
22:42mohamexiety[d]: skeggsb9778[d]: Where does this one go? Is it an env var?
22:42skeggsb9778[d]: kernel config option
22:44mohamexiety[d]: That’s during build right?
22:45mohamexiety[d]: I’ll try debug=mmu=debug first and if it doesn’t change things will change the build config
23:02airlied[d]: something seems wrong with the page pickers, I think I'm seeing size 16 with host pages
23:03airlied[d]: actually ignore that, I imsread the printk
23:17mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1344085945477234780/IMG_0225.jpg?ex=67bfa115&is=67be4f95&hm=a61f60c10d063b86d2235030ac78a682a59a107578e8dbd7c6bd2c85129c8953&
23:17mohamexiety[d]: Hmm, nothing here despite using debug=mmu=debug
23:20skeggsb9778[d]: mohamexiety[d]: sorry, i might have been unclear. if you're putting it in your bootloader config, it'll be nouveau.debug=mmu=debug (you can drop the "nouveau." when calling modprobe etc though)
23:21mohamexiety[d]: Yup that’s what I wrote
23:21mohamexiety[d]: nouveau.debug=mmu=debug=trace
23:22skeggsb9778[d]: drop the "=trace", it's one or the other
23:24mohamexiety[d]: Woops sorry about that then
23:30mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1344089078001434684/IMG_0226.jpg?ex=67bfa400&is=67be5280&hm=8f4b2c28b55531d16812ac9d4f253258494923f518d277e7411cbb7a0910cb27&
23:30mohamexiety[d]: Ok this is deqp-vk.api.smoke.triangle
23:31skeggsb9778[d]: that's saying you're trying to memory, allocated at 4k page size, into a vmm with 64k pages
23:32skeggsb9778[d]: args are: virt address, size, offset in memory block, virt page size, memory block page size
23:36mohamexiety[d]: I see, thanks! That’s a bit odd. It’s not a sysmem mapping that got wrongfully assigned 64k since the code checks for the location and sysmem mappings would always get 4k. So now I am curious where/how it got allocated at 4k
23:37mohamexiety[d]: The page sizing logic is also in the map() path so in theory it should catch everything early on
23:37skeggsb9778[d]: nouveau_bo_alloc() decides that, it's stored in nouveau_bo.page
23:37skeggsb9778[d]: probably need to check that in your page sizing logic too
23:38mohamexiety[d]: Got it, I didn’t look anywhere outside _uvmm.c since I figured that was for the old path only
23:38airlied[d]: where is the memory block page size used?
23:38mohamexiety[d]: Iirc noiveau_bo has a similar page sizing loop as well so that’s interesting there’s a mismatch
23:39skeggsb9778[d]: airlied[d]: nvkm_vmm checks it (via nvkm_memory_page()) to avoid doing things like mapping 4k blocks into 64k PTEs
23:40skeggsb9778[d]: you can map 64k blocks as 4k PTEs etc though still
23:40airlied[d]: but is it a real thing?
23:40airlied[d]: like is it used by the hw anywhere
23:40skeggsb9778[d]: no, it's used by the vram allocator to make sure all the blocks in the list are aligned at alloc time
23:40skeggsb9778[d]: to avoid having to check every region when you map it