00:22 redsheep[d]: snowycoder[d]: By pre-kepler are you talking about supporting fermi? I thought you were only building kepler support
00:24 gfxstrand[d]: Kepler A and Fermi have the same ISA
00:24 redsheep[d]: Ok kepler was before my time, I didn't realize there even was a kapler A
00:25 redsheep[d]: I guess that's 600 series?
00:26 gfxstrand[d]: As if NVIDIA GPU generations mapped to marketing numbers... 😂
00:27 gfxstrand[d]: 600 series contains Fermi, Kepler A, and Kepler B.
00:28 gfxstrand[d]: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_600_series
00:28 redsheep[d]: gfxstrand[d]: Fair, though I've tended to only care about parts that are pretty high end and with only a couple rare exceptions that has correlated well for quite a few years
00:29 gfxstrand[d]: Yeah. It's generally fine for Pascal+
00:30 gfxstrand[d]: Maxwell was a little funky but nearly as bad as Kepler.
00:30 redsheep[d]: Maxwell hit the huge speedbump, yeah
00:33 redsheep[d]: gfxstrand[d]: To be clear same ISA doesn't mean supporting Kepler A in nvk is the same difficulty as supporting Fermi, right? There were still missing features on fermi that are important for vulkan, no?
00:35 gfxstrand[d]: Yes
00:35 gfxstrand[d]: Bindless textures are the big one
00:36 gfxstrand[d]: And it has a different copy engine
00:37 gfxstrand[d]: I'm debating what to do about bound textures and UBOs pre-Turing. What we have today technically works for UBOs but it splits the push A LOT. It doesn't work for textures. I suspect that if we want pre-Turing to actually perform, we'll need CPU descriptors. We could probably use that same infrastructure for Fermi.
00:38 gfxstrand[d]: Well, ish. The Fermi descriptors will be different because it doesn't use the bindless texture/sampler heaps.
00:39 redsheep[d]: So supporting all of pre-turing well and supporting fermi at all have a lot of overlap in terms of the legwork?
00:39 gfxstrand[d]: <a:shrug_anim:1096500513106841673> It's more work the further you go back.
00:39 gfxstrand[d]: There's some overlap, yes. But there's a bunch of work that'll be unique to Fermi.
00:40 gfxstrand[d]: There's a part of me that's inclined to go ahead and build Kepler and Femi support and then cut the driver in half and have a Turing+ driver and a pre-Turing driver.
00:41 gfxstrand[d]: But IDK that there's actually a good reason to do that.
00:41 gfxstrand[d]: But there's a good chance we'll want to do that in a few years.
00:41 redsheep[d]: By modern standards fermi really isn't fast, I don't see why it would be worth bothering, especially if stretching to support it means making the driver complicated enough to make that worthwhile
00:42 gfxstrand[d]: For now there's probably enough benefit of pre-Turing picking up bugfixes from Turing+ that we should keep things together.
00:42 gfxstrand[d]: redsheep[d]: When you account for reclocking troubles, Fermi is faster than Maxwell or even Volta.
00:42 redsheep[d]: On the other hand, the pre and post turing worlds could be divided to have one side only care about the nouveau kmd and the other only care about nova
00:43 gfxstrand[d]: redsheep[d]: That woudl be the other reason. Once Nova is out and working, we would probably drop nouveau.ko support after a few years.
00:45 redsheep[d]: It's honestly pretty crazy that even with volta as a bridge they managed to overturn so many things about how everything works with turing
00:45 gfxstrand[d]: But also, hopefully NVKMD will alleviate most of that pain so I'm not too worried at the moment.
00:46 redsheep[d]: Turing is remembered among quite a few people as being pretty bad but the fact they pulled it off at all is impressive
00:46 gfxstrand[d]: redsheep[d]: It's pretty clear that Turing was more than one generation worth of R&D.
00:47 redsheep[d]: I think they said at one point it was started like a decade prior, yeah
00:47 gfxstrand[d]: Oh, I wouldn't believe that either
00:48 redsheep[d]: I wonder if they're waiting so much longer than usual to drop maxwell because they intend to have their entire software stack go through a similar transformation in a year or two and completely drop everything pre-turing at once
00:48 gfxstrand[d]: <a:shrug_anim:1096500513106841673>
00:48 gfxstrand[d]: Dropping Kepler was likely at least partially because it can't do Vulkan 1.3.
00:49 gfxstrand[d]: And there's a bunch of stuff that was added on Maxwell B
00:49 gfxstrand[d]: So Kepler/Maxwell is a pretty reasonable cut.
00:49 gfxstrand[d]: But I wouldn't be surprised if they dropped everything pre-Volta in one go
00:49 gfxstrand[d]: Or maybe even pre-Turing.
00:49 gfxstrand[d]: Pascal is pretty uninteresting
00:50 redsheep[d]: I mean, just in terms of timing it was about on schedule. They had been dropping things after 8-9 years like clockwork for quite a few years and then the expected date for maxwell to die came and went without a word
00:50 gfxstrand[d]: There's no reason to drop Maxwell if they keep Pascal.
00:50 gfxstrand[d]: Pascal is Maxwell C
00:51 gfxstrand[d]: The big speed bump on Pascal came from a process shrink and better memory, I think. The GPU design didn't change much. I don't think NVK has a single Pascal check.
00:51 gfxstrand[d]: Okay, we have like 6
00:52 redsheep[d]: They did quite a bit with cache or rops afaict but yeah not as much of a revolution
00:52 redsheep[d]: It was the enormous jump from 28nm planar to finfet 16
00:52 redsheep[d]: Hugely better electrical properties
00:53 airlied[d]: gfxstrand[d]: should I be seeing a bunch of buffer related asserts in a CTS run?
00:53 gfxstrand[d]: gfxstrand[d]: The totality of the Pascal changes in NVK are:
00:53 gfxstrand[d]: 1. Max surface size went from 4k to 8k
00:53 gfxstrand[d]: 2. They added "real" instanced draws.
00:53 gfxstrand[d]: That's literally it.
00:54 gfxstrand[d]: airlied[d]: Cherry-pick CTS commit 046343f46f7d39d53b47842d7fd8ed3279528046
00:54 gfxstrand[d]: Or just pull tip of tree
00:55 redsheep[d]: gfxstrand[d]: I think that first was just the vram being big enough for #1 and yeah 2. I think coincides with their changes to the rops IIUC
00:57 redsheep[d]: Trianglebin shows pretty different looking rasterization on pascal, and then you plug a turing and it's back to looking a lot like maxwell and ampere, until you do ada and it's huge and pretty different again
00:57 redsheep[d]: I wonder how it looks on blackwell. I would be surprised if it'
00:58 redsheep[d]: if it's not just more of the same stuff as ada
00:58 gfxstrand[d]: Yeah, they reworked the rasterizer on Pascal
00:59 gfxstrand[d]: `.degenerateTrianglesRasterized = info->cls_eng3d >= PASCAL_A,`
01:05 redsheep[d]: Oh possible reason #3 to split nvk, the pre-turing half would be free to forever live out its days in blissful ignorance of raytracing existing, and if the turing+ half needs to be reworked for RT it could be done without having to worry about old hardware
01:07 gfxstrand[d]: yeah
01:08 gfxstrand[d]: karolherbst[d]: Do you remember why we claim 4k framebuffers pre-Pascal?
01:08 gfxstrand[d]: `.maxFramebufferHeight = info->cls_eng3d >= PASCAL_A ? 0x8000 : 0x4000,`
01:08 gfxstrand[d]: Is it just because of the silly copy engine limitations?
01:13 karolherbst[d]: I think so
01:30 gfxstrand[d]: Okay, now it's all documented (and Pascal A is fixed): https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34281
01:31 gfxstrand[d]: Glad redsheep[d] made me look at Pascal. 😅
01:36 gfxstrand[d]: That should also prevent us from ever advertising 1.3+ or `vulkanMemoryModel` on Kepler. That'll save snowycoder[d] some annoying pointless debugging.
01:42 pavlo_kozlenko[d]: we should have at least 1.2
01:43 redsheep[d]: Yeah 1.2 is what nvidia advertises and should work fine once Snowy is done
01:43 pavlo_kozlenko[d]: this is enough for dxvk 1.10 and opengl 4.5
01:44 pavlo_kozlenko[d]: and can be used vulkan render for kwin
01:46 pavlo_kozlenko[d]: because productivity is gallium opengl terrible, it seems like everything is running on one thread
01:48 pavlo_kozlenko[d]: This also applies to screen recording, when recording via OBS, everything starts to lag. This does not happen at all on the proprietary driver.
01:48 pavlo_kozlenko[d]: even on the series GT
01:49 redsheep[d]: I've personally never seen any kind of screen recording on linux work well as well as it does on windows, with the exception of using nvfbc on nvidia prop
01:51 pavlo_kozlenko[d]: I use a processor. although you can also use vaapi
02:03 skeggsb9778[d]: gfxstrand[d]: that can be fixed... the method interface is defined in fw. i don't think any userspace uses fermi ce on nouveau yet (hopefully?), wouldn't be a bad idea to rework the method interface to match the official class
02:04 skeggsb9778[d]: or just use nv's fw, should that be possible
02:05 skeggsb9778[d]: https://gitlab.freedesktop.org/bskeggs/nouveau/-/blob/03.00-r570/drivers/gpu/drm/nouveau/nvkm/engine/ce/fuc/com.fuc?ref_type=heads#L72
02:26 orowith2os[d]: What OpenGL version does Nvidia advertise on Kepler again?
02:26 orowith2os[d]: 4.5 or 4.6?
02:27 orowith2os[d]: Would NVK be able to expose enough from Vulkan 1.2 for Zink to get 4.6 running?
02:30 gfxstrand[d]: skeggsb9778[d]: Ooh, interesting...
02:30 gfxstrand[d]: orowith2os[d]: Yeah, Vulkan 1.2 is fine all the way back.
02:31 orowith2os[d]: Ngh, is there a table somewhere with each GPU gen and what features they expose (on Nvidia prop)?
02:32 gfxstrand[d]: https://gpuinfo.org/
02:33 tiredchiku[d]: beat me to it
02:34 orowith2os[d]: I should've thought of gpuinfo. Thanks :ferrisClueless:
02:34 gfxstrand[d]: skeggsb9778[d]: Sounds like whoever brings up Fermi has a fun project. 😅
02:34 gfxstrand[d]: I think I'd rather fix the firmware than write new NVK code.
02:36 gfxstrand[d]: (Assuming it doesn't break nouveau GL
02:37 redsheep[d]: Since when is fixing the firmware an option
02:38 orowith2os[d]: Since it doesn't require signed firmware
02:38 orowith2os[d]: :wires:
02:38 orowith2os[d]: Give me a couple Fermi GPUs, I'll either hack something together or brick every one of em
02:39 gfxstrand[d]: The good news is that you can get Fermi's for like $15 each!
02:39 gfxstrand[d]: Brick as many as you'd like. 😂
02:42 pavlo_kozlenko[d]: orowith2os[d]: GT 440 will do?
02:42 pavlo_kozlenko[d]: :happy_gears:
02:42 orowith2os[d]: Is anybody using anything older than Fermi? :akipeek:
02:42 pavlo_kozlenko[d]: orowith2os[d]: its broken
02:42 orowith2os[d]: Straight-up?
02:43 pavlo_kozlenko[d]: the system just hangs
02:43 pavlo_kozlenko[d]: gt100-300
02:43 orowith2os[d]: Hmm
02:43 pavlo_kozlenko[d]: tesla mby
02:43 orowith2os[d]: Now I really am considering getting a jump on getting my own place so I can set up a dev workstation to do some hacking to at least get nicer DRM drivers for older Nvidia GPUs.
02:44 pavlo_kozlenko[d]: on gt-gts-gtx 9-8xxx everything seems to be working normally
02:44 gfxstrand[d]: gfxstrand[d]: And the more you brick the fewer Fermis in existence that we have to support.
02:44 mhenning[d]: airlied[d]: I'm not sure I'm fully following this conversation but it sounds like the two cycles should be modeled as a RAW latency to me, which I think matches what Faith suggested
02:46 gfxstrand[d]: mhenning[d]: I'm not sure I fully followed either. 😂
02:46 mangodev: o/!
02:48 mangodev: i presume since gfxstrand is here, that this channel also covers vulkan-nouveau?
02:48 gfxstrand[d]: Yes
02:48 mangodev: :D
02:49 gfxstrand[d]: This is where all the NVK folks hang out
02:49 airlied[d]: I don't understand why if it's a raw latency though that we see NOPs with 2 in them
02:49 gfxstrand[d]: You don't need a NOP with 2 if you just add a delay
02:49 mhenning[d]: ohh wait if we allocate a waw barrier we also need to wait the two cycles don't we
02:49 gfxstrand[d]: you only need a NOP for a delay > 15
02:49 airlied[d]: gfxstrand[d]: why the comment in the code then?
02:49 gfxstrand[d]: Yeah, WaW is annoying like that
02:49 airlied[d]: that says NVIDIA seems to add the NOP after all exec latency]
02:49 gfxstrand[d]: airlied[d]: Which comment?
02:50 mhenning[d]: the one you wrote
02:50 gfxstrand[d]: Oh, exec latency... Yeah, I don't know what's going on with that.
02:50 gfxstrand[d]: The comment even says that I'm confused! Why are you asking for clarification? 😛
02:51 airlied[d]: because I'm wondering if that 2 is the same 2 as have to add for delay but for some reasons they add it in a NOP
02:51 gfxstrand[d]: Right
02:51 gfxstrand[d]: That's a good question
02:51 mhenning[d]: yeah, codegen always waits two cycles for a barrier to become active
02:52 gfxstrand[d]: NAK does for Bar, MemBar, and CCtl
02:52 gfxstrand[d]: But we don't have any such thing for control flow
02:52 gfxstrand[d]: Pre-Volta, we have to for all the control-flow
02:53 gfxstrand[d]: But things there might be a bit worst-case. I got it working but IDK that it's the right or minimal number
02:53 mhenning[d]: oh I forgot that exec_latency > 1 is different from variable latency
02:54 gfxstrand[d]: Yeah, `exec_latency()` is a thing where we have to wait before executing anything.
02:54 mhenning[d]: yeah, just remembered that
02:54 gfxstrand[d]: But also, I don't fully understand what the requirements are there. For a memory barrier, is it 6 cycles before the next instruction or just 6 cycles before the next memory op
02:55 tiredchiku[d]: gfxstrand[d]: brutal
02:56 mangodev: i'm curious on how to dissect a nouveau crash report in journalctl
02:56 mhenning[d]: Right, I think airlied is talking about the 2 cycles for a barrier to become active, which is different from exec_latency > 1
02:56 gfxstrand[d]: Yeah so that we currently have as a special case in the delay code
02:56 gfxstrand[d]: But there's a reason for that!
02:57 tiredchiku[d]: mangodev: the stacktrace should print out where it crashes
02:57 gfxstrand[d]: Because of barrier re-use, we the barrier insert code may add additional dependencies which are not expressed as register dependencies and we need to take those into account.
02:58 gfxstrand[d]: So basically we have a new barrier latency thing which isn't a RaW dependency.
02:59 mangodev: i can't tell if the crash was from steam fossilize and nvk errored from force closing an application, or if nvk was the cause
02:59 mangodev: although i'm *maybe* leaning toward driver-side because the display locked up?
02:59 tiredchiku[d]: you can also use https://github.com/torvalds/linux/blob/master/scripts/faddr2line to get the line it crashes on as opposed to the offset, assuming the crash is in the kernel
03:00 mangodev: there was no kernel related crashes
03:00 mangodev: the only applications reporting errors are 'fossilize_repla' and 'nouveau'
03:00 tiredchiku[d]: ..right, journalctl logs everything
03:01 mangodev: from nouveau, i got a gsp error and a fifo error
03:01 mhenning[d]: gfxstrand[d]: Right, that makes sense. instr_sched_postpass doesn't model that right now, which I think is reasonable
03:01 airlied[d]: mhenning[d]: I think the short summary of what I was asking, is the docs put branch instructions into decoupled category, which means they want scoreboards, but nak fixed latency checks says they are fixed latency
03:02 airlied[d]: I want to keep the sm75_instr_latencies docs align, so was wondering where to keep the difference
03:02 airlied[d]: I've just put a check in my tree for it now and seeing what CTS does
03:02 gfxstrand[d]: I like keeping things docs-aligned
03:03 gfxstrand[d]: At least in the per-SM code
03:03 mhenning[d]: Oh, iirc karol said that control flow is always variable latency, regardless of what bits you set on the instruction
03:03 mangodev: tiredchiku[d]: is 'engn:' the viewport? assuming so because of the amount of digits in the id
03:04 mhenning[d]: I'm not sure that control flow actually needs to show up in the scoreboards though
03:04 mhenning[d]: in which case we might not want to allocate a scoreboard
03:04 gfxstrand[d]: airlied[d]: Right, so I'm up for changing that. The old back-end had a needs_barrier helper and I made NAK have fixed_latency and neither of them are really quite accurate.
03:04 orowith2os[d]: gfxstrand[d]: in Mesa, or nouveau? :P
03:04 gfxstrand[d]: Because it's a tri-state, not binary
03:05 gfxstrand[d]: orowith2os[d]: Pretty sure if you brick it, we don't have to support it anywhere.
03:06 gfxstrand[d]: And even if you look at it as a tri-state, there's also FP64 which is sometimes fixed-latency and sometimes not depending on the GPU. :frog_upside_down:
03:07 gfxstrand[d]: But for control-flow instructions, the current model assumes it doesn't matter because they don't ever return anything and they only ever take predicates so there's no scoreboards to set.
03:07 gfxstrand[d]: I think there are forms that take registers, though, and those probably do need scoreboarding.
03:07 gfxstrand[d]: Like indirect jump and stuff like that
03:08 mhenning[d]: gfxstrand[d]: not in the sense of "control flow writes a scoreboard" though, right?
03:08 gfxstrand[d]: Yeah, I don't think it ever writes a scoreboard
03:08 mhenning[d]: I don't think we ever need to wait on a control flow instruction's scoreboard
03:09 gfxstrand[d]: Maybe for a funky WaR case?
03:09 gfxstrand[d]: But I think if you execute an instruction after a branch, then by definition the branch either was or wasn't taken.
03:09 mhenning[d]: yeah
03:10 gfxstrand[d]: So I don't think there's a case in which branch would write a scoreboard
03:10 gfxstrand[d]: So maybe that's what "decoupled" means
03:10 mhenning[d]: which is why it's set up as "fixed latency" in the compiler even though it takes a variable number of cycles
03:10 tiredchiku[d]: mangodev: that sounds like an MMU fault, though maybe someone else here can explain it better
03:10 gfxstrand[d]: Yeah...
03:10 gfxstrand[d]: Okay, I think I'm convinced that we should make it a tristate.
03:11 gfxstrand[d]: I don't think the current code is broken but I think it's probably better to be explicit and docs-matching.
03:11 gfxstrand[d]: Even if the NVIDIA names seem weird
03:13 airlied[d]: I've pushed my branch to at least clean up where I've got to, still not tri-state, but at least CTS passes again after mhenning[d] scheduler work
03:13 gfxstrand[d]: Sweet!
03:14 mhenning[d]: gfxstrand[d]: I think NVIDIA names put BRA and LDG both as "decoupled". I don't think it's a tristate in the sense you seem to suggest it is
03:14 gfxstrand[d]: I'm only around Monday and Tuesday this next week but I'll try to take a look tomorrow. I really want to know how NVIDIA is modeling things so I can update my mental model.
03:14 gfxstrand[d]: mhenning[d]: Uh...
03:15 gfxstrand[d]: Now I'm back to being confused
03:15 mhenning[d]: My understanding is that nvidia has 3 categores: fixed latency, variable latency, and fp64
03:15 airlied[d]: unconditional BRA is coupled
03:15 gfxstrand[d]: Right. That's what I was saying at the top (roughly)
03:15 mhenning[d]: but they don't call them that
03:16 airlied[d]: conditional BRA is decoupled
03:16 airlied[d]: coupled, decoupled and redirected
03:16 gfxstrand[d]: Right, so those names do actually make sense
03:16 gfxstrand[d]: Just in a weird way
03:16 tiredchiku[d]: ~~what's BRA again~~
03:17 mhenning[d]: airlied[d]: Does the "coupled"/"decoupled" here change anything about the actual instruction encoding?
03:17 gfxstrand[d]: coupled is "uses the one clock", decoupled is "it's off on its own" and redirected means "could be either"
03:17 mhenning[d]: tiredchiku[d]: branch
03:17 airlied[d]: HMMA/IMMA are also in the redirected category
03:17 gfxstrand[d]: tiredchiku[d]: Branch.
03:17 gfxstrand[d]: airlied[d]: Yeah, that's what I figured
03:17 airlied[d]: mhenning[d]: doesn't seem to
03:17 tiredchiku[d]: ah, ty
03:18 gfxstrand[d]: gfxstrand[d]: Okay, so we're right back to my `Delay` struct.
03:18 airlied[d]: but we don't have encoding docs, so I can't be 100% 🙂
03:18 mhenning[d]: airlied[d]: I'd lean towards keeping branches as "uses_scoreboard = false" in that case
03:19 mhenning[d]: but I don't have a strong opinion on the representation
03:19 gfxstrand[d]: airlied[d]: Honestly, if you just make `has_fixed_latency(Op::Bra)` return false, I don't think it'll break anything.
03:19 airlied[d]: well I've done needs_scoreboards with branch saying it doesn't right now
03:20 airlied[d]: I just want to make sure the code is clear between what the docs say and what is just how things are in NAK right now
03:20 mhenning[d]: Yeah, that makes sense to me
03:20 mhenning[d]: tiredchiku[d]: If you want a full list of instruction names, it's here: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html
03:21 tiredchiku[d]: thank you :D
03:21 gfxstrand[d]: airlied[d]: That's fair. Anything that's copy+paste of docs should match the docs. If we need to special case `OpBra` for some reason, we can do that in something that wraps the docs thing.
03:22 airlied[d]: okay that's what's there now then
03:22 gfxstrand[d]: I'll have much better ideas once I've actually read the MR.
03:22 tiredchiku[d]: will pore over them on sunday
03:22 gfxstrand[d]: I've been too busy with CTS runs and life to look at it yet. I'm going to read it tomorrow. I think that's the next thing in the priority queue.
03:23 gfxstrand[d]: But for now I need to go to bed so my brain works tomorrow when I'm reading your MR. 😛
03:24 mhenning[d]: gfxstrand[d]: yeah, it might just mean we have a few if statements where we check `instr.is_branch()`
03:24 airlied[d]: oh still have little bit more cleanup to do
03:24 gfxstrand[d]: mhenning[d]: That's fine
03:25 gfxstrand[d]: And we can have a wrapper which handles that and gives use something more NAKy
03:25 gfxstrand[d]: If that's helpful
03:25 gfxstrand[d]: But I kinda don't think it's going to matter in the end. `OpBra` doesn't write anything so no one cares when it "completes".
03:26 gfxstrand[d]: But also, it tends to be a special case for reasons so it's probably a wash in the end.
03:26 gfxstrand[d]: But let me read the MR and then I'll have more detailed ideas.
03:30 airlied[d]: okay found a bug in it, needs to fix and CTS again anyways
03:32 pavlo_kozlenko[d]: https://youtube.com/shorts/nhY0nXZMMSc?si=aLQ5PJbN7iMEk7ab
03:32 pavlo_kozlenko[d]: kepler codegen
03:35 mangodev: how do you find parts of the driver that run slower than expected?
03:36 airlied[d]: mhenning[d]: okay now I've hooked it up properly to your scheduler, I hit the assert at end of calc_instr_deps
03:36 airlied[d]: in some shaders
03:38 mhenning[d]: Ah, that means we're modeling latencies differently between the scheduler and calc_deps
03:39 mhenning[d]: The assert isn't critical - you could comment it out for now
03:40 airlied[d]: I wonder if it's the reguse none case in calc_instr_deps
03:40 airlied[d]: since I don't change that for the sm75 deps since I wasn't sure what was correct, but perhaps I should
03:42 airlied[d]: not sure anything should be direct calling instr_latency anymore, but I'm not sure what the intent behind it is
03:44 mhenning[d]: instr_latency is used in some cases where we need an upper bound
03:44 airlied[d]: it's broken though
03:44 mhenning[d]: so, eg. if you execute a branch, we're not currently smart enough to figure out a raw hazard across the branch
03:45 airlied[d]: I think getting the raw_latency will usually get the highest one
03:45 mhenning[d]: so we use instr_latency to get the maximum to wait for any reader
03:45 mhenning[d]: airlied[d]: you can't call raw_latency without a reader
03:45 airlied[d]: I know, but I think we should fake a worst case reader
03:46 airlied[d]: probably redirectedfp64
03:46 mhenning[d]: right, so we currently call "instr_latency" to get a value for the worst case reader, but maybe it would make sense to make those functions take an Option for the reader or something
03:46 mhenning[d]: or add more functions
03:48 x512[m]: Anybody knows what was the first Nvidia model with channel system and user-mappable FIFO?
03:49 mhenning[d]: take a look at envytools they have some stuff on older gpus
03:50 mhenning[d]: oh, wait user-mappable FIFO. uh, not sure but also envytools might still have the answer
03:54 airlied[d]: adding option to read doesn't seem too bad
03:57 mhenning[d]: airlied[d]: yeah, although we technically do want the maximum of waw, raw, war
04:03 gfxstrand[d]: RaW should always be the highest, though.
04:04 gfxstrand[d]: But maybe a separate thing would be the cleanest interface? <a:shrug_anim:1096500513106841673>
04:05 gfxstrand[d]: I don't love `instr_latency` as a name but I also don't have a better plan.
04:07 gfxstrand[d]: Actually, I think paw is the highest
04:08 gfxstrand[d]: But not everything writes predicates
04:10 gfxstrand[d]: But maybe if we have `instr_latency`, we can get rid of `paw_latency`?
04:11 gfxstrand[d]: The only real reason for PaW, IIRC, was because read takes a source index and an op and that doesn't make sense for predicates since those are not an indexed source and are handled by the front-end.
04:15 gfxstrand[d]: But if the read op and index were optional, PaW could just use that, I think. <a:shrug_anim:1096500513106841673>
04:15 airlied[d]: I've added a worst_latency to the list of apis and wrapped it
04:15 gfxstrand[d]: Lol. That works.
04:16 airlied[d]: ah yes paw is that now, picked the worst case reader
04:20 mangodev: hmmmmm, i'm concerned if there's an underlying issue with the way the driver is talking to the compositor or not
04:21 mangodev: today, i've noticed a journalctl log FLOODED with 'kwin_wayland_drm: Page flip failed: Invalid argument'
04:21 mangodev: strange occurrence, as that wasn't happening before i updated mesa to git last night
04:22 mangodev: not sure what it's from, because it has since stopped
04:23 mangodev: although i think i discovered an *actual* issue while looking at logs
04:25 mangodev: for some reason the graphics card briefly disconnected?
04:35 HdkR: Oops, GPU fell off the bus.
04:58 airlied[d]: mhenning[d]: commented it out with a pointer to the test that explodes
04:59 mhenning[d]: okay, I can probably take a look tomorrow
05:39 tiredchiku[d]: mangodev: that might be a compositor bug, I've seen people encounter it even on amdgpu and nvidia-drm
05:54 mangodev: interesting
07:34 skeggsb9778[d]: x512[m]: at least nv3 - though they had PIO channels (bash "registers" for each subchannel's method instead of using a push buffer)
07:34 skeggsb9778[d]: nv4 is the first to have dma push buffers intended for userspace
07:34 skeggsb9778[d]: i don't know much about the hw prior to nv3
08:51 mwk: x512[m]: literally all nvidia hardware had it
08:51 mwk: starting from nv1
08:52 mwk: it was just very primitive on the early GPUs (involving kernel traps for context switching)
09:16 x512[m]: I mean user-writable FIFO ring buffer.
09:17 karolherbst: well, that's just mapping those memory ranges to user-space, no? Though I think you need real virtual memory support to do so safely
09:18 karolherbst: on the GPU that is
09:18 karolherbst: anything that needs relocations need the kernel to well.. handle relocations
09:18 x512[m]: Not just mapping, but being able to allocate arbitrary many FIFOs and switching between them.
09:19 karolherbst: my point is rather, that any GPU new enough to have virtual memory also has this
09:19 karolherbst: I think nv50 got virtual memory, it's also the first gen supporting CUDA
09:20 karolherbst: and you probably don't really care about older GPUs aanyway
09:20 x512[m]: AMDGPU still have no user FIFOs in GFX 12 and use global kernel FIFO... At least for OpenGL/Vulkan, not ROCm.
09:22 karolherbst: yeah... I think I know for sure that Kepler can handle it
09:22 karolherbst: but I don't see any reasons why it wouldn't work on anything nv50 and newer
10:30 ristovski[d]: Hmm, is the GSP used with the Windows drivers as well? And if so, since what version?
13:06 mohamexiety[d]: they started shipping the binary in since 565 iirc
13:06 mohamexiety[d]: but I heard conflicting things about it being active on windows or not
18:15 gfxstrand[d]: airlied[d]: , mhenning[d] Please give a read through https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34302. I think this cleans up the interfaces a good bit and I'm happy to do the work to rebase Dave's branch on top of it if we all like the API.
18:20 gfxstrand[d]: I'm just trying to avoid the "generic helper thing has a big SM switch to call per-SM thing which should have been routed through the trait in the first place" problem.
19:23 airlied[d]: the commit that moves stuff to a separate file doesn't have the separate file 🙂
19:25 gfxstrand[d]: damn...
19:25 gfxstrand[d]: Fixed
19:27 gfxstrand[d]: I've pushed a rebase of your MR to nak/latency-rebase
19:27 gfxstrand[d]: I'm gonna run it on the SM73 card that just showed up in my mailbox.
19:35 gfxstrand[d]: Can I fit 8x CTS in 4GB of VRAM? We're gonna find out...
19:36 gfxstrand[d]: I'm glad I splurged for the 4GB T400. 2GB would be hell to run the CTS on.
19:40 gfxstrand[d]: Does anyone know what shader model Orin reports?
19:42 mhenning[d]: https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ says 87
19:43 gfxstrand[d]: mhenning[d]: Right. I missed that. That page is helpful but it's missing a few GPUs.
19:43 gfxstrand[d]: which is really annoying
19:46 gfxstrand[d]: OpenRM also has SM 7.1 and 7.2 in their headers: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/c5e439fea4fe81c78d52b95419c30cabe44e48fd/src/common/sdk/nvidia/inc/ctrl/ctrl2080/ctrl2080gr.h#L306
19:47 pavlo_kozlenko[d]: Why are you so active in communicating about cuda?
19:47 mohamexiety[d]: for Orin the main thing I know will be different is tiling. iirc we were working based on the orin manual's tiling parameters for host_image_copy and it turned out to be pretty wrong for dGPUs
19:47 mohamexiety[d]: so wouldnt be surprised if there are other funny differences too :thonk:
19:49 gfxstrand[d]: Okay, according to openrm sources, SM 7.0 and SM 7.2 are Volta and SM 7.3 and SM 7.5 are Turing:
19:49 gfxstrand[d]: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/c5e439fea4fe81c78d52b95419c30cabe44e48fd/src/common/unix/nvidia-3d/src/nvidia-3d-init.c#L288
19:49 gfxstrand[d]: pavlo_kozlenko[d]: What do you mean?
19:49 pavlo_kozlenko[d]: mhenning[d]: .
19:50 pavlo_kozlenko[d]: karolherbst: .
19:50 pavlo_kozlenko[d]: mhenning[d]: .
19:50 gfxstrand[d]: The CUDA docs are often the best source of public documentation for NVIDIA GPUs, especially for anything having to do with shaders.
19:50 gfxstrand[d]: So we reference the CUDA docs a lot
19:51 gfxstrand[d]: The shader models, for instance, aren't just a made-up CUDAism, they're a basically the version numbers for the ISA
19:52 gfxstrand[d]: gfxstrand[d]: I have no idea what cards are actually SM 7.2. Maybe the workstation Volta? Or maybe some crazy server part?
19:54 gfxstrand[d]: airlied[d]: The Ampere patches you have include Ada (SM 8.9) in Ampere. Is that intended?
19:55 gfxstrand[d]: I can pull out my Ada card and run it after a bit
19:59 airlied[d]: No I hadn't deliberately done Ada because I had no indications on whether it was compatible, so that is a mistake
20:01 gfxstrand[d]: Okay. I'm adding some helpers which will make that mistake less likely
20:04 mhenning[d]: gfxstrand[d]: Ampere binaries can run unmodified on ada (source https://docs.nvidia.com/cuda/ada-compatibility-guide/#compatibility-between-ampere-and-ada ) so it's correct to use ampere latencies on ada
20:05 mhenning[d]: I thought that part was intentional when I was reviewing it
20:07 gfxstrand[d]: Okay
20:08 airlied[d]: oh okay then leave it in 🙂
20:08 gfxstrand[d]: airlied[d]: mhenning[d] Are you okay with my refactors? If so, I may push my rebase and a few fixes to <@1048069888792596531>'s branch
20:09 airlied[d]: yes it looks good to me
20:10 mhenning[d]: gfxstrand[d]: Haven't looked yet, will look later today
20:11 asdqueerfromeu[d]: mhenning[d]: They also share the same USERMODE/GPFIFO class 🐸
20:23 gfxstrand[d]: Okay, in my rebase, it now uses arch names so it's clear that ampere and ada are taking the same path
20:23 airlied[d]: gfxstrand[d]: I'll await you pushing to my branch before I do any review changes
20:24 gfxstrand[d]: Okay. I also tweaked the long delays stuff in my branch. It's an extra commit that should be squashed in with the other two
20:24 gfxstrand[d]: Overall, I think I'm pretty happy with it.
22:17 x512[m]: gfxstrand[d]: Which one should be used in NVK: smVersion or spaVersion? It sometimes mismatch in openrm code.
22:32 gfxstrand[d]: SmVersion
22:33 gfxstrand[d]: I just looked at the remap function today