00:23airlied[d]: _lyude[d]: yeah I talked myself out of it again after I found the fence channel bug 🙂
01:31gfxstrand[d]: We have a fence channel bug?
01:37airlied: cross-device
01:38airlied: also won't affect vulkan
05:47skeggsb9778[d]: gfxstrand[d]: Working on it!
05:48skeggsb9778[d]: gfxstrand[d]: As airlied[d] mentioned, it overlaps a lot with blackwell. Will very likely come "for free" on the kernel side
05:49skeggsb9778[d]: Minus fancy features like confidential compute etc, of course
06:11gfxstrand[d]: Yeah, I'm not worried about those. I'm mostly worried about whether or not the digits box I plan to have on my desk in May is going to work.
06:12gfxstrand[d]: I'm also interested in Thor eventually but I'm happy to kick that down the road a bit.
08:56misyltoad[d]: Just curious, what's next on the table perf wise? Barrier stuff? RA/scheduling?
09:00karolherbst[d]: proper instruction latencies + RA are kinda important
13:57mohamexiety[d]: and compression/DCC
15:19gfxstrand[d]: And once I get my 5090, I need to figure out our scaling problem. I suspect that's hitting the small cards too but the big ones are getting destroyed by it.
15:19gfxstrand[d]: And I need to figure out our UBO story. It's really not great today.
15:25redsheep[d]: gfxstrand[d]: I'll probably get one as well, can't wait to see the scaling problem be even worse 🤣
15:26redsheep[d]: I feel like the prop drivers on the 4090 have trouble getting good scaling out of it. Even matching that sounds tough as I imagine that's one of the things they've thrown the most engineering hours at.
15:30redsheep[d]: Based on what Jensen said on stage it sounds like GB202 probably has just as much int as for again, with independent performance for the two the way Turing had it. Still getting 125 tflops out that arrangement sounds like a pretty crazy huge gpu
15:46pavlo_kozlenko[d]: ||The hedgehogs pricked themselves, but continued to eat the cactus.||
15:47dwlsalmeida[d]: gfxstrand[d]: The 5090 will also double up as a nice room heater, so two products for the price of one I guess
15:48dwlsalmeida[d]: It's not that cold here in Brazil, so I've been to places with 600w heaters, a mere 25w more than this thing lol
16:21gfxstrand[d]: Yeah. Combine that with a big CPU and my dual-3060 box running CTS tests and I'm going to need an AC for my office in winter time.
16:39redsheep[d]: gfxstrand[d]: Yeah that 10980xe isn't exactly efficient. Also pretty likely to end up being the weak point if that's the chip you're pairing with the 5090, though maybe that's a good thing for helping expose things to improve with CPU overhead?
16:41redsheep[d]: For context it's not rare at all for me to be running into a cpu bind on a 7950x3d with the 4090 at 4k, and this chip is often close to double the gaming perf of Skylake-X
17:23gfxstrand[d]: Nah, the 5090 is going in a brand new Ryzen box that I have yet to spec out.
17:48mhenning[d]: misyltoad[d]: I'm currently working on better instruction scheduling - this time, a prepass scheduler inspired by Goodman & Hsu 1988 that I'm hoping will reduce register spilling on shaders where that's an issue
18:01gfxstrand[d]: Do I really care about 3D vcache? That seems to be all AMD did this year.
18:03gfxstrand[d]: I don't think it's worth waiting until March...
18:04mohamexiety[d]: it's only good for games
18:04mohamexiety[d]: so if you want to absolutely minimize your CPU bottleneck in games, it's worth it. otherwise, no
18:46redsheep[d]: gfxstrand[d]: It does help in games quite a lot, and the 9800x3d is already available. If your priority is more towards fast builds you probably want more than 8 cores
18:47gfxstrand[d]: Yeah, I want cores
18:47redsheep[d]: And I don't recommend the 9950x3d or 7950x3d if consistency is a priority, as I'm fairly sure some weird results of mine have come from it sometimes being on vcache, and sometimes not
18:48redsheep[d]: So on balance you probably want a 9950x
18:52redsheep[d]: I'm pretty disappointed they didn't opt to put the extra cache on both dies with 9950x3d as it would have removed the tradeoff, and the extra frequency from removing vcache isn't very big anymore so the argument that the normal die is useful for frequency bound loads doesn't make as much sense
18:53redsheep[d]: It's more clearly a cost savings this time around imo
19:07mohamexiety[d]: redsheep[d]: they were very honest about it this time around https://x.com/aschilling/status/1876842395473441158?s=61
19:08mohamexiety[d]: but yeah I'd recommend either the 9950X or the 9950X3D if you don't mind waiting for March (but again, gaming only)
19:09mohamexiety[d]: there's also the 285K if compiles are your only concern (and you get more lanes for more GPUs etc) but gaming perf will be _bad_ to say the least
21:30magic_rb[d]: mohamexiety[d]: why will the 285k have bad gaming perf? i dont see anything obviously wrong with the chip
21:30mohamexiety[d]: it doesn't really do very nice on benchmarks from reviews, and there are regressions vs Intel 14th/13th gen in that department
21:31mohamexiety[d]: plus the platform as a whole is a bit inconsistent atm
21:31magic_rb[d]: ah so subtle things, the more you know
21:31gfxstrand[d]: I'm going with a 9950X, I think. I haven't had an AMD box in a while.
21:32gfxstrand[d]: The only reason I got the giant i9 was because I still worked for Intel when I bought it and could get it half price.
21:34mohamexiety[d]: yeah it should be good. the main thing I don't like about AMD this time around is the platform is kinda lacking in the IO department (in terms of PCIe lanes and such) but it's fine for most people
21:35gfxstrand[d]: The 285k isn't amazing either.
21:35gfxstrand[d]: No one can do more than 16 lanes of PCIe5
21:35gfxstrand[d]: Not in the desktop space, anyway.
21:39gfxstrand[d]: Motherboard shopping was so depressing. There's so many caveats to everything. This thing shares lanes with that thing. These USB ports go to the CPU, that one goes through the chipset.
21:39mohamexiety[d]: yeah it only becomes a concern in super niche stuff.
21:39mohamexiety[d]: e.g. if you wanted double or more GPUs and would be fine with a drop to x8, Intel usually gets you x8/x8 for less sacrifice into other IO (either M.2 slots or USB ports). alternatively if you want >4 M.2 SSDs without sacrifice (usually on X870E you end up eating into GPU lanes), etc. things like that
21:40gfxstrand[d]: Two GPUs is my standard test config...
21:40gfxstrand[d]: Finding motherboards with 2x PCIe 5.0 was a pain.
21:41gfxstrand[d]: For the time being, I'm probably just going to have the 5090 in it. But I want options if this box ever gets downgraded to a test rig.
21:42redsheep[d]: gfxstrand[d]: Do you have a link for the one you picked?
21:43gfxstrand[d]: https://www.amazon.com/gp/aw/d/B0BDTM7VP5/ref=ox_sc_act_title_2?smid=ATVPDKIKX0DER&psc=1
21:44gfxstrand[d]: Amazon doesn't do a good job of listing specs but they had the good price.
21:44gfxstrand[d]: https://www.newegg.com/asus-proart-x670e-creator-wifi/p/N82E16813119589
21:44redsheep[d]: I was worried it would be Asus. They're probably my very last pick on am5
21:44redsheep[d]: My experience with a similar board has been pretty bad
21:45gfxstrand[d]: Yeah, but there really aren't many options.
21:45redsheep[d]: You probably can't run xmp or expo on that board without 60-80 second boot time
21:46mohamexiety[d]: yeeeah it's a bit annoying. there's the ASRock X870E Taichi which has x8/x8, but the second slot is a bit too low so if you dont have a case with 8 slots the second GPU will be forced to be single slot
21:46gfxstrand[d]: Do I care? I have no intention to overclock.
21:46redsheep[d]: mohamexiety[d]: That one is a good pick if you just get an enormous case though
21:46gfxstrand[d]: mohamexiety[d]: Yup. That's a real problem
21:46mohamexiety[d]: redsheep[d]: fwiw this seems to have gotten fixed now. HUB tests have had all ASUS boards booting in like 5 seconds
21:47redsheep[d]: And ASRock strangely is the vendor with things figured out best on am5 best I can tell
21:47gfxstrand[d]: gfxstrand[d]: Everything bad I read in the reviews was tied to the overclock stuff. I'm fine leaving all that off.
21:47redsheep[d]: mohamexiety[d]: I'll try a new bios again but I've been testing new ones every month or two for the last like two years with no success.
21:48redsheep[d]: gfxstrand[d]: Especially if you don't have vcache the performance hit from not enabling the not-really-overclock default memory overclock is pretty big
21:49gfxstrand[d]: 🫤
21:49mohamexiety[d]: eh it's not that bad
21:50redsheep[d]: mohamexiety[d]: Last I saw it's quite often 10-15%, no?
21:51gfxstrand[d]: I can eat that
21:51gfxstrand[d]: Stability is way more important than max perf on the CPU side.
21:51mohamexiety[d]: redsheep[d]: it's worth it given how janky DDR5 OCing can be, specially on a dev system
21:52redsheep[d]: If that's an acceptable loss then yeah the Asus will probably otherwise be fine
21:53redsheep[d]: And it's possible it works. I think whether Asus is still broken depends on memory kit
21:53gfxstrand[d]: The thing is that there are all of like 5 am5 motherboard with 2x PCIe5 and 3 of them are Asus.
21:53redsheep[d]: Yeah, it's not a great situation
21:54redsheep[d]: You could pretty easily just get zen 4 threadripper instead and have as many GPUs as you want
21:55gfxstrand[d]: There's a nice MSI with an even nicer price tag. And there's a gigabyte with a really low second slot.
21:55mohamexiety[d]: the ProArt is nice tbh and would have been my choice as well
21:56mohamexiety[d]: either the X670E version or the X870E one, both are fine
21:56redsheep[d]: redsheep[d]: Epyc or threadripper is of course extremely expensive though
21:57gfxstrand[d]: Yeah... I don't want that CPU to cost more than the 5090...
22:01redsheep[d]: Whether the 5090 is worth buying for me is still kind of dubious, but I can probably make up half the cost with my used 4090 so 🤷
22:01redsheep[d]: There are some bad signs like the significantly dropped clockspeeds
22:02redsheep[d]: The next two years of tinkering is probably worth $1000 more
22:03redsheep[d]: I do feel a bit like a crazy person though
22:20gfxstrand[d]: You can tinker with a smaller card. 😝
22:21gfxstrand[d]: And if Jensen is to be believed, the 5070 is just as fast as the 4090 if you turn on enough AI shit.
22:22Jasper[m]: My headcanon is that they're doing another GTX780 where the die size is nowhere near close to the last gen and instead of compensating with IPC improvements and a shrunk node it's just a bunch of fucking AI algorithms
22:22Jasper[m]: Doesn't account for the 575W TDP, but I'm assuming the AI cores may be smaller than the CUDA stuff
22:23redsheep[d]: If the Wikipedia page is to be believes at least the gb202 die on the 5090 is truly massive at 744 mm2
22:24Jasper[m]: hmmm
22:24redsheep[d]: And even at that size it's pretty remarkable they made the specs fit. But yeah they do seem to be leaning way too hard on multi framegen to make things look good
22:28redsheep[d]: gfxstrand[d]: Also, just to interpret a joke far too seriously, framegen is probably a feature that won't be possible without being on the prop driver for the foreseeable future, right?
22:31mohamexiety[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1327042119126351872/u9v98iu9wlac1.png?ex=67819fc8&is=67804e48&hm=dcec06d1bce478ebc9cdbce9079dd7ec192645bbaa326f4f2729cdeb9ab984bb&
22:31mohamexiety[d]: re: EXPO/RAM scaling
22:32mohamexiety[d]: so it's not really that much unless you spend too much time on tweaking things
22:34redsheep[d]: Good to know
22:34gfxstrand[d]: redsheep[d]: Yeah, there's no open solution for DLSS, much less temporal/predictive.
22:36gfxstrand[d]: There's a lot of R&D work that needs to be done there and then a LOT of frame collection and training. No idea how to do that reasonably in open-source.
22:37redsheep[d]: redsheep[d]: I hope things settle on a more open solution, but it might just settle on a universal API to use proprietary backend like dlss. Much as I'm usually down to bash AI I actually think Nvidia's ML centric approach to performance improvement has some merit, and if features like dlss and framegen keep getting better it will eventually present as big a problem for their driver to have those
22:37redsheep[d]: features as it is for it to be faster without them
22:37redsheep[d]: Just in terms of people wanting to use an open driver over a closed one
22:38gfxstrand[d]: Yeah. We need an open solution.
22:39gfxstrand[d]: But how to build one is an interesting question. Nvidia has been collecting data for like 8 years now. We don't have access to any of that.
22:41gfxstrand[d]: And collecting an open repository of game renders is problematic from a copyright PoV. Same reason we don't have an open repository of traces.
22:41gfxstrand[d]: And you need a LOT of data to train an AI.
22:42redsheep[d]: It's not super clear at the moment which features are per-game trained
22:42mohamexiety[d]: I guess it wouldnt be too bad to figure out a way to run DLSS without the prop driver instead :thonk:
22:43redsheep[d]: It seems like the reflex 2 in painting is but the temporal upscaling is less clear afaict
22:43mohamexiety[d]: it's hard to say given how vague everything is but even something like FSR seems to be moving to an exclusive ML based model as well with FSR4 (no clue if they'll open source it like FSR 3 but it seems a bit unlikely)
22:44gfxstrand[d]: redsheep[d]: DLSS 1 was entirely per-game or at least per-game tuned. I'm not sure about 2, 3 and 4 are generic AFAIK.
22:45gfxstrand[d]: The predictive stuff might need training or maybe their models are just better now.
22:45redsheep[d]: gfxstrand[d]: Yeah dlss 1 was pretty terrible and that approach was scrapped. They talked about how dlss2 isn't per game as a brag
22:45redsheep[d]: It kind of seems like 4 might have some of that back though?
22:45redsheep[d]: It's weird
22:47redsheep[d]: Ray reconstruction might be per game as that was mentioned to be a CNN I believe
22:47gfxstrand[d]: And DLSS 4 requires Blackwell so they're probably depending on FP4 or FP6 to use a bigger model.
22:47redsheep[d]: Multi frame gen is the only part that is exclusive
22:47redsheep[d]: The transformer model for upscaling at least has a version for older carss
22:48gfxstrand[d]: Yeah
22:48redsheep[d]: Something about Blackwell having better frame delivery? Sounds like the GPU can flip by itself now but I find that confusing since I thought present from compute already existed
22:50gfxstrand[d]: redsheep[d]: I'm not sure what they're doing with that.
22:53gfxstrand[d]: redsheep[d]: Flipping from userspace could be interesting if we could figure out how to make that work with KMS. It would require leases at the very least but also IDK what the HW API looks like to lock things down and then unlock them for the one client. Seems kinda crazy.
22:57redsheep[d]: gfxstrand[d]: That does sound crazy. I feel like as the runway starts to run out for GPUs getting faster by brute force things are getting increasingly weird.
22:58gfxstrand[d]: Yeah. And a bunch of it will be useless/impractical and probably die off as we find better ways. Hard to predict, though.
22:59gfxstrand[d]: Flip from GPU isn't as crazy as it sounds if you're doing VR or something like that.
22:59redsheep[d]: That seems like a good reason to try to make it so nvk can work with Nvidia's existing effort on dlss and framegen. They have the capacity to waste the effort on something that will be obsolete before long
23:00redsheep[d]: OSS, not quite so much
23:01gfxstrand[d]: But IDK if we can work with their DLSS without scraping stuff they probably care about keeping secret from their driver.
23:03gfxstrand[d]: I mean, I can run a demo through envuhooks and scrape their shader and their model data. IDK where that falls between clean room and pissing them off, you.
23:05gfxstrand[d]: But it could be that we can take one of the existing public models for image generation and somehow distill it down for super scaling needs. 🤷🏻♀️ I haven't done enough research to know how bullshit that idea is. (Probably very.)
23:06redsheep[d]: Hmm, I dunno. I'm not clear enough on how it actually works to say. I'd have said a year ago it's probably good enough for fsr2 to work on nvk and call it a day, but that approach seems to have really fallen behind and I don't think it's a given that games will continue to implement it
23:10gfxstrand[d]: Oh, it's definitely moving into drivers and/or libraries. That I'm pretty confident in.
23:12gfxstrand[d]: But the question is whether or not we can come up with a solution for Mesa.
23:13gfxstrand[d]: And I think we need a solution that isn't "make Nvidia's CUDA thing work".
23:14gfxstrand[d]: And anything that's shader based and baked into apps we should make work too.
23:16misyltoad[d]: gfxstrand[d]: I don't think you'd need to, in Proton, it just uses VK_NVX_BINARY_IMPORT from , if you implement that, it should just work... assuming the stuff is there.
23:17misyltoad[d]: Obviously "just implement that" is a big task
23:17gfxstrand[d]: I'm not sure how practical it is to make that work. It might be possible. But where does it get the binary from?
23:17misyltoad[d]: But in theory for DLSS nothing else is needed other than that for it to work from the driver side
23:18misyltoad[d]: gfxstrand[d]: The binary is shipped in the prop driver
23:19gfxstrand[d]: Wait, so the binary comes from the prop driver but the app passes it in to the prop driver?
23:20misyltoad[d]: there is some dlss_dll (shipped in prop driver, as well as the shader binary in prop driver) -> dxvk nvapi (open source) -> dxvk (open source) -> driver (nvidia prop, or could be NVK) chain here
23:21gfxstrand[d]: Ah. So the app is linking against the DLSS DLL that it got from somewhere?
23:21misyltoad[d]: looks like it also needs VK_NVX_IMAGE_VIEW_HANDLE
23:22gfxstrand[d]: misyltoad[d]: That's easy enough
23:22misyltoad[d]: gfxstrand[d]: the windows dlss dll is shipped in the prop linux driver, and mounted by Proton into the prefix in the right place
23:22misyltoad[d]: so as long as the prop driver is present, that part should work(tm) in theory(tm)
23:24mohamexiety[d]: we may be able to ship the dll too I think? because the dll is available legally on many websites online (most notably Techpowerup) :thonk:
23:24misyltoad[d]: you should very likely not do that
23:24gfxstrand[d]: If we can ship it separate from the full driver stack that would be grand.
23:25gfxstrand[d]: "please install the prop driver so NVK will work" is a fun story.
23:27gfxstrand[d]: But for apps already shipping with DLSS, it's what we have to do.
23:29gfxstrand[d]: misyltoad[d]: Is it a compute shader? If so, binary import might not be too bad.
23:31redsheep[d]: If you squint at it that wouldn't be terribly different from the fact the cards need firmware we have to ship as well
23:31redsheep[d]: Assuming Nvidia gave their blessing
23:31misyltoad[d]: some cuda compute thing afaik, but i have no insight into what the cubin does
23:31gfxstrand[d]: The other question is how does the DLSS DLL know what version of the binary to hand out? If it's querying that from the Vulkan driver, we'll need to emulate the prop driver's physical device info if we want things to match up. If it's trying to open it's own connection to the hardware, that's a problem.
23:32gfxstrand[d]: misyltoad[d]: As long as it runs on the compute pipeline and isn't a fragment shader, we should be okay. Compute is relatively simple.
23:33misyltoad[d]: i would also take everything i said with a slight pinch of salt as i am trying to remember how this was architected like 2/3 years ago from memory :P
23:34gfxstrand[d]: That's okay. That's more than I know from memory. 🤡
23:34gfxstrand[d]: In any case, it sounds like getting it working is within the realm of possibility.
23:34misyltoad[d]: The dlss ngx binary is here: `/usr/lib/nvidia/wine/nvngx.dll` in the prop driver
23:35gfxstrand[d]: Do we have an issue open for this?
23:35misyltoad[d]: The relevant MRs for things back in the day were:
23:35misyltoad[d]: https://github.com/doitsujin/dxvk/pull/2260
23:35misyltoad[d]: https://github.com/jp7677/dxvk-nvapi/pull/33
23:36mohamexiety[d]: gfxstrand[d]: for DLSS? no
23:42misyltoad[d]: gfxstrand[d]: it likely just talks to NVAPI which would be dxvk nvapi which emulates stuff there
23:43redsheep[d]: I think one of the trickier parts will be that making dlss work probably means looking like Nvidia prop, and needing to have games be able to believe all of the other proprietary Nvidia magic is working even if it isn't
23:44redsheep[d]: Dxvk-nvapi still uses latencyflex to have reflex appear to work, right?
23:45misyltoad[d]: redsheep[d]: that's really all handled by dxvk and dxvk nvapi already
23:54gfxstrand[d]: gfxstrand[d]: We can file further notes here: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12439
23:55gfxstrand[d]: redsheep[d]: Yeah, if apps are putting it behind their giant "is this Nvidia?" switch, we may be in trouble.
23:58redsheep[d]: gfxstrand[d]: I suppos that can often mean more than nvapi. I feel like if it can be made not to break when games see nvk as Nvidia prop that could be a huge benefit
23:58redsheep[d]: There may be games running slower right now because they aren't using some special Nvidia path that is more friendly to the hardware, for instance