00:59 fdobridge_: <g​fxstrand> Kicked off an Ada run on my laptop. I'll start Ampere after supper
01:34 fdobridge_: <g​fxstrand> The bad news is that Ada fails. The good news is that it fails consistently and quickly. Looks like exactly the same as the WSI fail.
04:50 fdobridge_: <g​fxstrand> Okay, this bug is utterly insane. If I delete a bunch of tests that skip from the test list it passes...
04:51 fdobridge_: <g​fxstrand> My mind is boggled.
04:53 fdobridge_: <g​fxstrand> Also, reducing the test list is a bear. I've got it down to 218859 tests to reproduce (about 12.5 seconds) but reducing further is proving difficult.
05:22 fdobridge_: <g​fxstrand> Alright, I'm calling it. This is a kernel bug. It's something with context creation, I'm pretty sure. When I run my repro list with `NVK_DEBUG=push`, I get exactly the same thing for the hanging case as the passing case. When I run with `NAK_DEBUG=print`, the only differences are some randomly generated constants in shaders that get written to buffers. They change from run to run so there's no way those are the culpret given that it's 100%
05:22 fdobridge_: <g​fxstrand> @airlied dakr: ^^
05:38 fdobridge_: <g​fxstrand> https://gitlab.freedesktop.org/drm/nouveau/-/issues/315
05:38 fdobridge_: <g​fxstrand> @airlied Back to you. Have fun!
05:39 fdobridge_: <g​fxstrand> If you want to tell me where to hack nouveau to bump various ring sizes, I'm happy to try stuff out.
05:46 fdobridge_: <g​fxstrand> Oh, and if it helps motivate anyone, this is the bug standing between us and 1.3 conformance on Turing+. 😝
07:36 fdobridge_: <a​irlied> Once I figure out the correct fix for the sync irqs I'll take a look, or maybe while I'm getting annoyed at that I will have time
15:04 fdobridge_: <!​DodoNVK (she) 🇱🇹> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27319 :triangle_nvk:
16:56 fdobridge_: <g​fxstrand> Thanks! I keep forgetting those... 🤦🏻‍♀️
17:09 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand random thought: with `mme` we could actually dump the subchannel context and dump all set values on each method. Could be useful to debug some annoying issues and diff between different runs
17:10 fdobridge_: <k​arolherbst🐧🦀> the annoying part would be to write that code, as I bet you can't just read out everything as the GPU will complain about unknown methods :/ and I don't know how that works on methods being data dumps
17:13 fdobridge_: <g​fxstrand> Yeah, I've thought about that. I've also thought about trying to do something at a CPU SW level to prevent us from using state we don't explicitly initialize. IDK that I really like that, though, as there's a lot of state that doesn't matter until we use it.
17:15 fdobridge_: <g​fxstrand> And, like, with this one... What the hell state is it using?!? It's a really dumb draw call. It's not even using a texture. I know we disable all the shaders on init.
17:16 fdobridge_: <g​fxstrand> Maybe there's a stale bound CBuf?
17:26 fdobridge_: <g​fxstrand> Yeah, it's not that. I'm now disabling all CBufs on init and it still crashes
17:28 fdobridge_: <g​fxstrand> If I could just get my fault address, this might be debuggable
17:30 fdobridge_: <k​arolherbst🐧🦀> could also be some post draw thing.. like I dunno..
17:32 karolherbst: TimurTabi: weren't you working on something debugging related to GSP or was that somebody else?
17:32 fdobridge_: <g​fxstrand> We're really starting to run out of ways that draw could possibly get at memory to fault
17:32 fdobridge_: <k​arolherbst🐧🦀> it might not be the draw
17:33 fdobridge_: <g​fxstrand> I'm running with `NVK_DEBUG=push_sync`
17:34 fdobridge_: <g​fxstrand> With everything I've eliminated, it's starting to look like corrupt page tables or weird GSP shit
17:34 fdobridge_: <k​arolherbst🐧🦀> huh.. that push-fail thing doesn't even have a draw?
17:34 fdobridge_: <g​fxstrand> It does. It's the MME at the bottom
17:34 fdobridge_: <g​fxstrand> But it's a really dumb draw. Doesn't use textures or anything
17:35 fdobridge_: <k​arolherbst🐧🦀> at least the one you uploaded doesn't have an MME
17:35 fdobridge_: <k​arolherbst🐧🦀> only a `NVC1B5_LAUNCH_DMA`
17:35 fdobridge_: <g​fxstrand> Oh, that's the wrong one
17:35 fdobridge_: <g​fxstrand> I had to fix pushbuf dumping
17:35 fdobridge_: <g​fxstrand> It's a draw that's crashing
17:36 fdobridge_: <k​arolherbst🐧🦀> but yeah... there is random stuff going on, but that's since forever and I'm sure we also have pre GSP kernel bug somewhere...
17:36 fdobridge_: <k​arolherbst🐧🦀> I've ran into similar issues with the GL CTS, just that those faults happened after hours
17:37 fdobridge_: <k​arolherbst🐧🦀> but that's just using a single context, because GL
17:37 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand can you make the CTS stop reusing `VkDevice`s?
17:38 fdobridge_: <k​arolherbst🐧🦀> I wonder if that changes anything
17:38 fdobridge_: <k​arolherbst🐧🦀> (or makes it even worse)
17:38 fdobridge_: <g​fxstrand> No
17:39 fdobridge_: <g​fxstrand> I mean maybe?
17:39 fdobridge_: <g​fxstrand> IDK if there's a flag for that or not
17:39 fdobridge_: <k​arolherbst🐧🦀> yeah.. not sure either, I think there might be something, but I'm also only 5% sure
17:40 fdobridge_: <g​fxstrand> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27320/diffs?commit_id=d440eb96ce01d7a44ef365ea9f5957b32f075fb6
17:40 fdobridge_: <g​fxstrand> There isn't a flag.
17:42 fdobridge_: <k​arolherbst🐧🦀> maybe somebody had a patch somewhere...
17:42 fdobridge_: <g​fxstrand> I mean... I could do a crazy driver hack to create a new context for every command submit
17:42 fdobridge_: <k​arolherbst🐧🦀> mhhhh
17:42 fdobridge_: <k​arolherbst🐧🦀> I mean.. if you invalidate all state it _should_ work, shouldn't it?
17:43 fdobridge_: <k​arolherbst🐧🦀> but it doesn't really tell us anything anyway
17:43 fdobridge_: <k​arolherbst🐧🦀> state dumping sounds more promising tbh
17:44 fdobridge_: <g​fxstrand> State dumping doesn't do much if I don't know what I'm looking for
17:44 fdobridge_: <k​arolherbst🐧🦀> I mean like.. dumping everything
17:44 fdobridge_: <g​fxstrand> That's just a bigger haystack. Without even knowing the fault address, how am I supposed to know if anything is wrong.
17:45 fdobridge_: <k​arolherbst🐧🦀> diff it with the working run
17:45 fdobridge_: <g​fxstrand> Yeah, maybe?
17:45 fdobridge_: <k​arolherbst🐧🦀> yeah, I dunno
17:46 fdobridge_: <g​fxstrand> I wonder how long a "dump everything" MME would take to run...
17:46 fdobridge_: <k​arolherbst🐧🦀> state dumping is just a pain, because I'm sure nvidias internal description of that stuff would allow you to auto generate dumping code :ferrisUpsideDown:
17:46 fdobridge_: <k​arolherbst🐧🦀> mhh actually
17:46 fdobridge_: <g​fxstrand> I think my first attempt would be to literally dump everything and see if that works.
17:47 fdobridge_: <k​arolherbst🐧🦀> probably enough to only dump stuff `border_swizzle` touches
17:47 fdobridge_: <g​fxstrand> Maybe it complains but maybe it doesn't?
17:47 fdobridge_: <k​arolherbst🐧🦀> mhhh
17:47 fdobridge_: <k​arolherbst🐧🦀> yeah.. dunno what happens on reads actually
17:47 fdobridge_: <g​fxstrand> Nah. The border swizzle tests don't actually run
17:47 fdobridge_: <k​arolherbst🐧🦀> ahh...
17:47 fdobridge_: <k​arolherbst🐧🦀> pain
17:47 fdobridge_: <g​fxstrand> They just probably create and destroy enough contexts to get the moons to align
17:48 fdobridge_: <g​fxstrand> (I guess NVKland now canonically has multiple moons...)
17:48 fdobridge_: <k​arolherbst🐧🦀> figures
17:50 fdobridge_: <g​fxstrand> https://tenor.com/view/thats-no-moon-obi-wan-kenobi-starwars-death-star-gif-10170550
17:57 fdobridge_: <g​fxstrand> Hrm... There are only about 1k states on Turing. That should totally be dumpable.
18:18 fdobridge_: <t​om3026> Use the force, gfxstrand.
18:21 fdobridge_: <k​arolherbst🐧🦀> yeah.. the it's not too bad, and 3D is already quite big compared to the others
18:21 fdobridge_: <k​arolherbst🐧🦀> @gfxstrand you know what we could do with state dumping in place? Find aliasing methods between the sub channels...
18:40 fdobridge_: <g​fxstrand> Yeah, maybe
19:13 Sid127: oh, heh, looks like a bouncer went offline
19:14 Sid127: oh wait these are all matrix connections
19:35 karolherbst: matrix moments
21:21 fdobridge_: <p​avlo_it_115> HelIo everyone! Who can tell in what condition the NVK for kepler? Can it be used?
21:23 fdobridge_: <S​id> Kepler does work if you set `NVK_I_WANT_A_BROKEN_VULKAN_DRIVER=1` but it's a bit sketchy.
21:24 fdobridge_: <p​avlo_it_115> Sketchy?
21:24 fdobridge_: <p​avlo_it_115> why
21:25 fdobridge_: <m​henning> It's only partially implemented
21:26 fdobridge_: <m​henning> It needs more work before it will really be usable, although some stuff might work already
21:36 fdobridge_: <r​edsheep> Here's an example of where Kepler is at https://gitlab.freedesktop.org/mesa/mesa/-/issues/10447
22:10 fdobridge_: <g​fxstrand> I really need to get around to fixing that but I think it'll have to wait until after Vulkanised
22:10 fdobridge_: <g​fxstrand> Or maybe I'll plug my pikvm into my Kepler card and fix it during the MoltenVK talks. 😅
22:12 fdobridge_: <g​fxstrand> So... I dumpped everything on the 3D engine... No differences until... the MME faults.
22:12 fdobridge_: <g​fxstrand> Yeah, the state dumping MME does part-way through.
22:14 fdobridge_: <r​edsheep> Will Kepler ever be able to support enough for zink to work well, or will zink need special paths for it?
22:14 fdobridge_: <g​fxstrand> Kepler should be able to support must stuff.
22:14 fdobridge_: <g​fxstrand> Just not memory model
22:20 fdobridge_: <g​fxstrand> @karolherbst https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27321
22:21 fdobridge_: <p​avlo_it_115> Can I be more detailed please? But what does it affect?
22:22 fdobridge_: <k​arolherbst🐧🦀> wait.. all methods are the inputs to that macro? impressive
22:24 fdobridge_: <g​fxstrand> Yup
22:24 fdobridge_: <g​fxstrand> It just dumps everything in clc597.h
22:25 fdobridge_: <g​fxstrand> A lot of it is probably useless but whatever
22:27 fdobridge_: <g​fxstrand> Some of the details around memory ordering on global memory access. Global loads and stores work, they just don't have tight enough ordering guarantees on Kepler.
22:27 fdobridge_: <d​adschoorse> also no vulkanMemoryModel -> no vulkan 1.3
22:27 fdobridge_: <g​fxstrand> Yup
22:27 fdobridge_: <g​fxstrand> Unless I can somehow be more clever than the NVIDIA engineers.
22:28 fdobridge_: <g​fxstrand> But I know the people who did the memory model stuff at NVIDIA. They're *very* clever.
22:32 fdobridge_: <k​arolherbst🐧🦀> what's the limitation here anyway?
22:32 fdobridge_: <g​fxstrand> Something with memory ordering in the hardware
22:32 fdobridge_: <g​fxstrand> Something with memory access ordering in the hardware (edited)
22:32 fdobridge_: <k​arolherbst🐧🦀> mhhh
22:32 fdobridge_: <k​arolherbst🐧🦀> I see
22:39 fdobridge_: <g​fxstrand> This has me really confused. The only reason I could imagine that happening is if my page tables are being ripped out from under me somehow.
22:40 fdobridge_: <r​edsheep> What about Fermi? I know Nvidia never tried as it was already pretty much EOL at the time but wasn't Fermi pretty close to Kepler?
22:41 fdobridge_: <r​edsheep> Just naively reading the white papers it's hard to see what Kepler really added that would matter here except being much wider
22:41 fdobridge_: <g​fxstrand> I think so? 🤷🏻‍♀️
22:41 fdobridge_: <g​fxstrand> I've heard that they had a Fermi driver they decided not to release.
22:42 fdobridge_: <g​fxstrand> But the hardware gets more annoying the further back you go
22:44 HdkR: Oldest architectures also had the annoying split data and texture caches. So if you were modifying textures through a buffer and then sampling you would need some cache clear operations in the middle. Nightmare :|
22:45 fdobridge_: <r​edsheep> Yeah for now it certainly makes sense to focus on the hardware with the features and performance to make NVK work really well.
22:45 fdobridge_: <g​fxstrand> IDK. I worked on Intel. 😂
22:47 fdobridge_: <g​fxstrand> We had separate caches for everything
22:48 fdobridge_: <r​edsheep> I know it's probably been asked a thousand times but I don't feel like I've ever got a satisfying answer, but is there any hope at all for Maxwell? I'm sure that before long Nvidia will be looking to axe support and it would be awesome for mesa to pick up where Nvidia left off there.
22:48 fdobridge_: <g​fxstrand> And image layouts. 🤡
22:48 fdobridge_: <r​edsheep> Like I know the basics, it can't reclock, the firmware we have sucks
22:48 fdobridge_: <r​edsheep> Is it just up to Nvidia to publish firmware we can actually use?
22:48 fdobridge_: <g​fxstrand> Maxwell should mostly be working. A few people are trying to bring up the new compiler right now but and it's already passing a pile of tests.
22:50 fdobridge_: <r​edsheep> So gm100 chips could theoretically reclock and make use of that, but gm200 just never will with current firmware right?
22:50 fdobridge_: <g​fxstrand> Pretty much
22:52 fdobridge_: <t​om3026> are you talking about same pages as https://github.com/torvalds/linux/commit/23baf831a32c04f9a968812511540b1b3e648bf5 because im just about to try revert that commit it breaks my dma_pools but i dont get why im the sole person on this planet getting a stacktrace on boot because of it heh
22:52 fdobridge_: <a​irlied> And NVIDIA will not release Maxwell fw we can use probably ever
22:53 fdobridge_: <r​edsheep> For all the Nvidia people reading here, a great marketing story would be resolving that blocker at the same time you guys announce dropping Maxwell support. Turn a bad story good.
22:54 fdobridge_: <a​irlied> Nope they have no case for doing it, it's a lot of engineering work
22:57 fdobridge_: <t​om3026> or well just noticed it landed a kern version before it broke bleh i was so close.. oh well time to git bisect a gazillion MM commits then
23:02 fdobridge_: <g​fxstrand> I mean, it's possible I guess but I have no evidence that that would help.
23:02 fdobridge_: <t​om3026> yeah it probably wont noticed the tags its in before it began borking
23:03 fdobridge_: <t​om3026> i have faith if i bisect this find the offending commit suddenly all my software runs null pointer deref free and allocates memory like never before
23:04 fdobridge_: <p​avlo_it_115> Anyone in the community asking for firmware from nvidia?
23:04 fdobridge_: <p​avlo_it_115> everyone seemed resigned
23:05 fdobridge_: <p​avlo_it_115> everyone seemed resigned..? (edited)
23:06 fdobridge_: <p​avlo_it_115> everyone seemed... resigned? (edited)
23:06 fdobridge_: <a​irlied> We only asked since Maxwell 2 released 🙂
23:06 HdkR: The funny joke is, even if you ask more, it won't change
23:07 fdobridge_: <a​irlied> Yes they have no interest in giving redistribution rights for the fw they use, so even if you RE it all it's a pain in the ass
23:08 fdobridge_: <r​edsheep> Sorry to bring it up, I know it's a sore spot.
23:10 fdobridge_: <p​avlo_it_115> I remembered something (but vaguely). It seems some director of nvidia linux drivers left the company a long time ago and he was actively accepting applications from the nuvo community etc. Because of his resignation, now there is no "connection" with nvidia or what?
23:10 fdobridge_: <a​irlied> No we at Red Hat have regular meetings with them and engage with community concerns
23:11 fdobridge_: <g​fxstrand> Yeah, the problem isn't a lack of people asking. It's a lack of motivation on their part. They don't really have a reason to do the work required to do a competent firmware release.
23:12 fdobridge_: <p​avlo_it_115> As I understand it, he cannot influence this situation in any way?
23:13 fdobridge_: <a​irlied> They have no business case for the amount of engineering effort
23:15 fdobridge_: <a​irlied> Even getting the fw we have pre gsp was very painful and expensive engineering wise
23:16 fdobridge_: <a​irlied> Since it was bespoke nouveau fw and their engineering was always slow to react to new hw updates
23:16 fdobridge_: <g​fxstrand> And asking them to put in all that work for hardware that they've already sold and won't ever make another penny on... Yeah, that isn't going to happen. Especially with as close to EOL as it is. Their solution is "go buy a new GPU"
23:17 fdobridge_: <a​irlied> Or use the last binary driver forever
23:18 fdobridge_: <p​avlo_it_115> And what did they answer?)
23:18 fdobridge_: <g​fxstrand> If they'd answered "yes", would we be having this conversation right now?
23:18 fdobridge_: <p​avlo_it_115> -
23:19 fdobridge_: <a​irlied> There was a lot of looking into it, but nobody could work out a supportable strategy that wasn't let's get GSP done
23:20 fdobridge_: <r​edsheep> I remember some news in the last couple of years that somebody managed to get those cards to ignore the check for signed firmware, but I assume based on what you're saying that doesn't really provide any new hope
23:22 fdobridge_: <r​edsheep> I'm sure the RE work is immense any way you slice it
23:23 fdobridge_: <p​avlo_it_115> You intrigued me
23:24 fdobridge_: <a​irlied> There was some key leak at some point, but I forget the details
23:24 fdobridge_: <c​onan_kudo> Well at this point, most pre Turing users could upgrade affordably as long as it's not a laptop
23:24 fdobridge_: <c​onan_kudo> GTX 16 series cards are very cheap now
23:25 fdobridge_: <c​onan_kudo> And I'm saying that as a Maxwell owner
23:27 fdobridge_: <p​avlo_it_115> https://forums.developer.nvidia.com/t/can-nvidia-contribute-to-open-source-nouveau-driver-directly/270264
23:27 fdobridge_: <p​avlo_it_115> Silence)
23:27 fdobridge_: <a​irlied> https://gist.githubusercontent.com/plutooo/733318dbb57166d203c10d12f6c24e06/raw/15c5b2612ab62998243ce5e7877496466cabb77f/tsec.txt was some stuff but no idea if any of that is similar off the switch
23:29 fdobridge_: <a​irlied> Public forums aren't going to have any influence
23:35 fdobridge_: <p​avlo_it_115> 🤬
23:36 fdobridge_: <p​avlo_it_115> I have no words this
23:36 fdobridge_: <p​avlo_it_115> I have no words (edited)
23:37 fdobridge_: <p​avlo_it_115> Maybe try writing to them on github?
23:37 fdobridge_: <k​arolherbst🐧🦀> they already do, but uhh... not much 😄
23:38 fdobridge_: <g​fxstrand> It's not going to have any effect
23:38 fdobridge_: <k​arolherbst🐧🦀> the problem isn't engineers seeing that stuff
23:38 fdobridge_: <k​arolherbst🐧🦀> they know
23:38 fdobridge_: <k​arolherbst🐧🦀> it's purely dysfunctional on a management level
23:38 fdobridge_: <a​irlied> Yeah not sure why you think we don't talk to the people who make the decisions and have decided aleeady
23:39 fdobridge_: <k​arolherbst🐧🦀> nvidia just doesn't want to
23:39 fdobridge_: <k​arolherbst🐧🦀> it's really as simple as that
23:41 fdobridge_: <k​arolherbst🐧🦀> there are of course exceptions.. but that's mostly engineers working around the situation
23:41 fdobridge_: <m​arysaka> Pretty sure that kind of exploit isn't going to work on non Tegra stuffs, but the keys are common between desktop GPU and Tegra
23:41 fdobridge_: <m​arysaka> I think the best approach would be to get HS execution on Maxwell/Pascal via the Falcon's bootrom, but that requires quite the work to get something working (ROP chain against execute only memory is fun...)
23:41 fdobridge_: <m​arysaka> But even if that happen, you would still need a way to research the actual blobs to know what to write (so back to square one, need the keys...)
23:44 fdobridge_: <p​avlo_it_115> Decided what? Is this classified information?
23:44 fdobridge_: <p​avlo_it_115> Decided what? Is this private information? (edited)
23:45 fdobridge_: <r​edsheep> He's saying that management decided not to allocate the engineering effort. It's not lack of communication, and it's not the engineer's fault.
23:45 fdobridge_: <k​arolherbst🐧🦀> yes
23:46 fdobridge_: <g​fxstrand> That, too.
23:47 fdobridge_: <k​arolherbst🐧🦀> we certainly talk with nvidia about stuff, and e.g. the headers they publish is one result of those efforts
23:47 fdobridge_: <k​arolherbst🐧🦀> it's just...
23:47 fdobridge_: <k​arolherbst🐧🦀> nvidia wants to see $$$
23:47 fdobridge_: <k​arolherbst🐧🦀> for anything they do
23:47 fdobridge_: <k​arolherbst🐧🦀> so if contributing to nouveau doens't bring them any money, they won't do it
23:47 fdobridge_: <r​edsheep> As somebody dealing with support for a software company, I can totally understand getting burned out by tons of people asking for something where the engineering effort won't ever make sense on an EOL product.
23:47 fdobridge_: <k​arolherbst🐧🦀> so if you want nvidia to contribute, make a billion dollar contract with them only using nouveau
23:47 fdobridge_: <k​arolherbst🐧🦀> that's how you get nvidia to contribute
23:49 fdobridge_: <k​arolherbst🐧🦀> and the number is probably in the right ballpark
23:50 fdobridge_: <k​arolherbst🐧🦀> and even then I have my doubts
23:51 fdobridge_: <k​arolherbst🐧🦀> google tried, they failed, sooo....
23:51 fdobridge_: <r​edsheep> They have an obligation to the shareholders to do things that make money, and not do things that won't, so it makes sense. Not saying I like it, but it's the reality we all live with.
23:52 fdobridge_: <k​arolherbst🐧🦀> I mean.. on top of that is that they just don't want to open up 😄
23:52 fdobridge_: <k​arolherbst🐧🦀> probably
23:55 fdobridge_: <p​avlo_it_115> And what about nouveau-fw?
23:56 fdobridge_: <p​avlo_it_115> If I'm not mistaken, the binaries were simply extracted from the nvidia installer
23:56 fdobridge_: <p​avlo_it_115> and likewise someone here extracted binaries for reclocking
23:57 fdobridge_: <r​edsheep> As airlied mentioned earlier the nouveau firmware is bespoke, meaning it required its own engineering effort to make it exist. It is special firmware.
23:59 fdobridge_: <k​arolherbst🐧🦀> only the GSP stuff is something nvidia published
23:59 fdobridge_: <k​arolherbst🐧🦀> and that's also something we've talked about with them for years