00:03gfxstrand[d]: I'm looking at the lowering and `SU_INFO` stuff right now.
00:03gfxstrand[d]: It makes some sense but ugh...
00:04gfxstrand[d]: Best comment ever:
00:04gfxstrand[d]: `/* NOTE: this is really important*/`
00:11freddycondios: So why I do not write papers, which i otherwise would have skill for in whatever research department? Why i do not respond to job offers (yeah most or fake, but i do not search for job) . Well i am a war criminal and will be a bigger war criminal when the war reaches to here, in a single week or so we search all of the tyran fuckers up with not only Russians, but with my fans all around
00:11freddycondios: the world, and we will slaughter them. Because i know i go for this stuff, i am not writing papers or working for anyone, i played table tennis for 2years as a broken man, and get little paid, but it is what it is!
00:13pavlo_kozlenko[d]: mby ban?
00:14HdkR: They are.
00:14gfxstrand[d]: Okay, most of this stuff looks fairly sensible if a bit opaque.
00:14orowith2os[d]: I think it's about time this guy sets up a blog, cause holy moly it's ramped up the last few days
00:14pavlo_kozlenko[d]: I will subscribe to it.
00:15snowycoder[d]: gfxstrand[d]: Also: the unknown fields correlate with the size of the format. Suclamp uses them to shift left or right
00:17pavlo_kozlenko[d]: snowycoder[d]: Good day! Please don't count on me for coding, I've been really busy lately at the main job, I just don't have enough energy, I work and sleep, work and sleep. Sorry
00:18pavlo_kozlenko[d]: gfxstrand[d]
00:37gfxstrand[d]: snowycoder[d]: Yeah
00:37gfxstrand[d]: It all looks pretty sane if you know NVIDIA image tiling.
00:37gfxstrand[d]: But also ugh...
00:38gfxstrand[d]: I'll also starting to wonder why they have su ops at all at this point if they're just doing image calculations in the shader. 🤷🏻♀️
01:07HdkR: "Seemed like a good idea at the time."
02:18frautonigarn: For changing the other two numbers to bigger of the searched (i mean 50+54+58), i was too lazy to demonstrate that, as i am in bed and sick, i work on laptop and touchpad , it's not very fast to copy things around, intel driver is so good that i am glad to say. I HACKED IT ALL! but this test you see works for the same, it's crucial to know that it understands how to queue jump and keep the
02:18frautonigarn: order, i stretched the members like indexes and bases intentionally large and duplicated for this to work. But ok then. so one more pasted. 50+44+54+328+328+328+328+328+50+36+328−512−144−144−72−1024−36−72=198 50+44+54+328+328+328+328+328+50+36+328−512−144−144−72−512−144−144−72−144=314 and then 198−120=78 314−78.00−108=128 again 50-128=-78+28=-50 , but in a spectrum of that type of
02:18frautonigarn: compression this is only pseudo format for demonstration for single index, there are cubious amount of indexes generated, and orowith2os[d] i hope you try to test all those lines in the last 4days the theory spans from many years go though, my hands heart for trying to help you, i need a rest now.
02:18orowith2os[d]: Oh, it's my turn now. Okay
02:19orowith2os[d]: That's cool
02:25HdkR: orowith2os[d]: A cookie for you.
04:17gfxstrand[d]: Running Kepler A again. I think at this point I'm mostly missing doubles, mem ops (atom, membar), and images.
04:18gfxstrand[d]: In other words, all the hard stuff. :bim_sweat:
04:18gfxstrand[d]: (Well, doubles aren't hard)
04:23gfxstrand[d]: If the CTS run that's going right now is even remotely decent, I'm probably going to land my branch and start a new one. There's a few pieces from snowycoder[d] in there which I've cleaned up and pulled into separate patches (with his authorship). There's also a few bits which will probably fix tests on his branch.
04:23gfxstrand[d]: I think we're getting to the point where trying to get bits of sm20 and sm32 upstream will make development of both easier.
04:30gfxstrand[d]: Also, once I do so I'm probably going to step back for a bit. I might help a bit with the surface nonsense but I'm done going to town on SM20 for a bit.
04:33gfxstrand[d]: SM30 vs. SM32 is weird. In some ways they're the same and in other ways they're totally different.
04:33gfxstrand[d]: SM30 really is Kepler but the encodings are all Fermi. But the underlying hardware is clearly Kepler.
04:40HdkR: I do like the phrase "double's aren't hard", but go back to when GL 4.0 was first announced with doubles being mandated, Mesa not having softfloat at all, and everything on fire.
04:40HdkR: Good times, good times
05:08gfxstrand[d]: Doubles aren't hard anymore...
05:08gfxstrand[d]: I was there for those good times.
05:08gfxstrand[d]: Doubles on Haswell were NOT good times. 😂
05:09HdkR: :D
05:09gfxstrand[d]: Doubles on Nvidia aren't bad. And I've already got it working on two other backends so it really is just a matter of wiring up 5 opcodes.
05:10HdkR: Yea, NVIDIA is nice for keeping FP64 pipelines around, even if they're slow. Still faster than a softfloat.
05:10gfxstrand[d]: But yes, there was some real engineering work when I first got them working on Turing.
05:11gfxstrand[d]: Yeah, softfloat is just bad
05:11HdkR:laughs in f128 softfloat
05:11gfxstrand[d]: It'll pass conformance but you'd really better hope no one uses it.
05:12gfxstrand[d]: Eh, 128-bit soft float isn't that much worse. It just requires 128-bit integers.
05:12gfxstrand[d]: But also why would you do that to yourself?
05:13HdkR: Luckily GPUs don't need to worry about that. until we get VK_EXT_fp128
05:13HdkR: I just get to hate it in CPU land.
05:14gfxstrand[d]: Guess I know what extension I'm proposing next April. 😝
05:14HdkR: Suddenly POWER engineers are interested again.
05:15HdkR: Wacky that they natively support quadfloat
05:16HdkR: I'll cut FEX's fp128 usage down to fp80 at some point, once I work up the courage again.
05:40gfxstrand[d]: Oh fp80...
05:40gfxstrand[d]: And fp40
05:40gfxstrand[d]: Why, Intel, did you curse us so? 😫
05:45HdkR: The pain means we're still alive. :P
05:46HdkR: What fun would it be if ARM just gave me 128-bit float transcendental instructions and makes the pain go away?
09:41ermine1716[d]: Why doubles are this hard?
11:48mohamexiety[d]: HdkR: hold on though I thought fp80 was like, a mega terrible horrible idea? why would you do that?
11:48mohamexiety[d]: I don't know much here but I think for fp128 you mean what would be equivalent to mostly SSE stuff, right? or do you actually need to care about x87 stuff as well for FEX?
11:58MrMiniPixel: So my experience yes is indeed way above you on software and i had placed high demands on the thinking side, bust just like zmike[d] is referring i had grown out from this era cause your talks your emus your software is mildly said bullshit, just like our packages of television software platforms and everything i have to deal with, but in terms of code readability and genius behind
11:58MrMiniPixel: compilators and ides that are very heavy with the OS commodity systems probably have to be like this, but as said i know everything about everything what you talk about cause of my long experience, but i have no interest in that crap, as anyone has not stepped in in optimizing those systems, i band dealing with this shit altogether, i have already modern system frameworks , and they are
11:58MrMiniPixel: contunuously squuezing my nuts fictively like assaulting and trying to kill me, several chips they have planted for some reason for every 2years to get substances from me, surveillance me, to make illegal business on freaking gay nutbolts artificial whatever that is ovulation , nothing that i ever signed up for, and calling me an idiot on top by such leftovers on this planet who understand
11:58MrMiniPixel: absolute nothing about anything, definitely it's too much to avoid larger conflict between the lines. FEX type of projects i have delt with many of them , i used to dream about them knowing they are retarded, but i gotta be honest, especially after #llvm abuse done at my direction, i say very clearly that is not how software is supposed to work to get any performance out of the hw at all,
11:58MrMiniPixel: and fuck off , i am so much talented then your abortion leftover lgbt abusers. And i am not going to not show my importance in the world to you , when such trash wants to kill me, so i daily convince you how much better i am , and if you ain't gonna respect my privacy, in a massive conflict i punish you out of my closeness and perhaps even from the country or whole planet.
12:33MrMiniPixel: Also I admit i never know what deals you might have with Microsoft, cause DirectX is microsofts api, and Microsoft is no longer in the position of monopoly to have their api to be used for fair what use doctrine based, otherwise germans, french all have several tens of perfect kits for this, and theyd be all with minimal effort alway functional, sokol, forge , digilent all can be with very
12:33MrMiniPixel: few effort kicked to mainline, but i have no authority for this.
12:34MrMiniPixel: monopoly is now longe since on mobile device gaming sector.
12:34MrMiniPixel: where the apis are also never ideal like microsoft ones, but not worse either.
12:48MrMiniPixel: I see purest logics behind as to what i talk about, i do not care about the filthy courtesan bisexual fucker , or her mental illness or her crankgangster scrubs at all, i do not finance any of that, i care about my district which is a hotel in koh rong sanmloem , used to be another one in sihanoukville, tell me one compelling reason why i should not shoot the persons kicking be out from
12:48MrMiniPixel: there scrubs or tattood courtesans, that i am mentally ill, ok, i find another person who shoots instead of me then, and this is how it resolves if such shit harasses us more.
12:56zmike[d]: is there any way to fix the build to catch when rustc version changes and recompile subprojects?
14:12snowycoder[d]: suclamp, sueau and subfm folded!
14:13x512[m]:uploaded an image: (316KiB) < https://matrix.org/oftc/media/v1/media/download/ASo8onnYcjuhH0Hkfyzf62LKiWnIAkjw9EFcac3pU0Azk3uIq2r3i2uyzcpvHDP55huIMVltnLqi6bxuvlGh8YlCeWkVfwewAGhhaWt1LW1hdHJpeC5jb20vVnpUQ3FyU2J4bm9PSmFvUEpUT3d6anNW >
14:13x512[m]: Can this be NVK bug? Testing various games on Haiku with NVK+Zink.
14:13x512[m]:uploaded an image: (260KiB) < https://matrix.org/oftc/media/v1/media/download/AZotzGomuiSMOADinPo4Xht-1Uj9dc8ifGcaa00I4mB0GrqNLz_Py2KkV7x8wdQYSQzD7UxObYNWD5NDOThfRSRCeWkVid3gAGhhaWt1LW1hdHJpeC5jb20va1dSaE5NSGtJTXVIV015empKeWNoVGZ0 >
14:13x512[m]: Should be this.
14:16x512[m]: OpenMW
15:13gfxstrand[d]: x512[m]: Quite possibly. Hard to know without debugging it.
15:14gfxstrand[d]: snowycoder[d]: Sweet! I'm gonna type up a framework today for you to fill out the details. This one's gonna require all three pieces working together and it's easier for me to just type 20% of it to show the right shape than to try and explain it in English.
15:26gfxstrand[d]: ermine1716[d]: Doubles are hard for a couple reasons. First is that they take up 64 bits. GPUs are typically designed around 32-bit things, not 64. While 64-bit integers can often be implemented by just manipulating the two halves and having some clever cross-over like carry registers, double really have to be treated as a single 64-bit thing. This causes pain both in the compiler and in the ISA
15:26gfxstrand[d]: itself.
15:26gfxstrand[d]: Second is that doubles are expensive. Floats aren't cheap in hardware but we've given up and accepted that. Doubles are way worse and less useful (in graphics). This means HW designers are often trying to avoid adding actual FP64 hardware whenever possible. On some GPUs, this means you end up with an entirely software implementation. On others it means you have a free double ops and you have to
15:26gfxstrand[d]: implement the rest in terms of those.
15:26gfxstrand[d]: One example of this phenomenon is RCP (1/x) on Nvidia. There is a fp32 RCP instruction that does everything you want it to and at full precision. But for doubles, there is an opcode which does an RCP on the top 32 bits of a fp64 value. This gets you most of an RCP but there's a bit of work to do before you call it to make sure inf and nan propagate and then the result is nowhere near full
15:26gfxstrand[d]: precision so you have to treat it as an initial approximation and do a couple Newton steps to get the rest of the precision. The result is about 8 instructions or so, which is far better than if you implemented it with integers but still a lot more work than if we had a 64-bit RCP.
15:31gfxstrand[d]: karolherbst[d]: what part is nv4e?
15:31karolherbst[d]: gk104
15:31karolherbst[d]: ehh
15:31karolherbst[d]: nv4e or nve4?
15:31karolherbst[d]: I assume you meant nve4, because I doubt you care about pre tesla GPUs 😄
15:32gfxstrand[d]: Yes, that
15:32gfxstrand[d]: Okay, I'm just gonna call it Kepler then
15:36mhenning[d]: zmike[d]: zmike[d] That's a known meson bug. https://github.com/mesonbuild/meson/issues/10706
15:36karolherbst[d]: gfxstrand[d]: yeah.. nve4 is special, because there never was a nve0, so it just starts at nve4
15:49snowycoder[d]: gfxstrand[d]: Yep, I've pushed the foldings in my MR (although it's quite outdated and is missing a lot of your patches).
15:56gfxstrand[d]: My kernel locked up over night but hopefully I have lots of good fails to look into
16:29HdkR: mohamexiety[d]: Yes, I need to care about all instructions for FEX. So we need to emulate x87 which is still used for x86-64 applications compiled today. f80 softfloat would recover a small amount of performance, since f128 softfloat provides more precision than necessary.
16:34mohamexiety[d]: HdkR: ah I see. fair enough then, thanks. there's an argument to be made that you're opting into low perf if you go with x87 but eh 😝
16:35HdkR: mohamexiety[d]: If your application is using transcendentals then it's still faster to use x87 on x86-64, since SSE doesn't provide any way to accelerate those.
16:36mohamexiety[d]: HdkR: oh.. TIL. I actually thought x87 was fully deprecated in a sense. afaik on windows for example it actually doesn't get emit at all
16:38x512[m]: How Linux usually import GPU 1 buffer to GPU 2? What does it with tiling format?
16:42mohamexiety[d]: HdkR: one q though, sorry: does this still apply even though x87 transcendentals are emulated in microcode in all modern CPUs? (so the transcendental instructions should be software as well, similar to if you used SSE/AVX only)
16:44mohamexiety[d]: (looking here: https://web.archive.org/web/20250117184328/https://forum.nasm.us/index.php?topic=3904.0)
16:47HdkR: mohamexiety[d]: Yea, a pure scalar SSE version of the transcendentals (atan2, cos, exp2, log2, sin, tan) are still slower because the hardware has the pipelines to accelerate it for fp80, but not fp64. I think Microsoft's math library just ate the cost, maybe with the naive idea that Intel/AMD will solve it for fp64 in the future :D
16:47mohamexiety[d]: hmm I see. interesting. thanks!
17:25gfxstrand[d]: snowycoder[d]: There's a lot of moving parts but I think most of them are roughly the right shape in the nak/kepler-images branch I just pushed
17:26gfxstrand[d]: Most of the actual code is missing, though.
17:26snowycoder[d]: gfxstrand[d]: Can't work on it today, but I'll surely check it tomorrow, thanks!
17:26gfxstrand[d]: It's mostly a "here's how we should plumb this through descriptors"
17:26gfxstrand[d]: Also, I'm not sure how much I like the ordering/arrangement of su_info fields. We may want to tweak that some to try and shrink the size of the struct if we can.
17:26gfxstrand[d]: But now we have a skeleton
17:28ermine1716[d]: gfxstrand[d]: That's quite detailed, thank you!
17:28gfxstrand[d]: The core idea is that NIL fills out the `u32` info bits that `SUCLAMP` and friends consume. NAK has the actual lowering from generic intrinsics to NV ones, and NVK just passes the data through.
17:30gfxstrand[d]: So the interface NIL is programmed to is `SUCLAMP`, not some SW-defined thing.
17:30gfxstrand[d]: NAK depends on NIL but only for the ordering/naming of the fields in `nil_su_info`. Otherwise, it assumes the `SUCLAMP` HW interface as well.
17:30gfxstrand[d]: NVK just cares that the two can work together.
17:53mhenning[d]: gfxstrand[d]: Want to take a look at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34334 ? I'm happy with the general approach, although it could obviously have wider test coverage than it does
19:00orowith2os[d]: x512: dmabufs are your answer
19:00orowith2os[d]: You'll attach modifiers to tell the driver which format it uses, and it'll convert it to and from the formats it needs; without modifiers, I think it gets converted to linear in between users and back to the desired format?
19:02orowith2os[d]: So without modifiers, I'm pretty sure, if you went from Nvidia GPU 1 to Nvidia GPU 2, you'd go from a tiled buffer on GPU 1 -> linear buffer + dmabuf -> convert back to a tiled buffer on GPU 2
19:02orowith2os[d]: Correct me if I'm wrong
19:02orowith2os[d]: With modifiers, both ends say what they support, and pick one to send over, no conversion necessary
19:03orowith2os[d]: And it'll only copy it when necessary
19:03orowith2os[d]: So if I have a dmabuf from a program running on GPU 1 and my compositor on GPU 2, going to output on an attached display, it would make one copy: to composite it, or if using direct scanout, just pipe it straight to the display
19:04orowith2os[d]: Or if I have a dmabuf for it, and it's going to another program on the same GPU, no copy necessary, it accesses it directly
19:05orowith2os[d]: So if I'm capturing a game on GPU 2, with OBS on the same GPU, if I'm using pipewire to capture it, it won't have to do anything extra
19:59mohamexiety[d]: hm anyone tried nvk gaming on stable mesa recently? someone I know from another server was trying things out and noticed that there are insane system-wide stutters in both DX11 and DX12 games. interestingly the frametimes become super smooth when tabbing out (unfocusing the game) so could also be a DE issue but they're on KDE which is a common config :thonk:
19:59mohamexiety[d]: https://cdn.discordapp.com/attachments/807434218903699478/1362877028478488746/image.png?ex=6803fda4&is=6802ac24&hm=f6be3bdba090c17568d840d9033e5dcff6bdd98ccd20329667a0710ff374fcd6&
19:59mohamexiety[d]: https://cdn.discordapp.com/attachments/807434218903699478/1362879151798747378/image.png?ex=6803ff9e&is=6802ae1e&hm=841f3d211f1e989e4f064620fd0e3565a207609c8e49ca75c2f58dd064cde0f7& pics. first is a dx11 game, second is a dx12 one
19:59mohamexiety[d]: I told them to open an issue but they're still investigating
20:35mohamexiety[d]: yeah nevermind, not an nvk/nouveau issue.
21:14mangodev[d]: mohamexiety[d]: i *did* notice something funky that nvk and kde do, though
21:14mangodev[d]: try dragging vkcube between two monitors so that it's running uncapped
21:14mangodev[d]: once that is done, just simply move your mouse cursor between monitors, and every time the mouse moves between monitors, you should be able to measure a big stutter
21:14mangodev[d]: i think something funky may be going on with the hardware cursor plane
21:15mangodev[d]: maybe it's the presence of two hardware cursor sprites?
21:23gfxstrand[d]: Ugh... Why does `atom.e.cas` not work?!?
21:46gfxstrand[d]: I wonder if there's some secret HW restriction that the data and compare have to be in subsequent registers
21:46gfxstrand[d]: That feels weird, though
22:08karolherbst[d]: I'm sure there have been restrictions like that..
22:08karolherbst[d]: but yeah..
22:09karolherbst[d]: if codegen doesn't do weird things there, then it's probably fine
22:12karolherbst[d]: gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_lowering_nvc0.cpp?ref_type=heads#L1763
22:13karolherbst[d]: have fun 😛
22:14snowycoder[d]: "Also, it sometimes returns the new value instead of the old one under mysterious circumstances. "
22:14snowycoder[d]: Oh wow, just... how?
22:16mohamexiety[d]: that was the red line for you but you let this pass?
22:16mohamexiety[d]: "and the 3rd source
22:16mohamexiety[d]: // should be set to the high part of the double reg or bad things will
22:16mohamexiety[d]: // happen elsewhere in the universe." :KEKW:
22:16karolherbst[d]: of course the emitter fixes the 3rd source reg id 🙃
22:16karolherbst[d]: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_emit_nvc0.cpp?ref_type=heads#L2190
22:17karolherbst[d]: uhm 2nd
22:18karolherbst[d]: snowycoder[d]: not sure that actually happens...
22:20karolherbst[d]: fuck up barrier + cache flushing and weird things happen
22:20mhenning[d]: I think codegen sets some things to be cached where it's not legal to cache them on kepler, which could explain some of the weirdness
22:20karolherbst[d]: yeah
22:21karolherbst[d]: I'm sure that's just fallout from not reverse engineering all the way to the end.
22:48gfxstrand[d]: karolherbst[d]: Oh, well that's just awesome. 🤦🏻♀️
22:49gfxstrand[d]: karolherbst[d]: At least I'm not going crazy? <a:shrug_anim:1096500513106841673>
22:49karolherbst[d]: hooray
22:50karolherbst[d]: wait.. huh
22:50gfxstrand[d]: karolherbst[d]: I wonder why this isn't blowing up on Maxwell
22:51karolherbst[d]: it's kinda weird.. codegen only emits 2 sources
22:52karolherbst[d]: ohh nvm..
22:52karolherbst[d]: the adress is special
22:53mohamexiety[d]: gfxstrand[d]: ~~might not stay like that for long given how kepler HW seems to be like~~ 😝
22:54karolherbst[d]: mhhh
22:55karolherbst[d]: mhhh
22:55karolherbst[d]: actually..... I don't see the maxwell emitter emitting 2 sources either.. only address + inpit
22:56karolherbst[d]: gfxstrand[d]: sure you aren't doing something si.ilar for maxwell?
22:59gfxstrand[d]: Heh. Found where I fixed this on SM50
22:59karolherbst[d]: yeah..
23:06mangodev[d]: i can't wait for nvk to become more stable-ish on git so i can rebuild my drivers and see the latest improvements from the instruction timing stuff
23:15mhenning[d]: mangodev[d]: Do you see issues on current main? A few fixes landed earlier this week.
23:31mangodev[d]: mhenning[d]: that's good to hear, i've been careful about upgrading because of my experience with the latest version a couple weeks ago, as well as some of the reports in this channel about major regressions
23:32mangodev[d]: although i'm not on kepler nor blackwell, so i don't know if much of the recent changes would affect me (aside from a few bugfixes here and there)
23:48gfxstrand[d]: The sched stuff might
23:51gfxstrand[d]: `Pass: 3378, Fail: 14, Crash: 358, Skip: 5750, Duration: 4:00, Remaining: 19:15:54`
23:51gfxstrand[d]: The crashes are all images being missing
23:52HdkR: \o/
23:57asdqueerfromeu[d]: gfxstrand[d]: Do you mean uncompressed ones (or something like S3TC)? 🖼️
23:59mhenning[d]: asdqueerfromeu[d]: uncompressed images on nvk+kepler