03:05gfxstrand[d]: Soooo many NAK bugs to fix...
03:05gfxstrand[d]: Hehe. That makes it sound like NAK is buggy. It's usually fine. We just had a handful of regressions sneak in lately and I fixed like 5 things today.
03:21gfxstrand[d]: And, apparently, fixing the same bug as mhenning[d] π
03:21gfxstrand[d]: Though I swear I was looking at a dEQP test, not a game shader.
09:18karolherbst[d]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf there are some interesting blackwell instructions
09:19karolherbst[d]: `FFMA2` or `FMNMX3`
09:19karolherbst[d]: vectorized fp32 π
09:20karolherbst[d]: ehh wait
09:20karolherbst[d]: `FMNMX3` is something else
09:20karolherbst[d]: 3 inputs, not 2
12:26robinhood911: 32+26+24+361+332+324+336+361+322−512−144−144−72−36−1024−72=114-32-32-26-24=0 32+26+24+361+332+324+336+361+322−512−144−144−72−512−144−144−72-144=230-108-114=8 32+26+24+361+332+324+336+361+322-114=2004-2048+72=28 114-28=32+24+26+4-8-28=-32+32+26+24=50 you see it is stable on eliminating index 112/aka 144-72-36 out of the sequence. let's raise 32 to 34 per
12:26robinhood911: 361+334+324+336+361+322−512−144−144−72−36−1024−72=34 34+26+24+361+334+324+336+361+322−144−144−72−512−1024−72−36=118-34-34-26-24=0 34+26+24+361+334+324+336+361+322−512−144−144−72−512−144−144−72-144-108aka72+36-118=8 34+26+24+361+334+324+336+361+322−118=2004-2048+72=28 now 118-28=34+6+24+26=90-8=82=34-2+24+26 so you see it no longer works logically enough, so you just need a reference 114
12:26robinhood911: again so you do 234-114-108=12 and you see that 34-12-28-28 is indeed -34, which eliminates -34 again, now as to how to find the reference 114 being the real thing, is pretty simple 114+114-112-112=4 which equals the difference of 28 to 32, when you assemble that entry this obviously eliminates as reference.
12:36robinhood911: There is a little bit of wisdom as to how those bounds are most easily found 72-32-32=8-36=-28
12:37robinhood911: hence it's now clear that 36-32=4 also
12:41robinhood911: So that compression scheme is a very sophisticated material, it can not be squeezed into little talks, it took me days and days to explain the stuff to my pal, who is intelligent as heck and a real man who has saved my life overseas. and still the last part i did not upload to him yet, cause appareantly he had given up on me even likely, this is for very resillient hackers.
12:51robinhood911: even though XDC invitation i do not know if was real or possible assault preparing instance or a way to deal with me by sanctioning me annoying and attacking even more, but XDC is not enough to pull it off , even on three days i could not go through the needed amount of material, so IRC, discoord, matrix is way better for this.
13:18robinhood911: very much time I do not have, you talk about United States military strength or just you know comptence, yes that is very strong army indeed in the world, i had those materials sent to my phone for years where russians started to figure out how to deal with them next to their own area, so all that comes the strengths with maximum arrogance too, to have a what we call foreign war next to
13:18robinhood911: russian borders, it's a completest maximum arrogance, like whatever, i know the army is strong has hell, that there is actually probably not so much innovation in terms of their platforms that i am trying to achieve, but our indendependence is soon on the line, we need to do something on our own to strenghten all the systems for our own platforms.
13:37gfxstrand[d]: karolherbst[d]: Yeah, fmnmx3 could be nice in a few scenarios. Does it also support a "mix" mode like AMD where it gives you the middle one? Or is it just two fmnmx pasted together? I guess if we can get compute shaders working we can test it easily enough.
13:38gfxstrand[d]: ffma2 could be fun. It'll be annoying for RA but maybe actually not that bad. And if it effectively doubles ffma throughput...
13:40gfxstrand[d]: gfxstrand[d]: Even if it's just that mnmx, that still gives us clamp in a single op.
14:07karolherbst[d]: how would that give you clamp?
14:11gfxstrand[d]: AMD's mix gives you the middle one no matter which one it is. If it's two mnmx pasted together, we can clamp with it. If it's a single 3-arg min or max, we can't. It all depends on how much we get to control.
15:17gfxstrand[d]: snowycoder[d]: I got textures working on Kepler A last night. I think I might upstream some of that branch (and maybe a few bits off of your branch, too).
15:21gfxstrand[d]: https://www.khronos.org/conformance/adopters/conformant-products/vulkan#submission_907
15:22gfxstrand[d]: The others should post over the weekend. I'll update Mesa on Monday or Tuesday.
15:23marysaka[d]: Amazing π
15:29sravn: gfxstrand[d]: sm20.rs (the sm20 branch) has a few "panic!("SM50 doesn't have CBuf textures");". Should this read SM20 and not SM50?
15:39gfxstrand[d]: Yes it should
15:54gfxstrand[d]: karolherbst[d]: Is it just me or is the sched stuff on Kepler A entirely optional? I added support for it but it doesn't seem to do anything.
15:55marysaka[d]: it's optional yeah
15:55marysaka[d]: not quite sure how you enable it tho
16:00karolherbst[d]: gfxstrand[d]: it's optional, but you pay a big perf penalty for getting it wrong
16:01karolherbst[d]: not sure if there is a switch somewhere for
16:01karolherbst[d]: disabling verification or so
16:02mhenning[d]: yeah, scheduling on kepler is a perf thing, not a correctness thing
16:13gfxstrand[d]: Noted
16:33snowycoder[d]: gfxstrand[d]: Thanks, I folded suclamp and sueau and I'm working on subfm, I'm quite slow
16:35gfxstrand[d]: It's okay
18:15mhenning[d]: snowycoder[d]: I think you're making good progress!
18:38gfxstrand[d]: Yup
18:40gfxstrand[d]: I'm over here farting around with sm20 because I'm trying to help without stomping and conflicting all over your MR.
18:50snowycoder[d]: gfxstrand[d]: If you want to help on sm32 there's no problem, you have way more experience with how textures and surfaces work in general, I can pass what I have and work on shared atomics.
18:51snowycoder[d]: Related: SuEau is doing some kind of offset bitwise or, but why? it could be for min LOD selection but it seems kind of strange to bitwise-or it?
18:53gfxstrand[d]: Ugh...
18:53gfxstrand[d]: [ 825.038023] nouveau 0000:02:00.0: gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 000a [ILLEGAL_SPH_INSTR_COMBO]
18:54gfxstrand[d]: Something funky with OpIpa, I think.
18:57gfxstrand[d]: Or maybe not? Smashing everything to constant yields the same error
18:58karolherbst[d]: isn't that invalid dual issueing?
18:59karolherbst[d]: can only have 3 pairs of instructions with 0
18:59karolherbst[d]: or something like that
19:00gfxstrand[d]: maybe?
19:00karolherbst[d]: and then the units have to match
19:00karolherbst[d]: like you can dual issue 2 fmas
19:00karolherbst[d]: or uhm...
19:00karolherbst[d]: the other way around
19:00karolherbst[d]: two unrelated units...
19:00karolherbst[d]: I'd have to check
19:00karolherbst[d]: `TargetNVC0::canDualIssue` has a bit on that
19:01snowycoder[d]: if KeplerA is similar to KeplerB you can only dual-issue one instruction but not the next one, that might be a problem?
19:01karolherbst[d]: that aas well
19:01karolherbst[d]: it's a giant pita
19:02karolherbst[d]: don't bother, always use at least 1 for now π
19:02snowycoder[d]: I'm using 0x00 and it seems to work(?)
19:02snowycoder[d]: Various documentation cite 0 as the slowest fallback
19:03karolherbst[d]: mhh was dual issue a magic value?
19:03karolherbst[d]: mhh seems it was `0x04` whatever that means
19:23gfxstrand[d]: I get the same error if I totally shut off the scheduler
19:24snowycoder[d]: Huh, that's strange.
19:24snowycoder[d]: nvdisasm allows the code?
19:26gfxstrand[d]: In any case, it says SPH so I'm skeptical that it has to do with the scheduler
19:42mohamexiety[d]: mohamexiety[d]: ugh why does shared memory have to be this convoluted. rewrote the test to have a constant local size. with uvec4s I am now getting:
19:42mohamexiety[d]: 64:
19:42mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:42mohamexiety[d]: .VALUE = 0x4b41808
19:42mohamexiety[d]: 128:
19:42mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:42mohamexiety[d]: .VALUE = 0x8b41810
19:42mohamexiety[d]: 512:
19:42mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:42mohamexiety[d]: .VALUE = 0xd341840
19:42mohamexiety[d]: 1024:
19:42mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:42mohamexiety[d]: .VALUE = 0xd342880
19:42mohamexiety[d]: the lower bytes are as expected but the upper bytes are weird
19:43mohamexiety[d]: lets see how arrays with constant local size fare
19:43gfxstrand[d]: Okay, it's ipa on gl_FragCoord.w that's causing it heartburn
19:50gfxstrand[d]: Okay, it just doesn't like `ipa.offset` at all
19:50mohamexiety[d]: array of uints, constant local size:
19:50mohamexiety[d]: 64:
19:50mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:50mohamexiety[d]: .VALUE = 0x1b41802
19:50mohamexiety[d]: 128:
19:50mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:50mohamexiety[d]: .VALUE = 0x2b41804
19:50mohamexiety[d]: 512:
19:50mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:50mohamexiety[d]: .VALUE = 0x8b41810
19:50mohamexiety[d]: 1024:
19:50mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
19:50mohamexiety[d]: .VALUE = 0xd341820
19:52mohamexiety[d]: something to note is that back when I tested local size only for a basic shader, it didn't change any other header. so this header seems to represent either occupancy or some other property that is influenced as well :thonk:
20:02randomnugget: so that seems to lead to the point like this: 48+24+40+328+328+328+328+328+48+36+328−512−144−144−72−512−144−144−72−144=276 48+24+40+328+328+328+328+328+48+36+328−512−144−144−72−1024−36−72=160 so 276-160 being constantly 116 samewise 52+24+40+328+328+328+328+328+52+36+328−512−144−144−72−512−144−144−72−144=284-168=116 and samewise
20:02randomnugget: 50+24+40+328+328+328+328+328+50+36+328−512−144−144−72−512−144−144−72−144=280-164=116 now for the to the magic we learned 114 was special interpolator 114-108=6+114=120 , hence now we figured out the index to deal with 276-32-120=124 50-124+28=-50 284-32-120=132 52-132=-80+28=52 and last sigma repetition 280-32-120=128 50-120=-78+28=-50 , i am very tired now, hope that clearer form hit you
20:02randomnugget: somehow. I go to sleep.
20:08gfxstrand[d]: Ugh... We're emitting the same ipas, we're setting the same header bits, and yet codegen okay and NAK isn't
20:08karolherbst[d]: gfxstrand[d]: mhhh, right.. might be only legal with certian SPH configs...
20:09karolherbst[d]: gfxstrand[d]: you are aware that nvc0 does live patching of some of that, right?
20:09karolherbst[d]: there is the `nv50_ir_apply_fixups` stuff
20:09karolherbst[d]: not sure if you hit that
20:10karolherbst[d]: but depending on things (tm), the instructions and the header get adjusted
20:10mhenning[d]: oh yeah the fixups are terrifying
20:10karolherbst[d]: there was a time I understood that paart
20:21randomnugget: There are some mistypes, i accidentally made them from tiredness, however the proof is legit when you align them up properly, i am starting new testing tomorrow and from 0.00.00 am i am 42years old. though i have caught some cold, and am bit sick.
20:34gfxstrand[d]: karolherbst[d]: I am annoyingly aware
20:34gfxstrand[d]: But also I disabled the fixups and I'm still seeing everything match
20:35karolherbst[d]: mhhh
20:35karolherbst[d]: could also be something in the class
20:36gfxstrand[d]: class setup *should* be the same
20:36gfxstrand[d]: Nope. Only difference is register counts in the vertex shaders
20:37karolherbst[d]: mhhh
20:37gfxstrand[d]: There is one sph bit that's different but I don't think it matters. I should check it just in case, though.
20:38gfxstrand[d]: It's the DOES_FP64 bit. That shouldn't matter
20:40karolherbst[d]: as in "why does it matter" or as in "I'm sure it doesn't matter"
20:42gfxstrand[d]: What the hell?!? It is DOES_FP64
20:42karolherbst[d]: π
20:42mohamexiety[d]: surprise FP64 in the wild!
20:42karolherbst[d]: is it a fp64 ipa?
20:43gfxstrand[d]: Is that a thing?
20:43mohamexiety[d]: mohamexiety[d]: hm, if `local_size_x` here is fixed to 64, while the loop and shared array size is changed, what does this change compared to making them all the same value other than the number of threads going through the shader?
20:43mohamexiety[d]: ```glsl
20:43mohamexiety[d]: layout(local_size_x = 1024) in;
20:43mohamexiety[d]: layout(set = 0, binding = 0) buffer Storage {
20:43mohamexiety[d]: uint outb[];
20:43mohamexiety[d]: };
20:43mohamexiety[d]: shared uvec4 x[1024];
20:43mohamexiety[d]: void main() {
20:43mohamexiety[d]: for (uint i = 0; i < 1024; i++) {
20:43mohamexiety[d]: x[i].x = gl_LocalInvocationID.x;
20:43mohamexiety[d]: x[i].y = gl_LocalInvocationID.x * 2 + i;
20:43mohamexiety[d]: x[i].w = gl_LocalInvocationID.x * 3 + i;
20:43mohamexiety[d]: x[i].z = gl_LocalInvocationID.x * 5 + i;
20:43mohamexiety[d]: }
20:43mohamexiety[d]: barrier();
20:43mohamexiety[d]: for (uint i = 0; i < 1024; i++) {
20:43mohamexiety[d]: outb[i] = x[i].x +
20:43mohamexiety[d]: x[i].y +
20:43mohamexiety[d]: x[i].w +
20:43mohamexiety[d]: x[i].z;
20:43mohamexiety[d]: }
20:43mohamexiety[d]: }
20:43karolherbst[d]: gfxstrand[d]: I don't think so
20:44karolherbst[d]: sure it's the fp64 flag?
20:45karolherbst[d]: though
20:45karolherbst[d]: I'm kinda surprised.. there aren't many reasons codegen sets it to true
20:46karolherbst[d]: nvc0 does `prog->hdr[0] |= 1 << 27` on the fp64 flag
20:47karolherbst[d]: mohamexiety[d]: if local_size is 64, then you only have a gl_LocalInvocationID.x up to 63
20:48karolherbst[d]: it kinda feels like you do everything backwards there
20:49mohamexiety[d]: karolherbst[d]: yeah but the actual work being done is influenced by the loop counter and the size of the shared array of vectors. local size is only changing the number of threads that do this work. so what's up with the QMD packet changes :thonk:
20:49mohamexiety[d]: (and I am aware that realistically this shader isn't actually utilizing shared memory -- I made it this way to be able to toggle between shared usage and non-shared usage with everything else identical)
20:49karolherbst[d]: you have each thread writing to all 1024 elements of `x`
20:50karolherbst[d]: and the value stores is not defined, because whoever threads value ends up in shared memory is undefined
20:50gfxstrand[d]: Found it!
20:50karolherbst[d]: mohamexiety[d]: sure it's not entirely random?
20:51gfxstrand[d]: Apparently we just never set that bit in NAK?!?
20:51karolherbst[d]: gfxstrand[d]: rough
20:51gfxstrand[d]: And I guess Maxwell+ is okay with that
20:52karolherbst[d]: mhhh
20:52karolherbst[d]: the flags meaning might have changed
20:52karolherbst[d]: but who knows.. maybe it really doesn't matter for maxwell
20:52karolherbst[d]: I don't know why it matters at all
20:52karolherbst[d]: tbh
20:52mohamexiety[d]: karolherbst[d]: it isn't, no. this: https://discord.com/channels/1033216351990456371/1034184951790305330/1362182730753769695 is how it looks like when local size, loop count, shared size are all the same. this: https://discord.com/channels/1033216351990456371/1034184951790305330/1362513473078034776 is how it looks like when local size is set to 64 but shared size/loop count are changed
20:53karolherbst[d]: mohamexiety[d]: I mean.. the shader is writing random values
20:54mohamexiety[d]: that's fine though, I'd think. the whole purpose of the shader is to just provoke the driver into allocating a certain amount of shared memory
20:54karolherbst[d]: `x[0]` can be any value from 0 to local_size_x - 1
20:54karolherbst[d]: before the barrier
20:54karolherbst[d]: `x[0].x` I mean
20:54karolherbst[d]: mohamexiety[d]: ahh
20:55karolherbst[d]: well in that case might be good to have random values so the compiler can't optimize it
20:55karolherbst[d]: or something
20:55mohamexiety[d]: I don't really do anything with the data beyond mapping the output just to convince the driver that it's used for something
20:56karolherbst[d]: you want to RE this with OpenCL tho
20:56karolherbst[d]: OpenCL allows you to set an arbitrary shared memory size
20:56karolherbst[d]: and the runtime doesn't really have a choice there
20:57mohamexiety[d]: well this is for nvk/nak to try and get compute running on blackwell
20:57karolherbst[d]: sure
20:57karolherbst[d]: but
20:57karolherbst[d]: you can set "give me 1000 bytes of shared memory" and see what the runtime is doing
20:57gfxstrand[d]: And of course `i2f.f32.s64` is an fp64 op
20:57mohamexiety[d]: hm
20:57karolherbst[d]: gfxstrand[d]: .....
20:57karolherbst[d]: sure?
20:58karolherbst[d]: ohhh
20:58karolherbst[d]: apparently codegen agrees
20:59karolherbst[d]: mohamexiety[d]: `kernel void(local uchar x) { ... }` and then you can set the first kernel parameter to the shared memory size you want
20:59karolherbst[d]: piglit should have some example for it...
21:00karolherbst[d]: maybe it doesn't π
21:02karolherbst[d]: `./build/test_conformance/basic/test_basic local_arg_def` in the CL CTS
21:04karolherbst[d]: though that might be a bit much to learn at the same time.. mhh
21:11karolherbst[d]: mohamexiety[d]: ./bin/cl-program-tester ./tests/cl/program/execute/builtin/atomic/atomic_add-local.cl
21:11karolherbst[d]: in piglit
21:11karolherbst[d]: it's a bit... uhm.. tricky to understand
21:12karolherbst[d]: but `uint[1]` as an arg definition means it allocates `1 * sizeof(uint)` shared memory
21:12karolherbst[d]: and you can just bump up the size as you see fit
21:12karolherbst[d]: shouldn't matter for the test
21:12karolherbst[d]: not sure if the tracing works with nvidia's CL stack tho...
21:13mohamexiety[d]: hmm. I see, thanks! one problem with CL tho is I don't think we have a way to dump the driver's commands, right?
21:13karolherbst[d]: well..
21:13karolherbst[d]: it's the same
21:13karolherbst[d]: though the compute runtime might does things differently enough
21:13mohamexiety[d]: I actually thought about using CUDA for this at first but thought better not to for this reason. main reason I was considering CUDA was because for shared memory you can actually ask for a different split
21:13mohamexiety[d]: since with gfx stuff including Vulkan you only get 48KiB iirc
21:13gfxstrand[d]: Test run totals:
21:13gfxstrand[d]: Passed: 122/242 (50.4%)
21:13gfxstrand[d]: Failed: 0/242 (0.0%)
21:13gfxstrand[d]: Not supported: 120/242 (49.6%)
21:13gfxstrand[d]: Warnings: 0/242 (0.0%)
21:13gfxstrand[d]: Waived: 0/242 (0.0%)
21:14karolherbst[d]: right.. cuda is closer to the hw there
21:14gfxstrand[d]: That's `dEQP-VK.pipeline.fast_linked_library.multisample_interpolation.*`
21:14karolherbst[d]: anyway.. not a bad idea to make sure the tracer works also with CL and CUDA
21:16randomnugget: What i have figured is that the mission is possible, and i have little vague pointers that some professors actually know such numeral methods. I am saying i eventually win that goal, but for me it has been sophisticated, just trial and run on the calculator, predetermined ideas were close but they tend to blow up under pressure, i had hallucinated in messenger while ago about that thing,
21:16randomnugget: before i tried it, but it took me ages like weeks to get there to get rid of misses/bugs, and i better meet the dentist also.
21:19gfxstrand[d]: Okay, I've fixed a bunch of things. Time to kill my computer again by doing a CTS run. :frog_upside_down:
21:19gfxstrand[d]: snowycoder[d]: I'm rapidly getting to the point where a lack of images is going to be my limiting factor. π¬
21:20gfxstrand[d]: I guess I could wire up atomics. <a:shrug_anim:1096500513106841673>
21:20karolherbst[d]: I'm sure that's not any more horrible on kepler than on later ISAs
21:20gfxstrand[d]: But I fear Kepler A images might not be the same as Kepler B.
21:21gfxstrand[d]: Shared atomics are a little more horrible but global is fine.
21:21karolherbst[d]: shared atomics are fun everywhere
21:23karolherbst[d]: though the most fun I had with them on AMD's wave"64" which isn't really 64, but more like 32x2 and if the compiler messes up, you have weird inconsistencies between the two 32 halfs π
21:26mohamexiety[d]: showcasing the superiority of GCN
21:26gfxstrand[d]: gfxstrand[d]: The good news is that things are going surprisingly okay so far: `Pass: 5617, Fail: 810, Crash: 660, Skip: 10913, Duration: 7:08, Remaining: 18:02:21`
21:33mohamexiety[d]: mohamexiety[d]: gfxstrand[d] do you have any tips/advice? :thonk:
21:35snowycoder[d]: gfxstrand[d]: I think only keplerb images have weird instructions, but I never checked it much
21:35snowycoder[d]: gfxstrand[d]: I think only keplerb images have weird instructions, but I never checked it much
21:35snowycoder[d]: I'm also coming closer to a folding for subfm, I'll push it when I get it right.
21:35snowycoder[d]: I still have no idea why those instructions do the things they do, but at least we might have a software descriptionπ
22:23areyouactingwell: It's once again, my life is in order now already 2021 december financial sanctions put to me were removed, this was all fraud, now you brainless monkeys say i search for some kind of relationship? I do not give a fuck about any relationship, i had my job for 2years i have my computers i have brain in contrast to you, all that i was saying to the illies in cambodia, i stayed in my
22:23areyouactingwell: premise which btw. belongs to me and my dad, are you fucking bastards want to get bullets from our gangsters? If not buzz the fuck off scrubs, all i am saying is that you are on the wrong premise where you invested zero dollars, despite said otherwise.
22:24areyouactingwell: strange fucking scrubs, banning me out of my own territory
23:25fakerandiosum: so the final algebra for index112 50+44+40+328+328+328+328+328+50+36+328−512−144−144−72−512−144−144−72−144=300 50+44+40+328+328+328+328+328+50+36+328−512−144−144−72−1024−36−72=184 noticed how changed the members 44+40 used to be 24+40, so now 184-120=64 and 300-64-108 is 128 where again 50-128=-78+28=-50 well that finishes off all the proof system for single index described before.
23:26fakerandiosum: I said this is very complex system that i program, XDC days are not enough for this .
23:46gfxstrand[d]: Wow. Tooka a nap and my computer is still alive
23:46gfxstrand[d]: `Pass: 119547, Fail: 16310, Crash: 14008, Skip: 234635, Duration: 2:27:13, Remaining: 15:03:26`
23:47gfxstrand[d]: All the crashes are things that need iamges
23:47gfxstrand[d]: snowycoder[d]: Kepler A needs *something*. I don't know what yet
23:48gfxstrand[d]: Kepler A doesn't appear to have `suld/sust`. It has `suldga` and `sustga`
23:49gfxstrand[d]: But also `suclamp`, `subfm`, and `sueau`
23:49gfxstrand[d]: So maybe it's roughly like Kepler B?