01:02gfxstrand[d]: *grumble grumble* No SHF pre-Maxwell...
01:02HdkR: SHF is great. Long live SHF.
01:04gfxstrand[d]: Wait, Kepler B has Shf?
01:04gfxstrand[d]: No fair!
01:07redsheep[d]: I'm kind of surprised kepler b has that many differences
01:09gfxstrand[d]: Kepler B is Fermi
01:10gfxstrand[d]: It's Kepler's 3D hardware and Fermi's ISA
01:15redsheep[d]: Oh, yeah I was going to say they only released 14 months apart. Even knowing it shares ISA with fermi I kind of feel like ISA used to move faster
01:33TimurTabi: karolherbst: let me post a new version tomorrow with better text.
01:36gfxstrand[d]: Do I not have psetp
01:37gfxstrand[d]: Looks like we don't
01:37gfxstrand[d]: I guess we get to do `isetp.set_op rz rz rz`
01:37gfxstrand[d]: Pain
01:44gfxstrand[d]: Current NAK unit test failures on GK106:
01:44gfxstrand[d]: hw_tests::test_op_flo
01:44gfxstrand[d]: hw_tests::test_op_popc
01:44gfxstrand[d]: hw_tests::test_op_psetp
01:44gfxstrand[d]: hw_tests::test_op_shf
01:44gfxstrand[d]: hw_tests::test_shl64
01:44gfxstrand[d]: hw_tests::test_shr64
01:44gfxstrand[d]: I think FLO and POPC exist. The others don't and we might as well just ask NIR to lower for us.
02:02gfxstrand[d]: The "nice" thing about Fermi is that it's a lot simpler ISA.
02:15mhenning[d]: on the bright side, we don't need to schedule for correctness before maxwell
02:23pavlo_kozlenko[d]: mhenning[d]: Can you clarify "schedule" in detail? Please.
02:37airlied[d]: mhenning[d]: any ideas on vimnmx?
02:40airlied[d]: uvimnmx also exists
02:47airlied[d]: ah SIMD min/max and uniform SIMD min/mxa
02:52mhenning[d]: Yeah, you can grep for instruction names here: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html
02:52mhenning[d]: stuff like VIMNMX is probably somewhat niche
02:54airlied[d]: just found nvidia using it for a test we are failing, so made me look
02:58mhenning[d]: pavlo_kozlenko[d]: Starting with maxwell, nvidia gpus have an "exposed pipeline" which means that it's compiler's job to avoid pipeline hazards instead of the harware's job. This makes sense, in that it simplifies the hardware, but it's pretty annoying if you're developing a driver using reverse engineering
02:59mhenning[d]: Random link (from 5 sec of googling) that talks a bit about exposed pipelines: https://retrocomputing.stackexchange.com/questions/2616/what-was-the-first-cpu-with-exposed-pipeline
04:50airlied[d]: mhenning[d]: also I think lea might have had some changes
04:52airlied[d]: turning off the lea opts fixed a bunch of encoding errors
06:08gfxstrand[d]: mhenning[d]: Yeah. We still want best guesses to feed the scheduler but at least we don't have to fight delay bugs.
08:04snowycoder[d]: gfxstrand[d]: And no suclamp/subfm/sueau weirdness (right?)
10:13snowycoder[d]: subfm is a mess to fold, I feel like I'm probing hardware stuff.
10:13snowycoder[d]: How about we use it without knowing internals as we did with codegen?
14:20megastatus: cambodian army chased those gangstalkers with fishermen nets, they were out with 3000men or something like that on my route. Those are estonian scammers, with no results in sports nor science, they are hired just like you deal with nothing their brain can riddle out, are vandalizing the streets and puking their sexual content in neelix's gangbang parties of entire abortion leftovers
14:20megastatus: meetings. We are aware of who those people are. They get harassed throughout their lives tortured. Where as those dutch mindill ladies or more like crocodiles i saw , where is the wanker, i told them you are nutbolt abortion leftovers and absolute liquid shit accidents in human life and history. Where as those are not entirely dangerous, Laura Tornado as well as Gloria Kesä/Terreur are
14:20megastatus: mobster anal and vaginal apparatuses, what they do is frame a rape, and call their tattood human shit to search for retaliation, this entirely sick tattood human garbage has nothing to do with logics, posessing money or any other resources in contrast to what they claim to buy hotels etc. So those are already die hard criminals, wheres danger involved. And continuing to do those things they
14:20megastatus: end up in prison, i already said military is locking them up if needed all one by one one after another, they harassed with cold weapons all my hotels and are entire human shit. Now Charl such as Charl Voster, this crook is just a mindill bloke there is many in south africa and europe too, there is very predominant resultless section of those braggers in estonia too, they are huge human
14:20megastatus: garbage, normal person would not do such scams.
14:59gfxstrand[d]: snowycoder[d]: I'm nowhere near looking at images yet
14:59megastatus: Charl Vorster yeah, it's good to be back to home near my real friends.
15:08megastatus: on the compute side of things, sure all offsets can be always issued to the data bank, but there is realistically no such bank needed, where the compute stuff goes i propose such a belt or hash where as it fingerprints only as to how many mul div and all instructions answer sets there are , and the data dependency graph is hence little bit packed in it's data structures too, and you have a
15:08megastatus: global magic value which stores to every different instructions as to how many there are altogether, and you do clamp preceeding muls out, and allow IO write on whatever location to succeed.
15:08megastatus: IO writes can not really work in encoded format.
15:24megastatus: now technically it is achieved, where as base constant varies per instruction opcode, so all it ever needs, since you always for simplicity allow all operands, but the code can not preceed when you author nr3 base mul a stopper, then it just writes out the io, but you can write all ios assemble into data bank too, but as you really can not write them in parallel that is just pointless
15:24megastatus: additional job.
16:08mohamexiety[d]: gfxstrand[d]: doing this, we will get _at least_ x amount of registers, but not exactly x, right? :thonk: I did it and I am seeing two entries changing, though the changes are still not entirely clear so I'll try running with more numbers.
16:08mohamexiety[d]: first entry, with x[8]:
16:08mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:08mohamexiety[d]: .VALUE = 0x1c01
16:08mohamexiety[d]: first entry, with x[16]:
16:08mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:08mohamexiety[d]: .VALUE = 0x1d01
16:08mohamexiety[d]: first entry, with x[32]:
16:08mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:08mohamexiety[d]: .VALUE = 0x2b01
16:08mohamexiety[d]: as we increase the number of registers, this entry increases
16:09gfxstrand[d]: Yeah, it's hard to control exact x
16:09mohamexiety[d]: the second entry is weirder:
16:09mohamexiety[d]: x[8]
16:09mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:09mohamexiety[d]: .VALUE = 0x7ceaca2c
16:09mohamexiety[d]: x[16]:
16:09mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:09mohamexiety[d]: .VALUE = 0x7ceaca18
16:09mohamexiety[d]: x[32]:
16:09mohamexiety[d]: mthd 3bc4 NVC7C0_CALL_MME_DATA(120)
16:09mohamexiety[d]: .VALUE = 0x7ceaca20
16:09mohamexiety[d]: it decreases from 8 to 16, but then increases (but still less than x[8]
16:11mohamexiety[d]: I could see the first entry being size and the second entry being occupancy but..
16:13mohamexiety[d]: and even the first entry being size isn't straightforward because if we take the upper byte, we get:
16:13mohamexiety[d]: x[8] => 0x1c => 28
16:13mohamexiety[d]: x[16] => 0x1d => 29?
16:13mohamexiety[d]: x[32] => 0x2b => 43
16:14mohamexiety[d]: gfxstrand[d]: also what happens if I go for a large x like >256? (since iiuc we only have up to 256 regs, right?)
16:15gfxstrand[d]: Then it'll start spilling
16:15gfxstrand[d]: And we won't get the register count anymore
16:15mohamexiety[d]: I see, that makes sense but was curious if maybe it'd do something else
16:23gfxstrand[d]: You can probably get the exact register count by running the same shader through nvdump and looking at register numbers
16:25mohamexiety[d]: oh didn't know about nvdump. hmm
16:28mohamexiety[d]: does it work on spirv files?
16:49mhenning[d]: yeah, nvdump goes from spirv to assembly
16:50mhenning[d]: depending on how your shader's structured, it's possible that it's putting the array in local mem rather than registers, which could explain why you're seeing two different values change
16:52mhenning[d]: or even it might switch between regs and local mem depending on how big the array is
16:52mhenning[d]: airlied[d]: Have you run nak's hw_tests? Do they all pass?
17:29gfxstrand[d]: Of course codegen uses 32-bit bools. 🙄
17:32gfxstrand[d]: I don't think Fermi/KeplerA has a way to do and/or/xor on predicates.
17:32gfxstrand[d]: If it does, I haven't found it yet
17:32gfxstrand[d]: `isetp` has a 2nd predicate source but as far as I can tell it's ignored
17:33mhenning[d]: maybe you can predicate the op that sets the predicate?
17:33mhenning[d]: might take two instructions though
17:33gfxstrand[d]: Yeah and it requires my predication branch
17:33gfxstrand[d]: But yes that would probably work
17:34gfxstrand[d]: I really wish I knew what the 2nd source on isetp did
17:35gfxstrand[d]: It also has `.alu` and `.xlu`. I bet those do something
17:36mhenning[d]: yeah those sounds like they would do something
17:37gfxstrand[d]: That second predicate source has to be used for *something*
17:39mhenning[d]: airlied[d]: I wrote an encoding test against lea and didn't find anything different
17:53mhenning[d]: gfxstrand[d]: AsFermi has psetp listed as having .and/.or/.xor https://github.com/hyqneuron/asfermi/wiki/OpcodeMiscellaneous#psetp
17:54gfxstrand[d]: gfxstrand[d]: Actually... I think the second predicate is a destination a la psetp
18:03gfxstrand[d]: Okay, that's making more sense. I should actually believe myself when I leave myself notes in the code.
18:04mohamexiety[d]: mhenning[d]: yeah that's possible. this is how the shader looks like:
18:04mohamexiety[d]: ```glsl
18:04mohamexiety[d]: VkShaderModule cs = qoCreateShaderModuleGLSL(
18:04mohamexiety[d]: t_device, COMPUTE,
18:04mohamexiety[d]: layout(push_constant, std430) uniform Push {
18:05mohamexiety[d]: uint bound;
18:05mohamexiety[d]: } push;
18:05mohamexiety[d]: layout(set = 0, binding = 0, std430) buffer Storage {
18:05mohamexiety[d]: uint ua[];
18:05mohamexiety[d]: } ssbo;
18:05mohamexiety[d]: layout (local_size_x = 32) in;
18:05mohamexiety[d]: void main()
18:05mohamexiety[d]: {
18:05mohamexiety[d]: uint x[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
18:05mohamexiety[d]: uint bound = push.bound + gl_LocalInvocationID.x;
18:05mohamexiety[d]: for (uint i = 0; i < bound; i++) {
18:05mohamexiety[d]: x[0] += i;
18:05mohamexiety[d]: x[1] += i * 2;
18:05mohamexiety[d]: x[2] += i * 3;
18:05mohamexiety[d]: x[3] += i * 4;
18:05mohamexiety[d]: x[4] += i * 5;
18:05mohamexiety[d]: x[5] += i * 6;
18:05mohamexiety[d]: x[6] += i * 7;
18:05mohamexiety[d]: x[7] += i * 8;
18:05mohamexiety[d]: }
18:05mohamexiety[d]: uint y = x[0];
18:05mohamexiety[d]: y += x[1];
18:05mohamexiety[d]: y += x[2];
18:05mohamexiety[d]: y += x[3];
18:05mohamexiety[d]: y += x[4];
18:05mohamexiety[d]: y += x[5];
18:05mohamexiety[d]: y += x[6];
18:05mohamexiety[d]: y += x[7];
18:05mohamexiety[d]: ssbo.ua[gl_LocalInvocationID.x] = y + gl_LocalInvocationID.x;
18:05mohamexiety[d]: }
18:05mohamexiety[d]: );
18:08gfxstrand[d]: Yeah, probably need to run that through nvdump to see what it's actually compiling to
18:39mohamexiety[d]: how do I use nvdump?
18:42mhenning[d]: mohamexiety[d]: something like `nv-shader-tools/target/debug/nvdump --sm 86 vk_computeparticles_pipeline_1.compute.spv`
18:42mhenning[d]: see also `nvdump --help`
18:44mohamexiety[d]: ah got you. thanks!
18:52mohamexiety[d]: gfxstrand[d]: is there a way to get crucible to output spirv files or should I just copy the relevant bits out of the -spirv.h file that crucible makes?
18:53gfxstrand[d]: Probably just copy+paste the shader source and run glslangValidator yourself
18:55tiredchiku[d]: I promise I want to write more code, it's just that my personal issues have left me mentally crippled :Melting:
18:55gfxstrand[d]: I hear you
18:57mohamexiety[d]: relatable, don't worry about it and I hope things get better soon on your end <a:RosPat:849189042192908338>
18:57tiredchiku[d]: tried to work on some as a distraction yesterday but it was just not distracting enough
19:03airlied[d]: mhenning[d]: haven't run the the tests, yeah it's annoying has nvdisasm decodes it fine, I only found it my noping out each instruction in turn
19:07mhenning[d]: airlied[d]: Have you tried disabling 64-bit lea but not 32-bit? I wonder if blackwell requires more strict register alignment for the 64-bit form or something
19:56gfxstrand[d]: More strict than 2 regs?
19:56gfxstrand[d]: That would suck
20:06airlied[d]: nope 32-bit alone kills it
20:18gfxstrand[d]: Talking about LEA?
20:19gfxstrand[d]: It's possible they removed it but that seems unlikely
20:20mhenning[d]: gfxstrand[d]: lea on ampere doesn't have any alignment requirements. You pass the lower and upper 32 bits in two different registers and they can be any two registers
20:22mhenning[d]: airlied[d]: okay, I think the lea hw_test is the next thing to check then
20:24mhenning[d]: or you could just turn it off and not worry about it for now. lea isn't vital
20:27airlied[d]: I just turned off the opts for >= 90 for now locally
20:37mohamexiety[d]: for nvdump register usage, is looking at the highest register number used a good measure of how many registers it used?
20:38mhenning[d]: yeah, that's probably accurate
20:39mhenning[d]: or rather, num_regs is the highest register value + 1
20:40mohamexiety[d]: yeah
20:41mohamexiety[d]: in that case seems to be:
20:41mohamexiety[d]: x[8] => 26
20:41mohamexiety[d]: x[16] => 27
20:41mohamexiety[d]: x[32] => 41
20:42mohamexiety[d]: mohamexiety[d]: in which case this looks to be accurate-ish for number of registers?
20:42mohamexiety[d]: but not sure where the extra numbers are coming from
20:43mohamexiety[d]: it's a consistent +2 too
20:43mhenning[d]: Oh, yeah there's always another 2 registers used for the instruction pointer
20:44mohamexiety[d]: it being 1 byte would also fit with the 256 limit
20:44mohamexiety[d]: ohhh
20:44mohamexiety[d]: welp, we got it then
20:44mohamexiety[d]: thanks so much for the help! <a:ablob_heart:432929453644185602>
20:44mohamexiety[d]: now I wonder if there's a way to find what the second entry represents
20:48mhenning[d]: yeah, that one looks odd
20:49mhenning[d]: does the shader use ldl or stl at all (local memory) ?
20:49mohamexiety[d]: nope
20:50mhenning[d]: Also, there's an occupancy calculation in nak/ir.rs function max_warps_per_sm if you're wondering about occupancy, although I don't think that lines up well either
20:54mohamexiety[d]: yeah I am not entirely certain that's it as I'd expect it to consistently go down. if we take just the lower byte we get:
20:54mohamexiety[d]: x[8] => 0x2c => 44
20:54mohamexiety[d]: x[16] => 0x18 => 24
20:54mohamexiety[d]: x[32] => 0x20 => 32
20:56mohamexiety[d]: mohamexiety[d]: btw one thing which could be interesting here -- the lower byte is local_size_Z
20:56mohamexiety[d]: so upper byte is number of registers used and lower byte is local size Z
21:04mhenning[d]: yeah, it's interesting that they're packing the fields like that
21:08gfxstrand[d]: snowycoder[d]: I'm going to land https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34536 now. That'll decouple the SM20 and SM32 work a bit so the branches can live in separate MRs. Common stuff that's useful to both can be pulled out and landed in main ahead of the two primary MRs.
21:12airlied[d]: gfxstrand[d]: quick fix https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34537/
21:13gfxstrand[d]: airlied[d]: mhenning[d] FYI: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34538
21:21airlied[d]: ah hw_tests needs QMD
21:21mhenning[d]: oh, right
21:22airlied[d]: https://paste.centos.org/view/raw/d822767c is the current dEQP-VK.glsl* fails, mostly compute
21:24airlied[d]: I think depth textures are broken also
21:26gfxstrand[d]: gfxstrand[d]: Okay, I think that decouples the worst of the cross-MR conflicts. We've got three different HW enabling efforts in-flight.
21:26gfxstrand[d]: ("Why did Faith start a third," you might ask? Uh... Curiosity? <a:shrug_anim:1096500513106841673> )
21:28mohamexiety[d]: airlied[d]: how far away am I from basic QMD enablement? out of what Faith mentioned I am only missing further looks into shared memory and cbuf0/shader pointers
21:29gfxstrand[d]: Once you've got everything in my list, I think we ought to be able to run compute shaders
21:29gfxstrand[d]: The NAK hw tests are nice and simple.
21:29mhenning[d]: mohamexiety[d]: Take a look at src/nouveau/compiler/nak/qmd.rs for the code that actually fills them in
21:29dondomesdrone: So now instruction itself is determined by fingerprint, i.e answer sets all match with global answer set hash, it's div for example, and every variable has X depth where the dependency goes, given by shared base between them, and the architecture has no registers not even virtual registers, so the debugger just dumps instructions one by one, by fingerprinting answersets and also the pc
21:29dondomesdrone: is no longer supplied however there are indexes and as said those indexes that are dependent on each other, the base is shared. So you can query the base from the same hash by providing max operands per pc, instead of dumping value you dump the base address, metadata can be embedded in packed format into index1 baseY. And the last thing is we want to store depth of the same base That
21:29dondomesdrone: triggers a dilemma, is the base per instruction or per operand i.e variable or can they be both, and operand index was the pcZ now base is per operand i suppose and the depth is given by such a field as the constant length aggregate for last virtual answer in the set for example. we studied, so it would pack the dep tree enough, so you'd end up passing those non-packed but can impose a
21:29dondomesdrone: limit and the virtual answer content, now the compiler as the compiler merges the base to operand, you can dump the variable with the help of virtual answer. So long for the debugger of such code. It should scroll through like 100 instructions, which are for storage reasons not having any storage in virtual register, once again printing something is io operation and it can not be done in
21:29dondomesdrone: parallel from data banks, so having instruction banks makes no sense, they are all printed one by one in a loop.
21:30mhenning[d]: you could probably start wiring some of that up and seeing if you can get simple compute shaders running
21:31mohamexiety[d]: hmm I see
21:31mhenning[d]: Note that I'm a little fuzzy on how the macro actually works so the first few bytes might not be QMD, and you'll probably need to figure out what offset the acutal qmd starts
21:34airlied[d]: I've also asked nvidia in a meeting to prioritise blackwell qmd
21:48gfxstrand[d]: Uh... WUT? Why does POPC on Kepler take multiple sources?
21:51gfxstrand[d]: Maybe it counts both of them?
21:52airlied[d]: 64-bit?
21:58gfxstrand[d]: It appears to and them and count the intersected bits
21:58airlied[d]: fancy
22:05mhenning[d]: finally, what I've always wanted: a fused and + popcount instruction
22:13mhenning[d]: actually maybe it's useful if you want to mask off the bits that you're popcounting
22:14gfxstrand[d]: It does seem marginally useful
22:14gfxstrand[d]: And it's not like it really takes more hardware
22:14gfxstrand[d]: But it clearly wasn't the top-tier optimization they thought it was because they deleted it on Kepler B.
22:15gfxstrand[d]: Ugh... Not having psetp or plop suuuuuucks.
22:15gfxstrand[d]: IDK how I'm going to resolve parallel copies of predicates
22:16HdkR: plop is so good as well!
22:16mhenning[d]: I thought you did have psetp?
22:16gfxstrand[d]: Not that I can find
22:16gfxstrand[d]: AFAICT it's now on Kepler B
22:17mhenning[d]: This encoding doesn't work? https://github.com/hyqneuron/asfermi/wiki/OpcodeMiscellaneous#psetp
22:21gfxstrand[d]: That is it
22:21gfxstrand[d]: damn... I was scared there for a bit
22:22gfxstrand[d]: IDK why I couldn't find that
22:29keplerisaplanet: I am working on this specification, but no longer hints i have time to provide. Like karolherbst said there are asynchronous loads and dependent instructions, async parameter loads are quering the base address where there is a dependency as deep as Xbases are located it always goes to the end of execution, but based of the base depth it can seek back to print something, and btw. those
22:29keplerisaplanet: can be at certain intervals if a bit in the base indicates that the depth of that variable was extremely deep. Or just encode the depth in packed format. btw this is not a nuclear science, you just want the linker of that vISA to work reliably. presumably the base addresses come from the stack from assembly and are having load routines in binary format. but i am inspecting how wasm
22:29keplerisaplanet: handles the stack freme, stackpointer , framepointer likely is already deprecated. in theory all is handled in the order of usage already per code property graph specification of wasm.
22:33gfxstrand[d]: Okay, psetp is encoded now. Scared me for a bit there. Things were gonna get really dicy without it
23:03isaidclearly: there is just a dilemma, although packed results are very fast to decode to bigger depths, you anyways can print only200 instructions at max, so you always have the depth pc to seek backwards to print something, however if you want insane amount of instructions to go through at once, first packed exection io might be way back in the hash , maybe million answer sets back in 64bit or even
23:03isaidclearly: 32bit arch. I am sure it's that you only dump the IO locations and go one by one committing them to kernel, so that would entail writing io bank indexes as told in metadata section, and double pack them and indirectly use their stuff, that is a dilemma that i am having for tomorrow, but huge problems have not been met. But i do not like to use mul or iteration for constants, so there must
23:03isaidclearly: be some way to mount io upside down, so the first is actually the last etc.
23:06mhenning[d]: gfxstrand[d]: Is rZ a legal src in OpPhiSrcs?
23:07gfxstrand[d]: I think it is now?
23:12mhenning[d]: Yeah, I thought so too, but now I'm having trouble finding the commit that made it legal
23:21gfxstrand[d]: I'm not seeing it either
23:22gfxstrand[d]: Looks like it's in my constant remat branch
23:25gfxstrand[d]: 7180d561182588d035ba4378e68cb5299d46c0f9