01:47gfxstrand[d]: That works, too
05:14gfxstrand[d]: Yeah. They're intended to be lowered to intrinsics on SSA values
11:05MarinDudek: But linking and metadata is never a problem, there is too many ways to do it from LIEF/zig. But i am carefully inspecting cost effective basic block sizing, i need some presentation like pc-relative addressing that has branches in one direction after one and another, i know this exists, but i want to yet to get basic blocks that are equally sized cost effective inlining etc. So must go
11:05MarinDudek: through some compiler passes and their ir and in the end i want to lift those blocks to wasm and then the final code. It's because branching would otherwise kill performance. Then i can land into coding it all to finish it off. Since we know how to manipulate data banks, the project is semi-difficult medium level project.
11:27MarinDudek: technically i was not inaccurate because to measure how much digitbase256 could index data is 140/2 70, so in one direction 35 and another 35 is maximum, but since zero needs presentating as real number , we are better off not using the first digit reserved to zero, and reference point was 112, so 69 wide banks at such indexing. so we are better not using the second digit either. Now as to
11:27MarinDudek: how we manage offsets as tiled databanks, so 32bit accommodates a whole lot fewer than 64bit location. So 32bit is quite simple.
11:47MarinDudek: overall the formula of offsetting always would involve in it to raise the base address by some constant and at runtime at that as offset from the base, first say 1024*1024*storage data banks would have no offset i.e offset zero, where as later ones would have offset to be distiguishable.
11:52MarinDudek: The calculation for pure data is pretty simple in all cases, i have no trouble with it, even on 64bit i would assume, so i actually wan't to start talks as to how execution is carried out.
12:21MarinDudek: what happens during execution is we have such format 5 9 13 17 .... so we put 5 and 9 to the offset of a pc nr1 that forms an index of 14 to be accessed right from the word go, it will access the result with base of future bank as a result that similarly possibly combined with another operand to form a new index right
12:22MarinDudek: then we perform such procedures as I was talking for data, only the banks were slightly longer cause it needs to accommodate all the holes that combine into new even values too this time
12:27MarinDudek: My testing shows so magical results, that it appears to be lot easier than i expected, if you run things correctly that will output X intermediate values
12:28MarinDudek: for every pc executed right!
12:28MarinDudek: So now one needs to do those calculations we did incrementally on pieces of hashes before, on the whole values.
12:29MarinDudek: and very surprising tbh. for me was that those values need no bases other than the dependency logic, cause the same algorithm works on the full hashes too
12:30MarinDudek: according to my labs here , which was something i was not entirely sure about before
12:31MarinDudek: So technically i still needed the help of intel to get gnome-calculator stably running which they offered to me on the macbook i use under linux, so this thing is way simpler, and today we go over the procedures needed to be done to subindex those new intermediate pc banks.
12:33MarinDudek: I had another computer which tended to crash the calculator and immediately i lost all the work after this, but AMD hardware, and fairly old Mint linux
12:34MarinDudek: but i am already shifting away from gnome-calculator to compilation world, so i do little work outside, and come back to talk more about execution
12:35MarinDudek: But technically from may15 to forward, you should be able to see as to how those precodures work on full hash to subindex it in reverse
12:35MarinDudek: non-the less if you yet did not understand it, i go over from there , where you should be able to gather all the bits and pieces of the logic
12:41MarinDudek: I have two basic hypothesis as to why that works surprisingly enough to me, one is hidden below mathematics of circle area, and inside euler formula the constant called PI
12:42MarinDudek: and another is relying on that which is just so powerful representation in very unordered or chaotic but still determistic high entropy powers of twos encoder decoder in hardware...i suspect those two geniuse things that where computer understands what i do with full hashes.
12:43MarinDudek: so this is fairly new discovery and the last of the bigs i have identifyed needed for the magic to happen really
12:45MarinDudek: orthogonal doubt was about some caches that manipulate the outcome, but today i am sure that this isn't real hypothesis in any way
12:47MarinDudek: so the number system patterns are repeating so intelligently with albeit with needed high entropy , that hash has such a value that on longer run it reduces it to the needed access when you just do the needed arithmetic and yeah even on full hashes.
12:50MarinDudek: So why such roundtripish presentation, yes we could do a bit more straightforward, but it would not dissipate heat very well compared to current, and it would be easier to read security wise, I am afraid i did not find big mistakes in the worlds science outputs so far, computers also done the most intelligent possible
12:52MarinDudek: so for an example to thermopicture the current computers circuit, i do not think it's possible to read out what it was doing without accessing the control lines
12:52MarinDudek: it's very secure in that sense
13:01MarinDudek: so currently there are two prequisites to land the final hack, trace a little of the wasm index assigning and annotation which i little bit already know about, seems entirely fine and well adjustable, and the previous of the two, basic blocks of pc relative branch/flow indexing
13:02MarinDudek: And i take some time, not so much of it, i expect to land the last almighty code too during this year.
13:22ice9_: where to report bugs?
13:44gfxstrand[d]: GitLab
13:45gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/issues/
15:56arrigomanipulado: So today it's last of mine, so i wanted to recap you that answersets are in decreasing or increasing order per set, since there is no arithmetic that does not count for magnitude of the number, hence when you remove the answers gotten from all the possibilities, you get the index after you had added the base. so you just get all the indexes of correct answers, now how to resubindex
15:56arrigomanipulado: them, you do not know if the answer was at 144+72 index or illustratively 128+64, you do now a full hashvector batch operation, first is getting the distances as before then removing all the distances from all values, then adding back all the indexes get's you to indexed set again, so now you can remove 112 per pc1, and you get value+index, where others are at
15:56arrigomanipulado: referencepoint+value+index, all values removed you get index and referencepoint+index, so now we are close 63+114=177 and 58 , so remember that procedure that we needed not for data access this hack to follow is similar? if we remove now 112 from 58 it's at -54 other elements at 63 and whatever say 68 adding , so adding all indexes get's 4 126+134-112-126-134=-108 we can debug the
15:56arrigomanipulado: value as well as if needed decode it to write the IO. It works cause index is smaller than the referencepoint.
15:56arrigomanipulado: thanks for tolerating me today, that is all from me.
16:02snowycoder[d]: mhenning[d]: Can I use the Kepler alignment MR to also land some other minor Kepler encoding fixes?
17:09karolherbst[d]: did anybody work on predicating instruction instead of branching for nak?
17:09karolherbst[d]: I wonder how much that matters...
17:10karolherbst[d]: might even help with some RA stuff
17:10karolherbst[d]: yeah....
17:11mohamexiety[d]: didnt Faith do that?
17:12karolherbst[d]: dunno
17:13karolherbst[d]: `p0 null = plop3 pT pT up0 LUT[0xaa] LUT[0x0]`?
17:13snowycoder[d]: If I understand things correctly, there's one from Faith (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33676) and one from Mel (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33683) approaching the problem differently
17:13karolherbst[d]: is that converting a up0 to p0 or something more?
17:14karolherbst[d]: p0 null = plop3 pT pT up0 LUT[0xaa] LUT[0x0]
17:14karolherbst[d]: @!p0 bra L4
17:14karolherbst[d]: I think the `bra` can just take a up0 input
17:14karolherbst[d]: well...
17:14karolherbst[d]: on ampere 🙃
17:16karolherbst[d]: `BRA.U pT up0 L4`
17:25karolherbst[d]: div 32 %137 = phi b0: %1 (0x0), b12: %472
17:25karolherbst[d]: div 16 %281 = u2u16 %137
17:25karolherbst[d]: div 16x4 %438 = vec4 %281, %280, %279, %278
17:25karolherbst[d]: div 16x4 %439 = @cmat_muladd_nv (%393, %431, %438) (flags=797)
17:25karolherbst[d]: ...
17:25karolherbst[d]: div 16x4 %471 = @cmat_muladd_nv (%447, %467, %439) (flags=797)
17:25karolherbst[d]: div 32 %472 = u2u32 %471.x
17:25karolherbst[d]: yeah... I think we can do better 🙃
17:25karolherbst[d]: that's the reason for all those prmt
17:25karolherbst[d]: so the phi node is 32 bit for no reason at all
17:27karolherbst[d]: I wonder where to optimize that....
17:32reikonieminen: I go to long vacation now. I was having an unneeded roundrip sorry cause of that, i did it deliberately to recap how indexing and reindexing is being done at first, this also can be delayed during compilation to a batch/group procedure. But once you dump all values and indexes from the hash it's pointless to redistance them. So all in all the stuff is easier than you are used to doing, i
17:32reikonieminen: just reported it to the world, but code full code this time i do not forward. So by me it's good bye as of now. If you want to give a try it's may15tomay20 where the nouveau channel had the needed info typed. conditional branching and indexing the answersets, linker , and loops in the hashes it's all pretty small code it relies on wasmati annotate logic where the changes are made one day,
17:32reikonieminen: we have the fifo that is the fetcher and couple of loads that prefetch the words from hash, i mean i envision this, i have no code as of yet. But next year i will have it all tested. So it's good bye from me now.
17:40karolherbst[d]: looks like I need `nir_opt_phi_precision`
17:41karolherbst[d]: doesn't work 🥲
17:46karolherbst[d]: mhhh.. okay.. that's better, but I still get pointless prmts...
17:48karolherbst[d]: but at least on the nir side, those u2u32/u2u16 are gone now
17:48karolherbst[d]: so it's just 16 bit value handling that's not really great
17:51karolherbst[d]: r32 = prmt r9 [0x5410] r10 // delay=1 wt=001000
17:51karolherbst[d]: r104..106 = hmma.m16n8k16.f16 r104..108 r30..32 r32..34 // delay=2 wt=000001 rd:0 wr:3
17:51karolherbst[d]: ...
17:51karolherbst[d]: r56..58 = hmma.m16n8k16.f16 r56..60 r10..12 r104..106 // delay=15 wt=000001 wr:0
17:51karolherbst[d]: r9 = prmt r56 [0x10] rZ // delay=1 wt=000001
17:51karolherbst[d]: r10 = prmt r56 [0x32] rZ // delay=1 wt=100100
17:51karolherbst[d]: obviously that also blows up register usage
17:51karolherbst[d]: should just be this:
17:51karolherbst[d]: r32 = mov r9
17:51karolherbst[d]: r104..106 = hmma.m16n8k16.f16 r104..108 r30..32 r32..34 // delay=2 wt=000001 rd:0 wr:3
17:51karolherbst[d]: ...
17:51karolherbst[d]: r56..58 = hmma.m16n8k16.f16 r56..60 r10..12 r104..106 // delay=15 wt=000001 wr:0
17:51karolherbst[d]: r9 = mov r56
17:53karolherbst[d]: mhenning[d]: gfxstrand[d] any good ideas? This probably means messing with how from_nir.rs handles vecs and movs of 16 bit values, but not sure I'm familiar enough with the code to rework that in a way that's not breaking the world
17:54karolherbst[d]: I wonder if airlied[d] had a patch for that? didn't see one, but...
17:55karolherbst[d]: https://gitlab.freedesktop.org/airlied/mesa/-/commit/ec5be9c4c1cc7f470a0627f66ac7b90b1e05cf49 maybe?
17:56karolherbst[d]: ohh no, that's solved with nir_opt_phi_precision
17:57karolherbst[d]: also my barrier patch regressions memory_model 😢
17:58karolherbst[d]: mhhhhhh
17:58karolherbst[d]: maybe the phis should be on vec2 16 bit values...
17:58karolherbst[d]: and then they get placed in proper values
17:59karolherbst[d]: and no need for the prmt business
18:00karolherbst[d]: yeah.. maybe the solution is to handle this all in nir, and then from_nir isn't bothering with it anymore
18:00karolherbst[d]: and just gets 32 bit values, either be it 16vec2 or 8vec4
18:59f_[x]: Remember the strange timeouts and lag then black screen after unplugging/replugging second monitor issues I had on Fermi?
18:59f_[x]: It happened again.
18:59f_[x]: So if there's more logs and such I should fetch that's the right time ^^
19:00f_[x]: Oh. Aww, I thought I had ssh enabled but apparently not...
19:00f_[x]:will try emabling it blindly
19:00f_[x]: Enabling*
19:16f_[x]: This is on 6.14.2(-artix1-1??)
19:17f_[x]: Basically the second monitor started slowing down, then upon switching tty's it froze completely, and after unplugging the displayport the laptop's internal monitor completely died
19:20karolherbst: f_[x]: sounds like the display controller randomly entering a weird state...
19:21f_[x]: karolherbst: would make sense considering the "training failed" messages in dmesg I already wrote about before
19:21f_[x]: I'll send a dmesg before rebooting to make sure
19:22f_[x]: Moment
19:22f_[x]: (and I've ssh)
19:51f_[x]: Ugh :(
19:51f_[x]: I've lost the logs :(
19:53f_[x]: But I remember it being extremely similar to drm/nouveau#330
19:54f_[x]: If only I was able to reproduce this reliably ...
19:56f_[x]: Sorry for wasting your time
20:46f_: karolherbst: I was playing werewolf to relax when suddently I tried switching tty's and the internal monitor froze, the external one died. Coming back to Wayland the internal monitor is functional but the external one is just showing gibberish.
20:48f_: and the cursor disappears
20:48f_: and everything lags
20:49karolherbst: f_: mhh I think switching to the tty can involve a mode switch, so could be that the configuration was invalid or something
20:49karolherbst: could try changing resolutions and see if that fixes it
20:49karolherbst: though might not be possible given the state of things
20:50karolherbst[d]: okay.. disabling `nir_lower_phis_to_scalar` helps a lot
20:50karolherbst[d]: now I only have a bunch of nonsense movs
20:50karolherbst[d]: `Num GPRs: 83` down from 150 🙃
20:51karolherbst[d]: `assertion failed: src_reg.comps() == 1` yeah well...
20:51karolherbst[d]: one of the sub-tests jumped from 5 to 50 TFlops..
20:52karolherbst[d]: mhhh
20:58f_[x]: I captured the full dmesg
20:58mhenning[d]: karolherbst[d]: oh, yeah, I've been wondering about doing that. requires handling more cases in RA though
20:58f_: I'll try sending it now if possible
20:59karolherbst[d]: mhenning[d]: the shader looks _real_ good now
20:59karolherbst[d]: https://gist.githubusercontent.com/karolherbst/b4e7188f14f9e6658793876b1755f061/raw/d43bbc9eb36e5169c31ada31173655f40a4d328e/gistfile1.txt
20:59f_[x]: Actually I'll see if I can send it tomorrow
20:59karolherbst[d]: just the movs are annoying
20:59f_[x]: I'm feeling a bit sick now
21:00karolherbst[d]: so I think keeping vector phis is the way to go then
21:00karolherbst[d]: just need to make RA smarter and deal with the fallout
21:00f_[x]: What's a bit weird is that I did not suspend before getting this issue
21:00f_[x]: So perhaps we can rule out suspend?
21:01f_[x]: Eh. I'll just send the logs tomorrow hopefully and maybe it'll be clearer what's going on
21:02mhenning[d]: snowycoder[d]: snowycoder[d] should be fine
21:02karolherbst[d]: karolherbst[d]: kinda funny though:
21:02karolherbst[d]: r38 = mov r80 // delay=1
21:02karolherbst[d]: r80 = mov r0 // delay=3 wt=000001
21:02karolherbst[d]: r0 = mov r29 // delay=1
21:03f_[x]: Too sick to do it now ... :/
21:03karolherbst[d]: shaders doing tons of vecs are always fun to deal with
21:03karolherbst[d]: it's even worse...
21:04karolherbst[d]: there are absolutely unhinged groups of movs doing random nonsense
21:05mhenning[d]: yeah, still need to get better at coalescing
21:05karolherbst[d]: I don't think it's that
21:05karolherbst[d]: r42 = mov r80 // delay=1
21:05karolherbst[d]: r80 = mov r0 // delay=3
21:05karolherbst[d]: r0 = mov r33 // delay=1 wt=100000
21:05karolherbst[d]: r33 = mov r80 // delay=1
21:05karolherbst[d]: r80 = mov r1 // delay=3
21:05karolherbst[d]: r1 = mov r34 // delay=1
21:05karolherbst[d]: r34 = mov r80 // delay=1
21:05karolherbst[d]: r80 = mov r2 // delay=1
21:05karolherbst[d]: r75 = mov r46 // delay=1 wt=010000
21:05karolherbst[d]: r74 = mov r45 // delay=1
21:05karolherbst[d]: r2 = mov r41 // delay=1
21:05karolherbst[d]: r41 = mov r80 // delay=1
21:05karolherbst[d]: r80 = mov r3 // delay=1
21:05karolherbst[d]: r3 = mov r42 // delay=5
21:05karolherbst[d]: r42 = mov r80 // delay=1
21:05karolherbst[d]: r80 = mov r0 // delay=1
21:05karolherbst[d]: just look at this part
21:06mhenning[d]: I think that's cycle elimination from parallel copy lowering
21:07mhenning[d]: r80 is probably the temporary used for implementing swaps
21:07mhenning[d]: so yes doing RA better would probably fix it
21:07karolherbst[d]: yeah..
21:08karolherbst[d]: it's just funny how not doing scalar phis is triggering all that weirdness, but I guess nobody looked into it yet
21:08karolherbst[d]: could also be broken
21:08karolherbst[d]: other shaders just assert inside RA later
21:08karolherbst[d]: `Illegal instuction in ureg category st.local.strong.cta.b128 rZ [rZ+0x10] ur0..4` also fun
21:08mhenning[d]: yeah, it's possible some of our usual heuristics are breaking
21:09karolherbst[d]: anyway.. going to deal with the regressions from the barrier MR first
21:10mhenning[d]: yeah. That one could be as simple as a pass ordering problem
21:10karolherbst[d]: it's not like I haven't debugged those kind of fails for OpenCL stuff already 🙃
21:12karolherbst[d]: ohhh..
21:13karolherbst[d]: I think it's something more silly than ordering
21:15karolherbst[d]: ohh or it is ordering, because it only works before IO is lowered
21:15karolherbst[d]: oh well
21:15karolherbst[d]: only handles derefs
21:16karolherbst[d]: mhhhh
21:18karolherbst[d]: yeah... that won't do
21:18karolherbst[d]: I think...
21:18karolherbst[d]: most of those barriers are a result of loop unrolling, and nvk lowers IO really early
21:19karolherbst[d]: yeah.. and the barriers are back now..
21:20karolherbst[d]: *sigh*
21:20karolherbst[d]: I think it's better to teach the pass about lowered IO...
21:20karolherbst[d]: mhhhh
21:20karolherbst[d]: uhhh
21:20karolherbst[d]: that looks like a lot of work actually
21:30mhenning[d]: Can it go in nak_preprocess_nir sometime after the optimize_nir call? That should be after the first round of loop unrolling but before lowering io
21:37karolherbst[d]: the lowering is done inside `nvk_lower_nir`
21:37karolherbst[d]: which is called before `nvk_compile_nir`
21:38karolherbst[d]: so I don't see how that would work
21:38karolherbst[d]: with lowering IO I mean the `nir_lower_explicit_io` of global/shared memory
21:42mhenning[d]: I think preprocess is called before nvk_compile_nir
21:54karolherbst[d]: ohh indeed, though thing is, it doesn't help there either
21:55karolherbst[d]: the issue is more that we need to keep some membars, but most of them are just a result of loop unrolling
21:55karolherbst[d]: so if the barrier opt is called before unrolling loops it won't really help much
21:57mhenning[d]: Right, but nak_preprocess_nir calls loop unrolling when it calls optimize_nir
21:57mhenning[d]: so if the opt_barrier_modes goes after optimize_nir but before the end of preprocess, it should work, right?
21:58mhenning[d]: or does the unrolling fail there for some reason?
21:58karolherbst[d]: it unrolls more later
21:58karolherbst[d]: I've put the opt_barrier_modes right before explicit_io and it didn't really help
22:00mhenning[d]: hmm that's frustrating
22:01mhenning[d]: It sounds like it would be legal to hoist the membar outside of the loop in this case but actually doing so is a pain
22:02karolherbst[d]: yeah.. I was considering it, but...