06:12cernico: mwk released a stinky fart , can be felt from distance, your life will be complex after terrorizing me, airlied is nose in mwk's butt again investigating the components eaten. Computer users need to suffer on this endless butt and nose game.
06:22cernico: https://code.google.com/archive/p/maxas/wikis/ControlCodes.wiki it says what is the difference of kepler and maxwell, control codes actually were introduced in kepler for lower power.
06:32cernico: This is pretty rough concept but based of the latency of instruction it orders things statically to get scheduled with order of lower stall counts, which technically never should be touched if you use fast computation paradigm.
06:32cernico: i.e where instructions are all with same latency
06:41cernico: so those blocks do not do much of difference , it releases youngest instructions with lower latency grouped together that are independent, and more latent async instructions together. So that requires code that is harder to maintain.
06:50cernico: they do a bit difference but it's possible to manually sort the procedures for runtime without those control codes too. So regardless of whether it is control codes or just manual reordering those procedures would never have to be modified.
06:51cernico: or maintained per output of end user program, it's always kept the same
07:10cernico: I do not know what compilation analyses it requires if there is no control codes either, so they are handy but the programmers needs to know the hw as ones own 10fingers to get any benefits so control codes order the independent instructions so shortest clocks first, so it could seek more aggressively to forward
07:13cernico: and on paradigm where every alu is add or sub i.e minimal latency it needs to do an ordering for the sw scheduler only, and that i know only how it goes, this one is pretty simple
07:15cernico: so there is a memory that gets read, memory and alu operation are separate, lds and global memory instructions for an example share the same alu units except for memory fetch only where there is no alu involved
07:16cernico: so in that case, you should only be worried about issuing the independent instructions first.
07:42cernico: so as register pressure never exists, all the lds is pointless by then and you only call global data share instructions to registers , that just internally somehow use the only meaningful part of lds, which is not compulsory , cause sw can do it as well
07:42cernico: for an example it's useful to preload to lds at times
07:44cernico: so all in all, programmers need something like llvm ir or unoptimzed machine layer ir to do the needed trick
07:44cernico: the mesa optimized machine code is pointless
07:45cernico: since you do not program a nuclear reactor or centrifuge, you program a gpu that is beneficial to be kept always on best performance
07:57cernico: yep, it's the memories only use as of now, to preload some global data, and all the global memory alus and lds is useless
07:57cernico: otherwise
07:58cernico: it only wins you on very deep loads you can skip the memory loads per iteration, as if you have very long pixel shader and iterates over the size of lds cache
07:59cernico: you win with preloading all except the first iteration worth of memory throughput
08:01cernico: i am just saying that all the scheduling related complexity and LDS mubuf alus are functionality wise worthless
08:02cernico: so is 2.1 opencl , it's pointless unless you utlise the hardware as frequency generator based of chips states somehow
08:03cernico: in nuclear world you would for instance just need some of those frequencies
08:03cernico: but not as a computer user
08:09cernico: so in general my research is over in the era of using computers at home and energy savings and performance issues
08:11cernico: the complexity to get performance is not there, i do not know well howto utilize the default complex modes, except frequency generating things,, such as piezo crystals in shockwave and ultrasound era etc.
08:12cernico: it's yes those modes are quite complex , i do not know these usages extremely well either
08:20cernico: who would use a gpu to do sound related things, makes no sense overall, they want just some diode based ic to do things alike, and in the world of security hardware states are not so good random number generators either, and for heaters gpus are still too expensive too :)
08:23cernico: there is no point to communicate on the grounds of solving things that have been solved, or on problems that never existed, there is just no reason to allocate time for so silly thing
08:24cernico: the correct would be to discuss problems that can be solved in every day life
08:27cernico: since you are not honest as to why your criminal career started , and why you come to my territory to terror me, then i say it's not allowed and after 2.5 years of getting that on my vacation i do not accept excuses
08:27cernico: and the final result is, that real problems you do not want to confess or discuss with me
08:28cernico: and on the framerate issues i do not want to play the slave or clown to communicate over with
08:32cernico: every proper employing company would clear that performance problem with 1month if they needed to or wanted, and that type of artist am I too, i am capable of moving things into that position too.
08:40cernico: now if you want to still do it , and those short integer permutes, i am busy a bit, someone might want to communicate with scala autohors or read the documentation, i left my research into position where i could generate the dictionaries per alu operations if that has not been done, but i do not have time to test in this month
08:41cernico: They already express those things in around solutions of rdf and even in pure scala core libs.
08:41cernico: that object orientated programming language has support for such things
08:44cernico: any AI models or such type of solutions i have not found that suite, in other words you just need xml dictionary of openmath likely and are already on the run there towards good performance
08:44cernico: if not, those permutes can be generated with loops
08:47cernico: those are meant so like signal collect and other you generate a dictionary for mathematical operation evaluators, and the backend just packs them
08:48cernico: but i know only one such backend and it is currently compliant with 32bit
08:48cernico: there are more , but elias fano does not do that by default
08:49cernico: elias fano does no compression the way i finally looked
08:50cernico: it just can store many small ints in the same machine word
16:22passimoto: https://kampersanda.github.io/pdf/InnovateData2017.pdf , so i drop another link, they claim to generate the dictionaries faster than others, however the approach is not the same as the neat method i described, training times would not matter if all alus are pretrained, so that would be compatible with mine with a hack, cause i offer very quick assembly of dicts by core method too already, but all links are listed in current theirs approach,
16:22passimoto: which is just treating the strings as small ints. They describe some tech behind that.
17:10mareko: karolherbst: are you ok with this? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27953/diffs
17:12karolherbst: yeah, ack from my side
17:14karolherbst: mareko: once it's merged I'll probably have to clean up a few casts, but it shouldn't cause any direct issues (except when drivers start to expose higher limits maybe?)
17:24DavidHeidelberg: <built-in>:1:10: fatal error: 'opencl-c.h' where I did made mistake: mesa/clover compilation or app compilation?
17:25karolherbst: probably mesa
17:25karolherbst: the packaging is quite broken in debian for all of this and you might have to reinstall some stuff, because reasons
17:25DavidHeidelberg: hmm, except adding "-Dgallium-opencl=icd " I needed most likelydo something else (not Debian, Alpine)
17:26karolherbst: ahh..
17:26karolherbst: ohh, that's with clover?
17:26karolherbst: mhhh
17:26karolherbst: I have no idea :) that part is pretty broken in clover
17:26karolherbst: do you have such a file installed?
17:26DavidHeidelberg: I wanted to test on freedreno these 2/4/8 types w/ Clover
17:27DavidHeidelberg: I installed it, but probably not a dep for Mesa build
17:27DavidHeidelberg: (after building Mesa)
17:27karolherbst: there is a `CLANG_RESOURCE_DIR` thing for clover and I suspect it points to the wrong directory
17:27karolherbst: I've fixed it probably 5 times for src/compiler/clc already and clover didn't recieved any of those
17:28DavidHeidelberg: let me check :)
17:28DavidHeidelberg: thx
17:28karolherbst: it's a mess because every distribution does sometihng different and the way we used to do it was working based on wishful thinking :)
17:28DavidHeidelberg: Alpine originally didn't even build clover
17:43DavidHeidelberg: karolherbst: your previous effort to debug helped me, https://gitlab.freedesktop.org/mesa/mesa/-/issues/8365#note_1792650
17:43DavidHeidelberg: on Alpine it's not llvm 17, but 17.0.3 in the path
17:43DavidHeidelberg: *17.0.6,but doesn't matter :D
17:50Ristovski: mareko: AMD_TEST=testdmaperf on gfx90c (Ryzen 5700G) causes a "no-retry page fault" when it hits "VRAM->VRAM CS x2". Last logged value is always under 4096K (replicated three times), which is extremely low (<100) before it dies. Tested on 6.7.0 up to 6.7.9. testdmaperf log: https://bpa.st/raw/FMBA, page fault: https://bpa.st/raw/74KQ. Do I file this under mesa or drm/amd?
17:51Ristovski: A couple days back I triggered a nearly identical page fault messing around with AMD_pinned_memory
17:51mareko: drm/amd
17:52Ristovski: oh, "under 4096K" as in - it always dies on the 4096K test with CS x2
17:53mareko: also mention that it's a trivial memcpy compute shader, and that it works on other gfx9 chips
17:55Ristovski: Will do. Anything else I can quickly try/debug that might yield useful info?
17:55mareko: that's it
17:56Ristovski: I tried amdgpu.mcbp=0 but apparently its already disabled on gfx9 in a recent commit (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/gpu/drm/amd/amdgpu?h=v6.7.9&id=d6a57588666301acd9d42d3b00d74240964f07f6)
18:02karolherbst: DavidHeidelberg: yeah.... it was all super fragile and I hope I fixed it inside clc enough that we don't have any bugs anymore
18:04Ristovski: mareko: One more question while I have you here, "fill->VRAM ,CS x64" (nothing under L2p) for example returns 534639 for 131072KB. How are such speeds even possible?
18:04mareko: Ristovski: the L2 cache is faster than memory
18:05Ristovski: Oh so it _is_ using L2 cache? I had assumed those tests are uncached
18:05mareko: yes
18:05Ristovski: That explains it, thanks :)
18:07mareko: actually, it should not be using L2
18:09mareko: but it looks like it's using it
18:11Ristovski: Are file attachments borked on freedesktop gitlab? The button doesn't seem to do anything :P
18:12mareko: the high result is bogus
18:13Ristovski: It sure seems like it - the L2 cache on this APU isn't even that big
18:14Ristovski: Seems like CS x4 is fine, but then x8 and above are bogus above 2048K, idk
18:16mareko: same for Navi31, the test seems buggy
18:17mareko: or rewriting the shader to NIR broke it
18:42karolherbst: mareko: should I fix the rust part of your MR or will you manage?
19:40mareko: karolherbst: no idea how to fix that
19:40karolherbst: mareko: I already posted a patch to the MR
23:05DavidHeidelberg: karolherbst: up to you, I thought that these failing tests are important, if it's nothing serious, then I'm not the person who will use Clover+freedreno :D
23:11karolherbst: looks like those are all image related anyway
23:12karolherbst: might be real freedreno bugs even
23:12karolherbst: like something busted with image arrays?