16:59 RSpliet: mwk: for Kepler/Maxwell, do you know what kind of addressing modes are supported by the ld/st units?
17:06 RSpliet: reg->mem, mem->reg, direct, indirect (with absolute offsets or array indices in reg?)
18:32 interestedDude: RSpliet: a good question again, i do not have any details, but i belive your self answered version was consistent with the documentation i.e correct
18:33 interestedDude: RSpliet: and also, i think you are digging in correct digpaths most of the time, this load and store is incidentally another way to do instruction pipelining
18:35 interestedDude: RSpliet: but as i was saying it needs to indirectly point program counters values fetched from memory i.e the underlying instructions against 8 of different mask registers
18:39 interestedDude: RSpliet: and by reading your thesis i am quite determined also based of talking to you couple of times, and the places where you seem to be aiming or looking into, that you are the one who luckily understands how things work
18:44 interestedDude: RSpliet: most the time the answer how alu's are connected is buried in hw, but they are wired with the lanes in such way that basically latency reviels how are they connected or let's say the throughput
18:48 interestedDude: so if the latency is higher then usual for the same what i call data size or datapath, that how the alus or wired to backing lanes, means that alu can not proccess that data size of input operator width fully in parallel bit by bit
18:49 interestedDude: hence you either give a latency sched code or active or lane mask to unidle those lanes
18:51 mwk: RSpliet: no idea, sorry
18:53 interestedDude: it is possible to do both of the implementations , but sched code would use every 8th of the cache line, and not sure if they account with adjusting the latency on a cache miss
18:54 interestedDude: latency/sched code or mask in that case
18:59 interestedDude: RSpliet: and you should not have a bigger stress because of not knowing in transistor logic level how exactly the alus work on nvidia, because there are many open source kits where it is generated and can be inspected and replicated etc.
19:02 interestedDude: i have not managed to play my part during this nouveaus programming run due to many reasons, but once we agree that this performance is indeed needed, we can shake hands and just do it
19:08 interestedDude: same goes to multithreading allthough there are workarounds based of kernel isolation methods, once we agree to work only bit on those bits, it will be fixed for atomic cases too
19:08 interestedDude: those are the two things i have little bit specialized on, i don't have the clear view about the rest of the bugs though
19:11 interestedDude: i saw mwk documentation had all the bits how to fix those things back time, it's i am unsure why hasn't it been done, but surely it's not as complex as i've heard others to state it would be
19:21 interestedDude: but if i am sure, yeah sure, during last couple years , i've broken the bad barrier sacrificed a bit of my time, and generally read maybe even thousound or more pdf's about gpus, in some sense i have been able sum up all the details how this hw works
19:21 interestedDude: sort of conclude based of the info so to speak
19:26 interestedDude: yeah and hence by looking into tp&latency info i can give the correct masks, if they are not correct then it can be dested with perf counters or and also laneread/writes and masks
19:30 interestedDude: motivation is what i'd most want to do is a resolution scaler at good perf level which would be possible after doing those things on decent gpu
19:31 interestedDude: based of msaa probably, then one can have a slight weaker displays obviously or display controllers for native resos
19:36 interestedDude: all my monitors including laptop ones i saved money and did not put an extra bucks on the table because of larger native resolution, i think apu from amd may not be that strong to carry this scaler out, not sure, but a decent gpu would with no bubbles in the instruction pipelines
19:44 interestedDude: such a scaler can be programmed for Xserver that uses something based of glamorEGL, for pure kmscons maybe, for wayland and stuff like that, a more capable inch perfect scaler then randr's cpu one
19:46 interestedDude: i think it can be done on X's core primitives and or Xrender or some extension like this, if one would wish to waste time, they have some support for aa
19:47 interestedDude: myself i would choose the easier path
19:50 interestedDude: sure glamor is bit more resource needing at the moment then the main DDX, but as i said, it can be maintened so with a single precompiler for all the cards, so the user cooks up the blob and links it to X or wayland
19:51 interestedDude: doing so glamor as a hw blob, would allready be probably faster then current main DDX
19:51 interestedDude: and use less amount of resources and whatever the worries were
20:06 interestedDude: RSpliet: ld/st should additionally works on what AMD call's LDS -- local data share, and CUDA and NVIDIA calling as local memory
20:07 interestedDude: it has the ability at least on AMD to work on range of independant data and make operation based of this data by mixing them with two data regs for instance
20:08 interestedDude: then all the items would generate unique memory query and no bank conflicts , and this would be very fast, and also every data element in the range is processed in parallel
20:10 interestedDude: it is sort of convenient way to make a parallel loop without unrolling it and wasting a cache
20:23 interestedDude: i honestly need to leave now, i am occupied with pointless stuff, where those leftovers still think of how to violate and think new regulations against me, i am meeting my advocate tomorrow morning and this is quite important i as i said put all the ones involved into troubles hopefully, vast penalty and war declared
20:23 interestedDude: bye