06:09 sigod: imirkin, I'd like to help write some free firmware for nouveau
06:10 sigod: where do i start lol
06:10 sigod: i dont know how to program
12:16 mardikene: you only look at miaow, valu is 16rows multiplexed to 64 serially, verilog has hierarchical design, that means PS_alu_issue_flops.v as seen has 64x32 opcode flops of 32bit 4vectors, that are serialized with cont. assignment over 4clock cycles to 64alu instances whos operands are given from vgpr, it only cares about what operands have changed and provided that the opcode is same, it can interleave 32pointer operations
12:21 mardikene: err 64 that of float 32 that of double
12:24 mardikene: hence if you are going to feed several pixel vectors, the performance is multiplied by how many did you run in parallel
12:34 mardikene: so hence, when david tarjan got only 8fold improvement with only memory operations, hiding div and other stuff will get you 32x32 perf boost, around thousound time , when you start new pixels from the freed up reg fragments
12:58 mardikene: it could do more, but LSU width is such
13:14 mardikene: LSU width LD/ST unit in nvidia terms is designed correctly, cause usually you do not have that much of ILP to run arrays on from freed up regs
13:24 mardikene: depending also which way to view, it must be proportional to the amount of regs which is 256 per SIMD on most, due the read and writebacks it is poinless for LSU width to be larger than what it is too
13:27 mardikene: in other words, if you go from 256 to 512, i dunno maybe pascal has such formation or volta or something, then LSU width is lifted up too
13:36 mardikene: as seen from the comments for GCN it is 64 in miaow, i can not have further comments on this much, but it is commented in in the source code, per CU not simd, in the GCN docs, it says 16dw load of vectors, so 256 regs in four cycles in case of swizzling 16x4x4 first 4 is vector amount, second 4is row amount or amounts of simd's which way to sniff again
13:36 mardikene: so in case of swizzling one can run also 256 valu instances in four cycles
13:45 mardikene: now , i also have the algorithm, but before we can talk about it i need to look at databases for possible presence of intellectual property
13:47 mardikene: now this is why i hired laywer but she needs to hire some expert who at all understand what i am talking about, cause my laywer does not have belief neither understanding in what i talk about
18:43 mardikene: the reason why i think mannerov and others were confused: single 2047bit vector execution, does not take 4cycles, but far below when issued with pointers, it takes 4cycles with fetch&decode in ideal runahead situation, this is where i messed up back times once too
18:45 mardikene: it takes about 1cycle on 16rows with 16 such vectors, thats why bit parallel operators link were showed back times by me, and some papers
18:46 mardikene: but it can't go much more then 1000 fold only with scheduling, cause with 64 pixels added the occupancy is finally full
18:48 mardikene: it is not even that the occupancy is full then, but there is not enough regs freed to do that, since you could theoretically also use another cu to run another async pixel array
19:15 mardikene: well yeah on toy shaders one can go far beyond 1000fold like 10k fold improvement, but real complex shading which default in todays games won't go much above 1k times/fold improvement on average, little here and there to fluctuate but this is how i think
19:26 mardikene: so yeah when moore's law ends and then after precompiler+scheduling theory is also added, to further top-up the performance is possible only on small form factor to raise the voltage, i.e dope the die more heavily with electrons or dunmo stuff like that
19:26 mardikene: so it will further accelerate before becoming life dangerous
19:35 mardikene: to ramble and jabber endlessly about what XMAD or whatever does , is it twice as fast as MUL or not, is as small exchange money just about idiotic thing to do, instead of implementing the real stuff i talked about
19:39 mardikene: i've spotted recently Kayden definitely knowing how to do that stuff , where as allready some AMD guys knew, though i explained them mostly too, well Kayden is quite intelligent , i have known that before too, but he follows the pattern of others to violate me
22:18 gnarface: so i've observed now that several USB camera devices i have seem to adjust their "aperture" (as called in the UI but may just be software brightness for all i know) sometime after you start recording
22:18 gnarface: is there a way to take a snapshot *after* it's done doing that?
22:19 gnarface: something like this works, but always takes a pre-adjusted image: /usr/bin/ffmpeg -f video4linux2 -pix_fmt yuyv422 -s 1280x960 -i /dev/video0 -c:v rawvideo -f rawvideo -vframes 1 [outputfile]
22:20 imirkin: wrong channel?
22:20 gnarface: heh, yes, sorry