10:30 mardination: I am newbie on GLSL math. But i have suggestions, so on sm2.0 we remember the register file being 4*64*32 at least on gma950 era chips.
10:36 mardination: it can accommadate 8 times 1024 entry queues with each entry being 32bits widths
10:37 mardination: cause 8196/1024 = 8
10:47 mardination: you organise the registers with opcodes of that magnitude like it was 8CU apu
10:49 mardination: meanwhile the instruction cache is organised with 20different ARB instructions in hw dependent way
10:52 mardination: and also 64 texture instructions
10:58 mardination: the remaining alu slots are filled with random legal ALU instructions
11:00 mardination: best to put CMP or conditional type of instructions there
13:01 mardination: mom was rambling here, so it appears 16 async cmps can be done in parallel, to find out which reg it was 4 iterations should be run
15:09 mardination: https://stackoverflow.com/questions/35937598/how-is-3d-texture-memory-cached?rq=1
15:11 mardination: zero-cost for out-of-bounds sampling, indeed he is right, it should be very fast in fact
15:47 mardination: out-of-bounds texture accesses schedule only texture instructions, all ALUs will be getting stuck in scoreboard hence ignored due to valid_dest evaluating false
15:50 mardination: the moment you want to schedule an alu for certain seeked wfid, you need to ignore writes via lsu, like out-of-bounds destination register number or texture selector or mask enabled