==== How Conditional execution works ==== Valid Mask: + 1 bit used by KILL instructions + Once cleared, it cannot be restored. Per-Pixel State (Active Mask): + 2 bits that reflect a pixel's active status + Can be one of the following values: - Active - Inactive-Branch - Inactive-Continue - Inactive-Break + Branch-Loop instructions can push per-pixel state onto the stack + Changes to the Active Mask from the ALU (PRED_SET_* instructions) take effect at the beginning of the next CF Instruction. CF Instructions allow for conditional execution by: + Performing a condition test for each pixel based on current processor state + The CF instruction will do one of these things using the condition test result: 1. - Determine which pixels should execute the current instruction - Leave per-pixel state unmodified. 2. - Modify the per-pixel state * Pixels that pass are put in the active state * Pixels that fail are put into one of the inactive states 3. - If at least one pixel passes, push current per-pixel state onto the stack. - Modify the per-pixel state * If all pixels fail jump to a new location - Some instructions can pop the stack multiple times and change the per-pixel state to the result of the loast pop + Besides performing a condition test, CF instructions can pop per-pixel state from the stack, then perform a condition test based on the new state and update the per-pixel state again based on the results of the test. COND field: + CF_COND_ACTIVE - Pixel is active + CF_CONF_BOOL - CF_CONST bool = 1 + CF_COND_NOT_BOOL - CF_CONST bool = 0 Condition tests: + CF_COND_ACTIVE - Default: True if and only if pixel is Active - WHOLE_QUAD_MODE: True if and only if quad contains active pixel - VALID_PIXEL_MODE: True if and only if pixel is both active and valid For CF instructions that do not unconditionaly pop the stack, here is how the per-pixel state can be updated: 1. Evaluate the condition test for each pixel using current state, COND, WHOLE_QUAD_MODE and VALID_PIXEL_MODE 2. Execute the CF instruction for pixels passing the test 3. If the CF is a PUSH, push the per-pixel active state onto the stack before updating the state 4. If the CF instruction updates the per-pixel state, update the per-pixel state using the results of the condition test. Instructions that unconditionally pop the stack work this way: 1. Pop the per-pixel state from the stack (0 or more times) 2. Evaluate the condition test using new state, COND, WHOLE_QUAD_MODE, and VALID_PIXEL_MODE 3. Update the per-picel state again using results of the condition test CF Stack + Number of available stack entries is controlled by a register or hardware implementation + The minimum number of entries is determined by the deepest control-flow instruction + Each entry contains subentries - Subentries are limited based on the physical thread-group width + Stack size <= Full Entries + (subentries / (subentries per entry)) Control Flow Clauses + ADDR - Address of the start of the clause + JTS - Only used in CF_INST_JUMPTABLE + Pop Count - Only used by some CF instructions + CF_CONST - Only used by CF instructions + COND - Used by CF instructions _ Set to ACTIVE for GDS + COUNT - Instructions in the clause minus - 1 + GDS + Evergreen 64 KB that is shared across all work itmes in a kernel + This is *not* the same as OpenCL global address space. Writing to memory + CF_ALLOC_EXPORT_DWORD1_BUF + MEM_EXPORT + Two types of exports - EXPORT_WRITE - EXPORT_WRITE_IND + Write to buffer offset supplied by INDEX_GPR Writing to UAV (RAT Random Access Target) + CF Instructions - MEM_RAT + Apply RAT_INST to data - MEM_RAT_CACHELESS + RAT_INST must be RAT_STORE or RAT_READ (Does RAT_READ exist?) - MEM_RAT_COMBINED_CACHELESS CF_ALLOC_EXPORT_WORD0_RAT + RAT_WRITE - RAT_ID (UAV) - RAT_INDEX_M0 (0 for now) - TYPE * 0 For writes * 1 For indexed writes * 2 For writes with an ACK. How do we receive the ACK ? - RW_GPR * Data to write - RW_REL * 1 If relative to the aL value - INDEX_GPR * GPR with the coordinates to write within the UAV - ELEM_SIZE * Number of dwords per element in the array. * INDEX_GPR is multiplied by this factor - ARRAY_SIZE * dword stride (Probably 0) - COMP_MASK * Set to 1 for store - BURST_COUNT * 0 I think - VPM * 0 - EOP * 0 usually - MARK * I Think 0 - Barrier * 1 until optimizations