13:31 hakzsam: mareko: pepp: how do you usually debug gpu hangs with radeonsi these days?
14:50 haasn: is there any reason to prefer any of `x + subgroupQuadSwapHorizontal(x)`, `subgroupClusteredAdd(x, 2)` or `x + subgroupShuffleXor(x, 1)`?
14:55 pendingchaos: subgroupClusteredAdd works with inactive lanes, but is probably slightly slower if you know that's not a problem
14:55 pendingchaos: the first and third are basically the same
14:56 pendingchaos: subgroupQuadSwapHorizontal enables helper invocations, I don't think subgroupShuffleXor does
14:56 pendingchaos: actually, it seems both do
14:58 haasn: "works with inactive lanes" you mean I can call subgroupClusteredAdd from a diverging branch?
14:58 haasn: and it will add values from the non-active branches?
14:59 pendingchaos: I mean if one of the invocations are inactive
14:59 pendingchaos: a divergent branch can be one cause of that
14:59 pendingchaos: subgroupClusteredAdd doesn't add values from inactive invocations, it will just return x
15:23 mareko: hakzsam: I think I haven't debugged any hangs for a very long time
15:24 mareko: with existing chips that is
15:25 mareko: debugging hangs with new chips is different because we can just use very trivial tests
15:26 mareko: you need the value of GRBM_STATUS to see which hw block is stuck, or use umr to print active waves, and then you take it from there
22:02 mindnotatall: It's technically queue jumping because the modulo or modulus of numbers is less than 512, sum of all numbers modulus or division by nr of entries , for an example 20 banks , 20 * less than 512, would be sum mod 20 , or sum div 20.