i965: Instruction scheduling needs improvements (bottom-up scheduler) (cwabbott WIP) i965/vec4: Implement global copy propagation (see serious-sam-3/ultra/714.shader_test) NIR: Vectorization in NIR (vec4) (cwabbot WIP) Per-channel dead code elimination in NIR (vec4) Normalize CMP/ADD since they perform the same operation Particular problems: ---- witcher-2/524.shader_test: Temporary0.x = (ON_AttrBlendWeight0.y + ON_AttrBlendWeight0.x); Temporary0.x = (Temporary0.x + ON_AttrBlendWeight0.z); Temporary0.x = (Temporary0.x + ON_AttrBlendWeight0.w); Would be nice as dp4(Temporary0.xyzw, 1.0) ---- ---- serious-sam-3/medium/1673.shader_test: mov(8) g10<1>UD g1<0,4,1>UD mov(8) g11<1>UD g1.4<0,4,1>UD mad(8) g116<1>.xyF g10<4,4,1>.zwwwF g10<4,4,1>.xyyyF g2<4,4,1>.zwwwF mad(8) g116<1>.zwF g11<4,4,1>.zzzwF g11<4,4,1>.xxxyF g2<4,4,1>.zzzwF By recognizing that this code is effectively doing this: g116.xy = g1.0<0,4,1>.zw + g2<4,4,1>.zw * g1.0<0,4,1>.xy g116.zw = g1.4<0,4,1>.zw + g2<4,4,1>.zw * g1.4<0,4,1>.xy We could get rid of all of the mov(8)s and one of the mad(8)s and be left with mad(8) g116<1>F g1<4,4,1>F.zw g2<4,4,1>F.zw g1<4,4,1>F.xy ---- ---- After register allocation, dst and src of SEL might be the same. Could have done a predicated MOV, which could dual-issue (+f0) sel(8) g47<1>F g47<8,8,1>F g43<8,8,1>F ---- ---- Combine into sel: mov g3, ... (+f0) mov g3, ... ---- ---- xcom contains shaders that do lots of SIMD8 instructions on scalars: (+f0) sel(8) g11<1>F g6<0,1,0>F g6.4<0,1,0>F (+f0) sel(8) g12<1>F g6.1<0,1,0>F g6.5<0,1,0>F (+f0) sel(8) g13<1>F g6.2<0,1,0>F g6.6<0,1,0>F (+f0) sel(8) g22<1>F g6.3<0,1,0>F g6.7<0,1,0>F (+f0) sel(8) g24<1>F g7<0,1,0>F g7.4<0,1,0>F (+f0) sel(8) g26<1>F g7.1<0,1,0>F g7.5<0,1,0>F (+f0) sel(8) g27<1>F g7.2<0,1,0>F g7.6<0,1,0>F (+f0) sel(8) g33<1>F g7.3<0,1,0>F g7.7<0,1,0>F we could optimize these into (+f0) sel(4) g11<1>F g6<4,4,1>F g6.4<4,4,1>F (+f0) sel(4) g24<1>F g7<4,4,1>F g7.4<4,4,1>F or even (+f0) sel(8) g11<1>F g6<8,4,1>F g6.4<8,4,1>F and then rewrite future instructions to read individual components out of g11/g24. Basically everything this shader does is on uniforms. Huge potential for improvements. ---- ---- witcher-2/676.shader_test (and many others) contains (+f0) if(8) mov(8) g32<1>F 3.40282e+38F else(8) math rsq(8) g32<1>F (abs)g31<8,8,1>F endif(8) math inv(8) g33<1>F g32<8,8,1>F ---- ---- serious-sam-3/medium/1924.shader_test does cmp.l.f0.0(8) null:D, vgrf186:F, u136:F (+f0.0) sel(8) vgrf209:F, u160:F, u156:F (+f0.0) sel(8) vgrf210:F, u161:F, u157:F (+f0.0) sel(8) vgrf211:F, u162:F, u158:F (+f0.0) sel(8) vgrf212:F, u163:F, u159:F cmp.l.f0.0(8) null:D, vgrf186:F, u137:F (+f0.0) sel(8) vgrf213:F, u164:F, vgrf209:F (+f0.0) sel(8) vgrf214:F, u165:F, vgrf210:F (+f0.0) sel(8) vgrf215:F, u166:F, vgrf211:F (+f0.0) sel(8) vgrf216:F, u167:F, vgrf212:F cmp.l.f0.0(8) null:D, vgrf186:F, u138:F (+f0.0) sel(8) vgrf217:F, u168:F, vgrf213:F (+f0.0) sel(8) vgrf218:F, u169:F, vgrf214:F (+f0.0) sel(8) vgrf219:F, u170:F, vgrf215:F (+f0.0) sel(8) vgrf220:F, u171:F, vgrf216:F cmp.l.f0.0(8) null:D, vgrf186:F, u136:F (+f0.0) sel(8) vgrf308:F, u174:F, u172:F (+f0.0) sel(8) vgrf221:F, u175:F, u173:F cmp.l.f0.0(8) null:D, vgrf186:F, u137:F (+f0.0) sel(8) vgrf309:F, u176:F, vgrf308:F (+f0.0) sel(8) vgrf222:F, u177:F, vgrf221:F cmp.l.f0.0(8) null:D, vgrf186:F, u138:F (+f0.0) sel(8) vgrf310:F, u178:F, vgrf309:F (+f0.0) sel(8) vgrf223:F, u179:F, vgrf222:F The comparisons are repeated A, B, C, A, B, C. ---- ---- Optimize this: cmp.ne.f0(8) g25<1>D g2.5<0,1,0>UD g24<8,8,1>UD cmp.ne.f0(8) g27<1>D g2.4<0,1,0>UD g26<8,8,1>UD cmp.ne.f0(8) g29<1>D g2.3<0,1,0>UD g28<8,8,1>UD cmp.ne.f0(8) g31<1>D g2.2<0,1,0>UD g30<8,8,1>UD cmp.ne.f0(8) g20<1>D g5<8,8,1>UD g3.4<0,1,0>UD cmp.ne.f0(8) g21<1>D g6<8,8,1>UD g3.5<0,1,0>UD cmp.ne.f0(8) g23<1>D g7<8,8,1>UD g3.6<0,1,0>UD cmp.ne.f0(8) g24<1>D g8<8,8,1>UD g3.7<0,1,0>UD or(8) g22<1>D g20<8,8,1>D g21<8,8,1>D or(8) g25<1>D g23<8,8,1>D g24<8,8,1>D or(8) g26<1>D g22<8,8,1>D g25<8,8,1>D and.ne.f0(8) null g26<8,8,1>D 1D (+f0) if(8) 0 0 null 0x00000000UD into this: cmp.ne.f0 (+f0) cmp.ne.f0 (+f0) cmp.ne.f0 (+f0) cmp.ne.f0 (+f0) if(8) ---- ---- Recognize open-coded copysign() (e.g., floor(abs(x) + 0.5)) * sign(y)) sanctum-2/6830 and many others. Probably just make a GLSL extension that exposes bfm/bfi and copysign. 7 -> 5 instructions for this common sequence, with better improvements due to mask reuse for scalar VS. Also doesn't use the flag register. Unfortunately, while the shader authors likely wanted copysign(x, y), what they've written isn't equivalent, since sign(0) is 0. (assign (xyzw) (var_ref compiler_temp) (expression ivec4 f2i (expression vec4 * (expression vec4 floor (expression vec4 + (expression vec4 abs (swiz yxzw (var_ref Temporary0))) (constant vec4 (0.500000)))) (expression vec4 sign (swiz yxzw (var_ref Temporary0)))))) add(8) g50<1>F (abs)g34<4,4,1>.yxzwF 0.5F { align16 1Q }; rndd(8) g49<1>F g50<4,4,1>F { align16 1Q }; cmp.nz.f0(8) null g34<4,4,1>.yxzwF 0F { align16 1Q switch }; and(8) g53<1>UD g34<4,4,1>.yxzwUD 0x80000000UD { align16 1Q }; (+f0) or(8) g53<1>UD g53<4,4,1>UD 0x3f800000UD { align16 1Q }; mul(8) g48<1>F g49<4,4,1>F g53<4,4,1>F { align16 1Q compacted }; mov(8) g46<1>D g48<4,4,1>F { align16 1Q }; (assign (xyzw) (var_ref compiler_temp) (expression ivec4 f2i (expression vec4 copysign (expression vec4 floor (expression vec4 + (expression vec4 abs (swiz yxzw (var_ref Temporary0))) (constant vec4 (0.500000)))) (swiz yxzw (var_ref Temporary0))))) add(8) g50<1>F (abs)g34<4,4,1>.yxzwF 0.5F { align16 1Q }; rndd(8) g49<1>F g50<4,4,1>F { align16 1Q }; mov(8) mask bfi2(8) g53<1>D mask g49<4,4,1>D g34<4,4,1>.yxzwD { align16 1Q }; ---- ---- The pattern we generate for gl_FrontFacing is or(8) g39.1<2>W -g0<0,1,0>W 0x3f80UW { align1 1Q }; and(8) g55<1>D g39<8,8,1>D 0xbf800000UD { align1 1Q }; which we can't CSE, because the OR is a partial write. If we wrapped the pair of instructions in a virtual opcode and lowered it later, we could recognize it and CSE it. ---- ---- natural-selection-2/8914 (and many others) do: mov(8) g98<1>.xywF [0F, 0F, 0F, 0F]VF mov(8) g98<1>.zF g79<4,4,1>.xF we could do this in one instruction: mul(8) g98<1>F g79<4,4,1>.xF [0F, 0F, 1F, 0F]VF ---- ---- Add IR for transpose() and recognize that it can be removed by reordering multiplications. ----