i965:
	Instruction scheduling needs improvements (bottom-up scheduler) (cwabbott WIP)

i965/vec4:
	Implement global copy propagation (see serious-sam-3/ultra/714.shader_test)

NIR:
	Vectorization in NIR (vec4) (cwabbot WIP)
	Per-channel dead code elimination in NIR (vec4)

Normalize CMP/ADD since they perform the same operation

Particular problems:

---- witcher-2/524.shader_test:
Temporary0.x = (ON_AttrBlendWeight0.y + ON_AttrBlendWeight0.x);
Temporary0.x = (Temporary0.x + ON_AttrBlendWeight0.z);
Temporary0.x = (Temporary0.x + ON_AttrBlendWeight0.w);

Would be nice as dp4(Temporary0.xyzw, 1.0)
----

---- serious-sam-3/medium/1673.shader_test:
mov(8)          g10<1>UD        g1<0,4,1>UD
mov(8)          g11<1>UD        g1.4<0,4,1>UD
mad(8)          g116<1>.xyF     g10<4,4,1>.zwwwF g10<4,4,1>.xyyyF g2<4,4,1>.zwwwF
mad(8)          g116<1>.zwF     g11<4,4,1>.zzzwF g11<4,4,1>.xxxyF g2<4,4,1>.zzzwF

By recognizing that this code is effectively doing this:

g116.xy = g1.0<0,4,1>.zw + g2<4,4,1>.zw * g1.0<0,4,1>.xy
g116.zw = g1.4<0,4,1>.zw + g2<4,4,1>.zw * g1.4<0,4,1>.xy

We could get rid of all of the mov(8)s and one of the mad(8)s and be left
with

mad(8)          g116<1>F     g1<4,4,1>F.zw  g2<4,4,1>F.zw  g1<4,4,1>F.xy
----

----
After register allocation, dst and src of SEL might be the same.
Could have done a predicated MOV, which could dual-issue
(+f0) sel(8)    g47<1>F         g47<8,8,1>F     g43<8,8,1>F
----

----
Combine into sel:
	mov g3, ...
  (+f0) mov g3, ...
----

----
xcom contains shaders that do lots of SIMD8 instructions on scalars:
(+f0) sel(8)    g11<1>F         g6<0,1,0>F      g6.4<0,1,0>F
(+f0) sel(8)    g12<1>F         g6.1<0,1,0>F    g6.5<0,1,0>F
(+f0) sel(8)    g13<1>F         g6.2<0,1,0>F    g6.6<0,1,0>F
(+f0) sel(8)    g22<1>F         g6.3<0,1,0>F    g6.7<0,1,0>F
(+f0) sel(8)    g24<1>F         g7<0,1,0>F      g7.4<0,1,0>F
(+f0) sel(8)    g26<1>F         g7.1<0,1,0>F    g7.5<0,1,0>F
(+f0) sel(8)    g27<1>F         g7.2<0,1,0>F    g7.6<0,1,0>F
(+f0) sel(8)    g33<1>F         g7.3<0,1,0>F    g7.7<0,1,0>F

we could optimize these into

(+f0) sel(4)    g11<1>F         g6<4,4,1>F      g6.4<4,4,1>F
(+f0) sel(4)    g24<1>F         g7<4,4,1>F      g7.4<4,4,1>F

or even

(+f0) sel(8)    g11<1>F         g6<8,4,1>F      g6.4<8,4,1>F

and then rewrite future instructions to read individual components out of g11/g24.
Basically everything this shader does is on uniforms. Huge potential for improvements.
----

---- witcher-2/676.shader_test (and many others) contains
(+f0) if(8)
mov(8)          g32<1>F         3.40282e+38F
else(8)
math rsq(8)     g32<1>F         (abs)g31<8,8,1>F
endif(8)
math inv(8)     g33<1>F         g32<8,8,1>F
----

---- serious-sam-3/medium/1924.shader_test does
cmp.l.f0.0(8) null:D, vgrf186:F, u136:F 
(+f0.0) sel(8) vgrf209:F, u160:F, u156:F
(+f0.0) sel(8) vgrf210:F, u161:F, u157:F
(+f0.0) sel(8) vgrf211:F, u162:F, u158:F
(+f0.0) sel(8) vgrf212:F, u163:F, u159:F
cmp.l.f0.0(8) null:D, vgrf186:F, u137:F
(+f0.0) sel(8) vgrf213:F, u164:F, vgrf209:F
(+f0.0) sel(8) vgrf214:F, u165:F, vgrf210:F
(+f0.0) sel(8) vgrf215:F, u166:F, vgrf211:F
(+f0.0) sel(8) vgrf216:F, u167:F, vgrf212:F
cmp.l.f0.0(8) null:D, vgrf186:F, u138:F
(+f0.0) sel(8) vgrf217:F, u168:F, vgrf213:F
(+f0.0) sel(8) vgrf218:F, u169:F, vgrf214:F
(+f0.0) sel(8) vgrf219:F, u170:F, vgrf215:F
(+f0.0) sel(8) vgrf220:F, u171:F, vgrf216:F
cmp.l.f0.0(8) null:D, vgrf186:F, u136:F
(+f0.0) sel(8) vgrf308:F, u174:F, u172:F
(+f0.0) sel(8) vgrf221:F, u175:F, u173:F
cmp.l.f0.0(8) null:D, vgrf186:F, u137:F
(+f0.0) sel(8) vgrf309:F, u176:F, vgrf308:F
(+f0.0) sel(8) vgrf222:F, u177:F, vgrf221:F
cmp.l.f0.0(8) null:D, vgrf186:F, u138:F
(+f0.0) sel(8) vgrf310:F, u178:F, vgrf309:F
(+f0.0) sel(8) vgrf223:F, u179:F, vgrf222:F

The comparisons are repeated A, B, C, A, B, C.
----

----
Optimize this:
      cmp.ne.f0(8)    g25<1>D         g2.5<0,1,0>UD   g24<8,8,1>UD
      cmp.ne.f0(8)    g27<1>D         g2.4<0,1,0>UD   g26<8,8,1>UD
      cmp.ne.f0(8)    g29<1>D         g2.3<0,1,0>UD   g28<8,8,1>UD
      cmp.ne.f0(8)    g31<1>D         g2.2<0,1,0>UD   g30<8,8,1>UD
      cmp.ne.f0(8)    g20<1>D         g5<8,8,1>UD     g3.4<0,1,0>UD
      cmp.ne.f0(8)    g21<1>D         g6<8,8,1>UD     g3.5<0,1,0>UD
      cmp.ne.f0(8)    g23<1>D         g7<8,8,1>UD     g3.6<0,1,0>UD
      cmp.ne.f0(8)    g24<1>D         g8<8,8,1>UD     g3.7<0,1,0>UD
      or(8)           g22<1>D         g20<8,8,1>D     g21<8,8,1>D
      or(8)           g25<1>D         g23<8,8,1>D     g24<8,8,1>D
      or(8)           g26<1>D         g22<8,8,1>D     g25<8,8,1>D
      and.ne.f0(8)    null            g26<8,8,1>D     1D
(+f0) if(8) 0 0                 null            0x00000000UD

into this:
      cmp.ne.f0
(+f0) cmp.ne.f0
(+f0) cmp.ne.f0
(+f0) cmp.ne.f0
(+f0) if(8)
----

----
Recognize open-coded copysign() (e.g., floor(abs(x) + 0.5)) * sign(y))
sanctum-2/6830 and many others. Probably just make a GLSL extension
that exposes bfm/bfi and copysign. 7 -> 5 instructions for this common
sequence, with better improvements due to mask reuse for scalar VS.
Also doesn't use the flag register.

Unfortunately, while the shader authors likely wanted copysign(x, y),
what they've written isn't equivalent, since sign(0) is 0.

(assign (xyzw) (var_ref compiler_temp)
 (expression ivec4 f2i
  (expression vec4 *
   (expression vec4 floor
    (expression vec4 +
     (expression vec4 abs (swiz yxzw (var_ref Temporary0)))
     (constant vec4 (0.500000))))
  (expression vec4 sign (swiz yxzw (var_ref Temporary0))))))

add(8)          g50<1>F         (abs)g34<4,4,1>.yxzwF 0.5F      { align16 1Q };
rndd(8)         g49<1>F         g50<4,4,1>F                     { align16 1Q };
cmp.nz.f0(8)    null            g34<4,4,1>.yxzwF 0F             { align16 1Q switch };
and(8)          g53<1>UD        g34<4,4,1>.yxzwUD 0x80000000UD  { align16 1Q };
(+f0) or(8)     g53<1>UD        g53<4,4,1>UD    0x3f800000UD    { align16 1Q };
mul(8)          g48<1>F         g49<4,4,1>F     g53<4,4,1>F     { align16 1Q compacted };
mov(8)          g46<1>D         g48<4,4,1>F                     { align16 1Q };

(assign (xyzw) (var_ref compiler_temp)
 (expression ivec4 f2i
  (expression vec4 copysign
   (expression vec4 floor
    (expression vec4 +
     (expression vec4 abs (swiz yxzw (var_ref Temporary0)))
     (constant vec4 (0.500000))))
   (swiz yxzw (var_ref Temporary0)))))
   

add(8)          g50<1>F         (abs)g34<4,4,1>.yxzwF 0.5F      { align16 1Q };
rndd(8)         g49<1>F         g50<4,4,1>F                     { align16 1Q };
mov(8)          mask
bfi2(8)         g53<1>D         mask           g49<4,4,1>D    g34<4,4,1>.yxzwD  { align16 1Q };
----

----
The pattern we generate for gl_FrontFacing is

	or(8)           g39.1<2>W       -g0<0,1,0>W     0x3f80UW        { align1 1Q };
	and(8)          g55<1>D         g39<8,8,1>D     0xbf800000UD    { align1 1Q };

which we can't CSE, because the OR is a partial write. If we wrapped
the pair of instructions in a virtual opcode and lowered it later, we
could recognize it and CSE it.
----

----
natural-selection-2/8914 (and many others) do:

	mov(8)          g98<1>.xywF     [0F, 0F, 0F, 0F]VF
	mov(8)          g98<1>.zF       g79<4,4,1>.xF

we could do this in one instruction:

	mul(8)          g98<1>F         g79<4,4,1>.xF     [0F, 0F, 1F, 0F]VF
----

----
Add IR for transpose() and recognize that it can be removed by reordering
multiplications.
----