00:02karolherbst: imirkin: "MUFU.SQRT R29, R28;"
00:03karolherbst: since SM52
00:04HdkR: Just a little sqrt
00:05karolherbst: why did nobody notice...
00:06HdkR: What are you doing now?
00:06karolherbst: looking a bit at dual issue stuff
00:10karolherbst: sqrt is the only new one though
00:15karolherbst: HdkR: maybe it is quite fast
00:16karolherbst: could speed up all uses of sqrt
00:17HdkR: One would hope anyway
00:19karolherbst: ohhh, wait
00:19karolherbst: now it only gets added for sm52
00:19karolherbst: not sm60
00:49skeggsb_: volta has MUFU.SQRT too, i did briefly attempt to use it, but it randomly fails, likely some sched issue.. disabled it temporarily until i fix/finish the rest
00:55karolherbst: mhh, weird
00:56skeggsb_: especially weird because i'm pretty sure we use barriers on all those
00:56karolherbst: ohh wow
00:56HdkR: Obviously you just need to barrier harder
00:56karolherbst: look at this:
00:57karolherbst: ohh now
00:57karolherbst: it doesn't seem to stall longer
00:57skeggsb_: i have the same problem with some fp64 ops on volta too still actually, not sure if it's just MUFU.RCP64H or not, i'm ignoring sched-related issues until last :P
00:58karolherbst: better for now
00:59karolherbst: now checking with shader-db
00:59karolherbst: skeggsb_: I tried to look into dual issueing on maxwell today, but this is such a big mess, I doubt it is even worth it
00:59karolherbst: apperantly you can dual issue something with a depbar
01:02karolherbst: "total instructions in shared programs : 5472103 -> 5456268 (-0.29%)" nice
01:02karolherbst: pixmark_piano doens't render correctly though :D
01:03karolherbst: got faster though
01:03karolherbst: 301 -> 321 points
01:03karolherbst: helped inst ../nouveau_shaderdb/gputest_pixmark_piano/7.shader_test - 1 3694 -> 3523
01:03karolherbst: that is insane
01:04karolherbst: okay, should be easy to figure out what's wrong
01:06karolherbst: skeggsb_: are you sure it is something sched related?
01:10karolherbst: 301 -> 327
01:10karolherbst: and it renders correctly now
01:10karolherbst: why is it broken with SCHED=0 though
01:10HdkR:starts hiding instructions that only work if the scheduling is correct
01:11karolherbst: ohhh no
01:11karolherbst: it was optimize=0
01:11karolherbst: we still didn't fix that...
01:13nyef: Sounds like it's time to fix "that", whatever "that" is. d-:
01:14karolherbst: 64 bit int lowering
01:14HdkR: 64bit integers is overrated anyway
01:14karolherbst: yeah... not if the application is using it :p
01:15karolherbst: but insane that by just simply using mufu.sqrt instead of mufu.rcp(mufu.rsq) we get nearly a 10% perf increase in pixmark_piano
01:32karolherbst: but maybe simply that two SFU instructions after each other is like super expensive
01:43karolherbst: skeggsb_: maybe you simply forgot to add the latency stuff for sqrt?
01:43karolherbst: I did so as well
02:09karolherbst: uhm, AlgebraicOpt::handleRCP looks odd
02:10karolherbst: ohh, it does op =
02:10karolherbst: no, it is fine
03:23imirkin: karolherbst: should be easy to hook up. iirc we have a OP_SQRT, which just gets lowered
03:23imirkin: oh. and you already sent patches.
08:37pendingchaos: karolherbst: what's wrong with 64-bit int lowering?
10:43karolherbst: pendingchaos: we have to do it even with optimization level 0
10:47pendingchaos: so Split64BitOpPreRA isn't being done?
10:50pendingchaos: pixmark_piano seems fine with it disabled, though it's broken with FlatteningPass disabled?
10:51karolherbst: uhm... for me it breaks
10:51karolherbst: I mean
10:52karolherbst: if you set NV50_PROG_OPTIMIZE=0 it breaks
10:52karolherbst: it might be we just optimize those 64 bit opcodes away
10:52karolherbst: but that's one issue we have
10:53pendingchaos: "pixmark_piano seems fine with it disabled" = "pixmark_paino seems fine with Split64BitOpPreRA disabled"
10:53pendingchaos: I think it's broken with NV50_PROG_OPTIMIZE=0 because FlatteningPass is removing some broken control flow or something
10:53karolherbst: ohhh, I see
10:54karolherbst: might be as well
10:57karolherbst: would be less painful to debug if that shader wouldn't be that massive
10:59karolherbst: I kind of blame _from_TGSI on how joins are inserted though
11:00pendingchaos: I think I had a tentative fix for the _from_tgsi code somewhere
11:03pendingchaos: found it, though it probably needs some cleanup, testing and a proper understanding of the problem
11:03karolherbst: I think the main issue is, that we don't do the CFG stuff _that_ correctly as we flag the edges not like they should be
11:04karolherbst: I mean, it works, because we don't change the meaning later on
11:04karolherbst: but I think this issue is more related to joins
11:05karolherbst: especially because we change the meaining of OP_JOIN through the shader "stages"
11:06karolherbst: like here: https://gist.githubusercontent.com/karolherbst/8402a2d68679d07b525f615e7b1b24e3/raw/901d06825c9fbc5eaeff7eb76b777978ee44e0c9/gistfile1.txt
11:06karolherbst: join isn't a jump, because that would make that shader kind of insane
11:06karolherbst: but on hardware (and later on) it becomes one
11:07karolherbst: so I think we basically jump to whatever joinat points to skipping all joins though
11:10karolherbst: I think we change the meaning of the joins in the legalizing passes actually
11:10karolherbst: so they become a real jump instruction with a BB target
11:20karolherbst: pendingchaos: do you have an updated XMAD series somewhere by the way?
11:33mooch2: hey, does anybody know what the purpose of the https://github.com/envytools/scans repo is, or can i delete it? there's absolutely nothing there anyway :/
11:36JayFoxRox: I also wondered about that yesterday
11:42pendingchaos: karolherbst: v3 should be the latest
11:48pendingchaos: mooch2: it's been there for 4 years
11:48pendingchaos: I don't think it's going to be used for anything
11:48pendingchaos: and recreating it would be trivial
11:48pendingchaos: so deleting it is fine imo
11:48karolherbst: mhhh, why can't I get maxas to do that's stuff
11:54pendingchaos: karolherbst: what stuff?
11:54karolherbst: this maxwell scheduler thing
11:55karolherbst: I want to insert my own binary inside a cuda one and just let maxas replace all the sched opcodes with the proper thing
11:59pendingchaos: I wonder if we can just not create OP_JOIN/OP_JOINAT instructions until post-ra legalization
11:59pendingchaos: I don't think they add any meaning to the code
12:00karolherbst: pendingchaos: I mean, you still need to sync the threads
12:00karolherbst: but yeah...
12:01karolherbst: pendingchaos: think is, how to you figure out post-ra where to insert them?
12:31mwk: mooch2: I tried to upload my repo with register scans of various areas to github
12:32mwk: but it rejected the push for too large file size or something
12:33mwk: just delete it, I suppose
12:34karolherbst: mwk: you could push them to gitlab with git lfs though
12:37pendingchaos: imirkin: yeah, the affected stats can be rather meaningless
12:37pendingchaos: I think they're meaningful enough when you know what the affected shaders are or how many of them there are though
12:37pendingchaos: In the wip second version of the series, I have it so that it also shows the number and percentage of shaders affected
12:37pendingchaos: do you have any ideas for the affected stats?
12:39JayFoxRox: karolherbst: also works on github afaik https://git-lfs.github.com/
12:40JayFoxRox: https://help.github.com/articles/about-storage-and-bandwidth-usage/ [1GB storage + 1GB / mon]
12:48karolherbst: JayFoxRox: yeah sure, but this storage/rate limit is annoying
12:51JayFoxRox: does gitlab have bandwidth limits? I can only find that they have a 10GB storage limit
13:04karolherbst: JayFoxRox: we have our own gitlab for freedesktop projects
14:17rhyskidd: is the canonical upstream repo of nouveau_shaderdb now the gitlab.fdo one, or mupuf's still?
14:22pendingchaos: I think the gitlab one
14:26rhyskidd: great, i think then it's only the apitraces and nvidia_bios repos used in nouveau dev infra still with upstream at mupuf
14:52imirkin: pendingchaos: what i do sometimes is have a print which literally lists out the affected shaders
14:53imirkin: and says whether they were helped or hurt and on what metric
15:07pendingchaos: perhaps an --affected option could be added which shows n random affected shaders and how they were affected
15:07pendingchaos: there is already the --top and --smallest options, though they are ordered and are only for a specific metric
15:10imirkin: pendingchaos: perhaps it's just me, but i tend to like to see full lists
15:10pendingchaos: that can fill up your screen a bit though
15:10pendingchaos: you can just give it a huge number though
15:10pendingchaos: which is practically the same as a full list
15:19imirkin: or default to all, but take an optional number?
15:20imirkin: if it's a long list, can always pipe to less
15:20imirkin: but i have 10k lines of scrollback, so it's rarely an issue
15:20imirkin: and it should be printed before the stats, obviously
16:01karolherbst: and there is grep
16:06karolherbst: pendingchaos: what I look most of the time for are shaders which have a big hurt impact, but still only a few instructions
16:16karolherbst: imirkin: mind taking a second look at the imageload fix? I tested that on my kepler GPU as well a few days ago: https://github.com/karolherbst/mesa/commit/6cf12373d3ef188b9d9e2a7c3c9db0480c3a4f56
16:18imirkin: Instruction *mov = bld.mkMov(bld.getSSA(), bld.loadImm(NULL, 0));
16:18imirkin: eh, i guess it's fine actually
16:19karolherbst: yes :)
16:19karolherbst: alternative would be a zero immediate inside the union, or what other idea did you had?
16:19imirkin: no, well loadImm already creates a new instruction
16:19imirkin: but it doesn't return it
16:20imirkin: so ... wtvr
16:20karolherbst: didn't know it
16:20imirkin: anyways, r-b me
16:20imirkin: feel free to cc stable
16:27karolherbst: imirkin: what do we want to do about the fp64 patches? I can try to review dboyan patches, but that only covers kepler1
16:28pendingchaos: imirkin: thoughts? https://hastebin.com/udazicibut.txt
16:28karolherbst: don't know if you have enough time to really take a deep enough look to verify that stuff is sane enough
16:28pendingchaos: karolherbst: so perhaps score by "percentage_hurt / number_instructions"?
16:28karolherbst: pendingchaos: yeah, something like that
16:28karolherbst: but it has no meaningful ehough value though
16:29imirkin: pendingchaos: great stuff. can you flip affected and shared around?
16:29imirkin: so that it's easier to just grab stuff and paste it
16:30karolherbst: "inst 171 -> 108 (-36.84%)" what kind of optimization is that?
16:30imirkin: a good one :)
16:31imirkin: pendingchaos: oh, also, normalize the way stats are shown
16:31imirkin: i.e. always drop them to the next line
16:31imirkin: even if there's only 1
16:31karolherbst: killing those stupid RA movs might cause something like that :) allthough then I expected more affected shaders :D
16:31karolherbst: pendingchaos: ahh, yeah, that makes sense
16:31pendingchaos: just affecting compute shaders
16:31pendingchaos: so the count isn't too high
16:32karolherbst: but still
16:32karolherbst: should speed those up significantly
16:32karolherbst: pendingchaos: I guess basically only touches those feral ported games?
16:32pendingchaos: I think so?
16:32karolherbst: I am no aware of any other compute shaders though
16:32karolherbst: ohh, some of those tech demos also have a few
16:33pendingchaos: hitmanpro, f1_2015, shadow_of_mordor
16:33pendingchaos: so yeah
16:33karolherbst: I know that tomb raider uses those for rendering the fancy hair stuff, which was like super slow
16:33karolherbst: might actually cause a speedup there
16:34karolherbst: that reminds me, we have a silly crash with hitmanpro anyway
16:34karolherbst: something about using too many textures and we can't really deal with that
16:35pendingchaos: hitmanpro == HITMAN – Game Of The Year Edition? I don't see anything called hitmanpro
16:35karolherbst: pendingchaos: yeah, it is the hitman with anything after
16:35karolherbst: maybe calling that pro is actually wrong
16:36karolherbst: ohh Hitman Pro was the in dev name
16:36karolherbst: and they changed it later
16:36pendingchaos: there's a Hitman Absolution: Professional Edition actually
16:36karolherbst: the binary is called hitmanpro I think
16:36pendingchaos: don't know if it's for linux
16:37karolherbst: it is the "HITMAN" one we have
16:37karolherbst: but they changed the name
16:38karolherbst: pendingchaos: $ ls Hitman™/ "bin config HitmanPro.sh lib share steam_shader_cache"
16:39karolherbst: I think the better name would be Hitman2016
18:34kernel-3xp: i made some measurements, since doing audio work you dont want high latencies, thats what i found, i thought i need to share so you guys know
18:35kernel-3xp: max latency and average is much higher on nouveau than on nvidia proprietary driver, values in nanoseconds
18:41nyef: ... I wonder how much of that is things like kernel round-trips for the pushbuffers, and the spinwaits on the fences?
18:47kernel-3xp: no idea, i mean, can there be done anything?
18:55nyef: That, I couldn't say. I'm not too familiar with those parts of the system, and I don't know what the tradeoffs are.
18:56nyef: I mean, probably *something* could be done. After all, if the proprietary driver has (causes?) less latency then it's doing something different than what nouveau is doing.
18:57kernel-3xp: ok, but other devs see it and maybe can do sth if they care?
18:59nyef: If they see it and care, maybe they can do something about it. Or maybe they see it, and care, and could do something about it, but there's other things that they care about more, like missing functionality, improved rendering speed, fixing crashes, and so on.
19:00kernel-3xp: yeah ofc, i just thought maybe they want to know
19:00nyef: Certainly, thank you for bringing it to my attention, at least.
19:00kernel-3xp: ty, too
19:01nyef: (... how do you even track down causes for increased system latency?)
19:03kernel-3xp: maybe perf, latencytop or sth?
19:05imirkin: kernel-3xp: what is this a test of, precisely?
19:06kernel-3xp: kernel latency, time to answer a simple call
19:06imirkin: call, as in, syscall? which one?
19:07kernel-3xp: not sure
19:10nyef: So, it's measuring the variation in wakeup time on a 1ms interval timer?
19:11nyef: Or... Well, the details seem configurable.
19:11kernel-3xp: sth like that, basically its just sys/kernel latency
19:12nyef: Are you on a fully-preemptive kernel?
19:13kernel-3xp: no rt though
19:18karolherbst: kernel-3xp: what about the CPU clocks?
19:19karolherbst: the load seemed a lot lower with nouveau
19:19karolherbst: maybe the CPU simply slept more?
19:19karolherbst: and that causes those higher latencies as the CPU is woken up more often
19:19nyef: If it's specifically a nouveau thing, what about the GPU clocks? Could they be affecting things?
19:19karolherbst: for audio? doubtful
19:20nyef: If there's any display rendering going on, such as for a desktop environment, then still possible.
19:20karolherbst: the Min/Act/Avg values are quite close though
19:20karolherbst: only the Max is kind of bad
19:23karolherbst: anyway, benchmarking such thing (or everything basically) isn't as trivial as just starting some tool which measures something
19:23karolherbst: first tyou need a stable environment
19:23karolherbst: 1. no Turbo boost
19:23karolherbst: 2. static clocks
19:23nyef: I have a small interest in audio work on Linux, mostly dabbling at this point, and I'm running nouveau on that box. Latency isn't a huge problem for me yet, I have a hard time noticing right now, but there are two other issues that I need to hunt down, one of which is a lack of HDMI audio for some sinks.
19:24karolherbst: Nouveau itself doesn't do the audio stuff though
19:24nyef: (Okay, for ONE of the sinks, the problem is something to do with HPD on the audio out channel within the sink.)
19:25nyef: I know: That's the problem for two of my sinks. No ACR packets in the HDMI stream.
19:26karolherbst: imirkin: demmt doesn't pick up the gm200 stuff either (the mufu sqrt thing), so I guess I might want to fix that as well
19:27karolherbst: this shader: https://gist.githubusercontent.com/karolherbst/b263151ef94dc3a7fe06acf632d0b298/raw/84c92b14556fe75e2864c765e122ff16ca829936/gistfile1.txt
19:28nyef: I mostly "just" need to take the time to sit down with it and do the RE work required... Possibly against both of the sinks that I have that exhibit the same failure mode.
19:29kernel-3xp: i had static clocks, both on nvidia and nouveau reclocked to 0f
19:29karolherbst: kernel-3xp: no, the CPU clocks
19:29karolherbst: you are testing CPU latencies, right?
19:30kernel-3xp: all max even 1 step turbo on all cores
19:30kernel-3xp: kernel/sys latency
19:30karolherbst: I guess the CPU still idles though
19:31kernel-3xp: all sleep states disabled
19:31karolherbst: mhh, I see
19:32kernel-3xp: running C0 100%, no halt etc.
19:32karolherbst: wondering where those spikes are coming from then
19:32kernel-3xp: got them only with nouveau
19:33karolherbst: could be something stupid as nouveau keeps a lock for quite a long time
19:33karolherbst: and the audio stuff just waits longer on it
19:33karolherbst: anyway, I don't know enough to really know what could cause this
19:33karolherbst: I just think the loadavg thing is suspicious
19:34kernel-3xp: no idea either, just posting observations
19:34karolherbst: as, why would the load be higher with nvidia?
19:35kernel-3xp: good point but no idea
19:36kernel-3xp: just booted, mate, one terminal, and gedit
19:37nyef: So, a small amount of rendering going on as well.
19:37karolherbst: skeggsb_: do you know what we might do inside nouveau to stall the entire system?
19:37nyef: Spinwaiting on fences?
19:37kernel-3xp: yeah i think nvidia is more resource intensive after all
19:37karolherbst: kernel-3xp: Cyclictest basically just starts a few threads, let them sleep and wakes them up, right?
19:37karolherbst: and measures the time it takes until they respond
19:37nyef: Doesn't need to stall the entire system, just needs to stall one core a bit.
19:38kernel-3xp: yeah i think so
19:38karolherbst: kernel-3xp: how often/long did you test
19:39karolherbst: maybe it just got super unlucky and those super high values are kind of rare
19:39karolherbst: but happen with both drivers
19:39kernel-3xp: no, i made those tests before like 100s of times
19:40kernel-3xp: the high max is typical for nouveau, wont happen on nvidia
19:40karolherbst: well, I wouldn't be surprised if Nouveau can trigger a core to stall a bit
19:41kernel-3xp: yeah could be
19:43karolherbst: mhh 47ms in the worst case
19:43karolherbst: that's quite a lot
19:44kernel-3xp: those are nanoseconds
19:45kernel-3xp: so "not that bad" but adds up ofc
19:46kernel-3xp: and you will notice
20:11imirkin: kernel-3xp: so assuming you're not actually doing any rendering, this must all be in the page-flipping logic.
20:11imirkin: kernel-3xp: if you're interested in improving this, you could let us know what some of the hotspots turn out to be
20:11imirkin: i think the rt folk have various latency tracing helper tools one can use
20:12kernel-3xp: system was idle when measuring, yeah i could try to track this down
20:14imirkin: or put another way -- it'll require someone interested in fixing this to actually ... fix it
20:16kernel-3xp: yeah ofc
20:44karolherbst: preex2 can be dual issued with ex2 on maxwell
20:45karolherbst: same with presin and sin
20:48imirkin: less-than-useful though, since usually you don't do it on multiple values
21:00karolherbst: imirkin: wondering what kind of instruction MOV is supposed to be on maxwell in terms pf dual issueing
21:00karolherbst: you can dual issue add+sync or mov+sync
21:01karolherbst: but add+mov don't (afaik)
21:02karolherbst: maybe it is a full ALU instruction
21:17karolherbst: imirkin: so, this is odd. Allthough I comply to the rules nvidia has regarding dual issueing (all occurances of dual issues I do, I also find in the nvidia shader), but still the perf drops
21:20karolherbst: hakzsam: did you looked into that maxwell dual issueing at some point?
21:22karolherbst: uhm, we set a delay of 0x7e0 in case of dual issueing
21:22karolherbst: which is the deafult one meaning be crappy
21:23karolherbst: okay, that explains
22:28karolherbst: nice, with the yield flag set I get some perf improvement :)
22:30pendingchaos: isn't things broken when you set the yield flag on everything? or did you just set it for stalls < 13 or something (which seemed to fix things for me)?
22:30karolherbst: I only set it on instruction I dual issue
22:31karolherbst: pendingchaos: mhh, weird
22:31karolherbst: pendingchaos: because you need the yield flag for 12-15 stalls
22:32pendingchaos: codegen doesn't do that though and it seems to work fine?
22:33karolherbst: no idea
22:33karolherbst: maxas suggest it caps at 11 in that case
22:33karolherbst: "For stalls 12-15 the yield hint is required to be set in addition to the stall count. Not setting this additional flag will give you fewer stall clocks. I'm not sure what these stall counts are supposed to mean when this flag is not set. But it seems to not be important."
22:37karolherbst: pendingchaos: currently I am at 0.17% dual issueing and got 5013->5017 points in pixmark_piano
22:37karolherbst: not much... but it seems like to be going into the right direction
22:47imirkin: karolherbst: afaik the blob *always* sets the yield flag
22:47karolherbst: imirkin: it doesn't
22:47imirkin: despite what maxas docs say
22:47karolherbst: imirkin: https://gist.github.com/karolherbst/b263151ef94dc3a7fe06acf632d0b298
22:47karolherbst: it doesn't
22:47imirkin: heh ok
22:47imirkin: not always. just very very very very often
22:48karolherbst: very very often
22:48karolherbst: if we always set it, we get missrendering
22:48karolherbst: dunno wahts going on here
22:48karolherbst: all I know is, if we dual issue, we _have_ to set the yield flag if we don't wait on barriers or set one or something
22:48karolherbst: as stall=0 means this default thing
22:49karolherbst: this shader has quite a lot of dual issues though, so it is nice to get a rough idea when we can dual issue
22:50karolherbst: alu + mufu seems to be the most common thing
22:50karolherbst: and alu + mem operation
22:50karolherbst: alu + (int/float) cvt seems to be a thing as well
22:51karolherbst: there are alos some mov+bra/sync/ssy dual issues and I am not quite sure if that is valid for all alu instructions, but I think so
22:53karolherbst: ohh, cvt with pred is a mov actually, weird
23:06karolherbst: mhh imirkin what is the difference between tex and texs btw?
23:10imirkin: tex takes 2 quad args
23:10imirkin: texs takes 2 or 3 single or double args (sorry, i forget which)
23:11imirkin: this makes RA concerns simpler since you don't have to create quad args
23:11imirkin: but sometimes you really need all 8 args
23:11imirkin: like if you're doing ... samplerCubeArrayShadow + bias or something?
23:11imirkin: 4 coord, 1 depth compare, 1 bias... still only 6. hm.
23:12imirkin: anyways, when there's a will, there's a way :)
23:14karolherbst: I see
23:14karolherbst: nvidia seems to prefer using texs
23:15karolherbst: ohhhh, finally a decent improvement
23:15karolherbst: 1.05% dual issue rate: 5013 -> 5038 points
23:16karolherbst: imirkin: I guess if we would use texs, it would reduce those useless movs RA inserts for the quad op stuff?
23:18imirkin: just needs a bit of RE to pin down precisely how it works, and which situations it can handle
23:18karolherbst: allthough we know at lowering time if we can use TEXS or not?
23:18karolherbst: maybe we should just be explicit about it
23:19karolherbst: wouldn't have to touch RA
23:19imirkin: it should be very obvious
23:19imirkin: whether it's supported by TEXS or not
23:20imirkin: the texConstraintsGM107 could decide one way or the other
23:20karolherbst: thing is, do we really want to decide that within RA?
23:20karolherbst: doesn't sound like something RA should be concerned about
23:21imirkin: your call.
23:21imirkin: could add yet-another flag
23:21karolherbst: I mean, RA should't change ops as well and all that stuff
23:21imirkin: it wouldn't be changing them
23:21imirkin: it would just cause it to get encoded differently
23:21karolherbst: okay, so the emitter would have to recheck
23:23karolherbst: maybe it would be okay actually, as this tex stuff is already handled inside RA anyway
23:25imirkin: step 1 here is to figure out wtf all you can do with TEXS
23:28karolherbst: imirkin: well, yeah, but that should be fairly easy to figure out, just takes some time
23:28nyef: Why am I thinking "Don't mess with TEXS" right about now?
23:31karolherbst: + 0.4% perf is quite disappointing
23:34karolherbst: mhh https://github.com/karolherbst/mesa/commit/05d96d1f79604ea2b8e98a6d365fae4339dd93a4
23:36karolherbst: let's see if I can enable it for int stuff actually
23:38imirkin: you don't want that
23:38imirkin: you want to check the src directly to see if it's an imm
23:38imirkin: since you don't care if it's a reg that has an immediate moved into it
23:39karolherbst: ohhh, right
23:41karolherbst: with enabling int stuff I get even +0.64% perf...
23:41karolherbst: this is really disappointing and all :D
23:42karolherbst: nvidia has a dual issueing rate of 6% though, I only got 1.5%
23:43karolherbst: so, being optimistic +2% should be possible
23:43karolherbst: in perf increase