00:40dboyan: imirkin: just arrived at my lab after breakfast.
00:40imirkin: dboyan: what TZ are you in?
00:41dboyan: I'm in CST (GMT +8)
00:41imirkin: ah ok
00:41imirkin: so along with Ben (and Dave) you live in the future :)
00:43dboyan: imirkin: btw, the c01 thing is the most confusing "magic" in the rcp code, I guess. It has to do with 0x3ff (exponent field of 1)
00:43imirkin: could be.
00:43imirkin: i'd like to understand what it does, why it's there, etc
00:44imirkin: if we can't figure out, nuke it, see what breaks. that's what i mean when i say that i'd like to fully understand the algo... there can't be any magic bits.
00:44imirkin: alternatively if you can find a paper that describes this algorithm, you can just point at the paper and then not worry about explaining anything
00:45imirkin: (but then your code has to follow what's in the paper)
00:46imirkin: if you can't work it out you can see if mwk has more ideas... he has probably forgotten more about floating point than you and i will ever know :)
00:46dboyan: The goal is to make the input to be near 1 so that it won't overflow when converting to single precision. And I've thinking why can't we set the exponent field in the input to 0x3ff anyway
00:46dboyan: That'll be cleaner
00:48dboyan: The rsq code from the blob is even more confusing, but I just managed to nuke it
00:50imirkin: excellent :)
00:51imirkin: how are you determining precision of your algo?
00:51imirkin: i'd recommend using numpy to generate a range of inputs and compare the output of that vs the cpu operations, which you can assume are "precise enough"
00:51imirkin: [or in C, wtvr]
00:55dboyan: imirkin: I tried a few numbers from various ranges, and compared the result with the CPU implementation. For RSQ, I can see difference in the least significant bit sometimes (rather often).
00:57dboyan: I haven't tested what the blob algorithm can achieve yet
01:02imirkin: dboyan: 1 ULP difference? nobody cares about that.
01:03imirkin: dboyan: make sure you test stuff all over the range, e.g. around MAX/MIN_DOUBLE etc
01:04dboyan: 1 ulp I believe. And I've tested the maximum and minimum value of double. Actually they make little difference in rsq, since their value gets much more resonable after the sqrt
01:05imirkin: yeah, but esp for RCP since you convert to a single-precision float, this can cause issues
01:06dboyan: so that's why the c01 magic was in place...
01:06imirkin: btw, is the RCP algo basically to set the exponent to some reasonable value, do the RCP, and then fix the exponent?
01:07imirkin: and then newton-raphson for the fact that you don't have sufficient precision on the mantissa
01:07imirkin: [the comments should make this sort of thing clear]
01:07imirkin: esp the mechanism behind how you set the exponent, fix it, etc
01:08imirkin: the metric is that by taking only the comments, i should be able to construct code that does the same thing
01:09imirkin: (doesn't have to be the exact same asm of course)
01:11dboyan: I'll try clean it up, particularly the 0xc01. At least document it better. I know what it was doing, but that's not easy to describe.
01:12imirkin: there are many levels of understanding. being able to explain something is one of the higher ones.
01:13dboyan: imirkin: btw, the blob's rsq iterates the result in a long "magic" function twice after 2 newton raphson steps. I just wonder whether that "magic" can do.
01:13imirkin: i really appreciate you working on this btw - among all my feedback, that message might fall through the cracks.
01:14dboyan: But I think if that function can't achieve bit-level precision once, it can't either if done for a second time
01:14imirkin: i really wouldn't worry about single ULP differences
01:14imirkin: i don't think that the 32-bit RSQ and RCP functions are expected to get it right to a single ULP either
01:15imirkin: for both RCP and RSQ, another area of annoyance is near 0 but before infinity - the values can get really big really fast
01:15imirkin: esp for RSQ
01:16dboyan: at least I don't have to read the ~3k thing in blob's rsq
01:16imirkin: not unless you find examples where your algo isn't good enough
01:17imirkin: after carefully picking values, i'd just fuzz it with literally random 64-bit values
01:17imirkin: and let that run for a day
01:17dboyan: yeah, that's a good idea
01:18imirkin: i'd say that anything under 8 ULP's is of no consequence
01:19imirkin: er i take that back. 3 ULP
01:20imirkin: that's the standard for 32-bit sqrt... 2 ULP for rsq, 2.5ULP for rcp (whatever that means)
01:20imirkin: anyways, i guess 2 ULP is nothing to worry about, 3 ULP is not-so-great, anything bigger should be a red flag
01:21imirkin: [there's a wrinkle in that it's supposed to be ULP difference to ideal result, not to cpu result, but ... meh.]
03:08Echelon9: dboyan; If you're research algorithms for the elementary functions on nv GPUs, can I suggest you look at all algorithm writings by Stuart Oberman?
03:08Echelon9: e.g. http://www.acsel-lab.com/arithmetic/arith17/papers/ARITH17_Oberman.pdf
03:09Echelon9: and publications referenced at http://oberman.net/resume.html
03:55dboyan: Echelon9: Thanks for the reference, but I guess that was different thing. My work is primarily sw, the goal and measures of which is quite different from hw designers.
03:58Echelon9: You might find some clever optimizations mentioned, which were modeled within the SFU
03:59Echelon9: I found them helpful when doing range reduction optimizations for the vc4 trig functions
05:51imirkin: skeggsb: any word on merging that fan workaround?
15:04dpdocs21: Hi, I want to work on "Instuction Scheduler" Project as a pat of EVoC. I want to know moe about the project. Can you please provide possible next step? Who will be possible mentor for the project? Thanks.
15:26imirkin: dpdocs21: hakzsam or i are probably the most obvious mentors for such a project...
15:27imirkin: dpdocs21: the basic idea is that different instructions take different amounts of cycles to "complete"
15:27imirkin: dpdocs21: and so it'd be nice if those cycles were taken up doing useful work rather than just waiting for their results
15:28nyef: Is it an interlocked pipeline, causing stalls, or are there hazards that the scheduler would need to work around?
15:28nyef: (Basically, is correctness an issue, or "merely" performance?)
15:29imirkin: why not both? :)
15:29imirkin: in the case of textures (on kepler), you're supposed to issue a barrier
15:29imirkin: which is a separate instruction
15:30nyef: Okay, that's plausible, but the hazards would be on a definite subset of the available operations, typically the "system" handling registers rather than GPRs...
15:30imirkin: on maxwell, a bunch of different instruction types (including textures) require barriers for proper support
15:30nyef: Barriers are a little different than straight-up timing hazards.
15:30imirkin: however the barriers are handled via scheduling metadata rather than explicit ops
15:30imirkin: yeah, this isn't like, say, adreno, where you have to manually insert the right number of nop's, otherwise you're screwed
15:31nyef: Or MIPS I, II, or III?
15:32nyef: (MIPS IV was a fully-interlocked superscalar system, so most of the hazards went away at that point.)
15:32imirkin: lol. didn't know about that... i thought that all their ops were single-cycle, except that one op (i forget which, like multiply or something), and obviously the post-branch delay slot
15:33nyef: Single-cycle throughput, sure, but typically you couldn't do a back-to-back read-after-write.
15:33imirkin: although even the one op i'm thinking of, i think the deal it was 2 cycles, as effectively 2 meta-ops
15:33nyef: And then there's *plenty* of hazards for accessing system registers.
15:33imirkin: hah. was not aware.
15:33nyef: (Here's a fun one: There's a hazard for disabling IRQs, so you could end up in an
15:33imirkin: and i wrote a MIPS compiler... hrm
15:33imirkin: perhaps it was for post-MIPS-III?
15:33nyef: IRQ handler with the IRQ-disable bit set.)
15:34imirkin: iirc our stuff ran on O2's (and spim, naturally)
15:35nyef: I could be misremembering the details of the read-after-write, and it might be on a per-instruction basis, or was relieved partially fairly early on such as between MIPS I and MIPS II.
15:36nyef: O2's would be, what, R4k, maybe? Probably not R4400, though that's possible...
15:36imirkin: aha, O2's had r5k, which is mips iv
15:37imirkin: The O2 comes in two distinct CPU flavours; the low-end MIPS 180 to 350 MHz R5000- or RM7000-based units and the higher-end 150 to 400 MHz R10000- or R12000-based units.
15:37nyef: And Octane2 was r10k or r12k...
15:48dboyan: imirkin, I think I've got rid of the c01 magic
15:48dboyan: but haven't tested precision yet, maybe tomorrow
15:49dboyan: imirkin: https://github.com/dboyan/mesa/tree/fp64-rcprsq2
15:51imirkin: dboyan: ok cool
15:53dboyan: Now i think my rcp much cleaner than the blob one :)
15:54mupuf: dboyan: but is it compliant :p
15:54imirkin: thanks for taking the time to clean up envydis btw
15:56dboyan: my pleasure. Just noticed some easy ones I guess
15:58imirkin: yep. probably lots more left, but don't waste your time fixing it all. just the stuff you happen to hit, and then fix up the whole instruction while you're at it
15:58imirkin: since the incremental cost is quite low
15:58dpdocs21: imirkin : I want to know whether someone is already active on this poject or not? Moreover, As you told above, you want instuction scheduler to do some other wok waiting or what? I think there is some ambiguity. Please clear.
16:00imirkin: dpdocs21: there's loads and loads of ambiguity
16:00imirkin: dpdocs21: if this were an easy project, it would have been done long long ago
16:00imirkin: it is, in fact, a very difficult project
16:01dpdocs21: imirkin: hahah.I am a college student and have completed the compiler design course. If possible, Please suggest certain resouces which will be for this project.
16:02imirkin: dpdocs21: did this compiler class talk about SSA?
16:02dpdocs21: imirkin: I am willing to work on a tough poject. So I will try to learn whatever is required.
16:02dpdocs21: imirkin: Yes. I am familiar with that concept.
16:03imirkin: ok cool. did you hit on GVN and GCM?
16:04nyef: Is that last one Code Movement, or something else?
16:04imirkin: motion usually, but yeah
16:04nyef: Right, motion.
16:05imirkin: anyways, those aren't strictly required
16:05imirkin: the main thing is... you have these dependency chains
16:06imirkin: for computing of the final result
16:06dpdocs21: imirkin: GVN. I am not aware of GCM.
16:06imirkin: it's an inverse tree, essentially (or a handful of inverse trees)
16:06imirkin: at the end of the day, you have to linearize those trees into a single instruction stream
16:07imirkin: now, if you have like x = a + b + c; y = d + e; then you could sequence it as
16:07imirkin: t = a + b; x = t + c; y = d + e;
16:07imirkin: or you could sequence it as t = a + b; y = d + e; x = t + c;
16:08imirkin: in the first example, the second operation has to wait for the first
16:08imirkin: while in the second example, the second operation does not have to wait
16:08imirkin: in this particular case, these operations could be dual-issued (probably, i forget all the rules)
16:09imirkin: in other cases, it could just avoid a stall, due to internal resource contention
16:09imirkin: the trick is to do this without exploding live value counts
16:10imirkin: since there are only so many registers, and (a) spilling is beyond slow and (b) using more registers decreases parallelism
16:14dpdocs21: imirkin: OK. Fom above example, what I understood is : Given a cetain number of registers and a set of instuctions, The task is to rewrite those instuctions such that maximum parallellism can be achieved, right?
16:15imirkin: well, usually you schedule pre-RA, and then again post-RA
16:16imirkin: but basically yeah
16:16imirkin: and of course control flow can become an annoying little addition onto this effort
16:20dpdocs21: imirkin: OK. obviously, the task doesn't seems to be easy but I would like to work on this poject. Any other requirement you think needed to work on it? Some backgound knowledge/reading etc??
16:21imirkin: dpdocs21: well, some contributions to the project are required first for a EVoC (and GSoC). this is mostly done to make sure that the student is serious about the project and also capable of completing it
16:22imirkin: dpdocs21: which nvidia gpu's do you have easily available to you?
16:24dpdocs21: OK. So I will do that first.moreove, I would like to inform that I have completed GSoC 2015 successfully.
16:25dpdocs21: I have K40 Plus in my lab.
16:25imirkin: heh. that's a pretty serious GPU. is that a GK110-based one or GK104-based one?
16:26dpdocs21: I am not sure but its GTX480
16:26imirkin: oh, i guess all K40's are GK110-based. GTX 480 is a GF100 (fermi) - very different arch
16:28karolherbst: k40 should be gk110 though
16:29imirkin: ok. GTX 480 is a pretty old gpu (fermi series). it schedules in a somewhat different manner than then later keplers
16:29imirkin: although i think many of the same core concepts will still apply
16:30imirkin: one thing about GTX 480 though is that it'll probably boot to a fairly low frequency, making it less useful for performance analysis
16:30imirkin: [and nouveau won't be able to reclock it]
16:32imirkin: dpdocs21: would be good to figure out exactly what GPU it is that you have access to. also note that if you have a laptop with a nvidia gpu, that may be more than sufficient.
16:32imirkin: it doesn't have to be one of the monster gpu's :)
16:35dpdocs21: imirkin: OK, I'll first figure out that exact GPU and will also check availability of any other GPU . No, I don't have access of any such laptop currently.
16:35dpdocs21: Then move forward.
16:35imirkin: dpdocs21: "lspci -nn -d 10de:" should provide the necessary info.
16:37dpdocs21: imirkin: OK . Thanks. Will do that and get back to you.
16:41dboyan: btw, i happened to notice today that isa encoding on maxwell and pascal seems quite similar
16:41dboyan: although I know nouveau can't do anything interesting on it now, and in near future
16:42imirkin: dboyan: yes, pascal uses the same ISA. there are some additions to handle FP16 afaik
16:45dboyan: are there plan to enable 3d support on pascal? iirc gnurou provided some sort of firmware some days ago
16:46imirkin: yes, once that lands, we should be able to largely just flip it on
16:46imirkin: i believe skeggsb has already tested it out on his boards
16:46dboyan: wow, that's cool
16:47imirkin: there are no substantial api differences that we hit
16:47imirkin: although apparently compute needs more work
16:47imirkin: but i have no clue what it needs... probably a different descriptor. not sure.
16:50dboyan: well i can make mmt traces if needed
16:50imirkin: ok cool
16:50imirkin: actually you could do that anyways - afaik mmt needs some help with newer blob versions
16:50imirkin: and/or newer hardware
16:51dboyan: i have a laptop with 1050
16:51dboyan: the mobile one
18:11pmoreau: "/usr/lib/xorg/modules/extensions/libglx.so exists in both 'mesa-libgl-git' and 'xorg-server'": great, I will need to fix the image builder, again…
18:18imirkin: dboyan: well you could test out gnurou's tree
22:00nyef: Fun and games: Trying to get my "new" system somewhat set up, and it turns out that the only way to physically fit both video cards into the case results in one of them sitting in front of the bottom half of the cooling fan for the other.
22:01nyef: As in, millimeters of clearance.
22:03nyef: And I can't put them the other way around (where there'd be an empty card slot worth of gap) because of the case layout.