02:48 imirkin: ok, theoretically i made a thing that works for maxwell f32 atomic add on shared mem...
02:49 imirkin: if anyone feels like checking it out -- https://github.com/imirkin/mesa/commits/atomicfloat
02:50 imirkin: run with this piglit: https://cgit.freedesktop.org/~idr/piglit/plain/tests/spec/nv_shader_atomic_float/execution/shared-atomicAdd-float.shader_test?h=NV_shader_atomic_float&id=025bb9e5cb7e8e7955aa6ad85575c0cb318f6165
02:50 imirkin: need to force-enable the ext with MESA_EXTENSION_OVERRIDE
03:12 imirkin: pmoreau: any clue what LDS vs LDS.U do? (i.e. what's the .U?)
03:13 imirkin: or ... HdkR?
03:33 HdkR: imirkin: I know what it means, just doubt that I can say
03:33 imirkin: can you say if i should worry about it?
03:34 HdkR: You don't really need to worry about it
03:34 imirkin: i.e. is it perf-related, or functionally different
03:34 imirkin: ok
03:34 imirkin: excellent.
03:45 HdkR: :P
04:08 HdkR: imirkin: What's all the shared atomic work for?
07:53 pmoreau: imirkin: No clue what the .U is for: first time seeing it as well.
07:53 skeggsb: in context, i'd say uncached is likely
07:55 pmoreau: Could very well be
13:08 imirkin: HdkR: f32 add on shared memory. no op to do it, so have to do a load + cas loop.
14:21 methodMan: Hello all !
15:37 pendingchaos: imirkin: is now a good time?
16:07 imirkin_: pendingchaos: sorry ... send me an email, i'll process it at some point
17:11 nikk_: Hello everyone! Can someone please guide me into starting with drivers, GPU and related coding.. I don’t know where to start reading to be able to contribute to the project.. Thanks!
17:12 gnarface: nikk_: someone told me to get started with envytools first
17:24 juri_: envytools could use a lot of documentation.
17:24 juri_: "comments would be awesome".
17:32 imirkin_: nikk_: best way is to just find something you don't like, and fix it.
18:16 HdkR: imirkin_: Ah right, right
18:42 karolherbst: imirkin_: you know anything faster for doing ilog2(f) than "if f >= 1.0 ufind_msb(f2i(f)) else neg(ufind_msb(f2i(f^-1)))" ?
18:43 karolherbst: allthough even the if else thing is questionable
18:43 karolherbst: but maybe there is something on nv hw which can be used
20:52 imirkin_: karolherbst: ilog2 is usually done on an int
20:53 pendingchaos: the polygon-offset failure seems to be because line-width setting is broken
20:53 pendingchaos: making piglit fail
20:54 karolherbst: imirkin_: well, in OpenCL there is this Ilogb instruction, which is a flog2 but with int result
20:54 imirkin_: pendingchaos: hah! could be. it only fails on GM20x+
20:55 imirkin_: karolherbst: hm, well for a float it's basically just the exponent right?
20:55 karolherbst: mhhh
20:55 imirkin_: (for a normalized float)
20:55 karolherbst: true
20:55 imirkin_: perhaps +/- 1, i'd have to think about it. but approximately right.
20:55 karolherbst: but what about not normalized ones
20:56 imirkin_: note that f2i isn't an operation you want to do
20:56 imirkin_: range of floats is considerably larger than range of integers
20:56 karolherbst: ahh, right
20:56 karolherbst: forgot about that
20:57 karolherbst: maybe I really just do a flog2 for now and later improve it
20:57 imirkin_: =]
20:57 karolherbst: but I see the point now, getting the exponent is indeed much faster
20:57 karolherbst: than doing an actual log2
20:57 imirkin_: do you have to deal with denorms?
20:57 karolherbst: imirkin_: well, random input
20:58 imirkin_: does it matter?
20:58 karolherbst: dunno
20:58 imirkin_: i.e. are you expected to produce correct results there?
20:58 karolherbst: the spec is super unclear
20:58 imirkin_: if not, just extbf and move on :)
20:58 imirkin_: (well, the exponent is biased, so you have to unbias it)
21:01 karolherbst: imirkin_: the only thing the spec is saying is "Return the exponent as an integer value."
21:01 karolherbst: and logb is "Compute the exponent of x, which is the integral part of logr|x|."
21:02 karolherbst: uhh
21:02 karolherbst: maybe it isn't log2 to begin with
21:03 karolherbst: imirkin_: at least C and C++ are a bit more specific: "Extracts the value of the unbiased radix-independent exponent from the floating-point argument arg, and returns it as a floating-point value."
21:04 karolherbst: for logb
21:05 karolherbst: logb(123.45) = 6, because 1.928906 * 2^6
21:19 imirkin_: np.frexp(123.45)
21:19 imirkin_: (0.96445312500000002, 7)
21:20 imirkin_: you sure it should be 6?
21:20 imirkin_: the mantissa is supposed to be in the range [0.5, 1) to be considered a normal float
21:23 karolherbst: imirkin_: for frexp you get 0.964453 * 2^7
21:23 karolherbst: but that's C/C++
21:24 karolherbst: well, maybe I should check what the CTS expects
21:24 karolherbst: could clarify things
21:29 imirkin_: frexp pulls the mantissa and exponent
21:37 karolherbst: yeah, but ilogb still returns a different exponent
21:37 imirkin_: so +1 :)
21:37 karolherbst: most likely, yes
22:06 pendingchaos: imirkin_: setting LINE_WIDTH_SMOOTH instead of LINE_WIDTH_ALIASED seems to fix the issue
22:06 imirkin_: pendingchaos: interesting. perhaps we should just set both.
22:06 imirkin_: iirc blob driver just sets both and moves on with life.
22:07 pendingchaos: my guess is that they were merged on GM20x+ and LINE_WIDTH_ALIASED is a NOP
22:07 pendingchaos: I think I'll create a patch having both set
22:09 imirkin_: could be.
22:10 imirkin_: thinking about that stuff gives me a headache :)
22:10 imirkin_: i did understand it at one point, for like 2 minutes
22:10 imirkin_: smooth is when you have MSAA, while aliased is when you don't?
22:10 imirkin_: and the line width is set different because ... yeah no clue.
22:11 imirkin_: (the line rasterization algo is different, so different inputs for different algos? the size limits are different? who knows.)
22:11 imirkin_: anyways, nice find =]
22:35 pendingchaos: imirkin_: it seems I forgot to add the v2 in the title of the second revision
22:37 imirkin_: no worries
23:01 karolherbst: imirkin_: I am sure I asked that already, but was there an instruction to do x + (y - x) * a?
23:02 karolherbst: or just do sub/mul/add
23:02 imirkin_: this isn't x87
23:02 karolherbst: mhh sub/mad should be possible
23:02 karolherbst: uhm
23:03 karolherbst: yeah, no idea about x87 really
23:03 imirkin_: with fun little ops like "2^x-1"
23:04 imirkin_: or y*log2(x+1)
23:04 karolherbst: I see
23:04 imirkin_: (which only took 1000 cycles to complete)
23:04 karolherbst: "only"
23:04 imirkin_: you could *feel* the speed...
23:05 imirkin_: anyways, i haven't seen too much crazy stuff from nvidia
23:06 karolherbst: well, on x86 those instructions usually don't exist natively anyway
23:06 karolherbst: even if there are part of x86
23:06 imirkin_: on x87.
23:06 imirkin_: FYL2XP1
23:06 karolherbst: or x87
23:06 karolherbst: well, when did they start microcoding?
23:06 imirkin_: (the fpu coprocessor ... 8087 and so on)
23:06 imirkin_: not on the 8087 :p
23:06 karolherbst: :D
23:06 karolherbst: true
23:07 karolherbst: but who knows
23:07 karolherbst: maybe they had something like that already, it was just not replaceable
23:08 karolherbst: mhh but well 1980.. highly unlikely
23:20 imirkin_: it was common in the form of ROMs which would flip various enables for various ops
23:20 imirkin_: but having an actual uarch i think would have been rare
23:20 imirkin_: iirc i implemented something like that in a computer architecture class
23:21 imirkin_: (in a software emulator)
23:24 imirkin_: [and of course amusingly y*log2(x+1) is faster than y*log2(x) coz the taylor series is only nicely convergent for log2(x+1)...]