02:48imirkin: ok, theoretically i made a thing that works for maxwell f32 atomic add on shared mem...
02:49imirkin: if anyone feels like checking it out -- https://github.com/imirkin/mesa/commits/atomicfloat
02:50imirkin: run with this piglit: https://cgit.freedesktop.org/~idr/piglit/plain/tests/spec/nv_shader_atomic_float/execution/shared-atomicAdd-float.shader_test?h=NV_shader_atomic_float&id=025bb9e5cb7e8e7955aa6ad85575c0cb318f6165
02:50imirkin: need to force-enable the ext with MESA_EXTENSION_OVERRIDE
03:12imirkin: pmoreau: any clue what LDS vs LDS.U do? (i.e. what's the .U?)
03:13imirkin: or ... HdkR?
03:33HdkR: imirkin: I know what it means, just doubt that I can say
03:33imirkin: can you say if i should worry about it?
03:34HdkR: You don't really need to worry about it
03:34imirkin: i.e. is it perf-related, or functionally different
04:08HdkR: imirkin: What's all the shared atomic work for?
07:53pmoreau: imirkin: No clue what the .U is for: first time seeing it as well.
07:53skeggsb: in context, i'd say uncached is likely
07:55pmoreau: Could very well be
13:08imirkin: HdkR: f32 add on shared memory. no op to do it, so have to do a load + cas loop.
14:21methodMan: Hello all !
15:37pendingchaos: imirkin: is now a good time?
16:07imirkin_: pendingchaos: sorry ... send me an email, i'll process it at some point
17:11nikk_: Hello everyone! Can someone please guide me into starting with drivers, GPU and related coding.. I don’t know where to start reading to be able to contribute to the project.. Thanks!
17:12gnarface: nikk_: someone told me to get started with envytools first
17:24juri_: envytools could use a lot of documentation.
17:24juri_: "comments would be awesome".
17:32imirkin_: nikk_: best way is to just find something you don't like, and fix it.
18:16HdkR: imirkin_: Ah right, right
18:42karolherbst: imirkin_: you know anything faster for doing ilog2(f) than "if f >= 1.0 ufind_msb(f2i(f)) else neg(ufind_msb(f2i(f^-1)))" ?
18:43karolherbst: allthough even the if else thing is questionable
18:43karolherbst: but maybe there is something on nv hw which can be used
20:52imirkin_: karolherbst: ilog2 is usually done on an int
20:53pendingchaos: the polygon-offset failure seems to be because line-width setting is broken
20:53pendingchaos: making piglit fail
20:54karolherbst: imirkin_: well, in OpenCL there is this Ilogb instruction, which is a flog2 but with int result
20:54imirkin_: pendingchaos: hah! could be. it only fails on GM20x+
20:55imirkin_: karolherbst: hm, well for a float it's basically just the exponent right?
20:55imirkin_: (for a normalized float)
20:55imirkin_: perhaps +/- 1, i'd have to think about it. but approximately right.
20:55karolherbst: but what about not normalized ones
20:56imirkin_: note that f2i isn't an operation you want to do
20:56imirkin_: range of floats is considerably larger than range of integers
20:56karolherbst: ahh, right
20:56karolherbst: forgot about that
20:57karolherbst: maybe I really just do a flog2 for now and later improve it
20:57karolherbst: but I see the point now, getting the exponent is indeed much faster
20:57karolherbst: than doing an actual log2
20:57imirkin_: do you have to deal with denorms?
20:57karolherbst: imirkin_: well, random input
20:58imirkin_: does it matter?
20:58imirkin_: i.e. are you expected to produce correct results there?
20:58karolherbst: the spec is super unclear
20:58imirkin_: if not, just extbf and move on :)
20:58imirkin_: (well, the exponent is biased, so you have to unbias it)
21:01karolherbst: imirkin_: the only thing the spec is saying is "Return the exponent as an integer value."
21:01karolherbst: and logb is "Compute the exponent of x, which is the integral part of logr|x|."
21:02karolherbst: maybe it isn't log2 to begin with
21:03karolherbst: imirkin_: at least C and C++ are a bit more specific: "Extracts the value of the unbiased radix-independent exponent from the floating-point argument arg, and returns it as a floating-point value."
21:04karolherbst: for logb
21:05karolherbst: logb(123.45) = 6, because 1.928906 * 2^6
21:19imirkin_: (0.96445312500000002, 7)
21:20imirkin_: you sure it should be 6?
21:20imirkin_: the mantissa is supposed to be in the range [0.5, 1) to be considered a normal float
21:23karolherbst: imirkin_: for frexp you get 0.964453 * 2^7
21:23karolherbst: but that's C/C++
21:24karolherbst: well, maybe I should check what the CTS expects
21:24karolherbst: could clarify things
21:29imirkin_: frexp pulls the mantissa and exponent
21:37karolherbst: yeah, but ilogb still returns a different exponent
21:37imirkin_: so +1 :)
21:37karolherbst: most likely, yes
22:06pendingchaos: imirkin_: setting LINE_WIDTH_SMOOTH instead of LINE_WIDTH_ALIASED seems to fix the issue
22:06imirkin_: pendingchaos: interesting. perhaps we should just set both.
22:06imirkin_: iirc blob driver just sets both and moves on with life.
22:07pendingchaos: my guess is that they were merged on GM20x+ and LINE_WIDTH_ALIASED is a NOP
22:07pendingchaos: I think I'll create a patch having both set
22:09imirkin_: could be.
22:10imirkin_: thinking about that stuff gives me a headache :)
22:10imirkin_: i did understand it at one point, for like 2 minutes
22:10imirkin_: smooth is when you have MSAA, while aliased is when you don't?
22:10imirkin_: and the line width is set different because ... yeah no clue.
22:11imirkin_: (the line rasterization algo is different, so different inputs for different algos? the size limits are different? who knows.)
22:11imirkin_: anyways, nice find =]
22:35pendingchaos: imirkin_: it seems I forgot to add the v2 in the title of the second revision
22:37imirkin_: no worries
23:01karolherbst: imirkin_: I am sure I asked that already, but was there an instruction to do x + (y - x) * a?
23:02karolherbst: or just do sub/mul/add
23:02imirkin_: this isn't x87
23:02karolherbst: mhh sub/mad should be possible
23:03karolherbst: yeah, no idea about x87 really
23:03imirkin_: with fun little ops like "2^x-1"
23:04imirkin_: or y*log2(x+1)
23:04karolherbst: I see
23:04imirkin_: (which only took 1000 cycles to complete)
23:04imirkin_: you could *feel* the speed...
23:05imirkin_: anyways, i haven't seen too much crazy stuff from nvidia
23:06karolherbst: well, on x86 those instructions usually don't exist natively anyway
23:06karolherbst: even if there are part of x86
23:06imirkin_: on x87.
23:06karolherbst: or x87
23:06karolherbst: well, when did they start microcoding?
23:06imirkin_: (the fpu coprocessor ... 8087 and so on)
23:06imirkin_: not on the 8087 :p
23:07karolherbst: but who knows
23:07karolherbst: maybe they had something like that already, it was just not replaceable
23:08karolherbst: mhh but well 1980.. highly unlikely
23:20imirkin_: it was common in the form of ROMs which would flip various enables for various ops
23:20imirkin_: but having an actual uarch i think would have been rare
23:20imirkin_: iirc i implemented something like that in a computer architecture class
23:21imirkin_: (in a software emulator)
23:24imirkin_: [and of course amusingly y*log2(x+1) is faster than y*log2(x) coz the taylor series is only nicely convergent for log2(x+1)...]