00:11 karolherbst: Lyude: pro tip: set LD_LIBRARY_PATH and LIBGL_DRIVERS_PATH inside the build_dir/libs + gallium (https://gist.github.com/karolherbst/4d1c9fc78756e8d239c2ba50135660a0)
00:11 karolherbst: ohh, it was lib, not libs
00:11 karolherbst: anyway
00:12 imirkin: pro tip: don't listen to people who tell you to set LIBGL_DRIVERS_PATH =]
00:12 karolherbst: :(
00:12 karolherbst: works for me
00:13 imirkin: just do the install
00:14 imirkin: and set the LD_LIBRARY_PATH
00:14 imirkin: everything else is going to end in tears
00:14 karolherbst: yeah, it's the most safe way
00:16 karolherbst: we also don't do underclocking yet, because it makes no sense to do it
00:16 karolherbst: mangix: nvidia uses the same voltage on 405MHz and 51MHz, so the power consumption is the same
00:17 karolherbst: also your GPU is off when not used under nouveau
00:17 karolherbst: so meh
00:26 dboyan_: imirkin: is there a way to fold a always true predicate input to PT?
00:27 dboyan_: imirkin: https://hastebin.com/delafevawa.bash is really awkward
00:27 imirkin: sure, with LoadPropagation you could
00:27 karolherbst: dboyan_: that should get opted away
00:27 imirkin: dboyan_: so there, the first thing to do is the set
00:27 imirkin: should become a mov
00:28 karolherbst: is there no opt for this already?
00:29 dboyan_: is mov able to set predicate?
00:32 imirkin: erm
00:32 imirkin: kinda? :)
00:32 imirkin: you can model it as such in nv50 ir
00:33 imirkin: however when you emit, you have to use a setp variant
00:33 imirkin: i think those are normally modeled as 'cvt' rather than 'mov' btw
00:34 dboyan_: so the goal is first optimise the first two as a "set b32 $p0 0x1 0x1"
00:38 Lyude: karolherbst: huh, never knew about that one
00:39 karolherbst: Lyude: don't use it, except you know what you do and you are feeling lazy :p
00:39 imirkin: don't do it
00:39 Lyude: ah, lol
00:39 imirkin: that way lies pain and suffering
00:39 imirkin: (and endless debugging)
00:39 karolherbst: never had pain and suffering with that though :p
00:43 dboyan_: well, I have other stuff to work on this morning. Will come back to this in the evening. Hopefully I can roll out the ballot series today.
00:46 imirkin: dboyan_: just pushed the lsb/etc fix
00:49 dboyan_: oh great, I will rebase upon this one and see
00:50 dboyan_: imirkin: btw, what's the difference between getScratch() and new_LValue(FILE_GPR)?
00:51 dboyan_: i used some of the latter in my code, shall I change it to the former?
00:52 imirkin: dboyan_: minimal?
00:52 imirkin: dboyan_: in general we don't use new_LValue type stuff directly
00:53 imirkin: getScratch has the right defaults :)
00:53 imirkin: as does getSSA()
00:54 dboyan_: imirkin: I used it that way because I saw some new_LValue(FILE_PREDICATE)
00:55 imirkin: well, there's no way to get that iirc
00:55 imirkin: (maybe there is?)
00:55 dboyan_: okay, i will update my gpr usage to getScratch() instead
01:22 whompy: imirkin: I've had endless trouble running make install lately. Any chance you could pastebin your configure line somewhere for comparison?
01:23 whompy: I've been trying to get master to build to test your recent fix to no avail. Losing a fight with the build system.
02:21 imirkin: whompy: ./configure --prefix=/path/to/prefix --with-gallium-drivers=nouveau,swrast --with-dri-drivers=nouveau --enable-texture-float --enable-gles2 --enable-gallium-llvm --enable-gbm --with-egl-platforms=x11,drm --enable-nine --enable-debug CFLAGS='-march=corei7 -O0' CXXFLAGS='-march=corei7 -O0'
02:27 imirkin: mupuf: did the key on your box change since aug 2016? getting an error when trying to pull up the vbios repo
03:30 imirkin: found one more issue with potential source reuse in the common code... not looking forward to auditing the per-target-specific lowering passes =/
03:43 dboyan_: well, working with per-target code is hard. I had to modify vote and implement shfl emitting nearly 3 times, and be extra-careful not to make mistakes
03:45 dboyan_: I have yet to check if the emitted code is correct on gk104 and gm107
03:47 imirkin: already found one issue
03:48 imirkin: admittedly it's with textureGrad on array textures. probably doesn't come up *too* often...
03:48 dboyan_: well, no idea :/
03:49 imirkin: i'm just doing it by observation.
04:01 whompy: imirkin: Thanks!
04:03 imirkin: whompy: did it help?
04:04 whompy: Can't test for a while. Working midnights on storm restoration for a utility. Figure I may as well do something with the downtime
04:06 whompy: I should dig up pastebin of my make install output. It's strange. I know we have similar setups, but you hadn't mentioned any issues.
04:07 whompy: Hoping it's just something I can configure and ignore.
04:08 imirkin: yeah, it all worksforme
06:33 whompy: imirkin: Figured out the difference: the issue is popping in Wayland.
07:32 karolherbst: can somebody with the game "darkest dungeon" create an apitrace and create a bug on bugzilla?
08:15 mangix: karolherbst: btw since you mentioned it previously, how do i reclock my gtx 980? fan is not an issue :)
08:45 dboyan: "Save 50% on Darkest Dungeon on Steam" ;)
08:47 karolherbst: I have it, but I can't upload an apitrace
08:48 dboyan: oh, that's unfortunate
10:32 mupuf: imirkin: yes, I had a server crash
12:05 hakzsam: dboyan: nice series!
12:09 dboyan: thanks :)
12:12 karolherbst: what is shfl? shift f? left?
12:12 dboyan: shuffle
12:12 karolherbst: ohhhh
12:12 karolherbst: interesting
12:13 karolherbst: kepler1 has no shuffle fyi
12:13 karolherbst: or has it?
12:13 dboyan: it has
12:13 karolherbst: sure?
12:13 dboyan: introduced in sm_30
12:13 karolherbst: I am quite sure it was a gk110+ feature
12:13 karolherbst: at least nvidia says so
12:13 hakzsam: no, it's sm30
12:14 karolherbst: odd
12:14 karolherbst: then nvidia isn't honest :p
12:14 dboyan: karolherbst: http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-shfl
12:15 karolherbst: if I find some time, I can test your patches on my kepler1 as well. Inside this gk110 paper they just say, that the shuffle instruction is new for gk110
12:16 karolherbst: maybe shuffle can do more on gk110?
12:17 dboyan: basically the same
12:17 dboyan: yeah, in gk110 whitepaper "To further improve performance, Kepler implements a new Shuffle instruction"
12:18 karolherbst: maybe I missunderstood the section
12:18 dboyan: well, maybe they just forgot to mention it in gtx 680 whitepaper :p
12:18 karolherbst: could be
12:20 karolherbst: but the shuffle instruction isn't there on fermi then I assume?
12:21 dboyan: yeah
12:21 dboyan: you can tell from gf100.c in envytools
12:21 karolherbst: mhh, an assert within emitSHFL might make sense then
12:21 karolherbst: or do the same as with other kepler only instructions
12:38 dboyan: karolherbst: there isn't a good previous example of kepler-instruction, but an assert there is reasonable.
12:39 dboyan: s/kepler-instruction/gk104-only instruction/
12:54 hakzsam: dboyan: btw, does envydis is up-to-date for shfl&vote compared to codegen?
12:54 hakzsam: (in case you figured out new bits)
12:55 dboyan: hakzsam: envydis decode shfl and vote all right
12:55 hakzsam: ok, cool it's consistent then
12:55 hakzsam: because sometimes it's not :)
12:57 dboyan: hakzsam: About ARB_shader_clock, the blob is doing something crazy.
12:57 hakzsam: ah?
12:57 dboyan: hakzsam: The series I posted was somewhat mimicking the blob (but a lot simpler). I don't think we have reached a final decision yet
12:58 hakzsam: does it work?
12:59 dboyan: the blob puts clocklo in upper 32 bits and clockhi in lower bits. With a loop before it which allegedly prevent sort of clocklo overflow
12:59 dboyan: this is just insane
13:00 dboyan: v2 patch just stick clocklo (or clock on nv50) to upper 32 bits and set 0 to lower. Passed piglit
13:01 hakzsam: ok cool
13:01 hakzsam: did you run the ARB_shader_ballot tests with blob?
13:01 hakzsam: IIRC, nha didn't
13:01 dboyan: haven't yet
13:02 hakzsam: would be nice to confirm if they all pass
13:02 dboyan: but one thing is certain, it will fail fs-builtin-variables.test
13:02 dboyan: because it's wrong
13:02 dboyan: [on nvidia, which has thread group size of 32]
13:03 hakzsam: because the test assumes 64?
13:03 hakzsam: I didn't read it
13:04 dboyan: It's about gl_SubGroup{Ge,Gt}MaskARB. There upper bits should be 0 instead of 1.
13:04 dboyan: I have confirmed with nha
13:05 hakzsam: ok
13:06 hakzsam: so this one is expected to fail on blob as well
13:06 dboyan: exactly
13:11 imirkin: dboyan: if SHFL is used by the gm107 lowering, you should adjust the gm107 lowering to conform to your updated/new definition
13:12 hakzsam: yeah, emit src(2)
13:13 dboyan: imirkin: I think I'll make sure all the uses there are actually using 0x1c03
13:14 imirkin: you can also stuff things like that into the subOp
13:14 imirkin: depending on what it is
13:16 hakzsam: dboyan: "Write a testing infrastructure with performance counters" what do you mean?
13:19 hakzsam: dboyan: presumably, you might be interested by https://cgit.freedesktop.org/~hakzsam/piglit/commit/?h=perf_counters&id=38ea6fb2299ad52caf99a6309ee0e3107738e332
13:19 dboyan: hakzsam: I think I mean to write a program to analyze the efficiency of shader program by analyzing values from performance counter during benchmark
13:20 hakzsam: so yeah, piglit+amd_perf_monitor is what you need
13:21 hakzsam: I mean, this can be useful for small programs
13:22 hakzsam: GALLIUM_HUD is better if you wan to monitor perf counters during benchmarks
13:25 hakzsam: note that the thing I point out is for validating perf counterrs, not for real measurements but you get the idea
13:26 dboyan: yeah, got it
13:29 dboyan: well, did a quick of ARB_shader_ballot test with the blob. The blob doesn't seem happy with uint divided by float
13:30 dboyan: error C1020: invalid operands to "/"
13:30 imirkin: yeah, i dunno if that's legal
13:34 dboyan: uint can be implicitly converted to float, according to 4.1.10 of glsl 450 spec
13:34 hakzsam: dboyan: and btw, if you need info about perf counters and/or maxwell sched codes, please ask
13:35 dboyan: hakzsam: I guess my main test system will be kepler
13:36 dboyan: I get hints of gm107 isa from my pascal machine
13:36 dboyan: ;)
13:36 hakzsam: pascal uses the same ISA as maxwell, so :)
13:38 hakzsam: dboyan: so, how many fails with blob?
13:41 dboyan: hakzsam: 1/9 if I don't change the code to the blob's flavor
13:41 hakzsam: the expected one I guess?
13:42 dboyan: oh, 1/9 was pass rate
13:42 dboyan: other fails while compiling
13:42 hakzsam: ah ok
13:42 imirkin: dboyan: are you using a high enough glsl version?
13:43 imirkin: that rule was introduced in GL 4.0
13:43 imirkin: er, GLSL 4.00
13:43 dboyan: ah, the tests is written in glsl 150
13:43 imirkin: (or GL_ARB_gpu_shader5)
13:43 dboyan: interesting, so the blob is actually right
13:44 imirkin: blob often messes up such details though
13:44 imirkin: is there a ARB_gpu_shader5 enable?
13:44 dboyan: nope
13:44 imirkin: otoh, some implicit conversions were possible even before then
13:44 imirkin: i dunno if this one fits into the previously-legal ones or not
13:45 dboyan: uint -> float is in glsl 150
13:45 dboyan: well
13:46 imirkin: so blob could be wrong
13:47 imirkin: it's been known to happen
13:47 dboyan: yeah, I know, blob sometimes even makes *serious* mistakes
13:48 imirkin: but at the limits of what gets enabled in what glsl version, it's almost always wrong :)
13:48 hakzsam: and ballot is written against GLSL 4.50
13:49 imirkin: doesn't mean GLSL 4.50 is required for it tho
13:49 dboyan: yeah, exactly
13:49 hakzsam: sure.
13:50 dboyan: some time ago I found blob would compile atomic*(s.y, a) wrong if s was a shared variable
13:51 dboyan: that was only fixed in 378 series
14:16 dboyan: imirkin: Do you know when will handleManualTXD will be used?
14:16 dboyan: or can you point out a piglit test that uses it?
14:17 dboyan: now I'm sure dFdx and dFdy uses SHFL.BFLY ..., 0x1c03
14:33 hakzsam: dboyan: try tests/spec/arb_shader_texture_lod/execution/glsl-fs-shadow2DGradARB-06.shader_test maybe
14:37 dboyan: hakzsam: I think that's exactly what I need, thanks
14:38 dboyan: so all of them uses 0x1c03
14:39 dboyan: the 0x1c is a mask off the high 3 bits of thread id in a warp, so the exchange is within each 4 threads
14:42 dboyan: and the 0x3 there is a "clamp" value
15:32 imirkin: dboyan: when you have a situation that's not handleable by the TXD op directly
15:32 imirkin: dboyan: for example textureGrad with a cube or shadow texture
15:34 dboyan: yeah, hakzsam has pointed out an example and I have verified it with my gp107. All of the OP_SHFL in gm107 lowering use 0x1c03
15:35 dboyan: It's a deriviate of some sort
15:35 dboyan: so all of the data exchange in the pattern specified by 0x1c03
17:32 imirkin: dboyan_: yeah, so for dfdx/dfdy you want a particular interchange pattern between lanes. although you probably want a diff one for ddx than ddy...
18:16 jam_: imirkin: noob question, what does the shader blitter consist of? also what is exa?
18:18 jam_: jam_: found info on exa https://en.wikipedia.org/wiki/EXA
18:23 imirkin: 'the shader blitter'?
18:24 jam_: ah, i mean the "blitter shader"
18:24 jam_: hakzsam: you mentioned it in your patch here https://cgit.freedesktop.org/~hakzsam/mesa/commit/?h=gm107_scheduler&id=21c1edcb99385d5aa986e64e8849e9dda5a8fa1b
18:34 imirkin: it's just a shader meant to copy a texture to the framebuffer
18:34 jam_: so the "composite" is the same as a blit?
18:36 imirkin: yeah, just with a blend maybe? not sure.
18:36 jam_: ^sorry for the abiguity, what i mean is, i see bitfields stored in arrays like "NV110FP_CACompositeSrcAlpha" in many of them
18:36 imirkin: ah, that's in reference to various EXA ops
18:36 jam_: oh!
18:36 imirkin: the bits are the assembled shaders
18:37 imirkin: there's a makefile which helps build those with envyas
18:38 jam_: envytools does look neat! like a RE gem
18:51 imirkin: hakzsam: is "raw" image access ever used with GL?
18:51 imirkin: hakzsam: there's an issue for nve4 accessing image texture buffers, in that we cut off the low 8 bits of the address. however that's not good for glTexBufferRange()
18:51 imirkin: hakzsam: so i have to stick the low 8 bits somewhere
19:11 imirkin: hakzsam: eh... blob returns 256 for the tbo alignment, so meh - let's just bump the min and not worry about it
22:21 dboyan_: imirkin: The exact exchange pattern is controlled by the second input and the third one. The second inputs for dfdx and dfdy are different, but the third (0x1c03) are the same.
22:21 imirkin: dboyan_: ah ok
22:35 dboyan_: actually 0x1c03 just means "4 threads as a group", just as "0x1f" means "all 32 threads as a group"
22:37 imirkin: makes sense
22:37 imirkin: it splits up the mask into 2
22:42 dboyan_: yeah, there is a "mask" and "clamp
22:42 dboyan_: missed a " at the ending