00:01 gregory38: where do you put coherent buffer object ?
00:01 gregory38: maybe it just doesn't play well
00:05 gregory38: one silly question, the code setup a draw call to do the pbo stuff
00:05 gregory38: won't it be possible to prevalidate/precompile the stuff a la Vulkan
00:13 gregory38: + 30.49% 1.15% pcsx2_GSReplayL [.] try_pbo_upload_common ▒
00:13 gregory38: + 16.55% 0.89% pcsx2_GSReplayL [.] st_pbo_draw ▒
00:13 gregory38: + 12.34% 0.39% pcsx2_GSReplayL [.] cso_draw_arrays ▒
00:14 gregory38: It seems we spend more time to setup the pbo draw rather than the actual draw (well at least on my example)
00:16 gregory38: it is really strange, I don't understand where we spend the time in try_pbo_upload_common (I mean the 14% that isn't part of st_pbo_draw)
00:35 gregory38: have a good night :)
10:48 RSpliet: mwk: I found this paper mentioning the following about "GPU memory systems": ( https://www.cs.utah.edu/~rajeev/pubs/sc14.pdf )
10:48 RSpliet: In addition, to prevent pathological channel camping, where unusual access strides lead to excessive contention on one or few of the six simulated channels, the channel address is formed by XOR-ing a subset of higher-order bits with some lower order bits to provide a better spread across the channels as follows:
10:48 RSpliet: channel address = {addr[47:11]:(addr[10:8] XOR addr[13:11])} % 6
10:50 mwk: RSpliet: does that refer to Fermi?
10:51 mwk: hmm, it actually matches what I observed on Fermi
10:51 mwk: fair enough
10:51 RSpliet: it doesn't refer to any GPU in particular, but it's a 2014 paper (interestingly written by some AMD Research people along with Utah uni + UT Austin), one of the authors now works for NVIDIA
10:51 mwk: but then I didn't look close at it
13:57 mandragore: hello, i hove some problems which i think might be related to the nouveau driver. I am running an archlinux installation, xorg and i3wm. No composition managers. I updated my arch system and after a reboot my (laptop) screen started flickering! The problem starts after the first login (the first login screen is ok) and then even if a log out even the login screen flickers. I asked in the archlinux channel and after they saw my dmesg logs (ht
13:59 RSpliet: mandragore: your message was truncated just before the dmesg URL
14:08 fassl: hello, did anybody try or succeeded in passing through an optimus card via iommu here?
14:11 karolherbst: fassl: you are aware that you need a compatible CPU and motherboard?
14:14 fassl: i think it should be, i can see the card in the vm, nouveau in the vm cannot read the bios, and in win vm the driver cannot start either
14:15 fassl: hot to be sure if it supports this?
14:16 fassl: probably have to check FLR and if it does not share interrupts with other devices?
14:43 fassl: karolherbst, it seems it does not support FLR
14:43 fassl: so i guess a no way here
15:12 karolherbst: imirkin: a thought I have: if we have conditional branches in the shader, but both branches have the same amount if instructions to excecute, are joinat/joins still needed? I would say no, but I think the real answer is yes?
15:20 karolherbst: uhh "Kepler GK110 allows double precision instructions to be paired with other instructions"
15:28 imirkin: karolherbst: joinat/join has nothing to do with instruction counts
15:28 imirkin: karolherbst: imagine you're actually implementing a GPU
15:28 imirkin: karolherbst: and you have 32 threads that are all executing in a single "SIMD" type of thing
15:28 imirkin: and then *one* thread decides to branch and another doesn't
15:28 imirkin: what do you do?
15:28 imirkin: now think all of that through, and then think about how "join" might come in useful.
15:29 karolherbst: all threads within a warp execute the same instruction, right?
15:39 karolherbst: okay, I think I get it now
16:12 karolherbst: imirkin: any idea what the unknown bits in joinat are about? joinat 0x7958 [unknown: 00001c00 00000000]
16:21 imirkin_: karolherbst: yeah, it's the per-instruction predicate
16:22 imirkin_: envydis doesn't parse it because those predicates don't really apply
16:22 imirkin_: but blob sets those bits, and iirc nouveau might too
16:26 karolherbst: okay
17:00 mwk: karolherbst: what ISA is that?
17:01 mwk: Tesla and Fermi have an op set in envydis that makes them ignore predicate field for joinat and some other insns, I suppose Maxwell needs it too
17:12 imirkin_: karolherbst: i assume he's looking at gk104
17:12 imirkin_: er, mwk: --^
19:10 gregory38: Hello
19:10 gregory38: To correct my benchmark from yesterday,
19:11 gregory38: Due to 32 bits cross-compilation issue --disable-asm is used to compile measa.
19:11 gregory38: Don't know the speed impact.
19:12 gregory38: Otherwise I did a quick test on uber shader based on uniform.
19:12 gregory38: It isn't free for the GPU, it is actually faster when lots of shader are toggled. But otherwise it hurt GPU perf
19:13 gregory38: I only replaced a couple of idef branch, so it doesn't feel like the best solution
19:42 gregory38: Hum, I manage to enable the asm stuff. Impact remains small
20:11 karolherbst: mwk: yeah nve6
20:12 imirkin_: gregory38: coherent buffers are always in gart
20:12 imirkin_: (aka system memory)
20:12 karolherbst: gregory38: can you paste your uber shader somewhere?
20:13 gregory38: imirkin_: ok. I think Nvidia does the same
20:14 imirkin_: there's really no other way to do it...
20:14 gregory38: karolherbst: do you want the full shader or the part that I have changed
20:14 imirkin_: with persistent-but-not-coherent you could still have it in vram
20:15 gregory38: oh, sorry my bad. I used persistent but not coherent actually
20:15 gregory38: I use explicit flush
20:15 imirkin_: well, nouveau sticks both into gart
20:15 imirkin_: but there's probably some way to make it go via vram
20:15 imirkin_: anyways, explicit flush is much nicer
20:16 imirkin_: otherwise you have to flush a lot of caches on *EVERY DRAW*
20:17 gregory38: yes and it avoid to explose debug tool ;)
20:17 imirkin_: yep
20:17 gregory38: karolherbst: https://gist.github.com/gregory38/fb2deba8d3772203e2b821353034e7b1
20:17 gregory38: it is the only part that I changed
20:17 karolherbst: gregory38: I meant the complete thing
20:17 gregory38: with header and such ?
20:18 karolherbst: the glsl code is enough
20:18 gregory38: I'm asking because I've 2 files
20:19 karolherbst: well I thought you were writing one fragment shader for everything?
20:20 gregory38: https://gist.github.com/gregory38/98e054c9a4d1c17fed43b56aaca5c737
20:21 gregory38: sorry for the horror
20:21 gregory38: the changes are on the atst function
20:21 karolherbst: ahh okay, so geometry+vertex+fragment, right?
20:21 gregory38: the geometry and vertex "core" is in a separate file
20:22 gregory38: in short it is a mess
20:22 karolherbst: well I just need the glsl source for now
20:23 gregory38: karolherbst: do you want the TGSI before and after ?
20:24 gregory38: note maybe I use it wrongly, I don't know if there is switch case in glsl
20:24 karolherbst: no, just the glsl is fine ;)
20:24 karolherbst: gregory38: do you add special defines to the source before compiling those?
20:25 gregory38: yes the header :p
20:25 karolherbst: ah I see
20:25 gregory38: to define the various define
20:25 karolherbst: okay, then I need that too
20:25 gregory38: karolherbst: what do you want to do ?
20:25 karolherbst: compile it on my gpu and see what binary is emmited
20:25 karolherbst: *emited
20:26 gregory38: ah I yes I only have a kepler GPU
20:26 gregory38: (I didn't test nvidia by the way)
20:26 karolherbst: me too, but I can tweak stuff rather easily
20:26 karolherbst: maybe we do something super stupid
20:28 gregory38: my shader are quite small
20:28 gregory38: so the overhead of branch is bigger
20:29 karolherbst: well, as I said, maybe we do something stupid :p
20:29 karolherbst: I doubt that branching is the problem here
20:29 gregory38: https://gist.github.com/gregory38/98e054c9a4d1c17fed43b56aaca5c737
20:30 karolherbst: now you gave me the same thing twice
20:30 gregory38: see the comment below
20:30 karolherbst: ahh
20:30 imirkin_: gregory38: fyi, you can now do clipping/culling in GL ES with GL_EXT_clip_cull_distance
20:30 gregory38: I take a random shader, but I don't know if there are all slowers
20:30 gregory38: or only some of them.
20:31 imirkin_: gregory38: and also you can't always use gl_PointSize in geometry shaders on ES - only if GL_EXT/OES_geometry_point_size is defined
20:31 karolherbst: uhh
20:31 karolherbst: I hope I messed up
20:31 karolherbst: I really hope
20:32 karolherbst: okay
20:32 gregory38: imirkin_: well I don't support es anymore
20:32 karolherbst: imirkin_: you will love this: https://gist.github.com/karolherbst/eb2e453be25cdb06839af46b73ad21b0
20:33 imirkin_: gregory38: ah, i saw the #if !pGL_ES in there for clipdistance
20:33 karolherbst: 141-148 is the hit
20:33 imirkin_: karolherbst: that's usual
20:33 gregory38: yes, actually it is used to avoid issue with the WIN intel driver
20:33 karolherbst: all those joins?
20:33 gregory38: old stuff
20:33 imirkin_: karolherbst: yep.
20:34 karolherbst: imirkin_: never saw that many
20:34 imirkin_: karolherbst: that just means you have a ton of if() {} nested
20:34 imirkin_: each one pushes a value onto the "join" stack
20:34 karolherbst: yeah I am aware
20:34 imirkin_: now, with a TON of cleverness, we could probably do better in that case
20:34 imirkin_: but ... is it worth it?
20:35 karolherbst: I will check something
20:35 karolherbst: first I need to clean up that glsl shader
20:35 gregory38: karolherbst: which lead me to an interesting question
20:35 gregory38: is there a way to output the preprocessed glsl shader ?
20:35 karolherbst: maybe
20:36 gregory38: anyway the bad code is the atst function. And yes it is nested if, way too much
20:36 gregory38: imirkin_: by the way, what is the issue with subroutines
20:37 imirkin_: gregory38: RA is a huge mess
20:37 gregory38: RA?
20:37 imirkin_: register allocation
20:38 gregory38: oh, I could imagine
20:38 imirkin_: (registers are a whole lot faster than having a call stack in "lmem")
20:39 gregory38: but I could imagine uber shader aren't nice neither
20:39 karolherbst: okay
20:39 karolherbst: atst is the issue
20:40 gregory38: yes I told you, it is was the only code difference
20:40 karolherbst: mhhh
20:40 karolherbst: imirkin_: there are nearly no ifs nested though
20:41 karolherbst: ohh wait
20:41 gregory38: and actually how are branch handled ?
20:41 karolherbst: the else path is nested
20:41 gregory38: it is a switch case
20:41 gregory38: and each case does if (dynamic_condition)
20:41 gregory38: then discard
20:41 imirkin_: karolherbst: well, not necessarily if's... any control flow
20:42 imirkin_: e.g. while ()'s etc
20:42 karolherbst: I was wrog anyway. the code goes on like this: if () else if else if ...
20:42 imirkin_: right
20:42 imirkin_: which unfortunately becomes
20:43 imirkin_: if () { } else { if () { } else { } }
20:43 karolherbst: yeah
20:43 imirkin_: doing a switch would be much better
20:43 imirkin_: but unfortunately glsl ir doesn't support switch statements
20:43 gregory38: does glsl support switch
20:43 imirkin_: they all become those stupid if/else's
20:43 karolherbst: uhh
20:43 imirkin_: glsl does have a switch()
20:44 gregory38: well I had the feeling that hw doesn't really have switch
20:44 imirkin_: and ideally we'd detect the unnecessary structurization in the backend
20:44 imirkin_: and drop the extra joinat/join pairs
20:44 imirkin_: but ... we don't live in an ideal world =/
20:45 gregory38: note that inside the case, I have also dynamic branch
20:45 karolherbst: imirkin_: is discard llike exit()?
20:45 imirkin_: karolherbst: yes
20:45 karolherbst: k
20:45 gregory38: actually a better solution would been a rewrite of the code
20:46 karolherbst: mhh
20:46 karolherbst: a switch statement helps actually
20:46 gregory38: maybe it would be possible to merge some cases
20:47 karolherbst: gregory38: could you test this shader instead? https://gist.github.com/karolherbst/6b0ff7a8428efa20889a412172ae9ea4
20:47 karolherbst: ohh I removed somet stuff
20:47 karolherbst: just copy the atst function then
20:48 karolherbst: imirkin_: funny that this leads to better binary code
20:48 karolherbst: https://gist.github.com/karolherbst/167a655679517e4aa91ddc0458c32f16
20:49 karolherbst: in the tgsi is a BGNLOOP...
20:49 imirkin_: yeah....
20:49 imirkin_: coz that's what glsl ir lowers it to
20:49 karolherbst: I see
20:49 mwk: karolherbst: here's your fix
20:49 imirkin_: it's to deal with a conditional break, iirc
20:49 karolherbst: but it looks much better already, doesn'T it?
20:49 karolherbst: mwk: thanks :D
20:50 karolherbst: not a single join in that shader anymor
20:50 karolherbst: e
20:51 imirkin_: gregory38: btw... i don't suppose that alphatest can be used here?
20:51 karolherbst: I guess the issue is that there are so many if statements
20:51 karolherbst: and we aren't smart about if the conditions are solvable by evaluating one to true
20:51 mwk: karolherbst: does this mean you no longer need the fix? :(
20:52 gregory38: imirkin_: well don't know, is it still available ? and does it allow to no update the depth but keep the colors when it is failed
20:52 karolherbst: mwk: I was hoping those unknown bits would enable us to get super perf :p
20:52 imirkin_: gregory38: no... it's just a discard
20:52 mwk: nah, just junk
20:53 gregory38: karolherbst: much better
20:53 imirkin_: gregory38: not sure if it's still in core profiles...
20:53 karolherbst: gregory38: see :) we do stupid things
20:53 karolherbst: :p
20:53 mwk: I ignored them on Tesla, as some nv compiler set them on joinats and calls and whatnot
20:53 karolherbst: gregory38: so better use switches for that kind of stuff
20:53 mwk: but I haven't seen a Fermi compiler that sets them on calls before... seems it's a new thing
20:53 imirkin_: yeah, looks like it's gone =/
20:53 gregory38: it seems to be a bit faster (2fps) on my regression testcase but could just be cache/turbo variation
20:53 imirkin_: i guess "just do it in the shader"
20:53 karolherbst: anyway, I have no idea how we could optimize that if stuff in our backend :/
20:54 karolherbst: its a total mess
20:54 mwk: either way, Kepler and Maxwell might need the same fix
20:54 gregory38: karolherbst: put a warning so glsl dev can improve their code :)
20:54 karolherbst: gregory38: first we have to detect those
20:54 karolherbst: it isn't easy to do in glsl I think
20:55 karolherbst: mhh
20:55 karolherbst: actually
20:55 karolherbst: we have stuff like that:
20:55 karolherbst: set u8 $p0 ne u32 $r0 0x00000001
20:55 karolherbst: set u8 $p0 eq u32 $r0 0x00000002
20:55 karolherbst: set u8 $p0 eq u32 $r0 0x00000003
20:55 karolherbst: set u8 $p0 eq u32 $r0 0x00000004
20:55 karolherbst: ....
20:55 karolherbst: we should be able to detect this
20:55 karolherbst: but it ain't easy to reqrite the BBs
20:56 imirkin_: gregory38: btw, if you want to look at what code nouveau is generating, you can use NV50_PROG_DEBUG=1
20:56 karolherbst: gregory38: are there cases now where shader switching is faster than this uber shader?
20:57 imirkin_: which will print both the TGSI and the resulting shader code (as well as the final binary encoding)
20:58 gregory38: imirkin_: yes but it still require a brain to analyze the output and understand what going wrong
20:58 gregory38: karolherbst: I would need to test it further
20:58 gregory38: so far I tested in native resolution (very easy on GPU)
20:59 imirkin_: gregory38: sure. it was just an fyi. since it appears that you use your head not just for eating :)
20:59 gregory38: lol
20:59 gregory38: Actually you're already told me those variable when I was trying to reproduce some SSO bugs
21:00 imirkin_: ah ok
21:00 imirkin_: i don't keep track of such things :)
21:00 gregory38: too much support ;)
21:01 gregory38: The good news is that uber shader is faster on one of my testcase
21:01 gregory38: 38->46 fps
21:01 imirkin_: nouveau or blob?
21:01 gregory38: nouveau
21:01 karolherbst: gregory38: sounds great
21:01 gregory38: blob is already at 65 without the optimization ;)
21:01 karolherbst: mh
21:02 karolherbst: 20 regs allocated
21:02 karolherbst: I am sure we could do better there maybe
21:03 gregory38: I guess if I want speed, I need to push all draw commands into a separate thread
21:03 imirkin_: bad move.
21:03 imirkin_: [if you want it to work on nouveau]
21:03 karolherbst: gregory38: you mean move all gl calls to a separate thread ;)
21:03 imirkin_: ah, that'd be fine
21:03 gregory38: yes I means all gl calls
21:03 karolherbst: yeah
21:03 karolherbst: that would help a lot
21:03 imirkin_: but doing some stuff from one thread and other from another = insta-crash on nouveau
21:04 karolherbst: especially because you know you are mainly CPU bound
21:04 gregory38: karolherbst: especially because Nvidia give a 2x boost of various games
21:04 karolherbst: gregory38: with the threading stuff enabled vs disabled?
21:04 gregory38: yes
21:04 karolherbst: insane
21:04 karolherbst: :D
21:04 karolherbst: I am sure if you do it on your end it will get even faster
21:05 karolherbst: because you can specilize it more
21:05 gregory38: yes, surely, but the nvidia solution is free
21:05 gregory38: I just need to set a env variable and it is done
21:07 karolherbst: yeah I know
21:07 karolherbst: I am already wondering why nobody wrote a general implementation for that
21:07 karolherbst: because it can't be that hard though
21:08 imirkin_: you should implement it then
21:08 gregory38: lol
21:08 karolherbst: ;)
21:09 gregory38: ideally, I would need to take into account Vulkan so it would be easier to add in the future
21:11 gregory38: karolherbst: with higher resolution permutation is faster than uber shader (again only tested a single testcase)
21:12 gregory38: it is clearly a tradeoff between driver overhead and gpu overhead
21:12 karolherbst: yeah, we might allocate too many registers
21:12 karolherbst: and kill parallelism thath way
21:13 karolherbst: gregory38: well you should test with nvidia then
21:13 karolherbst: gregory38: could be that with theirs compiler it is much better anyway
21:14 gregory38: yes that was the plan but I don't like to reboot, I need to close all pending activities
21:15 gregory38: anyway, it is something that I need to investigate further
21:15 gregory38: and maybe I will put an option
21:34 urmet: so. I have a GM108M. can I poke it to run fast(er) or is just loading the driver all for now?
21:34 imirkin_: just loading the driver is all for now
21:35 urmet: ok. cool :)
21:35 imirkin_: it *ought* to be auto-suspending though, which should get you some battery life back
21:35 imirkin_: you can check in vgaswitcheroo (in sysfs somewhere), it should say DynOff
21:37 urmet: i don't seem to have anything relevant named *vga* in sys'
21:37 imirkin_: mmm... maybe you don't have it enabled then
21:38 urmet: maybe i sould reconfigure my kernel. switcheroo stuff looked like it would only make sense for older systems
21:39 imirkin_: no, you definitely want it here
21:39 imirkin_: otherwise your gpu won't auto-suspend
21:39 urmet: oh
21:39 urmet: goot to know
21:58 karolherbst: urmet: well you can try out my reclocking patches for maxwell, but usually the intel of yours should run faster, because those gm108 gpus are mainly crap ;)
22:17 urmet: karolherbst: crapness is no factor. fancy new features are fun to check out :)
22:17 imirkin_: urmet: well, the GM108 won't support tessellation or images (right now) which your intel gpu should support.
22:19 urmet: I'm not expecting all the features :)
22:27 urmet: and if all goes to plan, I'm going to have even worse experience with nouveau drivers next week :P
22:27 imirkin_: excellent!
22:30 urmet: 1070
22:30 imirkin_: oh lol
22:30 karolherbst: !
22:30 karolherbst: nice
22:30 imirkin_: well that'll be nice and easy - nouveau just plain won't work on it
22:31 karolherbst: we would need a. the vbios and b. an mmiotrace
22:32 urmet: if i get the card fast and the vbios/mmiotrace helps, I would be glad to help