07:21gregory38: @airlied hello
07:22gregory38: I was trying to understand shader/pipeline setup/bind
07:22gregory38: and I saw some codes to reset uniform when you bind a new pipeline
07:23gregory38: however you can keep the pipeline bound, and only use glUseProgramStages to switch the stage
07:23inglor: Does the "new" 4.6 kernel contain any fixes for kepler ?
07:24gregory38: In this case, I don't know if you need to reset or not the subroutines
07:34airlied: gregory38: I think I fixed it recently to always rebind subroutines, but there are patches on the list to change howq it all works anyways
07:34airlied: (subroutines that is )
07:35gregory38: yes I saw the patch, I give a quick look
07:35gregory38: You mean the per-context uniform support, isn't it?
07:40gregory38: Hum, what I see is that _mesa_shader_program_init_subroutine_defaults will be called when
07:40gregory38: 1/ a new pipeline is bound
07:40gregory38: 2/ a new program is set (useProgram)
07:42gregory38: Extract from the EXT spec (.txt) Program subroutine parameters for a given target are reset to arbitrary
07:42gregory38: defaults when the program string is respecified, and when BindProgram is
07:42gregory38: called to bind a program to that target, even if the specified program is
07:42gregory38: already bound.
07:44gregory38: Hum, so if you set a new program on the already bound pipeline (_mesa_UseProgramStages), I guess it ought to reset the uniform container
07:47gregory38: airlied: did I understand it wrongly?
07:49gregory38: Ah no it is fine
07:49gregory38: use_shader_program is called for both path
07:50gregory38: (useProgram and UseProgramStages)
09:47karolherbst: 0] 174.935244 MMIO8 R 0x30f230 0x00000003 PROM[0xf230] => 0x3
09:48karolherbst:  174.935248 MMIO8 R 0x320218 0x000000b0 PROM[0x20218] => 0xb0
09:48karolherbst: seems like after 0xf230 comes 0x20218 in the vbios
09:48karolherbst: now comes the odd part
17:08newuserrrrrrr: Hello, how to reclocking nouveau?
17:08newuserrrrrrr: /sys/kernel/debug/dri/0/pstate: No such file or directory
17:08hakzsam: imirkin, karolherbst, flickering in The Talos Principle fixed! And +20% fps (as a side-effect) :p
17:09hakzsam: cf. mesa-dev
17:10karolherbst: actually I really expected the fix to be something like that :D
17:10karolherbst: hakzsam: the green walls
17:10karolherbst: hakzsam: this belongs there
17:10hakzsam: something else
17:10karolherbst: it is part of the game
17:11hakzsam: but it's not the same issue
17:11karolherbst: the game itself is some kind of "emulation" kind of
17:11hakzsam: I will have a look later
17:11hakzsam: karolherbst, if you want to try my patch
17:11karolherbst: and at some places you have bigger glitches than green walls
17:11karolherbst: yeah I will check
17:11hakzsam: it should work like a charm for this flickering thing
17:11hakzsam: I know the issue
17:12karolherbst: do you push all your changes into your repository?
17:12karolherbst: doesn't matter for that change
17:12hakzsam: nope, you have to dl the patch from mesa-dev
17:12karolherbst: yeah well
17:13hakzsam: or to edit the file
17:13hakzsam: one line is not that much :p
17:13karolherbst: I know
17:13karolherbst: but the increased perf comes in handy
17:13karolherbst: let me test because the perf really sucked anyway
17:13karolherbst: hakzsam: you know what? the PCIe speed had a 25% perf impact on that game anyway
17:14hakzsam: imirkin, I followed your advice: formulate a theory, look into the code, make a change & hope :)
17:14hakzsam: karolherbst, not surprising
17:14hakzsam: it used the push buf for emitting the indexed draws
17:15karolherbst: I really like those fixes, because nobody will ever spot them by just reading the code
17:15newuserrrrrrr: And which better bumblebee nouveau or PRIME= ?
17:15karolherbst: newuserrrrrrr: if you use nouveau, PRIME
17:16imirkin_: hakzsam: i doubt that's the right change
17:16newuserrrrrrr: karolherbst: i have intel + nvidia gtx 660m
17:16hakzsam: imirkin_, why?
17:16imirkin_: hakzsam: a lot more things would be wrong... need to look at how it works
17:16imirkin_: nv50 index buffers are VERY different from nvc0
17:16imirkin_: anyways, gtg
17:17hakzsam: imirkin_, well, at least piglit is happy with that fix
17:17karolherbst: let me check
17:17karolherbst: hakzsam: you get my approval
17:18hakzsam: what about perf?
17:18karolherbst: I built mesa with O0
17:18hakzsam: doesn't really matter
17:18karolherbst: it does
17:19hakzsam: I get +20% with O0
17:19hakzsam: not tried with O3
17:19karolherbst: I see
17:19hakzsam: usually, I have a debug build :)
17:19karolherbst: well it runs at 60 fps
17:20karolherbst: hakzsam: mhhh I think those green thingies are indeed wrong
17:20karolherbst: hakzsam: nah I will test ultra settings
17:20hakzsam: yeah, still there
17:20karolherbst: hakzsam: do you have the game?
17:21hakzsam: not right now
17:21karolherbst: CPU->Mirror reflections is kind of broken
17:21karolherbst: if you stand near water and move the mouse
17:21karolherbst: the reflection kind of "stucks" for a few frames
17:21newuserrrrrrr: Guys, which path to reclock? echo 03 > ?
17:21hakzsam: karolherbst, but flickering is fixed, right?
17:22karolherbst: hakzsam: yeah
17:22karolherbst: newuserrrrrrr: /sys/kernel/debug/dri1/pstate
17:22karolherbst: newuserrrrrrr: with 4.5+
17:22hakzsam: karolherbst, nice
17:23karolherbst: newuserrrrrrr: but if that is unstable, I have a branch to fix that
17:23karolherbst: hakzsam: well on ultra everything looks good now :)
17:23karolherbst: except those colorful flicers now
17:23karolherbst: it isn't always green
17:23hakzsam: yeah, depends
17:23hakzsam: it's like random
17:23hakzsam: karolherbst, it would be good to measure perf
17:24newuserrrrrrr: karolherbst: thank you!
17:24karolherbst: hakzsam: yep
17:24karolherbst: let me rebuilt mesa with 0fast
17:25karolherbst: my 32bit is with 0fast
17:25karolherbst: I should check before running make clean
17:25hakzsam: sounds like better :)
17:27karolherbst: I had NV50_PROG_OPTIMIZE=0 stll there
17:28hakzsam: karolherbst, but now, we need to figure out the second issue :/
17:28karolherbst: it is much better already
17:28karolherbst: I think talos has an integrated benchmark mode anyway
17:29hakzsam: no clue
17:29newuserrrrrrr: karolherbst: it seems nouveau with bumblebee not have /sys/kernel/debug/dri1/pstate
17:31karolherbst: newuserrrrrrr: sure it isn't useing nvidia?
17:31karolherbst: newuserrrrrrr: or is your kernel older than 4.5?
17:32newuserrrrrrr: karolherbst: kernel 4.6.1, primusrun glxinfo | grep version server glx version string: 1.4 client glx version string: 1.4 GLX version: 1.4 OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.2 OpenGL core profile shading language version string: 4.10 OpenGL version string: 3.0 Mesa 11.2.2 OpenGL shading language version string: 1.30
17:32karolherbst: hakzsam: the performance is so high now... it can't be only 20%
17:32karolherbst: newuserrrrrrr: well
17:33karolherbst: newuserrrrrrr: if the command is gone, nouveau gets unloaded
17:33karolherbst: newuserrrrrrr: anyway, don't use bumblebee for nouveau
17:33hakzsam: karolherbst, please measure :)
17:33karolherbst: currently setting everything to ultra :3
17:36karolherbst: hakzsam: same postion: 8.5 -> 15 fps
17:36hakzsam: it's more than expected
17:36karolherbst: and it feels much smoother anyway
17:36karolherbst: I am sure there is a benchmark though
17:36karolherbst: EXTRAS -> BENCHMARK
17:36hakzsam: oh cool
17:37karolherbst: you can even set the time
17:37karolherbst: 15/30/60 seconds
17:37hakzsam: still have to wait ~20 minutes before testing
17:37hakzsam: although, I tested with some traces
17:37hakzsam: but not ingame
17:38karolherbst: 11.2 fps without your change on ultra
17:39karolherbst: but the benchmark is awesome
17:39karolherbst: it also shows your where the cpu cycles go
17:39karolherbst: (audio/rendering/... stuff)
17:39hakzsam: very nice
17:39karolherbst: hakzsam: uhhh
17:39karolherbst: the main menu is busted though
17:39hakzsam: with my patch?
17:39hakzsam: and it was not before?
17:40karolherbst: let me check
17:40karolherbst: ohh no
17:40karolherbst: you are right
17:40karolherbst: it is also busted without
17:40hakzsam: nice to hear
17:41hakzsam: maybe you could make a apitrace?
17:41karolherbst: benchmark time :)
17:41karolherbst: after the benchmark
17:41Calinou: new argument for pushing for optic fiber deployment: "I want to give APITraces to people"
17:41hakzsam: yep :)
17:42karolherbst: 17.6 fps
17:43karolherbst: so yeah
17:43karolherbst: hakzsam: one thing though: the shadow load droped significantly with your patch
17:43hakzsam: what's that?
17:44karolherbst: as i said, the benchmark shows the cpu load or something
17:44karolherbst: and ren increased and shadow droped
17:47karolherbst: hakzsam: ohh maybe I messed up the rendering :/ let me check
17:47karolherbst: yeah maybe
17:47hakzsam: what did you do?
17:48karolherbst: hakzsam: 25.7 fps with the 64bit version.... the hell
17:49karolherbst: ahh it changed settings
17:51karolherbst: hakzsam: weill see, but your patch helps a lot
17:52karolherbst: hakzsam: well I bisct and see what I messed up now
18:15karolherbst: hakzsam: seems like there is an issue in my dual issue pass
18:25hakzsam: karolherbst, I'm back, I have the game now
18:25hakzsam: will test
18:28hakzsam: karolherbst, btw, I pushed the fix on fdo https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=talos
18:30karolherbst: hakzsam: good
18:30imirkin_: hakzsam: i'm almost sure that's the wrong fix.
18:31karolherbst: well I get 17.5 fps with your patch with all settings to ultra
18:31imirkin_: hakzsam: you need to understand why it helps
18:31imirkin_: and then fix the underlying issue, whatever it is.
18:32karolherbst: imirkin_: any idea why doing this, gives us better performance?
18:32imirkin_: the push hint is an optimization
18:32imirkin_: you're either doing it more often or less often
18:32imirkin_: so either it's an opt that helps or hurts in this case
18:32imirkin_: should figure out which
18:33hakzsam: imirkin_, actually, the push hint is not an optimization in that case
18:33hakzsam: imirkin_, what are you saying that's the wrong fix? :)
18:34imirkin_: the push hint is supposed to only be for indexed draws. it makes little to no sense for non-indexed draws
18:34hakzsam: imirkin_, because I guess the number of vertices is somehow huge
18:34imirkin_: the idea is that if you have a huge buffer
18:34imirkin_: but are only drawing a few elements from it
18:35imirkin_: you're better off picking those elements "by hand" and then pushign them via dedicated interfaces
18:35karolherbst: hakzsam: okay, on 64bit with your patch: 10.9 -> 17.5 fps in the benchmark
18:35imirkin_: rather than putting the whole vbo resident into vram
18:35hakzsam: imirkin_, sure, I know that
18:35imirkin_: so... it only ever makes sense for indexed draws
18:36imirkin_: the vbo hint determination logic is correct
18:36imirkin_: although perhaps the implementation of it is wrong
18:36hakzsam: the limit is arbitrary...
18:37imirkin_: yes. but what's not arbitrary is the fact that it's for indexed draws :)
18:37hakzsam: and that might explain the perf improvements
18:37imirkin_: my guess is that there's a missing condition on when not to do it
18:37imirkin_: like when the index buffer is being written
18:37imirkin_: or who knows what
18:37imirkin_: [or the vertex buffer is]
18:38hakzsam: so, we should always use the push path for indexed draws?
18:39imirkin_: only for *small* indexed draws
18:39imirkin_: and perhaps with some additional restrictions
18:39hakzsam: but why nv50 doesn't do the same thing?
18:39imirkin_: it does
18:39imirkin_: you misanalyzed what it was doing.
18:40imirkin_: i agree it's confusing :)
18:40hakzsam: nv50 always uses a user vbo for indexed draws
18:40imirkin_: right, so among other things, the way you supply indices for an indexed draw is different on nv50
18:40imirkin_: on nvc0 you give it a bo
18:40imirkin_: and all is well
18:41imirkin_: on nv50, you feed the indices through the pushbuf
18:41imirkin_: no matter what
18:41hakzsam: makes more sense
18:41karolherbst: hakzsam: well I will upload a trace with the water reflection issue
18:41hakzsam: karolherbst, yep, thanks
18:41imirkin_: so some of the surrounding logic is different as a result
18:42hakzsam: this part is a bit confusing yeah
18:42hakzsam: that's why I was almost sure that using a user vbo fixed the issue
18:42hakzsam: I think this was not too crazy
18:43imirkin_: it should *work* in all the cases
18:43imirkin_: you're just shifting it from one path to another
18:43imirkin_: but both paths are supposed to work.
18:43hakzsam: and the push path doesn't work in that specific case
18:43imirkin_: so the fix is to fix the path that's not working
18:43imirkin_: rather than to shift it to another path
18:43karolherbst: so one path may lead to higher perfs, but both should be always "fine"
18:43hakzsam: performance comes later anyway :)
18:44karolherbst: huh :p
18:45hakzsam: imirkin_, but, the vbo hint limit might be wrong too, right?
18:45hakzsam: or might be improved
18:45imirkin_: hakzsam: the limit is arbitrary
18:45imirkin_: but again, the FIX is to fix the path it's going down
18:45hakzsam: one thing at a time
18:45imirkin_: or alternatively to determine why it doesn't work, and add an exclusion for that case
18:46imirkin_: (i'm guessing when the idxbuf has an unsignalled fence_wr)
18:46imirkin_: or perhaps when the vertex buffers are still being written
18:46imirkin_: unfortunately i'm not entirely sure when synchronize's are necessary
18:47karolherbst: hakzsam: okay, we might have an issue regarding that reflection issue
18:47imirkin_: although it seems reasonable to stick one in in that case
18:47hakzsam: imirkin_, okay, will look into the code
18:47hakzsam: karolherbst, talos is quite buggy
18:47karolherbst: hakzsam: guess why it falls under the "CPU perf" section?
18:48karolherbst: anyway, I created a trace, same issue on nouveau/intel/nvidia
18:48karolherbst: when replaying it
18:48hakzsam: are you sure it's wrong?
18:48karolherbst: I ran the game with nvidia, and it looked fine
18:49karolherbst: hakzsam: also when I disable that setting, it doesn't happen anymore
18:49karolherbst: just looks a bit more crappy overall
18:49hakzsam: and my patch doesn't help for this issue?
18:51hakzsam: karolherbst, please, fill a bug then
18:57karolherbst: at least it is going forward :)
18:59hakzsam: karolherbst, hey traces are always useful :)
18:59karolherbst: nope, not if you have the game in this case
18:59karolherbst: with the trace you won't know if you fix the issue or not
18:59hakzsam: even if I have a game, I make a trace
18:59hakzsam: because I don't want to spend my time at launching it :)
18:59karolherbst: I think this is something done on the CPU
19:00karolherbst: hence why this is in the CPU section
19:00karolherbst: the engine reads something back from the GPU
19:00karolherbst: does something with the CPU with that
19:00karolherbst: and pushes it back somewhere else
19:00hakzsam: well, I will have a look later, maybe
20:27newuserrrrrr: Hi again, i tried to use gtx 660m nouveau with Planetside 2 (wine) and it works!
20:34Yoshimo: well i used World of Warcraft on a 980 with nouveau and it worked, but 10fps was a rather broad definition of "works"
20:34karolherbst: why are people always surprised? :D
20:34karolherbst: Yoshimo: well you blame nvidia for that one :p
20:35Yoshimo: yes i do, but they don't seem to care about that
20:41karolherbst: imirkin: any idea why setting the liveOnly bit on those tex instructions might hang the gpu?
20:42gregory38: imirkin_: small question. Why does exactly the validate function (such as nvc0_compute_validate_constbufs)? Does it transfer the data to the GPU or is it more complex?
20:43karolherbst: gregory38: guess: I think it checks whether something needs to be uploaded to the gpu or updated or something and marks those things as dirty
20:45karolherbst: I might be wrong, but I think this is more or less right
20:46gregory38: oh, I didn't say it was wrong ;)
20:46gregory38: just trying to understand
20:47newuserrrrrr: but nouveau < fps lower than proprietary driver < windows :(
20:48karolherbst: newuserrrrrr: yeah well
20:48karolherbst: newuserrrrrr: we don't have any hw documentation
20:48imirkin_: gregory38: rtfs? :)
20:48newuserrrrrr: One reason why i use windows - planetside 2
20:49imirkin_: gregory38: that one takes the list of constbuf bo addresses and writes out method calls to the command ring to make it so
20:49newuserrrrrr: karolherbst: nvidia bad :( but amd even worse(?)
20:49gregory38: A nice explanation is always better :)
20:50imirkin_: gregory38: when you just set the constbuf bo's, we don't actually write it out to the command ring
20:50imirkin_: gregory38: we only do it at draw (or dispatch) time
20:50karolherbst: newuserrrrrr: amd provides hw documentation
20:50gregory38: ok make sense
20:51imirkin_: the hardware stores some amount of configuration, and we have to write out commands to updates bits and pieces of it
20:51newuserrrrrr: karolherbst: but amd sucks with hybrid graphics
20:51imirkin_: and hopefully we update all the pieces that have changed, or else ... we draw wrong
20:51karolherbst: newuserrrrrr: well nouveau sucks usually with performance
20:52gregory38: but if you draw with a program 1 and an ubo 0
20:53gregory38: and then swith the program and do a draw with program 2 but same ubo 0
20:53gregory38: data is already on the GPU
20:53newuserrrrrr: i long use windows and over proprietary software and it sucks, now i go with opensource and free software
20:54karolherbst: newuserrrrrr: well you can always help us, but you get usually better quality drivers with intel and amd
20:54imirkin_: newuserrrrrr: amd supports open source mesa development with both documentation and full time engineers
20:54imirkin_: newuserrrrrr: so stick with amd if you want good open-source support.
20:54karolherbst: but I guess you have your rig already, so you can help us out anyway
20:54imirkin_: gregory38: that's right.
20:55karolherbst: imirkin_: by the way, is there a way to tell mesa to not color the instructions?
20:55imirkin_: gregory38: that's more of a GL issue - how do you know it's the same ubo 0
20:55imirkin_: karolherbst: please rephrase
20:55karolherbst: imirkin_: NV50_PROG_DEBUG colors the instructions
20:55imirkin_: karolherbst: yeah, there's an env var
20:56imirkin_: check nv50_ir_print.cpp
20:56imirkin_: [i have no clue what it is]
20:56karolherbst: too easy
20:57gregory38: well I guess it would need a dirty bit on upload (don't ask coherent stuff), and something for the index binding
21:00gregory38: actually maybe you don't need the dirty data bitw
21:00gregory38: because you update only the "pointer" to the ubo
21:01gregory38: or maybe it is nvc0_cb_bo_push :p
21:01gregory38: that push the data
21:02imirkin_: that gets into how the hw works
21:03imirkin_: doing one draw at a time is all well and good
21:03imirkin_: however a lot of programs do something like
21:03imirkin_: update uniform; draw; update uniform; draw; update uniform; draw;
21:03imirkin_: wouldn't it be nice if you didn't have to wait for draw1 to finish before starting draw2?
21:04imirkin_: so uniforms are updated via this magical path that doesn't (immediately) affect the backing data
21:04imirkin_: but a draw will see it
21:04imirkin_: while previous draws won't be affected
21:04gregory38: ok gotcha
21:05imirkin_: all uniform updates go through such a path... perhaps they shouldn't, i haven't the faintest clue what the perf tradeoffs are
21:12karolherbst: why can't I swap those instructions?
21:12karolherbst: 278: texlod 2D $r8 $s0 f32 $r4d $r4t (8)
21:12karolherbst: 279: texlod 2D $r8 $s0 f32 $r0d $r0t (8)
21:14karolherbst: well next is this:
21:14karolherbst: 347: texbar (SUBOP:1) - # $r4d (8)
21:14karolherbst: 349: texbar - # $r0d (8)
21:14karolherbst: I guess I would have swap those texbars too?
21:14karolherbst: or even move that subop?
21:15karolherbst: maybe best I just leave those tex instructions alone...
21:15newuserrrrrr: sorry for offtopic, I I study on programmer if I have a chance to participate in a dev, I need to great learn C? (sorry im from Russia and my English is bad)
21:15karolherbst: newuserrrrrr: I can't do shit in C before I started on nouveau :p
21:16karolherbst: newuserrrrrr: but yeah, you would have to learn it
21:16imirkin_: karolherbst: still can't :p
21:16imirkin_: karolherbst: i'm guessing there's an instruction 348 that either reads or writes r4 or r5?
21:16karolherbst: 348: add ftz f32 $r2 neg $r5 1.000000 (8)
21:16imirkin_: so it needs the texbar before it
21:16imirkin_: it's all correct...
21:17imirkin_: if you swap the two instructions, you'd have to adjust all the depths
21:17imirkin_: it's not worth it
21:17imirkin_: at least not once texbars are inserted
21:17karolherbst: so order matters for tex instructions
21:17newuserrrrrr: Can you advise what some good docs or books?
21:17imirkin_: order matters for the later texbar's
21:17karolherbst: newuserrrrrr: didn't read any
21:17imirkin_: texbar 1 = wait until all but the last tex() are complete
21:17karolherbst: newuserrrrrr: https://www.kernel.org/doc/Documentation/CodingStyle
21:17imirkin_: texbar 10 = wait until all but the last 10 are complete
21:18imirkin_: (tbh i have no idea how deep the queue goes)
21:18karolherbst: I think I just ignore those tex instructions for the dual issue thing then
21:18karolherbst: it really doesn't matter that much because it is so rare that it happens anyway
21:18newuserrrrrr: karolherbst: hmm, thanks :)
21:19karolherbst: newuserrrrrr: well it doesn't teach you C, just the rules the kernel devs should follow
21:19karolherbst: newuserrrrrr: anyway, best idea is if you find an issue and just try to fix it
21:19karolherbst: newuserrrrrr: if you code something odd or bad, we will just shout at you then :p
21:21gregory38: imirkin_: const behavior feels like standard glUniform draw call that is uploaded every program change rather than ubo that could keep same value for severals program
21:21newuserrrrrr: But how learn C? :))
21:22karolherbst: newuserrrrrr: can you code in any other language?
21:22gregory38: well it might save a couple of C call but hardware interaction is more or less the same (at least on nouveau)
21:23newuserrrrrr: karolherbst: pascal, little android(java), also i like Golang
21:23gregory38: anyway, thanks for all your info it was very educational
21:23imirkin_: gregory38: well if it's not updated, then nothing is uploaded
21:23imirkin_: gregory38: a "user_buffer" is from uniforms stored in client memory, i.e. glUniform() things
21:23karolherbst: newuserrrrrr: well I think C is closes to java of all of those, but still pretty far away
21:24gregory38: UBO are flushed to GPU when the program is switched
21:24imirkin_: [which then have to be uploaded to a bo before draw]
21:24karolherbst: newuserrrrrr: maybe you should start just by coding anything in C or something, I really don't know because I really didn't follow much while learing that stuff :/
21:25gregory38: imirkin_: by which you mean the program or the ubo ?
21:26imirkin_: gregory38: the ubo
21:26newuserrrrrr: karolherbst: ok, important start
21:26gregory38: it would be easier for me to access the hw :)
21:26imirkin_: gregory38: ubo's are also CSO'd between st/mesa and the driver backends
21:26imirkin_: so if you happen to "change" it to the same thing, that update never hits the driver
21:26imirkin_: (CSO = constant state object)
21:27gregory38: yeah, the thing is that PCSX2 just switch shaders very often
21:27gregory38: unlike others app
21:28gregory38: so I'm pretty in the worst case validation
21:29imirkin_: yeah... it shouldn't be _so_ bad
21:29imirkin_: but i guess it works out that way =/
21:29glennk: on newer hardware you might be better off using a uber shader and switch less often
21:29gregory38: with subroutine you mean
21:29glennk: nah, just regular branching
21:30glennk: subroutine sucks on the hardware
21:30imirkin_: well, that's what using subroutines with mesa will work out to :)
21:31gregory38: honestly I didn't tested uniform
21:31gregory38: Most of the time, app are GPU bound rather than CPU bound
21:31imirkin_: mesa can and definitely should be improved in this regard
21:31imirkin_: but it will take someone with an app they care about to profile and think about how to fix it
21:32imirkin_: could be you :)
21:32gregory38: you don't imagine how I'm busy with PCSX2 ;)
21:32imirkin_: i'm sure it's "very"
21:32imirkin_: and i suspect it doesn't pay too much, so you also do other things
21:33gregory38: but yeah I'm trying to see what I can do to help
21:33gregory38: yes, I'm working on the hardware industry
21:33gregory38: used to be smartphone now inteconnect (network) for HPC
21:33glennk: gpu bound probably by fill rate rather than shader alu though
21:34gregory38: glennk: you don't imagine the mess I did on the fragment shader to emulate the PS2 ;)
21:34gregory38: glennk: what do you mean by fillrate ?
21:34gregory38: memory bandwidth ?
21:34glennk: the rate at which the hardware can fill pixels
21:35gregory38: And is it a good idea to put texture sampler in if/else
21:35glennk: often hardware bottlenecks on either the rasterizer generating fragments (before shading) or in ROP (alpha blend etc)
21:35glennk: branches are cheap as long as they go the same way in each warp/wavefront
21:36gregory38: ok. How much cheap ? 10 branch ? 20 branch ?
21:36glennk: newer hardware is way more heavy on shader alu and texture sampling than on rasterizer/ROP
21:37karolherbst: gregory38: most of the stuff may be optimized away anyway
21:37glennk: sorry, can't parse that question in any way that makes sense to me
21:37imirkin_: gregory38: branches are ~free when they're uniform
21:38gregory38: ok. Question was would branch remains free on performance if shader contains 20-30 branchs
21:38imirkin_: if they're uniform, yes
21:38gregory38: (yes my shader code is a royal mess)
21:38imirkin_: if each invocation goes its separate way, then no
21:38karolherbst: gregory38: anything in the form of type v = cond ? value1 : value2;
21:38gregory38: I guess I ough to benchmark it then
21:38imirkin_: and obviously not *free*
21:38glennk: look at the thread divergence
21:39imirkin_: if you have like one real instruction and 300 branches, then it might cost you
21:39glennk: if its low then branches cost basically nothing
21:39imirkin_: but presumably you do things other than branch in your shaders
21:39glennk: if you have divergence you pay for both sides of the branch in all threads
21:39karolherbst: gregory38: try to keep branches as simple as possible so that they can get optimized away into non branched binaries
21:40gregory38: ok. Actually I don't have too much instruction once you remove all the ifdef
21:41imirkin_: then you might want to benchmark and see what happens
21:41imirkin_: could go either way
21:41glennk: well, #ifdef shader variants are by definition uniform branches
21:41gregory38: Potentially there is still an impact, the compiler might still need to create additional register to store content
21:42glennk: yes, that and a lot of other variables
21:42glennk: but given its emulating ps/2 shader hardware, how complex could it be?
21:42glennk: compared to say an UE4 pixel shader
21:43karolherbst: gregory38: see that shader? https://gist.github.com/karolherbst/aad71d0e5715a1ffdd5c9c2fc0963505
21:43karolherbst: with all those ?: clauses
21:44karolherbst: gregory38: ends up having 5 branch instructions in the binary in total
21:45karolherbst: tgsi of that shader: https://gist.github.com/karolherbst/218323671e48f0a9cd147f1492c36beb
21:45gregory38: my shaders are much much smaller
21:46karolherbst: and less branches I figure?
21:46gregory38: there are still a couple of branches (1/2) sometimes
21:46karolherbst: yeah well
21:47karolherbst: on nvidia hardware there are many instructions to eliminate branches completly and make them pretty much free, allthough there is conditional stuff
21:47karolherbst: so you really don't need to worry about this
21:47karolherbst: would be a good experiment to know what happens if you have like one uber shader and stick with that
21:47gregory38: well believe or not, I'm still supporting others brands ;)
21:48karolherbst: glennk: I figure on modern amd cards there are also slct/set/predicated instructions?
21:48gregory38: glennk: to answer your question, ps2 has special fixed unit so typically I need to implement blending on the shader. Sometimes the filtering too
21:48karolherbst: or selp
21:48gregory38: but I barely do any math
21:49glennk: blending with framebuffer, or blending between "texture stages" ?
21:50gregory38: with framebuffer
21:50gregory38: the rop stuff
21:50glennk: thats an expensive thing
21:51gregory38: yes but I didn't find a free solution
21:51gregory38: alpha coefficient is unclamped
21:51gregory38: but the color could be either clamped or wrapped
21:52gregory38: they use integer math with trunc vs float rounding of the standard unit
21:54gregory38: 2 shader example. A basic one, and a more complex one (likely barely used)
21:55gregory38: anyway, I need to test the uber shader
21:59gregory38: hum might not be that easy
21:59gregory38: I have some discard on some paths
21:59imirkin_: discards are fine
22:01gregory38: I was afraid for early test and thing like that
22:02imirkin_: discards don't count as control flow
22:02imirkin_: (i mean, they kinda do...)
22:02glennk: everything is rasterized as 2x2 quad groups, so it won't save fill rate until you discard an entire quad group
22:04gregory38: purpose isn't really to save fill rate but ps2 can update either the framebuffer color or the depth based on the alpha value
22:04gregory38: so we do the rendering twice
22:05glennk: hmm, maybe stencil export could work for that
22:05gregory38: does it work on nvidia hw :p
22:05imirkin_: once with depth bound, once with color?
22:05gregory38: you can setup the ps2 as
22:05gregory38: if alpha < AREF update both color and depth
22:05gregory38: otherwise update only the color
22:06gregory38: (the otherwise could also be update only the depth)
22:06imirkin_: good times.
22:06gregory38: yeah, you could everything with the ps2
22:06gregory38: validation was 0
22:06gregory38: everything was allowed ;)
22:07karolherbst: gregory38: I wonder if games hacked around about the non standart floating point things and if you could detect that and revert it and just use ieee complient stuff :p
22:08karolherbst: maybe they simply didn't care though
22:08gregory38: game care
22:08gregory38: and none of them use IEEE
22:08gregory38: you can use IEE stuff it mostly work most of the time
22:08gregory38: but not always
22:08karolherbst: I see
22:09gregory38: For example you have the nice constant Pi
22:10imirkin_: you mean delicious constant
22:10gregory38: You do Pi * 1/4 which is different of 1/4 * Pi
22:10gregory38: the ps2 kind lose some bits of accuracy so you end up actually below pi/4
22:11gregory38: so cos/sin doesn't really have the same behavior
22:11gregory38: so there is one place in the code that do if (a == pi && b == 1/4) then result is ....
22:12imirkin_: that seems nice and scalable :)
22:12glennk: i guess you can use shader image loads/store instead of regular color buffer ops to handle that
22:13gregory38: if ((s == 0x3e800000) && (t == 0x40490fdb))
22:13gregory38: return 0x3f490fda; // needed for Tales of Destiny Remake (only in a very specific room late-game)
22:13glennk: so updating color but discarding depth would be a shader store followed by a discard, and the reverse is just depth only rendering
22:13karolherbst: gregory38: isn't there maybe some constant error you could apply?
22:13karolherbst: gregory38: or is it not really predictable?
22:13glennk: but the tricky bit is overlapping draw calls, needs explicit synchronization
22:14gregory38: glennk: discard, also discard the store
22:14gregory38: karolherbst: well I got this kind of idea. But it isn't easy
22:14gregory38: I'm not sure it always loose bits
22:14karolherbst: gregory38: well, do you have hardware?
22:15gregory38: so maybe a lut based on the lsb
22:15gregory38: yes it was planned to test lots of value into it ;)
22:15glennk: gregory38, uh no, the stores prior to the discard go through
22:16gregory38: glennk: are you sure?
22:16glennk: https://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt point 20
22:16gregory38: did you do that in Mesa ? It could explain some issue with load/store atomic
22:17gregory38: glennk: oh interesting
22:18glennk: "some issue" is rather vague
22:18gregory38: I need to double check the rendering on nvidia
22:18gregory38: otherwise I could provide an apitrace
22:20gregory38: wait actually I think it is the game that have a poor quality
22:21gregory38: yes it is fine
22:22gregory38: anyway, it is interesting. Maybe it could help in the future
22:22imirkin_: it can be hard to retroactively undo stores :)
22:23gregory38: well I'm working on HW, we can do stuff
22:23imirkin_: well, you can have e.g. a coherent buffer/image
22:23imirkin_: how would you undo that?
22:23imirkin_: esp after some atomic operation
22:23gregory38: dunno :p
22:24gregory38: for sure it would be costly or with limitation
22:24gregory38: but opengl wiki isn't clear
22:24imirkin_: limitation: it doesn't work :)
22:25gregory38: The Fragment Shader has the ability to issue a discard command. This will prevent writing any fragment values to the framebuffer. However, it will also have the effect of preventing image store and atomic operations from taking place.
22:25imirkin_: it's a wiki, fix it :)
22:25gregory38: (yeah I know better read the spec)
22:25imirkin_: what it says is technically not wrong
22:25imirkin_: but can easily be misinterpreted
22:25gregory38: yes it is confusing
22:27gregory38: anyway, the 2 consecutive draw won't be expensive if I don't switch the shader between then but only update uniform
23:25gregory38: So I manage to reinstall the nvidia driver to do some benchmark
23:25imirkin_: let me guess - it does better than nouveau?
23:25gregory38: barely :p
23:25imirkin_: that's ... very sad
23:26gregory38: I don't have my GPU at full speed to it could explain some perf issue
23:26gregory38: normally there are cpu bound testcase so it mustn't have a big impact
23:28gregory38: nouveau is around 60% of nvidia (if don't enable thread optimization)
23:28imirkin_: i assume that's with reclocking?
23:28gregory38: DC: core 1084 MHz memory 6007 MHz
23:28gregory38: 0f: core 405-1280 MHz memory 6008 MHz AC DC *
23:28gregory38: Yes but not turbo
23:28imirkin_: i'll take that as a "yes"
23:29gregory38: but that doesn't explain the perf, my testcase are likely more heavy on the CPU
23:29gregory38: except the sotc testcase
23:29imirkin_: well, that's potentially another 15-20%
23:30imirkin_: assuming turbo lets you go all the way up to 1280
23:30karolherbst: usually not
23:31karolherbst: turb usually gives you around 100-150MHz more though
23:31glennk: wonder if dual source blending would help for implementing some of the ps2 blend modes
23:31gregory38: I already use it
23:31gregory38: manage to find a bug on AMD driver which they manage to fix finally ! And now I'm waiting a fix
23:32gregory38: And it seems to be broken on intel too
23:32gregory38: Both if you enable SSO + blending
23:32gregory38: I mean dual-blending
23:32gregory38: imirkin_: karolherbst: let's say there is 10% perf explanation for the GPU speed
23:33gregory38: it remains some margins :)
23:34gregory38: Hum on sotc (generally GPU limited)
23:34gregory38: nvidia 113 fps (around 160+ with MT)
23:34gregory38: Nouveau 65
23:35gregory38: Nouveau without UseProgram 73
23:37gregory38: hum strange, it uses to help another testcase but not anymore
23:41gregory38: I have a testcase which is around 48 fps but 87 on nvidia
23:41gregory38: the testcase does lots of texture upload with pbo. Basically every draw call
23:41gregory38: (because a game designer got the good idea to upload a background with 16x16 sprites....)
23:42gregory38: _mesa_TextureSubImage2D is 38.80%
23:43gregory38: (with children)
23:43gregory38: including my program
23:43gregory38: otherwise 68% of the driver
23:46gregory38: could just be due to the slower GPU
23:54glennk: i think nha implemented pbo texture uploads/readpixels acceleration fairly recently
23:54imirkin: yeah, pbo upload uses a texture buffer (when possible)
23:54imirkin: and pbo download uses an image to write to the pbo
23:54imirkin: nouveau has a very static resource placement policy
23:54imirkin: nvidia might do something more dynamic
23:55glennk: or "game specific" policy
23:55imirkin: well, among other things, textures are always in vram
23:56gregory38: well I would be surprised they have a profile for us
23:56imirkin: never in "gart"