07:21 gregory38: @airlied hello
07:22 gregory38: I was trying to understand shader/pipeline setup/bind
07:22 gregory38: and I saw some codes to reset uniform when you bind a new pipeline
07:23 gregory38: however you can keep the pipeline bound, and only use glUseProgramStages to switch the stage
07:23 inglor: Does the "new" 4.6 kernel contain any fixes for kepler ?
07:24 gregory38: In this case, I don't know if you need to reset or not the subroutines
07:34 airlied: gregory38: I think I fixed it recently to always rebind subroutines, but there are patches on the list to change howq it all works anyways
07:34 airlied: (subroutines that is )
07:35 gregory38: yes I saw the patch, I give a quick look
07:35 gregory38: You mean the per-context uniform support, isn't it?
07:40 gregory38: Hum, what I see is that _mesa_shader_program_init_subroutine_defaults will be called when
07:40 gregory38: 1/ a new pipeline is bound
07:40 gregory38: 2/ a new program is set (useProgram)
07:42 gregory38: Extract from the EXT spec (.txt) Program subroutine parameters for a given target are reset to arbitrary
07:42 gregory38: defaults when the program string is respecified, and when BindProgram is
07:42 gregory38: called to bind a program to that target, even if the specified program is
07:42 gregory38: already bound.
07:44 gregory38: Hum, so if you set a new program on the already bound pipeline (_mesa_UseProgramStages), I guess it ought to reset the uniform container
07:47 gregory38: airlied: did I understand it wrongly?
07:49 gregory38: Ah no it is fine
07:49 gregory38: use_shader_program is called for both path
07:50 gregory38: (useProgram and UseProgramStages)
09:47 karolherbst: hehe
09:47 karolherbst: 0] 174.935244 MMIO8 R 0x30f230 0x00000003 PROM[0xf230] => 0x3
09:48 karolherbst: [0] 174.935248 MMIO8 R 0x320218 0x000000b0 PROM[0x20218] => 0xb0
09:48 karolherbst: seems like after 0xf230 comes 0x20218 in the vbios
09:48 karolherbst: huh
09:48 karolherbst: now comes the odd part
17:08 newuserrrrrrr: Hello, how to reclocking nouveau?
17:08 newuserrrrrrr: /sys/kernel/debug/dri/0/pstate: No such file or directory
17:08 hakzsam: imirkin, karolherbst, flickering in The Talos Principle fixed! And +20% fps (as a side-effect) :p
17:08 karolherbst: :O
17:08 karolherbst: how
17:09 hakzsam: cf. mesa-dev
17:09 karolherbst: :O
17:10 karolherbst: ...
17:10 karolherbst: actually I really expected the fix to be something like that :D
17:10 hakzsam: ahah
17:10 karolherbst: hakzsam: the green walls
17:10 karolherbst: hakzsam: this belongs there
17:10 hakzsam: yeah
17:10 hakzsam: something else
17:10 karolherbst: nonon
17:10 karolherbst: it is part of the game
17:11 hakzsam: usure
17:11 karolherbst: well
17:11 hakzsam: but it's not the same issue
17:11 karolherbst: the game itself is some kind of "emulation" kind of
17:11 hakzsam: I will have a look later
17:11 hakzsam: karolherbst, if you want to try my patch
17:11 karolherbst: and at some places you have bigger glitches than green walls
17:11 karolherbst: yeah I will check
17:11 hakzsam: it should work like a charm for this flickering thing
17:11 hakzsam: I know the issue
17:12 karolherbst: do you push all your changes into your repository?
17:12 karolherbst: well
17:12 karolherbst: doesn't matter for that change
17:12 hakzsam: nope, you have to dl the patch from mesa-dev
17:12 karolherbst: yeah well
17:13 hakzsam: or to edit the file
17:13 hakzsam: one line is not that much :p
17:13 karolherbst: :D
17:13 karolherbst: I know
17:13 karolherbst: but the increased perf comes in handy
17:13 karolherbst: let me test because the perf really sucked anyway
17:13 karolherbst: ohhhh
17:13 karolherbst: hakzsam: you know what? the PCIe speed had a 25% perf impact on that game anyway
17:14 hakzsam: imirkin, I followed your advice: formulate a theory, look into the code, make a change & hope :)
17:14 hakzsam: karolherbst, not surprising
17:14 hakzsam: it used the push buf for emitting the indexed draws
17:15 karolherbst: I really like those fixes, because nobody will ever spot them by just reading the code
17:15 hakzsam: yeah
17:15 newuserrrrrrr: And which better bumblebee nouveau or PRIME= ?
17:15 karolherbst: newuserrrrrrr: if you use nouveau, PRIME
17:16 imirkin_: hakzsam: i doubt that's the right change
17:16 newuserrrrrrr: karolherbst: i have intel + nvidia gtx 660m
17:16 hakzsam: imirkin_, why?
17:16 imirkin_: hakzsam: a lot more things would be wrong... need to look at how it works
17:16 imirkin_: nv50 index buffers are VERY different from nvc0
17:16 imirkin_: anyways, gtg
17:17 hakzsam: imirkin_, well, at least piglit is happy with that fix
17:17 karolherbst: let me check
17:17 karolherbst: hakzsam: you get my approval
17:18 hakzsam: what about perf?
17:18 karolherbst: I built mesa with O0
17:18 hakzsam: doesn't really matter
17:18 karolherbst: it does
17:19 hakzsam: I get +20% with O0
17:19 karolherbst: ohh
17:19 hakzsam: not tried with O3
17:19 karolherbst: I see
17:19 hakzsam: usually, I have a debug build :)
17:19 karolherbst: well it runs at 60 fps
17:19 karolherbst: so...
17:19 hakzsam: vblank_mode=0?
17:20 karolherbst: hakzsam: mhhh I think those green thingies are indeed wrong
17:20 karolherbst: hakzsam: nah I will test ultra settings
17:20 hakzsam: yeah, still there
17:20 karolherbst: hakzsam: do you have the game?
17:21 hakzsam: not right now
17:21 hakzsam: downloading
17:21 karolherbst: CPU->Mirror reflections is kind of broken
17:21 karolherbst: if you stand near water and move the mouse
17:21 karolherbst: the reflection kind of "stucks" for a few frames
17:21 newuserrrrrrr: Guys, which path to reclock? echo 03 > ?
17:21 hakzsam: karolherbst, but flickering is fixed, right?
17:22 karolherbst: hakzsam: yeah
17:22 karolherbst: newuserrrrrrr: /sys/kernel/debug/dri1/pstate
17:22 karolherbst: newuserrrrrrr: with 4.5+
17:22 hakzsam: karolherbst, nice
17:23 karolherbst: newuserrrrrrr: but if that is unstable, I have a branch to fix that
17:23 karolherbst: hakzsam: well on ultra everything looks good now :)
17:23 karolherbst: except those colorful flicers now
17:23 karolherbst: it isn't always green
17:23 hakzsam: yeah, depends
17:23 hakzsam: it's like random
17:23 hakzsam: karolherbst, it would be good to measure perf
17:24 newuserrrrrrr: karolherbst: thank you!
17:24 karolherbst: hakzsam: yep
17:24 karolherbst: let me rebuilt mesa with 0fast
17:25 karolherbst: ohhh
17:25 karolherbst: my 32bit is with 0fast
17:25 karolherbst: okay
17:25 karolherbst: I should check before running make clean
17:25 hakzsam: sounds like better :)
17:27 karolherbst: ahh
17:27 karolherbst: I had NV50_PROG_OPTIMIZE=0 stll there
17:28 hakzsam: karolherbst, but now, we need to figure out the second issue :/
17:28 karolherbst: well
17:28 karolherbst: it is much better already
17:28 hakzsam: yeah
17:28 karolherbst: I think talos has an integrated benchmark mode anyway
17:29 hakzsam: no clue
17:29 newuserrrrrrr: karolherbst: it seems nouveau with bumblebee not have /sys/kernel/debug/dri1/pstate
17:31 karolherbst: newuserrrrrrr: sure it isn't useing nvidia?
17:31 karolherbst: newuserrrrrrr: or is your kernel older than 4.5?
17:32 newuserrrrrrr: karolherbst: kernel 4.6.1, primusrun glxinfo | grep version server glx version string: 1.4 client glx version string: 1.4 GLX version: 1.4 OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.2 OpenGL core profile shading language version string: 4.10 OpenGL version string: 3.0 Mesa 11.2.2 OpenGL shading language version string: 1.30
17:32 karolherbst: hakzsam: the performance is so high now... it can't be only 20%
17:32 karolherbst: newuserrrrrrr: well
17:33 karolherbst: newuserrrrrrr: if the command is gone, nouveau gets unloaded
17:33 karolherbst: newuserrrrrrr: anyway, don't use bumblebee for nouveau
17:33 hakzsam: karolherbst, please measure :)
17:33 karolherbst: yeah
17:33 karolherbst: currently setting everything to ultra :3
17:36 karolherbst: hakzsam: same postion: 8.5 -> 15 fps
17:36 karolherbst: :D
17:36 hakzsam: uhu
17:36 hakzsam: it's more than expected
17:36 karolherbst: yeah
17:36 karolherbst: and it feels much smoother anyway
17:36 karolherbst: I am sure there is a benchmark though
17:36 karolherbst: EXTRAS -> BENCHMARK
17:36 hakzsam: oh cool
17:36 karolherbst: yep
17:37 karolherbst: you can even set the time
17:37 karolherbst: 15/30/60 seconds
17:37 hakzsam: still have to wait ~20 minutes before testing
17:37 hakzsam: although, I tested with some traces
17:37 hakzsam: but not ingame
17:38 karolherbst: 11.2 fps without your change on ultra
17:39 karolherbst: but the benchmark is awesome
17:39 karolherbst: it also shows your where the cpu cycles go
17:39 karolherbst: (audio/rendering/... stuff)
17:39 hakzsam: very nice
17:39 karolherbst: hakzsam: uhhh
17:39 karolherbst: the main menu is busted though
17:39 hakzsam: with my patch?
17:39 karolherbst: yeah
17:39 hakzsam: and it was not before?
17:40 karolherbst: let me check
17:40 karolherbst: ohh no
17:40 karolherbst: you are right
17:40 karolherbst: it is also busted without
17:40 hakzsam: nice to hear
17:41 hakzsam: maybe you could make a apitrace?
17:41 karolherbst: benchmark time :)
17:41 karolherbst: yeah
17:41 hakzsam: ok
17:41 karolherbst: after the benchmark
17:41 Calinou: new argument for pushing for optic fiber deployment: "I want to give APITraces to people"
17:41 Calinou: :)
17:41 hakzsam: yep :)
17:42 karolherbst: 17.6 fps
17:43 karolherbst: so yeah
17:43 karolherbst: 50%
17:43 karolherbst: hakzsam: one thing though: the shadow load droped significantly with your patch
17:43 hakzsam: what's that?
17:44 karolherbst: as i said, the benchmark shows the cpu load or something
17:44 karolherbst: and ren increased and shadow droped
17:45 hakzsam: okay
17:47 karolherbst: hakzsam: ohh maybe I messed up the rendering :/ let me check
17:47 karolherbst: yeah maybe
17:47 hakzsam: what did you do?
17:48 karolherbst: hakzsam: 25.7 fps with the 64bit version.... the hell
17:49 karolherbst: ahh it changed settings
17:51 karolherbst: hakzsam: weill see, but your patch helps a lot
17:52 karolherbst: hakzsam: well I bisct and see what I messed up now
17:53 hakzsam: bbl
18:15 karolherbst: hakzsam: seems like there is an issue in my dual issue pass
18:25 hakzsam: karolherbst, I'm back, I have the game now
18:25 hakzsam: will test
18:28 hakzsam: karolherbst, btw, I pushed the fix on fdo https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=talos
18:30 karolherbst: hakzsam: good
18:30 imirkin_: hakzsam: i'm almost sure that's the wrong fix.
18:31 karolherbst: well I get 17.5 fps with your patch with all settings to ultra
18:31 imirkin_: hakzsam: you need to understand why it helps
18:31 imirkin_: and then fix the underlying issue, whatever it is.
18:32 karolherbst: imirkin_: any idea why doing this, gives us better performance?
18:32 imirkin_: the push hint is an optimization
18:32 imirkin_: you're either doing it more often or less often
18:32 imirkin_: so either it's an opt that helps or hurts in this case
18:32 imirkin_: should figure out which
18:33 hakzsam: imirkin_, actually, the push hint is not an optimization in that case
18:33 hakzsam: imirkin_, what are you saying that's the wrong fix? :)
18:34 hakzsam: *why
18:34 imirkin_: the push hint is supposed to only be for indexed draws. it makes little to no sense for non-indexed draws
18:34 hakzsam: imirkin_, because I guess the number of vertices is somehow huge
18:34 imirkin_: the idea is that if you have a huge buffer
18:34 imirkin_: but are only drawing a few elements from it
18:35 imirkin_: you're better off picking those elements "by hand" and then pushign them via dedicated interfaces
18:35 karolherbst: hakzsam: okay, on 64bit with your patch: 10.9 -> 17.5 fps in the benchmark
18:35 imirkin_: rather than putting the whole vbo resident into vram
18:35 hakzsam: imirkin_, sure, I know that
18:35 imirkin_: so... it only ever makes sense for indexed draws
18:36 imirkin_: the vbo hint determination logic is correct
18:36 imirkin_: although perhaps the implementation of it is wrong
18:36 hakzsam: the limit is arbitrary...
18:37 imirkin_: yes. but what's not arbitrary is the fact that it's for indexed draws :)
18:37 hakzsam: and that might explain the perf improvements
18:37 imirkin_: my guess is that there's a missing condition on when not to do it
18:37 imirkin_: like when the index buffer is being written
18:37 imirkin_: or who knows what
18:37 imirkin_: [or the vertex buffer is]
18:38 hakzsam: so, we should always use the push path for indexed draws?
18:39 imirkin_: only for *small* indexed draws
18:39 imirkin_: and perhaps with some additional restrictions
18:39 hakzsam: right
18:39 hakzsam: but why nv50 doesn't do the same thing?
18:39 imirkin_: it does
18:39 imirkin_: you misanalyzed what it was doing.
18:40 imirkin_: i agree it's confusing :)
18:40 hakzsam: nv50 always uses a user vbo for indexed draws
18:40 imirkin_: right, so among other things, the way you supply indices for an indexed draw is different on nv50
18:40 imirkin_: on nvc0 you give it a bo
18:40 imirkin_: and all is well
18:41 imirkin_: on nv50, you feed the indices through the pushbuf
18:41 imirkin_: always
18:41 imirkin_: no matter what
18:41 hakzsam: makes more sense
18:41 karolherbst: hakzsam: well I will upload a trace with the water reflection issue
18:41 hakzsam: karolherbst, yep, thanks
18:41 imirkin_: so some of the surrounding logic is different as a result
18:42 hakzsam: this part is a bit confusing yeah
18:42 hakzsam: that's why I was almost sure that using a user vbo fixed the issue
18:42 hakzsam: I think this was not too crazy
18:42 imirkin_: right
18:42 imirkin_: but
18:43 imirkin_: it should *work* in all the cases
18:43 imirkin_: you're just shifting it from one path to another
18:43 imirkin_: but both paths are supposed to work.
18:43 hakzsam: right
18:43 karolherbst: okay
18:43 hakzsam: and the push path doesn't work in that specific case
18:43 imirkin_: so the fix is to fix the path that's not working
18:43 imirkin_: rather than to shift it to another path
18:43 karolherbst: so one path may lead to higher perfs, but both should be always "fine"
18:43 hakzsam: yeah
18:43 hakzsam: performance comes later anyway :)
18:44 karolherbst: huh :p
18:45 hakzsam: imirkin_, but, the vbo hint limit might be wrong too, right?
18:45 hakzsam: or might be improved
18:45 imirkin_: hakzsam: the limit is arbitrary
18:45 imirkin_: but again, the FIX is to fix the path it's going down
18:45 hakzsam: yeah
18:45 hakzsam: sure
18:45 hakzsam: one thing at a time
18:45 imirkin_: or alternatively to determine why it doesn't work, and add an exclusion for that case
18:46 imirkin_: (i'm guessing when the idxbuf has an unsignalled fence_wr)
18:46 imirkin_: or perhaps when the vertex buffers are still being written
18:46 imirkin_: dunno
18:46 imirkin_: unfortunately i'm not entirely sure when synchronize's are necessary
18:47 karolherbst: hakzsam: okay, we might have an issue regarding that reflection issue
18:47 imirkin_: although it seems reasonable to stick one in in that case
18:47 hakzsam: imirkin_, okay, will look into the code
18:47 hakzsam: karolherbst, talos is quite buggy
18:47 karolherbst: hakzsam: guess why it falls under the "CPU perf" section?
18:48 karolherbst: anyway, I created a trace, same issue on nouveau/intel/nvidia
18:48 karolherbst: when replaying it
18:48 hakzsam: are you sure it's wrong?
18:48 karolherbst: yes
18:48 karolherbst: I ran the game with nvidia, and it looked fine
18:48 hakzsam: okay
18:49 karolherbst: hakzsam: also when I disable that setting, it doesn't happen anymore
18:49 karolherbst: just looks a bit more crappy overall
18:49 hakzsam: and my patch doesn't help for this issue?
18:49 karolherbst: nope
18:51 hakzsam: karolherbst, please, fill a bug then
18:57 karolherbst: well
18:57 karolherbst: at least it is going forward :)
18:59 hakzsam: karolherbst, hey traces are always useful :)
18:59 karolherbst: nope, not if you have the game in this case
18:59 karolherbst: anyway
18:59 karolherbst: with the trace you won't know if you fix the issue or not
18:59 hakzsam: even if I have a game, I make a trace
18:59 karolherbst: well
18:59 hakzsam: because I don't want to spend my time at launching it :)
18:59 karolherbst: I think this is something done on the CPU
19:00 karolherbst: hence why this is in the CPU section
19:00 karolherbst: so
19:00 karolherbst: the engine reads something back from the GPU
19:00 karolherbst: does something with the CPU with that
19:00 karolherbst: and pushes it back somewhere else
19:00 hakzsam: well, I will have a look later, maybe
20:27 newuserrrrrr: Hi again, i tried to use gtx 660m nouveau with Planetside 2 (wine) and it works!
20:34 Yoshimo: well i used World of Warcraft on a 980 with nouveau and it worked, but 10fps was a rather broad definition of "works"
20:34 karolherbst: why are people always surprised? :D
20:34 karolherbst: Yoshimo: well you blame nvidia for that one :p
20:35 Yoshimo: yes i do, but they don't seem to care about that
20:41 karolherbst: imirkin: any idea why setting the liveOnly bit on those tex instructions might hang the gpu?
20:42 gregory38: imirkin_: small question. Why does exactly the validate function (such as nvc0_compute_validate_constbufs)? Does it transfer the data to the GPU or is it more complex?
20:43 karolherbst: gregory38: guess: I think it checks whether something needs to be uploaded to the gpu or updated or something and marks those things as dirty
20:44 gregory38: ok
20:45 karolherbst: I might be wrong, but I think this is more or less right
20:46 gregory38: oh, I didn't say it was wrong ;)
20:46 gregory38: just trying to understand
20:47 newuserrrrrr: but nouveau < fps lower than proprietary driver < windows :(
20:48 karolherbst: newuserrrrrr: yeah well
20:48 karolherbst: newuserrrrrr: we don't have any hw documentation
20:48 imirkin_: gregory38: rtfs? :)
20:48 newuserrrrrr: One reason why i use windows - planetside 2
20:49 imirkin_: gregory38: that one takes the list of constbuf bo addresses and writes out method calls to the command ring to make it so
20:49 newuserrrrrr: karolherbst: nvidia bad :( but amd even worse(?)
20:49 gregory38: A nice explanation is always better :)
20:49 gregory38: Thanks
20:50 imirkin_: gregory38: when you just set the constbuf bo's, we don't actually write it out to the command ring
20:50 imirkin_: gregory38: we only do it at draw (or dispatch) time
20:50 karolherbst: newuserrrrrr: amd provides hw documentation
20:50 gregory38: ok make sense
20:51 imirkin_: the hardware stores some amount of configuration, and we have to write out commands to updates bits and pieces of it
20:51 newuserrrrrr: karolherbst: but amd sucks with hybrid graphics
20:51 imirkin_: and hopefully we update all the pieces that have changed, or else ... we draw wrong
20:51 karolherbst: newuserrrrrr: well nouveau sucks usually with performance
20:52 gregory38: ok
20:52 gregory38: but if you draw with a program 1 and an ubo 0
20:53 gregory38: and then swith the program and do a draw with program 2 but same ubo 0
20:53 gregory38: data is already on the GPU
20:53 newuserrrrrr: i long use windows and over proprietary software and it sucks, now i go with opensource and free software
20:54 karolherbst: newuserrrrrr: well you can always help us, but you get usually better quality drivers with intel and amd
20:54 imirkin_: newuserrrrrr: amd supports open source mesa development with both documentation and full time engineers
20:54 imirkin_: newuserrrrrr: so stick with amd if you want good open-source support.
20:54 karolherbst: but I guess you have your rig already, so you can help us out anyway
20:54 imirkin_: gregory38: that's right.
20:55 karolherbst: imirkin_: by the way, is there a way to tell mesa to not color the instructions?
20:55 imirkin_: gregory38: that's more of a GL issue - how do you know it's the same ubo 0
20:55 imirkin_: karolherbst: please rephrase
20:55 karolherbst: imirkin_: NV50_PROG_DEBUG colors the instructions
20:55 imirkin_: karolherbst: yeah, there's an env var
20:56 imirkin_: check nv50_ir_print.cpp
20:56 imirkin_: [i have no clue what it is]
20:56 karolherbst: NV50_PROG_DEBUG_NO_COLORS
20:56 karolherbst: too easy
20:57 gregory38: well I guess it would need a dirty bit on upload (don't ask coherent stuff), and something for the index binding
21:00 gregory38: actually maybe you don't need the dirty data bitw
21:00 gregory38: because you update only the "pointer" to the ubo
21:01 gregory38: or maybe it is nvc0_cb_bo_push :p
21:01 gregory38: that push the data
21:02 imirkin_: sooooo
21:02 imirkin_: that gets into how the hw works
21:03 imirkin_: doing one draw at a time is all well and good
21:03 imirkin_: however a lot of programs do something like
21:03 imirkin_: update uniform; draw; update uniform; draw; update uniform; draw;
21:03 imirkin_: etc
21:03 gregory38: yeah
21:03 imirkin_: wouldn't it be nice if you didn't have to wait for draw1 to finish before starting draw2?
21:04 imirkin_: so uniforms are updated via this magical path that doesn't (immediately) affect the backing data
21:04 imirkin_: but a draw will see it
21:04 imirkin_: while previous draws won't be affected
21:04 gregory38: ok gotcha
21:05 imirkin_: all uniform updates go through such a path... perhaps they shouldn't, i haven't the faintest clue what the perf tradeoffs are
21:12 karolherbst: mhh
21:12 karolherbst: why can't I swap those instructions?
21:12 karolherbst: 278: texlod 2D $r8 $s0 f32 $r4d $r4t (8)
21:12 karolherbst: 279: texlod 2D $r8 $s0 f32 $r0d $r0t (8)
21:14 karolherbst: well next is this:
21:14 karolherbst: 347: texbar (SUBOP:1) - # $r4d (8)
21:14 karolherbst: 349: texbar - # $r0d (8)
21:14 karolherbst: I guess I would have swap those texbars too?
21:14 karolherbst: or even move that subop?
21:15 karolherbst: maybe best I just leave those tex instructions alone...
21:15 newuserrrrrr: sorry for offtopic, I I study on programmer if I have a chance to participate in a dev, I need to great learn C? (sorry im from Russia and my English is bad)
21:15 karolherbst: newuserrrrrr: I can't do shit in C before I started on nouveau :p
21:16 karolherbst: *couldn't
21:16 karolherbst: newuserrrrrr: but yeah, you would have to learn it
21:16 imirkin_: karolherbst: still can't :p
21:16 karolherbst: :D
21:16 imirkin_: karolherbst: i'm guessing there's an instruction 348 that either reads or writes r4 or r5?
21:16 karolherbst: 348: add ftz f32 $r2 neg $r5 1.000000 (8)
21:16 karolherbst: yeah
21:16 imirkin_: so it needs the texbar before it
21:16 imirkin_: it's all correct...
21:17 imirkin_: if you swap the two instructions, you'd have to adjust all the depths
21:17 imirkin_: it's not worth it
21:17 karolherbst: uhhh
21:17 imirkin_: at least not once texbars are inserted
21:17 karolherbst: so order matters for tex instructions
21:17 imirkin_: well
21:17 newuserrrrrr: Can you advise what some good docs or books?
21:17 imirkin_: order matters for the later texbar's
21:17 karolherbst: newuserrrrrr: didn't read any
21:17 imirkin_: texbar 1 = wait until all but the last tex() are complete
21:17 karolherbst: newuserrrrrr: https://www.kernel.org/doc/Documentation/CodingStyle
21:17 karolherbst: :D
21:17 imirkin_: texbar 10 = wait until all but the last 10 are complete
21:18 imirkin_: (tbh i have no idea how deep the queue goes)
21:18 karolherbst: mhh
21:18 karolherbst: well
21:18 karolherbst: I think I just ignore those tex instructions for the dual issue thing then
21:18 karolherbst: it really doesn't matter that much because it is so rare that it happens anyway
21:18 newuserrrrrr: karolherbst: hmm, thanks :)
21:19 karolherbst: newuserrrrrr: well it doesn't teach you C, just the rules the kernel devs should follow
21:19 karolherbst: newuserrrrrr: anyway, best idea is if you find an issue and just try to fix it
21:19 karolherbst: newuserrrrrr: if you code something odd or bad, we will just shout at you then :p
21:21 gregory38: imirkin_: const behavior feels like standard glUniform draw call that is uploaded every program change rather than ubo that could keep same value for severals program
21:21 newuserrrrrr: But how learn C? :))
21:22 karolherbst: newuserrrrrr: can you code in any other language?
21:22 gregory38: well it might save a couple of C call but hardware interaction is more or less the same (at least on nouveau)
21:23 newuserrrrrr: karolherbst: pascal, little android(java), also i like Golang
21:23 gregory38: anyway, thanks for all your info it was very educational
21:23 imirkin_: gregory38: well if it's not updated, then nothing is uploaded
21:23 imirkin_: gregory38: a "user_buffer" is from uniforms stored in client memory, i.e. glUniform() things
21:23 karolherbst: newuserrrrrr: well I think C is closes to java of all of those, but still pretty far away
21:24 gregory38: UBO are flushed to GPU when the program is switched
21:24 imirkin_: [which then have to be uploaded to a bo before draw]
21:24 karolherbst: newuserrrrrr: maybe you should start just by coding anything in C or something, I really don't know because I really didn't follow much while learing that stuff :/
21:25 gregory38: imirkin_: by which you mean the program or the ubo ?
21:26 imirkin_: gregory38: the ubo
21:26 newuserrrrrr: karolherbst: ok, important start
21:26 gregory38: it would be easier for me to access the hw :)
21:26 gregory38: ok
21:26 imirkin_: gregory38: ubo's are also CSO'd between st/mesa and the driver backends
21:26 imirkin_: so if you happen to "change" it to the same thing, that update never hits the driver
21:26 imirkin_: (CSO = constant state object)
21:27 gregory38: yeah, the thing is that PCSX2 just switch shaders very often
21:27 gregory38: unlike others app
21:28 gregory38: so I'm pretty in the worst case validation
21:29 imirkin_: yeah... it shouldn't be _so_ bad
21:29 imirkin_: but i guess it works out that way =/
21:29 glennk: on newer hardware you might be better off using a uber shader and switch less often
21:29 gregory38: with subroutine you mean
21:29 glennk: nah, just regular branching
21:30 glennk: subroutine sucks on the hardware
21:30 imirkin_: well, that's what using subroutines with mesa will work out to :)
21:31 gregory38: honestly I didn't tested uniform
21:31 gregory38: Most of the time, app are GPU bound rather than CPU bound
21:31 imirkin_: mesa can and definitely should be improved in this regard
21:31 imirkin_: but it will take someone with an app they care about to profile and think about how to fix it
21:32 imirkin_: could be you :)
21:32 gregory38: you don't imagine how I'm busy with PCSX2 ;)
21:32 imirkin_: i'm sure it's "very"
21:32 imirkin_: and i suspect it doesn't pay too much, so you also do other things
21:33 gregory38: but yeah I'm trying to see what I can do to help
21:33 gregory38: yes, I'm working on the hardware industry
21:33 gregory38: used to be smartphone now inteconnect (network) for HPC
21:33 glennk: gpu bound probably by fill rate rather than shader alu though
21:34 gregory38: glennk: you don't imagine the mess I did on the fragment shader to emulate the PS2 ;)
21:34 gregory38: glennk: what do you mean by fillrate ?
21:34 gregory38: memory bandwidth ?
21:34 glennk: the rate at which the hardware can fill pixels
21:35 gregory38: And is it a good idea to put texture sampler in if/else
21:35 gregory38: ?
21:35 glennk: often hardware bottlenecks on either the rasterizer generating fragments (before shading) or in ROP (alpha blend etc)
21:35 glennk: branches are cheap as long as they go the same way in each warp/wavefront
21:36 gregory38: ok. How much cheap ? 10 branch ? 20 branch ?
21:36 glennk: newer hardware is way more heavy on shader alu and texture sampling than on rasterizer/ROP
21:37 karolherbst: gregory38: most of the stuff may be optimized away anyway
21:37 glennk: sorry, can't parse that question in any way that makes sense to me
21:37 imirkin_: gregory38: branches are ~free when they're uniform
21:38 gregory38: ok. Question was would branch remains free on performance if shader contains 20-30 branchs
21:38 imirkin_: if they're uniform, yes
21:38 gregory38: (yes my shader code is a royal mess)
21:38 imirkin_: if each invocation goes its separate way, then no
21:38 karolherbst: gregory38: anything in the form of type v = cond ? value1 : value2;
21:38 gregory38: I guess I ough to benchmark it then
21:38 imirkin_: and obviously not *free*
21:38 glennk: look at the thread divergence
21:39 imirkin_: if you have like one real instruction and 300 branches, then it might cost you
21:39 glennk: if its low then branches cost basically nothing
21:39 imirkin_: but presumably you do things other than branch in your shaders
21:39 glennk: if you have divergence you pay for both sides of the branch in all threads
21:39 karolherbst: gregory38: try to keep branches as simple as possible so that they can get optimized away into non branched binaries
21:40 gregory38: ok. Actually I don't have too much instruction once you remove all the ifdef
21:41 imirkin_: then you might want to benchmark and see what happens
21:41 imirkin_: could go either way
21:41 glennk: well, #ifdef shader variants are by definition uniform branches
21:41 gregory38: Potentially there is still an impact, the compiler might still need to create additional register to store content
21:42 glennk: yes, that and a lot of other variables
21:42 glennk: but given its emulating ps/2 shader hardware, how complex could it be?
21:42 glennk: compared to say an UE4 pixel shader
21:43 karolherbst: gregory38: see that shader? https://gist.github.com/karolherbst/aad71d0e5715a1ffdd5c9c2fc0963505
21:43 karolherbst: with all those ?: clauses
21:44 karolherbst: gregory38: ends up having 5 branch instructions in the binary in total
21:45 karolherbst: tgsi of that shader: https://gist.github.com/karolherbst/218323671e48f0a9cd147f1492c36beb
21:45 gregory38: my shaders are much much smaller
21:46 karolherbst: and less branches I figure?
21:46 gregory38: there are still a couple of branches (1/2) sometimes
21:46 karolherbst: yeah well
21:47 karolherbst: on nvidia hardware there are many instructions to eliminate branches completly and make them pretty much free, allthough there is conditional stuff
21:47 karolherbst: so you really don't need to worry about this
21:47 karolherbst: would be a good experiment to know what happens if you have like one uber shader and stick with that
21:47 gregory38: well believe or not, I'm still supporting others brands ;)
21:48 karolherbst: glennk: I figure on modern amd cards there are also slct/set/predicated instructions?
21:48 gregory38: glennk: to answer your question, ps2 has special fixed unit so typically I need to implement blending on the shader. Sometimes the filtering too
21:48 karolherbst: or selp
21:48 gregory38: but I barely do any math
21:49 glennk: blending with framebuffer, or blending between "texture stages" ?
21:50 gregory38: with framebuffer
21:50 gregory38: the rop stuff
21:50 glennk: thats an expensive thing
21:51 gregory38: yes but I didn't find a free solution
21:51 gregory38: alpha coefficient is unclamped
21:51 gregory38: but the color could be either clamped or wrapped
21:52 gregory38: they use integer math with trunc vs float rounding of the standard unit
21:54 gregory38: https://gist.github.com/gregory38/fbcdc481c7477ca11f689a26510d5c21
21:54 gregory38: 2 shader example. A basic one, and a more complex one (likely barely used)
21:55 gregory38: anyway, I need to test the uber shader
21:59 gregory38: hum might not be that easy
21:59 gregory38: I have some discard on some paths
21:59 imirkin_: discards are fine
22:01 gregory38: I was afraid for early test and thing like that
22:02 imirkin_: discards don't count as control flow
22:02 imirkin_: (i mean, they kinda do...)
22:02 glennk: everything is rasterized as 2x2 quad groups, so it won't save fill rate until you discard an entire quad group
22:04 gregory38: purpose isn't really to save fill rate but ps2 can update either the framebuffer color or the depth based on the alpha value
22:04 gregory38: so we do the rendering twice
22:04 imirkin_: huh?
22:05 glennk: hmm, maybe stencil export could work for that
22:05 gregory38: does it work on nvidia hw :p
22:05 imirkin_: once with depth bound, once with color?
22:05 gregory38: you can setup the ps2 as
22:05 gregory38: if alpha < AREF update both color and depth
22:05 gregory38: otherwise update only the color
22:06 gregory38: (the otherwise could also be update only the depth)
22:06 imirkin_: good times.
22:06 gregory38: yeah, you could everything with the ps2
22:06 gregory38: validation was 0
22:06 gregory38: everything was allowed ;)
22:07 karolherbst: gregory38: I wonder if games hacked around about the non standart floating point things and if you could detect that and revert it and just use ieee complient stuff :p
22:08 karolherbst: maybe they simply didn't care though
22:08 gregory38: game care
22:08 gregory38: and none of them use IEEE
22:08 gregory38: you can use IEE stuff it mostly work most of the time
22:08 gregory38: but not always
22:08 karolherbst: I see
22:09 gregory38: For example you have the nice constant Pi
22:10 imirkin_: you mean delicious constant
22:10 gregory38: You do Pi * 1/4 which is different of 1/4 * Pi
22:10 gregory38: lol
22:10 gregory38: the ps2 kind lose some bits of accuracy so you end up actually below pi/4
22:11 gregory38: so cos/sin doesn't really have the same behavior
22:11 gregory38: so there is one place in the code that do if (a == pi && b == 1/4) then result is ....
22:12 imirkin_: that seems nice and scalable :)
22:12 glennk: i guess you can use shader image loads/store instead of regular color buffer ops to handle that
22:13 gregory38: if ((s == 0x3e800000) && (t == 0x40490fdb))
22:13 gregory38: return 0x3f490fda; // needed for Tales of Destiny Remake (only in a very specific room late-game)
22:13 gregory38: else
22:13 glennk: so updating color but discarding depth would be a shader store followed by a discard, and the reverse is just depth only rendering
22:13 karolherbst: gregory38: isn't there maybe some constant error you could apply?
22:13 karolherbst: gregory38: or is it not really predictable?
22:13 glennk: but the tricky bit is overlapping draw calls, needs explicit synchronization
22:14 gregory38: glennk: discard, also discard the store
22:14 gregory38: karolherbst: well I got this kind of idea. But it isn't easy
22:14 gregory38: I'm not sure it always loose bits
22:14 karolherbst: gregory38: well, do you have hardware?
22:15 gregory38: so maybe a lut based on the lsb
22:15 gregory38: yes it was planned to test lots of value into it ;)
22:15 glennk: gregory38, uh no, the stores prior to the discard go through
22:16 gregory38: glennk: are you sure?
22:16 glennk: https://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt point 20
22:16 gregory38: did you do that in Mesa ? It could explain some issue with load/store atomic
22:17 gregory38: glennk: oh interesting
22:18 glennk: "some issue" is rather vague
22:18 gregory38: I need to double check the rendering on nvidia
22:18 gregory38: otherwise I could provide an apitrace
22:20 gregory38: wait actually I think it is the game that have a poor quality
22:21 gregory38: yes it is fine
22:21 gregory38: :)
22:22 gregory38: anyway, it is interesting. Maybe it could help in the future
22:22 imirkin_: it can be hard to retroactively undo stores :)
22:23 gregory38: well I'm working on HW, we can do stuff
22:23 imirkin_: well, you can have e.g. a coherent buffer/image
22:23 imirkin_: how would you undo that?
22:23 imirkin_: esp after some atomic operation
22:23 gregory38: dunno :p
22:24 gregory38: for sure it would be costly or with limitation
22:24 gregory38: but opengl wiki isn't clear
22:24 imirkin_: limitation: it doesn't work :)
22:24 gregory38: https://www.opengl.org/wiki/Image_Load_Store
22:25 gregory38: The Fragment Shader has the ability to issue a discard​ command. This will prevent writing any fragment values to the framebuffer. However, it will also have the effect of preventing image store and atomic operations from taking place.
22:25 imirkin_: it's a wiki, fix it :)
22:25 gregory38: (yeah I know better read the spec)
22:25 imirkin_: what it says is technically not wrong
22:25 imirkin_: but can easily be misinterpreted
22:25 gregory38: yes it is confusing
22:27 gregory38: anyway, the 2 consecutive draw won't be expensive if I don't switch the shader between then but only update uniform
23:25 gregory38: So I manage to reinstall the nvidia driver to do some benchmark
23:25 imirkin_: let me guess - it does better than nouveau?
23:25 gregory38: barely :p
23:25 imirkin_: that's ... very sad
23:26 gregory38: I don't have my GPU at full speed to it could explain some perf issue
23:26 gregory38: normally there are cpu bound testcase so it mustn't have a big impact
23:28 gregory38: nouveau is around 60% of nvidia (if don't enable thread optimization)
23:28 imirkin_: ouch
23:28 imirkin_: i assume that's with reclocking?
23:28 gregory38: DC: core 1084 MHz memory 6007 MHz
23:28 gregory38: 0f: core 405-1280 MHz memory 6008 MHz AC DC *
23:28 gregory38: Yes but not turbo
23:28 imirkin_: i'll take that as a "yes"
23:29 gregory38: but that doesn't explain the perf, my testcase are likely more heavy on the CPU
23:29 gregory38: except the sotc testcase
23:29 imirkin_: well, that's potentially another 15-20%
23:30 imirkin_: assuming turbo lets you go all the way up to 1280
23:30 karolherbst: usually not
23:30 imirkin_:&
23:31 karolherbst: turb usually gives you around 100-150MHz more though
23:31 glennk: wonder if dual source blending would help for implementing some of the ps2 blend modes
23:31 gregory38: I already use it
23:31 gregory38: manage to find a bug on AMD driver which they manage to fix finally ! And now I'm waiting a fix
23:32 gregory38: And it seems to be broken on intel too
23:32 gregory38: Both if you enable SSO + blending
23:32 gregory38: I mean dual-blending
23:32 gregory38: imirkin_: karolherbst: let's say there is 10% perf explanation for the GPU speed
23:33 gregory38: it remains some margins :)
23:34 gregory38: Hum on sotc (generally GPU limited)
23:34 gregory38: nvidia 113 fps (around 160+ with MT)
23:34 gregory38: Nouveau 65
23:35 gregory38: Nouveau without UseProgram 73
23:37 gregory38: hum strange, it uses to help another testcase but not anymore
23:41 gregory38: I have a testcase which is around 48 fps but 87 on nvidia
23:41 gregory38: the testcase does lots of texture upload with pbo. Basically every draw call
23:41 gregory38: (because a game designer got the good idea to upload a background with 16x16 sprites....)
23:42 gregory38: _mesa_TextureSubImage2D is 38.80%
23:43 gregory38: (with children)
23:43 gregory38: including my program
23:43 gregory38: otherwise 68% of the driver
23:46 gregory38: could just be due to the slower GPU
23:54 glennk: i think nha implemented pbo texture uploads/readpixels acceleration fairly recently
23:54 imirkin: yeah, pbo upload uses a texture buffer (when possible)
23:54 imirkin: and pbo download uses an image to write to the pbo
23:54 imirkin: nouveau has a very static resource placement policy
23:54 imirkin: nvidia might do something more dynamic
23:55 glennk: or "game specific" policy
23:55 imirkin: well, among other things, textures are always in vram
23:56 gregory38: well I would be surprised they have a profile for us
23:56 imirkin: never in "gart"