00:00 karolherbst: ahh wait
00:01 karolherbst: http://www.realworldtech.com/gt200/9/
00:01 karolherbst: "Dual ‘Issue’"
00:01 imirkin: karolherbst: it has a ton of issues. does that count? :)
00:01 karolherbst: yes
00:01 karolherbst: I have a tesla hre
00:02 karolherbst: that's why it counts :p
00:02 imirkin: [play on words]
00:02 karolherbst: well
00:02 karolherbst: even if we can only dual issue like 10%
00:02 karolherbst: it is still an improvement
00:02 karolherbst: allthough nobody seriously cares anymore...
00:03 imirkin: which tesla do you have btw?
00:03 karolherbst: mcp79?
00:03 tobijk: karolherbst: just to play devils advocate, i care :D
00:03 imirkin: ah cool
00:03 karolherbst: one of those mac minis
00:03 karolherbst: I have
00:03 karolherbst: yeah I think it had a 9400m
00:04 karolherbst: does that count as gt200+?
00:04 karolherbst: I read on those the dual issue situation is a bit better
00:04 karolherbst: but I have seriously no idea how we can be sure it improves anything
00:04 imirkin: i guess i dunno for sure, but i GUESS that the dual-issue stuff is for 4-byte ops (vs 8-byte ones)
00:05 imirkin: nouveau does try to emit as many 4-byte ops as possible
00:05 karolherbst: mhhh
00:05 karolherbst: I think it is more to it
00:05 karolherbst: http://www.realworldtech.com/includes/images/articles/g100-6.gif
00:05 imirkin: yeah, could be
00:05 karolherbst: the graph shows it somehow
00:05 imirkin: yeah, no clue what that means
00:06 karolherbst: instruction latency is 4+
00:06 karolherbst: usually
00:06 karolherbst: but
00:06 imirkin: mcp79 is most like a g200, except that it doesn't have fp64 (while g200 does)
00:07 karolherbst: if you issue a mad, it gets executed on teh FPU and after 2 cycles a MUL can be executed on the SFU
00:07 imirkin: it has support for pausing/unpausing TF, but none of the DX 10.1 features that the GT21x series supports
00:07 imirkin: "on the SFU"?
00:07 imirkin: vs "on the FPU"?
00:07 karolherbst: special function unit
00:08 karolherbst: there are 2 sfu per SM
00:09 karolherbst: " Each SFU can fulfill up to four operations per clock: four MUL (Multiply) instructions. So one SM as a whole can execute 8 MADs (16 operations) and 8 MULs (8 operations) per clock, or 24 operations per clock, which is (relatively speaking) 3 times the number of SPs."
00:09 karolherbst: no idea if that's true though
00:10 karolherbst: okay
00:10 karolherbst: acording to that guy
00:11 karolherbst: the SM can issue one instruction every 2 clocks
00:11 karolherbst: mh
00:11 karolherbst: would be funny to figure that out somehwat
00:12 karolherbst: maybe reading some nvidia docs is better at this point
00:12 tobijk: karolherbst: but only every 4th instruction the same instr? (mul or mad) thats would be silly
00:13 karolherbst: ahh
00:13 tobijk: instruction = cycle / clock
00:13 karolherbst: nvidia says something about that too
00:13 tobijk: karolherbst: pure interpretation of the figure
00:13 tobijk: could be complete bs
00:13 karolherbst: "The individual streaming processing cores of GeForce GTX 200 GPUs can now perform near full-speed dual-issue of multiply-add operations (MADs) and MULs (3 flops/SP) by using the SP’s MAD unit to perform a MUL and ADD per clock, and using the SFU to perform another MUL in the same clock. Optimized and directed tests can measure around 93-94% efficiency."
00:13 karolherbst: that is from nvidia^
00:15 karolherbst: regarding the SFU: "Special function units (SFUs) in the SMs compute transcendental math, attribute interpolation (interpreting pixel attributes from a primitive’s vertex attributes), and perform floating-point MUL instructions."
00:16 tobijk: karolherbst: ah makes makes sense, but can we already address the sfu?
00:16 karolherbst: I would assume that happens automatically in the SM?
00:16 RSpliet: assumption is the mother of...
00:17 karolherbst: well
00:17 karolherbst: I have a dual issue pass for the compiler
00:17 karolherbst: and if I implement canDualIssue for gt200+
00:17 karolherbst: I might be able to get a significant perf increase in pixmark_piano
00:17 karolherbst: it does _a_lot_ of MUL/MADs
00:17 karolherbst: it is nearly crazy
00:17 karolherbst: some BBs are like 20 MULSs+ 10 MADs
00:18 RSpliet: yeah, most matrix multiplications are just a string of mads
00:19 karolherbst: we don't have any hw counters on tesla I guess?
00:19 RSpliet: wonder how much you could win by carefully turning mul's into mad a*b+0, and splitting up 1/3 of strings of mads into mul + add
00:20 karolherbst: :D
00:20 RSpliet: probably not as much as it sounds, given most workloads are mem-bound :-P
00:21 karolherbst: pixmark_piano is engine only :p
00:21 karolherbst: really good test to test the compiler
00:21 karolherbst: because you directly notice any perf change
00:21 karolherbst: don't forget, my dual issue pass improves perf there by like 4% alone
00:22 karolherbst: mhh
00:22 karolherbst: we have only inst_issued counters for fermi and newer
00:36 karolherbst: would be interessting if we could somehow expose TXAA support on kepler... but for whatever reasons, it doesn't seem to be available on linux
08:26 s0be: I've got a new laptop coming with hybrid/prime/optimus/whatever the preferred name it goes by these days with a (I believe) NV117 discrete gpu. What sorts of pains should I be expecting to use nouveau?
08:36 chithead: it will work but not very fast I think
08:36 chithead: you may see better performance with the integrated graphics
08:39 s0be: cool, thanks
08:39 s0be: the extent of my 'gpu use' is video offload, so I wasn't real worried
08:39 s0be: just read a lot of horror stories from people battling hybrid gpus
08:42 s0be: I'm upgrading from an old NVA5 laptop, which was apparently one of the last generations where discrete only was still common
11:39 karolherbst: gregory38: honestly, I think you are doing something funny in pcsx2. I have alternating frames with 20k draw calls and then with 25 calls
11:39 karolherbst: *20k calls
11:39 karolherbst: and I really don't see why there are frames with only 25 calls
11:40 gregory38: I fixed a bug with vsync (mesa don't use same extensions)
11:40 gregory38: Behavior of draw calls are normal
11:40 gregory38: ps2 internal fps is 60 fps
11:40 karolherbst: I don't see the point of thos 25 call frames
11:40 gregory38: but often game limit the rendering to 30 fps
11:40 gregory38: so they render a full image once = 20k calls
11:41 gregory38: and present half of it (interlaced)
11:41 karolherbst: wouldn't it be better just to time the frames better then?
11:41 gregory38: 2nd image is only merged/presented so few draw calls
11:42 karolherbst: yeah right, but why can't you just present 30 frames per second and drop those other frames?
11:42 karolherbst: also, what do you do when the display has like 40 Hz currently?
11:42 karolherbst: or something funny like 59.23Hz
11:42 karolherbst: this trick only works with 60 Hz right?
11:43 gregory38: ps2 fps are hardcoded based on internal pll
11:43 gregory38: ntsc/pal/...
11:43 gregory38: it isn't aligned on the current display
11:43 karolherbst: yeah, I know, but you can also time the glXSwapBuffers
11:44 gregory38: i.e
11:44 gregory38: I don't know
11:45 karolherbst: anyway, I don't think this approach is that usable. Think about those 144Hz displays
11:45 karolherbst: they want to display 144fps
11:45 gregory38: well they would display 4 times the same frame ;)
11:46 karolherbst: well
11:46 karolherbst: then you are still off
11:46 karolherbst: 120fps vs 144 fps
11:46 RSpliet: gregory38: does the GPU work ahead? eg. say frame 0 contains these 20K draw calls, frame 1 only 25, frame 2 contains 20K again. Does it start rendering frame 2 before or after frame 1 is physically displayed?
11:46 karolherbst: uhh
11:47 RSpliet: hmm. physically displayed is probably wrong given the long scan-out time, but before or after frame 1 is activated by pageflip
11:47 karolherbst: right, this would become a big bottlenecl
11:47 gregory38: future screen has a variable fps so it might work better in the future (freesync and such)
11:47 karolherbst: if you do like 50% of the time,... nothing
11:47 karolherbst: gregory38: I don't think you got the point
11:48 karolherbst: RSpliet: does glXSwapBuffers block by the way?
11:48 karolherbst: I doubt it
11:48 karolherbst: but I also know not enough about gl
11:48 RSpliet: idk, I know fuckall about GL :-P
11:48 gregory38: I've an appointement but feel free to explain me the stuff (you could even post patch to my project)
11:48 karolherbst: :D
11:48 karolherbst: gregory38: well the thing is just
11:48 karolherbst: gregory38: if you end up doing nothing while you render the filling frame
11:48 karolherbst: gregory38: you waste 50% of performance
11:49 karolherbst: because every two frames you don't do anything
11:49 karolherbst: maybe tripple buffer would solve the problem
11:49 karolherbst: but again
11:49 karolherbst: with 144Hz displays you waste time again
11:50 karolherbst: you you need to buffer round_up(display_frame_rate, target_frame) + 1
11:50 RSpliet: karolherbst: don't worry too much about corner cases like that just before you get the original perf up ;-)
11:50 karolherbst: :D
11:50 karolherbst: well
11:50 karolherbst: wasting 50% of perf is a serious issue
11:50 karolherbst: anyway
11:50 karolherbst: pcsx2 is really CPU bottlenecked
11:50 karolherbst: and I would like to know why exactly
11:51 karolherbst: it has something todo with the gl overhead
11:52 karolherbst: ahh okay
11:52 karolherbst: RSpliet: glxSwapBuffers is free, but syncing point might block
11:52 karolherbst: like a glClear
11:56 RSpliet: right... but do you swap buffers for every second frame in the first place?
11:57 karolherbst: for every frame^
11:57 karolherbst: otherwise there wouldn't be a new frame if you don't swap buffers
11:59 RSpliet: but you don't want a new frame right?
11:59 RSpliet: if you want to display the same twice ;-)
11:59 karolherbst: yeah
11:59 karolherbst: well you can smooth things up if you want though
11:59 karolherbst: but then you would need the next frame
12:00 karolherbst: or you interlace, but I don't see the point in that
12:00 karolherbst: interlacing was invented to reduce bandwith, not to smooth things up
12:02 karolherbst: gregory38: also, you use tons of unsupported stuff
12:02 karolherbst: if you depend on ARB_imaging, check for that
12:02 karolherbst: and if it isn't there, don't use stuff from it
12:10 karolherbst: ohhh
12:10 karolherbst: wait
12:10 karolherbst: that looks like a buffer overflow
12:13 karolherbst: I got this error string: https://gist.github.com/karolherbst/3a975e2a24165e5c0e7bae66c85468f7
12:17 karolherbst: odd
12:21 karolherbst: well
12:24 karolherbst: _mesa_GetPointerv (pname=3568, params=0x7fffffffd650) at ../../../src/mesa/main/getstring.c:295
12:24 karolherbst: yeah
12:24 karolherbst: that looks wrong
12:25 karolherbst: params[0]=0x100000001
12:25 karolherbst: params[1]=0x3f800000deadc0de
12:31 karolherbst: ahh
12:31 karolherbst: apitrace is doing that
12:31 karolherbst: but why...
12:44 karolherbst: okay, that's apitrace being stupid?
13:00 gregory38: I'm back
13:00 gregory38: In reverse order, I don't use tons of unsupported stuff
13:01 gregory38: image is only for some effects in some games and I do check the reported gl extension
13:02 gregory38: what is the the perf impact to render a image with few draw calls ?
13:03 gregory38: is it a GPU or a CPU impact?
13:05 gregory38: karolherbst: which version of apitrace? Normally it is working
13:06 gregory38: but it used to have some issues with image
13:06 gregory38: oh by arb_imaging did you mean GL_ARB_copy_image ?
13:07 karolherbst: gregory38: sorry for that, apitrace was being stupid and I got tons of errors :/
13:08 karolherbst: gregory38: stalls mainly
13:08 gregory38: http://pastebin.com/PCeNbxmx
13:08 karolherbst: gregory38: glXSwapBuffers swaps the fron/back buffer as you know, but if you hit a sync point later, you have to wait until vsync happened
13:08 gregory38: I did that a long time ago, I'm not sure we still need it
13:09 gregory38: hum, I think we clear the backbuffer
13:09 karolherbst: yeah and in this case you have to wait for vsync I think
13:10 gregory38: we do rendering, then
13:10 gregory38: clear + copy render image to backbuffer + vsync
13:11 gregory38: I'm not sure the clear is useful.
13:11 karolherbst: yeah, I was just curious if you don't fill the second frame if perf would be better
13:11 gregory38: if I vsync only the first frame (or the second)
13:14 gregory38: by the way, I updated my bug on the perf.
13:14 gregory38: The removal of image/ssbo validation has a huge perf impact
13:14 gregory38: (well big enough)
13:15 gregory38: It saved 10 ms of validation
13:18 gregory38: karolherbst: http://hpics.li/343cb8a
13:19 gregory38: 23.73% 0.00% pcsx2_GSReplayL [unknown] [.] 0000000000000000
13:19 gregory38: What could it be ?
13:20 gregory38: unfolding the callgraph
13:20 gregory38: show nvc0_constbufs_validate,st_bind_ubos.isra.1,nvc0_set_constant_buffer
13:20 gregory38: (and others stuff)
13:21 karolherbst: gregory38: you will need the callgraph
13:21 karolherbst: the reverse one
13:21 gregory38: - 23.73% 0.00% pcsx2_GSReplayL [unknown] [.] 0000000000000000 ▒
13:21 gregory38: - 0 ▒
13:21 gregory38: - 6.96% 0 ▒
13:21 gregory38: - 6.02% 0 ▒
13:21 gregory38: - 6.02% 0x9641018 ▒
13:21 gregory38: 2.40% nvc0_constbufs_validate
13:21 gregory38: you mean that?
13:21 gregory38: One quesiton, I saw that mutex lock is often called
13:22 gregory38: could it be a serialization stuff that cause the perf impact ?
13:22 karolherbst: in nouveau_screen_get_name?
13:22 gregory38: no
13:23 gregory38: I don't know who call the pthread lock/unlock
13:23 gregory38: unfortunately
13:24 gregory38: Anyway, I'm curious if all this UBO validation/bind is normal
13:59 gregory38: Hum I'm curious, why texture image upload from a pbo (texsubimage) requires a draw (st_pbo_draw) ?
14:00 gregory38: my first guess, would have been a kind of memcopy to transfer data from mem to vram
14:32 karolherbst: gregory38: I would guess that most of the stuff is stored in ram before it gets actually uploaded to the gpu
14:33 karolherbst: gregory38: anyway, nouveau handles out of memory situation really bad, so I guess it is only uploaded what is actually needed
14:37 gregory38: ok
14:40 karolherbst: but I also have no idea about how gallium works anyway
14:42 gregory38: hum, if I don't call draw, I have nearly the performance of the Nvidia driver
14:43 karolherbst: visual differences?
14:43 gregory38: well no draw == black screen
14:43 karolherbst: ahh
14:43 karolherbst: right
14:43 karolherbst: I thought you excluded just a few draw cals
14:43 karolherbst: *calls
14:44 gregory38: I try to remove texture upload
14:44 gregory38: still slow,
14:44 karolherbst: gregory38: do you know the cpu profiler in apitrace?
14:44 gregory38: so I tryed to remove uniform
14:44 gregory38: still sslow
14:44 gregory38: so I try to remove the full draw call so see if there is any impact
14:44 gregory38: I try the profiler of apitrace
14:44 karolherbst: glretrace --pcpu
14:45 karolherbst: this will print the cpu time of each gl call
14:45 karolherbst: you can also use -pgpu
14:45 gregory38: but I never manage to have any useful info from hit
14:45 karolherbst: which does the same for the gpu then
14:45 karolherbst: well
14:45 karolherbst: --pcpu should help finding where the driver spends some time
14:45 gregory38: well in the draw call ;)
14:45 karolherbst: you will be suprised
14:45 karolherbst: it won't
14:46 karolherbst: draw calls have high on the gpu
14:46 karolherbst: but not on the cpu
14:46 karolherbst: usually shader link/compile have the highest cpu load
14:46 karolherbst: or glclear
14:46 karolherbst: stuff like that
14:47 gregory38: normally I replay several time my benchmark (and skip the first time) so link/compilation ought to be done
14:48 gregory38: I'm tracing with the gui but my trace is maybe a bit too big
14:48 karolherbst: doesn't matter
14:48 karolherbst: you can write the output of the profiler into a file
14:48 karolherbst: and sort the output
14:48 karolherbst: I've done it with 10GB traces already
14:50 karolherbst: mupuf: I am quite sure that the voltage step size was reduced to 3.125mV on pascal, just need somebody to verify that for me :D
14:50 karolherbst: but it would make sense
14:50 karolherbst: gregory38: 50% more perf sounds very good :D
14:51 karolherbst: gregory38: mind giving me the patch?
14:51 gregory38: which patch
14:52 gregory38: http://pastebin.com/uXierYNr
14:52 gregory38: You mean this one
14:52 karolherbst: ohh
14:52 karolherbst: you didn't changed anything else now?
14:52 gregory38: I fixed the vsync stuff in PCSX2 too
14:52 gregory38: It was always enabled
14:53 karolherbst: ahh
14:53 karolherbst: any significant perf change?
14:53 gregory38: yes it is a bit better, I'm not capped at 60 fps
14:54 gregory38: but we're still off of the target
14:55 karolherbst: how much perf do you miss compared to nvidia now?
14:56 gregory38: way too much :p
14:56 gregory38: I would need to resinstall the nvidia driver to compare it
14:57 gregory38: I have the profile info
14:58 karolherbst: gregory38: that happens when an application tires to be smart: https://i.imgur.com/b9Rn5UK.jpg
14:58 karolherbst: check the fps
15:02 gregory38: http://pastebin.com/dny4ihYk
15:02 gregory38: sorted by cpu duration (only the end)
15:03 karolherbst: wow
15:06 gregory38: I don't know the unit but I take all value above 10000 (likely 10 us)
15:06 karolherbst: gregory38: how big is the trace you've got?
15:07 gregory38: 362MB
15:07 gregory38: I mean the apitrace file
15:07 karolherbst: yeah
15:07 karolherbst: could you send it over? (xz compresed)
15:08 gregory38: So above 10us, I have 9646 entries and 5746 are glDrawElementsBaseVertex
15:10 gregory38: xzip ongoing
15:10 gregory38: the game toggles lots of state
15:10 gregory38: so it is quite heavy for the driver
15:10 tobijk: gregory38: i'm not sure if that patch is appicable to a general mesa
15:10 karolherbst: :D
15:10 gregory38: no it isn't
15:11 gregory38: tobijk: it was just to measure the perf impact of useless validation
15:12 tobijk: mh we could try to be smarter about the validations, maybe skipping them in some save places :)
15:12 gregory38: those validations were added recently for 4.2/4.2 features
15:12 gregory38: if we don't use the new feature, we mustn't pay the extra validation
15:13 tobijk: that could be a save way to go, check the features and enable or disable those checks on that basis
15:14 gregory38: tobijk: https://bugs.freedesktop.org/show_bug.cgi?id=96355
15:14 gregory38: Ask the expert, I think they know how to fix it
15:14 gregory38: However, I'm afraid it won't go into mesa 12.0
15:15 tobijk: gregory38: i'd assume this more a general glsl topic, not an nouveau specific one
15:15 tobijk: though the driver may have its fair share :D
15:16 gregory38: it is between gallium and the drivef
15:16 gregory38: driver*
15:17 tobijk: gregory38: i guess that is 41fps (46fps)?: => Mean by frame: 21.586538ms (46.325169fps)
15:18 gregory38: what?
15:18 gregory38: first value is the rendering time in mili seconds
15:19 tobijk: ups :D
15:19 gregory38: 2nd value is the inverted value
15:19 tobijk: ignore me
15:20 gregory38: next time, I will remove the texture of my rendering
15:20 gregory38: it would plumet the trace size
15:21 gregory38: I wish I had a fiber line
15:21 tobijk: we all do :)
15:22 gregory38: karolherbst: https://drive.google.com/file/d/0BzftmrM8nnSoMGFjVzVRaEp1a0k/view?usp=sharing
15:23 gregory38: 118MB
15:28 karolherbst: gregory38: thanks
15:29 karolherbst: gregory38: fixes to that will be most likely applied to stbale versions
15:29 karolherbst: gregory38: it is a performance regression, isn't it
15:29 karolherbst: ?
15:31 gregory38: yes perf regression
15:31 karolherbst: gregory38: I get lik 14 fps :D
15:32 karolherbst: ohh
15:32 karolherbst: mesa built with O0
15:37 gregory38: I'm leaving, happy hacking
15:39 karolherbst: yay 25fps
15:40 karolherbst: 19 fps on lowest clocks
15:48 imirkin: gregory38: i'm probably going to spend some time on nouveau today... if there's a trace or something else you want me to test with, let me know. note that i don't know the first thing about pcsx2 (beyond that it's a ps2 emu), so instructions for idiots will be necessary.
15:49 karolherbst: imirkin: the link he gave me
15:49 karolherbst: https://drive.google.com/file/d/0BzftmrM8nnSoMGFjVzVRaEp1a0k/view?usp=sharing
15:49 imirkin: what do i do with it beyond downloading?
15:49 karolherbst: it's an apitrace
15:49 imirkin: ah ok
15:50 imirkin: i know what to do with that =]
15:50 karolherbst: and on my gpu: 19fps lowest clock -> 25 fps highest clocks ;)
15:50 karolherbst: looks like _something_ is wrong :D
15:51 imirkin: probably just sitting there waiting on fences
15:51 karolherbst: odd
15:51 karolherbst: 26 fps with nvidia
15:52 imirkin: so i guess nouveau doesn't do _too_ badly then
15:52 tobijk: cpu bound?
15:52 karolherbst: odd
15:53 karolherbst: tobijk: well yeah
15:53 karolherbst: but he said he gets like multiple times more perf with nvidia
15:53 karolherbst: maybe he just built mesa with O0
15:53 imirkin: gregory38: the pbo stuff was recently accelerated with draws... for upload it reads from a texture buffer, for download it writes to an image.
15:53 karolherbst: well
15:53 karolherbst: gregory38s patch still helps
15:54 karolherbst: imirkin: and you would need this: http://pastebin.com/raw/uXierYNr
15:54 karolherbst: 1st hunk fixes rendering
15:54 karolherbst: the other things disable some validations speeding up rendering
15:54 karolherbst: but now I get the same perf with nvidia and nouveau
15:54 karolherbst: odd
15:54 karolherbst: gregory38: ^^ something on your end must be fishy then...
15:57 karolherbst: yeah, sotc main menu also same perf with nouveau and nvidia
15:58 karolherbst: yeah the perf difference in game is like 20% or something
16:28 karolherbst: well I get bad perf because apitrace is slow
16:57 imirkin: karolherbst: or not use a debug context :)
16:59 karolherbst: how can I do that in apitrace?
16:59 imirkin: that validate_io stuff hitting in desktop is a new thing idr added, only in debug contexts
16:59 imirkin: dunno
16:59 imirkin: might not be possible =/
16:59 imirkin: it's mandatory in ES
17:00 imirkin: i'm gonna send a change to nuke it, since it breaks stuff
17:03 karolherbst: okay
17:05 karolherbst: gregory38: from what I see, 11% of the time pcsx2 spends malloc/free ints
17:06 karolherbst: but in total: 34% apitrace, 28% mesa_nouveau, 20% libc, 7.5% snappy
17:07 karolherbst: I am more worries about those 20% libc thing
17:07 karolherbst: *worried
17:08 gregory38: back for a couple of minutes
17:08 gregory38: I'm using persistent buffers
17:08 gregory38: I don't know what apitrace does to replay them properly
17:09 karolherbst: mhh
17:09 karolherbst: gregory38: are you sure you built mesa with Ofast?
17:09 gregory38: ah no
17:09 karolherbst: because I don't see any significant change between mesa and nvidia here
17:09 karolherbst: well
17:09 karolherbst: your validation thing still works a lot
17:09 karolherbst: but besides that, the difference is quite small
17:09 gregory38: on apitrace
17:10 karolherbst: well
17:10 karolherbst: sotc ingame: 20% difference
17:10 gregory38: on in PCSX2
17:10 gregory38: but you can't compare with PCSX2, it will be limited by the main thread
17:10 karolherbst: ahh so I would have to replay a pcsx2 dump
17:10 gregory38: I will reinstall the nvidia driver so I can compare same stuff
17:11 gregory38: gs dump, it registers all the input sends to the gs thread
17:11 gregory38: this way you could benchmark/debug only the gs thread
17:11 gregory38: without the huge noise of PCSX2
17:11 karolherbst: could you send me yours? I had troubles creating one
17:12 imirkin: karolherbst: iirc glretrace --benchmark will avoid creating a debug context
17:12 karolherbst: ahh good
17:12 gregory38: ok
17:13 imirkin: gregory38: is there such a thing as trace in pcsx2? kinda like dolphin's fifo logs?
17:13 gregory38: I don't know dolphin stuff
17:13 gregory38: but I guess it is quite close
17:13 gregory38: at least on the "GPU" "buses"
17:13 imirkin: i'm no expert either, but it's a trace that can be replayed by dolphin
17:14 imirkin: instead of by, say, apitrace
17:14 karolherbst: imirkin: yeah, seems like benchmark works
17:14 karolherbst: but no difference in perf
17:14 imirkin: karolherbst: largely expected. but would avoid you from having to bail in validate_io
17:14 karolherbst: right
17:15 karolherbst: 25 fps nouveau, 26 fps nvidia
17:15 karolherbst: so... yeah
17:15 karolherbst: apitrace is useless for this anyway
17:15 gregory38: directory link https://drive.google.com/folderview?id=0BzftmrM8nnSoRFlrdVNNWGV4NGc&usp=sharing
17:15 gregory38: upload on going
17:16 gregory38: I'm uploading the replayer too but I have an old gcc (4.9)
17:16 gregory38: usage $HOME/pcsx2/pcsx2_GSReplayLoader bin/plugins/libGSdx.so dump.gs $HOME/pcsx2/inis
17:17 gregory38: you can tune GSdx.ini to replay the trace with various options
17:17 gregory38: and you need to set linux_replay to 10 (more than 1) to do run in benchmark mode
17:17 gregory38: Ah, you don't need to uncompress the xz file
17:18 gregory38: colin_big.gs.xz / sotc_big.gs.xz (eta 10 minutes)
17:19 karolherbst: gregory38: I would expect that with mesa built with Ofast the difference should become quite small
17:19 karolherbst: or at least smaller than before
17:20 gregory38: -m32 -g -O2 -Wall -std=c99 -Werror=implicit-function-declaration -Werror=missing-prototypes -fno-strict-aliasing -fno-math-errno -fno-trapping-math -fno-builtin-memcmp
17:20 gregory38: I didn't know the 0Fast option I will give it a try
17:25 gregory38: oh it is indeed better on sotc
17:25 gregory38: I'm close of 80 fps
17:25 gregory38: (mesa from debian 11.1.3 is around 60 fps
17:26 karolherbst: anyway, I can't get pcsx2 to build for reasons I don't understand
17:27 gregory38: 32 bits mess
17:27 karolherbst: https://gist.github.com/karolherbst/21c04d65c37d78c0e947d9d05e8456fe
17:27 gregory38: distribution ?
17:27 karolherbst: gentoo
17:27 karolherbst: well through the package manager I get pcsx2 to build
17:27 karolherbst: just not locally
17:28 gregory38: the file /usr/bin/wx-config
17:28 gregory38: is a link to a 64 bits wx-config
17:29 karolherbst: I have only a gtk2-unicode-3.0 version
17:29 imirkin: pcsx2 is 32-bit only?
17:29 karolherbst: yeah
17:29 imirkin: ah, sad
17:29 gregory38: yes long story-short, miss man power
17:29 imirkin: no need to defend it - lots of reasons stupid things happen =/
17:30 karolherbst: well
17:30 gregory38: well program is old. older than 64 bits ;)
17:30 karolherbst: at least you still get avx2 and stuff
17:30 gregory38: karolherbst: you need the 32 bits version of wx
17:30 karolherbst: gregory38: yeah, I know
17:30 karolherbst: I have both
17:31 gregory38: -DwxWidgets_CONFIG_EXECUTABLE=
17:31 karolherbst: I have just one
17:31 gregory38: add the path to you 32 bits wx-config file
17:31 gregory38: even in /usr/lib
17:31 gregory38: you need 32 bits include
17:31 karolherbst: anyway
17:31 karolherbst: that's a compile error
17:31 karolherbst: no linking error
17:32 gregory38: by the way, you don't need wx for gsdx.so file
17:32 gregory38: dump uploaded
17:34 gregory38: karolherbst: imirkin: is it possible to switch driver without a reboot ?
17:34 imirkin: gregory38: from nouveau -> nvidia, yes. but ... not easily.
17:34 gregory38: I don't understand how it will work for kernel module with the extra layer (don't remember the name. vgn?)
17:34 imirkin: i assume you're on a desktop and it's your primary card?
17:35 gregory38: yes
17:35 imirkin: yeah... i mean you can get out of X, disconnect the vtcon link, remove nouveau and load nvidia
17:35 imirkin: but you won't have a console
17:35 gregory38: ok. It as fast as a reboot
17:35 imirkin: going back to nouveau didn't work of late, since they do something to the display engine that we don't know how to undo
17:36 imirkin: and so we send it commands to update the fb, and it says "yes, i have updated the fb". but it hasn't.
17:36 gregory38: not nice
17:36 imirkin: if you use a different gpu as primary and use DRI3-based offloading, then it's easy to flip between nouveau and nvidia
17:38 gregory38: so iGPU could be useful
17:38 gregory38: !
17:43 imirkin: very
17:44 karolherbst: yes, it is
17:45 karolherbst: well the only problem is you can't really optimize for high fps micro benchmarks
17:45 imirkin: btw, i have a patch which should reduce the validation overhead for shader buffers
17:46 karolherbst: nice
17:46 imirkin: http://hastebin.com/obifaroben.coffee
17:46 imirkin: [untested]
17:47 gregory38: will it work if you switch a shader that uses a ssbo buffer but differently
17:47 imirkin: not sure what you mean by that
17:47 gregory38: will the validation be done if nothing is rebinded
17:47 imirkin: if it has to be
17:48 gregory38: ah never mind
17:48 imirkin: anyways, the thing's not perfect. in an ideal world i'd add more bins
17:48 imirkin: but i'm lazy
17:49 imirkin: right now all ssbo's are in a single bin
17:49 imirkin: so i have to revalidate the whole bin
17:49 imirkin: as that is (effectively) the unit of validation
17:49 gregory38: all or nothing is still way better that always all :)
17:49 imirkin: ;)
17:51 gregory38: there is also the extra validation of images
17:51 imirkin: one thing per patch :p
17:51 imirkin: but yes, i'm aware
17:51 imirkin: that was next on the list
17:52 imirkin: and there's about to be a ton more, since as it turns out we weren't validating enough
17:52 imirkin: so if you did BindTextureImage()
17:52 imirkin: and then did TexImage2D() in suhc a way that the underlying resource was reallocated, we wouldn't notice
17:52 karolherbst: stupid chromium dns caching...
17:53 imirkin: i wish they had just gone the ARB_texture_view route and mandated that you can only use immutable textures
17:54 gregory38: yes, GL could have been so much easier
17:58 imirkin: grrrrr
17:58 imirkin: stupid unions.
17:58 imirkin: can't == them
17:58 karolherbst: :D
17:58 karolherbst: well
17:58 karolherbst: I never tried to == structs
17:59 imirkin: yeah, i guess that doesn't work
18:00 imirkin: grrrr
18:04 gregory38: I will install nvidia-driver later, when debian provide a nice package will all the required patch to build against a new 4.6 kernel ^^
18:11 imirkin: you'd think if you can do =, you can do ==. but you'd be wrong.
18:11 mupuf: karolherbst: that does not surprise me that they would decrease the voltage steps
18:11 karolherbst: mupuf: yeah
18:11 karolherbst: one of the tables had a strainge "3125" value in it
18:12 mupuf: 3125? Could be the frequency at which we should drive the thing
18:12 karolherbst: ohh right
18:12 karolherbst: could be
18:13 karolherbst: mupuf: new tables: https://gist.github.com/karolherbst/f785850c9249dff0c1eb01828e737301
18:13 karolherbst: voltage table is gone by the way
18:14 karolherbst: ohh allthough that 3125 could also be a 800000
18:17 mupuf: can't have a look at this for quite a long time :s
18:17 karolherbst: uhh
18:17 karolherbst: quite long, that's sounds more than a day
18:17 karolherbst: :D
18:18 mupuf: that is almost like a month :s
18:18 karolherbst: uhh
18:18 karolherbst: something bad happened?
18:20 karolherbst: imirkin: quite some patches :D
18:21 imirkin: karolherbst: do they help anything?
18:21 karolherbst: I currently try to apply them on by local branch
18:21 karolherbst: are those in your git?
18:21 imirkin: no
18:21 imirkin: just download the mbox from patchwork
18:22 karolherbst: well, those are 3 series right?
18:22 imirkin: sure
18:22 mupuf: karolherbst: nothing happened, it is just that I have a ton of things to do and the little spare time I have should go to fixing bugs I introduced
18:22 mupuf: and then, in 2 weeks, I am goign back to france
18:23 karolherbst: aahhh
18:23 karolherbst: oaky
18:23 mupuf: for two weeks
18:23 mupuf: and I won't have time to hack there
18:23 karolherbst: yeah no problems
18:23 karolherbst: I was just worries you can't because of some IP bs or something stupid like that
18:23 karolherbst: *worried
18:24 karolherbst: imirkin: well git am $mbox_file doesn't do the trick
18:24 imirkin: uhhhh
18:24 imirkin: it should.
18:24 karolherbst: error: src/mesa/main/mtypes.h: patch does not apply
18:25 imirkin: lame.
18:25 imirkin: i might have modified something in mine =/
18:25 imirkin: are you on latest?
18:25 karolherbst: pretty much
18:25 karolherbst: maybe 1 day off?
18:26 karolherbst: ahh
18:26 imirkin: maybe that makes the diff. anyways, i just pushed this out on my (ha!) nv30 branch
18:26 karolherbst: now it works
18:26 imirkin: some day i'll actually get back to trying to hack on nv30
18:26 karolherbst: I finished my rebase. remove gregory38 hacks and rebased on master
18:26 imirkin: the fact that it doesn't work with prime makes it a bit annoying
18:27 imirkin: i have it hooked up over s-video though
18:27 imirkin: and my monitor supports picture-in-picture
18:27 mupuf: karolherbst: nope, just boring reasons
18:27 karolherbst: mupuf: disappointing
18:27 karolherbst: imirkin: right, and I should do more on my tesla
18:28 mupuf: I mean, I am definitely going to have a ton of work/fun in France, but that the reason itself is classic
18:28 karolherbst: I am sure I will look into that dual issue mul/mad thing tomorrow
18:28 imirkin: karolherbst: neither mul nor mad are SFU ops
18:28 karolherbst: well
18:28 imirkin: you can't just will an op to be run by one unit or another
18:28 karolherbst: no idea why nvidia writes something like that then
18:29 karolherbst: "The individual streaming processing cores of GeForce GTX 200 GPUs can now perform near full-speed dual-issue of multiply-add operations (MADs) and MULs (3 flops/SP) by using the SP’s MAD unit to perform a MUL and ADD per clock, and using the SFU to perform another MUL in the same clock. Optimized and directed tests can measure around 93-94% efficiency."
18:29 karolherbst: this quote comes directly from an nvidia pdf
18:29 karolherbst: about gt200+
18:30 karolherbst: that's the entire section: https://gist.github.com/karolherbst/cc075afcdcf83e3b0f21d8685e8faf05
18:30 imirkin: i dunno what they're talking about
18:30 imirkin: i'm sure it's something real
18:30 imirkin: but i don't know what
18:30 karolherbst: yeah
18:31 karolherbst: well
18:31 karolherbst: I have such a tesla to find it out
18:31 karolherbst: hopefully
18:31 karolherbst: but I read that on pre gt200 that dual issue thing is mostly garbage anyway
18:31 karolherbst: and only on gt200 it become somewhat sane
18:34 karolherbst: imirkin: okay, trace renders with your patches
18:37 imirkin: mwk: any idea what a "DIVERGENT" warp error might mean? gr: GPC0/TPC0/MP trap: global 00000004 [MULTIPLE_WARP_ERRORS] warp 1d0014 [] (0x14 == DIVERGENT according to gk20a headers)
18:37 imirkin: mwk: does it mean that i have a divergent bra without a joinat/join?
19:05 mwk: imirkin: divergent bra shouldn't be an error
19:06 mwk: might be divergent bar though
19:06 mwk: what does the PC point to?
19:09 imirkin: kepler... no pc
19:09 imirkin: and i've yet to do the trap handler stuff
19:11 imirkin: mwk: are divergent bar's not allowed?
19:11 imirkin: mwk: ARB_compute_shader definitely allows them...
19:12 imirkin: still not sure which shader is triggering it =/
19:14 imirkin: ok, well i can say with some certainty it's not bar - there's not a single bar in the trace :)
19:15 imirkin: or compute shaders for that matter
19:15 imirkin: this is happening in the 3d pipeline
19:16 mwk: divergent bars are quite damn ill-defined
19:16 mwk: if (x) { ...; bar() ; ...} else {...; bar(); ...}
19:17 mwk: suppose the if is diverent, and then-part is executed first, how do you want to stash its execution context and resume the else-part?
19:18 imirkin: dunno - but either way that's not what's happening here
19:18 mwk: true
19:18 mwk: in that case... I have no idea, please do tell if you figure it out
19:18 imirkin: k
19:18 mwk: any quadops involved here?
19:18 imirkin: no quadops
19:18 mwk: hm
19:18 mwk: weirdo texture instructions?
19:19 mwk: votes?
19:19 imirkin: nope
19:19 imirkin: from what i can tell these are all bog-standard
19:19 imirkin: let me see if it's one of my desperately-try-to-fix-compute patches that's breaking it
19:20 imirkin: this is on a GK208B in case it makes any difference (NV106)
19:21 imirkin: mwk: is the lod argument to texlod allowed to be divergent? i know it's not on tesla...
19:22 mwk: hmm, it might not be
19:22 mwk: but, no idea
19:25 imirkin: hmmmmmmm
19:45 imirkin: mwk: at what stack depth are we supposed to start providing a call stack?
19:56 mwk: IIRC it was around 16
19:56 mwk: but hmm
19:56 mwk: that's for Tesla...
19:57 imirkin: ok, well 16 is well over what's in this shader i'm looking at
19:57 imirkin: it has like ... 6?
19:59 mwk: that should fit
20:00 imirkin: and i think you get a different error if you go "over"
20:16 imirkin: on the bright side, i still get them when disabling codegen optimizations. so it's not some opt-gone-wrong. as i suspected, but still nice to know.
20:19 imirkin: mwk: could it be a texbar imbalance?
20:19 imirkin: mwk: there's a situation where there's a texbar in some conditional flow
20:22 mwk: huh
20:22 mwk: no idea, I don't even know what a texbar is supposed to be
20:22 mwk: seriously, I'm still living in Tesla age... Fermi at best :(
20:23 imirkin: hehe
20:23 imirkin: texbar adds a barrier to wait for tex to finish
20:23 imirkin: it's a kepler-only thing
20:23 mwk: that much I know
20:23 imirkin: :)
20:24 imirkin: well, i asked a question in gnurou's github thing
20:26 mwk: what github thing?
20:26 imirkin: mwk: https://github.com/Gnurou/nouveau/issues
20:26 imirkin: some questions sometimes get answered to some degree of correctness
20:26 imirkin: (that's some^3 in case you're not following along)
20:27 mwk: huh
20:28 imirkin: i don't have high hopes
20:28 imirkin: but might as well ask
20:41 karolherbst: but gnurou is somewhat not here anymore... did he say anything?
20:41 imirkin: i think he said he was going away for a week or three?
20:41 imirkin: not sure
20:42 karolherbst: I now he was away for a week some weeks ago
20:59 karolherbst: okay, I think I got my tesla working again. silly static network
20:59 karolherbst: and my apple keyboard I have here doesn't work without the right kernel module...
21:47 karolherbst: I like websites where you can only download stuff inside a browser...
21:50 karolherbst: uhhh
21:50 karolherbst: segfault in mesa
22:18 karolherbst: does anybody here has a tesla can could check if pixmark_piano crashes?
22:18 karolherbst: well, I am sure _somebody_ here has a tesla, but could someone of you check?
22:21 imirkin: where do i get this pixmark piano?
22:21 imirkin: and where does it crash?
22:23 karolherbst: http://www.geeks3d.com/dl/showd/392
22:23 karolherbst: no idea, gdb currently installing
22:23 karolherbst: mhh
22:23 karolherbst: odd
22:24 karolherbst: ahh libsegfault
22:24 karolherbst: of
22:24 karolherbst: course
22:25 karolherbst: ...
22:25 karolherbst: I even builr with -g
22:27 karolherbst: well it crashes in nouveau_dri.so
22:37 karolherbst: I like it when error messages don't help
23:08 imirkin: skeggsb_: all the "clever" logic in libdrm_nouveau is pretty disastrous :(
23:08 imirkin: skeggsb_: it means i have to add locks around nouveau_bo_wait, which ... is the opposite fo what i want
23:09 imirkin: skeggsb_: for the off-chance that it decides to kick the channel =/
23:27 imirkin: karolherbst: fyi i pushed out a commit that adds some locking to nvc0 - dunno if you've had some multi-context issues, but if so, check it out
23:27 karolherbst: nope
23:27 karolherbst: usually I have only one client on the gpu
23:28 karolherbst: in libdrm?
23:28 imirkin: well, this is related to multi-threaded clients
23:28 imirkin: that submit commands from multiple threads
23:29 karolherbst: mhh
23:29 karolherbst: I doubt I ever run into this
23:29 karolherbst: maybe I did
23:29 karolherbst: and forgot
23:29 imirkin: kk
23:29 imirkin: a bunch of games do it
23:29 karolherbst: huh, really?
23:29 karolherbst: which ones?
23:29 imirkin: specifically warsow 2.0 (although they have an option to disable it)
23:29 imirkin: i've seen grid autosport hit the issue
23:30 imirkin: and i'm moderately sure that f1 2015 is affected
23:30 karolherbst: mhh okay
23:30 karolherbst: skeggsb_: thanks for the proper fix :D
23:30 karolherbst: skeggsb_: I was there once, but decided my code was fine, because default: was unreachable anyway
23:32 karolherbst: yay! one of my commits for the reclocking rework was merged :)
23:34 imirkin: 1 down, 74 to go
23:40 karolherbst: uhhh
23:40 karolherbst: that crash
23:40 karolherbst: fun
23:40 karolherbst: https://gist.github.com/karolherbst/c2e27c6dd170840a4071ba696869886d
23:40 imirkin: didn't i fix that
23:41 imirkin: like 20 times
23:41 imirkin: heh
23:41 imirkin: and pmoreau a few times after that
23:41 karolherbst: :D
23:41 karolherbst: seems like it needs more fixing
23:41 imirkin: 25th time is a charm? :)
23:42 karolherbst: uhhhhh
23:42 karolherbst: this looks .... bad?
23:42 karolherbst: pre ra: https://gist.github.com/karolherbst/140afe92a301703c881aa2d4efaecf0c
23:42 karolherbst: all those phis at the end ...
23:43 karolherbst: wow
23:43 karolherbst: nearly 10% of all ops are phis
23:44 imirkin: no real problem with that...
23:44 karolherbst: I am just surprised because on my kepler I get a lot less
23:45 imirkin: layout ends up different
23:45 imirkin: you can predicate a lot fewer instructions too
23:45 karolherbst: mhh okay
23:45 imirkin: dunno
23:45 karolherbst: and I guess no fancy slcts ?
23:45 karolherbst: or predicated instructions?
23:45 imirkin: well, some predicated, just not as many
23:46 karolherbst: and this is post RA anyway
23:46 karolherbst: this gpu is so slow :D
23:46 karolherbst: furmark: 1fps
23:48 imirkin: clock it up?
23:48 imirkin: there can't be any phi's post-ra
23:48 imirkin: that would mean that RA failed
23:49 karolherbst: I meant most of the predicated instructions are added post ra
23:49 karolherbst: flattening does that
23:49 karolherbst: now 4.6 is installed :)
23:52 karolherbst: mhh
23:52 karolherbst: nearly 3 fps now!
23:52 karolherbst: progress
23:53 karolherbst: ohh
23:53 karolherbst: there is also liveOnly on tesla :=
23:54 imirkin: ya
23:54 karolherbst: mhh
23:54 karolherbst: let's see if the perf drops
23:54 imirkin: btw, my idea was to do the opposite of what you were doing for liveOnly
23:54 imirkin: i.e. always set it to true
23:54 karolherbst: mhhh
23:55 imirkin: and then explicitly look for "bad" instructions
23:55 imirkin: and follow their arguments up
23:55 karolherbst: 58->60 score at 20secs
23:55 karolherbst: mhhh
23:55 karolherbst: makes more sense I think
23:55 karolherbst: because the bad case is usually not there
23:55 imirkin: right
23:55 imirkin: well, "bad" instructions will be there
23:55 imirkin: since tex() is a bad instruction
23:56 imirkin: and it comes up a lot
23:56 karolherbst: mhh right
23:56 karolherbst: okay
23:56 imirkin: however its args' value chain will usually be short
23:56 karolherbst: so 5% more perf on my tesla in furmark
23:56 imirkin: due to?
23:56 karolherbst: liveOnly
23:57 imirkin: ah nice
23:57 karolherbst: well on my kepler it increased by like 15%
23:57 imirkin: can't win 'em all
23:57 karolherbst: mhh
23:57 imirkin: iirc there's a 4-byte tex variant on tesla that we don't use
23:57 karolherbst: now I got 62 score
23:58 karolherbst: okay, I say the perf increase is around 7% then
23:58 karolherbst: it's significant enough anyway
23:58 karolherbst: but only there
23:59 karolherbst: okay, anything I could do to fix that crash?