00:36imirkin: karolherbst: to confirm, you ran CTS with my latest patches including the levelZero one and all was well, yes?
00:36imirkin: [except for the tests you reported as failing, which are unrelated to the levelZero thing]
00:36karolherbst: imirkin: 1265ce677e74fe5dadfe4c3039e0ff2d92eb54cb is the top commit
00:37karolherbst: should be the same as in your github repository
00:38imirkin: ok great
00:38imirkin: i'm going to push all of those other than the gallium/mesa ones then
00:38imirkin: speak now if you object :)
00:39karolherbst: imirkin: I think we want to have acks from others for the 1265ce677e74fe5dadfe4c3039e0ff2d92eb54cb commit
00:39imirkin: that's under the "gallium/mesa" heading
00:40imirkin: i.e. "other than those"
00:40imirkin: but all the nouveau-internal ones
00:40karolherbst: ohhh, true
00:40karolherbst: okay, then that should be alright :)
00:40karolherbst: I didn't test on kepler obviously
00:40imirkin: yeah, don't worry about that
00:40karolherbst: but I guess you did that
00:40imirkin: i tested on kepler2
00:40imirkin: it could theoretically break things on kepler1, but ... very unlikely
00:43imirkin: i guess i could keep the gk104 rcp/rsq thing out, but again ... meh
00:43karolherbst: that should be fine
00:43karolherbst: I am sure I've tested it plenty of times
00:43imirkin: i made some changes to the source, but it didn't affect the generated asm (which is good)
00:43imirkin: basically stuck "long" in a bunch more places where i thought it could potentially matter
00:44imirkin: should have some #pragma long or something
00:44imirkin: but ... there's better uses of time than that :)
00:53imirkin: ok, so now i need to take that blit z32_s8 thing apart
01:02imirkin: ok. this is VERY surprising. blitting from non-ms to 4x and 8x msaa = fail?
01:02imirkin: that's the easiest blit of all
01:04karolherbst: but looking at the result the scaling is simply wrong
01:06imirkin: i'll have to think about it.
01:06karolherbst: I mean, looking at it, my patch kind of make sense, but maybe there is more to it? dunno
01:07imirkin: that stuff's not there randomly either
01:07imirkin: i'll have to think about it
01:07imirkin: in any case it needs a lot more justification than "fix weird MS issue" :)
01:08karolherbst: I though that describes it perfectly :p
01:08imirkin: the problem is that we cheat pretty hard in that blitter, esp when it comes to MS
01:08joepublic: "perform bold, inventive optimizations for speed"
01:09karolherbst: imirkin: I guess we could just port to u_blitter and forget about all of that?
01:09imirkin: thing is that x0/y0/etc are scaled by ms_x/y
01:10imirkin: but like i said ... let me think
01:11imirkin: need to chekc how the 2 attribs are used again
01:11imirkin: been a very long time
01:13karolherbst: imirkin: thing is, I am sure that my patch neither triggered regressions inside piglit, nor the CTS :/
01:13karolherbst: I think it even fixed a bunch of piglits tests as well
01:13imirkin: that's great
01:14imirkin: i'd still like to understand what's going on =]
01:14karolherbst: sure :)
01:14imirkin: perhaps it's doing the right thing
01:15karolherbst: I'll be honest, I have no idea what's going on there either, it was just a lucky guess
01:15imirkin: but it'd be a bit surprising given that the blit isn't 100% broken
01:15imirkin: e.g. for color
01:15imirkin: and then ... why is depth different than color, etc
01:15imirkin: z32f_s8 blit is handled differently
01:15imirkin: so perhaps it's the only one of "those" cases that hits this pipeline
01:16imirkin: and the rest go via the 2d engine
01:16imirkin: since for single -> multiple samples, it's basically a 2d engine job
01:16imirkin: you just broadcast the same value to all the samples
01:16karolherbst: ohh, wait, yeah, I think there was this explicit check somewhere
01:16imirkin: but z32f_s8 has to go through the 3d engine
01:17imirkin: but then ... why does 2x work and not 4x? i guess it gets stretched along the y axis?
01:17imirkin: anyways -- Needs Understanding (tm)
01:17karolherbst: imirkin: hehe.. "if (info->dst.resource->format == PIPE_FORMAT_Z32_FLOAT || info->dst.resource->format == PIPE_FORMAT_Z32_FLOAT_S8X24_UINT) eng3d = true;"
01:18imirkin: but it works for 2x
01:18imirkin: and not 4x
01:18imirkin: we over-rast it by A LOT
01:18imirkin: i wonder if that's part of it
01:18imirkin: i'm going to try making it a square instead of a triangle
01:21imirkin: yeah ok
01:22imirkin: it can't handle the giant rast values
01:22imirkin: this also fixes it
01:23imirkin: we draw this GIANT triangle
01:23imirkin: that's guaranteed to cover the square that we actually want to rasterize
01:23imirkin: but i think the values are a bit too big, and something gets screwed up
01:23karolherbst: yeah... kind of makes sense
01:24imirkin: we also do this without window transforms
01:24imirkin: which is why these aren't the normalized coordinates
01:24imirkin: they're the window coords
01:24imirkin: the max fb size is 16x16k
01:24karolherbst: doing that with a rectangle feels more natural anyway
01:25imirkin: that's why we use a 32kx32k triangle
01:25imirkin: so that the 16kx16k square fits inside
01:25imirkin: however if there's MS, we have to use an even larger triangle
01:25imirkin: which is why it's 32k << ms factor
01:25imirkin: removing the ms factor will mean that a blit might not 100% cover the desired region
01:26karolherbst: for really big blits I assume?
01:26imirkin: for a super-large texture
01:26imirkin: for 256x256, nobody gives a shit
01:26imirkin: the problem with a square is that it's 2 triangles separately rasterized
01:26imirkin: let me do a poll to see what some other people do
02:06imirkin: i think we can just safely drop those shifts.
02:09imirkin: i need to do some tests.
02:13karolherbst: meh.. those random channel kills are super annoying
02:16imirkin: so i think all the dst-coord-related ms shifts are totally bogus in the 3d path
02:16imirkin: on both nv50 and nvc0
02:16imirkin: i'm going to do some tests to verify this
02:16imirkin: but that's my current thinking.
02:16karolherbst:is wishing we had any way to debug those CTXSW_TIMEOUTs
02:16imirkin: karolherbst: well, i consistently get one on GK208 for that dEQP
02:17karolherbst: I now have a list of 4-5 games triggering those quite randomly :/
02:17karolherbst: and I am sure there are more out there
02:17karolherbst: maybe apitrace might help? but from my experience, it never did
02:18karolherbst: mhh, let me try that again actually
02:19karolherbst:is happy he has his channel dead detection patches, so his machine just recovers :p
02:27karolherbst: that trace triggers a channel kill :)
02:29karolherbst: now hoping that qapitrace is happy with that situation
02:44karolherbst: imirkin: it's a glDrawRangeElements call
02:45karolherbst: binds two buffers: ELEMENT_ARRAY_BUFFER + ARRAY_BUFFER
02:46karolherbst: then VetexAttribPointer with pointer = NULL
02:46karolherbst: then drawRangeElements with pointer = NULL
02:49karolherbst: imirkin: uniform vec arr that should be alright, or not?
02:49karolherbst: in a vertex shader
02:50karolherbst: doesn't matter anyway
03:10Delemas: imirkin, nouveau worked fine. Thanks.
12:16imirkin: karolherbst: uniforms in shaders are fine. DrawRangeElements(NULL) is fine too -- it's actually an offset into the bound element array buffer.
12:57karolherbst: imirkin: yeah, I wasn't wondering about the general thing, more like if you have some ideas what could get wrong there
12:58imirkin: nothing specific to those
12:58imirkin: those are very very very common things to do
12:58karolherbst: mhh :/
12:59imirkin: it's basically "how indexed draws work"
12:59karolherbst: sure, but I am thinking of something like "too many indicies at once" or something stupid like that
12:59imirkin: now, if it's drawing only a handful of indices
13:00imirkin: out of a giant vertex buffer
13:00imirkin: which is client-side
13:00imirkin: then we have some optimizations for that case
13:00karolherbst: it's more like thousends
13:00imirkin: also if there are DOUBLE or FIXED attributes, we hit a fallback path
13:00imirkin: (which is why i had to fix indirect draw for those)
13:01imirkin: btw - were you able to get a clean webgl run in the end?
13:01imirkin: (webgl cts)
13:03karolherbst: ohh interesting, with lower clocks I get the timeout with earlier calls as well
13:13karolherbst: mhhh, maybe it is indeed a proper timeout and the firmware doesn't like that something takes as long? :/
13:30imirkin_: your guess is at least as good as mine, probably better
15:50imirkin_: karolherbst: pretty sure i asked, but i forget your answer if any -- have you been able to get a complete webgl cts run?
15:50karolherbst: imirkin_: yes
15:50imirkin_: did you have to do anything special on top of the fixes which are already upstream?
15:50imirkin_: iirc you did it locally to speed things up considerably
15:50karolherbst: I used my mt fixes patches to get it to be more relible
15:51karolherbst: and cloned the repo, yes
15:51karolherbst: doesn't help much with long running tests, so the overall speed is equalish
15:51karolherbst: but less timeouts
15:51imirkin_: do you have to do something once you clone the repo? or just point the browser to file:///bla ?
15:51karolherbst: some chromium flag for local file access
15:51imirkin_: yeah, that sounds familiar.
15:51imirkin_: ok thanks
15:52imirkin_: skeggsb: ping
16:28karolherbst: this is great... trying to bisect which gl call causes the channel to be killed, but apperantly everytime I get to something, it's some previous call, which was fine before :/
16:29imirkin_: it's almost as if there was something non-deterministic going on...
16:30karolherbst: thing is, it kind of is
16:30karolherbst: but yeah, not 100%
16:30imirkin_: well, fwiw i have a reproducible CTXSW_TIMEOUT
16:30imirkin_: on gk208
16:30karolherbst: here as well
16:30imirkin_: (or at least NV106)
16:30karolherbst: even an apitrace
16:30imirkin_: with a simple deqp test
16:30karolherbst: which one?
16:30imirkin_: i posted about it earlier
16:31imirkin_: primitives_generated_instanced iirc
16:31imirkin_: i don't think it dies on other gens. i wouldn't be surprised if it were GK208-specific.
16:32karolherbst: mhh, works here
16:32imirkin_: yeah, i think you said that
16:32imirkin_: probably some bit of card config left out
16:32imirkin_: that's why i want skeggsb's help :)
16:33karolherbst: well, I have around 4 reproducible timeouts :) but those are triggered by games, so a bit annoying to debug
16:33karolherbst: I think even one or two in piglit?
16:33imirkin_: what gpu are you on?
16:33karolherbst: those 4 are from gm204
16:33imirkin_: so we have no control over the firmware in anyc ase
16:34karolherbst: doesn't have to be a firmware bug though
16:34imirkin_: doesn't have to be
16:34imirkin_: but more fun to assume it is :)
16:34imirkin_: that way you can move on to the next problem
16:34karolherbst: what's most surprising that with higher clocks, the first call to fail is a much later one :)
16:36karolherbst: I tri to talk with nvidia about that... I mean, it cannot be that we have no means to debug those fails
16:36imirkin_: were you going to try to upstream support for reclocking gm20x?
16:36imirkin_: (or is it there already?)
16:36karolherbst: at some point... I think the patches are unchanged since the last time I've posted those
16:36karolherbst: just needs somebody to look at those
16:36imirkin_: just a matter of getting skeggsb to pick them up...
16:37karolherbst: I tried too many times, guess I've gave up
16:37imirkin_: or send to airlied direct?
16:37karolherbst: can't even remember which branch those are on :D
16:37karolherbst: they are essentially just patches for the gentoo-sources packages... and I think there are there since a year or something
16:38karolherbst: I think that is the branch? https://github.com/karolherbst/nouveau/commits/clk_update_v3
16:38karolherbst: ohh wait, that's missing bits
16:38karolherbst: probably that one: https://github.com/karolherbst/nouveau/commits/clk_to_upstream
16:41karolherbst: yep, should be
16:42karolherbst: imirkin_: that patch is golden btw: https://github.com/karolherbst/nouveau/commit/92be149a28e3e22fb20500f8a26f8b20c260d06d
16:42karolherbst: otherwise we can't use the PMU to do memory reclocking :)
16:43karolherbst: and maybe even wrong
16:43karolherbst: works for me
16:54karolherbst: I think we are hitting a real timeout here
16:56karolherbst: is that annoying
17:00karolherbst: imirkin_: do you know how to turn off instanced drawing inside gallium?
17:02imirkin_: that kills GL 3.0
17:02imirkin_: there's a PIPE_CAP for it... search for "instance"
17:02imirkin_: or check what cap neables ARB_draw_instanced or something
17:02karolherbst: mhh, yeah, I tried _INSTANCE and _INDIRECT but that doesn't really disable the extensions
17:02karolherbst: maybe gallium emulates in such cases?
17:02imirkin_: disable that
17:03imirkin_: that'll kill everything
17:03karolherbst: mhhhhhhhhh the channel kill is gone now with INSTANCE and INDIREC disabled :/
17:03karolherbst: or uhm.. maybe it's a later call now...
17:03karolherbst: let me verify
17:04imirkin_: if you're just replaying the trace, you'll get errors for those draws
17:04imirkin_: since it'll say "function not found" :)
17:04karolherbst: well, I don't
17:04karolherbst: ahh, still the channel timeout, okay, that's good
17:05karolherbst: yep, instanceid is the right cap
17:06karolherbst: still getting 3.0 though
17:06imirkin_: perhaps it's in 3.1? i forget.
17:06karolherbst: yeah... sounds about right
17:06imirkin_: i htought it was 3.0 tho
17:06karolherbst: draw_instanced is 3.1
17:06imirkin_: ah no. it's in 3.1. and ARB_instanced_arrays is in 3.3
17:07imirkin_: (instanced arrays lets you have dividers)
17:07imirkin_: or ... somethin
17:07karolherbst: okay... so does the game run with 3.0...
17:08karolherbst: it's a dx9 based one, so I guess they have fallback paths
17:08imirkin_: should yeah
17:08imirkin_: i don't think dx9 had any of that
17:08imirkin_: that's all dx10 stuff
17:09karolherbst: fun... still a dead channel
17:10karolherbst: they do some query stuff though
17:10karolherbst: I think I decide to blame that
17:11imirkin_: wise choice.
17:11karolherbst: I think when I gdb into it, it was hanging on fetching a query result actually
17:11karolherbst: how can I disable all that?
17:12imirkin_: queries? those are necessary for DX9
17:12imirkin_: PIPE_CAP_OCCLUSION something
17:12imirkin_: iirc needed for GL 2.0 or 2.1
17:12karolherbst: well... one way to find out
17:13karolherbst: yeah.... getting 1.4 now
17:13HdkR: Hard code it to return zero :P
17:13karolherbst: it makes like heavy use of it
17:13karolherbst: fetches like 50 results each frame
17:14karolherbst: that's not it either
17:14karolherbst: trace still replays fine
17:14imirkin_: do you get a 2.0 or 1.x context now?
17:15karolherbst: the channel is killed a bit later into the trace though
17:15karolherbst: I guess it's gdb time
17:16karolherbst: yeah, nice, X freezes of course. so remote gdb
17:24karolherbst: what an annoying issue
17:25karolherbst: mhhh nouveau_scratch_data
17:26karolherbst: okay... so where is that bo used
18:56karolherbst: imirkin_: mhhhh, user vertex buffer not flushed correctly? this could cause stuff like that I guess. It's essentially the most common thing that game does
18:57karolherbst: well, user vertex buffer
19:06imirkin_: you mean client-side?
19:07karolherbst: uhh, yes
19:07imirkin_: those are a bit different on maxwell+
19:08imirkin_: we have to stick them into a vbo
19:17karolherbst: mhh, interesting
19:17karolherbst: might be worth checking if we have the same issue on kepler
20:07karolherbst: buuh ... :/
20:09karolherbst: imirkin_: apperantly making vbos sysmen only helps...
20:09karolherbst: or VERTEX_BUFFER... whatever that corresponds to
20:09karolherbst: removed "BIND_VERTEX_BUFFER | BIND_INDEX_BUFFER" from the vidmem_bindings bitfield and now it doesn't kill the channel... didn't check if it's either of one, but...
20:12karolherbst: yeah... seems like to be the VERTEX_BUFFER part of it :/
20:35karolherbst: noo :/
20:35karolherbst: imirkin_: this seems to fix it for other games as well
20:36imirkin_: solution: never use vram :)
20:37karolherbst: annoying :/
20:37karolherbst: I guess for vertex buffers it doesn't matter as much
20:37karolherbst: but still
20:37imirkin_: my guess is that we're missing a stall or smoething? dunno
20:37RSpliet: VRAM is prohibitively slow anyway. *waves his fist at NVIDIA* damn you and your signed PMU firmware
20:38imirkin_: or perhaps having to fetch from sysmem slows things down enough that it doesn't hit the badness it otherwise does :)
20:38karolherbst: or maybe
20:41karolherbst: but generally it felt like things are generally faster
20:41karolherbst: with vertex buffers inside sysmem
20:42imirkin_: we're clearly doing a real-good job
20:42imirkin_: the driver's so fast, that pcie to ram is faster than vram.
20:48karolherbst: mhh, actually, it is a bit slower, but not _that_ much
20:49karolherbst: 70 -> 63 fps in the game
20:49karolherbst: just hard to check as the context just instantly goes down :/
20:52karolherbst: imirkin_: any wild guesses on what the problem actually might be?
20:53imirkin_: well ... with client buffers
20:53imirkin_: we have to copy it SOMEWHERE
20:53imirkin_: before we can hand it to the gpu
20:53imirkin_: pre-maxwell, we essentially copy into pushbufs
20:53imirkin_: since those GPUs have ways of consuming immediate-mode vertices like that
20:53imirkin_: maxwell dropped those methods
20:54imirkin_: which means that we have to stick them into a bo
20:54imirkin_: i believe we use nouveau_scratch to get a bo to stick them into
20:54imirkin_: that will get you a sub-allocated bo, potentially, which can have more stalls and whatnot
20:57karolherbst: yeah, I saw that the application waits for the fence inside nouveau_scratch
20:59karolherbst: I guess it might be a good idea to allocate anew bo instead and try with that?
20:59imirkin_: it cuts both ways
20:59imirkin_: if you have like 4 vertices
20:59imirkin_: the overhead will become huge
21:05karolherbst: maybe we could try if it helps if we only do it for many vertices
21:06karolherbst: but... why would that be a problem in the first place
21:07imirkin_: perhaps it doesn't like changing VBO's without some kind of flush?
21:16karolherbst: mainly wondering what would be cheap to try out
22:11karolherbst: imirkin_: mhhh, it seems to help with other games as well... oh well, at least we found _one_ reasons that is happening
22:17imirkin_: you mean it helps perf? or it helps to not have hangs?
22:17karolherbst: not having hangs
22:18imirkin_: that's unfortunate.
22:18karolherbst: which I kind of prefer of having more perf to be honest :/
22:18imirkin_: can you specify a domain when getting scratch memory?
22:18imirkin_: i forget how that api works
22:18karolherbst: mhh, so you want to use sysmem only for the user stuff?
22:19gnarface: i'm just watching along and trying to understand concepts here, so please forgive me if this is a dumb question, but can't you just save performance by stuffing a bunch of vertexes into one VBO?
22:19imirkin_: yeah, but keep "real" vbo's in vram
22:19imirkin_: gnarface: we don't control how the game is written
22:19gnarface: oh, that's up to the game engine?
22:19gnarface: hmmm, interesting
22:19imirkin_: VBO = Vertex Buffer Object
22:19imirkin_: which is a GL concept
22:19karolherbst: gnarface: well, they stuff thousends of vertices inside such a user provided buffer
22:20karolherbst: and we kind of have to deal with that
22:20imirkin_: GL also enables you to have this data in client-side buffers
22:20imirkin_: and some software (esp older software) avails themselves of that functionality
22:20imirkin_: [esp since there were no VBO's back then, so there wasn't much choice in the matter]
22:20imirkin_: although VBO's are fairly old...
22:20imirkin_: but i don't think they became core until like GL 2.0
22:22gnarface: i think quake2 and quake3 could use them for lighting under non-default options
22:22gnarface: default was lightmap based (basically transparent bitmap lighting)
22:22imirkin_: there's probably also multiple levels of this stuff
22:22imirkin_: a long long time ago, you had immediate vertex data submission
22:23imirkin_: i.e. glBegin(); glVertex4f(); glVertex4f()..... glEnd()
22:23imirkin_: and then you could put that into buffers, which were either client- or gpu-side
22:23imirkin_: and then you eventually got VAO's which could save a lot of configuration data about which attributes were where
22:23imirkin_: (Vertex Array Objects)
22:23imirkin_: at the gallium level, it's a lot simpler though
22:24imirkin_: basically you get vertex data
22:24gnarface: ah, i may have gotten VAO and VBO terminology confused
22:24imirkin_: which may either be in a user buffer or a gpu buffer
22:24imirkin_: if it's a user buffer, it's on you to get that to the gpu
22:24imirkin_: (you = the driver)
23:39karolherbst: imirkin_: ohh btw, I reworked how we collect our shaders in the shader-db. all files are nicley hashed now + I wrote a script to do the magic, so now one just have to collect the shaders inside shader-db/$game, run the script and then git tells you which new files there are (and removes all the duplicates)
23:40karolherbst: nice for collect many shaders to do compile testing
23:47karolherbst: imirkin_: uff, the scratch stuff is already inside GART uncondtionally :/
23:47imirkin_: ok, that's not extremely surprising
23:51karolherbst: using VertexAttribPointer with an array_buffer means that it's not doing the user memory stuff, right?
23:51imirkin_: don't think so
23:51imirkin_: depends on what's bound to ARRAY_BUFFER
23:52imirkin_: hrm. dunno. would have to look/think
23:53karolherbst: okay.. kind of looks like it looking at the trace
23:53karolherbst: which changes our question/problem
23:57gnarface: hmmm. i remember an old issue (in wine i think?) about performance sometimes being affected by a variable that forced hardware or software "vertex buffer arrays"... so this is why that sometimes helped performance, but sometimes did nothing, isn't it?
23:58imirkin_: could be 1000 things
23:58imirkin_: depends on precisely what that setting did
23:58gnarface: hmm. i wish i remember where and when i was messing with it...
23:58gnarface: i thought it was a wine thing
23:58gnarface: i could be wrong
23:58imirkin_: could be
23:58imirkin_: but then it'd depend on what wine did differently :)
23:59gnarface: yea, well that's often a thing with wine, is figuring out if what it's doing is even sensible to begin with