00:34 espes__: you'd think after 5 years id have some idea how this stuff works
00:34 espes__: where do reads and writes of the (nv20) channel control area go to for inactive channels?
00:34 espes__: ramfc?
00:35 espes__: as in, dma_{get,put}
00:36 espes__: imirkin_: ^
02:32 mwk: espes__: ramfc indeed
03:19 espes__: thanks
03:30 martm: espes__: though why do you deal with nv20, people saying that early chips can not do vulkan though, should not be correct?
03:31 martm: it should be more important there to get rid of api overhead , totally feasible for instance on r300
03:31 martm: and probably also nv34 and nv20 etc.
03:32 martm: i think they only have one issue though, that some of them don't have hw context's so context switches would be slow, i.e not cache on switch
03:35 martm: MichaelLong: i prepared a minor pseudo code, how instructions should be reordered for best results, i do a more thorough explanating one in the future, to be honest i am tired of arguing especially with clueless people, like the ones here in my country who have annoyd me
03:41 martm: i don't know it's just so simple common sense, that fastest solution is to do fastest insutruction first in the order where they go slower step by step, and all the stuff is in recursion and then you won't have to deal with restoring the masks
03:42 martm: some predicates for loops , plus determining the end of iteration, and that is kind of all, it is indeed very simple
04:00 martm: yeah can be done on earlier generations too, the moment they added a shader processors instead of command processors, then they also added interrupts for shaders, when you wanna get the command processor for earlier chips into recursion, i belive you'd play with dma packets writing to dmaput dmaget, or if possible use those last ones straight
04:16 karolherbst: imirkin_: I just disabled the ZETA bits and stuff looks quite funny :D
04:16 karolherbst: I didn't thought that this should make any visual difference
04:16 karolherbst: or is ZETA not EarlyZ stuff?
04:18 martm: karolherbst: allmighty imirkin is sleeping probably;)
04:20 martm: karolherbst: so you wanna discard the pixels based of depth comparison which is done early right, that is a fixed function pipeline which does the comparison
04:22 martm: there was discussion long time ago about that, and most thought, that unless depth is rewritten in the shader, that discarding would happen before the shader ever starts
04:23 martm: but you're correct in a sense that should be real performance boost, i.e besides the scheduler a rare almost only real one
04:25 martm: yeah it's the same thing with stencil, i remember that stencil values can only be read not written
04:25 martm: in the shader
04:31 martm: karolherbst: actually i looked at the phoronix data, it's as it used to be as highest clocked perf means 80percentage of the NVIDIA binary drivers perf, that pretty much almost looks that open source driver is better
04:34 martm: http://www.phoronix.com/scan.php?page=article&item=nvidia-352-nouveau&num=1 bit ancient one from july 2015 , but shows how it roughly behaves, highest mid and lowest clocks against nvidia binary highest ones
04:38 martm: karolherbst: i have no even reclocked my gpu, i understand i can do that with open source driver with some kernel version to highest level even
04:42 martm: so to sum it up, the scheduler is easy, vulkan is easy, dx12 dx11.3 are easily doable, and generating pure blobs from any one of them is reasonably easy too, once you start behaving and stop pulling crap and stop sabbotaging
04:49 martm: and once you have a minor clue, actually making the driver working on fpga's is also task that can be done, it's reasonably simple, but requires work again
04:53 martm: very attractive candidates for using nouveaus driver for instance would be two tiny interrconnected arria boards for instance, making an utterly powerful solution in small form factor
04:54 mgottschlag: actually, current FPGAs aren't as good for shader execution as everyone believes they are
04:54 mgottschlag: because of limited amounts of DSP slices
04:55 mgottschlag: you'd rather want a coarse-grained reconfigurable fabric for that (grid consisting of ALUs)
05:00 martm: i am googling soon, i am not sure what you mean by DSP-slices
05:00 martm: the data i have, does not mention there should be a problem for long time allready, starting from 2010 before arria V was released definitely
05:05 mgottschlag: apparently, the Arria V has up to ~1000 DSP slices
05:05 mgottschlag: now, in the best case, those can be clocked at probably 500MHz (didn't read the datasheet, but it is something in that area probably)
05:06 martm: mgottschlag: is that low or hight amount?
05:06 mgottschlag: pretty high for an FPGA
05:06 mgottschlag: but that means 500GOps if those DSP slices can process one operation per cycle
05:06 mgottschlag: that's not exactly much anymore
05:07 mgottschlag: and the memory bandwidth for external DRAM also is limited, it's usually not possible to construct such large memory busses as used on modern GPUs
05:07 martm: mgottschlag: well what is the difference between fpga logic blocks and dsp slices, cause the logic can be clocked to 3.2ghz?
05:07 mgottschlag: I doubt that you will get 3.2GHz for any complex designs
05:07 martm: wiki says something that dsp is used for analog signal filtering
05:08 mgottschlag: have you done FPGA development before?
05:08 mgottschlag: FPGA logic blocks are bit-level configurable
05:08 mgottschlag: very flexible, but you'll need a lot of them to build arithmetic operations, and the result will be rather slow, simply because of the size
05:09 martm: mgottschlag: not much yet, but i know most about verilog, i just yet have not received my board
05:09 mgottschlag: you might be able to construct microscopic circuits which are clocked with 3.2GHz, but for anything realistic, I think the usual numbers are a lot lower
05:09 mgottschlag: most CPUs e.g. on a Spartan 6 (low end Xilinx part) are clocked with ~100MHz
05:09 mgottschlag: now, the high end parts might be quite a bit faster, but not *that* much
05:10 mgottschlag: you could work with extremely deep pipelines, but that creates other throughput problems
05:10 martm: yeah i dunno, it only what spec says, 450mhz as muliplied plus 8 clock phases
05:11 mgottschlag: what's your target anyways - 3D graphics, or GPGPU?
05:11 martm: both:) plus an x86 firmware too:D
05:12 mgottschlag: ambitious... but certainly a good learning experience :)
05:14 martm: well x86 firmware i allready have i modify this to x86_64 and modify this to perform, and gpu rom is largely generated, otherwise yeah it would be very amitious and i would not be capable of doing this, helping factor is that i have driver hunks too:)
07:49 martm: mgottschlag: btw. thanks for the explanation, i allready thought i don't have much to read, now reading about how many LUTs and LE's would be used to do that hw dsp mac
07:52 martm: it seems to be yeah expensive to do that 500LUT's out of 100k of them so maximum 2000of them can be done approximately
07:52 martm: in logic
07:52 mgottschlag: yeah, and I don't think those will be too fast, although I don't know what the 3GHz you quoted specify, I guess that's the delay of a single LUT
07:53 mgottschlag: and your multiplier will have longer chains of LUTs
07:53 mgottschlag: (the critical path in your recombinatorial logic defines the maximum clock speed)
07:53 mgottschlag: I mean, the length of the path
07:56 martm: well it's just the clock frequency for the logic...frequency is something like pulses in a second, clock rises and falls so many times
07:57 martm: 1ghz means 1ns resolution
07:57 martm: so this one is bit higher resolution, i should calculate
07:59 mgottschlag: well, I'd rather say that the maximum frequency is the clock frequency with which the most simple circuit is still able to operate
07:59 mgottschlag: http://www.ee.surrey.ac.uk/Projects/CAL/seq-switching/Graphics/gen_synch_circ.gif
08:00 mgottschlag: a 3GHz frequency would mean 333ps period, and I think in modern CPUs, that's the equivalent of 20 CMOS gates worth of combinatorial logic
08:00 mgottschlag: in FPGAs, a lot less
08:00 mgottschlag: now, if I was altera, I'd quote the maximum frequency of a circuit consisting of one flipflop and one LUT
08:00 mgottschlag: because that's the most impressive number, even if it doesn't mean anything
08:01 martm: yeah, but so it is, because it has 8clock inverted clock shifters in hw, and up to 4xx clock multiplier
08:01 mgottschlag: oh, btw, is this amount of OT chat okay?
08:01 martm: yep a little bit yeah
08:01 martm: so lets stop this, either thanks.
08:05 martm: either/either way
08:06 mupuf: mgottschlag: why are you suddenly talking about FPGAs?
08:06 mupuf: I scrolled up and couldn't find anyone talking about it
08:08 mgottschlag: talking to martm, I guess you have him on your ignore list
08:09 mupuf: I don't remember why I would have done that, but I guess I did, now let's see how I can get him out of it
08:09 mgottschlag: (this all started with "hey, I'll just implement a GPU on an FPGA and it's going to be fast")
08:09 mgottschlag: (basically)
08:12 karolherbst: I get the feeling that every new hw thing starts with it's gonna be fast :D
08:13 mupuf: oh wow, I indeed missed a lot of the conversation :D
08:13 mgottschlag: I've recently heard a lecture on modern CPU designs using FPGAs for reconfigurable accelerators, and it left me slightly depressed
08:13 mgottschlag: I would certainly have thought that the opportunities were bigger :D
08:14 karolherbst: mgottschlag: well you need to time to reconfigure your fpga
08:15 mupuf: mgottschlag: intel bought Altera, we'll see what comes out of it!
08:15 mupuf: if we get an open source toolchain, that will already be a good win!
08:16 martm: well i dunno as always there is controversial info on the net, rough amount of people are largely offtrack, there is a belive that x86 can not be done fast on fpga, but i am sure i've read it can be
08:16 martm: mupuf: that was wise thing to do by intel, it turns out well definitely
08:16 mgottschlag: mupuf: yeah, indeed, and there is a lot of research going on in that area as well
08:17 mupuf: martm: reading about something being possible is ... useless :D
08:17 mgottschlag: I mean, after all, in a couple of years, the money isn't in smaller and smaller technology nodes, but instead in architecture optimizations, and Intel knows that better than anyone else
08:17 mupuf: people can claim that they invented the perpetual motion
08:18 karolherbst: it kind of makes sense if you don't have to reconfigure too much, but basically everything which is a pc or a phone or something won't benefit from this I guess, because you can't just decide to specilize your chip for every task and keep up with scheduling and stuff
08:18 mgottschlag: karolherbst: yeah, that, and the toolchains suck
08:18 mgottschlag: there have been architectures which can reconfigure in a couple of clock cycles, but then the speedup is rather limited, or the chip area is huge
08:19 mgottschlag: or, usually, both
08:19 karolherbst: yeah okay, but you also need to figure how to reconfigure
08:19 karolherbst: what is real good configuration for the next task
08:19 mgottschlag: yeah
08:19 karolherbst: if you want that to be dynamic
08:20 karolherbst: so in the end the overhead of having this eats all performance gains in the first place
08:20 karolherbst: but
08:20 karolherbst: maybe this could be moved into the binary compiler and attached to the binaries or something
08:20 karolherbst: or parts of that
08:21 martm: mgottschlag: but this reconfiguration does not take much right, synthesis would take some time
08:22 martm: i.e program a decent core, and run stuff off from it, with a minimal os
08:22 mgottschlag: yeah, and if synthesis is done in advance, it's done for a very specific FPGA, and usually for a specific FPGA size
08:22 mgottschlag: now there are ways around that, and they all have their drawbacks
08:23 mgottschlag: and there are still *some* scheduling problems, because e.g. a server still needs to decide how many accelerators of which type are needed in the next second
08:24 mgottschlag: I mean, such systems are probably the future, but there are quite some practical problems to be solved
08:24 mgottschlag: I guess open source toolchains might actually help quite a bit with that
08:27 martm: just a minor reading yet to be done, i think i allready got it , but it seems that this 500mhz dsp clock can be raised too interconnecting logic clocks to it
08:27 martm: but needs a bit more reading
08:34 karolherbst: meh, zcull seems to be a bit more complicated then thought. It seems like there are up to 0x3f regions (maybe less on some chipsets or less for all and 0x3f is just a special one or whatever) and for each zcull_region there exist an own state :/
08:35 karolherbst: and the state is kept on the hardware and the driver only partially updates the parameters
08:38 mgottschlag: martm: that 500MHz was just an estimate of what I've usually seen in such devices
08:38 mgottschlag: it's in no way accurate for your part
08:47 martm: however it was still good example, i saw similar pdf allready, yeah altera dsp builder allows to choose clock phases and play pll and all such. i think we are quite done with the subject now
09:07 martm: i was monitoring the imirkin github branch, i belive i do some testing on the images load store part here, looks like compute will be soon landed
09:08 martm: and i belive not (not quite sure) there is no interest by me in the spec that is above 4.3 at least i have not met anything important there
09:08 martm: sparse buffers i know is interesting but, basically almost not needed still
09:14 martm: i think last time i was sceptical about open source drivers was 3years ago, and i am sorry for that during some angry moments, nowdays if i will make some plans i make them around open ones only
09:37 martm: i suppose i have not really understood what is being talked about most the time, except few cases, and the fact, that during congresses or hacker conferences the audiance is so big, it looks like everyone is very smart person and thinking, and in reality the code is posted by few people only, i dunno why is that so?
10:12 martm: you know i've been sober for 1month and 5days, it's the longest break i have had in 5years, i have a vodka in the fridge and i think i celebrate this, i take for estonian abortion leftovers, who regulate my rights!
10:26 imirkin_: skeggsb: just to close the loop on my questions last night, i was indeed just doing something dumb. turns out fences work a lot better if you *first* emit them *then* wait for them rather than vice-versa.
10:28 karolherbst: what is zeta _exactly_ by the way? I know it seems to be some kind of depth test, but I doubt this is the EarlyZ thing I thought it was :/
10:33 karolherbst: imirkin_: I was thinking I could just set ZETA_ENABLE always to 0 and mess with ZCULL until the performance gets okay again, but somehow all the depth testing is gone now and I see hidden stuff in most gl Applications. Is Zeta not EarlyZ? Or is that EarlyZ depth testing somewhat required by OpenGL?
10:34 imirkin_: zeta = depth + stencil
10:34 imirkin_: if ZETA_ENABLE is 0, then there's no depth, depth tests, etc
10:34 karolherbst: yeah I noticed somewhat :)
10:35 karolherbst: mhh okay, but is there an easy way to disable EarlyZ with nouveau and still be able to use Zcull?
10:36 karolherbst: but I guess applications with a lot of depths tests (if the difference with ZETA_ENABLE=0 is huge) means that zcull will make a difference
10:37 imirkin_: zcull is only effective when early depth tests are done i think
10:37 imirkin_: you can force them off by claiming that the shader writes to gmem or to depth (even if it doesn't)
10:38 karolherbst: ahh right, there is this fp->writesDepth check in mesa which disables zcull in the header
10:38 imirkin_: ;)
10:38 imirkin_: i don't think you're quite understanding what zcull is
10:38 karolherbst: anyway, disabling ZETA is nice :D
10:39 imirkin_: yeah, migth break a thing or two though
10:39 karolherbst: mhh well I am just trying to find a way first how I can messure if it works or not or to find something where nvidia would make heavy use of zcull
10:39 karolherbst: it breaks like everything I tested so far
10:39 imirkin_: just write an application
10:39 imirkin_: which does depth testing
10:39 imirkin_: in the prescribed manner
10:39 imirkin_: iirc glxgears should do the trick
10:40 imirkin_: but you could create something way more pessimal
10:40 imirkin_: just draw a TON of polygons that are depth-rejected
10:40 karolherbst: mhh I read the zcull stuff from the mmt of glxgears already, but it doesn't tell enough
10:40 karolherbst: imirkin_: antichamber is good for this, the entire world is just depth rejected
10:41 karolherbst: you can see everything with zeta disabled
10:41 imirkin_: :)
10:41 imirkin_: might be stencil'd out actually
10:41 imirkin_: stencil is part of zeta
10:41 karolherbst: ahh okay
10:41 imirkin_: i guess it depends.
10:41 karolherbst: anyway, it seems small enough, the apitrace has only like 140MB
10:41 karolherbst: is there a way if I can check if it's depth tested or stenciled out?
10:42 imirkin_: sure, just look at the apitrace
10:42 imirkin_: instead of disabling zeta you could selectively ignore depth or stencil test settings in nvc0_zsa_state_create
10:42 imirkin_: [might not be the actual function name]
10:43 imirkin_: but something with nvc0, zsa, state, and create :)
10:47 karolherbst: nvc0_zsa_state_create sounds good
10:57 karolherbst: imirkin_: how would I see it in the apitrace?
11:12 karolherbst: now that's funny
11:12 imirkin_: karolherbst: just adjust the code
11:12 imirkin_: karolherbst: in the apitrace, you can look for glEnable(GL_DEPTH_TEST)
11:12 imirkin_: and i forget the thing for stencil
11:13 imirkin_: glStencilMask? i forget.
11:13 karolherbst: well I already checked for that, both not there
11:13 karolherbst: nothing with glStencil...
11:13 karolherbst: no glEnable for Stencil nor depth
11:23 karolherbst: weird
11:25 karolherbst: imirkin_: the if I trace the game with ZETA_ENABLE set to 0 and replace the trace I get all the stuff which got removed by ZETA, but when I trace the game with ZETA_ENABLED 1 and replay this trace with ZETA_ENABLED set to 0, I see the stuff at the border of the frame, but they disappear and nothing is there in addition anymore :/
11:25 imirkin_: heh
11:26 imirkin_: they probably do shadow mapping
11:26 imirkin_: or... something
11:26 imirkin_: basically where you take the depth texture and use it as a texture
11:26 karolherbst: but replaying the trace with the broken zeta test looks the same
11:26 karolherbst: the performance is just worse
11:27 imirkin_: coz the game reacts differently based on rendered results, while apitrace doesn't?
11:27 karolherbst: yeah makes sense
11:30 karolherbst: but I can use this to create a nice apitrace to test depth testing :)
11:35 karolherbst: by the way: that is the difference with enabled and disabled ZETA: https://imgur.com/a/XHbmh :)
11:56 karolherbst: at least one thing: ZCULL stuff is always set before ZETA except when ZETA_ENABLED==FALSE
11:56 karolherbst: so I guess this zcull thing can just be moved above the ZETA stuff inside zsa_validate
12:03 imirkin_: order doesn't matter
12:03 imirkin_: none fo this stuff does anything until you hit a "go" method
12:03 imirkin_: like draw, clear, etc
12:08 karolherbst: ahh okay
12:08 imirkin_: vast majority of methods is "store value in internal state register"
12:42 karolherbst: imirkin_: mhh somehow that zcull stuff looks easy though, you never update the buffer except on special events (I think new surface? or something like that is one) the buffer is usually 0x20000 big, sometimes 0x40000 and you only update values you want to change
12:43 karolherbst: but I get the feeling that something is missing
12:43 karolherbst: oh yeah and that zcull_region holds in own state
12:43 karolherbst: I think
13:01 karolherbst: imirkin_: I have also no idea what to do with this: https://gist.github.com/karolherbst/3ae1096b1fafd151b7af usually I would expect size to be 3, but then there is this PM? thing. Also the third thing has a data of 0x80000656 on the left side :/
13:02 imirkin_: PM means that it's written out by a macro
13:02 imirkin_: demmt has a fancy little macro evaluator built into it
13:04 karolherbst: okay, but how do I translate this to the PUSH_DATA macros in mesa? just use the last value? and the both on top are magically done or do I have to do something else?
13:05 imirkin_: take a look at the macro and see what it does
13:05 imirkin_: they have some idiotic macros
13:05 imirkin_: but some make sense
13:05 karolherbst: okay, and the macro is somewhere else in the mmt
13:06 imirkin_: yeah, at the beginning, they upload all their macros
13:09 karolherbst: I should read the rnndb stuff more often: ZCULL_LIMIT= ADDRESS + align((width * height) / 28, 128 KiB)
13:16 karolherbst: GK104_3D.GRAPH.MACRO_ENTRY_POS = 0x3c?
13:16 imirkin_: something like that
13:16 imirkin_: it should be pretty obvious
13:16 karolherbst: mhhh
13:16 karolherbst: yeah it is
13:16 karolherbst: but well
13:17 karolherbst: read $r2 0xd5d [GK104_3D.GRAPH.SCRATCH[0x5d]] read $r1 0xd4a [GK104_3D.GRAPH.SCRATCH[0x4a]] exit maddr 0x1f3 [GK104_3D.ZCULL_UNK07CC] send (add $r2 $r1)
13:17 karolherbst: that's all
13:18 imirkin_: so... it adds 2 values from scratch registers and writes them to that UNK07cc thing
13:19 karolherbst: ZCULL_UNK07CC is set to 0x0 before that
13:28 martm: i think RSpliet calculated something that we were wondering about always, how many threads on kepler, per stage could work maximum in parallel, i think he got 333 which well in the area of my possible values
13:29 martm: so they are grouped into warps of 32threads, but per stage actually more could run on parallel, it could dispatch many warps per one shader
13:29 karolherbst: imirkin_: any idea where that GK104_3D.ZCULL_INVALIDATE comes from? Because its offset is 0x1958
13:29 martm: would be good to know what is the max for kepler, if i am not mistaken then RSpliet thought it was 3xx
13:29 imirkin_: that's probably a "do stuff" call
13:29 imirkin_: which clears the zcull surface
13:30 imirkin_: (or rather, invalidates all its data)
13:30 karolherbst: okay
13:30 karolherbst: makes sense actually
13:30 karolherbst: because it is only called when width/height also changes
13:31 karolherbst: https://gist.github.com/karolherbst/f01128f87deea698fc6a
13:31 karolherbst: something like that
13:32 karolherbst: and usually it only sets the region and that GK104_3D.UNK02E8 thing
13:33 martm: mgottschlag: if you're interested in those fpga projects, i have searched for partners for ages , i could talk a little closer how to do things
13:33 karolherbst: there is also this "0x80000656" which bothers me a bit :/
13:39 martm: mgottschlag: i dunno what is the english term for it but, but some having annoyd me for 15 or 20years, the thing was solved so that basically i was assigned a wardship, after they had kept me in for 5times alltogether more then a years time illegally
13:41 martm: and even with as a cripple that is pure war situation, so i am planning to attack abortion leftovers and i need friends around to world to do that
13:46 mgottschlag: martm: I have way too many projects already, both professional and spare-time ones
13:49 martm: mgottschlag: yeah really me too, but i have finally having struggled much with getting the momentum, found my satisfaction in the computing world, about hundreds of projects are yet still undone but...
13:49 martm: i sort of found my way where i wanted to land as prequsite
13:53 karolherbst: imirkin_: you know what.. there is a counter for zcull rejected pixels... :/ maybe I should just try to get this one first or something
13:54 mwk: karolherbst: heh, would be nice to figure out what the 4 ZCULL counters are :)
13:54 karolherbst: :D
13:54 karolherbst: yeah
13:54 karolherbst: kind of :D
13:54 karolherbst: I am sure that I do the right thing kind of, but maybe the hardware dislikes the buffer? or something else is fishy?
13:55 imirkin_: make sure the memtype is right
13:56 karolherbst: ohh right, I checked the allocation yesterday
13:56 karolherbst: sounded like something generic
13:56 karolherbst: wait
13:57 karolherbst: LOG: NVRM_IOCTL_VSPACE_MAP post, fd: 9, cid: 0xc1d00071, handle: 0xcaf0001b [class: 0x0002 NV1_DMA_FROM_MEMORY], dev: 0xbeef0003 [class: 0x0080 NVRM_DEVICE_0], vspace: 0xbeef0202, base: 0x0000000000000000, size: 0x0000000000020000, flags: 0x00000000, addr: 0x0000000100a20000, status: SUCCESS
13:58 martm: mgottschlag: this fpga market is sort of hasn't yet really settled, it's one of the bigger movements that is needed to be handled sw wise..but looking at miaow this code is so rough, that i am honestly not capable of reading it, i'd do it lot simpler ways
13:58 imirkin_: i don't remember how the blob sets memtypes
14:00 martm: this world has like 50percent likeliness, that most indian scientists talk about that stuff on the net, i dunno why are they so keen of those programmable device technologies, maybe tata automobile uses such devices:D
14:04 karolherbst: with NVRM_IOCTL_MEMORY2 they get a handle and do NVRM_IOCTL_VSPACE_MAP to get some memory, but I don't see anything else happening on that handle
14:04 imirkin_: i might have been an asshole and not printed all the MEMORY2 fields when i added support for that ioctl
14:05 karolherbst: LOG: NVRM_IOCTL_MEMORY2 post, fd: 10, cid: 0xc1d00071, handle: 0xcaf0001b, parent: 0xbeef0003 [class: 0x0080 NVRM_DEVICE_0], class: 0x0002 [NV1_DMA_FROM_MEMORY], unk0c: 0x00000000, status: SUCCESS, unk14: 0x00000000, vram_total: 0x00000000c0000000, vram_free: 0x00000000beb57000, vspace: 0x474c1000
14:06 karolherbst: maybe
14:06 karolherbst: mwk: I guess you have no idea about those zcull counters?
14:09 martm: i was really wondering what karolherbst is after, i read that stuff so long time ago, take away nothing from that gentelman even if he pulls crap sometimes, the actions are good still
14:10 martm: but there i think were certain cases where this optimization can be used, you know you can't define depth buffer with your own values and hope to get this optimization to work
14:11 martm: what was this rendering method called where it was possible to use that infact..one moment
14:11 martm: airlied was always obsessed talking about this method, i am drunk i really forgat what was it
14:12 martm: it was something where shadow maps were involved, it had a certain methods name
14:14 ben--ben: helo
14:15 ben--ben: So, my GPU locks up with disp: error 7 [invalid handle]
14:15 ben--ben: and then i get a bunch of fifo errrors
14:15 imirkin_: pastebin dmesg
14:17 benwaffle: imirkin: http://termbin.com/bfbo
14:17 benwaffle: My computer locked up again :/
14:17 karolherbst: imirkin_: in the header is a unk0c, unk14 and uint32_t unk30[32], so I guess it will be rather in that array
14:19 imirkin_: benwaffle: is there anything one might consider "unusual" about your setup?
14:19 benwaffle: Nope, I disconnected my second monitor
14:19 benwaffle: Just a regular nvidia gpu
14:19 imirkin_: tell me about the monitor that's plugged in? what resolution is it? how is it hooked up?
14:20 benwaffle: 1920x1080, dvi on GPU to HDMI on screen
14:20 benwaffle: Asua
14:20 benwaffle: Asus
14:20 imirkin_: hmmmmmm... that could be the issue
14:20 benwaffle: What?
14:20 imirkin_: er hm.
14:20 imirkin_: is it a DVI cable with an HDMI adapter on the screen end
14:21 imirkin_: or an HDMI cable with a DVI adapter on the computer end?
14:21 benwaffle: How do I check
14:21 imirkin_: err
14:21 imirkin_: use your eyes?
14:21 benwaffle: There is no adapter its a dvi / HDMI cable
14:22 imirkin_: or are there no adapters, and it's just a cable with one on one end, one on another?
14:22 benwaffle: Yup
14:22 imirkin_: weird.
14:22 benwaffle: Its never given me problems
14:22 imirkin_: oh i'm sure it works
14:23 imirkin_: it's just not a frequently-tested scenario
14:23 imirkin_: i'm concerned that we think it's a DVI cable
14:23 imirkin_: and then try to enable dual-link on it
14:23 imirkin_: which won't work since you have HDMI on the other end
14:23 benwaffle: What is dual link
14:23 imirkin_: DVI cables can carry 2 TMDS links
14:24 imirkin_: HDMI cables can only carry one
14:25 benwaffle: I'm trying my other monitor which is just dvi
14:28 benwaffle: Still locks up
14:29 martm: it took some time but it's called deferred shading
14:30 imirkin_: skeggsb: benwaffle requires your expertise :)
14:33 benwaffle: imirkin: are you a dev
14:35 imirkin_: by most definitions, yes
14:38 benwaffle: I just got the fifo errors without the disp error
14:38 benwaffle: BTW this is on the proprietary firmware
14:40 skeggsb: i'm mostly wondering if this is a regression? i suspect it's neither disp or fifo that's at fault here, but something else they both depend on
14:42 benwaffle: Would running mesa-git help? Or compiling my own kernel?
14:43 imirkin_: benwaffle: i doubt it's on proprietary firmware... that takes some doing for 4.3+ since stuff got renamed
14:43 benwaffle: imirkin_: what
14:44 benwaffle: Oh, Linux gt
14:44 benwaffle: Git?
14:44 benwaffle: I've been running 4.3 and 4.4
14:44 imirkin_: benwaffle: 4.3+ looks for different filenames if you force proprietary firmware
14:45 imirkin_: (different than what's documented everywhere, and different than what my script generates)
14:49 karolherbst: isn't there a kind of memory like just give me memory for gpus?
14:49 karolherbst: or why does the type matter at all=
14:50 mwk: karolherbst: not really, no
14:50 imirkin_: karolherbst: because they like to do fancy things and improve perf
14:50 mwk: only how to read them
14:54 benwaffle: I'm going to try lts and git
14:56 martm: karolherbst: bit drunk and all that, but i think it's pretty good plan, as i am not sure yet, still anything that uses software based z-tests in their methods, perhaps can be indeed optimized to use a hw one, imo there are only two of those cases
14:57 martm: shadow maps, and deferred shading, but, can be entirely wrong, will look that all out if it's needed
15:01 karolherbst: okay, and setting a wrong mem type could make zculling not using the memory region at all, because there is something odd with this, understood.
15:05 chimpans: skeggsb: again very abortion leftoverish from you
15:06 imirkin_: skeggsb: see what you started? before ignore worked just fine...
15:07 skeggsb: imirkin_: airlied requested it :P i'd been ignoring
15:07 imirkin_: hehe
15:08 airlied: imirkin_: ignore works for you, doesn't scale :
15:08 airlied: he keeps talking to other people
15:10 karolherbst: on a side note: if you really want to stop somebody then you have to hellban him, otherwise it won't work
15:10 imirkin_: well, #intel-gfx ends up banning all of estonia as a result
15:10 imirkin_: as well as all web irc clients
15:11 karolherbst: yeah, freenode needs hellbans for that kind of stuff, because if you really want you can join after getting baned
15:11 airlied: for a nutter he has access to a lot of either hacked a/c or hosting services
15:39 benwaffle: alright, i give up. back to proprietary drivers
16:00 karolherbst: imirkin: !!!
16:01 karolherbst: something changed :O
16:01 karolherbst: heaven is all blue now
16:01 karolherbst: ohhh wait
16:01 karolherbst: :/
16:01 karolherbst: mesa build system strikes again
16:01 karolherbst: but other stuff worked, mhhh
16:02 imirkin_: huh? just make install and don't try any silly stuff like LIBGL_DRIVERS_PATH
16:02 karolherbst: no, I changed a header again
16:02 karolherbst: and not everything got recompiled
16:02 karolherbst: ...
16:03 karolherbst: I only have to set the width/height and all that crap when there is an actual new buffer otherwise I just write two things to the gpu, I needed a flag to keep track of this
16:03 imirkin_: uhhhhh
16:04 imirkin_: that shouldn't happen. changing headers should cauuse things to be recompiled.
16:04 karolherbst: yeah, it doesn't :/ maybe it did, I rather make clean now and see if everything is okay again
16:05 karolherbst: yep, make clean helped :/
16:05 karolherbst: sorry for that false alarm then :/
16:06 karolherbst: but it wasn*t the first time I had this compilation issue
16:06 karolherbst: and I am pretty sure others had this as well
16:07 imirkin_: only if there's something funky with timestamps
16:07 karolherbst: well I have noatime set
16:07 imirkin_: shouldn't matter
16:07 karolherbst: but usually that works in other projects
16:08 imirkin_: atime = access time
16:08 imirkin_: mtime = modified time
16:08 karolherbst: well I will try out to modify a nouveau header again, and see what make does
16:08 karolherbst: yeah no recompile
16:08 imirkin_: which header?
16:08 karolherbst: I added a new line at the top of nvc0_context.h
16:08 imirkin_: uhhhhh
16:09 imirkin_: that def causes recompiles for me
16:09 imirkin_: there's something fubar'd on your system
16:09 karolherbst: make because I build out of tree?
16:09 imirkin_: double-checked, and yes, it works.
16:09 imirkin_: could be the out-of-tree thing
16:10 imirkin_: that's not how i build
16:10 karolherbst: k
16:10 imirkin_: although i think others do and would have noticed such sisues
16:10 karolherbst: mhh even the modification time of the file changes
16:12 karolherbst: mhhh the .deps/*.Plo files only contain "#dummy"
16:12 karolherbst: no idea if that what's wrong
16:23 karolherbst: but I think this private buffer is no texture at all, it is just memory and it is usually rather small (0x20000 in size)
16:23 karolherbst: ohhhh wait
16:24 karolherbst: maybe it is just something in front of another buffer
16:24 karolherbst: and the zcull thing reads from this and writes into the small one for later usage?
16:25 karolherbst: this comment is in rnndb: receives the same address as the next allocated object, so should be a limit doesn't have any immediate effect though (as long as data is in cache ?)
16:25 karolherbst: for ZCULL_LIMIT
16:26 karolherbst: mhh no, doesn't make sense
16:44 karolherbst: nice I figured GK104_3D.UNK02E8 out: it is (ZCULL_WIDHT * ZCULL_HEIGHT * 0x10) / (0x50 * 0x20)
16:44 karolherbst: aligned to 0x100
16:46 karolherbst: sounds important
16:47 imirkin_: so i guess every zcull bucket is 0x20 by 0x50...
16:48 karolherbst: yay
16:48 karolherbst: engine fault
16:48 karolherbst: a change
16:48 imirkin_: i suspect UNK02E8 is the zcull buffer size? dunno.
16:48 karolherbst: fifo: read fault at 001a5e0000 engine 00 [GR] client 0c [GPC0/RAST] reason 02 [PTE] on channel 2 [00bf890000 UDKGame-Linux[4397]]
16:48 imirkin_: that means it read too far out
16:48 karolherbst: so it read
16:49 imirkin_: you should emit the addresses of the various things so that you can match them up to kernel errors
16:49 karolherbst: something was too small anyway
16:49 imirkin_: right
16:51 karolherbst: odd
16:51 karolherbst: ohhh
16:51 karolherbst: it was using an old buffer
16:51 karolherbst: first zcull buffeR: 1a5e0000, second one: 1a860000
16:53 karolherbst: no setting unk02e8 back to 0
16:53 karolherbst: maybe I messed something else up
16:54 karolherbst: it is nice though how the address never changes
16:56 karolherbst: mhh and 0x1500 enables/triggers ZCULL?
16:56 karolherbst: because
16:57 karolherbst: if I don't do set 0x1500 to 0, there is no read fault
16:57 karolherbst: at the zcull address
16:57 karolherbst: easy, I won't ever reset the buffer and use one for all the time :/
17:02 karolherbst: awesome :O
17:02 karolherbst: I get a trap at ZCULL
17:02 karolherbst: so I think it kind of works now, if we ignroe those illegal reads
17:03 karolherbst: okay, so the GPC/RAST reads the memory region out
17:14 karolherbst: imirkin_: any idea why the rastarizer reads at the old address?
17:14 karolherbst: I assume this memory region got freed and can't be accessed anymore
17:15 imirkin_: mmmm... perhaps the code frees it too eagerly?
17:15 karolherbst: mhhh
17:15 karolherbst: zcull gets validate like 10 times already, but the hang can come later can't it?
17:15 imirkin_: skeggsb: how would you feel about pushbuf_refn taking an *actual* ref on the bo?
17:15 imirkin_: skeggsb: do you think that'd slow things down a lot?
17:19 skeggsb: hrm, i'd have to take a proper look.. the original idea was that the client should be smart enough to deal with it itself, and avoid that overhead
17:19 imirkin_: skeggsb: yeah... it's just tricky
17:19 imirkin_: esp if the client tries to be clever and not push_kick all the time
17:19 imirkin_: you might use a buffer for rendering and then delete it
17:19 imirkin_: all before a kick happens
17:20 imirkin_: i "fixed" this by sticking deletes into a fence work function
17:20 imirkin_: but it's not quite perfect
17:20 karolherbst: okay
17:20 karolherbst: not deleting the buffer does a lot more
17:20 karolherbst: bad performance
17:20 karolherbst: but I get a lot of TRAPS
17:20 karolherbst: like twenty per seconds
17:21 imirkin_: karolherbst: you can't just expect that doing stuff at random will Just Work - you have to reason about how these buffers are being used, when they're used, and make sure they're available to the card
17:21 imirkin_: karolherbst: it occurs to me that i forgot to stick the zcull buffer into the bufctx
17:21 karolherbst: yeah I know, I was just seeing how the card reacts to this
17:21 imirkin_: karolherbst: so it will actually end up getting evicted... oops
17:22 karolherbst: ahh okay
17:22 imirkin_: in the clear where i create them
17:22 karolherbst: also I think the zcull buffer needs only be updated when the size changes
17:23 imirkin_: actually forget that
17:23 imirkin_: in nvc0_validate_zcull, add
17:23 imirkin_: BCTX_REFN(nvc0->bufctx_3d, FB, res, WR); (where res is the zcull thing)
17:23 karolherbst: at the top of the PUSH_DATA things? or doesn't matter
17:23 imirkin_: before any of those, yeah
17:24 karolherbst: k, lets see
17:25 karolherbst: yeah, seems to work
17:25 imirkin_: [actually come to think of it, shouldn't really matter]
17:25 karolherbst: works though
17:25 karolherbst: well
17:25 imirkin_: [whether it's before or after, i mean]
17:25 karolherbst: ahh okay
17:26 karolherbst: now I also have no perf impact anymore
17:26 karolherbst: still lots of ZCULL traps
17:26 karolherbst: sooo
17:26 karolherbst: now I have to figoure out what I do wrong
17:26 karolherbst: the trap says nothing as you might already have guessed
17:27 imirkin_: check the kernel where it gets printed, and then check gk20a code to see if more info might be gotten
17:28 imirkin_: e.g. i recently enhanced a few prints
17:28 imirkin_: https://github.com/skeggsb/nouveau/commit/708d46df4c83ade6cdd749acb9f537db4d470dfe
17:28 imirkin_: that was all info from gk20a
17:29 imirkin_: (i was motivated by the lack of MACRO trap info, which would have come in useful as i was developing the macros for indirect draws)
17:29 karolherbst: well it says "gr: TRAP ch 2 [00bf890000 UDKGame-Linux[8058]]" and " gr: GPC0/ZCULL: 80000001"
17:30 karolherbst: I think I have your commit, checking
17:30 imirkin_: ok, so the additional info is gr: GPC0/ZCULL: 800000001
17:30 imirkin_: i'm guessing the high bit is "there was a trap, you idiot"
17:30 imirkin_: while the low bit might mean something
17:30 karolherbst: yeah
17:31 karolherbst: nope, I already have your commit
17:32 imirkin_: yeah, mine didn't address zcull
17:32 karolherbst: right
17:32 imirkin_: let's see... it prints GPC register 900
17:34 imirkin_: nope, they don't have that one
17:34 imirkin_: sad.
17:34 karolherbst: :/
17:34 karolherbst: so I have to look at data which is obviously different than the stuff nvidia does
17:37 karolherbst: ohh yeah, I aligned one field wrongly
17:41 HpCompaqSo: all what, all the country because of irish lunatic?
17:42 karolherbst: fewer traps now :O
17:48 karolherbst: imirkin_: could the wrong memory type also cause such traps?
17:49 karolherbst: because I think I write the right stuff to the gpu now
17:50 HpCompaqSo: i'll let you work, you would not be able to keep me away if i wanted to enter the channels, samewise all those violations will get an end here too, the more they last the bigger sentence will be there, bye
17:55 karolherbst: okay, at least the trap only happens after pushing out the data
17:55 karolherbst: so, the data is somehow wrong
21:21 imirkin: skeggsb: any good idea on how to debug this? http://hastebin.com/sarobeboyo.hs
21:21 imirkin: basically looks like things got desync'd
21:22 imirkin: i think it's the beginning of a draw... the very first thing is SCREEN_SCISSOR_HORIZ
21:22 imirkin: and then there's a bunch of vertex array stuff
21:24 imirkin: errrrr what? looks like nvc0_vbo has explicit push checking turned on??