00:00skeggsb: pretty sure userspace can get at the board name too
00:00skeggsb: it just doesn't
00:02mangix: got it
00:02skeggsb: mangix: btw, 4.11 is end-of-life, so no patches will be backported... f26 is gaining 4.12 this week though apparently
00:03skeggsb: i did try to send them :P
00:03mangix: any objections to me sending patches to mute WFD warnings?
00:04mangix: wifi display
00:04skeggsb: you're talking about the unknown DCB types?
00:05mangix: eg: unknown connector type 70
00:05mangix: the spec says virtual connector for wifi display
00:05Hoolootwo: so I accidentally caused a forkbomb today, which caused an OOM kill of Xorg, which threw me a kernel panic, which seems problematic http://hooloovoo.blue/files/nouveau_panic
00:06skeggsb: Hoolootwo: i'm aware of that issue, still been looking for a reproducer, since i can't make it happen
00:07skeggsb: mangix: no objection, that code, in its current form, is supposed to go away at some point, but who knows when that'll happen
00:08Hoolootwo: skeggsb, well, I'll see if I can get it to happen again, be with you in a bit
00:09imirkin: i think there was another bug that was filed about that nouveau_bo_ttm_del thing
00:20Hoolootwo: welp, that's going to be an annoying one to debug
00:21Hoolootwo: gonna go grab the other laptop with this gpu because I have no intention of continuing to OOM my main laptop, especially with 20 days and counting of uptime
00:29Hoolootwo: actually, going on vacation tomorrow, if you need help debugging that issue give me a ping sometime after the weekend, I'd be happy to help
00:43mangix: i'm going to assume mesa_glthread works with nouveau
05:36David_Hedlund: Is it possible to utilize nouveau from software to figure out the GPU usage? I see that people are using nvidia-smi in conky but nvidia-smi is not available in my distro. I assue its non-free: https://packages.debian.org/jessie/nvidia-smi
07:28pmoreau: David_Hedlund: nvidia-smi relies on data provided by the NVIDIA driver AFAIK, so it won't work with Nouveau.
07:30pmoreau: David_Hedlund: There was some discussion about outputting load information in debugfs (IIRC) for Nouveau, but I don't think the patches have been merged yet.
07:30pmoreau: karolherbst: ^
07:30karolherbst: pmoreau: yeah
07:31David_Hedlund: pmoreau: Thank you. Things will be taken care of I hear.
07:31karolherbst: top 7 patches: https://github.com/karolherbst/nouveau/commits/pmu_counters_v2
07:32pmoreau: karolherbst: Ugh, completely forgot to go back to reviewing your series; I needed to go through ~4 patches again but the other ones were fine.
07:32pmoreau: Will try to do that tonight.
07:33karolherbst: anyway, I am off to work, by laptop just accidentally wake up from suspend
07:55David_Hedlund: pmoreau: Can you please give me the link to the discussion (if you can find it)?
07:56pmoreau: See the link from Karol, it has the patches for it, if you want to apply it to a custom kernel.
07:58pmoreau: David_Hedlund: The latest sent version to the ML can be found here: https://lists.freedesktop.org/archives/nouveau/2017-June/028093.html
12:42karolherbst: mwk: uhh, zcull
12:43karolherbst: allthough zcull on tesla+ hardware would be a nice feature to implement
13:51mwk: karolherbst: I'll be happy if I ever figure out how that thing works on *any* hw
13:51mwk: right now it's more or less a complete mystery
13:51karolherbst: yeah, I figure
13:51karolherbst: but maybe I may find time for this in the future
13:52mwk: best I have is a few notes for Celsius
13:52karolherbst: just I can't really evalute on how much performance this will give in the end
13:52mwk: I'd guess lots of
13:52karolherbst: well, it's still nouveau we are talking about
13:52karolherbst: we may have other bottlenecks which are more important right now
13:54karolherbst: but yeah, I could imagine that it might give us a benefit in either case
13:54karolherbst: mwk: does zcull reject vertices to be rendered? I really don't know where it lives inside the pipeline
13:55mwk: it rejects... tiles
13:55mwk: seems to be an old version
13:56mwk: it's basically sandwiched between two halves of the rasterizer
13:56karolherbst: maybe it's just a faster form of rejecting fragments and it does so based on vertices?
13:56mwk: nv rasterizes in 4 steps
13:56mwk: 1) edge setup, 2) coarse raster, 3) window clip & zcull, 4) fine raster
13:57mwk: edge setup does lotsa math and calculates polygon edge equations
13:57mwk: coarse raster takes that and spits out a list of tiles that contain pixels that could possibly be covered by the polygon
13:57mwk: 32×32 or something like that
13:57mwk: probably dpeends on hw gen
13:58mwk: zcull takes each tile and outputs a mask of pixels to be trivially passed / trivially rejected / checked later
13:58mwk: maybe it will kill a whole tile, in which case fine raster doesn't see it
13:59karolherbst: I think it onlyrejects entire tiles, or at least this is what I remember from some docs about zcull
13:59mwk: and then there is fine raster, which figures which exact pixels are covered
14:00mwk: I'm not sure of anything when it comes to zcull
14:00karolherbst: mwk: http://developer.download.nvidia.com/GPU_Programming_Guide/GPU_Programming_Guide_G80.pdf page 43
14:00mwk: but I'm pretty certain the Celsius format involves bitmasks of pixels in a tile
14:01karolherbst: and by page I mean those little numbers on the bottom not real PDF page
14:01karolherbst: "large blocks of pixels"
14:02mwk: "In addition, GeForce 8 series and later GPUs can also perform fine-grained Z
14:02mwk: and Stencil culling, which allow the GPU to skip the shading of occluded pixels. "
14:02mwk: there, masks
14:03karolherbst: yeah okay, but that's earlyZ, isn't it?
14:03karolherbst: "Similarly, fine-grained Z/Stencil culling (also known as EarlyZ)"
14:04mwk: call it whatever you want, it's done by the same hw based on the same information
14:04karolherbst: ohh, okay
14:04karolherbst: but that doesn't matter. It just matters if they use the same information provided by the driver or if it has to be setup seperatly
14:04mwk: AFAICT the same
14:06mwk: *maybe* earlyz needs less info to work
14:06mwk: but if zcull works, so does earlyz
14:06karolherbst: sounds nice
14:07mwk: at least I hope so :p
14:07karolherbst: we already know it won't though :p
14:08karolherbst: I need to look again on the counter work and first check if we already have all the required counters to even know if it works or not
14:08karolherbst: anyway, I created an issue against gnurous repository and sooda responded there, no idea what the status is currently
14:09mwk: for zcull, there are 4 special queries you can send via the 3d object
14:09mwk: which give you zcull stats
14:09mwk: you should be able to see the results there
14:09mwk: of course, we have no idea what these 4 queries return, exactly :p
14:09karolherbst: hopefully they return !0 on zcull doing stuff and 0 on zcull doing nothing
14:09karolherbst: and that a higher number means better results
14:10mupuf: karolherbst: if the rendering is still fine :p
14:10karolherbst: but maybe they got nasty and put more values inside a 32bit block
14:10karolherbst: mupuf: well, we only want zcull to work
14:10karolherbst: misrendering ist kind of the expected results to figuring out how it works :p
14:10karolherbst: and if we get missrendering, that means we are doing _something_ right
14:11karolherbst: sooda: https://github.com/Gnurou/nouveau/issues/11
14:11sooda: ah that :|
14:12sooda: haven't worked on it in ages, but what i did was much simpler than the whole thing. and the details are nasty, even our internal docs aren't that great on the subject
14:12karolherbst: sooda: I have an idea for a deal: we do proper documentation here, and you give us something in exchange :p
14:13sooda: we do have proper docs, it's just that there are so many parameters and all
14:13karolherbst: ahh I see
14:14karolherbst: sooda: can you share any numbers on how much performance is won by having zcull enabled?
14:14karolherbst: even an "up to x% in real world games" would be good enough
14:15sooda: depends heavily on the scene, but iirc for some games it would be like 10-30% (for just the relevant ops where it's active, not for the whole game :P)
14:15sooda: the buf works in some sorts of tiles and the hw can then skip a whole bunch of work
14:16karolherbst: well yeah, we know that much already
14:16sooda: yea :P
14:16sooda: how to optimally configure it is unfortunately real secret sauce
14:18karolherbst: just a matter of time how long it stays secrent...
14:18sooda: looking at the backlog here, it seems that you seriously know more than i do :P
14:21karolherbst: when I have time I should look into games with serious bottlenecks like SR3/4
14:23karolherbst: but I could imagine that with zcull we are able to pretty much close the gaps for a lot of games between nouveau-nvidia
14:24sooda: yup zcull and memory compression are some of the most significant ones perfwise
14:25RSpliet: sooda: presumably memory compression makes more difference on Tegra than on GTX1080?
14:26sooda: idk, on desktop it still helps in mem bandwidth
14:36mupuf: RSpliet: if you are not memory-bandwidth limited on a graphics workload, you are doing it wrong :p
14:40karolherbst: mupuf: you forget about modern games
14:40RSpliet: mupuf: it depends on how (de)compression logic is balanced with compute power and BW though... if you end up bottlenecking on the compression compute...
14:40karolherbst: they tend to be not that much memory limited anymore
14:40karolherbst: or rather shader limited
14:41mupuf: karolherbst: then why don't they pre-compute stuff and put it in memory?
14:42karolherbst: no idea
16:01karolherbst: mupuf: but if you look at how fast raw processing power and memory bandwith grows, you see that processing power grows faster
16:02mupuf: karolherbst: please rephrase
16:02karolherbst: mupuf: GPUs processing power grows faster than memory bandwidth
16:02mupuf: yes, and that supports my point ;)
16:03mupuf: memory bandwidth is hard to increase
16:03karolherbst: well maybe, but maybe games also want to compute more things instead of using memory for this reason
16:03mupuf: and hence why memory compression / fast clear / etc... is so important
16:03mupuf: oh sure
16:04mupuf: but compute is easy, it is data that is hard to deal with
16:04karolherbst: anyway, I just notcied that on modern games, memory clocks isn't as important as shader clocks, at least on nouveau, but maybe that's just because nouveau sucks
16:04mupuf: and synchronisation between the different cores
16:04mupuf: otherwise, raw ALU performance is meaningless
16:04mupuf: quite possibly :D
16:06karolherbst: it would be nice to have some form of figuring out what the current bottleneck is for an application under nouveau
16:06mupuf: well, that's power allocation you are talking about ;)
16:06karolherbst: but I am sure it's "not that easy"
16:06mupuf: for this, we would need to be power-limited
16:06mupuf: which is not the case, AFAIK :D
16:07karolherbst: I doubt we are power caped anywhere
16:07karolherbst: except furmark
16:07mupuf: exactly, so no point in doing power allocation until we get there ;)
16:07karolherbst: furmark without offloading throws my GPU above the power budget
16:11RSpliet: GPU performance also grows faster than sequential CPU performance. Until games start really really exploiting multi-core execution the bottleneck will end up not being the GPU at all ;-)
16:12mupuf: RSpliet: true that :)
16:49imirkin_: anyone have opinions on what i should do with my bindless patches? the current situation is that they pass tests but don't work on real games. i obviously don't have the time to figure out what's wrong
16:50imirkin_: i was thinking i could push them out but leave the feature not-advertised
16:56mupuf: imirkin_: that would prevent the pain of rebasing... but not the bitrotting part
16:56mupuf: if they at least pass the tests, that's a good way to prevent bitrotting
16:56mupuf: or at least, to make it bisectable
16:57mupuf: not that my view matters, but I would be fine with merging it with the feature not advertised
17:06RSpliet: imirkin_: stick an todo comment in nvc0_screen (including games/test-cases that fail) so that even those not familiar with our Trello board can try and pick it up?
17:07RSpliet: Or hmm... maybe trello is the better place for that actually :-D
17:14imirkin_: skeggsb: let me know what you think, as i think it'll affect you the most
19:00karolherbst: imirkin: one reason I got more instructions: abs modifier...
19:01imirkin_: ah right - you have to set up the POW instruction properly :)
19:03karolherbst: I guess I can move in abs on the first source
19:03karolherbst: in ModifierFolding
19:13jvesely: hi nouveau, is it possible to run CUDA on top of nouveau kernel module? I found some old email thread of nvidia trying to that for tegra, but I couldn't find anything more recent (or on x86)
19:14imirkin_: not really
19:14imirkin_: you can look at gdev
19:14imirkin_: (last updated 3 years ago)
19:15karolherbst: imirkin_: wasn't there something based on gallium recently? I really thing there was something, but maybe I just remember something never happened before
19:15imirkin_: i do think that google (?) made a cuda llvm frontend
19:15karolherbst: I am like super sure somebody worked on CUDA based on gallium
19:15karolherbst: I see
19:17karolherbst: looking better now: "total instructions in shared programs : 4680880 -> 4682325 (0.03%)"
19:17imirkin_: is this with ConstantFolding handling imm^x ?
19:18karolherbst: new thing: 98: ld u32 %r239 c0[0xc]
19:18karolherbst: as second src
19:18karolherbst: can be loaded into the mul
19:18imirkin_: oh... bleh.
19:18imirkin_: wtvr, i don't care - that's fine.
19:19karolherbst: why is that BLEH?
19:19imirkin_: well, it works out with other the POW lowering
19:19imirkin_: but mul c0[x], c0[x] isn't a thing
19:19imirkin_: but ... i really don't think it's an issue.
19:19karolherbst: yeah, but the first source will be a register
19:19karolherbst: cause it's the result of the lg2
19:20karolherbst: and if one source is an immediate, we are fine anyway
19:21karolherbst: neg on source 2 can be moved into the mul as well
19:21imirkin_: and sat on the result, although you have to be a little careful with that.
19:22karolherbst: ohh, wait
19:22karolherbst: I think I was smart enough to move that stuff into source 1 of mul... or does it cause troubles on nv50... let me check
19:23karolherbst: no, looks fine
19:23karolherbst: ohh wait
19:23karolherbst: c is only on source 2 for nv50
19:24karolherbst: okay, moving it to source 2 then
19:24karolherbst: same for nvc0
19:28karolherbst: immediate can also be moved into the mul on source 2
19:28karolherbst: and I also have an example with sat on the ex2
19:30jvesely: thanks. I have a custom way to generate nvptx/nvptx64 so I really just need a way to finalize/dispatch it (+ allocate input buffer).
19:30karolherbst: imirkin_: \o/ breakthrough: "total instructions in shared programs : 4680880 -> 4680875 (-0.00%)"
19:31karolherbst: and there is still more to do
19:34karolherbst: imirkin_: okay, and what is the problem with sat? can't I just move the sat on the pow and then sat the ex2? Or do I have to be like super careful when doing the ConstantFolding things then?
19:38karolherbst: okay nice, got a shader where the generated result is now exactly the same
19:39imirkin_: karolherbst: you just have to be a little careful since the sat has to be applied to the last mul in the sequence (or the ex2 or whatever)
19:40karolherbst: yeah okay, but this should be quite easy
19:40imirkin_: jvesely: yeah ... so the issue is that ptx is yet-another high-level language
19:40imirkin_: which still needs to be compiled to the actual target ISA
19:41imirkin_: jvesely: and obviously all the buffer management hookup needs to be handled by the surrounding api
19:41karolherbst: imirkin_: ConstantFolding keeps the pow the last instruction, just with converted op and in the legalize pow function I just set it on the ex2, so it should fine
19:41imirkin_: karolherbst: ok cool
19:41karolherbst: "total instructions in shared programs : 4680880 -> 4679234 (-0.04%)"
19:41karolherbst: still some shaders hurt
19:47jvesely: imirkin_, thanks. I guess I'll just byte the bullet and go with the prop driver. I need to do the stuff in python anyway, so pycuda looks like the easiest way to go.
19:52karolherbst: for pow(a, 0)
19:52karolherbst: 51: mov u32 %r479 0x00000001 (0)
19:52karolherbst: 56: mul ftz f32 %r416 %r415 %r479 (0)
19:53karolherbst: that mov is what I generate in constantFolding
19:53karolherbst: but I don't get why it isn't folded into that mul later
19:54karolherbst: or do I have to do some opt for mul(a, pow(b, 0))?
19:55karolherbst: this would be stupid
19:55pmoreau: imirkin, karolherbst: There is guda, a CUDA state-tracker for Mesa, still a WIP
19:55karolherbst: pmoreau: ahh yeah, that was it
19:55karolherbst: I couldn't find it though
19:55karolherbst: so I started to think I simply imagined it
19:56karolherbst: maybe I'll take a look into that some time later
19:57karolherbst: I mean writing that ptx -> nv50ir
19:57karolherbst: or maybe we should have a generic ptx -> tgsi first and let the driver do optimized paths?
19:58karolherbst: which ends up in us writing ptx->nv50ir anyway
19:58pmoreau: jvesely: The day Nouveau gets close to having OpenCL 1.0 support, I'll start looking into adding some CUDA support. Especially, as a starter, to use the code generated by ptxas and stored in the application, as you don't need to parse PTX.
19:58pmoreau: Or PTX -> SPIR-V -> NIR/NVIR
19:58karolherbst: yeah well
19:58pmoreau: Or go back to LLVM IR instead of SPIR-V
19:58karolherbst: we could
19:59karolherbst: please, for fun I wouldn't do the llvm thing :p
20:00karolherbst: but now I am more interested in solving this mov 0 mul mystery
20:02jvesely: I'd preferably use SPIR-V so that I can target all cl_khr_spir capable opencl implementations, but few ocl implementations do that and SPRI backend is not in upstream llvm yet
20:04pmoreau: Yeah, not having an upstream SPIR-V backend is hurting. Hopefully there is some WIP going on another repo than the Khronos one, based on master rather than 3.6, and trying to implement it the way LLVM people see it.
20:10karolherbst: imirkin_: do you have any idea why those aren't optimized?
20:12imirkin_: karolherbst: you're somehow not setting progress? dunno.
20:12imirkin_: karolherbst: oh, lol
20:13imirkin_: karolherbst: you're loading an int 1, but then doing a float mul
20:15karolherbst: imirkin_: can this break something: "mov f32 %r396 %r479 (0)"
20:15imirkin_: i don't think you get it
20:15imirkin_: float 1 != int 1
20:15imirkin_: 51: mov u32 %r479 0x00000001 (0)
20:15imirkin_: that's an int 1
20:16imirkin_: a float 1 would be 0x3f800000
20:17karolherbst: mhhhhhh okay
20:22Terkal: Hi ! I'm looking for smby in order to fix a HDMI not connected issue
20:27nyef`: ... HPD problem or something else?
20:28nyef`:gets DPort HPD issues, not usually HDMI HPD issues.
20:33Terkal: Hum, xrandr is showing me hdmi-1 disconnected, but gnome display manager can't manage hdmi screen
20:34karolherbst: who would have known that 1.0 defaults to a double value
20:34Terkal: i'm currently using nouveau driver
20:35Terkal: xrandr --listproviders : Provider 0: id: 0x44 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 2 associated providers: 0 name:modesetting
20:42karolherbst: I am jumping from one bad thing into a worse one
20:42karolherbst: 118: mov u32 $r17 0x3f1645a2 (8)
20:42karolherbst: 119: mad ftz f32 $r1 $r17 $r1 $r4 (8)
20:43karolherbst: why doesn't the source get swapped, seriously
20:45karolherbst: oh yeah, because the regs are silly and stupid
20:45karolherbst: stupid RA
21:10karolherbst: imirkin_: okay, I think only CSE opportunities are left now
21:19imirkin_: karolherbst: ok
21:26imirkin_: karolherbst: and my guess is that the improvement is stronger with your improved POW legalization
21:48karolherbst: imirkin_: well, the change is inside the paste
21:48imirkin_: karolherbst: oh, that was with the additional pow -> mul logic?
21:48imirkin_: ah =/
21:48karolherbst: pixmark_piano gets faster though
21:49imirkin_: same amount faster presumably?
21:49karolherbst: more like 1%
21:49imirkin_: ah neat
21:49imirkin_: wonder if it's just dumb luck, or it uncovers another opt
21:49karolherbst: one pow inside a loop gets optimized
22:41Lyude: Has anyone been seeing this recently on nouveau? Jul 26 18:39:43 LyudeCowCube kernel: nouveau 0000:01:00.0: fifo: read fault at 0008200000 engine 1b [CE2] client 18 [GR_CE] reason 0c [UNSUPPORTED_KIND] on channel 2 [007fa1b000 systemd-logind]
22:41Lyude: specifically with kepler
22:43imirkin_: not recently, but definitely have seen that before
22:44skeggsb: Lyude: potentially 38bcb208f60924a031b9f809f7cd252ea4a94e5f (drm-fixes currently)
22:44skeggsb: most of the reports had similar things
22:45skeggsb: though, a lot more of them than just the one :P
22:45Lyude: yeah, sometimes it's followed by a very long string of errors where the GPU eventually chokes and dies
22:46skeggsb: f26 + wayland, works ok on X though?
22:46Lyude: skeggsb: tbqh it never even goes past gdm
22:46Lyude: but I know gdm is using wayland
22:47skeggsb: yeah, that's possible too. on my particular board, it didn't hit enough usage on gdm (but, close) to cause it
22:47skeggsb: other GPUs it might, depending on grctx size differences etc
22:48Lyude: damnit it does work. so this IS the same bug that you mentioned yesterday
22:48Lyude: i guess i completely forgot to actually try X when I was getting annoyed trying to figure out why cogl was ignoring ldconfig
22:49Lyude: heh, now I've got time to work on xwayland evilstream support at least
22:59imirkin_: was that the BAR thing?
22:59Lyude: imirkin_: not 100% sure but it probably is, yes
22:59Lyude: about to find out once this kernel finishes building
23:00imirkin_: ah yes
23:00Lyude: ended up wasting a day on it by accident :(, used the wrong words when asking about it before I think
23:01imirkin_: well, you said you were hitting it on kepler
23:01imirkin_: i think skeggsb was seeing it on GM20x
23:01imirkin_: skeggsb: btw - did you see my question about what to do with bindless above?
23:01Lyude: yeah i'm definitely hitting it on kepler
23:01Lyude: that's marked as gf100 though
23:01Lyude: the bar fix I mean
23:01imirkin_: right, it is
23:01imirkin_: but i dunno if in practice it affects earlier stuff
23:02Lyude: heh, either way at least I learned a lot about ldconfig
23:02imirkin_: that it's subtle and quick to anger?
23:02Lyude: imirkin_: -was- subtle, there's a ton of env variables I've used for a while to figure out what it's doing but i found some more niceties
23:03imirkin_: ldconfig -r is nice iirc
23:03Lyude: apparently you can use ldconfig to temporarily add a directory to the top of the ld.so.cache
23:03Lyude: yeah, that
23:03imirkin_: and ldconfig -p? something like that to print wtf it's doing
23:04Lyude: that was how I noticed gnome-shell was just entirely ignoring ldconfig
23:04Lyude: honestly my least favorite part about opengl is how needlessly complex loading it is
23:05Lyude: cool, confirmed it's definitely the same bug
23:06imirkin_: nice, i'd probably have spent the remainder of my natural life tracking that one down