03:34 HdkR: Time to burn nyancat with uncapped framerate in alacritty for a few hours. Wonder if it'll hit the IceLake hang problem
03:36 HdkR: Concerned that the "hang" is aggressive low clock speeds on the iGPU (10Mhz!) while pushing a 4k alacritty display and is right at the timeout threshold
04:43 jekstrand: HdkR: And it's still taking the system down so you can't get a dump?
04:44 HdkR: Yea, never creates that dump file
04:45 jekstrand: :-(
04:45 jekstrand: Can you SSH into it or is it dead dead?
04:45 HdkR: I can ssh in still
04:45 HdkR: Now that I actually have ssh-server installed
04:46 jekstrand: /sys/class/drm/card0/error should exist
04:46 jekstrand: It may not have much in it though
04:47 HdkR: I ssh'd in and checked last time it hung, there wasn't anything in it
04:47 imirkin: don't let file size 0 fool you
04:48 HdkR: I mean, I opened it in vim, didn't ls it
04:48 imirkin: oh
04:49 jekstrand: The other error report is on KBL
04:49 imirkin: dunno if it has enough stuff implemented to be opened in vi
04:49 imirkin: i'd definitely just copy it
04:49 imirkin: to a real filesystem
04:49 HdkR: Sure, I'll do that next time I get a hang
04:50 jekstrand: Yeah, use cat
04:50 jekstrand: I usually do "sudo cat /sys/class/drm/card0/error > alacritty.err"
04:51 jekstrand: cp also works
04:51 HdkR: Sounds good
04:51 imirkin: cp is clever to read + write when it sees the files are on different filesystems
04:52 jekstrand: The hang report in the gitlab.fd.o issue looks like it's dying on a stencil blit.
04:54 HdkR: Looks like Alacritty removed their accidental stencil and depth usage in 0.4.2, and I'm on 0.4.1
04:55 HdkR: Oh, then stencil reenabled due to a bug with radeon
04:56 jekstrand: Why is a terminal emulator using stencil????
04:56 jekstrand: Also, why are they blitting stencil?
04:56 HdkR: Looks like it was accidental and then they relied on it somewhere
04:57 jekstrand: This app is starting to give men reeeeaallll good feelings....
04:57 imirkin: heh, blitting stencil is the worst. esp when trying to maintain depth
04:57 HdkR: Whatever utility library they are using seemed to default allocate it
04:57 jekstrand: Don't get me wrong. A GL driver should never hang unless you do something truely crazy like infinite-loop in a shader.
04:57 jekstrand: So there is a bug
04:57 jekstrand: But, also, why is anyone blitting stencil?
04:59 jekstrand: HdkR: If we get a second dump from you and it's also hanging on a stencil blit, I'll think somethings connected. Right now, though, the fact that one hang on KBL happened to be on an otherwise perfectly normal blit doesn't really tell me much.
04:59 jekstrand: Stencil blitting should work
05:00 imirkin: is the weird tiling stuff still there in later gens?
05:00 imirkin: or did that die out after gen7?
05:00 HdkR: I'll try and nab the dump once it happens again. Good thing is once it starts hapening I can restart X and it happens every few minutes
05:00 jekstrand: imirkin: Didn't die until Gen11. :'(
05:01 jekstrand: Or maybe 12? Yeah, it didn't die until 12
05:01 HdkR: Theoretically I could compile alacritty git and change the single line to remove stencil as well
05:02 mareko: what's the relationship between wrmask and component in store_output? and is component and wrmask in units of the type or 4 bytes?
05:03 jekstrand: mareko: wrmask is in units of components
05:03 jekstrand: mareko: It does not compact components
05:03 jekstrand: So if you have a wrmask of 0x9, it still takes 4 components and drops the middle two.
05:04 jekstrand: Which is different from GLSL IR
05:04 mareko: jekstrand: does comp=1 wrmask=y write to y or z?
05:04 jekstrand: .z
05:05 jekstrand: The mask in NIR is literally just a mask.
05:05 jekstrand: It doesn't change the semantics of components beyind "these don't actually get written"
05:05 jekstrand: *beyond
05:10 jekstrand:lables HdkR's issue iris and assigns Kayden
05:11 HdkR: Is the iris gallium driver default now?
05:11 mareko: HdkR: yes
05:11 HdkR: ah fancy, gallium hud does indeed work
05:12 jekstrand: I think Kayden had to hook up TGSI just for that. :)
05:12 jekstrand: Oh, and nine might still use it
05:12 mareko: tgsi_to_nir is pretty easy to use
05:13 imirkin: and surprisingly complete nowadays
05:13 mareko: (not really)
05:13 jekstrand: yeah, Kayden was trying hard to avoid it and get a 100% NIR driver but there are still those corners...
05:13 imirkin: i think you covered like 90+% of instructions
05:13 imirkin: used to be like ... 10%
05:13 jekstrand: And hooking up a pass someone else wrote is way easier than re-writing the HUD to have two paths: TGSI and NIR
05:15 mareko: I still write internal shaders in TGSI in radeonsi, it's more convenient
05:15 mareko: even though the driver converts it to NIR
05:23 mareko: so TGSI is effectively a NIR frontend
05:45 Kayden: if you're getting hangs on Icelake, check your kernel version
05:45 Kayden: 5.6.8+ seems fine
05:45 HdkR: I'm on 5.6.11 atm
05:50 Kayden: probably ok
05:51 HdkR: I could upgrade to 5.6.12 that was released today if you think it'll do anything
07:37 kode54: also beware building it with GCC 10 if you have that and are building your own kernel
07:43 MrCooper: what happens if you do?
07:44 MrCooper: daniels: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4965 needs to be backported to the 20.1 branch as well, right?
07:46 daniels: MrCooper: ah, yes
07:46 daniels: though I guess it's a race; expecting we can get into the office tomorrow to fix it, so depends on how quickly it can be pulled back to 20.1
07:49 tango_: don't pull anything else in 20.1 so no new jobs start?
07:49 daniels: heh
07:49 tango_: or do jobs start periodically rather than on new commits?
07:49 tango_: (that suggestion was a joke btw)
07:51 pepp: since Marge pipeline runs on mesa project, couldn't it use variables to dynamically disable/enable some CI hardware?
07:55 pepp: nvm, only post-merge pipelines are run in the mesa/mesa project
10:01 danvet: pq, see my reply on the hotunplug thread
10:02 danvet: I think we've already thought about all your concerns, amd has done a bit of wheel reinventing first, ask questions later
10:02 danvet: pq, maybe double-check I got them all
10:34 pq: danvet, maybe I wrote too much, the first paragraph would have been enough from me.
10:43 pq: danvet, I guess we disagree a little on "nothing can fail". SIGBUS totally agreed, GL/Vulkan already have what they have, but KMS ioctls I think the kernel should start failing rather than papering over.
10:43 pq: it's not a regression IMO if KMS ioctls start failing instead of a machine crash
10:45 pq: ...unless it's too late and stuff is already papered over
10:59 danvet: pq, maybe compositors are more resilient now, but I remember good fireworks if page_flip suddenyl failed for no reason
10:59 danvet: maybe we need an "I can take -EIO" client cap for that
11:00 danvet: and then specify very clearly where -EIO is allowed, and where not
11:02 MrCooper: surely the fact that the GPU is gone counts as a reason :)
11:04 danvet: MrCooper, from a practical pov it doesn't matter much whether your kernel, mesa or app dies in fire, end result is the same
11:05 danvet: so this is kinda complicated, and if one layer just yolos the entire thing we're not really improving
11:06 danvet: from a security pov ofc kernel should oops no matter what, but for desktop users all is lost no matter which part dies
11:06 ickle: +not
11:07 danvet: ah yes, oopsing not considered a feature
11:10 pq: danvet, what is the regression case you are thinking if unplugging a device leads to icotls returning failures? What was the old behaviour you need to keep?
11:12 pq: was there already a way for userspace to limp along when a DRM device disappeared? Would it not just hang the machine or crash the kernel?
11:14 pq: if you're going to need a kernel mode, where all DRM ioctls continue faking success anyway, then I'm not sure we need another mode or operation - userspace either handles the removal uevent or doesn't.
11:15 danvet: pq, this is more limping along after permanent gpu death
11:15 danvet: i.e. execbuf goes -EIO
11:15 pq: oh the render side
11:15 danvet: and we had various other kms ioctl that also returned -EIO because waiting for rendering did that
11:15 danvet: until we patched the kernel to not do that
11:16 pq: don't mind me for the render side at all, I'm not familiar with that. I've only been thinking about KMS ioctls.
11:16 danvet: once the gpu was fully reset the -EIO stopped, but at that point your compositor is dead
11:16 danvet: but yeah for rendering we need apps to eventually handle the unplug
11:16 danvet: since the rendering is clearly not going anywhere anymore
11:17 danvet: I'd expect apps to be more or less hopeless
11:17 danvet: and compositors can maybe do arb robustness
11:17 danvet: or not even that, just unplug the gpu and switch to a different one
11:17 danvet: while meanwhile no gl call should go boom
11:22 pq: yeah, after a GPU crash a reset may make it respond again, but an unplug is an actual removal and it won't come back as the same DRM device instance. I didn't imagine to handle both the same way.
11:34 danvet: pq, permanently dead gpu is fairly similar to unplugged gpu
11:37 karolherbst: broken runpm can lead to a permanently dead gpu btw.. like the bug we had in Nouveau for a while
11:37 karolherbst: or any other "fallen of the bus" situation as well
11:40 pq: danvet, so, you need to make KMS ioctl fake-succeed until the driver knows for sure if the GPU is permanently dead or if it comes back?
11:42 karolherbst: danvet: btw, robustness makes all GL call return a special error once the context is lost though, so you don't need to do special calls actually
11:44 karolherbst: anyway, clients not doing robustness should just crash.. there isn't any way around that I guess
11:44 karolherbst: if they'd care, they'd use robustness
11:46 karolherbst: pq: in regards to that: the driver knows
11:47 karolherbst: although I guess ther eis the case where recovery _could_ fail, but then the device is still there
11:47 karolherbst: and probably a driver bug which needs to be fixed
11:47 pq: karolherbst, does it know the instant when it breaks, or does the driver need to e.g. wait for a reset to time out?
11:47 karolherbst: but the device being gone is something a driver knows very soon
11:47 karolherbst: pq: the kernel removes the device
11:47 karolherbst: aka unbind
11:47 pq: eh?
11:47 karolherbst: yeah
11:47 karolherbst: you go through the unloading path
11:47 danvet: well it's still all concurrent, so "the instant" is ill-defined not matter what
11:48 karolherbst: and for PCI devices the channel is marked dead
11:48 danvet: drm_dev_unplug() uses srcu underneath, so even that instant is very ill-defined
11:48 karolherbst: yeah. I wouldn't use drm helpers for that
11:48 danvet: it's very relativistic, depending upon which cpu core you're looking at
11:49 danvet: karolherbst, that = ?
11:49 karolherbst: pq, danvet: that's what I'd use for nouveau: https://github.com/karolherbst/nouveau/commit/0cf5e53aa8677a70dd8ac94a444f60e69cf41cab
11:49 danvet: oh for pci we actually know whether it's a hotunplug or hotremove
11:49 karolherbst: what if it's neither
11:49 danvet: or is that not set on hotunplug
11:50 karolherbst: GPU just falls of the bus for.. stupid reasons
11:50 karolherbst: is that hotunplug?
11:50 danvet: karolherbst, there's an issue that for unload developers expect a proper shutdown of all hw
11:50 karolherbst: I know
11:50 danvet: vs for hotunplug or "fallen off the bus" you shouldn't try that really
11:51 karolherbst: well
11:51 karolherbst: you have to as the kernel will call it
11:51 danvet: not with drm_dev_unplug and drm_dev_enter/exit
11:51 karolherbst: it's not up to the driver
11:51 karolherbst: the pcihp stuff goes throught the tree and removes all devices from their drivers
11:52 danvet: and it doesn't call the ->remove function?
11:52 karolherbst: it should
11:52 karolherbst: maybe I missunderstood what you mean
11:52 danvet: then I'm not sure what you mean
11:52 karolherbst: ohh, I see what you meant now
11:52 karolherbst: yeah.. you need to fix your remove path in the driver I guess :p
11:53 karolherbst: and check if pci_channel_offline or something if your device is gone for good or not
11:53 karolherbst: or other functions
11:53 danvet: just compare the remove functions for some simple kms drivers (less code, easier to spot)
11:54 karolherbst: the ugly thing really is that all your mmio access are returning -1, which.. well.. will cause all kinds of funny bugs
11:54 karolherbst: and workers which are still in flight
11:54 danvet: yeah there's a transition state that goes wild
11:54 pq: karolherbst, I'm a userspace developer. When you say "unbind" or "removes the device", I understand that a) all open DRM fds become dead, b) remove uevent is sent, c) the device node disappears from the fs. Is that what you mean too?
11:55 karolherbst: anyway, when I was looking at those patches I was like "wait.. isn't like most of the work still missing or are most mmio accesses already guarded in radeonsi :p"
11:55 karolherbst: pq: equal as doing echo 1 >> undbind on the device
11:55 karolherbst: except no refcount check
11:55 pq: I don't think I've ever done that, what does it do?
11:55 danvet: karolherbst, do you mean the module refcount?
11:56 danvet: pq, boom
11:56 danvet: on most rendering drivers at least
11:56 karolherbst: danvet: yeah.. not sure if there is a per device refcount actually
11:56 danvet: more seriously, it unbinds the driver from the device
11:56 danvet: karolherbst, there is
11:56 karolherbst: ahh, okay
11:56 danvet: drm_dev_get/put
11:56 danvet: it's just, that doesn't prevent unbind
11:56 karolherbst: so sysfs will probably fail if you do that for a device still in use
11:56 danvet: because you can't prevent unbind
11:57 danvet: the only thing the module refcount is for is "is there still a ops vtable pointing at this code somewhere"
11:57 danvet: which is entirely not related to hotunplug
11:57 karolherbst: yeah.. the module refcount is for unloading the driver, sure
11:57 danvet: except many people think that "I prevented module unload, I'm safe"
11:57 karolherbst: that's why I was wondering about a per device refcount
11:57 karolherbst: but.. also no idea how to do that in core code
11:58 danvet: karolherbst, struct device has one too
11:58 danvet: but again, doesn't prevent unbind
11:58 danvet: because can't
11:58 karolherbst: okay, I hope sysfs checks that though
11:58 danvet: it's just normal refcounting
11:58 danvet: nope
11:58 karolherbst: ehhh
11:58 pq: karolherbst, so, if you are going to "remove the device" from userspace perspective, you can be sure that no existing userspace will survive that. So why do that if you know your GPU reset will work?
11:58 danvet: that refcount is only for normal lifetime management of kernel structures
11:58 karolherbst: okay
11:58 danvet: the device underneath can go whenever it wants to
11:58 karolherbst: pq: well.. those with robustness should
11:58 danvet: also, if you'd check you could never unbind
11:59 danvet: because drm_device holds a full ref on the underlying struct device :-)
11:59 karolherbst: pq: also, eGPU unplug is about sudden removals where you users are just unplugging the device :p
11:59 karolherbst: and we need a way of handling that
11:59 karolherbst: either by migrating applications to the still existing gpus
11:59 karolherbst: or crashing them
11:59 danvet: yeah gpu reset shouldn't result in device unplug/unbind/removal
12:00 danvet: that would break the world
12:00 karolherbst: anyway arb_robustness is usefull as at some point all GL calls just fail
12:00 danvet: so not exactly sure what nouveau does for gpu reset ...
12:00 karolherbst: it tries its best
12:00 karolherbst: actually
12:00 karolherbst: GPU resets are quite reliable in nouveau
12:01 danvet: karolherbst, I mean somewhere earlier you said you'd yank the device away
12:01 danvet: or did I get confused about that
12:01 pq: karolherbst, I know the eGPU case, but we are talking about GPU reset now, are we not?
12:01 karolherbst: danvet: I meant the kernel will if you hotunplug :p
12:01 danvet: karolherbst, ah so that was _not_ about gpu reset?
12:01 karolherbst: and I'd use pci_channel_offline to detect inside ->remove if the device is there or not
12:01 karolherbst: no
12:02 karolherbst: pq: yeah, maybe I missed that transition
12:02 karolherbst: but honestly, if that's about resets you have to assume the reset will work, .. kind of
12:02 karolherbst: otherwise you do the same as a hot unpluf
12:02 karolherbst: and a GPU reset is not just "GPU broke please restart"
12:03 karolherbst: at least with nv gpus we have multiple levels of reset
12:03 karolherbst: like your context can be gone, but the GPU is still fine and other applications too
12:03 karolherbst: so you kill that one context and move on
12:03 karolherbst: in such a case the GPU reports that to the kernel actually what happened
12:04 karolherbst: if you do a full level reset of the entire device... well, than you have to expect the GPU falls of the bus I guess or the driver running into random bugs?
12:04 karolherbst: and then you can argue about everything, as anything could happen
12:04 karolherbst: I think the only way to approach this is to already know before reseting if it works or not
12:04 karolherbst: and normally you should know that already
12:05 karolherbst: and if there is a chance a GPU reset could break, don't do it
12:05 karolherbst: except you are in a state where everything is gone anyway
12:06 pq: ok, that I didn't expect that you know before you try to reset
12:06 karolherbst: well, it kind of depends on the hw I guess?
12:06 karolherbst: but if you just kill of a hw context you can assume that you can just remove this context and move on with life
12:06 danvet: from what I just tested amdgpu seems to just go with device level reset and boom
12:06 karolherbst: bad for that one application, but so what
12:07 danvet: I hung my machine a few times last Fri on minimal stuff with amdgpu hang recovery :-)
12:07 karolherbst: danvet: sounds like a stupid idea
12:07 danvet: karolherbst, dunno whether the hw can't do more
12:07 karolherbst: risking the entire desktop to go down
12:07 karolherbst: yeah...
12:07 karolherbst: could be
12:07 karolherbst: still doens't make it a good idea :p
12:07 danvet: i915 is pretty good nowadays, with a similar cascade of "ban context" -> "reset engine" -> "reset entire render block"
12:08 karolherbst: I usually hit hundreds of recovery with nouveau when doing gl stuff or testing of stupid things
12:08 karolherbst: just the handling of it which sucks right now
12:08 karolherbst: and X being stupid :p
12:08 danvet: and last time around that last step forced a reset of the display too (where the real problems start) was gen3
12:08 karolherbst: yeah
12:08 danvet: amdgpu seems to still reset display even on latest
12:08 karolherbst: ufff
12:09 karolherbst: yeah well
12:09 karolherbst: at some point this needs to change
12:09 danvet: (I didn't check whether there's some less drastic paths in the tdr code, that's simply the one that the debugfs file his, and also amdgpu_test from libdrm)
12:09 karolherbst: ahh, I see
12:10 danvet: karolherbst, well just the existence of that path is kinda worrisome
12:10 karolherbst: yeah
12:10 danvet: since the locking for a gpu reset that involves display is iffy
12:10 danvet: and amdgpu gets it wrong
12:10 karolherbst: but I guess in the worst case scenario it's that of a frozen desktop
12:10 karolherbst: *or
12:11 karolherbst: but actually most cases of frozen desktop is X being stupid
12:11 karolherbst: didn't run into this with wayland yet though
12:11 karolherbst: blocking on clients is really annoying
12:11 karolherbst: or whatever is blocking
12:11 karolherbst: some clients might do shit which leads the context to go boom and some fences or whatever to never get signalled
12:12 karolherbst: but then X freezes and nothing seems to help
12:12 karolherbst: killing that one application -> boom, everything back to normal
12:12 danvet: karolherbst, uh if you don't force-complete fences on reset, that's a kernel bug
12:12 karolherbst: and I wished we would be fruther down the road on this
12:12 karolherbst: danvet: yeah.. dunno, could be
12:12 karolherbst: but... well
12:12 karolherbst: our fencing in nouveau is....
12:12 karolherbst: stupid
12:13 karolherbst: we do fencing inside mesa
12:13 karolherbst: so if a context is gone, you are stuck
12:13 danvet: but for cross-process fences, don't you need dma_fence?
12:13 danvet: either in syncobj or dma_resv or wherever
12:13 karolherbst: we also have the kernel bits of course
12:14 danvet: what you do within the umd is kinda "whatever"
12:14 karolherbst: yeah.. I didn't really debugged fully what's going wrong
12:14 karolherbst: should probably do that once I got more time
12:14 lynxeye: danvet: nouveau has userspace fences, which the kernel doesn't even know about
12:14 lynxeye: so you can't force complete them
12:14 danvet: lynxeye, even across processes?
12:15 karolherbst: no
12:15 danvet: well then you shouldn't be able to hang X or a compositor on them
12:15 karolherbst: we only fence GL stuff within userspace (more or less)
12:15 danvet: and userspace fences are fine, all you do is shoot yourself if it breaks :-)
12:15 karolherbst: danvet: yeah.. I only describe the symptoms here and guessing what's wrong
12:16 danvet: maybe nouveau ddx is doing something fancy and breaking that assumption?
12:16 karolherbst: but the symptoms are: one application does stupid things and the hw context is gone -> display freees
12:16 danvet: would explain why wayland compositors are fine
12:16 karolherbst: also happens with modesetting afaik
12:16 danvet: hm
12:16 karolherbst: also with prime offloading
12:16 karolherbst: using DRi3
12:16 karolherbst: really.. I need to investigate more
12:16 karolherbst: this was just a "we shouldn't freeze the desktop because one application was stupid and SIGKILLing it unfreezes the desktop"
12:17 karolherbst: couldn't we even do a "uhh, I didn't flip for a second, maybe something is wrong" inside compositors or X
12:17 karolherbst: or something
12:35 pq: karolherbst, X or compositor can only recover from a missing pageflip completion event, assuming the kernel will be happy to take another modeset or flip. But there is no way for userspace to cancel a flip or modeset, so there is not much userspace can do.
12:36 karolherbst: pq: right, but I talk about broken clients
12:36 pq: how is that different?
12:37 karolherbst: that we should handle it better than compositor ones
12:37 karolherbst: clients are sometimes stupid
12:37 pq: there are plans for compositors to explicitly wait for fences before using a client's buffer, but I don't think anyone actually implements that.
12:37 karolherbst: some clients are broken and only get a new frame per second or so
12:37 karolherbst: stalling the entire desktop
12:37 karolherbst: it's super annoying
12:37 karolherbst: I hate this behaviour
12:37 karolherbst: seriously.. there is nothing worse than a 15 fps application making my desktop all laggy :p
12:38 karolherbst: and I assume we could fix both at once
12:38 pq: yes, that's the problem that waiting for a fence before using a buffer would partially solve
12:38 MrCooper: karolherbst: 1 fps happens because the Xorg Present implementation doesn't support secondary GPUs
12:39 karolherbst: MrCooper: no, I meant applications being that demanding that the GPU isn't fast enough :p
12:39 karolherbst: but maybe it's the same issue
12:39 karolherbst: dunno
12:40 MrCooper: one frame per second would be a serious mismatch between GPU capabilities and application workload :)
12:40 karolherbst: yeah, but.. well, something you can hit with nouveau
12:40 karolherbst: having a GTX 710 with stock clocks or something
12:40 karolherbst: :p
12:41 pq: what the compositors I know of do right now is it gets a dmabuf from a client, submits a job to GPU using the client buffer as a texture, and submits a flip to KMS. So the whole desktop obviously waits for the client rendering to finish. I agree it's a problem, but solving it also trades off some pipelining.
12:41 karolherbst: pq: why can't we attach a timeout to it?
12:41 karolherbst: I don't say we should abort, just if it takes longer than 1/60 we move on and try again next frame
12:41 karolherbst: or something
12:41 pq: karolherbst, because there is no way to cancel a KMS flip?
12:42 karolherbst: mhh, true
12:42 pq: or do you mean *after* a compositor starts actually looking at client buffer fence state before it uses them? :-)
12:42 karolherbst: I guess we could do that
12:42 pq: yeah, people are thinking about doing that
12:43 karolherbst: if that makes my desktop smoother I am all in for that :p
12:43 pq: I'm not sure...
12:43 karolherbst: probably also solves the broken application bug
12:43 MrCooper: cancelling flips would hardly be useful for this anyway, since you'd have to composite another frame to replace the late one, and you might no longer have the client buffers needed for that
12:43 karolherbst: yeah...
12:43 pq: it might make e.g. games that used to run barely fine to drop to half fps
12:43 karolherbst: we would need a way to exclude certain clients from a flip essentially
12:44 karolherbst: pq: doesn't change much though I guess
12:44 pq: oh, people are extremely sensitive about games that just barely fine :-P
12:44 karolherbst: if a game runs at 10 fps and you try to get an updated buffer every 1/60 second, I don't see why that game would run at 5 fps now
12:44 pq: +run
12:45 pq: I mean a game that runs 58 fps drops to 30 fps
12:45 karolherbst: you still composite at your display refresh rate
12:45 karolherbst: why would it?
12:45 karolherbst: also
12:45 karolherbst: that's preferable to 58 fps :p
12:45 karolherbst: well
12:45 karolherbst: depends on what the users prefer
12:45 karolherbst: smooth movements or high fps
12:45 pq: becuase the compositor waits for the client rendering to finish, and then it misses the vblank with its own rendering
12:45 MrCooper: karolherbst: note that this is only one side of the coin; slow clients may still delay the compositor, unless the GPU & drivers support high priority contexts which can "overtake" lower priotity ones (I supposed that might be relatively easy with Nvidia HW)
12:45 pq: ...maybe because the game is already rendering a frame N+2
12:46 karolherbst: MrCooper: hehe... I doubt it
12:46 karolherbst: nvidia gpus are not preemtible
12:46 pq: karolherbst, there are people who vocally prefer tearing too...
12:47 karolherbst: pq: I know, but then it all doens't matter anyway
12:47 karolherbst: but you don't have to drop to 30 fps
12:47 karolherbst: you just say frame 45 wasn't wasn't able to update the window content for client X, we try again in frame 46
12:47 karolherbst: and maintain the disaplys refresh rate as the frame rate of the compositor
12:49 pq: most wayland compositors never tear, so if the game is consistently just below refresh rate, it will consistently show at half fps, because the compositor happens to wait for the client rendering to finish before it submits it to KMS
12:49 karolherbst: right, but then the game doesn't run at 58 fps :p
12:49 karolherbst: but close to 30
12:49 MrCooper: pq: done right, a game which ran at 58 fps before should still do the same after (but there'll be a higher chance that the user actually gets to see all of those frames :)
12:49 pq: it could run at 58 if the compositor didn't wait for the fence
12:49 karolherbst: no
12:50 karolherbst: way would we skip that client on every other frame?
12:50 karolherbst: Üwhy
12:50 karolherbst: it's not like you idle the game
12:50 karolherbst: they still do double or trippe buffering
12:50 karolherbst: so they keep the GPU busy
12:50 pq: because the frame takes slightly more time than one refresh period
12:51 MrCooper: yeah, why would waiting for the fence prevent running at 58 Hz? The game clearly isn't synchronizing to the refresh cycle, it just cranks out its 58 fps either way
12:51 karolherbst: just the window you have left ot the flip gets smaller until you have to skip a frame again
12:51 karolherbst: pq: yeah, but that's fine for most frames, until you hit a frame you start after the game and end earlier
12:51 karolherbst: but most frames just intersect with each other in their slots
12:52 karolherbst: you don't start at the same time each frame
12:52 karolherbst: games not doing double/tripple buffer would suffer from your idea
12:52 karolherbst: which.. don't exist
12:52 karolherbst: every game at least double buffers
12:52 karolherbst: or has an option to enable it
12:53 karolherbst: a lot even tripple buffer for.. reasons
12:54 pq: yeah, this should really be drawn on paper to see it
12:54 karolherbst: also, there are other games halving frame rates themselves with fake frames
13:41 kusma: Hmm, am I the only one who finds it suspicious that u_bitter.c doesn't seem to respect pipe_blit_info::render_condition_enable?
13:42 kusma: I guess in theory a driver could make it respect that by dropping util_blitter_save_render_condition, though...
14:10 alyssa: krh: https://people.collabora.com/~alyssa/0001-nir-Propagate-f2f16-into-vectors.patch
14:10 alyssa: ^ Does this seem sane?
14:18 tlwoerner: can someone who has control over the fdo planet please add our gsoc student to the feed? https://melissawen.github.io/ thanks :-)
14:27 alyssa: daniels: ^
15:41 daniels: tlwoerner: done!
15:42 tlwoerner: daniels: perfect thanks! :-)
15:46 tlwoerner: does her name only show up after she's posted something? i don't see her name in the subscription sidebar
15:59 daniels: tlwoerner: it only regenerates on a schedule
16:01 tlwoerner: daniels: ah, thanks :-D
16:43 ncharlie: Hi -- is this a good place to ask questions about debugging the AMDGPU driver?
16:44 MrCooper: #radeon is better
16:45 ncharlie: MrCooper - ok, thanks
17:13 krh: alyssa: looks reasonable, but how often do you end up with heterogenous vectors?
18:01 alyssa: krh: quite a lot it seems
18:02 alyssa: 128-bit vector alu >:
18:04 alyssa: mediump varying foo ...... vec4(foo.xyz, 1.0)
18:04 alyssa: IIUC that would trigger that sort of pattern
18:04 alyssa: (maybe not *exactly* that, but conceptually it's not so foreign)
18:08 mannerov: is there a release planned after mesa 20.0.6 ? 20.0.7 ?
18:08 kisak: mannerov: you can always get the answer to that at https://www.mesa3d.org/release-calendar.html
18:09 mannerov: ok thanks, looks like I have 2 days left
18:15 swick: danvet: btw, would it be possible to have an equivalent to evdevs EVIOCREVOKE for render nodes where the FD behaves like the device was unplugged?
18:16 danvet: swick, we're kinda assuming render nodes are safe to share among unrelated processes
18:16 danvet: so should never be needed
18:16 danvet: if your driver needs revoke it smells like there's an isolation issue, which kinda breaks the render node contract
18:17 danvet: swick, and for modeset/primary nodes we have revoke
18:17 danvet: or well something that gets that job done at least
18:17 danvet: swick, what do you need your render node revoke for?
18:19 swick: this would be for flatpak where /dev/dri would not be mounted and the user could at runtime change which GPUs to make available
18:21 swick: it's just a concept, lots of problems with it obviously like how most apps don't handle gpu unplug and that drivers just assume /dev/dri to be there but still, would be neat if we could go in that direction
20:37 blackhole: Hello, I have an application that is using mesa, the only thing I did for that was to load opengl32.dll compiled by mesa, now I am trying to use osmesa, I have osmesa.dll but renaming osmesa.dll to opengl32.dll & replacing the earlier doesn't seems to work
20:38 blackhole: I use 3rd party engine that depends upon opengl32.dll, opengl32.dll have function like wglGetProcAddress, I can't really have those 3rd party libs link against osmesa
21:32 HdkR: jekstrand: The hang is back. Let's see if I can capture anything about it
21:33 HdkR: GPU operating at 0-2Mhz
21:33 jekstrand: Wonderful!
21:33 karolherbst: trying to remove AGP eh :)
21:34 HdkR: https://pastebin.com/KzusunDw The only output in dmesg
21:35 EdB: pmoreau: I made some changes on my MR
21:35 HdkR: jekstrand: Nice, the error file actually has some output this time https://pastebin.com/hsADSppa
21:38 jekstrand: HdkR: Your context looks messed up
21:38 jekstrand: Kayden: ^^
21:38 jekstrand: Kayden: The STATE_BASE_ADDRESS in the context is all zero
21:38 karolherbst: jekstrand: finally I have time to rebase the structurizer on your vtn_cfg changes :p and actually I am not sure if anything got harder for me.. actually, probably harder
21:39 HdkR: The output on my screen is pretty messed up, nothing unexpected :P
21:39 jekstrand: HdkR: The other error state on that bug has a messed up context too
21:39 jekstrand: HdkR: I'm kind-of wondering if we're not looking at a kernel bug
21:39 jekstrand: Either that, or my batch decoder is busted
21:39 jekstrand: But I don't think that's likely
21:41 pmoreau: EdB: Nice! I saw you had pushed an update earlier today, but it seems like you just pushed a new one. Will have a look at it later in the week.
21:41 jekstrand: HdkR: Do you have any idea if alacritty is using threaded OpenGL?
21:42 HdkR: not sure. It's all written in rust so I am unable to parse it
21:42 HdkR: From what I can see it only ever creates a single context
21:43 Kayden: jekstrand: that sure seems fishy to me
21:43 Kayden: like, I suspect an error state capture issue more than I suspect a driver issue causing that
21:43 Kayden: vertex buffers are also all blank
21:43 jekstrand: On this one, it almost looks like you're suddenly getting the golden context
21:43 jekstrand: Which could be error capture being bogus
21:44 jekstrand: HdkR: Can you stick your error state on the bug so we don't loose it?
21:44 HdkR: It's there
21:44 jekstrand: Thanks!
21:45 EdB: pmoreau: the first one was for kernel attributes, the latest for arginfo rewrite and missing error code
21:45 HdkR: Going to build their git version and disable depth and stencil to ensure that isn't a problem
21:46 Kayden: HdkR: this is just, run alacritty, get hangs?
21:46 ickle: your decoded is nuts
21:46 ickle: and is decoding the default context
21:46 HdkR: run alacritty and let it sit for a few hours
21:47 HdkR: Or in this case it took 1 day, 24 hours
21:47 HdkR: Shortest run was about 4 hours
21:47 pmoreau: karolherbst: Neat! I’ll happily rebase on top of your updated branch.
21:47 Kayden: :(
21:48 HdkR: Just switch your main terminal over to Alacritty, it'll feel like it's happening constantly ;)
21:50 HdkR: One good thing is that once it starts happening, restarting X and reproducing it seemingly only takes a few minutes
21:50 jekstrand: ickle: Do you mean that we're getting the default context in our error state (kernel bug) or that we're decoding it wrong (our bug)?
21:51 ickle: you are interpretting the default context, which is included in the error state for reference
21:51 jekstrand: ickle: Oh...
21:51 jekstrand: ickle: I've seen my context in the error state sometimes too.
21:51 jekstrand: ickle: Are they always both in there?
21:52 ickle: they should both be there for the moment
21:52 ickle: if you want the default context in future, speak now for I've dropped it
21:52 jekstrand: ickle: "Active context"?
21:53 ickle: HW context
21:53 jekstrand: ickle: I see "WA context", "HW context", and "NULL context"
21:54 jekstrand: ickle: HW context is definitely the one that's getting decoded
21:54 jekstrand: So unless the kernel's dumping the wrong buffer....
21:54 ickle: no, but they could be anything
21:55 jekstrand: What do you mean by "could be anything"?
21:56 ickle: in that kernel it'll be copied over by the default at the moment of reset, which will be before the capture
21:56 jekstrand: Oh... So the context we're seeing is just garbage
21:56 jekstrand: And we shouldn't trust it
21:56 jekstrand: That's very non-useful.
21:57 Kayden: "that kernel" -> it's being improved to capture the hanging context in new kernels?
21:57 jekstrand: I guess HdkR could run with GPU reset disabled. That'll be fun....
21:57 ickle: even before that point, it'll be at best the state before the context switch prior the hang
21:57 ickle: it won't be the state at the moment of the hang
21:58 jekstrand: Yeah, but that's fine
21:58 Kayden: would like it to be the context that was loaded when our batch started executing
21:58 jekstrand: The problem is that iris emits STATE_BASE_ADDRESS etc. once at context creation and, without a competent capture, we don't have that information when looking at the hang.
21:58 jekstrand: So the contents of the context right before the hanging batch is *exactly* what we want.
21:59 Kayden: would be useful for surface base address...for the others, they're hardcoded constants, so not too hard to figure out :)
21:59 jekstrand: Yeah, but our tools don't know those constants
21:59 Kayden: but yeah surface is the most interesting, sadly
21:59 ickle: if you want exciting, try drm-tip
22:00 EdB: pmoreau: with that I've done all from my CL 1.2 list. I guess now the next step is printf support and CTS 1.2 validation.
22:00 jekstrand:doesn't like *that* much excitement. :-P
22:00 ickle: that should be doing what you think it should
22:00 karolherbst: EdB: can we please ignore printf? :D
22:01 Kayden: oh, right, I hardcoded them for INTEL_DEBUG=bat with iris, but doesn't work for error states. Could always hardcode them in the tools, as assumptions if you don't see a STATE_BASE_ADDRESS
22:01 Kayden: that is pretty stupid but it would make the tool work better, and would be harmless for those that actually emit SBA
22:01 jekstrand: Kayden: If we standardize across drivers. :-)
22:01 Kayden: well, neither anv/brw inherit SBA
22:02 Kayden: but, doesn't help us with surfaces anyway :/
22:02 jekstrand: Kayden: Not yet. :-)
22:02 jekstrand: Kayden: The MR I just posted makes ANV start
22:02 Kayden: ooo
22:02 EdB: karolherbst: I was hopping someone would have done it before me. I'm happy without printf
22:02 karolherbst: EdB: how much does the CTS actually validates?
22:03 jekstrand: Kayden: And Jenkins was decidedly non-explody (full green)
22:03 karolherbst: I fear they do too much though :/
22:03 EdB: karolherbst: I neve run a full CTS
22:03 karolherbst: EdB: anyway.. there are ways to implement it, but.. ufff
22:03 karolherbst: at least the format is a uniform constant string afaik
22:03 karolherbst: so we can get away by doing lowering
22:03 karolherbst: but..
22:03 karolherbst: aahhhh
22:04 airlied: I thought you just shoved thing in a buffer and have the host do the printf work
22:04 EdB: karolherbst: I'll go for printf (fmt, ...) {} :p
22:04 airlied: it always seemed messy rather than particularly hard
22:04 karolherbst: airlied: you still need to copy the values out
22:04 airlied: I was going to rip off beignet if I had time
22:04 EdB: airlied: yes. That how AMD put it in llvm
22:04 jekstrand: Kayden: I would be a big fan of it if it weren't for the fact that it requires 256B mode for binding tables and so now all our 1-entry tables are taking 8x the memory.
22:05 jekstrand: Kayden: It really sucks for blorp
22:05 Kayden: yeah, I'm not a fan of 256B mode
22:05 Kayden: I was kind of tempted to use 16:6 mode in iris
22:05 EdB: airlied: once the kernel is done you mostly pass its content to host printff
22:05 jekstrand: It's sort of "Thanks for giving me a bigger buffer.... And ensuring that things are so bloated I fill it exactly as fast"
22:06 Kayden: Yeah. It's exactly the same, unless your table is > 8 entries.
22:06 Kayden: I ran my patch on Portal for example, and it stalls exactly the same amount.
22:06 Kayden: Mordor cuts 6.3% of stalls
22:07 HdkR: I've got the latest alacritty disable their depth and stencil usage. will see if it repros over the next day or so
22:07 HdkR: and disabled*
22:07 jekstrand: I would drop the whole 256B thing except that using binding table pools without 256B mode means you only have 16-bit binding table entries. :-(
22:07 EdB: what I need is to figuring out how to add an extra buffer that can be adress by the printf on kernel side
22:07 Kayden: jekstrand: is using 8x the memory that big of a deal?
22:08 Kayden:did math wrong, yeah, 16-bit entries are a total non-starter
22:08 jekstrand: Well, looking at the docs, it sems they're bits [21:6] so that's more like a 22-bit
22:08 jekstrand: But that's still not great
22:09 karolherbst: actually.. I am thinking about writing the vtn unstructured part from scratch as things are too incompatible now :/
22:09 EdB: After next step is 3.0 without 2.X stop :)
22:10 karolherbst: does 3.0 have generic pointers?
22:11 airlied: optional I believe
22:11 jekstrand: karolherbst: 3.0 is basically 1.2 with all the 2.0 stuff as optional features
22:11 karolherbst: ehh
22:11 jekstrand: The actual delta between 1.2 and 3.0 required is basically nothing
22:11 jekstrand: So 3.0 is actually implementable
22:11 airlied: "Support for the Generic Address Space is optional for devices supporting OpenCL 3.0. "
22:11 karolherbst: and generic pointers are actually a nice feature... just the headache
22:11 airlied: I think they moved some obvious fixes into 3.0 from 2.x
22:12 karolherbst: ahh, okay
22:12 airlied: so it's nost just 1.2 but it's not far off
22:12 karolherbst: yeah, I guess after 1.2 I'd target 3.0 as well directly :p
22:12 karolherbst: I am just hoping that we still have a new OpenCL runtime in mesa :p
22:12 karolherbst: or get one
22:13 airlied: opencl on lvl0 :-P
22:13 karolherbst: ehh
22:13 karolherbst: yeah... I don't really care all that much :p
22:13 jekstrand: airlied: And level0 on gallium and gallium on vulkan....
22:13 airlied: just need nouveau to grow a vulkan driver already
22:13 karolherbst: I just don't look forward having to maintain 3 drivers if the trend continues
22:13 airlied: jekstrand: it would be lvl0 on vulkan on gallium :-P
22:14 airlied: at least with the current code I've written
22:14 airlied: though I'm at the how to handle the difference between shader models stage now
22:14 karolherbst: airlied: actually how bad is the gallium overhead for vulkan? just terrible api design for vulkan or just too many func calls?
22:14 karolherbst: actually wondering if it's so bad
22:14 airlied: karolherbst: it's ba
22:14 airlied: it's less work to write a vulkan drive
22:14 airlied: driver
22:15 karolherbst: yeah, still not looking forward having to maintain two drivers :p but I except you can rip of a lot of things and have a couple of libraries for certain stuff
22:15 karolherbst: *expect
22:15 karolherbst: and maybe we could even make u_blitter compatible with nouveau
22:15 karolherbst: it seems like it's totally not
22:15 airlied: karolherbst: the compiler and image management seem to be the first things people take out
22:15 jekstrand: I've thought about making a sort of gallium-like layer that we could build both anv and iris on top of
22:16 airlied: jekstrand: take a look at AMD's PAL
22:16 airlied: then just run away
22:16 karolherbst: skeggsb looked into using u_blitter for nouveau, this was his result: "for reference, HW not supporting shader stencil export is why we can't use u_blitter"
22:16 airlied: making a shader stck that writes stencil without that isn't impossible though
22:16 karolherbst: well, we have our own 3d blitter in nvc0
22:16 airlied: stencil export is just a nice shortcut
22:16 karolherbst: well
22:16 karolherbst: sure
22:17 karolherbst: but we could make it optional in u_blitter I guess
22:17 karolherbst: I just don't know enough
22:17 karolherbst: and if we are the only one with that limitation
22:17 airlied: jekstrand: the problem is getting something efficent for both cases and it's kinda hard
22:17 karolherbst: but our 3d blitter is also partly broken
22:17 karolherbst: especially in regards to MS
22:17 anholt: lots of hw can't, and would like the 8-pass thing to set a stencil bit per pass.
22:18 jekstrand: airlied: Sure. That's why I haven't written it yet. :-P
22:18 karolherbst: if we could make u_blitter the defacto 3d blitter for mesa, that would be nice :p
22:18 airlied: jekstrand: PAL is like vulkan with all the dynamic state, but it's still pipeline explosions for a GL driver I think
22:18 anholt: I mean, it won't ever be something for vulkan.
22:18 karolherbst: but u_blitter is heavily gallium based, right?
22:18 jekstrand: karolherbst: I'd be fine with that on SKL+. Anything earlier than that and we have mad magic to make things like MSAA stencil blittable.
22:18 airlied: yeah you really don't want u_blitter for vulkan
22:19 anholt: you would need to completely rewrite it for vulkan.
22:19 karolherbst: oh well
22:19 anholt: all you have to share is shader code, which is usually like one sample instruction.
22:19 airlied: you just write your own meta blitter or use your blorp equiv
22:19 karolherbst: ours is broken :p
22:19 jekstrand: fix it?
22:19 karolherbst: MSAA is annoying
22:20 karolherbst: just we have no idea why it's broken
22:20 airlied: jekstrand: I've filed a bunch of lvl0 bugs last week wrt differences from vulkan
22:20 karolherbst: needs time spent on
22:20 karolherbst: anyway, those bugs are not hit by the CTS, so meh...
22:20 jekstrand: airlied: There's a bug tracker for it?
22:21 airlied: https://github.com/oneapi-src/level-zero/issues
22:21 airlied: jekstrand: the spec and loader are open there
22:21 jekstrand: and by "filed a bunch of bugs" you mean "filed half the bugs the project has ever seen" :D
22:21 airlied: pretty much :-P
22:22 airlied: I'm trying to work out how best to approach the module/kernel stuff
22:22 airlied: we kinda punted on that for clover
22:22 airlied: into something that works but isn't exactly what the API wants
22:22 airlied: level0 expects to get a binary back from the module stage where we don't know the entrypoint
22:22 airlied: you get the entrypoint at the kernel stage
22:23 airlied: it also does spec consts at the module stage
22:23 karolherbst: they rely too much on llvm :p
22:24 karolherbst: clover had this same missdesign
22:24 karolherbst: or still has for llvm
22:24 karolherbst: just that with llvm it doesn't matter
22:24 airlied: it's the API design really
22:24 karolherbst: yeah.. a bit
22:24 karolherbst: it sucks.. true
22:24 jekstrand: airlied: spec constants at the module sense makes sense
22:24 jekstrand: *module stage
22:24 airlied: jekstrand: yeah they weren't too messy
22:24 jekstrand: airlied: Vulkan's spec constants are kind-of horrible
22:25 airlied: the getting a binary back at that point is a bit messier
22:25 airlied:has to also work out how to add a pipeline cache to llvmpipe (orthogonal problem)
22:25 jekstrand: Which is to say that you can probably compile to binary without knowing the spec constants 90% of the time but you only know whether or not that's possible 0% of the time without parsing the full SPIR-V
22:26 jekstrand: In particular, if a spec constant is used to size an array or similar, you're toast.
22:26 airlied: yeah but how to deal with all the entrypoints in our spirv stack is a bit trickier
22:26 jekstrand: If it's just used for an if, you can theoretically emit binary and then patch in the spec constant somehow.
22:26 jekstrand: Yeah, we don't handle that today
22:27 jekstrand: I've always wished we could
22:27 jekstrand: But, as long as we continue to need the spec constant at the parsing stage, it doesn't do us any good to do that. :-(
22:27 airlied: I expect that will be the thing I need to do to make lvl0 work
22:27 airlied: one other feature they have is event pools, where you can share the backing to a bunch of events and sent it over "IPC"
22:28 airlied: but they are overly reliant on events, which imo suck as a sync mechanism
22:28 jekstrand: *sigh*
22:28 jekstrand: Yeah, CL events suck
22:28 airlied: well vulkan ones does as well :-P
22:29 airlied: hey I'm just going to stop your GPU side command processing
22:29 jekstrand: What people really want is a timeline semaphore that you can signal mid-batch
22:29 jekstrand: And maybe wait on or at least query mid-batch.
22:29 airlied: yeah they are looking into alternates thankfully now
22:29 jekstrand: Waiting mid-batch is tricky though
22:29 jekstrand: If you fail to preempt properly, boom!
22:47 airlied: jekstrand: if you have any preconceived ideas of how multi-entrypoint should look let me know :-)