00:24 jekstrand: Ugh... I may have just hit a GCC bug. :-(
00:25 jekstrand:rebuilds from scratch
00:32 Kayden: the amount of times we've legitimately done that is kind of surprising
00:33 imirkin: rebuilt from scratch?
00:33 jekstrand: tried that
00:34 dcbaker: or found bugs in $compiler?
00:34 imirkin: ;)
00:34 jekstrand: GCC isn't resolving overloading weak symbols with strong symbols from one of my files.
00:34 jekstrand: It's only one file that's not working.
00:35 imirkin: i've never fully understood the weak/strong symbol stuff. tricky stuff =/
00:35 dcbaker: and highly non-portable
00:35 dcbaker: or, at least, very painful to do on some OSes
00:35 jekstrand: And really darn useful if you want to build a table of functions some of which may not exist
00:35 Kayden: well, the amount of times we've rebuilt from scratch is also high, but also very unsurprising. :D
00:36 jekstrand: What flag was it that anholt set to garbage-collect unused symbols?
00:36 airlied: the gc sections stuff?
00:37 jekstrand: maybe?
00:37 anholt: look for gc-sections
00:37 aphysically: oh looks like with that encoding bug I also get this:
00:37 aphysically: MESA-INTEL: error: ../src/intel/vulkan/anv_batch_chain.c:1856: execbuf2 failed: No such file or directory (VK_ERROR_DEVICE_LOST)
00:37 jekstrand:rebuilds without gc-sections
00:38 airlied: ah that explains why my lp_dump_llvmtype helper function disappeared :-P
00:38 aphysically: (still nothing in journal/dmesg)
00:39 jekstrand: The file that's not being resolved properly has only symbols that are referenced because they override weak symbols.
00:39 jekstrand: I wonder if that has anything to do with it
00:39 jekstrand: But I'm pretty sure that's true of multiple files in ANV so I'm confused
00:39 aphysically: this happens when doing vaapi decode -> vulkan filter -> vaapi encode in ffmpeg on debian bullseye, causing the encode to fail for long encodes
00:40 aphysically: although sometimes yuou can get the encode to succeed if it's a few frames long (otherwise you get this error)
00:40 xexaxo1: imirkin: weak symbols - allows for the symbol to be unresolved at build time. any unresolved symbol will be NULL at runtime.
00:40 jekstrand: That also looks very much like you're messing up synchronization or something
00:42 xexaxo1: imirkin: sort of like providing a binary with unresolved symbols (aka missing -no-undefined) but weak symbols don't show in ldd -r binary
00:43 jekstrand: disabling gc-sections doesn't help. :-/
00:43 imirkin: xexaxo1: yeah, i know what it is. but getting it to work is another matter entirely :)
00:43 jekstrand: xexaxo1: Do we use weak symbols for GL dispatch?
00:44 xexaxo1: fwiw I've used it in a handful of projects. happy to take a look
00:44 imirkin: like i wanted to detect that a LD_PRELOAD'd thing was "loaded" by having it share a symbol, but that was a massive fail
00:44 xexaxo1: jekstrand: GL dispatch should have none. the Vulkan dispatch only
00:44 jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8676
00:44 jekstrand: xexaxo1: It's the stuff from vk_cmd_copy.c that's not showing up
00:45 xexaxo1: woot one copy of the Vulkan dispatch \o/
00:45 jekstrand: xexaxo1: If you run basically any vulkan test and break on vk_device.c:51 and look at vk_common_device_entrypoints, it's missing the stuff from that file.
00:45 jekstrand: Everything else works.
00:45 jekstrand: xexaxo1: If you can see some thing I can't, that'd be awesome. I'm very sure I had this working earlier today. :-/
00:46 xexaxo1: ack. give me a few minutes to dust off my mesa build setup
00:46 jekstrand: xexaxo1: Yeah, I've been wanting to get rid of the Vulkan dispatch duplication for a long time now. Glad I've finally managed.
00:47 jekstrand: bnieuwenhuizen: I'm done churning now, I think. The only remaining issue is figuring out this stupid symbol resolution stuff.
00:48 bnieuwenhuizen: ack, thanks
00:50 jekstrand: which I'm very sure I had working :-(
00:52 bnieuwenhuizen: git reflog is your friend :)
00:52 bnieuwenhuizen: unless you rebased so many times it is hard to track down :(
00:53 jekstrand: bnieuwenhuizen: The 2nd one. :(
00:56 xexaxo1: hmm did ninja stop using all the cores suddenly or ...
00:56 jekstrand: xexaxo1: Mesa has gotten slower to build. :-(
00:56 xexaxo1: some python/other script is explicitly single core.
00:56 jekstrand: odd...
00:58 xexaxo1: had 3-4 instances of 30s+ ... anyway. let's see weak symbol thing first - 460/540
00:59 dcbaker: xexaxo1: there's some C++ in mesa that takes a *long* time to compile these days
00:59 dcbaker: maybe I should look at pch again... hmm
00:59 xexaxo1: dcbaker: building only anv - no dri, gallium, osmesa
01:00 dcbaker: hmmm, yeah, you might be hitting a bottle neck on one of the python generators
01:00 dcbaker: with enough cores and few enough targets that becomes almost inevitable
01:01 bnieuwenhuizen: is meson/ninja capable of creating execution graphs for visualization?
01:01 dcbaker: I think ninja is
01:02 dcbaker: meson doesn't have a good enough view of the DAG I don't think
01:02 bnieuwenhuizen: yeah, I guess I just think of it as an integrated solution :)
01:02 dcbaker: ninja -t graph
01:02 bnieuwenhuizen: thx
01:02 jekstrand:used to be able to build anv in 20s from cold. It's been a while since I could do that. :-/
01:02 dcbaker: oh, and ninja -t browse
01:03 dcbaker: I've never read the help for `ninja -t` before. There's some really interesting stuff in there
01:03 bnieuwenhuizen: O_o 20 s
01:03 bnieuwenhuizen: is that just ninja or did that include meson time?
01:04 dcbaker: I would hope that includes the time that ninja took to reconfigure
01:04 dcbaker: s/ninja/meson
01:04 bnieuwenhuizen: well with ninja clean you wouldn't need that
01:04 dcbaker: oof
01:04 bnieuwenhuizen: or rather the reconfigure is mostly noop?
01:04 dcbaker: ninja clean took 20s?
01:04 bnieuwenhuizen: no I mean if you did ninja clean to get the cold state
01:06 dcbaker: I'm a little terrified that `ninja -t browse` exists
01:09 xexaxo1: jekstrand: doesn't seems like a non-weak symbol is generated by the python scripts.
01:10 xexaxo1: there is literally just the header declaration, weak "definition" and user.
01:12 jekstrand: xexaxo1: Yeah, I'm looking at the .o with nm and it's 100% weak symbols
01:12 jekstrand: And looking at vk_cmd_copy.c.o, I see a bunch of non-weak symbols
01:12 xexaxo1: jekstrand: sorry what I meant was - doesn't seem like a compiler bug...
01:12 xexaxo1: let me look in cmd_copy
01:13 jekstrand: xexaxo1: The weak symbols are in vk_entrypoints.c
01:13 jekstrand: the .h doesn't have them weak because then everything would be weak
01:14 bnieuwenhuizen: dcbaker: the graph/browse give me the general graph but no execution timeline :(
01:14 xexaxo1: guess I'm just tired - is there a non-weak vk_common_MapMemory generated somewhere?
01:14 bnieuwenhuizen: still cool but not as useful to figuring out stragglers/bottlenecks
01:14 jekstrand: xexaxo1: vk_common_GetDeviceProcAddr is in vk_device.c
01:15 xexaxo1: all the symbols from vk_cmd_copy.c are present in the vk_common_device_entrypoints
01:16 jekstrand: Yes, but the ones in vk_cmd_copy.c are strong and the ones in vk_entrypoints.c are weak so the vk_cmd_copy.c ones should win.
01:17 xexaxo1:double-checks vk_common_device_entrypoints
01:17 jekstrand: if this works on xexaxo1's machine, jekstrand is going to be annoyed
01:18 jekstrand:blows his whole build dir, meson and all, and his ccache
01:19 jekstrand:wonders why nir.h includes xxhash
01:19 jekstrand: Or, for that matter, half the stuff in here
01:19 jekstrand: It includes gl.h too
01:20 xexaxo1: my bad - misread the table contents... yes the cmd_copy symbols are missing
01:20 bnieuwenhuizen: ah .ninja_log contains execution times
01:25 xexaxo1: cool found a hack which seems to work...
01:25 jekstrand: ?
01:25 jekstrand: That was fast
01:25 xexaxo1: link_with -> link_whole in the idep_vulkan_util
01:26 bnieuwenhuizen: huh, compiling u_format_table.c is the longest compile time in my build actually, not even c++
01:26 xexaxo1: my guess is that something doesn't figure out that the symbols will be used, so it discards them.
01:26 jekstrand: bnieuwenhuizen: Yeah... THat things massive
01:26 bnieuwenhuizen: 20 seconds :)
01:26 xexaxo1: link_whole forbids the discard.
01:26 xexaxo1: jekstrand: I might have spotted a thing or two in the MR.
01:27 xexaxo1: might giving me a few hours so I can check tomorrow morning?
01:28 xexaxo1:puts the above in the MR ... hopefully in a more coherent form
01:34 xexaxo1: jekstrand: MR patch/comment added. Mention in the MR if there are more symbols missing...
01:34 xexaxo1: and I'll check tomorrow. c'ya o/
01:41 xexaxo1: imirkin: just remembered you asked - IIRC in order of LD_PRELOAD to work with weak symbols they must be non-zero and public (visibility default)
01:42 imirkin: i worked out some other way of addressing my problem
01:42 imirkin: but i couldn't even get the symbols to show up properly
01:42 imirkin: much less work
01:42 imirkin: my point is understadning how they work and actually making them work ... two very different things :)
01:43 xexaxo1: indeed. g'night all o/ (for real)
01:44 agd5f: airlied, I thought that was why we had to pin the fbdev buffer and waste the memory for oops and such.
01:45 airlied: agd5f: oh we do kmap it and I suppose that kmapping might be hard to move, though maybe not impossible
01:50 agd5f: airlied, was there shadowfb-like support in the drm fbdev stuff? That might be an option
01:53 airlied: agd5f: but yeah it might be that fbcon needs changes alright to updaet it to a new kmapping
02:01 agd5f: here we go: https://patchwork.kernel.org/project/dri-devel/cover/20190705092613.7621-1-tzimmermann@suse.de/
02:15 alyssa: Mali wants stores in fragment shaders to be wrapped in if(!is_helper_invocation) { ... }
02:15 alyssa: Anyone have opinions on whether that belongs in compiler/nir/ or just the backend?
02:16 alyssa: Directly implementing spec text, but I don't know if other hardware has this issue
02:17 alyssa: It'd be a completely trivial pass if there was an easy way to see if an insruction writes memory
02:17 alyssa: Currently I pasted a 100 line switch from gather_info =p
02:23 alyssa: Looks like that only applies to oolder Mali at that.
02:27 alyssa: Yeah, Midgard only issue, Bifrost handles this correctly.
05:14 jekstrand: alyssa: We do it in our back-end
08:40 MrCooper: vsyrjala: seems like a no-brainer that async flips should cause vblank to end with VRR; what would be the rationale for them not doing so?
08:50 pq: mdnavare, VRR on Gnome Wayland or X.org?
08:53 emersion: i'm betting X.org, sadly
08:54 emersion: xf86 drivers sounds like an easy way out of the uapi requirements
08:55 MrCooper: yeah, mutter Wayland doesn't support VRR yet
09:05 pq: exactly why I was asking, since testing VRR on Gnome Wayland would produce... false results :-)
09:06 pq: even if you force-enabled VRR in DRM for Gnome Wayland, I think mutter's frame scheduling might interfere
09:26 jadahl: pq: mutter on Xorg is implicitly "vrr" more or less, because Xorgs single frame clock and we having to adapt when it changes. on top of drm it uses the crtc refresh rate still, assuming no vrr mr being applied
09:27 pq: jadahl, right, that's what I was expecting.
09:39 MrCooper: jadahl: Mesa disables VRR on Xorg for compositors such as mutter/gnome-shell via a denylist; it's only active while a "normal" application which isn't in that list is fullscreen and using page flipping
09:45 emersion: :S
09:45 jadahl: MrCooper: if its a fullscreen game, it wouldn't go through mutter though
09:46 pq: that's the point
09:46 pq: ...of the X.org implementation of VRR
09:47 MrCooper: yeah, the point is to only enable VRR while the display isn't driven by a Xorg compositor
11:20 mlankhorst_: is there a fast way to check if a user range is valid without copying it? I don't care if it races
12:30 emersion: pq: ah, so if i import a DMA-BUF as an EGLImage, then the client updates the buffer, i need to re-create the EGLImage if i want to see the changes?
12:35 pq: emersion, yeah, ISTR that it is not *guaranteed* to see the changes correctly without re-importing.
12:35 pq: but it might work without, sometimes, or on some machines
12:35 emersion: well… thanks for the info :)
12:37 pq: daniels might remember the details if you're interested
12:38 pq: I forget if it was just for some proprietary drivers where it has to make a copy somewhere along the dmabuf -> EGLImage -> GL texture path, or if it was one of those memory cache flush things that depends on hw arch
12:39 pq: or something else
12:39 emersion: hm
12:39 emersion: i don't see anything about this in the spec, at least
12:40 pq: that would be far too easy :-)
12:41 pq: IIRC with EGL, one thing is that nothing says the import -> EGLImage -> GL texture path needs to be zero-copy, so it's allowed by omission (and some convoluted language around... image siblings??)
12:41 pq: *copy is allowed
12:46 pq: and the cache flush thing is just plain undocumented I guess
12:47 emersion: fun
12:48 pq: implicit transitions ftl
12:51 daniels: srsly
12:51 daniels: 'by omission' is the strongest argument other than 'this happens everywhere in practice' :P
12:51 daniels: e.g. image_external has explicit language specifying that sampled contents will be up to date wrt external updates at every draw, which other image extensions notably do not
12:53 pq: and that makes image_external implementations inefficient, too?
12:53 daniels: in some cases, yes
12:54 emersion: ah, so external is fine, but regular texture is not?
12:54 daniels: on drivers like etnaviv (pretty much the showcase for 'worst possible performance in any corner case' tbh) and vc4 where you may very well need to use a shadow copy for texturing, that means you'll do a source -> shadow copy before sampling for every draw, whether you need to or not, when using external
12:55 daniels: whereas TEXTURE_2D will only do the copy once at reimport
12:55 emersion: hm
12:55 emersion: copies copies copies
12:56 emersion: well it would still be nice to refresh the EGLImage, without having to re-import
12:56 pq: daniels, FYI, I think this discussion was triggered by https://lists.freedesktop.org/archives/dri-devel/2021-January/295677.html
12:57 pq: where people wanted to account dmabuf memory to processes and maybe even for OOM-killer purposes
12:57 emersion: i wonder what the vulkan spec has to say about this…
13:00 pq: vulkan has the explicit transitions to handle this, right?
13:00 daniels: emersion: AMD proposed an extension to add explicit invalidate+flush ops to images, which would solve this, but there was never any actual extension text published, so the extension implementation was pulled from Mesa
13:01 emersion: ah, yes, i remember this
13:01 emersion: was also useful to "commit" changes to a FBO tied to a DMA-BUF, iirc
13:02 daniels: cf. mesa e64b91e34aa04a137a322ae9444c1c603383c6d4
13:02 emersion: i didn't remember about the invalidate part
13:02 daniels: yeah, exactly - you invalidate as an EGLImage consumer to make sure you have an up-to-date view of external changes, and you flush as a FBO producer to make sure external consumers can have an up-to-date view of your changes
13:03 emersion: iirc they don't have plans to resurrect it ;_;
13:03 daniels: being as it came from AMD, my guess is that the primary motivation was to be able to use DCC on FBO targets, but then trigger a resolve on flush so you can share to external targets
13:03 daniels: no plans ttbomk
13:04 emersion: well, we'll just have to live with copies then
13:05 emersion: and re-imports
13:05 emersion: daniels: so, using external guarantees we don't have to re-import?
13:06 daniels: correct
13:06 emersion: hm, unfortunate we have external_only but not e.g. prefers_external
13:06 emersion: or external_is_not_ultra_slow
13:07 daniels: yeah, nothing ever _prefers_ external, but is_external_basically_free_compared_to_tex2d sure
13:07 daniels: that being said I've never observed any performance stalls or hitches from rebinding
13:08 emersion: ok, that's good. i think i've seen one with multi-gpu and nouveau
13:08 emersion: importing an intel dma-buf into nouveau seems to be slow
13:08 lynxeye: emersion: I don't have any spec text handy, but I don't think you need to reimport the EGLImage. It should be enough to specify a new texture from the image to make sure you get the latest content in the texture.
13:08 emersion: ah, that's interesting
13:09 emersion: so, keep the same EGLImage, but re-create the texture
13:11 emersion: daniels: do etnaviv and vc4 behave well when doing this? ^
13:11 lynxeye: emersion: Yep, that should work. It's the EGL sibling aka texture doing the copy if necessary. So by re-creating the texture you should get the behavior you want.
13:11 daniels: emersion: if you want to know how etnaviv works, listen to lynxeye and not me ;)
13:11 emersion: cool. that's probably something we should note in the thread, pq
13:11 emersion: ahah :)
13:11 emersion: but i thought you knew everything about everything!
13:12 daniels: that's just the marketing
13:12 emersion: clever
13:12 daniels: until someone asks too many questions :P
13:12 daniels: but yeah, I was going to say that I couldn't remember if you needed to drop the EGLImage itself or just drop the GL-side texture bindings and rebind as I've only just woken up - but we have our answer then :)
13:13 emersion: yup
13:15 pq: emersion, that's news to me. lynxeye, do other people agree with that interpretation? Is it Mesa-wide or even more general?
13:16 emersion:hopes for universe-wide agreement
13:16 pq: lynxeye, wait, what does re-create the texture mean exactly? Do I need to glDeleteTextures() in between?
13:17 emersion: i understand it as: call glEGLImageTargetTexture2DOES again
13:18 emersion: but maybe this isn't enough and we need to delete and re-generate the texture as well
13:19 pq: anyway, weston does the re-imports today, not only the glEGLImageTargetTexture2DOES call
13:19 emersion: yes, wlroots does the same
13:19 pq: maybe some proprietary driver did a copy on EGL import?
13:19 emersion: i was in fact looking at changing this a few days ago, good timing
13:20 emersion:doesn't really care :P
13:21 lynxeye: emersion: pq: When you create the new texture from the EGLImage you go through st_bind_egl_image in Mesa, which calls resource_changed, so if the driver implements this properly you'll get the updated content.
13:21 pq: hmm, could there be a difference between importing a dmabuf and importing an opaque wl_buffer (not dmabuf)?
13:22 pq: lynxeye, sorry, none of those names are something I would call in my program, so I don't understand. :-)
13:22 emersion: lynxeye: can we just call glEGLImageTargetTexture2DOES and re-use a previous texture?
13:22 emersion: or do we need to do it with a fresh texture?
13:23 emersion:doesn't know if re-using would even be valid EGL API usage
13:26 lynxeye: pq: those are frontend functions, so it's at least Mesa wide agreed behavior.
13:27 pq: lynxeye, yeah, but how does one trigger a call to it?
13:27 emersion: "create the new texture from the EGLImage" is not precise enough
13:33 lynxeye: pq: call EGLImageTargetTexture2DOES
13:34 lynxeye: I have no idea if reusing an existing texture for this is valid usage.
13:35 ascent12: Seems like it would be pretty inconsistent with the rest of the OpenGL API if it wasn't valid to do that
13:35 ascent12: Can never underestimate them, though
13:40 daniels: if you look at GL_OES_EGL_image, it states that calling glEGLImageTargetTexture2DOES completely respecifies the texture ('Any existing image arrays associated with any mipmap levels in the texture object are freed (as if TexImage was called for each, with an image of zero size)')
13:41 daniels: it says that unconditionally, so that rebinding should never be a no-op, even if the (indirectly) named texture i.e. the current texture object for the texture target you specify, was previously bound to the same EGLImage
13:43 pq: cool, thanks!
13:44 pq: wonder why weston's code is like it is... it could be a lot simpler
13:44 daniels: np! and remember to blame lynxeye if Mesa doesn't live up to the spec :P
13:44 daniels: I'd say it's from a mix of an abundance of caution, lazy refcounting (if you tie texture obj + EGLImage lifetime together, it's one less thing to worry about), and broken ancient proprietary drivers :P
13:48 emersion: yeah, re-importing is definitely the simpler thing to do
13:54 MrCooper: jadahl: ^ how does mutter handle this?
13:54 jadahl: same as weston
13:54 jadahl: iirc
13:55 jadahl: i.e. creates eglimage from dmabuf, creates texture, panit textures, destroys texture, destroys, image
13:56 jadahl: (was that the topic?)
13:57 emersion: does it do that on wl_surface.commit?
13:58 daniels: jadahl: it was yeah
13:58 pq: yeah, is that for every screen refresh even without client update?
13:59 jadahl: i'm missing more context, is this about client dmabufs or hybrid graphics?
13:59 pq: client dmabufs
14:00 jadahl: then it's on "applied" not "commit", which would be the same if there are no subsurfaces
14:00 emersion: ok
14:00 emersion: so yeah, same as everybody else
14:01 jadahl: i.e. commit() -> apply() { create_egl_image(); create_Gl_texture(); }, ... paint()...
14:01 emersion: jadahl: there's nothing wrong with this approach, but the tl;dr is that you could re-use the EGLImage if you wanted to (*not* the texture)
14:01 jadahl: what would one gain?
14:02 pq: less CPU overhead
14:02 emersion: avoid going through ioctls and driver logic
14:02 jadahl: ok, that seems like a good idea
14:02 emersion: but it doesn't seem like it's a real bottleneck fwiw
14:03 jadahl: maybe a "rainy sunday evening" activity then
14:03 emersion: i did see some 1-2 ms delays in my logs
14:04 emersion: but would need to do more precise measurements
14:04 jadahl: thanks for the tip anyhow, I have added it to my todo list
14:05 jadahl: I guess it's harder to apply the same optimization on gbm_surface's dmabufs. after handing the bo back to the surface I guess there is no guarantee whether it'll come back the same or not?
14:06 pq: jadahl, you can store user data with gbm_bo for exactly that purpose. Weston caches DRM FBs that way, IIRC.
14:07 jadahl: or maybe one can set an user data on it...
14:07 jadahl: I guess one could cache an eglimage too
14:07 emersion:doesn't do gbm_surface anymore
14:09 jadahl: i have thought about stopping doing that too, not crucial enough to prioritize
14:21 MrCooper: FWIW, YaLTeR[m] on GIMPNet #gnome-shell measured ~0.5 ms for this with an otherwise mostly no-op frame update; don't know how much of that could be eliminated by caching the EGLImage though
14:22 MrCooper: (mostly no-op as this was for measuring when drawing finished to a direct scanout buffer)
14:23 emersion: would be interesting to measure the EGLImage creation time vs. the texture creation time
14:23 emersion: if most of these 0.5ms are spent in EGLImage creation, it would be worth it to ditch it
14:24 MrCooper: yeah, that could add up with multiple surfaces
14:24 MrCooper: to be clear, that's 0.5 ms CPU time
14:25 bnieuwenhuizen: I'd hope that typically most surfaces are fairly static?
14:26 MrCooper: well apparently Wayland compositors are currently doing this for each Wayland surface that needs to be drawn for an output frame
14:26 pq: ..only when the Wayland updated the buffer contents
14:27 pq: *Wayland client
14:27 MrCooper: surfaces may need to be drawn even if their contents haven't changed, may they not? E.g. for the gnome-shell overview
14:27 pq: yes, and in that case, the EGLImage is not re-created
14:27 MrCooper: gotcha
14:27 pq: not in Weston, at least, I believe
16:13 emersion: xexaxo1: why is the bug not caused by the EGL commit i mentionned?
16:14 xexaxo1: emersion: because our current kmsro approach is a hack to begin with ;-)
16:14 emersion: ah, ok. that i agree with
16:15 xexaxo1: the mentioned commit removes a silly thinko on my end - render-node is not a requirement for the extension.
16:15 emersion: xexaxo1: do you think an EGL client could discover the GPU → DC mapping automatically? perhaps by creating a BO on the render device and try to import it in KMS?
16:16 emersion: xexaxo1: hm, but the commit *adds* the requirement
16:16 emersion: it makes it so the render-node is required to expose an EGLDevice
16:17 xexaxo1: hmm you're right, need to have another look thing about it.
16:17 xexaxo1: s/thing/think/
16:18 xexaxo1: emersion: will have another look tomorrow and let you know.
16:19 emersion: np, thx
16:47 zmike: mareko: wow the drawoverhead numbers after rebasing onto your atomics work...
16:47 zmike: it's just a straight 50%+ increase
16:48 zmike: 😲
16:51 zmike: though I've somehow lost my perf edge for textures...
16:53 bnieuwenhuizen: time for a bisect :P
16:54 zmike: there's not really anything to bisect from what I can see, it's more or less just a normal rebase
16:55 zmike: not like the perf is bad/worse, it's just not magically beating the non-texture case anymore
17:37 MrCooper: zmike: bisecting in this case refers to the base commit you rebase on :) a alternative option I've used before is to use "git bisect start --no-checkout <bad> <good>" and then manually applying the diff to the commit it wants to test at each step
17:38 zmike: MrCooper: yeah, I meant that I've diffed between the pre-rebase and post-rebase and there was nothing of substance that seemed like it should've affected the pre-rebase weirdness that I had
17:38 zmike: so the most likely thing in this case is that the rebase just resolved the weirdness
17:39 phh: k
17:39 phh: oops, wrong window
17:40 jekstrand: airlied: Has lavapipe ever supported android?
17:43 daniels: jekstrand: 'why APL' - Chromebooks are basically the only viable Intel CI platform right now. we have a hard requirement on a fully-automated boot process: every run starts with power off, then you push it through the boot process through a known kernel (which can be upgraded at will), etc.
17:44 bnieuwenhuizen: aren't there faster Chromebooks though?
17:44 jekstrand: daniels: Ah. Yeah, if you're stuck in the chromebook space, using big-core gets expensive.
17:44 jekstrand: bnieuwenhuizen: Yeah, for >= $1k each
17:44 bnieuwenhuizen: ugh
17:45 daniels: jekstrand: if we pushed into non-Chromebook form factors (for Arm platforms we use a mix of Chromebooks and SBCs), we'd need to pretty drastically expand our physical space, as well as get much better cooling. but the main issue is that server parts don't have GPUs, plus you pay for the ability to control power etc with a _really_ long boot process.
17:45 daniels: for desktop parts (even SDVs) we're at the mercy of whatever your UEFI vendor does, which has an annoying tendency to force someone to go press F2 every now and again
17:46 daniels: absent that end-to-end automated control of the boot process, we'd need to have a janesma and a craftyguy to go physically babysit things, but even if that weren't essentially illegal these days, it would also hold Mesa hostage to UK working hours ...
17:47 jekstrand: daniels: Yeah, that makes sense. We've got auto-reboot and the ability to physically power-cycle but even that is flaky sometimes.
17:47 daniels: jekstrand: eh, $5-10k is nothing compared to the cost of developer time. a more pressing issue is the ability to source them ... there are some nicer ones I've seen around but we can't buy them in enough volume right now, just one or two at a time off Amazon, which is a problem when they inevitably fail
17:48 anholt: yeah. even internally, asking for 8 of a chromebook was like "well, I can get you 7 of not the platform you asked for, is that good enough? because that's all we've got"
17:49 FLHerne: daniels: You just need a remote-controlled device to press 'F2' :p
17:50 daniels: FLHerne: https://i.makeagif.com/media/7-28-2014/AnSME7.gif
17:51 daniels: mareko: ^ picking up on the above, realistically the up-front requirements are one machine as a DHCP + TFTP/NFS + gitlab-runner host, then 2-3 machines to actually execute the tests to begin with
17:51 daniels: mareko: (minimum 2 so you can have parallel jobs executing, plus 1 for redundancy)
17:52 bnieuwenhuizen: in my experience the hard part is having machines that work ok with lava and stuff
17:52 FLHerne: (but seriously, can't you get some kind of IP-KVM thing?)
17:52 bnieuwenhuizen: tried to set up a mchine like that for a while but getting UEFI/grub to reliably do serial and boot over network is a mess on desktop PCs :(
17:52 daniels: mareko: anholt's bare-metal documentation is a great place to start if you have machines you can dedicate to Mesa CI; all the Collabora machines run LAVA as an intermediate scheduler instead, because we have the same DUTs executing Mesa CI, KernelCI, and also internal testing we do for customer projects, so LAVA lets us share them between all of the above
17:53 daniels: bnieuwenhuizen: srsly, and netconsole is just disastrous
17:53 daniels: FLHerne: sure you _can_ get some kind of IP/KVM thing, but that means you need to wait for someone to wake up to press the button, even if it is remote
17:53 bnieuwenhuizen: I basically have bare-metal setup for radv setup using the rockpi we got for DXC but just couldn't get the boot process reliable enough
17:54 anholt: bnieuwenhuizen: even for cheza, we've got some "did you get this weird message? try rebooting again"
17:54 daniels: bnieuwenhuizen: huh, really? where are your fails?
17:55 daniels: anholt: on which note, this week we have LAVA doing auto-retries on misc bootloader wtf
17:55 anholt: \o/
17:55 bnieuwenhuizen: mostly if I enable network on linux then the next boot the boot from network fails in grub (and uefi too) next boot
17:56 bnieuwenhuizen: and couldn't get grub to work with my serial over usb cabling
17:56 bnieuwenhuizen: (the latter mostly relevant for lava, for bare-metal I can just use a fixed path)
17:57 bnieuwenhuizen: and tftp was too slow for me, sometimes randomly timing out
17:57 bnieuwenhuizen: http mostly worked but sometimes still times out
17:58 daniels: anholt: yeah, long overdue tbh, but got there ... the raw data I have says we're at <= 0.3% failures, so I'm not sure whether it's just clustering really hard or whether we are just executing that many jobs, but it has been annoyingly prevalent
17:58 daniels: bnieuwenhuizen: oh, I would extremely recommend not using GRUB and UEFI tbh
17:58 bnieuwenhuizen: eventually got it mostly reliable by just having an ssd and doing 2 boots: 1 to host linux to copy thins to the boot partition and then using systemd to do a single boot to the test system
17:59 daniels: jfc
17:59 daniels: that's horrendous
17:59 bnieuwenhuizen: daniels: what is the alternative
17:59 bnieuwenhuizen: yes
17:59 daniels: bnieuwenhuizen: u-boot
17:59 bnieuwenhuizen: is that something you can install on a random pc?
17:59 daniels: the nice thing about GRUB is that it's a totally self-contained OS ... the awful thing about GRUB is that it's a totally self-contained OS
17:59 daniels: it's not something that you can install on a random PC, but it is something you can install on a RockPi64
18:00 bnieuwenhuizen: right but my rockpi64 was my management device, not the dut
18:00 daniels: _oh_
18:00 daniels: I did think it was rather perverse that you'd slaved a Radeon to a RK3399 :)
18:00 bnieuwenhuizen: no PCIe bus AFAIK
18:01 bnieuwenhuizen: and for Vulkan-CTS a beefy CPU is recommended ...
18:01 daniels: hm, I thought it did expose PCIE through one of the headers. but it's a pretty bad idea in any case.
18:02 bnieuwenhuizen: in the end I suspect a better UEFI impl might have helped, but there are zero ways to figure out if it will work pre-buying
18:03 daniels: I've yet to hear of a UEFI implementation shipped in something you can actually purchase which works well for constant unattended boots :\
18:03 anholt: daniels: I think we really are just running that many jobs these days.
18:03 bnieuwenhuizen: I hear the new GPUs have fully working GPU passthrough so might try something in a VM next time
18:03 jenatali: So I saw https://gitlab.freedesktop.org/mesa/mesa/-/issues/4171 is tagged as a release blocker right now. I put out a proposed fix but I'm taking off on a vacation tomorrow. I don't want to hold up the release but if my suggested fix doesn't work for people I'm not sure I'll be able to put together another one before 21.0 is supposed to release
18:04 daniels: anholt: brilliant. grunt definitely sticks out like a sore thumb, but even with rcn spending 3-4 weeks burning down the last remaining failures, we're still a bit over 0.1%, so at this point shrugging and retrying seems a lot better than surfacing DDR-training failures as pipeline failures
18:05 MrCooper: jenatali: looks fine to me at a quick glance, thanks!
18:05 daniels: dcbaker: ^ from jenatali
18:05 anholt: daniels: yeah. ours was "sometimes cheza doesn't like the quality of the power it gets at boot, possibly when the office is some specific temperature."
18:06 daniels: we got as far as strongly suspecting regulator/temperature correlation before deciding that it was (per above) cheaper to just buy more of them and add in more retries than spending another month bottoming it out :P
18:10 anholt: computers are so much cheaper than people
18:10 bl4ckb0ne: a bit late to the discussion but could IPMI help?
18:12 anholt: bl4ckb0ne: I don't see how.
18:12 anholt: does ipmi magically get you reliable (way under .1% failure rate) reboots?
18:12 daniels: bl4ckb0ne: on paper, yes. in reality, you only see it on server systems (huge, giant power/cooling requirements, also no GPU), or on laptops which even with IPMI are surprisingly unreliable through the boot process as they're not designed for unattended boots - plus you have to hope that IPMI gives you UART instead of VNC
18:13 jekstrand: airlied, bnieuwenhuizen: The VK_EXT_private_data implementation in radv is pretty busted. :-(
18:14 jekstrand: airlied, bnieuwenhuizen: So far I've seen that radv_queue and radv_physical_device don't derive from vk_object_base. I don't know what all others are missing.
18:14 jekstrand: In order for VK_EXT_private_data to work, EVERYTHING has to derive from `vk_object_base`.
18:15 bnieuwenhuizen: well, we can fix that
18:16 jekstrand: One way to ensure that it does is to drop RADV_DEFINE_*_HANDLE_CASTS and use the VK_ version instead and watch what blows up.
18:16 bnieuwenhuizen: is that available before the common dispatch stuff?
18:16 jekstrand: The VK_ versions will compile-fail if you don't derive from the base object properly. They also assert that you cast the right object type.
18:16 bnieuwenhuizen: also, hakzsam ^
18:16 jekstrand: bnieuwenhuizen: Yes, it is. If you fix it before that lands, it'll cause rebase problems but I can deal with that.
18:31 mdnavare: pq: Patches to enable VRR in mutter are work in progress
19:05 jekstrand:somehow managed to break all the drivers and has no idea how. :-(
19:05 zmike: this is a familiar scenario for me
19:19 jekstrand:wishes he had the ability to debug that was better than "run GitLab CI"
19:29 daniels: jekstrand: which driver do you need to debug?
19:29 jekstrand: daniels: It repros on LLVMpipe, just not on my machine. :-(
19:30 daniels: jekstrand: are you running in the same Docker container?
19:30 jekstrand: daniels: No, that's probably the next step
19:32 daniels: jekstrand: look at the top couple of lines of the job log to find the container name; go to user menu (top-right on GitLab web UI) -> settings -> access tokens; create yourself an access token with `read_registry` scope; provide that as the password to `docker login registry.freedesktop.org`; docker pull `$container_name_from_job_log`; docker run -ti $... -- /bin/bash
19:32 ajax: same cpu flags?
19:32 alyssa: what's the right syntax to cc 21.0 but not 20.x for a backport?
19:33 bnieuwenhuizen: alyssa: Fixes
19:33 alyssa: bnieuwenhuizen: There's no clear commit that it fixes though
19:34 dcbaker: Or cc: 21.0 $list
19:34 alyssa: Unless I just pick the "bump GL version" commit
19:34 alyssa: dcbaker: ack, thanks
19:34 bnieuwenhuizen: I think the cc syntax for a single version is not working anymore
19:34 bnieuwenhuizen: ah, thought it was dropped from the script last time I looked
19:34 dcbaker: It should work still
19:35 dcbaker: I keep meaning to replace it with "stable: version [version]", but never seem to find time
19:37 jekstrand: daniels: Will that have the build already in it?
19:37 jekstrand: daniels: If I pull the one from the llvmpipe run
19:38 bnieuwenhuizen: no
19:41 daniels: jekstrand: no - the container gives you the base running environment - it doesn't change unless we change the build dependencies (toolchain, LLVM, whatever)
19:41 daniels: jekstrand: the per-build bits come in through artifacts
19:41 daniels: jekstrand: if you look at the job log, you can see a download link to the artifacts, which is what we built in that pipeline
19:42 bnieuwenhuizen: I think you have to go to the job that builds llvmpipe though (meson-testing?)
19:42 daniels: yeah
19:43 bnieuwenhuizen: on the right there is a browse button and then you'll find a install.tar and that should contain the built driver
19:43 daniels: the browse button probably doesn't work, and you're better off just using the download button
19:43 daniels: (transient, should be fixed by next weekend's infra overhaul)
19:44 anholt: daniels: I've never had to make a token to pull registry images
19:44 jekstrand: Ok, I think I've got it maybe
19:44 anholt: https://docs.mesa3d.org/ci/index.html <-- docker instructions for reproducing builds, should be easy to translate for reproducing a test run
19:46 anholt: though the install.tar thing is a pain with current artifacts. need to move the rest of our artifacts to minio, at which point we won't have the stupid zip+tar to pull, plus you'll have a handy wget in the job log to copy and paste.
19:46 airlied: jekstrand: I thought I fixed the queue in my branch
19:46 airlied: and physical device get fixed by moving to common code I assume
19:48 anholt: hmm, why is radv so much faster at cts startup overhead than turnip? --min-tests-per-group 32 shaves 1/3 of the CTS runtime on turnip, but any value of that doesn't change runtime at all on my radv.
19:49 jekstrand: airlied: Yes. I wonder if there aren't more though
19:50 anholt: though, hmm. I'm on nfs for the turnip, wonder if nfs access is getting in the way of startup perf.
19:51 jekstrand: Now to install gdb in the container....
19:52 jekstrand: and... gdb doesn't work inside the docker image
19:53 daniels: how does it fail?
19:53 jekstrand: warning: Error disabling address space randomization: Operation not permitted
19:53 jekstrand: warning: Could not trace the inferior process.
19:53 jekstrand: I need to enable ptrace from the looks of things
19:53 anholt: jekstrand: interesting, just reproduced that here, never had that before
19:54 daniels: yeah, start with --privileged or --cap-add CAP_SYS_PTRACE (IIRC)
19:57 jekstrand: Ugh... No debug symbols.
20:00 jekstrand: Do any of the artifacts produce those?
20:01 bnieuwenhuizen: IIRC the meson-gallium one might be debug
20:02 jekstrand: Yeah, the 700k archive size says no. :)
20:02 bnieuwenhuizen: ... but of course doesn't upload the relevant artifacts ...
20:02 anholt: jekstrand: we strip going into the artifacts to avoid moving a ton of junk to the runners back when egress was a big deal
20:03 jekstrand: anholt: Yeah.
20:03 anholt: we now avoid stripping for the asan artifacts, I think once we land my ci-fd-traces branch and have more runners on minio-based artifacts download, we could just convert the x86 tests over to minio and then stop stripping entirely.
20:03 jekstrand: I've got some aco "ninja install" tests that are failing. Maybe I can repro that.
20:05 jekstrand: I suspect a GCC version difference. Everything works here. Nothing works in CI.
20:08 daniels: jekstrand: .gitlab-ci/meson-build.sh is what runs the build, so you can re-run with different params if you want to rebuild and see
20:08 jekstrand: I'm pulling the building image now
20:08 daniels: jekstrand: if you click through to the pipeline on the MR, the meson-testing job has set -x so you can see exactly what it's running
20:37 jekstrand: Of course, the ACO build test failure doesn't repro locally, even in the docker container.
20:37 jekstrand:isn't having a good afternoon
20:39 jekstrand: I guess now that I have a build, I can pull that into the runner image....
20:51 daniels: jekstrand: hmm, if it doesn't repro with the same build, I can give you temporary SSH credentials to the runner
20:52 daniels: this is the CPU profile of the x86-64 runners, if it helps any https://gitlab.freedesktop.org/-/snippets/1524
20:57 jekstrand: daniels: I'm trying something else quick
20:57 jekstrand: Well, not quick. None of this process is "quick"
21:07 jekstrand: Even with a build from inside the container and running lavapipe inside the container, it fails. :-(
21:08 imirkin: you mean fails to fail?
21:08 jekstrand: correct
21:08 imirkin: you're just doomed to success
21:08 jekstrand:wonders if there's a build flag he's missing
21:16 jekstrand: Ok, I've narrowed it down to one of two changes. :D
21:34 jekstrand: Yup, it's xexaxo1's fault. :-P
21:35 xexaxo1: jekstrand: you called?
21:35 jekstrand: xexaxo1: Looks like visibility("hidden") wasn't such a good idea.
21:35 jekstrand: xexaxo1: I have absolutely no idea why but it causes CI to blow up.
21:36 jekstrand: I can't reproduce any of the issues on my F33 system locally, even with the same docker containers.
21:36 xexaxo1: jekstrand: so the CI is broken ;-)
21:36 jekstrand: Maybe a difference in kernels? I have no idea.
21:37 jekstrand: In any case, none of the drivers using __attribute__((weak)) are using visibility("hidden") today so I'm not concerned about not having it in there.
21:37 xexaxo1: what exactly do you mean with "blow-up" btw?
21:37 jekstrand: It's super-weird
21:38 xexaxo1: story of my life :-0
21:38 jekstrand: Some of the entries in the entrypoints table that should be null aren't
21:38 jekstrand: Which is exactly the opposite direction I would expect. (-:
21:39 xexaxo1: Now that is just beyond weird.
21:39 jenatali: We don't have any Windows CI tests outside of the D3D12 ones, right?
21:39 jekstrand: jenatali: I think there's a scons build test
21:40 jenatali: But does it run anything?
21:40 jenatali: Or just build?
21:40 jekstrand: build
21:40 jenatali: I'm wondering because !8783 blows up the D3D12 tests and it's probably just blowing up Windows as a whole
21:40 anholt: jekstrand: did you reproduce the bad build within the container in the end?
21:40 anholt: jekstrand: if you've got linker trickery, note that the docker images use ld.gold by defualt.
21:40 jekstrand: anholt: Nope. Even in the container, it passes for me.
21:41 jekstrand: And I'm using the build-mesa script
21:41 anholt: hmm
21:41 jekstrand: I guess it's possible I'm missing an environment variable
21:41 robclark: jenatali: I was just about to ask you about the d3d12 test.. is that fail actually *caused* by !8783 or is something else going on?
21:41 xexaxo1: jekstrand: it does sound like a really borked toolchain, or perhaps its exposing other bug?
21:42 jenatali: robclark: I mean, it's part of CI, so I don't see what else could be going on... not like anything else snuck in
21:42 anholt: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/ci/lint may help you figure out just what a build was using.
21:42 jenatali: robclark: I just built it and was about to try it locally
21:42 robclark: ok, thx..
21:42 anholt: jenatali: is there anything tricky about the aligned malloc function on windows? or do you have any >16-byte alignment tricks in your code?
21:43 jenatali: anholt: Not that I can think of...
21:44 robclark: anyways, I suppose we could also just revert the patch that broke things in the first place
21:44 jenatali: Ah, looks like heap corruption, aligned malloc needs an aligned free
21:44 airlied: jekstrand: nope for android
21:44 robclark: *ahh*
21:44 robclark: (btw, I suppose we should add some 32b CI)
21:45 anholt: we've got armhf thanks to vc4. not sure why that didn't catch this.
21:46 anholt: jenatali: oh, totally. thanks!
21:46 robclark: hmm, maybe llvm/clang tries harder to vectorize..
21:47 jekstrand: airlied: Well, you claim a VK_ANDROID extension. :-P
21:47 jekstrand: airlied: And you even moved it around in your patches and didn't seem to notice.
21:49 jenatali: anholt: Good timing on trying to get that in, tomorrow I wouldn't have been able to help for another week :P
21:50 bnieuwenhuizen: why is it that every time I look at that series the number of patches grows ...
21:52 jekstrand: bnieuwenhuizen: I think I really am done now apart from figuring out why turnip is busted
21:52 airlied: jekstrand: cut-n-paste, it might work on android though ,I've never tested it
21:52 bnieuwenhuizen: what is the problem with android?
21:52 bnieuwenhuizen: I can test the Android-on-meson build if needed
21:53 airlied: whether lavapipe works on it, I assume nobody will care, I just left some extensions in my ext list
21:53 jekstrand: bnieuwenhuizen: I'm just giving airlied a hard time for advertising VK_ANDROID_native_buffer on lavapipe
21:54 airlied: it might even work :-P
21:54 airlied: the win32 port will probably get there first :-P
21:55 jekstrand: Has anyone tried to make lavapipe work on win32? That might be a good first step for the people trying to port radv.
21:55 jenatali: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7208
21:55 bnieuwenhuizen: jekstrand: I don't even nknow which part of radv they need :P
21:56 bnieuwenhuizen: if it is just shader compilation stuff it might be very close already otherwise there needs to be some serious ABI stuff
21:57 jenatali: Yeah... getting that to work on a closed-source AMD Windows KMD would be hard, and porting the Linux kernel stuff would also be hard...
21:59 airlied: jekstrand: it's ongoing, the porter seems to be learning and making progress in the right direction
22:00 jenatali: airlied: Can you tell how close it is? Seems like it should be getting pretty close
22:00 bnieuwenhuizen: seems the remaining painful stuff is mainly drm_fourcc.h and amdgpu_drm.h inclusion
22:01 jenatali: bnieuwenhuizen: I was asking about the lavapipe one, not the RADV one :)
22:01 bnieuwenhuizen: ah, sorry :P
22:01 jenatali: I am following both with interest though :P
22:01 airlied: jenatali: yeah I thihnk the missing loader interface should solve the last big issues
22:01 bnieuwenhuizen: there is someone actually porting lavapipe to windows?
22:01 airlied: now I think they just have to get the image on the screen
22:02 airlied: bnieuwenhuizen: yup that MR above
22:02 bnieuwenhuizen: heh, 2021 the year that mesa really breaks through on windows (combined with the hopefully larger d3d12 driver deployment)?
22:03 jenatali: Yeah, we're interested to open it up a bit more and keep pushing feature set
22:08 robclark: anholt: btw, there is a free() in `_mesa_destroy_context()` as well (but should only be error path)
22:08 robclark: (and I think you mentioned i965 uses ralloc.. which would already not work with `_mesa_destroy_context()`)
22:31 jekstrand: jenatali: What do you mean by "open it up a bit more"?
22:31 jenatali: jekstrand: The package we ship through the MS store only works on Photoshop right now unless you're in the Windows insider program
22:32 jekstrand: Ah
22:32 jenatali: Just to prevent making things worse for people - lack of GL might be better than broken GL
22:32 jenatali: But with some more testing we can remove that policy
22:32 jekstrand: Come on! Go for broke! :-P