IRC Logs of #dri-devel on irc.freenode.net for 2023-03-21

00:18 DavidHeidelberg[m]: zmike: btw. adding flakes for HL2 engine traces, all flakes regulary on zink :(
00:19 DavidHeidelberg[m]: not sure if new or it was always like that, but these days it's pretty common: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8436#note_1831487
00:19 DavidHeidelberg[m]: s/regulary/often/
00:21 zmike: DavidHeidelberg[m]: those weren't flakes though, those were real bugs hitting my MRs until I fixed them
00:24 DavidHeidelberg[m]: oopsie. Thanks for the feedback, there is likely bug inside the flake reporting logic to fix.
00:27 DavidHeidelberg[m]: zmike: on other hand, does that mean traces catched something or it was multiple jobs failing?
00:27 zmike: traces catch bugs
00:39 DavidHeidelberg[m]: (heartwarming to hear that)
00:42 clever: does anybody here know amdgpu well? ive written a prometheus exporter based on radeontop, and can now clearly see my issues, "vram" fills up instantly, but there is still a gig of "gtt" available
00:42 clever: if i could adjust the balance between those 2 pools, i could get more out of this hw?
01:00 robclark: zmike: yeah, for glthread
01:01 robclark: binhani: was hoping that tlwoerner would respond.. I think he is still involved w/ gsoc/evoc.. I've been out of the loop on that for a few years
01:05 robclark: zmike: I am not sure how much we want to use glthread.. but making the frontend part of shader compile async is useful for some games, it seems.. (OTOH just getting disk_cache async store working for android might be enough)
01:05 zmike: robclark: I'm not sure that's required for tc? this is basically a stream uploader that remains mapped async for copies between glthread + driver thread
01:38 jenatali: Huh cool
01:41 clever: that windows tool is more like dtrace on a mac, it gets syscalls for every process on the entire system, and you then have to filter it down to something useful
02:15 mareko: robclark: the latter needs to handle a new PIPE_MAP flag
02:40 robclark: mareko: hmm, we do use slab allocator.. I didn't notice any reference to PIPE_MAP_x flag in docs about the caps
09:09 MrCooper: clever: VRAM generally performs much better than GTT for GPU operations (and the latter takes away from system memory, it's not a separate pool), so the former filling up before the latter is expected
09:20 clever: MrCooper: the issue is that chrome is using a decent chunk of VRAM, and i think the gpu is exausting all VRAM when i launch a game
09:20 clever: if i have chrome running when i launch a game, the game only gets 1fps, but if i close chrome and restart the game, it runs fine
09:23 clever: ah, but i went into the chrome task manager, and killed things based on gpu mem usage, and now its just low enough usage that they can co-exist
09:25 clever: until i enter a room with too much detail, then it slows to a crawl once more
09:45 MrCooper: clever: unless Chrome keeps drawing as well in the background, its BOs in VRAM should get evicted out of VRAM in favour of the game's in the long run
09:45 clever: i could try SIGSTOP, that would halt all usage of the BO's
09:51 clever: MrCooper: oh, is it possible to list all BO's and where they currently live?
09:51 clever: along with size
09:52 MrCooper: AFAICT /sys/kernel/debug/dri/0/amdgpu_gem_info is pretty much that
09:55 clever: ah, perfect
09:56 clever: its even listing the metric exporter i tied into prometheus, which has zero BO's
09:56 clever: and if i parse that dump, i could graph how much vram and gtt each pid is using
09:57 clever: X is listed multiple times..., because it has multiple drm handles open
10:03 MrCooper: that is because the DRM file descriptor for DRI3 clients is opened by X, there are pending patches which will name the DRI3 clients instead
10:04 clever: unix socket fd passing?
10:04 MrCooper: yep
10:08 clever: MrCooper: oh, that reminds me, while copying code from radeontop, i noticed getting the drm magic# from the render node, fails with a permission error
10:09 clever: drmGetMagic()'s ioctl
10:09 clever: which kind of makes the whole point of that moot
10:10 MrCooper: the magic stuff isn't needed in the first place with render nodes
10:11 MrCooper: rendering ioctls work by default with them
10:11 clever: ah
10:11 clever: so its only the display nodes, that need magic, to mediate control over output ports
10:12 clever: and render nodes, anybody can use it, but they just get a BO result, and cant directly display it
10:14 MrCooper: more or less, right
10:26 clever: definitely looks like ctrl+z helps
10:26 clever: vram dropped when i did that, and went back up upon fg
10:38 MrCooper: clever: is this in a Wayland or X session?
10:43 clever: X session
11:15 pac85: I recently opened an MR with some changes to Gallium which triggered a ton of CI stages and I got a lot of failures. Now because those seemed unrelated I ran the CI on the main branch (plus a commit that adds some comments to trigger the ci stages)
11:15 pac85: https://gitlab.freedesktop.org/antonino/mesa/-/commit/542d25cccc1e719fc4879d00612a113a1655910c
11:17 pac85: There are are several failures in both various drivers, crocus-hsw has 2 failures freedreno has 44 in one device, also anv has some
11:18 pac85: Now if I understand correctly the CI runs every time something is merged so this shouldn't be possible right?
12:01 zmike: you ran the manual jobs that don't run on merges
12:01 zmike: you aren't supposed to run those
12:13 pac85: Oh I didn't know that. Thank you!
12:23 danvet: javierm, for the nvidia think I did some series but didn't get around to respinning it yet :-(
13:03 javierm: danvet: yeah, we remembered that with tzimmermann and were discussing it yesterday
13:07 danvet: I need to get around to that :-(
13:07 javierm: danvet: since only affects nvidia in practice, I guess is hard to give it a high prio
13:08 javierm: there are some patches from your series that I think could land though, like the ones for ast and mga500
13:08 javierm: tzimmermann: ^
13:47 zmike: mareko: I still need details on the exact failure you're seeing with KHR-GL46.gpu_shader_fp64.fp64.state_query
15:02 MrCooper: robclark: sysfs is generally writable for root only, so not usable for general purpose display servers
15:19 dj-death: gfxstrand: random question, it seems global memory load/store to implement ubo/ssbo load/stores does not deal with null descriptors, does that sound correct?
15:25 gfxstrand: dj-death: It handles them when robustness is enabled because the buffer size is zero and so everything is OOB
15:29 dj-death: I see thanks
15:30 dj-death: looks like it's going to be my problem with descriptor buffers :/
15:30 mareko: robclark: PIPE_MAP_THREAD_SAFE
15:40 zmike: is https://registry.khronos.org/webgl/sdk/tests/webgl-conformance-tests.html the actual way to run webgl cts?
15:43 anholt_: I believe so
15:43 anholt_: (when receiving a bug report from folks that cared about the cts, that's the root of the url I got)
15:44 anholt_: and if you end up debugging a single case, you want something like https://registry.khronos.org/webgl/sdk/tests/deqp/functional/gles3/shaderoperator/unary_operator_01.html?filter=shaderop.unary_operator.pre_decrement_effect.lowp_uint_vertex
15:59 jenatali: daniels: What kind of stress test were you thinking for !22034?
16:00 daniels: jenatali: .gitlab-ci/bin/ci_run_n_monitor.py —target ‘jobnameregex’ —stress
16:00 daniels: add —sha REV (or —pipeline ID) if it’s not HEAD you want to test
16:01 jenatali: Ah I see
16:01 gfxstrand:so wants to rewrite the RADV image code but I'm too afraid to
16:01 jenatali: I've not used that yet because using the UI to click play on the Windows build jobs automatically limits to the drivers I care about :P
16:01 zmike: surely there's nothing more urgent demanding your time
16:10 zmike: anholt_: is it intentional that deqp-runner doesn't work with --deqp-log-images=disable / --deqp-log-shader-sources=disable ?
16:15 anholt_: I haven't considered ever wanting to do that.
16:15 anholt_: "be able to make some sense of rare, flaky results" has always been a priority.
16:15 gfxstrand: dj-death: Why do descriptor buffers make it harder?
16:15 zmike: mm
16:15 zmike: in a run where you know there will be lots of failures, writing all the outputs will end up increasing the test time exponentially
16:16 anholt_: I think you'd need to back up and explain what problem you're really trying to solve here.
16:17 zmike: I'm trying to solve the problem of running cts CLs that are broken and not being able to because writing all the outputs takes literal hours and consumes my entire disk
16:18 anholt_: I'm asking why you need the status of running the a large subset of the CTS if it's broken?
16:18 zmike: because I'm involved with CTS development and running tests is part of the process?
16:20 anholt_: I guess you could make the cts allow repeated arguments that override each other. or special-case in deqp-runner to drop the defaults if you have an override present. but I still don't understand why you need to run some large fraction of some massive set of tests that are all failing.
16:20 anholt_: usually people dealing with some big set of failing stuff will carve off a specific subset to test, or use a --fraction, or something.
16:21 anholt_: like, are you planning on tracking thousands of xfails as you develop?
16:21 anholt_: I really don't get the usecase.
16:23 zmike: well currently I can't even establish a baseline
16:24 zmike: the goal is to be able to determine the patterns of tests that are failing so that they can be fixed, but I can't actually run all the tests (in deqp-runner) because of the previously-mentioned issues
16:26 anholt_: ok, well, I've given a whole bunch of ideas here. go for it.
16:27 zmike: alrighty
16:50 zmike: anholt_: as an alternative, is there a reason why deqp-runner couldn't just add the user's — options after the deqp-runner internal opts? I think that would solve this case neatly
16:50 zmike: (though obviously it would also allow users to footgun)
16:54 anholt_: actually, looking at the code, deqp-runner doesn't add any image or shader logging args
16:55 Sachiel: they are on by default
16:55 anholt_: yeah
16:55 zmike: it seems to add those args at deqp_command.rs:311
16:55 zmike: err 313
16:56 zmike: and 322
16:56 zmike: will test if moving the user args is enough to resolve it
16:57 anholt_: I'm looking at 313 and that's "deqp-log-filename"
16:57 anholt_: 322 is "deqp-shadercache-filename"
16:57 zmike: yea I'm guessing specifying those overrides the user options
16:58 zmike: or something
16:58 anholt_: the user options you asked about were " --deqp-log-images=disable / --deqp-log-shader-sources=disable"
16:58 anholt_: which are not those.
16:58 zmike: I dunno, I'm just speculating why the user opts wouldn't be working based on the code there
16:59 anholt_: ok, I'm going to stop engaging with this conversation until you do some actual investigation instead of speculating.
16:59 zmike: sounds good
17:00 dj-death: gfxstrand: have to decode the RENDER_SURFACE_STATE from the shader
17:01 dj-death: gfxstrand: not impossible, just added the additional "is this the null surface" check
17:02 dj-death: gfxstrand: maybe we never want to use A64 with descriptor buffers?
17:03 gfxstrand: dj-death: Oh, right...
17:04 gfxstrand: dj-death: I think we have to for things like 64-bit atomics
17:04 gfxstrand: Unless those got surface messages when I wasn't looking
17:06 heftig: does anyone know whether https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21829 might get into 23.0.1?
17:09 zmike: anholt_: okay, deeper investigating reveals this might be a cts runner issue and not deqp-runner at all; sorry for the noise
17:30 binhani: tlwoerner: I was informed that you might be involved with GSoC related projects. If that's true, would you guide me to what projects are currently available for this GSoC
17:31 alyssa: embarassing how quickly my mesa ci appreciation report is filling up
17:31 alyssa: everyone else is welcome to add to it too, right now it looks like it's just me who keeps trying to merge broken code and getting told off by Marge
17:31 alyssa:sweats
17:32 robclark: anholt_: re: deqp-runner, a --give-up-after-this-many-failures=N type thing might be useful.. since I've seen CI runtimes get really long when an MR is more broken than expected
17:33 anholt_: robclark: hard to use for CI, though. Think about when we uprev the cts, and lots of new fails show up and you need to categorize them.
17:34 anholt_: taking really long should usually be limited by the job timeouts, which should be set appropriately already (but may not be in all cases)
17:34 robclark: hmm, that is kinda a special case, and I guess you could just push a hack to disable that option when shaking things out
17:34 anholt_: and job timeouts are much better at getting at the thing you're concerned about, though they mean that you don't get the results artifacts.
17:35 robclark: I guess timeouts need to be a bit conservative because they can indicate unrelated issues
17:35 robclark: anyways, just an idea..
17:36 daniels: yeah I was thinking about --give-up-after-n-fails and I think it is good - you're right to say that it's painful when you're doing uprevs, but you can just enable it for marge jobs and not user/full jobs
17:37 robclark: yeah, marge vs !marge would work
17:37 daniels: there's a bit of a tension with gitlab-side job timeouts, as we do want to allow those to be longer so in case of machine issues (won't boot, randomly died, network stopped networking, etc) we can retry the test runs without losing our DUT reservation
17:38 daniels: but just giving up early on deqp if things are really really broken and saying 'errr idk try it yourself perhaps' is definitely helpful
17:38 anholt_: daniels: yeah, that's why for bare-metal I've got the TEST_PHASE_TIMEOUT
17:38 daniels: hmm, I'm sure I've seen a630 blow through like 45 minutes on jobs which just crash every single test
17:39 anholt_: though it ends up being real embarrassing when the reason you keep rebooting to retry your job is that there was a minute of testing left and you decided that the board must be hosed.
17:39 daniels: we do have that on LAVA as well, but I've had to keep creeping it up because runtimes keep creeping up and then you start introducing false fails
17:39 daniels: yeah
17:39 anholt_: we've been getting pretty sloppy on keeping actual test phase runtimes down
17:39 alyssa: there's an interesting chicken/egg question here ... should you only assign an MR to Marge if you're really really sure it's going to pass (probably yes), and if so, should you do a manual pipeline before assigning something to MR (unsure)?
17:40 alyssa: (re: more broken than expect)
17:41 anholt_: my opinion is yes, you shouldn't hand something to marge unless you've got a recent green pipeline.
17:41 alyssa: OK
17:41 daniels: anholt_: tbf part of that is our measurements being totally shot due to rubbish servo UART, but gallo is working on SSH execution; we've got a working PoC
17:41 alyssa: Of the entire pipeline or just the relevant subset?
17:41 daniels: I don't mind passing stuff to Marge that definitely looks like it should be OK and not super risky
17:42 anholt_: daniels: what's the plan for getting kernel messages while also doing ssh execution?
17:43 daniels: anholt_: still snoop UART for kmsg, but use SSH to drive the actual tests
17:43 anholt_: interesting
17:43 anholt_: sounds like something that might have sharp corners, but good luck! would be lovely to have the boards more reliable
17:43 robclark: alyssa: I've defn done the misplaced ! or similar where I expected a green pipeline but instead broke the world
17:44 anholt_:wonders if this could include a heartbeat in kmsg so we know when uart dies
17:45 robclark: fwiw console-ram-oops is useful for getting dmesg.. but after the DUT does warm reboot so not sure how to usefully slot that into CI as uart replacement (for kernel msgs)
17:45 daniels: anholt_: hmm right, you mean just so we can emit 'btw uart died so you might be missing any oops'?
17:45 anholt_: that's what I was thinking
17:45 daniels: robclark: the problem here is that we're relying on UART to actually drive the testing machinery, so if UART goes off a cliff (which all servo-v4 seems to do), then we no longer know what deqp's doing, so we just time out and kill it
17:46 robclark: yeah
17:46 dj-death: gfxstrand: thanks, I always forget about 64bit atomics
17:46 alyssa: robclark: yeah.. I think there's a social issue around the expectations on CI and on developers using CI, i'm unsure how we want to resolve it... When you don't have any CI you're (ostensibly) more likely to do your own deqp runs ahead-of-time before `git push`ing crap... When you do have CI it's really easy to say "well, if CI is happy so am I" and assign to marge plausibly looking code.
17:47 alyssa: this isn't CI's fault, but I think there might be some mismatched expectations
17:48 gfxstrand: dj-death: Sorry
17:48 gfxstrand: dj-death: The good news is that it's not a common case so if the code is a bit horrible it's not the end of the world.
17:51 robclark: alyssa: CI is useful in that it can run CTS across more devices than I could manually in a reasonable amount of time.. I just try to (a) if I expect some trial/error, do it at a time when marge isn't busy (or at least has MRs in the queue that look like they won't be competing for the same runners, and (b) if I realize I broke the world, cancel the job to free up runners
17:51 alyssa: robclark: sure, the "across more devices" is big for me, I don't even *have* all the panfrost hardware we run in CI lol
17:52 alyssa: fwiw - i'm not trying to be argumentative here. it's just that I don't think we've communicated/documented a clear expectation for what Marge workflows should look like for developers.
17:53 alyssa: (case in point: I only learned about the "run manual jobs satisfying regex" script, like, last week)
17:57 daniels: we tried to make it as well documented as the rest of Mesa :P
17:57 MrCooper: alyssa: I think your Marge appreciation issue helps counter-act the human mind's tendency to focus on the negative, thanks for that
17:57 daniels: (more seriously, the docs do need updating, yeah)
18:02 alyssa: MrCooper: that's the idea :)
18:03 alyssa: MrCooper: also humbling as hell, because these days I try not to assign anything to Marge that hasn't been appropriately reviewed and that I'm not reasonable sure is correct
18:03 alyssa: and yet, still manage to fill up that thread pretty quickly o_o
18:05 robclark: alyssa: I think it is pretty normal to miss/overlook things.. think of CI as `Reviewed-by: GPU` ;-)
18:06 MrCooper: that's what we have the CI for :)
18:06 robclark: yup
18:49 alyssa: + * Copyright 208 Alyssa Rosenzweig
18:49 alyssa: damn i must be old
18:51 airlied: or parallel universe you
18:53 FLHerne: or the code travelled back from the future, where they've reset the year numbering
18:54 kisak: Should I ask where the other 207 Alyssas went?
18:55 kisak: nevermind, I don't want to know the answer
18:58 ccr: alyssa, sounds like seriously legacy code :P and you must've invented copyright!
19:00 alyssa: I've messed with time before and there haven't been any noticeable consequences!
19:01 ccr: or so it would seem ... * glances around *
19:03 alyssa: lina: asahi spdx conversion MR up
19:03 alyssa: we'll see what happens I guess
19:04 qyliss: . o O ( does this mean the code is accidentally public domain )
19:04 alyssa: qyliss: seems leigt
19:04 alyssa: legit
19:05 alyssa: the fact there's still code in Mesa that I wrote in high school amuses me greatly
19:06 psykose: it just means it's really good
19:08 alyssa: think
19:09 alyssa: if i could go back in time i'd tell my high school self to use genmxl
19:09 alyssa: genxml
19:11 psykose: i can think of far better things to tell my high school self
19:12 alyssa: oh, i mean. same.
19:20 alyssa: admittedly the initial checkin of asahi was pretty bad and that was genxml
19:22 alyssa: granted that was a driver merged after barely more than 4 months since getting my hands on the hardware, with no hw docs or reference code, while in school and doing panfrost
19:22 alyssa: so I guess I can forgive myself for hardcoding some things :p
20:13 dj-death: gfxstrand: ah, chasing what I thought was a compiler bug for a while
20:14 dj-death: gfxstrand: but apparently you can set nullDescriptor=true and robustBufferAccess=false
20:16 dj-death: looks like we need the internal NIR robust handling to implement null descriptors support with global loads
20:17 alyssa: womp womp
20:33 gfxstrand: dj-death: Oh, that's entertaining. :-/
23:09 mareko: zmike: the shader of the glcts test is: https://pastebin.com/raw/WtgwAiB7 ; The problem is that gl_Position is written by VS but not read by TCS, which causes the linker to eliminate the gl_Position write, which makes all uniforms inactive, but the test expects all of them to be active, which is incorrect
23:09 mareko: *all VS uniforms inactive
23:10 zmike: mareko: ok, should be an easy fix then
23:49 mareko: zmike: In theory, what the test is doing is setting an output value based on a bunch of uniforms. There is an optional optimization in my plans ("uniform expression propagation") to eliminate that output by moving the whole uniform expression into the next shader (if the expression doesn't source any phis, though in theory we could move whole branches into the next shader). That will ruin the test even
23:49 mareko: if the dead gl_Position write is fixed.
23:57 mareko: tarceri: do you think it's feasible to do this at the end of gl_nir_link_varyings, and it's about outputs storing a value from load_ubo: move a UBO load from one shader to another as an optimization, i.e. copying the UBO declaration to the next shader stage such that st/mesa will correctly bind the UBO in both shader stages automatically?