IRC Logs of #dri-devel on irc.freenode.net for 2023-03-12

00:09 alyssa: android-virgl-llvmpipe is way too slow
00:09 alyssa: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/37907988
00:09 alyssa: Marge failed the pipeline because that was 5 seconds over its 20 minute budget
00:10 alyssa: probably the time was spent waiting for a runner but still
00:47 DavidHeidelberg[m]: alyssa: daniels limited it to the 20 minutes recently, feel free to bump it to the 25 in the MR
00:47 DavidHeidelberg[m]: I would do it, bit I'm on the phone
00:51 alyssa: I feel like we need a better approach to CI
00:52 alyssa: because this isn't scaling and not for lack of herculean efforts trying
00:56 alyssa: Maybe driver teams queueing up all their MRs for the week (that they've certified are good) and assigning as a unit to Marge
00:56 alyssa: so that the Marge queue is freed up for big common code changes that really do need the extra CI checks
00:57 alyssa: doesn't fix reliability but it saves resources, and the fewer pipelines you assign to Marge the fewer fails you'll see statistically
01:01 alyssa: if that queueing is happening at a team level (and not an individual level), then you sort out the rebase conflicts "offline"
01:01 alyssa: and only a single person on the team has to actually interact with the upstream CI (rather than the teensy subset that's needed for the team to develop amongst themselves)
01:02 alyssa: not to throw one member of the team under the bus, that role can (and probably should) rotate
01:03 alyssa: but, like, if I only had to interact with Marge once a fortnight, and Lina interacted with Marge once a fortnight, and asahi/mesa became a canonical integration tree that got synced upstream every week... that would eliminate a lot of the emotional burden, I think
01:03 alyssa: the associated problem with downstream canonical integration is that it can move review downstream too, which we don't want
01:03 alyssa: ideally review continues to happen out on the open in mesa/mesa, just doesn't get merged to upstream immediately
01:04 alyssa: one kludge way to do this is to use the "Needs merge" milestone for stuff that's nominally ready but needs to be queued up with other work from that team before hitting marge
01:04 alyssa: and then having some out of band way to sync that downstream for integration for the week
01:04 alyssa: a better way might be having branches on mesa/mesa for each driver team
01:0502:20 bnieuwenhuizen: HdkR: should be the same on AMD
02:20 alyssa: I know freedreno has a perf dashboard, I guess they're abusing CI to feed it with data
02:20 Lynne: yup, minImportedHostPointerAlignment = 4096
02:20 HdkR: bnieuwenhuizen: ooo fancy
02:21 bnieuwenhuizen: of course major benefits if you use hugepage or similar (GPU really likes 64k pages)
02:21 alyssa: performance-rules has allow_failure so I guess I don't care for the purpose of this rant
02:21 Lynne: by host-map, I mean I map the host memory as a VkDeviceMemory and use it to back a VkBuffer
02:21 bnieuwenhuizen: Lynne: anything in dmesg?
02:21 Lynne: no, empty
02:21 alyssa: so... software-renderer you're up... why is there an llvmpipe-piglit-clover job when we're not supporting clover and there's ALSO a rusticl job? why trace based testing (see above issue)?
02:25 alyssa: layered-backends: do I even want to ask about the spirv2dxil job, how is that even in scope for upstream testing? virgl traces has the usual trace problems and has caused problems for me personally with correct NIR changes, what value is that providing to upstream Mesa to justify its inclusion in premerge? similar for zink traces?
02:25 bnieuwenhuizen: Lynne: maybe try some errno dumping? https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/commit/eb12053b213d91b2970d3043a8bb4c6540fb9554
02:26 Lynne: sure, give me a sec
02:27 alyssa: quite frankly, with my upstream Mesa hat on, I am NAK'ing checksum-based trace testing in pre-merge CI for any driver.
02:27 alyssa: I can do that now apparently XP
02:27 bnieuwenhuizen: also IIRC there stupidly there is a kernel build option that needs to be enabled to make it work
02:27 bnieuwenhuizen: thought it was kinda default ish, but at least ChromeOS at some point managed to disable it
02:27 Lynne: "host ptr import fail: 1 : Operation not permitted"
02:28 bnieuwenhuizen: thx, let me check
02:30 Lynne: err, apparently it's not ordinary ram
02:31 Lynne: it's actually device memory
02:31 bnieuwenhuizen: oh that wouldn't work
02:31 Lynne: yeah, I thought so, my fault, thanks
02:32 alyssa: The bottom line is "some company paid to have this in Mesa CI" is fundamentally an UNACCEPTABLE reason to put something in upstream CI
02:32 alyssa: because it's a cost that EVERYONE pays
02:32 alyssa: for every item in CI
02:33 alyssa: For any job in CI the cost the community pays to have it needs to be measured against the benefit the community gains from it
02:33 alyssa: and if the cost exceeds the benefit -- as it does in the case of a number of the jobs I mentioned above -- it does not deserve to be in premerge
02:33 alyssa: even if there's a billion dollar corporate sponsor for the CI coverage
02:34 Lynne: I wouldn't be able to detect if an address is device memory, by any chance, right?
02:34 bnieuwenhuizen: failure to import? :P
02:34 bnieuwenhuizen: but no, not really
02:34 alyssa: it is a simple cost-benefit analysis, and if the private bigcorp reaps the benefit while the commons pays the cost.. that's unacceptable
02:34 Lynne: or could vkGetMemoryHostPointerPropertiesEXT be changed to return !VK_SUCCESS?
02:35 bnieuwenhuizen: we could try to do an import there
02:35 bnieuwenhuizen: I think the other weird case is mmapped files, I don't think those are supported either
03:20 alyssa: Aahahahaha and the pipeline failed because of a trace job flaking
03:21 alyssa: how many times do i need to say that trace jobs cannot be in premerge testing
03:24 alyssa: Trace-based testing. does not. belong in upstream premerge
03:24 alyssa: Every bit of premerge CI coverage is a cost that EVERYONE pays
03:25 alyssa: and unless there's that benefit in turn, it is a burden on EVERYONE and needs to go
03:25 alyssa: and given that the value proposition of checksum based trace testing is essentially nil, I see no reason to keep it.
03:26 alyssa: do what you want post-merge but this is an unacceptable burden for the community to bear
03:45 lina: alyssa: Another advantage of having asahi/main as integration point is we could probably add our own custom CI without having to worry about its stability being an issue for other teams ^^
03:45 lina: (Like once we actually have runners)
03:46 lina: Or some other branch specific to us
03:50 lina: Like if I hav some runners in my closet it's probably good enough for us but I don't want to be responsible for breaking CI for everyone if my internet goes down ^^;;
03:53 alyssa: lina: Responsible. I appreciate that :)
03:53 HdkR:sweats in pile of ARM boards
04:05 Lynne: if you've got too much of them, you can turn them into bricks of fabulous doorstops very easily if they're called "rockchip" and carry the number 3399, just call dd
05:10 lina: Actually, would it make sense to gate pre-merge CI on the tags?
05:11 lina: Like only run CI specific to the drivers affected
05:11 lina: And then full CI can run periodically on the main branch
05:15 alyssa: lina: Pre-merge CI is gated on the files updated
05:16 alyssa: mesa/.gitlab-ci/test-source-dep.yml controls that
05:16 alyssa: tags aren't used since they get stale easily
05:36 daniels: DavidHeidelberg[m]: I limited it to 20min as it usually completes in 8min; 25min is in no way normal or good
10:17 Newbyte: This page links to the Mesa issue tracker as the place to report bugs: https://docs.mesa3d.org/bugs.html
10:17 Newbyte: But the link is a 404. What gives?
10:20 ccr: due to spammer issues the issue tracking is currently set to project members only (afaik)
10:20 Newbyte: thanks
10:29 psykose: it makes them invisible to non-members too :D
10:50 ccr: unfortunately, but I'm sure someone is working on a better solution.
11:43 DavidHeidelberg[m]: daniels: I agree, my point was if something wen't wrong and it takes around 21-23 minutes it's still better compromise to have 25 instead of 1h before if that lead to job finishing
11:45 DavidHeidelberg[m]: daniels: as I'm looking into the Daily and the failure rate, maybe we should disable it for now + I'm thinking about moving alpine and fedora into nightly runs, since figuring out build failure isn't that hard and it happens only rarely
14:37 jenatali: alyssa: re mingw, some downstream folks apparently wanted that - not us fwiw
14:38 jenatali: Re spirv2dxil, it's a compiler only job. Its main purpose was to stress the DXIL backend when fed Vulkan SPIR-V, but now that Dozen is more mature we can probably retire it
14:41 daniels: DavidHeidelberg[m]: yeah but honestly it’s just hiding actual root causes and making it harder to solve the actual problem
14:42 DavidHeidelberg[m]: what I was thinking is moving our daily treshold for reporting jobs to 15 and 30 minutes, instead of 30 and 60 minutes (enqueued etc.)
14:43 DavidHeidelberg[m]: so we would see it. I has to agree with Alyssa that it's so annoying from developer POV and, how we care about it doesn't really matter, but they should have passing marge pipelines no matter how we reach it
15:42 daniels: sure, but at some stage it's unusable anyway - we could accept that jobs take 3-4x the runtime, which probably means making the marge timeout 3h, and at that point we can only merge 8 MRs per day
16:43 cheako: Do ppl know that issues were removed from gitlab/mesa?
16:44 hch12907: you need an account to access them, I think
16:44 cheako: good catch, but no I'm logged in.
16:45 ccr: only available project members, e.g. people with certain access level. due to recent spam issues.
16:45 ccr: available to
16:53 cheako: I'm trying to provide more information, I can wait/when should I try again?
16:58 daniels: cheako: opened them up now
17:02 cheako: :)
17:40 alyssa: jenatali: yep, I am aware that the mingw job wasn't you
17:40 alyssa: whether I'm happy about it or not, the windows jobs have earned their place :p
17:41 alyssa: (the VS2019 ones)
17:42 alyssa: which is why I was wondering what benefit it *was* providing
17:43 jenatali:shrugs
17:46 daniels: the vmware team do most of their work on top of mingw
17:47 alyssa: OK, I don't think I recalled that
17:49 alyssa: So then the question is -- what benefit is there to the job (i.e. what issues will it catch that the combination of linux gcc + windows vs2019 will not catch), what cost is it to premerge ci, and how much of that benefit could be recovered with some form of post-merge coverage (likely almost all of it, because build failures are easy to deal with for relevant stakeholders... given that there is Windows
17:49 alyssa: CI it should be a rare event to see a mingw only failure)
17:50 jenatali: I'd be inclined to agree, post-merge seems more appropriate
17:50 alyssa: benefit measured as P(legitimate fail in mingw | windows vs2019 passes AND gcc linux passes)
17:51 daniels: I'm not sure that post-merge has any more value than just not having it ever, because all that happens is that you get used to seeing that stuff has failed and ignoring it
17:51 alyssa: The question is who is "you"
17:51 daniels: either way, I've disabled the job for now as it's broken in some kind of exotic way
17:51 alyssa: If the "you" is "Alyssa", then that seems... fine? I don't do anything that's liable to change mingw in interesting ways and from an upstream perspective mingw is not something we're committed to supporting, just committed not to kicking from the tree.
17:51 jenatali: Right, if there's no stakeholders, then nobody will ever fix it, and post-merge is the same as never running it
17:52 alyssa: If the "you" is "an interested mingw stakeholder", say VMware, then if the coverage is getting them benefit, they will monitor the post-merge and act appropriately
17:52 anholt_: post-merge is, effectively, me. I've got plenty of chasing CI already, no thanks.
17:52 anholt_: (in the form of the nightly runs)
17:52 alyssa: and if that mingw stakeholder doesn't care then... if there's no benefit in premerge or postmerge then there's no benefit to having the coverage full stop and it should just be removed
17:52 jenatali: daniels: Want to ping lygstate for the mingw fails? I think he cares
17:53 daniels: jenatali: oh, thanks for the pointer
17:53 jenatali: (I'm still mostly on vacation, just happened to see a relevant topic for me in the one minute of scroll back I read)
17:54 zmike: jenatali: go vacation harder!
17:54 alyssa: I guess that's my point. If there is a stakeholder who cares, then they will monitor the nightly run and act accordingly.
17:54 alyssa: if there's no stakeholder who cares, there's no value in the job running at all, and.. that's fine?
17:54 jenatali: I'm sitting in a hotel lobby waiting to go to the airport lol. I've vacationed hard enough
17:54 anholt_: alyssa: there is no mechanism for nightly alerting.
17:54 zmike: jenatali: oh okay, proceed
17:54 anholt_: it would be great if there was
17:54 alyssa: anholt_: ugh. I see.
17:55 alyssa: to be clear "anholt_ monitors all the nightly mingw jobs" is not the proposal and NAK to that because that's a terrible idea
17:56 APic: Uh huh.
17:56 anholt_: +1 to deleting clover job. It was introduced when rusticl was first landing and "make sure we don't break clover" seemed more reasonable. On the other hand, I don't think I've seen it flake.
17:58 HdkR: How soon until it is +1 to deleting Clover? :)
17:59 anholt_: I'm +1 to deleting clover right now.
17:59 alyssa: same here
17:59 daniels: srs
17:59 DavidHeidelberg[m]: 🎊
17:59 anholt_: but the rusticl dev has been hesitant until feature parity
17:59 anholt_: (which, afaik, is close)
17:59 alyssa: if "clover is deleted" is the only thing that comes out of this burnout fuel hell weekend
17:59 alyssa: still a net positive
17:59 alyssa: :p
18:00 DavidHeidelberg[m]: Can someone update https://www.mesa3d.org/ Current release: 22.3.7 . Anyway Clover will stay in 23.0, which is not that far apart anyway
18:01 HdkR: mesamatrix doesn't track clover versus rusticl features, I'm sad :P
18:02 daniels: alyssa: we also now have shared runners which aren't being DoSed by some impressively resourceful crypto miners
18:03 alyssa: shitcoin really does ruin everything it touches
18:14 DavidHeidelberg[m]: before the fate of Clover is fulfulled, can we agree on decreasing the load by dropping the clover CI jobs? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19385#note_1818805 If yes, I'll prepare MR so we can do small amendment to the CI and remove three jobs
18:14 daniels: DavidHeidelberg[m]: sure, sounds good
18:15 alyssa: DavidHeidelberg[m]: ++
18:21 DavidHeidelberg[m]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21865
18:31 DavidHeidelberg[m]: haven't thought about it yet HOW to do it, but when we putting farm down, we should omit running CI. Only on bringup.
18:38 jenatali: Yeah, would really be nice if there was a way to avoid running hardware CI jobs for unrelated config changes, like bumping a Windows container image...
19:19 airlied: anholt_: the clover job preexisted rusticl
19:20 airlied: by a long time
19:20 airlied: and it has csught a some llvmpipe regressions
19:21 airlied: now rusticl will eventually catch them, just not sure it does yet
19:25 anholt_: airlied: yeah, misread a commit. you're right.
19:28 eric_engestrom: DavidHeidelberg[m]: https://gitlab.freedesktop.org/mesa/mesa3d.org/-/merge_requests/163 merged, website will be updated in a couple of minutes
19:30 DavidHeidelberg[m]: Thank you :)
19:31 APic: ☺