IRC Logs of #dri-devel on irc.freenode.net for 2023-06-13

07:03 dolphin: airlied, agd5f: I was grepping for some register values from i915 and xe using "drivers/gpu/drm" as a path and came across "drivers/gpu/drm/amd/pm/powerplay/inc/polaris10_pwrvirus.h". Is there some backstory for the file, it seems pretty much a binary blob in the source to me? Maybe it should be in firmware repo instead?
07:28 airlied: dolphin: not sure what it is used for, but its just a register programming sequence in a table, not a binary
07:32 dolphin: pwr_virus_section3 too?
07:33 dolphin: seems like a blob to me for sure
07:49 airlied: good question on what it is programming into the hw
07:49 airlied: agd5f: ^
07:49 airlied: it used to be just a long sequence of reg writes, but maybe it's writing some microcode
07:52 MrCooper: the name seems clear, it's a power virus ;)
08:42 tzimmermann: section3 unleashed the power virus!
10:41 mupuf: enunes[m]: Howdy, there seem to be some network issues in your farm: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/43664062#L1141 / https://gitlab.freedesktop.org/mesa/mesa/-/jobs/43664298#L2153 / https://gitlab.freedesktop.org/mesa/mesa/-/jobs/43669017
10:41 mupuf: I would say this is quite likely a routing issue, especially since similar issues happened earlier today in France
10:41 mupuf: so... not much you can do about that
10:42 mupuf: but if they don't disapear soon, I'll have to disable the farm
11:04 enunes-: mupuf: I'll look now
11:04 enunes-: mupuf: it seems that my network dropped at some point for a short time overnight, so it might be something with that
11:05 mupuf: I would be surprised if this were the case... but if this is the case, maybe a reboot would help?
11:05 enunes-: that is a strange error, I haven't seen that before
11:07 mupuf: right, but FYI, the same issue happened this morning in France
11:08 mupuf: to many users, all from the same ISP
11:08 mupuf: so... it isn't just you
11:11 enunes-: well... what I can quickly do is reboot the router to reconnect to the ISP, probably not much else indeed
11:34 enunes: mupuf: still not great it seems, maybe it is better to disable it for today, and I also take the downtime to do some pending updates on it
11:35 mupuf: enunes: yeah, sounds like a good idea
11:36 enunes: I can send a MR for it if you don't have one ready yet
11:42 mupuf: enunes: please do :)
11:46 alyssa: stop doing amdgpu
11:47 alyssa: power viruses were never meant to be programmed
11:47 alyssa: wanted to amdgpu anyway? we had a tool for that. it was called r200.
11:52 mupuf: alyssa: rofl
13:19 alyssa: haswell is... crocus? or iris? or both?
13:19 alyssa: seemingly crocus?
13:19 alyssa: like mostly sure crocus, cool
13:20 alyssa: unfortunately crocus doesn't build on arm64. errrrg
13:20 alyssa: that's fine, I didn't want to read Intel assembly anyway
13:32 kisak: Haswell is Intel gen 7.5, yes, that's crocus.
13:45 q66: <alyssa> stop doing amdgpu
13:45 q66: i wish
13:47 q66: maybe when intel upstreams xe kmd so i can actually use it on non-x86 hardware
14:01 alyssa: gfxstrand: this is odd.. the shader you sent me is actually *helped* for instruction count on midgard
14:03 alyssa: and we go deeper into the rabbit hole..
14:08 alyssa: i do not understand haswell vec4 asm
14:11 jfalempe: tzimmermann, did you have a chance to look at my mgag200 DMA v2 patches ?
14:11 tzimmermann: jfalempe, sorry not yet
14:11 tzimmermann: it's busy recently :(
14:12 alyssa: well, I can reproduce the shaderdb change now. moo.
14:13 jfalempe: tzimmermann, ok, no problem, let me know if it can still be improved ;)
14:16 alyssa: how is virgl still failing
14:18 agd5f: dolphin, airlied, it's used to tune the voltage frequency curve on individual boards. IIRC, it's not firmware. It's some sort of pattern data sent to a power validation hardware which runs test with the patterns and then the results of those tests are used to tune the curve on each board so it's stable across varying silicon. I don't remember all of the details off hand.
15:01 alyssa: ValueError: could not convert string to float: 'top-down'
15:02 alyssa: How do I do Intel shader-db reports?
15:04 alyssa: did a grep abomination
15:04 alyssa: gfxstrand: reworked lower_vec_to_regs, haswell vertex shaders on my shader-db seem happy https://rosenzweig.io/lol.txt
15:04 alyssa: not actually runtime tested though so could be totally broken, but you know
15:07 alyssa: Midgard is less happy https://rosenzweig.io/lolmidg.txt
15:07 alyssa: close enough that I'm happy with that hit though
15:23 alyssa: should the lima lab be marked as offline? seems to be struggling this morning
15:24 gfxstrand: alyssa: Did you push?
15:24 alyssa: gfxstrand: not yet
15:24 alyssa: got distracted
15:24 alyssa: and am now wondering if I should mark the lima lab offline
15:24 jenatali: I saw discussion about that in the scrollback this morning
15:25 alyssa: oh, so you did
15:25 jenatali: enunes said he was going to send a MR to take it offline
15:28 enunes: yes, there it is https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23623
15:29 alyssa: :+1:
15:30 alyssa: nothing else in the marge queue triggers lima jobs so yes that should do the trick
15:31 alyssa: definitely silly that farm disabling runs full premerge CI on every other farm though..
15:31 alyssa: mupuf: could the FARM online/offline boolean live in src/vendor/ci/.yml instead, to avoid that silliness?
15:31 alyssa: It loses the niceness of the bools for every farm being together but, meh?
15:32 alyssa: DavidHeidelberg[m]: ^^
15:33 jenatali: Some of them can, but some (e.g. the "Microsoft" farm aka the Windows builder/runner) can't
15:33 alyssa: jenatali: Hm
15:33 alyssa: Why not?
15:33 alyssa: Why can't that be in src/microsoft/?
15:33 jenatali: I mean, I guess they could, but it wouldn't make sense
15:33 jenatali: If it was just the test runners then sure, but it's also the build runner
15:34 alyssa: src/microsoft/ci/gitlab-ci.yml contains the gitlab-ci yaml source for microsoft's CI
15:34 alyssa: makes perfect sense to me?
15:34 alyssa: sure it's morally a backronym, but
15:34 jenatali: I think we're the only one in that situation though where our "farm" is more than just test runners
15:34 alyssa:shrug
15:34 alyssa: I'm thinking even just from a psychological perspective
15:35 DavidHeidelberg[m]: alyssa: yes, yes, I was thinking about that last week to move it into ci-farms.yml in root
15:35 alyssa: Days where a farm needs to be disabled are days where people are already stressed (from jobs failing from the farm that needs to be taken down)
15:35 DavidHeidelberg[m]: to avoid triggering whole farm CI when disabling. For enabling then some .gitlab-ci.yml should be touched to check all the jobs (because they haven't been tested when farm is off)
15:36 DavidHeidelberg[m]: I was thinking how to make it automagicall. Off without pipeline, On with pipeline.
15:36 jenatali: That would be excellent
15:36 alyssa: ..adding a 30 min stall in there where nothing gets merged when people are already stressed is, suboptimal
15:36 alyssa: DavidHeidelberg[m]: eyes
15:36 DavidHeidelberg[m]: alyssa: haha, yeah....
15:37 alyssa: ("people" includes both the users of CI and the maintainers of it, I imagine)
15:37 DavidHeidelberg[m]: 2x yeah...
15:37 alyssa: DavidHeidelberg[m]: If you're working in that area, the other question is how the farm disable commit should actually get merged
15:38 alyssa: In particular, if there are a pile of MRs already assigned to marge and some of them would trigger jobs on the broken farm
15:38 alyssa: the "unassign everything, assign disable MR, reassign everything" manual dance is silly
15:38 alyssa: the "just assign to the end" means a day's marge queue is wasted
15:38 DavidHeidelberg[m]: I was thinking also about doing - include: farms.yml@different-repo
15:38 DavidHeidelberg[m]: then we could alter it without unassigning marge
15:39 DavidHeidelberg[m]: but it's inconvinience to go to another repo for that
15:39 alyssa: and the "push directly, one MR will get shot down but it was going to fail and the next MR will be fine" technique is socially undesired
15:39 enunes: I wonder if we considered having something at the runner side that the CI scripts would check to see if the job needs to run at all
15:39 alyssa: How would that cross-repo include work to make sure reenabling has a pipeline?
15:40 enunes: so an admin with access to the runner could flip a switch there and the job would just skip
15:40 jenatali: David Heidelberg: What if we had both? Then you could push disables to a separate repo, while enqueueing a secondary disable to mesa (sequenced behind all other MRs). Re-enabling then touches the mesa repo too
15:40 enunes: so we wouldn't need to merge "set the lab to offline" commits at all
15:40 alyssa: enunes: That has the usual problem that, when the farm is back and flipped back on, random unrelated pipelines will start failing if anything regressed while offline
15:41 DavidHeidelberg[m]: yes yes, full pipeline (or at least all pipelines on related farm) needs to be run
15:41 DavidHeidelberg[m]: but before enable-phase could do something like "all-farms off, except the one who gets enabled"
15:41 alyssa: DavidHeidelberg[m]: I always kinda wondered if we could have a monotonically increasing integer "death_count" on each farm
15:42 DavidHeidelberg[m]:wonder if he should put CO2 meter badge on our Mesa3D CI farm :D
15:42 alyssa: The mesa/mesa rules would hardcode a check "if death_count <= 27: run pipeline, else skip"
15:42 alyssa: When a farm dies, the admin increases the integer on the farm side to 28. So now everything is skipped.
15:43 alyssa: When the farm is back, the admin needs to MR against mesa/mesa changing the check to "death_count <= 28", going through the regular pipeline
15:43 alyssa: and then presumably the actual check logic is nicely abstracted in the yaml/bash halls of hell, so the actual mesa/mesa side is just the usual 1 line commit "LIMA_FARM: 27" -> "LIMA_FARM: 28" or whatever
15:44 alyssa: this probably has some weird side effects for running CI on the stable branches
15:44 alyssa: but that's wholly uncommon so maybe stable branch would just be exempted as part of the branching off a release process
15:45 alyssa: eric_engestrom: ^^ you'd be affected by that if you want to tell me why I'm being silly and this is a terrible idea actually =D
15:45 DavidHeidelberg[m]: when you release into stable, you SHOULD always wait for all farms to be ready to test
15:45 alyssa: right, yeah
15:45 DavidHeidelberg[m]: you don't want to release something which had half of the testing off :P
15:46 alyssa: =D
15:47 alyssa: DavidHeidelberg[m]: by the way, it's not clear to me if the CI itself has gotten better lately (vs my habits on how to use CI have changed, vs me social engineering myself with the appreciation report is working) ... but my perceived CI signal:noise ratio is a LOT higher than it used to be
15:47 alyssa: so thank you CI team ^^
15:47 DavidHeidelberg[m]: sergi: gallo koike ^ :)
15:48 DavidHeidelberg[m]: it got a bit better I think
15:48 jenatali: +1
15:48 mupuf: alyssa: it's definitely better
15:49 jenatali: I merged a nir change yesterday on the first try, that made me happy
15:49 mupuf: I guess fewer big uprevs too?
15:49 gfxstrand: \o/
15:49 DavidHeidelberg[m]: :D we trained developers to be happy even when the stuff merges :D
15:49 DavidHeidelberg[m]:laughing his ass off
15:50 mupuf:ran a stress test of 1000+ jobs and got a failure rate of 0.5%
15:51 DavidHeidelberg[m]: koike wrote really nice reporting, so at some point when we added most offending flakes showing time to time, reliability increased. It was milion of flakes, which takes a hit once a time, but... if summed up, it was almost every job
15:51 mupuf: Ran for 4 days continuously, on three steam decks
15:51 gfxstrand: mupuf: Is that across the entire CI or one runner?
15:51 mupuf: Just the steam deck runners
15:51 mupuf: At my home
15:51 alyssa: mupuf: that's kinda exactly the problem though
15:52 alyssa: with 100 jobs with a failure rate of 0.5%, we'd expect 40% of pipelines to fail if retry isn't enabled
15:52 mupuf: Oh yeah, of course! I wasn't satisfied with it
15:53 mupuf: There is more work needed
15:53 mupuf: But distributed test farms over the internet are harder to make super reliable
15:53 alyssa: (That effect is presumably what the daily reports show every day... 99% jobs passing but half of pipelines failing)
15:54 mupuf: Yep... But retries aren't the solution either: we need to retry on infra failures
15:55 alyssa: (I appreciate the difficulty of the problem. I am not going to attempt to find solutions because I am chastised every time I try. But I do know arithmetic.)
15:55 mupuf: That's doable, I guess, but it requires some work on Marge to detect that we got to the point where actual code was tested
15:56 mupuf:will replace gitlab runner soon-ish
15:56 mupuf: Should speed up the startup sequence, reduce the number of moving parts, and allow me to add more network resiliency
15:57 daniels: I don't like the retries either, but realistically given that people just hit marge with a hammer until a merge occurs, it's better having those followed by an automated script which just goes around finding what the flakes are and auto-merging those into expectations
15:58 mupuf: True
15:58 mupuf: The most important asset is developer's trust
15:59 mupuf: Without it, the system is fully useless
16:16 DavidHeidelberg[m]: I'm thinking about ON/OFF farms logic: 1. definition: .ci-farms/; .ci-farms/$farm_name; 2. execution: if changes .ci-farm/$farm_name: always run farm jobs; 3. if changes .ci-farm/ never run; if exist .ci-farm/$farm_name always run
16:16 DavidHeidelberg[m]: so, if we enable farm, it gets run ($farm_name file exist now, so change)
16:17 DavidHeidelberg[m]: other farms won't run, because 3., ci-farm/ changed
16:17 DavidHeidelberg[m]: if there is normal state (without change), last option becomes valid, .ci-farm/$farm_name exist, it'll run
16:18 anholt: anyone wish your deqp-runner output was cleaner when you did piglit? https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/811
16:18 DavidHeidelberg[m]: something like: https://paste.sr.ht/~okias/5c6fe8c210814734d7c109f01617dd48f11d20ac
16:24 jenatali: David Heidelberg: Sounds reasonable to me
16:39 mupuf: DavidHeidelberg[m]: why not just check for the existence of the file?
16:40 mupuf: If anholt is not there, then don't run anything. If it's there, run
16:40 DavidHeidelberg[m]: mupuf: when we sent commit to re-enable the farm, we just want pipeline to check that farm, we can keep the others not touched
16:40 mupuf: Hmm... OK. Well documented, that would work indeed
16:41 DavidHeidelberg[m]: Yeah, it always must be separate commit, but I think it'll be nice and fast.
16:42 mupuf: :) builds will be harder to disable though...
16:43 DavidHeidelberg[m]: yeah, I think for start I'll keep the builds
16:43 mupuf: Well, I guess we could run them only when there are changes outside of the farm folder
17:11 DavidHeidelberg[m]: Have to go, but here is draft: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23629
17:11 DavidHeidelberg[m]: I'll need to test it a bit, but "could work"
17:12 DavidHeidelberg[m]: for some reason austriancoder farm jobs pops up even when it's not enabled, but I'll look into it
17:13 austriancoder: funny
17:19 mupuf: DavidHeidelberg[m]: I would suggest renaming to $farm.disabled
17:20 mupuf: This way, we don't typo the name when adding it back :D
17:21 mupuf:needs to split valve farm into two: KWS and mupuf
17:21 DavidHeidelberg[m]: Sure, maybe moving to .ci-farms-disabled?
17:21 mupuf: Oh, better!
17:22 DavidHeidelberg[m]: (I think the CI syntax would get more complicated when filteting the .disabled files
17:23 DavidHeidelberg[m]: *filtering (damn wish we could use Matrix or something. I love retroactively fix my typos)
17:51 alyssa: gfxstrand: I am questioning whether aggressive vec_to_moves/regs is a good idea
17:51 alyssa: it eliminates moves, sure
17:52 alyssa: but it also spikes register demand (-->spilling) since it means random scalars that eventually get collected into a short-lived vector become channels of a long-lived vector
17:52 alyssa: and none of the vec4 backends can split live ranges in their RA
17:54 alyssa: at least this is the case with midgard
17:54 alyssa: maybe intel/vec4 skirts around that somehow
17:58 gfxstrand: alyssa: intel/vec4 skirts around it by basically never having to spill in vec4.
17:58 gfxstrand: alyssa: We have 256 vec4 registers so pressure is almost never an issue.
17:59 alyssa: ah..
17:59 alyssa: that's cheating
17:59 alyssa: :p
17:59 gfxstrand: It also means that the ALU is massively under-utilized because it could be doing 16 threads, not 2, but c'est la vie.
17:59 alyssa: right
17:59 gfxstrand: There is a pass which helps with this somewhat
18:00 gfxstrand: nir_move_vec_src_uses_to_dest
18:00 gfxstrand: It's a poor-man's value numbering kinda thing
18:00 alyssa: midgard runs that right before vec_to_movs, same as intel/vec4
18:00 gfxstrand: To try and avoid having scalars and vectors live at the same time
18:01 gfxstrand: Does the more aggressive coalescing hurt midgard bad?
18:02 alyssa: gfxstrand: it's not in the blocking path, but see https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23624
18:06 gfxstrand: ouch
18:07 gfxstrand: alyssa: I dropped you more shader-db stats for commit messages. I think I'm happy with the impact. It looks like most of the noise from the series as a whole comes from reworking things like PTN and TTN to not depend on writemasks. IMO, that's totally fine.
18:08 alyssa: :-D
18:09 alyssa: yeah..
18:09 alyssa: It's not like PTN/TTN shaders are in any hot paths anyway
18:09 gfxstrand: Yeah
18:10 gfxstrand: The delta across the Intel patch to flip the switch is just noise now that vec_to_regs is being aggressive again
18:10 alyssa: "again"?
18:13 gfxstrand: alyssa: vec_to_movs pushed writes up. You weren't before and that led to a bunch of minor regressions. How that the new pass is also pushing stuff up, the regressions are gone.
18:15 gfxstrand: Oh, one other thing RE register pressure. The Intel vec4 back-end doesn't RA per-component. Every NIR def/reg gets a whole vec4 whether it needs it or not (or 2 if it's a dvec3/4).
18:15 gfxstrand: Intel vec4 is dumb...
18:16 alyssa: doh
18:16 alyssa: yeah, that'd do it then
18:17 alyssa: Midgard allocates registers at byte-granularity, with full per byte liveness tracking
18:17 gfxstrand: Yeah, Intel vec4 is dumb
18:17 alyssa: oh the midgard compiler is dumb in a lot of ways
18:17 gfxstrand: But, hey, it supports tessellation shaders with FP64 so...
18:17 alyssa: but it makes excellent use of the register file
18:17 alyssa: (by bruteforce, mostly)
18:17 gfxstrand: hehe
18:18 alyssa: I fondly remember you marking up that paper when I was in first year and not being able to finish it :~)
18:18 alyssa: also a comment to the effect of "too much detail, this isn't homework"
18:19 gfxstrand: lol
18:19 gfxstrand: That may have been a thing past me said
18:58 jenatali: Huh, Intel's Windows Vulkan driver apparently doesn't display anything on monitors that aren't directly connected to it. That's fun
19:17 gfxstrand: Color me unsurprised.
19:17 zmike: what color is that
21:49 karolherbst: jenatali: I'm sure you are the first and only person hitting this issue
21:53 DavidHeidelberg[m]: jenatali: have you triggered the Windows build in mine MR? :D just not sure if I broke the rules or you just wanted test it works
22:04 jenatali: David Heidelberg: I didn't trigger anything
22:05 DavidHeidelberg[m]: damn, ok I'll check the pipeline
22:18 jenatali: David Heidelberg: I think the problem is that the container builds can't use the same rules as the build/test jobs
22:18 jenatali: The containers are supposed to be auto for Marge / post-merge, manual everywhere else
22:19 DavidHeidelberg[m]: jenatali: I have the fix
22:19 jenatali: The build and test jobs are supposed to be auto all the time, and it's just their dependency on the containers that keep them from running
22:19 jenatali: Ok cool
22:19 DavidHeidelberg[m]: jenatali: the trigger container you have depending on Win farm devices. And you define it, so when I say if changed, go `always`, it... always always :D
22:20 jenatali: Yeah
22:20 DavidHeidelberg[m]: jenatali: https://paste.sr.ht/blob/86eb9f173e5162256b2a2c166fa746edfd48058c
22:20 DavidHeidelberg[m]: this could be the fix
22:20 DavidHeidelberg[m]: with changing the reference for the trigger job to farm-manual-..
22:21 jenatali: That'll break marge
22:21 jenatali: It can't be always manual
22:21 jenatali: Well, probably anyway. I don't know enough about all of this :)
22:22 DavidHeidelberg[m]: thanks, you're right.
22:22 jenatali: Hence my original comments about our stuff being special because our "farm" being offline isn't just some tests to skip
22:35 DavidHeidelberg[m]: jenatali: copy paste the container, adjusted + added rest of the MS rules
22:36 DavidHeidelberg[m]: the offending rules never gets executed if we go trough the container. I just thinking if I should move .container into .gitlab-ci/test-source-dep.yml to have it in same file
22:36 DavidHeidelberg[m]: ... also there is the link to snipper: https://paste.sr.ht/~okias/062b0f5375a1353942e67990ab6d32de3b3706ac
23:09 DavidHeidelberg[m]: it works
23:10 DavidHeidelberg[m]: just one unintentional change, which I kinda of LIKE. ... When you re-enable farm, it runs ALL the jobs (even the manual ones). I kinda like it, because in these scenarios we had to wait until nightly runs to see new flakes/fails/succ, but now it gets fixed at re-enabling phase