IRC Logs of #freedesktop on irc.freenode.net for 2023-08-29

02:06 _DOOM_: I am working the StatusNotifierItem spec as the watcher when a host or item have a NameOwnerChange what should the watcher do if the item/host has a new name?
02:06 _DOOM_: Should the watcher reannounce the item/host?
02:36 Yakov: using base libevdev sample https://www.freedesktop.org/wiki/Software/libevdev/ -> getting error Failed to init libevdev - how to fix?
05:17 tpalli: daniels regarding the rust issue, I will try to bump RUST_VERSION to 1.70.0-2023-06-01 and see if that works out
05:40 tpalli: aww hitting something else now: "Error: authenticating creds for "harbor.freedesktop.org": can't talk to a V1 container registry"
05:44 mupuf: tpalli: is that in a fork or in mesa/mesa?
05:50 tpalli: mupuf that is a MR which needs to bump the the rootfs tag .. so I think that is why it is hitting these things, it is https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24600
05:54 bbhtt: In the user verification template I don't have external: true, what do I do?
05:57 mupuf: bbhtt: you have "false"?
05:57 mupuf: If so, then you don't need to request anything
05:57 bbhtt: mupuf: Yea
05:57 bbhtt: Ah thanks
05:57 mupuf: you should be able to fork :)
05:57 mupuf: you must work for a company that fd.o trusts
06:00 bbhtt: I think it's because my account was created before all this
06:03 mupuf: oh, could be
06:03 mupuf: tpalli: checking it out
06:04 mupuf:restored the banner about spam since new users need to know they need to request rights
06:05 tpalli: mupuf thanks!
06:07 mupuf: tpalli: looks to me like an issue with the unreliable network. It should be improved today
06:08 tpalli: mupuf okeydokkey
06:31 kode54: aaaaaa, apparently the gitlab was just migrated sideways?
07:08 daniels: yes, it went up to 16.x
07:08 daniels: which does feature some big UI changes
07:09 kode54: ah
07:18 Yakov: is it possilbe to detect windows key with libevdev?
07:38 alatiera: registry usage should be transparent post-migration still right?
07:39 alatiera: linux runners seem to work fine, but the windows builds can't reach it for push it seems
07:39 bentiss: alatiera: minus the fact that it's hosted on the new cluster which is showing some serious disk issues
07:39 alatiera: though I think the windows job did manage to login
07:39 alatiera: bentiss ack, thanks
07:39 bentiss: alatiera: plan is to solve this this morning, but I can not seem to pg_dump the current db right now
07:40 bentiss: alatiera: yeah, I don't think the login requires an access to the db
07:45 Yakov: can I get help upon lebevdev here?
07:52 bentiss: I'm glad I made a dump of the registry db yesterady and I kept it around: the registry db on the new cluster is simply not anwswering any requests
07:52 bentiss: II'll reset it to yesterday's state soon
07:57 alatiera: what db is currently backing the registy now
07:57 alatiera: old machine with the old dump?
07:57 bentiss: alatiera: the one on the new cluster which is failing
07:58 alatiera: hmm, seems to be working on my end mostly
07:58 bentiss: I'm making it pointing at the old cluster with the new db
07:58 alatiera: (as in it's pushing things)
07:58 bentiss: alatiera: yeah, you'll have to re-push, the db will be reset to yesterday's state
07:59 alatiera: weird that it doesn't ack requests on your side huh
07:59 alatiera: bentiss yea I don't mind
07:59 bentiss: when I run the db dump on the machine it's running, it's simply hanging, so I guess I must not be the only one having issues
08:12 mupuf: bentiss: yeah, the registry has been unreliable
08:14 mupuf: daniels: did the update happen? I still see 15.X in the admin
08:16 mupuf: Anyway, the priority should be fixing the registry :)
08:29 bentiss: hmm... It seems I can now dump the registry that was failing
08:30 bentiss: and it seems that when noone is accessing the disks, they are fine. That's weird isn't it :)
08:31 alatiera: if a disk does io and nobody hears it, did it do it at all?
08:31 alatiera:knows where the door is
08:31 bentiss: good question :)
08:32 bentiss: anyway, big question: should I keep running the current registry db with the backup from yesterday, or should I dump the one from 30 min ago?
08:33 bentiss: mupuf: ^^?
08:33 mupuf: bentiss: the new one, plrase
08:33 bentiss: mupuf: ok.
08:33 bentiss: I need to take the registry down then
08:34 bentiss: it's down now
08:38 mupuf: Crossing fingers it will go well
08:38 bentiss: so far so good
08:39 bentiss: (replicating)
08:39 bentiss: creating indexes....
08:40 bentiss: mupuf: and regarding the gitlab migration to , yes it's not done, but I need a stable cluster for that
08:40 bentiss: 16.x
08:40 mupuf: Exactly
08:41 bentiss: and done, respinning up the registry pods
08:41 bentiss: (they seem to be happy)
08:42 mupuf: Gitlab reports psql to be 14.9. Isn't that too old for gitlab 16.x?
08:42 bentiss: mupuf: it's supposed to be 15.9
08:42 bentiss: oops, no, 14.9, you are correct
08:42 bentiss: IIRC we were on 13.x before
08:43 mupuf: I see, hopefully this is good-enough for gitlab 16
08:44 bentiss: https://docs.gitlab.com/charts/installation/tools.html#postgresql mentions postgres 13 for gitlab 15.x (gitlab chart was 6.x, it's now 7.x)
08:44 mupuf: Yeah, just saw tgat
08:44 mupuf: So I guess we were running something older than psql 13 then
08:44 bentiss: https://docs.gitlab.com/charts/releases/7_0.html -> recommended postgres 14.8
08:45 bentiss: no, we were on 13, and now we are on 14
08:45 bentiss: and gitlab 16 requires 14
08:45 mupuf: Yep, it was 12.7
08:45 bentiss: we couldn't have run gitlab 15.x on 12.7
08:46 hakzsam: is it safe to re-assign MR to Marge now?
08:46 mupuf: Ok, whatever, I am probably misreading
08:46 mupuf: hakzsam: you can assign, it may or may not go through
08:46 bentiss: hakzsam: safe, not sure, but you can try :)
08:46 mupuf: I'll babysit it
08:46 hakzsam: ack
08:47 bentiss: mupuf: indeed: https://gitlab.freedesktop.org/freedesktop/helm-gitlab-deployment/-/commit/ecc1760e8bb8533ca8b18f3259aeb2ea529f5dfd
08:47 hakzsam: is the registry also restored now ? because there is 0 tags from https://gitlab.freedesktop.org/hakzsam/vk-cts-image/container_registry/5327
08:47 bentiss: also, for mesa, we need to remove the CI variables pointing at harbor, it's useless now
08:47 mupuf: hakzsam: user tags are gone
08:48 hakzsam: like lost?
08:48 mupuf: Not lost, you can transfer them using skopeo. It is explained in the banner
08:48 mupuf: I'll send you a link when I reach my pc
08:48 hakzsam: ok
08:49 bentiss: mupuf: the link in the banner disappeared
08:49 mupuf: oh, right, I'll add it back
08:50 bentiss: mupuf: no rush, I haven't updated it
08:50 alatiera: I wonder, do we have numbers of the size of the registry with and without user tags?
08:50 bentiss: or maybe we should promote the instruction as a wiki page
08:50 bentiss: alatiera: no, and we can not, the blobs are shared
08:51 alatiera: ah
08:51 bentiss: what I can give you is the size of the registry on gcs and the one we hold now that has garbage collection
08:51 alatiera: was curious how much of the blobs were leaf to the users
08:51 mupuf: bentiss: I would love to see the size difference between the registries
08:52 bentiss: so on GCS, we had 27TB of data, and I pulled only 9.8TB
08:53 alatiera: if we remove old mesa/gst images we can probably half that
08:53 bentiss: on those 9.8 TB, the data contains the main projects plus all new registry repos that were created after I started the registry migration (I think last September, one year ago)
08:53 mupuf: alatiera: more like 75% down :D
08:53 alatiera: (but that only works when there are no user tags)
08:54 bentiss: well, harbor has some more numbers, and since I set it up mesa is roughly 1TB of data
08:54 alatiera: I have half a script to parse the image tags in yml for the gst repo
08:54 bentiss: but in any case, we have gc now, so in theory, if we can clear the tags, the blobs will be cleared eventually
08:54 alatiera: but never finished the "query the reigstry and delete everything not in main|stable branches
08:55 mupuf: bentiss: still getting some 503: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48138108
08:55 bentiss: weird
08:57 bentiss: let me update the dns
08:57 mupuf: So, the registry's DB is now in the old cluster, but the data is still in the new cluster, right?
08:58 mupuf: and the migration of the data back to the old cluster should happen online
08:58 mupuf: and when this is done, nothing will be using the new cluster
08:58 mupuf: and thus it could be destroyed and re-built
08:59 bentiss: mupuf: almost, s3.freedesktop.org is also on the new cluster
08:59 bentiss: and it was working fine for the past week
08:59 mupuf: but not anymore?
09:00 bentiss: nobody seems to be complaining (though it's hard to test without the registry)
09:00 mupuf: Yeah, the big data transfer must have broken a disk down... very unlucky.
09:00 bentiss: these are different diskss
09:00 bentiss: the HDDs are working fine
09:00 bentiss: the SSDs are completely failing
09:00 mupuf: I see
09:02 mupuf: so, what's the plan for s3 then? Remain in the new cluster?
09:03 bentiss: still TBD
09:03 bentiss: ideally I need to do forensic to understand what is happening
09:09 mupuf: yeah, let's not be brash about it
09:25 bentiss: I am configuring the runners to directly point at registry.fd.o, not harbor, and rebooting them
09:25 bentiss: just in case the job log grows
09:29 mupuf: bentiss: nice!
09:29 mupuf: so, harbor will be gone then?
09:30 bentiss: mupuf: that's the plan yes
09:30 mupuf: what registry server are you using then?
09:30 bentiss: but I need to ensure that mesa and gfx-ci are synced
09:30 bentiss: mupuf: the one bundled with gitlab
09:30 mupuf: ok :)
09:30 mupuf: well, fewer services == better
09:30 bentiss: which is capable of properly handle the authorizations and such
09:30 bentiss: yeah
09:31 bentiss: no rewrite of the urls on live for the runners to
09:31 bentiss: too
09:31 mupuf: damn right!
09:42 bentiss: ml-24 is in a bad place, I'll reinstall it in a bit
09:42 mupuf: ack, I think it had some issues previously... so not a big loss
09:43 bentiss: it was weird: it was showing some link down on the bond, and now everytime I reboot, it's not using the correct boot entry
09:44 mupuf: there's been some "network down" errors recently
09:44 mupuf: may be related
09:45 mupuf: I did not check if it was -24
09:45 bentiss: could very well be
09:58 dabrain34[m]1: I'm trying to rebuild a windows image for my project and I'm getting for the second time a "HTTP status: 503 service unavailable" when pushing it to the registry. Here is the job https://gitlab.freedesktop.org/dabrain34/GstPipelineStudio/-/jobs/48139879
10:01 mupuf: dabrain34[m]1: hmm
10:03 mupuf: I retried the job
10:04 mupuf: bentiss: could this be an s3 error ^
10:04 mupuf: the push fails
10:05 mupuf: dabrain34[m]1: it doesn't hurt to "docker push ... || {sleep 5; docker push ...}
10:05 mupuf: in case of network errors
10:05 mupuf: but still, seems pretty flaky
10:07 bentiss: mupuf: that runner is still pointing at the registry in the old cluster. And I can see errors from it. We either need to wait for the dns cache refresh or force one
10:07 mupuf: bentiss: oh, great, thanks :)
10:08 dabrain34[m]1: shall I do something ?
10:08 bentiss: dabrain34[m]1: unless you have root access on that server, no
10:30 mupuf: bentiss: So, anything else you want to work on today or this week? I would like to write that the down time is over
10:30 mupuf: (and that DNS may take some time to propagate, but otherwise, we are done)
10:31 bentiss: mupuf: would be nice if we could upgrade gitlab too
10:31 dabrain34[m]1: how long should I wait more or less for this DNS propagation ?
10:31 bentiss: dabrain34[m]1: at most 4 hours
10:31 dabrain34[m]1: ok
10:31 dabrain34[m]1: thanks for the support :)
10:32 mupuf: bentiss: right, yeah, probably a good thing to do
10:32 mupuf: but the registry work is done for now, right?
10:33 bentiss: maybe? :)
10:33 mupuf: we have the data in the new cluster, the DB in the old one
10:33 bentiss: yeah
10:33 mupuf: good good
10:33 bentiss: I've disabled the gc while I am copying the data over to the old cluster
10:33 bentiss: so yeah, not entirely finished
10:33 mupuf: good call
10:34 mupuf: is that a hot transfer, or is there still potential for data loss?
10:34 bentiss: no hot transfer
10:34 bentiss: no, hot tranfer
10:35 bentiss: well, could have like a blob not transfered when I switch from the new to the old cluster, but I'll continue to sync the blobs in the background, so like 10 min delay
10:35 mupuf: ack
10:36 mupuf: ok, I'll write something down and ask you for a review
10:36 bentiss: thanks!
10:45 mupuf: bentiss: how long do you think it would take to upgrade to gitlab 16?
10:45 mupuf: ~30 minutes?
10:46 bentiss: mupuf: no idea. It can take a while, and it can be transparent or not depending on how the migration happens
10:46 bentiss:<- lunch, bbl
10:47 mupuf: bentiss: enjoy!
12:07 hakzsam: looks like pushing new images to registry is unavailable: received unexpected HTTP status: 500 Internal Server Error ?
12:09 bentiss: hakzsam: which job, and which runner?
12:10 hakzsam: it happened to me when I wanted to push a new image to vk-cts-image
12:11 bentiss: hakzsam: I can see access to your registry on the failing registry pod, so hopefully when the dns cache gets properly expired, you should be fine
12:11 hakzsam: ok, I will wait a bit then, thanks!
12:11 bentiss: (should be another 2 hours tops)
12:14 hakzsam: sounds good
13:10 bentiss: alright, fixed the pages jobs and all artifacts uploads... it was trying to access the failing cluster instead of using the current one
13:36 bentiss: I've removed harbor from teh CI configuration in mesa. In theory, no visible impact
13:39 bentiss: maybe not
13:56 alatiera_afk[m]: 🤞
14:00 zmike: anyone know what's going on with these cargo failures https://gitlab.freedesktop.org/zmike/mesa/-/pipelines/972362
14:01 bentiss: zmike: I think karolherbst and daniels talked about that last week
14:24 karolherbst: yeah, but it was unclear what's causing this problem or rather what we want to do to fix it... Those containers don't use our rustup script, so the installed rust version comes from _somewhere_
14:25 zmike: it's blocking further updates
14:25 zmike: so ideally we want to do something
14:25 zmike: even if it's just a stopgap
14:25 karolherbst: sure, but the infra update happened :) I guess we should get back to that issue
14:26 karolherbst: but I have no idea about that part of CI, it's something something the kernel stuff is doing there
14:26 hakzsam: yeah, it's blocking every new containers
14:28 mupuf: eric_engestrom: that may be something you can help with ^
14:28 karolherbst: it's probably the clap_lex upgrade to 0.5.1 which happened like 5 days ago and bindgen (or something) selects 0.5.x
14:28 karolherbst: yeah.. that bumped the rust req from 1.64.0 to 1.70.0
14:29 karolherbst: I think the solution here is to make crosvm use rustup (and our script for that) and install rustc 1.70 instead of relying on the distributions rustc
14:30 eric_engestrom: I don't have much context here, but that last sentence makes sense to me karolherbst :)
14:30 karolherbst: hakzsam, zmike: there is a workaround you can try
14:31 karolherbst: inside .gitlab-ci/container/build-crosvm.sh
14:31 zmike: you know I love workaroudns
14:31 karolherbst: mhh.. maybe not, not sure how --locked actually works here with binaries
14:32 karolherbst: yeah.. it's doing something else
14:33 karolherbst: there is a thing called a cargo.lock file, but I've never used it and have no idea how it works
14:35 bentiss: that's weird, the mesa images have not been mirrored from harbour to registry, when harbor says so
14:35 karolherbst: who is maintaining/managing the crossvm stuff?
14:35 karolherbst: *crosvm
14:36 karolherbst: tintou and DavidHeidelberg[m]?
14:36 karolherbst: please read ^^
14:36 karolherbst: crosvm generation has to use a fixed rustc, not whatever the distribution uses, else a crate dependency update _might_ not compile, because of too old rustc
14:37 karolherbst: it's currently broken, see the pipeline link
14:37 tintou: Yeah I actually just bumped on it
14:37 karolherbst: _maybe_ using a Cargo.lock file is the more reliable solution here
14:37 karolherbst: probably the one causing less issues
14:55 bentiss: FWIW, I'm babysitting the mesa pipeline by manually copying from harbor to registry the images that are not here
14:55 zmike: heroic
14:55 bentiss: I really don't know why harbor wasn't doing the replication properly
14:57 eric_engestrom: bentiss: is it possible that somehow multiarch images got lost in translation?
14:57 eric_engestrom: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48166178
14:57 eric_engestrom: > .gitlab-ci/meson/build.sh: line 28: /usr/bin/llvm-config-15: cannot execute binary file: Exec format error
14:57 bentiss: eric_engestrom: it's not a multiarch image, isn't it?
14:57 eric_engestrom: wait no, that's not a multiarch image, that's an x86_64 image cross-building to s390x
14:58 bentiss: let me push it again
14:58 bentiss: the blobs did not match, they are pushed again
14:58 eric_engestrom: thanks!
14:58 eric_engestrom: is the push finished?
14:59 bentiss: not yet
14:59 bentiss: I'll restart the job
14:59 eric_engestrom: ack; I jumped the gun and already did
15:00 bentiss: I need to remove the image on the runners also
15:00 eric_engestrom: actually no rush on retrying that job, the MR will fail anyway because other jobs have been taking too long so it's too late for marge anyway
15:01 bentiss: well, it's for the next run
15:02 bentiss: but maybe it's because it was running on ml24 I just reinstalled
15:02 bentiss: and *maybe* s390x is not working there
15:02 eric_engestrom: bentiss: your latest retry worked, thanks!
15:03 bentiss: ok... no ideas what is happening, the image also works on ml-24
15:04 bentiss: hah! it's a runner issue
15:04 bentiss: /usr/bin/llvm-config-15: cannot execute binary file: Exec format error
15:04 bentiss: error: Checking out added file "/ppc64le-linux-gnu": mkdirat: No such file or directory
15:11 dabrain34[m]1: mupuf: As a follow up I face now received unexpected HTTP status: 500 Internal Server Error https://gitlab.freedesktop.org/dabrain34/GstPipelineStudio/-/jobs/48154304 is it expected ?
15:11 mupuf: No, I would have expected the DNS to be updated by now
15:11 mupuf: But I know it can take a while
15:11 dabrain34[m]1: the same
15:12 dabrain34[m]1: ok I'll give a try tomorrow morning
15:12 mupuf: Maybe you can modify the job to print the IP for registry.freedesktop.org?
15:18 bentiss: eric_engestrom: a reboot of the runner solved the issue (that's a package we can not put in the current boot apparently)
15:23 dabrain34[m]1: it gives 172.29.208.1
15:24 dabrain34[m]1: where on my machine it gives 147.75.198.156
15:24 bentiss: looks like a proxy?
15:26 dabrain34[m]1: what should I expect ?
15:26 bentiss: 147.75.198.156 is the correct IP
15:41 bentiss: I think the errors on the registry are due to "FATAL: sorry, too many clients already / FATAL: remaining connection slots are reserved for non-replication superuser connections"
15:41 bentiss: so we are DoS the db
15:43 mupuf: Oops
15:44 bentiss: I'll probably split the db in 2 pods, one for registry, and one for gitlab
15:48 mupuf: bentiss: can't just increase the connection count?
15:48 bentiss: maybe
15:49 bentiss: but we already increased it to 300, so maybe we are loading too much the db
15:49 mupuf: I'm all for splitting, but as quick workaround, it would help
15:50 mupuf: bentiss: I guess the load average and iostats would tell us that better than connection counts
15:51 bentiss: but if I increase the connection count, I'll have to cut gitlab (or at least the db), while if I split, I just have to stop the registry for 1-2 min
15:52 mupuf: And how much work is it?
15:52 tpalli: brw I did bump the RUST_VERSION in latest version of the particular pipeline that failed, not sure if that is the correct solution
15:52 mupuf: If it isn't much work, then fuck yeah!
15:52 bentiss: mupuf: should be too hard to do
15:52 tpalli: s/r/t/
15:54 mupuf: Modularity is good then
16:11 bentiss: alright, cutting down the registry for a short amount of time, while I migrate to a separate db
16:14 mupuf: bentiss: crossing fingers
16:15 bentiss: db migrated, waiting for the new config to propagate
16:15 bentiss: pods are starting
16:15 bentiss: and running
16:15 mupuf: \o/
16:16 bentiss: seems to be working (as in skopeo inspect works)
16:18 mupuf: yeah, and my runners seem to be happy too
16:26 mupuf: bentiss: so far, so good! https://gitlab.freedesktop.org/mesa/mesa/-/pipelines/972658
16:27 bentiss: yep, not a single 500 since 18:00
16:27 mupuf: <3
16:27 bentiss: and noone should be using harbor by now too :)
16:30 bentiss: alright I'm done for the day. I hope nothing will crash over the night
16:32 mupuf: bentiss: thanks! Enjoy your evening!
22:27 koike: bentiss thanks for working on this