12:12austriancoder: https://gitlab.freedesktop.org/ == 502
12:31vyivel: 504 over here
12:57austriancoder: now its a 504 :)
13:04Rayyan: what happened to https://gitlab.freedesktop.org ?
13:38kisak: fwiw, gitlab.fd.o appears to be inaccessible here.
13:53Rayyan: kisak: yep
14:37daniels: trying to fix
14:57Sevenhill: is this will be whole day job or can we access it within a few hours ?
15:08kisak: troubleshooting servers doesn't work that way? There's never a way to give an ETA up front, only if something profoundly bad happens and extra people have to get involved.
15:12daniels: Sevenhill: I hope shortly
15:13daniels: bentiss: so I'm losing gitaly-2 with 'rbd image replicapool-ssd/csi-vol-fb66e2ed-d5f8-11ec-9c25-266a9a9a89cb is still being used' ... I think that may be a consequence of large-7 having gone down uncleanly
15:13Sevenhill: daniels: thank you for working on it
15:19daniels: Sevenhill: np
15:58daniels: bentiss: it looks like osd-14 is unhealthy after all this - can you remind me again how to recover from this?
16:53bentiss: daniels: sorry I can't help you now. I'll have a look tonight when I come back home
16:53daniels: bentiss: no prob :) thanks
18:15daniels: bentiss: btw, where we are now is that osd-14 (one of the replicapool-ssd from large-5) is dead and refusing to come back up; the rest are complaining because they're backfillfull
18:15daniels: my thinking is to nuke osd-14 and let it rebuild, but I'm not 100% sure how to do that non-destructively atm!
19:22bentiss: daniels: back home, starting to look into this
19:24bentiss: daniels: looks like osd 0 is also dead, which explains the backfillfull
19:27daniels: bentiss: oh right, I thought osd 0 being dead was already known
19:27bentiss: well, 2 ssds down is too much :(
19:28daniels: but OSD 0 is on server-3 and has 0 objects in it
19:28daniels: so I don't think it's any loss to the pool?
19:29daniels: (btw the toolbox pod wasn't working due to upstream changes, so I spun up rook-ceph-tools-daniels as a new pod)
19:29bentiss: we should have 2 ssds in each server-*, and we got only one in server-3, so we are missing one
19:29bentiss: and regarding loss, we should be OK IMO
19:30bentiss: but I'd rather first work on osd-0, then osd-14 (should be the same procedure)
19:32bentiss: basically the process is: remove the OSD from the cluster, wipe the data on the disk, then remove the deployment on kubectl, then restart rook operator
19:32bentiss: I just need to find the link where it's all written down :)
19:32daniels: oh hmm, osd-0 is on server-3 which is throwing OOM every time it tries to spin up
19:32bentiss: oh. not good
19:33daniels: blah, that same assert fail: 'bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc8400000, block size 0x1000, free 0xb036b3000, fragmentation 0.429172, allocated 0x0'
19:33daniels: I wonder if this would be fixed in a newer version of ceph
19:34bentiss: maybe, but restoring the OSD works
19:34bentiss: give me a minute to find the proper link
19:35bentiss: to zap the disk: https://rathpc.github.io/rook.github.io/docs/rook/v1.4/ceph-teardown.html#zapping-devices 19:35bentiss: https://rook.github.io/docs/rook/v1.4/ceph-teardown.html#zapping-devices even (same data)
19:36daniels: yeah, loks like ti should be
19:36bentiss: and https://rook.github.io/docs/rook/v1.4/ceph-osd-mgmt.html#remove-an-osd 19:36daniels: thanks! I was looking at those and they seemed like the right thing to do, but I also didn't really want to find out in prod tbh :P
19:36bentiss: so it's a purge on the osd
19:38bentiss: so, process is make sure osd is out and cluster is backfilled properly, then purge the osd, then zap the disk, then remove the osd deployment and operator restart
19:38bentiss: why the UI can not purge OSD-0???
19:41bentiss: purge is not working... maybe we can just zap the disk and destroy the deployment
19:49bentiss: server-3 is not responding to the various zap commands, rebooting it
20:06daniels: urgh …
20:14bentiss: managed to remove the OSB from ceph
20:21bentiss: oh boy, zxapping the disks while the OSD was still known in the cluster was a bad idea... not bad in terms of we are doomed, but bad in terms of now I need to clean up the mess :/
20:34bentiss: daniels: so maybe we need to spin up a temporary new server-* so ceph goes back to a sane state
20:34bentiss: and we can the also nuke server-3 while we are at it and keep only the new one
20:35daniels: bentiss: oh right, so it can populate that with the new content, then we can kill the old one behind the scenes?
20:35daniels: that sounds good to me - I just need to go do some stuff but will be back in an hour if you need me for anything
20:35bentiss: k, I'll try to make the magic happen in the interim :)
20:35daniels: great, ty :)
20:36daniels: sorry, just back from holiday and still haven't unpacked etc!
20:36bentiss: no worries
20:39bentiss: damn... "422 ewr1 is not a valid facility"
21:02daniels: bentiss: yeah, have a look at gitlab-runner-packet.sh - it's all been changed around a fair bit
21:02bentiss: daniels: so far I was able to have a new server in ny with a few changes
21:02bentiss: hopefully it'll bind to the elastic ip
21:03daniels: nice :)
21:03bentiss: and worse case, we'll migrate our current servers to this new NY
21:05bentiss: huh, it can not contact the other control planes
21:09bentiss: seems to be working now
21:10bentiss: damn, smae error: RuntimeError: Unable to create a new OSD id
21:24bentiss: daniels: sigh, the new server doesn't even survive a reboot, it fails at finding the root
21:27bentiss: I guess my cloud-init script killed the root
21:35bentiss: FWIW, reinstalling it
21:48daniels: mmm yeah, you might want to look at the runner generate-cloud-init.py changes too
21:48daniels: particularly 89661c37ea2f0cef663e762a18f4a3a600f8356f
21:50bentiss: I think the issue was that jq was missing
21:55bentiss: daniels: currently upgrading k3s to the latest 1.20 or it is downgrading the new server which seems to make thinks not OK
22:02daniels: yeah, the changes in the runner script should fix jq too
22:04bentiss: there is something else too, because I just added a new mount, made sure I formatted the correct disk, and reboot failed
22:04bentiss: so I am re-installing it
22:06bentiss: daniels: also, FYI there are 3 osd down without info, this is expected. I re-added them because we can not clean them up properly while we are backfil_full
22:14bentiss: giving up debian_10, I double checked that the server was using the proper uuid, the disk was correct, did nothing in cloud-init related to disk, and it doesn't survive reboot
22:25bentiss: alright, debian_11 works way better
22:35daniels: yeah, debian_10 isn’t readily available for the machine types which are …
22:43bentiss: sigh, the route "10.0.0.0/8 via 10.66.151.2 dev bond0" is messing with our wireguard config
22:47bentiss: daniels: so I have cordoned server-5 because it clearly can not reliably talk to the control plane
22:48bentiss: daniels: `curl -v https://10.41.0.1` fails way too often, so I wonder what is the issue
23:06daniels: bentiss: hmmm really? it seems to work ok at least atm ...
23:06bentiss: it's way too late here, and I have to wake up in 6h to bring the kids at school, giving up for now, I'll continue tomorrow
23:06daniels: but I wonder
23:06daniels: 10.0.0.0/8 via 10.99.237.154 dev bond0
23:06daniels: 10.40.0.0/16 dev flannel.1 scope link
23:06daniels: oh no, nm
23:06bentiss: I changed the default route FWIW
23:06bentiss: and scoped it to 10.66.0.0/16
23:07bentiss: while it was 10.0.0.0/8
23:07daniels: yeah, it still has the /8 on -5
23:08bentiss: or large-5?
23:08bentiss: because I do not see it on server-5
23:08daniels: sorry, wrong -5 :(
23:08daniels: I think it might be too late for me too tbh
23:08bentiss: OK, let's call it a day, and work on it tomorrow
23:09bentiss: cause we might do more harm than anything
23:10daniels: yeah ...
23:10daniels: I definitely need to understand more about how kilo is supposed to work as well