06:19 Adrinael: Is gitlab down?
06:20 dt9: guys, is freedesktop down right now?
06:21 vyivel: yep, still down, no eta, be patient
06:21 dt9: ack, 10x for confirmation
07:08 daniels: what a day to not have coffee at home
07:09 bentiss: daniels: ouch :/
07:10 bentiss: so... I managed to mess up one disk on server-3, currently restoring it
07:11 daniels: ah yeah, I was just about to wonder why I wasn't able to SSH to it :P
07:12 bentiss:basically nuked /var/lib/rancher :(
07:13 daniels: bentiss: oh ... ouch :(
07:14 bentiss: luckily given that it was at partitioning time I still had the device mounted, so I backed it up before the reboot
07:18 daniels: bentiss: is there anything I can do to help atm?
07:18 bentiss: daniels: if you can try to understand why server-5 is not happy with 10.41.x.x that would be good
07:19 daniels: bentiss: ok! :)
07:19 bentiss: and it's not kilo the culprit but flannel with wireguard backend FWIW
07:19 emersion: i'll also be available in a bit, if you have a noob-friendly task :P
07:20 bentiss: emersion: the more the merrier :)
07:20 bentiss: emersion: same thing than daniels, it would be nice to understand why server-5 can not talk to the other services
07:21 emersion: ok!
07:27 mceier: someone could write mail describing the situation (and maybe eta); there's at least one mail on xorg-devel ml asking about 504 error
07:28 airlied: fubar, no eta
07:28 mceier: ;)
07:28 bentiss: mceier: 2 disks down in the cluster, which means everything on fire
07:40 daniels: mceier: good point, sent
07:40 mceier: cool :)
07:43 hakzsam: good luck with fixing this guys!
07:49 bentiss: finally, server-3 is back
07:58 Asmadeus: m
07:58 Asmadeus: (sorry)
08:06 bentiss: daniels, emersion: I managed to get the ssd on server-3 back in the pool, it's currently recovering, so hopefully the cluster will restart in a few minutes
08:10 emersion: \o/
08:15 bentiss:attempts at fixing large-5 too
08:17 daniels: bentiss: I didn't manage to figure out large-5 yet; was looking just before the reboot but it's a mystery to me how the wg traffic gets captured in the first place ...
08:18 bentiss: daniels: I was asking about server-5 :) not large
08:18 daniels: ah
08:18 daniels:rubs eyes
08:18 bentiss: daniels: TBH, wg is most of the time way simpler than regular traffic, but I just don't understand the flannel config that makes the service plane working :/
08:19 daniels: yeah, so I'd gone to large-5 to look at a working example, and couldn't figure out how it was supposed to work in the first place? at least judging by the routing table that's there
08:20 bentiss: I fear the issue is because it's not on the same subnet
08:20 bentiss: because different facility
08:21 daniels: ah yeah of course, NAT
08:21 bentiss: and the others are working because the default route makes it use 10.99.x.x and then the interfaces are magically picking up the traffic :(
08:22 bentiss: but maybe we can leverage kilo to route the 10.41.x.x addresses toward server-2
08:23 bentiss: and then we will have to transfer the cluster to a facility that is not deprecated
08:28 bentiss: OK, disk on large-5 is back up, we now need to wait for ceph to settle
08:28 bentiss: and then we can clean up the various OSD leftovers
08:29 bentiss: daniels: so, for extra safety, what I did now was just remove the failing deployment, drained the node, rebooted it, uncordon it, then zap the *correct* disk, then killed the operator
08:30 bentiss: it doesn't clean up the old OSD, but at least the disk is back up
08:30 daniels: hmm, so it _is_ at least hitting the right iptables rules to masquerade which should push it through wg ...
08:31 bentiss: mayeb a wrong ufw config on the other side
08:31 daniels: but then why would it only be intermittent? :\
08:31 daniels: bentiss: \o/ thanks!
09:00 bentiss: recovery stopped... running fstrim on all the nodes, that might be the issue
09:24 bentiss:reboots large-7
09:38 bentiss:upgrades rook from 1.6.8 to 1.6.11
09:48 daniels: bentiss: on the network side, tracing through the iptables rules, in the NAT table we go from POSTROUTING -> KUBE-SERVICES -> KUBE-SVC-NPX46M4PTMTKRN6Y for the HTTP plane
09:49 daniels: that balances connections between targets of 10.99.237.141 (server-2), 10.99.237.145 (server-3), and 10.66.151.3 (server-5)
09:49 daniels: perhaps unsurprisingly, server-5 is the one which fails to answer itself
09:49 bentiss: or the other way around :)
09:50 daniels: nope
09:50 daniels: every time we land in the server-5 chain it times out; every time we land in server-2 or server-3 it works fine
09:51 bentiss: hmm, interesting
09:51 daniels: it's on a probability distribution directing 50% to server-5, which explains why it works exactly half the time
09:51 daniels: (I've currently hacked it to always forward to server-2, which is now working 100% of the time)
09:52 bentiss: heh, that's what I was about to suggest
09:52 bentiss: can I try to uncordon server-5 then?
09:53 bentiss: right now the pg are stuck backfilling, so adding a couple of disks might unblock them
09:55 daniels: yep, go for it
09:55 daniels: let's see what happens
09:56 bentiss: thanks
09:56 daniels: I need to go afk for a bit anyway, back at 1pm your time
09:56 bentiss: k
09:56 bentiss:probably needs to find some food too
10:19 bentiss: daniels: still failing a lot
10:19 bentiss: I think I'll grab some lunch and then remove server-5, and use a new c2-medium as an agent, not a server. This should solve the issues with control plane
10:57 daniels: ack
10:59 daniels: and yeah, I can keep looking into it, but reflexively I think it would be better to get everything in the same facility + use a VPC for all traffic + pare ufw down to the absolute bare minimum ruleset (allow WG into boundary host + allow HTTPS/SSH ingress to elastic + drop all other incoming external) so the k3s-internal traffic can be managed solely by k3s rules?
11:33 bentiss: daniels: sounds appealing :)
11:36 bentiss: daniels: can I nuke server-5?
11:54 daniels: yep, fine by me :)
12:00 shadeslayer: Hm, I seem to be hitting gateway timeout issues on gitlab
12:01 shadeslayer: though hopefully it's temporary
12:01 JoniSt: shadeslayer: https://www.phoronix.com/scan.php?page=news_item&px=FreeDesktop-GitLab-2022-Crash
12:01 JoniSt: Sadly not temporary, the Gitlab had massive data loss
12:02 shadeslayer: ah shit :(
12:08 pq: JoniSt, no, no data loss AFAIU.
12:08 shadeslayer: I'd be surprised if there was ^^
12:08 JoniSt: Pheeeew. I hadn't heard much news about it yet other than the Phoronix article
12:09 JoniSt: But yeah, I'd assume that the Gitlab gets backed up very regularly
12:09 mceier: https://lists.x.org/archives/xorg-devel/2022-June/058833.html
12:10 bentiss: \o/ managed to kick in the recovery once again
12:12 JoniSt: Man... Good luck!
12:13 bentiss: yes, no more stale pgs
12:15 daniels: JoniSt: there is no data loss, just annoyance
12:18 JoniSt: That's nice to hear. Reminds me of the fact that a single raid1 btrfs might also not be enough to keep my own Gitlab instance alive if something happens...
12:46 bentiss: daniels: I think I'll reboot all machines one after the other, some process are stuck
12:46 daniels: bentiss: yeah ... RBD I'm guessing
12:47 bentiss: fstrim is also hanging, and what actaully made the recovery start again was to kill all the rook-ceph pods besides the osds
12:47 bentiss: so a full reboot might help
12:47 daniels: heh ...
13:19 karolherbst: JoniSt: I was thinking about having my own gitlab instance, but making it rock solid _is_ a huge investment and until that's settled, data is better of being replicated externally anyway... :D
13:19 karolherbst: I am thinking about doing raid6
13:24 JoniSt: Hmm... Well, raid5/6 always gives me a bit of a weird feeling :P
13:28 JoniSt: I mostly store university stuff on that Gitlab (when I do group work) so it wouldn't be thaaat bad if the filesystem crashed
13:48 karolherbst: JoniSt: yeah... but I also plan to do a setup complete without any fans or noise, so I have to make sure to not go overboard with the budget
13:49 karolherbst: and complete mirroring can get extremly expensive
13:49 karolherbst: well. with SSDs that is
13:51 karolherbst: there are actually passively cooled cases which would allow a power budget of ~200W, so that's not even a huge issue
13:51 karolherbst: just.. expensive :D
13:53 DragoonAethis: karolherbst: or you could go for an actively-cooled case and replace the fans
13:53 karolherbst: nah
13:53 karolherbst: it's still audible
13:54 DragoonAethis: Yeah, but with Noctua/be quiet fans it can be really quiet
13:54 DragoonAethis: Pretty much inaudible unless you've got the box in a silent room at night or something like that
13:54 karolherbst: the issue is, that motherboards are generally quite crappy in this regard
13:54 karolherbst: so if the CPU/whatever isn't hot/warm, the fans can be turned off
13:54 karolherbst: but firmware....
13:55 Mattia_98: noctua fans running on low rpm are inaudible, I can vouch for that
13:55 karolherbst: DragoonAethis: well.. my work laptop has its fans usually completely turned of, so it would be noticable :P
13:55 karolherbst: and I do have noctua fans for other stuff
13:55 karolherbst: Mattia_98: thing is.. they are inaudible when it doesn't matter
13:56 karolherbst: under load is where it matters
13:56 DragoonAethis: Or alternatively, get a fan controller and write custom cooling scripts
13:56 karolherbst: so I can go for complete fanless with a power budget of 200W or....
13:56 karolherbst: the noctua fans aren't silent if they have to cool away ~100W of CPU heat
13:58 karolherbst: DragoonAethis, Mattia_98: Streacom DB4 is what I was thinking about
13:59 karolherbst: can manage up to 110W CPU heat and 65W GPU heat
13:59 karolherbst: where I wouldn't need the GPU cooling
13:59 karolherbst: thing is.. there isn't much space in it :D
13:59 DragoonAethis: Unfortunately I'm more of a midi tower+ guy myself ;P
14:00 DragoonAethis: But it looks really nice
14:00 karolherbst: yeah.. I have one for my desktop for work
14:00 karolherbst: got myself the be quiet! Dark Rock 4 PRO BK022 CPU cooler which isn't all that bad actually
14:01 DragoonAethis: I have the non-Pro version, it's pretty good too (but getting the Pro for the next upgrade)
14:01 karolherbst: definetly worth the money
14:01 karolherbst: it manages 150W without getting loud
14:01 karolherbst: and my CPU stays at max clock pretty much all the time
14:02 DragoonAethis: And for the case it's Fractal Meshify C with the stock Fractal fans (which are almost inaudible, but they don't ramp up with the rest of the system)
14:02 karolherbst: but for devices which are like on 24/7 I want something without fans :P
14:04 bentiss: daniels: \o/ ceph is back in the game
14:04 bentiss: no more degraded objects
14:04 daniels: bentiss: woo! I saw the tools pod is working now that we're upgraded too, awesome
14:04 bentiss: though daemons are crashing like hell
14:05 bentiss: daniels: I fixed the deployment of the tools pod to use the same rook minor version :)
14:05 bentiss: it's not part of the helm chart :(
14:05 daniels: ahhhhh, right
14:05 daniels: I did think it was weird that they'd use :master
14:06 daniels: ooh, gitaly-2 now running too
14:06 bentiss: and postgres!
14:06 daniels: \o/
14:06 daniels: should I try redis next?
14:06 bentiss: still having connectivity issues
14:06 daniels: ah :(
14:06 daniels: control plane or inter-pod?
14:07 bentiss: I mean 504
14:07 bentiss: but yeah, feel free to re-enable redis
14:07 daniels: oh right, yeah it'll 504 since I killed the redis + webservice pods :P
14:07 bentiss: that's what I just realized :)
14:07 daniels: the log noise was getting annoying
14:09 bentiss: \o/ back online!!!!!
14:10 daniels: :D :D :D
14:10 bentiss: that's what... 24h of downtime?
14:10 bentiss: (back online, right before I got to pick up kids at school)
14:10 pixelcluster: many thanks to both of you for fixing it!!
14:11 Mattia_98: nice job guys!
14:12 jekstrand: \o/ y'all are heroes!
14:12 karolherbst: \o/ daniels, bentiss: thanks for all the work!
14:12 bentiss: daniels: I'm off for taday I think. I have removed the PVC for elasticSearch, but we need to clean up the actual data on disk to reclaim the space
14:12 bentiss: and I am starting to have a strong headache now, so that will be something for tomorrow
14:12 daniels: bentiss: thanks so much Monsieur Storage Wizard <3
14:13 daniels: hope you have a nice & quiet night, drink lots of water
14:13 bentiss: daniels: I shall not say how many reboots it took me :)
14:14 DragoonAethis: Congrats :D
14:15 JoniSt: Yay, nice! :D
14:15 daniels: bentiss: _cough_
14:15 daniels: bentiss: I'll look at how we can move to VPC-only networking
14:16 hakzsam: thanks, great job!
14:16 pq: Thank you! I got a full day of working on my own code instead of reviewing others' stuff. ;-D
14:16 jkhsjdhjs: thanks, appreciate it! finally I can look through the pipewire issues :D
14:17 karolherbst: yay, I can finally work again!
14:19 daniels: bentiss: oh yeah, when you're back tomorrow could you please push the helm changes?
14:31 kisak: Thanks for burning half your weekend on that snafu.
14:32 Mattia_98: Michael already wrote an article on Phoronix. He works fast XD
14:50 danvet: daniels, bentiss thx a lot!
14:53 bentiss: daniels: changes in helm-gitlab-config pushed
14:56 daniels: bentiss: merci!
18:38 eric_engestrom: bentiss, daniels: awesome work these last couple of days! 💪
18:38 eric_engestrom: (adding to the pile of well deserved praise)
20:41 alanc: +100
23:18 dcbaker: daniels: I have a docker file for the mr-lable-maker: https://gitlab.freedesktop.org/dbaker/mr-label-maker-docker you can pass it GITLAB_TOKEN as an environment variable. I've gotten far enough with it to see that it wants a token, but that's it
23:18 dcbaker: I took Marcin's work and did a little cleanup to it a littl emore pythonic, but it's otherwise the same (i added setup.py to make installation easier, for example)
23:19 dcbaker: let me know if that looks reasonable to you whenever you've clamed down from the ceph stuff :)