06:19Adrinael: Is gitlab down?
06:20dt9: guys, is freedesktop down right now?
06:21vyivel: yep, still down, no eta, be patient
06:21dt9: ack, 10x for confirmation
07:08daniels: what a day to not have coffee at home
07:09bentiss: daniels: ouch :/
07:10bentiss: so... I managed to mess up one disk on server-3, currently restoring it
07:11daniels: ah yeah, I was just about to wonder why I wasn't able to SSH to it :P
07:12bentiss:basically nuked /var/lib/rancher :(
07:13daniels: bentiss: oh ... ouch :(
07:14bentiss: luckily given that it was at partitioning time I still had the device mounted, so I backed it up before the reboot
07:18daniels: bentiss: is there anything I can do to help atm?
07:18bentiss: daniels: if you can try to understand why server-5 is not happy with 10.41.x.x that would be good
07:19daniels: bentiss: ok! :)
07:19bentiss: and it's not kilo the culprit but flannel with wireguard backend FWIW
07:19emersion: i'll also be available in a bit, if you have a noob-friendly task :P
07:20bentiss: emersion: the more the merrier :)
07:20bentiss: emersion: same thing than daniels, it would be nice to understand why server-5 can not talk to the other services
07:21emersion: ok!
07:27mceier: someone could write mail describing the situation (and maybe eta); there's at least one mail on xorg-devel ml asking about 504 error
07:28airlied: fubar, no eta
07:28mceier: ;)
07:28bentiss: mceier: 2 disks down in the cluster, which means everything on fire
07:40daniels: mceier: good point, sent
07:40mceier: cool :)
07:43hakzsam: good luck with fixing this guys!
07:49bentiss: finally, server-3 is back
07:58Asmadeus: m
07:58Asmadeus: (sorry)
08:06bentiss: daniels, emersion: I managed to get the ssd on server-3 back in the pool, it's currently recovering, so hopefully the cluster will restart in a few minutes
08:10emersion: \o/
08:15bentiss:attempts at fixing large-5 too
08:17daniels: bentiss: I didn't manage to figure out large-5 yet; was looking just before the reboot but it's a mystery to me how the wg traffic gets captured in the first place ...
08:18bentiss: daniels: I was asking about server-5 :) not large
08:18daniels: ah
08:18daniels:rubs eyes
08:18bentiss: daniels: TBH, wg is most of the time way simpler than regular traffic, but I just don't understand the flannel config that makes the service plane working :/
08:19daniels: yeah, so I'd gone to large-5 to look at a working example, and couldn't figure out how it was supposed to work in the first place? at least judging by the routing table that's there
08:20bentiss: I fear the issue is because it's not on the same subnet
08:20bentiss: because different facility
08:21daniels: ah yeah of course, NAT
08:21bentiss: and the others are working because the default route makes it use 10.99.x.x and then the interfaces are magically picking up the traffic :(
08:22bentiss: but maybe we can leverage kilo to route the 10.41.x.x addresses toward server-2
08:23bentiss: and then we will have to transfer the cluster to a facility that is not deprecated
08:28bentiss: OK, disk on large-5 is back up, we now need to wait for ceph to settle
08:28bentiss: and then we can clean up the various OSD leftovers
08:29bentiss: daniels: so, for extra safety, what I did now was just remove the failing deployment, drained the node, rebooted it, uncordon it, then zap the *correct* disk, then killed the operator
08:30bentiss: it doesn't clean up the old OSD, but at least the disk is back up
08:30daniels: hmm, so it _is_ at least hitting the right iptables rules to masquerade which should push it through wg ...
08:31bentiss: mayeb a wrong ufw config on the other side
08:31daniels: but then why would it only be intermittent? :\
08:31daniels: bentiss: \o/ thanks!
09:00bentiss: recovery stopped... running fstrim on all the nodes, that might be the issue
09:24bentiss:reboots large-7
09:38bentiss:upgrades rook from 1.6.8 to 1.6.11
09:48daniels: bentiss: on the network side, tracing through the iptables rules, in the NAT table we go from POSTROUTING -> KUBE-SERVICES -> KUBE-SVC-NPX46M4PTMTKRN6Y for the HTTP plane
09:49daniels: that balances connections between targets of 10.99.237.141 (server-2), 10.99.237.145 (server-3), and 10.66.151.3 (server-5)
09:49daniels: perhaps unsurprisingly, server-5 is the one which fails to answer itself
09:49bentiss: or the other way around :)
09:50daniels: nope
09:50daniels: every time we land in the server-5 chain it times out; every time we land in server-2 or server-3 it works fine
09:51bentiss: hmm, interesting
09:51daniels: it's on a probability distribution directing 50% to server-5, which explains why it works exactly half the time
09:51daniels: (I've currently hacked it to always forward to server-2, which is now working 100% of the time)
09:52bentiss: heh, that's what I was about to suggest
09:52bentiss: can I try to uncordon server-5 then?
09:53bentiss: right now the pg are stuck backfilling, so adding a couple of disks might unblock them
09:55daniels: yep, go for it
09:55daniels: let's see what happens
09:56bentiss: thanks
09:56daniels: I need to go afk for a bit anyway, back at 1pm your time
09:56bentiss: k
09:56bentiss:probably needs to find some food too
10:19bentiss: daniels: still failing a lot
10:19bentiss: I think I'll grab some lunch and then remove server-5, and use a new c2-medium as an agent, not a server. This should solve the issues with control plane
10:57daniels: ack
10:59daniels: and yeah, I can keep looking into it, but reflexively I think it would be better to get everything in the same facility + use a VPC for all traffic + pare ufw down to the absolute bare minimum ruleset (allow WG into boundary host + allow HTTPS/SSH ingress to elastic + drop all other incoming external) so the k3s-internal traffic can be managed solely by k3s rules?
11:33bentiss: daniels: sounds appealing :)
11:36bentiss: daniels: can I nuke server-5?
11:54daniels: yep, fine by me :)
12:00shadeslayer: Hm, I seem to be hitting gateway timeout issues on gitlab
12:01shadeslayer: though hopefully it's temporary
12:01JoniSt: shadeslayer: https://www.phoronix.com/scan.php?page=news_item&px=FreeDesktop-GitLab-2022-Crash 12:01JoniSt: Sadly not temporary, the Gitlab had massive data loss
12:02shadeslayer: ah shit :(
12:08pq: JoniSt, no, no data loss AFAIU.
12:08shadeslayer: I'd be surprised if there was ^^
12:08JoniSt: Pheeeew. I hadn't heard much news about it yet other than the Phoronix article
12:09JoniSt: But yeah, I'd assume that the Gitlab gets backed up very regularly
12:09mceier: https://lists.x.org/archives/xorg-devel/2022-June/058833.html 12:10bentiss: \o/ managed to kick in the recovery once again
12:12JoniSt: Man... Good luck!
12:13bentiss: yes, no more stale pgs
12:15daniels: JoniSt: there is no data loss, just annoyance
12:18JoniSt: That's nice to hear. Reminds me of the fact that a single raid1 btrfs might also not be enough to keep my own Gitlab instance alive if something happens...
12:46bentiss: daniels: I think I'll reboot all machines one after the other, some process are stuck
12:46daniels: bentiss: yeah ... RBD I'm guessing
12:47bentiss: fstrim is also hanging, and what actaully made the recovery start again was to kill all the rook-ceph pods besides the osds
12:47bentiss: so a full reboot might help
12:47daniels: heh ...
13:19karolherbst: JoniSt: I was thinking about having my own gitlab instance, but making it rock solid _is_ a huge investment and until that's settled, data is better of being replicated externally anyway... :D
13:19karolherbst: I am thinking about doing raid6
13:24JoniSt: Hmm... Well, raid5/6 always gives me a bit of a weird feeling :P
13:28JoniSt: I mostly store university stuff on that Gitlab (when I do group work) so it wouldn't be thaaat bad if the filesystem crashed
13:48karolherbst: JoniSt: yeah... but I also plan to do a setup complete without any fans or noise, so I have to make sure to not go overboard with the budget
13:49karolherbst: and complete mirroring can get extremly expensive
13:49karolherbst: well. with SSDs that is
13:51karolherbst: there are actually passively cooled cases which would allow a power budget of ~200W, so that's not even a huge issue
13:51karolherbst: just.. expensive :D
13:53DragoonAethis: karolherbst: or you could go for an actively-cooled case and replace the fans
13:53karolherbst: nah
13:53karolherbst: it's still audible
13:54DragoonAethis: Yeah, but with Noctua/be quiet fans it can be really quiet
13:54DragoonAethis: Pretty much inaudible unless you've got the box in a silent room at night or something like that
13:54karolherbst: the issue is, that motherboards are generally quite crappy in this regard
13:54karolherbst: so if the CPU/whatever isn't hot/warm, the fans can be turned off
13:54karolherbst: but firmware....
13:55Mattia_98: noctua fans running on low rpm are inaudible, I can vouch for that
13:55karolherbst: DragoonAethis: well.. my work laptop has its fans usually completely turned of, so it would be noticable :P
13:55karolherbst: and I do have noctua fans for other stuff
13:55karolherbst: Mattia_98: thing is.. they are inaudible when it doesn't matter
13:56karolherbst: under load is where it matters
13:56DragoonAethis: Or alternatively, get a fan controller and write custom cooling scripts
13:56karolherbst: so I can go for complete fanless with a power budget of 200W or....
13:56karolherbst: the noctua fans aren't silent if they have to cool away ~100W of CPU heat
13:58karolherbst: DragoonAethis, Mattia_98: Streacom DB4 is what I was thinking about
13:59karolherbst: can manage up to 110W CPU heat and 65W GPU heat
13:59karolherbst: where I wouldn't need the GPU cooling
13:59karolherbst: thing is.. there isn't much space in it :D
13:59DragoonAethis: Unfortunately I'm more of a midi tower+ guy myself ;P
14:00DragoonAethis: But it looks really nice
14:00karolherbst: yeah.. I have one for my desktop for work
14:00karolherbst: got myself the be quiet! Dark Rock 4 PRO BK022 CPU cooler which isn't all that bad actually
14:01DragoonAethis: I have the non-Pro version, it's pretty good too (but getting the Pro for the next upgrade)
14:01karolherbst: definetly worth the money
14:01karolherbst: it manages 150W without getting loud
14:01karolherbst: and my CPU stays at max clock pretty much all the time
14:02DragoonAethis: And for the case it's Fractal Meshify C with the stock Fractal fans (which are almost inaudible, but they don't ramp up with the rest of the system)
14:02karolherbst: but for devices which are like on 24/7 I want something without fans :P
14:04bentiss: daniels: \o/ ceph is back in the game
14:04bentiss: no more degraded objects
14:04daniels: bentiss: woo! I saw the tools pod is working now that we're upgraded too, awesome
14:04bentiss: though daemons are crashing like hell
14:05bentiss: daniels: I fixed the deployment of the tools pod to use the same rook minor version :)
14:05bentiss: it's not part of the helm chart :(
14:05daniels: ahhhhh, right
14:05daniels: I did think it was weird that they'd use :master
14:06daniels: ooh, gitaly-2 now running too
14:06bentiss: and postgres!
14:06daniels: \o/
14:06daniels: should I try redis next?
14:06bentiss: still having connectivity issues
14:06daniels: ah :(
14:06daniels: control plane or inter-pod?
14:07bentiss: I mean 504
14:07bentiss: but yeah, feel free to re-enable redis
14:07daniels: oh right, yeah it'll 504 since I killed the redis + webservice pods :P
14:07bentiss: that's what I just realized :)
14:07daniels: the log noise was getting annoying
14:09bentiss: \o/ back online!!!!!
14:10daniels: :D :D :D
14:10bentiss: that's what... 24h of downtime?
14:10bentiss: (back online, right before I got to pick up kids at school)
14:10pixelcluster: many thanks to both of you for fixing it!!
14:11Mattia_98: nice job guys!
14:12jekstrand: \o/ y'all are heroes!
14:12karolherbst: \o/ daniels, bentiss: thanks for all the work!
14:12bentiss: daniels: I'm off for taday I think. I have removed the PVC for elasticSearch, but we need to clean up the actual data on disk to reclaim the space
14:12bentiss: and I am starting to have a strong headache now, so that will be something for tomorrow
14:12daniels: bentiss: thanks so much Monsieur Storage Wizard <3
14:13daniels: hope you have a nice & quiet night, drink lots of water
14:13bentiss: daniels: I shall not say how many reboots it took me :)
14:14DragoonAethis: Congrats :D
14:15JoniSt: Yay, nice! :D
14:15daniels: bentiss: _cough_
14:15daniels: bentiss: I'll look at how we can move to VPC-only networking
14:16hakzsam: thanks, great job!
14:16pq: Thank you! I got a full day of working on my own code instead of reviewing others' stuff. ;-D
14:16jkhsjdhjs: thanks, appreciate it! finally I can look through the pipewire issues :D
14:17karolherbst: yay, I can finally work again!
14:19daniels: bentiss: oh yeah, when you're back tomorrow could you please push the helm changes?
14:31kisak: Thanks for burning half your weekend on that snafu.
14:32Mattia_98: Michael already wrote an article on Phoronix. He works fast XD
14:50danvet: daniels, bentiss thx a lot!
14:53bentiss: daniels: changes in helm-gitlab-config pushed
14:56daniels: bentiss: merci!
18:38eric_engestrom: bentiss, daniels: awesome work these last couple of days! 💪
18:38eric_engestrom: (adding to the pile of well deserved praise)
20:41alanc: +100
23:18dcbaker: daniels: I have a docker file for the mr-lable-maker: https://gitlab.freedesktop.org/dbaker/mr-label-maker-docker you can pass it GITLAB_TOKEN as an environment variable. I've gotten far enough with it to see that it wants a token, but that's it
23:18dcbaker: I took Marcin's work and did a little cleanup to it a littl emore pythonic, but it's otherwise the same (i added setup.py to make installation easier, for example)
23:19dcbaker: let me know if that looks reasonable to you whenever you've clamed down from the ceph stuff :)