00:17gfxstrand[d]: I think my Volta is just overheating and shutting down
00:17gfxstrand[d]: Turns out those boot clocks aren't low enough!
00:18airlied[d]: moar fans!
00:18gfxstrand[d]: Yeah. I need to figure out how to rig a fan
00:21karolherbst[d]: gfxstrand[d]: that should like never happen
00:21karolherbst[d]: though if it's falls off the bus, then yeah... but....
00:22karolherbst[d]: doesn't have volta therm sensors set up in nouveau?
00:22karolherbst[d]: I don't see a reason why not
00:29mohamexiety[d]: airlied[d]: when you got things working on my branch was it just removing the GART flag or were there other nvk modifications?
00:30mohamexiety[d]: Relaxing the GART limitation in kernel only and I am seeing the same behavior
00:30airlied[d]: this is where I wonder what computer I was doing that work on
00:31mohamexiety[d]: :blobcatnotlikethis:: sorry
00:31airlied[d]: https://paste.centos.org/view/raw/45652042 is the git diff of the mesa, I've no idea what is relevant
00:32airlied[d]: I currently have 4 nvidia development machines 😛
00:32gfxstrand[d]: karolherbst[d]: It does:
00:32gfxstrand[d]: [ 2341.348925] nouveau 0000:01:00.0: therm: temperature (90 C) hit the 'fanboost' threshold
00:32gfxstrand[d]: [ 2579.392844] nouveau 0000:01:00.0: therm: temperature (95 C) hit the 'downclock' threshold
00:32gfxstrand[d]: [ 2983.026840] nouveau 0000:01:00.0: timer: stalled at ffffffffffffffff
00:34mohamexiety[d]: airlied[d]: Doh the alignment
00:34mohamexiety[d]: I forgot that
00:34mohamexiety[d]: Thanks! ❤️
00:35mohamexiety[d]: airlied[d]: I need more physical space. Soon will be having two and even that I could barely make work :KEKW:
00:35orowith2os[d]: If your hardware isn't being pushed hard enough to start doing math wrong like an overworked college student, you're doing something wrong
01:07gfxstrand[d]: gfxstrand[d]: Fans showing up tomorrow. Gonna get loud in my office. 😅
01:08mohamexiety[d]: Hm this is a bit awkward. airlied[d] sorry but can I add in the alignment change (aligning VAs to 64KiB/2MiB), boot in terminal mode and test via meson devenv or do I need to install mesa and test that?
01:09mohamexiety[d]: I am missing a few components for my second system (should arrive tomorrow or Saturday) so not sure I want to install a modified nvk just yet
01:16airlied[d]: I think you can meson devenv fine, but just with CTS
01:31gfxstrand[d]: The only time you can't really devenv is when you're trying to run a whole desktop environment.
01:32gfxstrand[d]: But with random apps or the CTS, it's usually fine.
01:34orowith2os[d]: gfxstrand[d]: `meson devenv -- launch.sh`?
01:34orowith2os[d]: What breaks?
01:36gfxstrand[d]: Sometimes various components will trash the environment in various ways.
01:37gfxstrand[d]: Especially if setuid is involved but even when it's not sometimes environment variables don't make their way through.
01:40karolherbst[d]: gfxstrand[d]: RIP
01:41karolherbst[d]: it's kind of impressive tho.. is the fan not turned on at all?
01:42gfxstrand[d]: It has no fan. 🤡
01:42karolherbst[d]: ahh....
01:42karolherbst[d]: volta with no fans?
01:42gfxstrand[d]: It has a massive heat sink. BYOF
01:43karolherbst[d]: I see...
01:43gfxstrand[d]: I ordered some fans that I should be able to mount on it and an adapter I can rig up to power them.
01:44mohamexiety[d]: yeah it was probably a server card
01:44karolherbst[d]: ahh.. right
01:44mohamexiety[d]: those rely on the fact you have those funny 9k RPM fans blowing at them :KEKW:
01:44karolherbst[d]: yeah.. wanted to ask who thought passively cooling a 250W TDP card is a great ide
01:44karolherbst[d]: a
01:45mohamexiety[d]: I am moreso impressed it got to thermal shutdown with no reclocking tbh
01:45mohamexiety[d]: like
01:45mohamexiety[d]: it should be running at basically idle power
01:47gfxstrand[d]: mohamexiety[d]: Yeah, that's what gets to me. Like, how?!?
01:48airlied[d]: I wouldn't also be shocked if volta just fell off the bus for no reason :-P, but I assume we've gotten a CTS run at least on GL one
01:48gfxstrand[d]: I guess there's enough silicone there to burn decent power even when it's clocked down.
01:48karolherbst[d]: the hardware also cuts clocks by 1/8 when it gets too hoy
01:48karolherbst[d]: *hot
01:48gfxstrand[d]: I'll get fans on it and we'll see how it goes
01:49gfxstrand[d]: I should get about 16CFM going across the heatsink.
01:50karolherbst[d]: maybe you also just need to fix airflow in your case 😛
01:52gfxstrand[d]: Nah. I bought a stupid card. It needs fans. Oh, well. I'll get them.
02:06gfxstrand[d]: But it takes close to an hour to get hot enough to fall off the bus so I don't think I need that much air to keep it cool enough.
03:14HdkR: gfxstrand[d]: Kind of surprised that you didn't get a GV100 which has a fan
03:15HdkR: Or a Titan V I guess
03:16HdkR: Server class cheaper because everyone upgraded immediately I guess? :D
03:19gfxstrand[d]: Yeah, they're like $200 cheaper and easier to get your hands on.
03:19gfxstrand[d]: The day I was shopping, I could only find a couple desktop ones that shipped in reasonable time/price and they wanted $500 for them.
03:20HdkR: ah
03:20pavlo_kozlenko[d]: This is the first time I've heard of Volta, what series of video cards is this?
03:20pavlo_kozlenko[d]: 800?
03:20gfxstrand[d]: This one was $300 and I got it in a couple days.
03:20HdkR: Series between Pascal and Turing
03:20gfxstrand[d]: pavlo_kozlenko[d]: TITAN V and Quadro GV100. Those are the only two.
03:21gfxstrand[d]: That's why it's so hard to get your hands on. They only made it as a mega card.
03:21gfxstrand[d]: Well, and the Xavier
03:21pavlo_kozlenko[d]: by the way, there was some kind of videocard based on Kepler, called Titan ...
03:22gfxstrand[d]: Yes
03:22gfxstrand[d]: There's a few TITANs. The Volta one is TITAN V.
03:24pavlo_kozlenko[d]: TITAN V||olta||
03:25HdkR: Titan, Titan Black, Titan Z, Titan X (Maxwell), Titan X (Pascal), Titan Xp, Titan V (,CEO edition), Titan RTX
03:25HdkR: Quite a few
03:28dwfreed: not confusing at all /s
03:31HdkR: I partiularly enjoy the three Titan X devices
03:31HdkR: particularly even
03:32HdkR: Mostly because people were already calling the Pascal one the Titan XP, and then NVIDIA released the XP :D
04:06gfxstrand[d]: Yeah, the TITAN Xp (note that the only lower case letter is "p") is kinda hilarious.
05:26tiredchiku[d]: oh looks like the 5090 is finally available in India
05:27tiredchiku[d]: 4200 USD
05:28tiredchiku[d]: cheapest 5080 is 1600$
05:28tiredchiku[d]: cheapest 5070Ti is 1050$
05:28tiredchiku[d]: cheapest 5070 is 790$
05:32magic_rb[d]: Holy shit, 4k for a gpu is utterly ridiculous
05:41HdkR: Don't look at the professional level cards then :)
05:42HdkR: The RTX 6000 Ada is currently $8k
05:43HdkR: I assume the RTX Pro 6000 is going to be roughly equivalent but also try to start itself on fire
05:45tiredchiku[d]: the workstation gpu naming scheme has become so awful
05:45tiredchiku[d]: dunno why they dropped the quadro branding with reasonable model numbers
07:18dj-death: gfxstrand[d]: I was curious why this was set in Anv, even before Gfx libs when everything would have been link optimized
13:13gfxstrand[d]: dj-death: Oh, that's a weird bit of history. It was entirely because of how the URB layout code in the Intel backend compiler lays out clip/cull distances. For some reason (the details are 10 years old at this point), it would sometimes screw stuff up if we didn't set that bit.
14:07djdeath3483[d]: okay
14:07djdeath3483[d]: I turned it off and it seems to pass all the pipeline CTS
14:08djdeath3483[d]: off only for non-pipeline libraries
14:08djdeath3483[d]: but yeah... been knee deep into the VUE/MUE code and it's soooo fucked
14:08djdeath3483[d]: there is like 4 layers of indirection between NIR location & where stuff lands in the thread payload
15:24gfxstrand[d]: Yeah, it's deeply cursed. Like, old as the foundations of the earth kind of curse. Be careful in there!
15:25gfxstrand[d]: https://tenor.com/view/maleficent-curse-cursed-evil-revenge-gif-16300466
16:45gfxstrand[d]: Switching my CTS desktop over to GSP 570. Here's hoping it becomes a bit more stable.
16:45tiredchiku[d]: spoiler alert: it doesn't
16:46gfxstrand[d]: (I switched my dev desktop a while ago but never got around to switching the CTS one.)
16:46gfxstrand[d]: tiredchiku[d]: I'm hoping it'll fix a very particular form of GSP crash so I'm not totally giving up hope.
16:46gfxstrand[d]: But I don't expect it to fix suspend issues or anything like that
16:49tiredchiku[d]: ah, fair
17:11gfxstrand[d]: Ugh... It's still loading 535 but I have 570 clearly sitting there in my initram...
17:13gfxstrand[d]: Oh, because skeggsb9778[d] changed which version it tries to load.
17:13gfxstrand[d]: 570.124.04 instead of 570.86.16
17:13tiredchiku[d]: :BlobhajShock:
17:13tiredchiku[d]: the betrayal!
17:18gfxstrand[d]: I really wish it would print a "Failed to find GSP XXXXX" message if it's it tries to load and can't find the file.
17:19gfxstrand[d]: Maybe we could get upstream folks to be okay with that since it only prints on a failure path?
17:19tiredchiku[d]: maybe
17:19tiredchiku[d]: I feel like failing to load firmware is important enough that it should be logged
17:20gfxstrand[d]: Yeah
17:20gfxstrand[d]: I get not wanting to spam on success but spamming on failure seems reasonable
17:21gfxstrand[d]: And the firmware-free linux folks can just deal with it
17:21gfxstrand[d]: They probably aren't running on recent NVIDIA cards anyway
17:21orowith2os[d]: or they're running a 40xx nvidia card with linux-libre, expecting it to still work :ferrisClueless:
17:22orowith2os[d]: I wouldn't put it past some people to spend 4000 dollars on a setup only to try and put dwm, linux-libre, and coreboot or something onto it
17:22orowith2os[d]: guix, anyone?
17:22gfxstrand[d]: `[ 4.134960] nouveau 0000:17:00.0: gsp: RM version 570.124.04`
17:23gfxstrand[d]: Okay, throwing the CTS at it. It usually takes 3 runs on average to get the GSP to fall over so I'll keep running until I hit one.
17:26skeggsb9778[d]: i'll be changing it again today, sorry 😛
17:26gfxstrand[d]: 😂
17:26mohamexiety[d]: :KEKW:
17:26gfxstrand[d]: Go ahead. I've got a pair of branches that works again now
17:26mohamexiety[d]: can we push to gitlab now?
17:26skeggsb9778[d]: oh, good point
17:26gfxstrand[d]: mohamexiety[d]: No, it still has the warning
17:27mohamexiety[d]: aww
17:27jannau: why doesn't the firmware loader print anything? see drivers/base/firmware_loader/main.c: "Direct firmware load for %s failed with error %d\n"
17:27jannau: and I see that warning with brcmfmac
17:27gfxstrand[d]: eric_engestrom: tagged a Mesa release and then the tag went away because of GitLab migration. :frog_upside_down: I'd recommend against any pushing at the moment.
17:27skeggsb9778[d]: nouveau silences it, because it tries multiple versions
17:28gfxstrand[d]: I guess if a distro knows they'll use a new enough kernel, they can strip out some of the old versions.
17:30skeggsb9778[d]: i can probably make it so a warning is emitted (with fallback to direct hw programming on <=ampere), and explicit error on >=ada (currently it'd be "gsp: ctor failed", or something)
17:30jannau: I should do the same for asahi/brcmfmac. in that combo it tries to load multiple variants as well and I see the message 5 times
17:37gfxstrand[d]: Yeah, I don't want to spam anyone. I just want to be able to debug more effectively than just checking piles of stuff to see if it all matches.
19:32djdeath3483[d]: gfxstrand[d]: how does all the data that goes into the URB on Intel gets passed around on NV?
20:16gfxstrand[d]: All system value things have fixed slots in the hardware. Varyings go in another address space. We tell the hardware what slots are actually used in the shader header and it compacts stuff by magic.
20:22djdeath3483[d]: nice
20:22djdeath3483[d]: PrimitiveID is a system value too?
20:23HdkR: NVIDIA sticks a bunch of values in to the system values. It's great! :D
20:23gfxstrand[d]: djdeath3483[d]: Yup
20:24gfxstrand[d]: The only thing that's not a real system value is `gl_DrawID` and `gl_ViewIndex`
20:24djdeath3483[d]: HdkR: Intel does not, it sucks 🙂
20:24gfxstrand[d]: Oh and `gl_BaseVertex/Instance`
20:24gfxstrand[d]: Those we push as uniforms
20:25gfxstrand[d]: Everything else is real system values
20:43snowycoder[d]: This might be a problem but in sm50 FAdd encoding does not take into account rounding mode for long-immediate instructions, legalizing code should check if rounding is present and copy the long immediate to a register
20:44snowycoder[d]: Same hw quirk happens in Kepler unfortunately, god Volta+ is so much cleaner
20:45HdkR: When you have 128-bits to encode instruction data, you can do almost anything :P
20:45HdkR: Ignore the 20-ish bits stolen by scheduling
20:46snowycoder[d]: Doesn't it have problems with instruction cache? They made instructions 2x larger
20:47HdkR: Threw also grew caches by like 4x in two generations, Surely it's fine :P
20:48HdkR: s/Threw/They
20:48HdkR: but to be fair, Volta was heavily icache bounded in some shader programs
20:48HdkR: Turing and above improved that
21:04gfxstrand[d]: snowycoder[d]: That's fine as long as the default rounding mode that works is round-even.
21:04gfxstrand[d]: Just fix it up in validate
21:05gfxstrand[d]: No one really cares about math that isn't round-even.
21:08snowycoder[d]: Yep, will send a MR when gitlab is up
21:09gfxstrand[d]: gfxstrand[d]: CTS has run 3 times successfully so far. Not definitively better but probably not worse? <a:shrug_anim:1096500513106841673> I'll keep kicking off runs for a while
21:14airlied[d]: is the failure you've seen where you see VMM allocations fail and all the channels die?
21:16gfxstrand[d]: Yeah
21:16gfxstrand[d]: It's not something I've ever seen anyone else see in the wild or I've ever seen when trying to run games or whatever. But whacking nouveau.ko with 36 instances of the CTS across two GPUs seems to hit corner cases somewhere.
21:29_lyude[d]: i don't know if anyone remembers me mentioning the silly issue that stopped my test machine from booting when I was originally looking at the mini cursor issue, but I just figured it out and I am a fool. this is where dev-dm8 has been coming from:
21:29_lyude[d]: Boot args: **resume=/dev/dm-8** nmi_watchdog=1 crashkernel=512M log_buf_len=50M iomem=relaxed console=tty0 consoleblank=1200 pci=noaer systemd.log_level=debug systemd.log_ratelimit_kmsg=0 rd.break=pre-mount
21:29_lyude[d]: can't believe how much time I wasted not noticing this lmao
21:30tiredchiku[d]: https://tenor.com/view/slochiverse-the-slocher-past-self-past-self-gif-7518002
22:46gfxstrand[d]: Okay, my Volta now has fans. They're not great fans but there's air moving across the heatsink now. Hopefully enough to keep it below 95C at boot clocks.
22:55dwfreed: damn
22:58gfxstrand[d]: If not, I can get some proper server fans and ear plugs. 😂
22:59Jasper[m]: Did you just buy a Tesla from somewhere?
22:59Jasper[m]: Or whatever the weird compute server models are called
23:01gfxstrand[d]: Yeah
23:01gfxstrand[d]: It's a server card that has a massive heat sink and fan mounts on the end
23:02gfxstrand[d]: But it didn't come with fans so I had to buy some on Amazon
23:02dwfreed: gfxstrand[d]: I have a 1U server sitting next to me; it doesn't have anything super hot in it, just 4x 6 TB helium drives and 2x 10 core ivy bridge; I barely notice the fan noise
23:02gfxstrand[d]: And the screws I got with them didn't work (because the heat sink uses US screws for some reason?!?) so I had to go to Lowes (big box hardware store) and buy screws.
23:03gfxstrand[d]: It's currently sitting at 51°C according to `sensors`.
23:06Jasper[m]: would be a kind of interesting testing setup
23:07Jasper[m]: Full height serverrack with a bunch of pcie slots. Throw in a bunch of cards per gen
23:08Jasper[m]: (or on smaller scale, can imagine that you'd be using 6000W of power before you're out of Kepler territory :^)
23:08Jasper[m]: * territory :^))
23:10gfxstrand[d]: Now up to 52C
23:11gfxstrand[d]: 21 minutes and only up to 52C. I think it might be working.
23:18redsheep[d]: gfxstrand[d]: Considering you were looking at close to an hour before failure I'm sure any airflow at all will do it. Probably could have blown on it and done the trick
23:19gfxstrand[d]: It's gotta last like 8 hours, though. Even a slow climb up is problematic.
23:20redsheep[d]: Yeah steady state for passive cooling is very slow. Still with it taking that long you were very close to fine
23:24redsheep[d]: Do the fans spin on your maxwell b or is has that been surviving passive?
23:39gfxstrand[d]: 48 min, 53C
23:39gfxstrand[d]: redsheep[d]: I don't think they spin. I don't remember, though. And it's a 980 so there's a decent bit of GPU there.
23:43gfxstrand[d]: Now down to 52C. I think things may have stabilized
23:43gfxstrand[d]: 52C is a little higher than necessary but certainly a sustainable temperature.