00:18gfxstrand[d]: Storage images, to be particular
00:18gfxstrand[d]: Textures work fine
00:18gfxstrand[d]: That and shared atomics is about all I've got left on sm20, I think.
00:21gfxstrand[d]: I think NAK might be strictly better than codegen now. π€
00:22gfxstrand[d]: Perf isn't great because I'm not inserting scheduling instructions but meh. <a:shrug_anim:1096500513106841673>
00:22gfxstrand[d]: Oh, and I guess shared atomics
00:22gfxstrand[d]: But given how many more features are available once we flip on NAK...
00:23gfxstrand[d]: Grr... Still getting a few ILLEGAL_SPH_INSTR_COMBO
00:25gfxstrand[d]: And derivatives are busted
01:40mangodev[d]: gfxstrand[d]: does the vulkan runtime or the driver itself handle texture compression (as in BCx/ETC/ASTC/PVRTC, not PNG/JPEG)?
01:42mangodev[d]: also speaking of formats
01:42mangodev[d]: for future reference, can an open source driver have acceleration for formats *other* than AV1 (since h264/h265 have royalties, and are the only other two formats in vulkan video [afaik])? or is there some exemption because the decoding/encoding utilizes a dedicated hardware component that the company already paid royalties for?
01:43orowith2os[d]: any open source driver can support formats other than AV1, but someone needs to pay for them somewhere
01:43orowith2os[d]: Intel already did iirc, so they're okay. AMD pushed it off I think, and NVIDIA might've? Not sure.
01:43gfxstrand[d]: mangodev[d]: For those sorts of formats, they're compressed offline and the app provides compressed textures directly.
01:44mangodev[d]: gfxstrand[d]: whoops, typo
01:44gfxstrand[d]: Typically the game developers have the compressor as part of their build pipeline when they build the game.
01:45mangodev[d]: i meant *de*compression, not the compression of the textures, sorry :P
01:45gfxstrand[d]: Oh, decompression happens in hardware as you sample.
01:45mangodev[d]: does nvk already integrate with that, or is it just uncompressed textures for now?
01:45gfxstrand[d]: There's no separate decompression step. Half the point is to save on VRAM so there's decompressed on the fly.
01:46mangodev[d]: interesting
01:46mangodev[d]: so no extra instructions needed? or is a different sampling instruction used depending on the format?
01:46gfxstrand[d]: We advertise everything the hardware supports.
01:47gfxstrand[d]: Nope. It's all handled by the sampler. We just set the format to BC6 or whatever in the descriptor and the hardware does the rest.
01:48mangodev[d]: that's actually more automatic than i expected it
01:48mangodev[d]: genuinely thought it was gonna be a PitA like fp64 operations where they only do a small part of the work for you
01:48gfxstrand[d]: There's a little annoyance with some of the image layout calculations but that's a pretty well understood problem.
01:48orowith2os[d]: orowith2os[d]: khronos also apparently has plans to wrap up vp9 video decode this year apparently?
01:48mangodev[d]: orowith2os[d]: was gonna ask
01:49orowith2os[d]: ope, two "apparently"s in there. oh well.
01:49mangodev[d]: vp9 would be a pretty substantial codec to land support for, given webm and youtube use it as a primary format, combined with the lack of royalties compared to h264
01:49gfxstrand[d]: mangodev[d]: It's only a PITA if you're trying to support a format the hardware doesn't support. Like most desktop GPUs don't support ASTC and some Mesa drivers have a transcoder that's good enough for some Android apps.
01:50mangodev[d]: gfxstrand[d]: i'd assume ASTC is mainly used for mobile GL-ES or Mesa through Waydroid?
01:51gfxstrand[d]: It's mostly just for Android.
01:51gfxstrand[d]: Or GLES
01:51HdkR: The ASTC decoder should definitely only be used for Android/GLES :)
01:51mangodev[d]: ig that's what the A stands for
01:51mangodev[d]: funny that
01:51gfxstrand[d]: No, it stands for Arm
01:52mangodev[d]: that makes more sense
01:52HdkR: `Adaptive scalable texture compression (ASTC)`
01:52gfxstrand[d]: Never mind me. π
01:52gfxstrand[d]: But it was developed by Arm
01:53gfxstrand[d]: That part's true
01:53HdkR: Yea, mostly ARM with some collab in there
01:53mangodev[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1362969260690898994/image.png?ex=6804538a&is=6803020a&hm=78f5d2d29e6f7ab425dd8027dcb16d62d613619a8729530ec9c8d28093073016&
01:53mangodev[d]: i noticed that it has LDR and HDR modes too (not able to jam-pack in the screenshot, tall table)
01:54mangodev[d]: …why are so many texture compression standards developed by phone companies?
01:55mangodev[d]: ETC being by Ericsson, PVRTC being by PowerVR, ASTC being by ARM
01:55orowith2os[d]: efficiency for their specific hardware
01:56HdkR: BW is a premium on mobiles, would be good to save BW and memory. It's win win.
01:56mangodev[d]: the only one that *isn't* for mobile phones is S3/BCn, and it's old as hell
01:56HdkR: And we needed something newer than BC/S3.
01:56HdkR: ASTC is the best format that no one wants :P
01:57orowith2os[d]: mangodev[d]: vp8 is apparently a codec too, which makes the list Firefox's about:support page shows: h264, h265, av1, vp9, and vp8
01:57mangodev[d]: so which ones are and aren't used on desktop machines?
01:57orowith2os[d]: should wrap it up nicely
01:58HdkR: Desktop GPUs universally support BC* because D3D mandates it.
01:59HdkR: But pretty much nothing in desktop GPU space supports ASTC.
01:59mangodev[d]: HdkR: probably because microsoft practically mandates DXTn/BCn
01:59HdkR: Yes
01:59HdkR: It's a good thing
02:00HdkR: Google tried pulling a Microsoft and mandating ASTC but everyone but the mobile vendors fought back.
02:00mangodev[d]: orowith2os[d]: wait what
02:00mangodev[d]: apparently WEBP is to VP8 as AVIF is to AV1??
02:00mangodev[d]: how did it take me this long to learn
02:01mangodev[d]: HdkR: tbf google loves mandating formats they themselves won't even use, ruining the whole point of mandating a format
02:01orowith2os[d]: https://cdn.discordapp.com/attachments/1034184951790305330/1362971394022965358/image.png?ex=68045587&is=68030407&hm=8f599f7202f7d16cdd7746ec0dcdf124fe2fddeccb1821394a20f05fd68bafe1&
02:01orowith2os[d]: mangodev[d]: I don't know what these funny words mean, but this is what firefox says:
02:01orowith2os[d]: and I know AAC down is audio
02:02mangodev[d]: orowith2os[d]: hardware-decoded audio sounds funky
02:02mangodev[d]: but the fact that it's an option implies that there's hardware FLAC decoders/codecs out there
02:02orowith2os[d]: there are
02:02gfxstrand[d]: Part of the problem is that compressed texture formats cost hardware and ASTC is pretty expensive. Desktop vendors have already invested a lot in all the BC formats and aren't thrilled about burning all that die space just to add ASTC when it's not *that* much of an improvement.
02:03gfxstrand[d]: It's better than BC but it's not enough better to be worth the hardware. And ASTC isn't a replacement for BC so they can never get rid of the BC hardware.
02:03HdkR: Also completely unused in the desktop space, chicken and egg problem, hard to care :D
02:03HdkR: ASTC was too complex for its own good.
02:03gfxstrand[d]: Mobile, on the other hand, had ETC and ETC2 and they suck.
02:03gfxstrand[d]: ASTC has one advantage: It's not patent encumbered.
02:04gfxstrand[d]: And it does generally let you get some pretty nice compression ratios
02:04mangodev[d]: gfxstrand[d]: if BC is Microsoft's format, what do OGL and Vulkan use? BC as well? or a mobile format? (referring to desktop GL and VK, not GL-ES or mobile VK)
02:04gfxstrand[d]: BC
02:05gfxstrand[d]: Developers don't want to ship separate assets for Vulkan/GL/D3D.
02:05gfxstrand[d]: They want to ship one set of BC assets and then have multiple back-ends they can flip between with a settings drop-down.
02:06mangodev[d]: gfxstrand[d]: fair
02:06mangodev[d]: reminds me of some UE5 games
02:06mangodev[d]: being able to swap between DX12 and VK backends (with the VK one being less stuttery lmao)
02:06gfxstrand[d]: If you have a game that ships on both desktop and mobile, they build one asset pack with BC for desktop and a separate asset pack with ASTC for mobile. But that's fine because all mobile gets ASTC and all desktop gets BC. You never have to ship both on the same platform.
02:06orowith2os[d]: gfxstrand[d]: some games will ship several copies of their assets in different formats sometimes, won't they?
02:07orowith2os[d]: I think uhh, cod did it?
02:07gfxstrand[d]: Maybe?
02:07orowith2os[d]: usually only when it *really* matters for performance, and they don't feel like converting on the fly
02:07HdkR: This also has the inverse problem for me. Most mobile hardware doesn't support BC. So running PC games on them can result in a lot of pain :|
02:08gfxstrand[d]: Pain
02:08HdkR: Snapdragon luckily has been shipping BC for a while, and /some/ Mali hardware cares enough to enable it in hardware.
02:08HdkR: And the weird one, Samsung Exynos with Radeon doesn't ship BC.
02:09HdkR: But that's fine. No weirdos would every try running PC games on Android devices.
02:09HdkR: ever*
02:09HdkR: never ever.
02:09orowith2os[d]: what are you doing to get into that situation? termux + emulators?
02:10orowith2os[d]: rooted devices?
02:10HdkR: Well, that's what users are doing anyway
02:10HdkR: Pixel phones also support VMs with GPU passthrough.
02:11mangodev[d]: weird question btw
02:11mangodev[d]: does vulkan generally not perform as well on pascal & maxwell cards compared to other nvidia architectures? is this a driver quirk, or a hardware quirk? i have yet to see a vulkan game run *better* than a dx12 game on either architecture
02:11mangodev[d]: i often see them experience some really alarming stutter that doesn't happen on dx12, which is weird because on any other card it'd be closer to being the other way around :/
02:12orowith2os[d]: it's a vulkan feature that pascal doesn't support quite all the way, that vkd3d has pains translating from d3d12
02:12orowith2os[d]: you won't see issues with Turing and later
02:13mangodev[d]: orowith2os[d]: which is strange, because a good amount of these cases i see on windows, but also linux
02:13mangodev[d]: i just generally interact with more windows nvidia users than linux ones (predictably)
02:14mangodev[d]: vk mode portal 2 and satisfactory run better for me (turing), but run *horribly* on any pascal rig that i've seen (compared to dx12)
02:14mangodev[d]: usually in the form of massive stutters and frame spikes
02:14mangodev[d]: smells like a nvidia driver quirk over a hardware thing, though i could be wrong
02:14orowith2os[d]: you can ask some vkd3d devs about it and they should be able to give a decent answer
02:14orowith2os[d]: not sure if it's about what would be normally d3d11 though
02:15orowith2os[d]: dxvk should have no issues?
02:15orowith2os[d]: I've only ever seen complaints with d3d12 stuff
02:15orowith2os[d]: bindless compute something something
02:15gfxstrand[d]: DXVK is fine because they're not trying to emulate D3D12's descriptor heaps with VK_EXT_descriptor_indexing.
02:16mangodev[d]: orowith2os[d]: even on windows i had better experiences on vk, both on my old 660 and my current 1660 super
02:16gfxstrand[d]: VK native apps should perform just fine
02:16gfxstrand[d]: It's really D3D12 translation that's the problem.
02:17mangodev[d]: strange
02:18orowith2os[d]: if you're having issues specifically with pascal, where it's *that* bad *not on vkd3d*, you should probably get that checked...
02:18mangodev[d]: most of these issues i've had with people on native vk on windows, only sometimes linux (though tbf a lack of vk support on linux would be deeply concerning)
02:19mangodev[d]: orowith2os[d]: and i honestly wonder what to tell people then
02:19mangodev[d]: because i've only spoke to a couple people on pascal that *don't* have issues with vk, to the point where it concerns me
02:19mangodev[d]: makes me think there's a vk driver issue on proprietary nvidia, hopefully nothing hardware related
02:20gfxstrand[d]: Pascal Vulkan is definitely not going to be as good as Turing+
02:20gfxstrand[d]: Pascal doesn't have bindless UBOs and that really sucks.
02:20gfxstrand[d]: It also doesn't have `ld.constant`
02:21gfxstrand[d]: So you're getting full memory coherency on every UBO load and that really drags you down.
02:21gfxstrand[d]: Though I would expect D3D12 to also suffer from that
02:21mangodev[d]: gfxstrand[d]: i'm curious
02:21mangodev[d]: what is the `ld.*` stuff? seems related to storage of some sort?
02:21gfxstrand[d]: It's just a memory load instruction
02:21gfxstrand[d]: It takes an address and returns data
02:21mangodev[d]: ah, makes sense
02:22mangodev[d]: i remember you all talking about an ld instruction for kepler a little while back
02:22mangodev[d]: `ld.a` iirc? something along those lines
02:22gfxstrand[d]: `ld.constant` is new on Turing and it tells the hardware to skip all the coherency checks and just take the first thing it finds in the cache. It's way faster than actually coherent loads.
02:24mangodev[d]: gfxstrand[d]: iirc isn't a coherency check a check between the cache and (v)ram to see if the data is still in sync? or am i misremembering (new to this stuff)
02:25mangodev[d]: don't want to bug you or this channel too much with beginner questions
02:26orowith2os[d]: yeah
02:26orowith2os[d]: cpus have it too
02:27mangodev[d]: yeah, that's where i learned about it from
02:27mangodev[d]: but i somehow only realized that gpus have cache too
02:27orowith2os[d]: gfxstrand[d]: I am curious though. Do you only use it when you *know* it's coherent, or is there a use for it outside of just not caring if it's coherent?
02:29HdkR: The API ensures it is coherent at the draw call
02:30HdkR: So the driver knows a bindless UBO is coherent and to use it.
02:31HdkR: `ld.a` is an entirely different thing, and is only related because it's also a load, but it loads something else.
02:32HdkR: There's a lot of heavy lifting behind the "The API ensures it is coherent", but weeds.
02:33redsheep[d]: Not to jump too far back here but I wanted to point out a distinction, the BC compression your gpu can sample from directly and the PNGs or jpegs your textures are often stored as on disk are pretty different. If you choose to store them with better compression you'll usually just use a library to convert in your engine as you load. Or at least that's what the engines I have worked with do.
02:35redsheep[d]: I haven't actually looked what the major engines are doing but I would be surprised if they're usually settling for BC on disk given textures are usually the bulk of the game's size.
02:36redsheep[d]: Un'ess you're titanfall in which case you fill the user's disks with uncompressed audio lol
02:37HdkR: On disk will be compressed in some way
02:37HdkR: Infinite ways to compress the assets on disk.
02:38orowith2os[d]: redsheep[d]: *flashbacks to when I ran halo 3 on an hdd with zstd:16 compression*
02:39orowith2os[d]: the audio was crunchy.
02:40HdkR: zstd decompression speed is pretty good even at high levels
02:41HdkR: You just want it to be a read-only drive at that level :P
02:41orowith2os[d]: I think the main thing was, it was loading it so much, and from a hard drive at that
02:41HdkR: Platters are pretty mean
02:41orowith2os[d]: I ended up lowering it to either level 9 or level 7 and it played just fine
02:41orowith2os[d]: cool experiment
02:42orowith2os[d]: now I have everything on a 1tb sd card, in the hopes I'll get a Deck
02:44redsheep[d]: Yeah I am glad I finally managed to rid my main machine of all hard drives a couple years back. Now only my media pc has to live with the pain
02:45redsheep[d]: Genuinely one of the very worst parts about still mainly using windows is that the presence of even one slow internal drive makes windows explorer nearly unusable
02:46orowith2os[d]: if you want to risk it, I have like, 15 old 480gb SSDs sitting around nobody here uses lol
02:46orowith2os[d]: just back up your stuff to something else every so often
02:46HdkR: orowith2os[d]: ZSTD RAID2 them! :P
02:48redsheep[d]: I swear windows used to handle storage so much better, it wasn't an issue for me on 7 but on windows 10 onwards they seem to have made a ton of things hitch on literally any storage not having responded yet regardless of it actually being needed. That and ruining the search indexer. I really do want to go back to linux
02:51redsheep[d]: Once Nova has materialized I will get it another go
03:02orowith2os[d]: redsheep[d]: maybe you could help test drive some of the nvrm stuff? :p
03:08redsheep[d]: Yeah once sid has the energy to return to that project I will be there to test
03:15gfxstrand[d]: Kepler A just survived a full CTS run:
03:15gfxstrand[d]: `Pass: 964846, Fail: 102487, Crash: 146, Skip: 1676519, Flake: 3, Duration: 2:22:21, Remaining: 0`
03:16HdkR: π
03:17gfxstrand[d]: And I think basically all the fails are either shared atomics or images
03:17gfxstrand[d]: There's a handful of SPH issues but I'm not sure if those are even real.
03:19gfxstrand[d]: And those should be easy enough to debug once we get rid of the noise from not having images.
03:28gfxstrand[d]: Found a vote bug (I had any and all switched)
05:25x512[m]: orowith2os[d]: Will is work if GPU 1 buffer is outside of PCI BAR mappable region? It will do GPU -> CPU -> GPU copy?
06:25gfxstrand[d]: The moment a dma-buf gets shared with a different GPU, it gets evicted and lives in shared memory until all other GPUs references are dropped. If it's only ever shared with other contexts on the same GPU, it stays in VRAM.
06:27gfxstrand[d]: And by "shared memory" I mean system RAM.
06:28gfxstrand[d]: Linux doesn't currently support one device mapping another device's memory.
06:46x512[m]: So it is GPU -> CPU -> GPU copy and the same can be done with regular Vulkan API?
06:47x512[m]: NVRM seems support mapping physical memory of another device to GPU.
06:48x512[m]: But it will not work if source device memory is outside of PCI BAR.
10:16notasthingswork: The interpolator step is not needed, cause 116 being the index only cause we delt with 36 not 32, but this would get us to the negative side as learned from the literature before, to the positive side we get by 50+44+54+328+328+328+328+328+50+32+328−512−144−144−72−512−144−144−72−144=310 50+44+50+328+328+328+328+328+50+32+328−512−144−144−72−1024−36−72=190-120=70 310-70-190=50 where as
10:16notasthingswork: negative polarity came from 36 in the large addsequence, that yields 314 and 194 respectively so 194-120=64 where 120 comes from 314-194=120 and then you did like before 314-64-108=128-50=-78+28=-50 , this is a hackset that all bases on previous literature that 144+256+50=450 512-450=62+256+144=462 462+450=912 1024-912=112 144-112=32.
10:27notasthingswork: well 50-128=-78 of course but you get deal.
10:34notasthingswork: You have nothing to say to me, than do not say anything you are brickheads, do not come to my territory with italian mindill tennis player wannabes or swizz dutch wank spammers or nigerian south afircan rocket anus banger scrubs or vikings, you are going to get a bullet all from my range, as to how attractive Charl Vorster is as a crook abuser or Jack Dedman alike, or Indrek Raud in
10:34notasthingswork: science or tourism business, you can reach to with satellite pictures, they are in their public toilet alone ever since i was violated, absolutely no one visits their businesses again, and as long as the LGBT vagina and tush flashers with their abuse parasite there, absolutely no international star will ever go there.
10:34notasthingswork: So sure they are very popular, you bet so.
13:17gfxstrand[d]: x512[m]: It may not have to copy. The GPU can map system RAM and access it directly.
13:17x512[m]: But render to system RAM is slow?
13:18gfxstrand[d]: But the client process (in the case of WSI) very well may because we typically render in a shadow copy and copy to linear.
13:19gfxstrand[d]: But if you have two GPUs that support the same modifiers, I think we currently avoid the blit. That's a heuristic we can change. But also, having two discreet GPUs and rendering on one and displaying on the other is a very uncommon case.
13:21magic_rb[d]: Why does it matter if its two dedicated gpus or one dedicated and one integrated?
13:21magic_rb[d]: As in a future system i was planning on having a dedicated amd and intel gpu
13:22x512[m]: It looks like nothing currently done with dmabuf can't be done with regular Vulkan API and proper coordination. No extra "magic".
13:23x512[m]: Vulkan also allows allocating and importing system RAM.
13:25x512[m]: magic_rb[d]: Maybe because integrated GPU use system RAM?
13:27gfxstrand[d]: Yes and because you get a copy in that case
13:27magic_rb[d]: Oh yeah integrated use system memory, forgot
13:28gfxstrand[d]: So on the dGPU side it renders to VRAM and then copies to system RAM and on the iGPU side it's all system RAM anyway.
13:30orowith2os[d]: and if it were two dGPUs, it would be vram -> system ram -> vram?
13:30orowith2os[d]: which means two copies, instead of the one, with vram -> vram or vram -> system ram?
13:32x512[m]: I suppose best method with 2 dGPUs will be GPU 1 directly write to GPU 2 PCI BAR bypassing CPU.
13:33x512[m]: gfxstrand[d]: Can GPU quickly convert between tiled and linear format in-place?
13:34orowith2os[d]: I'd say "figure out what the hell Linux is doing where dmabufs can't just go straight over to the gpus" but I'm sure someone's alreadyh looked at that
14:04gfxstrand[d]: Yes it's theoretically possible to optimize the two dGPU case but unless you're using SLI it's going to be slow anyway and also it's not a case that really comes up much in practice.
14:05gfxstrand[d]: x512[m]: No, tiled and linear have different alignment requirements and are generally shuffled around enough that you would need a pretty big scratch area to detile. A few KB of compute shared memory, for instance, isn't enough.
15:30gfxstrand[d]: gfxstrand[d]: `Pass: 965232, Fail: 102227, Crash: 20, Skip: 1676519, Flake: 3, Duration: 2:19:22, Remaining: 0`
15:30gfxstrand[d]: That's with vote fixed. It fixed about 400 fails. What we really need are images.
16:10gfxstrand[d]: Okay, nak/sm20 branch pushed with surface ops. snowycoder[d] I pulled just the surface ops and folding out of your branch (with you as author) and that patch is now in my sm20 branch. All the unit tests pass on Kepler A.
16:12gfxstrand[d]: snowycoder[d]: At what point do you want to land the non-image parts of SM32? If you want to get images working first, that's fine. If you want to land a bunch of it so it's out of your tree, that's fine, too.
16:12gfxstrand[d]: I'm just cherry-picking the core NAK bits I need for SM20 into separate patches. That's been working for now. But at some point we should land SM32 even if it's not 100% complete.
16:20gfxstrand[d]: And please don't think I'm pushing you. Go at whatever pace you can. You're doing amazing work. π
16:27gfxstrand[d]: I'm just trying to help get your stuff landed and I'm happy to do some of the git wrangling if needed.
16:34snowycoder[d]: Thank you! I don't mind when it gets merged, if you think it could be better to have sm32 in-tree I'm happy to help.
16:34snowycoder[d]: How should we split the encodings in git commits? I don't mind doing git work too.
16:36gfxstrand[d]: First rebase. You should be able to drop most of the stuff that isn't it sm32.rs. Then separate out the surface stuff. If the patch in my sm32 branch helps with that, great.
16:37gfxstrand[d]: Mainline already has TexDepBar and the interpolation stuff.
18:04gfxstrand[d]: I'm hesitant to land any surface stuff until we have the lowering actually working. But I'm happy to land the rest of what you have working.
18:47facebookgiants: Yes you can't get to the positive side on large sequences indeed with previous there is another way for this, the only thing why negative always succeeds is easy fact that 112+256+50+36=454-512=-58 just like 50-108 is where 108 is 72+36, now to with the 320-84-108=128 where 84 comes from 204-120=84 is that really understood? well it is 166+50+54+50 being 320 where 166-36-50 is 80 that is
18:47facebookgiants: a rembrant from a scaffold and more importantly it is 58+58=116+50=166 , so hence 108 removed from 116+116 gives 124 hence 4 added to it results to 128, because 204 is 116+88 hence the deal like this, when there is not enough room it goes to negative and still gives 128 when the sign changes. and 4 comes from 450.00+58+144+256=908+112=1020 1024-1020=4.
18:53snowycoder[d]: gfxstrand[d]: I see that sm20 has no `cs2r` nor any code in `from_nir.rs` to route clock reads to `s2r`, so clock reads should panic(?).
18:53snowycoder[d]: In sm32 I routed clock reads to normal `s2r` and it's working (it doesn't seem to have `cs2r`)
18:58karolherbst[d]: kepler also should have s2r, no?
18:58karolherbst[d]: mhh but you should have a 64 bit version there...
18:58mhenning[d]: We should probably lower 64-bit timers before that and not implement cs2r where it's unavailable
18:58karolherbst[d]: `0x00000004` `0x2c000000` opcode
18:59karolherbst[d]: (or the other way around?)
18:59karolherbst[d]: check `CodeEmitterNVC0::emitMOV`, it does encode a 64 bit sysval load
19:00mhenning[d]: oh is there actually a 64-bit version on kepler? I don't think codegen actually uses it at all
19:00karolherbst[d]: it does for shader_clock, no?
19:00mhenning[d]: My memory was that codegen just puts the 32-bit shader clock in the upper 32-bits of the 64-bit version and sets the rest to zeros
19:00mhenning[d]: which is a legal implementation
19:01karolherbst[d]: mhhhhh
19:01karolherbst[d]: where is the code handling thta?
19:01karolherbst[d]: the nir -> codegen translation certianly doesn't lower it
19:02karolherbst[d]: ohh wait...
19:02karolherbst[d]: ehhh...
19:02karolherbst[d]: it does so if it's a vec2 load...
19:03mhenning[d]: here it is: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/nouveau/codegen/nv50_ir_from_nir.cpp#L2486-2487
19:03karolherbst[d]: well.. that loads two elements, so I guess in nir it's always a 32x2 thing?
19:03karolherbst[d]: indeed...
19:04gfxstrand[d]: If that's what we need to do, we should do that in NIR
19:05mhenning[d]: I don't know if it's necessary, but it's what codegen does right now
19:05karolherbst[d]: it's such a pain to lower π
19:05karolherbst[d]: I think Faith meant to do it properly
19:05gfxstrand[d]: Not if we just lower it to the top 32 bits
19:05karolherbst[d]: like do two loads
19:05karolherbst[d]: and check for overflows
19:05snowycoder[d]: Hold on, isn't `cs2r` also 32-bit?
19:05snowycoder[d]: Then why are we doing cs2r loops in `nak_nir.c`?
19:05gfxstrand[d]: But if we do two ABA and loop, yeah, that's a pain
19:05karolherbst[d]: cs2r can do both
19:06karolherbst[d]: anyway.. I'd check out the opcode codegen has there...
19:06karolherbst[d]: might not work tho
19:06snowycoder[d]: for keplerB codegen has no opcode for 64-bit sysval loads
19:06karolherbst[d]: I think it matters if vulkan allows you to zero the low bits, if your precision is terrible enough?
19:20gfxstrand[d]: I mean, we probably can. But it's also not too much work to loop
19:22snowycoder[d]: `cs2r` seems to only handle 64-bit loads on sm70+ but we don't seem to use it since `nak_nir_lower_system_value_intrin` always translates it into 32-bit HI-LO-HI load loops?
19:22snowycoder[d]: If for kepler `s2r` loads the clock only as a 32-bit value it should work with the current loop implementation, what am I missing?
19:23gfxstrand[d]: You mean I've already written that code?
19:24snowycoder[d]: ahahahah, look at `nak_nir.c` line 576
19:25snowycoder[d]: The only strange thing is that we always translate it into load loops even for sm70, if `cs2r` can handle 64-bit loads it should be unnecessary
19:30gfxstrand[d]: I don't remember why I did that. I'd have to look at the git logs. I'm sure there's a reason. Probably because the 64-bit load isn't actually atomic. I think I remember something about that.
19:35mhenning[d]: gfxstrand[d]: You did that because the nvidia vulkan driver does that.
19:35mhenning[d]: CUDA doesn't do that though and I don't think it's actually necessary on recent gpus
19:36mhenning[d]: but I also don't care enough to check carefully that the non-looping version works
19:44gfxstrand[d]: I know I started with the non-looping version so there must have been a reason I changed it.
19:44gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27303
19:45gfxstrand[d]: I didn't leave any more commentary, unfortunately. But I'm also pretty sure I wouldn't have done that if I didn't need to do it to pass a test.
19:46gfxstrand[d]: Past me should have left current me better notes. π€
19:48snowycoder[d]: gfxstrand[d]: when you have time check if `dEQP-VK.glsl.shader_clock.compute.clock2x32ARB` passes on sm20, if my assumptions are correct it might panic on encoding
19:55gfxstrand[d]: Yeah, it fails. I wasn't using enough bits. It passes with the two patches I just pushed to my sm20 branch
19:57gfxstrand[d]: So yeah, the hunk from your branch is correct. I just needed to fix my encoding.
20:06gfxstrand[d]: IDK why I thought it was 6 bits
20:06gfxstrand[d]: Maybe I copied that from sm50? Maybe I copied from codegen? In any case, should have run and trusted the fuzzer.
20:07gfxstrand[d]: snowycoder[d]: That `--bit-flip` flag you added to the fuzzer is awesome, BTW.
20:07gfxstrand[d]: It's my new favorite thing. π
20:10snowycoder[d]: gfxstrand[d]: Thanks! I was getting tired of checking for bit-flagsπ
20:12gfxstrand[d]: Yeah, my standard flow for SM20 has been to run `nvfuzz --bit-flip 5..58` on everything
20:12gfxstrand[d]: There are some opcodes with weirdly non-orthogonal bits where it doesn't actually catch everything (`bar` is particularly annoying) but it catches like 95% of stuff.
20:14snowycoder[d]: Even when it doesn't, it can be useful to check for boundaries.
20:15snowycoder[d]: Also: another minor thing I added is that bytes printed by the `--print-raw` flag can be passed directly into the tool (it accepts even byte sequences that are not 32-bits)
20:16snowycoder[d]: When i find a sub-operation that might have different flags I re-throw it into nvfuzz
20:16mhenning[d]: gfxstrand[d]: This is the old thread about the looping
20:26karolherbst[d]: I still don't understand why they use CS2R.32 there
21:17gfxstrand[d]: And I still don't understand why CS2R.64 doesn't actually work. I'm happy that it should work. But it doesn't and I don't know why.
21:21gfxstrand[d]: It's possible that throwing in a depbar or even a dependency tracker rule that ensures we never have two CS2R in-flight at the same time is sufficient.
21:52mhenning[d]: gfxstrand[d]: Yeah, that's my hunch. cuda uses the 64-bit version and uses barriers to prevent their execution from overlapping. I suspect we could probably do the same
21:56gfxstrand[d]: We have depbar now. It shouldn't be too hard to test.
21:57gfxstrand[d]: We might need a bi-directional scheduling barrier, though.
21:59gfxstrand[d]: Looks like we don't. We should fix that.
21:59gfxstrand[d]: Wait... It's not even marked !can_eliminate?!?
22:01gfxstrand[d]: Yeah, dce just deletes it. It does nothing right now. π€¦π»βοΈ
22:02gfxstrand[d]: It's just having control-flow in the way that's keeping the CS2Rs apart
22:07mohamexiety[d]: could be worth rechecking vulkan too to see if they still do that loop
22:24snowycoder[d]: What deqp test can I use to test `al2p`?
22:28gfxstrand[d]: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34620
22:29gfxstrand[d]: Passes the Vulkan tests. I'm not sure if the GL tests are more strict or not.
22:29gfxstrand[d]: But also doing the dumb thing passed the Vulkan tests so IDK what was going on before. <a:shrug_anim:1096500513106841673>
22:29gfxstrand[d]: snowycoder[d]: Tessellation
22:30gfxstrand[d]: It's used for TCS output load/store
22:36gfxstrand[d]: Come to think of it, I'm not sure if I wired up a2lp, either...
22:38snowycoder[d]: I removed it because tests with it were crashing my system, but now I'm not sure it was at fault
23:11gfxstrand[d]: Heh. We'll have to look into that
23:12gfxstrand[d]: But it's fine to leave it out for now