13:50blockofalumite[d]: pavlo_kozlenko[d]: Is that smth like SR-IOV?
14:23samantas5855[d]: it's enterprise bullshit that is locked on consumer cards
14:23samantas5855[d]: it was possible to bypass the restrictions up until turing
14:23samantas5855[d]: but ampere and onwards are fw locked iirc
14:35blockofalumite[d]: Ah, so the stuff there is nothing useful ?
15:13magic_rb[d]: Afaik no, not useful to anyone but enterprise
15:14magic_rb[d]: They could just block cuda somehow and call it good. If you want cuda/nvenc pay up, but for games it should be fine. Dont think theyd loose much revenhe
15:51samantas5855[d]: then someone could make a geforce now alternative
15:51samantas5855[d]: they could just limit to 1 vm
16:05blockofalumite[d]: Meanwhile me, doing a software version of this as a platform-agnostic version of virtio-gpu 3D:
16:07karolherbst[d]: there are a few details to this, because there needs to be some driver to configure the device and it helps being upstream, because then it's just available. This driver isn't just using the same GPU in multiple guests, it's about partitioning actual hardware resources and creating actual sub devices
16:39blockofalumite[d]: karolherbst[d]: What's the difference ?
16:39blockofalumite[d]: Less overhead?
16:40karolherbst[d]: it's all on the hw level
16:40karolherbst[d]: and higher isolation
16:42blockofalumite[d]: karolherbst[d]: Yes, and what are the implications of that ?
16:43blockofalumite[d]: karolherbst[d]: How so ?
16:54karolherbst[d]: you can e.g. slice the memory controller, vram, GPU cores and have a fixed mapping to vurtual GPUs
16:55karolherbst[d]: I think you can also over subscribe if you want to, but you can also have a strict allocation of resources with no sharing
16:59notthatclippy[d]: What is the alternative? Forward the API (vk, gl, etc) calls from guest to host and let the host driver handle it as any other userspace client?
17:01karolherbst[d]: yeah
17:01karolherbst[d]: another way is to let the host create a GPU context and forward that one
17:02karolherbst[d]: but none of this gives you strong isolation guarantees or partitions the hardware in any enforceable way
17:05notthatclippy[d]: Sounds useful in the case where you want to run a windows game in a VM, but that's a drastically different usecase to vgpu
17:12notthatclippy[d]: How do you hijack the DX/GL/VK API calls in windows then? You'd probably need to replace the system DLLs in the VM and pray that the game is actually using those as expected. And then you're gonna need a hell of a lot of "DPI" for stuff D3DKMTEscape()
17:12notthatclippy[d]: I guess it's much simpler if your VM also uses Mesa and you can just tell it to use the virtualized driver.
17:13karolherbst[d]: it often works via being a driver in the guest
17:14karolherbst[d]: and yeah.. those solutions ain't the greatest
17:18notthatclippy[d]: karolherbst[d]: Oh. Duh. No reason why it couldn't require a special driver. Makes total sense, thanks!
17:19notthatclippy[d]: I was stuck thinking that you want the VM to see the make&model of the host GPU, but there's no reason for that. It can just see a "Generic virtualized GPU"
17:19karolherbst[d]: though the direction is moving more towards native contexts, which cut out all the translation layers
17:21karolherbst[d]: it's still some paravirtualized device in the guest, but you program that one like a real GPU as you get a real hardware context assigned to the guest, just no need to manage the GPU, because that's done on the host side
17:22karolherbst[d]: so a lot closer to what SR-IOV or other things are doing, just still being paravirtualization
17:23pac85[d]: karolherbst[d]: With sr-iov you'd get the kernel driver in the VM wouldn't you?
17:23karolherbst[d]: well with the other modes as well, they just do different things. with SR-IOV you get a real PCIe device
17:24karolherbst[d]: it's still a bit different than a native GPU, but that's just details
17:24pac85[d]: karolherbst[d]: Yeah I mean, you get a kernel driver for the device rather than one that forwards stuff to the real kmd in the host
17:24karolherbst[d]: same with native-contexts
17:25pac85[d]: Mmm?
17:25karolherbst[d]: you get the drivers UAPI to userspace
17:25karolherbst[d]: not some generic thing
17:25karolherbst[d]: so the driver needs explicit support for it and the mode of operation simply changes in the guest
17:25pac85[d]: With sr-iov the VM kernel driver talks to hw, not to the host correct?
17:25karolherbst[d]: yeah
17:26pac85[d]: pac85[d]: That's what I meant here
17:26pac85[d]: Anyway
17:26pac85[d]: I really wonder how sr-iov works at a lower level
17:27karolherbst[d]: it's a PCI feature
17:27karolherbst[d]: similar to how you also have audio/USB subfeatures on PCI devices
17:28karolherbst[d]: just more dynamic
17:28karolherbst[d]: and then you use device passthrough to assign that sub feature to the guest
17:29pac85[d]: karolherbst[d]: Right that's how the gpu can expose a device to the vm
17:29pac85[d]: But like, how does "contexy switching" happen, how is memory managed etc.
17:30karolherbst[d]: the fun part here is, that I had to fix nouveau to not crash on such a device, because there are subtle differences the guest driver needs to be aware of
17:30pac85[d]: Uhm lol
17:30pac85[d]: Like?
17:30karolherbst[d]: pac85[d]: up to the host driver
17:30karolherbst[d]: pac85[d]: the mmio register range is way smaller
17:30karolherbst[d]: it's still not a real GPU passed through
17:30karolherbst[d]: but more like a GPU with it's own programming interface
17:30pac85[d]: karolherbst[d]: Ah
17:31karolherbst[d]: so the hot driver partitions the device and let's the GPU create a sub feature
17:31karolherbst[d]: *host
17:31pac85[d]: Wait does this work on GPUs which don't do scheduling on the fw?
17:31pac85[d]: Like
17:31pac85[d]: Because that's what confuses me
17:31karolherbst[d]: and then you run a vGPU aware guest driver doing the programming of the vGPU
17:31karolherbst[d]: it's not a real GPU you get
17:32karolherbst[d]: it's just very very advances para virtualization looking like a real device
17:32pac85[d]: Uhm
17:32pac85[d]: But like your mmio writes still go to he
17:32karolherbst[d]: sure, but the mmio interface is a different one
17:32pac85[d]: Ah
17:32pac85[d]: OK I kinda get the hang of it
17:33pac85[d]: Thx
17:33pac85[d]: So like, once the guest has pushed some work how does it get to run, is the host kmd involved?
17:33karolherbst[d]: so it might expose functionality to create contexts and do GPU submission, but might not expose any hw management stuff (like power management, etc...)
17:33pac85[d]: Right I see
17:33karolherbst[d]: pac85[d]: nah, that happens on the hardware
17:34pac85[d]: karolherbst[d]: Ah OK, I can see how that could work on GPUs that do scheduling on the fw
17:34karolherbst[d]: though it might makes the firmware running on the GPU do something
17:34karolherbst[d]: which might send interrupts to the host for $things
17:34pac85[d]: pac85[d]: But I guess that's a requirement for sr-iov?
17:34karolherbst[d]: I think that's an implementation detail
17:34pac85[d]: Say you only have one gfx ring in hw how would it work?
17:35karolherbst[d]: something would need to schedule it
17:35pac85[d]: Yeah you see
17:35pac85[d]: So you need fw scheduling
17:35karolherbst[d]: nope
17:35karolherbst[d]: the GPU could send an interrupt to the host
17:35karolherbst[d]: and the host doing the scheduling
17:35pac85[d]: I guess yeah
17:36karolherbst[d]: but that's just... either the CPU or the firmware controller handles that interrupt
17:36pac85[d]: Though like
17:36pac85[d]: Yeah makes sense, does any gpu work like that?
17:37pac85[d]: Also like, the hw would offer some way of "switching" between the two interfaces I guess
17:37pac85[d]: Which means switching all the state programmed through the regs
17:40karolherbst[d]: do you need to switch interfaces when you use the audio part of the GPU?
17:40karolherbst[d]: SR-IOV is really just that, an additional sub-device of the PCI device with it's own BARs
17:40karolherbst[d]: *its
17:40pac85[d]: But those are different pieces of hw
17:41karolherbst[d]: BARs are just memory (tm)
17:41pac85[d]: So unless you fiscally duplicate parts of the gpu, which I guess is an option, you somehow need to switch which interface is in control
17:41karolherbst[d]: nah, those BARs ain't physical on the GPU anyway
17:41pac85[d]: Uhm
17:41karolherbst[d]: it's all virtualized
17:42pac85[d]: So you say it's like some sram on the device?
17:42karolherbst[d]: there is like 256MiB of BAR on nvidia GPUs
17:42karolherbst[d]: since forever
17:42pac85[d]: And it gets copied to the real mmio regs at some point?
17:42pac85[d]: karolherbst[d]: Wdym
17:42karolherbst[d]: uhh.. not sure how it was called, but there is a piece of hardware handling the mmio requests
17:42pac85[d]: Bar is just a register containing the address at which hw is mapped
17:42karolherbst[d]: sure
17:43karolherbst[d]: but it doesn't mean there is real memory behind it
17:43pac85[d]: Yeah, it's a bunch of hw regs
17:43karolherbst[d]: it's not
17:43pac85[d]: Mmm
17:43karolherbst[d]: there is also a special MMIO region which you can redirect to any piece of VRAM
17:43karolherbst[d]: it's all way more dynamical on nvidia hardware
17:44pac85[d]: Sure
17:44pac85[d]: You just have circuitry that takes address and data and does whatever with it then gives a result
17:44karolherbst[d]: yeah, more or less
17:44pac85[d]: Mmio is really just sram where each cell is also read or written by some piece of hw
17:45pac85[d]: Or perhaps doesn't even have the ability to store stuff
17:45karolherbst[d]: I'm sure there is no sram for the whole thing, because the whole thing is massive
17:45pac85[d]: I see what you mean
17:45pac85[d]: Ofc. The bar is just an address space
17:45karolherbst[d]: also.. even the PCI config space is directly mapped into the mmio region at offset 0x88000 😄
17:46karolherbst[d]: which is quite funny
17:46pac85[d]: Why
17:46pac85[d]: Do you prefer how it was in the old days
17:46pac85[d]: With that io address space
17:46pac85[d]: Lol
17:46karolherbst[d]: I mean, it's the same thing
17:46karolherbst[d]: you can use the PCI config thing, or that offset
17:46karolherbst[d]: it's the same
17:46pac85[d]: Yeah but it makes sense to map it to memory
17:47karolherbst[d]: yeah, sure
17:47pac85[d]: Anyway
17:47pac85[d]: say you have a bunch of regs that control work submitted to the gpu
17:47pac85[d]: Not dutr how it works on nvidia
17:48pac85[d]: On the GPUs I know you have something like a base address and the two offsets into the ring buffer
17:48karolherbst[d]: I mean.. there is userspace command submission where some of the space is mapped into userspace memory
17:48karolherbst[d]: ohh soo uhm..
17:48karolherbst[d]: on nvidia it's funky
17:48karolherbst[d]: there are context switched regions in the mmio space
17:48pac85[d]: Ah
17:48pac85[d]: Interesting
17:48pac85[d]: How does that work?
17:49karolherbst[d]: you pin a context and then you can read/write context specific state
17:49karolherbst[d]: from a host driver perspective
17:49karolherbst[d]: but I don't think it's the thing used when doing command submission
17:50pac85[d]: So like, do you have multiple copies of the same regs?
17:50pac85[d]: And you map a set to userspace?
17:50pac85[d]: That's how I imagine it
17:51karolherbst[d]: not sure how the door bell is set up for userspace command submission, never looked into it
17:52pac85[d]: What door bell
17:53karolherbst[d]: soo.. the tldr is that the kernel maps a door bell thing into userspace's memory, which userspace can use to trigger command submission without needing to involve the kernel
17:54pac85[d]: karolherbst[d]: So like a register that when poked requests something to the fw
17:54pac85[d]: And what else is mapped?
17:54pac85[d]: I suppose some kind of ring where submissions are placed?
17:54karolherbst[d]: a ring buffer probably
17:54pac85[d]: I see
17:55pac85[d]: I should probably look at how noveau submits instead of bothering you with questions lol
17:55karolherbst[d]: ~~I haven't read the code for that myself yet lol~~
17:58pac85[d]: Rip np
18:19blockofalumite[d]: karolherbst[d]: I see no reason why a software solution couldn't do this
18:21karolherbst[d]: blockofalumite[d]: it's isolated on a hardware level, of course it's software configuring it, but software alone can't control on what pieces of the hardware something runs
18:22blockofalumite[d]: karolherbst[d]: I am unsure I follow there. What is the extra isolation?
18:22karolherbst[d]: you control of which pieces of the GPU your shader runs
18:23blockofalumite[d]: Like, which "cores" ?
18:23blockofalumite[d]: Or, well, groups of them
18:26notthatclippy[d]: Yes. Also which video encoder/decoder engines, DMA engines, VRAM regions, etc. Also detailed scheduling policy for the oversubscribed case. Bunch of side-channel info leak prevention, DOS prevention.
18:27notthatclippy[d]: Think if you leased a vGPU'd VPS from a cloud provider, and I leased a second one on the same physical machine. You don't want me to be able to mess with your workload, and you _definitely_ don't want me to be able to glean any info about you.
18:27notthatclippy[d]: There's also another set of features where even your Cloud Service Provider can't read any info from your VM, even if physically sitting next to the machine with an oscilloscope.
18:28notthatclippy[d]: But yeah, these are all very enterprise level features, not something anyone here actually cares about I guess.
18:28karolherbst[d]: cubeOS users might :ferrisUpsideDown:
18:28karolherbst[d]: ehh wait
18:28karolherbst[d]: it's QubesOS
18:43blockofalumite[d]: notthatclippy[d]: Does that mean that the software stuff isolation usually will have leaks?
18:49notthatclippy[d]: Well, consider that you cannot trust any code in the VM, even in the kernel. So it's all about the API that the VM itself uses to talk to the host driver (UMD and/or KMD)
18:50notthatclippy[d]: If that is for example a full on Vulkan API, well, there's so many ways for one vk app to get info about another vk app using the API.
18:51blockofalumite[d]: What for example ?
18:51notthatclippy[d]: Well, we recently talked about how memory isn't zeroed, for example.
18:51karolherbst[d]: the thing about side-channels is, that in theory you can extract almost everything if you just get lucky enough
18:51karolherbst[d]: remember rowhammer?
18:52blockofalumite[d]: Isn't this like, quite bad for the current virtio-gpu 3D stuff ?
18:52notthatclippy[d]: Yeah. And another thing about side channels, it's really hard to think it all through. You just don't know what you didn't think of yet.
18:52karolherbst[d]: ~~extracting private RSA keys via some small javascript code~~
18:53notthatclippy[d]: You can probably get a lot of useful info about what other workloads are happening on the GPU just by measuring the latency of your SFU or some other specialized operations
18:53karolherbst[d]: yeah... if you want to be really sure you want hw level isolation
18:54karolherbst[d]: even hyperthreading has huge attack surfaces and the kernel got a lot of mitigation added just to prevent some of them. And to prevent them all you still need to fully disable hyperthreading
18:56notthatclippy[d]: blockofalumite[d]: I don't think that's necessarily true. Just serves different needs.
19:21pac85[d]: notthatclippy[d]: I suppose for vram there's an extra layer of address translation?