16:06graphitemaster: Does anyone know where I can find out the binary format of shaders on NV - like when you dump them you seem to get the NVfp/NVvp stuff as plain text, but there's a program header on there too and I was just wondering if that has been reverse engineered / documented that somewhere?
16:07imirkin: "format of shaders" ... where?
16:07imirkin: the shaders uploaded to the GPU are a shader header (on nvc0+) + byte code (varies by gpu gen)
16:08graphitemaster: Well when you use program binary in GL specifically, the content you get back is a compiled to assembly format (NVfp/NVvp as plain text) and then some weird binary data on the head of it.
16:08imirkin: you mean the format of the blob you get from ARB_get_program_binary?
16:08graphitemaster: I want to know what that header is specifically.
16:08graphitemaster: Yeah
16:08imirkin: no clue.
16:09imirkin: never looked at it
16:09imirkin: the shader program header is documented here: https://nvidia.github.io/open-gpu-doc/Shader-Program-Header/Shader-Program-Header.html
16:09imirkin: this is the structure which prefixes a shader program in the code page that the GPU accesses.
16:10graphitemaster: Right. I doubt that is what is in the program binary
16:10graphitemaster: (for GL)
16:10imirkin: yea, no clue
16:11imirkin: the binary probably needs to have enough in it to be able to be recompiled due to weird lowering that needs to happen
16:30graphitemaster: NV tools has a great Cuda occupancy calculator, which was ported to the web here https://xmartlabs.github.io/cuda-calculator/ that I use quite frequently but I have to guess the registers per thread and shared memory size
16:30pendingchaos: the information returned from VK_KHR_pipeline_executable_properties is specific to the compiled program
16:30pendingchaos: vkGetPipelineExecutablePropertiesKHR takes a pipeline in the VkPipelineInfoKHR struct
16:32graphitemaster: Ah that's nice to know. So maybe just getting that extension in GL would be good enough.
16:39graphitemaster: https://docs.nvidia.com/drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/DRIVE_OS_Linux_SDK_Development_Guide/Graphics/graphics_comp_shader.html
16:39graphitemaster: So first 8 bytes figured out XD
16:51graphitemaster: imirkin, Alternatively, I can do something like this https://cdn.discordapp.com/attachments/600646864907403264/784203963754872862/unknown.png
16:51graphitemaster: also: https://pastebin.com/raw/vg85bVvc
16:51graphitemaster: basically memcpy from the GPU the actual binary and fetch it in GL
16:51graphitemaster: Screw the actual format
16:51graphitemaster: Though this feels so much more hack to me XD
16:52graphitemaster: This would get me SPH though right?
16:57graphitemaster: The memcpy shader https://pastebin.com/raw/K3dzeXAU
16:58graphitemaster: What a fun little hack
16:58graphitemaster: No way this works with nouveau since NV_command_list
16:59graphitemaster: (oh and pointers, lel mesa)
17:00imirkin: nouveau does not support the NV_program/NV_command_list stuff
17:00imirkin: it could, but it's a lot of effort for zero benefit
17:00imirkin: plus there's no way that those specs are complete
17:01imirkin: we'd have to be bug-for-bug compatible
17:01graphitemaster: Oh for sure. I've used a lot of those bugs which are shipped on Nintendo Switch games :D
17:02graphitemaster: I can list two games that emulators will *never* beable to emulate because we straight up just write and work the macro method expander directly from NVN.
17:02imirkin: well, the mme isn't _super_ capable
17:02imirkin: i was just complaining the other day that it didn't have a multiply op :)
17:03graphitemaster: Well changing state in the middle of a draw is pretty useful :D
17:03graphitemaster: And works
17:03imirkin: the *middle* of a draw?
17:03graphitemaster: (at least on Maxwell)
17:03imirkin: definitely not at the gpu-level of a draw
17:03imirkin: could be in the middle of a multidraw
17:03imirkin: (which is def done to feed in gl_DrawId/etc)
17:04graphitemaster: One game out there fudges the depth test in the middle of the draw based on VertexID
17:04graphitemaster: And it works "fine"
17:04imirkin: can't be done if it's a single draw. must be getting broken up into multiple draws.
17:05graphitemaster: Also the MME has a MUL/MULH/MULU op so not sure what you mean by it doesn't
17:06imirkin: must be new
17:06imirkin: since when?
17:06graphitemaster: MME64 does yeah
17:06imirkin: fermi def doesn't have it
17:07graphitemaster: I dunno what fermi has but the MME I know has 32 ops, (5 bit encoding).
17:08imirkin: graphitemaster: https://github.com/envytools/envytools/blob/master/envydis/macro.c
17:08imirkin: what's missing?
17:11graphitemaster: Looking at that table, mul, mulh, mulu, extended, clz, sll, srl, sra, (no clue what andn is, never seen that), slt, sltu, sle, sleu, seq, state, loop, jal, blt, bltu, ble, bleu, beq, dread, and dwrite (maybe your names are not the same as the nv ones, dunno)
17:12imirkin: graphitemaster: do you know how they're encoded, and what GPUs they're available on?
17:13graphitemaster: turing introduced mme64
17:13graphitemaster: iirc
17:13imirkin: oh lol, ok
17:13imirkin: i don't care about turing :)
17:14graphitemaster: nv backported some of it to the switch tho with some weird emulation crap
17:14imirkin: my level of caring drops off _very_ quickly after kepler
17:14graphitemaster: not sure how that works on maxwell
17:14graphitemaster: I think maxwell is where my care drops off personally XD
17:14graphitemaster: their best microarch leap ever and we'll never see something like that again
17:15graphitemaster: yeah fermi does not have much in the mme does it
17:15imirkin: it does not.
17:16graphitemaster: they changed their embedded cpu cores in there after fermi iirc, their falcor or what ever the hell they called them
17:16graphitemaster: they had some in-house risc stuff
17:16imirkin: if you want to read some fun macros: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/mme/com9097.mme#n426 :)
17:17imirkin: falcon
17:17graphitemaster: *falcon* yeah, fast logic controller
17:17imirkin: pretty sure it's all falcon until turing-ish?
17:18graphitemaster: I'm pretty sure the mme is just a vm running on falcon cpus
17:19graphitemaster: which is why they could offer firmware updates to backport new instructions on maxwell, but I dunno
17:19imirkin: it's part of the graph unit
17:19imirkin: but i'm not sure precisely how it's implemented
17:19imirkin: graph unit has falcon cores for ctxsw, at least
17:20graphitemaster: right, falcon next is riscv, but prior to that they had their own legit isa
17:20graphitemaster: so all new falcons are riscv64 https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_RISCV_Story_V2.pdf
17:20imirkin: prior to falcon was xtensa ;)
17:21graphitemaster: that slide deck says falcon has been in 15 hardware engines at nv for a decade
17:21imirkin: xtensa was there G84 .. G96 + G200
17:21graphitemaster: ah
17:21imirkin: for video decoding assistance only
17:22imirkin: (to drive the dedicated video decoding engine hardware)
17:22imirkin: G98 was the first GPU with a falcon, i think
17:22graphitemaster: jeeze it would be nice if they documented this stuff, or at least 10 years down the road they just released all their internal crap XD
17:23imirkin: all documented in nouveau :)
17:23graphitemaster: nouveau is very far behind and missing quite a lot though
17:24graphitemaster: not really your fault though, this shit is kept under lock and key and you have to reverse engineer everything
17:28imirkin: the fact that they've locked reclocking on maxwell+ really kills any interest in nouveau
17:28imirkin: which is why i don't care about things past kepler
17:30graphitemaster: I still think there's a path for reclocking that involves having everyone run the real binary blob and build up a nice database of that init process and just shipping that startup process in nouveau
17:30graphitemaster: even if you have no idea what it is
17:30imirkin: yea ...
17:30imirkin: could work. ton of effort. and i'm tired.
17:31graphitemaster: like a call to action sort of thing
17:31imirkin: as is everyone else, i think
17:31graphitemaster: pretty sure if you just got on the protondb forums and asked nv users to run this tool and submit their bins they'd do it
17:31graphitemaster: collecting data seems easy to me
17:31imirkin: but i'd have to create the tool
17:32imirkin: and figure out what the unit of card is
17:32imirkin: same SKU could have a different process.
17:32imirkin: depending on memory chips/etc
17:32graphitemaster: yeah but that's a solved problem isn't it, if it's not I'd start there
17:33imirkin: go for it.
17:33graphitemaster: like you can get all that information from just the serial number alone iirc
17:34imirkin: i guess just needs someone to actually do stuff.
17:35graphitemaster: yeah I mean I'm SUPER BUSY but the next time I have free time I'm not completely against messing around - probably won't get anywhere but I would like to see some new motivation in solving this problem and this is the only tractable solution in my opinion.
17:35graphitemaster: short of just waiting for nv to give you the key which is never going to happen
17:35imirkin: yeah, new motivation would be super.
17:36imirkin: none here though.
17:37graphitemaster: I think squandering the interest gamers have in seeing the clocking problem solved is a bit of a management failure here - building the tool and infrastructure for users to submit that you'd probably see ~299k reports and even Valve would take an interest in that.
17:37imirkin: awesome! let me know when you've built the tool
17:37graphitemaster: er, s/299/200 :D
17:38imirkin: (you see what i'm getting at?)
17:38graphitemaster: Yeah, talk is cheap.
17:38imirkin: ;)
17:39graphitemaster: But don't you ALREADY HAVE THE TOOL, it just needs to be repurposed XD
17:39graphitemaster: Like does envytools not have a "record entire boot process"
17:39imirkin: sure
17:39imirkin: at least some blob versions DMA from system memory, which defeats that
17:40imirkin: (tool = mmiotrace)
17:40imirkin: and we have tons of traces from fermi, for example
17:40imirkin: still no reclocking support in nouveau though
17:41imirkin: there was a reclocking prototype which was developed on 1 gpu. i have the same gpu, didn't work. it's all tricky stuff.
17:43graphitemaster: very sad stuff
17:43imirkin: (Quadro 400, so not like some weird reseller)
17:44graphitemaster: wonder if anyone has tried getting nvidia tools to work with nouveau, like nvidia-smi as far as I know tweaks this stuff without actually touching the driver, just operates on the device node in /dev directly iirc
17:44imirkin: that just talks to the driver
17:44graphitemaster: Although I guess the way that device node is exported is part of the binary driver after all.
17:44imirkin: we'd have to implement the blob's uapi
17:45graphitemaster: yeah that's actually way more work
17:45graphitemaster: alternative idea
17:45imirkin: the switch homebrew guys ported nouveau to run on blob
17:48karolherbst: implementing nvidias uapi is a losing battle anyway
17:48karolherbst: as they don't care about breaking it
17:51graphitemaster: It's pretty curious as to why NV doesn't just tell you how to reclock.
17:52imirkin: nowadays that's not even the issue
17:52graphitemaster: If I had my guess it would be that they artifically gimp their hardware in the clock department for SKU reasons
17:52imirkin: nowadays the issue is that they don't distribute blob
17:52imirkin: and they require signed firmware
17:52karolherbst: yeah...
17:52imirkin: e.g. we can actually reclock maxwell
17:52imirkin: except we can't :)
17:53imirkin: (pascal has a different memory controller)
17:53graphitemaster: Well signed firmware does make a lot of sense to me from a security point of view. Some of the worst and scariest exploits are attacking hardware firmware these days.
17:53karolherbst: graphitemaster: sure.. but not like this
17:53karolherbst: sooo
17:53karolherbst: the deal with maxwell2 is you can change the clock and voltage and everything
17:53karolherbst: you just can't control fan speed
17:54graphitemaster: Ask NV to sign your stuff then :D
17:54graphitemaster: (that's a joke)
17:54karolherbst: I wish it would be that easy
17:55ajax: use a bunch of nvidia gpus to run the discrete logarithm problem to work out nvidia's private key...
17:55imirkin: ;)
17:55karolherbst: ajax: just do it on users machines and try out 5 keys each boot or something :D
17:55karolherbst: the only problem is: the GPU goes into lockdown and needs a hard reset
17:56ajax: karolherbst: so you do it on reboot
17:56karolherbst: ajax: also.. it's a symmetrical key
17:56karolherbst: it's aes128
17:56graphitemaster: ugh, apply for a job at nv and then leak the key
17:57karolherbst: I know better things to do than landing in prison tbh
17:57graphitemaster: don't get caught
17:57karolherbst: problem is: we are probably very high on the "most likely offenders" list
17:58karolherbst: especially if we join nvidia and the key gets leaks like a week after
17:58graphitemaster: yeah, even if you don't do it and it got randomly leaked and no one here was involved, everyone here would have their own fbi agent within 48 hours
17:59karolherbst: ajax: mhhh.. or right before suspend
17:59karolherbst: although... doing it on runtime_suspend might work as well
18:00graphitemaster: joking aside, crypto tends to be broken as hardware advances, so breaking something like aes128 is doable in the future, I mean quantum computing will make that easy right
18:00karolherbst: no
18:00karolherbst: they won't
18:00graphitemaster: So maybe we just wait it out
18:00karolherbst: they can figure out prime factors, but aes is just math
18:00karolherbst: well.. trivial math
18:01karolherbst: an aes key is really just a random number in the end or did I miss something here?
18:02graphitemaster: sure but crypto can halve any block-cipher because of grover's algorithm
18:02graphitemaster: s/crypto/quantum/
18:02graphitemaster: so aes128 becomes aes64 attack wise if you built the QC for it
18:04karolherbst: ohh, interesting
18:05graphitemaster: mind you 2^64 is still a lot of iterations
18:05karolherbst: yeah
18:05karolherbst: not not impossible
18:05graphitemaster: but if amazon starts having elastic quantum compute engines you could just rent like 64 boxes and come back in a few days
18:08graphitemaster: and if this happens nv would just move to aes256 for their next gpu so we're back at square one
18:09karolherbst: another issue is, that reverse engineering the clocking stuff gets more and more complicated
19:47Lyude: dunno who needs to hear this but heads up: just had to send a revert a patch I pushed to nouveau yesterday because it turns out I accepted a patch from umn.edu without even realizing it
19:48Lyude: so, probably want to make sure any other patches we get aren't coming from there :)
19:58karolherbst: Lyude: maybe we should teach dim about it
20:12graphitemaster: What is wrong with umn.edu ?
20:13graphitemaster: Oh is that the university that got banned from the Linux kernel for their buggy patches
20:24Lyude: karolherbst: actually after talking with ag5df I think I agree we should just leave the patch in
20:25Lyude: *agd5f
20:27karolherbst: Lyude: yeah, I think the patch itself is fine as well
20:38anholt: karolherbst: I was using the dtb from the linux kernel. tried using one from the l4t firmware, and something with irqs was busted at boot.
20:38karolherbst: anholt: I used to use the one from the kernel, but it's not unlikely you'll hit firmware bugs anyway
20:39karolherbst: there is a bug somehwere where node ids are used more than once
20:40karolherbst: uhm.. I think it's the "serial" field seen in sysfs or something. Anyway... I just use UEFI and ignore dtbs