02:33 Pharyngeal: The feature matrix lists NV130 hardware acceleration as TODO - I'm not sure I have the expertise to work on this, but I can help with testing if anyone knows of anyone working on this.
02:34 imirkin: Pharyngeal: which bit is still listed as TODO?
02:35 Pharyngeal: vdpau video decoding acceleration
02:35 imirkin: ah yeah
02:35 imirkin: no one's working on it
02:36 Pharyngeal: I also see no vaapi support in `vainfo`
02:36 imirkin: right, it's all part of the same thing
02:36 imirkin: video decoding accel support isn't there
02:37 Pharyngeal: That's unfortunate
02:37 imirkin: thank you for choosing nvidia. we appreciate that you have a choice of gpu vendors, and you appear to have made the wrong one.
02:38 Pharyngeal: Is there a technical limitation to supporting it on this generation, or has there just been limited manpower available?
02:38 imirkin: it's _likely_ that technical limitations will exist, but to be honest, no one's even looked at it
02:39 imirkin: it should be entirely doable on the GM107
02:39 imirkin: (aka NV117 ... not sure about GM108 - i have the recollection that video decoding accel is just missing from that chip entirely as it's meant as a 3d accelerator only)
02:40 imirkin: if you're interested, i can give tips, but i won't sugar-coat it -- it's a lot of work.
02:40 Pharyngeal: It'd be interesting, but I am not sure I have the relevant expertise to work on something like this.
02:41 imirkin: your call
02:42 imirkin: there are a few macro steps ... first you have to figure out how to get the engine going, then you have to figure out how video data is submtited and decoded frames returned, and then finally you have to apply that knowledge to create an implementation of this in the kernel and mesa.
02:43 imirkin: there's a big change in VP6 relative to VP5 -- VP5 was 3 separate engines, while VP6+ are a single engine
02:43 imirkin: however i've been led to believe that the interfaces are still largely similar, just compacted into a single engine
02:43 imirkin: so a lot of stuff could be reused in that case
02:43 imirkin: but someone's gotta do the leg work
02:43 imirkin: and it ain't gonna be me -- had enough of my fill doing VP2 and fixing up bits of VP3/4/5
02:45 imirkin: with GM20x+ you could easily run into issues of needing signed firmware, although you'd be extracting those from the blob anyways. but loading them may be a circuitous annoyance.
02:45 imirkin: but that's just icing on the proverbial cake
03:23 Pharyngeal: Indeed, I'm not familiar with some of the concepts you mention, so I'm probably not the right person to lead this.
03:24 Pharyngeal: Just tried the official proprietary driver, and it seems to break wayland
03:24 Pharyngeal: Which I need for fractional scaling
03:29 HdkR: The proprietary driver supports wayland now?
03:30 Pharyngeal: Apparently, as long as modesetting is enabled with a kernel cmdline param
03:31 Pharyngeal: Turns out that particular issue was due to a conflict with the intel integrated graphics module. Just blacklisted it, and now I'm running wayland on the proprietary one.
03:31 Pharyngeal: Although I feel quite uncomfortable using the proprietary driver.
03:33 HdkR: That's just the natural feeling that Nvidia gives people
03:34 HdkR: Eventually your mind will block it out
03:41 Pharyngeal: I'm just worried about the security of my box
03:48 HdkR: Nvidia is worried about the security of their competitive advantage :P
04:31 orly_owl: nvidia is worried theyll make a useful product one day
04:33 HdkR: :hot_take:
04:36 Pharyngeal: So I ended up reverting back to nouveau - vaapi and several apps were broken with the nvidia/wayland combo, and the only app I'd really want it in only supports it under wayland.
04:36 Pharyngeal: ¯\_(ツ)_/¯
04:39 Pharyngeal: Thanks anyway. My next box includes AMD Vega - hopefully that'll work better with Wayland
04:39 Pharyngeal: I wish I was more familiar with the linux graphics stack/had the time to contribute.
04:45 imirkin: HdkR: eventually you get an amd chip :)
04:46 imirkin: not like amd doesn't have its problems, but there's an actual team behind it
04:46 HdkR: yea, I don't use Nvidia under Linux anymore
08:43 karolherbst: HdkR: no GL in H
08:43 karolherbst: ...
08:43 karolherbst: *X clients though
08:43 karolherbst: soo.. useless ;)
09:55 karolherbst: ohh interesting, gallium supports up to 32 const buffers
09:55 karolherbst: why is that?
10:02 karolherbst: ehh, seems like that was arbitrarily choosen
11:23 cosurgi: imirkin: kernel version 5.5.1
11:23 cosurgi: good morning everyone, btw :)
12:01 karolherbst: ohh.. I could use cupti to re the perf counters: https://docs.nvidia.com/cupti/Cupti/r_main.html#metrics-reference-6x
12:01 karolherbst: mhhh
12:01 karolherbst: oh, we also have cupti_trace
12:01 karolherbst: let's see
14:48 karolherbst: https://gist.githubusercontent.com/karolherbst/c65a5fa4c67e97274e679c306c9fefeb/raw/e78ca7b701288c894102c48a2137ad03b19ef630/gistfile1.txt :)
16:54 lovesegfault: Can someone ELI5 the GBM vs EGLStreams thing?
16:55 imirkin: GBM = mesa thing
16:55 imirkin: EGLStreams = nvidia thing
16:55 imirkin: fight! fight!
16:55 lovesegfault: But what does GBM _do_?
16:55 lovesegfault: Or EGLStreams for that matter
16:58 imirkin: who knows
16:58 imirkin: winsys stuff.
16:58 imirkin: making things show up on the screen
16:58 imirkin: you know, little things
16:58 lovesegfault: lol
17:06 karolherbst: well, there is a bit more to it like allocating buffers fitting the needs of all users, etc.. but yeah ;)
17:07 imirkin: it's basically the repository of all the annoying stuff which makes things actually work
17:07 lovesegfault: This is where I get 🤔; why is allocating buffers a contentious topic?
17:07 lovesegfault: it sounds simple?
17:07 imirkin: who allocates?
17:07 imirkin: who specifies the parameters?
17:07 imirkin: what if two sides disagree?
17:08 lovesegfault: I imagine the program asks, the kernel allocates it in the GPU?
17:08 imirkin: what about the mechanism for communicating that a surface is ready
17:08 imirkin: it's all similar, but different
17:08 lovesegfault: What's a surface?
17:10 karolherbst: lovesegfault: what if you have to use that buffer on three devices and two applications?
17:12 karolherbst: like a webcam streaming into a buffer, one GPU renders and the other scanouts from that buffer spanning your "application" + the compositor
17:12 karolherbst: you want to eliminate as many format conversions and copies as possible
17:13 karolherbst: just one example making the overall topic overly complicated
17:16 imirkin: followed by encoding the screen by yet-another hardware component
17:47 karolherbst: ufff, when I look at shaders and see stuff like this: https://gist.githubusercontent.com/karolherbst/86d62c1fbe354632dbc759da54b823ab/raw/29db8ac066278d40c9b2620979e21c0f4fdacbb6/gistfile1.txt
17:48 imirkin: welcome to ssa-ville
17:49 fincs: Btw, what's the state of nouveau's nir frontend, is it at feature parity with the tgsi frontend?
17:49 karolherbst: nope
17:49 karolherbst: but only unimportant bits are missing
17:49 fincs: I guess it's missing support for more obscure features
17:49 karolherbst: nah
17:49 fincs: Does the whole nir shebang result in better optimization?
17:50 karolherbst: more like stuff imirkin added support later on :p
17:50 fincs: lol
17:50 karolherbst: fincs: usually yes
17:50 fincs: I'd like to mess around with it sometime in the future
17:51 karolherbst: anyway, right now there is no real reason to focus much on the nir backend, but long term I don't see any other way than to get rid of the TGSI one anyway
17:51 fincs: I saw mesa received some good refactoring to separate tgsi-nir codepaths
17:51 karolherbst: alternative would be to write a spirv to codegen pass
17:51 karolherbst: but.. ufff
17:51 karolherbst: I am already happy that vtn handles all of that
17:51 fincs: Ideally non-NIR should be killed tbh
17:51 fincs: There's like... four IRs coexisting atm iirc :p
17:51 karolherbst: more :p
17:51 karolherbst: except you only count middle layer IRs
17:52 fincs: Which one did I miss? Mesa IR, TGSI, NIR, SPIR-V
17:52 karolherbst: but then we still have more
17:52 karolherbst: glsl ir
17:52 karolherbst: and LLVM
17:52 fincs: Mesa IR != GLSL IR?
17:52 fincs: Oh right
17:52 karolherbst: mesa ir is this ARB shader stuff
17:52 fincs: Yuck
17:52 karolherbst: well, needs to stay :p
17:53 karolherbst: and then this dx9 IR
17:53 karolherbst: in gallium nine
17:53 karolherbst: anyway...
17:53 karolherbst: it's complicated :p
17:53 fincs: IIRC it used to be that NIR depended on TGSI support code, which in turn depended on Mesa IR code; not sure if that's accurate anymore
17:55 HdkR: Don't forget the upcoming DXIL->NIR stuff :P
17:55 imirkin: tgsi works well, and is easy to work with
17:56 imirkin: without an instruction reordering pass, i don't think NIR can be workable
17:56 fincs: I've seen some codegen oopsies when switching from default-uniform stuff to UBO stuff
17:56 karolherbst: imirkin: yeah.. that's true as well
17:56 karolherbst: nir places a few things really annoyingly
17:56 karolherbst: but there are passes to clean this pass
17:56 karolherbst: and I already workaround that for immediates
17:56 karolherbst: anyway, right now the nir shaders are still faster in avarage
17:56 imirkin: a lot of pain for seemingly no gain
17:56 karolherbst: it's faster
17:57 imirkin: faster compile or runtime?
17:57 karolherbst: runtime
17:57 imirkin: hm ok. wonder why.
17:57 karolherbst: 10% in pixmark piano last time I checked
17:57 karolherbst: CFG based opts
17:57 karolherbst: nir does that
17:57 karolherbst: tgsi and codegen don't
17:57 karolherbst: well
17:57 karolherbst: at least that's what I would bet this on
17:58 karolherbst: and these days there are lot of nir opts
17:58 karolherbst: so I wouldn't be surprised if we could get the nir stuff to be even faster
17:58 karolherbst: compile times are worse though
17:58 karolherbst: but that shouldn't be a surprise
17:59 karolherbst: imirkin: what's also nice, that with nir we can get rid of some of the indirction loads so I see way less spilling as well
17:59 karolherbst: and other godies
17:59 karolherbst: but yeah.. long time last I checked
17:59 karolherbst: and it doesn't matter prior vulkan or 4.6 anyway
18:00 karolherbst: and/or proper recocking
18:00 karolherbst: *reclocking
18:02 karolherbst: pixmark_piano 89ms ->85ms frame time tgsi vs nir
18:03 karolherbst: fun fact.. gpr usage is generally higher with nir as well
18:03 karolherbst: ohh, I had my imm opt enabled :)
18:04 imirkin: well, tgsi isn't a thing that would do cfg opts in the first place, so that's not a real comparison
18:04 imirkin: codegen really should though
18:04 karolherbst: yeah.. but making it better to deal with cfg stuff?
18:04 karolherbst: you really want to spend time on that?
18:04 imirkin: =]
18:04 imirkin: not really, no
18:04 karolherbst: nir is nice, as everybody else does it for us .p
18:05 karolherbst: and even if we would be heavily nir based, we still need codegen for other things
18:05 karolherbst: like modifier folding
18:05 karolherbst: and other stuff
18:05 imirkin: and we emit tons of code in the lowering passes
18:05 imirkin: some of which expects later optimization
18:05 karolherbst: or post ra opts
18:05 karolherbst: yeah...
18:05 karolherbst: but we could move most of the lowering inro nir actually
18:05 karolherbst: a lot of the tex stuff probably as well
18:06 karolherbst: dunno
18:06 karolherbst: never looked that detailed into it yet
18:06 imirkin: getting stuff like the manual txd lowering will require ... work
18:07 karolherbst: yeah.. but we don't have to move everything
18:07 karolherbst: if moving some stuff makes codegen easier to understand and easier to fix, then that's already a win
18:08 imirkin: i think everyone agrees by now that a proper backend compiler is necessary
18:09 karolherbst: yeah
18:09 karolherbst: was that different at some point?
18:09 imirkin: i think the original thrust was "move everything into frontend"
18:09 imirkin: which i never agreed with
18:10 karolherbst: maybe.. I always thought that moving as much as possible is probablly a good idea, but due to our ISA it was obvious we also have to deal with odd things we can't do in nir except making it harder for everybody else
18:11 karolherbst: anyway, nirs passes are way more sane than ours and the increase in compile time won't matter with a shader cache anyway :)
18:11 karolherbst: or not that much
18:11 karolherbst: some games take 5-6 minutes to load :(
18:11 karolherbst: without a cache
18:11 karolherbst: even the TGSI cache was able to reduce that to 1-2
18:14 karolherbst: imirkin: one thing which is super interesting
18:14 karolherbst: pixmark_piano TGSI; type: 1, local: 0, shared: 0, gpr: 49, inst: 3467, bytes: 36984
18:14 karolherbst: pixmark_piano NIR: type: 1, local: 0, shared: 0, gpr: 78, inst: 2879, bytes: 3071
18:14 karolherbst: the nir one is faster
18:15 karolherbst: but that might just be due to how the opts are working and nir makes by chance better choices...
18:15 karolherbst: dunno
18:15 karolherbst: never looked into why that is
18:15 imirkin: fewer instructions = more better
18:16 karolherbst: yeah, but gpr?
18:16 karolherbst: less threads = more worse
18:16 imirkin: doesn't matter as much as instructions apparently :)
18:16 karolherbst: yeah.. seems like it
18:16 karolherbst: or the loops are simply smaller
18:16 karolherbst: and nothing else matters
18:16 imirkin: right
18:17 imirkin: not all instructions are equal
18:17 imirkin: some run once, others run 10000x
18:18 karolherbst: we really should report an estimated cycle count as well
18:18 karolherbst: that might help with some decisions
18:18 karolherbst: but ufff
18:19 karolherbst: that's also a bit annoying with codegen, because we have no idea where loops are
18:19 karolherbst: or how deep a BB is within a loop
18:21 RSpliet: what does "bytes" mean in those statistics?
18:21 karolherbst: binary size
18:22 karolherbst: I failed at copy pasting :p
18:22 RSpliet: Ah ok
18:22 karolherbst: a 2 is missing
18:22 RSpliet: Yeah that sounds about right
18:22 imirkin: it only matters for nv50, not for nvc0
18:22 imirkin: on nvc0, all instructions are 8 bytes
18:22 imirkin: but on nv50, some are 4 some are 8
18:22 imirkin: the theory being that 4 is better than 8
18:22 karolherbst: and even then.. does it really matter all that much for nv50?
18:22 imirkin: well, it helped me assess how well the post-ra stuff worked
18:23 karolherbst: :)
18:23 imirkin: to make it possibel to use some of those encodings
18:23 karolherbst: yeah.. I mean, using less VRAM is always nice
18:23 karolherbst: I just don't think it matters that much..
18:23 RSpliet: It matters when your bottleneck is instruction cache hits
18:23 karolherbst: still
18:23 imirkin: i think they also get dual-issued, maybe
18:23 karolherbst: it's an optimization
18:23 imirkin: absolutely
18:23 karolherbst: imirkin: mhhhh... I don't think so
18:23 karolherbst: there are dual issueing rules
18:23 karolherbst: but mhhh
18:23 karolherbst: those are a little weird for nv50
18:23 karolherbst: apparently you can dual issue two fmas
18:23 karolherbst: one on the ALU one on the SFU
18:23 imirkin: not on nv50
18:24 imirkin: or you mean mad's?
18:24 imirkin: nv50 doesn't have fma :)
18:24 karolherbst: yes.. mads
18:24 karolherbst: or muls
18:24 karolherbst: dunno
18:24 karolherbst: there was soemthing though
18:24 karolherbst: and gt200+ only
18:25 RSpliet: Either way, 78 GPRs is an unholy amount of GPRs. Didn't even know you could have so many without spilling... last ISA I checked (which... granted, was like Kepler or Maxwell) you could only address 63 registers in your program?
18:25 karolherbst: anyway.. it was more about placement of the instructions as well.. but I guess hard to figure out without perf coutners
18:25 karolherbst: *counters
18:25 imirkin: kepler2+ has 256 ISA regs
18:25 karolherbst: RSpliet: kepler2
18:25 karolherbst: ...
18:25 RSpliet: Ah
18:25 karolherbst: yes
18:26 imirkin: fermi/kepler1 have 63
18:26 karolherbst: RSpliet: gm107+ even has 18 cbs :p
18:26 imirkin: nv50 has 128
18:26 karolherbst: there are some of those nice goodies
18:26 RSpliet: Is the threshold for maximum warps still 31?
18:26 karolherbst: RSpliet: it was never per gen in the first place
18:26 karolherbst: RSpliet: or do you mean the size of a warp
18:27 karolherbst: RSpliet: anyway, those tables are quite useful: https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
18:27 RSpliet: No. The maximum number of GPRs you can use before you have to reduce the number of warps running in paralle
18:27 karolherbst: depends on the chipset ;)
18:28 RSpliet: It does, but for a long time 31 was the rule of thumb that works everywhere except some FrankenKeplers
18:28 karolherbst: SM37 has more regs overall per SM as it seems
18:28 karolherbst: 128k instead of 64k
18:29 RSpliet: Yeah, so for SM37 and SM75 you can address 63 or 64 regs (there's this weird 0-overlay thing that bugs me) without dropping occupancy
18:29 karolherbst: yeah
18:29 karolherbst: it's all a bit weird
18:29 karolherbst: and I guess something we should look into at some point
18:30 karolherbst: also shared mem per MP is a bit odd
18:30 karolherbst: I think our understand is way more static than that
18:31 karolherbst: maybe we even do it wrongly
18:31 karolherbst: dunno
18:32 karolherbst: also.. what does "Maximum number of instructions per kernel" mean? :D
18:32 karolherbst: shader size?
18:32 karolherbst: but that seems too big
18:32 karolherbst: executed instructions?
18:32 karolherbst: that would be odd
18:33 RSpliet: Right, so the number of warps in flight with that NIR shader drops to like 52 instead of 64. I suspect its slightly less, 48's a nicer number but who knows. That might still be enough to keep all compute resources occupied, esp. if the shader isn't DRAM bound
18:33 RSpliet: In which case, lower dynamic insn count is more important than #warps. That's a tough decision to make.
18:34 karolherbst: RSpliet: yeah.. but I expect that generally with nir we get smaller loop bodies
18:34 karolherbst: so we execute way less instructions overall
18:34 karolherbst: but maybe the opts are just smarter as well
18:34 RSpliet: That's whay "dynamic instruction count" means :-)
18:34 karolherbst: dunno
18:34 karolherbst: ahh, yeah
18:34 karolherbst: anyway, the gpr usage is signficantly higher
18:35 karolherbst: and that's mostly due to how some of the nir opts work
18:35 karolherbst: but.. there are opts for that :)
18:35 karolherbst: like moving input loads closer to the user and stuff
18:35 RSpliet: Instruction scheduling can make a difference yes
18:35 karolherbst: anyway.. if at some point somebody has a lot of spare time, that might be time well spent
18:35 RSpliet: Question is: does it need to? Do you know the statistics of NVIDIA's shader?
18:36 karolherbst: nvidia is between our TGSI and NIR stats in terms of gpr usage afaik
18:36 karolherbst: they generally use slightly more gprs
18:36 RSpliet: And in binary size - as a proxy for dynamic instruction count?
18:36 karolherbst: dunno
18:36 karolherbst: but I think they are usually smaller as well
18:36 karolherbst: but.. eah
18:36 karolherbst: dunno
18:37 RSpliet: Sometimes they are, sometimes they unroll loops pretending that DRAM is free :-D
18:37 karolherbst: RSpliet: indirect cb loads suck
18:37 karolherbst: well...
18:37 karolherbst: if the address isn't uniform
18:37 karolherbst: and that could happen in loops...
18:38 karolherbst: the hell?
18:38 karolherbst: how can this happen
18:38 karolherbst: 16 bit immediate?
18:39 imirkin: in mul on nv50
18:39 RSpliet: int or FP?
18:39 karolherbst: gp107 here
18:39 karolherbst: mov u16 %r9284 0x0000 (0)
18:39 imirkin: oh
18:39 imirkin: we do that in some places
18:39 karolherbst: yeah
18:39 imirkin: i forget
18:39 imirkin: check the xmul lowering
18:39 imirkin: also i think i do it for like tex layers
18:39 karolherbst: ohh
18:39 karolherbst: tex
18:39 imirkin: when converting the float layer to int
18:40 karolherbst: 999: mov u16 %r9284 0x0000 (0)
18:40 karolherbst: 1000: tex 2D_ARRAY $r255 $s31 rgba f32 { %r9285 %r9286 %r9287 %r9288 } %r9284 %r9275 %r9276 %r9283 (0)
18:40 karolherbst: mhhhh
18:40 imirkin: yeah, so the layer must have been the floating point 0
18:40 imirkin: and the cvt got optimized
18:40 karolherbst: that messes with my imm to cb pass :D
18:40 imirkin: sorry!
18:40 karolherbst: I kind of assume 32 bit imms
18:40 karolherbst: yeah well..
18:40 imirkin: you could upgrade it to 32-bit
18:40 imirkin: doesn't matter much
18:40 karolherbst: I guess so
18:41 karolherbst: now that I am looking into the pass... it's not good :D
18:41 karolherbst: I really need to improve that one
18:41 imirkin: the tex lowering thing?
18:41 imirkin: or the imm -> const thing?
18:41 karolherbst: imm -> const
18:42 karolherbst: like I add entries to the table, then check if it can be loaded
18:42 karolherbst: and leave the entry
18:42 karolherbst: even if the load can't happen :)
18:42 imirkin: hehe
18:42 imirkin: whatever, the table entries are free if you're burning a cb
18:42 karolherbst: yeah..
18:42 karolherbst: but that doesn't help me in figuring out which constants are worthy for a static table :)
18:43 karolherbst: and then why does it try to move a 0 into the table anyway
18:43 karolherbst: ...
18:43 karolherbst: at least the 0 thing can be fixed easily
18:44 imirkin: yeah, skip the 0's :)
18:44 imirkin: those become $rZero anyways
18:45 karolherbst: yeah
18:45 imirkin: also, the layer has to be in a register anyways
18:45 imirkin: so no point in making trouble imo
18:46 karolherbst: sure.. but I need to prepare the test if I can load from a cb from an instruction source
18:46 karolherbst: that's not that trivial
18:46 karolherbst: codegen assumes that load exists
18:46 karolherbst: in my situation... it does not
18:46 karolherbst: so I needed to add a new API for that
18:47 karolherbst: imirkin: I guess it would be safe to allign all immediates to 0x4 or 0x8 in the table and wasting a few bytes for smaller imms
18:47 karolherbst: because not all instructions can byte address
18:48 imirkin: in fact, none can
18:48 imirkin: except mov :)
18:48 karolherbst: yeah.. probably
18:48 imirkin: er, LDC
18:49 karolherbst: OP_BRA can apparently
18:49 karolherbst: ohh all the flow instructions can
18:49 karolherbst: not that we use it...
18:50 karolherbst: but still
18:50 karolherbst: well... we could :D
18:50 karolherbst: but that only makes sense for indirection
18:52 imirkin: right.
18:52 imirkin: you REALLY want the immediate there
18:52 karolherbst: yeah
19:21 karolherbst: imirkin: scaning through our shader db: https://gist.githubusercontent.com/karolherbst/f7b90629461392f3e5558204c0eae841/raw/2728dbf4ca813c11bb98cdca7b5a4c32e38cb506/gistfile1.txt
19:21 karolherbst: kind of looks like a small static table is already good enough
19:21 imirkin: heh. 1 and -1 are popular :)
19:21 karolherbst: :)
19:21 imirkin: those are short imms
19:21 imirkin: so should be fine
19:21 karolherbst: nope
19:21 imirkin: ?
19:22 karolherbst: I only print the one which actually got converted to loading cb because loading them failed
19:22 karolherbst: as short imms
19:22 karolherbst: wait. I can show you examples probably
19:22 imirkin: i'd be curious where we fail to load -1 or 1, but succeed with a const
19:22 imirkin: probably lots of places where neither works
19:22 imirkin: since const can only go into src2
19:22 imirkin: (and imm)
19:23 imirkin: (except like fma)
19:23 imirkin: btw - remember how FADD has the "PO" mode? ("plus one")? i forget if we use it
19:23 karolherbst: ohhh uhm...
19:23 karolherbst: crap
19:23 karolherbst: I forgot about fma
19:24 karolherbst: mov u32 %r299 0x3f800000 (0)
19:24 karolherbst: mad f32 %r308 %r194 %r194 %r299 (0)
19:24 imirkin: hm
19:24 karolherbst: yeah..
19:24 imirkin: should double check if there's a mode that lets you have an imm in the last source
19:24 imirkin: there might be.
19:24 karolherbst: yes
19:24 karolherbst: but
19:24 karolherbst: we only know post RA
19:24 imirkin: you can definitely dump a const in there
19:25 imirkin: oh, coz the regs have to match up?
19:25 karolherbst: ohh wait
19:25 imirkin: no, that's different
19:25 karolherbst: yeah
19:25 karolherbst: .. let me check
19:25 karolherbst: nope
19:25 karolherbst: src2 is either a reg or cb
19:25 karolherbst: at least out emiter only handles that
19:25 imirkin: right, but double-check envydis
19:27 karolherbst: nope
19:27 karolherbst: doesn't exist
19:27 karolherbst: ohhh
19:27 karolherbst: wait
19:27 karolherbst: it does
19:27 karolherbst: { 0x3280000000000000ull, 0xfe80000000000000ull, OP8B, T(pred), N( "ffma"), T(5980_0), T(5980_1), ON(50, sat), ON(47, cc), REG_00, REG_08, ON(48, neg), F20_20, ON(49, neg), REG_39 }
19:28 karolherbst: interesting
19:28 karolherbst: maybe I wire that up first and see how that goes
19:29 imirkin: no, that's src2 that gets the imm
19:29 karolherbst: ehh. yeah..
19:29 karolherbst: so .. no
19:29 karolherbst: doesn't exist
19:29 karolherbst: anyway.. fma with dst == src2 is something I also ignore
19:29 karolherbst: :/
19:30 imirkin: that's for a limm
19:30 karolherbst: yeah
19:30 karolherbst: but only post ra
19:30 imirkin: right
19:30 karolherbst: and with the imm -> cb that will never happen anymore
19:30 imirkin: wtvr
19:30 karolherbst: yeah..
19:30 karolherbst: matters more with a per shader table though
19:31 karolherbst: mhhh...
19:31 karolherbst: seems like 1.0 is really just this fma case
19:32 karolherbst: heh
19:32 karolherbst: 0x3a83126f is 0.001
19:32 karolherbst: makes sense to be one of the most common used limms though
19:32 imirkin: is there a FMUL.PO variant somehwere?
19:33 karolherbst: a non FMA case
19:33 karolherbst: mov u32 %r255 0x3f800000 (0)
19:33 karolherbst: slct u32 %r256 ne %r254 %r255 %r253 (0)
19:33 imirkin: oh, PO is only IADD32I apparently
19:33 imirkin: right yeah ok
19:33 karolherbst: :)
19:34 karolherbst: but yeah.. looks like a table of 100 entries would already help a lot
19:35 karolherbst: but.. well, that's kind of biased towards the shaders we have
19:35 karolherbst: but most of the numbers there don't look surprising
19:35 karolherbst: 0x38d1b717 is 0.0001
19:36 karolherbst: 0x33d6bf95 is an odd 1
19:36 karolherbst: 1.00000001169e-07
19:36 karolherbst: mhh
19:36 karolherbst: odd
19:36 imirkin: i think that's just 10^-7
19:36 karolherbst: ohhh
19:36 karolherbst: right..
19:36 karolherbst: I am stupid :d
19:36 karolherbst: 0x358637bd us 10^-6
19:37 karolherbst: *is
19:37 imirkin: >>> hex(np.float32(1e-7).view(np.uint32))
19:37 imirkin: '0x33d6bf95'
19:37 karolherbst: 0x3c23d70a is 0.01
19:37 karolherbst: 0x3dcccccd is 0.1 :)
19:37 karolherbst: and with that we already have a lot
19:37 karolherbst: and those are not even unusual numbers
19:37 imirkin: it's almost like people use powers of 10 a lot
19:37 karolherbst: :)
19:38 karolherbst: and first number being a bit off is 0x3de147ae == 0.11
19:38 karolherbst: but not too crazy either
19:39 karolherbst: mhh
19:39 karolherbst: 0x3f371759 = 0.7152
19:39 karolherbst: does this number has any special meaning?
19:39 imirkin: that's half the golden ratio, no?
19:39 karolherbst: ahhh
19:39 karolherbst: luminance stuff
19:39 imirkin: or half of sqrt(2)
19:39 karolherbst: https://en.wikipedia.org/wiki/Relative_luminance
19:39 imirkin: nope, off on both counts
19:40 karolherbst: 0x3d93dd98 = 0.0722
19:40 karolherbst: right next to it
19:40 imirkin: sqrt(2)/2 = .707, half of golden ratio = .809, sqrt(3) / 2 = .866
19:40 karolherbst: which are the G and B thingies for that relative luminance stuff
19:41 karolherbst: don't see 0.2126
19:41 imirkin: i used to know this stuff back when i did math competitions... o well
19:41 karolherbst: oh well
19:42 karolherbst: anyway.. I think we can say that having a per shader table might actually not help all that much :)
19:42 karolherbst: but mhh
19:48 imirkin: yeah, i mean pick those top 50 or 100, dump them into the driver cb, done
19:48 imirkin: have a hard-coded list shared
19:48 imirkin: so that it can be generated properly
19:48 karolherbst: :)
19:48 karolherbst: yeah
19:59 drathir: hi guys wonder of any ideas about full freeze and theoretically "nouveau 0000:00:0d.0: bus: MMIO write of 015d0001 FAULT at" connected with that ?
20:00 karolherbst: drathir: what GPU?
20:00 drathir: theoretically, bc not sure if exactly that dmesg throw on crash that last what i see in dmesg about gpu logs...
20:00 drathir: karolherbst: its 00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2)
20:02 karolherbst: huh weird
20:03 karolherbst: yeah dunno
20:03 karolherbst: drathir: mind posting a full dmesg?
20:03 RSpliet: drathir: 16 year old GPUs are hard to come by. Sadly, nouveau hasn't been tested well recently with such a piece of ancient history. Debugging your problem might prove more tricky than one might hope, just be prepared :-)
20:05 RSpliet: This one's integrated with the southbridge as well I think, so possibly just shared RAM, no dedicated VRAM
20:06 drathir: karolherbst: the most funny part is its mostly works w/o problems under system and even firefox, but when You for example try fireup steam there is 90% chance it freeze at such 10% when start normally its just matter of time when durning usage freeze as well...
20:07 karolherbst: mhhh
20:07 karolherbst: might be chromium related and multithreading...
20:07 karolherbst: mhhh
20:09 drathir: karolherbst: in dmesg there isnt anything other than https://ncry.pt/p/VtPn#mhcFlhTcwbwi6sgAhsdqh_K-by1KbZD2IGpwlskYeHQ
20:09 karolherbst: drathir: it might be that just X is dead
20:10 karolherbst: drathir: mind confirming that if you ssh into your system and kill steam it unfreezes?
20:11 drathir: RSpliet: yea its old-ish i know thats why wondered if that isnt anything from "unable to fix" ones to be honest, bc at such longer time most of the issues probably get noticed catched and fixed already...
20:12 drathir: karolherbst: its kill everything including kb and eth card not even ping goes throught...
20:13 drathir: karolherbst: i need play further and confirm if sysrq freeze as well...
20:14 karolherbst: ohh
20:14 karolherbst: yeah.. then it's probably something more serious
20:14 karolherbst: it might make sense to setup netconsole
20:14 karolherbst: usually on a hard reset you don't get the newest log entries
20:15 karolherbst: so dmesg might be entirely useless at this point
20:20 drathir: karolherbst: yea if just x crash i possible to dig logs remotly, but its instant freeze and when does that it instantly kill eth card as well, and on freeze screen horizontal kind of regular distorted waves happens...
20:21 karolherbst: drathir: netconsole still prints stuff
20:21 karolherbst: usually
20:21 karolherbst: just needs to be setup correctly.. which is always a bit painful
20:21 karolherbst: but then the kernel pushes out udp packages before crashing for real
20:22 drathir: karolherbst: to be honest have no sure if that isnt just hw dying bc of age and not needed migrate in near future...
20:22 karolherbst: it still probably a bug though.. we are aware of plenty, just a matter of which one you are hitting here
20:44 drathir: karolherbst: will try that as well if able to harvest some more logs for sure will share here...
21:00 imirkin: i did fix the steam client starting up on the nv30 driver at some point s.t. it should work on the nv4x gen
21:00 imirkin: but that says nothing about the games
21:00 imirkin: those mmio faults at 00b0?0 aren't anything to worry about
21:01 imirkin: it does suggest something may be trying to use vdpau
21:01 imirkin: which won't work
21:01 HdkR: if (nvidia) useVDPAU = true
21:17 imirkin: drathir: you could try removing libvdpau_nouveau.so and see if that improves things
21:17 imirkin: but it's just a random guess
21:17 imirkin: could be instead that the game is trying to do something which doesn't work properly on nv30 driver
22:57 karolherbst: imm -> cb with only 1.0 :)
22:57 karolherbst: total instructions in shared programs : 10240501 -> 10230234 (-0.10%)
22:57 karolherbst: total gprs used in shared programs : 1125555 -> 1124019 (-0.14%)
22:57 karolherbst: still need to wire up uploading the data though
22:58 karolherbst: imirkin: do we have some place where we initially upload the driver constbuf with static data or would that be the first time actullay?
22:58 karolherbst: or should we do that always when invalidating the driver cb?
23:00 karolherbst: ohhh.. driver constbuf is part of uniform_bo...
23:00 karolherbst: totally forgot about this
23:11 imirkin: yeah, i mean we already upload stuff in there
23:11 imirkin: just do it on startup
23:11 karolherbst: yeah.. already saw it
23:11 karolherbst: but mhh
23:11 karolherbst: something I do wrong
23:12 karolherbst: imirkin: https://github.com/karolherbst/mesa/commit/8c766fd124d201a43b77f156c7840c6ddf6f7816
23:12 karolherbst: change in screen.c
23:13 imirkin: oh, annoying, it's per-stage
23:13 karolherbst: yeah..
23:13 karolherbst: but..
23:13 karolherbst: we bind the full range anyway
23:13 karolherbst: and it's just uploaded once
23:13 karolherbst: soo.. meh
23:14 karolherbst: heh...
23:14 karolherbst: my pass is wrong :D
23:14 karolherbst: mad ftz f32 $r27 $r28 $r27 c15[0x0]
23:14 karolherbst: nvm then
23:14 imirkin: yeah, that won't quite work
23:14 karolherbst: ohhh :/
23:14 karolherbst: yeah.. I removed too much code
23:15 karolherbst: "offset += prog->driver->io.immCBOffset;" :)
23:16 karolherbst: now it works :)
23:16 karolherbst: fun.. now if I mess it up, rendering in pixmark_piano is all black
23:39 karolherbst: https://github.com/karolherbst/mesa/commit/ee35bee0c4ed55c5cce5615297847e3fa29d136e :)
23:39 karolherbst: I think I will put a few more imms into it
23:39 karolherbst: fincs: ^^ I think you will like this approach more
23:40 karolherbst: and I should remove NVC0_CB_AUX_IMMEDIATE_SIZE ...
23:42 karolherbst: ehh.. I just move the imms to 0x9000
23:42 karolherbst: why even bother filling the gap
23:42 fincs: :), however - I see you don't dynamically grow the table?
23:42 karolherbst: yeah
23:42 karolherbst: fincs: I did some stats: https://gist.githubusercontent.com/karolherbst/f7b90629461392f3e5558204c0eae841/raw/2728dbf4ca813c11bb98cdca7b5a4c32e38cb506/gistfile1.txt
23:42 karolherbst: so I thought... that's good enough
23:42 karolherbst: we can always add dynamic tables later... but...
23:42 fincs: Heh, I only intended to have a dynamic table
23:42 karolherbst: well.. I don't think it's really required unless you want the last 1% perf in certain applications
23:43 karolherbst: yeah...
23:43 karolherbst: but...
23:43 karolherbst: having a static table is nice
23:43 karolherbst: the driver constbuf is soo empty still
23:43 karolherbst: and it's already fully allocated anyway
23:43 karolherbst: so, no drawbacks :)
23:43 fincs: As in, my system only really lets the compiler generate a single helper constbuf that is part of the shader
23:43 fincs: And I'd need to make every single shader get a copy of this static table of common immediates
23:44 karolherbst: well, you still have a driver cb, no?
23:44 fincs: I'd rather just collect all the immediates used by the shader and stash them in its generated constbuf
23:44 fincs: Yeah I have a driver cb but my lib is completely compiler agnostic
23:45 karolherbst: well.. I could also let the driver hand over a table to the compiler...
23:45 karolherbst: but that would be more work
23:45 karolherbst: the table is more like an ABI
23:46 fincs: Either way I think you implemented it in a flexible enough way that will let me tweak it later on :)
23:47 karolherbst: I add more imms :D
23:47 karolherbst: I still should decide on a max size
23:47 fincs: 0x10000 - current size of driver constbuf :D
23:47 karolherbst: ehhh
23:48 karolherbst: uhm.. why can't I use STATIC_ASSERT :/
23:48 imirkin: has to be inside a function iirc
23:48 karolherbst: :/
23:49 karolherbst: at least this table isn't even 256 bytes big
23:49 karolherbst: and I thought 4k would be a good limit
23:49 karolherbst: sooo much space
23:49 karolherbst: fincs: the main idea was more to get rid of shaders having like 1 dynamic imm
23:50 karolherbst: or as many as possible
23:50 karolherbst: would be annoying to always rebind the table if ... the benefit is super small
23:50 fincs: Hmm, maybe it needs some tuning
23:50 karolherbst: yeah
23:50 karolherbst: that's totally biased towards our shader-db as well
23:50 karolherbst: and maybe some constants are way more common
23:50 fincs: I'm open to just bundling the table of handy constants tbh
23:50 karolherbst: doesn't matter though
23:51 fincs: Also (tangentially related), some time ago I attempted to optimize the loading of ssbo pointers from the driver constbuf, currently nouveau generates LDC.64 which is undesirable (ideally it should use c[] directly in MOV/ADD instructions)
23:51 karolherbst: my idea was: we allocate this space anyway, and right now it's not used for anything
23:52 karolherbst: fincs: yeah.. wouold be nice to work on that
23:52 fincs: However it looks like the compiler isn't able to see past that and emits MOV/MOV and keeps wasting gprs :\
23:52 karolherbst: stuff like that would still be in codegen even when I would move as much as I think is sane into nir :p
23:52 karolherbst: so always a good idea to work on that :D
23:52 fincs: Basically I tried 2x bld.mkLoadv + 1x mkOp2v OP_MERGE
23:57 fincs: I guess this is more of a general problem with 64-bit math
23:58 karolherbst: probably, yes
23:58 karolherbst: with TGSI 64 bit stuff sucks a little
23:58 karolherbst: it's better with nir, which has native support for 64 bit types
23:58 karolherbst: well.. it's more of a translation issue in the end
23:58 fincs: I think it goes deeper though, because these 64-bit loads/adds are generated within the lowering passes
23:59 karolherbst: mhhh
23:59 karolherbst: right.. memory opt might merge those
23:59 karolherbst: ahh right
23:59 karolherbst: loadpropagation is before memoryopt
23:59 karolherbst: so.. you are out of luck
23:59 fincs: As in, the code which checks if c[] can be used directly as an operand just doesn't know how to deal with 64-bit stuff