02:44rhyskidd: on pascal+ does nouveau interpret directly the devinit script within a vbios, doing rw on the mmio registers? --- or does it pass the script to PMU to run on our behalf?
03:23tertl3: do you guys work with Vulkan?
03:25HdkR: Nouveau doesn't support Vulkan
03:27gnarface: could it, in theory?
03:27gnarface: like, could it be made compatible with dxvk, in theory?
03:30HdkR: Writing a Vulkan driver for Nvidia in Mesa could reuse a large amount of code from the Nouveau driver. Anything that is Vulkan compliant + implements the supported extensions could theoretically work with dxvk
03:31HdkR: If you are also requesting a "Why?" as to there not being a Nouveau based Vulkan driver yet. It's a time consuming task and someone hasn't taken that burden upon themselves yet.
03:39rhyskidd: gnarface: there's also some necessary features that need to be added to the kernel-space nouveau driver
04:15gnarface: noted, thanks
04:16gnarface: i was just curious if it could be done
04:16gnarface: it it seems like it could greatly enhance the wine experience
04:17gnarface: i'm curious if it would close the performance gap with the proprietary driver
04:18HdkR: Doesn't it mainly lower CPU overheads?
04:19gnarface: i don't think that's all it does
04:19gnarface: for WoW it seemed to release a threading bottleneck in the GPU
04:20gnarface: it only slightly increased CPU overhead
04:20HdkR: ah. I see
04:21gnarface: (wine is still massively under-utilizing the CPU, but finally using the GPU 100%)
04:23HdkR: How much does performance change between the GL and dxvk things on the proprietary driver?
05:37gnarface: HdkR: GPU usage goes from about 40% to 100%, framerate roughly doubles to triples, depending on the size of the room
05:37gnarface: HdkR: (CPU usage goes from about 150% to 160% on a 4-core machine)
05:38gnarface: i think that suggests there's still a CPU bottleneck unrelated to the graphics pipeline
05:38gnarface: but it's probably a wine problem
05:39gnarface: but it's the difference between it being playable or not in heavy combat
05:39gnarface: dxvk that is
05:41gnarface: i think in various places there's conflicts between the linux threading implementation and the way windows binaries expect it to behave, but i don't understand it more than that
05:45HdkR: gnarface: With the proprietary blob?
05:45HdkR: just making sure
07:01gnarface: HdkR: yea
07:15HdkR: gnarface: Interesting. __GL_THREADED_OPTIMIZATIONS=1 as an environment variable doesn't help the GL side at all?
07:21gnarface: HdkR: no
07:21gnarface: it might just be a problem with blizzard games
07:21gnarface: in wine
07:22gnarface: but whatever it is, vulkan seems to help a lot
07:22HdkR: Sounds like some bad CPU side bottleneck in their GL backend that was worked around in the vulkan side
07:29gnarface: to be clear, when i say 160% i mean 1.6 of 4 cores
07:29gnarface: literally one of them at ~100%, another at ~60% and the rest unused
07:29HdkR: Sure, I bet a CPU core is spinning at 100%
07:29HdkR: Probably the rendering side worker thread
07:30HdkR: This can either be due to just a busy wait, waiting for the GPU to complete or actually spamming a bunch of work. Needs profiling to know for sure
07:32HdkR: Since CPU time didn't change significantly but GPU load did then it is probably a busy wait which makes it hard to see CPU bottlenecks
10:52karolherbst: wine + vulkan is brutally fast compared to the OpenGL path
10:53diogenes_: karolherbst, with nouveau?
10:53diogenes_: does nouveau plays with dxvk?
10:53karolherbst: there is no vulkan for nouveau yet
10:54diogenes_: is it planned?
10:54diogenes_: yay cool
10:54karolherbst: there is already some code, but nothing is public yet
10:54karolherbst: and it isn't really useful in the first place
10:54karolherbst: but... some work was done
10:55diogenes_: why isn't useful?
10:55taniwha: is writing vulkan driver code as tedious as writing vulkan app code?
10:55karolherbst: because it isn't stable, doesn't implement a lot and has bugs ;)
10:55taniwha: I'm going through the vulkan tut. *rolls eyes*
10:55karolherbst: taniwha: doubtful, but we need some kernel API changes and skeggsb needs time to do that. Currently busy with other things
10:56diogenes_: karolherbst, that is a secondary issue, the main thing is to at least have something is that respect
10:56karolherbst: sure, but everybody could come at least up with something
10:56karolherbst: but that won't be useful
10:57karolherbst: we really need those kernel API changes first and build the vulkan thing around it
10:58taniwha: what about a kernel API emulator so dev can get started?
11:01karolherbst: how does that help if the API itself is the thing important here?
11:03karolherbst: also you can't emulate the new things, because nouveau isn't able to do it
11:04taniwha: it's for unit tests, not hw tests
11:10Sarayan: What changes are needed in the kernel api?
11:10Sarayan: if it's not too long to explain them :-)
11:11Sarayan: btw, adding the swtich it off and on again patch changes nothing
14:54imirkin_: pendingchaos: you have push access yet?
14:58imirkin: pendingchaos: btw, those pushbuf-vs-copy engine tests need to be done on a reclocked gpu
15:42pendingchaos: imirkin: I have push access
15:42pendingchaos: I don't think I can reclock my Pascal gpu with nouveau?
15:47pendingchaos: the result of the OP_AND should be used with a OP_SHL, which doesn't seem to support NV50_IR_MOD_NEG
15:47pendingchaos: (about the bindless multisampled images patch)
15:47pendingchaos: I don't see how OP_CVT + OP_SET is better than OP_AND + OP_SET?
16:43imirkin: pendingchaos: you can't.
16:44imirkin: pendingchaos: it's better if the predicate could get used directly, which it can't, which was why i said it wasn't better :)
16:44imirkin: (whereas with the OP_AND, the predicate would be hidden away)
16:44imirkin: i'd still use OP_NEG though over the AND
16:45imirkin: re the reclocking, i'm just concerned that we're using e.g. a fifo engine operating at full clocks and comparing it to a copy engine that's underclocked. for example.
16:45imirkin: [if e.g. the fifo engine only has one clock speed it can run at]
16:46imirkin: that said, iirc you wanted to bump it to 1024 bytes, which is 256 dwords, which imo is totally fine.
16:46imirkin: i'd just be more careful with the 64kb bits :)
16:51pendingchaos: OP_NEG currently creates an OP_CVT, which requires a barrier and IIRC took 14 or something cycles
16:52pendingchaos: I have a patch to make a iadd/fadd be emitted instead for those things though
16:59pendingchaos: *creates an f2f/i2i/etc
17:08imirkin: huh. ok.
17:08imirkin: ok, r-b without that change then.
21:51karolherbst: yeah... those i2i and f2f instructions are stupid
21:51karolherbst: basically nvidia never uses them
21:53karolherbst: pendingchaos: if you want, you could also take care of adding iadd3 support. And with that we are able to remove more of those OP_NEG and OP_ABS ones
21:53karolherbst: top two commits: https://github.com/karolherbst/mesa/commits/gm107_iadd3_v3
21:53karolherbst: for the basic ops
21:54karolherbst: what iadd3 can compared to iadd is having two modifiers
21:54karolherbst: so you could convert add(neg(a), neg(b)) into iadd3(0, neg(a), neg(b)) effectivly reducing binary size by one + eliminating i2i
21:55pendingchaos: I added a opt for iadd3 built on those two commits with some fixes, but it didn't seem to do much with the shader-db numbers
21:55pendingchaos: the i2i thing might make it more significant
21:55karolherbst: the add(add(a, b), c) ->iadd3(a, b, c) ain't that common
21:56karolherbst: but the filding in two modifiers is
21:56karolherbst: well, more common at least
21:56karolherbst: and should have a perf impact as well
21:56karolherbst: just never got around to do it I think? let me check
21:57karolherbst: pendingchaos: did you base it on top of my versions though? I had to fix a few things I think
21:57pendingchaos: where is your version? I think I also fixed a few things in the two commits for OP_ADD3
21:57pendingchaos: I don't think I did
21:59karolherbst: mhh https://github.com/karolherbst/mesa/commits/nvir_opt_shladd_const
21:59karolherbst: there are my opts based on those
21:59karolherbst: / ADD(NEG(a), b) -> ADD3(0, b, -a)
21:59karolherbst: just a bit painful to actually do that without hurting stuff
22:01karolherbst: pendingchaos: I think for OP_ABS, OP_NEG we should just change the emitter
22:01karolherbst: actually two ts
22:01karolherbst: pendingchaos: what do we allow on those? We could just emit IADD code for those
22:02karolherbst: no reason to bother the IR with that stuff
22:02pendingchaos: I think you're allowed to do type conversions with OP_ABS, OP_NEG and OP_SAT
22:02karolherbst: sure, but the we emit i2f and f2i
22:02karolherbst: the code is all there in the emitter
22:02karolherbst: but just instead if doing i2i and f2f we use iadd, fadd
22:03karolherbst: CodeEmitterGM107::emitI2I and CodeEmitterGM107::emitF2F
22:03karolherbst: sadly those also allow 64 -> 32 and 32 -> 64 conversions and everything
22:04karolherbst: so, mhh maybe only if the dtype == stype?
22:04pendingchaos: I think OP_ABS would need an i2i
22:04karolherbst: and no subop?
22:04karolherbst: pendingchaos: why?
22:04pendingchaos: I'm just looking at an old patch I made
22:04pendingchaos: and I'm not sure if iadd has an abs modifier or anything
22:05karolherbst: it has
22:05karolherbst: neg and abs
22:05karolherbst: I think...
22:06karolherbst: ohh crap
22:06pendingchaos: not in CodeEmitterGM107 or gm107.c
22:06karolherbst: yeah... I saw
22:06karolherbst: but the target allows it
22:06pendingchaos: that's probably wrong
22:06karolherbst: except... there is a special exception
22:06karolherbst: like the table also states add can take two modifiers
22:07karolherbst: yeah, hardcoded
22:07pendingchaos: ah, because FADD has them but IADD doesn't
22:07karolherbst: but.. OP_SUB is allowed to have an abs?
22:09karolherbst: I guess something makes it in a way that there is never an isub
22:10karolherbst: oh well
22:10pendingchaos: seems it's done in ModifierFolding
22:10karolherbst: we can check what nvidia does for abs then
22:10karolherbst: I see
22:12karolherbst: anyway, would be interesting to see how feral games are affected by using fadd/iadd instead of i2i and f2f
22:13karolherbst: I think it would be fine to just adjust the emiter here
22:13karolherbst: because that's kind of target spefici
22:14karolherbst: and the IR shouldn't care if we actually emit a i2i with neg or an iadd with neg
22:14karolherbst: pendingchaos: cvt needs a barrier, right?
22:14pendingchaos: I think so
22:15karolherbst: because it doesn't have a fixed runtime
22:15imirkin: note that i2i has no 32 <-> 64 functionality
22:15imirkin: f2f obviously does
22:15karolherbst: i2i supports 16 <-> 32 though
22:15karolherbst: (which f2f also supports)
22:16karolherbst: and saturated conversion
22:16karolherbst: all in one instruction :)
22:16karolherbst: i2i.u16.s32.sat neg a should be a valid instruction
22:20karolherbst: pendingchaos: I don't see it inside isBarrierRequired though
22:20karolherbst: ohh, no, there it is
22:20pendingchaos: it should be under OPCLASS_CONVERT
22:21karolherbst: if ((insn->op == OP_MUL || insn->op == OP_MAD) && !isFloatType(insn->dType)) return true; return false;
22:21karolherbst: ohh, and that
22:21karolherbst: yeah, i2p and p2i don't need barriers
22:21karolherbst: pendingchaos: but, OP_NEG and OP_ABS are OPCLASS_ARITH, or not?
22:22pendingchaos: I think they were OPCLASS_CONVERT?
22:22karolherbst: yeah... I just checked
22:22karolherbst: indeed they are
22:22pendingchaos: probably because they emit conversion instructions
22:22karolherbst: no idea how to deal with it correctly
22:23karolherbst: we want one code which decides if we have a OP_NEG and OP_ABS which are actually arithm instructions and not conversions
22:23karolherbst: like if no modifiers, no subops, same type size, same base type, no saturation -> arithmetic
22:23karolherbst: and then we can use iadd/fadd
22:23karolherbst: in the emiter
22:24karolherbst: and have isBarrierRequired return false on those
22:24karolherbst: target->needsConversion(Instruction *i)?
22:25karolherbst: dunno, maybe we could have a more generic name instead
22:25karolherbst: like target->isTrivialVariant()
22:27karolherbst: mhh, maybe actually that isn't such a bad idea
22:28karolherbst: we could have target->isTrivialVariant, target->hasTrivialVariant and target->convertToTrivialVariant and have an optimization which could trivialize such patterns (like if you can't directly emit something else instead and have to split it, but the split instructions are still cheaper than the complex original)
22:28karolherbst: or something like that
22:29karolherbst: pendingchaos: uhm... OP_ABS is also trivial in the end...
22:29karolherbst: or well. mhh
22:29karolherbst: let me think, how can this be done good enough
22:36karolherbst: best I can think of are three instructions
22:36karolherbst: shift, add, xor
22:40pendingchaos: what for (shift, add and xor)?
22:54karolherbst: integer abs
22:57pendingchaos: not sure if that would be faster than an i2i, though apparently it has pretty bad throughput: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions
23:00HdkR: Huh. I didn't realize that table existed publicly
23:04karolherbst: pendingchaos: yeah
23:04karolherbst: pendingchaos: but loks like shift, add, xor is equally bad