01:53tomreyn: hi there. https://cgit.freedesktop.org/wiki/nouveau/log/index.mdwn was last updated >1y ago and this channels' topic was last updated around the same time. one might get impression that nouveau development has come to a standstill. is this a drastically wrong impression?
01:54imirkin_: trying to come up with an accurate way to answer that...
01:55imirkin_: which perhaps is an answer in and of itself?
01:56imirkin_: you could look at https://cgit.freedesktop.org/mesa/mesa/log/src/gallium/drivers/nouveau and https://github.com/skeggsb/nouveau/commits/master for indications of development that was recently done
01:56imirkin_: but that doesn't correlate to changing channel topics or index pages
01:58tomreyn: thanks for your answer and these links. so things aren't in as bad a state as i had assumed, that's nice to know.
01:58imirkin_: well, just because there's development doesn't mean things are in a good state :)
01:59imirkin_: in fact, the better the state, the less development :)
01:59tomreyn: you know of any project where that is so? :)
01:59tomreyn: but i get your point
02:00imirkin_: if you're looking to buy a well-supported gpu, stick to intel or amd
02:00tomreyn: luckily i did, but i like to support folks in #ubuntu, and there are so many lost souls who bought into the wrong company
02:02tomreyn: there's actually not any day where you don't run into frustrated nv users there.
02:02imirkin_: ah yeah, it's unlikely to play well with ubuntu
02:02imirkin_: or ... modern desktops in general
02:02imirkin_: at some point people decided it was a great idea to rely on GL for everything
02:03tomreyn: i don't like desktops with GL compositors either.
02:03imirkin_: it goes further than that
02:03imirkin_: now GTK / Qt will use it to draw all widgets
02:04tomreyn: meh. i guess it's about time i finally 'downgrade' to a tiling window manager ;)
02:04tomreyn: doesn't fix it, but reduces the impact.
02:05imirkin_: fwiw, i use WindowMaker
02:05imirkin_: been happy with it since the late 90's.
02:05imirkin_: switched to it when AfterStep changed everything about how it worked at some point
02:05tomreyn: i always thought it was a spelling error, should have been Widow Maker
02:06tomreyn: well, i guess i'll take a look around. thanks for the chat :)
02:07pabs3:is a happy user of nouveau, modulo the vblank issue that is already fixed
02:07pabs3: is nVidia getting more or less friendly towards nouveau?
02:07imirkin_: [wow, have i been using WM for 20 years? time flies...]
02:08imirkin_: pabs3: in the open, more friendly, but practically less friendly
02:08imirkin_: they recognize that nouveau exists
02:09imirkin_: their hardware, however, now requires cryptographically signed firmware to be operated properly
02:09imirkin_: and they only release a small fraction fo that
02:09imirkin_: and even that, usually 1+ year after the hw is released
02:09pabs3: does that apply to the ARM stuff too?
02:09imirkin_: i wrote a scanner to find various fw, but haven't really gone further
02:09imirkin_: tegra? less so.
02:14tomreyn: i suspect there is newer firmware available than https://packages.ubuntu.com/cosmic/nouveau-firmware (version nouveau-firmware-20091212), though?
02:15tomreyn: htis is sourced from https://people.freedesktop.org/~pq/nouveau-drm/
02:15imirkin_: dunno what that is, it's very old, from the "before time"
02:16imirkin_: the firmware is actually in linux-firmware
02:16imirkin_: i've also written a couple extractors to get it out of blob
02:16tomreyn: oh rioght, that's a lot newer
02:20tomreyn: nice. i guess redistributing these extracted blobs could be harmful to self, though.
02:21imirkin_: generally they're shipped as a script to download stuff and extract locally
02:21HdkR: tfw zero Volta support
02:21imirkin_: HdkR: actually my extractor can get the volta fw out :)
02:21imirkin_: the "scanner" tool
02:22imirkin_: or rather, "some" fw out
02:22HdkR: Well that's good at least. Now for all the rest
02:22pabs3: is the location of these blobs in the proprietary drivers standard? i.e. how often does the script break for new versions?
02:22imirkin_: pabs3: well the scanner is just a prospective tool
02:23imirkin_: so it's not locked to a particular version
02:23imirkin_: extract_firmware is a lot more fragile
02:23imirkin_: but as long as it works with one blob version, can just keep using that
02:23imirkin_: the actual fw rarely changes
02:23imirkin_: and the whole idea of it is to get around having to ship the actual images
02:23imirkin_: but not making people have to mmiotrace blob drivers
02:24imirkin_: these were the old instrutions: https://nouveau.freedesktop.org/wiki/NVC0_Firmware/
02:25imirkin_:was pretty proud of that perl snippet, it replaced a long and error-prone python program
02:25pabs3: ack, seems like a good idea. ISTR other drivers have something similar, even one where you have to have a whole proprietary operating system to extract the firmware from
03:25Subv: does the nvc0 codegen at any point emit an SSY instruction?
03:29Subv: i don't understand why that instruction is required, does every sequence of non-uniform control flow need a JOINAT?
03:31karolherbst: so uhh, we get vulkan only games now apperantly
03:31imirkin_: Subv: yes.
03:32imirkin_: Subv: well, technically no.
03:32imirkin_: it's just a really really good idea
03:32imirkin_: don't forget that this is all secretly a giant SIMD execution engine
03:32karolherbst: "Rise of the Tomb Raider" is vulkan only for example
03:33imirkin_: so ... when one lane wants to do one thing and another lane wants to do another thing, that's not SIMD-friendly
03:33HdkR: karolherbst: What's this about a nouveau based Vulkan backend? :)
03:33imirkin_: SSY / SYNC help the hardware know when all lanes should start executing together again
03:33HdkR: (Really it is needed to let the hardware to know when to execute together again)
03:34imirkin_: isn't that what i said?
03:34Subv: for some reason i was under the impression that the hardware did this automatically
03:34HdkR: help implies it isn't necessary :P
03:34imirkin_: HdkR: it's not
03:34imirkin_: you can just diverge, and it will all execute
03:34imirkin_: one lane at a time :)
03:34karolherbst: HdkR: nothing :p
03:34HdkR: Oh, the "if you hate yourself path"
03:35HdkR: GOt it
03:36HdkR:is curious how hard it is to actually write a vulkan backend
03:38imirkin_: mostly i've been waiting on the kernel interface changes which are necessary to do _anything_ in vulkan
03:38imirkin_: i don't think it's been a priority for ben - he's busy adding support for volta
03:38HdkR: Which is a decent priority
03:39imirkin_: and for pascal fault support prior to that
03:39imirkin_: and a dozen other things
03:40imirkin_: so i haven't touched the thought of writing a vk driver until that's complete
03:46karolherbst: imirkin_: well I will still try to push for this year
03:47karolherbst: not that we will run games with that
03:47karolherbst: but like simple stuff
03:50imirkin_: i also clearly don't have time for anything anymore
03:50imirkin_: other than making snide remarks on irc
03:50HdkR: Oh hey, that's what I do
04:06karolherbst: imirkin_: well, this is already a big help though :)
04:07karolherbst: imirkin_: well, the first task would be to extract codegen out of nouveau anyway
04:07karolherbst: or rather the gallium driver nouveau
04:08imirkin_: first task is to make the kernel have the proper api
04:09karolherbst: well, maybe "first" is the wrong word
04:09karolherbst: but we can start with both
04:10karolherbst: we don't have to wait for the new API for just extracting codegen ;)
05:22karolherbst: imirkin_: anyway, if application start to use vulkan explicitly, it simply means it goes up the priority here, so we might see a bigger push for it now, maybe. Depends on how much time skeggsb_ and me will have in the end :)
05:35Subv: that'd be nice, being able to emit maxwell code without having to actually run the entire driver stack would be useful
05:41karolherbst: problematic are the TGSI bits a bit
05:42karolherbst: allthough I guess a header file dependency would be okay for now
05:42karolherbst: well, maybe even keeping the _from_tgsi file in nouveau, dunno
05:42karolherbst: or another lib
09:45pendingchaos: imirkin_: shall I send an updated patch using the "Value *foo = bld.mkOp2v(bar, bld.getSSA(), ...)" thing? even though I can't find a good source for the method I can point to.
09:57pendingchaos: also making it a bit more robust with divisors like -2147483648
10:03pendingchaos: it seems just -2147483648 I guess, OP_ADD seems to accept rather large immediates
11:08karolherbst: pendingchaos: there are sometimes restrictions on instructions to use large immediates
11:08karolherbst: usually those also have a short immediate form with ~20 bit immediates
11:09karolherbst: pendingchaos: there is always target->canInsnLoad(Instruction, int?, Instruction *load?) to check if the given instruction can load the source of the load instruction (not quite sure about the signature)
11:10karolherbst: so you could do targ->insnCanLoad(add, 0 or 1, load)
11:10karolherbst: I don't like insnCanLoad really, because the interface kind of restricts on what you can do
11:12karolherbst: pendingchaos: also, OP_SELP takes a predicate
11:12karolherbst: you want OP_SLCT
11:13karolherbst: or something else
11:13pendingchaos: I wonder why it was working before then
11:13karolherbst: but with SLCT you can compare against 0
11:13karolherbst: or you use SET + something else
11:14karolherbst: or maybe something did the right thing later
11:14karolherbst: but SELP is a SLCT just instead of a cc 0, it just checks the bool value of the given predicate source
11:16karolherbst: but yeah, I guess a SLCT with eq or ne 0 should be fine here
11:17karolherbst: pendingchaos: also when creating OP_SET/OP_SLCT/OP_SELP, use bld.mkCmp
14:48pendingchaos: karolherbst: thanks
14:51pendingchaos: switching to SLCT fixes a weird problem I was having
14:58imirkin: pendingchaos: generically you should never stick immediates into instructions unless you really know what you're doing. the usual thing to do is to use bld.loadImm() and then let the later load propagation take care of it
14:58pendingchaos: yeah, that what I think I'll do
14:58pendingchaos: I'm not sure if I can get this to reliably work with negative divisors, because of things like -2147483648 / -8
14:58imirkin: since there are a variety of rules around what can go into ops and what can't, given what other flags, etc
14:58pendingchaos: since negative divisors are implemented like a / -b -> -a / b but negating -2147483648 probably wouldn't give any thing useful since it overflows
14:58imirkin: yeah, MIN_INT is annoying =/
14:59imirkin: since -MIN_INT == MIN_INT :)
14:59pendingchaos: I doubt anything is doing things like that anyways
14:59imirkin: how does gcc do it then?
15:00imirkin: either way, optimizing negative divisors doesn't seem like an extremely pressing problem
15:00pendingchaos: I think clang has some other method of handling signed divisions by power of twos I saw in one of it's READMEs
15:00imirkin: what does gcc do when you do x / -8?
15:00pendingchaos: it seems to negate the result
15:01pendingchaos: instead of the operand
15:01pendingchaos: *I think llvm
15:01imirkin: so it does -(x / 8)?
15:01pendingchaos: I think so
15:02pendingchaos: I don't think there are any problems with that
15:02imirkin: call me crazy, but it *seems* like this should be a solved problem :)
15:02pendingchaos: since if you're dividing by something greater than one, it should never be MIN_INT
15:03imirkin: just have to make sure that -x/MIN_INT works when x == MIN_INT
15:24pendingchaos: I think I'm confident in it's current state, so I'll send out a new patch
15:26imirkin: ok cool
15:26imirkin: and if it's not too much trouble, maybe send out a couple of piglits to ensure that the odd cases work as expected?
15:27imirkin: stick them into glsl-1.30 i think, since that's the first one to have "real" integers
15:28imirkin: stuff like -1/2 == 0 and with negatives, etc
15:45pendingchaos: I think the current div -> (mad + other stuff) transform is broken with 2147483647 / -2147483648
15:45pendingchaos: seems to give -1 instead of 0
15:56imirkin: oh, i think the x / y thing doesn't work for "big" integers
15:56imirkin: i.e. > 23 bit
15:57imirkin: since it uses floats to approximate
16:10karolherbst: pendingchaos: did you read my comments after you left? especially that mkCmp one?
16:11karolherbst: also, it is always worth to do a full shader-db run
16:11karolherbst: just to see how beneficial some opts are
16:11pendingchaos: forgot about that
16:11pendingchaos: I did it with an older version though
16:11karolherbst: allthoguh with div I guess it won't show as "better"
16:12pendingchaos: practically no change
16:12pendingchaos: though the shader I was focusing on lost some imads
16:12karolherbst: yeah, well, the default shader-db shaders aren't really covering a lot
16:12pendingchaos: which was the intention of the optimization
16:14karolherbst: but I thought we basically use a builtin thing for s32/u32 divs?
16:14pendingchaos: I think we do, but this is for division by constants
16:14imirkin: karolherbst: and the builtin doesn't handle large ints
16:14karolherbst: I see
16:14karolherbst: pendingchaos: ahhh, right, makes sense
16:16pendingchaos: imirkin: the broken thing I was talking about it this code btw: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/codegen/nv50_ir_peephole.cpp#n1136
16:16pendingchaos: removing that transform seems to fix 2147483647 / -2147483648
16:17imirkin: that's unfortunate.
16:17pendingchaos: in case you thought I meant to stuff in gm107.asm
16:18imirkin: well you could just not do that transform for MIN_INT
16:18imirkin: int32_t l = util_logbase2(static_cast<unsigned>(abs(d)));
16:18imirkin: that probably won't work so hot...
16:18imirkin: although i dunno - it might be ok
16:18imirkin: since MIN_INT will just become 1<<31 there... dunno
16:19imirkin: if ((1 << l) < abs(d))
16:19imirkin: that's clearly bogus though =/.
16:20pendingchaos: changing abs() to llabs() seems to fix 2147483647 / -2147483648
16:20pendingchaos: though -2147483648 / -2147483648 seems to also be broken
16:20pendingchaos: giving something other than one
16:23rhyskidd: any know headers / values within Falcon firmware blobs to identify the GPU family it is associated with?
16:23rhyskidd: so say if I have fecs, gpccs etc --> which family does it relate to
16:24imirkin: rhyskidd: not really. have a look at my extract_firmware.py for known-good examples
16:25imirkin: rhyskidd: and have a look at my scanner to extract stuff from recent blobs
16:25rhyskidd: yup, I'm looking at the output from scanner.go
16:25rhyskidd: nice tool btw
16:25imirkin: a little hacky :) but it's a hacky concept in the first place
16:26imirkin: i did look for ways i could identify the gpu from that data, but i was unsuccessful
16:26imirkin: you can also look at md5sum's - e.g. some series of gpu's will have identical file X but different file Y
16:27rhyskidd: mmm, i guest that might identify archives as being within the same family, but not what the family itself is
16:27rhyskidd: e.g. maximal number of similar files within a family is a decent assumption
16:27imirkin: also you can fairly easily see whether it's falcon v3 or v5
16:27imirkin: GK208+ is v5, earlier is v3 (for gpccs/fecs)
16:29imirkin: an earlier version of that script just tried to decode at EVERY possible position in the object, but that was super-slow and generated tons of garbage. i think looking at relocations is fairly accurate and relatively future-proof
16:32rhyskidd: there's other, more tedious ways i guess
16:32rhyskidd: i have an mmio trace, or could just brute force the version of 390.x that added support etc
16:33imirkin: well, this stuff used to be easily extractable from mmiotraces
16:34imirkin: problem is that this stuff is no longer in the trace at all
16:34imirkin: the gpu dma's it directly
16:34rhyskidd: hrmm, i've seen what appears to be something loaded via mmiotrace
16:35rhyskidd: looked like a 32 byte hash, plus a blob
16:35rhyskidd: wonder what that was ...
16:37imirkin: could be some stuff is uploaded one way, some another
16:37imirkin: you can take the bytes and stick them into envydis and see if it's falcon code or not
16:50pendingchaos: anyone with a kepler card willing to test a mesa patch with a few piglit tests: https://hastebin.com/raw/inugegecuz?
16:55pendingchaos: (the piglit tests are arb_shader_image_load_store-semantics and multiple-resident-images-reading.shader_test with shader_runner)
19:11pendingchaos: imirkin: forgot to update the comment in the shader_test... feel free to change it to "Test signed division by immediates" when pushing if there isn't a third version
21:14Subv: huh, found a bit in the iset maxwell instruction that isn't documented on envydis
21:14imirkin: quite possible.
21:14Subv: it also has a "bf" bit, similarly to fset, it controls whether to output 1.0f or -1 when the condition is true
21:15imirkin: oh yeah, it does
21:15imirkin: novueau emits it properly
21:15imirkin: but yeah, iirc envydis misses it
21:15Subv: should i submit a PR?
21:15imirkin: bit 44 right?
21:18pendingchaos: does it sound fine if I move LateAlgebraicOpt before LoadPropagation and after ConstantFolding?
21:18pendingchaos: I think it might be useful for doing the IMAD/IMUL -> XMADs thing
21:19imirkin: i think the point of LateAlgebraicOpt is to be after LoadPropagation
21:20pendingchaos: do you know why?
21:20imirkin: well, it can be whereever obviously
21:20imirkin: but the passes within it work a lot better if things have been loaded in
21:21imirkin: iirc for shl+add -> shladd?
21:21imirkin: check the commit log
21:21imirkin: iirc i moved it there
21:21imirkin: i'm sure the commit log would have had a rationale
21:22pendingchaos: seems it was always there
21:25imirkin: i definitely mvoed it.
21:25imirkin: maybe i moved an opt from one place to another?
21:25pendingchaos: ah wait
21:25pendingchaos: it was always in LateAlgebraicOpt
21:25pendingchaos: but LateAlgebraicOpt was moved
22:29pendingchaos: imirkin: could such situations be handled by handling SHLADD like ADD in IndirectPropagation?
22:30pendingchaos: in a shader which seemed to demonstrate the problem, handling SHLADD in IndirectPropagation seems to yield the same effect as moving the pass
22:31pendingchaos: it also gives a -0.11% decrease in instructions in shader-db (just in Dolphin's ubershaders), seemingly due to moving LateAlgebraicOpt to right after ConstantFolding
22:33pendingchaos: (a overall -0.11% decrease, not a -0.11% decrease when testing just dolphin's ubershaders)
22:38imirkin: it's possible
22:39pendingchaos: since afaik you can't include shifts/multiplication in indirects like x86, it just creates a shift for each source and lets CSE and DCE clean it up
22:47imirkin: correct - it's not like x86 in that regard
22:48pendingchaos: I believe this was the intention of the move: https://hastebin.com/fezaqaxaba.md?