08:23 karolherbst: imirkin: for the liveOnly thing: do you think I can just go backwards through all BBs in RA and mark instructions as liveOnly according to the rules? Or are there other cases where liveOnly will be used=
08:23 karolherbst: ?
08:53 karolherbst: ohh, that's a property on the tex stuff...
08:57 karolherbst: mupuf: before you upgrade to systemd-230: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=825394
08:58 karolherbst: like by default systemd-230 kills all background tasks after logout, which means: quit ssh session -> screen killed
09:00 karolherbst: well there is this fancy systemd-run tool to ignore this too ;)
09:12 Lekensteyn: apprently you can build systemd with --without-kill-user-processes (seen it in the Arch packaging)
09:14 karolherbst: yeah well
09:14 karolherbst: you can also just change the logind configuration
09:33 janemba: hi
09:33 janemba: I have an issue compiling libvdpau -> http://pastebin.com/KkRvMkPs
09:35 janemba: found the problem I need at least glibc-2.17
09:35 janemba: thx for your help :)
09:35 Lekensteyn: we are rubber ducks? :)
09:41 pmoreau: imirkin: Since we are getting close to a release, would you like some piglit results (as soon as rc1 is released)?
09:45 pmoreau: imirkin: And glxinfo as well of course :-)
11:03 karolherbst: imirkin: is there somewhere a list on which instructions are considered quadops?
12:00 mupuf: karolherbst: lovely!
12:00 mupuf: well, I am not against cleaning stuff up behind you, but they probably should have added a background session and a way for users to migrate a task to it
12:00 yann|work: i'm surprised to see optirun failing with "Cannot access secondary GPU - error: [XORG] (EE) Unknown chipset: NV117" for quite some time now, as NV117 is not listed as unsupported on the nouveau website - does bumblebee support require more than basic nouveau support, or something ?
12:01 mupuf: something like nohup
12:01 mupuf: yann|work: you need to use the modesetting driver, not the nouveau driver
12:01 mupuf: do you have a xorg.conf?
12:01 mupuf: if so, change the driver from nouveau to modesettig
12:02 mupuf: or delete the file, since it is not needed 99% of the time
12:03 karolherbst: yann|work: don't use bmblebee with nouveau
12:04 karolherbst: mupuf: also /etc/bumblebee/xorg.conf.nouveau selects nouveau
12:04 karolherbst: mupuf: but if you see primusrun/optirun/bumblebee + nouveau, always answer, don't use bumblebee ;)
12:05 mupuf: karolherbst: right, prime is the way to go
12:05 yann|work: karolherbst: what's the problem ?
12:05 karolherbst: yann|work: bumblebee has worse integration
12:05 karolherbst: and less performance
12:05 karolherbst: mupuf: fyi: https://github.com/Bumblebee-Project/Bumblebee/issues/773
12:06 karolherbst: mupuf: especially where this conversation is going in the end :/
12:06 karolherbst: I would like them to drop nouveau support entirely, but maybe using prime as the bridge might be good enough
12:07 karolherbst: mupuf: and yeah, there is a tool for that: systemd-run
12:07 karolherbst: :D
12:09 karolherbst: yann|work: https://nouveau.freedesktop.org/wiki/Optimus/
12:10 yann|work: hm, xrandr --listproviders doe snot show nouveau here
12:11 yann|work: missing drm.rnodes=1 could explain that ?
12:12 karolherbst: yann|work: kernel version?
12:12 yann|work: 4.5
12:12 karolherbst: yann|work: do you know if you have DRI3 enabled for intel?
12:13 karolherbst: anyway, maybe just running this will work: DRI_PRIME=1 glxinfo | grep "OpenGL vendor"
12:13 karolherbst: but nouveau needs to be loaded in any case
12:14 yann|work: did not request anything myself at least :)
12:14 yann|work: karolherbst: DRI_PRIME=1 alone only shows Intel
12:15 mwk: alright, AsmParser working reasonably well
12:15 mwk: time for relocation support
12:17 yann|work: the driver doc says "Default: All levels of DRI are enabled for configurations where it is supported." - there should not be anything to do to activate DRI3 then, right ?
12:17 karolherbst: yann|work: well, I think you also want to switch to nvidia from time to time?
12:17 karolherbst: yann|work: or do you just want to use nouveau?
12:17 yann|work: well, I'd rather just use nouveau if possible
12:18 karolherbst: yann|work: I see. well you could use this xorg.conf: https://gist.github.com/karolherbst/1f1bdd1a3822df74097f
12:18 karolherbst: yann|work: and with that, you can also use nvidia-bumblebee, when you unload nouveau and turn off the GPU yourself via bbswitch
12:18 karolherbst: but if you don't care, then you only would need the dri3 part from intel
12:19 karolherbst: yann|work: or, just use the DRI2 offloading bits
12:20 karolherbst: yann|work: what you need (basically): loaded nouveau (most liekly blacklisted through bumblebee now), and the modesetting ddx (for DRI2)
12:20 yann|work: what's the purpose of the nvidia/dummy section ?
12:20 karolherbst: yann|work: Xorg holds an open fd on the nouveau driver otherwise
12:20 karolherbst: yann|work: which makes it hard to unload nouveau without restarting X
12:21 karolherbst: yann|work: I use that config so I can run something with nvidia within seconds when required
12:22 yann|work: before installing bumblebee, it looked like the nvidia card was always powered on, which made the battery empty quite rapidly - if nouveau is not blacklisted on startup, how does the problem get solved ?
12:22 karolherbst: yann|work: nouveau turns off the card
12:23 karolherbst: yann|work: when nouveau is loaded you should have this file: /sys/kernel/debug/vgaswitcheroo/switch
12:23 karolherbst: yann|work: but you don't need to touch it
12:23 karolherbst: it is just there to configure the power management (somehow?, no idea, never touched it)
12:24 mupuf: mwk: what does relocation mean when you are talking about compilers?
12:24 mupuf: trampolines?
12:25 yann|work: mupuf: partial linking ?
12:25 mupuf: oh, right, for linking, that would make sense
12:26 yann|work: karolherbst: that was a lot of info, I'm not sure I got the hand of it all
12:26 mwk: mupuf: yep, relocations to external symbols
12:26 mwk: as in, extern int x; void f() { x++; }
12:26 karolherbst: yann|work: well, basically you don't need to do much
12:26 mwk: you need to emit a relocation to x in the code
12:27 karolherbst: yann|work: remove the nouveau blacklist entries, disable bumblebee and you should be good to go
12:27 mupuf: mwk: and I guess LTO would get rid of it, right?
12:27 mwk: mupuf: not really
12:27 yann|work: let's try by understanding the current state of my machine: I have nouveau apparently loaded, I have /sys/kernel/debug/vgaswitcheroo/switch, but xrandr --listproviders does not show it, what could be wrong ?
12:27 mwk: we still need a linker phase
12:28 mwk: LTO can squash a lot of things, mostly by inlining stuff
12:28 karolherbst: yann|work: you loaded nouveau after starting X maybe, not sure
12:28 mwk: but in the end, you still need something to allocate space for variables in the data segment
12:28 karolherbst: yann|work: I really don't know how well the DRI2 mecahnics are working if you add nouveau while X is running :/
12:28 mwk: and that's linker's job
12:28 yann|work: I tried to use bumblebee, it surely loaded it for the job
12:29 karolherbst: yann|work: yeah, but bumblebee starts a second X server and copies data to your main one
12:29 karolherbst: which is rather expensive
12:29 karolherbst: and very limited through the PCIe bus
12:31 yann|work: ok, deinstalling bumblebee, rebooting, and we'll see what turns out
12:31 karolherbst: yann|work: just make sure the blacklist entries are gone (and that also initramfs is rebuilt)
12:32 yann|work: hm, removing bumblebee would remove primus too, is that a good idea ?
12:33 karolherbst: primus is something else
12:33 karolherbst: it has nothing to do with the prime thing nouveau uses
12:35 yann|work: ok
12:41 yann|work: no bumblebee any more, nouveau de-blacklisted, but xrandr --listproviders still only shows Intel :(
12:42 karolherbst: yann|work: did you modified your xorg.conf by adding that intel dri3 entry?
12:42 yann|work: no, I thought dri2 could work :) doing that now
12:43 karolherbst: ahh okay
12:43 karolherbst: well
12:43 karolherbst: you could also paste your xorg log somewhere
12:43 karolherbst: yann|work: lsmod | grep nouveau
12:43 yann|work: it's here
12:43 karolherbst: okay, so something on Xs end went wrong
12:43 yann|work: but does not appear in the xorg log
12:44 karolherbst: yann|work: well you can also pastebin: dmesg | grep nouveau
12:45 yann|work: the only xorg.conf.d entry I currently have is one that sets one option on the Intel driver, that would not prevent it to use nouveau ?
12:45 pmoreau: karolherbst: IIRC, Intel DRI3 is not enabled by default in Mesa. So except if the distro enables it, you need to enable it yourself and recompile.
12:45 karolherbst: pmoreau: don't you mean the ddx?
12:46 pmoreau: Err
12:46 karolherbst: yann|work: don't think so
12:46 karolherbst: yann|work: well pastebin of your dmesg and xorg.log will help
12:46 pmoreau: That would make more sense
12:47 karolherbst: pmoreau: but usually dri3 is built in, just dri2 selected by default
12:47 karolherbst: pmoreau: there is even a configure option to change the deafult
12:48 pmoreau: That was not the case some time ago: you needed to recompile it. Maybe it changed
12:48 yann|work: karolherbst: http://paste.debian.net/711130/
12:51 karolherbst: huh
12:52 karolherbst: now I wonder why the card doesn't turn off again
12:52 karolherbst: yann|work: does "/sys/kernel/debug/dri/1/clients" list anything?
12:52 karolherbst: yann|work: but it should be fine though, I guess X messes it up somewhat
12:54 yann|work: nothing there, just what looks like a header line
12:54 karolherbst: yeah, okay
12:55 karolherbst: yann|work: what does /sys/kernel/debug/vgaswitcheroo/switch contain?
12:55 karolherbst: "1:DIS: :DynOff:0000:01:00.0" or something else?
12:56 yann|work: yes, but with "0:IGD:+:Pwr:0000:00:02.0" first
12:56 yann|work: ah no
12:56 yann|work: 1:DIS: :DynPwr:0000:01:00.0
12:57 karolherbst: so something keeps the GPU being on :/
12:58 karolherbst: so something keeps the GPU being on :/
12:58 karolherbst: ...
12:58 karolherbst: sorry
12:59 yann|work: AutoAddGPU ?
12:59 karolherbst: yann|work: no, this is fine. Actually there aren't many things which may use the nouveau driver in a way that the GPU has to stay on
13:00 karolherbst: yann|work: Opengl wakes it up, lspci wakes the gpu up because it reads from the pcie config space, but other than that?
13:01 yann|work: can't we get some debugging traces to find out ?
13:02 karolherbst: well, maybe the Xorg log helps, because when offloading doesn't work, something is odd anyway
13:04 yann|work: karolherbst: http://paste.debian.net/711145/
13:06 mwk: mupuf: I've added the section on assembly operands in http://0x04.net/~mwk/Falcon.html
13:06 mupuf: mwk: let's see :)
13:07 mwk: you can load 32-bit immediates like that: mov %r0, %lo16(abcd) ; sethi %r0, %hi16(abcd)
13:07 mupuf: mwk: does the debugger already work? You get this for free with llvm?
13:07 mwk: nope, I haven't thought about the debugger yet
13:07 mwk: but a lot of it is going to be free, I hope
13:08 mwk: but but... Falcon is Harvard, this is going to hurt for debugger
13:08 mupuf: so, fuc comes with a single stepper too?
13:08 mwk: I'm not sure if lldb knows about Harvard arches
13:08 mwk: mupuf: I see three ways to debug Falcon
13:08 mwk: v4+ has a full debugging interface, so it's awesome
13:09 mwk: we'd need to write a gdbserver-like program that connects to it and exposes it to lldb, but it's quite doable
13:09 karolherbst: yann|work: okay, the modesetting driver isn't loaded
13:10 karolherbst: yann|work: well for now you can just remove your entire xorg.conf I think
13:10 mwk: for v3... there's a nice way to set up instruction breakpoints, but we'll need a special trap handler installed in the code segment
13:10 mwk: which, of course, takes up valuable code space
13:11 mwk: and the third way is full simulation
13:11 mwk: with a stub on the Falcon executing io accesses, DMA, ...
13:17 mwk: mupuf: anyhow... here's the status
13:19 mwk: only v3 is supported; disassembler is fully functional, assembler will be fully functional in a few days (I'm still working on immediates and relocations, rest is good), codegen has some functionality but is still basically broken, lld and lldb are not started yet
13:19 yann|work: karolherbst: ah, it's not even installed
13:19 mwk: and I'm wondering whether I should do codegen or lld next
13:19 karolherbst: yann|work: well with 1.18 it should be inside the X server already
13:19 yann|work: hm, right, just noticed that
13:19 mwk: a working ELF assembler + linker is more useful than working ELF compiler with no linker :)
13:22 yann|work: karolherbst: ok, now nouveau gets initialized by xork
13:22 yann|work: modesetting loaded as well
13:23 karolherbst: yann|work: well nouveau should get unloaded anyway, but that's normal
13:23 yann|work: ... and we're back to the original issue: Unknown chipset: NV117
13:23 karolherbst: yann|work: yeah, but modesetting takes care of that
13:23 karolherbst: yann|work: check if xrandr lists the provider
13:24 yann|work: it does not
13:24 karolherbst: now, that is odd
13:24 karolherbst: yann|work: can you paste your current xorg.log?
13:25 Lekensteyn: is runpm disabled? there is possibly a race condition that breaks detection when the device has to be resumed
13:25 yann|work: http://paste.debian.net/711169/
13:26 mupuf: mwk: thx!
13:26 karolherbst: yann|work: could you uninstall the nouveau xorg driver and retry?
13:27 imirkin: yann|work: separately i've seen phantom VGA ports keep gpu's up...
13:27 imirkin: (i.e. a situation where nouveau thinks there's a VGA port, but in reality it's not pinned out)
13:28 Calinou: is lldb better than gdb by the way? can gdb debug Clang-compiled binaries, and can lldb debug GCC-compiled binaries?
13:29 mupuf: mwk: nice, you got a boolean that actually uses 1 bit in regs
13:31 yann|work: does not change much
13:32 yann|work: http://paste.debian.net/711181/
13:32 karolherbst: yann|work: now I wonder...
13:32 imirkin: yann|work: is the current status that DRI_PRIME is working but it won't autosuspend?
13:32 imirkin: yann|work: or are you still trying to get the first one to work?
13:32 mwk: mupuf: I'm still not convinced it's a good idea
13:32 karolherbst: yann|work: output of this please: find /etc/modprobe.d/ -type f -exec grep nouveau {} +
13:32 mwk: support for %pX on Falcon basically sucks
13:33 karolherbst: imirkin: still occupied with getting offloading working :/
13:33 mwk: you think "hey, nice, a 1-bit register", but then you notice you can't actually do much with them
13:33 yann|work: imirkin: not sure I get your question - yes I'm trying to get DRI_PRIME to work
13:33 mwk: the supported opcodes are: mov to/from a GPR, set to 0, set to 1, toggle, branch if true, wait for interrupt while true
13:34 yann|work: karolherbst: finds nothing in modprobe.d
13:34 imirkin: yann|work: ok, your problem is that you don't have DRI3 enabled, and for some reason, modesetting doesn't auto-load for you
13:34 imirkin: yann|work: i highly recommend just enabling DRI3. if you prefer not to for whatever reason, make sure you don't have any "Driver" sections in your xorg.conf.
13:34 mwk: but what you can't easily do: move one %p to another, set %p to result of comparison, xor/and/or
13:35 mwk: esp. the first one is painful, it'll be really damn annoying for RA
13:35 karolherbst: yann|work: okay, then use the xorg.conf I linked above, this should take care of everything then
13:35 mwk: so I'm not sure whether the decreased register pressure and ability to branch in a single instruction outweighs the missing operations
13:36 mwk: nvidia only uses %p's for bools in the crypto code... but it does come useful there
13:36 mwk: so, we'll see how it works out
13:37 yann|work: imirkin: right now I don't have any xorg.conf anyway
13:37 yann|work: trying that...
13:37 imirkin: hm, very odd then =/
13:40 mupuf: mwk: yeah, we'll see
13:40 yann|work: DRI3 now enabled, but provider still not listed
13:40 karolherbst: yann|work: providers are a DRI2 only thing
13:40 karolherbst: with dri3 it shold just work
13:41 karolherbst: DRI_PRIME=1 glxinfo | grep "OpenGL vendor"
13:41 karolherbst: DRI_PRIME=0 glxinfo | grep "OpenGL vendor"
13:41 karolherbst: should give you different vendors
13:41 mupuf: mwk: could you have a virtual address space? This way, you could hide the harvard architecture by using the highest bits to select which address space oyu are in
13:42 yann|work: ah yes, this does work :)
13:43 yann|work: thanks much!
13:43 yann|work: why DRI2 did not work is still mysterious, though...
13:43 karolherbst: yann|work: well, if you want to use nouveau with your nvidia GPU, you might be interessted in performance. I have some experimental stuff to enable reclocking on your GPU
13:44 yann|work: I'm also puzzled my DRI3 is not active by default, since the intel manpage was apparently saying it should
13:44 karolherbst: yann|work: well, DRI3 causes problems on _some_ systems
13:44 mupuf: mwk: fun that you allow changing the number of regs for fast global variables. That is a little extreme though
13:45 karolherbst: yann|work: allthough I didn't encounter anything and DRI3 worked always better than DRI2 for me :/
13:45 karolherbst: so no clue
13:45 mupuf: how do you select which variables will be promoted to one reg?
13:46 mupuf: sorry, should have read on
13:46 yann|work: karolherbst: why not testing something :)
13:46 mwk: mupuf: I've been wondering about the virtual address space
13:47 yann|work: while I'm at it :)
13:47 karolherbst: yann|work: well you would have to compile the nouveau kernel module yourself and install it and everything ;)
13:47 mwk: the thing is, I'd somehow have to hook into the debugger to auto-set the high bit on all void*'s
13:47 mwk: or, on all void(*)()'s
13:47 mupuf: hmm, right
13:48 mwk: and, on v4+, all bits of address are actually used by the MCU if you happen to use unified address space
13:48 yann|work: karolherbst: ah, I may not have enough time to do that, then :)
13:49 mupuf: ...
13:49 mupuf: why can't we call the interupt handler ourselves? Is the hw getting mad?
13:49 mwk: also annoyingly, I can't just set the high bit in all data pointers
13:50 mwk: first, I'd suddenly need sethi for all address loads
13:50 mwk: second... %sp masks off higher bits anyhow
13:50 mwk: so if I take an address of an on-stack variable, the high bit would be gone
13:51 mwk: hmm, maybe setting the high bit for code space could be a workable plan
13:51 mupuf: well, impressive work mwk!
13:52 mwk: mupuf: it's going to be impressive when it's all implemented :p
13:52 mupuf: that will definitely speed up the re-write of pdaemon to be co,patible with nvidia
13:52 mupuf: ah ah!
13:52 mwk: as for calling the interrupt handle ourselves, there's a problem with the iret instruction
13:52 mupuf: well, that seems like a great plan
13:53 mupuf: oh, right
13:53 mupuf: i should have thought about this
13:53 karolherbst: yann|work: yeah well, it doesn't take much time, it is just a bit messy to do
13:53 mwk: since it restores the interrupt enables
13:53 mwk: and we'd need to, um, un-restore them
13:53 karolherbst: yann|work: anyway, which GPU do you have exactly?
13:53 mupuf: be back later, sauna time
13:53 karolherbst: mhh allthough gm107 are semi fast...
13:55 mwk: mupuf: see you
13:56 mwk: oh, and as for fast global variables, that one's basically free, and may come quite useful, so I figured why not
13:57 mwk: though I'm a little disappointed that clang/llvm can't optimize them
13:59 mwk: as in: register int x asm("r8"); void g(); void f() { x = 1; if (x == 1) g(); }
13:59 mwk: it cannot fold the comparison
13:59 mwk: for a normal global variable it can
14:00 mwk: otoh, "register volatile" doesn't work quite right either
14:00 mwk: register int x asm("whatever"); void f() { x = x; }
14:00 mwk: er
14:00 mwk: register volatile int x asm("whatever"); void f() { x = x; }
14:01 mwk: this is optimized to nothing, even though it really shouldn't touch accesses to volatiles
14:01 mwk: won't matter for Falcon, but if you had a funny CPU with side-effects on reg read/write (like vµc IIRC), it's a problem
14:01 karolherbst: well, it still makes sense to optimize garbage away ;)
14:02 mwk: karolherbst: that's not necessarily garbage if the register has side effects
14:02 mwk: and this is what volatile means...
14:02 karolherbst: mhh wouldn't llvm check that inside the register decleration?
14:02 mwk: no
14:02 mwk: it's optimized away by clang before it gets to llvm
14:02 karolherbst: odd then
14:03 mwk: basically, on one hand, volatile is ignored on registers
14:03 karolherbst: but I also have no idea why you should depend your code on something like that anyway
14:04 mwk: and on the other, there isn't much optimization
14:04 mwk: karolherbst: you haven't looked at enough crazy MCUs then :)
14:04 karolherbst: well if you do crazy stuff like that, do it in assembly :D
14:04 mwk: consider this: register uint8_t serial_receive_port asm("whatever"); void ignore_byte() { serial_receive_port; }
14:04 karolherbst: but please keep your C code clean :)
14:04 mwk: err, register volatile
14:05 mwk: this should emit a read from the register and ignore the result
14:05 mwk: and if it was a memory variable, clang+llvm would indeed emit the read, and ignore the result
14:06 mwk: but for a register variable, it'll optimize away the read
14:06 karolherbst: well in my opinion architecture specific stuff doesn't belong inside C really.
14:06 karolherbst: yeah, well for memory variables it makes totally sense, because this is like a hardware interface
14:06 mwk: well, I think it's nicer to write things in C than assembly, even if they're arch-specific
14:06 karolherbst: well maybe some llvm dev could tell us why they did this
14:07 karolherbst: mwk: yeah well. But porting the code becomes odd then
14:07 mwk: you know, all code ever compiled by this compiler is going to be very arch-specific :p
14:08 karolherbst: :D
14:08 karolherbst: right
14:08 karolherbst: but if they change the arch, we still would use the same code
14:08 karolherbst: and if we depend on stuff which is only there for pre v4 and v5 would be totally new, we would have to unmess those specific things anyway
14:09 karolherbst: well it may be handy to write it in C yes, but you also have to make sure, that this still works on a new arch/revision/whatever
14:09 mwk: v5 s a funny thing really
14:09 karolherbst: yeah, I already had fun with that
14:10 mwk: it changes the encoding a lot, many instructions stay the same but get new opcodes
14:10 mwk: eh
14:10 mwk: I have to write a Falcon testsuite for hwtest
14:10 mwk: I really don't trust anything about v5 just yet
14:11 karolherbst: mwk: well another option would be to have a inline assembly somewhere with #if guards and error on untested archs immediatly
14:11 karolherbst: this would be somewhat okay to use in C
14:12 karolherbst: mwk: I guess clang provides us with defines like __IS_FALCON_V5 or something like that?
14:13 imirkin: mwk: any clue what CP_NO_REG_SPACE_STRIPED might be?
14:13 mwk: karolherbst: you're overworrying again
14:13 mwk: karolherbst: __falcon_version__ == 5
14:13 mwk: imirkin: that's a graph dispatch error, right?
14:13 imirkin: mwk: yeah
14:13 imirkin: mwk: from a compute launch
14:13 imirkin: on fermi
14:14 mwk: oh, so they kept the code
14:14 mwk: anyhow
14:14 mwk: ignore the _STRIPED
14:14 mwk: that's Tesla-specific, I doubt it has the same meaning for Fermi
14:15 mwk: but the core issue is that regs per thread * threads per block > regs on MP
14:15 imirkin: ahhhhh we have to be careful about that?
14:15 mwk: so the MP cannot even fit a single block on the MP
14:15 mwk: of course
14:15 imirkin: :(
14:15 imirkin: so ... how many regs on MP?
14:15 mwk: if you need to launch a big block, you have to tell the RA to watch it
14:15 imirkin: right
14:15 mwk: ah, that is of course GPU-dependent
14:15 imirkin: of course.
14:16 mwk: https://en.wikipedia.org/wiki/CUDA has a table
14:16 imirkin: cool
14:16 mwk: "Number of 32-bit registers per multiprocessor"
14:17 imirkin: yeah, so 32K on fermi
14:17 imirkin: hakzsam: --^
14:17 imirkin: and 64K on kepler
14:17 mwk: ... except GK210
14:17 imirkin: and 128K on the hypothetical GK210
14:17 imirkin: which i'm convinced is just a marketing stunt
14:17 imirkin: i'll believe it when i see it :)
14:17 mwk: I'm afraid it's not
14:18 mwk: 128K threads per MP is not something you can just fake
14:18 imirkin: well, you know what - nouveau will be limited to 64K on that one for now :p
14:18 mwk: nouveau won't start on it due to unknown chip id :)
14:18 imirkin: mwk: sure i can. i'll just release a GK220 and say it has 1GB of regs.
14:19 hakzsam: imirkin, interesting
14:19 imirkin: hakzsam: so we have to adjust the max regs based on # threads :(
14:19 mwk: yep
14:19 mwk: welcome to CUDA
14:19 hakzsam: right :)
14:20 imirkin: i assume that's the same issue i get on GK208, but it just hangs instead of whining
14:20 imirkin: hmmmm. maybe not
14:20 imirkin: 46 regs in the opt version, 37 in non-opt
14:20 hakzsam: oh, the compute shader is 1.6k lines, pretty huge
14:20 mwk: on Kepler, that stuff is specified in the params block, not by method
14:20 mwk: so I suppose dispatch can't detect the error
14:21 imirkin: mwk: 16K_SHARED_48K_L1 - does that affect matters?
14:21 mwk: nope
14:21 hakzsam: mwk, how about fermi?
14:21 mwk: you use that one if you can't fit the shared memory
14:21 imirkin: mwk: right
14:21 mwk: but that's not dependent on block size
14:21 mwk: hakzsam: on Fermi, the error should be detected by dispatch
14:21 mwk: and apparently it is, since imirkin claims so
14:22 hakzsam: okay
14:22 imirkin: mwk: actually that's from hakzsam's logs :)
14:22 mwk: same thing on Tesla
14:22 mwk: Tesla also has the striped/packed distinction
14:23 mwk: but that's easy to take care of, just set the switch to striped and forget it
14:25 mwk: eh
14:25 mwk: speaking of __falcon_version__, I'll need to change it to cover Falcon v4.1
14:25 mwk: so I'd have __falcon_version__ == 0x41 or something like that
14:26 mwk: assuming, of course, that 4.1 actually introduces some MCU feature, I haven't seen evidence of that yet
14:27 hakzsam: imirkin, yep, but the thing is how to fix the issue?
14:27 imirkin: hakzsam: need to teach getFileUnits() about it
14:28 imirkin: in nv50_ir_target
14:28 hakzsam: what does this method do?
14:28 hakzsam: it returns 0 or 2
14:28 imirkin: er
14:28 imirkin: the other one then :p
14:28 imirkin: the one that returns the number of regs
14:28 imirkin: getFileSize
14:29 hakzsam: oh and getFileUnit() seems to be unused
14:29 imirkin: nah, it's used.
14:29 hakzsam: err, not
14:29 hakzsam: yeah okay, getFileSize()
14:30 karolherbst: imirkin: any idea what is the direct effect of the liveOnly bit on the tex regs? I really can't find anything which benefits from this optimization :/
14:30 mwk: karolherbst: pixels that won't be rendered won't actually execute the tex
14:31 karolherbst: mwk: ahh
14:31 karolherbst: mwk: so less texs executed
14:31 mwk: yep
14:31 karolherbst: and applications with a tex_utilization close to 100 should benefit most
14:32 mwk: you want to set this flag iff the result of the tex instruction won't ever be used, directly or indirectly, as inputs to screen-space derivative functions
14:32 mwk: including the implicit derivatives involved in the usual tex instruction
14:33 karolherbst: mwk: mhh, I currently enable this for every tex in furmark and get like 15% more performance
14:33 mwk: nice
14:33 karolherbst: especially metric-issued_ipc goes from 226% up to 253%
14:34 mwk: but... you do it only when it's safe, right? :)
14:34 karolherbst: I guess effectively it reduces stalls
14:34 karolherbst: mwk: nope, always currently. But the code is in place, just need to fill the conditions
14:34 karolherbst: I would have expected some visual change, but didn't notice anything yet
14:35 mwk: I was the one who REd that bit, the results are very visible if you hit a program that uses it
14:35 mwk: I found it on some 3d demo
14:35 karolherbst: mwk: any idea what hits this?
14:35 mwk: but I don't remember it now
14:35 karolherbst: because I tested a bunch of games and none hit this
14:35 mwk: well, anything that feeds output of tex to input of another tex with auto-mip
14:36 karolherbst: something broke in mesa...
14:36 karolherbst: glxgears and glxspheres64 aren't working anymore
14:36 mwk: it was a parallax mapping demo of... something
14:37 karolherbst: mwk: okay.. maybe I find that
14:37 karolherbst: mwk: well in the trello it says input of quadops and other tex
14:37 karolherbst: mwk: any nice function to check if an instruction is a quadop?
14:38 mwk: karolherbst: if it's called "quadop", then it's quadop
14:38 karolherbst: so just OP_QUADOP ;)
14:39 karolherbst: well I would say I have to check for quadop, quadon, quadopo, dfdx and dfdy, but I have no idea if that's all
14:39 mwk: quadon/quadpop don't have inputs/outputs
14:39 karolherbst: ahh okay
14:39 mwk: dfdx/dfdy are special cases of quadop
14:40 karolherbst: okay, so just quadop/dfdx/dfdx and tex?
14:40 karolherbst: well, texcsaa too?
14:40 mwk: only the normal tex
14:40 mwk: and tex bias
14:40 mwk: tex lod, tex csaa, tex fetch don't count
14:40 karolherbst: mhh, at least emitTEXCSAA can set that bit too
14:41 karolherbst: so that's why I was wondering
14:41 mwk: the liveonly reg controls *output*
14:41 karolherbst: mwk: ahh okay, so OP_TEX and OP_TXB
14:41 mwk: but tex lod/csaa/fetch don't use other pixels' coordinates on *input*
14:41 karolherbst: okay
14:42 mwk: and I don't remember how it works for tex gather
14:42 mwk: I'd guess it also doesn't need other pixels
14:44 mwk: karolherbst: look for irrlicht3d parallax mapping demo
14:44 mwk: it triggers the problem if you always use liveonly
14:45 karolherbst: the ut3 demog?
14:45 karolherbst: ohh wait
14:45 karolherbst: that's something else
14:46 mwk: karolherbst: http://www.irrlicht3d.org/pivot/entry.php?id=6
14:46 mwk: download irrlicht3d, look for the demo that looks like the screenshots here
14:48 imirkin: so i don't think that the register count thing is a problem for me ... i'm on a GK208 and it uses 46 registers ( * 1024 threads) - under the 64K limit.
14:54 karolherbst: mwk: this one? http://irrlicht.sourceforge.net/docu/example011.html
14:55 imirkin: aHA!
14:55 imirkin: 796: cvt u8 $p0 $r38 (8)
14:55 mwk: karolherbst: yes
14:55 imirkin: /*1c70*/ MOV R0, R38; /* 0xe4c03c00131c0002 */
14:55 imirkin: not *quite* the same.
14:56 imirkin: similar... just ... not the same :)
14:56 imirkin: i thought i fixed that :(
14:57 imirkin: just ... not hard enough
14:57 imirkin: gr
15:01 imirkin: and of course envydis doesn't know about these =/
15:02 imirkin: i guess this just needs to be a ISETP comparing it to 0?
15:02 imirkin: (or PSETP in case it's a predicate source)
15:08 karolherbst: "Mesa: User error: GL_INVALID_ENUM in glEnable(GL_LIGHTING)" ... any ideas what goes wrong?
15:10 imirkin: karolherbst: something tries to use a compat feature from a core context
15:10 karolherbst: glxgears would do that?
15:10 imirkin: no
15:10 imirkin: unless you were forcing things somehow
15:10 imirkin: perhaps someone messed up the logic with the enables
15:11 imirkin: i think brian was moving things around
15:11 karolherbst: but yeah, you are right
15:11 imirkin: although his stuff looked good
15:11 karolherbst: MESA_GL_VERSION_OVERRIDE=3.0 fixes it
15:11 karolherbst: but my system installed mesa works without issues though :/
15:13 karolherbst: mwk: do you remember what changes?
15:15 karolherbst: ohh I think I see it now
15:15 karolherbst: the difference is rather... minimal
15:17 mupuf: I remember playing a lot with irrlicht back in highschool
15:18 mupuf: and this parralax demo was mesmerizing
15:18 karolherbst: :D
15:18 Tom^: is there a spec or limit on how much power a 8pin or rather 6+2 pin pci-e power port can draw?
15:18 karolherbst: yes
15:18 karolherbst: the pcie spec
15:19 karolherbst: Tom^: 16x width port 75W, 6pin 75W, 8 pin 150W
15:19 karolherbst: Tom^: 2x6, 8+6 combination allowed
15:19 Tom^: indeed
15:20 karolherbst: pcie 4.0 will get 2x8 most likely though
15:22 Tom^: was just having a discussion about the 1080 card which only has 1 8pin
15:23 karolherbst: 150+75W :)
15:23 Tom^: which means you wont be overclocking it much more then what it comes with because you are already quite close to the limit
15:23 karolherbst: huh?
15:23 karolherbst: this isn't close
15:24 karolherbst: 180W TDP vs 225W?
15:24 Tom^: benchmarks seems to suggest it uses even up to 230W at full load
15:24 Tom^: but idk
15:24 karolherbst: which is a bit dangerous though
15:25 karolherbst: well I am sure there will be 6+8 1080
15:25 karolherbst: with a big big cooling system on top :D
15:25 Tom^: yea there are rumours about that too, that the non reference designs will have beefier power ports
15:25 karolherbst: it is just insane how power efficient that thing is
15:26 karolherbst: your 780 Ti needs like 50% more power, and only has like 50% of the perf
15:26 Tom^: =D =D
15:26 karolherbst: :D
15:26 Tom^: im just heating my room in a more fun way
15:26 karolherbst: using f16 as an optimisation will be a lot of fun...
15:27 Tom^: but yea il probably buy an 1080ti at winter.
15:30 karolherbst: bye bye nouveau then I guess
15:30 Tom^: meh il just annoy you guys enough with traces and bugreports until it runs.
15:30 hakzsam: can't use cuda 8 with gcc > 5.3, funny
15:32 karolherbst: Tom^: yeah well, still we need the firmware
15:32 Tom^: perhaps it gets out by then, its still like 8 - 9 months away
15:32 Tom^: :p
15:33 karolherbst: :D
15:33 karolherbst: yeah sure
15:34 karolherbst: \o/
15:34 karolherbst: mwk: done :) I think
15:34 karolherbst: mwk: 13% more perf in furmark and no breakage in the demo :)
15:35 karolherbst: imirkin: currently I check if one of the defs is a TEX or TXB instruction, is this enough or do I have to check explicitly at which source position the def is?
15:39 yann|work: karolherbst: it's a gtx960m here
15:39 karolherbst: yann|work: I see...
15:39 karolherbst: yann|work: well without reclocking those cards are really really slow
15:40 yann|work: right, just tested a glmark2: the intel gpu has 1723, and the nvidia 2554 - not very impressive
15:41 yann|work: so yes, i'd be interested in your experimental stuff :)
15:41 mupuf: karolherbst: very nice
15:41 karolherbst: well
15:41 karolherbst: it would be better if something else would benefit from this :D
15:41 yann|work: just not now, leaving for a show in a couple of minutes, and will have to sleep afterwards :)
15:41 mwk: karolherbst: broken 2x2 pixel quads on edges of polygons, right?
15:41 karolherbst: mupuf: anyway, with that change, nouveau is faster running furmark on my system :D
15:42 karolherbst: mwk: yeah
15:42 mupuf: karolherbst: but you overclocked it a little :D
15:42 mwk: good
15:42 karolherbst: mupuf: not above nvidia
15:42 karolherbst: mupuf: and no, I didn'T
15:42 karolherbst: :D
15:43 karolherbst: well with stock nouveau was like 95% close anyway
15:43 karolherbst: no idea what furmark does, but it is indeed shader core limited
15:43 karolherbst: but does a lot of tex operations
15:43 karolherbst: really odd thing
15:43 karolherbst: achived occupancy goes up a little, and ipc rate from 2.2 to 2.5
15:45 pmoreau: hakzsam: Yeah, that sucks… I was hoping that nvcc in CUDA 8 would support GCC 6.x, but… Maybe for the final release?
15:45 hakzsam: probably
15:46 pmoreau: I haven’t checked whether they do support VS2015 now
15:46 hakzsam: oh, Pascal's ISA seems pretty similar to Maxwell :)
15:47 karolherbst: yay
15:47 hakzsam: did you already look?
15:47 pmoreau: Tom^: Looking forward for a 1080Ti as well :-)
15:47 karolherbst: imirkin, mwk: do you know any opts which are similiar to that liveOnly thing? Because then I could integrate it into my pass
15:47 pmoreau: I only checked the fp16 stuff, cause that’s something we plan to use really soon
15:48 karolherbst: pmoreau: so you have access to a card like that I guess?
15:48 pmoreau: Right now, no. But we are most likely going to buy a 1080 at work.
15:49 karolherbst: okay
15:49 karolherbst: I am mainly interessted in the vbios
15:49 karolherbst: just to check if I have to change anything reclocking related for those
15:50 mwk: karolherbst: nope
15:53 mwk:still wonders what they were smoking when they designed Falcon ISA encoding
15:54 mwk: the whole two-address vs three-address thing is completely pointless
15:55 mwk: the only instructions where the two-address form is actually benefitial are not/neg/hswap, all the others have both forms of identical size
16:00 hakzsam: imirkin, the vectorAdd compute kernel compiled for SM60 and decoded with envydis http://hastebin.com/otapovehov.sm :)
16:01 hakzsam: sounds familiar
16:01 hakzsam: maybe, there are some variants, but it's pretty similar
16:52 mupuf: hakzsam: yeepee
16:52 imirkin: hakzsam: yeah, looks the same. you can compare it to nvdisasm output to double-check
16:56 hakzsam: imirkin, sure
16:56 hakzsam: well, have to go, see you
16:57 karolherbst: ahhh so close in borderlands :/
16:58 karolherbst: issued_ipc up from 90% to 104%
16:58 karolherbst: this doesn't really gives us a perf increase, but it still looks much better
17:08 imirkin: hakzsam: ok, well fyi, with my patch to fix gk110 emitter's unspilling, that trace replays much better
17:17 karolherbst: the liveOnly Pass: https://github.com/karolherbst/mesa/commit/6502303cb31795147125f1aad87075f2208e7b00
17:21 imirkin: what does data.liveOnly represent?
17:25 karolherbst: if all defs of a tex instructions have this flag set to true, then the tex can have it set to true, too
17:25 karolherbst: I was thinking if I flag every instruction I care of, I don't have to check anything twice
17:29 imirkin: i think it's fine to cache it
17:29 imirkin: but it's per-instruction
17:29 imirkin: so i guess it's "does this instruction read values from other lanes"?
17:29 imirkin: or rather
17:29 imirkin: does this instruction, or instructions that use its return values, read values from other lanes?
17:30 imirkin: either way, a comment would be super.
17:30 karolherbst: well it is just to move the check up, and to cache the results as much as possible
17:31 imirkin: sure, but you're not randomly programming until you get a desirable result. the thing means something. it'd be nice to have a comment as to what exactly it means
17:31 karolherbst: in the end, if you have a tex, you only check it's def and you are done with, because the defs already declare through the liveOnly flag, if they end up in quadops or texs
17:32 karolherbst: but I wrote the pass in this way to be able to enhance it for other flags which might work similiar
17:32 imirkin: karolherbst: can you try replaying the trace in https://bugs.freedesktop.org/show_bug.cgi?id=94858 on your kepler?
17:33 karolherbst: I think so
17:33 karolherbst: cyan cloth and yellow ball?
17:34 imirkin: yep
17:34 imirkin: with the cloth falling over the ball
17:34 karolherbst: but either the shader has issues or nouveau has
17:34 imirkin: a bit of tearing right?
17:34 karolherbst: where the cloth is heavily folded it looks wrong
17:34 imirkin: as it falls
17:35 imirkin: hmmm
17:35 imirkin: make a video?
17:35 karolherbst: I will try
17:36 imirkin: (and then tell me how you made that video, and i'll make one of how it looks for me)
17:36 imirkin: [yes, i'm that lazy]
17:38 karolherbst: huh, why is vlc broken now...
17:38 karolherbst: stupid vlc
17:40 karolherbst: odd
17:40 karolherbst: it also records faster
17:40 karolherbst: ..
17:42 karolherbst: imirkin: well I used this script: https://gist.github.com/karolherbst/1881c6cb72365f226726ecb9a87cacd4
17:43 karolherbst: and here the video: https://drive.google.com/open?id=0B78S7GSrzebIc3lnUVNsRW1SMzA
17:43 karolherbst: uhh
17:43 karolherbst: quality is like really bad :D
17:46 imirkin: ew
17:46 imirkin: that's wrong.
17:46 imirkin: https://github.com/apitrace/apitrace/blob/master/docs/USAGE.markdown
17:46 imirkin: https://github.com/apitrace/apitrace/blob/master/docs/USAGE.markdown#recording-a-video-with-ffmpeglibav
17:48 karolherbst: ohhh
17:48 karolherbst: :D
17:48 karolherbst: well
17:48 karolherbst: that makes more sense
17:50 karolherbst: just
17:50 karolherbst: the quality is really bad
17:50 Calinou: SimpleScreenRecorder?
17:50 Calinou: <pmoreau> Right now, no. But we are most likely going to buy a 1080 at work.
17:50 Calinou: I'm going to get a 1080 probably too
17:51 imirkin: i've added -b 4000k to that cmdline
17:52 Calinou: -b:v?
17:52 Calinou: for video bitrate
17:52 imirkin: karolherbst: https://people.freedesktop.org/~imirkin/cloth.mp4
17:52 karolherbst: imirkin: | ffmpeg -r 60 -f image2pipe -vcodec ppm -i pipe: -c:v libx264 -pix_fmt yuv420p -b:v 3000k -minrate 3000k -maxrate 3000k -preset veryslow -threads 8 -y output.mp4
17:52 pmoreau: Calinou: I won’t be doing any RE’ing or testing on that 1080 though, since I’m not supposed to. That will have to wait until I buy the 1080 Ti myself.
17:52 karolherbst: mhh
17:53 karolherbst: imirkin: ahh, looks same though
17:53 Calinou: karolherbst told me I could share the VBIOS :p
17:53 imirkin: karolherbst: do you get the same tearing at the end?
17:53 karolherbst: what tearing?
17:54 imirkin: look towards the end of the video, as the cloth is falling off
17:54 imirkin: it tears a bit
17:54 karolherbst: https://drive.google.com/file/d/0B78S7GSrzebIc3lnUVNsRW1SMzA/view
17:54 imirkin: yeah, you get the same tearing
17:56 karolherbst: imirkin: I've added a "// ends up in quadop or tex" comment on the liveOnly flag, anything else I should leave a comment on?
17:57 imirkin: so... it's the opposite of liveOnly then?
17:58 karolherbst: ohhh
17:58 karolherbst: right, I have to switch the meaning of the comment :D
17:59 karolherbst: "// true if doesn't end up in quadop or tex"
17:59 imirkin: ok
18:00 imirkin: and i don't remember, but you only ever compute it once per instruction, right?
18:00 imirkin: what happens when an instruction depends on itself?
18:00 imirkin: (as it might in a loop)
18:00 imirkin: do you just end up in an infinite loop?
18:02 mwk: yay, I'm emitting relocations
18:07 imirkin: if someone has a maxwell, try replaying the trace at https://bugs.freedesktop.org/show_bug.cgi?id=94858 - curious if it works or not. i don't think it uses images or anything like that, but you might have to force-enable GL 4.3 to get the trace to run.
18:17 karolherbst: imirkin: nope, checked is set to true on entering the loop, so the recursive calls end there. Allthough I really would like to have a non recursive variant of that pass :/
18:29 imirkin: hakzsam: have a look at https://github.com/imirkin/mesa/commits/nv30 - i believe the top 2 commits should fix your issues on fermi
18:32 pmoreau: imirkin: I’m having trouble getting my 64-bit MUL/MAD splitting to work without allocating temporary variables. This is what I currently have: https://phabricator.pmoreau.org/P98
18:33 pmoreau: imirkin: I have been saving temporary results into the high bits of res, but that doesn’t work if you get `res = res * foo + bar` :-/
18:34 imirkin: i gtg, sorry. good luck.
18:34 pmoreau: np. I should probably move it before RA
18:53 pmoreau: Hum, maybe solved…
19:13 huelter: does nouveau has a way of separating the digital audio on it's outputs? Right now I have sound on two monitors or none, using pulseaudio
19:19 karolherbst: I see there is a shl-add instruction?
19:26 pmoreau: Yeah! It seems to be working! Need to check with S64 now, but with U64, I get the same results as the CPU. :-)
19:38 karolherbst: :D awesome
19:50 mwk: alright, Falcon relocations are supported
19:50 mwk: now I need to test the hell out of them
20:00 karolherbst: uhh, add $r1 $r2 $r3 // neg $r4 $r1 => add $r4 neg $r2 neg $r3
20:10 karolherbst: ohhhhh, I think I found something
20:14 karolherbst: uhh
20:14 karolherbst: that really hurts
20:18 karolherbst: I think I found one potential perforamnce issue in those eon games
20:18 karolherbst: if we end up with stuff like that: https://gist.github.com/karolherbst/284b6a27873936a6d31539a320c9a9fc#file-gistfile1-txt-L713-L716
20:19 karolherbst: it means we computate both "values", even if we already know really early which we actually need to compute
20:20 karolherbst: or do I oversee/forget something?
20:30 karolherbst: yeah, that makes sense
20:30 karolherbst: I think we need to move some instruction inside the if clause if it is the only use of it
21:12 s0be: Is the feature matrix up-to-date with resepct to the nv110 support?
21:31 mwk: well
21:31 mwk:managed to link the first Falcon ELF executable
21:33 mwk: I'll need to tell lld that I really don't want the ELF header loaded into memory
21:34 mwk: and it'd be nice to have data segment start at 0, not at the last code address, but linker script seems to work good enough for that
21:37 mwk: also, porting lld was ridiculously easy
21:37 mwk: 84-line patch
21:39 RSpliet:bows
21:43 karolherbst:wonders why I can't move instructions with subops around as I like :/
21:43 karolherbst: ohh that wasn't it
21:43 karolherbst: strange
21:44 mwk: hmm, maybe I shouldn't even support linking without a linker script
21:45 karolherbst: huh, can't I move stuff between movs with subop:1?
21:54 mwk: hmm, how do I convince the stupid clang to use my shiny new linker...
21:56 karolherbst: mwk: does your linker get llvm bytecode as the input?
21:56 mwk: karolherbst: nope, Falcon ELF .o files
21:56 karolherbst: mhh
21:56 mwk: although it should be able to eat llvm bytecode as well
21:56 mwk: for LTO mode
21:57 mwk: but that's near the bottom of the TODO list
21:57 karolherbst: does the -B argument work?
21:58 mwk: what -B?
21:59 karolherbst: mhh, seems like it still does nothing
22:00 karolherbst: odd
22:00 karolherbst: https://llvm.org/bugs/show_bug.cgi?id=10744
22:30 mwk: so, I guess nobody here ever attempted dynamically loading code for Falcon?
22:30 mwk: aka overlays
22:31 mwk: I'm not sure how to handle this lovely case
22:36 mwk: eh, lld doesn't even support paddr != vaddr
22:36 mwk: not really surprising... :)
23:44 mwk: alright, I think I have an idea on how to do overlays for Falcon
23:44 mwk: it's kind of horrible :)
23:45 mwk: lld upstream will probably kill me if I attempt to submit it, but oh well