08:20henti: Morning all.
08:24henti: I have a Lenovo P50 with Optimus and setup dual screen. I'm getting random live image corruption
08:25henti: Expecially on console text and mouse movement. Any help with fixing this ?
08:25henti: Troubleshooting guide mentioned that raising the card performance mode might help
08:35skeggsb: henti: sounds like glamor issues that used to exist, but have apparently been fixed by "something" (ie. it doesn't happen for me anymore) already...
08:35skeggsb: updated Xorg and/or mesa would be my predictions
08:36skeggsb: though, perhaps it's something prime(optimus)-related
08:36skeggsb: i have a p50 on my desk currently, but using it in discrete mode
08:40henti: I'm seriously thinking of that as well. Will test later ... right now .. work
08:40karolherbst: reverse prime is broken by design anyway :p
11:39RSpliet: pmoreau, hakzsam, sooda: I've spent some time searching for OpenCL profiling tools for NVIDIA on Linux... do these not exist?
11:41hakzsam: RSpliet, I guess you tried nvprof?
11:42hakzsam: but nvprof won't work I would say because this tool is based on cupti
11:45hakzsam: RSpliet, https://hakzsam.wordpress.com/2013/05/28/the-cuda-compute-profiler/
11:45hakzsam: this is old but it worked with CL IIRC
11:46hakzsam: and this
11:46hakzsam: look for OPENCL_PROFILE :)
11:46RSpliet: they scrapped that in Cuda 8
11:46hakzsam: scrapped what?
11:46RSpliet: COMPUTE_PROFILE=1 command line profiling
11:46hakzsam: ah okay
11:46hakzsam: you can always downgrade :)
11:47RSpliet: yeah... that's kind of my last resort
11:47hakzsam: but you should try nvprof
11:48RSpliet: yeah I gave that a whirl, but didn't detect OpenCL kernel invocations when I tried it without extra params
11:48RSpliet: I'll retry after lunch
11:49hakzsam: not surprising that NVIDIA doesn't want to make any efforts for CL :)
11:49RSpliet: I'm sure it's a matter of limited resources and priorities rather than "desire"
11:49hakzsam: they just don't care I think
11:50RSpliet: surely individual engineers care
11:50hakzsam: for engineers, sure
12:39RSpliet: hakzsam: nvprof doesn't/no longer works with OpenCL programs
12:56RSpliet: hakzsam: and for that matter, the previous Cuda version cannot be installed in Ubuntu 16.04 using apt (for 15.04... because a 16.04 repo doesn't exist for cuda 7.5) because the repo uses sha1 as a gpg signature - which is no longer supported in 16.04.
12:56RSpliet: ^ this is why I hate proprietary software
12:56karolherbst: why do you use ubuntu anyway?
12:57RSpliet: karolherbst: if it was my choice I wouldn't, but it's a work desktop
12:57karolherbst: I can still install 7.5 and 6.5
12:57karolherbst: you can still download the package and install it manually
12:57RSpliet: sure, I can go and get the .run file and have fun with that
12:58karolherbst: well the gentoo ebuild does exactly this: https://gitweb.gentoo.org/repo/gentoo.git/tree/dev-util/nvidia-cuda-toolkit/nvidia-cuda-toolkit-7.5.18-r2.ebuild
12:58karolherbst: fun with run files :)
12:58RSpliet: yeah, but why do you use gentoo anyway?
12:58karolherbst: especially that you need a gcc below 5 is awesome :D
12:58karolherbst: RSpliet: so that I don't have the issue you have with ubuntu? :p
12:58karolherbst: and it is easy to write my own packages
12:59RSpliet: "gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)"
12:59RSpliet: karolherbst: I think the distribution isn't the culprit here. The *real* problem is that NVIDIA removes useful features from their toolkits and drivers at random
12:59karolherbst: that is nvidias fault
13:02RSpliet: the only reason why I hooked up that forth monitor to my Kepler is so that I can monitor profiling information of my OpenCL programs, but alas, 4 monitor support was removed from the 310.xx :-P
13:02karolherbst: how silly
13:03karolherbst: guess that means we have to get that working with nouveau now
13:03RSpliet: of course, I could invest significant time in helping pmoreau to get OpenCL up and running, so I can use hakzsams performance monitoring stuff... if only I had this infinite time source
13:05karolherbst: well, you could just create a gentoo chroot prefix and install cuda there and just use that before you crap your OS :D
13:06RSpliet: I'd rather stick to what I'm comfortable with (eg. Fedora ancient++)
13:07karolherbst: well my point was, that you don't have to mess up your system, I doubt you can just install a fedora on top of your ubuntu
13:08RSpliet: the point I tried to make is that figuring out how to do this stuff properly is likely to take more time than fixing OpenCL support in nouveau
13:08RSpliet: and that's not because the latter is trivial... it isn't
13:10karolherbst: well, then the right course of action is to fix OpenCL in nouveau as it seems :)
13:17hakzsam: RSpliet, lot of fun :)
13:35pmoreau: karolherbst: Fix? I would start with "getting it to work" if I were you :-p
13:36pmoreau: Is nvprof the old or the new profiling interface?
13:36RSpliet: pmoreau: I don't know which is older or newer or better or w/e, it's dysfunctional for OpenCL last time I tried
14:24pmoreau: RSpliet: Which OpenCL features would you need/like? I could definitely prioritise those, plus that would give some real testing to the code. :-)
14:28RSpliet: ... did the #intel-gfx guys just blindly ban all web irc users?
14:30pmoreau: O.O no idea…
14:31ajax: RSpliet: which +b gives you that impression?
14:32RSpliet: ajax: I erm... don't know. webchat doesn't really tell me I think
14:32ajax: barjavel.freenode.net, 927021 secs ago]
14:32ajax: well at least it's not "just"
14:32ajax: in the sense of just now
14:34RSpliet: ajax: well that clarifies one or two things. How intentional is this... I was hoping to find out more about some of the beignet profiling tools
14:35ajax: i imagine it was to silence one user in particular
14:35RSpliet: bold... hope to have my ZNC machine up and running again tomorrow if people object to lifting a webchat ban there
14:37ajax: i just cleared it
14:37ajax: if someone really wants to argue with me about it, bring it
14:39karolherbst: RSpliet: webchat is banned from #dri-devel afaik
14:39karolherbst: and maybe intel-gfx as well
14:39ajax: which, again, pretty sure there's _one_ person that's directed at
14:39karolherbst: it is intentional afaik
14:40ajax: i'm sure it was
14:41RSpliet: karolherbst: #intel-gfx already has this "register before you're allowed to talk" policy judging by the topic title. Given this, banning web clients seems a pointless measure to reduce unaddressed spam
14:41karolherbst: if you really want to, you can also use tor...
14:42RSpliet: you can't get on freenode with tor without registration
14:42karolherbst: like how hard it is to register on freenode
14:42karolherbst: or create a new registration
14:42RSpliet: enough of a hassle to not have spam-bots doing this. In other words, it must have been directed at a single user
14:43karolherbst: but still, any user with enough motivation can get around any of those bans
14:43karolherbst: except you ban tor
14:43RSpliet: which is what freenode did up until two weeks ago
14:44karolherbst: why did they ban tor?
14:44RSpliet: unaddressed spam, presumably: http://freenode.net/news/tor-online
14:45karolherbst: I thought there was this onion address epsecialy for tor access
14:46karolherbst: that is a really crappy way of freenode to enable tor though
14:46karolherbst: so basically, it is useless
14:47karolherbst: k, so indeed no tor support for freenode until now
16:17waltercool: guys, building mesa with LLVM and musl, I got an error on nouveau of tr1/unordered_set, it seems to work if I remove the tr1, something expected for C++v11, but why wouldn't be detecting the ifdev __cplusplus >= 201103L ? Someone knows?
16:18imirkin_: waltercool: using clang?
16:18waltercool: imirkin_: yep
16:18imirkin_: i think clang never shipped the tr1 headers
16:19waltercool: I know, but should be compatible with the __cplusplus >-201103L, or that's only for GCC?
16:19imirkin_: nouveau isn't built in c++11 mode.
16:19imirkin_: nor am i about to require it when the last gcc supported on bsd's is gcc 4.2
16:19imirkin_: and i consider clang to be a broken compiler
16:20imirkin_: [hm, broken is perhaps too strong. perhaps "inappropriate for general usage"]
16:20waltercool: but nouveau seems to fetch correctly with the ifdef the difference between the tr1/unordered_set and just unordered_set, but seems to fail for clang, no idea why
16:21waltercool: I think clang is fine, is just not fully compatible with gnu std
16:21imirkin_: if you send unintrusive patches that help clang, i won't object.
16:21imirkin_: but you can't in the process break working setups
16:21imirkin_: e.g. it has to work with gcc 4.2
16:21imirkin_: which had no c++11 concept
16:22imirkin_: nor shipped non-tr1 versions of those things
16:22imirkin_: [unless someone landed those patches that bump min gcc version... pretty sure they were nak'd though]
16:23waltercool: yeah I know, let me check how clang manage those header, where should I send/search for that kind of patches? mesa mailing list?
16:23waltercool: great! I will investigate and provide proper patches if I found the correct solution. I will try with both GCC and clang to avoid any breakage
16:25imirkin_: (or even intrusive patches are fine, as long as they don't make everything ugly)
16:28waltercool: imirkin_: Ideally should be just some ifdef logic, because is precompiler. I don't like nasty stuff neither :)
16:32siro__: imirkin: I found the problem that was causing my system to hang: It was due to a transfer_map call with invalid bounding box, causing the nouveau drm module to stop working
16:34imirkin_: siro__: ahhh ok, we should probably do some *basic* bounds checking/asserting
16:55waltercool: imirkin_: Just a question, there is a reason to keep C99?
16:56imirkin_: not sure what you're suggesting
16:56waltercool: I meant, why nouveau didn't moved to c++11
16:57ajax: what possible benefit would it provide
16:57imirkin_: i like C? it was already written in C?
16:58waltercool: oh my bad, I thought it was c++
16:59imirkin_: codegen is written in C++
16:59imirkin_: but it's a restricted subset of C++
17:01imirkin_: which i think was because calim was afraid that people would reject STL usage
17:01hermier: imirkin_: the bug on was triggered
17:01imirkin_: hermier: that's fine... no serious harm done.
17:02imirkin_: you mean WARN_ON, right?
17:02imirkin_: yea, should be fine
17:02imirkin_: hopefully won't hang anymore
17:02ajax: all i'm saying is c++ usage in mesa means i have to play stupid linker tricks to make steam work
17:02hermier: but I saw a video corruption
17:02ajax: it is completely unsuitable for systems programming
17:02ajax: and anyone who disagrees with me is wrong
17:02imirkin_: ajax: i think that ship has sailed =/
17:03imirkin_: glsl compiler uses c++
17:03ajax: oh i'm aware
17:03imirkin_: as does almost every backend, i believe
17:03hermier: I mean I have dual head and one of the head don't go to sleep, and screen updates makes the displayed image to be corrupted till I wake up the screens
17:05imirkin_: never heard that one
17:05hermier: will try in graphical mode to see if it helps somehow
17:09hakzsam: imirkin_, any objections if I push karol's RA patch ?
17:09imirkin_: which one?
17:10hakzsam: "nv50/ra: let simplify return an error and handle that"
17:10imirkin_: go for it
17:14hermier: imirkin_ do you want the BUG_ON trace ?
17:14imirkin_: i have plenty in my own dmesg :)
17:15imirkin_: and to be clear, it's a WARN_ON, not BUG_ON
17:15hermier: hmmm true XD
17:18hermier: my ultimate test would be to start minecraft FTB ...
17:39urmet: hermier: what pack?
17:51hermier: imirkin_: I doubt it helps but even with all the patches, I still have
17:51hermier: [ 1658.662328] nouveau 0000:01:00.0: fifo: write fault at 0000261000 engine 00 [GR] client 0f [GPC0/PROP_0] reason 02 [PTE] on channel 2 [007fb31000 X]
17:51hermier: [ 1658.662331] nouveau 0000:01:00.0: fifo: gr engine fault on channel 2, recovering...
17:52hermier: and it still never recover
18:13imirkin_: hermier: when doing what?
18:40vita_cell: what GPUs support now reclock (core & memory)
18:41vita_cell: without running blobs
18:41karolherbst: most of the 2nd gen tesla, all kepler and maxwell, allthough we can't enable it for 2nd gen maxwell
18:42vita_cell: Maxwell 1g, Kepler, Tesla
18:42vita_cell: so Maxwell 1g supports reclock for core and memory
18:42karolherbst: yeah, but only on nouveau master currently
18:43vita_cell: yes, but is it possible right?
18:43karolherbst: it is also possible for maxwell 2
18:43zeq: Spent the last couple of days getting chromium/vaapi to work with non-i956 VA drivers, including libva-vpdau-driver. It should also mean it will now work with nouveau VDPAU/VAAPI.
18:43vita_cell: gtx750=Kepler gtx750ti=Maxwell 1g?
18:44karolherbst: vita_cell: gm10x is maxwell 1
18:44karolherbst: you can't trust model numbers
18:44karolherbst: zeq: isn't there also a gallium vaapi state tracker?
18:45zeq: yes, but only the i965 driver worked with Chromium
18:45vita_cell: are all Tesla reclockable?
18:45karolherbst: okay, so you also tested it with nouveau gallium vaapi and it works now?
18:45karolherbst: vita_cell: not all
18:46zeq: karolherbst: no, not yet, I've been working with the proprietary driver... :-$
18:46karolherbst: vita_cell: it is disabled before G90
18:46imirkin_: it's enabled for G94+
18:46imirkin_: but that doesn't mean it works
18:46vita_cell: reclockable models are after G90 right?
18:47karolherbst: well I would say the best support is with kepler now regarding reclocking
18:47karolherbst: no idea how stable those teslas are
18:47karolherbst: but last time my GPU crashed due to reclocking is like over 3 months ago
18:47imirkin_: for the ones where it works, fairly stable i think
18:47zeq: karolherbst: I managed to switch the MUX from the BIOS and convince the NVIDIA driver to start. I've been trying to figure out why vgaswitcheroo doesn't work with nouveau/i965
18:47karolherbst: zeq: what machine=
18:48zeq: it's a Dell E6530
18:48imirkin_: zeq: didn't you have a G80? or was that someone else?
18:48zeq: I have several NVIDIA cards...
18:48karolherbst: zeq: no optimus system?
18:48vita_cell: I ran gtx770 4gb (now I stored it at the moment), which never crashed, same for gt730 2gb gddr5 (now I run), and gtx650
18:49zeq: it is optimus, but has a MUX, which only switches from the BIOS/UEFI setup screen, after reboot
18:49vita_cell: now I run gt730 as render on corebooted gigabyte board
18:49zeq: funny thing is the firmware then proceeds to get the screen mode wrong
18:49karolherbst: sure it does
18:50karolherbst: zeq: so basically you can tell the firmware to use the nvidia gpu on boot and the intel one just disappears?
18:50zeq: completely gone
18:50karolherbst: and I guess the nvidia one is gone when selecting the intel one?
18:50hermier: imirkin_: runing desktop, with my usual bunch of apps
18:50karolherbst: are you booting in uefi mode?
18:50hermier: systems was not stressed
18:50imirkin_: hermier: kde?
18:51hermier: yep as usual
18:51imirkin_: do you have my patched mesa?
18:51imirkin_: or rather, mesa with my locking branch
18:51zeq: yes UEFI. It's either Optimus mode: i965 primary, NVIDIA only available as secondary, although, I believe it may be attached to some outputs, or NVIDIA only, then it sees the LVDS directly.
18:52hermier: well it is mesa 12.0.3 with your patches
18:52karolherbst: zeq: then you should stay with intel main and just use bumblebee for using nvidia or prime for nouveau
18:52zeq: switcheroo switches GPU but not the mux when running in Optimus mode
18:52karolherbst: because it shouldn't
18:52karolherbst: and can't
18:52karolherbst: because it doesn't know about your mux
18:52karolherbst: the point of prime is to offload rendering onto another device
18:52zeq: its a secret mux
18:52karolherbst: not to switch the display
18:53hakzsam: karolherbst, I have pushed your RA patch btw
18:53karolherbst: hakzsam: thanks
18:53karolherbst: unhappy about shader-db failing? :D
18:53imirkin_: hermier: sorry, dunno
18:54zeq: Sure, and I'd use nouveau+ i965 PRIME if memory reclocking worked ;-) Otherwise I just end up using i965 only.. bumblebee wouldn't help. The NVIDIA isn't that much faster than the i965, but it does keep the CPUs cooler...
18:54karolherbst: zeq: what gpu do you have by the way?
18:54hakzsam: karolherbst, nope, but I'm going to try shaderdb on maxwell right now
18:54karolherbst: hakzsam: well, it only crashed for kepler1
18:54hakzsam: because of the limited number of GPRs?
18:54hermier: imirkin_ no problem ;)
18:55karolherbst: hakzsam: nope, it crashed while doing sched stuff
18:55hakzsam: karolherbst, makes sense then
18:55zeq: A (fast i7-3840) IVB GT2 HD4000, NVIDIA is NVS5200M
18:55karolherbst: hakzsam: yep, array access[0x3fffffff] ;)
18:55zeq: aka GF108GLM
18:55hakzsam: karolherbst, :)
18:55karolherbst: zeq: ohh right, I remember that one
18:56karolherbst: gddr3 fermi
18:56karolherbst: that will be fun
18:56karolherbst: zeq: bad news for you then, even if we get fermi memory reclocking working, we will target ddr3 and gddr5 mainly
18:56zeq: Still, having the proprietary driver running for the last few days on the mux has been interesting
18:56karolherbst: because there aren't many gddr3 cards at all
18:56karolherbst: no idea if any of us has such
18:57karolherbst: zeq: what is so bad about bumblebee in your case?
18:57karolherbst: the point of bumblebee with nvidia is to turn off the gpu completly
18:58karolherbst: so I don'T see how that will benefit the heat situation using nvidia
18:58zeq: To be honest, I don't *really* want to be running the NVIDIA driver
18:58zeq: I got it going as much to try to fix Chromium/VAAPI
18:59karolherbst: ahh I see
18:59hermier: zeq: you are not the only one ;)
18:59zeq: worked really great already on the i965
18:59karolherbst: makes sense
18:59zeq: I'm stuck with it for the G80 and especially the NV35
19:00zeq: G80 just doesn't work to well with nouveau
19:00zeq: NV35 is *really* bad
19:00zeq: no offense :-)
19:01karolherbst: fell free to help out :D
19:01imirkin_: zeq: a fix went into v4.8 which should help some amount of plain kernel hangs you'd get
19:01imirkin_: (for NV35)
19:02zeq: I put the NV35 in a machine for a friend and built Gentoo around the old legacy nvidia driver (no EGL, taking advantage of accel where possible)
19:03zeq: everything optimized as I could. It's working well, especially with my custom Chromium build. Wish I could have used nouveau for it, but it didn't even support OGL2 or compositors.
19:03zeq: (and the hangs ^)
19:04imirkin_: yeah, nouveau support for "modern" environments is pretty much shit across the board
19:05zeq: I wish I was better at driver hacking :-(
19:06zeq: something daunting about hacking GPU drivers
19:06imirkin_: unfortunately my time and motivation to address those issues have both largely dried up. hopefully someone will try to fix them up.
19:07zeq: If I'm feeling brave maybe I'll give it a go.
19:08zeq: I'm feeling fairly confident after getting the Chromium/VAAPI/VDPAU stuff going :-)
19:11zeq: Going back to that secret MUX I apparently have, I had just assumed that there would be an ACPI method to deal with the switching. Do you think it's a hardware limitation that it needs to switch *cold*?
19:12karolherbst: what do you think about the Nvidia VBIOS section here? https://gist.github.com/karolherbst/4341e3c33b85640eaaa56ff69a094713#nvidia-vbios
19:13karolherbst: ohh have to adjust the last section still
19:20zeq: karolherbst: looks quite good to me
20:06NanoSector: heyo, it's me again, with the weird GT750M
20:06NanoSector: I was wondering if anything regarding that card was fixed in Linux 4.8?
20:10zeq: Does the GALLIUM VDPAU state tracker work with decoders on nouveau? Chip dependent?
20:11zeq: I've found a victim to test my Chromium ebuild, he has a gtk770, but VDPAU shows all decoders as -- not supported ---
20:15imirkin_: zeq: you need firmware
20:16imirkin_: zeq: fwiw there should also be direct va-api support
20:16zeq: CMEPTb: imirkin_: So should it work at all without firmware? with vaapi?
20:17imirkin_: firmware is necessary to drive the video decoding engines
20:17imirkin_: you should still have access to the presentation aspects of vdpau without it
20:17imirkin_: but if you want video decoding, you need firmware
20:18zeq: shame there's no VP1 firmware, or is there?
20:19imirkin_: it exists, but i don't extract it
20:19imirkin_: also the linux driver never made use of it
20:19imirkin_: and from various online info, it was never beneficial on windows
20:19zeq: imirkin_: the blob driver for linux doesn't use it because it doesn't meet VDPAU feature set 1
20:19zeq: so no hw video decode for the G80 based GTX8800 I have
20:20imirkin_: you get VPE2 support actually
20:20imirkin_: which gets you MPEG1/MPEG2
20:20zeq: is VP1 support possible?
20:20imirkin_: it's really best to use with XvMC than VDPAU though
20:20imirkin_: theoretically? sure.
20:20imirkin_: and mwk has RE'd most/all of it
20:21zeq: I read that NVIDIA couldn't be bothered due to it only being available on a small number of cards, and VP2 being much better
20:21imirkin_: VP1 was available on NV41+
20:21zeq: but only supported HD on the last couple of cards AFAIK
20:22zeq: including the 8800
20:22zeq: there wasn't much point accelerating SD on the GPU
20:22zeq: maybe I'm made all that up, that's what I remember
20:23imirkin_: the sole reason i got a nv34 back in the day was to decode ATSC streams
20:23zeq: yes, back in the day, I meant that contemporary VDPAU CPUs
20:24zeq: NVIDIA when they made VDPAU didn't see the point of supporting chips that could only accel SD when CPUs at the time could do that
20:24zeq: As far a I know it was always supported on Windows
20:25zeq: HD MPEG2 is still useful IMHO
20:25mwk: VP1 is just a vector processor, you get to write any codec you like for it...
20:25mwk: it's a lot of work though
20:26zeq: mwk: is that different to the later VP chips?
20:26mwk: very much so, yes
20:26zeq: mwk: interesting
20:26mwk: VP2 is sort of similiar to VP1, I think, I never finished REing it
20:27mwk: for one, it has a much more advanced DMA engine
20:27mwk: sure, 3 different kinds
20:27mwk: it's a dual-processor thingy
20:27mwk: there's an xtensa core managing stuff and a vector processor computing stuff
20:27zeq: I mean embedded, rather than being programmable like VP1?
20:27mwk: plus a macro engine in between
20:28mwk: oh, it's fully programmable... probably
20:28mwk: I haven't decoded the vector processor code yet
20:28zeq: Could have been used for acceleration of other things then?
20:29mwk: anything you like
20:29mwk: except the H.264 bitstream decoder, which is a fixed-function engine
20:29zeq: probably not that fast compared to modern CPUs though?
20:29mwk: yeah, pretty much useless today
20:29zeq: does that apply to VP1 or just VP2 later
20:30mwk: I mean
20:30zeq: unless you have an old slow CPU :)
20:30mwk: if you happen to have a laptop with VP2 card
20:30mwk: you probably want to use it, because its CPU can't be that good, and it's less power-hungry than the CPU even if it is
20:30zeq: I think my laptop has a VP4.2
20:31mwk: any kind of acceleration is a win on laptops because of power savings, basically
20:31zeq: I have an old PC hooked up to my TV with AthlonX2 CPU and a GTX8800. Perhaps relevant to that.
20:32mwk: that CPU might kind of suck
20:32mwk: so VP1 could help
20:32zeq: it's a 3Ghz chip, it just manages to do HD x264 decoding with both cores
20:33mwk: I used to have a G86 laptop that had problems with playing HD H.264
20:33mwk: on the CPU
20:33mwk: so it was either nvidia + VDPAU or framedrop
20:33zeq: My x264 decoder is quite optimized :)
20:34zeq: It's a shame G80 doesn't work well with Nouveau as I mentioned earlier. VP1 support would only help if it didn't crash :)
20:35mwk: zeq: well then, you get to port the x264 decoder to VP1 processor...
20:35mwk: or rather, parts of it
20:35mwk: a vector processor won't help you at all with bitstream decoding
20:35zeq: I guess none of you guys had a 8800, it was ridiculously expensive and superseded quite quickly
20:36mwk: I got 2 actually
20:36mwk: yeah, G80 is... weird
20:36zeq: VP1 instruction set isn't known though, right?
20:36mwk: it's almost fully known
20:36zeq: there's an assembler?
20:36mwk: there is
20:37zeq: no excuses then eh? LOL
20:37mwk: it sucks a lot, but it's usable in a pinch
20:37mwk: well, I still need to figure out the DMA engine
20:37mwk: though I roughly know what to expect
20:38mwk: there's also a fair amount of unknowns in the control registers, but that doesn't much matter in practice
20:38zeq: I'd better feed my dog or I'm going to start losing keys! :-O
20:39mwk: also, the VP1 ISA is kind of... weird
20:39mwk: lots of instructions are plain old crazy
20:43mwk: the branch unit is not documented yet, and quite weird too, but mostly known
20:44zeq: Fun to code for then
20:46zeq: I used to do a lot of assembler back in the 80s/90s. 6502/ARM.
20:47zeq: I used to write most of my desktop software in ARM code :)
20:47mwk: ever coded in assembler for a batshit crazy ISA? :)
20:47mwk: nv has plenty of those
20:47zeq: ARM is so easy :)
20:48zeq: ARMv2 as it was back then, anyway.
20:48zeq: I feel so old :-)
20:52hermier: it is not crazy, it is advanced XD
20:53mwk:wasn't even alive in the 80s
20:54hermier:was born in 79 so ...
20:56zeq:was born in 77 so not as old as he feels
20:56mwk: oh, and wrt VP1
20:56mwk: I said the ISA is *known*
20:56mwk: I didn't say *understood* :)
20:56mwk: I still have no idea what some of the weirdo instructions are for
20:57zeq: maybe NV didn't either, hence no support from VDPAU from the blob
20:58mwk: I just gave up on writing a description for this one, and settled on pseudocode
20:58zeq: maybe they designed the instruction set around a software implementation of their decoder?
20:59zeq: that was a function :)
20:59mwk: I'd probably have to sit with the ISA on one screen and MPEG2/H.264/VC1 standard text on the other
20:59mwk: then I might figure this out
21:25karolherbst: wow, how difficult it is to get the compiler to generate an actualy idiv instruction...
21:25karolherbst: (well cpu compiler that is)
21:30imirkin_: karolherbst: don't divide by an immediate.
21:30karolherbst: hah, if that would be that easy
21:32karolherbst: "(2 * i) / 13" produces different code than "i/6.5" odd
21:33imirkin_: is i an integer?
21:33imirkin_: why would you anticipate it'd generate identical code?
21:33imirkin_: they're completely different operations
21:33karolherbst: well sure, but usually I don't mind precision if I do floating point anyway
21:34karolherbst: and the former generated code is much faster
21:34karolherbst: this is with -O3 by the way
21:34karolherbst: checking Ofast
21:34karolherbst: with Ofast there is no div operation for the second case as well
21:35imirkin_: should be a fdiv
21:35imirkin_: er right
21:35imirkin_: just fmul
21:35imirkin_: since 1/6.5 is a constant
21:35karolherbst: only with Ofast
21:35karolherbst: O3 does a div
21:35imirkin_: that's dumb
21:36imirkin_: 2i/13 should also not generate a division
21:36karolherbst: it doesn't
21:36imirkin_: should be shl i, 1 + a bunch of weird muls
21:36karolherbst: sar is a shift, right?
21:36karolherbst: or what was sar again?
21:36imirkin_: iirc with roll around?
21:37imirkin_: arithmetic right-shift
21:37imirkin_: aka signed shift
21:38imirkin_: "rol" & co are the rotate instructions
21:38karolherbst: I see
21:38karolherbst: I selected ICC :D
21:38imirkin_: right, makes sense
21:39karolherbst: gcc is equally dumb about the 0fast / O3 thing though
21:39imirkin_: note that imul returns in eax:edx
21:40karolherbst: yeah, I figured
21:43hermier: karolherbst: not sure it is dumb, I think there might be precision problems in some corner cases
21:44imirkin_: hermier: explain
21:44hermier: since multiplying and dividing are not the same operation
21:44imirkin_: inifnity/nan/-0 should all be the same i think
21:44imirkin_: how do you mean?
21:45karolherbst: hermier: huh dividing by 6.5 has precision problems?
21:45hermier: the fact that you transform x/y to x * z where z is 1/y is not the same operation
21:45imirkin_: that can definitely get you into trouble, but not when y is a normal thing
21:46karolherbst: I only care about the result, and both things look equal in the result to me, maybe not type, but the int value is the same
21:46karolherbst: I agree with you about arbitrary values, but not in this case
21:46hermier: yes but you can't know before hand if there will not be precision problem for every x
21:46imirkin_: when y = 0, it gets you into trouble
21:47hermier: so the compiler implementor has to choose to use one or the other possibility
21:47imirkin_: er, when y is infinity maybe?
21:47imirkin_: anyways, for a normal "y" it's equivalent
21:47imirkin_: assuming that your rcp() is accurate.
21:48hermier: no this can happens at every value because of the way floatting values are encoded
21:48imirkin_: hermier: please demonstrate a concrete example of where this doesn't work out identically
21:49hermier: I mean mathematically it works
21:49imirkin_: although iirc division is allowed more ULP's of imprecision.
21:49karolherbst: imirkin_: well, for y = 0 it really doesn't matter much to begin with
21:49imirkin_: hermier: i mean specific float values where it won't work out.
21:49hermier: I don't have that in mind
21:49imirkin_: like 32-bit IEEE 754-encoded values.
21:49imirkin_: my claim is that for all non-special values, it'll work as expected.
21:50imirkin_: people think of floating point precision as a boogey man
21:50imirkin_: but it's actually fairly well specified
21:50hermier: the problem is not for daily usage thought
21:51hermier: I mean it is mostly for scientific usage, where error must be quantified
21:51imirkin_: i don't think you understand what i'm saying
21:52imirkin_: what i'm saying is that you're flat-out wrong.
21:52imirkin_: for any non-special float immediate
21:52hermier: yes, I understand
21:52hermier: that 16.5 is a egular number and should not produce any computation error
21:52imirkin_: x / immediate (+/- the ULP of the division operation) will contain x * (statically computed as-exact-as-possible 1/immediate value)
21:53imirkin_: er, that was "+/- the allowable ULP error of the div operation"
21:54hermier: imirkin_ I think a good example could be x/3
21:55imirkin_: for what specific value of x will those not be the same?
21:57hermier: let me fire up libreoffice so I can try to find one
21:59hermier: it will be safe for most values
21:59karolherbst: imirkin_: I am sure it won't matter for 0 as well
21:59hermier: but the compiler has to do an extra effort to check that the result is still accurate
21:59karolherbst: well the multiplication will
22:00karolherbst: hermier: that is what Ofast is for though ;)
22:00karolherbst: to tell the compiler you don't care the slightiest
22:00karolherbst: which only breaks dumb code to begin with
22:04hermier: well one should be aware that -O3 and -O fast might add imprecision to your results
22:04hermier: but that it would only matters for scientific usage
22:04karolherbst: I highly doubt this
22:05karolherbst: cause if precision really matter to you, you don't use IEEE floats anyway
22:07hermier: true and false ... one could have many version of the same program with different implementation of floating points, to quick validate results and or validate precision of your cpu, and a slower with higher precision systems
22:09hermier: if you see what I mean ^^
22:09karolherbst: right, but the slower one will use a really precise floating point library though
22:09karolherbst: and for the fast result you don't care
22:09hermier: well you car a little to see/quantify the bias you can expect in your results
22:14hermier: but all that is not for daily usage, I agree ;)
22:21imirkin_: hermier: it's not about daily vs non-daily usage. it's identical 100% of the time for 'regular' immediates. and it's very easy to tell which immediates are regular and which aren't.
22:22imirkin_: [subject to the division precision caveat]
22:24imirkin_: with 0.12345 there's a 1 ULP difference.
22:25imirkin_: which iirc is within acceptable range for division.
22:25imirkin_: i.e. http://hastebin.com/ufateyojah.rb
22:26hermier: I didn't meant it was not acceptable, thought it has to be know so one does not amplify the error
22:27imirkin_: but which one is wrong ;)
22:27imirkin_: it'll depend on the specific values of x
22:28hermier: this is why smalltalk was so great :p
22:28karolherbst: asigns to self make things soooo much easier
22:28karolherbst: no idea why other languages don't really support that
22:28karolherbst: except ObjC
22:29hermier: 1/3 was 1/3 for ever till you ask for an aproximate
22:29karolherbst: but 1/3 * 3 was 3/3
22:29karolherbst: not 1 afaik
22:30karolherbst: or was there a check for equal values within / ?
22:30imirkin_: sure, but three * (1/3) still == 1.0
22:30karolherbst: we talk about smalltakl though
22:31karolherbst: still had primitives though, sort of
22:32hermier: karolherbst: I don't remember exactly but I think that yes, because the / was allways performing reductions of common divisors
22:32karolherbst: I see
22:33karolherbst: hakzsam: yay! one opt of mine improves ue4 stuff
22:33hermier: it is costly with big numbers, but as soon as reductions happens ...
22:33hakzsam: karolherbst, which?
22:34karolherbst: hakzsam: sub(a, 0) to a
22:35karolherbst: will send it out, otherwise I won't do that ever :D
22:44karolherbst: hakzsam: I will work on the pow 2i -> mul opts next, cause the benefit is quite huge
22:45karolherbst: "total instructions in shared programs : 2818606 -> 2817952 (-0.02%)"
22:45karolherbst: and it matters even more for speed
22:45karolherbst: does hurt gpr count for RA silliness though
22:46imirkin_: i wouldn't worry about that
22:46karolherbst: I know
22:46karolherbst: it isn't much
22:46karolherbst: helped 3, hurt 13 for gpr
22:46karolherbst: but helped 206 for inst
22:46imirkin_: i'd say if you replace pow by 4 fma's, it's worth it.
22:46imirkin_: maybe even as many as 8
22:46karolherbst: 2 is the most I do though
22:47karolherbst: 8 ?
22:47imirkin_: i dunno
22:47imirkin_: maybe that's too much
22:47karolherbst: that is up to pow 256
22:47imirkin_: right, it's unlikely that will happen
22:47karolherbst: I could check what is the highest immediate we get
22:47karolherbst: 16 is also quite high
22:47imirkin_: yeah, 16 is probably a very safe max
22:47imirkin_: (really, 31)
22:48imirkin_: (since the effort it takes to get to 31 is the same as to get to 16)
22:48karolherbst: the pain is just things like pow 11
22:48karolherbst: cause pow 16 is nice
22:49karolherbst: only 3 muls needed
22:49karolherbst: but with 11 you need 4
22:49karolherbst: even more I think
22:49imirkin_: it's not so bad.
22:49imirkin_: it's actually same number as for 16
22:49karolherbst: how do you get to 3?
22:49imirkin_: you just need an extra add
22:49imirkin_: well, how do you do it for 16?
22:50imirkin_: x2 = x * x; x4 = x2 * x2; x8 = x4 * x4; x16 = x8 * x8;
22:50imirkin_: so that's 4 multiplies
22:50imirkin_: for 11 it's just x2 = x * x; x4 = x2 * x2; x8 = x4 * x4; x11 = x8 + x2 + x
22:51karolherbst: ohh right
22:51karolherbst: using add makes it so much simplier
22:51karolherbst: tried to do everything with muls
22:51imirkin_: yeah, figuring out some weird x5 = x4 * x; x5 * x5 * x thing is ... tricky.
22:51karolherbst: exactly my problem!
22:52karolherbst: when gets the pow eliminated by the way?
22:52imirkin_: and will generally end up with more multiplies.
22:52karolherbst: using add is the better solution :)
22:52imirkin_: not sure. could be before the const folding pass. but you could fix that.
22:52karolherbst: what about pow -4 and stuff?
22:52imirkin_: add rcp() at the end
22:52karolherbst: imirkin_: it is done by NVC0LoweringPass::handlePOW
22:52karolherbst: makes sense
22:53karolherbst: no idea when the lowering happens though
22:53imirkin_: although rcp is also a SFU op, so might not be any faster
22:53karolherbst: imirkin_: it is better
22:53imirkin_: that happens way too early
22:53hermier: is pow(x, 1) optimised ?
22:53imirkin_: hermier: kinda
22:53karolherbst: imirkin_: currently we do lg2, mul, preex2, ex2
22:53karolherbst: some muls + rcp should be faster than that
22:54imirkin_: karolherbst: right, it's a bunch of sfu ops. i forgot :)
22:54imirkin_: karolherbst: move it into NVC0LegalizeSSA
22:54karolherbst: also more potential for dual issueing anyway :D
22:54karolherbst: so we get the POWs within our passes
22:54karolherbst: we can't opt parts of those away anyway
22:55karolherbst: maybe with a really super smart CSE, but I highly doubt that
22:56karolherbst: this is just crazy: "helped inst ../nvidia_shaderdb/gputest_pixmark_piano/7.shader_test - 1 3753 -> 3725"
22:56karolherbst: and it affects performance even more
22:57imirkin_: for what?
22:57karolherbst: just the pow thing
22:57karolherbst: and only up to pow 4