04:13imirkin: HdkR: finally! GL_NV_fragment_shader_barycentric
04:13imirkin: nvidia joins the fun.
04:15imirkin: gl_BaryCoordNV + gl_BaryCoordNoPerspNV, as well as being able to specify varyings as "pervertex"
04:24imirkin: wow. someone's been optimizing for VR... GL_NV_shading_rate_image
04:35HdkR: imirkin: I know right. Been a long time coming
04:36HdkR: Now Dolphin can finally implement manual texture coordinate interpolation to fix the nvidia interpolation bug :P
04:36HdkR: Mesh shaders also look interesting
04:42imirkin: yeah. tess on crack :)
04:43imirkin: basically for the, as i understand it, fairly common usage of "compute shader generates geometry"
04:43imirkin: i wonder if there's much pickup over doing that + draw indirect
04:44imirkin: but then even a 5-10% improvement would be worth their while
04:44HdkR: I feel like Mesh shaders are going to be something that people will use
04:44imirkin: yeah, it's more straightforward than compute + indirect
04:45HdkR: The mesh shaders themselves act almost identical to compute as well
04:45imirkin: should be easy to port
04:47HdkR: shading rate it'll be interesting if people take advantage of that extensively. Hip new feature to improve perf, kind of like checkerboard rendering
04:47HdkR: Seems like it would need to be in the consoles for anyone to care
04:48imirkin: could have a huge impact though
04:48imirkin: like 100% perf improvement
04:49HdkR: Yea, just think if you're rendering at 8k and you can cut the sampling rate of most objects to pretty much nothing
04:49imirkin: you need to render like ... 1 fragment per 16x16 block on the side of a 4k image that's being displayed in VR glasses.
04:49HdkR: yep :D
04:49imirkin: if that
04:53HdkR: I need to look at how flexible it is. Looks like you bind some sort of rate image when you're rendering something
04:53airlied: need non square scissors :-p
04:55HdkR: It's all about rectangles :)
04:55imirkin: you just provide a different rate for each 16x16 pixel block.
04:55imirkin: in a R8UI image. presumably 0 = 1, 255 = 256.
04:56imirkin: i didn't read the specs carefully.
04:56HdkR: Yea, I haven't opened the spec sheet for that one at all yet. Just listening to people talk about it
04:57imirkin: (or 16x16 fragment block. dunno.)
04:59HdkR: That's one heck of a reduction on the fragment side
05:00imirkin: so it could be huge for VR perf
05:01imirkin: i bet you could get away with an average of 50% coverage across the image
05:01imirkin: i.e. 2x faster rast
05:01imirkin: although ... obviously it's also a function of polygon counts and whatnot
05:01imirkin: [and where those polygons are]
05:03HdkR: Something like SuperHot in VR where lowering the fragment sampling rate wouldn't make a visual impact at all would be neat
09:55RSpliet: kernel-3xp: yeah... I know exactly what to do for Fermi reclocking, it's mostly a solved problem but one that takes a lot of time to fiddle about with details. Sadly, I'm nearing the final phase of a PhD, so I literally have negative time and energy to spend on nouveau.
09:57RSpliet: skeggsb also really knows what it would take, but he's tied up in other responsibilities (things like fixing multi-threaded mesa, Turing bring-up, maybe doing some stuff for Vulkan, general code quality and error handling, taking care of Red Hat Enterprise customer bug reports... you know, a long long todo list :-D)
10:06kernel-3xp: ah ok, :/ good luck with the phd though
10:09RSpliet: Yeah. It'd be nice if NVIDIA could stop producing new hardware for the next four or five years so we can finally catch up :-P
10:10karolherbst: that would be boring :p
10:10karolherbst: we just have to get better :p
10:11RSpliet: karolherbst: we just need to open a big tin of talented developers and reverse engineers
10:11karolherbst: RSpliet: well skeggsb worked mainly on volta bringup and I am sure he will look into turing as well. And currently I am working on the MT issues
10:11karolherbst: RSpliet: working on it :p
10:11RSpliet: karolherbst: that's good news though, MT would be a big win (and a prerequisite for Vulkan?)
10:12karolherbst: but I doubt I got anybody interested doing fermi reclocking work though
10:12karolherbst: RSpliet: we wouldn't have the problem with vulkan
10:12karolherbst: because, it would be a new driver and we wouldn't do the same mistakes there
10:12karolherbst: I was already looking into the issue and I kind of understood what those are
10:13RSpliet: karolherbst: whenever you wonder whether anyone is interested in X, remember mooch is interested in exact hardware behavioural specs for like NV5 or something ;-)
10:13karolherbst: in mesa we have a fence list per screen
10:13karolherbst: but, we need that per context/thread
10:13karolherbst: like in case you use shared context
10:13karolherbst: dolphin does this to compile shaders in parallel
10:13karolherbst: and things break on glFinish() as multiple threads hammer on the fence list
10:14RSpliet: Shouldn't the fence list somehow become atomic or serialised regardless...? I'm not sure if every fork will create a new context, and you might want to share fences between your threads. Or am I thinking too simplistically? :-D
10:15karolherbst: no, why?
10:15karolherbst: just give each context its own
10:15karolherbst: you can't use the same context from different threads anyhow
10:15RSpliet: Ah ok, GPU context. Yeah that'd make sense then
10:15karolherbst: uhm, I meant GL context
10:15karolherbst: basically a nouveau_context struct
10:15RSpliet: Yeah, I assume there's a correlation with GPU HW contexts :-)
10:16karolherbst: still need to figure out the details
10:16RSpliet: No shared fences used to signal the compositor that your windowed OGL app is finished rendering a frame?
10:16karolherbst: like can we have multiple threads waiting on the same context or something through async gl calls
10:16karolherbst: RSpliet: well, the main problem are shared contexts, so imagine your compositor uses one shared context per client
10:16karolherbst: and now it does gl calls on each of those
10:17RSpliet: Ok, that's over my head now :-D
10:17karolherbst: so you share the nouveau_screen on all contexts where all the fencing is
10:17karolherbst: helgrind is already going crazy over our code :p
10:18karolherbst: most common offenders are the pushbuffers and the fence list we've got
10:19karolherbst: one example: https://gist.githubusercontent.com/karolherbst/5f6c5eb70f7facb6e676e0841db1596d/raw/1153f66300dfd9d1ed7affd597d94c73a7fd6da7/gistfile1.txt
10:21karolherbst: thread #4 seems to be the rendering/main thread and dolphins spawns some compiler threads
10:21RSpliet: Eh, valgrind is just a noisy little nuisance right, we can ignore those warnings!
10:22karolherbst: well, dolphin crashes with async shader compiles :D
10:22karolherbst: "it's just an optimization, we can disable it :p"
10:23RSpliet: That's not a bug, that's a feature helping people be more productive by deterring them from using dolphin
10:23RSpliet: Ok, that's terrible :-D
10:23karolherbst: maybe by simply moving the fence lists into context, we would fix tons of issues already... not quite sure. Still need to get a better understanding about the code itself
10:24karolherbst: I am sure HdkR did saw your comment :p
10:24karolherbst: *didn't I meant...
10:26RSpliet: Well, you have ideas to chase!
10:26karolherbst: yeah... I kind of hope I don't have to rework everything in one step
10:26karolherbst: if I find some "smaller" changes which fixes a few bugs, but not everything, I am happy already
10:28karolherbst: I think imirkin_ had some branches around, but I think he removed those
10:28karolherbst: distributions started to ship his patches, so he got annoyed by that
10:30Sarayan: ok, looks like I have a compiler kernel correcponding to the installed package version, it's going to be possible to debug things then :-)
10:31karolherbst: Sarayan: ohh, you are the one I asked to test the NvForcePost options, right?
10:33Sarayan: doing too many things in parallel, I have a little latency in my testing :-)
10:33karolherbst: mhh, I have a bad feeling about this issue
10:33karolherbst: I think your vbios might be broken and we require signed PMU firmwares from nvidia to fix it
10:34karolherbst: _not_ quite sure, but
10:34karolherbst: we got updated firmwares for gp108
10:34karolherbst: but we don't have the PMU stuff yet
10:34karolherbst: allthough we never got broken gp108 firmware
10:35karolherbst: just gp108 was special
10:35Sarayan: also, I have a colleague with optimus and a more recent card which ends up livelocking all the time, rings a bell?
10:35karolherbst: yeah.... nouveau.runpm=0 helps
10:35Sarayan: I can't annoy her as much as I can on my own stuff
10:35karolherbst: basically the GPU dies when trying to do runtime power management
10:36karolherbst: because of... reasons nobody knows
10:36Sarayan: the firmwares, can we disassemble/read them actually?
10:36karolherbst: it's the kind of issue we have for 2 years, everybody looks into it and figures out nothing
10:36Sarayan: it's mips, arm, sh or anything else sane?
10:36karolherbst: Sarayan: yes, but that usually doens't help
10:36karolherbst: it is a custom ISA
10:36Sarayan: but of course :-)
10:36karolherbst: nvidia calls those chips falcons
10:37Sarayan: reminds me of the gma500
10:37karolherbst: well, the shaders also have their custom ISA, but that's a different one ;)
10:37Sarayan: yeah, but custom isa for the shaders makes perfect sense
10:37Sarayan: for the control, a little less
10:38karolherbst: anyway, on laptops booting with "nouveau.runpm=0" fixes quite a lot of issues, but kills battery lifetime
10:38karolherbst: another workaround is to blacklist nouveau
10:38Sarayan: that makes rmmod nouveau a better solution, which is kind of annoying :-)
10:38RSpliet: Sarayan: that's why NVIDIA is shifting to RISC-V... but it's custom because they use the same type of core for video decoding and DMA and stuff
10:39karolherbst: and just enable the runtime power management stuff manually
10:39karolherbst: either way
10:39karolherbst: you need some customizing in order to be able to use the GPU
10:39karolherbst: I have the same issue on my laptop
10:39karolherbst: and I have custom scripts to remove the PCIe device and invoke some ACPI calls manually
10:39Sarayan: my issue or sabrina's?
10:39karolherbst: the locking up one
10:39Sarayan: (the colleague with the newer card)
10:40Sarayan: ok yeah
10:40karolherbst: _hopefully_ we get some answers form nvidia about it
10:40karolherbst: just that usually takes quite some time
10:40RSpliet: When is XDC again? :-D
10:40karolherbst: RSpliet: next week :p
10:41RSpliet: Sarayan: there might be some answers next week :-P
10:41karolherbst: :D or the week after
10:41karolherbst: I have none
10:41Sarayan: gonna do... stuff on the nvidia driver to see what it tries to do on my hardware at some point
10:42RSpliet: (semi-joking of course, although scepticism about NVIDIA being forward is more a result of time-budgetting for a non-profitable project like nouveau than on actually guarding big scary secrets. Most of them are well willing to think along if you catch them :-) )
10:42karolherbst:gives no comment on that one
10:43Sarayan: Yeah, I'm not entirely sure of the why of their position on that, unless it's just inertia at that point
10:43RSpliet: My bet is return-on-investment considerations in management.
10:44Sarayan: yeah, releasing code is an investment, and not a small one
10:44Sarayan: between IP sanitazing and just making it usable for outsiders, it's work
10:45RSpliet: My personal opinion is that their return is underestimated. Nouveau got me interested in GPUs to an extent that I can do research with the open source stuff (bringing the field of computer science forward), and I'm probably employable to work on these things in the future with a smaller investment.
10:45karolherbst: anyway, Lyude also found somebody with the same issue on a gp108? I think
10:45karolherbst: I am sure this is something more common and we just need to figure out what is wrong this time
10:45Sarayan: the livelock?
10:46karolherbst: the can't boot one
10:46Sarayan: I can do whatever on my laptop, I'm somewhat busy but I control it
10:46karolherbst: RSpliet: you are taking more than what you give, shame on you :p
10:46Sarayan: plu RE is my kink, but that's another story :-)
10:47RSpliet: karolherbst: I will not rule out employment by NVIDIA in the future should they have ears for it.
10:47karolherbst: Sarayan: you need to find more time doing RE stuff :p one of the only actual real world puzzles you can't cheat around :D
10:48Sarayan: I'm one of the main MAME developers, how's that for RE? :-)
10:48karolherbst: RSpliet: :p
10:49karolherbst: Sarayan: nice. I kind of get the feeling that outside of the graphics world, console stuff is the most I have contact with regarding my profession :p
10:49Sarayan: heh, especially the switch emus I guess?
10:49Sarayan: tegra ftw
10:50karolherbst: maybe :p
10:50karolherbst: allthough actually not so much with the switch emulator people
10:51karolherbst: those things take like 5 years until you ruled out the "I am here for attention" people from the "I am here to do awesome shit"
10:51karolherbst: so I kind of keep my distance whenever I feel like there is this situation
10:51karolherbst: and that's usually the case whenever you get new emulators for new consoles
10:51karolherbst: then like 5 different emulators arrise
10:52karolherbst: and after some time there are 1 or 2 actually working ones
10:52Sarayan: very true
10:52Sarayan: especially piracy engines, I mean emulators for recent, still sold hardware
10:53karolherbst: you don't have to pirate those though
10:53karolherbst: emulators give you still more freedom
10:53karolherbst: like cranking up graphical settings or replacing textures with funny stuff :p
10:53karolherbst: but yeah, I think priacy is the main driver for some
10:54Sarayan: yeah, in mame we avoid recent stuff because of that
10:54Sarayan: ok, I can recompile the module, excellent
10:55Sarayan: hmmm, insmod is not enough to trigger the timeout
10:56Sarayan: guess I need to restart X, which is kind of annoying
10:56Sarayan: ah, starting a second X works
10:57karolherbst: Sarayan: https://gist.githubusercontent.com/karolherbst/4341e3c33b85640eaaa56ff69a094713/raw/c976daa9e406d37e01351357ea9d8c20d5097d66/xorg.conf.d.nouveau.conf
10:57karolherbst: you actually want X to grab the GPU
10:57karolherbst: nvm my link then
10:57karolherbst: it helps with X _not_ grabbing the GPU
10:58Sarayan: well, I want _something_ to grab the gpu to see whether it works
10:58Sarayan: ideally I want to see both gpus under randr and vulcan, but the second is yet another story
10:58Sarayan: ideally++, I want to run NN training on the nvidia while I do normal stuff on the intel
10:59Sarayan: but, well, starting the gpu is a good start :-)
11:00karolherbst: I guess if you want to do NN training you kind of want to use the nvidia driver anyhow, because under nouveau that should be terribly slow
11:00karolherbst: for many reasons
11:00Sarayan: reckling fun?
11:01karolherbst: more than that
11:01Sarayan: reclocking damnit
11:01karolherbst: the tensor cores are real hardware
11:01karolherbst: and nobody ever looked into it
11:01karolherbst: might require ISA changes
11:01karolherbst: might require supporting OpenCL/CUDA
11:01karolherbst: the latter for sure
11:01Sarayan: Can Spir-V handle it?
11:01karolherbst: there is no upstream compute support besides GL compute shaders
11:01karolherbst: Sarayan: maybe? OpenCL requires spir-v
11:02karolherbst: but again, there is no OpenCL support yet
11:02karolherbst: and we are very far away from doing perf optimizations
11:02Prf_Jakob: They released vulkan extensions for the raytracing extensions at least.
11:02karolherbst: sure, but that doesn't really help with NN training
11:03karolherbst: and those extra extensions all require a lot of RE work as well
11:03karolherbst: (and new GPUs most likely)
11:03Sarayan: well, I don't know if I'll manage to do anything, but that happens to be part of my definition of "fun"
11:03karolherbst: same here :p
11:03karolherbst: well if you manage to stick around and do some things, would be already great
11:04karolherbst: I am just quite sure that you won't be able to fix that issue if yours as I kind of fear it is only fixable by nvidia. But hopefully I am wrong here and we just do something stupid
11:04karolherbst: _maybe_ tracing nvidia and nouveau sheds some lights
11:05karolherbst: in the pmu reset thing we basically just invoke some PMU code from the vbios
11:05karolherbst: and it is signed
11:05Sarayan: yeah, I have a lua-interfaced x86 emulator I used to reverse-enginner the windows gma500 drivers
11:05karolherbst: so we cna't change it
11:05RSpliet: karolherbst: Aren't tensor cores like warp-wide 8*8 8-bit integer MAD units (like, matrix multiplication operations). I don't think they're quire the same as the NVIDIA TPU
11:05Sarayan: need to make it handle amd64, but that shouldn't be too bad
11:06karolherbst: RSpliet: more or less I think?
11:06karolherbst: RSpliet: but you have special instructions for those
11:06karolherbst: RSpliet: I already looked into how that works inside CUDA
11:06karolherbst: PTX I meant
11:06RSpliet: I think it's an ISA thing that... will take some interesting compiler hacking
11:06karolherbst: not really
11:07karolherbst: you pack values into one register
11:07RSpliet: Because I bet there's some hard requirements on register allocation to have all that data close together among multiple threads ;-)
11:07karolherbst: and execute that stuff
11:07karolherbst: RSpliet: yes, all data inside one register ;)
11:07karolherbst: uhm, per "lane" so to speak
11:07RSpliet: Yeah exactly. Don't know how data must be shuffled around among threads to kick that off
11:07karolherbst: so if you have 4 8 bit ints, you pack them into one 32 bit reg
11:07karolherbst: there is no threading going on here afaik
11:08karolherbst: or well, you don't have your threads to synchronize
11:08Sarayan: intel does that with 16-bytes wide registers
11:08Sarayan: it's just vectorization
11:08karolherbst: we do the reverse for 64/96/128 bit ones :p
11:08RSpliet: Well, there is. You have 4*8bit ints per thread, but a warp will represent like an 8*8 matrix. *That* will be multiplied presumably
11:09karolherbst: RSpliet: afaik all operations are don single threaded and you get your speed by synchronizing the work through the API
11:09RSpliet: Or well, two of those matrices. One accessed row-wise, the other column wise
11:09RSpliet: You sure?
11:09karolherbst: pretty much
11:09karolherbst: let me find my ptx file
11:11karolherbst: RSpliet: what you have to do though is to load the values into the tensor cores
11:11RSpliet: Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4x4x4 matrix multiply (see Figure 9).
11:12RSpliet: The Volta tensor cores are accessible and exposed as Warp-Level Matrix Operations in the CUDA 9 C++ API. The API exposes specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently use Tensor Cores from a CUDA-C++ program. At the CUDA level, the warp-level interface assumes 16x16 size matrices spanning all 32 threads of the warp.
11:12karolherbst: RSpliet: "9.7.13. Warp Level Matrix Multiply-Accumulate Instructions" int he cuda docs
11:13RSpliet: Ah ok, so the tensor cores have a dedicated register/memory space?
11:13karolherbst: sounds like it
11:13RSpliet: See, tricky compiler stuff :-P
11:13karolherbst: check out the examples
11:13karolherbst: it doesn't look that complicated though
11:14karolherbst: you need to understand it
11:14RSpliet: I'm looking at the Volta whitepaper right now
11:15RSpliet: "you need to understand it" - that's sort of the definition of tricky ;-)
11:15karolherbst: RSpliet: you basically have wmma.store and wmma.load
11:15karolherbst: and than you use normal ALU instructions on the returned data
11:15karolherbst: well at least in PTX
11:15karolherbst: the resulting binary is kind of a mess
11:16karolherbst: there is wmma.mma now to do a few things instead
11:16karolherbst: "Perform a single matrix multiply-and-accumulate operation across a warp"
11:17karolherbst: but in the end it is on the same level as texture instructions regarding complexity
11:44rhyskidd: any reviewers for https://github.com/envytools/envytools/pull/170
11:48karolherbst: hah! cuda 10 is out
11:49Sarayan: llvm 7 is out :-P
11:52mwk: rhyskidd: have a few comments
11:59karolherbst: what the hell are those "Uniform Datapath Instructions"
12:03karolherbst: imirkin_: ^^ any ideas?
12:03karolherbst: RSpliet, maybe you have some ides about those as well
12:03karolherbst: I have my theories though
12:03karolherbst: and those are to write to workgroup common values
12:04karolherbst: like you have your uniform input which is the same for all threads and you calculate values, which are identical to all threads
12:04karolherbst: so you mark some operations to do computations which is the same for all threads to save time or something?
12:11rhyskidd: mwk: thanks, those were helpful
12:11karolherbst: looks like that: https://gist.githubusercontent.com/karolherbst/fd5c0f2e34d0fa5e48dae5f697d6b7d4/raw/c87b9b6ea1ead3845b27d00d3541261c665ae153/gistfile1.txt
12:11karolherbst: "UIMAD UR6, UR6, UR4, UR5 ;"
12:12karolherbst: "IADD3 R0, R0, UR6, RZ ;"
12:12karolherbst: and those UR registers are indeed seperated from the normal ones we got
12:12karolherbst: there are also UPs
12:12pendingchaos: so something like AMD's vector and scalar thing?
12:13karolherbst: no idea
12:13karolherbst: the cuda docs aren't specific about those
12:15karolherbst: but I kind of get the feeling those are critical for performance
12:34karolherbst: "Get the number of uniform predicate registers per warp on the device. Since CUDA 10.0."
12:34karolherbst: soo those are indeed per warp
12:34karolherbst: (from the debugger API)
12:35karolherbst: soo with turing we get per warp registers/predicates, nice
12:35karolherbst: and we have instructions to write to them and normal instructions which can read from those
12:35karolherbst: I think there are 64/8 of those with hard wired always 0/TRUE
13:13imirkin_: karolherbst: i'd guess instructions that use the uniform datapath
13:13imirkin_: i.e. the same data path used for loading uniforms
13:14imirkin_: karolherbst: or it could be instructions that require uniform data in all lanes when they execute
13:19Sarayan: interesting, reset looks called twice, and it's the second one that fails, when waiting for idle
13:20Sarayan: two calls to nvkm_pmu_reset with roughly a tenth of a second between them
13:23karolherbst: imirkin_: yeah, I think it is the latter
13:23karolherbst: imirkin_: as instructions can actually write to those
13:23karolherbst: and non uniform datapath instructions can consume those
13:23karolherbst: but not the other way around
13:23karolherbst: Sarayan: the second one should be the one we run after devinit
13:24karolherbst: uhm wait, no
13:24karolherbst: both are inside the pmu subdev, mhh interesting
13:25karolherbst: Sarayan: mind skipping the one inside nvkm_pmu_init?
13:25karolherbst: interesting to know what happens then
13:25Sarayan: doing that
13:25karolherbst: imirkin_: anyway, could become quite some fun to add that to codegen
13:25karolherbst: as we actually have to push those through RA as well
13:28Sarayan: kh: no more oops, no card in xrandr --listprovider though
13:28Sarayan: is there a way to know if anything is working?
13:29Sarayan: since it's a everything-connected-to-intel optimus
13:29karolherbst: Sarayan: check dmesg
13:30Sarayan: nothing added after the [ 9369.286810] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
13:30Sarayan: of the insmod
13:30Sarayan: I did get my printk tucked into pmu_reset though
13:30karolherbst: then it should work
13:30karolherbst: more or less
13:30Sarayan: Xorg.2.log only talks about intel
13:30karolherbst: try a DRI_PRIME=1 glxinfo
13:30karolherbst: ohh, mhh
13:30karolherbst: dunno how your X setup is
13:31Sarayan: oh, that does (bad :-) things
13:31karolherbst: might be the runtime suspend/resume issue
13:32karolherbst: try to boot with nouveau.runpm=1 in addition
13:33Sarayan: I have a [ 9580.790811] nouveau 0000:01:00.0: timeout
13:33Sarayan: [ 9580.790851] WARNING: CPU: 5 PID: 2942 at drivers/gpu/drm/nouveau/nvkm/subdev/secboot/ls_ucode_msgqueue.c:192 acr_ls_sec2_post_run+0x1fd/0x250 [nouveau]
13:33Sarayan: [ 9580.804536] nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 6013d4 [ TIMEOUT ]
13:33Sarayan: [ 9582.801317] nouveau 0000:01:00.0: gr: init failed, -16
13:34Sarayan: bunch of oops on vmm flush
13:34Sarayan: X config only tells intel to use dri3, nothing more
13:39karolherbst: Sarayan: what happens if you apply this patch on top? https://github.com/karolherbst/nouveau/commit/67c57185d716792f806b73f24bf3826040b217a1
13:40Sarayan: I *love* that commit message
13:43cliff-hm: I foresee someone 3 years from now trying to delete those lines as 'useless', it breaks and they struggle to understand why, and end up reluctantly putting them back :)
13:44cliff-hm: (the comments help to hopefully prevent that though)
13:45Sarayan: hmmmm, how do I tell github to give me the diff as a diff I can patch in?
13:47Sarayan: whatever, did it by hand, it's small
13:56imirkin_: Sarayan: add .patch to the url
13:57imirkin_: you can then git am it.
13:57imirkin_: i.e. https://github.com/karolherbst/nouveau/commit/67c57185d716792f806b73f24bf3826040b217a1.patch
14:04Sarayan: ok, I'm getting some weird behaviour, which probably means I need to do that at home
14:04Sarayan: (recompiling the kernel package locally, installing it, then mucking with nouveau)
14:05Sarayan: I have a slight mismatch between the kernel and the module right now, and that makes it cranky
15:33RSpliet: Ooo! The parboil/SPEC FFT benchmark is broken in the most subtle way :')
17:04RSpliet: Deets (you might enjoy this one, pmoreau :p): pretty much the first thing the FFT benchmark (not actually published as part of parboil, but somehow is part of it and made it into spec) does is check whether M_PI is defined, and defines it as 3.141592653589793238462643f if not
17:04RSpliet: This constant is then used to calculate the angle for each data point. So multiply with one or two integers, divide by a bit more and presto, an angle
17:05RSpliet: ... OpenCL 1.2 specs define M_PI to be a double precision number.
17:05RSpliet: the #ifndef M_PI makes sure that it's not redefined as a float of course
17:06RSpliet: The result is that that angle calculation is entirely performed in f64 arithmetic, including a full f64 division, before being converted to f32.
17:07RSpliet: The solution: use the M_PI_F constant, which is defined in the OpenCL spec as the f32 equivalent
17:07RSpliet: If you do so, runtime of the benchmark for a serious data set drops to just under half on my GT(X?)650
17:09RSpliet: Judging by the #ifndef, #define[...] the programmers' intentions were definitely to use f32 logic. Hence, bug.
17:12karolherbst: question is: is OpenCL stupid or the devs :p
17:13Sarayan: well, I rarely heard good things about opencl :/
17:14karolherbst: yeah, OpenCL ain't grate
17:15karolherbst: the best about OpenCL is, that... well, you are free to implement it
17:15karolherbst: I am sure openmp would be a better alternative in all cases
17:15karolherbst: but... there is basically no toolchain support for openmp on GPUs except nvidia
17:16karolherbst: kind of wondering if we might be able to write some openmp backend in mesa and have a stable mesa-openmp API other projects could develop against
17:16karolherbst: would be fun
17:21Sarayan: is there anything nvidia-specific in cuda actually?
17:23karolherbst: hard to say
17:24karolherbst: in the end it was designed for nvidia GPUs, so the API should be closer to the nvidia hw than any other
17:24HdkR: Do you consider HMMA instructions to be Nvidia specific? What about the nonsense that PTX instructions as inline asm give you? :)
17:24karolherbst: well, the ptx stuff is quite high level
17:24karolherbst: ptx isn't even assembly
17:25HdkR: Yea, it is just treated as inline assembly that gets lifted to IR later
17:25karolherbst: it is just wrong to call it assmebly due to many reasons :p
17:26karolherbst: anyway, I doubt that PTX is hard to translate to any other ISA in the end
17:26karolherbst: some instructions might be a bit more challenging though
17:27HdkR: HIP shows off that it can be done right?
17:27HdkR: Probably very fragile though
17:28karolherbst: never looked into it
17:29karolherbst: fixing that multithreading stuff is super painful :/
17:29karolherbst: a lot of moving code around and rename stuff
17:29karolherbst: but it might actually end up in a cleaner driver afterall
17:30RSpliet: People complain about OpenCL a lot, it's really quite good. The toolchains are lacking though
17:30karolherbst: problem is, the nv50 and the nv30 have to be reworked at the same time more or less
17:31karolherbst: RSpliet: compared to Cuda OpenCL is quite ... well, it doesn't give you much
17:31karolherbst: the biggest problem with OpenCL is, that it is just a language+API specification
17:31karolherbst: and things like debugging are no concern
17:32RSpliet: That's exactly what I said. Toolchains are lacking
17:32karolherbst: should have been part of OpenGL
17:32karolherbst: debugger API
17:32karolherbst: it isn't about the toolchain really
17:33karolherbst: if OpenCL would have gotten a debugger API, you could actually write a debugger against it. Just designing that API would be tons of work
17:34RSpliet: Oh you specifically speak of a GDB-like interface. Well... yes, that would need to be standardised somehow
17:36RSpliet: But a lot could already be gained from vendor-specific tools. CodeXL I thought was helpful for profiling, perf tuning and warning for silly mistakes
17:36RSpliet: NVIDIA has those tools for CUDA, but killed their OpenCL counterparts (what used to be a nice Eclipse plugin) quite early on
18:17karolherbst: inside nv30: /*XXX: *cough* per-context pushbufs */
18:17karolherbst: /*XXX: *cough* per-context client */
19:11mupuf: karolherbst: hehe
20:47RSpliet: karolherbst: Almost as if you, a good pirate, just found the map to the booty!
20:48RSpliet: XXX marks the spot!
22:22karolherbst: nice, first full compilation
22:27bubblethink: Hi. Is there any way to limit the amount of memory that is visible to an nvidia gpu ? Something like the mem=limit parameter
22:28karolherbst: bubblethink: not really, why would you wanna do that?
22:28HdkR: Get around the GTX 970 last 512MB issue? :P
22:29karolherbst: HdkR: on that GPU you have different issues :p
22:32bubblethink: karolherbst, this is for research. I am doing some research around CUDA's unified memory behaviour. The GPU with the smallest memory is 2 GB right now (1030). I want working sets that don't fit in the GPU's memory. 2 GB is fine, but still quit big for fast protoyping
22:32bubblethink: Essentially, I want a lot of evictions and CPU<->GPU transfers
22:33karolherbst: you could also read the nvidia-uvm code, everything should be in there
22:33karolherbst: or modify it
22:33bubblethink: yes, was planning to do that next
22:33karolherbst: I see
22:33karolherbst: maybe there is a way to limit memory on nvidia GPUs, dunno
22:34bubblethink: my other hackish idea is to allocate a chunk with cudamalloc, and that should hopefully remain untouched. Then whatever remains, if allocated with cudamallocmanaged, would work as effective memory
22:34karolherbst: allthough, there are some register read/writes which one could intercept and provide own values
22:34karolherbst: but that might be a bit too challenging
22:34karolherbst: and maybe even slower
22:40karolherbst: ohh fun
22:40karolherbst: nvidia is going crazy with the chipsets number with turing
22:40HdkR: Crazy you say?
22:41karolherbst: we have like 7 GPU models and already 4 diifferent chipsets
22:42HdkR: TU102, TU104, TU106, + non-FE vendor IDs? :P
22:42karolherbst: TU116 as well
23:26karolherbst: wow, even glxgears renders
23:26karolherbst: and dolphin crashes, nice
23:28karolherbst: that seems... trivial to fix?
23:29HdkR: Fixing threaded VAO binding and glFinish? :)
23:30karolherbst: it is more of a core issue
23:31karolherbst: threading again, but different
23:33karolherbst: mhh, helgrind doesn't report any race conditions though
23:34karolherbst: realloc fails because OOM
23:34HdkR: Time to download more ram
23:34karolherbst: no, that was just valgrind
23:34karolherbst: buh valgrind, buh :p
23:50karolherbst: okay, out of bound reads
23:50karolherbst: uhm, writes actually