04:13 imirkin: HdkR: finally! GL_NV_fragment_shader_barycentric
04:13 imirkin: nvidia joins the fun.
04:15 imirkin: gl_BaryCoordNV + gl_BaryCoordNoPerspNV, as well as being able to specify varyings as "pervertex"
04:24 imirkin: wow. someone's been optimizing for VR... GL_NV_shading_rate_image
04:35 HdkR: imirkin: I know right. Been a long time coming
04:36 HdkR: Now Dolphin can finally implement manual texture coordinate interpolation to fix the nvidia interpolation bug :P
04:36 HdkR: Mesh shaders also look interesting
04:42 imirkin: yeah. tess on crack :)
04:43 imirkin: basically for the, as i understand it, fairly common usage of "compute shader generates geometry"
04:43 imirkin: i wonder if there's much pickup over doing that + draw indirect
04:44 imirkin: but then even a 5-10% improvement would be worth their while
04:44 HdkR: I feel like Mesh shaders are going to be something that people will use
04:44 imirkin: yeah, it's more straightforward than compute + indirect
04:45 HdkR: The mesh shaders themselves act almost identical to compute as well
04:45 imirkin: right
04:45 imirkin: should be easy to port
04:47 HdkR: shading rate it'll be interesting if people take advantage of that extensively. Hip new feature to improve perf, kind of like checkerboard rendering
04:47 HdkR: Seems like it would need to be in the consoles for anyone to care
04:48 imirkin: could have a huge impact though
04:48 imirkin: like 100% perf improvement
04:49 HdkR: Yea, just think if you're rendering at 8k and you can cut the sampling rate of most objects to pretty much nothing
04:49 imirkin: you need to render like ... 1 fragment per 16x16 block on the side of a 4k image that's being displayed in VR glasses.
04:49 HdkR: yep :D
04:49 imirkin: if that
04:53 HdkR: I need to look at how flexible it is. Looks like you bind some sort of rate image when you're rendering something
04:53 airlied: need non square scissors :-p
04:55 HdkR: It's all about rectangles :)
04:55 imirkin: you just provide a different rate for each 16x16 pixel block.
04:55 imirkin: in a R8UI image. presumably 0 = 1, 255 = 256.
04:56 imirkin: i didn't read the specs carefully.
04:56 HdkR: Yea, I haven't opened the spec sheet for that one at all yet. Just listening to people talk about it
04:57 imirkin: (or 16x16 fragment block. dunno.)
04:59 HdkR: That's one heck of a reduction on the fragment side
05:00 imirkin: exactly.
05:00 imirkin: so it could be huge for VR perf
05:01 imirkin: i bet you could get away with an average of 50% coverage across the image
05:01 imirkin: i.e. 2x faster rast
05:01 imirkin: although ... obviously it's also a function of polygon counts and whatnot
05:01 imirkin: [and where those polygons are]
05:03 HdkR: Something like SuperHot in VR where lowering the fragment sampling rate wouldn't make a visual impact at all would be neat
09:55 RSpliet: kernel-3xp: yeah... I know exactly what to do for Fermi reclocking, it's mostly a solved problem but one that takes a lot of time to fiddle about with details. Sadly, I'm nearing the final phase of a PhD, so I literally have negative time and energy to spend on nouveau.
09:57 RSpliet: skeggsb also really knows what it would take, but he's tied up in other responsibilities (things like fixing multi-threaded mesa, Turing bring-up, maybe doing some stuff for Vulkan, general code quality and error handling, taking care of Red Hat Enterprise customer bug reports... you know, a long long todo list :-D)
10:06 kernel-3xp: ah ok, :/ good luck with the phd though
10:09 RSpliet: Yeah. It'd be nice if NVIDIA could stop producing new hardware for the next four or five years so we can finally catch up :-P
10:10 karolherbst: :D
10:10 karolherbst: that would be boring :p
10:10 karolherbst: we just have to get better :p
10:11 RSpliet: karolherbst: we just need to open a big tin of talented developers and reverse engineers
10:11 karolherbst: RSpliet: well skeggsb worked mainly on volta bringup and I am sure he will look into turing as well. And currently I am working on the MT issues
10:11 karolherbst: RSpliet: working on it :p
10:11 RSpliet: karolherbst: that's good news though, MT would be a big win (and a prerequisite for Vulkan?)
10:12 karolherbst: but I doubt I got anybody interested doing fermi reclocking work though
10:12 karolherbst: RSpliet: we wouldn't have the problem with vulkan
10:12 karolherbst: because, it would be a new driver and we wouldn't do the same mistakes there
10:12 karolherbst: I was already looking into the issue and I kind of understood what those are
10:13 RSpliet: karolherbst: whenever you wonder whether anyone is interested in X, remember mooch is interested in exact hardware behavioural specs for like NV5 or something ;-)
10:13 karolherbst: in mesa we have a fence list per screen
10:13 karolherbst: but, we need that per context/thread
10:13 karolherbst: like in case you use shared context
10:13 karolherbst: dolphin does this to compile shaders in parallel
10:13 karolherbst: and things break on glFinish() as multiple threads hammer on the fence list
10:14 RSpliet: Shouldn't the fence list somehow become atomic or serialised regardless...? I'm not sure if every fork will create a new context, and you might want to share fences between your threads. Or am I thinking too simplistically? :-D
10:15 karolherbst: no, why?
10:15 karolherbst: just give each context its own
10:15 karolherbst: you can't use the same context from different threads anyhow
10:15 RSpliet: Ah ok, GPU context. Yeah that'd make sense then
10:15 karolherbst: uhm, I meant GL context
10:15 karolherbst: basically a nouveau_context struct
10:15 RSpliet: Yeah, I assume there's a correlation with GPU HW contexts :-)
10:16 karolherbst: still need to figure out the details
10:16 RSpliet: No shared fences used to signal the compositor that your windowed OGL app is finished rendering a frame?
10:16 karolherbst: like can we have multiple threads waiting on the same context or something through async gl calls
10:16 karolherbst: RSpliet: well, the main problem are shared contexts, so imagine your compositor uses one shared context per client
10:16 karolherbst: and now it does gl calls on each of those
10:17 RSpliet: Ok, that's over my head now :-D
10:17 karolherbst: so you share the nouveau_screen on all contexts where all the fencing is
10:17 karolherbst: helgrind is already going crazy over our code :p
10:18 karolherbst: most common offenders are the pushbuffers and the fence list we've got
10:19 karolherbst: one example: https://gist.githubusercontent.com/karolherbst/5f6c5eb70f7facb6e676e0841db1596d/raw/1153f66300dfd9d1ed7affd597d94c73a7fd6da7/gistfile1.txt
10:21 RSpliet: Ah
10:21 karolherbst: thread #4 seems to be the rendering/main thread and dolphins spawns some compiler threads
10:21 RSpliet: Eh, valgrind is just a noisy little nuisance right, we can ignore those warnings!
10:21 karolherbst: :p
10:22 karolherbst: well, dolphin crashes with async shader compiles :D
10:22 karolherbst: "it's just an optimization, we can disable it :p"
10:23 RSpliet: That's not a bug, that's a feature helping people be more productive by deterring them from using dolphin
10:23 RSpliet: ...
10:23 RSpliet: Ok, that's terrible :-D
10:23 karolherbst: maybe by simply moving the fence lists into context, we would fix tons of issues already... not quite sure. Still need to get a better understanding about the code itself
10:24 karolherbst: I am sure HdkR did saw your comment :p
10:24 karolherbst: *didn't I meant...
10:26 RSpliet: Well, you have ideas to chase!
10:26 karolherbst: yeah... I kind of hope I don't have to rework everything in one step
10:26 karolherbst: if I find some "smaller" changes which fixes a few bugs, but not everything, I am happy already
10:28 karolherbst: I think imirkin_ had some branches around, but I think he removed those
10:28 karolherbst: distributions started to ship his patches, so he got annoyed by that
10:30 Sarayan: ok, looks like I have a compiler kernel correcponding to the installed package version, it's going to be possible to debug things then :-)
10:31 karolherbst: Sarayan: ohh, you are the one I asked to test the NvForcePost options, right?
10:32 Sarayan: yup
10:33 Sarayan: doing too many things in parallel, I have a little latency in my testing :-)
10:33 karolherbst: mhh, I have a bad feeling about this issue
10:33 karolherbst: I think your vbios might be broken and we require signed PMU firmwares from nvidia to fix it
10:33 Sarayan: oh?
10:34 karolherbst: _not_ quite sure, but
10:34 karolherbst: we got updated firmwares for gp108
10:34 karolherbst: but we don't have the PMU stuff yet
10:34 karolherbst: allthough we never got broken gp108 firmware
10:35 karolherbst: just gp108 was special
10:35 Sarayan: also, I have a colleague with optimus and a more recent card which ends up livelocking all the time, rings a bell?
10:35 karolherbst: yeah.... nouveau.runpm=0 helps
10:35 Sarayan: I can't annoy her as much as I can on my own stuff
10:35 karolherbst: basically the GPU dies when trying to do runtime power management
10:36 karolherbst: because of... reasons nobody knows
10:36 Sarayan: ouch
10:36 Sarayan: the firmwares, can we disassemble/read them actually?
10:36 karolherbst: it's the kind of issue we have for 2 years, everybody looks into it and figures out nothing
10:36 Sarayan: it's mips, arm, sh or anything else sane?
10:36 karolherbst: Sarayan: yes, but that usually doens't help
10:36 karolherbst: uhm
10:36 karolherbst: it is a custom ISA
10:36 Sarayan: but of course :-)
10:36 karolherbst: nvidia calls those chips falcons
10:37 Sarayan: reminds me of the gma500
10:37 karolherbst: well, the shaders also have their custom ISA, but that's a different one ;)
10:37 Sarayan: yeah, but custom isa for the shaders makes perfect sense
10:37 Sarayan: for the control, a little less
10:38 karolherbst: anyway, on laptops booting with "nouveau.runpm=0" fixes quite a lot of issues, but kills battery lifetime
10:38 Sarayan: urgh
10:38 karolherbst: well
10:38 karolherbst: another workaround is to blacklist nouveau
10:38 Sarayan: that makes rmmod nouveau a better solution, which is kind of annoying :-)
10:38 RSpliet: Sarayan: that's why NVIDIA is shifting to RISC-V... but it's custom because they use the same type of core for video decoding and DMA and stuff
10:39 karolherbst: and just enable the runtime power management stuff manually
10:39 karolherbst: either way
10:39 karolherbst: you need some customizing in order to be able to use the GPU
10:39 karolherbst: I have the same issue on my laptop
10:39 karolherbst: and I have custom scripts to remove the PCIe device and invoke some ACPI calls manually
10:39 Sarayan: my issue or sabrina's?
10:39 karolherbst: the locking up one
10:39 Sarayan: (the colleague with the newer card)
10:40 Sarayan: ok yeah
10:40 karolherbst: _hopefully_ we get some answers form nvidia about it
10:40 Sarayan: heh
10:40 karolherbst: just that usually takes quite some time
10:40 RSpliet: When is XDC again? :-D
10:40 karolherbst: RSpliet: next week :p
10:41 RSpliet: Sarayan: there might be some answers next week :-P
10:41 karolherbst: :D or the week after
10:41 karolherbst: I have none
10:41 Sarayan: gonna do... stuff on the nvidia driver to see what it tries to do on my hardware at some point
10:42 RSpliet: (semi-joking of course, although scepticism about NVIDIA being forward is more a result of time-budgetting for a non-profitable project like nouveau than on actually guarding big scary secrets. Most of them are well willing to think along if you catch them :-) )
10:42 karolherbst:gives no comment on that one
10:42 karolherbst: :p
10:43 Sarayan: Yeah, I'm not entirely sure of the why of their position on that, unless it's just inertia at that point
10:43 RSpliet: My bet is return-on-investment considerations in management.
10:44 Sarayan: yeah, releasing code is an investment, and not a small one
10:44 Sarayan: between IP sanitazing and just making it usable for outsiders, it's work
10:45 RSpliet: My personal opinion is that their return is underestimated. Nouveau got me interested in GPUs to an extent that I can do research with the open source stuff (bringing the field of computer science forward), and I'm probably employable to work on these things in the future with a smaller investment.
10:45 karolherbst: anyway, Lyude also found somebody with the same issue on a gp108? I think
10:45 karolherbst: I am sure this is something more common and we just need to figure out what is wrong this time
10:45 Sarayan: the livelock?
10:45 karolherbst: no
10:46 karolherbst: the can't boot one
10:46 Sarayan: ok
10:46 Sarayan: I can do whatever on my laptop, I'm somewhat busy but I control it
10:46 karolherbst: RSpliet: you are taking more than what you give, shame on you :p
10:46 Sarayan: plu RE is my kink, but that's another story :-)
10:47 RSpliet: karolherbst: I will not rule out employment by NVIDIA in the future should they have ears for it.
10:47 karolherbst: Sarayan: you need to find more time doing RE stuff :p one of the only actual real world puzzles you can't cheat around :D
10:48 Sarayan: I'm one of the main MAME developers, how's that for RE? :-)
10:48 karolherbst: RSpliet: :p
10:49 karolherbst: Sarayan: nice. I kind of get the feeling that outside of the graphics world, console stuff is the most I have contact with regarding my profession :p
10:49 Sarayan: heh, especially the switch emus I guess?
10:49 Sarayan: tegra ftw
10:50 karolherbst: maybe :p
10:50 karolherbst: allthough actually not so much with the switch emulator people
10:51 karolherbst: those things take like 5 years until you ruled out the "I am here for attention" people from the "I am here to do awesome shit"
10:51 karolherbst: so I kind of keep my distance whenever I feel like there is this situation
10:51 karolherbst: and that's usually the case whenever you get new emulators for new consoles
10:51 karolherbst: then like 5 different emulators arrise
10:52 karolherbst: and after some time there are 1 or 2 actually working ones
10:52 Sarayan: very true
10:52 Sarayan: especially piracy engines, I mean emulators for recent, still sold hardware
10:53 karolherbst: well
10:53 karolherbst: you don't have to pirate those though
10:53 karolherbst: emulators give you still more freedom
10:53 karolherbst: like cranking up graphical settings or replacing textures with funny stuff :p
10:53 karolherbst: but yeah, I think priacy is the main driver for some
10:54 Sarayan: yeah, in mame we avoid recent stuff because of that
10:54 Sarayan: ok, I can recompile the module, excellent
10:55 Sarayan: hmmm, insmod is not enough to trigger the timeout
10:56 Sarayan: guess I need to restart X, which is kind of annoying
10:56 Sarayan: ah, starting a second X works
10:57 karolherbst: Sarayan: https://gist.githubusercontent.com/karolherbst/4341e3c33b85640eaaa56ff69a094713/raw/c976daa9e406d37e01351357ea9d8c20d5097d66/xorg.conf.d.nouveau.conf
10:57 karolherbst: ohh
10:57 karolherbst: you actually want X to grab the GPU
10:57 karolherbst: nvm my link then
10:57 karolherbst: it helps with X _not_ grabbing the GPU
10:58 Sarayan: well, I want _something_ to grab the gpu to see whether it works
10:58 Sarayan: ideally I want to see both gpus under randr and vulcan, but the second is yet another story
10:58 Sarayan: ideally++, I want to run NN training on the nvidia while I do normal stuff on the intel
10:59 Sarayan: but, well, starting the gpu is a good start :-)
10:59 karolherbst: mhh
11:00 karolherbst: I guess if you want to do NN training you kind of want to use the nvidia driver anyhow, because under nouveau that should be terribly slow
11:00 karolherbst: for many reasons
11:00 Sarayan: reckling fun?
11:01 karolherbst: more than that
11:01 Sarayan: reclocking damnit
11:01 karolherbst: the tensor cores are real hardware
11:01 karolherbst: and nobody ever looked into it
11:01 karolherbst: might require ISA changes
11:01 karolherbst: might require supporting OpenCL/CUDA
11:01 karolherbst: well
11:01 karolherbst: the latter for sure
11:01 Sarayan: Can Spir-V handle it?
11:01 karolherbst: there is no upstream compute support besides GL compute shaders
11:01 karolherbst: Sarayan: maybe? OpenCL requires spir-v
11:02 karolherbst: but again, there is no OpenCL support yet
11:02 karolherbst: and we are very far away from doing perf optimizations
11:02 Prf_Jakob: They released vulkan extensions for the raytracing extensions at least.
11:02 karolherbst: sure, but that doesn't really help with NN training
11:03 karolherbst: and those extra extensions all require a lot of RE work as well
11:03 karolherbst: (and new GPUs most likely)
11:03 Sarayan: well, I don't know if I'll manage to do anything, but that happens to be part of my definition of "fun"
11:03 karolherbst: same here :p
11:03 karolherbst: well if you manage to stick around and do some things, would be already great
11:04 karolherbst: I am just quite sure that you won't be able to fix that issue if yours as I kind of fear it is only fixable by nvidia. But hopefully I am wrong here and we just do something stupid
11:04 karolherbst: but
11:04 karolherbst: _maybe_ tracing nvidia and nouveau sheds some lights
11:05 karolherbst: dunno
11:05 karolherbst: in the pmu reset thing we basically just invoke some PMU code from the vbios
11:05 karolherbst: and it is signed
11:05 Sarayan: yeah, I have a lua-interfaced x86 emulator I used to reverse-enginner the windows gma500 drivers
11:05 karolherbst: so we cna't change it
11:05 RSpliet: karolherbst: Aren't tensor cores like warp-wide 8*8 8-bit integer MAD units (like, matrix multiplication operations). I don't think they're quire the same as the NVIDIA TPU
11:05 Sarayan: need to make it handle amd64, but that shouldn't be too bad
11:06 karolherbst: RSpliet: more or less I think?
11:06 karolherbst: RSpliet: but you have special instructions for those
11:06 karolherbst: RSpliet: I already looked into how that works inside CUDA
11:06 karolherbst: uhm
11:06 karolherbst: PTX I meant
11:06 RSpliet: I think it's an ISA thing that... will take some interesting compiler hacking
11:06 karolherbst: not really
11:07 karolherbst: you pack values into one register
11:07 RSpliet: Because I bet there's some hard requirements on register allocation to have all that data close together among multiple threads ;-)
11:07 karolherbst: and execute that stuff
11:07 karolherbst: RSpliet: yes, all data inside one register ;)
11:07 karolherbst: uhm, per "lane" so to speak
11:07 RSpliet: Yeah exactly. Don't know how data must be shuffled around among threads to kick that off
11:07 karolherbst: so if you have 4 8 bit ints, you pack them into one 32 bit reg
11:07 karolherbst: there is no threading going on here afaik
11:08 karolherbst: or well, you don't have your threads to synchronize
11:08 Sarayan: intel does that with 16-bytes wide registers
11:08 Sarayan: it's just vectorization
11:08 karolherbst: we do the reverse for 64/96/128 bit ones :p
11:08 RSpliet: Well, there is. You have 4*8bit ints per thread, but a warp will represent like an 8*8 matrix. *That* will be multiplied presumably
11:09 karolherbst: RSpliet: afaik all operations are don single threaded and you get your speed by synchronizing the work through the API
11:09 RSpliet: Or well, two of those matrices. One accessed row-wise, the other column wise
11:09 RSpliet: You sure?
11:09 karolherbst: pretty much
11:09 karolherbst: let me find my ptx file
11:11 karolherbst: RSpliet: what you have to do though is to load the values into the tensor cores
11:11 RSpliet: Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4x4x4 matrix multiply (see Figure 9).
11:12 RSpliet: The Volta tensor cores are accessible and exposed as Warp-Level Matrix Operations in the CUDA 9 C++ API. The API exposes specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently use Tensor Cores from a CUDA-C++ program. At the CUDA level, the warp-level interface assumes 16x16 size matrices spanning all 32 threads of the warp.
11:12 karolherbst: RSpliet: "9.7.13. Warp Level Matrix Multiply-Accumulate Instructions" int he cuda docs
11:13 RSpliet: Ah ok, so the tensor cores have a dedicated register/memory space?
11:13 karolherbst: sounds like it
11:13 RSpliet: See, tricky compiler stuff :-P
11:13 karolherbst: check out the examples
11:13 karolherbst: it doesn't look that complicated though
11:14 karolherbst: just
11:14 karolherbst: you need to understand it
11:14 RSpliet: I'm looking at the Volta whitepaper right now
11:15 RSpliet: "you need to understand it" - that's sort of the definition of tricky ;-)
11:15 karolherbst: RSpliet: you basically have wmma.store and wmma.load
11:15 karolherbst: and than you use normal ALU instructions on the returned data
11:15 karolherbst: well at least in PTX
11:15 karolherbst: the resulting binary is kind of a mess
11:16 karolherbst: there is wmma.mma now to do a few things instead
11:16 karolherbst: "Perform a single matrix multiply-and-accumulate operation across a warp"
11:17 karolherbst: but in the end it is on the same level as texture instructions regarding complexity
11:44 rhyskidd: any reviewers for https://github.com/envytools/envytools/pull/170
11:48 karolherbst: hah! cuda 10 is out
11:49 Sarayan: llvm 7 is out :-P
11:52 mwk: rhyskidd: have a few comments
11:59 karolherbst: sooo
11:59 karolherbst: what the hell are those "Uniform Datapath Instructions"
12:03 karolherbst: imirkin_: ^^ any ideas?
12:03 karolherbst: RSpliet, maybe you have some ides about those as well
12:03 karolherbst: I have my theories though
12:03 karolherbst: and those are to write to workgroup common values
12:04 karolherbst: like you have your uniform input which is the same for all threads and you calculate values, which are identical to all threads
12:04 karolherbst: so you mark some operations to do computations which is the same for all threads to save time or something?
12:11 rhyskidd: mwk: thanks, those were helpful
12:11 karolherbst: looks like that: https://gist.githubusercontent.com/karolherbst/fd5c0f2e34d0fa5e48dae5f697d6b7d4/raw/c87b9b6ea1ead3845b27d00d3541261c665ae153/gistfile1.txt
12:11 karolherbst: "UIMAD UR6, UR6, UR4, UR5 ;"
12:12 karolherbst: "IADD3 R0, R0, UR6, RZ ;"
12:12 karolherbst: and those UR registers are indeed seperated from the normal ones we got
12:12 karolherbst: there are also UPs
12:12 pendingchaos: so something like AMD's vector and scalar thing?
12:13 karolherbst: no idea
12:13 karolherbst: the cuda docs aren't specific about those
12:15 karolherbst: but I kind of get the feeling those are critical for performance
12:34 karolherbst: ahhhh
12:34 karolherbst: "Get the number of uniform predicate registers per warp on the device. Since CUDA 10.0."
12:34 karolherbst: soo those are indeed per warp
12:34 karolherbst: (from the debugger API)
12:35 karolherbst: soo with turing we get per warp registers/predicates, nice
12:35 karolherbst: and we have instructions to write to them and normal instructions which can read from those
12:35 karolherbst: I think there are 64/8 of those with hard wired always 0/TRUE
13:13 imirkin_: karolherbst: i'd guess instructions that use the uniform datapath
13:13 imirkin_: i.e. the same data path used for loading uniforms
13:14 imirkin_: karolherbst: or it could be instructions that require uniform data in all lanes when they execute
13:19 Sarayan: interesting, reset looks called twice, and it's the second one that fails, when waiting for idle
13:20 Sarayan: two calls to nvkm_pmu_reset with roughly a tenth of a second between them
13:23 karolherbst: imirkin_: yeah, I think it is the latter
13:23 karolherbst: imirkin_: as instructions can actually write to those
13:23 karolherbst: and non uniform datapath instructions can consume those
13:23 karolherbst: but not the other way around
13:23 karolherbst: Sarayan: the second one should be the one we run after devinit
13:24 karolherbst: uhm wait, no
13:24 karolherbst: both are inside the pmu subdev, mhh interesting
13:25 karolherbst: Sarayan: mind skipping the one inside nvkm_pmu_init?
13:25 karolherbst: interesting to know what happens then
13:25 Sarayan: doing that
13:25 karolherbst: imirkin_: anyway, could become quite some fun to add that to codegen
13:25 karolherbst: as we actually have to push those through RA as well
13:28 Sarayan: kh: no more oops, no card in xrandr --listprovider though
13:28 Sarayan: is there a way to know if anything is working?
13:29 Sarayan: since it's a everything-connected-to-intel optimus
13:29 karolherbst: Sarayan: check dmesg
13:30 Sarayan: nothing added after the [ 9369.286810] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
13:30 Sarayan: of the insmod
13:30 karolherbst: mhh
13:30 Sarayan: I did get my printk tucked into pmu_reset though
13:30 karolherbst: then it should work
13:30 karolherbst: more or less
13:30 Sarayan: Xorg.2.log only talks about intel
13:30 karolherbst: try a DRI_PRIME=1 glxinfo
13:30 karolherbst: ohh, mhh
13:30 karolherbst: dunno how your X setup is
13:31 Sarayan: oh, that does (bad :-) things
13:31 Sarayan: minimal
13:31 karolherbst: yeah...
13:31 karolherbst: might be the runtime suspend/resume issue
13:32 karolherbst: try to boot with nouveau.runpm=1 in addition
13:32 karolherbst: uhm
13:32 karolherbst: =0
13:33 Sarayan: I have a [ 9580.790811] nouveau 0000:01:00.0: timeout
13:33 Sarayan: [ 9580.790851] WARNING: CPU: 5 PID: 2942 at drivers/gpu/drm/nouveau/nvkm/subdev/secboot/ls_ucode_msgqueue.c:192 acr_ls_sec2_post_run+0x1fd/0x250 [nouveau]
13:33 Sarayan: [ 9580.804536] nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 6013d4 [ TIMEOUT ]
13:33 Sarayan: [ 9582.801317] nouveau 0000:01:00.0: gr: init failed, -16
13:34 Sarayan: bunch of oops on vmm flush
13:34 Sarayan: etc
13:34 Sarayan: unhappy
13:34 Sarayan: X config only tells intel to use dri3, nothing more
13:39 karolherbst: mhh
13:39 karolherbst: Sarayan: what happens if you apply this patch on top? https://github.com/karolherbst/nouveau/commit/67c57185d716792f806b73f24bf3826040b217a1
13:40 Sarayan: I *love* that commit message
13:43 cliff-hm: I foresee someone 3 years from now trying to delete those lines as 'useless', it breaks and they struggle to understand why, and end up reluctantly putting them back :)
13:44 cliff-hm: (the comments help to hopefully prevent that though)
13:45 Sarayan: hmmmm, how do I tell github to give me the diff as a diff I can patch in?
13:47 Sarayan: whatever, did it by hand, it's small
13:56 imirkin_: Sarayan: add .patch to the url
13:57 imirkin_: you can then git am it.
13:57 imirkin_: i.e. https://github.com/karolherbst/nouveau/commit/67c57185d716792f806b73f24bf3826040b217a1.patch
14:04 Sarayan: ok, I'm getting some weird behaviour, which probably means I need to do that at home
14:04 Sarayan: (recompiling the kernel package locally, installing it, then mucking with nouveau)
14:05 Sarayan: I have a slight mismatch between the kernel and the module right now, and that makes it cranky
15:33 RSpliet: Ooo! The parboil/SPEC FFT benchmark is broken in the most subtle way :')
16:15 HdkR: :)
17:04 RSpliet: Deets (you might enjoy this one, pmoreau :p): pretty much the first thing the FFT benchmark (not actually published as part of parboil, but somehow is part of it and made it into spec) does is check whether M_PI is defined, and defines it as 3.141592653589793238462643f if not
17:04 RSpliet: This constant is then used to calculate the angle for each data point. So multiply with one or two integers, divide by a bit more and presto, an angle
17:05 RSpliet: ... OpenCL 1.2 specs define M_PI to be a double precision number.
17:05 RSpliet: the #ifndef M_PI makes sure that it's not redefined as a float of course
17:06 RSpliet: The result is that that angle calculation is entirely performed in f64 arithmetic, including a full f64 division, before being converted to f32.
17:07 RSpliet: The solution: use the M_PI_F constant, which is defined in the OpenCL spec as the f32 equivalent
17:07 RSpliet: If you do so, runtime of the benchmark for a serious data set drops to just under half on my GT(X?)650
17:09 RSpliet: Judging by the #ifndef, #define[...] the programmers' intentions were definitely to use f32 logic. Hence, bug.
17:12 karolherbst: ...
17:12 karolherbst: question is: is OpenCL stupid or the devs :p
17:13 Sarayan: well, I rarely heard good things about opencl :/
17:14 karolherbst: yeah, OpenCL ain't grate
17:15 karolherbst: *great
17:15 karolherbst: the best about OpenCL is, that... well, you are free to implement it
17:15 karolherbst: I am sure openmp would be a better alternative in all cases
17:15 karolherbst: but... there is basically no toolchain support for openmp on GPUs except nvidia
17:16 karolherbst: kind of wondering if we might be able to write some openmp backend in mesa and have a stable mesa-openmp API other projects could develop against
17:16 karolherbst: would be fun
17:21 Sarayan: is there anything nvidia-specific in cuda actually?
17:23 karolherbst: hard to say
17:24 karolherbst: in the end it was designed for nvidia GPUs, so the API should be closer to the nvidia hw than any other
17:24 HdkR: Do you consider HMMA instructions to be Nvidia specific? What about the nonsense that PTX instructions as inline asm give you? :)
17:24 karolherbst: well, the ptx stuff is quite high level
17:24 karolherbst: ptx isn't even assembly
17:25 HdkR: Yea, it is just treated as inline assembly that gets lifted to IR later
17:25 karolherbst: it is just wrong to call it assmebly due to many reasons :p
17:26 HdkR: :)
17:26 karolherbst: anyway, I doubt that PTX is hard to translate to any other ISA in the end
17:26 karolherbst: some instructions might be a bit more challenging though
17:27 HdkR: HIP shows off that it can be done right?
17:27 HdkR: Probably very fragile though
17:28 karolherbst: dunno
17:28 karolherbst: never looked into it
17:29 karolherbst: fixing that multithreading stuff is super painful :/
17:29 karolherbst: a lot of moving code around and rename stuff
17:29 karolherbst: but it might actually end up in a cleaner driver afterall
17:30 RSpliet: People complain about OpenCL a lot, it's really quite good. The toolchains are lacking though
17:30 karolherbst: problem is, the nv50 and the nv30 have to be reworked at the same time more or less
17:31 karolherbst: RSpliet: compared to Cuda OpenCL is quite ... well, it doesn't give you much
17:31 karolherbst: the biggest problem with OpenCL is, that it is just a language+API specification
17:31 karolherbst: and things like debugging are no concern
17:32 RSpliet: That's exactly what I said. Toolchains are lacking
17:32 karolherbst: should have been part of OpenGL
17:32 karolherbst: debugger API
17:32 karolherbst: it isn't about the toolchain really
17:33 karolherbst: if OpenCL would have gotten a debugger API, you could actually write a debugger against it. Just designing that API would be tons of work
17:34 RSpliet: Oh you specifically speak of a GDB-like interface. Well... yes, that would need to be standardised somehow
17:34 karolherbst: yeah
17:36 RSpliet: But a lot could already be gained from vendor-specific tools. CodeXL I thought was helpful for profiling, perf tuning and warning for silly mistakes
17:36 RSpliet: NVIDIA has those tools for CUDA, but killed their OpenCL counterparts (what used to be a nice Eclipse plugin) quite early on
18:17 karolherbst: :D
18:17 karolherbst: inside nv30: /*XXX: *cough* per-context pushbufs */
18:17 karolherbst: /*XXX: *cough* per-context client */
18:35 HdkR: haha
19:11 mupuf: karolherbst: hehe
20:47 RSpliet: karolherbst: Almost as if you, a good pirate, just found the map to the booty!
20:48 RSpliet: XXX marks the spot!
22:22 karolherbst: nice, first full compilation
22:23 HdkR: oh?
22:27 bubblethink: Hi. Is there any way to limit the amount of memory that is visible to an nvidia gpu ? Something like the mem=limit parameter
22:28 karolherbst: bubblethink: not really, why would you wanna do that?
22:28 HdkR: Get around the GTX 970 last 512MB issue? :P
22:29 karolherbst: HdkR: on that GPU you have different issues :p
22:30 HdkR: Definitely
22:32 bubblethink: karolherbst, this is for research. I am doing some research around CUDA's unified memory behaviour. The GPU with the smallest memory is 2 GB right now (1030). I want working sets that don't fit in the GPU's memory. 2 GB is fine, but still quit big for fast protoyping
22:32 karolherbst: mhhh
22:32 bubblethink: Essentially, I want a lot of evictions and CPU<->GPU transfers
22:33 karolherbst: you could also read the nvidia-uvm code, everything should be in there
22:33 karolherbst: or modify it
22:33 bubblethink: yes, was planning to do that next
22:33 karolherbst: I see
22:33 karolherbst: maybe there is a way to limit memory on nvidia GPUs, dunno
22:34 karolherbst: mhh
22:34 bubblethink: my other hackish idea is to allocate a chunk with cudamalloc, and that should hopefully remain untouched. Then whatever remains, if allocated with cudamallocmanaged, would work as effective memory
22:34 karolherbst: allthough, there are some register read/writes which one could intercept and provide own values
22:34 karolherbst: but that might be a bit too challenging
22:34 karolherbst: and maybe even slower
22:40 karolherbst: ohh fun
22:40 karolherbst: nvidia is going crazy with the chipsets number with turing
22:40 HdkR: Crazy you say?
22:41 karolherbst: we have like 7 GPU models and already 4 diifferent chipsets
22:42 HdkR: TU102, TU104, TU106, + non-FE vendor IDs? :P
22:42 karolherbst: TU116 as well
22:42 HdkR: ah
23:26 karolherbst: wow, even glxgears renders
23:26 karolherbst: and dolphin crashes, nice
23:27 HdkR: 10/10
23:28 karolherbst: interesting
23:28 karolherbst: that seems... trivial to fix?
23:29 HdkR: Fixing threaded VAO binding and glFinish? :)
23:30 karolherbst: it is more of a core issue
23:31 karolherbst: threading again, but different
23:33 karolherbst: mhh, helgrind doesn't report any race conditions though
23:34 karolherbst: ohhhh
23:34 karolherbst: realloc fails because OOM
23:34 HdkR: Time to download more ram
23:34 karolherbst: no, that was just valgrind
23:34 karolherbst: buh valgrind, buh :p
23:50 karolherbst: okay, out of bound reads
23:50 karolherbst: uhm, writes actually