08:45xerpi: Hi! Does anybody know if it's possible for the Nintendo Switch GPU (NVIDIA B197 Maxwell B) to render to a pitch linear color target and a block linear depth/stencil surface? Looking at Nouveau's code, I don't see anything telling me otherwise, but I get a GPU crash (I'm using deko3d, a low-level open-source graphics library for the Nintendo Switch)
08:49xerpi: Here's more info about my problem: https://pastebin.com/raw/TwC48qQw
15:07karolherbst: pmoreau: how does it look like in regards to reviewing the two MRs? We'd like to get them in for the next mesa release, so if you don't have time I could just merge them, but as this was touching code you were involved in I'd wanted to ask you to look at it as well
18:01fltrz: karolherbst: I may have been unnecessarily selective in wanting to get OpenCL running. I am not committed to a specific gpu-compute language/platform (be it OpenCL, Vulkan, ...). i.e. I am willing to adapt to whatever would work best on this laptop's GT 740M, as long as its open source. I simply assumed OpenCL would have the best support in my case, but I now realize I simply assumed that. Should I go for
18:01fltrz: OpenCL or some other technology to put my GPU to work?
18:02fltrz: the workload is machine learning, but will be handcrafted...
18:03karolherbst: fltrz: hard to say because all options besides CUDA are kind of crappy
18:04karolherbst: depending on how high level your stuff will be, it might make sense to base it upon something which abstracts teh compute runtime away
18:09fltrz: I realize I have an older card, so I don't expect to be able to run the heavy workloads of modern frameworks etc. I understand most people wouldn't want to write their workload at a low level, but I guess I am prepared to try that if it means actually being able to run small models (by todays norms even tiny)
18:11karolherbst: I don't see why you would have to do that
18:12karolherbst: though kepler doesn't have any of that fp16 stuff
18:12fltrz: well it would be good news if I can have the comfort of a higher level
18:12karolherbst: or acceleration for int8/16
18:12karolherbst: well.. OpenCL is high level, that's why it sucks
18:12fltrz: GT 740M is kepler then
18:12fltrz: ?
18:12karolherbst: yeah, should be
18:13fltrz: you know this by head or is there some documentation (by volunteers or by NVIDIA) listing things like this?
18:14karolherbst: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units has a good enough overview
18:20fltrz: for clarity, I'm not trying to run models I download from the web, just trying to implement some of the many ideas I had reading lots and lots of ML journal articles. I would implement RM-AD or analytically compute gradients myself... It's just not clear to me which options are available besides OpenCL (apart from abusing OpenGL shaders as compute)
18:22fltrz: if OpenCL is my best option, I will continue investigating this path of course, just asking if you might have a suggestion since I realized I simply assumed OpenCL was my best fit
18:23fltrz: karolherbst: re: "all options besides CUDA are kind of crappy" < which others should be possible on this GT 740M no matter how crappy?
18:24fltrz: karolherbst: I haven't actually tried compiling mesa with OpenCL yet, I will soon though
18:29karolherbst: if you don't want to have to debug mesa's OpenCL implementation you should probably not use it because it does have bugs
18:29karolherbst: but yeah.. besides OpenCL there isn't really anything useful here... at least not GPU accelerated
18:31karolherbst: if you don't want to have to deal with random driver bugs you are probably better of fetching nvidias stack and either use OpenCL or CUDA. OpenCL is really in a bad state within mesa atm
18:31karolherbst: if you are willing to spend the time improving the situation, fine, but if not, then you shouldn't use mesa
18:41fltrz: I am definitely installing closed source drivers. So I guess OpenCL is the closest thing to available gpu-compute on this laptops GT 740M. Until I actually try it out I don't know, perhaps I don't run into too many bugs? The bugs typically occur if I try to use more advanced OpenCL features? or its more like a random process where the GPU crashes after some rough half-life of running fine?
18:41fltrz: definitely *not* installing closed ...
18:42karolherbst: mix of everything. But the most critical one is program corectness.
18:43karolherbst: I know that we do compile kernels correctly for quite a lot of things, but not evertyhing
18:43karolherbst: and I wouldn't say we are confident enough so that you can actually rely on getting the correct results
18:43karolherbst: you can of course play around with it, but don't say I didn't warn you later :P
18:44fltrz: oh, I wasn't seeking guarantees of that kind ;)
18:44karolherbst: of course there are the occasional crashes as well, but as long as you don't do anything too crazy it probably won't be that bad
18:45karolherbst: fltrz: you do have an intel GPU, right? I suspect it's haswell?
18:46karolherbst: anyway.. might be good to double check results with a different driver. Pocl could be used to verify with the CPU
18:46karolherbst: also.. you might want to ramp up GPU clocks with the pstate debugfs file on nouveau, so it's not terribly slow
18:47fltrz: yes its an intel GPU, product: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
18:47karolherbst: sandy bridge even
18:48karolherbst: I wrote a new OpenCL stack for mesa in Rust, but I only made sure it passes conformance on 12th gen Intel (and there are some issues on earlier gens), but I don't think it runs on those old Intel GPUs yet. But that's something I would be more confident about actualy correctness
18:48fltrz: I have no idea what pstate debugfs is. are you saying its default downclocked and one needs to manually upclock?
18:48karolherbst: yep
18:49karolherbst: in /sys/kernel/debug/dri/1/ there should be a file called "pstate" and that can be used to select clocks
18:49karolherbst: it's not automatic yet, because of instabilities and that stuff
18:49fltrz: this is something I have to do in the terminal, and after using the GPU I should downclock it again to not unnecessarily melt the keyboard? :)
18:49karolherbst: shouldn't matter, as your GPU will be turned off after ~5 seconds
18:49karolherbst: normally
18:49karolherbst: if runpm works correctly that is
18:50karolherbst: if the Nvidia GPU is at PCI address 01:00.0 you can cat "/sys/bus/pci/devices/0000:01:00.0/power/runtime_status"
18:50fltrz: so only when starting a run should I remember to upclock it, and I guess I could make my code do the upclocking so I don't have to do it manually
18:50karolherbst: it should say "suspended" if the GPU isn't in use
18:51karolherbst: fltrz: I think the clocks will get restored, so you might only have to do it once, but it might be good to verify this
18:51fltrz: it says active
18:51karolherbst: are you using external displays on the GPU?
18:52karolherbst: might also be that the audio device isn't suspending... so might ahve to check "/sys/bus/pci/devices/0000:01:00.1/power/runtime_status" as well
18:52fltrz: nope, but I am using sway, I dont know if its using the integrated GPU in the CPU or if its using the nvidia GPU
18:52karolherbst: and make sure that "control" is set to auto for both
18:52karolherbst: fltrz: if you only use the internal display, it should be intel
18:52karolherbst: but depending on multiple factors, your system might be missconfigured and it stays on
18:52fltrz: both active
18:52karolherbst: I'd check the control file on 01:00.0 and 01:00.1
18:53karolherbst: e.g. "/sys/bus/pci/devices/0000:01:00.0/power/control"
18:53fltrz: the control file says auto
18:53karolherbst: also for 01:00.1?
18:54fltrz: yes
18:54fltrz: both again
18:54karolherbst: :(
18:55karolherbst: in case you wondered why your battery life time is bad, this might be the reason
18:55fltrz: :) I probably misconfigured, its arch linux after all, and vaguely seem to remember not being sure about some choice between the intel gpu or the discrete
18:55karolherbst: yeah.. no idea why that is then.. maybe something keeps something in use, or.... maybe the root port isn't configured correctly
18:55karolherbst: or something else
18:55fltrz: I didn't complain about battery life time, as the battery was already broken when I got the laptop from a friend
18:56karolherbst: I see
18:56fltrz: perhaps I accidentally configured sway to use the discrete all the time?
18:56karolherbst: though the GPU staying on makes a huge difference, like getting it to be off can easily double bettery runtime
18:57fltrz: Im on power cord all the time because the battery never worked in my hands
18:57karolherbst: could check with "lsof /dev/dri/{card,render}*"
18:58karolherbst: card0 + renderD128 should be intel
18:58fltrz: hmm lsof is not a recognized command
18:58karolherbst: where 1 and 129 should be nouveau
18:58fltrz: /dev/dri/card0 /dev/dri/card1 /dev/dri/renderD128 /dev/dri/renderD129
18:58karolherbst: yeah.. might want to install lsof then
18:58karolherbst: it lists what processes are using those files
18:59karolherbst: on my end it's all intel, so the nvidia GPU is getting powered off
18:59karolherbst: which is nice, because the fans are turned off as well most of the time :)
19:00fltrz: its all using card1
19:00karolherbst: :(
19:00fltrz: and somtimes renderD128 and renderD129
19:00karolherbst: guess sway uses the nvidia GPU then
19:01fltrz: yes literally all processes that use /dev/dri/card* use card1
19:01fltrz: including sway
19:02fltrz: I think I actually had an option to choose between integrated and discrete, and chose discrete somewhere while installing arch+sway
19:02karolherbst: I see
19:02karolherbst: well.. in case you start doing more CL stuff, it might make sense to try to get it to use the intel GPU instead, so your UI stays responsive or something
19:02fltrz: I wasn't aware of rammifications
19:02karolherbst: also helps with GPU crashes
19:03fltrz: I see, and probably frees up VRAM for the compute load?
19:04karolherbst: yeah
19:04fltrz: btw not actually sure if it was during installation of sway, might have been wayland etc
19:07fltrz: what are the renderDxxx's? openGL render contexts?
19:07karolherbst: something like that
19:08karolherbst: if you only care about rendering, like doing GL you use the renderer nodes
19:08fltrz: the lsof listing actually gives all card1 or renderD129 except for one renderD128 in use by electron
19:08karolherbst: the card ones are actually for scanout
19:08karolherbst: and more control, like the things compositors do
19:08karolherbst: yeah.. electron seems to have their own "buggy" ways of deciding which GPU to use
19:08fltrz: I have always wondered where gamma tables enter in wayland / nouveau
19:09karolherbst: instead of trusting the compositor/system
19:09fltrz: thats just weird
19:10karolherbst: we had a user, where intel loads a bit later and ends up as card1
19:10karolherbst: and electron just uses the nvidia GPU where everything else uses intel :(
19:16fltrz: ok, I will stop pestering you then, so if I understand correctly I can use Pocl to compare outputs to identify incorrectly compared kernels when unsure, and otherwise its probably relatively stable except for the occasional crash
19:16karolherbst: yeah
19:17karolherbst: though using char and shorts could be more broken in general
19:17fltrz: as opposed to which other primitives?
19:18fltrz: I really don't know this GPU card at all
19:18fltrz: I vaguely remember that in the past NVIDIA and AMD had some distinction in being more into integers vs floats but forgot which
19:18karolherbst: well.. GL is mostly about 32 bit stuff, so we are confident that this works.
19:19karolherbst: but 8 bit and 16 bit datatypes are a bit of a hit and miss
19:19karolherbst: also.. from a perf perspective it doesn't make a difference on your GPU anyway
19:20karolherbst: some "complicated" int operations can be slower though, like divisions by arbitrary numbers
19:20karolherbst: anything you can't convert to shifts really
19:20fltrz: your saying the inference optimized lower resolution primitives parallellized instead of a higher resolution FLOP is dated after this GPU? I sort of expected that to be the case
19:21karolherbst: yeah
19:21karolherbst: maxwell has some faster int16 stuff, but not really substantial
19:22fltrz: since I'll want to try training I predict I'd just use 32 bit floats anyway
19:22karolherbst: the juicy bits started to come in with turing
19:22karolherbst: though I think pascal also had some stuff already
19:22karolherbst: and fp16 is a annoying situation anyway
19:22fltrz: you are a lot of help, thanks man
19:22karolherbst: not sure why, but on pascal fp16 was even slower than fp64
19:23fltrz: lol
19:23karolherbst: something something quadro something... probably
19:24karolherbst: guess they tried to make people buy more expensive GPUs to get proper fp16 speed
19:26fltrz: so if sway is currently using nvidia card, can I poll what the clocked state is?
19:27fltrz: power_state is D0
19:28karolherbst: cating the pstate file should show it
19:32fltrz: hm there is no debug in /sys/kernel
19:32fltrz: oot_params cgroup config dmabuf fscaps iommu_groups irq mm notes profiling rcu_expedited rcu_normal reboot security slab software_nodes tracing uevent_seqnum
19:32fltrz: *boot_params
19:33karolherbst: might have to mount /sys/kernel/debug then
19:33karolherbst: normally systemd is doing that, but pre systemd systems usually had to fstab it or something.. dunno
19:34fltrz: I don't think I ever mounted debug fs
19:34karolherbst: on "normal systems" it's just mounted :P
19:35fltrz: I probably majestically strayed during arch install again here
22:11TimurTabi: How did you generate drivers/gpu/drm/nouveau/nvkm/subdev/gsp/fw/booter/load/ga100.h? I need to do the same thing for AD10x, but I'm hoping to same time by using whatever tool you have.