08:30az: hi, I'm having what it seems to be nouveau xorg error and system freeze. the log from syslog is here https://dpaste.org/gtq0 no issue when using Nvidia drivers
10:29pmoreau: az: Which GPU and kernel are you using, and are there any errors before the snippet you pasted?
11:04az: pmoreau, https://dpaste.org/Pt86
11:05az: no errors before that
11:05az: it happens everytime I run an Android emulator
14:05karolherbst: imirkin: ehhh.. do we even support direct buffers?
14:06imirkin: what are direct buffers?
14:06karolherbst: buffers with a direct index
14:06imirkin: i still have no idea what you're talking about
14:06karolherbst: uhm.. ssbos
14:07imirkin: we support ssbo's
14:07imirkin: next question? :)
14:07karolherbst: yeah.. I am just wondering why we don't print the buffer index
14:07imirkin: which buffer index, and where?
14:07karolherbst: in the shader, to select the correct buffer you can have an indirect or direct index
14:07imirkin: you mean the index into the BUFFER array?
14:08karolherbst: essentially yes
14:08imirkin: az: how can i reproduce?
14:08karolherbst: for const memory we always display which const buffer to use, but we don't do that for ssbo
14:08karolherbst: hence I am wondering
14:08imirkin: karolherbst: and where were you hoping it'd be printed?
14:08karolherbst: well, same as for const buffers
14:08imirkin: const buffers are a gpu construct
14:08imirkin: ssbo's are a GL construct
14:08karolherbst: I know
14:09imirkin: the GPU only knows about globally-addressable memory
14:09karolherbst: but we still print b[0x0]
14:09karolherbst: but what buffer does it belong to?
14:09karolherbst: we don't know
14:09imirkin: where do we print this?
14:09imirkin: can you give me an example?
14:09karolherbst: before lowering
14:09imirkin: the buffer file is lowered away
14:09imirkin: ah ok
14:09imirkin: so the printer is probably messed up
14:09karolherbst: yeah.. I guess
14:09imirkin: it's probably looking at getIndirect(0) isntead of getIndirect(1)
14:10karolherbst: it's more annoying
14:10karolherbst: we have memory_constant hardcoded for direct indicies
14:10karolherbst: but if you have a direct index on a buffer, we print b[0x0] instead of b2[0x0]
14:11imirkin: i wasn't too worried about the printer when i added those
14:11imirkin: feel free to improve the printer :)
14:12karolherbst: yeah.. I also found a bug with buffers without compile time lengths.. just wondering what's wrong
14:12karolherbst: it's just messy do debug
14:12imirkin: the length is supposed to be in the descriptor
14:12imirkin: but it's the overall length
14:12imirkin: not the length of just the "unbound" part
14:13imirkin: so you have to do some math (which is done in the glsl ir iirc) to get the "unsized" part
14:13imirkin: i.e. (total - known) / 4
14:13imirkin: not exactly rocket science ;)
14:14imirkin: or rather / typesizeOf(the unbound array type)
14:23karolherbst: mhh, feels like codegen gets rid of a store :)
14:23karolherbst: maybe constant index and offset is broken?.. I'll figure it out
14:23karolherbst: or something else broken
14:24imirkin: i mean ... maybe. works with tgsi :p
14:25imirkin: i'm not going to sit here and pretend like codegen is perfect
14:25imirkin: as much as i'd like to
14:30RSpliet: codegen is perfect every time you fix a new bug
14:34az: imirkin, https://docs.beeware.org/en/latest/tutorial/tutorial-5/android.html
14:35az: this is how the issue happens while trying to run the emulator in this stop, but you have to download a huge amount of packages. I might be able to run some debugging for you if you help me do it
14:35imirkin: wow, ok. so nothing simple =/
14:35imirkin: can i just run the emulator more directly?
14:36imirkin: or is there no standalone variant?
14:36az: I'll try to check with the developers on how to run that when I reach their support channel
14:46ccr:wonders if apitrace would help
14:48HdkR: You can technically run the Android emulator freeish standing, it's just a nightmare
14:48HdkR: Easiest is to just install android-studio and let the GUI set everything up correctly
14:56imirkin: i'm guessing that it just does multi-threading of GL calls
14:57HdkR: It also does some fun IPC marshalling of data from one process to another
14:58imirkin: az: i wouldn't bother with support. "nouveau is unsupported, use nvidia blob driver"
15:05az: imirkin, I mean I'll try to ask the people who created this emulator to tell me how to get it running in less steps so you can test it
15:05imirkin: not so much about quantity of steps
15:05az: I can share the code with you if you want to test
15:05imirkin: as it is about running random software downloaded off the interwebs
15:06az: it's not random it's from beeware project
15:06imirkin: aka random
15:06imirkin: i've never heard of it before
15:06imirkin: i've heard of qemu.
15:06az: it's for developing application on mobile using python
15:07az: got funded by the Python foundation
15:07imirkin: note that i also class nvidia blob as 'random software downloaded off the interwebs'
15:07az: I agree :)
15:07HdkR: Isn't it just the official Google Android emulator rebranded?
15:08HdkR: Which is qemu+kvm+GL shim for acceleration
15:08imirkin: df -h .
15:08imirkin: Filesystem Size Used Avail Use% Mounted on
15:08imirkin: /dev/root 74G 68G 2.1G 98% /
15:08imirkin: so i'm not going to be getting some like 20GB android stupid thing
15:08HdkR: Oh, it's gotta be more like 50GB now :P
15:43karolherbst: az, imirkin: android is making heavily use of multithreading
15:44karolherbst: az: you can try out this MR and see if that works reliably https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8440
15:44karolherbst: I think it does, but...
15:44karolherbst: it's not that well tested
15:44karolherbst: (and I know of more issues)
15:47KungFuJesus: so did soemthing change with glibc's POWER ABI with regard to alignment? I'm seeing this strange, seemingly new issue with this code (though, my memory is a bit fuzzy), where I load columns of data from aligned allocations at unaligned offsets and the very last load seems to set off address sanitizer
15:47KungFuJesus: and by last, I mean second to last
15:48KungFuJesus: I'm doing the standard http://mirror.informatimago.com/next/developer.apple.com/hardware/ve/alignment.html unaligned load song and dance
15:49KungFuJesus: and my loop stops early in the loads, as in the last n % 4 (16 byte) loads are peeled off at the end
15:52KungFuJesus: it'd be one thing if just asan crapped out with a "you read past your boundary, but technically this is the same page and since we force heap allocations be a an alignment of vec_align and you permute in the partial read later, this actually legal", but what's weird is that an assert fires in the optimized binaries and there's seemingly memory corruption as a result of this
16:42KungFuJesus: I think something screwy is going on. I asked in #gcc, where else can I go that someone might actually know what's going on when using this code with glibc's allocators?
16:48imirkin: KungFuJesus: it could be that the kernel dropped its fixups
16:48imirkin: of userspace unaligned access
16:48imirkin: not sure
16:48imirkin: but then you'd get a SIGILL
16:56KungFuJesus: according to that ancient documentation from Apple, the second load for an unaligned load can load bytes off the boundary of the heap but is still considered to be safe
16:57KungFuJesus: as the alignment itself is supposed to be sized of multiples of 16 bytes
16:57KungFuJesus: if I'm interpreting that correctly, anyway
16:58KungFuJesus: now I can see asan throwing a false positive for that and that'd be one thing, but the weird thing is that oversizing my allocations by padding 16 bytes on the end shuts it up but makes FFTW's FFTs have some visual artifacts
16:59KungFuJesus: and when I use Intel's arena allocator by LD_PRELOAD'ing tbb's malloc proxy...the issue magically disappears
17:01KungFuJesus: (probably because the pooled allocator is oversizing all allocations so that they have a unique page)
17:01KungFuJesus: but still, this once worked with glibc's allocators without issue
17:02KungFuJesus: am I making a dumb assumption and this second to last load was always unsafe, or did something in glibc change when they added VSX's much more convenient unaligned load mechanics?
17:16KungFuJesus: imirkin: my loop of unaligned loads is as structured in a way that loops over 3 columns of data of a dim that is not an multiple of 16 bytes. So, I do the first floor(n/16) byte loads, with the 2nd and third columns being handled by vec_lvsl(0, addr), vec_ld(0, addr), vec_ld(16, addr), + vec_perm, where addr is numEl + offset, and 2*numel + offset, respectively
17:18KungFuJesus: now granted, the modulus of numEl + offset and 2*numEl + offset are different from each other, but they are handled with vec_lvsl generating the permutation vector per unaligned load, and the last load should happen to have at least a couple of bytes which are part of the allocation. I don't _think_ this should be a data hazard, should it? Or was this always hazardous and I got lucky?
17:19KungFuJesus: (modulus of 16 bytes)
17:47KungFuJesus: https://sourceware.org/bugzilla/show_bug.cgi?id=27227 I really don't know how much more specific I can be about this :-/
17:55_171_: Okay, so I want to get into some proper nouveau development, but I have 2 questions first. They both sound really stupid to me, but I don't know anywhere else where I could get a better answer.
17:57_171_: First is, I think I read somewhere that nouveau developpers have never used the proprietary driver and that doing so would be illegal because of the license you have to agree to when you use it. That sounds crazy, but is there any thruth to it/any other potential legal problems I should be aware of?
17:59_171_: ...and second is, what are the possible ways I can brick my GPU through software (overheating via bad power management, etc.) so that I can be aware of them (I only have the one GPU and I'd rather keep it).
18:00imirkin: we regularly use the proprietary driver to try to figure out what it's doing. it's a process known as reverse engineering
18:00imirkin: however we do not have access to nvidia's specs (except the ones they make public), nor any sort of source code
18:00imirkin: bricking your GPU in software is hard. i'm not aware of any nouveau developer having achieved this feat.
18:01imirkin: mupuf killed a few by overheating them on purpose, but that's somewhat different.
18:01imirkin: i believe an oven was involved.
18:01_171_: I thought maybe you only used data sent from other people using NVIDIA's driver or something... Thank you for the answers!
18:02imirkin: _171_: what's your goal btw?
18:03_171_: Well, my goal was to make my GPU work properly with nouveau. I guess now that includes making it work out of the box without having to mess with kernel parameters...
18:03imirkin: ah yeah
18:05_171_: Maybe also try to do some proper power management with it. What's the big deal with that, by the way? I heard about signed firmware being a problem, but doesn't this only apply to newer GPUs?
18:05imirkin: yes, signed firmware are only a problem with GM200+, so not a problem for you
18:08_171_: That's great! Thank you.
18:09imirkin: we do support changing clocks (you can echo stuff to /sys/kernel/debug/dri/1/pstate) but in the past we haven't felt like it was reliable enough to do automatically
18:09imirkin: it can also cause display flicker
18:09imirkin: and finally there'd have to be a source of data to indicate when to switch to what level. would ideally hook it up to some sort of governor
18:13_171_: So the governor would receive data from userspace and decide how the clock speeds should change?
18:13imirkin: form the kernel
18:13imirkin: from the kernel
18:14_171_: Okay, I see.
18:21_171_: So wait, does that mean that there's no actual dynamic power management being done right now other than booting or shutting down the GPU depending on whether it's being used or not?
18:22imirkin: i believe karolherbst may have had some sort of indicators of load on pcie bus/etc
18:22imirkin: but i don't remember if that was upstreamed
18:26_171_: Is all the input data we get from the PCIe bus, or is there any other form of communication?
18:27imirkin: i mean, ultimately all data travels over pcie, yes.
18:27imirkin: but the GPU provides several counters
18:27imirkin: which can be read out
18:27imirkin: and be used to infer the level of activity
18:27imirkin: pcie, graphcis, etc
18:28_171_: Yes, I think I saw a presentation that talked about this on the website.
18:29_171_: Are those documented somewhere?
18:29imirkin: probably. look for PCOUNTERS
18:30imirkin: in rnndb
18:30_171_: What's rnndb?
18:33_171_: Alright, I'll check that out.
18:56mupuf: Yeah, killing an nvidia GPU is hard! Even when keeping a hair drier full blast on it
18:56mupuf: I never killed one, even when reverse engineering
18:57mupuf: The only dead one I have was a in a bad reflow job
18:58mupuf: (I had bought it dead, as part of a lot, and I fixed 9/10 of them)
18:59RSpliet: reflow is quite a fancy word for "oven-baked"
19:00RSpliet: didn't even bake those chips until crispy
21:39emersion: are there some low hanging tasks to do in the kernel driver? i'd be interested in doing some nouveau hacking this w-e
21:41imirkin: emersion: what available hw do you have?
21:43emersion: i have an old N560GTX, and a newer GT710
21:44imirkin: are you interested in doing blob RE?
21:44emersion: oh, why not
21:44imirkin: do you have a display which supports YUV 4:2:0?
21:44emersion: pretty sure i have
21:44imirkin: the CLAIM is
21:45imirkin: that kepler can support YUV 4:2:0 somehow
21:45imirkin: the question is simple: how :)
21:45imirkin: they've published display docs
21:45imirkin: but those make no mention of this
21:45imirkin: i'm guessing they do something clever
21:45imirkin: like apply a CSC and mess with clock rates
21:46imirkin: but who knows
21:46imirkin: we're quite sure later gens support it too, and are equally ignorant of how they do that
21:46imirkin: but it's less important there
21:46emersion: hm, are there other supported YUV formats? i haven't checked
21:47imirkin: wrong thing
21:47imirkin: this isn't about scanout of a YUV format
21:47imirkin: the scanout happens of an RGB format
21:47emersion: this is about YUV on the wire right?
21:47imirkin: but then the bits get encoded onto the HDMI wire
21:49imirkin: emersion: https://nvidia.github.io/open-gpu-doc/classes/display/cl917d.h
21:49imirkin: so e.g. that has like
21:49imirkin: but no 420
21:50imirkin: and ORIGINALLY for kepler they said "sorry, no 4:2:0" (like for their own drivers)
21:50imirkin: and then it magically appeared
21:50imirkin: so did they remember they had accidentally included the functionality? or did they figure out some weird way of doing it?
21:50imirkin: the thing is that this would enable kepler to do 4k@60 (@yuv420)
21:51imirkin: since otherwise you need HDMI 2.0 for 4k@60, which is maxwell2+
21:52imirkin: anyways, this isn't like some trivial project. there's a steep learning curve, and RE'ing display stuff can be extra-tricky. but i suspect you're up to it.
21:52imirkin: otoh i don't want to totally discourage you with a too-difficult project
21:53imirkin: so perhaps can think of something else
21:53imirkin: if you think this is a little much for a first outing
21:55emersion: what's the basic principles to do blob RE? has somebody written notes about it?
21:55imirkin: too many
21:55imirkin: to the point of not being useful
21:55imirkin: there are two types of things you can capture
21:56imirkin: 1. mmio accesses by the blob. you need to use the 'mmiotrace' tracer, which is in the kernel
21:56emersion:looks at envytools
21:56imirkin: 2. command submissions. this is done using a valgrind plugin called valgrind-mmt
21:56imirkin: the reason why display is extra-tricky is that it generally eneds a combination of both
21:57imirkin: like the thing i linked to is commands being submitted in a buffer, executed by some command processor
21:58imirkin: but there are also random registers that need setting, e.g. all the stuff in nvkm/engine/disp
21:58imirkin: which would be captured via mmio
21:58imirkin: this is a very good guide to operating mmiotrace: https://wiki.ubuntu.com/X/MMIOTracing
21:58emersion: ok, makes sense to me. that atomic display configuration interface also goes through command submission?
21:58imirkin: this is the valgrind mmt docs: https://nouveau.freedesktop.org/Valgrind-mmt.html
21:59imirkin: much of it, yes
21:59imirkin: that's the stuff in nouveau/dispnv50/*
21:59imirkin: (the command submission stuff)
22:00imirkin: (same general system since the original G80, obviously with some changes)
22:05emersion: hm, but valgrind-mmt is for user-space stuff right?
22:06emersion: how come display configuration is set from user-space?
22:06emersion: not kms?
22:06imirkin: maybe recent nvidia kernels changed it? dunno
22:07emersion: yeah: https://drmdb.emersion.fr/snapshots/d33ca872d3d4
22:07imirkin: i haven't done this in quite a while myself
22:08imirkin: ok, well maybe there's a way not to use the kms thing anyways
22:08emersion: actually this snapshot is strange, because it has DRM_CAP_PRIME = 3
22:08imirkin: or find an older blob
22:08imirkin: yeah, they added prime for turing
22:08emersion: is this released already?
22:08imirkin: afaik yes
22:08emersion: ok, didn't know
22:09emersion: but yeah, there's a kernel param to disable KMS
22:09imirkin: but only on turing
22:09imirkin: hopefully that disables KMS doesn't just make userspace call into the kernel to do kms things a different way
22:09imirkin: otherwise command submissions are slightly tough to capture
22:09imirkin: although you know what
22:09imirkin: we capture memory writes
22:10imirkin: so it'd just be a matter of finding where the pushbuf gets built up
22:10imirkin: unless the buffer is in system memory, then we're f'd
22:13emersion: how could the buffer be in system memory?
22:13imirkin: gpu can dma things from system memory
22:14emersion: ok, would be a little annoying indeed
22:14imirkin: and obviously mmiotrace does not capture writes to system memory
22:14imirkin: that'd be a bit much, sadly
22:15emersion: "plan a 50GiB buffer to capture traces"
22:16imirkin: usually a few hundred MB. couple GB at the very most if you're tracing something for a longer time
22:16imirkin: and they compress REALLY well
22:17emersion: 50GiB seems reasonable for a system memory trace ;)