02:07 imirkin: a pretty in-depth analysis of volta perf characteristics: https://arxiv.org/pdf/1804.06826.pdf
02:07 imirkin: among other things, has pretty good explanations of all the control flags
02:09 imirkin: oh wow, IMAD is fast on volta. nice.
02:10 HdkR: Everyone loves fast imad
02:11 imirkin: well, it's super-slow on earlier gpu's
02:12 HdkR: Sure, but imad is becoming more and more utilized
02:13 HdkR: That's a pretty thorough breakdown
02:18 HdkR: Lol, I love the number showing imul/imad latency on pascal
04:15 rhyskidd: pmoreau: congrats on becoming an NV research intern
07:35 pmoreau: rhyskidd: Thanks! :-) I’ll have to restrain myself not to look at hardware details, if I want to have a chance continuing to contribute to Nouveau to some degree. It will be hard :-/
07:37 pmoreau: imirkin: Nice article you found!
07:40 HdkR: That'll be a very difficult task :P
07:41 pmoreau: :-D
07:46 nikk__: Hey! I want to get started with this documentation task of writing a proper mmio scanning tool. Can someone please guide of how to get started with this?
07:49 HdkR: pmoreau: What physical location are you interning at?
07:49 pmoreau: HdkR: Lund, Sweden, a couple of 100 meters from where I do my PhD. :-)
07:50 HdkR: Ah cool, someone I know is starting there next week as well :D
07:50 pmoreau: Ah nice! I heard that someone else was starting there next week; I start the week after.
07:52 pmoreau: nikk__: Hi! What would you want that tool to do exactly? There is already nvascan in envytools (https://github.com/envytools/envytools/blob/master/nva/README#L61)
07:54 nikk__: pmoreau: Hi! I actually found this task on trello and thought I can start with by helping with this
07:56 pmoreau: Ah, this one https://trello.com/c/4xhrtohN/201-write-a-proper-mmio-scanning-tool-for-searching-tracesavailable-source ?
07:57 nikk__: pmoreau: yes!
07:59 pmoreau: Okay. Lyude might be able to help you get started. You’ll need access to a few mmio traces (and maybe VBIOSes).
08:00 nikk__: Oh okay thanks pmoreau
08:02 pmoreau: nikk__: The format of those MMIO traces can be found here: https://nouveau.freedesktop.org/wiki/MmioTraceLogFormat/. I guess that later on, when printing the different values, you could use lookup from envytools to break the hexadecimal values into the different bitfields or fields of that register.
08:03 pmoreau: You could generate your own MMIO traces, or use those attached to bug reports (I think there might be some on the bugzilla, though you’ll need to find those bugs first :-/). Additionally we can share some traces with you as well.
08:04 nikk__: Great! I’ll get going to read a little about this.
08:06 pmoreau: Let me see if I can find an MMIO trace around, so you can start having a look at it.
08:09 pmoreau: nikk__: https://phabricator.pmoreau.org/file/download/x47w3o6qqkmim5jbzxz2/PHID-FILE-cgxc6ekcblmpge372qte/gf114_bringup_reclock.mmio.xz and https://phabricator.pmoreau.org/file/download/qtouasifkhdtfsvptg3w/PHID-FILE-62m4cv6yo35j75mlj5hr/gp102_bringup_reclock.mmio.xz
08:11 nikk__: pmoreau Yep ok!
08:11 pmoreau: You’ll need to decompress them with xz, but that will give you two MMIO traces (they are in ASCII format).
08:12 nikk__: Oh okay I’ll do that!
08:41 pmoreau: hakzsam: Apparently the yield flag tells the scheduler whether to switch to another warp or execute the next instruction in the current warp. :o
09:17 pmoreau: Ahem, I need to fix a few things, cause that custom LLVM build is just taking way too much space! The build folder is about 36 GiB, and just the clang binary is already at 1.1 GiB /o\
09:20 HdkR: Just get more hard drive space right? :)
09:21 pmoreau: Sure, that would work :-D
09:30 HdkR: pmoreau: What llvm build are you playing with?
09:31 pmoreau: HEAD, and building llvm + clang + compiler-rt
09:31 pmoreau: I need a LLVM 7.0 build to compile SPIRV-LLVM-Translator against it
09:31 HdkR: ah, right
09:32 HdkR: I work in LLVM too frequently, so every time LLVM comes up I'm interested :)
09:33 pmoreau: :-) I understand, I tend to do the same regarding clover and OpenCL
09:56 karolherbst: imirkin: uhh fun, with OpenCL we have to allow the API to disable optimizations
09:59 pmoreau: Eh? You have to allow optimisations, (math) optimisations aren’t enabled by default
10:17 karolherbst: pmoreau: huh, really?
10:17 karolherbst: but there is -cl-opt-disable?
10:18 karolherbst: pmoreau: I could imagine that optimizations with different results are disallowed
10:18 karolherbst: pmoreau: "The default is optimizations are enabled."
10:18 karolherbst: pmoreau: and "cl-mad-enable" to enable mad optimizations
10:19 karolherbst: but the spec doesn't state if it should be disabled by default, because for some hardware it gives the same result
10:19 karolherbst: and then "cl-unsafe-math-optimizations"
10:19 karolherbst: not that I care much about that now, but this may get fun in the future
10:20 pmoreau: Ah right, thee is cl-opt-disable.
10:21 pmoreau: But all the floating point ones are disabled by default.
10:21 karolherbst: no
10:22 karolherbst: only if they are unsafe
10:22 pmoreau: (Well, all the ones that result in not following IEEE-754
10:22 karolherbst: or violate IEEE rues
10:22 karolherbst: *rules
10:22 karolherbst: ;)
10:22 karolherbst: or OpenCL rules
10:22 karolherbst: pmoreau: on nv50 we can do mul+add = mad
10:22 karolherbst: not on nvc0+
10:22 pmoreau: cl-mad-enable is disabled by default
10:22 karolherbst: are you sure?
10:22 karolherbst: I am sure it's not defined
10:23 pmoreau: “These options are not turned on by default [...]”
10:23 pmoreau: In the paragraph after cl-opt-disable
10:23 pmoreau: Section 45.8.4.3 (for OpenCL 2.1)
10:23 pmoreau: *5.8.4.3
10:24 karolherbst: mhh
10:24 karolherbst: weird
10:24 karolherbst: so we can do it on nvc0+ without that option
10:24 karolherbst: because mad is the truncated one
10:24 karolherbst: and we only have fma
10:24 karolherbst: makes no sense to do it this way, but....
10:27 karolherbst: or implement extensions other drivers already have, so that all mesa drivers are closer to each other
10:27 pendingchaos: karolherbst: wrong channel?
10:27 karolherbst: ...
10:27 karolherbst: yes
10:34 pmoreau: HdkR: Found my first mistake: compiling in RelWithDebInfo rather than Release: 36 GiB -> 2 GiB. And I guess dynamic linking should bring that further down.
10:42 karolherbst: pmoreau: what are you building? :D
10:43 pmoreau: karolherbst: LLVM HEAD, to compile SPIRV-LLVM-Translator against it, so that I can update my series for you (and update my reformat branch as well)
10:43 karolherbst: ohh
10:43 karolherbst: check the travis file
10:43 karolherbst: you dont really need to build llvm completly
10:43 karolherbst: though
10:44 karolherbst: but if you want to, you can go ahead :p
10:44 karolherbst: pmoreau: you also need clang by the way
10:44 pmoreau: Well, I have it built now, and I could compile SPIRV-LLVM-Translator, so, too late :-)
10:44 pmoreau: I built LLVM + clang + compiler-rt.
10:44 karolherbst: pmoreau: I would cone/symlink clang and SPIRV-LLVM-Translator (as llvm-spirv) into llvm/tools and build this way
10:44 karolherbst: ahh
10:45 pmoreau: That’s what I did on my laptop.
10:45 pmoreau: I tried the out-of-tree build on my desktop.
10:45 pmoreau: And it worked fine. I was just having that slight memory usage issue.
10:47 karolherbst: linking?
10:47 karolherbst: yeah...
10:47 karolherbst: you need 32GB
10:49 karolherbst: pmoreau: what's x.s3 in opencl? fourth component?
10:49 pmoreau: Maybe? I need to check
10:51 pmoreau: Yes
10:51 pmoreau: karolherbst: And for 16 component vectors, you even get x.sa (or x.sA) & similar
10:52 karolherbst: okay
10:52 karolherbst: so we have like three notation for this now .x .a and .s0?
10:53 pmoreau: It doesn’t look like .a is valid.
10:54 karolherbst: mhh, yeah maybe it is not a thing
10:54 karolherbst: I am currently implementing those fancy geometry functions
10:54 karolherbst: kind of fun
10:54 pmoreau: Also, there is .lo, so if you have a vec4, you could do vec4.x, vec4.s0, vec4.lo.x, vec4.lo.s0, vec4.lo.lo, vec4.even.lo, etc.
10:55 karolherbst: :(
10:55 pmoreau: But they will all end up with the same SPIR-V code, I would assume.
10:56 karolherbst: yeah
10:57 karolherbst: pocl has an odd length implementation
11:02 karolherbst: pmoreau: can you imagine why we would need any CF for https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/length.html ?
11:02 karolherbst: except maybe to get higher precision
11:05 karolherbst: pmoreau: :( "Data sample 4 at size 2 does not validate! Expected (0x1p+64), got (inf), source (0x1p+64), ulp 309485009821345068724781056.000000"
11:07 pmoreau: karolherbst: CF?
11:07 karolherbst: control flow
11:08 karolherbst: pmoreau: https://github.com/pocl/pocl/blob/master/lib/kernel/vecmathlib-pocl/length.cl
11:10 pmoreau: karolherbst: For optimisations: if all components are 0, there is no need to square them, add them and then take the sqrt of the result.
11:11 karolherbst: if all components are 0, the result is 0
11:12 pmoreau: Sure, but do you need to compute sqrt(0*0 + 0*0 + 0*0 + 0*0) to decide that? Or knowing that all components are 0, return straight 0.
11:12 pmoreau: s/straight/directly
11:18 karolherbst: well, I don't really want to do any control flow though :(
11:18 karolherbst: but now I am a bit better: "Data sample 14 at size 2 does not validate! Expected (0x1.fffffep+126), got (0x0p+0), source (0x1.fffffep+126), ulp -16777215.000000"
11:19 karolherbst: "vector: { 0x1.fffffep+126, 0x1.e7fe44p+8 } length vector size 2 FAILED"
11:21 pmoreau: The control flow can always be added later as an optimisation.
11:23 karolherbst: uhm
11:23 karolherbst: the cf is there for precision
11:24 karolherbst: and to prevent the / 0
11:24 karolherbst: or / inf
11:25 pmoreau: Why do you need the division though?
11:25 karolherbst: precision
11:26 karolherbst: the CTS is unhappy otherwise
11:27 pmoreau: Ah, so you get the different components to be between -1.0 and 1.0.
11:28 karolherbst: yeah
11:28 karolherbst: well 0 and 0.0
11:28 karolherbst: ...
11:28 karolherbst: 0.0 and 1.0
11:28 karolherbst: you can just abs the entire vector
11:28 sigod: if you have archlinux is it a bad idea to install nouveau-fw if you want a totally free system?
11:28 karolherbst: but for "{ 0x1.fffffep+126, 0x1.e7fe44p+8 }" -> "0x1.fffffep+126"
11:29 karolherbst: sigod: you only really need it for maxwell2+ and video decoding acceleration
11:29 karolherbst: sigod: but those firmwares don't really do much
11:29 karolherbst: sigod: your call
11:29 sigod: ok thanks
11:30 pmoreau: sigod: Video decoding, or if you are having issues with Novueau’s own firmwares (for example on GK106 and GK107 chipsets)
11:31 karolherbst: pmoreau: I will care about it later... the current impl is good enough
11:32 pmoreau: karolherbst: That package does not include the signed firmwares, only those extracted by imirkin’s script. (AFAICS)
11:32 karolherbst: pmoreau: ahh
11:33 sigod: im using a gk106
11:33 pmoreau: mupuf created that package to make it easier for users to get the video firmwares. The official NVIDIA firmwares are provided in linux-firmware, similarly to the other firmwares from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
11:35 pmoreau: sigod: If you are having issues with Nouveau, similar to https://bugs.freedesktop.org/show_bug.cgi?id=93629 or https://bugs.freedesktop.org/show_bug.cgi?id=83897 and some others, then the firmware provided by nouveau-fw (which contains firmware from NVIDIA) might help.
11:36 pmoreau: Otherwise, if everything is fine, then you might only need those firmwares if you need accelerated video decoding.
12:03 karolherbst: does BFE sign extend?
12:04 karolherbst: uhh, seems like it does
12:07 karolherbst: imirkin: do you have any ideas how to handle 8 and 16 bit integer math? like min/max of s8 and s16?
12:07 karolherbst: my current idea is to lower that pre SSA
12:07 karolherbst: for ops which don't support it
12:10 karolherbst: basically just stick bfes and ands where needed
12:35 hakzsam: pmoreau: that would make sense
13:48 imirkin: karolherbst: there's a BFE.S32 and BFE.U32
13:48 imirkin: the S32 one sign-extends
13:49 imirkin: the video ops do support doing math on 8- and 16-bit parts
13:49 imirkin: but i don't know precisely how they work
13:53 karolherbst: imirkin: yeah, sure. But I was more like thinking in general
13:53 karolherbst: when would you want to do the lowering for example
13:54 imirkin: later
13:54 imirkin: that way you still get opts on the "math" operations
13:55 karolherbst: mhh
13:57 karolherbst: imirkin: well those video ops are really just vector ops on vectors inside single registers
13:57 imirkin: correct.
13:57 imirkin: but it's more than that.
13:57 imirkin: i think they do a single subop at a time
13:57 imirkin: or a couple
13:57 imirkin: and you say like
13:57 imirkin: byte 0 of src0 * byte 1 of src1
13:58 imirkin: that sort of thing
13:58 imirkin: but then wtf is in the result? who knows.
13:59 karolherbst: vell v*2 does 16 bit stuff and v*4 8 bit
13:59 karolherbst: but yeah
14:02 imirkin: right. but i think it might only do one op at a time
14:02 imirkin: which is precisely what we want...
14:03 karolherbst: yeah
14:03 karolherbst: but mhh, nvidia doesn't use those
14:03 karolherbst: they just do bfe and imnmx
14:05 karolherbst: even for char4, just bfe, imnx, and, bfi
14:05 karolherbst: unoptimized they use XMAD.PSL.CLO though
14:06 karolherbst: imirkin: I think I will decide depending on the op
14:07 karolherbst: we can just lower it for max/min pre SSA, because it really just inserts bfe instructions
14:07 karolherbst: allthough
14:07 karolherbst: most of the time it is just inserting bfe or bfi instructions
14:18 imirkin: pendingchaos: this is the list of your outstanding patches, correct? https://patchwork.freedesktop.org/project/mesa/list/?submitter=17518
14:18 imirkin: i just cleaned up that list slightly, hopefully i didn't nuke anything i shouldn't have
14:19 pendingchaos: looks right
14:21 karolherbst: imirkin: ohh if you have some time this weekend, you could take a look at https://patchwork.freedesktop.org/series/40199/ and https://patchwork.freedesktop.org/series/40754/
14:23 imirkin: right yeah. thanks for reminding me.
14:23 imirkin: you kept mentioning those at poorly timed moments, and i kept forgetting. sorry!
14:24 karolherbst: yeah, I should remember to ping you only on saturday morning :p
14:24 karolherbst: (or noon)
14:25 imirkin: ;)
14:25 imirkin: karolherbst: i'm going to do a minor rewrite of your RA change. same basic thing just a minor restructure.
14:25 karolherbst: okay
14:39 imirkin: karolherbst: https://hastebin.com/hucuqihaca.php
14:39 imirkin: objections?
14:42 karolherbst: looks good
14:47 pendingchaos: I'm getting some poor msaa quality. it only seems to effect some of the edges of a triangle is this a know issue?
14:48 imirkin: define 'quality'
14:48 pendingchaos: It's like instead of a smooth gradient, it's the triangle color or 50% mix between the background and the triangle
14:48 pendingchaos: I think I'll try to get a screenshot
14:48 imirkin: our resolves suck
14:49 imirkin: and use some questionable techniques.
14:50 pendingchaos: https://lh5.googleusercontent.com/Ahot0mMPrA7YM1gosRpzj7DWZT8Ay_NWX6Znk939mfEsmICcNAmG0dTgj1jV267fQYfqLXvxucSkjiMSCkAR
14:50 pendingchaos: it's worse than no msaa
14:50 imirkin: try forcing it to use the 3d path in nvc0_blit()
14:50 imirkin: i.e. just set eng3d = true.
14:51 pendingchaos: that doesn't fix it
14:52 imirkin: well, i'd be happy to check what your code does on GF108
14:52 imirkin: maybe something changed somewhere
14:53 pendingchaos: https://hastebin.com/cajawelefa.cpp
14:53 pendingchaos: it's just a big triangle
14:55 imirkin: karolherbst: we should only do that when imm is not a short immediate, right?
14:55 karolherbst: imirkin: yeah, but that usually already happened pre RA
14:55 karolherbst: so only important in case we don't do that opt
14:56 karolherbst: I mean with the short imm
14:56 imirkin: karolherbst: oh right duh
14:56 imirkin: pendingchaos: what size should i make the window to screenshot?
14:57 imirkin: or did you just grab the tip of it?
14:57 pendingchaos: I just grabbed the tip
14:58 imirkin: pendingchaos: yeah i get the exact same thing... let's check nv50
14:58 imirkin: oh, i have to reboot for that. nevermind.
14:59 imirkin: are you pretty sure that this shouldn't happen?
14:59 imirkin: we do fail some msaa tests btw
14:59 imirkin: but not related to regular plain ol' rasterization
15:00 pendingchaos: I don't see how it could happen with a correct implementation
15:02 pendingchaos: actually maybe it could if you get the vertices right
15:02 imirkin: i've never *really* investigated this stuff so carefully
15:02 imirkin: i just kinda twiddle some bits, tests start passing, move on to the next thing ;)
15:03 pendingchaos: the triangle should be completely symmetrical though
15:03 imirkin: afaik msaa is fairly poorly defined as to specifics of resolves, etc.
15:03 imirkin: pendingchaos: should it?
15:03 imirkin: the sample positions aren't laid out on a symmetric grid necessarily
15:04 pendingchaos: I guess so
15:05 pendingchaos: "poor sample locations for the task" seems more likely now
15:06 imirkin: https://msdn.microsoft.com/en-us/library/windows/desktop/ff476218(v=vs.85).aspx
15:06 imirkin: have a look at the standard msaa 4 pattern
15:07 imirkin: it's at some funny angle, which != the angle of the triangle, i think it's plausible.
15:07 pendingchaos: that's what nouveau uses?
15:07 imirkin: well, it's what the hw does :)
15:07 imirkin: pre-GM200, sample locations aren't configurable
15:07 imirkin: for GM200+, we use configure the standard locations
15:07 imirkin: since anything else would lead to a pile of trouble
15:09 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_context.c#n517
15:10 imirkin: and this: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_state_validate.c#n214
15:30 karolherbst: imirkin: can we use cvt to convert to u16/u8/s16/s8?
15:30 karolherbst: like from float
15:30 imirkin: one way to find out
15:36 annadane: so i was watching RMS's libreplanet video in which he mentions that for the first time now, there's a free decompiler... i wonder if this could help at all with nouveau
15:36 annadane: i'm not actually sure what it is
15:39 imirkin: karolherbst: both of these seem to work as expected on GF108: https://hastebin.com/wehetamede.go
15:39 imirkin: karolherbst: here's the shader_test i'm using: https://hastebin.com/ukatopisiz.cs
15:40 imirkin: are you sure you saw issues on fermi/kepler1?
15:41 karolherbst: yeah
15:42 karolherbst: but you ended with a LOP32I
15:43 karolherbst: mhh, I can check on monday again
15:44 annadane: in other words, the decompiler helps with reverse engineering
15:46 karolherbst: imirkin: there is a I2I.U16.U8
15:49 rhyskidd: is there a list of know problems with nouveau's atomic ioctl?
15:49 rhyskidd: background is: why is nouveau.atomic defaulted to off?
15:49 rhyskidd: see here: https://lists.freedesktop.org/archives/dri-devel/2017-September/153581.html
15:51 karolherbst: imirkin: yeah, so we can do s8 -> s32 and u8 -> u32 and to 16 bit as well
16:05 mupuf: funny how the blob is OK with having the fan not rotating at all in some circumstances
16:05 mupuf: and it is not just based on the temperature
16:05 mupuf: I guess they actually check the power usage
16:05 mupuf: and under a certain usage, they consider that passive dissipation is sufficient
16:06 mupuf: more things to reverse!
16:12 pmoreau: mupuf: I have seen that on my GP104. Dunno if it’s the blob deciding that, or just the fan threshold that are set higher in the VBIOS.
16:30 OhGodAGirl: In comparison to what card?
16:30 OhGodAGirl: *perks up at mention of GP104 VBIOS*
16:31 OhGodAGirl: Oh, by the way, you guys might be interested in this:
16:32 OhGodAGirl: https://arxiv.org/pdf/1804.06826.pdf
16:33 OhGodAGirl: https://usercontent.irccloud-cdn.com/file/3nrCYFRq/s8122_dissecting_the_volta_gpu_architecture.pdf
16:34 OhGodAGirl: OhGodACompany will be releasing a tool in about three hours that modifies the VBIOS for the GeForce line while bypassing the need for flashing and skirting past Falcon for poking some malicious areas.
16:34 OhGodAGirl: *poofs*
16:38 karolherbst: OhGodAGirl: uhm, getting the blob to load a different vbios might be indeed come in handy
16:39 OhGodAGirl: *nod* That can be done.
16:39 OhGodAGirl: Windows or Linux, pick your poison.
16:44 juri_: good job. this come with a 'stop buying nvidia products' warning?
16:44 annadane:loves having bought a computer back when i didn't know what i was doing; nvidia graphics (open source unfriendly), intel processor (IME, spectre/meltdown)
16:47 karolherbst: imirkin: do you know what some of the issues are when compiling multiple shaders in parallel in codegen (besides the debug output being screwed)?
16:48 karolherbst: annadane: you are basically screwed on all hw these days
16:48 karolherbst: annadane: nice if your CPU and GPU is blob free, then you have your network cards having blobs again,
16:48 karolherbst: well, in the end everything is pointless until you have open hardware network interfaces
16:49 juri_: it's a constant fight. hense, why we each have to do our part... and still lose. ;)
16:49 karolherbst: ;)
16:52 imirkin: karolherbst: lol, yeah. i messed that up. thanks ;)
16:52 imirkin: i already had the patch in place.
16:53 imirkin: karolherbst: and yes, without the patch, it messes up. ok good.
16:54 imirkin: rhyskidd: skeggsb is scaredy-cat.
16:54 imirkin: afaik the only reason atomic is off is fear of the unknown
16:54 rhyskidd: hrmm
16:54 pmoreau: OhGodAGirl: imirkin shared the article earlier today; it was really interesting! That upcoming tool sounds very interesting :-)
16:54 rhyskidd: ok
16:55 rhyskidd: getting igt running routinely through nouveau would help, i guess?
16:55 imirkin: or once ;)
16:55 imirkin: i dunno if it's been done
16:55 imirkin: there's been some talk of it
16:56 imirkin: without CRC's, can't be automated, unfortunately
16:56 imirkin: (well, some things can)
16:57 rhyskidd: yes, mupuf spoke to me about the lack of CRCs and how that would expose WAY more of nouveau to igt's testing surface
16:59 imirkin: the hardware does support them
17:00 imirkin: i'm not sure if ben has quite worked out how to operate them
17:03 OhGodAGirl: juri_ Never stop buying NVIDIA products.
17:03 OhGodAGirl: Buy more of them. =3
17:03 OhGodAGirl: I come from the cryptocurrency side so part of the reason we're doing this is for that dank hash(rate).
17:05 imirkin: OhGodAGirl: your analysis is probably more sophisticated already, but i've started writing a tool to locate firmware in nvidia's blob -- https://github.com/envytools/firmware/blob/master/scanner.go
17:35 annadane: i believe the AMD free drivers are quite good these days, good alternative for people who feel boxed in by NVIDIA's hostility
17:36 imirkin: i look at NVIDIA as the "alternative" to AMD's hardware with open-source support.
17:55 imirkin: karolherbst: sent an updated version of that fix for short immediates
18:26 juri_: OhGodAGirl: I'm trying to make my ImplicitCAD code compile and run on my nvidia card. so, i don't want X, and i do want double precision floating point (but can make do with single).
18:27 OhGodAGirl: Which card?
18:37 Subv: hey, i was wondering if anyone knew what the fields in the device spec table of the nvidia driver blob mean, the one used to match the current gpu to the configuration values in the driver
18:39 rhyskidd: PDISP = display engine
18:39 rhyskidd: but what might PDISP_FE mean? more particularly the FE part
18:42 karolherbst: imirkin: nice
18:43 karolherbst: imirkin: I doubt it makes much of a difference which encoding we actually use, but yeah
18:54 imirkin: karolherbst: yeah, but there were other isLIMM checks
18:54 imirkin: we have to be a bit careful
18:54 imirkin: and might as well roll all those other "fixes" back
18:54 imirkin: which will make the isLIMM code exercised more
18:55 imirkin: as well as fixing up asserts to ensure that it's all things that are meant to work
19:04 imirkin: rhyskidd: frontend? :)
19:05 OhGodAGirl: juri_ What card do you need help with? Can't assist without the info.
19:07 juri_: OhGodAGirl: I'm working with a set of quadro 600 cards ATM. just verifying the code and documentation for them.
19:08 juri_: once some parts get here, i'm going to try my hand at getting seavgabios to initialize them, so i can do away with the vendor's bios.
19:08 juri_: (yes, i'm one of those free software zealots)
19:09 imirkin: juri_: using a pentium 3 cpu, i hope?
19:10 imirkin: and a fully open-source bios on your motherboard?
19:10 juri_: sadly no, on an AMD 5350 (agesa, ugh). but, corebooted.
19:10 juri_: i have some libreboot hardware, and am trying to get it running for the task.
19:10 imirkin: but of course with solid plans to replace all that dirty, dirty cpu microcode
19:11 juri_: solid plan == leave behind X86. ;)
19:11 imirkin: hehe
19:11 imirkin: RISC-V ftw!
19:12 juri_: first, i want to get my nvidia card working using free software only, and attached via a usb3 -> pcie bridge..
19:20 HdkR: pmoreau: Dynamic linking helps with building LLVM so much
19:21 orbea: juri_: i hear amd is better for that :P
19:23 gnarface: woah
19:23 gnarface: is there such a thing as a usb3->pcie bridge?!
19:24 juri_: there is.
19:24 gnarface: wow, and here i thought my gamecube controller adapter was breaking all the rules
19:37 OhGodAGirl: We call them risers. ;)
19:39 OhGodAGirl: But the USB3380-AB EVK-RC is very decent
19:39 gnarface: i would have thought latency would be too great for it to even work
19:39 OhGodAGirl: Well, in crypto, we don't need low latency.
19:39 gnarface: oh i see
19:39 OhGodAGirl: But in other tasks? Yeah, it is a bottleneck.
19:40 gnarface: oh i meant like i thought pcie devices would fail to handshake
19:40 gnarface: like the device wouldn't POST fast enough or something
19:40 OhGodAGirl: I use a thunderbolt dock for all my MacOS to GPU shit.
19:40 gnarface: but i guess i don't really know how that stuff works at a low level
19:40 juri_: OhGodAGirl: ah, you're a step ahead of me, then. i'm doing 3d design (non-real-time), so latency is not a problem here, either.
19:41 juri_: what 3380 boards did you end up with?
19:42 OhGodAGirl: I just have the evaluation kit from PLX. Like I said, risers are what I usually use, or custom motherboard, because fuck the system. =D
19:42 OhGodAGirl: You can look up the USB3380-AB EVK-RC
19:43 juri_: ah. yeah, i've seen it.
19:43 OhGodAGirl: But I'm using a Thunderbolt dock at the moment for anything that requires testing.
19:43 HdkR: I bought an M.2 to PCIe riser a couple of days ago :D
19:43 OhGodAGirl: Yay risers. =3
19:43 juri_: i figgured i'd have to spin a new board for the chip.
19:43 gnarface: i guess mining doesn't really use much of the PCIe bandwidth, either?
19:44 gnarface: it's more just tying up some specific subset of the GPU resources with abnormally huge calculations then?
19:44 gnarface: does that mean like with... 8 usb ports you could reasonably expect to connect 8 video cards???
19:45 gnarface: where does the power come from?
19:45 imirkin: those risers tend to have their own separate power plugs
19:46 imirkin: so you have a PSU on the side and a 12V plug
19:46 OhGodAGirl: Yes you can. That's where risers come in! They have power slots from molex to sata, though the custom ones have 6-pin adapters
19:46 HdkR: https://twitter.com/Sonicadvance1/status/987511645390880768 You can see the power connectors on the riser board :)
19:46 OhGodAGirl: Most motherboard vendors built boards specifically for us =3
19:46 gnarface: this opens up all kinds of new possibilities
19:46 OhGodAGirl: https://hothardware.com/ContentImages/NewsItem/41903/content/ASUS_B250_Mining_Expert.jpg https://usercontent.irccloud-cdn.com/file/XMEzEur8/image.png
19:46 gnarface: i had no idea...
19:47 OhGodAGirl: That's a 19-GPU motherboard
19:47 OhGodAGirl: Now, cryptocurrency miners are looking into alternate compute workloads
19:49 karolherbst: what was the lea instruction again in sass?
19:49 HdkR: iscadd?
19:50 OhGodAGirl: Load Effective Address
19:51 OhGodAGirl: It lets you perform memory addressing calcs without actually addressing mem
19:51 OhGodAGirl: doesn't alter flags
19:51 OhGodAGirl: it's like a funky mov
19:51 karolherbst: I don't mean sulea
19:51 OhGodAGirl: I know
19:51 OhGodAGirl: LEA
19:51 imirkin: HdkR: ISCADD = shift left + add
19:51 imirkin: (shift left by immediate)
19:51 karolherbst: huh
19:51 OhGodAGirl: better to do an example
19:51 karolherbst: nvidia is using lea to implement uhadd
19:51 imirkin: which is often done for a LEA-style thing, yes
19:52 imirkin: there's also a SULEA on ... some arch
19:52 karolherbst: lea(xor(a, b), and(a, b), 0, 0x1f)
19:52 HdkR: imirkin: Yea, that's why it was my guess, since it could be used for scaled address calculation :D
19:52 OhGodAGirl: That's correct HdkR
19:53 HdkR: woop woop
19:56 karolherbst: imirkin: we don't support lea in codegen or does it have a different name there?
19:59 imirkin: karolherbst: it doesn't really come up... when would you ever get the effective address?
19:59 imirkin: karolherbst: for SSBO's there's a lowering pass that retrieves it based on the b[] reference
19:59 imirkin: for images, there's SULEA
20:23 karolherbst: imirkin: sure.. but as I said, nvidia uses that for implementing uhadd
20:24 imirkin: what's uhadd again?
20:24 karolherbst: (a + b) / 2 without overflow
20:24 imirkin: can you pastebin what nvidia does?
20:26 karolherbst: PTX for hadd and uhadd + sass output: https://gist.github.com/karolherbst/209e7a19adbd97d5479eea5970d0e1f9
20:27 imirkin: what's the difference between hadd and uhadd? oh, for the div-by-2, sign?
20:27 karolherbst: kind of looks like a shr+add to lea opt
20:27 karolherbst: imirkin: well yeah, uhadd is just unsigned
20:28 karolherbst: shr.s32 vs shr.u32
20:29 imirkin: ahhhh clever
20:30 karolherbst: seems like something which might be useful in a very little amount of situation, but still
20:30 imirkin: yeah ... i'd have to work through it
20:30 imirkin: but i think i kinda see what's going on
20:30 imirkin: LEA.HI R0, R2, R0, RZ, 0x1f;
20:30 imirkin: i suspect what this means is
20:31 karolherbst: ohhh
20:31 imirkin: let a = R2:RZ (64-bit)
20:31 imirkin: add R0 (1<<0x1f) sized elements to it
20:31 imirkin: and then return the high dword
20:31 karolherbst: smart
20:31 imirkin: which is exactly your hadd.
20:31 karolherbst: yeah
20:32 karolherbst: the ordering is a bit odd though
20:32 karolherbst: mhh wait
20:32 imirkin: yeah, i think the args are in a funny order. same thing with SHF.L iirc
20:33 karolherbst: (R2:RZ << 0x1f).HI + R1
20:33 karolherbst: weird
20:33 imirkin: no
20:33 karolherbst: uhm
20:33 karolherbst: R0
20:33 imirkin: (R2:RZ + (R1 << 0x1f).HI
20:34 karolherbst: mhh
20:34 imirkin: that's my guess based on how LEA normally works (e.g. x86)
20:35 karolherbst: hadd: xor(a, b) >> 1 + and(a, b)
20:36 imirkin: well, they do the xor / and stuff above
20:36 karolherbst: right
20:37 karolherbst: and then you get R2 >> 1 + R0
20:37 karolherbst: or (R2 << 31).HI + R0
20:40 karolherbst: but I think it might be still good to figure out what we can all do with lea
20:41 imirkin: yeah, i've been meaning to write a gallium-api-using thing which feeds pre-compiled compute shaders in
20:41 imirkin: so help RE instructions
20:41 imirkin: to*
20:42 imirkin: but ... $otherthings
20:43 karolherbst: yeah
20:44 karolherbst: but aren't there opengl apis for that already?
20:44 imirkin: for feeding in pre-compiled shaders?
20:44 imirkin: not easily.
20:44 karolherbst: ohh, we can fetch those
20:51 karolherbst: imirkin: uhh, for an add we have to cast s8 to s16 :( that 8bit and 16bit stuff really starts to get a bit annoying
22:08 pendingchaos: imirkin: am I correct in thinking calling each of the sample location methods (0x11e0 and stuff) with 0x88888888 should set the sample locations to all be the center of the pixel?
22:09 imirkin: yes
22:09 imirkin: however the precise meaning of those registers is a little odd
22:09 imirkin: i.e. for 2x MSAA, it's pairs of 2
22:09 imirkin: for 4x MSAA it's pairs of 4
22:09 imirkin: etc
22:10 imirkin: so you can define squares of a varying number of pixels
22:10 imirkin: depending on MSAA level
22:10 imirkin: pendingchaos: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_state_validate.c#n214
22:10 pendingchaos:nods
22:10 imirkin: this is how we set it now
22:11 pendingchaos: calling them all with 0x8...8 should effectively disable msaa?
22:12 imirkin: 4 8-nibble dwords, 2 nibbles per sample location (x/y), so that's 16 sample locations that can be specified total
22:12 imirkin: mmmm ... good question. i doubt the hardware would take kindly to identical sample locations
22:13 pendingchaos: I'm seem to be getting an anti-aliased image doing that
22:13 imirkin: perhaps there's an enable somewhere?
22:14 imirkin: i do remember that not specifying sample locations broke some things.
22:14 imirkin: but perhaps it has to be enabled explicitly for other things
22:15 pendingchaos: I think I'll see what the blob does in the identical sample location situation sometime
22:44 imirkin: karolherbst: planning on looking at the limm patch version i sent, or not? (happy to let it sit out there, but i think you're the only person who even occasionally reviews, so don't want to wait on it for no reason)
22:45 karolherbst: imirkin: I would like to test the patch on Monday when I have access to my GPUs
22:46 imirkin: ok sure
22:47 imirkin: karolherbst: what about https://patchwork.freedesktop.org/patch/216042/
22:49 karolherbst: looks fine. I am just worried that changes like that may break something somewhere, but this is mainly due to the fact, that I can't really predict which change in RA changes what
22:51 imirkin: :)
22:52 imirkin: it shouldn't change anything except the one thing it addresses, textures on nv50
22:52 imirkin: and actually only single-arg single-def textures
22:52 imirkin: so ... rare.
22:52 karolherbst: I see
22:52 imirkin: 1d textures that only use one component result. or the texture size that only uses one dimension
22:53 imirkin: not exactly common
22:57 pendingchaos: imirkin: I might also see If I can figure out how 8x msaa works with varying sample locations per pixel
22:58 pendingchaos: looking at the spec, the pixel grid should be the same size for all number of samples
22:59 karolherbst: why do I end up with xor when doing "~srcA"....
23:00 karolherbst: xor 0xff is kind of okay, but not if something fails to do it correctly for chars..
23:00 imirkin: pendingchaos: ah, dunno - at least the AMD one lets you specify like a 2x2 grid
23:01 imirkin: this is desirable so that you can set diff pixels' sample locations differently
23:01 imirkin: to get sort of a dithering effect
23:01 imirkin: (dithering's the wrong term, but hopefully you get the idea)
23:01 imirkin: with gm200, for 8x you can do a 2x1 grid (or 1x2, no idea)
23:01 imirkin: for 4x, you can do the 2x2
23:08 imirkin: karolherbst: xor with 0xff seems like a pretty simple way to do a bitwise not on a u8 -- no real better way i can think of
23:09 karolherbst: why not do a simple not?
23:09 imirkin: that will affect the other bits too
23:09 karolherbst: which usually doesn't matter because you will most likley insert explicit conversion points anyway
23:10 imirkin: right, depends what you want to do with the result
23:10 imirkin: but if you have 4 u8's packed in a single 32-bit reg
23:10 imirkin: and you want to negate just one of them
23:10 imirkin: and pass the rest through
23:10 karolherbst: right
23:10 karolherbst: in that case it matters
23:10 imirkin: then this xor-based method works nicely
23:11 karolherbst: right... but why would the spir-v care?
23:11 karolherbst: allthough, I don't think there is a OpBitwiseNot
23:12 karolherbst: uhh, indeed, it doesn't exist
23:12 karolherbst: the bug is something else, because a vec of 0xff values ended up as a vector of 0xffffffff 0x0 0x0 0x0
23:12 karolherbst: not that great
23:13 imirkin: yeah, dunno about spir-v stuff.
23:20 karolherbst: "OpConstantComposite %v4uchar %uchar_255 %uchar_255 %uchar_255 %uchar_255" -> vec4 8 ssa_28 = load_const (0xffffffff /* -nan */, 0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */, 0x00000000 /* 0.000000 */)
23:20 karolherbst: this looks pretty wrong :)
23:21 imirkin: depends how things are defined
23:21 imirkin: and where it's supposed to get its bits from
23:21 imirkin: but yeah, not what i would have assumed
23:30 karolherbst: imirkin: seems like it was just a displaying bug
23:31 karolherbst: uhh of course, it is my fault
23:34 karolherbst: imirkin: possible opt, allthough I am sure we will never use it: xor a ~0 -> not a
23:34 karolherbst: or maybe we do that
23:34 karolherbst: just it doesn't work for u16 and u8 if you have 0xffff and 0xff
23:36 imirkin: karolherbst: that's a minor opt... there is no NOT
23:36 imirkin: there's only LOP.PASS_B
23:36 imirkin: so LOP.PASS_B or LOP.XOR ... who cares.
23:37 karolherbst: ohh, right
23:37 karolherbst: any idea what might be wrong with this one? https://gist.github.com/karolherbst/8448de01e0a11aea24048edcfd94637a
23:38 karolherbst: c[0x0][0x8] is the char input
23:38 karolherbst: and the result should be ~in
23:38 imirkin: seems correct.
23:38 karolherbst: yeah...
23:38 karolherbst: it returns 0xff
23:38 imirkin: but i take it that it isn't?
23:38 karolherbst: if I put 0x04 in
23:38 imirkin: what if you get rid of the NOT
23:39 imirkin: perhaps the LDC doesn't work quite right
23:39 karolherbst: interesting
23:39 karolherbst: it kind of works though
23:39 karolherbst: but sometimes not really
23:40 karolherbst: it worked with char4
23:41 imirkin: ;)
23:42 karolherbst: https://gist.github.com/karolherbst/8448de01e0a11aea24048edcfd94637a
23:43 imirkin: and that works?
23:43 karolherbst: yeah
23:43 karolherbst: I suspect something else messing up
23:43 imirkin: try not disabling NV50_PROG_SCHED
23:43 imirkin: i think LDC produces a barrier
23:43 imirkin: and that may not be fully accounted for by the default 0x7e0 thing
23:44 karolherbst: still wrong result
23:44 imirkin: ok. new idea.
23:44 imirkin: it should be c0[0xb] and not c0[0x8]
23:45 karolherbst: that would mean that char2 is also kind of broken
23:45 imirkin: yes.
23:45 karolherbst: mhh, it actually is
23:45 imirkin: although that makes little sense
23:45 imirkin: but ... anyways, something to do with the constbuf uploads
23:45 karolherbst: char3 same behvaiour
23:46 karolherbst: yeah...
23:46 karolherbst: because the other stuff works
23:46 karolherbst: or a lot of u8 stuff already works
23:46 imirkin: there's probably logic that does it in u32's
23:46 imirkin: and rounds down instead of up
23:46 imirkin: so the last bit doesn't get uploaded
23:46 karolherbst: lds are u8
23:46 imirkin: pretty sure i assume that constbufs are u32's
23:46 karolherbst: ohhh
23:46 karolherbst: mhh
23:46 imirkin: i mean in like the transfer functions
23:47 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_transfer.c#n572
23:47 karolherbst: hah
23:47 imirkin: see, that's in units of words
23:47 karolherbst: you are genious
23:47 karolherbst: if I put the pointer after the char, it works
23:48 karolherbst: imirkin: char3 gave me a weird result, but yeah
23:48 imirkin: how does this data come in?
23:48 imirkin: is it managed?
23:48 imirkin: or passed in as arguments?
23:49 imirkin: nve4_compute_validate_constbufs
23:49 karolherbst: we have a buffer in clover which gets filled before it's getting uploadded
23:49 imirkin: it uploads user constbufs up to size / 4
23:49 imirkin: right, but is it a user pointer?
23:49 karolherbst: no
23:49 karolherbst: constant value
23:50 imirkin: ...
23:50 karolherbst: uhm
23:50 imirkin: not sure what that means
23:50 imirkin: is it a user pointer, or a manged pipe_resource
23:50 imirkin: those are the only two options
23:50 imirkin: or is it the ->input pointer
23:50 imirkin: in pipe_grid_info
23:51 imirkin: in which case we do the same thing -
23:51 imirkin: PUSH_DATAp(push, info->input, cp->parm_size / 4);
23:51 karolherbst: pipe_grid_info.input is where all the stuff is inside
23:51 imirkin: basically you need to do a bit of work to compute the last thing to push in there
23:51 karolherbst: so yeah
23:52 karolherbst: ahh
23:53 imirkin: like u32 last = 0; memcpy(&last, &info->input[cp->parm_size & ~3], cp->parm_size & 3);
23:53 imirkin: PUSH_DATA(push, last);
23:54 karolherbst: mhh
23:54 karolherbst: couldn't we also just align the input buffer?
23:54 imirkin: you could do lots of things ;)
23:54 karolherbst: I really don't care about those 3 bytes :D
23:54 karolherbst: but mhh
23:54 karolherbst: let me try your thing
23:54 imirkin: making sure the input is a multiple of 4 would also solve these problems
23:58 karolherbst: well, your idea doesn't seem to work
23:58 imirkin: did you bump the BEGIN thing by 1 too in that case?
23:58 karolherbst: no
23:58 imirkin: well then.
23:58 karolherbst: okay, now it works
23:59 imirkin: :p
23:59 karolherbst: mhh
23:59 karolherbst: yeah, why not actually
23:59 imirkin: there's DIV_ROUND_UP btw
23:59 imirkin: so like DIV_ROUND_UP(size, 4)