00:17 joepublic: I sometimes think displayport was designed because someone thought all other connectors worked "too well:
00:57 Lyude: mhhh, yeah this is bizarre. I'm going to need to trace the nvidia driver tommorrow and make sure we're not setting something up incorrectly
00:59 Lyude: I bet we're missing something from the vbios that's causing us to end up with noise on the gpio pins used for hpd/aux detection
00:59 Lyude: rhyskidd: btw ^, if you have any ideas
01:09 Lyude: either that, or there is something on the port trying to use HPD pulses instead of IRQ pulses
01:59 gnarface: shuddering at the thought that someone could look at fucking HDMI and think "gee, this works too well. how can we make it weirder and less reliable?"
02:28 gnarface: oh please, yes let's have stereo equipment that needs a specific boot order
03:10 airlied: Lyude: could be a misconfigured output pin driver, like something is floating that should be driven etc
04:26 mupuf: Lyude: sorry, don't have right to silence people, AKAIK
04:58 airlied: mupuf: pretty sure chanserv yhinks you do
04:59 mupuf: airlied: hmm, thanks for checking it out!
04:59 airlied: you at least have ops
05:00 mupuf: well... no idea how I could forget that
07:28 tomeu: karolherbst, pmoreau: btw, this shader passes compilation fine, but after execution the output array is empty: https://people.collabora.com/~tomeu/plaidml-test.txt
07:28 karolherbst: tomeu: what hardware?
07:29 tomeu: karolherbst: jetson tk1
07:30 karolherbst: mhh
07:32 karolherbst: uff
07:32 tomeu: karolherbst: and this is the clc shader that hangs in nir_lower_goto_ifs/handle_if: https://people.collabora.com/~tomeu/plaidml-mobilenet.txt
07:33 karolherbst: mhhh
07:33 karolherbst: with the first one there are a few broken things
07:33 karolherbst: Workgroup memory is essentially 32 bit only, but we use 64 bit addresses
07:34 karolherbst: although I doubt that's the issue here
07:36 karolherbst: also the spirv kind of doesn't seem to match
07:36 karolherbst: mhh
07:36 karolherbst: I guess llvm does a lot of const folding
07:36 karolherbst: and loop unrolling
07:38 karolherbst: ohh, it's cut of
07:39 karolherbst: mhh
07:39 karolherbst: and whats "imul.nsw"
07:39 karolherbst: ahh
07:39 karolherbst: no signed wrap
07:43 tomeu: yeah, btw, I had to disable extension checking because clover doesn't know that we support SPV_KHR_no_integer_wrap_decoration
07:51 mrsinisalu: karolherbst: http://on-demand.gputechconf.com/gtc/2012/presentations/S0221-1024-Bit-Parallel-Rational-Arithmetic-Operators-for-the-GPU.pdf vs https://www.diva-portal.org/smash/get/diva2:831071/FULLTEXT01.pdf page39
07:52 mrsinisalu: first does use the branches or loops which are pretty fast when you have repeated loops for iterations
07:52 mrsinisalu: second one hasn't got anything to do with opencl scheduling abstractions anymore, and this isn't the real way to do it yet though
07:53 mrsinisalu: blum blum shub shader remembers the flops with two separate kernels as it was in some smaller stream, but this way obviously you can hit the limit pretty fast
07:54 mrsinisalu: so the third and real way is to redirect FUs to arbitrary queues so that everything is a in queue location or sw branch done with wfids that wrap around
07:55 mrsinisalu: loops are suffiently fast on opencl though still, allthough they still fetch the opcode, but branch opcode quits the decoder early
07:57 mrsinisalu: so opencl or cuda them both are far less useful than people tend to think.
07:59 mrsinisalu: because one can do much better with a proper codegen and fragment shader.
08:01 mrsinisalu: ParallelMultiplier/ParallelMultiplierKernel.cu is the one from the first package to be inspected why they get so good results on kogge-stone 1024bit adder
08:02 mrsinisalu: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ they use this method basically
08:03 mrsinisalu: so in essence opencl is doable and it should not be problem to you to sum it up, but it is not very useful thing either.
08:05 mrsinisalu: you have been talking about how loops work on nvidia loop stack based hw, i have forgatten the details how they work too my own, since amd does not use branch stack and uses easier method to implement them
08:07 mrsinisalu: and yes generally the NVIDIA way is just a waste of hw resources, but on modern chip that does not much matter either, their chips are still enough powerful
08:14 mrsinisalu: I am not interested on this goto to if else pass, since i do not see point in this, however i could be capable, instead i am interested to talk about fixing cpu multithreading issues
08:16 mrsinisalu: yeah me off the channel, be a clueless dick all your life, bye
09:19 lendurlea: AS i said, it really needs multiple iterations to work fast! That can be demonstrated with little verilog code, there is a continous assignment in the code and the decoder is not driven anymore and issue gets the last opcode and flags on the instance, it happens so when you fetch the very same opcode in sequence. So the decoder just skipps and forwards the last opcode
09:19 lendurlea: but it only happens so, when the same instruction is in sequence executed multiple times.
09:22 lendurlea: so first iteration involves a fetch and other following iterations are lot faster hence.
10:20 lendurlea: at least this how i think it works workgroups in opencl terms get distributed probably either in parallel if there are enough CUs/SIMDs or serial if the dispatcher finds there are not, serial instances are scheduled to the same ueue entry always as this would make sense too, and it should be done so in case of loop
10:21 lendurlea: worgroups are blocks in cuda terms
10:37 tagr: skeggsb: do you think we could take this one through drm-misc since it fixes a regression that was introduced there a couple of days ago? https://patchwork.freedesktop.org/patch/323927/
11:01 lendurlea: I think 40decoder instances for workgroups like this in miaow GCN, so if the command stream commits more then alus available on single compute unit bigger workgroup, decode wfid is repeated by the dispatcher, and everything is got from queues, unless it has dynamic nested true conditional branches, for instance based of local memory content
11:01 lendurlea: those should be replaced for predication to work out fine, those mess up to be honest
11:02 lendurlea: they do not mess up, but those will be slower quite some margin
11:07 karolherbst: tagr: btw.. I figured out the mesa regressions :/ it's something super annoying
11:08 karolherbst: apparently tegra gives us a linear buffer to render to, and nouveau can't render to one, because it has to be tiled (something like that)
11:08 karolherbst: and a fix, broke it
11:08 karolherbst: https://gitlab.freedesktop.org/mesa/mesa/commit/c56fe4118a2dd991ff1b2a532c0f234eddd435e9
12:31 lendurlea: http://dpaste.com/2RH2XCH
12:33 lendurlea: so you can see CU Has 40wf, and there are 40opcode regs to bypass the decoder to do in queues computation, all was correctly said.
12:36 lendurlea: this is taken from miaow testbench decode_tb.v and simulated and grepped from the output file
12:45 lendurlea: such an assignment does not go through, cause it is as you see continous assignment, it goes through only when instance has change in right hand side i.e when there is a new opcode landed, and that happens on dynamic condition aka. with branches nested to the static one
12:48 lendurlea: this is what i said, that first iteration is fetched into decode buffers, next ones in case of repeated iterations come from queues ;) this is the pointless strength of opencl and cuda to demonstrate like this
12:52 lendurlea: this is of course correctly done, but i have better method than this, It has weakness that you can run out of queue entries and it does not work when there is no repeted iterations of the same calculation on different operands.
12:53 lendurlea: i meant values of same operands, so different values of the same operands to be accurately precise and pedantic
13:01 lendurlea: the new paradigm is just some playing with words to carry out some meaning, like linus said on first LINUX operating system that everything is a file in kernel, abstract enough..but when you compile to queue entries everything is a branch every instruction
13:02 lendurlea: you remap the functional units into queues in centre of the chip, and you start to target them with LSU in fast mode
13:03 lendurlea: that way since instruction repeat themselves there are fewer opcodes then queue entries, you put correct indices
13:03 lendurlea: and target those functional units, but it happens a whole lot faster then on long pipeline mode hence
13:13 lendurlea: i present this kinda of algorithm , i have many versions of this, it jumps to the FU executes it async, and takes from the stack the next opcode to run from compressed addressspace variable
13:13 lendurlea: the thing is that you need to see the picture for once and all how this functions to do correct things in the driver
13:16 lendurlea: the jumps are taken so, that you redirect a variable to clamped unit that stalls in LSU or you redirect it to one that does not
13:16 lendurlea: the stall is being acheived when you repeat the element it is a trick in hw
13:18 lendurlea: there are hundreds and thsousounds of versions, but they work based of some general rules of thumb
13:18 lendurlea: that you need to start comprehending
13:20 tagr: karolherbst: hmm... yeah, I see how the absence of modifiers support could cause this
13:20 lendurlea: it is because i do not have such time for the rest of my life to clean up your codebase continuously and being much like unemployed bullied at freak to you
13:21 tagr: karolherbst: the problem is that we're kind of depending on the modifiers support to tell us which type of buffer we have
13:21 tagr: karolherbst: technically the Tegra driver doesn't give you a linear buffer to render to, but given that it doesn't get any modifiers, it'll just fall back to asking Nouveau to get a linear one
13:22 tagr: I suppose there may be ways to do this a little more cleverly
13:23 tagr: I vaguely recall not doing that because the guidance (from DRM and Wayland developers) at the time was that new code should be relying on modifiers as the canonical way of transporting this information
13:23 karolherbst: tagr: yeah... I think the issue is just, that we can't really render to a certain type of buffers
13:23 karolherbst: and we fallbacked to do a copy _somewhere_ and that's why it worked out
13:23 karolherbst: or something
13:23 karolherbst: I don't know much about all of that
13:23 karolherbst: and imirkin also never really looked into it
13:23 tagr: karolherbst: I wasn't aware that we couldn't render to linear buffers
13:24 tagr: I think I saw it work at least in a couple of cases (kmscube, perhaps) with reduced performance
13:25 karolherbst: tagr: a while back ilia was saying this: "the issue is that nvidia gpu's can't render to an untiled color buffer and a tiled depth buffer"
13:25 tagr: ah... I see
13:26 tagr: so possibly we could store a flag somewhere to signal that depth buffers need to be untiled if the color buffer is untiled?
13:26 karolherbst: yeah, I think so
13:26 karolherbst: I just have no clue about all the code involved and hoped somebody with proper knowledge could hack something up
13:27 karolherbst: but other than that, it works great. kmscube on a tty works perfectly... but GLX and EGL are broken right now
13:27 tagr: I'm not sure if the Tegra driver gets involved at that level, but ideally it'd be something that we'd handle there
13:27 tagr: karolherbst: couldn't you build GLX with modifiers support to restore the desired behavior?
13:28 tagr: or does it end up never passing modifiers anyway?
13:28 karolherbst: probably.. but again, I have no clue about that code...
13:28 karolherbst: mhh
13:28 karolherbst: let me think
13:29 karolherbst: ohh, so it starts using the "use" flags instead of modifiers if no modifiers are passed in
13:29 karolherbst: mhhh
13:29 tagr: I don't think the nouveau kernel driver currently supports modifiers, so perhaps there's something that we could add to make this magically work
13:29 karolherbst: hopefully yes
13:29 karolherbst: but I think X is the issue here
13:29 tagr: then again, GLX should only get the modifiers from Tegra DRM because that's where the display support is coming from
13:30 tagr: karolherbst: yeah, I vaguely recall seeing something similar at some point and I /think/ the solution was to enable modifiers support in GLX somehow
13:30 karolherbst: mhhh
13:30 karolherbst: but EGL is broken equally
13:30 tagr: can't remember exactly how I did that...
13:31 tagr: isn't EGL using the same code paths at that level?
13:32 karolherbst: yeah.. I think so
13:33 karolherbst: I think reverting that commit fixed both
13:33 imirkin: tagr: karolherbst: just a quick note on tiling ... linear color buffers can't have depth at all. there is no untiled depth.
13:34 imirkin: if you want depth, you must have tiling
13:35 imirkin: https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_state_validate.c#n227
13:35 imirkin: nouveau_bo_memtype != 0 means tiled, pretty much
13:36 imirkin: and note how in the else case, we "assert(!fb->zsbuf)" -- we're not just being picky, the hw doesn't support it
13:36 imirkin: you'll get ZETA_SOMETHING errors if you try
13:36 tagr: ouch
13:37 tagr: so, I'm currently trying to unbreak nouveau on linux-next for Tegra, once I'm done with that I will hopefully have some time to look into the upper level of the stack
13:37 tagr: my copy of Mesa is a bit out of date, so I'll probably run into that issue when I upgrade
13:37 imirkin: yay :)
13:38 karolherbst: :)
13:38 karolherbst: cool
13:38 tagr: I also really need to look into setting up some basic testing in our test farm so we catch these things earlier
13:38 karolherbst: tagr: I was also playing around with reclocking on the jetson nano (which is the main reason mine got so hot)
13:38 karolherbst: but that didn't work out that great
13:39 karolherbst: the upper 2 or 3 perf levels are more or less broken
13:39 karolherbst: tagr: I want to connect the jetson nano with the gitlab CI we are working on :)
13:39 karolherbst: that should be good enough
13:39 imirkin: anyways, good luck.
13:39 karolherbst: but yeah.. would be cool if you do some testing on your end as well
13:42 tagr: I'm currently still looking at enabling Nouveau on Jetson TX2 and maybe Jetson AGX Xavier so that I don't always have to go back to some older board just to keep an eye on Nouveau support
13:42 karolherbst: ahh, cool :)
13:42 karolherbst: I hope to get the CI working on the jetson over the next two weeks
13:42 tagr: I've got Jetson TX2 mostly working, but for some reason the CE channel doesn't get initialized properly and it falls back to the CPU for buffer copies
13:43 karolherbst: mhh, skeggsb might know something about it
13:44 tagr: I'm sure he does
13:44 tagr: slowly making my way through the code with printk... but it's not yet making any sense to me
13:56 lendurlea: I can make several releases of that kind of code, were someone to suggest what is the codes generation for the favorite occupancy of yours, i think 32-64 vgpr can be chosen for the codegen and parallelism trade-off
13:58 lendurlea: maxumum occupancy on GCN is 25 regs, that will work fine, not sure how your codegen adjusts to similar stuff
13:59 lendurlea: LLVM just makes an abstract machine for this amount of regs .
14:05 lendurlea: i also add, that fetch&decode and simd arbiter works always on maximum occupancy, but writeback stage starts to block on high occupancy
14:05 lendurlea: you have regs assigned to warps, and the RF arbiter will just block the same warp from not writing
14:08 lendurlea: I also add, that indirected registers are not scoreboarded :(
14:08 tomeu: karolherbst: what CI is that?
14:08 karolherbst: tomeu: the gitlab build stuff. Anholt got already something working as well
14:08 karolherbst: and I wanted to reuse his stuff
14:09 lendurlea: when you add indirection with ARL type of instruction, this will only scoreboard the register which was redirected from decode stage
14:09 tomeu: karolherbst: ah, only build?
14:09 karolherbst: nope
14:09 karolherbst: hw testing
14:09 karolherbst: we can register runners which do different kind of tests than build testing apparently
14:09 karolherbst: sounded quite good
14:10 tomeu: ah, we're using lava for panfrost
14:10 tomeu: would be very easy to extend that to run tests on tegra boards
14:11 karolherbst: well
14:11 karolherbst: how community driven is that lava thing?
14:11 karolherbst: or the setup
14:12 karolherbst: and I think hooking something up in gitlab is more or less the best way
14:17 lendurlea: so hence one needs to scoreboard his own in the cache, so easiest is again two's complement variable
14:27 lendurlea: but it is pointless to scoreboard all the 25regs, there is almost never so much parallelism, so we do 6 max, this allready gives insane perf.
14:44 lendurlea: and the final delay is something like on VLIW like r300 6gate delays for full fetch&decode pipeline emulation
14:44 lendurlea: which on that chip corresponds to something like 40ps
14:52 lendurlea: scoreboard in hw isn't at all hence something useful to have, since it does not work with indirections
14:57 lendurlea: so taking r300 as reference arch, 64 operations can be done completely in parallel, very many alus available that is, and 256 is the queue length per single pixel shader 4wide alu version
14:59 lendurlea: oih yeah it is 128 for single 256 for two
14:59 lendurlea: and 512 for 4 ps versions, r300 came with 2 and 4 as i remember
15:00 lendurlea: 16*16/2 that is for single PS
15:00 lendurlea: cause it uses dual scheduler
15:04 tomeu: karolherbst: not sure how to answer that
15:04 tomeu: lava is a normal open source project
15:05 tomeu: lava labs are maintained by companies, typically, and higher parts of the CI infra submit jobs to them
15:05 tomeu: kernelci submits jobs that test the kernel, for example
15:05 tomeu: for panfrost, we submit jobs that run deqp
15:07 karolherbst: tomeu: yeah.. I don't want anything owned by a company
15:07 karolherbst: if the tools doesn't allow a distributed network, it's not usefull for a community project
15:07 karolherbst: that's why gitlab is a good idea
15:08 tomeu: the tools of course allow for a distributed network, and some individuals maintain lava labs that get jobs from kernelci
15:08 karolherbst: and you just install a ci runner on your end and let it do the work
15:08 karolherbst: ahh
15:08 tomeu: you could maintain your own lava lab if you wanted
15:08 karolherbst: but I think for mesa we want something integrated into gitlab
15:08 tomeu: I prefer for others to do that work :)
15:08 karolherbst: because.. we want the mesa builds to fail
15:08 karolherbst: anything else is pointless
15:09 tomeu: just look around here: https://gitlab.freedesktop.org/tomeu/mesa/pipelines
15:09 karolherbst: ahh, I see
15:09 tomeu: gitlab jobs fail if regressions are detected after running in lava
15:09 karolherbst: cool
15:09 karolherbst: yeah, that's what we want to have :)
15:10 tomeu: you could use a gitlab runner, but what will happen when the machine locks?
15:10 tomeu: lava needs a serial connection though, so might not be a good choice for desktop-class runners
15:11 tomeu: as the whole point of lava is to be part of a scalable solution, which desktop-class hw doesn't help with as manual intervention is often needed
15:11 karolherbst: yeah...
15:13 lendurlea: so when fetch&decode&scb&operandread is some amount of ps (around 40), final thing is the sw scheduler delay, which probably adds another 40ps per instruction, this is where the clamping and indirection logic works and delays the instructions
15:14 tagr: you can test desktop-class NVIDIA GPUs (or AMD, or whichever for that matter) on some embedded boards
15:14 tagr: a few do have PCI slots these days
15:20 karolherbst: tagr: I think the issue is, that you need some kindof watchdog one way or the other
15:21 tagr: karolherbst: my understanding is that you typically have some sort of timeout on the test infrastructure, so if the device hangs the test times out and that counts as failure
15:22 tagr: typically you can then use some sort of hardware controlled reset to get the device back into a usable state
15:23 lendurlea: I think preliminary expectation is around about 100ps delay for lock-stepping so 6instructions delay is 600ps around 40instructions instead of one in 4ns, this is entirely ccrazy but should be real and reality
15:24 lendurlea: this did not include execution as all understood, which adds variable latency
15:24 lendurlea: based of what operation it is
15:27 tagr: karolherbst: that said, many of these embedded boards do have a hardware watchdog built-in as well, though that doesn't necessarily help with the test infrastructure, all that gets you is that it reboots the system if it hangs
16:08 Lyude: oh wow, that was easy
16:09 Lyude: just added the dmi info for the thinkpad p71 into gpio_reset_ids[] in drivers/gpu/drm/nouveau/nvkm/subdev/gpio/base.c and it looks like that fixed hotplugging
16:10 RSpliet: Lyude: sounds like you made a hammer!
16:11 Lyude: RSpliet: for fixing hotplug noise you mean?
16:11 RSpliet: That's what all hammers do... they reduce noise! Right?
16:20 Lyude: gah, nvm, seems I spoke too soon
16:21 lendurlea: yeah it could be that indirect loads are still scoreboarded and on some even stores too, i had some notes in bookmarks.
16:21 lendurlea: https://www.hillsoftware.com/files/atari/jaguar/jag_v8.pdf
16:41 lendurlea: but both ways are possible it would simplify when indirections are scoreboarded, then sw scoreboarding needs not to be done.
16:42 lendurlea: and if indirection failed loads leave traces on the scoreboard, programmer just allocate one special reg
16:42 lendurlea: i.e when they get stuck
17:11 lendurlea: well both of them do not seem to be possible on r300 read operands are not scoreboarded, and write operations can not leave traces on failure, not with indirection nor by default, however r300 does not allow using indirection on dest operands
17:13 lendurlea: on SIMD also read is just a stage that does not get scoreboarded
17:15 lendurlea: is this destination indirection limitation a mesa/opengl limitation or a hardware one?
17:24 lendurlea: https://gitlab.freedesktop.org/gerddie/mesa/commit/ffcdd49c69811b9f768c0b32acef6527d5626a6e it seems to be r300 hw bug
17:24 lendurlea: intel and nvidia ones should allow this
17:59 lendurlea: if (dst_reg->reladdr != NULL) {
17:59 lendurlea: assert(dst_reg->file != PROGRAM_TEMPORARY);
17:59 lendurlea: dst = ureg_dst_indirect(dst, translate_addr(t, dst_reg->reladdr, 0));
17:59 lendurlea: }
17:59 lendurlea: from st_glsl_to_tgsi.cpp
17:59 lendurlea: states like for non temporaries it should be possible
18:07 lendurlea: the way i read indirections are not implemented in miaow, this is why i am not entirely sure how it should work for scoreboarding, i know think more that dests are scoreboarded and probably it is latched delay line in regfile
18:10 lendurlea: if none wants to do machine code passes, then the mesa source tree editing can be pretty complex, i.e the most complex part of this type of implementation
18:10 lendurlea: i am still myself reading the files and need to probably print debug infos and compare and how to do that type of codegen like needed
18:12 lendurlea: it may turn out to be some short lines, but hard to do them, offhand yet no real suggestions for the mods
18:51 lendurlea: in miaow this is very cleverly solved for the LSU, that a failing load do not leave scoreboard stucked bits, it delays the blocking in scbd, until the LSU really performs the write, cause on full pipeline the next instruction is not yet fetched that time
18:53 lendurlea: it calculates the address and if this does not succeed write toggle is not at all going to be guided to the scoreboard
18:54 lendurlea: but if address calculator succeeds, exactly before load acknowledgement the write starts to block in scoreboard
19:20 endrift: Does vaapi and/or vdpau work with nouveau at all?
19:21 endrift: oh there's a feature matrix for it on the website
19:22 endrift: bleh, Maxwell is 100% TODO
19:24 endrift: I'd ask what I could do to help but the only Maxwell cards I have are a Tegra X1 (Jetson Nano and a Nintendo Switch) and a GTX 980 that I got secondhand
19:25 endrift: actually wait the GTX 980 is VP6, so that would work
19:34 lendurlea: which brings me to the issue, is the dependent load 4 of them on low-end cards, a hw or sw limitation, it can not be hw one imo, cause i think in vertex shader there is unlimited use of address ARL register
19:34 lendurlea: and very likely even on specialized hw the regfile is shared between specialized shader stages
19:35 lendurlea: this is where i think the indirection is done , in regfile
19:37 lendurlea: cause the trick that miaow does with lsu loads, works also as a disadvantage, on multi-issue and faulting fetch it would execute alu regardless that LSU did not succeed
19:37 lendurlea: preceeding LSU, cause the dependency is spotted slightly later on short pipeline
20:12 pmart: Hi. Can you help me make my optimus card switch off? I already followed the advice on https://nouveau.freedesktop.org/wiki/Optimus/ and passed video=VGA-2:d to kernel but the card stays DynPwr according to vgaswitcheroo
20:13 pmart: cat /sys/kernel/debug/dri/1/clients show Xorg
20:15 pmart: xrandr shows only LVDS-1 connected
20:24 Lyude: we really need to actually document all the registers we're using in nouveau in envytools as well :\
20:26 lendurlea: Yeah i figured it finally out, why it blocks the alu mux on that reg in the scoreboard :) that is perfect.
20:52 lendurlea: the way i see, one texunit will get lost in the process, since looking at it again, i do not think it can be unblocked once it stalls