07:02 mithro: karolherbst: Random question, have you seen an issue where multiple Nvidia cards in a machine prevent it from posting? After the success previously I tried to see if I could get some older Quadro cards also working but the machine just refuses to post with multiple cards in it :/
07:13 karolherbst: no clue
07:17 mithro: karolherbst: It was my longshot. My K1200 turned up today, so I'll give that a go with your drivers sometime this week
07:23 marcheu: mithro: yeah when you have too many, there's not always enough pci space for all of them
07:24 marcheu: mithro: pick cards which need a small pci space, also pick a motherboard w/ large pci space
07:24 mithro: marcheu: pci space?
07:24 mithro: marcheu: do you mean PCI Lane bandwidth?
07:25 marcheu: no, basically the part of physical address space that the bios gives each card
07:25 marcheu: if you do lspci you see a bunch of PCI resources for each card
07:25 marcheu: like this:
07:25 marcheu: Region 1: Memory at f0000000 (64-bit, prefetchable) [size=128M]
07:26 marcheu: a lot of bioses can only go up to 1GB and aren't good at not fragmenting it
07:27 mithro: marcheu: I was hoping that it would just ignore all the external graphics cards (just boot with the internal one) and let Linux figure it out
07:27 marcheu: linux can talk to the cards through this address space only :)
07:28 marcheu: that said different cards require different amount of space, you can try to find these. I'm not sure offhand how
07:28 marcheu: also it's not just graphics cards, each pci devices will eat up some space, so sometimes unplugging other stuff helps
07:29 marcheu: of course there's no way to know exactly why the bios is failing at coming up with a layout :(
07:30 mithro: marcheu: It's a "server" motherboard, so I just think it doesn't really understand multiple graphics cards - it dies in the "VGA initialization" according to the beep codes
07:33 karolherbst: mupuf: any reasons why some subdev doesn't want to wait until reclocking is finished? This makes the locking code more complicated sadly
07:33 mupuf: karolherbst: temperature polling should not be blocked
07:33 marcheu: mithro: they are just pci devices at that stage, there's nothing to understand...
07:33 mupuf: you can start a worker if you want
07:33 marcheu: but yeah, crappy bioses are common
07:34 marcheu: I got a PC to boot w/ 5 graphics card, so I know the issues quite well :)
07:34 karolherbst: mupuf: mhh right
07:35 karolherbst: mupuf: I think it isn't possible in the end though. There needs to be a subdev lock in nvkm_clk_ustate, nvkm_clk_astate and nvkm_clk_update_work in the end
07:36 mupuf: yeah, and a timeout for reclocking
07:36 karolherbst: you don't want the expected states to change while you reclock, but you also don't want to chnage the state while reclocking :D
07:36 mupuf: if it is short enough and if we can revert to a working state, then all is well
07:36 mithro: marcheu: Well, I was hoping to reuse the cards I already have - my interest is reduced if I have to purchase new cards
07:36 mupuf: agreed
07:37 karolherbst: mupuf: but maybe the issue isn't as big: we just read out the clk->(p/c)states in the worker thread and just unlock at this point and reclock
07:37 karolherbst: the next reclock will be queued anyway sooner or later anyway
07:37 mupuf: hmm
07:37 karolherbst: (I use the word anyway too much)
07:37 mupuf: that sounds bad :D
07:38 karolherbst: my idea is this generally
07:38 mupuf: but, I would need to see the entire picture and not just a tiny portion of it
07:38 karolherbst: we have functions to change the expected states
07:38 karolherbst: + we have nvkm_clk_update to actually change the clocks
07:38 karolherbst: though nvkm_clk_update will be called on state changes anyway
07:38 karolherbst: and usually we have three ways a recocking will be triggered
07:39 karolherbst: temperature, dynamic reclocking, user
07:39 karolherbst: though we should only allow either dynamic or user
07:40 karolherbst: so we have a interval based daemon which may trigger a reclock every second + a reclock triggered at random points in time
07:41 mjg59: Can the performance counters generate interrupts?
07:41 karolherbst: I doubt that
07:42 marcheu: mjg59: yeah
07:42 mjg59: HP implemented firmware-level reclocking on x86 by hijacking one of the performance counters and triggering it to fire a system management interrupt
07:42 marcheu: karolherbst: did you look at what we did for pixel c?
07:42 karolherbst: well dynamic relocking won't use them anyway. There are special counters on the PMU
07:42 karolherbst: marcheu: don't think so
07:42 marcheu: karolherbst: look at the chromeos-3.18 kernel, it is the kernel used on the pixel C.
07:43 mjg59: Having the GPU notify you when it's not running fast enough seems like a win
07:43 marcheu: karolherbst: it has both a worker-based implementation, and a (IIRC disabled) interrupt based one
07:43 karolherbst: well
07:43 mjg59: Polling when you're at high clock rates isn't really a problem when it comes to system power consumption
07:43 karolherbst: do you mean an interrupt through the PMU?
07:43 karolherbst: or something else
07:44 mjg59: But ideally you'd avoid it when the system is otherwise idle
07:44 marcheu: mjg59: thankfully you can stop it when pm_runtime kicks in
07:44 mjg59: marcheu: Does that include when the screen is on but static?
07:44 karolherbst: I already wrote a implementation on the PMU for sending interrupts to the host when the gpu isn't fast enough
07:44 marcheu: well yeah for gpu it does
07:45 marcheu: mjg59: pm_runtime is what you do with it, it's really just some hooks and a refcounting mechanism :)
07:45 mjg59: marcheu: Right, I just wasn't sure what you were refcounting
07:46 marcheu: for pixel c, the GPU is nouveau drm while the display is tegra drm so it's trivial
07:46 mjg59: I think my initial experimentation around that was based on DPMS
07:46 mjg59: Ah yeah
07:47 mjg59: Innovative display controller
07:47 marcheu: you can probably use pm_runtime domains for desktop
07:48 mupuf: marcheu: I thought about using them years ago
07:49 mupuf: but guess what, they only make sense of the cpu
07:49 mupuf: they do not understand that there are more than one clock domains
07:49 mupuf: so, say you monitor pgraph's usage to perform reclocking
07:50 mupuf: you may miss the elephant in the room which is, you are vdec, pcopy or memory-limited
07:51 marcheu: yeah IIRC the nvgpu code doesn't do any better there
07:51 mupuf: we would need to integrate with pm_runtime, but this is definitely not going to happen at a low level.
07:51 mupuf: marcheu: really? Pretty sure I gave a more complex demo code to the engineer who wrote the actual code for nvgpu
07:52 mupuf: albeit it was in the userspace
07:52 mupuf: in any case, since I have you here, what tests did you use for testing the DVFS algorithm?
07:53 marcheu: mupuf: nah, there is a bit of system-wide negociation for the emc clock (because watermarks) but I think that's it
07:53 mupuf: I have been hammering karolherbst for not going too fast on this and actually create proper tools.
07:53 karolherbst: :D
07:53 marcheu: and certainly for the GPU driver the "contribution" to the EMC calculation is constant for a given GPU load
07:54 mupuf: tegra has it easy, there are less clock domains :D
07:54 marcheu: I've been using a bunch of apps which uses more or less GPU %
07:54 mupuf: manually?
07:54 marcheu: well the EMC stuff on tegra is "fun"
07:54 marcheu: yeah
07:54 marcheu: also I wrote a little piece of GL code which generates load
07:55 mupuf: ok, well, chrome has a lot of gpu tests that creates spikes in the gpu usage
07:55 marcheu: and I put it all in a spreadsheet
07:55 marcheu: yeah certainly
07:55 marcheu: it makes things interesting
07:55 mupuf: and I have a project called ezbench that can run benchmarks (chrome ones included along with the classic benchmarks)
07:55 marcheu: if you've seen the dvfs code, you know that downclocking too quickly is evil
07:56 mupuf: not downclocking too quickly is also evil
07:56 marcheu: actually, not really
07:56 karolherbst: well in the end it depends on how fast you react
07:56 mupuf: depends how good your power and clock gating is
07:57 mupuf: but static power consumption is non trivial, let me pull up some numbers from my thesis
07:57 marcheu: well, even on mobile, it wasn't an issue. leakage numbers are very hw-specific
07:57 marcheu: and with new process, leakage has almost become irrelevant
07:58 mupuf: http://fs.mupuf.org/mupuf/nvidia/graphs/nve6_pwr_voltage.svg --> this tells another story though
07:58 mupuf: I would need to test on a maxwell also
07:58 marcheu: what's the process for this?
07:58 mupuf: but mobile chips are less impacted with leakage
07:58 karolherbst: mupuf: was that on nvidia?
07:58 mupuf: let me see, it is a GTX 660
07:58 mupuf: yes
07:58 marcheu: mobile chips have better process
07:58 karolherbst: okace
07:58 marcheu: but overall things are going in the direction where leakage is ok
07:58 karolherbst: mhhh
07:59 karolherbst: well mine isn*t better
07:59 karolherbst: and I have a mobile chip
07:59 mupuf: it is not better, it is a process that does not go up in frequency and can thus have a lower leakage
07:59 mupuf: marcheu: 28 nm
08:00 mupuf: and that is an entirely idle machine
08:00 mupuf: but admitedly, power and clock gating was not turned on beyond what the bios did by default
08:00 mupuf: so, we will need to redo the experiment
08:01 marcheu: yeah gm20b is using 20nm
08:02 mupuf: http://fs.mupuf.org/mupuf/nvidia/graphs/temperature_power_voltage.svg -> the impact of temperature on power usage
08:02 karolherbst: well at least downlocking fast on my system is pretty important. Through high gpu clock around 20% of the entier system power consumption is wasted just by staying on higher clocks (and nvidia does it for like 40 seconds sometimes)
08:03 mupuf: karolherbst: right
08:03 mupuf: but then, you also need to avoid janks
08:03 mupuf: jitter is really bad
08:03 marcheu: well, I think performance matters more than power for desktop, and for mobile the process makes up for it
08:04 mupuf: so, we definitely need to take into account when was the last time we downclocked or how often we reclocked in the past 10 seconds
08:04 karolherbst: I have a mobile system
08:04 karolherbst: and it doesn't at least on kepler
08:04 karolherbst: mupuf: yeah, something like that makes sense and defining some reclocking target like, we don't want to reclock more than 5 times every minute
08:05 karolherbst: we will anyway on spiky load cause of the variable temperature sometimes
08:05 marcheu: that's the way to jank
08:05 mupuf: marcheu: there is a test for that in chrome
08:05 marcheu: "jank" is a generic term
08:06 marcheu: the test in chrome btw, outputs a synthetic number which isn't very useful
08:06 mupuf: right, but it lets the GPU idle for a bit before scrolling again
08:06 marcheu: yeah, but don't look at the chrome "jank" metric, it's really not good
08:06 mupuf: it outputs how often it missed 60 Hz refresh rate, that's pretty nice
08:06 karolherbst: ohh yeah, that sounds important
08:06 mupuf: hmm, can you tell me why?
08:06 marcheu: yes that part is ok, but the "overall" metric misses the point
08:07 mupuf: oh, yes
08:07 mupuf: this overall is bullshit indeed
08:07 mupuf: frametime + jank metrics are the ones I am interested in
08:07 marcheu: the problem with chrome in general for benchmarking, is that it adapts to the speed of the device
08:07 mupuf: :o
08:07 marcheu: so when things are too slow, it will postpone things like texture uploads, etc.
08:08 marcheu: which means that you can't really profile a driver with it
08:08 mupuf: I see
08:08 marcheu: it is good for the end user, but not for profiling
08:08 mupuf: shit, well, we can always apitrace it... won't work for intel but it should for nvidia because the gpu does not share the budget
08:08 marcheu: in a way, all these metrics are meant to optimise for chrome given a constant environment/platform, not to optimise the platform
08:09 marcheu: yes one way to do it is to replay an API trace, but then you miss all the CPU load... :)
08:09 marcheu: guess what, a lot of jank is bad scheduling
08:09 mupuf: yep, but this is no problem for intel
08:09 marcheu: actually it is
08:10 mupuf: sorry, I meant, it is not a problem for nvidia
08:10 mupuf: or nouveau, when we tweak the DVFS scheduler
08:10 marcheu: well, it depends
08:10 marcheu: let's say you have tegra
08:10 mupuf: yes, for tegra, apitrace won't fly
08:10 marcheu: and CPU load changes the EMC (memory) clock which makes the GPU a little faster
08:10 marcheu: these effects are complex and annoying to debug
08:11 mupuf: yep, this is why I ruled it out for Intel performance testing
08:11 mupuf: and relied on the chrome test instead
08:12 mupuf: I have this problem a lot at intel, fighting the PUNIT which reclocks quite hapily when TDP limited, and I do not know about it :s
08:12 marcheu: yeah if you look at the chrome os gfx tests they monitor the chipset temp before startingt the tests...
08:12 mupuf: it is coming though, so I should soon know how many times it reclocked when running my benchmark. I can lower the clock and start again until I get good results
08:13 mupuf: yeah, but this is not enough, there is a power cap
08:13 mupuf: and I have hooks to do the same in ezbench
08:13 marcheu: at the very least it makes results reproducible
08:14 marcheu: I will argue that's very valuable for a test :p
08:15 mupuf: :p
08:15 mupuf: http://fs.mupuf.org/mupuf/nvidia/graphs/thresholds_memory.svg (and the other thresholds graphs) are done on nvidia
08:15 mupuf: I wrote a tool to fake perfornance counters reading
08:15 mupuf: and nvidia happily reclocks as I want it to
08:15 mupuf: very useful to find ... thresholds
08:16 mupuf: and to see the impact of static power consumption
08:16 mupuf: because the thing was truly inactive
08:17 mupuf: http://fs.mupuf.org/mupuf/nvidia/graphs/thresholds_graph.svg <-- the funnier one, with boost
08:17 mupuf: oh, I really have to go now
08:17 karolherbst: that spike though
08:18 mupuf: noise in the data, ignore it
08:18 karolherbst: ;)
08:18 mupuf: due to how I poll the counters
08:18 karolherbst: I meant the power consumption, but I know that sometimes garbage is read out
08:18 karolherbst: never happend for me on nouveau though
08:19 mupuf: well, you do not spy on the i2c bus using the cpu to find changes in the two lines and decode the transmisssion
08:19 mupuf: this is what I did
08:19 karolherbst: :D right
08:20 mupuf: marcheu: in any case, we plan on testing on the jetson TK1 before changing the reclocking policy
08:21 mupuf: but, if needed, we will see to keep your code active for your chipset
08:21 mupuf: that will complicate the design though
08:21 marcheu: oh I don't really care, the thing has sailed
08:21 mupuf: ok, great!
08:21 marcheu: I'm just trying to help
08:21 karolherbst: has the tk1 also a PMU?
08:21 mupuf: thx for it
08:22 marcheu: having done these mistakes and all
08:22 mupuf: karolherbst: of course! It was introduced at the end of tesla era
08:22 karolherbst: ahh okay
08:22 karolherbst: yeah
08:22 mupuf: marcheu: right :)
08:22 karolherbst: then we will remove the cpu based dyn reclocking anyway later
08:22 karolherbst: at least I think we will
08:23 mupuf: karolherbst: no need to plan the future, we will do what we can prove is better and easier to maintain
08:23 karolherbst: yeah well
08:23 karolherbst: we shouldn't read out the pmu counters on the cpu every 0.1 seconds
08:24 mupuf: marcheu: in any case, fun to see how nouveau turned from your pet project to a driver shipping in an actual product!
08:24 marcheu: yeah I pushed for both :p
08:24 mupuf: so I heard :D
08:24 karolherbst: yeah well, I think I will get for breaking the clk subdev today...
08:25 karolherbst: *go back
08:25 karolherbst: to
08:25 karolherbst: ...
10:51 karolherbst: mupuf: I have an idea: we could use the current pstate interface to limit the highest pstate setable and then everything will become quite simple: pstate/cstate to hold the current state ustate_ac/dc do hold the max pstates, astate to hold what the driver wants to set
10:51 karolherbst: and then we could add the same for cstates, but maybe without the ac/dc caps
11:34 karolherbst: marcheu: currently in the gk20a code the various clocks are put inside a list of nvkm_pstate, is there any reason for this? Because I would rather have one nvkm_pstate with a list of nvkm_cstate, so that nvkm_pstate/nvkm_cstate have the same meaing on tegras and desktops/moibles
11:35 karolherbst: gm20b has the same
11:37 karolherbst: or do you have no idea about this?
11:38 karolherbst: maybe I should just check who wrote this :D
11:39 karolherbst: gnurou: ahh you did this
11:53 gnurou: karolherbst: yeah, have I done it wrong?
11:53 karolherbst: well
11:53 karolherbst: it is different
11:54 karolherbst: on desktop/mobiles we have something like this: nvkm_pstate for pcie/memory/video/... for clocks which are set through the "performance state" found on the nvidia driver
11:54 karolherbst: there are also engines which don't get as many clocks as the gpcs do
11:55 karolherbst: and then there is a list of nvkm_cstates with the various CSTEP clocks (from the vbios)
11:55 karolherbst: so in the end on a normal desktop gpu there are like 3 nvkm_pstates
11:55 karolherbst: and a list of around 40 nvkm_cstates
11:55 karolherbst: since kepler
11:55 karolherbst: no fermi there are usually also pstates, but less cstates
11:56 karolherbst: everything before that has only nvkm_pstates
11:56 karolherbst: best I show you examples from my vbios
11:57 karolherbst: gnurou: https://gist.github.com/karolherbst/22cfba539d15b16e8f38b521bfbebbe4
11:57 karolherbst: first thing shows the pstates
11:57 karolherbst: and the clocks are put inside pstate.base
11:57 karolherbst: second are the cstates
11:58 karolherbst: and domain clocks are usually based upon them (and scaled through the boost table)
11:58 karolherbst: the table shows the actualy GPC clock
11:58 karolherbst: and domains have always something liek 106% of GPC or just 92% of GPC and this information is filled within nvkm_cstate.domain
12:13 karolherbst: gnurou: well the thing is, currently I want to implement the boosting stuff in nouveau and there are quite big changes needed in the clk subdev for that
12:13 karolherbst: gnurou: and I really don't want to break your code
12:14 gnurou: karolherbst: don't worry, I will fix as needed, just give me a heads-up
12:14 karolherbst: ahh perfect
12:25 karolherbst: gnurou: how deep do you test the dynamic reclocking stuff for nouveau/tegra on your side?
12:26 gnurou: karolherbst: well we have products shipping with it :)
12:26 gnurou: ... *one* product I should say
12:27 karolherbst: yeah well, and how do you test if the code is solid enough? D:
12:28 karolherbst: I am asking because I worked on a PMU based implementations for this on my mobile chips, but there are some PMU communication issues left
12:36 martm: been playing with the warthunder, and every so often there is indeed crash in the driver, but there are also quite many crashes in glib and stuff
12:38 martm: i have not reclocked by gpu, but unfortunently the crashes are so randomly happening
12:39 martm: sometimes i get passed on point where it used to crash and sometimes not etc.
12:39 martm: Program received signal SIGSEGV, Segmentation fault.
12:39 martm: 0x00007fffe9ab85d7 in ?? () from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
12:40 martm: last one, but i also got crash happening in drm once, really very vague and random stuff
12:46 martm_: took a bt too, it most oftenly shows that something is wrong with dri2
12:47 martm_: from /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
12:47 martm_: #4 0x00007ffff73440ce in dri2SwapBuffers (pdraw=0x39a0960, target_msc=0,
12:47 martm_: divisor=0, remainder=0, flush=1) at dri2_glx.c:851
12:50 martm_: i'd continue debugging it in the xephyr, when things would not look so hopeless, i need to recompile mesa with debugging symbols
12:51 martm_: #4 0x00007ffff1de5670 in nouveau_pushbuf_kick ()
12:51 martm_: from /usr/lib/x86_64-linux-gnu/libdrm_nouveau.so.2
12:55 martm: i have received gpu lockup almost deterministically during one scene too
12:55 martm: it really does not seem to be a game fault, but rather drivers indeed, but i am not on latest mesa stack
12:56 martm: OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.0-devel (git-30fcf24 2016-01-29 trusty-oibaf-ppa)
13:00 RSpliet: karolherbst: I don't think Tegra uses our PMU firmware, nor does it require any mem reclocking (making it less PMU heavy)
13:01 martm: now the gpu lockup i don't even know how to debug, sysconsole via terminal cable would be needed, i will look if i have such cable, there seem to several bugs, dunni if those are multithreading related
13:01 RSpliet: martm: it's assumed that Mesa has quite a few problems with multithreaded rendering applications
13:02 RSpliet: (and with mesa I mostly mean the nouveau 3D gallium driver)
13:03 RSpliet: you could alternatively try to get your kernel to use netconsole and output logs over your network, but that may require more fiddling than a serial terminal cable (whichever you may be most comfortable with)
13:03 martm: RSpliet: yep, i am updating mesa, but i thinkthis thing is over my head
13:04 martm: netconsole yeah, pardon this was what i meant, RSpliet, this goes over serial cable right?
13:04 RSpliet: no, netconsole goes over the network
13:05 martm: ah yeah, ok...such cable i have
13:05 RSpliet: more convenient with laptops that lack a serial port, but depending on your router it might be slightly tricky to configure (sorry, not the best person to help you with that)
13:06 martm: no problem, i now try with git, if that has same faults, and queue up the debugging work to some day
13:07 martm: but i'd want someone to comment here if it's important to you too guys , maybe tomorrow i'll try to find where it locksup, quite deterministic at the moment
13:09 karolherbst: RSpliet: okay
13:09 RSpliet: martm: it's a problem that needs solving (as it prevents VirtualBox from running GPU accelerated as well for instance), but I'm not familiar enough with mesa or libdrm to know where the culprits are
13:11 martm: RSpliet: new mesa stack does get stuck in that same spot in the scene, but does not capture a lockup nor segfault
13:18 martm: seems as if some genious is trying to fix those mistakes, is not yet fully succeeding, it's too complicated for me too
13:19 martm: function=function@entry=0x7ffff1de6260 "nouveau_pushbuf_data")
13:19 martm: at assert.c:92
13:19 martm: #3 0x00007ffff641bc32 in __GI___assert_fail (assertion=0x7ffff1de6239 "kref",
13:19 martm: where i used to get the freeze now it fails with assertion
13:21 martm: but yeah the game is heaivly mutlithreaded according to gdb dumps, mentions how threads quit, and nptl stuff is displayd etc.
13:22 RSpliet: martm: if this is due to a threading issue, it's expected that writes to command buffers fail or result in non-valid contents when you share them amongst processes. I don't think anybody has tried to device a clever solution to arbitrate these writes yet
13:25 martm: RSpliet: yes it could be..i don't have the exact line yet, cause i do not have debugging symbols enabled, can't remember how to step with breakpoints, it almost looks like what you say is happening
13:29 martm: it still shows the line though
13:29 martm: file=0x7ffff1de621a "../../nouveau/pushbuf.c", line=727,
13:29 martm: function=0x7ffff1de6260 "nouveau_pushbuf_data") at assert.c:101
13:35 martm: RSpliet: really looks like you're comment was spot on precise, but i am a bit worried about still, why did not that assert hit me the last time with mesa git, it still locked-up, i try once again
13:36 martm: it froze for one sec, and then went into the game and locked-up, how was it able to bypass the assert , i dunno
13:38 martm: if (bo) {
13:38 martm: kref = cli_kref_get(push->client, bo);
13:38 martm: assert(kref);
13:39 RSpliet: martm: threading problems are hard to reproduce exactly. They are caused by the interleaving of two threads on different CPUs, and the exact way they interleave depends on a lot of random factors (context switching, cache state, memory bus contention due to programs/threads on other CPUs), causing the problems to manifest themselves in different ways every time
13:45 karolherbst: gnurou: by the way, do you have any ways to read out the current load on the tegras?
13:52 martm_: RSpliet: yes exactly, but it's as i even stated it weakest of my most gpu understandings https://people.freedesktop.org/~anholt/hang.txt this was for intel, i think nouveau has some playlist, it's complex i got again that assertion bypassed, i think we agree that this is a bug there should be some way to determine the order, definitely i am wrong person here trying to hypothesize publicly
13:52 martm_: however i still try to read and think about it, it's just i won't talk about my braindumps
13:55 martm_: RSpliet: but imo you're theories are reasonable, if you at some point are given a mission again to handle that, you read the code and reorder the stuff, you could be capable of fixing it
14:03 martm_: RSpliet: i try to read, as i do not know, but i understand what you say, the pushbuf is some sort of hw channel, how it is composed to the final playlist i kinda have not ever known this
14:05 martm_: it is fixable phenomen, needs a bit of a drivers infrastructure kownedge, but the comment was the last one very good again imo, yes depends on those things
14:11 admicable: I'm having some issues with a multi-head setup on Fedora, could anyone help shed some light?
14:19 martm_: so RSpliet: my diagnosis to you, imo you understand things very well, and i am quite sure, that actually if you manage to read the code you'd be able to fix that, imo me too, i am in bit of mess in life though
14:23 gnurou: karolherbst: IIUC I was able to get load info using nv_perfmon on Tegra
14:23 martm_: i am just wondering, if it's yeah done in hardware, sure pushbuf takes things from cache, i can see it's tried to be tested in kernel, like how it looks like i have no knowledge about, probably fifo structure in hw
14:24 martm_: composing of different tables , but for instance how many threads for single pushbuf, that i do not know
14:24 karolherbst: gnurou: mhh okay, I was just wondering if it pokes into the PMU counter regs directly or if there is some nouveau interace for that you use?
14:26 gnurou: karolherbst: I think it pokes the PMU regs directly, we don't have PMU firmware loaded on upstream Nouveau for Tegra
14:26 gnurou: karolherbst: as for Pixel C, I'm not sure how load is measured but can investigate if you want
14:26 karolherbst: gnurou: k, I was working on the pmu load counters stuff for nouveau
14:26 karolherbst: gnurou: it would be awesome if we could manage to have a common interface for nouveua also for the tegras
14:27 martm_: yes the final playlist is definitely in hw too, i.e it's some sort of cache again
14:36 martm_: so it's fifo or ringbuffer in hw, which is finally composed first in first out, circular buffer even named, i.e program counter is incremented by one dword however big words the hw works with, it can take time to find it in workable state, it's that ben from red hat, who mostly deals in that land, as it's seen
14:42 martm_: bwidawsk; once wrote about this, in details in his blog, getting tired for today, at another day, need to see if pushbuf==hw context or is there some other abstraction for hardware contexts
14:45 martm_: it could work so that for instance i-cache is the playlist backbone, and every contexts instruction count is counted, it just writes to correct position in cache
14:50 martm_: but surely we don't have problems with this hw theory, but in gallium land how to place those syncrhonization primitives in cpu land right
14:51 martm_: gotta read up, fortunate or unfortunate i do not know what those semaphores/locks/mutexes are all meant for
14:56 martm_: be that anything as occupancy of the pushbuf how is that handled etc. could be queried with some method in hw, or could be bookkeeped on cpu, in last case another thread should guarded somehow, at least thread-safe and reentrant should be that function and something like this
15:03 martm_: RSpliet: thise CACHE_ERROR nohash when pushbuf hit's this it brings in the tag from memory right, otherwise at least artifacts would be shown on the screen, has to be that pushbuf infact indeed has it's cache, and playlist also indeed, probably both on fault will bring in the tag from memory
15:08 martm_: yeah i belive it should be so, it's the only way to do it, now it's just at some point i dig into the code
15:10 martm_: but playing cache it's an instances in hw logic, basically you can reorder the data anyways one would need too, so actually maybe there isn't separate partitions of cache
15:16 martm_: for example, if the cpu due to cacheline filling puts the contexts in the wrong order, it could be redone in circuit too, using one additional scratch reg for instance
15:22 martm_: hmm, so the pushbuf is really a separate fifo entity on it's own, and it's named as a channel, as intel has patches, the order can be arbitrary, yeah that last mentioned version seems a lot better anyways
15:29 RSpliet: martm_: I'm not the person to solve threading issues; it's unfortunately not as simple as sprinkling some mutexes over the code (OpenGL is quite stateful, this state must be shared between threads too) and I'm hardly familiar with userspace side of nouveau
15:30 RSpliet: sorry
15:34 martm_: RSpliet: yeah but i was thinking about quite an easy method, which probably again something that isn't being used, but i am not sure
15:35 martm_: on cpu you make an order well vram that is fed to pushbufs is always linear right, so it needs to know what is the start and end of the context
15:36 martm_: now that cache is filled arbitrary, but all can be fetched via its tag, and in the end based of those start and end buffers and index what would be the order, you reorder the cache in hw
15:36 martm_: and just execute the playlist
15:39 martm_: the reordering should be done entirely in hw though, but it seems fairly easy with a shader, looks like it has the needed methods for this
15:47 martm_: i am thinking that context 2 could be some 200-500 and context 1 can be 800-1100 addresses, and it would leave a hole, so actually they should reordered, doing that via shader is lot faster then relocating them
16:08 martm_: yeah i am seeing it can work probably, need to look what is that txc->offset + tic->id it something how the line can be accessed, it's like an instance number, this can be executed probably
16:13 martm_: nah i could be wrong, it's data cache it can be accessed so to execute it needs an instruction cache, but this one executes based of the tag or pc
16:16 martm_: heck, too complex, could be done as branches instead, why the reordering than
17:28 martm_: does anyone know what this TIC stands for..tag identification/index something or what?
17:41 martm_: http://envytools.readthedocs.org/en/latest/hw/memory/g80-vm.html#dma-objects dunno it's too complex for me, it's some sort of dma objects, base address that of pgraph 0x0020 like whatever i really do not understand
17:59 martm_: it says there PGRAPH TIC & PGRAPH TSC , it looks like if that is connected to cache, i dunno how to get there with shaders, but it is said to be a VM user
17:59 martm_: 0x400-0x1400: PFIFO CACHE [fifo channels only]
17:59 martm_: 0x0020 0x00200 PGRAPH
18:00 martm_: absolutely have no clue what is that about
18:00 martm_: but it is a fixed address vm user, it should be accessible from the shader too
18:05 martm_: what i am just after, is that if it is possible to reorder the cahe into fix address and just you know execute it
18:22 martm_: i think it's like getting the offset from a texture that of context start and end and it's index, just reorder that into this contiguous cache, set a pc there and playlist is ready
18:30 martm_: RSpliet: i did too much spam today, i have one easy idea, will at some point talk about it, it needs a bit polishing, almost seems like this one is quite easy
18:41 pascalp100: https://devblogs.nvidia.com/parallelforall/inside-pascal/ New Pascal architecture
18:43 Yoshimo: not that there wasn't already more than enough to work on
18:49 martm_: FP64 GFLOPs 1680 213 5304[1] ---- maxwell 213, hmm
18:51 RSpliet: HBM2 after all...
18:51 karolherbst: this memory bandwidth
18:51 karolherbst: 4096 bit
18:51 karolherbst: *interface
18:52 martm_: maxwells double precision kinda suck then, worse then keplers , but pascal seems to be the best of everything, i log off , i digest the multithreading thing, RSpliet, currently i don't see that there is very much of locking issues, only couple of functions need to be thread-safe
18:52 martm_: but i decided to think about the second mentioned reordering on gpu method
18:54 martm_: i don't wanna use branching, but it seems that cache is sort of contiguous, like it is based of index plus it has address tag queries, it seems to be done quite thoughtfully by designers, it's definitely on way of doing it right
18:54 Yoshimo: at which point or rather for which use cases does that precision matter?
18:55 Yoshimo: just professional use i guess and not for games
18:58 karolherbst: Yoshimo: yeah something like that
19:00 karolherbst: ohhh
19:00 martm_: Yoshimo: i don't really know, but it's possible to do a very high inzoom with double precision
19:00 karolherbst: f32 dual issuing
19:00 karolherbst: *f16
19:01 karolherbst: uhh and two f16 values inside one reg
19:01 martm_: that is how they use it in graphics world, but i agree, this is something that is not normally needed in games yeah
19:01 martm_: because the levels of zoom in out are so big that noone needs that
19:02 RSpliet: karolherbst: that's going to be very interesting for OpenCL kernels, I'm quite sure that DNN libraries will want to use this
19:02 karolherbst: yeah
19:06 martm_: i see two possible driver modifications, one of course the channel occupancy, and other, when the cache starts to get full, it should ensure, that the commands previosuly queued up, get executed somehow
19:06 martm_: those that were in the cache
19:10 martm_: the last one is just for performance, it would work without it too, but it would just probably bring lot more from memory instead of cache if not controlled right
19:13 martm_: but that would leave the coherance issues right, i'd actually really fancy getting rid of l1, which needs some magic that i really yet do not comprehend
19:13 martm_: and resize lds by that much + stack
19:16 martm_: imirkin: maybe understands the deadlocking theory, i yet gotta study it, threads need to be again synced with primitives, for data serialisation and stuff, i would need to waste some time to understand that
19:17 martm_: its described in the web, when you place those sort of primitives wrong way for the cache, it can hit into deadlock
19:23 martm_: but yeah as said, to prevent that as the experts on the web say that only non-coherant one is l1, and on nvidia that can be disabled..then with using a scratch again we could flesh out local mem radeon terms lds, which they on graphics use for varying variables
19:23 martm_: then it should be impossible to get it wrong
19:26 martm_: but yeah i have absolutely no clue how to yet do that sort of trick on opengl, there was a cuda flag to do that
19:27 martm_: as i understand that cuda stack is used as binaries, probably there is no such option seeable inside the driver, and probably also in gdev etc. it could be tricky
19:28 martm_: i.e needs some blob tracing likely
19:29 martm_: and nonestimated amount of driver rework, but it should be very beneficial
19:31 RSpliet: gnurou: you probably see it coming already, but I hope pascal firmware is available very very very soon :-P
19:32 pmoreau: :-D
19:33 karolherbst: gnurou: I've made some tegra changes here, do you want to take a look if that's okay? Not that I want to remove something you had bigger plans for https://github.com/karolherbst/nouveau/commit/6101f93facdd0217d89d8f67b20b6e2c3270d01d
19:33 pmoreau: RSpliet: Do you mean all Pascal firmwares, or like what we currently have for Maxwell?
19:34 karolherbst: oh well
19:34 karolherbst: It would be actually nice to get the PMU stuff pretty soon, so that we get at least those gm200 gpus running fast on nouveau
19:35 RSpliet: pmoreau: I think pascal gr beats maxwell pmu imho :-P
19:35 karolherbst: nobody expects nouveau to run pascall above 20% nvidia speed before mid 2017 anyway...
19:36 Yoshimo: how well do cards before maxwell work these days?
19:36 karolherbst: depends on what you use
19:36 pmoreau: RSpliet: :-D Well… Maxwell gen2 isn't too old yet, so having PMU would be quite nice
19:37 karolherbst: Yoshimo: if you use my reclocking patches, pretty solid then, but that's kepler and maxwell gen1 only
19:37 karolherbst: I think on tesla gen2 there are also only minor issues left
19:37 pmoreau: My Tesla cards (G96 and MCP79) still work smoothly, and have working reclocking
19:37 RSpliet: Tesla 2nd gen DDR2 is dodgy
19:37 RSpliet: Tesla 2nd gen GDDR5 doesn't work
19:38 karolherbst: Yoshimo: but getting a 780 ti to work on full clocks shouldn't be a problem anymore
19:38 RSpliet: (G)DDR3 should work
19:38 RSpliet: oh, and Fermi is wank currently, 'td be nice to make a change there...
19:38 karolherbst: yeah
19:38 karolherbst: well I do that too, but only on the engine stuff
19:38 karolherbst: my patches will be some kind of important there too
19:39 karolherbst: allthough the voltage map table has to be REed there as well
19:39 karolherbst: there are only three coefficients per entry
19:39 karolherbst: and there could be the same or totally different compared to the kepler/maxwell ones
19:39 Yoshimo: if there is lots of work to be done on pre-maxwell cards, why bother with PMU for maxwell and pascal, that was the idea
19:40 karolherbst: there will be some benchmarks on phoronix soon testing my kernel tree with all the reclocking stuff on kepler and maybe maxwell gen1 too
19:40 vedranm: news are all over the net, and /r/AyyMD is full of Pascal memes
19:40 karolherbst: Yoshimo: because with the PMU we could have memory relcocking on maxwell gen2
19:40 karolherbst: and maxwell is pretty close to kepler
19:40 karolherbst: so if kepler works, maxwell is 90% done or more
19:40 karolherbst: regarding reclocking the changes are pretty minimal
19:41 vedranm: karolherbst: how come that Kepler is better supported than Fermi? just developers owning cards?
19:41 Yoshimo: so 2 birds one stone, basically
19:41 karolherbst: Yoshimo: yeah
19:41 karolherbst: vedranm: no clue
19:41 karolherbst: vedranm: I expect that kepler was newer and it made more sense to do kepler first
19:42 karolherbst: no idea what the real reasons was though
19:42 karolherbst: *were
19:42 Yoshimo: the selfish person that i am, i can't think of anything more important than maxwell pmu
19:42 karolherbst: right
19:42 karolherbst: it would be really awesome to have the signed pmu firmware
19:42 karolherbst: but this would require some work on nouveau, because we have to move to the same PMU interface nvidia uses for its driver
19:42 karolherbst: no big deal, but still work
19:43 karolherbst: RSpliet: did you work on the context switch stuff by the way? I am not sure who digged into the falcon code for this
19:44 RSpliet: karolherbst: what do you want to know about that code?
19:44 Yoshimo: i think the bigger workload is on nvidia to decouple the firmware and the driver
19:44 karolherbst: nothing, just a comment on a bug kept me thinking
19:44 karolherbst: regarding nvidia pr firmware works but nouveaus not
19:45 RSpliet: oh hmm, well, nouveau firmware should work, but might occasionally hang?
19:45 karolherbst: RSpliet: https://bugs.freedesktop.org/show_bug.cgi?id=93629#c14
19:45 karolherbst: but that person didn't state anything at all
19:45 karolherbst: just that nvidias firmware seems to work
19:46 RSpliet: I personally believe there might be some race conditions when an interrupt arrives while a context switch is in progress
19:46 karolherbst: yeah
19:46 karolherbst: we had the same issue on the PMU
19:47 karolherbst: basically the pmu reseted the entire interrupt flags for internal interrupts allthough external interrupts were also fired and lost this way
19:47 RSpliet: skeggsb seemed convinced there isn't, but couldn't recall why (might have something to do with the conditions under which a context switch interrupt could reach the hub fuc engine in the first place...)
19:47 karolherbst: RSpliet: I could try to stresstest the gr a bit like I've done with the PMU
19:48 RSpliet: it doesn't use the same communication fifo's as the PMU
19:49 karolherbst: I thought as much
19:49 karolherbst: but is there a way to call a gr function from the host and get a reply?
19:49 RSpliet: no I think the "set predicate $p1" and "sleep $x1" instruction should be apart, or maybe even better, make the sleep function on the predicate "queue empty" directly
19:49 martm_: i think it is still easier when disabling, then just preload the contiguous cache regs, but how to test..where the tag address is
19:50 RSpliet: i might just hack that up later, but right now I'm a bit swamped with all the things I'd like to do ;-)
19:50 karolherbst: :D
19:50 karolherbst: okay
19:50 RSpliet: (which, trivially as it sounds, includes groceries right now... bbl)
19:50 karolherbst: but I think the context switch thing is higher clocked with nouveau anyway
19:52 karolherbst: RSpliet: is it CTXCTL in nvatiming?
19:53 karolherbst: nvidia: PGRAPH.CTXCTL's periodic timer: frequency = 115.191444 MHz
19:53 karolherbst: nouveau: PGRAPH.CTXCTL's periodic timer: frequency = 539.956787 MHz
20:24 martm_: it's in a google book, i can not paste it, but only fermi and higher cards allow that, dynamic cache is said to be coherent , it says ldu lsu instructrions are needed to be used in the shader, and gpu broadcasts it to the threads needed, taking care of ceherency by itself hence
20:25 martm_: and those instructions can be used if cuda option is used, -dlcm=cg
20:31 martm_: no lsu, only ldu that is, fui
20:57 RSpliet: karolherbst: I'm not sure if that means what we think it means
20:57 karolherbst: no idea either, I am more curious
20:58 martm_: https://code.google.com/archive/p/asfermi/wikis/OpcodeMiscellaneous.wiki
20:59 martm_: CCTL
20:59 martm_: Instruction usage: CCTL(.E)(.Op1).Op2 reg0, [reg1+0xabcd]; 0xabcd should be a multiple of 4. Template opcode: 1010 000000 1110 000000 000000 00 000000000000000000000000000000 0 11001 mod reg0 reg1 mod2 0xabcd mod3 |mod 2:4 value|.Op2 | |:------------|:------| |0 |QRY1 | |1 |PF1 | |2 |PF1_5 | |3 |PR2 | |4 |WB | |5 |IV | |6 |IVALL | |7 |RS
20:59 martm_: this is something interesting also...QRY1 seems to query the cache
21:00 martm_: dunno if that would work with ldu
21:05 martm_: 0 is query, i really wonder what would it do?
21:06 martm_: like returning 1 or 0 depending wether the address is in cache or something?
21:06 martm_: that would be very neat...then basically for scheduling there is no problem on fermi and kepler
21:07 martm_: and if it determines stuff for ldu too, then with coherency it would be helpful too
21:24 martm_: there is a patent it's exactly how it works
21:24 martm_: http://www.google.com/patents/US20110072213
21:32 martm_: well sounds like the simplifies a lot, probably during this year we'd get all things working, i am quite in a mess/jam here, have persue with other things for some time, get back my human rights here stuff like that, too many aferists in business here in estonia
21:32 martm_: cheers.
22:13 karolherbst: argh... found a stupid error in my code
22:18 karolherbst: sooo. mupuf when I manucally trigger a reclock, the voltage gets updated quite nicely :) allthough currently a full reclock will be simply made. I guess we should have shortcuts when only the voltage should be updated, but currently it works already :)
23:05 gnurou: RSpliet: I certainly hope too, note that this is not dependant on me though
23:05 gnurou: (if only it was...)
23:51 karolherbst: mhhh
23:51 karolherbst: around 5000 engines relcocks per seconds
23:51 karolherbst: I thought we could reclock a little faster actually
23:56 karolherbst: 700 pstate changes per seconds or 2700 cstate changes per second
23:56 karolherbst: oh well
23:57 karolherbst: I guess a lot is scheduling overhead because it is actually done within a worker