08:37 mrsinisalu: I am pretty sure now that on VLIW based cpus, queues are on CPU word/bundle_width*word/bundle_width/2*num_alus which totals 32*32/2*4 for elbrus CPU, so it has 2048 post decode instruction buffer, i am pretty sure that russians made the best CPU on x86 hence
08:48 mrsinisalu: because if i was a hw designing technocrat employee, I would design it exactly the same round about.
08:52 mrsinisalu: but ARM cores in-order ones look pretty good too, last place goes to OoO pipelined superscalar.
09:03 mrsinisalu: when there is 10pipeline staged pipelined ARM with async branches, then the debug way of running the chip is more complex than on VLIW -- last requires only a minor scheduler in the kernel
09:04 mrsinisalu: in-order without performant debug/trace buffers can not be used at all, however scoreboarded OoO cores is easy to run too, but consume more power
09:06 mrsinisalu: the fabbed all the processors and graphics processors in a way that all are covered, however vliw should work the best and easiest way
10:06 mrsinisalu: so in-order cores require either async branches or so called async memory addressing modes , lsq/hw transactional memory/buffered async loads or such aka software pipelining
10:07 mrsinisalu: and then the sw pipelined debug upload to the queues can be programmed, i am looking at the xnu kernel to do that on ios at the moment
14:24 mrsinisalu: On the processors it very much seems that DCC/ITM/ETM&ETB are just buffered extensions ontop of jtag, which was first presented on i486
14:25 mrsinisalu: it looks as if ISR ir register is pre-decode, and bsr registers and ITM SWIT are post decode
14:26 mrsinisalu: JTAG seems to very powerful debugging and also trace running facility
14:30 mrsinisalu: there is something described as adaptive RTCK i.e adaptive return clock, to use async clocking on sw pipelined loop toggling maybe
14:49 mrsinisalu: in more precisely it appears the access to TAPs are done directly with SAMPLE/PRELOAD commands that of jtag
14:50 mrsinisalu: the debug commands are decoded but they probably quit the decoder very early, as it on arm is a coprocessor register based redirecting
14:53 mrsinisalu: i have no jtag verilog code or modules anywhere available for ASICs, but i think the redirection is done like on miaow in microcode loop, if things were done sanely
14:54 mrsinisalu: there are instruction groups based of microcode, and it redirects and does not parse idenitify stuff like flags and such, so the alus are the first microcode, debug and coprecessors second or third etc.
14:56 mrsinisalu: alus are redirected to alu decoder, debug commands to debug decoder, other coprocessors to corresponding co-processor like FPU or such
15:05 mrsinisalu: any modern kernel should be modified to have two scheduler hunks, memory using offline compiler specific or checksum sepcific application scheduling which use data cache amd fetch stuff from memory
15:05 mrsinisalu: and 2. a runtime program part of the scheduling that does not
15:06 mrsinisalu: compilers do not have to be modified
15:07 mrsinisalu: kde programs that compile into machine code would run and start very fast after they get checksummed
15:11 mrsinisalu: yrah i am pretty sure you are wondering how to access disks like the IO.
15:12 mrsinisalu: those are apic and timer interrupt based stuff, some disks have IO queues as well though
15:13 mrsinisalu: if such an opcode is identified then memory read/write is also allowed
15:19 mrsinisalu: all the idea is to bypass ifetch and decode stages , syscall based data cache and memory is allowed at runtime apps too
15:26 mrsinisalu: i head off now, to sum it up, certain types of rearrangements need to be done in the OS for CPUs, in the driver for GPUs, and also it would be worthwhile to write a precompiler for the GPU and command processor checksum to remove the CP lagging behind due to command fetching from memory. i.e lags due to excessive and needless instruction memory access
15:28 mrsinisalu: it is lot of work, since you have not looked at it for ages from sane perspective, but it can be done
15:31 mrsinisalu: and i am totally unsure why are such freaks getting paid to do nonsense all the time, and not looking at the correct things.
15:33 mrsinisalu: the problem is on the sw side, I would rather pay sane money for the prizes of athletes who have trained most their lives to stuff in high standards, rather then paying money for someone who can not understand primary school material.
15:36 mrsinisalu: because when you have sane first prizes for the title, people start to more eagerly train, and this is how stronger and capable generation develops, when you get paid for doing shit like you do i.e fuck up and fail on all things and ontop bully too
15:37 mrsinisalu: this does a degeneration a degrade of human skills imo
15:39 mrsinisalu: there should be social programs for freaks and zombies like you, not 5000dollars payment per month
15:39 mrsinisalu: minimal money paid that is
17:29 ZombieChicken: Hello. I have 2 questions. 1) Where would the external firmware come from if the nouveau driver apparently needs it, and how would it be loaded? 2) Is the NVC0 family usable for 3D accelerated stuff, like games? I'm mostly wondering if my GT 620 is usable with this driver or not
19:35 mrsinisalu: I think i talked pretty much everything now. I have not compiled code for ages, need to work on different kernels and drivers.
19:37 mrsinisalu: i see that ios12+xnu has been entirely released, I do not have only windows NT kernel source code.
22:16 Lyude: So, on a nvidia GPU an SOR or a PIOR is akin to a drm_encoder, a head is akin to a drm_crtc and a connector is just a connector, right?
22:17 Lyude: Asking because I'm trying to solve a bug with nouveau on the P71, where it looks like the culprit is that we make so many encoders we trigger the `if (WARN_ON(dev->mode_config.num_encoder >= 32))` in drm_encoder_init()
22:18 airlied: skeggsb: ^
22:18 Lyude: but looking at the dcb entries, the only SORs I ever see being referenced are SOR 1 and SOR 2
22:36 aaronp: Lyude, what are the types on the extra encoders? As a wild guess, interpreting some garbage as a bunch of legacy VGA or TV ports on a GPU that doesn't have DACs, maybe?
22:38 Lyude: aaronp: no, it's all displayport
22:39 Lyude: afaict the vbios doesn't look like it's got any garbage dcb entries. definitely has some that don't refer to any real existing DP connector though, since the vbios lists 6 DP ports (1eDP + 5 DP) ... actually
22:39 Lyude: I think every single one of those DP ports might also be correctr
22:40 Lyude: 2 TB ports, 1 mini DP, docking port on the bottom (so +1 DP), + 1 HDMI that's almost certainly just a DP->HDMI adapter that's built into the laptop
22:40 Lyude: + the eDP for the display
22:41 Lyude: so yeah, that adds up to 6 DP ports
22:43 Lyude: aaronp: I can give you the vbios if you like, note that envytools complains about the DCB header being too long but it looks like that might just be bogus
22:43 aaronp: Oh, is it hitting this DP MST case? It looks like nv50_mstm_new() will call drm_encoder_init() max_payloads times for each SOR.
22:43 aaronp: and if I'm understanding correctly, max_payloads is 4 because there are 4 heads.
22:43 Lyude: aaronp: yeah, but 5*4 = 20. So that still leaves room for 12 encoders
22:43 Lyude: aaronp: correct
22:44 Lyude: wonder if we could start off by just sharing MST encoders
22:45 Lyude: but I also think we might be able to improve things by just not creating duplicate encoders for the same SORs
22:45 aaronp: Plus one encoder created directly from nv50_sor_create(), but that still only adds up to 30 unless the eDP port is getting treated as MST as well.
22:45 aaronp: Oh wait, I'm still counting wrong. It adds up to 30 even if you have 5 encoders for all 6 DP ports.
22:46 Lyude: aaronp: I -think- we might be getting 6 encoders per DP port. There's an HDMI DCBE for each DP DCBE as well, i'm assuming just because they all support DP++
22:47 Lyude: actually yeah, modetest seems to confirm that guess
22:48 aaronp: Oh yeah, I forgot about the TMDS side. eDP shouldn't have MST or TMDS though, so I wonder if the DCB for that one is just wrong.
22:48 Lyude: might be. either way, I think starting off with deduplicating SORs here might be the way to go. I don't see any actual reason for us having duplicates anyway
22:49 aaronp: In any case, we don't have a concept of 'encoders' in our driver so I'm not entirely sure what they even do. :) It does sound plausible that you could just make up a few and reuse them as necessary.
22:49 Lyude: aaronp: from the looks of nv50_display_create(), it appears we just make encoders for SORs/PIORs
22:52 aaronp: Now I'm curious what Nikhil did in the nvidia-drm.ko open files.
22:53 aaronp: Looks like we call drm_encoder_init() whenever a DP MST device gets attached. Now I'm worried that calling it dynamically instead of at init time is going to confuse somebody.
22:54 Lyude: aaronp: no, I don't mean like that. So like, currently we create an encoder each time we process an SOR from the dcbe. However, multiple dcb entries can refer to the same SOR
22:54 Lyude: we don't check for existing encoder objects that match a specific sor, and instead just create a new one each time
22:55 aaronp: But you still need four separate encoder objects if you're going to drive four heads on one DP via MST, right?
22:56 Lyude: aaronp: yes, but assuming this GPU has 4 SORs and 5 DP ports, 5 * 4 == 20, 20 + 4 == 24
22:56 Lyude: vs 5 * 2 * 4
22:57 Lyude: erm, 5 * 2 * 4 SOR encoders
23:09 Lyude: ok this is definitely not going to be fun to fix
23:13 airlied: up rhe num encoder limit
23:13 Lyude: airlied: I was thinking of that
23:13 Lyude: since it doesn't seem like we can deduplicate sors like I thought
23:14 airlied: oh actually irs painful
23:14 airlied: there is likely a bitfield somewhere
23:49 skeggsb: Lyude: we don't exactly create an encoder per-OR, rather, per supported OR+protocol combination. which makes sense, because drm_encoder doesn't exactly expect a single encoder to do both DP+TMDS, for example
23:50 skeggsb: if you're hitting limits, i think the best thing to do would probably be to rework the MST stuff to work with a single encoder, rather than one per possible head like it is now
23:50 skeggsb: i can't exactly remember why i did it that way though, there might be a reason that's difficult too