00:09 airlied: pepp: hey just opened an issue for the app sha1 stuff
01:23 lrusak: bnieuwenhuizen, any idea why on amd the va state tracker returns a P016 format instead of a P010 format for 10bit HEVC decoding?
01:24 lrusak: vainfo: Driver version: Mesa Gallium driver 19.2.8 for AMD VEGAM (DRM 3.36.0, 5.6.0-0.rc6.git1.1.fc33.x86_64, LLVM 9.0.0)
01:25 lrusak: I'm able to use DRM_FORMAT_P010 and import that into EGL, but the format I get from va is VA_FOURCC_P016
01:26 lrusak: here is a screenshot of it working ok as DRM_FORMAT_P010 https://i.imgur.com/55Oi5my.png
01:28 lrusak: (gdb) print va_desc.fourcc
01:28 lrusak: $3 = 909193296
01:29 lrusak: 909193296 == 0x36313050 == #define VA_FOURCC_P016 0x36313050
01:53 imirkin: someone around who i can ask about confidential issues in gitlab and the correct way to address them?
02:09 robclark: I'd say maybe daniels but maybe a bit late in the day to catch him?
02:09 robclark: (or at least hopefully he sleeps sometimes?)
02:14 imirkin: basically, the issue isn't a big deal imo (i mean, it's a bug, i have a fix), but it was filed as a confidential one, so ... what to do.
02:17 robclark: I guess it depends on whether the bug has any private info? I *assume* you could still push an MR w/ closes #1234 without opening up the bug (not 100% sure about that, but seems like it would be a glaring gitlab bug if that were not the case)
02:18 imirkin: there's also a "confidential" MR thing one can do
02:19 imirkin: but it seems like a lot more of a pain than it's worth for something that'll be reviewed in a day
02:19 imirkin: i can see it for something that'll require months of back and forth
02:21 robclark: I guess confidential MR might only make sense if it was a serious CVE thing and time was needed to get fix propagated into stable branches and distros..
02:22 lrusak: on wayland is there a way to get a render node for the dri card that is currently in use? Or do I have to try and open each one and somehow match it?
02:27 lrusak: usecase is that on an intel cpu and amd gpu system the output is on rendernode /dev/dri/renderD129 but typically the default is /dev/dri/renderD128 and when decoding via vaapi the decoded frames of the intel cpu are not compatible with the amd gpu (due to modifiers being used)
02:36 airlied: imirkin: if it's not actually got a reason to be confidential don't bother
06:13 daniels: lrusak: yes, wl_drm sends the device that the server is using, and for the more enlightened there's the dmabuf-hints protocol as well
06:13 daniels: robclark: 3:09am is usually a bit late for me, yeah
06:42 pepp: airlied: thanks, I'll take a look today
07:03 MrCooper: lrusak: render nodes have no outputs :)
07:32 hakzsam: nroberts: can you please use Marge instead of pushing directly next time? You just need to assign your MR to marge and she will rebase and push for you
07:50 nroberts: hakzsam, ah right, ok, sorry
08:02 austriancoder: anholt: how beefy is your gitlab runner hw used for bare-metal mesa testing?
08:39 pq: lrusak, https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/8
11:05 danvet: mripard, see my quick reply to paulk-leonov patch, some new stuff about best practices and all
11:05 danvet: although drm_dev_alloc has been deprecated since a long time by now ...
11:10 paulk-leonov: danvet: is there a rationale behind these comments, besides unifying the way drivers are written?
11:11 danvet: paulk-leonov, pointer chasing is tougher on the compiler, and it makes the relationship for non-authors easier to understand
11:11 danvet: if it goes through common concepts
11:11 paulk-leonov: m'kay
11:11 danvet: i.e. if you sprinkle random driver-specific pointers all over I need to check whether it's the same pointer as the drm one
11:12 danvet: or whether that driver is doing some extremely questionable and like having mismatching objects linked up somehow
11:12 danvet: also devm_kzalloc is just wrong lifetime in almost all cases, that's just the current crusade
11:13 danvet: so drm_dev_alloc + devm_kzalloc of the driver private is something I'm trying to actively eradicate, if possible
11:14 danvet: and I'm getting the magic infrastructure in place that even converting the 40 odd existing drivers is becoming doable
11:21 danvet: paulk-leonov, btw found another one
11:21 danvet: all part of the recent 50 patch series I landed
11:38 eric_engestrom: lordheavy, tjaalton, and anyone who packages Mesa: I have just published 20.0.4 which reverts a commit that was causing a severe regression in SPIRV that was introduced in 20.0.3, so you should skip the latter and make sure you update to 20.0.4
11:38 tjaalton: eric_engestrom: yep, I noticed the thread and didn't update yet
11:39 eric_engestrom: lordheavy: I just checked and see that you had already updated to 20.0.3; please update to 20.0.4 asap :)
11:40 eric_engestrom: and I'm really sorry for the mess everyone, this was missed in testing and was also caused by a bit too much eagerness from me
11:41 eric_engestrom: I'm updating our documentation and tools to prevent myself and others from backporting commits too quickly in the future
11:43 tjaalton: btw, there's c81aa15d which should be a candidate for 20.0.x :)
11:48 lordheavy: eric_engestrom: yep
11:49 danvet: pinchartl, paulk-leonov I'm prepping the patch series which adds devm_drm_dev_alloc
11:49 danvet: might be worth to wait a bit and rebase your in-flight drivers on top of that
11:49 danvet: it's just a small patch which only needs the stuff already merged into drm-misc-next
11:50 danvet: and you avoid the entire interim lolz with drmm_add_final_kfree
11:50 pinchartl: danvet: what will devm_drm_dev_alloc do ?
11:50 danvet: kzalloc + devm_drm_dev_init + drmm_add_final_kfree
11:51 danvet: I converted all drivers which are using either devm_drm_dev_init or drm_dev_init, result looks imo pretty
11:51 danvet: also final patch unexports drmm_add_final_kfree :-)
11:51 pinchartl: can you share the prototype of the function ?
11:51 danvet: it's a macro ...
11:52 pinchartl: :-)
11:52 pinchartl: what does it allocate ? drm_device ? or the containing structure ?
11:52 danvet: https://paste.debian.net/1138225/
11:52 danvet: the container
11:53 pinchartl: please don't make this a requirement
11:53 pinchartl: as I said before, not all drivers may expose only a drm_device
11:53 pinchartl: they could expose other resources
11:53 pinchartl: so the top-level structure would embed a drm_device, but also other userspace-facing interfaces
11:53 pinchartl: we thus shouldn't enforce that drm_device drives all the allocation
11:54 pinchartl: same for drmm_add_final_kfree really, it's the same issue
11:54 pinchartl: that's why I want it optional
11:56 danvet: pinchartl, hm that wasn't clear to me in your earlier mail
11:56 danvet: I haven't found a driver in-tree that needs this
11:57 danvet: and the drivers which had some kind of split or other fun stuff all looked very questionable
11:58 danvet: pinchartl, is the other subsystem also this opinionated?
11:58 udovdh: hello! What to do with this message?\
11:58 udovdh: amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description
11:58 udovdh: found in journalctl...
11:58 danvet: or could that other interface hold a drm_device reference, so that the drm_device is the last thing that goes down?
11:58 pinchartl: danvet: for instance uvcideo exposes both a V4L2 device for video capture, and an input device for the buttons
11:59 pinchartl: two different subsystems, I implement .release() for both, refcounting the top structure
11:59 pinchartl: let's not assume drm_device will always be the only one
11:59 pinchartl: I'm fine with simple helpers to handle this case, which is the main case
11:59 udovdh: at https://www.linuxquestions.org/questions/slackware-14/requests-for-current-14-2-15-0-a-4175620463/page303.html there is some explanation
12:00 udovdh: but
12:00 udovdh: # grep CONFIG_CHECKPOINT_RESTORE .config
12:00 danvet: pinchartl, so I'm not deleting any of the building blocks
12:00 udovdh: # CONFIG_CHECKPOINT_RESTORE is not set
12:00 pinchartl: but let's make it possible to implement .release() manually
12:00 danvet: so re-adding them is trivial, just EXPORT_SYMBOL
12:00 udovdh: so what is the matter here?
12:00 danvet: hm drmm_add_final_kfree is there now already
12:00 pinchartl: danvet: drmm_add_final_kfree is mandatory today
12:00 pinchartl: and it shouldn't be :-)
12:01 danvet: well works perfectly fine in upstream right now :-)
12:01 danvet: I guess the super polished version would be to pull the managed stuff out of drm
12:01 danvet: and allocate that first
12:01 danvet: and then pass that managed container to drm_dev_init
12:01 danvet: so a drmm_dev_init or so
12:02 danvet: because in a way for your use-case I'd say even the drm_dev_put() is probably the wrong thing
12:02 pinchartl: until we get there, I'd like drmm_add_final_kfree to remain optional
12:02 danvet: to avoid hilarity I'd assume you only want to start garbage collecting once all interfaces are gone
12:02 danvet: can't
12:02 danvet: because drm core uses it
12:02 eric_engestrom: tjaalton: c81aa15d isn't tagged for stable in any way, so it would've gone unnoticed; I just added to the staging area for 20.0.x, it should be in 20.0.5 but dcbaker will handle that one so he'll have the final say
12:02 pinchartl: I'm in a meeting, need to postpone this discussion, sorry
12:03 danvet: I guess as a very gross hack you could do a callback that runs instead of kfree(final_kfree)
12:03 tjaalton: eric_engestrom: should there be a label or something to tag MR's for stable?
12:03 danvet: but I'd expect in reality this will result in all kinds of wrong lifetimes since by that point the entire drm_device stuff is garbage collected already
12:03 tjaalton: for commits like this that were already committed
12:04 danvet: so I feel like the proper solution is to lift drm_managed.c up, and the kref
12:04 danvet: so that garbage collection only starts once all the interfaces are gone
12:05 eric_engestrom: tjaalton: see https://mesa3d.org/submittingpatches.html#nominations
12:05 danvet: and the quickest way to do that is if all other interfaces hold a drm_device reference
12:06 tjaalton: eric_engestrom: ahh, good to know.. thanks
12:06 eric_engestrom: tjaalton: in this case the proper way would be to create a backport merge request against staging/20.0 containing the commits that you want to backport
12:06 tjaalton: yep
12:06 eric_engestrom: but that's not necessary for this time as I've backported it already
12:08 tjaalton: hmm, I've done a libdrm 2.4.101 release, but tarball copy failed
12:09 tjaalton: ssh key missing from somewhere?
12:10 eric_engestrom: did you run the xorg util release.sh ?
12:10 tjaalton: yes
12:10 tjaalton: gitlab has my key
12:10 eric_engestrom: any error message?
12:10 tjaalton: just permission denied
12:10 tjaalton: scp: /srv/dri.freedesktop.org/www/libdrm/libdrm-2.4.101.tar.xz: Permission denied
12:10 eric_engestrom: you're not uploading to gitlab with that script I think
12:10 eric_engestrom: yeah
12:10 eric_engestrom: you need access to annarchy
12:11 eric_engestrom: daniels might help you with that
12:11 tjaalton: daniels: ^ got a minute :)
12:11 eric_engestrom: (or maybe kemper actually?)
12:11 danvet: pinchartl, another problem that you already have I think is drm_dev_enter/exit
12:11 danvet: or well, going to be complicated in your scenario too
12:11 tjaalton: I can ssh to annarchy
12:12 tjaalton: not to kemper
12:12 eric_engestrom: I think that mihgt be the issue then :]
12:12 eric_engestrom: udovdh: that means your kernel is too old to have SYS_kcmp; what kernel version are you running on?
12:15 daniels: no-one can ssh to kemper
12:15 daniels: kemper isn't relevant
12:15 daniels: you copy files to annarchy
12:15 daniels: the reason you're getting permission denied is because you aren't in the mesa group
12:15 tjaalton: oh
12:15 daniels: i've added you to the mesa group now, which should take effect in 5-10min
12:15 tjaalton: ok, thanks
12:16 daniels: np
12:20 MrCooper: udovdh: you need to enable CONFIG_CHECKPOINT_RESTORE
12:25 udovdh: will try MrCooper
12:25 udovdh: where in the kernel menu's is that? will have to search
12:29 MrCooper: udovdh: I'd remove the line from .config and run "make oldconfig"
12:29 udovdh: taht is the trick? OK... thanks!
12:31 udovdh: vague description and advised for 'N' in the kernel helptext...
12:32 udovdh: eric_engestrom, this is a 5.6.2
12:33 udovdh: trying out if https://bugzilla.kernel.org/show_bug.cgi?id=206191 happens on 5.6 as well
12:34 udovdh: not mesa related as far as I know
12:42 udovdh: BTW: is CONFIG_CHECKPOINT_RESTORE 'old' or older than 5.4?
12:44 udovdh: can I do somethign about this one message?
12:44 udovdh: amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
12:44 udovdh: Is there any ucode to fetch someplace? or do I simply have to wait for it to show up in linux-firmware?
12:45 udovdh: FWIW: this is AMD Ryzen 5 3400G with Radeon Vega Graphics
15:10 sravn: danvet: ok to commit drm_pci patches while you focus on drmm stuff?
15:15 danvet: sravn, you mean mine?
15:18 sravn: danvet: yep had it already applied and only needs to type "dim push"
15:18 sravn: danvet: only to help
15:18 danvet: sravn, sure
15:32 sravn: danvet: done, will read patches later today
15:33 danvet: thx
15:44 MrCooper: narmstrong: would it be possible to fix the panfrost-t820-test:arm64 job failures on the 20.0 branch? E.g. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/2169720
15:45 MrCooper: I'm worried the stable maintainers might miss other regressions if that pipeline stage is always red
15:45 narmstrong: MrCooper: i pushed a fix on master for this
15:45 MrCooper: can you create an MR backporting it to staging/20.0 ?
15:46 MrCooper: or alternatively, disabling the job there
15:46 narmstrong: MrCooper: ok
15:46 MrCooper: thanks
16:12 lrusak: daniels, pq thanks for the answers, I'll see if I can use the wayland-drm.xml protocols
16:14 emersion: it's a private protocol
16:18 daniels: please don't
16:19 daniels: (but va-api already does internally, sadly)
16:19 lrusak: yea I see that :/
16:19 lrusak: and linux_dmabuf_unstable_v1 doesn't look like it exposes what I need?
16:20 daniels: not yet, but dmabuf_hints is fixing this
16:20 emersion: daniels: would it be okay to submit a trimmed-down version of dmabuf-hints to fix this?
16:20 emersion: would it help at all?
16:20 daniels: emersion: tbh we should just be able to finish dmabuf-hints quickly if we can focus on it
16:20 emersion: i'd really like to get it done
16:21 emersion: all right
16:21 emersion: what will be missing is a compositor sending different hints depending on the surface
16:21 lrusak: can I test dmabuf-hints now? I'm really not sure what it takes to use a new protocol
16:22 emersion: lrusak: a client would be very nice and would help getting the protocol merged sooner
16:23 emersion: lrusak: there are weston and wlroots patches available atm
16:23 emersion: but we still don't know what's the best way to refer to a DRM device
16:24 emersion: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/issues/10
16:25 emersion: lrusak: to test the protocol, you should be able to just build the weston MR or the wlroots PR, copy-paste the updated protocol in your client, and use the protocol with the compositor
16:25 lrusak: that's above my pay grade :P
16:30 lrusak: sec, let me try quickly :)
16:31 emersion: note that the implementatiosn are out-of-sync with the wayland-protocols proposal atm
16:31 daniels: emersion: ascent12 has been working on it as well
16:31 emersion: so copy the protocol .xml file from the compositor MR/PR rather than from the wayland-protocols MR
16:32 emersion: daniels: yeah, on the weston side :)
16:32 emersion: but the per-surface bits too?
16:32 daniels: i've been busy with CMake/Windows/LLVM this week (my god), so haven't been able to check what his latest was, but I think he might have Mesa working now as well?
16:32 daniels: oh, do the Weston patches not do per-surface? and are they against an old protocol? :\
16:32 emersion: i had mesa patches as well
16:33 emersion: per-surface is… complicated
16:34 emersion: it's only useful if you know whether changing a format would allow the surface to be put into a particular plane
16:34 emersion: or if you support multiple devices
16:34 emersion: otherwise global hints are enough
16:35 emersion: also, it's out-of-date just because we haven't agreed on a way to identify DRM devices
16:37 lrusak: ah so I have to make changes to the compositor also to use this?
16:38 emersion: you can just checkout the merge requests, lrusak
16:38 emersion: weston: https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/397
16:38 danvet: MrCooper, do you have the splat?
16:39 MrCooper: it's attached
16:39 danvet: haha
16:39 danvet: yeah you need to create the property at driver load time
16:39 emersion: very old wlroots PR: https://github.com/swaywm/wlroots/pull/1376
16:39 gitbot: swaywm issue (Pull request) 1376 in wlroots "[WIP] Implement linux-dmabuf-unstable-v1 version 4" [Open]
16:39 emersion: i'll update it soon
16:40 emersion: damn, it's already two years old
16:44 lrusak: emersion, I guess what I'm asking is that it requires both server and client changes yea?
16:44 MrCooper: danvet: k, I might give it a go next week, if nobody beats me to it
16:45 emersion: lrusak: yes
16:45 MrCooper: danvet: the free_cb code in drm_mode_object.c is dead, right? Nothing can pass in a non-NULL callback
16:46 emersion: ascent12: the hints don't change depending on the surface in your weston patch right?
16:49 danvet: MrCooper, all refcounted thing pass in a non-NULL free_fb
16:49 danvet: iirc that's drm_framebuffer, drm_crtc and drm_blob
16:49 danvet: drm_property_blob
16:50 MrCooper: how? The only caller of __drm_mode_object_add is drm_mode_object_add, which hardcodes NULL
16:51 danvet: your git grep is broken
16:51 MrCooper: ah! I inferred from the leading underscores that __drm_mode_object_add must be static...
16:52 danvet: nah I think airlied just felt a bit uninspired
16:53 danvet: ah no tagr renamed it
16:54 danvet: airlied gave it a _get_reg suffix, not really better
16:58 vsyrjala: naming is one of those two hard things for sure
16:58 MrCooper: couldn't the same thing still happen in intel_dp_add_mst_connector => intel_attach_force_audio_property if no other connector had the force audio property before?
17:01 vsyrjala: there will always be a corresponding dp sst connector
17:01 danvet: we have our ci hotplugging mst connectors iirc
17:01 danvet: at least that was some plan somewhen
17:02 MrCooper: cool, anyway gotta go, bbl
17:04 vsyrjala: unfortunately we've broken mst quite a few times recently. not sure we still have any decent testing in ci
17:05 danvet: vsyrjala, maybe it was only ever a plan :-/
17:06 emersion: iirc it was just a plan
17:06 emersion: at least for hotplugging
17:10 pinchartl: danvet: back
17:13 danvet: pinchartl, so I have both a hacky solution and a very pretty solution for you
17:13 danvet: assuming I can convince gregkh that I'm not an incompetent imposter
17:13 pinchartl: :-)
17:14 danvet: which atm isn't going that well with the current patch series, so there's that :-/
17:14 pinchartl: first, do we agree on the problem statement ?
17:14 pinchartl: to summarize my point:
17:14 danvet: yup
17:14 pinchartl: I think that the infrastructure is nice and cleans up lots of drivers
17:14 danvet: I think your problem statement is why gregkh doesn't understand me, since within drm we already have your problem
17:14 danvet: every drm_device has an entire family of userspace devices registered
17:14 pinchartl: but we shouldn't assume that the drm_device is the only and central piece of every driver
17:14 danvet: which I think is why we're hurting so much more than others
17:15 danvet: yup I'm on board with this now
17:15 pinchartl: I'm totally fine with helpers that are based on the assumption that everything can be driven by drm_device
17:15 pinchartl: but they should remain helpers, not a midlayer
17:15 pinchartl: (using midlayer here is a way to hopefully convince you, as I know how much you love them :-))
17:16 danvet: oh you convinced me
17:16 danvet: like I said, I have a short term hack to unblock you
17:16 danvet: and a nice long term solution
17:17 danvet: which boils down to pulling the drm_managed.c stuff out of drm
17:17 Lyude: Hey seanpaul, I think I'm starting to see some timeouts on one of my MST hubs that appear to be related to simultaneous down replies
17:17 danvet: and the short term hack is fully compatible with the long term thing
17:17 danvet: +/- functions will need to be renamed ofc
17:17 seanpaul: Lyude: hmm interesting, any logs/traces?
17:17 danvet: the only hiccup is that I'd like to put that into drivers/base or maybe lib/
17:18 danvet: but atm it looks like gregkh doesn't think there's even a problem, or if there is, drm is doing it wrong
17:18 Lyude: seanpaul: yeah I can grab some - also, can add some tracing stuff for sideband replies + correcting the bogus rad address decoding we have in our sideband message dumps, if that'd help at all
17:19 Lyude: oh hold on-I just noticed withour your patches I'm seeing: [ 301.069653] [drm:drm_dp_mst_hpd_irq [drm_kms_helper]] Got NAK reply: req 0x21 (REMOTE_DPCD_WRITE), reason 0x04 (BAD_PARAM), nak data 0x00
17:19 lrusak: emersion, the weston PR doesn't build
17:19 seanpaul: Lyude: hmm, so naks getting dropped on the floor?
17:19 Lyude: I wonder if we're getting NAKs but we're just not handling them properly with simultaneous down replies
17:20 Lyude: seanpaul: yeah I think so - I'll take a look in a moment, about to grab some logs for you with/without the patches as well
17:20 seanpaul: sounds like it, do they have proper seqno's
17:20 lrusak: emersion, from the protocol it looks like it should be using major/minor here https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/397/diffs#449fb231ff9f908a3066b3afb821e4e4c146c2db_397_480
17:20 emersion: yeah, it's out-of-date
17:22 danvet: agd5f_, did you ask mlankhorst for the backmerge?
17:22 danvet: generally I get same day backmerges, but tz is kinda better for me :-)
17:23 pinchartl: danvet: you're not blocking me :-) I managed to fix my issue in the Xilinx driver. I'll try to embed drm_device in R-Car DU next, and that should be fine too. omapdrm is also one of my targets, but there's no need to convert that yet if it's too difficult with the current implementation
17:23 agd5f_: danvet, I asked in general, not specifically :)
17:24 pinchartl: I would however like to understand why we need to make the final_kfree mandatory, is there any technical reason other than trying to catch drivers that may implement .release() incorrectly by forbidding manual implementation of memory management altogether ?
17:25 danvet: pinchartl, to make conversions easier the managed_release happens after the ->release callback
17:25 danvet: so that I can start pulling away stuff at the end, like drm_dev_fini()
17:25 danvet: which means if the last thing you do is kfree, or whatever_put()
17:25 danvet: the drmm_managed_release is using freed memory
17:26 danvet: *sad trombones*
17:26 danvet: hence why that final kfree needs to be done after the drmm_managed_release call
17:26 Lyude: seanpaul: without simultaneous down replies: https://paste.centos.org/view/51951120 with simultaneous down replies: https://paste.centos.org/view/9b1c2792
17:26 Lyude: looks like some of the NAKs get picked up correctly
17:27 danvet: pinchartl, I tried handling that final kfree with a drmm action
17:27 danvet: but it get tricky, since at that point a lot of stuff kinda doesn't exist anymore
17:27 pinchartl: danvet: ->release() should really be the very very last step. do I understand correctly that we can't do so today given the history of misimplementation of tear-down in most drivers, that led to the code you merged as a reasonable first step ?
17:27 danvet: specifically the debug printing started to rain down
17:28 danvet: I guess we could try again
17:28 danvet: pinchartl, most drivers don't have a ->release
17:28 danvet: the ones that do have, generally got it right
17:28 pinchartl: is there anything that would block moving the ->release() call as the very last step ?
17:28 danvet: you'd break everything?
17:29 danvet: everything that has a correctly implemented ->release callback still
17:29 danvet: I've spent a few weeks reading the unload/release code of a lot of drm drivers
17:29 danvet: it took me a long time to even figure out how to get out of this mess without breaking everything
17:30 danvet: the only way you can do is 2 release callbacks
17:30 danvet: one before drm_manged_releaes, with the current code
17:30 danvet: and a new one after, instead of the kfree(final_kfree)
17:30 danvet: but that's butt ugly imo
17:30 pinchartl: it's not clean
17:31 Lyude: seanpaul: hm, maybe we're not not waking up tx_waitq in drm_dp_mst_handle_down_rep()?
17:31 Lyude: looks like we don't do it if we receive a NAK
17:31 pinchartl: it *could* be a temporary solution if we have to fix drivers gradually in order to get back to a singgle .release(), but it's certainly not a good long-term option
17:31 pinchartl: in other subsystems I've seen, .release() is called as the very last operation
17:32 danvet: for multi-interface drivers it's imo the wrong direction
17:32 pinchartl: and I think that's what we should aim for
17:32 danvet: other subsystems dont have managed release
17:32 pinchartl: from an architecture point of view, calling .release() earlier with more cleanup in the code afterwards isn't a good option
17:32 pinchartl: yes they do
17:32 pinchartl: ok, not managed
17:32 danvet: :-)
17:32 pinchartl: but they have a .release()
17:32 danvet: oh sure everyone has that
17:32 pinchartl: and the driver model is that .release() is called last
17:33 danvet: so imo the clean solution for a multi-interface driver is the following
17:33 pinchartl: not doing so in DRM breaks lots of assumptions
17:33 danvet: a) we have one overall driver structure
17:33 danvet: b) which contains all the interface pieces embedded into it
17:33 danvet: so far so good?
17:33 pinchartl: yes
17:33 seanpaul: Lyude: yeah, looks that way, trying to page all of this back in...
17:34 danvet: c) I think we still agree: the overall interface structure should only be freed once all interfaces have dropped their refcounts to 0
17:34 danvet: correct?
17:34 pinchartl: correct
17:34 Lyude: seanpaul: alright-I'm trying to come up with a patch now fwiw
17:34 danvet: now here's the one that's probably new
17:34 danvet: d) managed release should only happen once _all_ these refcounts go to zero
17:34 seanpaul: Lyude: works for me, thank you
17:35 danvet: reason: you might share driver internal objects and stuff across these interfaces, e.g. when a v4l thing feeds into drm or the other way round, or a connector that has both a drm_connector and an audio device
17:35 pinchartl: it's not a very precise statement, as "managed release" can mean many things
17:35 pinchartl: there are managed resources linked to different objects
17:35 danvet: managed release like drm just gained
17:35 danvet: _not_ devm
17:35 Lyude: seanpaul: also-might try to get out some more patch series for MST today regarding actually adding a printk specifier for MST RADs, getting rid of the multiple (one of which is actually wrong) functions for converting rads to strings, and -might- also try to get out sideband message reply dumping
17:35 pinchartl: devm_clk_get() associated with the physical struct device
17:36 pinchartl: drmm_ associated with drm_device
17:36 danvet: well that's the point
17:36 pinchartl: and possibly others (now or in the future)
17:36 danvet: drmm_ associated with the drm_device is a bug imo for this
17:36 danvet: I want to pull that up
17:36 danvet: so that drmm_ gets released only when the overall driver thing goes to 0
17:36 pinchartl: moving it to a more central place is a good idea, but you're looking at the problem from the wrong end I think
17:36 danvet: otherwise someone is going to have a driver_node containing a drm_encoder and a v4l_node
17:36 danvet: and bugs still happen
17:37 pinchartl: we *first* need to have a sound architecture for *non-managed* release
17:37 pinchartl: wit refcounting
17:37 danvet: we had
17:37 danvet: I accidentally deleted it
17:37 pinchartl: and manual .release() implementations
17:37 pinchartl: and *then*
17:37 danvet: s/accidentally/intentionally/
17:37 pinchartl: no top of that base
17:37 pinchartl: s/no/on/
17:37 pinchartl: we add managed resources
17:37 danvet: because driver writers, on average, are not qualified to write that code themselves
17:37 pinchartl: sure, and that's not a problem
17:38 pinchartl: we implement managed helpers to solve that
17:38 pinchartl: make it easier for drivers
17:38 seanpaul: Lyude: ack, sgtm. looking at this patch and it seems skipping the waitq wake was quite intentional, but it doesn't make sense to present me, not sure what past me was thinking there
17:38 danvet: so my idea is to just move the refcount and managed release one level up
17:38 cmarcelo: jekstrand: a NIR flag to identify intrinsics that write to memory would be /almost/ equivalent to the set of things that don't have CAN_ELIMINATE flag (from a galnce the exception would be barrier like things). wondering if instead of adding a new flag, have a helper function that filter out the barriers and the CAN_ELIMINATEs... (or even have a flag for the barriers too, which could be useful in its
17:38 danvet: from the drm_device to your driver structure
17:38 pinchartl: but it has to be on top of a clean, non-managed base
17:38 cmarcelo: own way)
17:38 danvet: including the refcount
17:38 pinchartl: no, that's not a good idea
17:38 danvet: why?
17:38 pinchartl: you'll run into issues earlier or later
17:38 Lyude: seanpaul: ok - that fixed a bunch of timedout errors, but I think (and am not sure) we might still be having some issues as I'm seeing ACT waits timeout (granted, it might just be the hub being weird)
17:38 danvet: pinchartl, like?
17:38 pinchartl: there will always be corner cases where your nice managed helpers will not be a good fit
17:39 pinchartl: so we need to have a good non-managed base, and add the managed helpers on top
17:39 Lyude: or, huh, actually maybe we're accidentally sending the ALLOCATE_PAYLOAD only after we wait for the ACT
17:39 danvet: I'm willing to look at that case, when it shows up, in drm
17:39 danvet: not earlier
17:39 pinchartl: not merge all the problems into a big mixed bag of spaghetti and try to pull out a top-level solution, hoping it will work under the carper
17:39 pinchartl: carpet
17:39 Lyude:looks deeper
17:39 pinchartl: you have the last word on the DRM side
17:40 pinchartl: but for anything centralized in drivers/base/ or lib/, no, I don't agree
17:40 danvet: if we pass that managed thing to drm, then drm still just works
17:40 pinchartl: first a good non-managed base
17:40 danvet: if someone else breaks it, *shrug*
17:40 pinchartl: with .release() called as the very last operation
17:40 pinchartl: and refcounts
17:40 danvet: uh
17:40 pinchartl: and then build on top of it
17:40 pinchartl: anything else will bite us back sooner or later
17:40 danvet: that's what I'm proposing as my managed solution
17:40 pinchartl: and the more you push the managed helpers to a central location, the more painful it will be when it bites back
17:40 danvet: I think you're not entirely clear on what I mean
17:40 cmarcelo: (oh, there's discard and friends too)
17:41 danvet: probably no surprise, since it'll take a few weeks to type this up I think
17:41 danvet: pinchartl, so maybe step back right now
17:41 danvet: in your low-level, no helpers world
17:41 pinchartl: I'm sorry, but I won't review something that isn't build on top of a clean base :-)
17:42 pinchartl: we have one issue now
17:42 danvet: do we need a kref in your driver structure?
17:42 pinchartl: .release() isn't called as the last operation
17:42 pinchartl: I would like to fix that one first
17:42 pinchartl: and then work on even nicer managed helpers
17:42 pinchartl: if we do it the other way around
17:42 danvet: nah, I want this somewhat correctly designed
17:42 pinchartl: push more powerful managed helpers first
17:42 danvet: I just spend over a month fixing all the false starts we've had in drm
17:42 pinchartl: to work around the fact that .release() isn't called last
17:42 danvet: so again
17:42 danvet: in your low-level world
17:43 pinchartl: then you're hiding it under the carpet, and the more layers you put on top, the more painful it will be
17:43 jekstrand: cmarcelo: It would be similar but very not the same
17:43 danvet: do you think we need a kref, with the ->release callback of that kref settable by the driver?
17:43 pinchartl: I've seen this happening in V4L2
17:43 pinchartl: with the maintainer pushing a rework of more than 80 patches
17:43 danvet: 100 by now
17:43 pinchartl: despite very explicit nacks from core developers
17:43 jekstrand: cmarcelo: Typically (there may be exceptions), it's the distance between internal and external writes
17:43 pinchartl: and now, the subsystem is completely stuck
17:43 danvet: well those I didn't get really ...
17:43 pinchartl: there are lifetime issues of objects all over the place
17:43 pinchartl: and they're unfixable
17:44 danvet: pinchartl, so aside from old wounds that I didn't create
17:44 danvet: can we get to the technical discussion?
17:44 pinchartl: sure :-)
17:44 danvet: my understanding is: you want a ->release settable by the driver
17:44 pinchartl: so here's my question
17:44 cmarcelo: jekstrand: another strange corner, do we want store_deref to have such a flag?
17:44 pinchartl: what's blocking from calling ->release() as the very last operation ?
17:44 danvet: which pretty much means you need a kref somewhere in that driver structure?
17:44 danvet: pinchartl, atm you cant
17:44 jekstrand: cmarcelo: Uh...
17:44 jekstrand: cmarcelo: Probably no
17:44 danvet: no in drm, not without breaking a pile of drivers
17:45 jekstrand: cmarcelo: The assumption being that you should really lower stuff
17:45 jekstrand: cmarcelo: But that makes it tricky for drivers that might want to pass something to LLVM
17:45 pinchartl: I understand you can't atm, because drivers were broken to start with, and the less painful way to fix that was to introduced managed helpers that had to put some cleanup after ->release()
17:45 pinchartl: but that means in my opinion that we're not done with fixing the problem *in* drm and *in* the drivers
17:45 pinchartl: and to answer your question
17:45 jekstrand: cmarcelo: Another option would be to pull the table out of load_store_vectorize and make that sort of I/O intrinsic info a more proper thing in NIR
17:46 pinchartl: the outter structure would need a kref in the general case, yes
17:46 danvet: ok
17:46 danvet: my plan is to give you exactly that
17:46 pinchartl: if drm_device is the only interface embedded in it then we can skip the kref
17:46 danvet: the only thing I'd like to add to that kref is the managed stuff
17:46 pinchartl: I want a .release() that can tear down the outer layer
17:46 danvet: so that drm can not just offload the kref_put/get to that kref
17:46 danvet: but also the managed release
17:47 danvet: and if you provide a kref yourself, I'm happy to let you set the ->release too
17:47 pinchartl: that kref is for the outer layer
17:47 danvet: yup
17:47 danvet: and drm_device is going to use it
17:47 pinchartl: no :-)
17:47 pinchartl: that's not a good design, I'm sorry
17:47 pinchartl: this goes against the design of the driver core
17:47 danvet: putting 2 kref into the same malloc'ed area always ends in tears
17:48 danvet: been there, done that
17:48 pinchartl: no it doesn't
17:48 pinchartl: the kref for drm_device reference-counts drm_device
17:48 danvet: you want drm_device embedded in something else, with its own kref
17:48 pinchartl: once that refcount goes to 0, ->release() is called
17:48 danvet: I need that kref
17:48 pinchartl: and that's it
17:48 pinchartl: the ->release() then performs the cleanup
17:49 danvet: if you want multiple kref, do multiple kmalloc
17:49 pinchartl: in the non-managed world, it frees drm_device if it was allocated manually, or frees the outer layer, or decrements the refcount on the outer layer if drm_device isn't its only user
17:49 pinchartl: then we add the managed helpers on top
17:49 pinchartl: to avoid duplicated incorrect manual implementations in drivers
17:50 pinchartl: and that will likely, down the line, involve helpers in drivers/base/
17:50 danvet: if you do this multi-kref thing
17:50 pinchartl: but I don't want to build on top of something that doesn't comply with the kref-based driver model
17:50 danvet: with staggered release
17:50 danvet: I'm pretty sure drivers will massively botch it up
17:50 danvet: and it will make life complicated
17:51 danvet: since if you a have driver_obj which contains both a drm_obj and a v4l_obj
17:51 danvet: you need to refcount that
17:51 danvet: manually
17:51 danvet: because there's no place where you can put the automatic managed release
17:51 pinchartl: if we let drivers implement it manually, yes, it will be botched up in most cases. that's why we need managed helpers that will, in most cases (hopefully), make .release() empty
17:51 danvet: yeah
17:51 danvet: but if you want to be able to release the above situation automatically
17:51 danvet: you need the managed release at the top level
17:52 danvet: all of it
17:52 pinchartl: not all of it, that's the thing
17:52 pinchartl: ok, let's go conceptual for a minute
17:52 danvet: ok, let's go through it
17:52 pinchartl: here's how I view things
17:52 danvet: you have this driver_foo think, with drm_foo an v4l_foo
17:52 pinchartl: (brainstorming)
17:52 danvet: you put the automatic release in drm
17:52 danvet: wont work, since v4l might still use it
17:52 pinchartl: may I go first ?
17:52 danvet: same if you put it in v4l
17:53 danvet: so you put it into the top level
17:53 danvet: but then you blow up, because by that point the drm_device has disappeared
17:53 danvet: so refcounting and sadness
17:53 danvet: sure, your turn
17:53 pinchartl: we have objects
17:53 pinchartl: that are created and destroyed at runtime
17:53 pinchartl: they have a life time, they have users
17:54 pinchartl: and those objects live in memory
17:54 cmarcelo: jekstrand: table there seems a bit ortogonal to the issue...
17:54 pinchartl: for some of them, the memory is allocated when the object is created
17:54 pinchartl: drm_dev_alloc() is an example
17:54 pinchartl: allocates memory and constructs the object
17:54 cmarcelo: jekstrand: one way is to have call site (or a helper) filter out store_deref (and any other unlowered).
17:55 pinchartl: for some of them, memory is allocated separately, when the structure is embedded in another object
17:55 pinchartl: those objects are connected
17:55 pinchartl: drm_device belongs to a priv object from a driver
17:55 pinchartl: and in turn drm_connector belongs to drm_device
17:56 pinchartl: and something could belong to drm_connector
17:56 pinchartl: (maybe not the best example with drm_connector, let's assume it's a generic example)
17:56 pinchartl: because objects have users, they need to be refcounted
17:56 pinchartl: and when a refcount drops to 0, two things need to happen
17:56 pinchartl: 1. the object must be destroyed
17:57 pinchartl: as nothing uses it anymore at that point, a destroy function is safe, that will destroy the internals of the object
17:57 pinchartl: this doesn't free the memory containing the object
17:57 pinchartl: 2. the memory must be freed
17:57 pinchartl: if the memory was allocated when the object was constructed, we can free it at this point
17:57 pinchartl: otherwise, we can't
17:58 pinchartl: we instead need to tell the object that embeds drm_device that drm_device is destroyed
17:59 pinchartl: that effectively amounts to dropping a reference to the outer object's memory
17:59 jekstrand: cmarcelo: What I was thinking of with the helper was we have something for "is this I/O?" and "give me the mode"
17:59 pinchartl: all of this, in the non-managed world, is done manually
17:59 pinchartl: kref for recounts
17:59 jekstrand: cmarcelo: And then a list of modes which are external
18:00 pinchartl: when a kref drops to 0, the destructor of the object is called
18:00 pinchartl: which destroys the internals
18:01 pinchartl: and calls the .release() provided by the owner of the object, to signal the destruction
18:01 pinchartl: and that's all chained manually
18:01 pinchartl: I think we could automate all this nicely
18:01 pinchartl: not by using a top-level managed controller that would own everything in a flat list
18:02 pinchartl: but by having these relationships chained
18:02 pinchartl: this would mimick the manual architecture based on .release()
18:02 pinchartl: but would be fully automatic
18:03 pinchartl: -- end of brainstorming rembling --
18:03 pinchartl: rambling
18:03 danvet: ok I think I see the disagreement
18:03 danvet: essentially I've come to the conclusion that for locking and refcounting
18:03 danvet: helpers don't work
18:03 danvet: almost everyone will get it wrong if you let them
18:04 danvet: and it's very hard to catch the bugs in review, much less testing
18:04 danvet: so over the past few years I've moved all the kms locking and refcounting into core code
18:04 danvet: so in theory I agree with you
18:04 danvet: build the chaing of refcounted things
18:04 pinchartl: what if we can get those helpers to not require anything on the driver side at release time ?
18:04 danvet: let drivers control everything
18:04 pinchartl: for instance
18:04 danvet: in practice, you're guaranteed to get only broken drivers with this approach
18:04 pinchartl: drm_dev_init() for the non-managed, fully manual case
18:05 pinchartl: drmm_dev_init() for the managed case, and that would establish the right chaining
18:05 pinchartl: the responsibility of doing the chaining right would be put on the drmm_dev_init() function
18:05 danvet: drm_dev_init is about 100x not scary enough
18:05 pinchartl: it's a central, one-time implementation
18:05 pinchartl: and from a driver point of view, nothing else would be needed
18:06 pinchartl: ok, probably not nothing else
18:06 danvet: drm_dev_init_and_yes_ive_spented_2_weeks_discussing_my_driver_with_danvet_line_by_line maybe
18:06 danvet: has the trouble that I'm also fairly incompetent in all things locking and refcounting
18:06 pinchartl: maybe one additional call at probe time to tell that the managed API is used. not more than that
18:06 danvet: as a quick look at git history can tell you
18:06 pinchartl: but I don't dispute that :-)
18:07 pinchartl: my point is that I don't want drivers to have to implement this
18:07 pinchartl: we agree on that
18:07 danvet: also I'll add taht function if you've convinced me you actually do qualify
18:07 danvet: or r-b it at least
18:07 danvet: like seriously, everyone gets refcounting wrong
18:07 pinchartl: but I'd like a model that, behind the scene, is based on chaining, instead of a model that, behind the scene, is based on something else
18:07 danvet: I've seen a few drivers in my review that try to do the split refcount thing you explained, with chaing
18:07 pinchartl: we have a manual driver model based on chaining
18:07 danvet: it's broken
18:08 pinchartl: if you implement a managed model based on something else, it won't match
18:08 pinchartl: sure, if it's a flat list, and your chain tree has a single branch, it will work
18:08 danvet: I want chaining, but if you chain, I want to delay the drm_device release until the topmost thing goes away
18:08 pinchartl: but when the chained tree from the device model is more complex, the two won't match, and it will be very difficult (if not impossible)
18:08 danvet: and if you don't like that, give me a fake topmost thing and lie in the ->release callback of that kref
18:08 danvet: but all the drivers I've seen who tried to do this nesting, got it wrong
18:09 danvet: some cases it's unavoidable (like hotpluggable drm_connector)
18:09 pinchartl: why do you want to delay drm_device release until the topmost thing goes away ?
18:09 danvet: and after 5 years we're still working on fixing it
18:09 danvet: less bugs
18:09 pinchartl: what kind of bugs ?
18:09 danvet: maybe more leaks, but whatever
18:09 danvet: the one I explained above with an attached thing not being released correctly
18:09 pinchartl: once drm_device has 0 users, what's the point in delaying its destruction ?
18:10 danvet: the other thing
18:10 danvet: like for some reason you've decided to embedded 2 interfaces into your overall driver struct
18:10 danvet: there's probably a reason behind that
18:10 danvet: not just random chance
18:10 danvet: good reason is "they're actually the same thing, or so closely related that really splitting would be nasty"
18:10 pinchartl: yes, and that means you can't free the memory, but you can destroy the object
18:11 danvet: so extremely good chances that the v4l side will happily chase pointers that drm owns
18:11 danvet: or the other way round
18:11 danvet: ofc, in theory we could audit the code
18:11 danvet: exhaustively test it with KASAN
18:11 danvet: and refcount everything correctly
18:11 danvet: in practice, none of this will happen
18:12 danvet: and the most reliable "fix" is to just delay the entire cleanup until really everything has hit a refcount of 0
18:12 pinchartl: but that's not what all this is about
18:12 danvet: imo it is
18:13 pinchartl: if you're concerned that something happening on the V4L2 side (and let's remember we're talking about device unbind, either because of physical unplug, sysfs unbind or module removal) will access the DRM side
18:13 danvet: the entire reason I made final_kfree mandatory is to take the refcounting away from driver writers
18:13 pinchartl: then it means that drm_dev_unregister() can't free or destroy *anything*
18:14 danvet: it doesnt'
18:14 danvet: that was step one of fixing this mess
18:15 danvet: started in 2013
18:15 danvet: if I found the right commit
18:16 danvet: ok so small clarification: drm_dev_unregister frees a few internal things, which the driver should never ever have access to
18:16 danvet: s/free/releases the reference it holds/
18:16 pinchartl: there you go :-)
18:16 danvet: but wrt anything that the drm driver sees
18:16 danvet: nothing at all gets freed
18:16 danvet: like nothing
18:16 danvet: we don't even drop a refcount on the drm_device
18:16 danvet: so as far as a driver author is concerned
18:17 danvet: drm_dev_unregister frees nothing, and releases nothing
18:17 pinchartl: then later on the driver calls drm_dev_put()
18:17 danvet: yup
18:18 danvet: generally at the end of its ->remove/disconnect bus driver implementation
18:18 pinchartl: and there you start freeing
18:18 danvet: video_unregister_device does both things together
18:18 pinchartl: what if the V4L2 side then touched the drm_dvice ?
18:18 pinchartl: s/touched/touches/
18:18 danvet: bad
18:18 pinchartl: there you go
18:18 pinchartl: so this can't be helped
18:18 danvet: no I mean in your example this would be bad
18:19 danvet: and I'm 100% sure all drivers will do that in some way
18:19 pinchartl: if drivers go touch internals of an object after they put() it, we're screwed
18:19 danvet: so I want to make sure they cant screw it up
18:19 pinchartl: regardless of how much magic we use
18:19 danvet: we're human writing C
18:19 pinchartl: exactly
18:19 pinchartl: so it has to be at least somehow correct code
18:19 danvet: no
18:19 pinchartl: if it was rust the story may be different
18:19 danvet: it's pretty much guaranteed to be wrong code
18:19 danvet: because it's C code written by humans
18:20 pinchartl: but in this case we can always go and mess up with the internals of objects
18:20 pinchartl: you can mark a field in a structure as private, I can still access it
18:20 pinchartl: no amount of managed helpers will prevent that
18:20 danvet: so the question of "what happens if the v4l device access the drm_device after that final drm_dev_put()" isn't what, but how many
18:20 pinchartl: it needs to be 0. it should simply not touch it
18:21 danvet: but if we delay the cleaup of the drm_device until also the v4l side is fixed
18:21 pinchartl: and there's no technical way to guarantee that
18:21 danvet: then the "how many" won't matter anymore
18:21 danvet: not if humans wrote the code
18:21 pinchartl: so you want to delay *everything* until the very end ?
18:21 danvet: well, at least if it's so closely tied that it's in the same struct
18:22 pinchartl: it doesn't matter if it's the same struct or not
18:22 danvet: I mean I go even further, I plan to delay the entire drm_device release until everything hanging off it is gone
18:22 danvet: drm_minor (both primary and render)
18:22 danvet: drm_leases
18:22 danvet: dma_bufs
18:22 danvet: dma_fences
18:22 danvet: drm_syncobj
18:22 pinchartl: for all it matters, you can drm_dev_alloc() it and access the pointer after it gets freed
18:22 danvet: ...
18:22 danvet: the list is very long
18:22 danvet: oh sure
18:22 danvet: but if you make the code that results in the least amount of typing be the code that results in the least amount of bugs
18:22 danvet: you win
18:22 danvet: on average
18:23 anholt:is with danvet on this
18:23 pinchartl: of course you need to delay the destruction of drm_device until dma_bufs (for instance) are released
18:23 danvet: essentially for this device removal refcounting is sooooooo horribly broken on average
18:23 danvet: and I've just read all the drivers in-tree, I know
18:23 danvet: that I'm totally willing to incur all the leaks necessary to stop it
18:23 danvet: pinchartl, atm we dont
18:23 danvet: in theory this works
18:24 pinchartl: I think you're living in a bubble when you care about a single device, a single subsystem. and you want to apply that reasoning to all cases, even when they're different and don't match
18:24 danvet: in practice
18:24 danvet: not even close
18:24 pinchartl: and that bulldozer approach is what annoys me the most
18:24 danvet: did you see the list of interface?
18:24 pinchartl: it may be ok moving foward fast and disregarding the opinion of others when you're right :-)
18:24 pinchartl: but what happens when you're wrong ?
18:24 danvet: drm is already a world where even a single "device" has an entire list of interfaces hanging off it
18:24 danvet: more than you can count on one hand
18:24 pinchartl: will you give me a guarantee that you'll fix the mess you create, if it happens to be a mess ?
18:25 danvet: we already have the problem you describe
18:25 danvet: I'm willing to pay the price
18:25 danvet: because I know the mess the current solution creates
18:25 danvet: and it's worse
18:25 pinchartl: you have a top-level drm_device and I believe it an rule the world. it can rule a drm driver within drm, sure
18:25 danvet: optimus is a thing
18:25 danvet: so are socs
18:26 danvet: I think by numbers, most systems have 2 drm drivers loaded by now
18:26 danvet: or more
18:26 pinchartl: what if I point you in the future to a problem I face, and fixing it correctly will require redesigning this all completely, and modifying *every* single driver that has been converting to it ? will you do the work yourself to help me ?
18:26 danvet: I'm doing it right now
18:26 pinchartl: no you're not
18:26 pinchartl: you're creating a mess :-)
18:26 pinchartl: or what I think will be a mess in the future
18:26 danvet: I'm actually touching every single drm driver's unload code
18:27 pinchartl: our opinions clearly differ on this topic
18:27 danvet: that's how much I'm fed up with status quo
18:27 pinchartl: I'm talking about touching every other driver in every other subsystem that will have been converted to the core helpers, if they end up being wrong
18:27 danvet: pinchartl, anyway, if you show up with your driver that needs this
18:27 danvet: I'm happy to add a drm_dev_init_for_pinchartl_only_who_writes_perfect_code
18:27 danvet: or slightly shorter
18:28 pinchartl: no, that's not what I'll want
18:28 pinchartl: I'll want a clean global solution, not a special case
18:28 danvet: everyone else, me included
18:28 pinchartl: *especially* if that clean solution is what I'm proposing right now
18:28 danvet: gets the kindergarden version with safety built mandatory built-in
18:29 danvet: I can also give you a knob to overwrite the ->release function of the drm_device kref
18:29 pinchartl: well, I've already asked for a non-mandatory final_kfree, and have been told we can't :-(
18:29 danvet: with about 6 udnerscores in front or so ...
18:29 danvet: none of the current drivers need it
18:29 danvet: last time I left code in that was unused because might be useful for driver
18:29 pinchartl: again, I'm annoyed by this bulldozer approach that disregards any alternative proposal without evaluating it :-(
18:29 danvet: it broke faster than it was used
18:30 danvet: I have
18:30 danvet: I see it
18:30 pinchartl: it's quite demotivating when you care about a subsystem and get told your opinions and concerns are not worth it
18:30 cmarcelo: jekstrand: is "store_deref" I/O in that sense?
18:30 danvet: pinchartl, I'm pretty good at shooting holes into locking and refcount schemes
18:30 jekstrand: cmarcelo: Yes
18:30 danvet: but I can't review all the drivers for those
18:31 danvet: so there's 2 options: we stop merging drivers until they're perfect
18:31 danvet: or I fix this mess
18:31 pinchartl: I still don't understand why you don't want to consider a solution based on chaining that wouldn't require manual implementation in drivers
18:32 pinchartl: and if that's not tried, I won't side with you trying to convince gregkh to merge something in drivers/base/
18:32 pinchartl: you need to first convince me your solution is the right one
18:32 danvet: because the chainign just means more bugs
18:32 danvet: without actually fixing stuff
18:32 pinchartl: you still haven't showed me where those bugs would be
18:32 danvet: I mean I can give you that option if you want ti
18:32 pinchartl: it's chaining without intervention of drivers
18:32 danvet: but I dont want to review tons of drivers for this
18:33 danvet: pick any drm driver not written by you
18:33 danvet: (and maybe not i915)
18:33 danvet: and they're full of them
18:33 danvet: and in theory it should be possible with the current infrastructure to not have these bugs
18:34 pinchartl: they're full of them because they don't use any managed infrastructure
18:34 danvet: nah
18:34 danvet: some are broken because the use the wrong managed infrastructure
18:34 pinchartl: same thing
18:34 danvet: some are broken because they copypasted their cleanup from another driver that misplaced a drm_dev_unregister
18:34 danvet: or a drm_mode_config_cleanup
18:34 danvet: or a drm_dev_put
18:34 pinchartl: I'm talking about creating a managed infrastructure with the right life time and whose *internal* implementation would be based on chaning
18:34 pinchartl: chaining
18:35 danvet: some have a hard time understanding what refcounts, and kfree before the put the thing that still needs it
18:35 danvet: or kfree and _put
18:35 danvet: on the same thing
18:35 pinchartl: those examples are about drivers that don't use a managed API
18:35 danvet: so every driver does something either too early, or too late, or in the wrong order
18:35 danvet: or twice or not enough
18:35 danvet: it's all total horros
18:36 pinchartl: calling kfree() at the wrong time means you call kfree(), and thus don't use managed resource handling
18:36 danvet: nah that's not my point
18:36 danvet: they're all drivers with multiple userspace objects
18:36 danvet: the _only_ drm drivers which are correct are
18:36 danvet: i915
18:36 danvet: drm_simple_display_pipe (only one thing)
18:36 danvet: a few written by you, but I'm not so sure about those
18:36 danvet: everyone else with multiple things gets it wrong
18:36 pinchartl: lol :-)
18:37 danvet: as in I've found the bug already
18:37 pinchartl: no, there are lifetime issues in rcar-du, I'm aware of them (or at least some of them)
18:37 danvet: hence my position that if you have multiple userspace facing things in one driver
18:37 danvet: the only way out is to release it all at the end
18:37 pinchartl: if you embed the interfaves in a top-level structure
18:37 danvet: hence why I don't plan to add managed resources to anything else
18:38 danvet: but just put it all onto the drm_device
18:38 pinchartl: you *have* to *free* that memory at the end, absolutely
18:38 pinchartl: but it doesn't mean you have to destroy every single object at the end
18:38 danvet: nah, even all the ones that have separate kmalloc objects
18:38 danvet: oh i915 is also wrong, just forgot, we dont refcount dma_buf/fence
18:39 danvet: pinchartl, yup in theory you could write perfect code
18:39 danvet: and destroy as needed
18:39 danvet: and clean up as needed
18:39 danvet: and not have a single bug
18:39 danvet: not with humans, not in C
18:39 pinchartl: you've telling me why manual chained destruction is almost always broken
18:40 pinchartl: which I've never disputed
18:40 pinchartl: but you still haven't answered my qeustion
18:40 danvet: I'm saying taht even automatic chaining is broken
18:40 pinchartl: I'm talking about creating a managed infrastructure with the right life time and whose *internal* implementation would be based on chaining
18:40 danvet: unless you make the container really big
18:40 danvet: the bigger, the better the chances that you dont have a bug
18:40 danvet: ofc can't make the entire kernel work like this
18:40 danvet: but an entire driver .ko, sure why not
18:41 danvet: yeah I'm disputing the usefulness of that in practice
18:41 pinchartl: you gave me examples of why it's broken, and they're all about using devm_kzalloc(physical device) incorrectly, or calling kfree() at the wrong time (and thus calling kfree), or other code structures that won't exist in a driver using the managed API
18:41 danvet: since in practice I want to throw your nice chain away
18:41 danvet: and put everything onto the managed release of the top level
18:42 danvet: komeda tries to do the split thing
18:42 danvet: gets it totally wrong
18:42 pinchartl: you're again using an example of a manual implementation to explain why the concept is wrong
18:42 danvet: dma_buf/fence in i915 is another split thing example that's going to blow up
18:43 pinchartl: given a proper managed API
18:43 pinchartl: used correctly by drivers
18:43 danvet: given correct code nothing is broken
18:43 pinchartl: I want to understand why basing the *internal* release on chaining is wrong
18:43 danvet: real world code sucks
18:44 pinchartl: given broken code everything is broken
18:44 danvet: and I'm not clever enough to find all the bugs
18:44 danvet: yup
18:44 danvet: that's my statement
18:44 karolherbst: in the end it only matters how easy an API is to use, not how the situation in a perfect world would look like with perfect coders ;)
18:44 pinchartl: the only thing you can do is create APIs that make it more difficult to use them incorrectly
18:44 danvet: so given that we have broken code
18:44 pinchartl: less functions to call = less bugs
18:44 danvet: and there's no way around the bugs
18:44 anholt: danvet: I think you're saying that even the clever internal release code we only have to review "once" would end up full of bugs too, because every other instance of release code has been full of bugs.
18:44 pinchartl: less parameters to functions = less bugs
18:44 pinchartl: and so on
18:44 danvet: what can we do to make the bugs blow up less often
18:44 danvet: ?
18:45 danvet: my answer is for this refcounting: delay so long that hopefully really everything that might contain buggy code is guaranteed to no longer be accessible"
18:45 danvet: anholt, nah the clever internal release code will be right
18:45 danvet: I think with unit tests we can make sure of that
18:45 danvet: but drivers use it
18:45 danvet: anholt, I think maybe a mesa example is the cleanup context for the compiler
18:46 danvet: except you think only cleaning up once everything is linked is too wasteful (or whenever mesa does that, I have no idea)
18:46 danvet: so you want to clean up at every stage
18:46 danvet: and try to make sure that every stage sets the cleanup stage correctly
18:46 danvet: with a mix of objects that leak from one stage to the next
18:46 danvet: in theory you could chain stuff like that
18:46 danvet: in practice it'll never work
18:47 pinchartl: I've exhausted the amount of time I can spend on this today. if you want my support, you'll have to show me why an *internal* implementation that follows the architecture of the device model won't work
18:47 danvet: and the savings in temporarily leaked memory is neglectable
18:47 danvet: pinchartl, it will
18:47 pinchartl: *won't work equally well
18:48 pinchartl: (or better ;-))
18:48 danvet: that's what I meant
18:48 danvet: you can design something really nice that does a nice cascading chain of auto-cleanup
18:48 danvet: real pretty
18:48 danvet: I want something that for a given driver (and I don't care how many interface that thing exposed)
18:49 danvet: just delays the entire cleanup until everything is gone
18:49 danvet: for that driver/device instance
18:49 pinchartl: you'll have to prove to me that this is necessary to achieve the goal
18:49 danvet: some judgement on what exactly "one instance" means
18:49 danvet: it's not
18:49 danvet: assuming infinte danvet's to review everyone's refcounting
18:49 danvet: we can do without this
18:49 pinchartl: please, let's be serious, that's bad faith arguments
18:50 pinchartl: you know what I'm asking
18:50 danvet: you need about 5 danvets per review, that's what it takes on average to spot most of the bugs
18:50 pinchartl: and you know we share the same goal
18:50 danvet: I don't think we exactly agree
18:50 pinchartl: make it easy to use for drivers, as bug-free as possible, limit the impact of bugs, make it easy to reveiw, all that
18:50 danvet: I'm willing to go with a slightly shoddy architecture that gives me better real world results
18:51 pinchartl: I want you to show me that this shoddy architecture is necessary to give better real world rsults
18:51 danvet: pinchartl, I mean if you want this on v4l I think you can do your chain of cleanups there
18:51 danvet: for drm I simply want to attach all of drm to the top level
18:51 pinchartl: v4l is dead
18:51 pinchartl: or dying
18:51 danvet: whatever else that's cool now
18:52 danvet: so to be serious, the devm_drm_dev_init would take the managed kref of the top level thing
18:52 danvet: and would put all the managed stuff onto that
18:52 danvet: and nothing onto drm_device itself
18:52 danvet: you can chain other stuff on top if you feel like
18:52 danvet: outside of drm_device
18:52 pinchartl: that still doesn't explain me why this is necessary to give better real world results
18:53 pinchartl: and if you want to push such an object to drivers/base/
18:53 pinchartl: you'll have to explain why it's necessary
18:53 danvet: but looking at state of drm lifetime management, that aspiration of cleaner code will just result in lots more bugs
18:53 pinchartl: not just to me
18:53 pinchartl: to Greg, and others
18:53 pinchartl: that's all I'm asking, I want to understand why it's a necessity
18:53 danvet: it's not necessary
18:53 danvet: if we have infinite reviewers to catch refcount issues
18:54 pinchartl: seriously... do I express myself so badly ?
18:54 danvet: I think I understand your question
18:54 pinchartl: "20:51 pinchartl: I want you to show me that this shoddy architecture is necessary to give better *real world* rsults"
18:54 pinchartl: (emphasis added)
18:54 danvet: I just don't see how I can convince you
18:54 danvet: I've fixed an enormous amount of lifetim/refcount bugs in drm
18:54 pinchartl: so please don't use the infinite reviewers card
18:54 danvet: over the past 10 years or so
18:55 danvet: and mostly I fixed those bugs by moving refcounting to core code
18:55 pinchartl: I'm asking for an explanation regarding the *real world* necessity to achieve our *common goals* of simpler drivers and better reliability (less bugs and, for bugs that can't be avoided, a less severe impact)
18:56 Lyude: seanpaul: figured out the ACT timeouts :), it looks like the hub is literally just very slow with updating it's payload table sometimes, I have a feeling simultaneous down replies (that contain the aux dpcd NAKs from earlier) just ends up slowing things down a bit on the hub as it seems like if we use a really large timeout value (>1sec) the ACT eventually goes through
18:56 danvet: pinchartl, my experience of fixing these kinds of bugs in drm for the past few years
18:56 danvet: that's all
18:56 pinchartl: and of course we'll move refcounting away from drivers, to core code, to higher level core code too
18:56 karolherbst: Lyude: ohh, you just reminded me, I wanted to look into this OLED backlight thing with nouveau again.. but your script didn't help at all
18:56 karolherbst: I can turn off the backlight, but I can't turn it on again
18:56 danvet: I can't prove that if I would have tried to fix this all in drivers it would have been worse
18:56 pinchartl: this isn't about keeping refcounting manually in drivers, I don't want that either
18:56 Lyude: karolherbst: "turn" off as in set to 0?
18:57 danvet: just that I'd probably still be busy fixing drm_framebuffer refcount bugs
18:57 karolherbst: Lyude: well.. if I disable the backlight bit, the display turns black
18:57 karolherbst: if I set it again, nothing changes
18:57 seanpaul: Lyude: well that's interesting, do you think the NAKs are caused by the hub getting overwhelmed as well?
18:57 pinchartl: what I'm asking is why you think that the core code that handles the refcounting for drivers can't work if it's internally based on chaining
18:57 danvet: which is the first thing I've worked on I think
18:57 danvet: pinchartl, it can work
18:57 Lyude: seanpaul: I don't think so, the NAKs happen either way so I don't think they're related
18:57 danvet: it's just, for drm, I want to attach this stuff to the top level in the chain
18:58 danvet: atm the top level is drm_device for all drm drivers merged into upstream
18:58 danvet: but if you give me a new one, I'm happy to just move a level up
18:58 pinchartl: I know you *want* that, but why is that the best option (real world, common goals, etc...)
18:58 Lyude: seanpaul: actually-you might be right about that because I realized I think I still saw occasional ACT timeouts without the simultaneous down reploies
18:58 danvet: well ok komeda plays games and gets it wrong
18:58 danvet: pinchartl, my time
18:58 danvet: I'm not willing to fix these bugs any other way
18:59 danvet: so I guess I should stop doing that
18:59 Lyude: either way though - I've seen other situations without NAKs where ACTs seem to timeout, I'd also imagine that the more branches a payload allocation has to travel down the longer it's going to take to get processed
18:59 danvet: it's been broken for 10 years
18:59 seanpaul: Lyude: yeah, seems like a longer timeout is warranted
18:59 danvet: I dont' think many will care if we keep it broken for another 10 years
18:59 Lyude: plus - the MST spec (as of 2.0) surprisingly does not seem to specify any sort of ACT timeout
18:59 pinchartl: I think you've done a great job in that area, and certainly not a very pleasant one
19:00 pinchartl: it has high value
19:00 danvet: there's no way I'm going to play whack-a-mole with this, it's too little value and too many bugs
19:00 Lyude: I'm taking a guess and assuming that the current values we're using for udelay() are just the SST link training delay values copied over, not sure where the 30 retries came from though
19:00 danvet: in reality all we need is to be able to handle hotunplug of usb devices correctly
19:00 danvet: for the 2 usb drivers we have, we do that
19:00 danvet: everything else is nice to have
19:00 pinchartl: and I understand you can be frustrated by my questioning it, it can feel ungrateful
19:01 seanpaul: Lyude: from "30 ought to be more retries than anyone will ever need" probably
19:01 karolherbst: danvet: GPUs as well
19:01 danvet: maybe (and that's the entire justification for this) we need hotunplug for pci drivers for pass-through virtualization
19:01 danvet: karolherbst, totally broken
19:01 karolherbst: I know
19:01 danvet: don't hotunplug your gpu behind a thunderbolt
19:01 karolherbst: but we have to be able to handle that is all I meant ;)
19:01 danvet: that's also the other vaguely justified use-case
19:01 Lyude: seanpaul: yeah, probably
19:01 pinchartl: but please also keep in mind that, from my side, when I question the technical implementation, being told that the question won't be answered because there's no time and we have to move forward, is quite demotivating too
19:01 karolherbst: danvet: it works on macos and windows
19:01 karolherbst: and users expect it to work
19:02 Lyude: anyway, I think we'll go with a surprisingly high timeout value of 4 seconds
19:02 danvet: pinchartl, I mean I could spend half this effort and fix up dma_fence/buf in i915 and call it a day
19:02 seanpaul: Lyude: sounds good to me
19:02 Lyude: ...and also, use usleep instead :)
19:02 karolherbst: so ignoring GPU hotunplug is just calling for trouble long term
19:02 pinchartl: danvet: I know you're doing what you think is the best
19:02 pinchartl: and I know that providing answers can be time consuming
19:02 seanpaul: Lyude: a 4 second wait ought to be more time than any hub will ever need
19:02 danvet: pinchartl, I dont mean to say this to play drama, it's just I really don't know how to explain this any other way
19:02 pinchartl: and if the questions sound stupid (and they may be), it's frustrating too
19:02 Lyude: seanpaul: it should be, but this hub takes over 1 second
19:03 Lyude: I imagine the timeouts increase exponentially the higher the lct is
19:03 danvet: aside from past 10 years of fixing drm lifetime bugs tells me fixing this without this somewhat drastic approach
19:03 danvet: it's not going to be worth the effort
19:03 seanpaul: yeah, i mostly wanted to get on the record with that so we can recall it when it proves to be false :-)
19:03 Lyude: :P
19:03 pinchartl: but from a community point of view, if we disregard question because we want to move at high speed, we then alienate community members too
19:03 danvet: pinchartl, I can't shovel those memories from my brain into yours, as much as you can't shovel yours into mine
19:04 pinchartl: that I fully agree with you
19:04 pinchartl: and it's probably all for the best
19:04 danvet: and I think nothing short really works as convincing evidence one way or another
19:04 pinchartl: in both directions :-)
19:04 pinchartl: can we sleep over this ?
19:04 danvet: I think I get way you think my "move all release to the top level and take it away from drivers" is horrible
19:04 pinchartl: taking it away from drivers isn't horrible
19:04 danvet: I just think that that bit of theoretical soundness in design isn't worth the price in practical bugs
19:04 pinchartl: it's actually good :-)
19:05 pinchartl: even though in some cases manual override is needed, but that's a minority
19:05 pinchartl: and I want to convert the drm drivers I work on to the new model
19:05 danvet: I think with some swearing we could convert the drm_add_final_kfree to a normal drmm action
19:05 danvet: and then you could create a function which removes/replaces it with something else
19:06 danvet: that would essentially give you your ->release
19:06 pinchartl: this discussion is really about how it's implemented behind the scene, taking it away from drivers is a given
19:06 danvet: but I really don't think that's worth it
19:06 danvet: nah, I think not
19:06 danvet: the "chained or everything on the top level" has a clear impact on drivers
19:06 pinchartl: my concern is that a managed model whose *behind the scene* implementation is not aligned with the device model will bite us back very painfully
19:07 danvet: the latter leaks resources for longer than strictly necessary
19:07 pinchartl: what if we can keep chaining behind the scene without impacting drivers ?
19:07 danvet: the former trusts driver authors more to write correct code
19:07 danvet: if it doesn't actually have an impact
19:07 danvet: why do you chain?
19:08 pinchartl: to align with the device model that use chaining too. to have the two options based on the same architecture
19:08 pinchartl: so evolutions in one of the two could be replicated in the other one if needed
19:08 danvet: what do you mean with chaining in the device model?
19:08 danvet: nesting of devres groups?
19:09 danvet: something else?
19:09 pinchartl: I'm not sure how it could be technically implemented. I think devres group is likely not the right level, this should become a krefres really
19:09 danvet: well that's what I tried to suggest
19:09 danvet: krefres
19:10 danvet: and then you pass that krefres to drm_dev_init
19:10 pinchartl: (the name looks horrible, but the concept is good)
19:10 pinchartl: your approach is that there will be one kref somewhere at the top level that will rule them all
19:10 danvet: so drm_dev_init(struct drm_device *dev, const struct drm_driver *driver, struct device *hwdev, struct krefres *kref)
19:10 danvet: yup
19:11 danvet: and drm_dev_alloc simply picks a krefres that's embedded into a drm_device as a fallback
19:11 pinchartl: while I'd like to see if we can have parent-child kref chaining so that destruction of a kref would release a reference to the parent kref
19:11 danvet: plus a pointer to make sure we use the right one
19:11 danvet: well happy for you to have that
19:11 danvet: just not in driver code I might end up having to review :-)
19:11 pinchartl: of course !
19:12 danvet: I mean if you want you could even allocate a krefres for drm_device only
19:12 danvet: heck you could pass the one that's embedded
19:12 pinchartl: that shouldn't be in driver code (at least in normal circumstances, there can always be rare exceptions)
19:12 danvet: it's just I'll totally ignore your driver in that case for lifetime bug reviews
19:13 danvet: or you might get a patch to change it to the top level like once per year
19:13 pinchartl: can we sleep over this ? as I said I've exceeded the time budget for today
19:13 danvet: but essentially the drm_dev_init() with krefres is what I came up this morning to make you happy
19:13 danvet: and if you totally insist you can chain them however you want
19:13 danvet: I just don't want to hear anything about that :-)
19:14 danvet: and if you hand in a krefres, you also get to set it's ->release callback ofc
19:15 danvet: and even more glorious, you could build devres on top of krefres
19:15 pinchartl: of course, we shouldn't have bot krefres and devres
19:15 danvet: probably just the lower-level functions, since you want to have a driver interface that uses the right higher-level types
19:15 pinchartl: repeating my request to sleep over this
19:15 danvet: I think we should have both
19:15 danvet: as interfaces
19:15 danvet: maybe not as implementations
19:16 pinchartl: (apparently Daniel's IRC client filters out the word sleep :-))
19:17 danvet: it's not yet sleeepy time over here
19:17 danvet: karolherbst, btw hardest part of gpu hotunplug is going to be mmap
19:17 danvet: since absolutely no userspace will survive sigbus
19:18 pinchartl: neither is it here, but my brain is shutting down, I need to do some coding to fuel it for further discussions tomorrow :-)
19:18 danvet: so we get to invent fake memory, without racing
19:18 danvet: that looks sufficiently like the real thing until userspace got around to processing the udev event
19:18 danvet: well I delayed my home work out
19:19 danvet: weight-lifting with the funiture, I guess it's mostly entertaining to watch
19:20 pinchartl: are you broadcasting it online ? :-D
19:23 danvet: I think already delivered plenty enough entertainment for the entire channel today
19:25 emersion:'s popcorn bag is now empty
19:32 sravn:recall a story I heard about leaks in critical control code. It was used to control a rocket. They added twice the amount of memory and accepted the leak. When the rocket exploded the leak did not matter
19:37 vsyrjala:goes to add a big tank of some rocket fuel to i915
19:41 danvet: sravn, this holds in general, most gc algorithms waste a bit of memory
19:42 danvet: because perfectly tracking the working set at all times is just too expensive
19:42 danvet: both for cpus, and also for humans :-)
19:55 karolherbst: danvet: it's fine if userspace crashes
19:55 karolherbst: the kernel shouldn't though
19:55 karolherbst: applications using the GPU crash unless they support reobustness, that's something we just have to keep living with
19:57 karolherbst: mhhh.. wondering how well we will be able to deal with displays connected on the eGPU though
19:57 karolherbst: but that doesn't sound like a big issue to fix, might just need a new API and userspace fixed
19:57 karolherbst: but yeah, usecase would be users with laptops using eGPUs as docking stations
19:58 karolherbst: and I agree the hotunplugging is a convience feature, but... well, other OS implement this just fine so why shouldn't we
19:59 danvet: karolherbst, robustness isn't good enough
19:59 danvet: not with current mesa drivers
19:59 danvet: and there's not way we're going to be able to fix them, it would be too expensive in performance
19:59 karolherbst: well, mapped VRAM will always cause issues I guess
20:00 danvet: you want mmaped stuff
20:00 danvet: so the trick is that the kernel punches that out
20:00 danvet: and gives you something that's close enough
20:00 danvet: it'll be garbage data
20:00 danvet: but that's something we can do under robustness
20:01 danvet: "your gpu hung and made an impressive mess" is in-spec for robustness
20:01 karolherbst: yeah
20:01 danvet: suddenly everything SIGBUSes is not in-spec for robustness
20:01 danvet: it's pretty clear that everything should continue as-if the gpu is still alive
20:01 danvet: until you check
20:01 karolherbst: mhh yeah...
20:02 karolherbst: I really don't know what to do with compositors here
20:02 karolherbst: tell them to not map VRAM?... that's harsh but... well
20:02 emersion: compositors don't map VRAM?
20:02 danvet: karolherbst, they need robustness
20:02 danvet: essentially just throw everything away
20:03 karolherbst: sure, but how does maped VRAM and robustness play together
20:03 danvet: and wait until the client gives you a new buffer
20:03 danvet: it's supposed to not SIGBUS
20:03 danvet: hence why the kernel needs to punch the existing vram ptes out
20:03 danvet: and put something else there
20:03 karolherbst: right
20:03 danvet: we can do that, it's just not typed up
20:03 karolherbst: shouldn't hurt to bad I guess
20:03 danvet: then you'll have a few moments where every mmap is pure garbage
20:03 danvet: and every ioctl goes into the void
20:04 danvet: but the GL stack is still alive (in userspace)
20:04 karolherbst: yep
20:04 karolherbst: but I guess that's fine
20:04 emersion: OpenBSD has a map flag that returns zeroes when out-of-bounds
20:04 karolherbst: ahh
20:04 danvet: until either the compositor has processed the uevent and cleaned up the mess
20:04 danvet: or processed the robustness status, and cleaned up the mess
20:04 danvet: emersion, yeah something like that
20:04 danvet: except we also need to allow writing
20:04 karolherbst: anyway, right now that's like step 5 anyway :p
20:04 danvet: and need to make sure nothing leaks between processes
20:05 danvet: karolherbst, probably more
20:05 karolherbst: well the first big step is to let the kernel module not crash the kernel ;)
20:05 karolherbst: but I agree, it's a huge step
20:05 danvet: ambitious
20:05 karolherbst: yep
20:05 karolherbst: but it's fine
20:05 karolherbst: mmio reads just return -1
20:05 danvet: atm I'm only aiming for "not crash dereferencing something something drm_device"
20:05 danvet: everything else still goes boom
20:05 karolherbst: we just need to catch the device status early enough
20:06 karolherbst: well
20:06 karolherbst: the kernel tells us
20:06 karolherbst: the device gets removed actually
20:06 karolherbst: with a broken link status
20:06 danvet: karolherbst, the mmap reservations get cleared on hotunplug
20:06 danvet: so if you keep the ptes
20:06 danvet: and replug something
20:06 danvet: you now have a security issue
20:06 karolherbst: ohh sure
20:06 danvet: because suddenly all your mmap reads don't return -1 anymore
20:06 karolherbst: what I meant is just, that the kernel activly notifies the module that the device is gone which we can use here
20:07 karolherbst: we just need to handle that correctly
20:07 danvet: yeah, core kernel gives us all the tools already
20:07 karolherbst: and well...
20:07 karolherbst: guard against crappy mmio reads
20:07 danvet: well, the pte punching is going to be fun I think
20:07 danvet: yeah
20:07 karolherbst: some_arr[mmio_read(..)] etc...
20:07 karolherbst: I am sure there are tons of those things
20:08 danvet: ah that part we have
20:08 danvet: drm_dev_enter/exit
20:08 karolherbst: anyway, handling all of this better also makes the modules more stable in non eGPU systems, so I kind of like the idea of handling it rather sooner than later
20:08 danvet: it's supposed to be dirt cheap enough that you can sprinkle it around everywhere in-kernel
20:09 karolherbst: the mmap stuff?
20:09 danvet: drm_dev_enter/exit
20:09 danvet: that's for in-kernel
20:09 danvet: for userspace we simply need to replace the ptes
20:09 karolherbst: yeah
20:09 danvet: in-kernel we can't do that with linux
20:09 karolherbst: the biggest issue is just that modules assume their mmio reads return sane values, even in cleanup code... this is gonna be fun
20:10 danvet: karolherbst, that's actually pretty big trouble
20:10 karolherbst: jep
20:10 karolherbst: but
20:10 karolherbst: actually
20:10 danvet: because the difference between drm_dev_unplug and drm_dev_unregister is only whether drm_dev_enter/exit still allows you to access stuff
20:10 karolherbst: on some devices nouveau survives a device hotunplug
20:10 karolherbst: that's totally by chance though
20:11 danvet: atm usb uses drm_dev_unplug and pci drm_dev_unregister
20:11 danvet: also platform drivers for soc
20:11 danvet: because driver authors like it if their device is shut down upon driver remove
20:11 danvet: but if you'd use drm_dev_unplug that access to shut everything down would be blocked
20:11 danvet: so we have a bit a user intention thing going on there
20:12 danvet: maybe there's some bit somewhere in the driver core that tells whether it's just a driver remove
20:12 danvet: or a real hotunplug
20:12 danvet: and we could dtrt, always
20:12 karolherbst: yeah... you can eg check the pci link status
20:12 jekstrand: How long is the meson-windows-vs2019 CI job supposed to take? It's taking 30 minutes on this one MTR
20:12 danvet: karolherbst, there's a difference in struct device status
20:12 danvet: hotunplug = device_del()
20:13 danvet: driver remove != device_del()
20:13 danvet: but ofc that's totally racy :-)
20:13 danvet: so gregkh probably doesn't allow drivers to check this
20:13 karolherbst: danvet: pci_channel_offline
20:14 jekstrand: daniels: What do you have that VS 2019 builder running on? An atom chip? Some ARM surface device? It takes as many minutes to build mesa as my laptop does seconds.
20:15 karolherbst: dunno if this function is reliable enough, but at least from a pci_driver.remove perspective it worked in my experiments
20:15 danvet: karolherbst, well in drm_dev_unregister
20:15 anholt: jekstrand: looks like it's regularly taking 30 minutes right now
20:15 danvet: at the device level
20:15 anholt: (looking at previous passed jobs)
20:17 anholt: this sounds like something going wrong.
20:18 danvet: karolherbst, there's struct device->p->dead
20:18 danvet: but it's hidden very well
20:18 karolherbst: mhhh
20:19 karolherbst: I rally don't know the drm API though but I always got the feeling that in nouveau we still use a lot of legacy stuff.. dunno though
20:19 karolherbst: never cared much
20:19 danvet: we killed that
20:19 danvet: too many root holes
20:19 danvet: or do you mean in the kernel driver itself?
20:20 karolherbst: yeah, in the kernel itselg
20:20 anholt: jekstrand: the initial merge and the re-enabling of the msvc build 4 days ago passed in 6 minutes, so, yeah, something seems wrong.
20:21 HdkR: Maybe it was doing an update in the background :P
20:22 karolherbst: danvet: can't find that device->p->dead thing, what's p supposed to be here? the pci_dev or something else?
20:22 danvet: karolherbst, I think it's pretty nice, for a this old driver
20:22 danvet: karolherbst, look for kill_device
20:22 danvet: dev is the struct device
20:22 danvet: pdev->dev
20:22 karolherbst: ahh, yeah
20:23 karolherbst: mhh
20:23 karolherbst: but mhhh
20:23 danvet: you could type a patch to add device_is_dead and get murdered by gregkh or so
20:23 karolherbst: yeah.. no
20:23 karolherbst: this is always called from device_del
20:23 karolherbst: I don't think it helps
20:24 karolherbst: I could check that at some point though, but mhhh
20:24 karolherbst: yeah dunno...
20:25 karolherbst: it's probably equal to echo 1 > remove on the sysfs file
20:26 karolherbst: yep
20:26 karolherbst: sysfs goes through pci_stop_and_remove_bus_device which calls into device_del which calls into kill_device
20:27 karolherbst: (ignoring some steps in between)
20:27 karolherbst: I think pci_channel_offline is indeed our best bet here as this might also come in handy in other cases
20:27 karolherbst: maybe the channel just dies for... reasons
20:28 karolherbst: like devices not coming back in runtime_resume or so
20:33 jekstrand: HdkR: Or virus scanning all the .obj files :)
20:35 HdkR: ick
20:41 jekstrand: HdkR: I heard a story recently of a tech company (that makes GPUs!) whose IT department actually did that to everyone's dev machines.
20:44 HdkR: jekstrand: I've had experience with that yes. Wasn't great :)
20:45 Lyude: agd5f_ / hwentlan (cc mripard ): mind if I merge all but patch #3 of https://patchwork.freedesktop.org/series/75335/ (since trying to merge them on separate branches will probably cause conflicts) onto drm-misc-next? (and then have agd5f_ / hwentlan push patch #3 to amd's repo, since that shouldn't cause any conflicts)
20:48 agd5f_: Lyude, feel free to take them all through drm-misc
20:48 Lyude: agd5f_: ah cool, will go ahead and do that then thanks!
21:36 cmarcelo: jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4442
21:37 cmarcelo: jekstrand: this is how the flag approach would look like
22:11 daniels: jekstrand: we're being smashed to bits by GStreamer builds which occupy far more residency than we have space for, because the GSt Windows runner is currently down and waiting to be reactivated
22:11 daniels: jekstrand, anholt: in normal times it takes 6min end-to-end which is in line with normal builds
22:12 daniels: (ok, 6m20s ...)
22:16 daniels: (the GSt Windows runner should be back up next week, and long-term we're expecting to have a better answer than one Collabora-sponsored EC2 instance)
22:17 daniels: jekstrand: also, funny story - my LLVM-in-Docker builds suddenly sped up by an order of magnitude when I went to Windows Defender settings ...
23:13 jekstrand: daniels: Scanning too many things were we?
23:24 daniels: \_o_/