Skip to content

[PoC] Live Text image analysis on macOS #16063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rcombs
Copy link
Contributor

@rcombs rcombs commented Mar 16, 2025

This is a draft for discussion; this feature is not mergeable in its current state.

This adds support for the Live Text API on macOS, which allows the user to select text within a video.

In this demo, I select some Japanese text from a video file:

resized.mov

The copied text is:

トゲナシトゲアリ「雑踏、僕らの街」
Produced by 玉井健二
作詞・作曲:大濱健悟
編曲:玉井 健二,大濱 健悟
Product of agehasprings

Which is largely correct modulo spacing.

This will need a number of changes before it's mergeable, some of which I'd like to get some discussion started on:

  • The API calls required to capture the image and convert it to a form usable by the system APIs are largely hacked in haphazardly right now; I'm not sure what the best solution for some of this is
  • Currently, I'm capturing a window screenshot (so OSD is included), but I'm not informed when the OSD updates, so it becomes outdated easily; the simplest solution might be to simply not support the OSD (which would mean taking a subtitles screenshot and configuring the overlay view to be aware of the video's margins within the window)
  • This will presumably want to be gated behind a setting
  • Long-term, [WIP/POC] Add API to obtain metrics and shape data libass/libass#856 should provide the text metrics we'd need to implement our own selection functionality for text drawn using libass, at which point we'd want to switch this to use a video screenshot
  • This reuses some image conversion utility routines out of screenshot: add screenshot-to-clipboard command #15568, pulled out into their own new file; that'll need to be reconciled once either feature lands
  • The system only analyzes text in the user's configured languages by default; we should grab the list of languages that could plausibly be in the video or displayed subtitles (video stream language, all audio stream languages, and selected subtitle stream language seems like a reasonable set?) and signal those to the analyzer

@rcombs rcombs requested a review from Akemi March 16, 2025 14:36
@Akemi
Copy link
Member

Akemi commented Mar 16, 2025

This will presumably want to be gated behind a setting

yeah an option that can be toggled at runtime, so it doesn't interfere with window dragging if not wanted.

[edit]
the user could configure the behaviour that way and we don't need to hardcode anything (like on pause). eg auto-profile on pause to set this option.

overlayView.isSupplementaryInterfaceHidden = true
overlayView.delegate = self
analysisOverlayView = overlayView
addSubview(overlayView)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since ImageAnalysisOverlayView is added as a subview all the ImageAnalysisOverlayView functionality/delegate/as much as possible should be moved into its own view class if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants