[PoC] Live Text image analysis on macOS #16063

rcombs · 2025-03-16T14:36:39Z

This is a draft for discussion; this feature is not mergeable in its current state.

This adds support for the Live Text API on macOS, which allows the user to select text within a video.

In this demo, I select some Japanese text from a video file:

resized.mov

The copied text is:

トゲナシトゲアリ「雑踏、僕らの街」
Produced by 玉井健二
作詞・作曲：大濱健悟
編曲：玉井健二，大濱健悟
Product of agehasprings

Which is largely correct modulo spacing.

This will need a number of changes before it's mergeable, some of which I'd like to get some discussion started on:

The API calls required to capture the image and convert it to a form usable by the system APIs are largely hacked in haphazardly right now; I'm not sure what the best solution for some of this is
Currently, I'm capturing a window screenshot (so OSD is included), but I'm not informed when the OSD updates, so it becomes outdated easily; the simplest solution might be to simply not support the OSD (which would mean taking a subtitles screenshot and configuring the overlay view to be aware of the video's margins within the window)
This will presumably want to be gated behind a setting
Long-term, [WIP/POC] Add API to obtain metrics and shape data libass/libass#856 should provide the text metrics we'd need to implement our own selection functionality for text drawn using libass, at which point we'd want to switch this to use a video screenshot
This reuses some image conversion utility routines out of screenshot: add screenshot-to-clipboard command #15568, pulled out into their own new file; that'll need to be reconciled once either feature lands
The system only analyzes text in the user's configured languages by default; we should grab the list of languages that could plausibly be in the video or displayed subtitles (video stream language, all audio stream languages, and selected subtitle stream language seems like a reasonable set?) and signal those to the analyzer

This reverts commit d9eb9ed

Akemi · 2025-03-16T15:40:23Z

This will presumably want to be gated behind a setting

yeah an option that can be toggled at runtime, so it doesn't interfere with window dragging if not wanted.

[edit]
the user could configure the behaviour that way and we don't need to hardcode anything (like on pause). eg auto-profile on pause to set this option.

Akemi · 2025-03-16T15:45:13Z

video/out/mac/view.swift

+                overlayView.isSupplementaryInterfaceHidden = true
+                overlayView.delegate = self
+                analysisOverlayView = overlayView
+                addSubview(overlayView)


since ImageAnalysisOverlayView is added as a subview all the ImageAnalysisOverlayView functionality/delegate/as much as possible should be moved into its own view class if possible.

rcombs added 4 commits March 16, 2025 19:09

sws_utils: re-add mp_image_swscale, with opts/log handling

3442241

This reverts commit d9eb9ed

[osdep] add image conversion utility for macOS

2b633ee

[WIP] screenshot PoC utils

874af07

[WIP] support Live Text on macOS

43bb4ec

rcombs requested a review from Akemi March 16, 2025 14:36

Akemi reviewed Mar 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Live Text image analysis on macOS #16063

[PoC] Live Text image analysis on macOS #16063

rcombs commented Mar 16, 2025

Akemi commented Mar 16, 2025 •

edited

Loading

Akemi Mar 16, 2025

[PoC] Live Text image analysis on macOS #16063

Are you sure you want to change the base?

[PoC] Live Text image analysis on macOS #16063

Conversation

rcombs commented Mar 16, 2025

Akemi commented Mar 16, 2025 • edited Loading

Akemi Mar 16, 2025

Choose a reason for hiding this comment

Akemi commented Mar 16, 2025 •

edited

Loading