Skip to content

Conversation

@dataverse-hub
Copy link

No description provided.

Mark added 9 commits September 15, 2025 23:11
…aper by implementing _scrape_channel method, which calls Apify actor with channel_url, max_videos, start_date, and end_date to fetch multiple video transcripts and metadata.

- Enhanced input schema documentation to clarify field usage:
  - youtube_url: For single video scraping, paired with language for transcript.
  - channel_url: For fetching multiple videos from a channel, paired with max_videos, start_date, and end_date (both YYYY-MM-DD, optional)
…aper by implementing _scrape_channel method, which calls Apify actor with channel_url, max_videos, start_date, and end_date to fetch multiple video transcripts and metadata.

- Enhanced input schema documentation to clarify field usage:
  - youtube_url: For single video scraping, paired with language for transcript.
  - channel_url: For fetching multiple videos from a channel, paired with max_videos, start_date, and end_date (both YYYY-MM-DD, optional)
Implement character-level similarity using difflib for non-spaced languages
Include unicodedata for text normalization to handle diacritics
Implement character-level similarity using difflib for non-spaced languages
Include unicodedata for text normalization to handle diacritics
import unicodedata

# new compare_text_similarity
def compare_text_similarity(text1: str, text2: str, threshold: float = 0.8) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for fast similarity checks, but needs some care because it can overestimate similarity when chunks are moved around

+1 for unicode normalization, we should probably adopt it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants