new compare text similarity #632

dataverse-hub · 2025-09-27T11:00:06Z

No description provided.

…aper by implementing _scrape_channel method, which calls Apify actor with channel_url, max_videos, start_date, and end_date to fetch multiple video transcripts and metadata. - Enhanced input schema documentation to clarify field usage: - youtube_url: For single video scraping, paired with language for transcript. - channel_url: For fetching multiple videos from a channel, paired with max_videos, start_date, and end_date (both YYYY-MM-DD, optional)

Implement character-level similarity using difflib for non-spaced languages Include unicodedata for text normalization to handle diacritics

mcos-ntakouris · 2025-10-29T08:41:24Z

scraping/youtube/utils.py

+import unicodedata
+
+# new compare_text_similarity
+def compare_text_similarity(text1: str, text2: str, threshold: float = 0.8) -> bool:


good for fast similarity checks, but needs some care because it can overestimate similarity when chunks are moved around

+1 for unicode normalization, we should probably adopt it

Mark added 9 commits September 15, 2025 23:11

Add new youtube actor

aeb6180

rename actor

4b79ce9

Add Detect language

5e5a05a

Add Detect language

02db0f7

Add compare_text_similarity.py with improved text similarity function

9a6e0ad

Implement character-level similarity using difflib for non-spaced languages Include unicodedata for text normalization to handle diacritics

Merge remote-tracking branch 'origin/main'

e74c5b6

Add compare_text_similarity.py with improved text similarity function

0719f92

Implement character-level similarity using difflib for non-spaced languages Include unicodedata for text normalization to handle diacritics

mcos-ntakouris reviewed Oct 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new compare text similarity #632

new compare text similarity #632

Uh oh!

dataverse-hub commented Sep 27, 2025

Uh oh!

mcos-ntakouris Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

new compare text similarity #632

Are you sure you want to change the base?

new compare text similarity #632

Uh oh!

Conversation

dataverse-hub commented Sep 27, 2025

Uh oh!

mcos-ntakouris Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants