Skip to content

WIP: Transcripting code donation #2777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

moxious
Copy link

@moxious moxious commented May 23, 2025

Associated Issue: #2623 (comment)

tl;dr what is this? It's a small python script with instructions that fetches YouTube transcripts and summarizes them into nice Markdown files. The intent here is to store public text associated with those videos. This is nice by itself, but when combined with this: open-telemetry/opentelemetry.io#6769 it gets better. Kapa can be trained on these, and OTel has a sustainable way to do Q&A on the website based on video.

Core of this PR is the python code which isn't that big. Most of the line changes are actual markdown files which are the output of the python code.

@moxious
Copy link
Author

moxious commented May 23, 2025

Current known limitations: this works by pulling a raw youtube transcript and then summarizing/cleaning up. So when the raw youtube transcript is imperfect (which it often is with names) errors do happen. And so "Reese Lee" becomes sometimes "Ree Lee" and "Adriana Villela" becomes "Adriana Villa". Both the "nice cleaned up version" and the "very messy YouTube original" are included for comparison (and also so it's harder for OpenAI to fool me)

@moxious
Copy link
Author

moxious commented May 27, 2025

The spell checker action will ultimately be impossible to pass with raw YouTube transcripts; in many cases it also flags names (some correct, some incorrect) as unknown words. Will probably need some advice on what to do in this case since there's some tension between "capture what people said" and "make sure it's correct"

@dmathieu
Copy link
Member

cspell could be made to ignore the transcripts folder.

@danielgblanco
Copy link
Contributor

danielgblanco commented May 27, 2025

As this is aimed at YouTube transcripts, and I see how it can be really useful for the content the End-User SIG publishes, would it make more sense if this PR is opened against https://github.com/open-telemetry/sig-end-user ?

cc @avillela @reese-lee

@svrnm
Copy link
Member

svrnm commented Jun 2, 2025

As this is aimed at YouTube transcripts, and I see how it can be really useful for the content the End-User SIG publishes, would it make more sense if this PR is opened against open-telemetry/sig-end-user ?

cc @avillela @reese-lee

Not all recordings are from End User SIG right? I think we can start with community and later see if there is better places to have them

@danielgblanco
Copy link
Contributor

You're right. We do have YouTube videos that come from Comms SIG. However, as those tends to refer to documentation, do we think this tool is equally useful there? My thinking of putting this in a repo that's not community is that it'd make permissions easier to maintain those scripts.

@trask
Copy link
Member

trask commented Jun 3, 2025

hi @moxious, can you send this PR to https://github.com/open-telemetry/sig-end-user instead? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants