-
Notifications
You must be signed in to change notification settings - Fork 259
WIP: Transcripting code donation #2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Current known limitations: this works by pulling a raw youtube transcript and then summarizing/cleaning up. So when the raw youtube transcript is imperfect (which it often is with names) errors do happen. And so "Reese Lee" becomes sometimes "Ree Lee" and "Adriana Villela" becomes "Adriana Villa". Both the "nice cleaned up version" and the "very messy YouTube original" are included for comparison (and also so it's harder for OpenAI to fool me) |
The spell checker action will ultimately be impossible to pass with raw YouTube transcripts; in many cases it also flags names (some correct, some incorrect) as unknown words. Will probably need some advice on what to do in this case since there's some tension between "capture what people said" and "make sure it's correct" |
cspell could be made to ignore the transcripts folder. |
As this is aimed at YouTube transcripts, and I see how it can be really useful for the content the End-User SIG publishes, would it make more sense if this PR is opened against https://github.com/open-telemetry/sig-end-user ? |
Not all recordings are from End User SIG right? I think we can start with community and later see if there is better places to have them |
You're right. We do have YouTube videos that come from Comms SIG. However, as those tends to refer to documentation, do we think this tool is equally useful there? My thinking of putting this in a repo that's not |
hi @moxious, can you send this PR to https://github.com/open-telemetry/sig-end-user instead? thanks |
Associated Issue: #2623 (comment)
tl;dr what is this? It's a small python script with instructions that fetches YouTube transcripts and summarizes them into nice Markdown files. The intent here is to store public text associated with those videos. This is nice by itself, but when combined with this: open-telemetry/opentelemetry.io#6769 it gets better. Kapa can be trained on these, and OTel has a sustainable way to do Q&A on the website based on video.
Core of this PR is the python code which isn't that big. Most of the line changes are actual markdown files which are the output of the python code.