Draft: Add workaround for neuspell initialization bug (#8)

* ✨ add symspell fallback option to pipeline * 🚧 update fn argument * 🐛 add try-except for neuspell bug * 🚧 add args specifying which checker to use * ✏️ adjust a TODO location * 🐛 fix fn placement, passthrough method * 📝 add more logging * 🐛 lowercase text pre-spell correction * 📝 update script docs * 🔊 update logs * 🎨 format to black * ♻️ organized imports * 🐛 add workaround for Neuspell's bug * 📝 update readme w latest * 🐛 add workaround for bug in Neuspell
pszemraj · Feb 24, 2022 · cae85b7 · cae85b7
1 parent 309c420
commit cae85b7
Show file tree

Hide file tree

Showing 6 changed files with 2,478 additions and 1,711 deletions.
diff --git a/README.md b/README.md
@@ -6,45 +6,45 @@
 
 TL;DR check out [this Colab script](https://colab.research.google.com/gist/pszemraj/4183c4b39bf718b54de9dbf2df499cd9/vid2cleantext-single-demo.ipynb) to see a transcription and keyword extraction of a speech by John F. Kennedy by simply running all cells.
 
----
+* * *
 
 **Table of Contents**
 
 <!-- TOC -->
 
--   [Motivation](#motivation)
--   [Overview](#overview)
-    -   [Example Output](#example-output)
-    -   [Pipeline Intro](#pipeline-intro)
--   [Installation](#installation)
-    -   [Quickstart (aka: how to get the script running)](#quickstart-aka-how-to-get-the-script-running)
-    -   [Notebooks on Colab](#notebooks-on-colab)
-    -   [How long does this take to run?](#how-long-does-this-take-to-run)
--   [Application](#application)
-    -   [Now I have a bunch of long text files. How are these useful?](#now-i-have-a-bunch-of-long-text-files-how-are-these-useful)
-        -   [Visualization and Analysis](#visualization-and-analysis)
-        -   [Text Extraction / Manipulation](#text-extraction--manipulation)
-        -   [Text Summarization](#text-summarization)
-    -   [TextHero example use case](#texthero-example-use-case)
-    -   [ScatterText example use case](#scattertext-example-use-case)
--   [Design Choices & Troubleshooting](#design-choices--troubleshooting)
-    -   [What python package dependencies does this repo have?](#what-python-package-dependencies-does-this-repo-have)
-    -   [My computer crashes once it starts running the wav2vec2 model:](#my-computer-crashes-once-it-starts-running-the-wav2vec2-model)
-    -   [The transcription is not perfect, and therefore I am mad:](#the-transcription-is-not-perfect-and-therefore-i-am-mad)
-    -   [How can I improve the performance of the model from a word-error-rate perspective?](#how-can-i-improve-the-performance-of-the-model-from-a-word-error-rate-perspective)
-    -   [Why use wav2vec2 instead of SpeechRecognition or other transcription methods?](#why-use-wav2vec2-instead-of-speechrecognition-or-other-transcription-methods)
--   [Example](#example)
-    -   [Result](#result)
-    -   [Console output](#console-output)
--   [Future Work, Collaboration, & Citations](#future-work-collaboration--citations)
-    -   [Project Updates](#project-updates)
-    -   [Future Work](#future-work)
-    -   [I've found x repo / script / concept that I think you should incorporate or collaborate with the author.](#ive-found-x-repo--script--concept-that-i-think-you-should-incorporate-or-collaborate-with-the-author)
-    -   [Citations](#citations)
+- [vid2cleantxt](#vid2cleantxt)
+- [Motivation](#motivation)
+- [Overview](#overview)
+    - [Example Output](#example-output)
+    - [Pipeline Intro](#pipeline-intro)
+- [Installation](#installation)
+    - [Quickstart (aka: how to get the script running)](#quickstart-aka-how-to-get-the-script-running)
+    - [Notebooks on Colab](#notebooks-on-colab)
+    - [How long does this take to run?](#how-long-does-this-take-to-run)
+- [Application](#application)
+    - [Now I have a bunch of long text files. How are these useful?](#now-i-have-a-bunch-of-long-text-files-how-are-these-useful)
+        - [Visualization and Analysis](#visualization-and-analysis)
+        - [Text Extraction / Manipulation](#text-extraction--manipulation)
+        - [Text Summarization](#text-summarization)
+    - [TextHero example use case](#texthero-example-use-case)
+    - [ScatterText example use case](#scattertext-example-use-case)
+- [Design Choices & Troubleshooting](#design-choices--troubleshooting)
+    - [What python package dependencies does this repo have?](#what-python-package-dependencies-does-this-repo-have)
+    - [My computer crashes once it starts running the wav2vec2 model](#my-computer-crashes-once-it-starts-running-the-wav2vec2-model)
+    - [The transcription is not perfect, and therefore I am mad](#the-transcription-is-not-perfect-and-therefore-i-am-mad)
+    - [How can I improve the performance of the model from a word-error-rate perspective?](#how-can-i-improve-the-performance-of-the-model-from-a-word-error-rate-perspective)
+    - [Why use wav2vec2 instead of SpeechRecognition or other transcription methods?](#why-use-wav2vec2-instead-of-speechrecognition-or-other-transcription-methods)
+- [Examples](#examples)
+- [Future Work, Collaboration, & Citations](#future-work-collaboration--citations)
+    - [Project Updates](#project-updates)
+    - [Future Work](#future-work)
+    - [I've found x repo / script / concept that I think you should incorporate or collaborate with the author](#ive-found-x-repo--script--concept-that-i-think-you-should-incorporate-or-collaborate-with-the-author)
+    - [Citations](#citations)
+        - [Video Citations](#video-citations)
 
 <!-- /TOC -->
 
----
+* * *
 
 # Motivation
 
@@ -56,13 +56,10 @@ Video, specifically audio, is an inefficient way to convey dense or technical in
 
 Example output text of a video transcription of [JFK's speech on going to the moon](https://www.c-span.org/classroom/document/?7986):
 
-
-https://user-images.githubusercontent.com/74869040/151491511-7486c34b-d1ed-4619-9902-914996e85125.mp4
-
+<https://user-images.githubusercontent.com/74869040/151491511-7486c34b-d1ed-4619-9902-914996e85125.mp4>
 
 **vid2cleantxt output:**
 
-
 > Now look into space to the moon and to the planets beyond and we have vowed that we shall not see it governed by a hostile flag of conquest but. By a banner of freedom and peace we have vowed that we shall not see space filled with weapons of man's destruction but with instruments of knowledge and understanding yet the vow. S of this nation can only be fulfilled if we in this nation are first and therefore we intend to be first. In short our leadership in science and industry our hopes for peace and security our obligations to ourselves as well as others all require. Us to make this effort to solve these mysteries to solve them for the good of all men and to become the world's leading space fearing nationwide set sail on this new sea. Because there is new knowledge to be gained and new rights to be won and they must be won and used for the progress of all before for space science like nuclear science and all techniques. Logo has no conscience of its own whether it will become a force for good or ill depends on man and only if the united states occupies a position of pre eminence. Can we help decide whether this new ocean will be a sea of peace or a new terrifying theatre of war I do not say that we should or will go on. ... (truncated for brevity)
 
 See the [demo notebook](https://colab.research.google.com/gist/pszemraj/4183c4b39bf718b54de9dbf2df499cd9/vid2cleantext-single-demo.ipynb) for the full text output.
@@ -118,11 +115,11 @@ Notebook versions are available on Google Colab, because they offer free GPUs wh
 Links to Colab Scripts:
 
 1.  Single-File Version (Implements GPU)
-    -   Link [here](https://colab.research.google.com/gist/pszemraj/4183c4b39bf718b54de9dbf2df499cd9/vid2cleantext-single-demo.ipynb), updated _Jan 28th 2022_.
+    -   Link [here](https://colab.research.google.com/gist/pszemraj/4183c4b39bf718b54de9dbf2df499cd9/vid2cleantext-single-demo.ipynb), updated on 2022-02-24.
     -   This notebook downloads a video of JFK's "Moon Speech" (originally downloaded from C-SPAN) and transcribes it, printing and/or optionally downloading the output. No authentication etc required.
     -   This **is the recommended link for seeing how this pipeline works**. Only work involved is running all cells.
 2.  Multi-File Version (Implements GPU)
-    -   Link [here](https://colab.research.google.com/gist/pszemraj/a88ff352258f596d11027689653124ed/vid2cleantext-multi.ipynb), updated _Jan 27th 2022_. The example here is MIT OpenCourseWare Lecture Videos (see `examples/` for citations).
+    -   Link [here](https://colab.research.google.com/gist/pszemraj/a88ff352258f596d11027689653124ed/vid2cleantext-multi.ipynb), updated on 2022-02-24. The example here is MIT OpenCourseWare Lecture Videos (see `examples/` for citations).
     -   This notebook connects to the user's google drive to convert a whole folder of videos. The input can be either Colab or URL to a `.zip` file of media. Outputs are stored in the user's Google Drive and optionally downloaded.
     -   _NOTE:_ this notebook does require Drive authorization. Google's instructions for this have improved as of late, and it will pop up a window for confirmation etc.
 
@@ -145,21 +142,21 @@ On my machine (CPU only due to Windows + AMD GPU) it takes approximately 30-70%
 
 **Specs:**
 
-    	Processor Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
-    	Speed 4.8 GHz
-    	Number of Cores 8
-    	Memory RAM 32 GB
-    	Video Card #1 Intel(R) UHD Graphics 620
-    	Dedicated Memory 128 MB
-    	Total Memory 16 GB
-    	Video Card #2 AMD Radeon Pro WX3200 Graphics
-    	Dedicated Memory 4.0 GB
-    	Total Memory 20 GB
-    	Operating System  Windows 10 64-bit
+     Processor Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
+     Speed 4.8 GHz
+     Number of Cores 8
+     Memory RAM 32 GB
+     Video Card #1 Intel(R) UHD Graphics 620
+     Dedicated Memory 128 MB
+     Total Memory 16 GB
+     Video Card #2 AMD Radeon Pro WX3200 Graphics
+     Dedicated Memory 4.0 GB
+     Total Memory 20 GB
+     Operating System  Windows 10 64-bit
 
 > _NOTE:_ that the default model is facebook/wav2vec2-base-960h. This is a pre-trained model that is trained on the librispeech corpus. If you want to use a different model, you can pass the `--model` argument (for example `--model "facebook/wav2vec2-large-960h-lv60-self"`). The model is downloaded from huggingface.co's servers if it does not exist locally. The large model is more accurate, but is also slower to run. I do not have stats on differences in WER, but [facebook](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec) may have some posted.
 
----
+* * *
 
 # Application
 
@@ -207,7 +204,7 @@ Comparing frequency of terms in one body of text vs. another
 
 ![ST P 1 term frequency I ML 2021 Docs I ML Prior Exams_072122_](https://user-images.githubusercontent.com/74869040/110546149-69e49980-812e-11eb-9c94-81fcb395b907.png)
 
----
+* * *
 
 # Design Choices & Troubleshooting
 
@@ -243,11 +240,12 @@ Upon cloning the repo, run the command `pip install -r requirements.txt` in a te
 -   _Note: the github link in the reqs above downloads the spaCy model `en_core_web_sm` as part of the setup/installation process so you don't have to manually type `python -m spacy download en_core_web_sm` into the terminal to be able to run the code. More on this is described on spaCy's website [here](https://spacy.io/usage/models#production)_
 
 If you encounter warnings/errors that mention ffmpeg, please download the latest version of FFMPEG from their website [here](https://www.ffmpeg.org/download.html) and ensure it is added to PATH.
-## My computer crashes once it starts running the wav2vec2 model:
+
+## My computer crashes once it starts running the wav2vec2 model
 
 Try passing a lower `--chunk-len <INT>` when calling `vid2cleantxt/transcribe.py`. Until you get to really small intervals (say &lt; 8 seconds) each audio chunk can be treated as approximately independent as they are different sentences.
 
-## The transcription is not perfect, and therefore I am mad:
+## The transcription is not perfect, and therefore I am mad
 
 Perfect transcripts are not always possible, especially when the audio is not clean. For example, the audio is recorded with a microphone that is not always perfectly tuned to the speaker can cause the model to have issues. Additionally, the default models are not trained on specific speakers and therefore the model will not be able to recognize the speaker / their accent.
 
@@ -269,15 +267,16 @@ _`*` these statements reflect the assessment completed around project inception
 
 # Examples
 
-- two examples are evailable in the `examples/` directory. One example is a single video (another speech) and the other is multiple videos (MIT OpenCourseWare). Citations are in the respective folders.
-- Note that the videos first need to be downloaded video the respective scripts in each folder first, i.e. run: `python examples/TEST_singlefile/dl_src_video.py`
+-   two examples are evailable in the `examples/` directory. One example is a single video (another speech) and the other is multiple videos (MIT OpenCourseWare). Citations are in the respective folders.
+-   Note that the videos first need to be downloaded video the respective scripts in each folder first, i.e. run: `python examples/TEST_singlefile/dl_src_video.py`
 
 # Future Work, Collaboration, & Citations
 
 ## Project Updates
 
 A _rough_ timeline of what has been going on in the repo:
 
+-   Feb 2022 - Add backup functions for spell correction in case of NeuSpell failure (which at the time of writing is a known issue).
 -   Jan 2022 - add huBERT support, abstract the boilerplate out of Colab Notebooks. Starting work on the PDF generation w/ results.
 -   Dec 2021 - greatly improved runtime of the script, and added more features (command line, docstring, etc.)
 -   Sept-Oct 2021: Fixing bugs, formatting code.
@@ -293,13 +292,12 @@ A _rough_ timeline of what has been going on in the repo:
     -   ~~this will include support for CUDA automatically when running the code (currently just on Colab)~~
 
 2.  ~~clean up the code, add more features, and make it more robust.~~
-3.  add script to convert `.txt` files to a clean PDF report, [example here](https://www.dropbox.com/s/fpqq2qw7txbkujq/ACE%20NLP%20Workshop%20-%20Session%20II%20-%20Dec%202%202021%20-%20full%20transcription%20-%20txt2pdf%2012.05.2021%20%20Standard.pdf?dl=1)
-4.  add summarization script
-5.  convert groups of functions to a class object. re-organize code to make it easier to read and understand.
-6.  publish class as a python package to streamline process / reduce overhead, making it easier to use + adopt.
-7.  publish as an executable file with GUI / web service as feasible.
+3.  publish as a python package to streamline process / reduce overhead, making it easier to use + adopt.
+4.  add script to convert `.txt` files to a clean PDF report, [example here](https://www.dropbox.com/s/fpqq2qw7txbkujq/ACE%20NLP%20Workshop%20-%20Session%20II%20-%20Dec%202%202021%20-%20full%20transcription%20-%20txt2pdf%2012.05.2021%20%20Standard.pdf?dl=1)
+5.  add summarization script / module
+6.  convert groups of functions to a class object. re-organize code to make it easier to read and understand.
 
-## I've found x repo / script / concept that I think you should incorporate or collaborate with the author.
+## I've found x repo / script / concept that I think you should incorporate or collaborate with the author
 
 Send me a message / start a discussion! Always looking to improve. Or create an issue, that works too.
 
@@ -313,20 +311,20 @@ Send me a message / start a discussion! Always looking to improve. Or create an
 
 **HuBERT (fairseq)**
 
-@article{Hsu2021,
-   author = {Wei Ning Hsu and Benjamin Bolte and Yao Hung Hubert Tsai and Kushal Lakhotia and Ruslan Salakhutdinov and Abdelrahman Mohamed},
-   doi = {10.1109/TASLP.2021.3122291},
-   issn = {23299304},
-   journal = {IEEE/ACM Transactions on Audio Speech and Language Processing},
-   keywords = {BERT,Self-supervised learning},
-   month = {6},
-   pages = {3451-3460},
-   publisher = {Institute of Electrical and Electronics Engineers Inc.},
-   title = {HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
-   volume = {29},
-   url = {https://arxiv.org/abs/2106.07447v1},
-   year = {2021},
-}
+    @article{Hsu2021,
+       author = {Wei Ning Hsu and Benjamin Bolte and Yao Hung Hubert Tsai and Kushal Lakhotia and Ruslan Salakhutdinov and Abdelrahman Mohamed},
+       doi = {10.1109/TASLP.2021.3122291},
+       issn = {23299304},
+       journal = {IEEE/ACM Transactions on Audio Speech and Language Processing},
+       keywords = {BERT,Self-supervised learning},
+       month = {6},
+       pages = {3451-3460},
+       publisher = {Institute of Electrical and Electronics Engineers Inc.},
+       title = {HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
+       volume = {29},
+       url = {<https://arxiv.org/abs/2106.07447v1>},
+       year = {2021},
+    }
 
 **MoviePy**
 
@@ -377,10 +375,8 @@ Send me a message / start a discussion! Always looking to improve. Or create an
     > Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol
     > 10772, pp. 806 - 810. pdf
 
-```
-
 ### Video Citations
 
-<div class="csl-entry"><i>President Kennedy’s 1962 Speech on the US Space Program | C-SPAN Classroom</i>. (n.d.). Retrieved January 28, 2022, from https://www.c-span.org/classroom/document/?7986</div>
+-   <div class="csl-entry"><i>President Kennedy’s 1962 Speech on the US Space Program | C-SPAN Classroom</i>. (n.d.). Retrieved January 28, 2022, from https://www.c-span.org/classroom/document/?7986</div>
 
-- _Note that example videos are cited in respective `Examples/` directories_
+-   _Note: example videos are cited in respective `Examples/` directories_