utility for using transformers summarization models on text docs
This package is to provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.
For details, explanations, and docs, see the wiki
Install using pip:
# create a virtual environment (optional)
pip install textsum
The textsum
package is now installed in your virtual environment. CLI commands/python API can be summarize text docs from anywhere. see the Usage section for more details.
To install all the dependencies (includes PDF OCR, gradio UI demo, optimum, etc), run:
git clone https://github.com/pszemraj/textsum.git
cd textsum
# create a virtual environment (optional)
pip install -e .[all]
This package uses the clean-text python package, and like the "base" version of the package does not include the GPL-licensed unidecode
dependency. If you want to use the unidecode
package, install the package as an extra with pip
:
pip install textsum[unidecode]
In practice, text cleaning pre-summarization with/without unidecode
should not make a significant difference.
There are three ways to use this package:
To use the python API, import the Summarizer
class and instantiate it. This will load the default model and parameters.
You can then use the summarize_string
method to summarize a long string of text.
from textsum.summarize import Summarizer
summarizer = Summarizer() # loads default model and parameters
# summarize a long string
out_str = summarizer.summarize_string('This is a long string of text that will be summarized.')
print(f'summary: {out_str}')
you can also directly summarize a file:
out_path = summarizer.summarize_file('/path/to/file.txt')
print(f'summary saved to {out_path}')
To summarize a directory of text files, run the following command:
textsum-dir /path/to/dir
The following options are available:
usage: textsum-dir [-h] [-o OUTPUT_DIR] [-m MODEL_NAME] [-batch BATCH_LENGTH] [-stride BATCH_STRIDE] [-nb NUM_BEAMS]
[-l2 LENGTH_PENALTY] [-r2 REPETITION_PENALTY] [--no_cuda] [-length_ratio MAX_LENGTH_RATIO] [-ml MIN_LENGTH]
[-enc_ngram ENCODER_NO_REPEAT_NGRAM_SIZE] [-dec_ngram NO_REPEAT_NGRAM_SIZE] [--no_early_stopping] [--shuffle]
[--lowercase] [-v] [-vv] [-lf LOGFILE]
input_dir
For more information, run:
textsum-dir --help
For convenience, a UI demo1 is provided using gradio. To ensure you have the dependencies installed, clone the repo and run the following command:
pip install textsum[app]
To run the demo, run the following command:
textsum-ui
This will start a local server that you can access in your browser & a shareable link will be printed to the console.
Contributions are welcome! Please open an issue or PR if you have any ideas or suggestions.
See the CONTRIBUTING.md file for details on how to contribute.
- add CLI for summarization of all text files in a directory
- python API for summarization of text docs
- add argparse CLI for UI demo
- put on pypi
- optimum inference integration, LLM.int8 inference
- better documentation in the wiki, details on improving performance (speed, quality, memory usage, etc.)
- improvements to OCR helper module
Other ideas? Open an issue or PR!
Footnotes
-
The demo is currently minimal, but will be expanded in the future to accept other arguments and options. β©