topic-cluster

Cluster papers into topics according to their titles and abstracts.

This program takes a Bibtex file and reads the titles and/or the abstracts. After that a Latent Dirichlet Allocation is applied on those. The result is shown as a bar graph. The idea of this is found in Teh et. al.¹

Installation

To run this program from the code directly, python and poetry (pip install poetry) are required.

To install all the dependencies, use your command line and navigate to the directory where this README file is located in. Then run

poetry install

Execution

To execute the program use

poetry run python -m topic_cluster

The following arguments are supported:

topic_cluster [-h] [--version] [-v] [-vv] [--ignore-last-bibtex-path] [-t TOPIC_COUNT] [-f FEATURE_COUNT] [--no-title] [--no-abstract] [--no-plot] [--no-feature-list] [bibtex_path]

Positional optional arguments

bibtex_path: The file path of the bibtex file to read, if not given, the path from the last call is used, if this is the first call, the program will ask for it via a file open dialog

Optional arguments

-h, --help: Show this help message and exit
--version, -V: Show the program's version number and exit
-v, --verbose: Set the loglevel to INFO
-vv, --very-verbose: Set the loglevel to DEBUG
--ignore-last-bibtex-path, -i: Always ask for the bibtex path (and do not use the one from the previous run) if the bibtex_path is not given
-t TOPIC_COUNT, --topics TOPIC_COUNT: The number of topics, default is 3
-f FEATURE_COUNT, --features FEATURE_COUNT: The number of features to per topic, default is 7
--no-title: Use to exclude the title from the feature detection
--no-abstract: Use to exclude the abstract from the feature detection
--no-plot: Do not show the plot
--no-feature-list: Do not show the feature-frequency list
--min-ngrams {1,2,3,4,5}: The minimum number of words to use for feature extraction, default: 1
--max-ngrams {1,2,3,4,5}: The maximum number of words to use for feature extraction, default: 3

If no bibtex_path is given, a dialog will ask for the bibtex path. The topic and feature count will have the default values and title and abstract are used.

The actual appearance of the graph depends on the backend used by matplotlib.

To use natural language processing to refine search terms is called "systematic reviews" which I found in Teh et. al.¹

Teh, Hui Yie, Andreas W. Kempa-Liehr, und Kevin I-Kai Wang. "Sensor data quality: a systematic review". Journal of Big Data 7, Nr. 1 (11. Februar 2020): 11. https://doi.org/10.1186/s40537-020-0285-1. ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
src/topic_cluster		src/topic_cluster
.flake8		.flake8
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topic-cluster

Installation

Execution

About

Languages

License

miile7/topic-cluster

Folders and files

Latest commit

History

Repository files navigation

topic-cluster

Installation

Execution

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages