Skip to content

Commit

Permalink
fix: Pandoc doesn't support DOC
Browse files Browse the repository at this point in the history
  • Loading branch information
jpmckinney committed Apr 5, 2024
1 parent 185117f commit 9b2d8fa
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 4 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Install [Popper](https://poppler.freedesktop.org) for its `pdftotext` command. F
brew install poppler
```

Install [Pandoc](https://pandoc.org) to convert DOC and DOCX to text. For example, on macOS:
Install [Pandoc](https://pandoc.org) to convert DOCX to text. For example, on macOS:

```shell
brew install pandoc
Expand Down Expand Up @@ -73,7 +73,7 @@ Download data, for example:

### General

Transform PDF, DOC, DOCX, BMP, PNG and JPEG to text files:
Transform DOCX, BMP, PNG, JPEG and PDF to text files:

```shell
./manage.py any2txt data/do
Expand Down
5 changes: 3 additions & 2 deletions manage.py
Original file line number Diff line number Diff line change
Expand Up @@ -687,14 +687,15 @@ def any2txt(indir, skip_existing):
continue
click.echo(infile)

if suffix in (".doc", ".docx"):
if suffix == ".docx":
text = pypandoc.convert_file(infile, "plain")
elif suffix == (".bmp", ".jpeg", ".png"):
text = pytesseract.image_to_string(infile)
elif suffix == ".pdf":
text = "\n".join(pytesseract.image_to_string(i) for i in pdf2image.convert_from_path(infile, dpi=500))
else:
raise NotImplementedError(suffix)
click.secho(f"Unsupported format ({suffix})", fg="yellow")
continue

if text:
with outfile.open("w") as f:
Expand Down

0 comments on commit 9b2d8fa

Please sign in to comment.