-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add langauge-support.md * fix links * delete punctuation.md
- Loading branch information
1 parent
2859a84
commit d3db268
Showing
14 changed files
with
197 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from wordsiv import Vocab, WordSiv | ||
|
||
# Define the punctuation dictionary | ||
de_punc = { | ||
"insert": { | ||
" ": 0.365, | ||
", ": 0.403, | ||
": ": 0.088, | ||
"; ": 0.058, | ||
"–": 0.057, | ||
"—": 0.022, | ||
" … ": 0.006, | ||
}, | ||
"wrap_sent": { | ||
("", "."): 0.923, | ||
("", "!"): 0.034, | ||
("", "?"): 0.04, | ||
("", "…"): 0.003, | ||
}, | ||
"wrap_inner": { | ||
("", ""): 0.825, | ||
("(", ")"): 0.133, | ||
("‘", "’"): 0.013, | ||
("“", "”"): 0.028, | ||
}, | ||
} | ||
|
||
# Create a Vocab from a file, this time passing punctuation | ||
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True, punctuation=de_punc) | ||
|
||
# Add Vocab to WordSiv Object | ||
ws = WordSiv() | ||
ws.add_vocab("de-subtitles", de_vocab) | ||
|
||
# Try it out, turning up punctuation randomness so we see more variation | ||
print(ws.para(vocab="de-subtitles", rnd_punc=0.5)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
from wordsiv import Vocab, WordSiv | ||
|
||
# Create a Vocab from a file | ||
de_vocab = Vocab(lang="de", data_file="de.tsv", bicameral=True) | ||
|
||
# Add Vocab to WordSiv object | ||
ws = WordSiv() | ||
ws.add_vocab("de-subtitles", de_vocab) | ||
|
||
# Try it out | ||
print(ws.sent(vocab="de-subtitles")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
ich 3699605 | ||
sie 2409949 | ||
das 1952794 | ||
ist 1920535 | ||
du 1890181 | ||
nicht 1734016 | ||
die 1585020 | ||
es 1460530 | ||
und 1441012 | ||
der 1109693 | ||
wir 1075801 | ||
was 1072372 | ||
zu 918548 | ||
er 851812 | ||
ein 841835 | ||
in 793011 | ||
mir 645137 | ||
mit 641744 | ||
ja 635186 | ||
den 588653 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Language Support | ||
|
||
## Vocab | ||
In WordSiv, a [Vocab](../api-reference.md#wordsiv.Vocab) is an object that contains | ||
a word list and other language-specific data that allow a WordSiv object to | ||
appropriately filter words and generate text. | ||
|
||
!!! Note | ||
I considered naming this object **WordList**, but it also can contain | ||
word counts and punctuation data. I considered calling it **Lang**, but it's | ||
possible to have more than one set of words (and punctuation, etc.) per | ||
language. I can imagine having Vocabs derived from different genres of text: | ||
`en-news`, `en-wiki`, etc! | ||
|
||
### Using a Built-in Vocab | ||
|
||
See [Basic Usage](basic-usage.md) for how to list and select a built-in Vocab. | ||
If you're curious about the origin/license[^1] of these lists you can examine | ||
the built-in Vocabs in [wordsiv/_vocab_data][vocab-data]. | ||
|
||
### Creating a custom Vocab | ||
|
||
It's easy to add your own Vocab to WordSiv. The harder part is actually deriving | ||
wordlists from a [text corpus](https://en.wikipedia.org/wiki/Text_corpus)) and | ||
refining the capitalization (if applicable), which we won't detail here. | ||
|
||
Let's say we grab the top 20 German words from this [frequency wordlist derived | ||
from OpenSubtitles][hermit-de], and save it as `de-words.tsv` (replacing spaces | ||
with tabs): | ||
``` | ||
--8<-- "de.tsv" | ||
``` | ||
|
||
We can now create a Vocab and add it to WordSiv: | ||
```python | ||
--8<-- "add-vocab.py" | ||
``` | ||
|
||
We get the output: | ||
> Die du die der ich nicht sie das und e | ||
#### Adding Custom Punctuation to a Vocab | ||
|
||
But what if we want punctuation? We have some default punctuation for the | ||
built-in languages in [wordsiv/_punctuation.py][punctuation-py], but not yet for | ||
German (at the time of writing). Let's copy/paste the English one (for now[^2]) | ||
and try it out: | ||
```python | ||
--8<-- "add-vocab-punc.py" | ||
``` | ||
|
||
Now we see punctuation: | ||
> Ich ist mit das ich (du und) mit es sie… Nicht das was zu sie—du die ja nicht | ||
> und zu ist du? Das er das “wir” ich was sie der du mit das die und zu ich. In | ||
> und in, ich ja ich die der das (nicht er sie ich) mir. | ||
|
||
### Contributing Vocabs to WordSiv | ||
|
||
WordSiv is as only as good as the Vocabs (and punctuation dictionaries!) that | ||
are available to it, and we'd love any help on improving language support. Feel | ||
free to [create an issue on the GitHub | ||
repo](https://github.com/tallpauley/wordsiv/issues) if you're interested in | ||
helping us improve language support. You don't even have to be a programmer—we | ||
just need native speakers to help us construct useful Vocabs. However, if you | ||
are looking to learn some programming, building wordlists and punctuation can be | ||
a fun first project (and I'd be glad to help!). | ||
|
||
My long-term vision is to build a community-maintained project (outside of | ||
WordSiv) that has a huge selection of multilingual proofing text, wordlists, | ||
punctuation, etc. and resources and code that enable the global type community | ||
to more easily leverage the language data that is commonplace in | ||
NLP/linguistics/engineering circles. A lot of the source data | ||
[already](https://github.com/simoncozens/gobbet) | ||
[exists](https://cldr.unicode.org/), it just needs to be adapted for the | ||
needs/tooling of type designers. | ||
|
||
[^1]: Licensing for wordlists is a bit odd, because they're often built by | ||
crawling a bunch of data with all kinds of licenses. I'm just doing my best here | ||
to respect licenses where I can! | ||
[^2]: I'd recommend deriving punctuation frequencies for the target language | ||
from [real text][leipzig], and normalizing the probabilities between 0 and 1. I | ||
have a script that builds these dictionaries, which I hope to publish soon! | ||
|
||
[leipzig]: https://wortschatz.uni-leipzig.de/en/ | ||
[vocab-data]: | ||
https://github.com/tallpauley/wordsiv/tree/main/wordsiv/_vocab_data | ||
[punctuation-py]: | ||
https://github.com/tallpauley/wordsiv/tree/main/wordsiv/_punctuation.py | ||
[hermit-de]: | ||
https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/de/de_50k.txt |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
[tool.poetry] | ||
name = "wordsiv" | ||
version = "0.2.4" | ||
version = "0.2.5" | ||
description = "Generate text with a limited character set for font proofing" | ||
authors = ["Chris Pauley <[email protected]>"] | ||
license = "MIT" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters