-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
344 changed files
with
39,438 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Apache Software License 2.0 | ||
|
||
Copyright (c) 2023, Peiqin Lin, Chengzhi Hu | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models | ||
|
||
[data:image/s3,"s3://crabby-images/2b9b9/2b9b9d511d7ed8eabf706094ae5dcdcf7cf7fb19" alt="arXiv"](https://arxiv.org/abs/2305.13684) | ||
|
||
mplm-sim is a language similarity tool providing: | ||
|
||
- `Loader`: Accessing high-quality language similarity results directly. | ||
- `Executor`: Obtaining similarity results from scratch. | ||
|
||
## Quickstart | ||
|
||
Download the repo for use or alternatively install with PyPi | ||
|
||
`pip install mplm_sim` | ||
|
||
or directly with pip from GitHub | ||
|
||
`pip install --upgrade git+https://github.com/cisnlp/mPLM-Sim.git#egg=mplm_sim` | ||
|
||
## Loader | ||
|
||
```python | ||
from mplm_sim import Loader | ||
|
||
# loading existing results given model_name and corpus_name | ||
loader = Loader.from_pretrained(model_name='cis-lmu/glot500-base', corpus_name='flores200') | ||
# Or loading results given similarity file | ||
# loader = Loader.from_tsv('your_similarity_file.tsv') | ||
|
||
# Getting similarity given language pairs | ||
# iso3_script | ||
sim = loader.get_sim('eng_Latn', 'cmn_Hani') | ||
# or language name | ||
sim = loader.get_sim('English', 'Chinese') | ||
``` | ||
|
||
## Executor | ||
|
||
```python | ||
from mplm_sim import Loader | ||
|
||
# model_name: any text/speech language model support by Huggingface | ||
# corpus_name: specific corpus name for saving | ||
# corpus_path: path for multi-parallel corpora, see corpora_demo for file formatting | ||
# corpus_type: text or speech | ||
executor = Executor(model_name='cis-lmu/glot500-base', corpus_name='own', | ||
corpus_path='corpora/own', corpus_type='text') | ||
|
||
# Run | ||
executor.run() | ||
``` | ||
|
||
## Citation | ||
|
||
``` | ||
@article{DBLP:journals/corr/abs-2305-13684, | ||
author = {Peiqin Lin and | ||
Chengzhi Hu and | ||
Zheyu Zhang and | ||
Andr{\'{e}} F. T. Martins and | ||
Hinrich Sch{\"{u}}tze}, | ||
title = {mPLM-Sim: Unveiling Better Cross-Lingual Similarity and Transfer in | ||
Multilingual Pretrained Language Models}, | ||
journal = {CoRR}, | ||
volume = {abs/2305.13684}, | ||
year = {2023}, | ||
url = {https://doi.org/10.48550/arXiv.2305.13684}, | ||
doi = {10.48550/ARXIV.2305.13684}, | ||
eprinttype = {arXiv}, | ||
eprint = {2305.13684}, | ||
timestamp = {Mon, 05 Jun 2023 15:42:15 +0200}, | ||
biburl = {https://dblp.org/rec/journals/corr/abs-2305-13684.bib}, | ||
bibsource = {dblp computer science bibliography, https://dblp.org} | ||
} | ||
``` | ||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
马萨 (Massa) 至少在 2009 赛季的其余比赛中都不会出场。 | ||
皮特曼认为天气情况要到下周才能改善。 | ||
亚马逊河是世界上第二长,也是最大的河流。 它的水量是第二大河流的 8 倍以上。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Massa wird für den Rest der Saison 2009 ausfallen. | ||
Pittman deutete an, dass sich die Bedingungen erst irgendwann in der nächsten Woche verbessern würden. | ||
Der Amazonas ist der zweitlängste und der wasserreichste Fluss der Erde. Er führt mehr als achtmal so wie Wasser wie der zweitgrößte Strom. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Massa is due to be out for at least the rest of the 2009 season. | ||
Pittman suggested that conditions wouldn't improve until sometime next week. | ||
The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from mplm_sim.loader import Loader | ||
from mplm_sim.executor import Executor |
Oops, something went wrong.