Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
lpq29743 committed Jan 19, 2024
1 parent fa30723 commit 6bec65a
Show file tree
Hide file tree
Showing 344 changed files with 39,438 additions and 0 deletions.
16 changes: 16 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Apache Software License 2.0

Copyright (c) 2023, Peiqin Lin, Chengzhi Hu

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

76 changes: 76 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

[![arXiv](https://img.shields.io/badge/arXiv-2305.13684-b31b1b.svg)](https://arxiv.org/abs/2305.13684)

mplm-sim is a language similarity tool providing:

- `Loader`: Accessing high-quality language similarity results directly.
- `Executor`: Obtaining similarity results from scratch.

## Quickstart

Download the repo for use or alternatively install with PyPi

`pip install mplm_sim`

or directly with pip from GitHub

`pip install --upgrade git+https://github.com/cisnlp/mPLM-Sim.git#egg=mplm_sim`

## Loader

```python
from mplm_sim import Loader

# loading existing results given model_name and corpus_name
loader = Loader.from_pretrained(model_name='cis-lmu/glot500-base', corpus_name='flores200')
# Or loading results given similarity file
# loader = Loader.from_tsv('your_similarity_file.tsv')

# Getting similarity given language pairs
# iso3_script
sim = loader.get_sim('eng_Latn', 'cmn_Hani')
# or language name
sim = loader.get_sim('English', 'Chinese')
```

## Executor

```python
from mplm_sim import Loader

# model_name: any text/speech language model support by Huggingface
# corpus_name: specific corpus name for saving
# corpus_path: path for multi-parallel corpora, see corpora_demo for file formatting
# corpus_type: text or speech
executor = Executor(model_name='cis-lmu/glot500-base', corpus_name='own',
corpus_path='corpora/own', corpus_type='text')

# Run
executor.run()
```

## Citation

```
@article{DBLP:journals/corr/abs-2305-13684,
author = {Peiqin Lin and
Chengzhi Hu and
Zheyu Zhang and
Andr{\'{e}} F. T. Martins and
Hinrich Sch{\"{u}}tze},
title = {mPLM-Sim: Unveiling Better Cross-Lingual Similarity and Transfer in
Multilingual Pretrained Language Models},
journal = {CoRR},
volume = {abs/2305.13684},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2305.13684},
doi = {10.48550/ARXIV.2305.13684},
eprinttype = {arXiv},
eprint = {2305.13684},
timestamp = {Mon, 05 Jun 2023 15:42:15 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2305-13684.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

Binary file added corpora_demo/speech/cmn_Hani.pickle
Binary file not shown.
Binary file added corpora_demo/speech/deu_Latn.pickle
Binary file not shown.
Binary file added corpora_demo/speech/eng_Latn.pickle
Binary file not shown.
3 changes: 3 additions & 0 deletions corpora_demo/text/cmn_Hani.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
马萨 (Massa) 至少在 2009 赛季的其余比赛中都不会出场。
皮特曼认为天气情况要到下周才能改善。
亚马逊河是世界上第二长,也是最大的河流。 它的水量是第二大河流的 8 倍以上。
3 changes: 3 additions & 0 deletions corpora_demo/text/deu_Latn.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Massa wird für den Rest der Saison 2009 ausfallen.
Pittman deutete an, dass sich die Bedingungen erst irgendwann in der nächsten Woche verbessern würden.
Der Amazonas ist der zweitlängste und der wasserreichste Fluss der Erde. Er führt mehr als achtmal so wie Wasser wie der zweitgrößte Strom.
3 changes: 3 additions & 0 deletions corpora_demo/text/eng_Latn.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Massa is due to be out for at least the rest of the 2009 season.
Pittman suggested that conditions wouldn't improve until sometime next week.
The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
2 changes: 2 additions & 0 deletions mplm_sim/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from mplm_sim.loader import Loader
from mplm_sim.executor import Executor
Loading

0 comments on commit 6bec65a

Please sign in to comment.