Skip to content

Commit

Permalink
re-packaging + prepare release
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Nov 23, 2021
1 parent 23060db commit e786c26
Show file tree
Hide file tree
Showing 7 changed files with 65 additions and 94 deletions.
13 changes: 0 additions & 13 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,16 +103,3 @@ ENV/

# IDE settings
.vscode/

# lists
simplemma/lists/

# Not ready yet
docs/
AUTHORS.rst
CONTRIBUTING.rst
Makefile
.github/

# eval
UD/
10 changes: 10 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
=======
History
=======


0.1.0
-----

* Fork re-packaged
* Efficiency improvements in ``langid.py``
29 changes: 16 additions & 13 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,19 +1,26 @@
langid.py -
Language Identifier by Marco Lui April 2011
py3langid - Language Identifier
BSD 3-Clause License

Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.

Original code: Copyright (c) 2011 Marco Lui <[email protected]>.
Based on research by Marco Lui and Tim Baldwin.

Copyright 2011 Marco Lui <[email protected]>. All rights reserved.
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

2. Redistributions in binary form must reproduce the above copyright notice, this list
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
Expand All @@ -23,8 +30,4 @@ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the
authors and should not be interpreted as representing official policies, either expressed
or implied, of the copyright holder.
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
14 changes: 14 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#include AUTHORS.rst
#include CONTRIBUTING.rst
#include CITATION.cff
#include FEATURES
include HISTORY.rst
include LICENSE
include README.rst

recursive-exclude * __pycache__
recursive-exclude * *.py[co]
include tests/

# recursive-include conf.py Makefile make.bat *.jpg *.png *.gif
# recursive-include docs *.rst
90 changes: 25 additions & 65 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
================
``langid.py`` readme
================
=============
``py3langid``
=============


Changes in this fork
--------------------

``py3langid`` is a fork of the standalone language identification tool ``langid.py`` by Marco Lui.

Drop in replacement: ``import py3langid as langid``.

The classification functions have been modernized, thanks to implementation changes language detection with Python (``langid.classify``) is currently 2.5x faster.

The readme below is provided for reference, for now only the classification functions are tested and maintained.

Original license: BSD-2-Clause.
Fork license: BSD-3-Clause.



Introduction
------------
Expand All @@ -15,9 +32,9 @@ The design principles are as follows:
4. Single .py file with minimal dependencies
5. Deployable as a web service

All that is required to run ``langid.py`` is >= Python 2.7 and numpy.
The main script ``langid/langid.py`` is cross-compatible with both Python2 and
Python3, but the accompanying training tools are still Python2-only.
All that is required to run ``langid.py`` is Python >= 3.6 and numpy.

The accompanying training tools are still Python2-only.

``langid.py`` is WSGI-compliant. ``langid.py`` will use ``fapws3`` as a web server if
available, and default to ``wsgiref.simple_server`` otherwise.
Expand Down Expand Up @@ -311,10 +328,8 @@ It is also possible to edit ``langid.py`` directly to embed the new model string

Read more
---------
``langid.py`` is based on our published research. [1] describes the LD feature selection technique in detail,
and [2] provides more detail about the module ``langid.py`` itself. [3] compares the speed of ``langid.py``
to Google's Chrome CLD2, as well as my own pure-C implementation and the authors' implementation on specialized
hardware.
``langid.py`` is based on published research. [1] describes the LD feature selection technique in detail,
and [2] provides more detail about the module ``langid.py`` itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification,
In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011),
Expand All @@ -323,58 +338,3 @@ Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthol
[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool,
In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012),
Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

[3] Kenneth Heafield and Rohan Kshirsagar and Santiago Barona (2015) Language Identification and Modeling in Specialized Hardware,
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 2: Short Papers).
Available from http://aclweb.org/anthology/P15-2063

Contact
-------
Marco Lui <[email protected]>

I appreciate any feedback, and I'm particularly interested in hearing about
places where ``langid.py`` is being used. I would love to know more about
situations where you have found that ``langid.py`` works well, and about
any shortcomings you may have found.

Acknowledgements
----------------
Thanks to aitzol for help with packaging ``langid.py`` for PyPI.
Thanks to pquentin for suggestions and improvements to packaging.

Related Implementations
-----------------------
Dawid Weiss has ported ``langid.py`` to Java, with a particular focus on
speed and memory use. Available from https://github.com/carrotsearch/langid-java

I have written a Pure-C version of ``langid.py``, which an external evaluation (see `Read more`)
has found to be up to 20x as fast as the pure Python implementation here.
Available from https://github.com/saffsd/langid.c

I have also written a JavaScript version of ``langid.py`` which runs entirely in the browser.
Available from https://github.com/saffsd/langid.js

Changelog
---------
v1.0:
* Initial release

v1.1:
* Reorganized internals to implement a LanguageIdentifier class

v1.1.2:
* Added a 'langid' entry point

v1.1.3:
* Made `classify` and `rank` return Python data types rather than numpy ones

v1.1.4:
* Added set_languages to __init__.py, fixing #10 (and properly fixing #8)

v1.1.5:
* remove dev tag
* add PyPi classifiers, fixing #34 (thanks to pquentin)

v1.1.6:
* make nb_numfeats an int, fixes #46, thanks to @remibolcom
1 change: 0 additions & 1 deletion py3langid/.gitignore

This file was deleted.

2 changes: 0 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
from pathlib import Path
from setuptools import setup

from setuptools import setup



def get_version(package):
Expand Down

0 comments on commit e786c26

Please sign in to comment.