|
1 |
| -# MLonCode research playground [](https://pypi.python.org/pypi/sourced-ml) [](https://travis-ci.org/src-d/ml) [](https://hub.docker.com/r/srcd/ml) [](https://codecov.io/gh/src-d/ml) |
| 1 | +# MLonCode Core Library |
| 2 | + [](https://travis-ci.org/src-d/ml-core) |
| 3 | + [](https://codecov.io/gh/src-d/ml-core) |
| 4 | + [](https://github.com/ambv/black) |
2 | 5 |
|
3 |
| -This project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks. |
4 |
| - |
5 |
| -Currently, the following models are implemented: |
6 |
| - |
7 |
| -* BOW - weighted bag of x, where x is many different extracted feature types. |
8 |
| -* id2vec, source code identifier embeddings. |
9 |
| -* docfreq, feature document frequencies \(part of TF-IDF\). |
10 |
| -* topic modeling over source code identifiers. |
11 |
| - |
12 |
| -It is written in Python3 and has been tested on Linux and macOS. source{d} core-ml is tightly |
13 |
| -coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it. |
14 |
| - |
15 |
| -Here is the list of proof-of-concept projects which are built using ml-core: |
16 |
| - |
17 |
| -* [vecino](https://github.com/src-d/vecino) - finding similar repositories. |
18 |
| -* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository. |
19 |
| -* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets. |
20 |
| -* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale. |
21 |
| - |
22 |
| -## Installation |
23 |
| - |
24 |
| -Whether you wish to include Spark in your installation or would rather use an existing |
25 |
| -installation, to use `sourced-ml` you will need to have some native libraries installed, |
26 |
| -e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org) |
27 |
| -is also a requirement - we support both the CPU and GPU version. |
28 |
| -In order to select which version you want, modify the package name in the next section |
29 |
| -to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice. |
30 |
| -**If you don't, neither version will be installed.** |
31 |
| - |
32 |
| -## Docker image |
33 |
| - |
34 |
| -```text |
35 |
| -docker run -it --rm srcd/ml --help |
36 |
| -``` |
37 |
| - |
38 |
| -If this first command fails with |
39 |
| - |
40 |
| -```text |
41 |
| -Cannot connect to the Docker daemon. Is the docker daemon running on this host? |
42 |
| -``` |
43 |
| - |
44 |
| -And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user). |
45 |
| - |
46 |
| -## Contributions |
47 |
| - |
48 |
| -...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\_OF\_CONDUCT.md](code_of_conduct.md). |
49 |
| - |
50 |
| -## License |
51 |
| - |
52 |
| -[Apache 2.0](license.md) |
53 |
| - |
54 |
| -## Algorithms |
55 |
| - |
56 |
| -#### Identifier embeddings |
57 |
| - |
58 |
| -We build the source code identifier co-occurrence matrix for every repository. |
59 |
| - |
60 |
| -1. Read Git repositories. |
61 |
| -2. Classify files using [enry](https://github.com/src-d/enry). |
62 |
| -3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file. |
63 |
| -4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree. |
64 |
| -5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all |
65 |
| - identifiers on the same level as co-occurring. Besides, connect them with their immediate parents. |
66 |
| - |
67 |
| -6. Write the global co-occurrence matrix. |
68 |
| -7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \(requires Tensorflow\). Interactively view |
69 |
| - the intermediate results in Tensorboard using `--logs`. |
70 |
| - |
71 |
| -8. Write the identifier embeddings model. |
72 |
| - |
73 |
| -1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`. |
74 |
| - |
75 |
| -#### Weighted Bag of X |
76 |
| - |
77 |
| -We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\). |
78 |
| - |
79 |
| -1. Clone or read the repository from disk. |
80 |
| -2. Classify files using [enry](https://github.com/src-d/enry). |
81 |
| -3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file. |
82 |
| -4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints. |
83 |
| -5. Group by repository, file or function. |
84 |
| -6. Set the weight of each such feature according to TF-IDF. |
85 |
| -7. Write the BOW model. |
86 |
| - |
87 |
| -1-7 are performed with `repos2bow` command. |
88 |
| - |
89 |
| -#### Topic modeling |
90 |
| - |
91 |
| -See [here](doc/topic_modeling.md). |
92 |
| - |
93 |
| -## Glossary |
94 |
| - |
95 |
| -See [here](GLOSSARY.md). |
| 6 | +Library for machine learning on source code. Provides commonly used algorithms and tools |
| 7 | + to process the code-related data, such as: Babelfish's UASTs, plain code text, etc. |
0 commit comments