Skip to content

Commit e26706e

Browse files
author
Guillem Duran
committed
Change project name and remove mentions to spark
Signed-off-by: Guillem Duran <[email protected]>
1 parent 53ec3a0 commit e26706e

File tree

3 files changed

+9
-96
lines changed

3 files changed

+9
-96
lines changed

README.md

Lines changed: 6 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -1,95 +1,7 @@
1-
# MLonCode research playground [![PyPI](https://img.shields.io/pypi/v/sourced-ml.svg)](https://pypi.python.org/pypi/sourced-ml) [![Build Status](https://travis-ci.org/src-d/ml.svg)](https://travis-ci.org/src-d/ml) [![Docker Build Status](https://img.shields.io/docker/build/srcd/ml.svg)](https://hub.docker.com/r/srcd/ml) [![codecov](https://codecov.io/github/src-d/ml/coverage.svg)](https://codecov.io/gh/src-d/ml)
1+
# MLonCode Core Library
2+
[![Build Status](https://travis-ci.org/src-d/ml-core.svg)](https://travis-ci.org/src-d/ml-core)
3+
[![codecov](https://codecov.io/github/src-d/ml-core/coverage.svg)](https://codecov.io/gh/src-d/ml-core)
4+
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
25

3-
This project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.
4-
5-
Currently, the following models are implemented:
6-
7-
* BOW - weighted bag of x, where x is many different extracted feature types.
8-
* id2vec, source code identifier embeddings.
9-
* docfreq, feature document frequencies \(part of TF-IDF\).
10-
* topic modeling over source code identifiers.
11-
12-
It is written in Python3 and has been tested on Linux and macOS. source{d} core-ml is tightly
13-
coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it.
14-
15-
Here is the list of proof-of-concept projects which are built using ml-core:
16-
17-
* [vecino](https://github.com/src-d/vecino) - finding similar repositories.
18-
* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository.
19-
* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets.
20-
* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale.
21-
22-
## Installation
23-
24-
Whether you wish to include Spark in your installation or would rather use an existing
25-
installation, to use `sourced-ml` you will need to have some native libraries installed,
26-
e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org)
27-
is also a requirement - we support both the CPU and GPU version.
28-
In order to select which version you want, modify the package name in the next section
29-
to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.
30-
**If you don't, neither version will be installed.**
31-
32-
## Docker image
33-
34-
```text
35-
docker run -it --rm srcd/ml --help
36-
```
37-
38-
If this first command fails with
39-
40-
```text
41-
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
42-
```
43-
44-
And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
45-
46-
## Contributions
47-
48-
...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\_OF\_CONDUCT.md](code_of_conduct.md).
49-
50-
## License
51-
52-
[Apache 2.0](license.md)
53-
54-
## Algorithms
55-
56-
#### Identifier embeddings
57-
58-
We build the source code identifier co-occurrence matrix for every repository.
59-
60-
1. Read Git repositories.
61-
2. Classify files using [enry](https://github.com/src-d/enry).
62-
3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
63-
4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree.
64-
5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all
65-
identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
66-
67-
6. Write the global co-occurrence matrix.
68-
7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \(requires Tensorflow\). Interactively view
69-
the intermediate results in Tensorboard using `--logs`.
70-
71-
8. Write the identifier embeddings model.
72-
73-
1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.
74-
75-
#### Weighted Bag of X
76-
77-
We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\).
78-
79-
1. Clone or read the repository from disk.
80-
2. Classify files using [enry](https://github.com/src-d/enry).
81-
3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
82-
4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
83-
5. Group by repository, file or function.
84-
6. Set the weight of each such feature according to TF-IDF.
85-
7. Write the BOW model.
86-
87-
1-7 are performed with `repos2bow` command.
88-
89-
#### Topic modeling
90-
91-
See [here](doc/topic_modeling.md).
92-
93-
## Glossary
94-
95-
See [here](GLOSSARY.md).
6+
Library for machine learning on source code. Provides commonly used algorithms and tools
7+
to process the code-related data, such as: Babelfish's UASTs, plain code text, etc.

contributing.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# CONTRIBUTING
22

3-
sourced.ml project is [Apache licensed](license.md) and accepts contributions via GitHub pull requests. This document outlines some of the conventions on development workflow, commit message formatting, contact points, and other resources to make it easier to get your contribution accepted.
3+
ml-core project is [Apache licensed](license.md) and accepts contributions via GitHub pull
4+
requests. This document outlines some of the conventions on development workflow, commit message formatting, contact points, and other resources to make it easier to get your contribution accepted.
45

56
## Certificate of Origin
67

maintainers.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22

33
Vadim Markovtsev [[email protected]](mailto:[email protected]) \(@vmarkovtsev\)
44

5-
Guillem Duran [[email protected]](mailto:[email protected]) \(@guillemdb)
5+
Guillem Duran [[email protected]](mailto:[email protected]) \(@guillemdb\)

0 commit comments

Comments
 (0)