Skip to content

Commit 53ec3a0

Browse files
author
guillemdb
committed
Initial commit. Add markdown documents
Signed-off-by: guillemdb <[email protected]>
0 parents  commit 53ec3a0

8 files changed

+505
-0
lines changed

DCO

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Developer Certificate of Origin
2+
Version 1.1
3+
4+
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
5+
1 Letterman Drive
6+
Suite D4700
7+
San Francisco, CA, 94129
8+
9+
Everyone is permitted to copy and distribute verbatim copies of this
10+
license document, but changing it is not allowed.
11+
12+
13+
Developer's Certificate of Origin 1.1
14+
15+
By making a contribution to this project, I certify that:
16+
17+
(a) The contribution was created in whole or in part by me and I
18+
have the right to submit it under the open source license
19+
indicated in the file; or
20+
21+
(b) The contribution is based upon previous work that, to the best
22+
of my knowledge, is covered under an appropriate open source
23+
license and I have the right under that license to submit that
24+
work with modifications, whether created in whole or in part
25+
by me, under the same open source license (unless I am
26+
permitted to submit under a different license), as indicated
27+
in the file; or
28+
29+
(c) The contribution was provided directly to me by some other
30+
person who certified (a), (b) or (c) and I have not modified
31+
it.
32+
33+
(d) I understand and agree that this project and the contribution
34+
are public and that a record of the contribution (including all
35+
personal information I submit with it, including my sign-off) is
36+
maintained indefinitely and may be redistributed consistent with
37+
this project or the open source license(s) involved.

GLOSSARY.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
## Abstract Syntax Tree
2+
An abstract syntax tree is a tree representing the abstract syntactic structure of a program written in a programming language.
3+
Because not all details of the real syntax are present in an AST, it is "abstract" rather than "concrete".
4+
In the tree, the branches and nodes represent structural relationships between the syntactic elements of the program it is based on.
5+
ASTs from different languages will have different features, so they are not language agnostic.
6+
7+
## AST
8+
See [Abstract Syntax Tree](#abstract-syntax-tree).
9+
10+
## Bag-of-words model
11+
A bag-of-words model is a model wherein text is represented as a "bag" of the words it contains. A bag-of-words model discards information about text structure, grammar and order, but preserves [multiplicity](https://en.wikipedia.org/wiki/Multiplicity_(mathematics)), or the number of occurrences of each word in the text.
12+
13+
A `bow` model refers to a special type of bag-of-words model, described below.
14+
15+
### Weighted bag-of-words model
16+
A `bow` model is an instance of a weighted bag-of-words model. In a weighted bag-of-words model, each word in the bag is weighted using some algorithm.
17+
For the `bow` model, every bag is a feature extracted from source code and the associated weight is calculated using [TDIDF](#term-frequency-inverse-document-frequency).
18+
19+
For more information on the `bow` model, see the documentation [here](https://docs.sourced.tech/models#bow).
20+
21+
### Weighted bag-of-X model
22+
A bag-of-words model can be generalized to a bag-of-X model.
23+
These models, sometimes called bag-of-feature models, can hold any uniform feature type.
24+
For example, it is possible to store information about some feature of a document in a vector, then dump the vectors into a bag-of-vectors.
25+
Given document frequencies and identifier embeddings, it is possible to represent a repository as a weighted bag-of-vectors.
26+
27+
## Collection frequency
28+
The number of times some term appears in all documents in a collection.
29+
See also [document frequency](#document-frequency).
30+
31+
## COOC
32+
See [Co-occurance matrix](#co-occurance-matrix)
33+
34+
## Co-occurance matrix
35+
36+
## Document
37+
38+
## Document frequency
39+
The document frequency is defined as the number of documents in some collection of documents that contain a term or a feature.
40+
See also [collection frequency](#collection-frequency).
41+
42+
The `docfreq` model represents the document frequencies of features extracted from source code; that is, how many documents (repositories, files or functions) contain each tokenized feature.
43+
44+
For more information on the `docfreq` model, see the documentation [here](https://docs.sourced.tech/models#docfreq).
45+
46+
### Inverse document frequency
47+
The inverse document frequency is defined as `log(N/df(t))` where `df(t)` is the document frequency of a term `t` and `N` is the number of documents in a collection.
48+
This is used as a way to weight a term by its document frequency.
49+
50+
## Features
51+
Generally, a feature refers to any measurable property of data in the domain of a model.
52+
In the context of sourced.ml, a feature is a property of the source code sample used as input to a model.
53+
Selecting the correct features to use as inputs to a model is essential to the model's performance.
54+
55+
There are a number of relevant feature types used by sourced.ml:
56+
57+
### Identifier
58+
59+
### Token
60+
The string "atoms" generated by the parsing process, which involves splitting text into words and stemming the resulting words.
61+
62+
### Literal
63+
64+
### Graphlet
65+
The graphlet of a UAST node is composed from the node itself, its parent and its children.
66+
67+
## Feature extraction
68+
Feature extraction is the process of gathering information about [features](#features) from a set of data.
69+
70+
71+
## Identifier embeddings
72+
The `id2vec` model contains information on source code identifier embeddings; that is, every identifier is represented as a dense vector.
73+
74+
For more information on the `id2vec` model, see the documentation [here](https://docs.sourced.tech/models#id-2-vec).
75+
76+
## Model
77+
A model is the artifact from running an analysis pipeline.
78+
It is plain data with some methods to access it.
79+
A model can be serialized to bytes and deserialized from bytes.
80+
The underlying storage format is specific to [src-d/modelforge](https://github.com/src-d/modelforge)
81+
and is currently [ASDF](https://github.com/spacetelescope/asdf)
82+
with [lz4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) compression.
83+
84+
## Pipeline
85+
A tree of linked `sourced.ml.transformers.Transformer` objects which can be executed on PySpark/source{d} engine.
86+
The result is often written on disk as [Parquet](https://parquet.apache.org/) or model files
87+
or to a database.
88+
89+
## Quantization
90+
Most generally, quantization is a process which maps a large set of possible inputs onto a smaller set of possible outputs.
91+
The values of the large set may be continuous/uncountable.
92+
For example, a vector quantizer takes as its input a vector, which encodes some set of features of a document.
93+
The vector quantizer maps the input vector onto the nearest vector in a set of vectors.
94+
The vectors in the output set may be thought of as the vocabulary of words that can be used; all inputs can be mapped onto one of the words in the output set.
95+
96+
## Term frequency
97+
A measure of how many times a term appears in a given document.
98+
99+
## Term frequency inverse document frequency
100+
A weighting scheme that combines [term frequency](#term-frequency) with [inverse document frequency](#inverse-document-frequency). It produces a composite weight for each term in each document.
101+
The weight assigned by TF-IDF is higher when the term `t` is highly discriminating.
102+
This occurs when `t` is in relatively few documents, and thus has a high IDF, or when it occurs many times in the relevant document, and thus has a high TF.
103+
104+
## TF-IDF
105+
See [Term frequency inverse document frequency](#term-frequency-inverse-document-frequency).
106+
107+
## Topic modeling
108+
In machine learning, topic modeling is a type of modeling used to find abstract "topics" that occur in a collection of documents. The process is often used to identify semantic content from documents or collections of documents automatically.
109+
110+
In the context of sourced.ml, topic modeling is used to identify topics of source code repositories. The `topic` model can be used to model the topics of a Git repository; all tokens are identifiers extracted from the repository or repositories. They are used as indicators of the abstract "topics" mentioned above and are used to infer the topic(s) of each repository.
111+
112+
For more information on the `topic` model, see the documentation [here](https://docs.sourced.tech/models#topics).
113+
114+
## Transformer
115+
A `sourced.ml.transformers.Transformer` object, which serve as one of a series of potential steps in transforming source code features from one form into another.
116+
117+
## UAST
118+
See [Universal Abstract Syntax Tree](#universal-abstract-syntax-tree).
119+
120+
## Universal Abstract Syntax Tree
121+
A generalized version of an [abstract syntax tree](#abstract-syntax-tree).
122+
It is further abstracted away from any concrete details about the parent program, allowing different programming languages to have their programs converted into UASTs, which are language agnostic.
123+
124+
This is achieved using [Babelfish](https://docs.sourced.tech/babelfish), a universal code parser.
125+
126+
## Weighted MinHash
127+
An algorithm to approximate the [Weighted Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance)
128+
between all the pairs of source code samples in linear time and space. Described by
129+
[Sergey Ioffe](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf).

README.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# MLonCode research playground [![PyPI](https://img.shields.io/pypi/v/sourced-ml.svg)](https://pypi.python.org/pypi/sourced-ml) [![Build Status](https://travis-ci.org/src-d/ml.svg)](https://travis-ci.org/src-d/ml) [![Docker Build Status](https://img.shields.io/docker/build/srcd/ml.svg)](https://hub.docker.com/r/srcd/ml) [![codecov](https://codecov.io/github/src-d/ml/coverage.svg)](https://codecov.io/gh/src-d/ml)
2+
3+
This project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.
4+
5+
Currently, the following models are implemented:
6+
7+
* BOW - weighted bag of x, where x is many different extracted feature types.
8+
* id2vec, source code identifier embeddings.
9+
* docfreq, feature document frequencies \(part of TF-IDF\).
10+
* topic modeling over source code identifiers.
11+
12+
It is written in Python3 and has been tested on Linux and macOS. source{d} core-ml is tightly
13+
coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it.
14+
15+
Here is the list of proof-of-concept projects which are built using ml-core:
16+
17+
* [vecino](https://github.com/src-d/vecino) - finding similar repositories.
18+
* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository.
19+
* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets.
20+
* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale.
21+
22+
## Installation
23+
24+
Whether you wish to include Spark in your installation or would rather use an existing
25+
installation, to use `sourced-ml` you will need to have some native libraries installed,
26+
e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org)
27+
is also a requirement - we support both the CPU and GPU version.
28+
In order to select which version you want, modify the package name in the next section
29+
to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.
30+
**If you don't, neither version will be installed.**
31+
32+
## Docker image
33+
34+
```text
35+
docker run -it --rm srcd/ml --help
36+
```
37+
38+
If this first command fails with
39+
40+
```text
41+
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
42+
```
43+
44+
And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
45+
46+
## Contributions
47+
48+
...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\_OF\_CONDUCT.md](code_of_conduct.md).
49+
50+
## License
51+
52+
[Apache 2.0](license.md)
53+
54+
## Algorithms
55+
56+
#### Identifier embeddings
57+
58+
We build the source code identifier co-occurrence matrix for every repository.
59+
60+
1. Read Git repositories.
61+
2. Classify files using [enry](https://github.com/src-d/enry).
62+
3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
63+
4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree.
64+
5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all
65+
identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
66+
67+
6. Write the global co-occurrence matrix.
68+
7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \(requires Tensorflow\). Interactively view
69+
the intermediate results in Tensorboard using `--logs`.
70+
71+
8. Write the identifier embeddings model.
72+
73+
1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.
74+
75+
#### Weighted Bag of X
76+
77+
We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\).
78+
79+
1. Clone or read the repository from disk.
80+
2. Classify files using [enry](https://github.com/src-d/enry).
81+
3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
82+
4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
83+
5. Group by repository, file or function.
84+
6. Set the weight of each such feature according to TF-IDF.
85+
7. Write the BOW model.
86+
87+
1-7 are performed with `repos2bow` command.
88+
89+
#### Topic modeling
90+
91+
See [here](doc/topic_modeling.md).
92+
93+
## Glossary
94+
95+
See [here](GLOSSARY.md).

SUMMARY.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Table of contents
2+
3+
* [README](README.md)
4+
* [doc](doc/README.md)
5+
* [neural\_splitter\_arch](doc/neural_splitter_arch.md)
6+
* [topic\_modeling](doc/topic_modeling.md)
7+
* [cmd](doc/cmd/README.md)
8+
* [Preprocrepos command](doc/cmd/preprocrepos.md)
9+
* [README](doc/proposals/README.md)
10+
* [MLIP-000](doc/proposals/mlip-000.md)
11+
* [spark](doc/spark.md)
12+
* [LICENSE](license.md)
13+
* [MAINTAINERS](maintainers.md)
14+
* [CODE\_OF\_CONDUCT](code_of_conduct.md)
15+
* [CONTRIBUTING](contributing.md)
16+

code_of_conduct.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# CODE OF CONDUCT
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
6+
7+
## Our Standards
8+
9+
Examples of behavior that contributes to creating a positive environment include:
10+
11+
* Using welcoming and inclusive language
12+
* Being respectful of differing viewpoints and experiences
13+
* Gracefully accepting constructive criticism
14+
* Focusing on what is best for the community
15+
* Showing empathy towards other community members
16+
17+
Examples of unacceptable behavior by participants include:
18+
19+
* The use of sexualized language or imagery and unwelcome sexual attention or
20+
21+
advances
22+
23+
* Trolling, insulting/derogatory comments, and personal or political attacks
24+
* Public or private harassment
25+
* Publishing others' private information, such as a physical or electronic
26+
27+
address, without explicit permission
28+
29+
* Other conduct which could reasonably be considered inappropriate in a
30+
31+
professional setting
32+
33+
## Our Responsibilities
34+
35+
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
36+
37+
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
38+
39+
## Scope
40+
41+
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
42+
43+
## Enforcement
44+
45+
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [email protected]. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
46+
47+
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
48+
49+
## Attribution
50+
51+
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4, available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)
52+

0 commit comments

Comments
 (0)