src-d
diff --git a/‎DCO
Lines changed: 37 additions & 0 deletions b/‎DCO
Lines changed: 37 additions & 0 deletions
diff --git a/‎GLOSSARY.md
Lines changed: 129 additions & 0 deletions b/‎GLOSSARY.md
Lines changed: 129 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 95 additions & 0 deletions b/‎README.md
Lines changed: 95 additions & 0 deletions
diff --git a/‎SUMMARY.md
Lines changed: 16 additions & 0 deletions b/‎SUMMARY.md
Lines changed: 16 additions & 0 deletions
diff --git a/‎code_of_conduct.md
Lines changed: 52 additions & 0 deletions b/‎code_of_conduct.md
Lines changed: 52 additions & 0 deletions
@@ -0,0 +1,37 @@
+Developer Certificate of Origin
+Version 1.1
+
+Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+1 Letterman Drive
+Suite D4700
+San Francisco, CA, 94129
+
+Everyone is permitted to copy and distribute verbatim copies of this
+license document, but changing it is not allowed.
+
+
+Developer's Certificate of Origin 1.1
+
+By making a contribution to this project, I certify that:
+
+(a) The contribution was created in whole or in part by me and I
+    have the right to submit it under the open source license
+    indicated in the file; or
+
+(b) The contribution is based upon previous work that, to the best
+    of my knowledge, is covered under an appropriate open source
+    license and I have the right under that license to submit that
+    work with modifications, whether created in whole or in part
+    by me, under the same open source license (unless I am
+    permitted to submit under a different license), as indicated
+    in the file; or
+
+(c) The contribution was provided directly to me by some other
+    person who certified (a), (b) or (c) and I have not modified
+    it.
+
+(d) I understand and agree that this project and the contribution
+    are public and that a record of the contribution (including all
+    personal information I submit with it, including my sign-off) is
+    maintained indefinitely and may be redistributed consistent with
+    this project or the open source license(s) involved.
@@ -0,0 +1,129 @@
+## Abstract Syntax Tree
+An abstract syntax tree is a tree representing the abstract syntactic structure of a program written in a programming language.
+Because not all details of the real syntax are present in an AST, it is "abstract" rather than "concrete".
+In the tree, the branches and nodes represent structural relationships between the syntactic elements of the program it is based on.
+ASTs from different languages will have different features, so they are not language agnostic.
+
+## AST
+See [Abstract Syntax Tree](#abstract-syntax-tree).
+
+## Bag-of-words model
+A bag-of-words model is a model wherein text is represented as a "bag" of the words it contains. A bag-of-words model discards information about text structure, grammar and order, but preserves [multiplicity](https://en.wikipedia.org/wiki/Multiplicity_(mathematics)), or the number of occurrences of each word in the text.
+
+A `bow` model refers to a special type of bag-of-words model, described below.
+
+### Weighted bag-of-words model
+A `bow` model is an instance of a weighted bag-of-words model. In a weighted bag-of-words model, each word in the bag is weighted using some algorithm.
+For the `bow` model, every bag is a feature extracted from source code and the associated weight is calculated using [TDIDF](#term-frequency-inverse-document-frequency).
+
+For more information on the `bow` model, see the documentation [here](https://docs.sourced.tech/models#bow).
+
+### Weighted bag-of-X model
+A bag-of-words model can be generalized to a bag-of-X model.
+These models, sometimes called bag-of-feature models, can hold any uniform feature type.
+For example, it is possible to store information about some feature of a document in a vector, then dump the vectors into a bag-of-vectors.
+Given document frequencies and identifier embeddings, it is possible to represent a repository as a weighted bag-of-vectors.
+
+## Collection frequency
+The number of times some term appears in all documents in a collection.
+See also [document frequency](#document-frequency).
+
+## COOC
+See [Co-occurance matrix](#co-occurance-matrix)
+
+## Co-occurance matrix
+
+## Document
+
+## Document frequency
+The document frequency is defined as the number of documents in some collection of documents that contain a term or a feature.
+See also [collection frequency](#collection-frequency).
+
+The `docfreq` model represents the document frequencies of features extracted from source code; that is, how many documents (repositories, files or functions) contain each tokenized feature.
+
+For more information on the `docfreq` model, see the documentation [here](https://docs.sourced.tech/models#docfreq).
+
+### Inverse document frequency
+The inverse document frequency is defined as `log(N/df(t))` where `df(t)` is the document frequency of a term `t` and `N` is the number of documents in a collection.
+This is used as a way to weight a term by its document frequency.
+
+## Features
+Generally, a feature refers to any measurable property of data in the domain of a model.
+In the context of sourced.ml, a feature is a property of the source code sample used as input to a model.
+Selecting the correct features to use as inputs to a model is essential to the model's performance.
+
+There are a number of relevant feature types used by sourced.ml:
+
+### Identifier
+
+### Token
+The string "atoms" generated by the parsing process, which involves splitting text into words and stemming the resulting words.
+
+### Literal
+
+### Graphlet
+The graphlet of a UAST node is composed from the node itself, its parent and its children.
+
+## Feature extraction
+Feature extraction is the process of gathering information about [features](#features) from a set of data.
+
+
+## Identifier embeddings
+The `id2vec` model contains information on source code identifier embeddings; that is, every identifier is represented as a dense vector.
+
+For more information on the `id2vec` model, see the documentation [here](https://docs.sourced.tech/models#id-2-vec).
+
+## Model
+A model is the artifact from running an analysis pipeline.
+It is plain data with some methods to access it.
+A model can be serialized to bytes and deserialized from bytes.
+The underlying storage format is specific to [src-d/modelforge](https://github.com/src-d/modelforge)
+and is currently [ASDF](https://github.com/spacetelescope/asdf)
+with [lz4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) compression.
+
+## Pipeline
+A tree of linked `sourced.ml.transformers.Transformer` objects which can be executed on PySpark/source{d} engine.
+The result is often written on disk as [Parquet](https://parquet.apache.org/) or model files
+or to a database.
+
+## Quantization
+Most generally, quantization is a process which maps a large set of possible inputs onto a smaller set of possible outputs.
+The values of the large set may be continuous/uncountable.
+For example, a vector quantizer takes as its input a vector, which encodes some set of features of a document.
+The vector quantizer maps the input vector onto the nearest vector in a set of vectors.
+The vectors in the output set may be thought of as the vocabulary of words that can be used; all inputs can be mapped onto one of the words in the output set.
+
+## Term frequency
+A measure of how many times a term appears in a given document.
+
+## Term frequency inverse document frequency
+A weighting scheme that combines [term frequency](#term-frequency) with [inverse document frequency](#inverse-document-frequency). It produces a composite weight for each term in each document.
+The weight assigned by TF-IDF is higher when the term `t` is highly discriminating.
+This occurs when `t` is in relatively few documents, and thus has a high IDF, or when it occurs many times in the relevant document, and thus has a high TF.
+
+## TF-IDF
+See [Term frequency inverse document frequency](#term-frequency-inverse-document-frequency).
+
+## Topic modeling
+In machine learning, topic modeling is a type of modeling used to find abstract "topics" that occur in a collection of documents. The process is often used to identify semantic content from documents or collections of documents automatically.
+
+In the context of sourced.ml, topic modeling is used to identify topics of source code repositories. The `topic` model can be used to model the topics of a Git repository; all tokens are identifiers extracted from the repository or repositories. They are used as indicators of the abstract "topics" mentioned above and are used to infer the topic(s) of each repository.
+
+For more information on the `topic` model, see the documentation [here](https://docs.sourced.tech/models#topics).
+
+## Transformer
+A `sourced.ml.transformers.Transformer` object, which serve as one of a series of potential steps in transforming source code features from one form into another.
+
+## UAST
+See [Universal Abstract Syntax Tree](#universal-abstract-syntax-tree).
+
+## Universal Abstract Syntax Tree
+A generalized version of an [abstract syntax tree](#abstract-syntax-tree).
+It is further abstracted away from any concrete details about the parent program, allowing different programming languages to have their programs converted into UASTs, which are language agnostic.
+
+This is achieved using [Babelfish](https://docs.sourced.tech/babelfish), a universal code parser.
+
+## Weighted MinHash
+An algorithm to approximate the [Weighted Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance)
+between all the pairs of source code samples in linear time and space. Described by
+[Sergey Ioffe](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf).
@@ -0,0 +1,95 @@
+# MLonCode research playground [![PyPI](https://img.shields.io/pypi/v/sourced-ml.svg)](https://pypi.python.org/pypi/sourced-ml) [![Build Status](https://travis-ci.org/src-d/ml.svg)](https://travis-ci.org/src-d/ml) [![Docker Build Status](https://img.shields.io/docker/build/srcd/ml.svg)](https://hub.docker.com/r/srcd/ml) [![codecov](https://codecov.io/github/src-d/ml/coverage.svg)](https://codecov.io/gh/src-d/ml)
+
+This project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.
+
+Currently, the following models are implemented:
+
+* BOW - weighted bag of x, where x is many different extracted feature types.
+* id2vec, source code identifier embeddings.
+* docfreq, feature document frequencies \(part of TF-IDF\).
+* topic modeling over source code identifiers.
+
+It is written in Python3 and has been tested on Linux and macOS. source{d} core-ml is tightly 
+coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it.
+
+Here is the list of proof-of-concept projects which are built using ml-core:
+
+* [vecino](https://github.com/src-d/vecino) - finding similar repositories.
+* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository.
+* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets.
+* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale.
+
+## Installation
+
+Whether you wish to include Spark in your installation or would rather use an existing
+installation, to use `sourced-ml` you will need to have some native libraries installed,
+e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org)
+is also a requirement - we support both the CPU and GPU  version. 
+In order to select which version you want, modify the package name in the next section
+to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.
+**If you don't, neither version will be installed.**
+
+## Docker image
+
+```text
+docker run -it --rm srcd/ml --help
+```
+
+If this first command fails with
+
+```text
+Cannot connect to the Docker daemon. Is the docker daemon running on this host?
+```
+
+And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
+
+## Contributions
+
+...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\_OF\_CONDUCT.md](code_of_conduct.md).
+
+## License
+
+[Apache 2.0](license.md)
+
+## Algorithms
+
+#### Identifier embeddings
+
+We build the source code identifier co-occurrence matrix for every repository.
+
+1. Read Git repositories.
+2. Classify files using [enry](https://github.com/src-d/enry).
+3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
+4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree.
+5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all
+   identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
+
+6. Write the global co-occurrence matrix.
+7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \(requires Tensorflow\). Interactively view
+   the intermediate results in Tensorboard using `--logs`.
+
+8. Write the identifier embeddings model.
+
+1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.
+
+#### Weighted Bag of X
+
+We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\).
+
+1. Clone or read the repository from disk.
+2. Classify files using [enry](https://github.com/src-d/enry).
+3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.
+4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
+5. Group by repository, file or function.
+6. Set the weight of each such feature according to TF-IDF.
+7. Write the BOW model.
+
+1-7 are performed with `repos2bow` command.
+
+#### Topic modeling
+
+See [here](doc/topic_modeling.md).
+
+## Glossary
+
+See [here](GLOSSARY.md).
@@ -0,0 +1,16 @@
+# Table of contents
+
+* [README](README.md)
+* [doc](doc/README.md)
+  * [neural\_splitter\_arch](doc/neural_splitter_arch.md)
+  * [topic\_modeling](doc/topic_modeling.md)
+  * [cmd](doc/cmd/README.md)
+    * [Preprocrepos command](doc/cmd/preprocrepos.md)
+  * [README](doc/proposals/README.md)
+    * [MLIP-000](doc/proposals/mlip-000.md)
+  * [spark](doc/spark.md)
+* [LICENSE](license.md)
+* [MAINTAINERS](maintainers.md)
+* [CODE\_OF\_CONDUCT](code_of_conduct.md)
+* [CONTRIBUTING](contributing.md)
+
@@ -0,0 +1,52 @@
+# CODE OF CONDUCT
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+
+  advances
+
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+
+  address, without explicit permission
+
+* Other conduct which could reasonably be considered inappropriate in a
+
+  professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [email protected]. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4, available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)
+