Skip to content
This repository was archived by the owner on Apr 11, 2023. It is now read-only.

Commit bb121a5

Browse files
author
Miltos Allamanis
committed
Wrap up challenge and publish the human relevance judgements.
1 parent c1ada63 commit bb121a5

File tree

3 files changed

+4030
-2
lines changed

3 files changed

+4030
-2
lines changed

BENCHMARK.md

+6
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
> ## The Challenge has been concluded
2+
> No new submissions to the benchmark will be accepted. However, we would like
3+
> to encourage practitioners and researchers to continue using
4+
> the dataset and the human relevance annotations. Please see the
5+
> [main README](/README.md) for more information.
6+
17
## Submitting runs to the benchmark
28

39
The [Weights & Biases (W&B)](https://www.wandb.com) [benchmark](https://app.wandb.ai/github/CodeSearchNet/benchmark) tracks and compares models trained on the CodeSearchNet dataset by the global machine learning research community. Anyone is welcome to submit their results for review.

README.md

+15-2
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44

55
[paper]: https://arxiv.org/abs/1909.09436
66

7+
> # The CodeSearchNet challenge has been concluded
8+
> We would like to thank all participants for their submissions
9+
> and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details.
10+
>
11+
> No new submissions to the challenge will be accepted.
12+
713
**Table of Contents**
814

915
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
@@ -83,11 +89,11 @@ More context regarding the motivation for this problem is in this [technical rep
8389

8490
## Evaluation
8591

86-
The metric we use for evaluation is [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG). Please reference [this paper][paper] for further details regarding model evaluation.
92+
The metric we use for evaluation is [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG). Please reference [this paper][paper] for further details regarding model evaluation. The evaluation script can be found [here](/src/relevanceeval.py).
8793

8894
### Annotations
8995

90-
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to [this paper][paper] for further details on the annotation process.
96+
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to [this paper][paper] for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments [here](/resources/annotationStore.csv).
9197

9298

9399
## Setup
@@ -242,6 +248,13 @@ For example, the link for the `java` is:
242248
243249
The size of the dataset is approximately 20 GB. The various files and the directory structure are explained [here](resources/README.md).
244250

251+
## Human Relevance Judgements
252+
To train neural models with a large dataset we use the documentation comments (e.g. docstrings) as a proxy. For evaluation (and the leaderboard), we collected human relevance judgements of pairs of realistic-looking natural language queries and code snippets. Now that the challenge has been concluded, we provide the data [here](/resources/annotationStore.csv) as a `.csv`, with the following fields:
253+
* Language: The programming language of the snippet.
254+
* Query: The natural language query
255+
* GitHubUrl: The URL of the target snippet. This matches the `URL` key in the data (see [here](#schema--format)).
256+
* Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
257+
* Notes: a free-text field with notes that annotators optionally provided.
245258

246259
# Running Our Baseline Model
247260

0 commit comments

Comments
 (0)