Home

Chalab Wiki

We are running a public instance of Chalab. This is a beta version, use at your own risks. We decline any responsibility. The privacy and term conditions of Codalab are applicable to this site as well.

Chalab is a tool, which helps you design data science or machine learning challenges. A step-by-step Wizard guides you through the process. When you are done, you can compile your challenge as a self-contained zip file (competition bundle) and upload it to a challenge platform. Currently, only Codalab accepts competition bundles created by Chalab. You can view a sample competition of the style you can create with Chalab.

Although Codalab allows you to design very elaborate challenges (with many datasets and phases, and elaborate means of scoring results), this version of Chalab follows a rather rigid pattern to generate "classic" data science challenges. However, such challenges are both with result and CODE submission. This permits comparing solutions proposed by participants in a fair way, using the same computational resources. Instructors using challenges in their classes can easily evaluate and check solutions submitted.

Codalab allows you to submit any kind of Linux executable. You can run your code in a docker of your choice. For simplificy however, all the examples we provide are in Python. We use Jupyter notebook for the starting kit and the scikit-learn machine learning library, which includes an excellent tutorial.

Mini challenge organization tutorial

Your point of entry into Chalab is the wizard home page, which allows you to select a challenge to edit or create a new challenge. You are then led to the Wizard page allowing you to design a challenge, one step at a time! Conveniently, you may use as template challenges previously created by others (i.e. there is a lot of information already filled in that provides you with further guidance). To understand how to select or create a template, see the Profile and the Group pages.

We also provide sample data and code for the Iris challenge, which can be uploaded to ChaLab for test purposes.

The Chalab challenge design includes 6 steps:

1. Data:

Data science challenges designed with Chalab propose supervised learning tasks of CLASSIFICATION or REGRESSION. You must prepare your dataset in the AutoML challenge format, which supports data represented as feature vectors. Full and sparse (LIBSVM-style) formats are supported. See the data page for details. We supply several example datasets, which you can choose from in a menu, if you are not ready yet to upload your own data.

If you are new to challenge organization, you should know that if there is a flaw in your dataset that participants can turn to their advantage, some participant will find it! Read this paper) on data leakage.

If you wonder how large your dataset should be: no dataset is ever too large, data is EVERYTHING. Remember what Perter Norvig (director of research at Google) said: “We don't have better algorithms than anyone else; we just have more data.” (see article). However, your zipped dataset should not exceed 200 MB. Not a problem: machine learning algorithms do not need high resolution on the feature values. Do not use floats, use integers quantized on e.g. 100 values (1000 at the maximum). If you do not believe me: run a baseline method and decrease the resolution of your features until performance degrades significantly. If you really need large datasets, [upload them directly to Codalab](https://github.com/codalab/codalab-competitions/wiki/My-Datasets).

2. Split:

Chalab wants to split your data 3-way into a:

training set (with labels supplied to the participants to train their learning machine)
validation set (with labels concealed to the participants who must predict them)
(final) test set (also with labels concealed to the participants)

The two last sets are both test sets. We need two test sets because we let participants practice solving the problem by making many submissions during a first "development phase", which may last several weeks. They can make up to 5 submissions per day. So in the end, they can "overfit" the validation set easily (basically learn it by heart). We use the (final) test set during the final phase to perform some blind testing: only one try!

If you wonder how to optimally split the data into training, validation, and test set, think that your role as a challenge organizer is to get small error bars on the TEST set. This paper gives you an idea on how to proceed for classification problems. Basically, if you think that the best performing method will yield an error rate of E, then use Ntest = 100/E test examples. So for 1% error, use Ntest = 10000 examples. Generally Ntest = 10000 examples is a good rule-of-thumb, but if you can afford more, use more. You can use a 10 times smaller validation set (it is the problem of the participants if they overfit the validation set). Use the rest as training data.

3. Problem:

We let you supply a program that will receive the submissions of the participants, called ingestion program, see details. It runs first when a user makes a submission. It has access to the submission of the user and can do whatever it wants with it to produce the predictions. This is convenient if participants must submit code in the form of functions or libraries. The ingestion program can read the data and execute the function(s) supplied.

If you do not supply an ingestion program, the participants will still be able to supply:

result submissions (zip files containing predictions)
executable code submissions (zip files containing code + a metadata file specifying the command to be run).

4. Metric:

The performances of the participants are evaluated by comparing the "predictions" they provide in their submissions to the "ground truth" (or "solution"), known only to the organizers. This is achieved by using a metric:

metric(solution, prediction)

See how to supply your own data) or select from one of the metrics provided. The performances of various methods can be strongly affected by the choice of metrics, which also depend on the data (balanced or not) and task chosen (binary/multi-class/multi-label classification or regression). For example, in a binary classification problem, if one class has 99% of the data, it is preferable NOT to use error rate as a metric (because predicting always the most abundant class is a trivial way to get good performance). One would rather use the "balanced error rate", which averages the error rates of both classes.

5. Protocol:

We do not give much choice of protocol design. All challenges generated by this version of Chalab have 2 phases (development phase and final phase) and most phase parameters are fixed (maximum 5 submissions per day, 500 seconds of execution time per submission). You only get to select the start and end date of the development phase. See instructions. The data you supplied will automatically appear in the challenge platform in the "right" place:

Public data: the training set and its labels, and the two unlabeled test sets will be downloadable by the participants.
Input data: the public data will also be made available to the code submitted by the participants on the challenge platform, so training and prediction can happen on the platform.
Solutions: the validation and test labels will be made available to the scoring program, which will compute performances using the metric that you supplied.

PHASE 1 -- During the development phase, the participants can either submit "results" (prediction files, in a format similar to the "solution" files), or "code" (to be executed on the platform to re-create those prediction files). Result submission allows them to work on their own platform without computational contraints. Code submission allows them to test that their solution runs well on the challenge platform. Obviously, it is always possible to run a challenge with result submission only (just instruct the participants NOT to submit code).

PHASE 2 -- During the final test phase there is no submission. The last submission of the development phase is forwarded automatically to the final test phase. For competitions with result submission only, the participants must therefore include predictions on both the validation set AND the test set in their last submission.

6. Baseline:

Importantly, the organizer must supply an example of code that solves the task that they propose. In this page, we give guidelines on how to prepare a good starting kit, including a sample task (mini version of the dataset), and a sample submission. A good starting kit is key to lower the barrier of entry of your participants and get them hooked!

For instructors, the starting kit is also a way to introduce students gently to key concepts:

Data descriptive statistics.
Data visualisation, scatter plots and histograms.
Preprocessing, data cleaning, normalizations, handling missing values and categorical variables.
Space dimensionality reduction (e.g. PCA) and feature selection.
Cross-validation.
Grid-search for hyper-parameter (HP) selection.
Ensemble methods.
Result visualisation, including learning curves (as a function of number of training examples or training epochs), ROC curves, curves as a function of some key HP.
Calculation of error bars, use of bootstrap methods; visualisation of error bars.
Comparison of metrics. Post-processing to optimize a metric not used for training.

7. Documentation:

Last but not least, you must provide information to your participants on what your challenge is about (overview), how you are going to rank them (evaluation), give them instructions on how to prepare a submission (get data, etc.), and inform them about the rules (terms and conditions). This is supplied in the form of HTML pages, which will appear automatically in your challenge website. We recommend that you include a pictures, a slide show, or even a video describing the objectives of your challenge to motivate participants. Check the videos that students created to present challenges they designed a project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly