Skip to content

Latest commit

 

History

History
16 lines (13 loc) · 782 Bytes

File metadata and controls

16 lines (13 loc) · 782 Bytes

BigCode Analysis

This repository is for the analysis done in BigCode Project. You can find analysis of datasets, models, architecture choices and more.

Contents

  • Data analysis: In the folder data_analysis, we provide code for data analysis:

    • Near deduplication
    • Python data analysis:
      • Natural language distribution in comments/docstrings
      • Data decontamination for HumanEval and MBPP benchmarks
      • Percentage of files that can be successfully compiled
      • Percentage of configuration and test files
      • Exploration of unimax sampling on The Stack Some notebooks with some early data and model loss analysis.
  • Multi-Query Attention experiments, for details please to multi_query_experiments/README.md)