Skip to content

Commit ba83715

Browse files
committed
Init project
0 parents  commit ba83715

File tree

3,064 files changed

+409171
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

3,064 files changed

+409171
-0
lines changed

Paper.ipynb

+4,933
Large diffs are not rendered by default.

README.md

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# HaPy-Bug - Human Annotated Python Bug Resolution Dataset
2+
3+
Data and code package is available on https://figshare.com/s/9cbc129a95a8fc3a9640
4+
Structure (missing data directories are on figshare):
5+
* 'annotated_data' and 'collective' contain manual annotations extracted from label studio instance
6+
* 'code' contains scripts used to gather data and prepare dataset
7+
* 'label-studio' contains copy of label studio sources used for manual annotations
8+
* 'label-studio-frontend' contains modified copy of label studio frontend used for manual annotation
9+
* 'raw_data' contains gathered data before manual annotation
10+
* 'Paper.ipynb' contains notebook to open manual annotations and replicate paper experiments
11+
12+
Dataset schema for automated tools is in "code/Documentation/Dataset_structure.md"
13+
Annotations protocol is in "code/Documentation/miniatura_protocol.md"
14+
Annotations schema is as follows:
15+
* 'annotated_data' contains json files extracted from label studio
16+
* prefix letter of name of file denotes annotator, for instance "A_1_24.json" was annotated by annotator "A"
17+
* in file there is a list of annotations, each entry on the list has fields containing
18+
** bug description and metadata in "data" field
19+
** results of annotators actions in "result" field
20+
* Following entries are stored in "result" field as directory
21+
** entries with type 'choices' like "annotations-were-problematic", "reviewer-is-sure", "bug-type" contain selected answers
22+
** entry with type 'cve' contains line and file annotations
23+
*** subentry "diffsFiles" contains "fileName", assigned category "category" and annotated lines in "lines", divided into "afterChange" and "beforeChange" of specific commit
24+
25+
Example first "data/cve" entry
26+
```
27+
{'id': 'CVE-2019-11340',
28+
'publicationDate': '19-04-2019',
29+
'severityScore': '4.3',
30+
'summary': 'util/emailutils.py in Matrix Sydent before 1.0.2 mishandles registration restrictions that are based on e-mail domain, if the allowed_local_3pids option is enabled. This occurs because of potentially unwanted behavior in Python, in which an email.utils.parseaddr call on [email protected]@good.example.com returns the [email protected] substring.'}
31+
```
32+
33+
Corresponding example first "result" entry taken from "A_1_24.json" file, for manual annotatios of 'sydent/util/emailutils.py':
34+
```
35+
[{'value': {'choices': ['application/library']},
36+
'id': 'g4UOah3mt_',
37+
'from_name': 'bug-type',
38+
'to_name': 'text-bug-type',
39+
'type': 'choices',
40+
'origin': 'manual'},
41+
{'value': {'choices': ['yes']},
42+
'id': 'OhaMoaD2UT',
43+
'from_name': 'reviewer-is-sure',
44+
'to_name': 'text-reviewer-is-sure',
45+
'type': 'choices',
46+
'origin': 'manual'},
47+
{'value': {'choices': ['no']},
48+
'id': '3VdKlaBUh9',
49+
'from_name': 'annotations-were-problematic',
50+
'to_name': 'text-annotations-were-problematic',
51+
'type': 'choices',
52+
'origin': 'manual'},
53+
{'type': 'cve',
54+
'value': {'hyperlinks': [{'url': 'https://matrix.org/blog/2019/04/18/security-update-sydent-1-0-2/',
55+
'dates': {'min': '2019-04-18', 'max': '2019-04-18'},
56+
'labels': ['Release Notes', 'Vendor Advisory']},
57+
{'url': 'https://twitter.com/matrixdotorg/status/1118934335963500545',
58+
'dates': {'min': None, 'max': None},
59+
'labels': ['Third Party Advisory']},
60+
{'url': 'https://github.com/matrix-org/sydent/commit/4e1cfff53429c49c87d5c457a18ed435520044fc',
61+
'dates': {'min': '2019-04-18', 'max': '2019-04-26'},
62+
'labels': ['Patch', 'Third Party Advisory']},
63+
{'url': 'https://github.com/matrix-org/sydent/compare/7c002cd...09278fb',
64+
'dates': {'min': '2019-04-18', 'max': '2019-04-18'},
65+
'labels': ['Patch', 'Third Party Advisory']}],
66+
'diffsFiles': [[{'fileName': 'sydent/util/emailutils.py',
67+
'category': 'programming',
68+
'lines': {'afterChange': [{'lineNumber': 58, 'category': 'other'},
69+
{'lineNumber': 59, 'category': 'bug(fix)'},
70+
{'lineNumber': 60, 'category': 'bug(fix)'},
71+
{'lineNumber': 61, 'category': 'bug(fix)'},
72+
{'lineNumber': 64, 'category': 'refactoring'},
73+
{'lineNumber': 81, 'category': 'refactoring'},
74+
{'lineNumber': 82, 'category': 'documentation'},
75+
{'lineNumber': 83, 'category': 'documentation'},
76+
{'lineNumber': 84, 'category': 'documentation'},
77+
{'lineNumber': 85, 'category': 'documentation'},
78+
{'lineNumber': 86, 'category': 'bug(fix)'}],
79+
'beforeChange': [{'lineNumber': 58, 'category': 'other'},
80+
{'lineNumber': 59, 'category': 'bug(fix)'},
81+
{'lineNumber': 60, 'category': 'bug(fix)'},
82+
{'lineNumber': 61, 'category': 'bug(fix)'},
83+
{'lineNumber': 80, 'category': 'bug(fix)'}]}}]]}}]
84+
```
85+

code/.gitignore

+139
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
# PyCharm
132+
.idea/
133+
134+
# Script output files
135+
data/
136+
137+
# node.js
138+
node_modules
139+
package-lock.json

0 commit comments

Comments
 (0)