A filtered down dataset of the cap3d dataset, now containing only the most simple and quality of objects.
This project provides a script to filter captions from the cap3d
dataset to remove 3D objects that contain many sub-objects.
GLiNER was used for NER to capture the number of objects within text (with max threshold of <=2). The filter script can be found in the filtered folder
This install.sh
script will install GLiNER. Make sure the script has executable permissions. You can set executable permissions with:
chmod +x install.sh
./install.sh
OR
pip3 install gliner
To filter the data run the script in the filter folder.
python3 filter.py
To split cap3d_captions file into multiple other files, look into the filter folder and run:
The following files will be saved into a prepare folder locally.
python3 split.py
- Clone the repository and navigate to the project directory:
git clone https://github.com/RaccoonResearch/simdata
cd simdata