-
Notifications
You must be signed in to change notification settings - Fork 94
BIOM Format 2.0.0 Code Sprint Goals
This page lists potential goals, priorities, and to-do items for the BIOM Format 2.0.0 code sprint, which is taking place in Flagstaff, AZ from 12/2/2013 - 12/6/2013.
Here are some things that each developer needs to do before the code sprint begins.
Each developer should have the following dependencies installed. I have listed the commands that I (@jrrideout) ran on my Ubuntu 12.10 laptop to install each dependency:
- HDF5 libraries (
sudo apt-get install libhdf5-7 libhdf5-dev
) - numpy (preferably a modern version >= 1.6; tested with 1.7.1 and 1.8.0) (
pip install numpy
) - h5py (tested with 2.2.0) (
pip install h5py
) - scipy (preferably the newest available version; tested with 0.13.0) (
pip install scipy
) - numexpr (tested with 2.2.1) (
pip install numexpr
) - Cython (tested with 0.18) (
pip install Cython
) - PyTables (tested with 3.0.0) (
pip install tables
) - biom-format repository (fork and clone the repo; ensure that all unit tests are passing and that you're set up to submit pull requests)
- protobiom repository (fork and clone the repo)
Note: this is not a finalized list of dependencies. The actual dependency list will probably be much shorter. The goal here is to get some of the likely dependencies installed beforehand that we may need. This will cut down on the time needed to set up development environments for everyone and give us more time to troubleshoot any installation issues that might arise.
Read up on h5py, which is one of the HDF5 APIs available in Python. The quick start guide is really helpful and pretty concise. Also read up on HDF5 itself- in addition to poking through the website, their intro guide is pretty good.
The following are potential goals to accomplish during the code sprint. This list will likely need to be voted on and prioritized as we probably won't have time to get to everything. Here's a few things that come to mind:
- define 2.0.0 file format
- what stays the same?
- what existing pieces need to change?
- what new data do we want to store, if any?
- what about table types (e.g., OTU table, taxon table, metabolite table, etc.)? There are currently no distinctions between table types (just a controlled vocabulary of recognized table types)
- do we want richer descriptions of metadata? Metadata is currently extremely generic and has no requirements/validation
- should we have an attribute that indicates whether the table is in absolute or relative abundances?
- update BIOM Python API
- the current API is very generic (i.e., a table method usually takes a Python function as an argument, and applies that function to do e.g., sorting, filtering, transforming, etc.). This is elegant and makes the BIOM API very general-purpose, but