Contributors | GitHub Handle |
---|---|
Carrie Cheung | carrieklc |
Evan Yathon | EvanYathon |
Mike Yuan | mikeymice |
Shayne Andrews | shayne-andrews |
pydistrrr
is a Python package that calculates distances between numeric-based data points or observations. The currently supported distance metrics are:
In addition to computing distances, pydistrrr
can identify the closest data points to a given point based on a distance threshold, or based on a user-specified number of points. These functions are designed to be similar to Scikit Learn's Nearest Neighbors functionality.
We used the Python framework pytest
and the plug-in pytest-cov
to test and track test coverage for the pydistrrr
package. The results of the coverage report can be seen below.
Function Name | Input | Output | Description |
---|---|---|---|
get_distance | 3 parameters: 2 lists of numeric values, a string specifying type of distance metric | Single float | Given 2 observations each represented by a list of numeric values, compute and return the distance between the 2 points based on the specified distance metric (e.g. metric="euclidean" ) |
get_all_distances | 3 parameters: a list of numeric values, a dataframe, a string specifying type of distance metric | List of floats of length n |
Given a dataframe and an observation represented by a list of numeric values, compute and return the distances between the single observation and each observation in the dataframe based on the specified distance metric. Will output a list of distances (as numeric values) with size equal to the number of rows in the dataframe, n . |
filter_distances | 4 parameters: a list of numeric values, a dataframe, a numeric (float or int) value representing a threshold distance, a string specifying type of distance metric | List of int (row indices) | Similiar to get_all_distances except indices of rows/observations with distances less than the threshold distance will be returned. |
get_closest | 4 parameters: a list specifying values for a target point, a dataframe of data points, an int for number of neighbours k, a string specifying type of distance metric to calculate | List of int (row indices) of length k |
Similiar to get_all_distances except indices of the top k rows/observations with the smallest distances are returned. In the case where there is a tie in distances between two or more points, the point with larger index in the dataframe will be selected. |
There are existing packages that implement the same proposed functionality in both Python and R (listed below). Most of these packages provide functions to calculate different distance metrics between observations and/or also extend the functionality to compute the k closest neighbours (KNN) of a given point based on a selected distance metric.
In our package, we will be implementing the distance metric calculations manually rather than simply creating wrappers around existing functions.
Existing Packages/Functions |
---|
Sklearn's NearestNeighbors |
Scipy's Spatial Distance Functions |
R Distance Computations |
R K Nearest Neighbours |
To install the package, simply run the below in your terminal:
pip install git+https://github.com/UBC-MDS/pydistrrr.git
Then simply import pydistrrr
in your own development. For example:
>>> from pydistrrr import *
>>> get_distance([1,2],[2,1])
1.4142135623730951
Function Name | Example Usage(s) |
---|---|
get_distance | get_distance([1,2], [2,1], "manhattan") |
get_all_distances | x = [-2,4] |
filter_distances | x = [1, 1] |
get_closest | x = [1, 1] |