Changed the new implementation of the "old" eval to be a flag.

Pandoro · Pandoro · commit 047b98905c50 · 2019-02-26T16:58:30.000+01:00
As discussed in the pull request, using the default market evaluation is now an optional flag.
This also adds a comment about why we use mergesort and updates the README according to the new changes.
diff --git a/README.md b/README.md
@@ -273,8 +273,8 @@ The evaluation code in this repository simply uses the scikit-learn code, and th
 Unfortunately, almost no paper mentions which code-base they used and how they computed `mAP` scores, so comparison is difficult.
 Other frameworks have [the same problem](https://github.com/Cysu/open-reid/issues/50), but we expect many not to be aware of this.
 
-To make the evaluating code independent of the sklearn version we have implemented our own version of the average precision computation.
-This now follows the official Market1501 code and results in values directly comparable.
+We provide evaluation code that computes the mAP as done by the Market-1501 MATLAB evaluation script, independent of the scikit-learn version.
+This can be used by providing the `--use_market_ap` flag when running `evaluate.py`.
 
 # Independent re-implementations
 
diff --git a/evaluate.py b/evaluate.py
@@ -7,6 +7,7 @@
 import h5py
 import json
 import numpy as np
+from sklearn.metrics import average_precision_score
 import tensorflow as tf
 
 import common
@@ -50,8 +51,15 @@
     '--batch_size', default=256, type=common.positive_int,
     help='Batch size used during evaluation, adapt based on your memory usage.')
 
+parser.add_argument(
+    '--use_market_ap', action='store_true', default=False,
+    help='When this flag is provided, the average precision is computed exactly'
+         ' as done by the Market-1501 evaluation script, rather than the '
+         'default scikit-learn implementation that gives slightly different'
+         'scores.')
+
 
-def average_precision_score(y_true, y_score):
+def average_precision_score_market(y_true, y_score):
     """ Compute average precision (AP) from prediction scores.
 
     This is a replacement for the scikit-learn version which, while likely more
@@ -75,6 +83,8 @@ def average_precision_score(y_true, y_score):
                          'got lengths y_true:{} and y_score:{}'.format(
                             len(y_true), len(y_score)))
 
+    # Mergesort is used since it is a stable sorting algorithm. This is
+    # important to compute consistent and correct scores.
     y_true_sorted = y_true[np.argsort(-y_score, kind='mergesort')]
 
     tp = np.cumsum(y_true_sorted)
@@ -119,6 +129,12 @@ def main():
 
     batch_distances = loss.cdist(batch_embs, gallery_embs, metric=args.metric)
 
+    # Check if we should use Market-1501 specific average precision computation.
+    if args.use_market_ap:
+        average_precision = average_precision_score_market
+    else:
+        average_precision = average_precision_score
+
     # Loop over the query embeddings and compute their APs and the CMC curve.
     aps = []
     cmc = np.zeros(len(gallery_pids), dtype=np.int32)
@@ -153,7 +169,7 @@ def main():
             # it won't change anything.
             scores = 1 / (1 + distances)
             for i in range(len(distances)):
-                ap = average_precision_score(pid_matches[i], scores[i])
+                ap = average_precision(pid_matches[i], scores[i])
 
                 if np.isnan(ap):
                     print()