To compare the performance of several spike sorting algorithms, we constructed "hybrid ground truth" datasets, in which spikes were added to real recordings at known times. The spikes to be added were taken from the same recording at a different part of the electrode array, so the challenge was realistic: the test spikes had realistic waveform variability and drift.
Each algorithm was run on each of several test datasets. To evaluate the performance of the algorithms, we determined, for each added "ground truth cluster" of spikes, which cluster produced by the algorithm most closely matched it. Then we asked: how many of the true spikes were missing from that cluster (misses)? How many extra spikes were included in that cluster that did not belong to the set of true spikes (false positives)? We converted these to miss rates and false positive rates, as a proportion of the number of spikes in the cluster. Then we computed an "initial score": 1-FPR-MR. Finally, we asked: if a human operator were to perfectly decide when to merge clusters, how well would the resulting final cluster score, by the same metrics? This gave us a "post-merge score" for that cluster. We also tabulated the number of merges required to find that optimum, as well as the total number of spikes and clusters returned by each algorithm. All seven of these statistics were computed for each algorithm on each of six test datasets, and are plotted on the main page.
The datasets have the following properties:
In all cases, spikes were added with a temporal jitter around the original spike times. See https://github.com/cortex-lab/groundTruth/blob/master/creation/CreateGroundTruth.m for the code that generated these datasets.
Feel free to contact Nick Steinmetz, nick[dot]steinmetz[at]gmail[dot]com, with any questions or feedback.