To compare the performance of several spike sorting algorithms, we constructed "hybrid ground truth" datasets, in which spikes were added to real recordings at known times. The spikes to be added were taken from the same recording at a different part of the electrode array, so the challenge was realistic: the test spikes had realistic waveform variability and drift.

Each algorithm was run on each of several test datasets. To evaluate the performance of the algorithms, we determined, for each added "ground truth cluster" of spikes, which cluster produced by the algorithm most closely matched it. Then we asked: how many of the true spikes were missing from that cluster (misses)? How many extra spikes were included in that cluster that did not belong to the set of true spikes (false positives)? We converted these to miss rates and false positive rates, as a proportion of the number of spikes in the cluster. Then we computed an "initial score": 1-FPR-MR. Finally, we asked: if a human operator were to perfectly decide when to merge clusters, how well would the resulting final cluster score, by the same metrics? This gave us a "post-merge score" for that cluster. We also tabulated the number of merges required to find that optimum, as well as the total number of spikes and clusters returned by each algorithm. All seven of these statistics were computed for each algorithm on each of six test datasets, and are plotted on the main page.

The datasets have the following properties:

- 1. Random gaussian background noise was generated on all channels at all times. Spikes from a small number of neurons were added, and each added spike was exactly a copy of the mean waveform across all spikes for that neuron.
- 2. Random gaussian backgroun noise as above, but each spike was added according to the SVD method: a singular value decomposition was performed across all originally recorded spikes, and each spike was reconstructed from the top 6 components to form a de-noised version of each individual spike that retains spike-to-spike variation. Thesed de-noised but variable spikes were added to the dataset.
- 3. A real dataset with all ongoing activity was used to start with, but each added spike was a copy of the mean waveform, rather than the SVD-denoised spike.
- 4-6. Real datasets were used, and SVD-denoised spikes with spike-to-spike variation were added.

In all cases, spikes were added with a temporal jitter around the original spike times. See https://github.com/cortex-lab/groundTruth/blob/master/creation/CreateGroundTruth.m for the code that generated these datasets.

Feel free to contact Nick Steinmetz, nick[dot]steinmetz[at]gmail[dot]com, with any questions or feedback.