DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

The 2015 SIGMOD Best Paper Award is for DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation.

The contribution of this paper is that it finds a 17 year old bug in academic world, which is much older than the 5-Year-Old Linux Kernel Bug.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. The 2014 SIGKDD Test of Time Award{.broken_link} recognizes DBSCAN have had an important impact on the data mining research community in the last decade.

The main advantage of density-based clustering (over methods such as k-means) is its capability of discovering clusters with arbitrary shapes (while k-means typically returns ball-like clusters).

shaped-cluster

Time Complexity

The mis-claim in DBSCAN is the computation of time complexity.

The original KDD’96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n^2) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found.

In this paper, we prove that for d ≥ 3, the DBSCAN problem requires (n4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. This (i) explains why the community’s search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice.