Calculus of Variations and Geometric Measure Theory

A. De Rosa - A. Khajavirad - Y. Wang

On the power of linear programming for K-means clustering

created by derosa on 05 Feb 2024
modified on 16 Aug 2024

[BibTeX]

preprint

Inserted: 5 feb 2024
Last Updated: 16 aug 2024

Year: 2024

ArXiv: 2402.01061 PDF

Abstract:

In (SIAM J. Optim., 2022), the authors introduced a new linear programming (LP) relaxation for K-means clustering. In this paper, we further investigate both theoretical and computational properties of this relaxation. As evident from our numerical experiments with both synthetic real-world data sets, the proposed LP relaxation is almost always tight; i.e. its optimal solution is feasible for the original nonconvex problem. To better understand this unexpected behaviour, on the theoretical side, we focus on K-means clustering with two clusters, and we obtain sufficient conditions under which the LP relaxation is tight. We further analyze the sufficient conditions when the input is generated according to a popular stochastic model and obtain recovery guarantees for the LP relaxation. We conclude our theoretical study by constructing a family of inputs for which the LP relaxation is never tight. Denoting by $n$ the number of data points to be clustered, the LP relaxation contains $\Omega(n^3)$ inequalities making it impractical for large data sets. To address the scalability issue, by building upon a cutting-plane algorithm together with the GPU implementation of PDLP, a first-order method LP solver, we develop an efficient algorithm that solves the proposed LP and hence the K-means clustering problem, for up to $n \leq 4000$ data points.