After reading a rather interesting article in the MSDN Magazine February 2013 issue by James McCaffrey on "Detecting Abnormal Data Using k-Means Clustering" I was eager to have a go at implementing this rather simple clustering algorithm myself.
After finally getting to a point where my k-Means implementation matched what James's had I proceeded with expanding it a little bit and experimenting with the different configurations of the algorithm.
Download the code
You can download my C# implementation of the k-Means algorithm from my SkyDrive (or whatever they end up calling their cloud drive)
The differences
My implementation differs from James's only slightly and the core algorithm is pretty much the exact same thing (k-means is rather straight forward anyways). The biggest differences lie in the API as I wanted to experiment with a few concepts that I was re-reading about in my recently re-purchased data mining book.
The total distance value Although it being a NP hard problem, I was interested in attempting to stumble on the global minimum through repeated runs of the algorithm. I wanted to get the total squared distance of the solution so that I could compare multiple runs and attempt to find the smallest value.
Initial centroid configuration (cluster centers) I read that the initial configuration of the cluster centers could have a dramatic difference in both the quality of the final solution as well as the convergence speed. I therefore wanted to be able to seed my run with an experimental configuration to see if this could be optimized outside of the k-Means algorithm itself.
Custom distance calculation As the current Euclidean distance calculation is heavily biased towards large values I wanted to be able to experiment with different types of formulas to calculate the distance. One being for example taking the absolute value of the differences between the different attributes.
Support for complex objects and attribute based configuration Real life implementations will need to be run on complex objects that might be a part of a larger software project. I wanted to be able to have the algorithm be able to process more high-level objects than just primitive value type arrays, but with minimal loss in speed.
Implementing this was a fun Saturday coding session for me and I hope my solution might be of use to someone else. Please if you find any errors I would be grateful if you could leave a comment detailing the bug so that we might all learn a little more :)
Examples
The final algorithm, while still supporting James's original data array format, also supports a more complex object configuration and property access using the KMeansValueAttribute. Such as:
public class MyObject
{
[KMeansValue] public double Weight { get; set; }
[KMeansValue] public int Height { get; set; }
[KMeansValue] public double Fun { get; set; }
public int NotUsed { get; set; }
...snip...
}
The algorithm can then be called with an array of MyObject
's like so:
var result = KMeans.Cluster(new MyObject[]{ obj1, obj1, ...etc...}, 3, 30);
The algorithm will then extract and use only the three properties marked with the [KMeansValue] attribute (Weight, Height, Fun). Any other property values exposed are not included in the calculations.
The KMeansResults<T>
class that the main KMeans.Cluster()
function returns contains all the original data point references arranged into the final cluster configuration, the Centroids that the run converged on and the TotalDistance for all the data points from their respective cluster centroids.
Specifying initial cluster centers (centroids)
The KMeans.Cluster()
function has an optional parameter called initialCentroidIndices
that supports sending in a custom initial centroid configuration (based on the indices of items in the source data). Example:
KMeans.Cluster(data, 5, 30, initialCentroidIndices: new int[5] { 4,7,14,9,18 });
Using a custom distance function
To specify a custom distance function the KMeans.Cluster()
function exposes the calculateDistanceFunction
parameter.
This is a predefined delegate of type:
KMeansCalculateDistanceDelegate(double[] point, double[] centroid);
Here is an example that uses a distance function that only calculates the absolute difference between all the attributes:
KMeans.Cluster(data, 5, 30, calculateDistanceFunction:(p, c) => p.Select((x,i)=> Math.Abs(c[i]-x)).Sum());
Developer & Programmer with +15 years professional experience building software.
Seeking WFH, remoting or freelance opportunities.