After reading a rather interesting article in the MSDN Magazine February 2013 issue by James McCaffrey on "Detecting Abnormal Data Using k-Means Clustering" I was eager to have a go at implementing this rather simple clustering algorithm myself.

After finally getting to a point where my k-Means implementation matched what James's had I proceeded with expanding it a little bit and experimenting with the different configurations of the algorithm.

## Download the code

You can download my C# implementation of the k-Means algorithm from my SkyDrive (or whatever they end up calling their cloud drive)

## The differences

My implementation differs from James's only slightly and the core algorithm is pretty much the exact same thing (k-means is rather straight forward anyways). The biggest differences lie in the API as I wanted to experiment with a few concepts that I was re-reading about in my recently re-purchased data mining book.

**The total distance value**Although it being a NP hard problem, I was interested in attempting to stumble on the global minimum through repeated runs of the algorithm. I wanted to get the total squared distance of the solution so that I could compare multiple runs and attempt to find the smallest value.**Initial centroid configuration (cluster centers)**I read that the initial configuration of the cluster centers could have a dramatic difference in both the quality of the final solution as well as the convergence speed. I therefore wanted to be able to seed my run with an experimental configuration to see if this could be optimized outside of the k-Means algorithm itself.**Custom distance calculation**As the current Euclidean distance calculation is heavily biased towards large values I wanted to be able to experiment with different types of formulas to calculate the distance. One being for example taking the absolute value of the differences between the different attributes.**Support for complex objects and attribute based configuration**Real life implementations will need to be run on complex objects that might be a part of a larger software project. I wanted to be able to have the algorithm be able to process more high-level objects than just primitive value type arrays, but with minimal loss in speed.

Implementing this was a fun Saturday coding session for me and I hope my solution might be of use to someone else. Please if you find any errors I would be grateful if you could leave a comment detailing the bug so that we might all learn a little more :)

## Examples

The final algorithm, while still supporting James's original data array format, also supports a more complex object configuration and property access using the KMeansValueAttribute. Such as:

```
public class MyObject
{
[KMeansValue] public double Weight { get; set; }
[KMeansValue] public int Height { get; set; }
[KMeansValue] public double Fun { get; set; }
public int NotUsed { get; set; }
...snip...
}
```

The algorithm can then be called with an array of `MyObject`

's like so:

```
var result = KMeans.Cluster(new MyObject[]{ obj1, obj1, ...etc...}, 3, 30);
```

The algorithm will then extract and use only the three properties marked with the [KMeansValue] attribute (Weight, Height, Fun). Any other property values exposed are not included in the calculations.

The `KMeansResults<T>`

class that the main `KMeans.Cluster()`

function returns contains all the original data point references arranged into the *final cluster configuration*, the *Centroids* that the run converged on and the *TotalDistance* for all the data points from their respective cluster centroids.

### Specifying initial cluster centers (centroids)

The `KMeans.Cluster()`

function has an optional parameter called `initialCentroidIndices`

that supports sending in a custom initial centroid configuration (based on the indices of items in the source data). Example:

```
KMeans.Cluster(data, 5, 30, initialCentroidIndices: new int[5] { 4,7,14,9,18 });
```

### Using a custom distance function

To specify a custom distance function the `KMeans.Cluster()`

function exposes the `calculateDistanceFunction`

parameter.
This is a predefined delegate of type:

```
KMeansCalculateDistanceDelegate(double[] point, double[] centroid);
```

Here is an example that uses a distance function that only calculates the absolute difference between all the attributes:

```
KMeans.Cluster(data, 5, 30, calculateDistanceFunction:(p, c) => p.Select((x,i)=> Math.Abs(c[i]-x)).Sum());
```

Developer & Programmer with +15 years professional experience building software.

Seeking WFH, remoting or freelance opportunities.