12Jul


A distance measure for clustering mixed data

Most likely you have heard of Manhattan distance or Euclidean distance. These are two different metrics which provide information as to how distant (or different) two given data points are.

Manhattan and Euclidean distance graphed. Image by author

In a nutshell, Euclidean distance is the shortest distance from point A to point B. Manhattan distance calculates the sum of the absolute differences between the x and y coordinates and finds the distance between them as if they were placed on a grid where you could only go up, down, left, or right (not diagonal).

Distance metrics often underlie clustering algorithms, such as k-means clustering, which uses Euclidean distance. This makes sense, as in order to define clusters, you have to first know how similar or different 2 data points are (aka how distant they are from each other).

Calculating the distance between 2 points

To show this process in action, I will start with an example using Euclidean distance.



Source link

Protected by Security by CleanTalk