#data-mining
Thread

ct2

clustering methods overview

grid based vs. density based clustering

Problems with clustering high-dimensional data:
Parameters are often hard to determine, especially for high-dimensionality data sets and where users have yet to grasp a deep understanding of their data. A data set can contain numerous dimensions or attributes. Finding clusters of data objects in a high-dimensional space is challenging, especially considering that such data can be very sparse and highly skewed.

CLARA and CLARANS algorithms

Overview of CLARANS:
It presents a trade-off between the cost and the effectiveness of using samples to obtain clustering.
First, it randomly selects k objects in the data set as the current medoids. It then randomly selects a current medoid x and an object y that is not one of the current medoids.
Then it checks for the following condition:
> Can replacing x by y improve the absolute-error criterion?
If yes, the replacement is made. CLARANS conducts such a randomized search l times. The set of the current medoids after the l steps is considered a local optimum.
CLARANS repeats this randomized process m times and returns the best local optimal as the final result.