Cluster Analysis

Inception Innovation, Susan K.C.

Full Stack Developer

Introduction

Cluster is collection of data object which are similar to one another within the same group disimilar to the object in other group.

Cluster is unsupervised learning (i.e no predefine classes) where set of data are partition into a set of group ( i.e. cluster) which are as similar possible. It is goal of finding hidden patterns or grouping in a dataset.  Cluster analysis is know as Clustering or data segmentation.

Typical ways to use/appy cluster analysis

1. As a stand-alone tool to get insight into data distribution or

2. As a preprocessing(or intermediate) step for other algorithms

Clustering algorithms form groupings or clusters in such a way that data within a cluster have a higher measure of similarity than data in any other cluster.The measure of similarity on which the clusters are modeled can be defined by Euclidean distance, probabilistic distance, or another metric.Cluster analysis is an unsupervised learning method and an important task in exploratory data analysis. Popular clustering algorithms include:

1. Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree

2. k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster

3. Gaussian mixture models: models clusters as a mixture of multivariate normal density components

4. Self-organizing maps: uses neural networks that learn the topology and distribution of the data

The distinguishing feature of each of these algorithms is the metric to measure similarity.

Cluster analysis is similar in concept to discriminant analysis. The group membership of a sample of observations is known upfront in the latter while it is not known for any observation in the former.Cluster analysis is an exploratory data analysis tool for organizing observed data or cases into two or more groupsCluster analysis maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown 

Clustering results are therefore somewhat subjective, as they greatly depend on the users’ choices. Traditional cluster analysis is usually performed to group either observations or variables separately but simultaneous co-clustering (or biclustering) of the rows and the columns of the data matrix constitutes also a suitable alternative to search for biomarkers.

Purpose

We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering plays an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates(e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc.

Area of Application

Clustering techniques have been applied to a wide variety of research problems.In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever we need to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.

1. Marketing: Help marketers discover distinct groups in their customer bases, and then use this                              knowledge to develop targeted marketing programs.
2. Land use: Identification of areas of similar land use in an earth observation database.
3. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.
4. City-planning: Identifying groups of houses according to their house type, value, and                                                      geographical location.
5. Earth-quake studies: Observed earth quake epicenters should be clustered along continent                                                    faults.