week11. lec2#
Slide Summary: MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES
1. Cluster Analysis Introduction:
Classification vs. Clustering:
Classification: Classify into known groups with labeled training data.
Clustering: Identify potential groups without labeled data (unsupervised learning).
Data Format: Observations \(X_1, ..., X_n\), with \(X_i = (Xi_1, ..., Xi_p)^T\).
2. Real-life Example:
A new company wishes to identify clusters within their customers based on purchasing behavior. No training data is available.
3. Clustering Objectives:
Individuals within clusters should be more closely related to each other than to those in other clusters.
Hierarchical clustering: Arrange clusters in a hierarchy, breaking down larger clusters into smaller ones.
4. Principles of Cluster Analysis:
Used to determine if observations come from multiple groups.
Individuals within a cluster are similar. This similarity depends on the defined measure.
Selecting the right similarity measure is crucial and should align with the data type and problem at hand.
5. Dissimilarity Matrices:
Used by many clustering algorithms. Matrix \(D\) such that \(D_{ij}\) measures the dissimilarity between the \(i^{th}\) and \(j^{th}\) individuals.
Requirements:
Nonnegative elements.
Zero diagonal elements: \(D_{ii} = 0\).
Symmetric, if not, replace with \((D + D^T)/2\).
6. Understanding ‘Distance’ in Clustering:
Dissimilarity can be seen as a function \(D : \mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}^+\).
Real distances satisfy specific properties like symmetry \(D(a, b) = D(b, a)\) and the triangle inequality.
7. Measuring Similarity using “Correlation”:
A method measures the similarity between two individuals, \(i\) and \(k\), with the formula:
Transform this similarity into dissimilarity using: \( D_{ik} = 1 - \rho(X_i, X_k) \).
8. Types of Data and Their Treatment:
Categorical/Nominal Variables: No inherent order. Users need to define a custom measure for differences.
Ordinal Variables: Have an inherent order or rank. They can be transformed to mimic quantitative variables for clustering. For \(M\) distinct ordered values: