Feature Selection for unsupervised Learning with R and Python

Starting from dimensionality reduction

Feature selection is a part technique of data dimensional reduction. According to the book Data minging: concepts and techniques, the most ubiquitous methods are:

  • wavelet transforms
  • principal components analysis (PCA)
  • attribute subset selection(or feature selection)

It is worth mentioning, that PCA, Exploratory Factor Analysis (EFA), SVD, etc are all methods which reconstruct our original attributes. PCA is essentially creates new variables that are linear combinations of the original variables.

However, if we want to reserve the original attributes, then take a look at Feature selection.

Overview of Feature Selection

There exist several ways to category the techniques of feature selection: 1 2

Yet From the problem solving prospective,I divide the part of techniques into those ways:

  • Supervised(regression): LASSO, REF, Autoencoder, etc. The regression area has been investigated extensively more information
  • Unsupervised: principal feature analysis(PFA)

Concepts of unsupervised method(PFA)

Details to refer

Steps:

  1. Compute the sample covariance matrix or correlation matrix,
  2. Compute the Principal components and eigenvalues of the Covariance or Correlation matrix A.
  3. Choose the subspace dimension n, we get new matrix A_n, the vectors Vi are the rows of A_n.
  4. Cluster the vectors |Vi|, using K-Means
  5. For each cluster, find the corresponding vector Vi which is closest to the mean of the cluster.

Since so many of readers have mentioned the covariance calculation: in the paper, it states that both cov and corr are okay. and calculation of cov is embeded in PCA. That’s why you might be confused by the first step

The step in the paper

Code

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA(n_components=self.q).fit(X) # calculation Cov matrix is embeded in PCA
        A_q = pca.components_.T

        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]

the usage:

pfa = PFA(n_features=10)
pfa.fit(dataset)
# To get the transformed matrix
x = pfa.features_
# To get the column indices of the kept features
column_indices = pfa.indices_

Conclusion

Next time we’ll take a closer look at supervised method.

Reference:

  1. Wiki from Feature selectionk
  2. Exploratory Factor Analysis
  3. An Introduction to Feature Selection
  4. 11/2000, Feature selection for unsupervised learning
  5. Feature Selection Using Principal Feature Analysis
  6. 降维的多种算法
Share Comments
comments powered by Disqus