TAGS :Viewed: 6 - Published at: a few seconds ago

[ Deriving labels using K-means and then training using a classifier (python) ]

I have with me a feature dataset 'X' and a label dataset 'Y'. Now in this problem I'm only allowed to use the X dataset and use Y only for reference.

I'm using sklearn's Kmeans algorithm to predict the labels of the feature dataset. But upon comparing the derived labels using the already assigned labels, Kmeans is wrongly classifying about 40% of the labels.

So instead I've decided to use Kmeans to derive the labels and a classification algorithm to fit and predict using X and the derived labels, with the intention of getting a better accuracy.

Would this strategy work and could someone suggest me a good classification algorithm that I can use for this purpose? Thanks.

Answer 1


K-means is a semi-supervised learning algorithm, which means that it needs some examples to learn. So it needs to have data and class labels. However, k-means is often used for unsupervised learning problems, like yours.

To achieve this, class labels are initialized at random and a k number of means are calculated based on this labeling. Then the data is relabeled and new centroids are calculated. And so on until nothing changes anymore. The algorithm will converge to a local optimum, so not necessarily the global optimum and therefore the classification results are highly dependent on the initial means.

Results can often be improved by using smarter initialization, like the k-means++ algorithm. In the sklearn module for kmeans this initialization is also available, by passing init=kmeans++ as parameter. Like this:

KMeans(init='k-means++', n_clusters=k)

I suggest you try that and see if it yields better results. Also, make sure to pick an appropriate number for k, equal to the number of classes in your data.

Using poor k-means classification results as input for a fully supervised learning algorithm will not work. As you would then train a classifier to learn the poor labeling given by the k-means classification. In that case you better look at other (more complicated) unsupervised learning algorithms like neural gas.