TAGS :Viewed: 8 - Published at: a few seconds ago

[ K means Clustering on n dimensional vectors. ]

I'm applying TFIDF on text documents where I get varied length n dimensional vectors each corresponding to a document.

    texts = [[token for token in text if frequency[token] > 1] for text in texts]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    lda = models.ldamodel.LdaModel(corpus, num_topics=100, id2word=dictionary)
    tfidf = models.TfidfModel(corpus)   
    corpus_tfidf = tfidf[corpus]
    lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)
    corpus_lsi = lsi[corpus_tfidf]
    corpus_lda=lda[corpus]
    print "TFIDF:"
    print corpus_tfidf[1]
    print "__________________________________________"
    print corpus_tfidf[2]

The output to this is:

TFIDF:
Vec1:    [(19, 0.06602704727889631), (32, 0.360417819987515), (33, 0.3078487494326974), (34, 0.360417819987515), (35, 0.2458968255872351), (36, 0.23680107692707422), (37, 0.29225639811281434), (38, 0.31741275088103), (39, 0.28571949457481044), (40, 0.32872456368129543), (41, 0.3855741727557306)]
    __________________________________________
Vec2:    [(5, 0.05617283528623041), (6, 0.10499864499395724), (8, 0.11265354901199849), (16, 0.028248249837939252), (19, 0.03948130674177094), (29, 0.07013501129200184), (33, 0.18408018239985235), (42, 0.14904146984986072), (43, 0.20484144632880313), (44, 0.215514203535732), (45, 0.15836501876891904), (46, 0.08505477582234795), (47, 0.07138425858136686), (48, 0.127695955436003), (49, 0.18408018239985235), (50, 0.2305566099597365), (51, 0.20484144632880313), (52, 0.2305566099597365), (53, 0.2305566099597365), (54, 0.053099690797234665), (55, 0.2305566099597365), (56, 0.2305566099597365), (57, 0.2305566099597365), (58, 0.0881162347543671), (59, 0.20484144632880313), (60, 0.16408387627386525), (61, 0.08256873616398946), (62, 0.215514203535732), (63, 0.2305566099597365), (64, 0.16731192344738707), (65, 0.2305566099597365), (66, 0.2305566099597365), (67, 0.07320703902661252), (68, 0.17912628269786976), (69, 0.12332630621892736)]

The vector points not represented are 0. Which means say (18, ....) does not exist in the vector, then it is 0.

I want to apply K means clustering on these vectors (Vec1 and Vec2)

Scikit's K means clustering needs vectors in equal dimension and in matrix format. What should be done about this?

Answer 1


So after looking at the source code, it looks like gensim manually creates a sparse vector for each document (which is just a list of tuples). This makes the error make sense, since scikit-learn's kMeans algorithm allows for sparse scipy matrices, but it doesn't know how to interpret the gensim sparse vector. You can turn each of these individual lists into a scipy csr_matrix with the following (it would be better to convert all docs at once, but this is a quick fix).

rows = [0] * len(corpus_tfidf[1])
cols = [tup[0] for tup in corpus_tfidf[1]]
data = [tup[1] for tup in corpus_tfidf[1]]
sparse_vec = csr_matrix((data, (rows, cols)))

You should be able to make use of this sparse_vec, but if it throws errors, you can turn it into a dense numpy array with .toarray() or numpy matrix with .todense().

EDIT: Turns out that Gensim provides some nifty utility functions, including one that takes the streamed corpus object format and returns a csc matrix. Here's a full example of how your code might work (connected to sklearn's kMeans clustering algorithm)

from gensim import corpora, models, matutils
from sklearn.cluster import KMeans

texts = [[token for token in text] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

print "TFIDF:"
corpus_tfidf = matutils.corpus2csc(corpus_tfidf).transpose()
print corpus_tfidf
print "__________________________________________"

kmeans = KMeans(n_clusters=2)
print kmeans.fit_predict(corpus_tfidf)

You should calculate and pass the additional parameters that go into corpus2csc, as it could save you cycles depending on the size of your corpus. We transpose the matrix as gensim puts the documents in the columns and the terms in the rows. You can turn the scipy sparse matrix into the myriad of other types, depending on your use case (besides just the kmeans clustering).