Text Clustering (TC) is a general term whose meaning is often reduced to document clustering which is not always the case since the text type covers documents, paragraphs, sentences and even words. Therefore, text clustering can be document level (e.g.news articles) or sentence level (e.g. tweets or SMS) etc. TC aims at regrouping similar text units within a collection of documents and it is useful in mining any text-based resource. This task is unsupervised since, unlike in text classification, we have no prior idea about the categories.
It is common to find in many sources (blogs etc) that the first step to cluster text data is to transform text units to vectors. This is not 100% true. I would say that this step depends mostly on the similarity measure and the clustering algorithm. Some of the best performing text similarity measures don't use vectors at all. This is the case of the winner system in SemEval2014 sentence similarity task which uses lexical word alignment. However, vectors are more efficient to process and allow to benefit from existing clustering algorithms such as k-means.
Sentence level or document level?
Depending on the use case, one can cluster either documents or sentences. While document clustering serves to regroup documents about the same topic, sentence clustering is more fine-grained and is yet more critical to perform. Document clustering has applications in information retrieval , topic extraction, mails clustering, search engines, etc. In the latter case, the results to a given query are clustered into different categories, so the user chooses what he is looking for.
Usually sentence clustering is used to cluster sentences derived from different documents and can be considered as a transverse segmentation of the documents content. Thus, the number of clusters can exceed the number of documents. Representing each cluster with a relevant descriptor and concatenating those descriptors is one way to summarize large documents.
Thematic clustering or Semantic clustering?
Sentence clusters can be qualified as thematic or semantic. Thematic clustering regroups sentences that deal with the same topic/event while semantic clusters include sentences having almost the same meaning and conveying the same information. For example, let's consider these two sentences from different news articles about Tsunami:
Sentence 1: A decade ago, one of the largest earthquakes ever recorded struck off the coast of Indonesia, triggering a tsunami that swept away entire communities around the Indian Ocean. source: http://www.bbc.com/
Sentence 2: Ten years ago this Boxing Day, an earthquake in Indonesia triggered multiple tsunamis in what would become one of the biggest natural disasters in history. source: http://www.telegraph.co.uk/
These sentences can make a perfect semantic cluster. Only, one of them is sufficient to cover the information needed to know.
If we introduce this new sentence:
The disaster was the world's deadliest tsunami, with over 230,000 people killed and half a million injured by the waves that battered the low-lying coast.
The semantic clustering should create a new cluster to include the third sentence as it hasn't the same meaning as the first two sentences. However, a thematic clustering would probably regroup all the sentences in the same thematic cluster as they are all about the same event. This means that the sentences considered similar within a thematic cluster, are likely to be fractured into smaller clusters when using semantic clusters.
How do we produce thematic/semantic clusters?
To perform this task we mainly need two things: a text similarity measure and a suitable clustering algorithm. Since we are dealing with text, preprocessing is a must and it can go from shallow techniques such as splitting text into sentences and/or pruning stopwords to deeper analysis such as part-of-speech tagging, syntactic parsing, semantic role labeling, etc.
We should not fall in the trap of thinking that moving from thematic to semantic clustering consists of increasing the threshold beyond which two elements are considered close enough to be in the same cluster. This is not true as it depends only on the similarity measure used to quantify the distance between two sentences. A non semantic similarity measure can not detect the semantic equivalence even if we apply a high threshold as it is unable to capture semantic relations such as synonyms and antonyms for example.
Examples of text similarity measures?
Cosine similarity of tf-idf (term frequency-inverse document frequency) vectors
Knowledge-based Measures (wordNet)
Knowledge-based measures quantify semantic relatedness of words using a semantic network. Many measures have shown to work well on the WordNet large lexical database for English. The figure below shows a subgraph of WordNet.
Source : (Wang & al.,2015)
For example, Wu and Palmer metric measures the semantic similarity of two concepts as their depth of the least common subsumer in the wordnet graph, while Resnik metric estimates the similarity as the probability of encountering the least common subsumer in a large corpus. This probability is known as the Information Content (IC). Note that a concept, here, is a synset which is a word sense (synonym) and each word has several synsets. These examples are implemented in the Python NLTK module.
These word similarities could be used to compute sentence similarities through this formula from (Mihalcea & al,2006)
where maxSim(w,T) refers to the maximum value in the similarities of the word w to each word in the text segment T. Given two text units T1 and T2, this scoring formula combines the semantic similarities of each text unit words in turn with respect to the other text segment words. The idf factor indicates the specificity of a word and is used to give more importance to specific words similarities than to generic ones like "get" and "bring". This scoring function can be used with any word similarity metric.
Word embeddings :
Word embeddings are low dimensional vectors obtained by training a neural network on a large corpus to predict a word given a context (Continuous Bag Of Words model) or to predict the context given a word (skip gram model). The context is a window of surrounding words.The most known software to produce word embeddings is Tomas Mikolov's Word2vec. Pre-trained word embeddings are also available in the word2vec code.google page.
The sentences s1, s2, s3 and s4 are tweets from the SemEval 2013 evaluation set. The human judgment of these sentences semantic similarities is:
sim(s1,s2)= 4.2/5 ~ 0.82 (very similar)
sim(s2,s3)= 2.0/5 ~ 0.40
Run test for w2v:
As you can see the results returned by w2v are consistent and very close to the human judgment.
It is recommended to load the word model (embeddings) only once rather than loading it before each similarity computation. The word embedding used here are trained using word2vec on 100 million words from Google News downloadable here.
Which clustering algorithm to use?
In the context of text data, it is suitable and fast when we have an approximation of the clusters number and when the similarity measure is not expensive in terms of computation time. Here is a simple example of k-means clustering using tf-idf vectors with the scikit-learn implementation:
Incremental clustering algorithm:
As I said above the problem is that we should specify the number of clusters to perform k-means clustering. This is not a problem for Hierarchical clustering algorithms, however, they are time consuming since we should calculate a similarity matrix for the elements (sentences) and this is not very efficient when the data is voluminous. In this context, it is more convenient to use an incremental clustering algorithm which can be used also in online applications (Mnasri & al. 2016). In such algorithms, sentences are processed one at a time where each new sentence is compared to each of the already formed clusters. If its similarity with the nearest cluster is low, then a new cluster is created. Otherwise, the sentence is assigned to the nearest cluster. The similarity between a sentence and a cluster is conservatively estimated as the similarity of this sentence with the further cluster element (Complete-linkage clustering). I have tested this approach and it has shown to be fast and it preserves the clusters quality at the same time.
I'd like to point out that the list of the methods I presented, especially the similarity measures, is not exhaustive and there are many other approaches proposed by the NLP community. An absolute best method is hard to precise. Indeed, intrinsic assessment of similarities, for example, is likely to show the quality gap between different metrics. However, using these methods on simple real cases may show similar performances. In such cases, the most efficient option should be picked.
We just sent you an email. Please click the link in the email to confirm your subscription!