Selasa, 22 Januari 2013

Contoh Metode Single Pass Clustering

Misalkan kita memiliki Kumpulan dokumen dan istilah sebagai berikut, dan bahwa
kita tertarik pengelompokan istilah dengan menggunakan metode single pass
(perhatikan bahwa metode yang sama dapat digunakan untuk cluster dokumen, tetapi dalam
kasus ini, kita akan menggunakan vektor dokumen (baris) daripada
Istilah vektor (kolom).

 T1 T2 T3 T4 T5
Doc1 1 2 0 0 1
Doc2 3 1 2 3 0
Doc3 3 0 0 0 1
Doc4 2 1 0 3 0
Doc5 2 2 1 5 1

Dimulai dengan T1 dalam sebuah cluster dengan sendirinya, katakanlah C1. Pada titik ini, C1 berisi
 hanya satu item, T1, sehingga titik pusat dari C1 hanyalah vektor untuk T1:

C1 = <1, 3, 3, 2, 2>.

Now compare (i.e., measure similarities) of the next item (T2) to centroids
of all existing clusters. At this point we have only one cluster, C1 (we
will use dot product for simplicity):

SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11

Now we need a pre-specified similarity threshold. Let's say that our
threshold is 10. This means that if the similarity of T2 to the cluster
centroid is >= 10, then we add T2 to the cluster, otherwise we use T2 to
start a new cluster.

In this case. SIM(T2, C1) = 11 > 10. Therefore we add T2 to cluster C1.
We now need to compute the new centroid for C1 (which now contains T1 and
T2). The centroid (which is the average vector for T1 and T2 is:

C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

Now, we move to the next item, T3. Again, there is only one cluster, C1,
so we only need to compare T3 with C1 centroid. The dot product of T3 and
the above centroid is:

SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

This time, T3 does not pass the threshold test (the similarity is less
than 10). Therefore, we use T3 to start a new cluster, C2. Now we have two
clusters

C1 = {T1, T2}
C2 = {T3}

We move to the next unclustered item, T4. Since we now have two clusters,
we need to compute the MAX similarity of T4 to the 2 cluster centroids
(note that the centroid of cluster C2 right now is just the vector for T3):

SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>
            = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>
            = 0 + 6 + 0 + 0 + 5 = 11

Note that both similarity scores pass the threshold (10), however, we pick
the MAX, and therefore, T4 will be added to cluster C1. Now we have the
following:

C1 = {T1, T2, T4}
C2 = {T3}

The centroid for C2 is still just the vector for T3:

C2 = <0, 2, 0, 0, 1>

and the new centroid for C1 is now:

C1 = <3/3, 7/3, 3/3, 6/3, 9/3>

The only item left unclustered is T5. We compute its similarity to the
centroids of existing clusters:

SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>
            = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>
            = 0 + 0 + 0 + 0 +1 = 1

Neither of these similarity values pass the threshold. Therefore, T5 will
have to go into a new cluster C3. There are no more unclustered items, so
we are done (after making a single pass through the items). The final
clusters are:

C1 = {T1, T2, T4}
C2 = {T3}
C3 = {T5}

Note: Obviously, the results for this method are highly dependent on the
similarity threshold that is used. You should use your judgment in setting
this threshold so that you are left with a reasonable number of clusters.

Tidak ada komentar:

Poskan Komentar