Selection of KinK-means clustering [621433]
Selection of KinK-means clustering
D T Pham/C3, S S Dimov, and C D Nguyen
Manufacturing Engineering Centre, Cardiff University, Cardiff, UK
The manuscript was received on 26 May 2004 and was accepted after revision for publication on 27 September 2004.
DOI: 10.1243/095440605X8298
Abstract: The K-means algorithm is a popular data-clustering algorithm. However, one of its
drawbacks is the requirement for the number of clusters, K, to be specified before the algorithm
is applied. This paper first reviews existing methods for selecting the number of clusters for thealgorithm. Factors that affect this selection are then discussed and a new measure to assist theselection is proposed. The paper concludes with an analysis of the results of using the proposedmeasure to determine the number of clusters for the K-means algorithm for different data sets.
Keywords: clustering, K-means algorithm, cluster number selection
1 INTRODUCTION
Data clustering is a data exploration technique that
allows objects with similar characteristics to begrouped together in order to facilitate their furtherprocessing. Data clustering has many engineeringapplications including the identification of partfamilies for cellular manufacture.
The K-means algorithm is a popular data-
clustering algorithm. To use it requires the numberof clusters in the data to be pre-specified. Findingthe appropriate number of clusters for a given dataset is generally a trial-and-error process made moredifficult by the subjective nature of deciding whatconstitutes ‘correct’ clustering [ 1].
This paper proposes a method based on infor-
mation obtained during the K-means clustering
operation itself to select the number of clusters, K.
The method employs an objective evaluationmeasure to suggest suitable values for K, thus
avoiding the need for trial and error.
The remainder of the paper consists of five sections.
Section 2 reviews the main known methods forselecting K. Section 3 analyses the factors influ-
encing the selection of K. Section 4 describes the
proposed evaluation measure. Section 5 presentsthe results of applying the proposed measure toselect Kfor different data sets. Section 6 concludes
the paper.2 SELECTION OF THE NUMBER OF CLUSTERS
AND CLUSTERING VALIDITY ASSESSMENT
This section reviews existing methods for selecting
Kfor the K-means algorithm and the corresponding
clustering validation techniques.
2.1 Values of Kspecified within a range or set
The performance of a clustering algorithm may be
affected by the chosen value of K. Therefore, instead
of using a single predefined K, a set of values might
be adopted. It is important for the number ofvalues considered to be reasonably large, to reflectthe specific characteristics of the data sets. At thesame time, the selected values have to be signifi-cantly smaller than the number of objects in thedata sets, which is the main motivation for perform-
ing data clustering.
Reported studies [ 2–18]o n K-means clustering and
its applications usually do not contain any expla-nation or justification for selecting particular valuesforK. Table 1 lists the numbers of clusters and objects
and the corresponding data sets used in those studies.Two observations could be made when analysingthe data in the table. First, a number of researchers
[5–7,9] used only one or two values for K. Second,
several other researchers [ 1,3,11,13,16] utilized
relatively large Kvalues compared with the number
of objects. These two actions contravene the above-mentioned guidelines for selecting K.T h e r e f o r e ,t h e
clustering results do not always correctly representthe performance of the tested algorithms.
/C3Corresponding author: Manufacturing Engineering Centre,
Cardiff University, Cardiff CF24 OYF, UK.103
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
In general, the performance of any new version
of the K-means algorithm could be verified by com-
paring it with its predecessors on the same criteria.In particular, the sum of cluster distortions isusually employed as such a performance indicator[3,6,13,16,18]. Thus, the comparison is considered
fair because the same model and criterion are usedfor the performance analysis.
2.2 Values of Kspecified by the user
The K-means algorithm implementation in many
data-mining or data analysis software packages
[19–22] requires the number of clusters to be speci-
fied by the user. To find a satisfactory clusteringresult, usually, a number of iterations are neededwhere the user executes the algorithm with differentvalues of K. The validity of the clustering result is
assessed only visually without applying any formalperformance measures. With this approach, it isdifficult for users to evaluate the clustering result
for multi-dimensional data sets.
2.3 Values of Kdetermined in a later
processing step
When K-means clustering is used as a pre-processing
tool, the number of clusters is determined by the
specific requirements of the main processingalgorithm [ 13]. No attention is paid to the effect of
the clustering results on the performance of thisalgorithm. In such applications, the K-means
algorithm is employed just as a ‘black box’ withoutvalidation of the clustering result.Table 1 The number of clusters used in different studies
of the K-means algorithm
ReferenceNumbers of
clusters KNumber of
objects NMaximum
K/N
ratio (%)
[2] 32, 64, 128, 256,
512, 10248 192 12.50
32, 64, 128, 256,
512, 102429 000
256 2 048
[3] 600, 700, 800,
900, 100010 000 10.00
600, 700, 800,
900, 100050 000
[4] 4, 16, 64,
100, 128100 000 0.13
4, 16, 64,
100, 128120 000
4, 16, 64,
100, 128256 000
[5] 4 564 0.70
4 7204 1 0004 1 008
4 1 010
4 1 2024 2 0004 2 3244 3 0054 4 000
4 6 272
4 7 561
[6] 6 150 4.00
[7] 10 2 310 0.43
25 12 902
[8] 2, 4, 8 Not reported Not reported
[9] 2, 4 500 3.33
2, 4 50 000
2, 4 100 00010 300
[10] 1, 2, 3, 4 10 000 0.04
[11] 10, 20, 30, 40, 50,
60, 70, 80,
90, 100500 20.00
[12] 100 10 000 2.00
50 2 500
[13] 7 42 16.66
1, 2, 3, 4, 5, 6, 7 120
[14] 2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,
10, 11, 12, 13, 14250 5.60
[15] 8, 20, 50, 64, 256 10 000 2.56
[16] 5000 50 000 50.00
5000 100 0005000 200 0005000 300 0005000 433 208
100 100 000
250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 000
10, 20, 30, 40, 50,
60, 70, 8030 000
50, 500, 5000 10 000
50, 500, 5000 50 00050, 500, 5000 100 00050, 500, 5000 200 000
50, 500, 5000 300 000
50, 500, 5000 433 208
(continued )Table 1 Continued
ReferenceNumbers of
clusters KNumber of
objects NMaximum
K/N
ratio (%)
[17] 250 80 000 10.00
250 90 000
250 100 000250 110 000250 120 00050, 100, 400 4 00050, 100, 400 36 000
250 80 000
250 90 000250 100 000250 110 000250 120 00050, 100, 150 4 00050, 100, 150 36 000
50 800 000
500 800 000
[18] 3, 4 150 6.67
4, 5 752, 7, 10 214104 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
2.4 Values of Kequated to the
number of generators
Synthetic data sets, which are used for testing
algorithms, are often created by a set of normal oruniform distribution generators. Then, clusteringalgorithms are applied to those data sets with thenumber of clusters equated to the number of genera-tors. It is assumed that any resultant cluster willcover all objects created by a particular generator.
Thus, the clustering performance is judged on the
basis of the difference between objects covered bya cluster and those created by the correspondinggenerator. Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [ 7].
There are drawbacks with this method. The first
drawback concerns the stability of the clustering
results when there are areas in the object space
that contain objects created by different generators.Figure 1a illustrates such a case. The data setshown in this figure has two clusters, A and B,which cover objects generated by generators G
A
and GBrespectively. Object X is in an overlappingarea between clusters A and B. X has probabilities
PGAand PGBof being created by GAand GB, respect-
ively, and probabilities PCAand PCBof being included
into clusters A and B, respectively. All four pro-babilities are larger than 0. Thus, there is a chancefor X to be created by generator G
Abut covered
by cluster B, and vice versa. In such cases, theclustering results will not be perfect. The stability ofthe clustering results depends on these four proba-
bilities. With an increase in the overlapping areas in
the object space, the stability of the clustering resultsdecreases.
The difference between the characteristics of the
generators also has an effect on the clustering results.In Fig. 1b where the number of objects of cluster A isfive times larger than that of cluster B, the smallercluster B might be regarded as noise and all objects
might be grouped into one cluster. Such a clustering
outcome would differ from that obtained by visualinspection.
Unfortunately, this method of selecting Kcannot
be applied to practical problems. The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified.
2.5 Values of Kdetermined by
statistical measures
There are several statistical measures available for
selecting K. These measures are often applied in com-
bination with probabilistic clustering approaches.They are calculated with certain assumptionsabout the underlying distribution of the data. The
Bayesian information criterion or Akeike’s infor-
mation criterion [ 14,17] is calculated on data sets
which are constructed by a set of Gaussian distri-butions. The measures applied by Hardy [ 23] are
based on the assumption that the data set fits thePoisson distribution. Monte Carlo techniques,which are associated with the null hypothesis , are
used for assessing the clustering results and also for
determining the number of clusters [ 24,25].
There have been comparisons between probabilis-
tic and partitioning clustering [ 7]. Expectation–
maximization (EM) is often recognized as a typicalmethod for probabilistic clustering. Similarly,K-means clustering is considered a typical method
for partitioning clustering. Although, EM andK-means clustering share some common ideas,
they are based on different hypotheses, models and
criteria. Probabilistic clustering methods do nottake into account the distortion inside a cluster, sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering, and vice versa. Therefore, statistical measuresused in probabilistic methods are not applicable in
Fig. 1 Effect of the relationship between clusters on
the clustering for two object spaces in which(a) an area exists that contains objects createdby two different generators and (b) there areno overlapping areas: A, objects generated by
G
A;D, objects generated by GBSelection of KinK-means clustering 105
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
theK-means algorithm. In addition, the assumptions
about the underlying distribution cannot be verified
on real data sets and therefore cannot be used toobtain statistical measures.
2.6 Values of Kequated to the
number of classes
With this method, the number of clusters is equated
to the number of classes in the data sets. A data-clustering algorithm can be used as a classifier by
applying it to data sets from which the class attribute
is omitted and then assessing the clustering resultsusing the omitted class information [ 26,27]. The out-
come of the assessment is fed back to the clusteringalgorithm to improve its performance. In this way,the clustering can be considered to be supervised.
With this method of determining the number
of clusters, the assumption is made that the data-
clustering method could form clusters, each of
which would consist of only objects belonging toone class. Unfortunately, most real problems do
not satisfy this assumption.
2.7 Values of Kdetermined through
visualization
Visual verification is applied widely because of its
simplicity and explanation possibilities. Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clustering
results [ 5,27].
The assessment of a clustering result using
visualization techniques depends heavily on theirimplicit nature. The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets. The data sets in Fig. 2are illustrations of such cases. The application ofvisualization techniques implies a data distribution
continuity in the expected clusters. If the K-means
approach is applied to such data sets, there is not
Fig. 2 Data sets inappropriate for the K-means approach: (a) data sets with four clusters [ 5];
(b) data sets with three clusters [ 23]; (c) data sets with eight clusters [ 27]. Note that
the number of clusters in each data set was specified by the respective researchers106 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
any cluster that satisfies the K-means cluster-
ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets. Therefore, the K-means algorithm cannot pro-
duce the expected clustering results. This suggeststhat the K-means approach is unsuitable for such
data sets.
The characteristics of the data sets in Fig. 2
(position, shape, size, and object distribution) areimplicitly defined. This makes the validation of theclustering results difficult. Any slight changes inthe data characteristics may lead to different out-comes. The data set in Fig. 2b is an illustration ofsuch a case. Another example is the series of datasets in Fig. 3. Although two clusters are easily identi-
fiable in the data set in Fig. 3a, the numbers of
clusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring. Usually such parametersare not explicitly defined when a visual check iscarried out.
In spite of the above-mentioned deficiencies, visu-
alization of the results is still a useful method of
selecting Kand validating the clustering results
when the data sets do not violate the assumptionsof the clustering model. In addition, this methodis recommended in cases where the expected resultscould be identified explicitly.
2.8 Values of Kdetermined using
a neighbourhood measure
A neighbourhood measure could be added to the
cost function of the K-means algorithm to determine
K[26]. Although this technique has showed promis-
ing results for a few data sets, it needs to prove itspotential in practical applications. Because the costfunction has to be modified, this technique cannotbe applied to the original K-means algorithm.
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K) for evaluating the clustering result
could be used to select the number of clusters. Fac-tors that such a function should take into accountare discussed in this section.3.1 Approach bias
The evaluation function should be related closely to
the clustering criteria. As mentioned previously,such a relation could prevent adverse effects on thevalidation process. In particular, in the K-means
algorithm, the criterion is the minimization of thedistortion of clusters, so that the evaluation functionshould take this parameter into account.
3.2 Level of detail
In general, observers that could see relatively low
levels of detail would obtain only an overview of anobject. By increasing the level of detail, they could
gain more information about the observed object
but, at the same time, the amount of data that theyhave to process increases. Because of resource limit-ations, a high level of detail is normally used only toexamine parts of the object [ 28].
Such an approach could be applied in clustering. A
data set with nobjects could be grouped into any
number of clusters between 1 and n, which would
correspond to the lowest and the highest levels of
detail respectively. By specifying different Kvalues,
it is possible to assess the results of groupingobjects into various numbers of clusters. Fromthis evaluation, more than one Kvalue could be
recommended to users, but the final selection ismade by them.
3.3 Internal distribution versus global impact
Clustering is used to find irregularities in the data
distribution and to identify regions in which objects
are concentrated. However, not every region with a
high concentration of objects is considered a cluster.For a region to be identified as a cluster, it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set.
InK-means clustering, the distortion of a cluster is
a function of the data population and the distance
between objects and the cluster centre according to
I
j¼XNj
t¼1½d(xjt,wj)/C1382(1a)
where Ijis the distortion of cluster j,w jis the centre
of cluster j,Njis the number of objects belonging to
cluster j,xjtis the tth object belonging to cluster j,
and d(xjt,w j) is the distance between object x jtand
the centre w jof cluster j.
Each cluster is represented by its distortion
and its impact on the entire data set is assessed by
Fig. 3 Variations in the two-ring data setSelection of KinK-means clustering 107
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
its contribution to the sum of all distortions, SK,
given by
SK¼XK
j¼1Ij (1b)
where Kis the specified number of clusters.
Thus, such information is important in assessing
whether a particular region in the object spacecould be considered a cluster.
3.4 Constraints on f(K)
The robustness of f(K) is very important. Because
this function is based on the result of the clusteringalgorithm, it is important for this result to vary as
little as possible when Kremains unchanged. How-
ever, one of the main deficiencies of the K-means
approach is its dependence on randomness. Thus,the algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function. A new version of the K-
means algorithm, namely the incremental K-means
algorithm [ 29], satisfies this requirement and can
be adopted for this purpose.
The role of f(K) is to reveal trends in the data
distribution and therefore it is important to keep itindependent of the number of objects. The numberof clusters, K, is assumed to be much smaller than
the number of objects, N. When Kincreases, f(K)
should converge to some constant value. Then, if,for any intermediate K,f(K) exhibits a special beha-
viour, such as a minimum or maximum point, that
value of Kcould be taken as the desired number of
clusters.
4 NUMBER OF CLUSTERS FOR
K-MEANS CLUSTERING
As mentioned in section 3.3, cluster analysis is
used to find irregularities in the data distribution.
When the data distribution is uniform, there is notany irregularity. Therefore, data sets with uniformdistribution could be used to calibrate and verifythe clustering result. This approach was applied byTibshirani et al. [30]. A data set of the same dimen-
sion as the actual data set and with a uniform distri-bution was generated. The clustering performance
on this artificial data set was then compared with
the result obtained for the actual data set. A measureknown as the ‘ gap’ statistic [ 30] was employed to
assess performance. In this work, instead of generat-ing an artificial data set, the clustering performancefor the artificial data set was estimated. Also, insteadof the gap statistic, a new and more discriminatorymeasure was employed for evaluating the clustering
result.
When the K-means algorithm is applied to data
with a uniform distribution and Kis increased by 1,
the clusters are likely to change and, in the new pos-itions, the partitions will again be approximatelyequal in size and their distortions similar to oneanother. The evaluations carried out in reference[29] showed that, when a new cluster is inserted
into a cluster ( K¼1) with a hypercuboid shape and
a uniform distribution, the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions. This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K. In such cases, the sum of distortions
after the increase in the number of clusters couldbe estimated from the current value.
The evaluation function f(K) is defined using the
equations
f(K)¼1i f K¼1
SK
aKSK/C01ifSK/C01=0,8K.1
1i f SK/C01¼0,8K.18
><
>:(2)
aK¼1/C03
4NdifK¼2 and Nd.1
(3a)
aK/C01ț1/C0aK/C01
6ifK.2 and Nd.1
(3b)8
>>>>><
>>>>>:
where S
Kis the sum of the cluster distortions when
the number of clusters is K,Ndis the number of
data set attributes (i.e. the number of dimensions)
andaKis a weight factor. The term aKSK21in
equation (2) is an estimate of SKbased on SK21
made with the assumption that the data have a uni-
form distribution. The value of f(K) is the ratio of the
real distortion to the estimated distortion and isclose to 1 when the data distribution is uniform.When there are areas of concentration in the data
distribution, S
Kwill be less than the estimated
value, so that f(K) decreases. The smaller that f(K)
is, the more concentrated is the data distribution.Thus, values of Kthat yield small f(K) can be
regarded as giving well-defined clusters.
The weight factor
aK, defined in equation (3), is
a positive number less than or equal to 1 and isapplied to reduce the effect of dimensions. With
K¼2,
aKis computed using equation (3a). This equa-
tion is derived from equation (7) in reference [ 29],
which shows that the decrease in distortion is inver-sely proportional to the number of dimensions, N
d.
AsKincreases above 2, the decrease in the sum of
distortions reduces (the ratio SK/SK21approaches 1),
as can be seen in Fig. 4. This figure shows the values108 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
ofSK/SK21computed for different Kwhen the clus-
tering algorithm is applied to data sets of differentdimensions and with uniform distributions. Withsuch data sets, f(K) is expected to be equal to 1 and
aKshould be chosen to equate f(K) to 1. From
equation (2), aKshould therefore be SK/SK21and
thus obtainable from Fig. 4. However, for compu-tational simplicity, the recursion equation (3b) hasbeen derived from the data represented in Fig. 4 tocalculate
aK. Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig. 4closely.
The proposed function f(K) satisfies the con-
straints mentioned in the previous section. The
robustness of f(K) will be verified experimentally in
the next section. When the number of objects isdoubled or tripled but their distributions areunchanged, the resultant clusters remain in thesame position. S
Kand SK21are doubled or tripled
correspondingly, so that f(K) stays constant. There-
fore, generally, f(K) is independent of the number
of objects in the data set.
To reduce the effect of differences in the ranges
of the attributes, data are normalized before theclustering starts. However, it should be noted that,when the data have well-separated groups of objects,
the shape of such regions in the problem space hasan effect on the evaluation function. In these cases,the normalization does not influence the localobject distribution, because it is a scaling techniquethat applies to the whole data set.
5 PERFORMANCE
The evaluation function f(K) is tested in a series of
experiments on the artificially generated data sets
shown in Fig. 6. All data are normalized before theincremental K-means algorithm is applied with K
ranging from 1 to 19. f(K) is calculated on the basis
of the total distortion of the clusters.
In Figs 6a–c, all objects belong to a single region
with a uniform distribution. The graph in Fig. 6ashows that f(K) reflects well the clustering result on
this data set with a uniform distribution because
f(K) is approximately constant and equal to 1 for
allK. When K¼4 and K¼3 in Figs 6a and b, respect-
ively, f(K) reaches minimum values. This could be
attributed to the shape of the areas defined by theobjects belonging to these data sets. However, theminimum values of f(K) do not differ significantly
Fig. 4 The ratio SK/SK21for data sets having uniform
distributions: (a) two-dimensional ‘square’ and‘circle’; (b) four-dimensional ‘cube’ and ‘sphere’
Fig. 5 Comparison of the values of aKcalculated using
equation (3b) and the ratio SK/SK21Selection of KinK-means clustering 109
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
Fig. 6 Data sets and their corresponding f(K)110 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
Fig. 6 ContinuedSelection of KinK-means clustering 111
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
Fig. 6 Continued112 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
Fig. 6 ContinuedSelection of KinK-means clustering 113
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
Fig. 6 Continued114 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
from the average value for any strong recommen-
dations to be made to the user. By comparing the
values of f(K) in Figs 6a and c, it can be seen that
aKreduces the effect of the data set dimensions on
the evaluation function.
For the data set in Fig. 6d, again, all objects are
concentrated in a single region with a normaldistribution. The f(K) plot for this data set suggests
correctly that, when K¼1, the clustering result is
the most suitable for this data set.
The data sets in Figs 6e and f are created by
two generators that have normal distributions. InFig. 6e, the two generators have an overlappingregion but, in Fig. 6f, they are well separated. Note
Fig. 7 f(K) for the 12 benchmark data setsSelection of KinK-means clustering 115
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
that the value for f(2) in the latter figure is much
smaller than in the former.
The data sets in Figs 6g and h have three recog-
nizable regions. From the corresponding graphs,f(K) suggests correct values of Kfor clustering
these data sets.Three different generators that create object
groupings with a normal distribution are used to
form the data set in Fig. 6i. In this case, f(K) suggests
the value 2 or 3 for K. Because two of these three
generators create object groupings that overlap,f(2) is smaller than f(3). This means that the data
Fig. 7 Continued116 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
have only two clearly defined regions, but K¼3
could also be used to cluster the objects.
Figures 6j and k illustrate how the level of detail
could affect the selection of K.f(K) reaches mini-
mum values at K¼2 and 4 respectively. In such
cases, users could select the most appropriate valueofKbased on their specific requirements. A more
complex case is shown in Fig. 6l where there is apossible Kvalue of 4 or 8. The selection of a parti-
cular Kwill depend on the requirements of the
specific application for which the clustering iscarried out.
The data sets in Figs 6m–o have well-defined
regions in the object space, each of which has adifferent distribution, location, and number ofobjects. If the minimum value of f(K) is used to
cluster the objects, Kwill be different from the
number of generators utilized to create them (as in
the case of the clusters in Fig. 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n).The reason for the difference varies with differentcases. For example, it could be considered thatthere are five clusters in Fig. 6m because the clusterdistances are smaller for the two leftmost pairs of
clusters than for others and the clusters in those
pairs could be merged together. However, nosimple explanation could be given for the casesshown in Figs 6n and o. This highlights the factthat f(K) should only be used to suggest a guide
value for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user.
From the graphs in Fig. 6, a conclusion could be
made that any Kwith corresponding f(K),0.85
could be recommended for clustering. If there isnot a value with corresponding f(K),0.85, K¼1i s
selected.
The proposed function f(K) is also applied to 12
benchmarking data sets from the UCI RepositoryMachine Learning Databases [31]. Figure 7 shows
how the value of f(K) varies with K. If a threshold of
0.85 is selected for f(K) (from the study on the
artificial data sets), the numbers of clusters recom-mended for each of these data sets are given as inTable 2. K¼1 means that the data distribution is
very close to the standard uniform distribution. Thevalues recommended using f(K) are very small
because of the high correlation between the attributes
of these data sets, very similar to that shown in Fig. 6e.
This can be verified by examining two attributes ata time and plotting the data sets in two dimensions.
The above experimental study on 15 artificial
and 12 benchmark data sets has demonstratedthe robustness of f(K). The evaluation function
converges in most cases to 1 when Kincreases
above 9.6 CONCLUSION
Existing methods of selecting the number of clusters
forK-means clustering have a number of drawbacks.
Also, current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm.
A new method to select the number of clusters
for the K-means algorithm has been proposed in
the paper. The new method is closely related to theapproach of K-means clustering because it takes
into account information reflecting the performance
of the algorithm. The proposed method can suggestmultiple values of Kto users for cases when different
clustering results could be obtained with variousrequired levels of detail. The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guide
value for K. The method has been validated on
15 artificial and 12 benchmark data sets. Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions.
ACKNOWLEDGEMENTS
This work was carried out as part of the Cardiff
Innovative Manufacturing Research Centre Project
supported by the Engineering and Physical Sciences
Research Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme. The authors aremembers of the I
/C3PROMS Network of Excellence
funded by the European Commission.Table 2 The recommended number of
clusters based on f(K)
Data setsProposed number
of clusters
Australian 1
Balance-scale 1
Car evaluation 2, 3, 4Cmc 1Ionosphere 2Iris 2, 3Page blocks 2
Pima 1
Wdbc 2Wine 3Yeast 1Zoo 2Selection of KinK-means clustering 117
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
REFERENCES
1 Han, J. and Kamber, M. Data Mining: Concepts and
Techniques , 2000 (Morgan Kaufmann, San Francisco,
California).
2 Al-Daoud, M. B., Venkateswarlu, N. B., and
Roberts, S. A. Fast K-means clustering algorithms.
Report 95.18, School of Computer Studies, Universityof Leeds, June 1995.
3 Al-Daoud, M. B., Venkateswarlu, N. B., and
Roberts, S. A. New methods for the initialisation of
clusters. Pattern Recognition Lett. , 1996, 17, 451–455.
4 Alsabti, K., Ranka, S., and Singh, V. An efficient
K-means clustering algorithm. In Proceedings of the
First Workshop on High-Performance Data Mining ,
Orlando, Florida, 1998; ftp://ftp.cise.ufl.edu/pub/
faculty/ranka/Proceedings.
5 Bilmes, J., Vahdat, A., Hsu, W., and Im, E. J. Empirical
observations of probabilistic heuristics for the
clustering problem. Technical Report TR-97-018,
International Computer Science Institute, Berkeley,California.
6 Bottou, L. and Bengio, Y. Convergence properties of the
K-means algorithm. Adv. Neural Infn Processing Systems ,
1995, 7, 585–592.
7 Bradley, S. and Fayyad, U. M. Refining initial
points for K-means clustering. In Proceedings of
the Fifteenth International Conference on Machine
Learning (ICML ‘98 ) (Ed. J. Shavlik), Madison,
Wisconsin, 1998, pp. 91–99 (Morgan Kaufmann, San
Francisco, California).
8 Du, Q. and Wong, T-W. Numerical studies of
MacQueen’s K-means algorithm for computing the cen-
troidal Voronoi tessellations. Int. J. Computers Math.
Applics , 2002, 44, 511–523.
9 Castro, V. E. and Yang, J. A fast and robust general
purpose clustering algorithm. In Proceedings of the
Fourth European Workshop on Principles of Knowledge
Discovery in Databases and Data Mining (PKDD 00) ,
Lyon, France, 2000, pp. 208–218.
10 Castro, V. E. Why so many clustering algorithms?
SIGKDD Explorations, Newsletter of the ACM Special
Interest Group on Knowledge Discovery and Data
Mining , 2002, 4(1), 65–75.
11 Fritzke, B. The LBG-U method for vector quantiza-
tion – an improvement over LBG inspired from
neural networks. Neural Processing Lett. , 1997, 5(1),
35–45.
12 Hamerly, G. and Elkan, C. Alternatives to the K-means
algorithm that find better clusterings. In Proceedings of
the 11th International Conference on Information and
Knowledge Management (CIKM 02 ), McLean, Virginia,
2002, pp. 600–607.
13 Hansen, L. K. and Larsen, J. Unsupervised learning
and generalisation. In Proceedings of the IEEE
International Conference on Neural Networks ,
Washington, DC, June 1996, pp. 25–30 (IEEE,New York).
14 Ishioka, T. Extended K-means with an efficient
estimation of the number of clusters. In Proceedings
of the Second International Conference on IntelligentData Engineering and Automated Learning (IDEAL
2000) , Hong Kong, PR China, December 2000,
pp. 17–22.
15 Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C.,
Silverman, R., and Wu, A. The efficient K-means clus-
tering algorithm: analysis and implementation. IEEE
Trans. Pattern Analysis Mach. Intell. 2002, 24(7),
881–892.
16 Pelleg, D. and Moore, A. Accelerating exact K-means
algorithms with geometric reasoning. In Proceedings
of the Conference on Knowledge Discovery in
Databases (KDD 99) , San Diego, California, 1999,
pp. 277–281.
17 Pelleg, D. and Moore, A. X-means: extending K-means
with efficient estimation of the number of clusters. InProceedings of the 17th International Conference on
Machine Learning (ICML 2000) , Stanford, California,
2000, 727–734.
18 Pena, J. M., Lazano, J. A., and Larranaga, P. An empiri-
cal comparison of four initialisation methods for the
K-means algorithm. Pattern Recognition Lett. , 1999,
20, 1027–1040.
19SPSS Clementine Data Mining System. User Guide Ver-
sion 5 , 1998 (Integral Solutions Limited, Basingstoke,
Hampshire).
20DataEngine 3.0 – Intelligent Data Analysis – an Easy
Job, Management Intelligenter Technologien GmbH,
Germany, 1998; http://www.mitgmbh.de.
21 Kerr, A., Hall, H. K., and Kozub, S. Doing Statistics with
SPSS , 2002 (Sage, London).
22S-PLUS 6 for Windows Guide to Statistics , Vol. 2,
Insightful Corporation, Seattle, Washington, 2001;
http://www.insightful.com/DocumentsLive/23/44/
statman2.pdf.
23 Hardy, A. On the number of clusters. Comput. Statist.
Data Analysis , 1996, 23, 83–96.
24 Theodoridis, S. and Koutroubas, K. Pattern Recog-
nition , 1998 (Academic Press, London).
25 Halkidi, M., Batistakis, Y., and Vazirgiannis, M.
Cluster validity methods. Part I. SIGMOD Record ,
2002, 31(2); available online http://www.acm.org/
sigmod/record/.
26 Kothari, R. and Pitts, D. On finding the number of
clusters. Pattern Recognition Lett. , 1999, 20, 405–416.
27 Cai, Z. Technical aspects of data mining. PhD thesis,
Cardiff University, Cardiff, 2001.
28 Lindeberg, T. Scale-space Theory in Computer Vision ,
1994 (Kluwer Academic, Boston, Massachusetts).
29 Pham, D. T., Dimov, S. S., and Nguyen, C. D.
Incremental K-means algorithm. Proc. Instn Mech.
Engrs, Part C: J. Mechanical Engineering Science , 2003,
218, 783–795.
30 Tibshirani, R., Walther, G., and Hastie, T. Estimating
the number of clusters in a dataset via the gap statistic.
Technical Report 208, Department of Statistics,Stanford University, California, 2000.
31 Blake, C., Keogh, E., and Merz, C. J. UCI Re-
pository of Machine Learning Databases, Irvine,
California . Department of Information and Com-
puter Science, University of California, Irvine,
California, 1998.118 D T Pham, S S Dimov, and C D Nguyen
Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science C09304 #IMechE 2005
APPENDIX
Notation
A, B clusters
d(xjt,w j) distance between object x jtand the
centre w jof cluster j
f(K) evaluation function
GA,GB generators
Ij distortion of cluster j
K number of clusters
N number of objects in the data set
Nd number of data set attributes(the dimension of the data set)N
j number of objects belonging to
cluster j
PGA,PGBprobabilities that X is created by GAor
GBrespectively
PCA,PCBprobabilities that X is clustered into A or
B respectively
SK sum of all distortions with Kbeing the
specified number of clusters
X object
xjt object belonging to cluster j
wj centre of cluster j
aK weight factorSelection of KinK-means clustering 119
C09304 #IMechE 2005 Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Selection of KinK-means clustering [621433] (ID: 621433)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
