10th Jubilee IEEE International Symposium on Applied Computational Intelligenc e and Informatics May 21 -23, 2015 Timi șoara, Romania [602826]

10th Jubilee IEEE International Symposium on Applied Computational Intelligenc e and Informatics • May 21 -23, 2015 • Timi șoara, Romania
1
Hierarchical clustering techniques and
classification applied in Co ntent Based Image
Retrieval (CBIR)

Radu Andrei Stefan *, Ildikó -Angelica Sz öke*, Stefan Holban *
*Department of Computer Science, Politehnica University of Timisoara,
Bd. Vasile Pârvan 2, P .O. Box RO -300223, Timisoara, Romania
[anonimizat] , [anonimizat] , stefan.holba [anonimizat]

Abstract – This paper presents a study on the effectiveness of
hierarchical clustering techniques application and
classification for imaging contex t in the Content -Based Image
Retrieval (CBIR). The study has the purpose t o compare t he
obtained results from using different hierarchical clustering
algorithms with various input parameters and con figura tions
using two types of comparison techniques. The aims is also to
highlight the performance improvements and the costs
brought up by the i ntegration of such techniques in the
content -based image retrieval.
Keywords – classification ; clustering ; image; retrieval ;
content -based .
I. INTRODUCTION
In the la st years, due to continuous g rowth of the
Internet, the diminuation price of the storage devic es and
the increasing of computational power has become
necessary, possible and effective the database
manipulation of very large digital information. Generally
speaking, content -based i mage retrieval (CBIR) has as
goal the development of techniques that can provide
support for more effective image searching in large digital
libraries, using automatically features derived from
images. These features are taken from computer vision
techniques and can reproduce to some point, the images
visual information [1].
Images c lassification in a databas e, according to
another interrogated image , can take a lot of time and
resorces . These performanc es can have significant
improvements and advantages if clustering algorithms are
applied to the database image s. Thus, when query an
image , the comparation is not done with every single
image from the database, but only with the representative
image of all the c lusters. Also, the query result will not
consist of individual images, but from groups of images
with similar characte ristics.
Images r elevance which constitute the answer depends
on what characteristics of the content are used, by the
input parameters of the clustering algorithms and what
metrics are used to measure the distance between two
images or the dissimilarity of the two clusters. II. CONTENT -BASED IMAGE RETRIEVA L
New methods for the management for images collection
in an easy and efficient manner are needed and also new
cataloging forms and image indexing that rely on images
automatic extraction are neccessary .
The problems encountered in indexing the conventional
images led to an increased interest in image retrieval
techniques based on automatically derived features
technology that today are called Content -Based Image
Retrieval (CBIR) [2].
Content -based i mage retri eval (CBIR) is also known as
Query by Image Content or Content -Based Visual
Information Retrieval. This technique uses visual content
to search into a large database images and is an active
research area of the last two decades .
The most used features in C BIR are mathematical
measurements for color , texture and form images. Todays
typical systems allows the user to select the desired image,
then the system identifies certai n characteristics and
determin s based on those characteristics, the most similar
stored images from the database.
The color is one of the most commonly used features.
Most often , the colors are represented based on
histograms sto red, in the database , in vectors form . The
process of comparing two histograms can be done by
using metrics s uch as Euclidean distance or Manhattan
distance [2].
In this conte nt, using texture feature can be effective in
distinguishing between sur faces with the same color (i.e.
water and sky). The techniques used are ba sed on the
calculation of pixel pair brightn ess; then with the same
techniques is possible to measure the texture image by
contrast, regularity, frequency and severity [2].
The shape is probably the most obvious way to
distinguish objects, since natural objects are n ormally
recognized by the form. I nside the form can be calculated
the global feature s, like ration or circularit y, and the local
features, like marginal segments sets [2].
III. LOCAL BINARY PATTERN (LBP )
Local Binary Pattern (LBP) is one of the techniques
used in image classification based on texture. This
operator is simple and very effective , which labels the
image pixels based on their neighboring and consider the
result being a binary number. The most important property

R. A. Stefan , I. Szoke and S. Holban • Hierarchical clustering techniques and classification applied in Content B ased Image R etrieval (CBIR)

2
of LBP operator in rea l world is the invariation of the
differences of the image brightness. Also important is the
computational simplicity which enable s real-time image
analysis [3].
In this paper , we used the original LBP operator . We
work ed on a 3×3 region using the central value as the
reference value. This operator is ap plied to grayscale
images and c alculates a code for each pixel. For a 3×3
pixel cell, the center pixel is compared with its
surounding s. If the value is greater than the neighboring
pixel value , "1" will be taken into consideration o therwise
the value "0". The resulting value for e ach cell is a binary
number which is associated with a template that can be
converted into a LBP code . Now , we have a number of 8
pixels in the neighborhood, the total number of labels that
can be assigned to a central pixel is 28 = 256 values.

IV. COLORED HISTOGRAM
In image processing, a color histogram is the
representation of dye colors distribution in an image. For
digital images, a colored histogram represents the number
of pixels that have colors in each color range list that
creates the color space. In general, color histograms are
used in red -green -blue space. For grayscale images these
histograms can be called intensity histograms [4].
The c olor histograms , used in this study , are
represented in three vectors o f 256 position s, one for each
of the three colors that creates the space (red, green and
blue).
Thus, for every individual pixel , values are calculated
for those 3 colours and the results are gather ed into a
vector histogram.
The main drawback of the color histograms is the fact
that the representation is dependent on objects ’ color.
Thus two different objects of the same color can be
considered different by the system, but two identical
objects with different colors will be evaluated as totally
different. These aspects are due to the fact that in analysis
aspects like texture and form are not considered [4].

V. CLUSTERING HIERARCHICAL TECHNIQUES
Clustering analysis involves gro uping a set of objects ;
so the objects from the same group called cluster are
more simi lar to each other than those from other groups
(clusters). This analysis is an important branch in data
mining field and is a common technique for statistical
analysis of the data used in machine learning, pattern
recognition, information retrieval and image analysis [5].
Hierarchical clustering algorithms are also known as
clustering algorithms through connectivity and have as
basic idea that objects are similar to adjacent objects than
other more distant objects . These algorithms connect
objects to create clusters b ased on the distance between
them . A cluster can be described by the maximum
distance needed to connect its parts [6]. The clusters
structure can be represented by a dendogram, hence the
name of hierarchical clustering [13].
The clustering algorithms famil ly differ by approach
and by using different metrics for measuring the distance between objects (Euclidean distance, Manhatt an distance,
etc.). T he user has to choose the criteria to join the two
clusters. Some common choices are "single link"
(minimum obj ect distance), "complete link" (maximum
distances of objects) or "average link" (average distances
of objects) [7].
Clustering algorithms are classified into two categories:
agglomeration and division [5]. Agglomeration
algorithms are based on simple eleme nts that create
indivi dually one cluster and continue by joining them. On
the other hand, division algorithms use , as starting point ,
a cluster consisting of all elements of the dataset and
continue by dividing it into smaller clusters [8].

Figure 1: Splitting and clustering hierarchical
agglomerative sch ema [5]

A. Agglomerative Nesting

AGNES algorithms have the following steps:

– initialize n clusters , where each object , in the initial se t,
defines a cluster;
– calculate a triangular matrix of dissim ilarity made of
distances between all pairs of objects , 𝑑𝑖,𝑗 = dist(𝑥𝑖, 𝑥𝑗);
– repeat e it by n times (n – the number of dataset
elements );
– identify in the dissimilarity matrix the smallest distance
between the elements i and j;
– connects i and j clusters to a new one (i ,j);
– replace objects i and with (i,j);
– replace the dissimilarity matrix according to the new
created way . This steps means to erase the i and j lines
and collums and add a new line (i,j) for the distance
which will be c alculated conforming to the desired link
(simple, complet or medium) [6].

B. Divisive Analysis
Divisive Analysis algorithms have the following
steps :
– initialize one single cluster which contains all the
dataset objects;
– repeat this for n times;
– find t he cluster with the largest diameter (the l argest
distance between two contained objects) ;
– find in this cluster the object with the highest
dissimilarity reported to other objects ; this will initialize a
new cluster ;

10th Jubilee IEEE International Symposium on Applied Computational Intelligenc e and Informatics • May 21 -23, 2015 • Timi șoara, Romania
3
– for every new object outside the cl uster the difference is
computed :

𝐷𝑖 = [ media(d(i,j)), j ∈ new cluster ] – [media(d(i,j)),
j ∈ new cluster]
– find the i object with the largest 𝐷𝑖 difference; if this is
positive , the object will be added to the newest cluster;
– repeat the last two steps until all the 𝐷𝑖 differe nces are
negative. Now , the cluster was been divided into two new
clusters [6].

The algorithm will end when all clusters will contain
only one object. In the image classification context of the
datasets, when we have a clusters structure generated
previously, the problem is no longer when comparing
each image from the database, but when comparison with
each cluster of images.
Thus for each cluster a representative object can be
chosen as its center image or can be calculated the
centroid’s cluster. This is an average of elements
characteristics from the cluster. Clusters classification is
performed by comparing the queried image with the
representative centroid element from the cluster. Further
classification can be done by comparing the cluster lev el
with each component image.
VI. APPLICATION SPECIFICATIONS
The a pplication made for this project can be
considered a structure composed of three main modules:
a module for generating a database image , image
clustering module based on visual characteristics and a
module that performs the image classification from the
dataset by comparing data with the image queried by the
user.

Figure 2: The flowchart application

The first module , the one who generat es database, has
the role to analyze a set of received input images and
generate therefor a mathematical representation of the
characteristics. There are two types of features that can be
generated: histogram analysis on grayscale images using
Local Binary Pattern technique and colored histograms in
the color s pace RGB (red -gree-blue) for the colored
images analysis. LBP histograms are represented for each
image by a vector of 256 positions stored in a .csv file.
On the other hand, RGB histograms are represented by
three v ectors, each one by 256 positions, one for each of
the three colors. These vectors are also stored as records
in a .csv file. D atabase generator keeps also the mapping between the image location on the disk and the
corresponding characteristics from the . csv files.
Image clustering module recei ves as input data the
image histograms and will provide the structure cluster as
the result . The final number of clusters is specified by the
user, but also by the metrics or the related functions of
two clusters.
The last module performs the image classi fication
based on the queried image made by the user. The
specif ied module performs a classification by comparing
cluster centroid (pictures) with the user’s image.
Subsequently, in each cluster is also performed a
classification of all the images. The fin al number of the
displayed/ classified images is specified by the user. This
module calculates the similarity percentage between the
resulting images and the one queried by the user .
VII. EXPERIMENTAL RESULTS
In this section we will present some experimental
results obtained when the application was ran with
vario us input parameters. For this we used as benchmark
the James Z. Wang database [12] [9] composed of 1000
images having almost the same size . Images can be
considered as part of 10 themati c categories, bu t the
similarity degree varies substantially from one category
to another. The experimental results can provide
comparative evaluation betw een the usage of LBP
histograms or the usage of the color (RGB) and between
different co nfigura tions of hierarchical agglomerative
clustering algorithms and Nesting Analysis divisiv e. Two
clustering techniques have been used: AGNES and
DIANA. Three ways of building clusters ( Simple,
Average and Complet Link) were used fot the two
clustering tehniques, AGNES and DIANA. For each
versio, two types of distances were used : Euclidean and
Manhattan) . These results are written in tables; the
images performance evaluation images and the relevance
of the images resulted as res ponse by the application will
be done based on the sim ilarity coefficient between the
queried image and the images from the database . We will
consider for this study only the first 10 images provided
as response, for which we will calculate an average
similarity percentage.
The first image used in the experi ment is shown in
Figure 3 and an example of the images provide d by the
system is showed in Figure 4.

.
Figure 3: First image interrogation

R. A. Stefan , I. Szoke and S. Holban • Hierarchical clustering techniques and classification applied in Content B ased Image R etrieval (CBIR)

4

Figure 4: The results from the first image interrogation

We will create Table 1 with the results obtaine d when
the image was analysed .
Tabel 1
Clustering
algorithm
Distance
Elements
Link
Clusters
Nr.
Clusters
Similarit y
Procentage
[%]

AGNES
Euclidiană
Simple
25 95,15
AGNES Euclidiană Simple 50 94,66
AGNES Euclidiană Complet
25 94,27
AGNES Euclidiană Complet
50 93,93
AGNES Euclidiană Average 25 94,86
AGNES Euclidiană Average 50 94,95
AGNES Manhattan
Simple 25 83,24
AGNES Manhattan Simple 50 83,88
AGNES Manhattan Complet 25 94,49
AGNES Manhattan Complet 50 94,49
AGNES Manhattan Average 25 91,91
AGNES Manhattan Average 50 94,48
AGNES Cosine Simple
25 86,10
AGNES Cosine Simple 50 82,04
AGNES Cosine Complet
25 95,67
AGNES Cosine Complet
50 95,67
Clustering
algorithm Distance
Elements Link Clusters Nr.
Clusters Similarit y
Procentage
[%]
AGNES Cosine Average 25 95,67
AGNES Cosine Average 50 95,67
AGNES Correlation Simple
25 91,76
AGNES Correlation Simple 50 75,51
AGNES Correlation Complet
25 95,67
AGNES Correlation Complet
50 95,67
AGNES Correlation Mediu 25 95,67
AGNES Correlation Mediu 50 95,67 AGNES Bray Curtis Simple
25 83,25
AGNES Bray Curtis Simple 50 82,88
AGNES Bray C urtis Complet
25 94,50
AGNES Bray Curtis Complet
50 94,50
AGNES Bray Curtis Mediu 25 92,92
AGNES Bray Curtis Mediu 50 92,92
DIANA
Euclidiană – 25 93,84
DIANA Euclidiană – 50 93,98
DIANA Manhattan – 25 94,49
DIANA Manhattan – 50 92,17
DIANA Cosine – 25 36,37
DIANA Cosine – 50 36,67
DIANA Correlation – 25 82,92
DIANA Correlation – 50 82,92
DIANA Bray Curtis – 25 65,56
DIANA Bray Curtis – 50 65,56

RGB color histogram was used t o obtain these results.
The similarity percentage of the first 10 returned images
by the system is generally constant when were analize d
by RGB histogram analysis, but best performance was
obtai ned with the AGNES algorithm having as comparing
metric the Euclidean distance between the elements.
When using the Manhattan metric, the results tend to
decrease slightly in accuracy, especially in its use as a
link to the union simple clusters. The same can be s aid
for the use of C osine metrics, Correlation and Bray Curtis
together with the simple link. Under these con figura tions ,
the system delivers results with lower relevance for the
interrogated image .
Similar results are obtained when Diana algorithm was
used. A sudden drop rates can be seen when C osine
metrics, Correlation and Bray Curtis were used , but the
worst results were with Cosine metric.
Another accuracy estimator result offered by the image
belongs to the same semantic field as well as interrogated
image. An average of 8.75 procente, for the first 10
images , was correctly classified by the application. Also ,
all 10 images were part of the semantics area of the
queried image , except using DIANA algorithm with
Cosine distance – only one image of 10 was correctly
identified.
Further w e will present the results using LBP
histogram as comparison images feature . Since Local
Binary Pattern technique is a textural analysis method
that can be applied to grayscale images, we have chosen
for these experiments a database composed of various
ultrasound images. We opted for this because the real
world objects often have significant differences in terms
of texture. On the other hand, the ultrasound images often
have similar characteristics even for different patients .

10th Jubilee IEEE International Symposium on Applied Computational Intelligenc e and Informatics • May 21 -23, 2015 • Timi șoara, Romania
5
Also, such a content -based retrieval system is useful in
the medical field, since doctors can compare a new
ultrasound image with other database images to offer a
diagnosi s. Database [10] [11] contain s 220 ultrasound
images for various human organs, but to assess the
results , we did refer mainly to the first 4 relevance
pictures offered by the system , but also to the similarity
percentage. If the similarity percentage was calculated
before using the RGB histograms cosine metric, this time
we will calculat e using the same method, but on gray
histograms.
For this experiment we used the Figure 5 image and the
results can be seen in Figure 6.

Figure 5: Interrogated image using LBP technique

Figure 6: Interrogated results using LBP technique
We performed tests on Figure 5 image using the 3
techniqu es types of Local Binary Pattern operator from
the application ( simple operator, invariant rotational and
uniform patterns) and on different configuration
clustering algorithms . This time , the fin al number of
clusters was between 15 and 50 due to t he smaller set of
available images . Invariant rotational operators and
uniform pattern provides histogram which can generat e
the image classification accurately ; the first 4 images are
consistently the most relevant from the dataset with a
similarity of 99%, regardless of the clustering
configuration algorithms. Simple operator also provides
good resu lts, but often only 3 of the 4 s imilar i mages can
be found in the first classified images .
Like in the case of col ored his togram analysis , using
the s imple link to unify two clusters do provides
significantly worse performance . In addition of relevance analysis, on the provided
images were made performance evaluation tests using
clusteri ng content -based image retrieval algori thms; tests
were also performed on the required execution times .
Thus it was observed that depending on the final number
of clusters used , classification image times can be
obtained up to 9 times bet ter than in a normal
classification without a clustered g roup of pictures in
advance.
We will present the experimental results and the
execution time when an image was queried . When
querying an image using CBIR system t hat does not use a
clustering images from the database , the execution time
was 136 ms. In the next table , you can see the execution
times needed when an image is queried when clustering is
done before quering .

Number of
cluster s Execution Time
[ms] Execution
Acceleration
10 72 1.88
15 70 1.94
20 69 1.97
25 66 2.06
30 65 2.09
35 54 2.51
40 49 2.77
45 36 3.77
50 27 5.03
55 19 7.15
60 15 9.06
65 15 9.06
70 15 9.06
75 16 7.15
80 17 8.00
90 27 5.03
100 33 4.12

Execution acceleration is calculated as the ratio
between the execution time requir ed for a simple query
and the execution ti me required when queried by
clustering.
VIII. CONCLUSIONS

From a user point of view, the perfo rmance of a
content -based image retrieval system can be measured by
the quality and the speed when an image is queried .
Several factors contribute to the overall perf ormanc e of
such applications: the required time when running an
individual query , the quality , the precision results of
individual queries and the relevance. All these factors
should be taken into account when designing such
content -based image retrieval system .
The i ntegration of clustering algorithms in the context
of CBIR application , can bring higher performance to this
level, as long as optimal configuration of input
parameters is used . The quality of the result s, which are
provided by comparing the used characteristics, are a lso
important .
In terms of the used features , the curent study showed
that the use of RGB color histogram is effective in the
context for real life image analysis . The main
disadvantage of this approach is the inability to

R. A. Stefan , I. Szoke and S. Holban • Hierarchical clustering techniques and classification applied in Content B ased Image R etrieval (CBIR)

6
distin guish between different semantic elemen ts but with
the same color (i. e. the sky and the water) as well as the
difficulty to catalog different objects but with different
color (i .e. red car and green car). However different color
objects can sometimes be f ound similar if they have been
photographed in a similar context or with a similar
background.
On the other hand, using histograms generated with
Local Binary Pattern technique does not present good
results when the technique is applied on real wo rld
imag e, because of the big textural differences. This
method can be applied on ultrasound images. In general,
ultrasound images have major similarities from one
individual to another , in terms of texture. The CBIR
system used on the medical images may as sist in making
a diagnosis on a patient by comparing the ultrasound
image to other similar ultrasound images similar from the
database but belonging to other patie nt. We used three
types of operators to analyze the LBP histograms : simple,
rotational invariant and uniform pat terns. All three
technique s provided good and very good results , sligh tly
higher for the last two, rotational invariant and uniform
patterns .
By images clustering in a database , the CBIR system
response will not consist in individual images anymore ,
but in groups of images with similar characteristics.
Grouping images into clusters is justified by the fact that
images with a h igh degree of similarity tend to be
gathered in the same cluster. The two clustering
algorithms used in the application (agglomerative and
divisive Nesting Analysis) showed similar performance,
they are distinguished only by agglomerative approach or
by division.
Another important element in clustering techniques
integration for CBIR application is the type of metric
used t o measure distances between images. Euclidean
distance, Manhattan distance, cosine distance, correlation
distance and Bray -Curtis Correlation distance were used .
The relevance results are comparable to most of the
metrics, except cosine distance and distan ce divisor
algorithm Bray -Curtis Analysis.
The input parameter of clustering algorithms which
influenced the final results , was represented by the
conjuction function (link) used in the AGNES algorithm,
to unify two clusters. We used all 3 types of links, i.e.
simple link (minimum) , complete link (max im) and
average link. The experimenta l results showed that the
simple link create d large diameter clusters , often grouped
less similar items into the same cluster and provided
difficulties in treating the extreme elements. The query
results that uses clusters formed by the simple link
function are often misleaded and with a little relevance
(especially in combination with cosine distance). On the
other hand, using a complete link or medium t ends to
minimize the clusters diameter , forcing a high similarity
degree in the same group. This increases considerably the
system performa nce in terms of relevance results. A final important conclusion of this analysis relies on
the query acceleration time. The technique c lustering
integration into the application was resulted execution
time up to 9 times better even on a relatively small set of
1000 images , compared with a classical query system that
classifies all images in the database to provide the most
relevant respons e. This is due to the much smaller
number of comparisons made, requiring a clusters
classification, not of the images. Accelerating the
execution time is dependent on the final number of
clusters generated by clustering algorithm.
Findin g a developed optim al content -based image
retrieval system configuration can provide higher
performance in terms of required time and the image
relevance offered as a result.

REFERENCES

[1] Yixin Chen, James Z. Wang, Robert Krovetz. “Content -Based
Image Retrieval By Clustering ”, MIR '03 “Proceedings of the 5th
ACM SIGMM international workshop on Multimedia information
retrieva”, 2003 .
[2] John Eakins, Margaret Graham “Content -based Image Retrieval”,
The JISC Technology Applications Programme”, 1999 .
[3] M. Pietikainen, G. Zhao, A. Hadi d, T. Ahonen, "Computer Vision
Using Local Binary Patterns", Springer -Verlag London Limited,
2011.
[4] Xiang -Yang Wang, Jun -Feng Wu, Hong -Ying Yang "Robust
image retrieval based on color histogram of local feature regions",
Springer Netherlands, 2009.
[5] Jiawei H an, Micheline Kamber, “Data Mining – Concepts and
Techniques”, 2nd edition, Morgan -Kaufmann, 2006 .
[6] Kaufman, L.; Rousseeuw, P.J. (1990).” Finding Groups in Data:
An Introduction to Cluster Analysis” (1 ed.). New York: John
Wiley .
[7] P.-N. Tan, M. Steinbach & V . Kumar, "Introduction to Data
Mining", , Addison -Wesley (2005) .
[8] Bailey, Ken, "Numerical Taxonomy and Cluster Analysis".
Typologies and Taxonomies, 1994 .
[9] James Z. Wang, Jia Li, Gio Wiederhold, „SIMPLIcity:
Semantics -sensitive Integrated Matching for Pictu re
LIbraries,'' IEEE Trans. on Pattern Analysis and Machine
Intelligence , vol 23, no.9, pp. 947 -963, 2001.
[10] Ultrasound image database : Brno University of Technology –
Signal processing laboratory [Online]. Available at:
http://splab.cz/en/download/databaze/ultrasound . Accessed in
January 2015.
[11] Ultrasound image gallery [Online] . Available at:
http://www.ultrasound -images.co m. Accessed in January 2015.
[12] James Z. Wand database [Online] . Available at:
http://wang.ist.psu.edu/docs/related.shtml. Accessed in January
2015 .
[13] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh,
Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng,
Bing Liu, Philip S. Yu, Zhi -Hua Zhou, Michael Steinbach, David
J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data
mining. Knowl. Inf. Syst.14, 1 (Decembe r 2007), 1 -37.

Similar Posts