Fuzzy methods applied in data mining [603017]
Abstract
Fuzzy methods applied in data mining
1 Introduction
The digital revolution in the last years has led to huge amount of data from all
kinds of elds to be collected and stored in dierent forms (such as databases
or data warehouses). As a result, traditional mixtures of statistical tech-
niques and data management tools are no longer appropriate for analyzing
these vast collections of data.
Domains where big volumes of data are stored in centralized or distributed
databases include: dierent scientic domains, nancial ares, health care,
production and manufacturing, telecommunications, the World Wide Web.
The term data mining refers to the process of extracting knowledge from
considerable amounts of data. Mining is a term characterizing the process
that nds a small set of signicant information from a big data set. Other
terms carry a similar meaning to data mining, such as knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology,
and data dredging.
In some cases data mining is used as a synonym for Knowledge Discovery
from Data, or KDD. Yet, data mining is just a part of KDD which includes
several parts (such as data cleaning, data integration, data selection, data
transformation, data mining, pattern evaluation, knowledge presentation).
Typically, a data mining algorithm consists of some combination of the
following three components:
– The model: the function of the model (e.g., classication, clustering) and
its representational form (e.g., linear discriminants, neural networks)
– The preference criterion
– The search algorithm.
The most common model functions in current data mining practice in-
clude:
1)Classication : classies a data item into one of several predened cate-
gorical classes.
2)Regression : maps a data item to a real valued prediction variable.
3)Clustering : maps a data item into one of several clusters, where clusters
are natural groupings of data items based on similarity metrics or probability
density models.
1
4)Rule generation : extracts classication rules from the data.
5)Discovering association rules : describes association relationship among
dierent attributes.
6)Summarization : provides a compact description for a subset of data.
7)Dependency modeling : describes signicant dependencies among vari-
ables.
Since the publication of Zadehs seminal paper on fuzzy sets, fuzzy set
theory and its descendant, fuzzy logic, have evolved into powerful tools for
managing uncertainties inherent in complex systems.
The modeling of imprecise and qualitative knowledge, as well as the trans-
mission and handling of uncertainty at various stages are possible through
the use of fuzzy sets. Fuzzy logic is capable of supporting, to a reasonable
extent, human type reasoning in natural form. Fuzzy logic is a superset of
conventional logic that has been extended to handle the concept of partial
truth values between "completely true" and "completely false". As its name
suggests, it is the logic underlying modes of reasoning which are approximate
rather than exact. Fuzzy sets are generalized sets which allow for a graded
membership of their elements.
The role of fuzzy sets is categorized below based on the dierent functions
of data mining that are modeled.
1) Classication and Regression 2) Clustering: Data mining aims at shift-
ing through large volumes of data in order to reveal useful information in
the form of new relationships, patterns, or clusters, for decision-making by a
user.
3) Association Rules: An association rule describes an interesting association
relationship among dierent attributes.
4) Functional Dependencies: Fuzzy logic has been used for analyzing infer-
ence based on functional dependencies (FDs), between variables, in database
relations. Fuzzy inference generalizes both imprecise (set-valued) and precise
inference. Similarly, fuzzy relational databases generalize their classical and
imprecise counterparts by supporting fuzzy information storage and retrieval.
5) Data Summarization: Summary discovery is one of the major components
of knowledge discovery in databases. This provides the user with comprehen-
sive information for grasping the essence from a large amount of information
in a database.
6) Web Application: Mining typical user proles and URL associations from
the vast amount of access logs is an important component of Web personal-
ization, that deals with tailoring a users interaction with the Web information
2
space based on information about him/her.
7) Image Retrieval: Recent increase in the size of multimedia information
repositories, consisting of mixed media data, has made content-based image
retrieval (CBIR) an active research area.
2 Practical applications of fuzzy in data min-
ing
In data mining, the user looks for new knowledge, such as relations between
variables or general rules for instance. The search is performed for example
in databases or data warehouses. The purpose is to nd homogeneous cat-
egories, prototypical behaviors, general associations, important features for
the recognition of a class of data. In this case, using fuzzy sets brings
ex-
ibility in knowledge representation, interpretability in the obtained results,
in rules or in characterizations of prototypes.
The advantage of fuzzy systems is that they can provide simple intuitive
models for interpretation and prediction. Prior knowledge in the form of
fuzzy rules can be easily integrated.
Fuzzy systems have numeric interpolation capabilities and are therefore
suited for function approximation and prediction. On the other hand they
partition variables by fuzzy sets that can be labeled with linguistic terms.
Thus they also have a symbolic nature and can be intuitively interpreted.
Fuzzy systems conveniently allow us to model a partially known depen-
dency between independent and dependent variables by using linguistic rules.
By using linguistic terms represented by fuzzy sets to describe values, we can
select a certain granularity under which the data is observed. We can use a
fuzzy system both for predicting values for the dependent variables and for
knowledge representation.
2.1 Classication and Regression
In the terminology of machine learning, classication is considered an in-
stance of supervised learning. Classication is the problem of identifying to
which of a set of categories (sub-populations) a new observation belongs, on
the basis of a training set of data containing observations (or instances) whose
category membership is known. An algorithm that implements classication,
especially in a concrete implementation, is known as a classier.
3
2.1.1 Fuzzy decision trees
A decision tree is a structure in which each internal node represents a "test"
on an attribute, each branch represents the outcome of the test and each
leaf node represents a class label. The paths from root to leaf represents
classication rules.
Fuzzy decision trees (FDT) are interesting for data mining and informa-
tion retrieval because they enable the user to take into account imprecise
descriptions of the cases, or heterogeneous values (symbolic, numerical, or
fuzzy). A quality of FDT is their robustness, since a small variation of de-
scriptions does not drastically change the decision or the class associated
with a case, which guarantees a resistance to measurement errors and avoids
sharp dierences for close values of the descriptions.
When dealing with methods to construct fuzzy decision trees, we take
into consideration the possible choice of the measure of discrimination H.
Regarding that, two main families can be distinguished: The rst one deals
with methods based on a generalized Shannon entropy: the entropy of fuzzy
events as a measure of discrimination. It corresponds to the Shannon entropy
extended to fuzzy events by substituting probabilities of fuzzy events to clas-
sic probabilities. The second one deals with methods based on another family
of fuzzy measures, namely a measure of classication ambiguity, dened from
both a measure of fuzzy subsethood and a measure of non-specicity.
2.1.2 Support Vector Learning for Fuzzy Rule-Based Classica-
tion Systems
In machine learning, support vector machines (SVMs) are supervised learning
models with associated learning algorithms that analyze data and recognize
patterns, used for classication and regression analysis. An SVM model is
a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as
wide as possible. In addition to performing linear classication, SVMs can
eciently perform a non-linear classication using what is called the kernel
trick, implicitly mapping their inputs into high-dimensional feature spaces.
In [x] the authors analyze the relationship between fuzzy rule-based clas-
sication systems and kernel machines. They prove that under a general
assumption on membership functions, an additive fuzzy rule-based classica-
tion system can be constructed directly from the given training samples using
4
the support vector learning approach. Such additive fuzzy rule-based classi-
cation systems are named the positive denite fuzzy classiers (PDFCs).
2.1.3 Instance based learning
The nearest neighbor rule (NN) was originally proposed by Cover and Hart.
K nearest neighbors is a simple algorithm that stores all available cases and
classies new cases based on a similarity measure (e.g., distance functions).
A case is classied by a majority vote of its neighbors, with the case being
assigned to the class most common amongst its K nearest neighbors measured
by a distance function. KNN has been used in statistical estimation and
pattern recognition already in the beginning of 1970 as a non-parametric
technique.
The fuzzy k-nearest neighbor (F-KNN) algorithm was originally devel-
oped by Keller in 1985, which generalized the k-nearest neighbor (KNN)
algorithm and could overcome the drawback of KNN in which all of in-
stances were considered equally important. However, the F-KNN algorithm
still suers from the problem of large memory requirement same as the KNN.
In [x] the authors proposed a new approach in order to deal with the
problems arised by the F-KNN algorithm. They proposed the so called con-
densed fuzzy k-nearest neighbor rule (CFKNN) which selects the important
instances based on sample fuzzy entropy.
Given a decision table DT= (U;A[C);8xi2U;8Cj(1in; 1
jp), let(xi;Cj) be the fuzzy membership degree of instance xibelong to
classCj, the fuzzy entropy of instance xiis dened as follows:
Entr (xi) = pX
j=1(xi;Cj) log2(xi;Cj)
The CFKNN algorithm selects the set of the important samples based on
the fuzzy entropy of the sample in training set. The bigger the fuzzy entropy
is, the more important the sample, because the sample with bigger fuzzy
entropy can provide more information for classication and they are closer
to the boundaries of class.
2.2 Clustering
Contrary to fuzzy decision trees for example, clustering algorithms belong
to the unsupervised learning framework, i.e. they do not consider that a
5
decomposition of the data set into categories is available. They perform data
mining as the identication of relevant subgroups of the data, determining
subsets of similar data and thus highlighting the underlying structure of the
data set.
The fuzzy set theory proves its advantage in this framework through the
notion of membership degrees: in crisp clustering algorithms, such as the
k-means or hierarchical methods, a point is assigned to a single cluster. Now
this is not adapted to the frequent case where clusters overlap and points
have partial memberships to several subgroups. Ruspini [x] rst proposed
to exploit fuzzy set theory to represent clusters, so as to model unclear as-
signments and clusters with unsharp boundaries. Dunn [x] proposed the rst
fuzzy clustering algorithm, called fuzzy c-means (FCM), that was generalized
by Bezdek [x].
2.3 Association Rules
The idea of empowering classical association rules by combining them with
fuzzy set theory has already been around since several years. The original
idea derives from attempts to deal with quantitative attributes in a database,
where subdivision of the quantitative values into crisp sets would lead to over-
or under estimating values near the borders.
Based on classical association rule mining, a new approach has been de-
veloped expanding it by using fuzzy sets. The new fuzzy association rule
mining approach emerged out of the necessity to mine quantitative data fre-
quently present in databases eciently. Algorithms for mining quantitative
association rules have already been proposed.
When we are dealing with quantitative attributes mapped to fuzzy sets
we might, depending on the membership function, nd that the membership
values to the sets of one single entity does not add up to one. This depends on
how we dened our fuzzy sets and the corresponding membership functions
in advance. If we are dealing with a mix of quantitative and categorical
attributes, we might nd it unreasonable that the quantitative attribute has
the potential to contribute more to a rule than a categorical one.
2.4 Web Applications
Web mining – is the application of data mining techniques to discover patterns
from the World Wide Web. Web mining can be divided into three dierent
6
types Web usage mining, Web content mining and Web structure mining.
Personalization tailors a users interaction with the Web information space
based on information gathered about them. Declarative user information
such as manually entered proles continue to raise privacy concerns and are
neither scalable nor
exible in the face of very active dynamic Web sites and
changing user trends and interests. One way to deal with this problem is
through a complete automated Web personalization system. Such a system
can be based on Web usage mining to discover Web usage proles, followed by
a recommendation system that can respond to the users individual interests.
Signicant amounts of error and uncertainty can permeate all the stages of
Web personalization.
In [x] the authors present a fast and intuitive approach to provide Web
recommendations using a fuzzy inference engine with rules that are automati-
cally derived from prediscovered user proles. The proposed fuzzy recommen-
dation method achieves high coverage compared to K-NN and nearest-prole
recommendations despite slightly lower precision.
Fuzzy recommendations are very intuitive, deal with natural overlap in
user interests, and are very low in cost compared to collaborative ltering:
They are extremely faster and require much lower main memory at recom-
mendation time (no need to store or compare to a large number of instances).
This makes fuzzy recommendations suitable for real time recommendations
in a live setting on today's most active and huge websites.
3 Conclusions
REFERENCES
– Komal Sahedani, A SURVEY: FUZZY SET THEORY IN DATA MIN-
ING, International Journal of Advanced Research in IT and Engineering,
ISSN: 2278-6244, Vol. 2, No. 7, 2013
– Bernadette Bouchon-Meunier, Marcin Detyniecki, Marie-Jeanne Lesot,
Christophe Marsala, and Maria Rifqi, Real-World Fuzzy Logic Applications
in Data Mining and Information Retrieval, Studies in Fuzziness and Soft
Computing Vol. 215, 2007, pp 219-247
– Yixin Chen, James Z. Wang, Support Vector Learning for Fuzzy Rule-
Based Classication Systems, IEEE Transactions On Fuzzy Systems, VOL.
11, NO. 6, 2003
– Jun-Hai Zhai, Na Li, Meng-Yao Zhai, The Condensed Fuzzy K-Nearest
7
Neighbor Rule Based On Sample Fuzzy Entropy, Proceedings of the 2011
International Conference on Machine Learning and Cybernetics, 2011
– Lukas Helm, Fuzzy Association Rules – An Implementation in R
– Olfa Nasraoui and Christopher Petenes, Combining Web Usage Mining
and Fuzzy Inference for Website Personalization
8
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Fuzzy methods applied in data mining [603017] (ID: 603017)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
