Dmdw2 Data Preprocessing [607248]
2. Data preprocessing
2Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
3Florin Radulescu, Mihai DascăluDMDW-2qCategorical vs. NumericalqScale typesqNominal qOrdinalqIntervalqRatioData types
4Florin Radulescu, Mihai DascăluDMDW-2qCategorical data, consisting in names representing some categories, meaning that they belong to a definable category. Example: color (with categories red, green, blue and white) or gender (male, female). qThe values of this type are not ordered, the usual operations that may be performed being equality and set inclusion.qNumerical data, consisting in numbers from a continuous or discrete set of values. qValues are ordered, so testing this order is possible (<, >, etc). qSometimes we must or may convert categorical data in numerical data by assigning a numeric value (or code) for each label.Categorical vs. Numerical
5Florin Radulescu, Mihai DascăluDMDW-2Stanley Smith Stevens, director of the Psycho-Acoustic Laboratory, Harvard University, proposed in a 1946 Science article that allmeasurementin science are using four different types of scales:qNominal qOrdinalqIntervalqRatio Scale types
6Florin Radulescu, Mihai DascăluDMDW-2qValuesbelongingtoanominalscalearecharacterizedbylabels.qValuesareunorderedandequallyweighted.qWecannotcomputethemeanorthemedianfromasetofsuchvaluesqInstead,wecandeterminethemode,meaningthevaluethatoccursmostfrequently.qNominaldataarecategoricalbutmaybetreatedsometimesasnumericalbyassigningnumberstolabels.Nominal
7Florin Radulescu, Mihai DascăluDMDW-2qValues of this type are ordered but the difference or distance between two values cannot be determined. qThe values only determine the rank order /position in the set. qExamples: the military rank set or the order of marathoners at the Olympic Games (without the times)qFor these values we can compute the mode or the median (the value placed in the middle of the ordered set) but not the mean. qThese values are categorical in essence, but can be treated as numerical because of the assignment of numbers (position in set) to the values Ordinal
8Florin Radulescu, Mihai DascăluDMDW-2qThese are numerical values. qFor interval scaled attributes the difference between two values is meaningful. qExample: the temperature using Celsius scale is an interval scaled attribute because the difference between 10 and 20 degrees is the same as the difference between 40 and 50 degrees. qZero does not mean ‘nothing’but is somehow arbitrarily fixed. For that reason negative values are also allowed. qWe can compute the mean, the standard deviation or we can use regression to predict new values. Interval
9Florin Radulescu, Mihai DascăluDMDW-2qRatio scaled attributes are like interval scaled attributes but zero means ‘nothing’. qNegative values are not allowed. qThe ratio between two values is meaningful. qExample: age -a 10 years child is two times older than a 5 years child. qOther examples: temperature in Kelvin, mass in kilograms, length in meters, etc. qAll mathematical operations can be performed, for example logarithms, geometric and harmonic means, coefficient of variation Ratio
10Florin Radulescu, Mihai DascăluDMDW-2qSometimes an attribute may have only two values, as the gender in a previous example. In that case the attribute is called binary.qSymmetric binary: when the two values are of the same weight and have equal importance (as in the gender case)qAsymmetricbinary:one of the values is more important than the other. Example: a medical bulletin containing blood tests for identifying the presence of some substances, evaluated by ‘Present’or ‘Absent’for each substance. In that case ‘Present’is more important that ‘Absent’.qBinary attributes can be treated as interval or ratio scaled but in most of the cases these attributes must be treated as nominal (binary symmetric) or ordinal (binary asymmetric) qThere are a set of similarity and dissimilarity (distance) functions specific to binary attributes.Binary data
11Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
12Florin Radulescu, Mihai DascăluDMDW-2qMeasuringcentraltendency:qMeanqMedianqModeqMidrangeqMeasuring dispersion:qRangeqKth percentileqIQRqFive-number summaryqStandard deviation and varianceMeasuring data
13Florin Radulescu, Mihai DascăluDMDW-2qConsider a set of nvalues of an attribute: x1, x2, …, xn. qMean:The arithmetic meanor average value is:μ = (x1+ x2+ …+ xn) / nqIf the values x have different weights, w1, …, wn, then the weighted arithmetic meanor weighted average is:μ = (w1x1+ w2x2+ …+ wnxn) / (w1+ w2+ …+ wn)qIf the extreme values are eliminated from the set (smallest 1% and biggest 1%) a trimmed meanis obtained.Central tendency -Mean
14Florin Radulescu, Mihai DascăluDMDW-2qMedian:The median value of an ordered set is the middle value in the set. qExample: Median for {1, 3, 5, 7, 1001, 2002, 9999} is 7. qIf n is even the median is the mean of the middle values: Øthe median of {1, 3, 5, 7, 1001, 2002} is 6 (arithmetic mean of 5 and 7).Central tendency -Median
15Florin Radulescu, Mihai DascăluDMDW-2qMode: The mode of a dataset is the most frequent value. qA dataset may have more than a single mode. For 1, 2 and 3 modes the dataset is called unimodal, bimodal and trimodal. qWhen each value is present only once there is no mode in the dataset. qFor a unimodal dataset the mode is a measure of the central tendency of data. For these datasets we have the empirical relation:mean –mode = 3 x (mean –median)Central tendency -Mode
16Florin Radulescu, Mihai DascăluDMDW-2qMidrange.Themidrangeofasetofvaluesisthearithmeticmeanofthelargestandthesmallestvalue.qForexamplethemidrangeof{1,3,5,7,1001,2002,9999}is5000(themeanof1and9999).Central tendency -Midrange
17Florin Radulescu, Mihai DascăluDMDW-2qRange. The range is the difference between the largest and smallest values. qExample: for {1, 3, 5, 7, 1001, 2002, 9999} range is 9999 –1 = 9998.qkthpercentile. The kthpercentile is a value xjbelonging of that dataset and having the property that k percent of the values are less or equal than xj. qExample: the median is the 50thpercentile. qThe most used percents are the median and the 25thand 75thpercentiles, called also quartiles(notation: Q1 for 25% and Q3 for 75%).Dispersion (1)
18Florin Radulescu, Mihai DascăluDMDW-2qInterquartile range(IQR) is the difference between Q3 and Q1:IQR = Q3 –Q1qPotential outliers are values more than 1.5 x IQR below Q1 or above Q3.qFive-number summary. Sometimes the median and the quartiles are not enough for representing the spread of the values qThe smallest and biggest values must be considered also. q(Min, Q1, Median, Q3, Max) is called the five-number summary.Dispersion (2)
19Florin Radulescu, Mihai DascăluDMDW-2qExamples:For {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}Range = 10; Midrange = 6;Q1 = 3; Q2 = 6; Q3 = 9; IQR = 9 -3 = 6For {1, 3, 3, 4, 5, 6, 6, 7, 8, 8}Range = 7; Midrange = 4.5;Q1 = 3; Q2 = 5.5 [=(5+6)/2]; Q3 = 7; IQR = 7 -3 = 4For {1, 3, 5, 7, 8, 10, 11, 13}Range = 12; Midrange = 7;Q1 = 4; Q2 = 7.5; Q3 = 10.5; IQR = 10.5 -4 = 6.5Dispersion (3)
20Florin Radulescu, Mihai DascăluDMDW-2qStandard deviation. The standard deviation of n values (observations) is:qThe square of standard deviation is called variance. qThe standard deviation measures the spread of the values around the mean value. qA value of 0 is obtained only when all values are identical.Dispersion (3)
21Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
22Florin Radulescu, Mihai DascăluDMDW-2qThe main objectives of data cleaning are:qReplace (or remove) missing values,qSmooth noisy data, qRemove or just identify outliers qSome attributes are allowed to contain a NULL value. qIn these cases the value stored in the database (or the attribute value in the dataset) must be something like ‘Not applicable’ and not a NULL value.Objectives
23Florin Radulescu, Mihai DascăluDMDW-2qMay appear from various reasons:qhuman/hardware/software problems, qdata not collected (considered unimportant at collection time), qdeleted data due to inconsistencies, etc. qThere are two solutions in handling missing data:1.Ignore the data point / example with missing attribute values. If the number of errors is limited and these errors are not for sensitive data, removing them may be a solution. Missing values (1)
24Florin Radulescu, Mihai DascăluDMDW-22.Fill inthe missing value. This may be done in several ways:qFill in manually. This option is not feasible in most of the cases due to the huge volume of the datasets that must be cleaned.qFill in with a (distinct from others) value ‘not available’ or ‘unknown’.qFill in with a value measuring the central tendency, for example attribute mean, median or mode.qFill in with a value measuring the central tendency but only on a subset (for example, for labeled datasets, only for examples belonging to the same class).qThe most probable value, if that value may be determined, for example by decision trees, expectation maximization (EM), Bayes, etc.Missing values (2)
25Florin Radulescu, Mihai DascăluDMDW-2qThe noise can be defined as a random error or variance in a measured variable ([Han, Kamber 06]). qWikipedia define noise as a colloquialismfor recognized amounts ofunexplained variationin asample. qForremovingthenoise,somesmoothingtechniquesmaybeused:1.Regression(waspresentedinfirstcourse)2.BinningSmooth noisy data
26Florin Radulescu, Mihai DascăluDMDW-2qBinning can be used for smoothing an ordered set of values. Smoothing is made based on neighbor values. There are two steps:qPartitioning ordered data in several bins. Each bin contains the same number of examples (data points).qSmoothing for each bin: values in a bin are modified based on some bin characteristics: mean, median, boundaries.Binning
27Florin Radulescu, Mihai DascăluDMDW-2qConsider the following ordered data for some attribute: 1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81Example
Initial bins Use mean for binning Use median for binning Use bin boundaries for binning 1, 2, 4, 6, 9 12, 16, 17, 18, 23 34, 56, 78, 79, 81 4, 4, 4, 4, 4 17, 17, 17, 17, 17 66, 66, 66, 66, 66 4, 4, 4, 4, 4 17, 17, 17, 17, 17 78, 78, 78, 78 1, 1, 1, 9, 9 12, 12, 12, 23, 23 34, 34, 81, 81, 81
28Florin Radulescu, Mihai DascăluDMDW-2So the smoothing result is:qInitial: 1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81qUsing the mean: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 66, 66, 66, 66, 66qUsing the median: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 78, 78, 78, 78, 78qUsing the bin boundaries: 1, 1, 1, 9, 9, 12, 12, 12, 23, 23, 34, 34, 81, 81, 81Result
29Florin Radulescu, Mihai DascăluDMDW-2qAn outlier is an attribute value numerically distant from the rest of thedata. qOutliers may be sometimes correct values: for example, the salary of the CEO of a company may be much bigger that all other salaries. But in most of the cases outliers are and must be handled as noise.qOutliers must be identified and then removed (or replaced, as any other noisy value) because many data mining algorithms are sensitive to outliers. qFor example any algorithm using the arithmetic mean (one of them is k-means) may produce erroneous results because the mean is very sensitive to outliers.Outliers
30Florin Radulescu, Mihai DascăluDMDW-2qUse of IQR:values more than 1.5 x IQR below Q1 or above Q3 are potential outliers. Boxplots may be used to identify these outliers (boxplots are a method for graphical representation of data dispersion).qUse of standard deviation: values that are more than two standard deviations away from the mean for a given attribute are also potential outliers. qClustering. After clustering a certain dataset some points are outside any cluster (or far away from any cluster center. Identifying outliers
31Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
32Florin Radulescu, Mihai DascăluDMDW-2qData integration means merging data from different data sources into a coherent dataset. The main activities are:qSchema integrationqRemove duplicates and redundancyqHandle inconsistenciesObjectives
33Florin Radulescu, Mihai DascăluDMDW-2qMust identify the translation of every source scheme to the final scheme (entity identification problem)qSub-problems:qThe same thing is called differently in every data source. Example: the customer id may be called Cust-ID, Cust#, CustID, CID in different sources.qDifferent things are called with the same name in different sources. Example: for employees data, the attribute ‘City’means city where resides in a source and city of birth in another source.Schema integration
34Florin Radulescu, Mihai DascăluDMDW-2qDuplicates: The same information may be stored in many data sources. Merging them can cause sometimes duplicates of that information:qas duplicate attribute (same attribute with different names is found twice in the final result) or qas duplicate instance (same object is found twice in the final database). qThese duplicates must be identified and removed. Duplicates
35Florin Radulescu, Mihai DascăluDMDW-2•Redundancy: Some information may be deduced / computed. •For example age may be deduced from birthdate, annual salary may be computed from monthly salary and other bonuses recorded for each employee. •Redundancy must be removed from the dataset before running the data mining algorithm •Note that in existing data warehouses some redundancy is allowed.Redundancy
36Florin Radulescu, Mihai DascăluDMDW-2•Inconsistencies are conflicting values for a set of attributes. •Example Birthdate = January 1, 1980, Age = 12 represents an obvious inconsistency but we may find other inconsistencies that are not so obvious. •For detecting inconsistencies extra knowledge about data is necessary: for example, the functional dependencies attached to a table scheme can be used. •Available metadata describing the content of the dataset may help in removing inconsistencies.Inconsistencies
37Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
38Florin Radulescu, Mihai DascăluDMDW-2qData is transformed and summarized in a better form for the data mining process: qNormalizationqNew attribute constructionqSummarization using aggregate functionsObjectives
39Florin Radulescu, Mihai DascăluDMDW-2qAll attribute data are scaled to fit a specified range: q0 to 1, q-1 to 1 or generally q|v| <= r where r is a given positive value. qNeeded when the importance of some attributes is bigger only because the range of the values of that attributes is bigger. qExample: Euclidian distance between A(0.5, 101) and B(0.01, 2111) is ≈ 2010, determined almost exclusively by the second dimension.Normalization
40Florin Radulescu, Mihai DascăluDMDW-2We can achieve normalization using:qMin-max normalization:vnew = (v –vmin) / (vmax –vmin)qFor positive values the formula is:vnew = v / vmax qz-scorenormalization (σis the standard deviation):vnew = (v –vmean) / σqDecimal scaling: vnew = v / 10nwhere n is the smallest integer for that all numbers become (as absolute value) less than the range r (for r = 1, all new values of v are <= 1) thenNormalization
41Florin Radulescu, Mihai DascăluDMDW-2•New attribute construction is called also feature construction.•It means: building new attributes based on the values of existing ones. •Example: if the dataset contains an attribute ‘Color’with only three distinct values {Red, Green, Blue} then three attributes may be constructed: ‘Red’, ‘Green’and ‘Blue’where only one of them equals 1 (based on the value of ‘Color’) and the other two 0.•Another example: use a set of rules, decision trees or other tools to build new attribute values from existing ones. New attributes will contain the class labels attached by the rules / decision tree used / labeling tool.Feature construction
42Florin Radulescu, Mihai DascăluDMDW-2•At this step aggregate functions may be used to add summaries to the data. •Examples: adding sums for daily, monthly and annually sales, counts and averages for number of customers or transactions, and so on. •All these summaries are used for the ‘slice and dice’ process when data is stored in a data warehouse. •The result is a data cube and each summary information is attached to a level of granularity.Summarization
43Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
44Florin Radulescu, Mihai DascăluDMDW-2•Not all information produced by the previous steps is needed for a certain data mining process. •Reducing the data volume by keeping only the necessary attributes leads to a better representation of data and reduces the time for data analysis. Objectives
45Florin Radulescu, Mihai DascăluDMDW-2Methods that may be used for data reduction (see [Han, Kamber 06]) :qData cube aggregation, already discussed.qAttribute selection: keep only relevant attributes. This can be made by: qstepwise forward selection (start with an empty set and add attributes), qstepwise backward elimination (start with all attributes and remove some of them one by one)qa combination of forward selection and backward elimination. qdecision tree induction: after building the decision tree, only attributes used for decision nodes are kept.Reduction methods (1)
46Florin Radulescu, Mihai DascăluDMDW-2qDimensionality reduction: encoding mechanisms are used to reduce the data set size or compress data. qA popular method is Principal Component Analysis (PCA): given N data vectors having n dimensions, find k <= n orthogonal vectors (called principal components) that can be used for representing data.Reduction methods (2)
47Florin Radulescu, Mihai DascăluDMDW-2PCA for amultivariate Gaussian distribution (source: http://2011.igem.org/Team:USTC-Software/parameter ) centered at 1, 3 PCA example
48Florin Radulescu, Mihai DascăluDMDW-2qNumerosity reduction: the data are replaced by smaller data representations such as parametric models (only the model parameters are stored in this case) or nonparametric methods: clustering, sampling, histograms.qDiscretizationand concept hierarchy generation, discussed in the following paragraph.Reduction methods (3)
49Florin Radulescu, Mihai DascăluDMDW-2qData typesqMeasuring dataqData cleaningqData integrationqData transformationqData reductionqData discretizationqSummaryRoad Map
50Florin Radulescu, Mihai DascăluDMDW-2qThere are many data mining algorithms that cannot use continuous attributes. Replacing these continuous values with discrete ones is called discretization. qEven for discrete attributes, is better to have a reduced number of values leading to a reduced representation of data. This may be performed by concept hierarchies.Objectives
51Florin Radulescu, Mihai DascăluDMDW-2qDiscretization means reducing the number of values for a given continuous attribute by dividing its values in intervals. qEach interval is labeled and each attribute value will be replaced with the interval label.qSome of the most popular methods to perform discretization are:1.Binning: equi-width bins or equi-frequency bins may be used. Values in the same bin receive the same label.Discretization (1)
52Florin Radulescu, Mihai DascăluDMDW-2qPopular methods to perform discretization -cont:2.Histograms: like binning, histograms partition values for an attribute in buckets. Each bucket has a different label and labels replace values.3.Entropy based intervals: each attribute value is considered a potential split point (between two intervals) and an information gain is computed for it (reduction of entropy by splitting at that point). Then the value with the greatest information gain is picked. In this way intervals may be constructed in a top-down manner.4.Cluster analysis: after clustering, all values in the same cluster are replaced with the same label (the cluster-id for example)Discretization (2)
53Florin Radulescu, Mihai DascăluDMDW-2qUsage of a concept hierarchy to perform discretization means replacing low-level concepts (or values) with higher level concepts. qExample: replace the numerical value for age with young, middle-aged or old.qFor numerical values, discretization and concept hierarchies are the same. Concept hierarchies
54Florin Radulescu, Mihai DascăluDMDW-2qFor categorical data the goal is to replace a bigger set of values with a smaller one (categorical data are discrete by definition):qManually define a partial order for a set of attributes. For example the set {Street, City, Department, Country} is partially ordered, Street ⊆City ⊆Department ⊆Country. In that case we can construct an attribute ‘Localization’at any level of this hierarchy, by using the n rightmost attributes (n = 1 .. 4).qSpecify (manually) high level concepts for value sets of low level attribute values associated with. For example {Muntenia, Oltenia, Dobrogea} ⊆Tara_Romaneasca.qAutomatically identify a partial order between attributes, based on the fact that high level concepts are represented by attributes containing a smaller number of values compared with low level ones.Concept hierarchies
55Florin Radulescu, Mihai DascăluDMDW-2This second course presented:qData types: categorical vs. numerical, the four scales (nominal, ordinal, interval and ratio) and binary data.qA short presentation of data preprocessing steps and some ways to extract important characteristics of data: central tendency (mean, mode, median, etc) and dispersion (range, IQR, five-number summary, standard deviation and variance).qA description of every preprocessing step: qcleaning, qintegration, qtransformation, qreduction and qDiscretizationqNext week: Association rules and sequential patterns Summary
56Florin Radulescu, Mihai DascăluDMDW-2q[Han, Kamber 06] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publishers, 2006, 47-101q[Stevens 46] Stevens, S.S, On the Theory of Scales of Measurement. Science June 1946, 103 (2684): 677–680. q[Liu 11] Bing Liu, 2011. CS 583 Data mining and text mining course notes, http://www.cs.uic.edu/~liub/teach/cs583-fall-11/cs583.htmlReferences
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Dmdw2 Data Preprocessing [607248] (ID: 607248)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
