Data
Mining and Data warehousing
Two marks questions with answers
Unit II
- Why preprocess the data?
Data that is to
be analyzed by data mining techniques are incomplete, noisy, and inconsistent.
These are the commonplace properties of large real world databases and data
warehouses. To remove all these errors data must be preprocessed.
- What are the data preprocessing techniques?
Data
preprocessing techniques are
Data
cleaning-removes noise and correct inconsistencies in the data.
Data
integration-merges data from multiple sources into a coherent data store such
as data warehouse or a data cube.
Data
transformations-such as normalization improve the accuracy and efficiency of
mining algorithms involving distance measurements.
Data
reduction-reduces the data size by aggregating, eliminating redundant features,
or clustering
- Give the strategies involved in data reduction.
Data reduction
obtains a reduced representation of the data set that is much smaller in volume
but produces the same analytical results. The strategies involved are:
Data
aggregation-eg building a data cube.
Dimension
reduction-eg removing irrelevant attributes through correlation analysis.
Data
compression-eg using encoding schemes such as minimum length encoding or
wavelets
- What is noise?
Noise is a
random error or variance in a measured variable.
- Give the various data smoothing techniques.
Binning,
clustering, combined computer and human inspection, regression.
- Define data integration.
Data
integration combines data from multiple sources into a coherent data store.
These sources may include multiple databases, data cubes or flat files.
- Give the issues to be considered during data integration.
Schema
integration, Redundancy, detection and resolution of data value conflicts.
- Define redundancy.
An attribute is
said to be redundant if it can be derived from another table. Inconsistencies
in attribute or dimension naming can also cause redundancies in the resulting
data set.
- Explain data transformation.
In data
transformation data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following, smoothing, aggregation,
generalization, normalization, and attribute construction.
- Define normalization and give the methods used for normalization.
In
normalization attribute data are scaled so as to fall within a small specified
range such as -1.0 to 1.0 or 0.0 to 1.0. The methods used for normalization are
min-max normalization, z-score normalization and normalization by decimal
scaling.
- Define data reduction.
It is used to
obtain a reduced representation of the data set that is much smaller in volume
yet closely maintains the integrity of the original data. I-e mining on the
reduced set should be more efficient yet produce the same analytical results.
- Give the strategies used for data reduction.
Data cube
aggregation, dimension reduction, data compression, numerosity reduction, and
discretization and concept hierarchy generation.
- Explain data cube aggregation.
Data cube store
multidimensional aggregated information. Each cell holds an aggregate data
value data value, corresponding to the data point in multidimensional space.
- Explain dimensionality reduction.
Data sets for
analysis may contain hundreds of attributes many of which may be irrelevant to
the mining task, or redundant. Leaving out relevant attributes or keeping
irrelevant attributes may result in poor quality of discovered patterns.
- How dimensionality reduction is achieved?
Dimensionality
reduction is achieved using attribute subset selection. The goal is to find a
minimum set of attributes such that the resulting probability distribution of
the data classes is as close as possible to the original distribution obtained
using all attributes.
- Define data compression.
In data
compression, data encoding or transformation are applied so as to obtain a
reduced or “compressed” representation of the original data. If original data
can be reconstructed from the compressed data without any loss of information,
data compression technique used is called lossless. If only an approximation of
the original data is obtained then it is called lossy.
- Give examples for lossy data compression.
Wavelet
transforms and principal component analysis.
- Define Discrete Wavelet transforms.
DWT is a linear
signal processing technique that when applied to a data vector D, transforms it
to a numerically different vector, D’ of wavelet coefficients. The two vectors
are of the same length. It achieves better lossy compression.
- Define numerosity reduction.
Numerosity
reduction is a technique used to reduce the data volume by choosing alternative
smaller forms of data representation. These techniques may be parametric or
nonparametric.
- Give examples for parametric and nonparametric methods.
Parametric-Log-linear
models
Non-parametric-histogram,
clustering, sampling
No comments:
Post a Comment