Blogger Widgets

Total Page visits

Sunday, December 16, 2012

Data Mining and Data warehousing,Unit II,Two marks questions with answers



Data Mining and Data warehousing   
Two marks questions with answers

Unit II
  1. Why preprocess the data?
Data that is to be analyzed by data mining techniques are incomplete, noisy, and inconsistent. These are the commonplace properties of large real world databases and data warehouses. To remove all these errors data must be preprocessed.
  1. What are the data preprocessing techniques?
Data preprocessing techniques are
Data cleaning-removes noise and correct inconsistencies in the data.
Data integration-merges data from multiple sources into a coherent data store such as data warehouse or a data cube.
Data transformations-such as normalization improve the accuracy and efficiency of mining algorithms involving distance measurements.
Data reduction-reduces the data size by aggregating, eliminating redundant features, or clustering
  1. Give the strategies involved in data reduction.
Data reduction obtains a reduced representation of the data set that is much smaller in volume but produces the same analytical results. The strategies involved are:
Data aggregation-eg building a data cube.
Dimension reduction-eg removing irrelevant attributes through correlation analysis.
Data compression-eg using encoding schemes such as minimum length encoding or wavelets
  1. What is noise?
Noise is a random error or variance in a measured variable.
  1. Give the various data smoothing techniques.
Binning, clustering, combined computer and human inspection, regression.
  1. Define data integration.
Data integration combines data from multiple sources into a coherent data store. These sources may include multiple databases, data cubes or flat files.
  1. Give the issues to be considered during data integration.
Schema integration, Redundancy, detection and resolution of data value conflicts.
  1. Define redundancy.
An attribute is said to be redundant if it can be derived from another table. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
  1. Explain data transformation.
In data transformation data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following, smoothing, aggregation, generalization, normalization, and attribute construction.
  1. Define normalization and give the methods used for normalization.
In normalization attribute data are scaled so as to fall within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0. The methods used for normalization are min-max normalization, z-score normalization and normalization by decimal scaling.
  1. Define data reduction.
It is used to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data. I-e mining on the reduced set should be more efficient yet produce the same analytical results.
  1.  Give the strategies used for data reduction.
Data cube aggregation, dimension reduction, data compression, numerosity reduction, and discretization and concept hierarchy generation.
  1. Explain data cube aggregation.
Data cube store multidimensional aggregated information. Each cell holds an aggregate data value data value, corresponding to the data point in multidimensional space.
  1. Explain dimensionality reduction.
Data sets for analysis may contain hundreds of attributes many of which may be irrelevant to the mining task, or redundant. Leaving out relevant attributes or keeping irrelevant attributes may result in poor quality of discovered patterns.
  1. How dimensionality reduction is achieved?
Dimensionality reduction is achieved using attribute subset selection. The goal is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
  1.  Define data compression.
In data compression, data encoding or transformation are applied so as to obtain a reduced or “compressed” representation of the original data. If original data can be reconstructed from the compressed data without any loss of information, data compression technique used is called lossless. If only an approximation of the original data is obtained then it is called lossy.
  1. Give examples for lossy data compression.
Wavelet transforms and principal component analysis.
  1. Define Discrete Wavelet transforms.
DWT is a linear signal processing technique that when applied to a data vector D, transforms it to a numerically different vector, D’ of wavelet coefficients. The two vectors are of the same length. It achieves better lossy compression.
  1. Define numerosity reduction.
Numerosity reduction is a technique used to reduce the data volume by choosing alternative smaller forms of data representation. These techniques may be parametric or nonparametric.
  1. Give examples for parametric and nonparametric methods.
Parametric-Log-linear models
Non-parametric-histogram, clustering, sampling

No comments: