Data
Mining and Data warehousing
Two marks questions with answers
Unit IV
1)
Define classification.
Data
classification is a two step process. In the first step a model is built
describing a predetermined set of data classes are concepts. The model is
constructed by analyzing database tuples described b attributes. Each tuple is
assumed to belong to the predefined class as determined by one of the
attributes called class label attribute. In the second step the modal is used
for classification.
2)
Define training data set.
The data tuples
analyzed to build the model collectively form the training data set. Individual
tuples making up the training set are referred to as training samples and a
randomly selected from the sample population.
3)
Define accuracy of a model.
The accuracy of
a model on a given test set is the percentage of test set samples that are
correctly classified by the model.
4)
Define prediction.
Prediction can
be viewed as the construction and use of a model to assess the class of an
unlabeled sample or to assess the value or value ranges of an attribute that a
given sample is likely to have. Classification and regression are the two major
types of prediction.
5)
Differentiate classification and prediction.
Classification
is used to predict discrete or nominal values whereas prediction is used to
predict continuous values. Classification is also known as supervised learning
whereas prediction is also known as unsupervised learning.
6)
List the applications of classification and prediction.
Applications
include credit approval, medical diagnosis, performance prediction, and
selective marketing.
7)
List the preprocessing steps involved in preparing the
data for classification and prediction.
Data cleaning,
Relevance analysis, data transformation.
8)
Define normalization.
Normalization
involves scaling all values for a given attribute so that they fall within a
small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
9)
What is a decision tree?
A decision tree
is a flow chart like structure, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and leaf nodes
represent classes or class distribution. The top most node in the tree is
called root node.
10) Define
tree pruning.
When decision
trees are built many of the branches may reflect noise or outliers in training
data. Tree pruning attempts to identify and remove such branches with the goal
of improving classification accuracy on unseen data.
11) Define
information gain.
The information
gain measure is used to select the test attribute at each node in the tree.
Such a measure is referred to as an attribute selection measure or a measure of
the goodness of split. The attribute with the highest information gain is
chosen as the test attribute for the current node. This attribute minimizes the
information needed to classify the samples in the resulting partitions and
reflects the least randomness is “impurity” in the partitions.
12) List the two common approaches for tree
pruning.
Prepruning
approach – a tree is “Pruned” by halting its construction early. Upon halting
the node becomes a leaf. The leaf may hold the most frequent class among the
subsets samples or the probability distribution of the samples.
Post pruning
approach – removes branches from a “fully grown” tree. A tree node is pruned by
removing its branches the lowest unpruned node becomes the leaf and is labeled
by the most frequent class among its former branches.
13) List the problems in decision tree induction
and how it can be prevented.
Fragmentation, repetition,
and replication. Attribute construction is an approach for preventing these
problems, where the limited representation of the given attributes is improved
by creating new attributes based on the existing ones.
14) What are Bayesian classifiers?
Bayesian
classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given sample belongs to a
particular class. Bayesian classification is based on bayes theorem. Bayesian
classifiers exhibit high accuracy and speed when applied to large databases.
Bayesian classifier also known as naïve Bayesian classifiers is comparable in
performance with decision tree and neural network classifiers.
15) Define Bayesian belief networks.
Bayesian belief
networks are graphical models which allow the representation of dependencies
among subsets of attributes. It can also be used for classification.
16) List the two components in belief network.
Directed
acyclic graph, conditional probability table.
17) Define directed acyclic graph.
In Directed
Acyclic graph each node represents a random variable and each arch represents a
probabilistic dependence. If an arch is drawn from a node Y to a node Z then Y
is a parent or immediate predecessor of Z and Z is a descendent of Y. Each
variable is conditionally independent of its non-descendents in the graph given
its parents. The variables may be discrete or continuous – valued.
18) Define Conditional Probability Table (CPT).
The CPT for a
variable Z specifies the conditional distribution
P (Z| parents (Z)),
where parents (Z) are the parents of Z.
19) List the methods used for classification based
on concepts from association rule mining.
ARCS
(Association Rule Clustering System), Associative classification, CAEP
(Classification by Aggregating Emerging Patterns).
20) Explain ARCS?
ARCS mines
association rules of the form Aquan1 ^ Aquan2 => Acat
where Aquan1 and Aquan2 are tests on quantitative
attribute ranges and Acat assigns a class label for a categorical
attribute from the given training data. Association rules are plotted on a 2D
grid. The algorithm scans the grid, searching for rectangular clusters of
rules. Adjacent ranges are the quantitative attributes occurring within a rule
cluster may be combined. The clustered association rules generated by ARCS are
applied to classification.
No comments:
Post a Comment