Blogger Widgets

Total Page visits

Sunday, July 14, 2013

DATA WAREHOUSING AND DATA MINING,2Mark,Unit III



1.Define Clustering?
Clustering is a process of grouping the physical or conceptual data object into
clusters.

2. What do you mean by Cluster Analysis?
A cluster analysis is the process of analyzing the various clusters to organize the
different objects into meaningful and descriptive objects.

3. What are the fields in which clustering techniques are used?
• Clustering is used in biology to develop new plants and animal
taxonomies.
• Clustering is used in business to enable marketers to develop new
distinct groups of their customers and characterize the customer group on basis
of purchasing.
• Clustering is used in the identification of groups of automobiles
Insurance policy customer.
• Clustering is used in the identification of groups of house in a city on
the basis of house type, their cost and geographical location.
• Clustering is used to classify the document on the web for information
discovery.

4.What are the requirements of cluster analysis?
The basic requirements of cluster analysis are
• Dealing with different types of attributes.
• Dealing with noisy data.
• Constraints on clustering.
• Dealing with arbitrary shapes.
• High dimensionality
• Ordering of input data
• Interpretability and usability
• Determining input parameter and
• Scalability

5.What are the different types of data used for cluster analysis?
The different types of data used for cluster analysis are interval scaled, binary,
nominal, ordinal and ratio scaled data.

6. What are interval scaled variables?
Interval scaled variables are continuous measurements of linear scale.
For example, height and weight, weather temperature or coordinates for any cluster.
These measurements can be calculated using Euclidean distance or Minkowski distance.
  
7. Define Binary variables? And what are the two types of binary variables?
Binary variables are understood by two states 0 and 1, when state is 0, variable is
absent and when state is 1, variable is present. There are two types of binary variables,
symmetric and asymmetric binary variables. Symmetric variables are those variables that
have same state values and weights. Asymmetric variables are those variables that have
not same state values and weights.

8. Define nominal, ordinal and ratio scaled variables?
A nominal variable is a generalization of the binary variable. Nominal variable
has more than two states, For example, a nominal variable, color consists of four states,
red, green, yellow, or black. In Nominal variables the total number of states is N and it is
denoted by letters, symbols or integers.
An ordinal variable also has more than two states but all these states are ordered
in a meaningful sequence.
A ratio scaled variable makes positive measurements on a non-linear scale, such
as exponential scale, using the formula
AeBt or Ae-Bt
Where A and B are constants.

9. What do u mean by partitioning method?
In partitioning method a partitioning algorithm arranges all the objects into
various partitions, where the total number of partitions is less than the total number of
objects. Here each partition represents a cluster. The two types of partitioning method are
k-means and k-medoids.

10. Define CLARA and CLARANS?
Clustering in LARge Applications is called as CLARA. The efficiency of
CLARA depends upon the size of the representative data set. CLARA does not work
properly if any representative data set from the selected representative data sets does not
find best k-medoids.
To recover this drawback a new algorithm, Clustering Large Applications based
upon RANdomized search (CLARANS) is introduced. The CLARANS works like
CLARA, the only difference between CLARA and CLARANS is the clustering process
that is done after selecting the representative data sets.

11. What is Hierarchical method?
Hierarchical method groups all the objects into a tree of clusters that are arranged
in a hierarchical order. This method works on bottom-up or top-down approaches.

12. Differentiate Agglomerative and Divisive Hierarchical Clustering?
Agglomerative Hierarchical clustering method works on the bottom-up approach.
In Agglomerative hierarchical method, each object creates its own clusters. The single
Clusters are merged to make larger clusters and the process of merging continues until all
the singular clusters are merged into one big cluster that consists of all the objects.
Divisive Hierarchical clustering method works on the top-down approach. In this
method all the objects are arranged within a big singular cluster and the large cluster is
continuously divided into smaller clusters until each cluster has a single object.

13. What is CURE?
Clustering Using Representatives is called as CURE. The clustering algorithms
generally work on spherical and similar size clusters. CURE overcomes the problem of
spherical and similar size cluster and is more robust with respect to outliers.

14. Define Chameleon method?
Chameleon is another hierarchical clustering method that uses dynamic modeling.
Chameleon is introduced to recover the drawbacks of CURE method. In this method two
clusters are merged, if the interconnectivity between two clusters is greater than the
interconnectivity between the objects within a cluster.

15. Define Density based method?
Density based method deals with arbitrary shaped clusters. In density-based
method, clusters are formed on the basis of the region where the density of the objects is
high.

16. What is a DBSCAN?
Density Based Spatial Clustering of Application Noise is called as DBSCAN.
DBSCAN is a density based clustering method that converts the high-density objects
regions into clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a
maximal set of density connected points.

17. What do you mean by Grid Based Method?
In this method objects are represented by the multi resolution grid data structure.
All the objects are quantized into a finite number of cells and the collection of cells build
the grid structure of objects. The clustering operations are performed on that grid
structure. This method is widely used because its processing time is very fast and that is
independent of number of objects.

18. What is a STING?
Statistical Information Grid is called as STING; it is a grid based multi resolution
clustering method. In STING method, all the objects are contained into rectangular cells,
these cells are kept into various levels of resolutions and these levels are arranged in a
hierarchical structure.

19. Define Wave Cluster?
It is a grid based multi resolution clustering method. In this method all the objects
are represented by a multidimensional grid structure and a wavelet transformation is
applied for finding the dense region. Each grid cell contains the information of the group
of objects that map into a cell. A wavelet transformation is a process of signaling that
produces the signal of various frequency sub bands.

20. What is Model based method?
For optimizing a fit between a given data set and a mathematical model based
methods are used. This method uses an assumption that the data are distributed by
probability distributions. There are two basic approaches in this method that are
1. Statistical Approach
2. Neural Network Approach.

21. What is the use of Regression?
Regression can be used to solve the classification problems but it can also be used
for applications such as forecasting. Regression can be performed using many different
types of techniques; in actually regression takes a set of data and fits the data to a
formula.

22. What are the reasons for not using the linear regression model to estimate the
output data?
There are many reasons for that, One is that the data do not fit a linear model, It is
possible however that the data generally do actually represent a linear model, but the
linear model generated is poor because noise or outliers exist in the data.
Noise is erroneous data and outliers are data values that are exceptions to the usual and
expected data.

23. What are the two approaches used by regression to perform classification?
Regression can be used to perform classification using the following approaches
1. Division: The data are divided into regions based on class.
2. Prediction: Formulas are generated to predict the output class value.

24. What do u mean by logistic regression?
Instead of fitting a data into a straight line logistic regression uses a logistic curve.
The formula for the univariate logistic curve is
P= e (C0+C1X1)
1+e (C0+C1X1)
The logistic curve gives a value between 0 and 1 so it can be interpreted as the
probability of class membership.

25. What is Time Series Analysis?
A time series is a set of attribute values over a period of time. Time Series
Analysis may be viewed as finding patterns in the data and predicting future values.

26. What are the various detected patterns?
Detected patterns may include:
¨ Trends : It may be viewed as systematic non-repetitive changes to the values over
time.
¨ Cycles : The observed behavior is cyclic.
¨ Seasonal : The detected patterns may be based on time of year or month or day.
¨ Outliers : To assist in pattern detection , techniques may be needed to remove or
reduce the impact of outliers.

27. What is Smoothing?
Smoothing is an approach that is used to remove the nonsystematic behaviors
found in time series. It usually takes the form of finding moving averages of attribute
values. It is used to filter out noise and outliers.

28. Give the formula for Pearson’s r
One standard formula to measure correlation is the correlation coefficient r,
sometimes called Pearson’s r. Given two time series, X and Y with means X’ and Y’,
each with n elements, the formula for r is
S (xi – X’) (yi – Y’)
(S (xi – X’)2 S(yi – Y’)2)1/2


29. What is Auto regression?
Auto regression is a method of predicting a future time series value by looking at
previous values. Given a time series X = (x1,x2,….xn) a future value, x n+1, can be found
using
x n+1 = x + j nx n + j n-1x n-1 +……+ e n+1
Here e n+1 represents a random error, at time n+1.In addition, each element in the time
series can be viewed as a combination of a random error and a linear combination of
previous values.

No comments: