Data Mining Steps

Extract the dataset from heterogenous data sources
Data preprocessing and normalisation
Identification of data classes
Identification of input parameters to the classification technique
Splitting of the dataset into training and testing
Fitting the model in the training subset
Implementation of the classification technique
Evaluation of the output parameters
Interpretation of the output wrt the problem statements

Clustering and Classification

Clustering and Classification are both common techniques used in machine learning, but they differ in terms of their approach, use cases, and types of data they handle. Here’s a breakdown of both:

1. Clustering:

Clustering is an unsupervised learning technique that groups similar data points together based on certain features or characteristics. The goal is to identify natural groupings in the data without prior knowledge of labels.

Type of Learning: Unsupervised
Purpose: To discover hidden patterns or inherent groupings in data without predefined labels.
Input: Data without any labels (no output category is known in advance).
Output: Groups or clusters of similar data points.
How it Works: The algorithm identifies similarity between data points and assigns them to clusters. The number of clusters may or may not be predefined, depending on the algorithm used. Example Algorithms:
K-Means: Partitions data into K clusters based on centroids.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups points that are closely packed together and marks outliers as noise.
Hierarchical Clustering: Builds a tree-like structure (dendrogram) of clusters. Use Cases:
Market segmentation (grouping customers by behavior).
Image compression.
Document categorization.
Anomaly detection (identifying unusual patterns in data).

2. Classification:

Classification is a supervised learning technique where the goal is to assign predefined labels to data based on input features. It uses labeled training data to learn the mapping between the input features and the output class.

Type of Learning: Supervised
Purpose: To predict a category or class label for new, unseen data based on learned relationships from labeled data.
Input: Data with predefined labels (output categories are known during training).
Output: A class label (e.g., “spam” or “not spam,” “dog” or “cat”).
How it Works: The algorithm learns the relationship between input features and output labels from the training data and uses this knowledge to classify new data.

Example Algorithms:

Logistic Regression: Predicts binary outcomes.
Decision Trees: Classifies data by splitting it at various decision points.
Random Forest: An ensemble method that combines multiple decision trees.
Support Vector Machines (SVM): Finds a hyperplane that best separates data into different classes.

Use Cases:

Email spam detection.
Sentiment analysis (positive, negative, neutral).
Medical diagnosis (e.g., classifying diseases).
Image recognition (e.g., identifying objects or faces).

Key Differences Between Clustering and Classification:

Aspect	Clustering	Classification
Type of Learning	Unsupervised (no labels during training)	Supervised (requires labeled data for training)
Goal	To group similar data points together	To assign data points to predefined categories
Input Data	Unlabeled data	Labeled data (features with known class labels)
Output	Clusters (groups) of similar data points	Class labels (predictions of categories)
Nature of Task	Pattern discovery or grouping	Prediction of labels based on known patterns
Examples of Algorithms	K-Means, DBSCAN, Hierarchical Clustering	Decision Trees, SVM, Random Forest, Logistic Regression
Use Cases	Customer segmentation, image compression, anomaly detection	Fraud detection, medical diagnosis, sentiment analysis

How to identify the object on the basis of classification or clustering.

If we want to find behaviour, trend we will use clustering. It is unsupervised identifying technique
Classification is a supervised learning approach used to categorize items into predefined classes or groups based on labelled data.

Summary:

Clustering is used for discovering patterns or structures in data when labels are unknown. It works well when you want to understand the natural groupings in your data without prior knowledge of what those groups should be.
Classification is used for predicting predefined labels based on historical data with known labels. It is typically used in situations where you want to categorize new data points into classes based on what the model has learned.

In short, clustering finds structure in data, while classification assigns labels to data.

Statistical Baye’s method

Classification is done on the basis of

Regression technique
1. based on Statistical Principal of dependant and independent variables and following the eqn $y = m x + c$
2. y is the dependant variable and x is the independent variable
3. m is the slope of the line and c is the intercept of the line
4. The value of y is calculated on the basis of the value of x
5. The value of m and c is calculated on the basis of the training data
6. The value of y is calculated on the basis of the testing data
7. The value of y is calculated on the basis of the testing data
8. The value of y is calculated on the basis of the testing data

Regression Technique : Based upon statistical principle of dependent and independent variables and following the equation
Bayesian Classification: Based on the statistical principle of bayes theorem i.e. conditional probability
Distance based classification
K-Nearest Neighbour (KNN): Dividing the objects on the base of distance between the set of classes like Euclidean distance method
Decision Tree Based Classification
Dividing the input dataset into a set of classes based upon the hierarchical tree like structure having set of rules and classes

Based on the statistical principal of bayes theorem that is conditional probability

Distance based classification Dividing the data instances based upon the distance b/w set of classes like euclidean distance method

Decision Tree based classification Dividing the input data set having tree like structure that has set of rules or decision tree based classification

Bayesian Classifier

The Bayesian theorem

Problem Statements

Q1

Suppose $D = {t_{1}, t_{2}, \dots t_{m}}$ of tuples and a set of class $C = {c_{1}, c_{2}, \dots}$

References

Information

date: 2025.02.28
time: 10:27

🪴 TJ's Notes 1.0

Explorer

Data Warehousing and Mining Lecture 10