Data Mining Steps
- Extract the dataset from heterogenous data sources
- Data preprocessing and normalisation
- Identification of data classes
- Identification of input parameters to the classification technique
- Splitting of the dataset into training and testing
- Fitting the model in the training subset
- Implementation of the classification technique
- Evaluation of the output parameters
- Interpretation of the output wrt the problem statements
Clustering and Classification
Clustering and Classification are both common techniques used in machine learning, but they differ in terms of their approach, use cases, and types of data they handle. Here’s a breakdown of both:
1. Clustering:
Clustering is an unsupervised learning technique that groups similar data points together based on certain features or characteristics. The goal is to identify natural groupings in the data without prior knowledge of labels.
- Type of Learning: Unsupervised
- Purpose: To discover hidden patterns or inherent groupings in data without predefined labels.
- Input: Data without any labels (no output category is known in advance).
- Output: Groups or clusters of similar data points.
- How it Works: The algorithm identifies similarity between data points and assigns them to clusters. The number of clusters may or may not be predefined, depending on the algorithm used. Example Algorithms:
- K-Means: Partitions data into K clusters based on centroids.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups points that are closely packed together and marks outliers as noise.
- Hierarchical Clustering: Builds a tree-like structure (dendrogram) of clusters. Use Cases:
- Market segmentation (grouping customers by behavior).
- Image compression.
- Document categorization.
- Anomaly detection (identifying unusual patterns in data).
2. Classification:
Classification is a supervised learning technique where the goal is to assign predefined labels to data based on input features. It uses labeled training data to learn the mapping between the input features and the output class.
- Type of Learning: Supervised
- Purpose: To predict a category or class label for new, unseen data based on learned relationships from labeled data.
- Input: Data with predefined labels (output categories are known during training).
- Output: A class label (e.g., “spam” or “not spam,” “dog” or “cat”).
- How it Works: The algorithm learns the relationship between input features and output labels from the training data and uses this knowledge to classify new data.
Example Algorithms:
- Logistic Regression: Predicts binary outcomes.
- Decision Trees: Classifies data by splitting it at various decision points.
- Random Forest: An ensemble method that combines multiple decision trees.
- Support Vector Machines (SVM): Finds a hyperplane that best separates data into different classes.
Use Cases:
- Email spam detection.
- Sentiment analysis (positive, negative, neutral).
- Medical diagnosis (e.g., classifying diseases).
- Image recognition (e.g., identifying objects or faces).
Key Differences Between Clustering and Classification:
Aspect | Clustering | Classification |
---|---|---|
Type of Learning | Unsupervised (no labels during training) | Supervised (requires labeled data for training) |
Goal | To group similar data points together | To assign data points to predefined categories |
Input Data | Unlabeled data | Labeled data (features with known class labels) |
Output | Clusters (groups) of similar data points | Class labels (predictions of categories) |
Nature of Task | Pattern discovery or grouping | Prediction of labels based on known patterns |
Examples of Algorithms | K-Means, DBSCAN, Hierarchical Clustering | Decision Trees, SVM, Random Forest, Logistic Regression |
Use Cases | Customer segmentation, image compression, anomaly detection | Fraud detection, medical diagnosis, sentiment analysis |
How to identify the object on the basis of classification or clustering.
- If we want to find behaviour, trend we will use clustering. It is unsupervised identifying technique
- Classification is a supervised learning approach used to categorize items into predefined classes or groups based on labelled data.
Summary:
- Clustering is used for discovering patterns or structures in data when labels are unknown. It works well when you want to understand the natural groupings in your data without prior knowledge of what those groups should be.
- Classification is used for predicting predefined labels based on historical data with known labels. It is typically used in situations where you want to categorize new data points into classes based on what the model has learned.
In short, clustering finds structure in data, while classification assigns labels to data.
Statistical Baye’s method
Classification is done on the basis of
- Regression technique
- based on Statistical Principal of dependant and independent variables and following the eqn
- y is the dependant variable and x is the independent variable
- m is the slope of the line and c is the intercept of the line
- The value of y is calculated on the basis of the value of x
- The value of m and c is calculated on the basis of the training data
- The value of y is calculated on the basis of the testing data
- The value of y is calculated on the basis of the testing data
- The value of y is calculated on the basis of the testing data
-
Regression Technique : Based upon statistical principle of dependent and independent variables and following the equation
-
Bayesian Classification: Based on the statistical principle of bayes theorem i.e. conditional probability
-
Distance based classification
-
K-Nearest Neighbour (KNN): Dividing the objects on the base of distance between the set of classes like Euclidean distance method
-
Decision Tree Based Classification
-
Dividing the input dataset into a set of classes based upon the hierarchical tree like structure having set of rules and classes
Based on the statistical principal of bayes theorem that is conditional probability
Distance based classification Dividing the data instances based upon the distance b/w set of classes like euclidean distance method
Decision Tree based classification Dividing the input data set having tree like structure that has set of rules or decision tree based classification
Bayesian Classifier
The Bayesian theorem
Problem Statements
Q1
Suppose of tuples and a set of class
References
Information
- date: 2025.02.28
- time: 10:27