Classification and Clustering

FeatureClassificationClustering
Type of LearningSupervised (data is assigned to predefined categories)Unsupervised
Supervised/UnsupervisedSupervisedUnsupervised
Labeling RequirementRequires labeled dataNo predefined labels
GoalClassifies new data into known categoriesDiscovers hidden structures in the data
ExamplesSpam detection, disease diagnosis, sentiment analysisImage clustering, anomaly detection
AlgorithmsDecision Trees, SVM, Random Forest, Neural NetworksK-Means, DBSCAN, Hierarchical Clustering
OutputDiscrete categories (e.g., spam or not spam)Grouped clusters based on similarity
Use Case FocusPredicting labels for new dataFinding patterns and natural groupings in data
Real-World ExampleEmail classification into spam/hamGrouping customers by purchasing behavior

KDD Issues

  • Human Interaction
  • Overfitting
  • Outliers
  • Interpretation because of the complexity of the model
  • Visualization because of high dimensionality
  • Large Data sets because of computational complexity
  • MultiMedia data is an issue because format of the data can be in various forms.
    • Example photos and videos
  • Changing Data Streams is an issue because the model may not be able to adapt to the new data.

Assignment

Solve two numerical examples on on decision tree, regression, naivee baye’s, IDB

ID3

Entropy

If S is a collection of 14 examples with 9 yes and 5 no examples then Entropy(S)=-9/14 Log_2 (9/14) - (5/14)Log_2(5/14)=0.940

Information Gain

Gain is defined as the difference between how much information is needed to make a correct classification before the split versus how much information is needed after the split. Id3

References

Information
  • date: 2025.03.12
  • time: 13:14