Unsupervised Learning Techniques for Anomaly Detection in High-Dimensional Data Streams Using Clustering and Autoencoders
Keywords:
Unsupervised Learning, Anomaly Detection, Autoencoder, Clustering, High-Dimensional Data Streams, Deep Learning, Cybersecurity Analytics, Machine LearningAbstract
The accelerated growth of high dimensional streaming data produced by industrial automation systems, Internet of Things (IoT) devices, cybersecurity systems, healthcare monitoring systems and public cloud computing environments has led to the need to implement intelligent and scalable anomaly detection systems. Conventional supervised learning methods are very data heavy and somewhat unresponsive to dynamic stream experiences where anomalous behaviours are continually being developed. To overcome such limitations, this paper comes up with an unsupervised anomaly detection model that incorporates both clustering and deep autoencoders architectures in identifying abnormal patterns in high-dimensional streams of data. The suggested methodology utilizes data preprocessing, feature normalization, data organization on K-Means clustering and latent feature learning on deep autoencoder to detect anomalies without any labeled training data. The reconstruction error analysis is used to categorize the anomalous cases using error measurements between the original and reconstructed data representations. The benchmark intrusion detection data sets such as the NSL-KDD and the UNSW-NB15 were used to experimentally test the framework. Accuracy, precision, recall, F1-score, ROC-AUC and false positive rate were used as measures of performance. The experimental findings showed that the hybrid framework achieved a higher average detection accuracy of 97.1, precision of 96.5, recall of 95.8 and a ROC-AUC of 0.981 as compared to the traditional unsupervised methods namely Isolation Forest and One-Class SVM. The proposed Anomaly detection framework was statistically validated using 10-fold cross-validation to demonstrate the strength, scalability and reliability of the proposed framework in high dimensional streaming environments.




