Abstract:
Communications that are encrypted, such as HTTPS, TLS, and VPN, have become popular tools for ensuring privacy;
yet, they can be used for hiding malicious payloads, making intrusion detection more challenging. This study proposes a
machine learning framework for the classification of malicious encrypted communications using flow-based and temporal
characteristics. Public datasets containing network traffic captures were used for testing and validating the framework.
The benign and malicious flows were converted to flow-based features using Scapy and CICFlowMeter tools. Feature
importance was used to select the most important features for the framework. Three machine learning models were trained
and tested using the datasets: Random Forest, XGBoost, and linear Support Vector Machine (SVM). Stratified train/test
split, cross-validation, and family disjoint were used for testing and validating the models. The Random Forest model
was found to have achieved nearly perfect accuracy for both training and testing sets, approximately 100%, and a high
accuracy of approximately 92% using cross-validation. Overfitting was minimal for the Random Forest model, whereas
XGBoost was found to have overfitting issues and SVM had moderate accuracy, approximately 72%. This study suggests
that the proposed framework can be used for reliably detecting malicious encrypted communications, including those
that were not used in the training process. SHAP was used to analyze the explainability of the framework and identify
the most important flow characteristics that were responsible for the decision-making process. The proposed framework
is computationally efficient and was tested using real-world datasets, making it suitable for practical applications in
network security