Malware detection using random forest method trained on a balanced synthetic dataset

dc.contributor.advisorMokwena, S. N.
dc.contributor.authorMatsobane, Neo Onica
dc.date.accessioned2025-01-30T11:05:22Z
dc.date.available2025-01-30T11:05:22Z
dc.date.issued2024
dc.descriptionThesis (M.Sc. (eScience Data Science)) -- University of Limpopo, 2024en_US
dc.description.abstractMalicious software (malware) poses a significant threat to the security and integrity of computer systems. Traditional malware detection approaches often encounter challenges due to small-scale and imbalanced datasets, resulting in reduced detection accuracy and reliability. In this research, we proposed a novel approach to address these issues by utilising a Random Forest method trained on a balanced synthetic dataset. The primary objective of this study was to investigate the impact of employing a Random Forest technique on the detection of malware. To achieve this, we first created a balanced synthetic dataset based on the latest (CICMalDroid2020) dataset using Generative Adversarial Networks (GANs). This synthetic dataset aimed to address the limitations associated with small-scale and imbalanced datasets commonly encountered in malware detection. We then trained the Random Forest model using this balanced synthetic dataset. The evaluation of the model's performance was conducted using various metrics, including detection accuracy, precision, recall, balanced accuracy, geometric metrics, and F1-score. Intensive analyses were performed to assess the effectiveness of the proposed approach in detecting malware samples accurately and robustly, as compared to traditional detection methods. The results of our research provided insights into the potential benefits of utilising a Random Forest method trained on a balanced synthetic dataset for malware detection. The results shed light on the performance improvements achieved by the random forest method when trained on a balanced synthetic dataset, thus contributing to the advancement of malware detection techniques. The test results showed that random forest can detect malware attacks with an accuracy of 91%, recall of 100%, precision of 85%, Fl score of 92%, balanced accuracy of 95% and geometric metrics of 84%. From the results, we inferred that random forest has the capacity to detect malware attacks.en_US
dc.format.extentvi, 67 leavesen_US
dc.identifier.urihttp://hdl.handle.net/10386/4846
dc.language.isoenen_US
dc.relation.requiresPDFen_US
dc.subjectRandom Foresten_US
dc.subjectMalware detectionen_US
dc.subjectSynthetic dataseten_US
dc.subjectBalanced dataseten_US
dc.subjectGenerative Adversarial Networks (GANs)en_US
dc.subject.lcshMalware (Computer software)en_US
dc.subject.lcshComputer virusesen_US
dc.subject.lcshData setsen_US
dc.titleMalware detection using random forest method trained on a balanced synthetic dataseten_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
matsobane_no_2024.pdf
Size:
1.05 MB
Format:
Adobe Portable Document Format
Description:
Thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed upon to submission
Description: