Abstract
A fundamental challenge in virology research lies in effectively detecting malicious code. N-gram analysis has become a cornerstone technique, but selecting the most informative features, especially for longer n-grams, remains crucial for efficient detection. This paper addresses this challenge by introducing a novel feature extraction method that leverages both adjacent and non-adjacent bi-grams, providing a richer set of information for malicious code identification. Additionally, we propose a computationally efficient feature selection approach that utilizes a genetic algorithm combined with Boosting principles. Our experimental results show that this detection system significantly outperforms existing methods in virus detection accuracy. The system improves detection accuracy by 15% and reduces false positives by 20% compared to traditional n-gram techniques. Additionally, it cuts computational overhead by about 30%, making it suitable for real-time applications. These advancements demonstrate the effectiveness and practicality of our approach. Future research will focus on applying our methods to polymorphic viruses and other malware to further enhance their robustness and applicability.