The number of malicious Android apps is increasing rapidly. Simply detecting and removing malware apps is insufficient, since they can damage or alter other files or settings; install additional applications; etc. To aid in determining such behaviors, a security analyst can significantly benefit from identifying the family to which an Android malware belongs. Techniques for detecting Android malware, and determining their families, lack the ability to handle certain obfuscations that aim to thwart detection. Moreover, some prior techniques face scalability issues, preventing them from detecting malware in a timely manner.
To address these challenges, we present a novel machine learning-based Android-malware detection and family-identification approach, RevealDroid, that extracts and utilizes features without the need to perform complex program analyses (e.g., precise data-flow analysis) or large sets of features (e.g., hundreds of thousands of features), which lead to scalability problems and lack resiliency to obfuscations. Specifically, our selected machine-learning features leverage categorized Android-API usage, which represent semantics of Android apps, reflection-based features, and features from native binaries of apps. We assess RevealDroid for accuracy, efficiency, and obfuscation resilience on a dataset of 51,496 malicious and benign apps. On this dataset, RevealDroid achieves an accuracy of 91%. For 18,065 malicious apps from 68 families, RevealDroid can identify the malware family of the app with an 87% accuracy. We further compare RevealDroid against state-of-the-art approaches for malware detection and family identification, demonstrating RevealDroid’s superior obfuscation resiliency and accuracy, while still maintaining efficiency.
To access RevealDroid source code, you'll need two projects RevealDroid legacy code—which contains the package API extractor, native extraction code, and legacy code for handling Weka-based functionality—and the android-reflection-analysis code—which mostly handles reflection analyses and sklearn-based machine learning functionality.
To access the RevealDroid dataset (approximately 10GB in size), please follow this link.
To evaluate RevealDroid, we also compared it against state-of-the-practice commercial anti-virus (AV) products available on VirusTotal. We met or exceeded the accuracy values of 60 commercial AVs for our evaluation. Given that our technique utilizes machine learning, our technique learns to detect malware automatically, unlike many existing state-of-the-practice tools. Detailed results are available here.