Converts malware binary files into grayscale then train a CNN to clasify them into families of malwares
This project demonstrates a method for malware classification by converting malware binary files (.bytes) into grayscale images and then training a Convolutional Neural Network (CNN) to classify them into their respective families.
This approach is based on the principle that different malware families exhibit distinct textural patterns in their binary representation, which a CNN can learn to identify.
The notebook is structured into two main phases:
- Preprocessing: This stage involves two steps:
- Converting raw
.bytesfiles into PNG images. - Loading these images into a dataset, analyzing class distribution, and preparing them for training.
- Converting raw
- Modeling & Evaluation:
- Defining a CNN architecture using Keras.
- Training the model on the image dataset, accounting for class imbalance.
- Evaluating the model's performance and analyzing its results with a confusion matrix.
The first part of the project focuses on malware visualization.
- Input: The script reads raw malware files from a directory (
D:\ASSEMBLY\data_rawin the notebook). - File Format: It's designed to parse
.bytesfiles. These files are expected to be hexadecimal dumps, where each line contains an address offset followed by 16 hexadecimal values (bytes), e.g.,00401000 FF E0 15 .... - Conversion Process:
- Each line is read, and the 16 hex values are extracted (the address is ignored).
- Each hex value (e.g.,
FF,E0) is converted into its integer equivalent (0-255). Any missing bytes, represented as??, are converted to0. - This creates a 1D stream of byte values (pixels).
- To create an image, this 1D stream is reshaped into a 2D matrix. The width of the image is calculated as the next power of 2 from the square root of the file's total byte count, ensuring a roughly square-like shape. The height is calculated based on this width.
- The resulting 2D NumPy array is then saved as a grayscale PNG image using the Pillow (PIL) library.
After the images are generated and manually sorted into subfolders by class, they are loaded for training.
- Loading:
ImageDataGenerator.flow_from_directoryis used to load the images from the parent directory (D:\ASSEMBLY\multi virusi). - Dataset Stats: The script found 4725 images belonging to 22 classes (malware families).
- Image Specs: All images are resized to a uniform
(64, 64)and converted to 3-channel (RGB) format. - Class Imbalance: A bar chart of the class distribution reveals that the dataset is highly unbalanced, with two classes (
Allaple.AandAllaple.L) containing a significant majority of the samples. - Train/Test Split: The dataset is split into a 70% training set (3307 images) and a 30% test set (1418 images) using
sklearn.model_selection.train_test_split. - Normalization: Pixel values are normalized from 0-255 to 0.0-1.0 by dividing by 255.
A Sequential CNN model is built using tensorflow.keras. The architecture is as follows (note the redundant Flatten layer, which is present in the code):
| Layer | Type | Output Shape | Parameters |
|---|---|---|---|
| 1 | Conv2D |
(None, 62, 62, 30) | 840 |
| 2 | MaxPooling2D |
(None, 31, 31, 30) | 0 |
| 3 | Conv2D |
(None, 29, 29, 15) | 4,065 |
| 4 | MaxPooling2D |
(None, 14, 14, 15) | 0 |
| 5 | Dropout |
(Rate: 0.25) | 0 |
| 6 | Flatten |
(None, 2940) | 0 |
| 7 | Dense |
(None, 128) | 376,448 |
| 8 | Dropout |
(Rate: 0.25) | 0 |
| 9 | Flatten |
(None, 128) | 0 |
| 10 | Dense |
(None, 128) | 16,512 |
| 11 | Dropout |
(Rate: 0.5) | 0 |
| 12 | Dense |
(None, 50) | 6,450 |
| 13 | Dense |
(None, 22) | 1,122 |
Total params: 405,437
- Compilation: The model is compiled using the
adamoptimizer andbinary_crossentropyloss. Custom metrics for F1-score, precision, recall, and AUC are included. - Class Weighting: To combat the class imbalance identified during preprocessing,
sklearn.utils.class_weight.compute_class_weightis used withclass_weight='balanced'. This calculates weights to give more importance to under-represented classes during training. - Fitting: The model is trained for 100 epochs.
- Accuracy: The trained model achieves a final accuracy of ~92.2% on the test set.
- Confusion Matrix: A confusion matrix is generated to visualize the model's performance on a per-class basis.
- Analysis:
- The model performs well on most classes.
- It shows confusion between certain related families, such as Swizzor.gen!E and Swizzor.gen!l.
- It also struggles with classes that have very few samples, like Autorun.K, which is frequently misclassified as Yuner.A. This is a typical side effect of a highly imbalanced dataset.
- Clone the repository.
- Install dependencies:
pip install tensorflow numpy scipy pillow matplotlib scikit-learn seaborn pandas
- Prepare Data:
- Create a directory (e.g.,
data_raw) and place your.bytesmalware files inside it. - Update the
rootvariable in Cell 3 to point to this directory. - Run the Preprocessing - Converting to images section (Cells 1-5).
- After images are generated, create a new directory (e.g.,
multi virusi). - Inside this directory, create sub-directories for each malware class (e.g.,
Allaple.A,Yuner.A). - Move the corresponding PNG images into their respective class folders.
- Update the
path_rootvariable in Cell 8 to point to yourmulti virusidirectory.
- Create a directory (e.g.,
- Run the Notebook: Execute all cells sequentially to load the data, train the model, and view the results.