MalwareClassification

Converts malware binary files into grayscale then train a CNN to clasify them into families of malwares

Malware Classification using Byte-to-Image Conversion and CNN

This project demonstrates a method for malware classification by converting malware binary files (.bytes) into grayscale images and then training a Convolutional Neural Network (CNN) to classify them into their respective families.

This approach is based on the principle that different malware families exhibit distinct textural patterns in their binary representation, which a CNN can learn to identify.

Project Pipeline

The notebook is structured into two main phases:

Preprocessing: This stage involves two steps:
- Converting raw .bytes files into PNG images.
- Loading these images into a dataset, analyzing class distribution, and preparing them for training.
Modeling & Evaluation:
- Defining a CNN architecture using Keras.
- Training the model on the image dataset, accounting for class imbalance.
- Evaluating the model's performance and analyzing its results with a confusion matrix.

1. Preprocessing

.bytes to PNG Conversion

The first part of the project focuses on malware visualization.

Input: The script reads raw malware files from a directory (D:\ASSEMBLY\data_raw in the notebook).
File Format: It's designed to parse .bytes files. These files are expected to be hexadecimal dumps, where each line contains an address offset followed by 16 hexadecimal values (bytes), e.g., 00401000 FF E0 15 ....
Conversion Process:
1. Each line is read, and the 16 hex values are extracted (the address is ignored).
2. Each hex value (e.g., FF, E0) is converted into its integer equivalent (0-255). Any missing bytes, represented as ??, are converted to 0.
3. This creates a 1D stream of byte values (pixels).
4. To create an image, this 1D stream is reshaped into a 2D matrix. The width of the image is calculated as the next power of 2 from the square root of the file's total byte count, ensuring a roughly square-like shape. The height is calculated based on this width.
5. The resulting 2D NumPy array is then saved as a grayscale PNG image using the Pillow (PIL) library.

Dataset Loading and Preparation

After the images are generated and manually sorted into subfolders by class, they are loaded for training.

Loading: ImageDataGenerator.flow_from_directory is used to load the images from the parent directory (D:\ASSEMBLY\multi virusi).
Dataset Stats: The script found 4725 images belonging to 22 classes (malware families).
Image Specs: All images are resized to a uniform (64, 64) and converted to 3-channel (RGB) format.
Class Imbalance: A bar chart of the class distribution reveals that the dataset is highly unbalanced, with two classes (Allaple.A and Allaple.L) containing a significant majority of the samples.
Train/Test Split: The dataset is split into a 70% training set (3307 images) and a 30% test set (1418 images) using sklearn.model_selection.train_test_split.
Normalization: Pixel values are normalized from 0-255 to 0.0-1.0 by dividing by 255.

2. CNN Model and Training

Model Architecture

A Sequential CNN model is built using tensorflow.keras. The architecture is as follows (note the redundant Flatten layer, which is present in the code):

Layer	Type	Output Shape	Parameters
1	`Conv2D`	(None, 62, 62, 30)	840
2	`MaxPooling2D`	(None, 31, 31, 30)	0
3	`Conv2D`	(None, 29, 29, 15)	4,065
4	`MaxPooling2D`	(None, 14, 14, 15)	0
5	`Dropout`	(Rate: 0.25)	0
6	`Flatten`	(None, 2940)	0
7	`Dense`	(None, 128)	376,448
8	`Dropout`	(Rate: 0.25)	0
9	`Flatten`	(None, 128)	0
10	`Dense`	(None, 128)	16,512
11	`Dropout`	(Rate: 0.5)	0
12	`Dense`	(None, 50)	6,450
13	`Dense`	(None, 22)	1,122

Total params: 405,437

Training

Compilation: The model is compiled using the adam optimizer and binary_crossentropy loss. Custom metrics for F1-score, precision, recall, and AUC are included.
Class Weighting: To combat the class imbalance identified during preprocessing, sklearn.utils.class_weight.compute_class_weight is used with class_weight='balanced'. This calculates weights to give more importance to under-represented classes during training.
Fitting: The model is trained for 100 epochs.

3. Results and Analysis

Accuracy: The trained model achieves a final accuracy of ~92.2% on the test set.
Confusion Matrix: A confusion matrix is generated to visualize the model's performance on a per-class basis.
Analysis:
- The model performs well on most classes.
- It shows confusion between certain related families, such as Swizzor.gen!E and Swizzor.gen!l.
- It also struggles with classes that have very few samples, like Autorun.K, which is frequently misclassified as Yuner.A. This is a typical side effect of a highly imbalanced dataset.

How to Run

Clone the repository.

Install dependencies:

pip install tensorflow numpy scipy pillow matplotlib scikit-learn seaborn pandas

Prepare Data:
- Create a directory (e.g., data_raw) and place your .bytes malware files inside it.
- Update the root variable in Cell 3 to point to this directory.
- Run the Preprocessing - Converting to images section (Cells 1-5).
- After images are generated, create a new directory (e.g., multi virusi).
- Inside this directory, create sub-directories for each malware class (e.g., Allaple.A, Yuner.A).
- Move the corresponding PNG images into their respective class folders.
- Update the path_root variable in Cell 8 to point to your multi virusi directory.
Run the Notebook: Execute all cells sequentially to load the data, train the model, and view the results.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
malware_cnn.py		malware_cnn.py
proiect_CNN.ipynb		proiect_CNN.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MalwareClassification

Malware Classification using Byte-to-Image Conversion and CNN

Project Pipeline

1. Preprocessing

.bytes to PNG Conversion

Dataset Loading and Preparation

2. CNN Model and Training

Model Architecture

Training

3. Results and Analysis

How to Run

About

Uh oh!

Releases

Packages

Languages

License

dbogdanm/MalwareClassification

Folders and files

Latest commit

History

Repository files navigation

MalwareClassification

Malware Classification using Byte-to-Image Conversion and CNN

Project Pipeline

1. Preprocessing

.bytes to PNG Conversion

Dataset Loading and Preparation

2. CNN Model and Training

Model Architecture

Training

3. Results and Analysis

How to Run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages