Malwares come in many forms. Sometimes it’s a standalone binary whereas sometimes it’s a legitimate software having a malware inside it. Sometimes it’s a binary with malformed metadata and packed bytes to throw off the analyst whereas sometimes it acts as a downloader/launcher for a more malicious binary. Information extracted from static analysis can therefore be deemed unreliable and not as accurate when compared with information extracted from dynamic analysis.

Dynamic analysis involves analyzing the executable’s behavior in a controlled environment. All the actions performed and modifications to system settings cause by the executable are recorded and analyzed afterwards. This process is usually automated using sandboxes such as Cuckoo Sandbox logging the behavior.

 

Feature Extraction

The following features can be extracted from a dynamic analysis log:

  1. Registry Keys
    Windows registry contains configuration information about the system, installed apps and mounted devices. Any modifications made to the registry can be used to understand what sort of environment the executable wants to set up for it’s execution. Therefore, registry keys read, written, opened and deleted can be useful as a feature.
  2. Files
    The logs record any changes made to the file structure of the system. Any files copied, created, written or read, and any of these operations failing can be used as a feature to judge the executable’s characteristics. Ransomware and lockers would have high number of file system accesses.
  3. DLLs
    Native exported functions used by an executable can also help form a rough idea of the task it is trying to achieve. Therefore, the DLLs loaded by the executable can be used as a feature.
  4. API Calls
    APIs are sets of subroutines used by software to communicate with the hardware. API calls made by the executable would give a faint idea of the functions the executable is trying to perform. Therefore, all the unique API calls made, successful or failed can be summarized and used as a feature.
  5. IPs and DNS
    Logged network traffic can be used to observe the connections and queries that the executable tried. These can then be used to fingerprint traffic flows in the pcaps.
  6. API Call Sequence
    The order of API calls made by an executable is the closest one can get to it’s behavior since an ordered call sequence can define a particular intent.

Different sets of these features can be used to try out the performance of the models.

The API Call Sequence feature can be independently used for behavior fingerprinting. The APIs can be divided into different sets to create a signature and then a language model can be used on a dataset of signatures.

A POC working on malware classification using dynamic analysis is in this repository.