Static analysis is the analysis of an executable file on a structural bases without executing it in controlled environment. It is the analysis of the executable’s static attributes such as different sections and memory characteristics.

Therefore, static analysis of a PE allows extraction of a lot of metadata that can be useful in further analysis such as names of sections, imported DLLs and strings present which gives an early idea of the functions performed by the binary in question. Malicious binaries with unstructured and improperly formatted metadata (done to achieve sophisticated levels of obfuscation and anti-debugging/anti-reversing defenses) can raise suspicions since a benign PE generally has a well-formed and valid metadata. Therefore, static analysis has been a popular approach towards malware detection in PEs. Since execution of the binary is not required, this approach is much more lightweight and resource-conserving. This allows security teams and researchers quickly perform a preliminary analysis.

The metadata extracted from the PEs to be used as features, however, is mostly unreliable because of it’s extreme variance and invalidity.

 

Feature Extraction

The following features can be extracted from a PE:

  1. General File Information
    General information about the binary such as it’s size and other basic information parsed from the PE header: virtual size of the binary, resources, re-locations and symbols.
  2. Header Information
    We parse different headers and extract different data points relevant to the said headers. For example. we can extract timestamps and image characteristics from the COFF header, DLL characteristics an linker versions from the Optional header.
  3. Imported Functions
    Parsing the IAT(Import Address Table) can tell us about the imported functions by the libraries used by the binary. These strings can be hashed and used as features. Total number of imported functions is in General File Information.
  4. Exported Functions
    The functions exported by the binary can give us a vague idea if it works as a support to some other binary, and what functions it can perform. Total number of exported functions is in General File Information.Imported Functions
  5. Section Information
    Properties of each section such as it’s size, entropy and list of strings can help capture the characteristics of that section.
  6. Byte Histogram
    A histogram can be generated from all the bytes of the binary and normalized to be used as a feature.
  7. Byte-Entropy Histogram
    A histogram can be generated from the byte-entropy of the binary for an approximation of a byte’s joint distribution to the entropy. It can then be normalized and used as a feature.
  8. Strings
    Collection of subsequent printable bytes can be extremely unreliable since any random byte combination(of at least length 5) would qualify as a string. However, it provides distinct information in comparison to the byte histogram since strings can capture occurrences of URLs, registry keys and file names.

These features can be used to form a general idea about the PE before executing it.

Sample structure info after running a PE through a parser is shown in this file.

A POC working on malware classification using static analysis is in this repository.