Classification of malware traffic has been a hot topic ever since machine learning was introduced to the world of network security domain. There are a number of published researches that explore various techniques to extract features from the traffic and use it classification.

The pre-machine learning era saw classification of traffic based on structured rule matching, DPI(Deep Packet Inspection) and Port Inspection being the most popular methods, which were typically rendered ineffective post-encryption.

The post-machine learning era saw classical statistical and behavioral approaches towards classification. These were fairly effective since they worked well with patterns extracted from data with highly selective features.

Representation Learning is a methodology allowing a system to form “representations” out of raw data to automatically learn features, preventing us from the painstaking task of hand-designing features.

Considering a set of raw data, the raw bytes can be used to form images. Since structured bytes would give well-formed patterns, these patterns would be visible in images. This allows for learning meaningful feature representations from the image. Thus, convolutional models would be an excellent choice for this task.

 

Traffic Granularity

The traffic capture can be split into different levels of granularity.

For a set of packets P, let P = {p1, …, pn }.

For the ith packet p in the set P, pi = {xi, bi, ti}

where,

xi is a tuple of source IP, source port, destination IP, destination port and transport-level protocol.

bi is the size of the packet.

ti is the time of transmission of the packet.

Raw Traffic

All packets can be considered individually. There’s no specific structure to this granularity since any packet can belong to any device’s network flow. This could be a good granularity level for analyzing the flow of traffic in a generic sense, but not when we need to fingerprint malicious traffic.

Flow

A set of packets travelling from one device to another is the “flow” of the traffic. The flow can be defined as f = (x, b, d, t),

where, 

x is generalized since the tuple xi for the ith packet would be the exact same as all the other packets in this subset, since they all have the same origin and destination.

b = ∑ibi for i in the set of all packets in the flow, the total size of the flow. It’s the sum of size of all packets in the flow.

d = tn - t1, the duration of the flow.

t is the starting time of the transmission.

Session

A session is bidirectional, that is, it’s the back and forth flow of traffic between two devices. Therefore, it’s the same as a flow, except the first element x won’t be the same throughout, but would be either of two different tuples since now the source IP/port and destination IP/port pairs would be interchangeable.

 

Layers

The data in the application layer, that is, the 7th layer of the OSI model would harbor characteristic that would be intrinsic to a particular application such as SMTP structured data would represent email traffic where as HTTP structured data would represent browser traffic.

However, data from other layers, such as the port number, can also help deduce the application to which the traffic flow belongs (might be misleading since usage of default ports isn’t guaranteed for all applications).

Thus, another variation of the dataset can be created where only the 7th layer is considered.

NOTE: Removal of IP addresses and MAC addresses since they might end up being counted as features. These are only required to identify the origin and the destination of the traffic and have no bearing over the kind of traffic that is present. Moreover, in the same network, they won’t even be a piece of distinguishing information.

Citation:

Wang, Wei & Zhu, Ming & Zeng, Xuewen & Ye, Xiaozhou & Sheng, Y.. (2017). Malware traffic classification using convolutional neural network for representation learning. 712-717. 10.1109/ICOIN.2017.7899588.