Skip to content

Stream Specialization

Zhengrong Wang edited this page Dec 21, 2021 · 4 revisions

Stream Specialization Support

Stream specialization is perhaps the most important result of Gem-Forge framework. Here we provide a high-level documentation of the ISA instructions, compiler analysis and gem5 simulation support.

Stream ISA

Stream Types

In the ISA, a stream is defined as a decoupled portion of the program which together generates memory addresses. It is a high-level representation and defines the pattern with a sequence of addresses. There are many types of streams, and can be roughly classified into three categories:

  • Affine streams A[i]
    Affine streams have regular access pattern and is the basic build block for more complex streams. This example is adopted from 2D stencil workload.
for (int i = 0; i < N; +=i) {
  for (int j = 0; j < M; ++j) {
    out[i][j] = min(p[i][j] + in[i][j], in[i - 1][j]);
  }
}
  • Indirect streams A[B[i]]
    Indirect streams are more complicated, as they generate data-dependent addresses, and may formulate multiple levels for indirection. It is common in sparse workloads, e.g. graph processing. This example is adopted from page-rank, and contains the indirect access weights[edges[i]].
for (int i = 0; i < N; ++i) {
  float C = contrib[i];
  for (int j = 0; j < edges[i]; ++j) {
    int v = edges[i];
    weights[v] += C;
  }
}
  • Pointer-chasing streams A = A->next
    Pointer-chasing streams can be viewed as a special kind of self-dependent indirect streams. They are common in certain data structures, e.g. linked list, binary search tree. Due to the serial access nature, they are usually on the critical path. This example searches the linked list for a certain value.
bool found = false;
Value val = 0;
do {
  val = head->val;
  head = head->next;
  found = found || (val == key);
} while (val != key && head != NULL);

Key Concepts

Now we have good intuition on what are streams and how they appear in programs, we can introduce some key concepts of stream ISA. A stream is configured via s_config instruction before entering the loop and terminated by s_end instructions. Once configured, the microarchitecture can start to generate stream addresses and prefetch stream data. Stream data is exposed to normal user instruction via pseudo-reg, which can be implemented as a normal architectural register by the compiler, but is renamed to the stream data instead of a normal physical register in the microarchitecture. Since a stream defines a sequence of addresses and values, we use a s_step instruction to explicitly advance the stream pseudo-reg to the next value. This example shows how the indirect loop can be rewritten into stream ISA, and we discuss each part in more detail.

s_config(s_i, s_contrib);
while (s_i < N) {
  float C = s_contrib;
  s_config(s_j, s_edges, s_weights);
  while (s_j < N) {
    s_weights += C;
    s_step(s_j);
  }
  s_end(s_j, s_edges, s_weights);
  s_step(s_i);
}
s_end(s_i, s_contrib);
  • Stream Configure
    The first thing to notice is that in addition to memory streams, we also configure induction variable streams, e.g. s_i and s_j. This helps further reduce the instruction overhead. The induction variable streams serve as the root stream for streams in the loop dependent on it.
    Since there are two levels of loops, we have two s_config and s_end pairs to configure and terminate streams. Although here we only show a single s_config instruction for simplicity, in a real system this is implemented as a sequence of stream instructions:
    The first s_config instruction takes a pointer to the real configuration and triggers the hardware to read the stream configuration (usually close in the binary layout to increase the hit rate in the instruction cache). This is followed by zero or multiple s_input instructions to take in any runtime input that can not be determined at compilation time. For example, if the array is dynamically allocated, the starting address is recorded here. Finally, a s_ready instruction marks the end of stream configuration and serves as a barrier that streams are ready to be issued. See below for instruction encoding details.
    Notice that we explicitly terminate a stream via a s_end instruction. This enables streams with a data-dependent trip count. The programmer can even manually configure a stream to span multiple loops, which unlocks more stream opportunities.
s_config(config_ptr);
s_input(id1, val1);
s_input(id2, val2);
// ...
s_ready();
  • Stream Usage
    In the example, we can see that each stream is represented as a pseudo-reg, e.g. s_contrib, and is can be used by normal core user instructions.

  • Stream Step
    Streams are advanced by the s_step instruction. Notice that stepping on the root induction variable stream will also step all the dependent streams. This explicit stepping also enables data-dependent stream usage.

Instruction Encoding

So far we provide x86 implementation of stream instructions. They are defined in llvm/llvm/lib/Target/X86/X86InstrSSP.td. We implement multiple versions of the same instruction for different data types. Note that we use AVX-512 encoding for vectorized stream_load instructions.

Compiler Analysis

Stream specialization is implemented in transform/src/stream. These are the important files:

  • StaticStream.cpp
    This defines the root class for streams, and is derived by StaticMemStream and StaticIndVarStream, which stands for memory access streams (load, store, atomics) and induction variable streams respectively.
  • StaticStreamRegionAnalyzer.cpp
    This is the key analysis to recognize streams. It first instantiates a stream for every phi nodes in the loop header block and memory accesses in the loop. Then it analyzes the stream pattern and chooses qualified streams.
  • execution/StreamExecutionTransformer.cpp
    After streams are chosen, StreamExecutionTransformer is in charge of transforming the program to remove redundant address generation and insert new stream instructions. Stream instructions are implemented as intrinsics in LLVM IR (llvm/llvm/include/llvm/IR/Intrinsics.td) and the real x86 instructions are defined in llvm/llvm/lib/Target/X86/X86InstrSSP.td.
  • StreamPassOptions.cpp
    This file defines the key options to enable/disable features for stream analysis, e.g. configure streams at outer loops.
  • StreamMessage.proto
    This Protobuf file defines the stream configuration, which is used to configure the stream engine during simulation.

Gem5 Simulation Support

We also implement stream specialization in Gem5. They are mostly located here (omitting the gem5/src prefix):

  • cpu/gem_forge/accelerator/stream
    This folder contains most of the stream implementation. The most important file is stream_engine.cc, which is the core stream engine.
  • cpu/gem_forge/accelerator/stream/cache
    This folder contains the implementation of stream engines in the cache, e.g. L2 and L3 stream engine. This is used for stream floating.
  • mem/protocol/stream
    This folder contains the MESI coherence protocol with stream support.

Clone this wiki locally