Matrix Multiplication 02 Memory Layout

To create a two-dimensional array an “array of arrays” approach is often suggested (e.g. here, or here and here). At first glance this appears intuitive and straightforward. We’re simply nesting arrays to mimic rows and columns. However, this approach has several pitfalls and is rarely used in practice. In this post I’ll look over both the theoretical and practical implications of using an array of arrays, then discuss how we can do it better.

What is an array of arrays?#

One way to implement a matrix is to use an array to store arrays of row values. In Rust, the vector data structure provides a contiguous, growable array type. It is possible to represent a 3 by 3 matrix as an array of arrays as follows:

let matrix = vec![
    vec![1, 2, 3],
    vec![4, 5, 6],
    vec![7, 8, 9],
];

It is possible to access the desired element simply by indexing, which looks similar to the notation used in mathematics, e.g. we can find $A_{0,0}$ by accessing the $0^{th}$ element of the $0^{th}$ array (remember, Rust uses 0 based indexing).

let a00 = matrix[0][0];
assert_eq!(a00, 1);  // True

How is an array of arrays represented in memory?#

Vectors#

To understand how our data structure works, it is crucial to understand the underlying vector implementation. Vectors are resizable arrays that store their associated data on the heap. A vector consists of three pieces of information:

A pointer to the underlying data
Length
Capacity

The capacity is the amount of space allocated by the vector. This is different from the length, which is the number of elements that the vector holds. What’s important here is that if a vector’s length exceeds its capacity the vector will automatically reallocate the underlying data. In memory a vector with capacity 4 containing the numbers 1 and 2 (and therefore having a length of 2). Would look like this:

TODO: Insert image of memory layout

Vectors of vectors#

Now that we understand how a vector works, let’s consider vectors of vectors. A vector of vector consists of two levels of vectors, an outer vector that contains the inner vectors and the inner vectors themselves, each containing the elements of a row of that matrix. In memory our matrix from the above example looks as follows:

TODO: Insert image of memory layout

An alternative approach#

Before continuing, I’d like to propose an alternative implementation of a matrix. This will allow us to compare and contrast the alternative with our array of arrays approach. The alternative approach is to store the data as one, contiguous, array and using mathematics to correctly index the array.

Let’s consider the 3 x 3 matrix from before. An alternative way to represent this in memory is with a single vector, such as:

let matrix = vec![
    1, 2, 3,
    4, 5, 6,
    7, 8, 9
];

We can now access $A_{i,j}$ by accessing the index at $i * \text{cols} + j$, where $cols$ is the number of columns. For example, to access $A_{1, 1}$, we would need to access the $4^{th}$ element.

let index = 1 * 3 + 1;  // 4
let a11 = matrix[index];
assert_eq!(a11, 5);  // True

Comparing our approaches#

Accessing elements#

But this looks way more complicated. With the vector of vectors approach we didn’t need to do any calculations at all! That’s almost true. We didn’t make any calculations explicitly. But the CPU still did them. Let’s consider the following two functions:

pub fn get_element_vec_of_vec(matrix: &Vec<Vec<f64>>, i: usize, j: usize) -> f64 {
    matrix[i][j]
}

pub fn get_element_linear(matrix: &Vec<f64>, i: usize, j: usize, cols: usize) -> f64 {
    let index = i * cols + j;
    matrix[index]
}

Analysing Assembly#

Vector of vectors#

The get_element_vec_of_vec compiles to the following assembly:

push    rax
mov     rax, qword ptr [rdi + 16]
cmp     rax, rsi
jbe     .LBB0_3
mov     rax, qword ptr [rdi + 8]
lea     rcx, [rsi + 2*rsi]
mov     rsi, qword ptr [rax + 8*rcx + 16]
cmp     rsi, rdx
jbe     .LBB0_4
lea     rax, [rax + 8*rcx]
mov     rax, qword ptr [rax + 8]
movsd   xmm0, qword ptr [rax + 8*rdx]
pop     rax
ret

That’s a lot of work!

At a high level, the CPU has to:

Get the address to matrix[i] in memory
Bounds check matrix[i]
Load matrix[i]
Get the address to matrix[i][j] in memory
Bounds check matrix[i][j]
Load matrix[i][j]

Linear Memory#

What if we use a linear memory layout? This compiles to the following assembly:

imul    rsi, rcx
add     rsi, rdx
mov     rax, qword ptr [rdi + 16]
cmp     rsi, rax
jae     .LBB0_2
mov     rax, qword ptr [rdi + 8]
movsd   xmm0, qword ptr [rax + 8*rsi]
ret

That’s significantly less work. Although we need to do some arithmetic, this can be done very quickly by the CPU. And only need to calculate the address, perform a bounds check and load the memory once.

Profiling#

We’ve seen that there is a difference in the compiled assembly, but does this make a measurable difference in performance? I’ve set up a benchmark using criterion.

Method	Time (ns)
Array of Arrays	1.2270
Linear	0.8156

That’s about 2/3 time savings. Although 400ps may seem trivial, these savings add up when performed millions or billions of times.

Memory Management#

With arrays of arrays, each sub-array needs to be allocated and managed separately. This increases the complexity of memory management, making operations like copying or resizing more cumbersome. For instance, copying a 2D array of arrays requires deep copying each sub-array, which is more complex and error-prone compared to copying a single, contiguous array.

Let’s consider trying to clone each of our matrix representations. Luckily for us, Rust abstracts away the manual process of copying memory. All we have to do is call the .clone() method on our matrix. After running the benchmark for 4096 by 4096 matrices here are the results:

Method	Time (ms)
Linear	43.296
Array of Arrays	45.199

The linear layout is slightly (but statistically significantly) faster. What’s interesting to me is the variability of times. Let’s take a look at those:

Method	high mild outliers	high severe outliers
Linear	0	0
Array of Arrays	4	6

Out of our 100 measurements, the array of array approach had a total of 10 outliers. This makes sense as the OS had to assign memory once for each row (4096 times) instead of once. And each of these allocations introduces some variance.

Conclusion#

In this blog post, I have introduced the array of arrays approach to represent a matrix. This is often suggested to beginners, however this approach has several draw backs. I have presented some of these drawbacks by contrasting the array of arrays approach with a linear memory, highlighting some of the differences in theory, by analysing the compiled assembly and by timing.