## performance – 4×4 double precision matrix multiply using AVX intrinsics (inc. benchmarks)

An optimised 4×4 double precision matrix multiply using intel AVX intrinsics. Two different variations.

Gist

For quick benchmark (with a compatible system) copy paste the command below. Runs tests on clang and gcc on optimisation levels 0 -> 3. Runs a naive matrix multiplication NORMAL as a reference.

``````curl -OL https://gist.githubusercontent.com/juliuskoskela/2381831c86041eb2ebf271011db7b2bf/raw/67818fad255f6a682cba6c28394ac71df0274784/run.sh
bash run.sh >> M4X4D_BENCHMARK_RESULTS.txt
cat M4X4D_TEST_RESULTS.txt

``````

Quick result MacBook Pro 2,3 GHz 8-Core Intel Core i9:

``````(best)NORMAL
gcc -O3 -march=native -mavx
time: 0.170310

gcc -O3 -march=native -mavx
time: 0.002334
``````

Quick result Linux AMD Ryzen 3 3100 4-Core:

``````(best)NORMAL
gcc -O1 -march=native -mavx
time: 0.002570

(best)AVX MUL + MUL + ADD
clang -O2 -march=native -mavx
time: 0.002572
``````

Results indicate that there is some discrepancy in optimisation between gcc and clang and on Linux and Mac OS. On Mac OS neither compiler is able to properly optimise the normal version. On Linux clang still didn’t know how to optimise further, but gcc Managed to find the same speeds as we hit with the intrinsics version BUT only on -O1 and -O2 and NOT on -O3.

Version of the matrix multiplication which uses one multiply and
one multiply + add instruction in the inner loop. Inner loop is unrolled

``````typedef union u_m4d
{
__m256d m256d(4);
double  d(4)(4);
}   t_m4d;

// Example matrices.
//
// left(0) (1 2 3 4)  | right(0) (4 1 1 1)
// left(1) (2 2 3 4)  | right(1) (3 2 2 2)
// left(2) (3 2 3 4)  | right(2) (2 3 3 3)
// left(3) (4 2 3 4)  | right(3) (1 4 4 4)

static inline void m4x4d_avx_mul(
t_m4d *restrict dst,
const t_m4d *restrict left,
const t_m4d *restrict right)
{
__m256d ymm0;
__m256d ymm1;
__m256d ymm2;
__m256d ymm3;

// Fill registers ymm0 -> ymm3 with a single value
// from the i:th column of the left
// hand matrix.
//
// left(0) (1 2 3 4) -> ymm0 (1 1 1 1)
// left(1) (2 2 3 4) -> ymm1 (2 2 2 2)
// left(2) (3 2 3 4) -> ymm2 (3 3 3 3)
// left(3) (4 2 3 4) -> ymm3 (4 4 4 4)

// Multiply vector at register ymm0 with right row(0)
//
// 1  1  1  1   <- ymm0
// *
// 4  2  3  4   <- right(0)
// ----------
// 4  2  3  4   <- ymm0

ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));

// Multiply vector at register ymm1 with right hand
// value at ymm0 tp the result.
//
// 2  2  2  2   <- ymm1
// *
// 3  2  3  4   <- right(1)
// +
// 4  2  3  4   <- ymm0
// ----------
// 10 6  9  12  <- ymm0

// We repeat for ymm2 -> ymm3.
//
// 3  3  3  3   <- ymm2
// *
// 2  2  3  4   <- right(2)
// ----------
// 6  6  9  12  <- ymm2
//
// 2  2  2  2   <- ymm3
// *
// 3  2  3  4   <- right(3)
// +
// 6  6  9  12  <- ymm2
// ----------
// 10 14 21 28  <- ymm2

ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));

// Sum accumulated vectors at ymm0 and ymm2.
//
// 10  6   9   12   <- ymm0
// +
// 10  14  21  28   <- ymm2
// ----------
// 20  20  30  40   <- dst(0) First row!

// Calculate dst(1)
ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));

// Calculate dst(2)
ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));

// Calculate dst(3)
ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));
}

``````

Version of the matrix multiplication which uses two multiplies and and add in the inner loop.

``````static inline void m4x4d_avx_mul2(
t_m4d *restrict dst,
const t_m4d *restrict left,
const t_m4d *restrict right)
{
__m256d ymm(4);

for (int i = 0; i < 4; i++)
{
ymm(0) = _mm256_mul_pd(ymm(0), right->m256d(0));
ymm(1) = _mm256_mul_pd(ymm(1), right->m256d(1));
ymm(2) = _mm256_mul_pd(ymm(2), right->m256d(2));
ymm(3) = _mm256_mul_pd(ymm(3), right->m256d(3));
}
}

``````

Comparison matrix multiply that doesn’t use intrinsics.

``````static inline void m4x4d_mul(double d(4)(4), double l(4)(4), double r(4)(4))
{
d(0)(0) = l(0)(0) * r(0)(0) + l(0)(1) * r(1)(0) + l(0)(2) * r(2)(0) + l(0)(3) * r(3)(0);
d(0)(1) = l(0)(0) * r(0)(1) + l(0)(1) * r(1)(1) + l(0)(2) * r(2)(1) + l(0)(3) * r(3)(1);
d(0)(2) = l(0)(0) * r(0)(2) + l(0)(1) * r(1)(2) + l(0)(2) * r(2)(2) + l(0)(3) * r(3)(2);
d(0)(3) = l(0)(0) * r(0)(3) + l(0)(1) * r(1)(3) + l(0)(2) * r(2)(3) + l(0)(3) * r(3)(3);
d(1)(0) = l(1)(0) * r(0)(0) + l(1)(1) * r(1)(0) + l(1)(2) * r(2)(0) + l(1)(3) * r(3)(0);
d(1)(1) = l(1)(0) * r(0)(1) + l(1)(1) * r(1)(1) + l(1)(2) * r(2)(1) + l(1)(3) * r(3)(1);
d(1)(2) = l(1)(0) * r(0)(2) + l(1)(1) * r(1)(2) + l(1)(2) * r(2)(2) + l(1)(3) * r(3)(2);
d(1)(3) = l(1)(0) * r(0)(3) + l(1)(1) * r(1)(3) + l(1)(2) * r(2)(3) + l(1)(3) * r(3)(3);
d(2)(0) = l(2)(0) * r(0)(0) + l(2)(1) * r(1)(0) + l(2)(2) * r(2)(0) + l(2)(3) * r(3)(0);
d(2)(1) = l(2)(0) * r(0)(1) + l(2)(1) * r(1)(1) + l(2)(2) * r(2)(1) + l(2)(3) * r(3)(1);
d(2)(2) = l(2)(0) * r(0)(2) + l(2)(1) * r(1)(2) + l(2)(2) * r(2)(2) + l(2)(3) * r(3)(2);
d(2)(3) = l(2)(0) * r(0)(3) + l(2)(1) * r(1)(3) + l(2)(2) * r(2)(3) + l(2)(3) * r(3)(3);
d(3)(0) = l(3)(0) * r(0)(0) + l(3)(1) * r(1)(0) + l(3)(2) * r(2)(0) + l(3)(3) * r(3)(0);
d(3)(1) = l(3)(0) * r(0)(1) + l(3)(1) * r(1)(1) + l(3)(2) * r(2)(1) + l(3)(3) * r(3)(1);
d(3)(2) = l(3)(0) * r(0)(2) + l(3)(1) * r(1)(2) + l(3)(2) * r(2)(2) + l(3)(3) * r(3)(2);
d(3)(3) = l(3)(0) * r(0)(3) + l(3)(1) * r(1)(3) + l(3)(2) * r(2)(3) + l(3)(3) * r(3)(3);
};
``````

Main method and utils

``````///////////////////////////////////////////////////////////////////////////////
//
// Main and utils for testing.

t_v4d   v4d_set(double n0, double n1, double n2, double n3)
{
t_v4d   v;

v.d(0) = n0;
v.d(1) = n1;
v.d(2) = n2;
v.d(3) = n3;
return (v);
}

t_m4d   m4d_set(t_v4d v0, t_v4d v1, t_v4d v2, t_v4d v3)
{
t_m4d   m;

m.m256d(0) = v0.m256d;
m.m256d(1) = v1.m256d;
m.m256d(2) = v2.m256d;
m.m256d(3) = v3.m256d;
return (m);
}

int main(int argc, char **argv)
{
t_m4d   left;
t_m4d   right;
t_m4d   res;
t_m4d   ctr;

if (argc != 2)
return (printf("usage: avx4x4 (iters)"));

left = m4d_set(
v4d_set(1, 2, 3, 4),
v4d_set(2, 2, 3, 4),
v4d_set(3, 2, 3, 4),
v4d_set(4, 2, 3, 4));

right = m4d_set(
v4d_set(4, 2, 3, 4),
v4d_set(3, 2, 3, 4),
v4d_set(2, 2, 3, 4),
v4d_set(1, 2, 3, 4));

size_t  iters;
clock_t begin;
clock_t end;
double  time_spent;

// Test 1
m4x4d_mul(ctr.d, left.d, right.d);
iters = atoi(argv(1));

begin = clock();
for (size_t i = 0; i < iters; i++)
{
m4x4d_mul(res.d, left.d, right.d);

// To prevent loop unrolling with optimisation flags.
__asm__ volatile ("" : "+g" (i));
}
end = clock();

time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("nNORMALnntime: %lfn", time_spent);

// Test 2
m4x4d_avx_mul(&ctr, &left, &right);
iters = atoi(argv(1));

begin = clock();
for (size_t i = 0; i < iters; i++)
{
m4x4d_avx_mul(&res, &left, &right);
__asm__ volatile ("" : "+g" (i));
}
end = clock();

time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("nAVX MUL + FMADDnntime: %lfn", time_spent);

// Test 3
m4x4d_avx_mul2(&ctr, &left, &right);
iters = atoi(argv(1));

begin = clock();
for (size_t i = 0; i < iters; i++)
{
m4x4d_avx_mul2(&res, &left, &right);
__asm__ volatile ("" : "+g" (i));
}
end = clock();

time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("nAVX MUL + MUL + ADDnntime: %lfn", time_spent);
}
$$```$$
``````

## c++11 – template Matrix class with Static or dynamic size

I started implementing the folowing Matrix class in order to get a better understanding of templates classes in general. For now, it lacks a lot of features, it does very basic things:

• multiplication (only matrix X matrix multiplication is available here)
• transpose
• conversion to a string

Here are some points i think might be interesting

• The dimensions of the matrices can either be “static” or “dynamic”. By static, i mean the number of rows / columns is not allowed to change even if the data is stored in an std::vector

• When adding two matrices together, the resulting matrix’s dimension is static if at least one of the two operands is static.

• When multiplying two matrices together, the resulting matrix’s dimension depends on the rows of the left hand side and the columns of the right hand side of the “*”.

• The type of data contained in the returned matrix corresponds to the implicit cast of the types of the left and right sides of the expression. If the user decided to multiply a matrix of Foos, with a matrix of Bars, it’s on him.

• The “changeSize” methode, when called on a static matrix will only throw if the requested change is different from the matrix size.

• I did not bother for now with the type of exception i used as this is more of an exercise than something that i will use in a real application.

• The reason i made it possible to instanciate Static or dynamic matrices is because i want it to be able to manage Vector X Matrix operations. As a vector is just a special case of matrices, i choose to made them into a single class. (in the style of Eigen, which my code is inspired from a used inteface point of view).

Is my use of the rvalues references and move semantics correct? (that is a notion i learned about last week, any advice can help)

What do you think could be improved about my code?

Is there any notion i should learn before continuing?

Were any design choices in terme of template parameters good or bad?

``````#ifndef UTILS_MATRIX_H_
#define UTILS_MATRIX_H_

#include <vector>
#include <stdexcept>
#include <sstream>
#include <iomanip>

namespace mat
{

constexpr size_t DYNAMIC = static_cast<size_t>(-1); // (-) No matrix can be of this size... right?

template<typename TYPE, size_t NBROWS, size_t NBCOLS>
class Matrix
{
private:

std::vector<std::vector<TYPE>> matrixData;             // (-) Contents of the Matrix as a 2D array
static constexpr bool isRowStatic = NBROWS != DYNAMIC; // (-) Is the number of rows of this matrix editable at run time
static constexpr bool isColStatic = NBCOLS != DYNAMIC; // (-) Is the number of columns of this matrix editable at run time

public:

/** @brief Default constructor
*
*  @return Void.
*/
Matrix() :
matrixData()
{
if ((NBROWS <= 0 and NBROWS != DYNAMIC) or (NBCOLS <= 0 and NBCOLS != DYNAMIC))
throw std::runtime_error("The number of rows and columns of a static matrix must be positive integers");

// In case of a dynamic shape, the matrix should not be instanciated
if(isRowStatic and isColStatic)
internalChangeSize(NBROWS, NBCOLS);
}

/** @brief Consutuctor for the static size matrix
*  @param i_DefaultValue IN : Default value used to fill
*         the matrix
*
*  @return Void.
*/
Matrix(const TYPE &i_DefaultValue) :
matrixData()
{
// error handling
if (not isRowStatic or not isColStatic)
throw std::runtime_error("The default value constructor can not be called on a Matrix with at least one dynamic component");

if ((NBROWS <= 0 and NBROWS != DYNAMIC) or (NBCOLS <= 0 and NBCOLS != DYNAMIC))
throw std::runtime_error("The number of rows and columns of a static matrix must be positive integers");

internalChangeSize(NBROWS, NBCOLS, i_DefaultValue);
}

/** @brief Consutuctor for the dynamic size matrix without default value
*
*  @param i_Nbrows       IN : Number of rows of the matrix
*  @param i_Nbcols       IN : Number of columns of the matrix
*
*  @return Void.
*/
Matrix(const size_t i_Nbrows,
const size_t i_Nbcols) :
matrixData()
{
// error handling
if (i_Nbrows <= 0 or i_Nbcols <= 0)
throw std::runtime_error("The number or rows and columns has to be a positive integer");

// If one dimension is static, ignore the related constructor parameters
size_t NbRows = i_Nbrows;
size_t NbCols = i_Nbcols;
if(isRowStatic)
NbRows = NBROWS;
if(isColStatic)
NbCols = NBCOLS;

internalChangeSize(NbRows, NbCols, TYPE());
}

/** @brief Consutuctor for the dynamic size matrix with default value
*
*  @param i_Nbrows       IN : Number of rows of the matrix
*  @param i_Nbcols       IN : Number of columns of the matrix
*  @param i_DefaultValue IN : Default value used to fill the
*                             matrix
*
*  @return Void.
*/
Matrix(const size_t i_Nbrows,
const size_t i_Nbcols,
const TYPE &i_DefaultValue) :
matrixData()
{
// error handling
if (i_Nbrows <= 0 or i_Nbcols <= 0)
throw std::runtime_error("The number or rows and columns has to be a positive integer");

// If one dimension is static, ignore the related constructor parameters
size_t NbRows = i_Nbrows;
size_t NbCols = i_Nbcols;
if(isRowStatic)
NbRows = NBROWS;
if(isColStatic)
NbCols = NBCOLS;

internalChangeSize(NbRows, NbCols, i_DefaultValue);
}

/** @brief Copy constructor
*
*  @param i_otherMatrix IN : Matrix to be copied
*
*  @return Void.
*/
Matrix(const Matrix<TYPE, NBROWS, NBCOLS>& i_otherMatrix):
matrixData()
{
if(not i_otherMatrix.isEmpty())
{
changeSize(i_otherMatrix.getNbRows(), i_otherMatrix.getNbCols());
matrixData = i_otherMatrix.matrixData;
}
}

/** @brief Move constructor
*
*  @param i_otherMatrix IN : Matrix to be moved
*
*  @return Void.
*/
Matrix(Matrix<TYPE, NBROWS, NBCOLS>&& i_otherMatrix):
matrixData()
{
matrixData = std::move(i_otherMatrix.matrixData);
}

/** @brief getter for the matrix data vector
*
*  @return std::vector<std::vector<TYPE>>&.
*/
const std::vector<std::vector<TYPE>>& getMatrixData() const
{
return matrixData;
}

/** @brief getter for the number or rows
*
*  @return size_t
*/
size_t getNbRows() const
{
return matrixData.size();
}

/** @brief getter for the number or columns
*
*  @return size_t
*/
size_t getNbCols() const
{
if(matrixData.size() > 0)
return matrixData(0).size();
else
return 0;
}

/** @brief is the Matrix is empty
*
*  @return bool
*/
bool isEmpty() const
{
return getNbRows() == 0 or getNbCols() == 0;
}

/** @brief function used to retrieve the shape of the matrix
*  @param o_NbRows OUT : Number of rows of the matrix
*  @param o_NbCols OUT : Number of columns of the matrix
*
*  @return Void.
*/
std::pair<size_t, size_t> shape() const
{
return std::pair<size_t, size_t>(getNbRows(), getNbCols());
}

/** @brief function used to print the contents of the
*         matrix formatted into a string
*
*  @return std::string
*/
std::string toString(int i_Precision = 10) const
{
std::ostringstream SS;

// if 0 lines representation is an empty bracket
if(matrixData.size() == 0)
SS << "| |" << std::endl;

// This will align the floating point numbers
SS << std::showpoint;
SS << std::setprecision(i_Precision);
for(auto& Row : matrixData)
{
SS << "|";
for(auto& Value : Row)
{
SS << " " << Value << " ";
}
SS << "|" << std::endl;
}
return SS.str();
}

/** @brief function used to change the size of the matrix
*         This method does not require a default value,
*         hence it is default initialized
*
*  @param i_NbRows IN : New number of rows of the matrix
*  @param o_NbCols IN : New number of columns of the matrix
*
*  @return Void.
*/
void changeSize(const size_t i_NbRows, const size_t i_NbCols)
{
// Error handling
if (i_NbRows <= 0 or i_NbCols <= 0)
throw std::runtime_error("The number or rows and columns has to be a positive integer");

if (isRowStatic and NBROWS != i_NbRows)
throw std::runtime_error("You cannot change the number of rows, the matrix row size is static.");

if(isColStatic and NBCOLS != i_NbCols)
throw std::runtime_error("You cannot change the number of columns, the matrix columns size is static.");

internalChangeSize(i_NbRows, i_NbCols);
}

/** @brief function used to change the size of the matrix
*         This method does not require a default value,
*         hence it is default initialized
*
*  @param i_NewShape IN : New shape of the matrix
*
*  @return Void.
*/
void changeSize(const std::pair<size_t, size_t>& i_NewShape)
{
changeSize(i_NewShape.first, i_NewShape.second);
}

/** @brief function used to change the size of the matrix
*         This method requires a default value for filling
*         any new row / column.
*
*  @param i_NbRows     IN : New number of rows of the matrix
*  @param o_NbCols     IN : New number of columns of the matrix
*  @param DefaultValue IN : Default value used to fill the matrix
*
*  @return Void.
*/
void changeSize(const size_t i_NbRows, const size_t i_NbCols, const TYPE &DefaultValue)
{
// error handling
if (i_NbRows <= 0 or i_NbCols <= 0)
throw std::runtime_error("The number or rows and columns has to be a positive integer");

if (isRowStatic and NBROWS != i_NbRows)
throw std::runtime_error("You cannot change the number of rows, the matrix columns size is static.");

if(isColStatic and NBCOLS != i_NbCols)
throw std::runtime_error("You cannot change the number of columns, the matrix columns size is static.");

internalChangeSize(i_NbRows, i_NbCols, DefaultValue);
}

/** @brief function used to change the size of the matrix
*         This method requires a default value for filling
*         any new row / column.
*
*  @param i_NewShape   IN : New shape of the matrix
*  @param DefaultValue IN : Default value used to fill the matrix
*
*  @return Void.
*/
void changeSize(const std::pair<size_t, size_t>& i_NewShape, const TYPE &DefaultValue)
{

changeSize(i_NewShape.first, i_NewShape.second, DefaultValue);
}

/** @brief function used to transpose the current matrix.
*         If the matrix is dynamically allocated, it's shape
*         might be changed by the function
*
*  @return Void.
*/
void transpose()
{
// Error handlingmatrixData
if (isEmpty())
throw std::runtime_error("The transpose function can not be called on an empty matrix");

if ((isRowStatic or isColStatic) and getNbRows() != getNbCols())
throw std::runtime_error("The transpose function can not be called on a non square matrix with at least one static component");

// The transposed Matrix is built and replaces the old data
std::vector<std::vector<TYPE>> newData(getNbCols(), std::vector<TYPE>(getNbRows(), (*this)(0,0)));

for (size_t i = 0 ; i < getNbCols() ; i += 1)
for (size_t j = 0 ; j < getNbRows() ; j += 1)
newData(i)(j) = (*this)(i,j);

matrixData = std::move(newData);
}

/** @brief () operator. returns a reference to
*         the value at row i, column j
*
*  @return const TYPE&
*/
const TYPE& operator()(size_t i_Row, size_t i_Col) const
{
try
{
return matrixData.at(i_Row).at(i_Col);
}
catch(std::out_of_range& e)
{
const auto MatrixShape = this->shape();
std::ostringstream SS;
SS << "Indexes : (" << i_Row << ", " << i_Col << ") are out of bounds of the Matrix : " << "(" << MatrixShape.first << ", " << MatrixShape.second << ")";
throw std::runtime_error(SS.str());
}
}

/** @brief () operator. returns a reference to
*         the value at row i, column j
*
*  @return const TYPE&
*/
TYPE& operator()(size_t i_Row, size_t i_Col)
{
try
{
return matrixData.at(i_Row).at(i_Col);
}
catch(std::out_of_range& e)
{
const auto MatrixShape = this->shape();
std::ostringstream SS;
SS << "Indexes : (" << i_Row << ", " << i_Col << ") are out of bounds of the Matrix : " << "(" << MatrixShape.first << ", " << MatrixShape.second << ")";
throw std::runtime_error(SS.str());
}
}

/** @brief = operator. It copies the right hand side
*         into the left hand size Matrix. If the sizes are
*         different and the left hand side is static, the copy
*         fails
*  @param rhs  IN : Right hand side matrix
*
*  @return Void.
*/
template<typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
Matrix<TYPE, NBROWS, NBCOLS>& operator=(const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs)
{
const auto LhsShape = this->shape();
const auto RhsShape = rhs.shape();

// Error handling
if ((isRowStatic and (this->getNbRows() != rhs.getNbRows())) or
(isColStatic and (this->getNbCols() != rhs.getNbCols())))
{
std::ostringstream SS;
SS << "Impossible to fit data from a matrix of size (" << rhs.getNbRows() << ", " << rhs.getNbCols() << ") into"
" a static matrix of size (" << this->getNbRows() << ", " << this->getNbCols() << ")";
throw std::runtime_error(SS.str());
}

// If both matrices are empty, we dont need to do anything
if(not(isEmpty() and rhs.isEmpty()))
{
// else, change the size only if necessary, taking a default value in one of the non empty matrices in order not to call the default constructor
if (LhsShape != RhsShape )
{
if(not isEmpty())
{
changeSize(RhsShape, (*this)(0,0));
}
else if(not rhs.isEmpty())
{
changeSize(RhsShape, rhs(0,0));
}
}
}

matrixData = rhs.getMatrixData();
return *this;
}

/** @brief move = operator. It moves the right hand side
*         into the left hand size Matrix. If the sizes are
*         different and the left hand side is static, the copy
*         fails
*  @param rhs  IN : Right hand side matrix
*
*  @return Void.
*/
template<typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
Matrix<TYPE, NBROWS, NBCOLS>& operator=(Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS>&& rhs)
{
const auto LhsShape = this->shape();
const auto RhsShape = rhs.shape();

// Error handling
if ((isRowStatic and (this->getNbRows() != rhs.getNbRows())) or
(isColStatic and (this->getNbCols() != rhs.getNbCols())))
{
std::ostringstream SS;
SS << "Impossible to fit data from a matrix of size (" << rhs.getNbRows() << ", " << rhs.getNbCols() << ") into"
" a static matrix of size (" << this->getNbRows() << ", " << this->getNbCols() << ")";
throw std::runtime_error(SS.str());
}

// If both matrices are empty, we dont need to resize anything
if(not(isEmpty() and rhs.isEmpty()))
{
// else, change the size only if necessary, taking a default value in one of the non empty matrices in order not to call the default constructor
if (LhsShape != RhsShape )
{
if(not isEmpty())
{
changeSize(RhsShape, (*this)(0,0));
}
else if(not rhs.isEmpty())
{
changeSize(RhsShape, rhs(0,0));
}
}
}

matrixData = std::move(rhs.matrixData);
return *this;
}

/** @brief += operator. It adds the right hand side
*         into the left hand size Matrix. If the sizes are
*         different the operator fails
*
*  @param i_NbRows  IN : New number of rows of the matrix
*  @param o_NbCols  IN : New number of columns of the matrix
*  @param io_Matrix IN : Matrix which size will be reduced
*
*  @return Void.
*/
template<typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
Matrix<TYPE, NBROWS, NBCOLS>& operator+=(const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs)
{
// Error handling
if(this->isEmpty() or rhs.isEmpty())
throw std::runtime_error("Adding empty matrices is forbidden");

if (rhs.shape() != this->shape())
throw std::runtime_error("Adding matrices of different shapes is forbidden");

for(size_t i = 0; i < getNbRows()  ; i += 1)
for(size_t j = 0 ; j < getNbCols() ; j += 1)
matrixData(i)(j) += rhs(i,j);

return *this;
}

/** @brief + operator. It adds both sides of the "+" sign
*         and returns a static size matrix. Both sides
*         have to be of the same shape, otherwise the
*         operator fails.
*
*  @param rhs IN : Matrix, right side of the equation
*
*  @return Matrix<decltype(rhs(0,0) + (*this)(0,0)), NBROWS, NBCOLS, true>.
*/
template<typename LHSTYPE, size_t LHSNBROWS, size_t LHSNBCOLS, typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
friend auto operator+(const Matrix<LHSTYPE, LHSNBROWS, LHSNBCOLS> &lhs, const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs) ->
Matrix<decltype(lhs(0,0) + rhs(0,0)), LHSNBROWS == DYNAMIC ? RHSNBROWS : DYNAMIC,
LHSNBCOLS == DYNAMIC ? RHSNBCOLS : DYNAMIC>;

/** @brief operator* It multiplies both sides of the "*" sign
*         and returns a static size matrix. Both sides
*         have to have compatible sizes, (lhs.columns == rhs.cols)
*         otherwise, the operator fails.
*
*  @param rhs IN : Matrix, right side of the equation
*
*  @return Matrix<decltype(rhs(0,0) * (*this)(0,0)), this->getNbRows(), rhs.getNbCols(), true>
*/
template<typename LHSTYPE, size_t LHSNBROWS, size_t LHSNBCOLS, typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
friend auto operator*(const Matrix<LHSTYPE, LHSNBROWS, LHSNBCOLS> &lhs, const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs) -> Matrix<decltype(rhs(0,0) * lhs(0,0)), LHSNBROWS, RHSNBCOLS>;

private:
/** @brief function used to change the size of the matrix
*         This method does not require a default value,
*         hence it is default initialized. It does not check if the matrix is static or not
*
*  @param i_NbRows IN : New number of rows of the matrix
*  @param o_NbCols IN : New number of columns of the matrix
*
*  @return Void.
*/
void internalChangeSize(const size_t i_NbRows, const size_t i_NbCols)
{
// Error handling
if (i_NbRows <= 0 or i_NbCols <= 0)
throw std::runtime_error("The number or rows and columns should be a positive integer");

matrixData.resize(i_NbRows, std::vector<TYPE>(i_NbCols, TYPE()));
for (auto& row : matrixData)
row.resize(i_NbCols, TYPE());
}

/** @brief function used to change the size of the matrix
*         This method requires a default value for filling
*         any new row / column. It does not check if the matrix is static or not
*
*  @param i_NbRows     IN : New number of rows of the matrix
*  @param o_NbCols     IN : New number of columns of the matrix
*  @param DefaultValue IN : Default value used to fill the matrix
*
*  @return Void.
*/
void internalChangeSize(const size_t i_NbRows, const size_t i_NbCols, const TYPE &DefaultValue)
{
// error handling
if (i_NbRows <= 0 or i_NbCols <= 0)
throw std::runtime_error("The number or rows and columns has to be a positive integer");

matrixData.resize(i_NbRows, std::vector<TYPE>(i_NbCols, DefaultValue));
for (auto& row : matrixData)
row.resize(i_NbCols, DefaultValue);
}
};

/** @brief + operator. It adds both sides of the "+" sign
*         and returns a static size matrix. Both sides
*         have to be of the same shape, otherwise the
*         operator fails.
*
*  @param rhs IN : Matrix, right side of the equation
*
*  @return Matrix<decltype(rhs(0,0) + (*this)(0,0)), NBROWS, NBCOLS, true>.
*/
template<typename LHSTYPE, size_t LHSNBROWS, size_t LHSNBCOLS, typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
auto operator+(const Matrix<LHSTYPE, LHSNBROWS, LHSNBCOLS> &lhs, const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs) ->
Matrix<decltype(lhs(0,0) + rhs(0,0)), LHSNBROWS == DYNAMIC ? RHSNBROWS : DYNAMIC,
LHSNBCOLS == DYNAMIC ? RHSNBCOLS : DYNAMIC>
{
// Error handling
if(lhs.isEmpty() or rhs.isEmpty())
throw std::runtime_error("Adding empty matrices is forbidden");

if (rhs.shape() != lhs.shape())
throw std::runtime_error("Adding matrices of different shapes is forbidden");

Matrix<decltype(lhs(0,0) + rhs(0,0)), LHSNBROWS == DYNAMIC ? RHSNBROWS : DYNAMIC,
LHSNBCOLS == DYNAMIC ? RHSNBCOLS : DYNAMIC> newMatrix(lhs.getNbRows(), lhs.getNbCols(), lhs(0,0));

for(size_t i = 0 ; i < rhs.getNbRows() ; i += 1)
for(size_t j = 0 ; j < rhs.getNbCols() ; j += 1)
newMatrix(i, j) = lhs(i,j) + rhs(i,j);

return newMatrix;
}

/** @brief operator* It multiplies both sides of the "*" sign
*         and returns a static size matrix. Both sides
*         have to have compatible sizes, (lhs.columns == rhs.cols)
*         otherwise, the operator fails.
*
*  @param rhs IN : Matrix, right side of the equation
*
*  @return Matrix<decltype(rhs(0,0) * (*this)(0,0)), this->getNbRows(), rhs.getNbCols(), true>
*/
template<typename LHSTYPE, size_t LHSNBROWS, size_t LHSNBCOLS, typename RHSTYPE, size_t RHSNBROWS, size_t RHSNBCOLS>
auto operator*(const Matrix<LHSTYPE, LHSNBROWS, LHSNBCOLS> &lhs, const Matrix<RHSTYPE, RHSNBROWS, RHSNBCOLS> &rhs) -> Matrix<decltype(rhs(0,0) * lhs(0,0)), LHSNBROWS, RHSNBCOLS>
{
// error handling
if(lhs.isEmpty() or rhs.isEmpty())
throw std::runtime_error("Multiplying empty matrices is forbidden");

if(lhs.getNbCols() != rhs.getNbRows())
throw std::runtime_error("The size of the matrices is incompatible with matrix multiplication");

Matrix<decltype(rhs(0,0) * lhs(0,0)), LHSNBROWS, RHSNBCOLS> newMatrix(lhs.getNbRows(), rhs.getNbCols(), lhs(0,0));

for(size_t i = 0 ; i < lhs.getNbRows() ; i += 1)
{
for(size_t j = 0 ; j < rhs.getNbCols(); j += 1)
{
decltype(rhs(0,0) * lhs(0,0)) sum = lhs(i,0) * rhs(0,j); // Compute the first element of the sum in order not to call the default constructor
for(size_t k = 1 ; k < lhs.getNbRows() ; k += 1)
{
sum += lhs(i,k) * rhs(k,j);
}
newMatrix(i, j) = std::move(sum);
}
}
return newMatrix;
}

/** @brief operator<< outputs the Matrix into a stream
*
*  @param i_Matrix IN  : Matrix to output
*  @param o_OS     OUT : Stream in which the matrix must be fitted
*
*  @return std::ostream&
*/
template <typename TYPE, size_t NBROWS, size_t NBCOLS>
std::ostream& operator<<(std::ostream& o_OS, const Matrix<TYPE, NBROWS, NBCOLS>& i_Matrix)
{
return o_OS << i_Matrix.toString();
}
} /* namespace Mat */
``````

## search – Find the path in 2D Matrix

If there is a proceeding 0’s from the top left right bottom in the matrix then it return true, else returns false.

How could I improve this solution?

Thanks.

``````package TwoDArray;

public class TwoDArray {
//given 2d array, check if there is path from one point to another

static int row = 5;
static int col = 5;

public static void main(String() args) {

int()() matrixNo = {{0, 0, 0, -1, 0},
{-1, 0, 0, -1, -1},
{0, 0, 0, -1, 0},
{-1, 0, -1, 0, 0},
{0, 0, -1, 0, 0}};

int()() matrixYes = {{0, 0, 0, -1, 0},
{-1, 0, 0, -1, -1},
{ 0, 0, 0, -1, 0},
{-1, 0, 0, 0, 0},
{ 0, 0, -1, 0, 0}};

System.out.println(searchMatrix(matrixNo));

System.out.println(searchMatrix(matrixYes));
}

// To find the path from
// top left to bottom right
static boolean searchMatrix(int()() arr) {

return searchMatrixHelper(arr,0, 0);

}

private static boolean searchMatrixHelper(int()() arr,int x, int y) {

if(arr(x)(y) == 1) {
return false;
}

if(x == row-1 && y == col-1) {
return true;
}

arr(x)(y) = -1;

if((y+1 < col && arr(x)(y+1)==0)) {
return searchMatrixHelper(arr, x, y+1);
} else if(x+1 < row && arr(x+1)(y)==0) {
return searchMatrixHelper(arr, x+1, y);
} else if(y-1 >=  0 && arr(x)(y-1)==0) {
return searchMatrixHelper(arr, x, y-1);
} else if(x-1 >= 0 && arr(x-1)(y)==0){
return searchMatrixHelper(arr, x-1, y);
} else
return false;

}
}
``````

## c++ – An efficient way for removing rows with zero elements iteratively while constructing the matrix

Based on a previous post, Accelerating creation of matrices and finding ways for optimal scaling, we managed to accelerate the way that I construct a matrix in Rcpp (inputs for example are at the end of the post). The code used is the following:

``````#include <RcppArmadillo.h>

// ((Rcpp::export))
int NonDiagSum(arma::mat n, int K){
int sum = 0;
for(int i=0; i<(K+1); ++i){
sum += n(i,K-i);
}
return accu(n) - sum;
}

std::random_device rd;
std::mt19937 gen(rd());

// ((Rcpp::export))

arma::mat Generate(arma::mat n, int K, double p){

std::bernoulli_distribution distrib(p);

int N = NonDiagSum(n,K);
NumericMatrix H(N,K);

int next = 0;

for(int j=0; j<K; ++j){
for(int i=0; i<(K-j); ++i){
for(int iter=0; iter<n(i,j); ++iter){

for(int k=j; k<(K-i); ++k){
H(iter+next,k) = distrib(gen);
}

}
next += n(i,j);
}
}
return H;
}
``````

The main part of the code is the function `Generate()`, which produces a matrix with `N` rows and `K` columns and each cell can take the value `0` or `1`.

Because `0` and `1` are generated in a probabilistic way it is expected to have rows which are entirely of zero elements. My goal, is to remove those rows, in an efficient way.

I think the naive way to approach this problem, is to create an additional function that will calculate the row sum of the matrix produced by the `Generate()` function, and then create a second function that will remove the rows for which the sum is equal to zero, i.e. `row_sum(i)==0`. Those function are the following (I assume that those function have the least amount of complexity and cannot be further improved??):

``````// ((Rcpp::export))

arma::vec RowSum(arma::mat H, int K){

int N=size(H)(0);

arma::vec s(N);
s.zeros();

for(int i=0; i<N; ++i){
for(int k=0; k<K; ++k){
s(i) += H(i,k);
}
}
return s;
}

// ((Rcpp::export))

arma::mat Extract(arma::mat H, int K){

arma::vec s = RowSum( H,  K);

int idx = 0;

for(int i=0; i<size(s)(0); ++i){
if(s(i)==0){
H.shed_row(i-idx);
idx += 1;
}
}
return H;
}
``````

And I could use the `Extract()` function inside the `Generate()` function as

``````// ((Rcpp::export))

arma::mat Generate(arma::mat n, int K, double p){

... //Same as before

return Extract(H,K); //Use the Extract() function on the output matrix H
}
``````

which will solve my problem, but it will burden the calculations enormously.

However, I believe that we do not trully need the `RowSum()` function as those operations are implemented already inside the `Generate()` function. Hence, the `Generate()` function which calculates also the row sums is the following:

``````// ((Rcpp::export))

arma::mat Generate(arma::mat n, int K, double p){

std::bernoulli_distribution distrib(p);

int N = NonDiagSum(n,K);
NumericMatrix H(N,K);

NumericVector row_sum(N);

int next = 0;

for(int j=0; j<K; ++j){
for(int i=0; i<(K-j); ++i){
for(int iter=0; iter<n(i,j); ++iter){

for(int k=j; k<(K-i); ++k){
H(iter+next,k) = distrib(gen);
row_sum(iter+next) += H(iter+next,k);
}
}
next += n(i,j);
}
}
return H;
}

``````

Then ideally, there are I think two approaches to remove the rows which have zero sum. The first one is to find a way for the removing zero rows after the creation of each row, i.e. after the implamentation of the code part (in `Generate()` function)

``````for(int k=j; k<(K-i); ++k){
H(iter+next,k) = distrib(gen);
row_sum(iter+next) += H(iter+next,k);
}
``````

which personally I cannot find a way at the momment to do that.

The second approach is the following, inside the `Generate()` function instead of defining `H` as a `NumericMatrix` we could define it as `arma::mat` in order to use the row removing function `H.shed_row()` function. Based on that the new `Generate()` code could be

``````// ((Rcpp::export))

arma::mat Generate(arma::mat n, int K, double p){

...//Same as before, we construct the matrix H

//Here we remove the rows with zero row_sum
int idx=0;
for(int i=0; i<N; ++i){
if(s(i)==0){
H.shed_row(i-idx);
idx += 1;
}
}

return H;
}
``````

I assume that the optimal way incorporates keeping the `H` matrix as `NumericMatrix` because it avoids the comands `arma::mat H` and `H.zeros()` and that the part that we remove the rows should be right after the implementation of the code part

``````for(int k=j; k<(K-i); ++k){
H(iter+next,k) = distrib(gen);
row_sum(iter+next) += H(iter+next,k);
}
``````

Sorry for the length of the post, I just wanted to give all the details. If any clarification or something more needed I’m willing to help. Any suggestion would be really helpful!

Potential inputs that can be used for examples are the following:

``````p = 0.8
K=4
n = matrix(c(0,242,0,272,9222,0,10,0,123,0,0,0,0,0,0,0,131,0,0,0,0,0,0,0,0),K+1,K+1,byrow=TRUE)
n
(,1) (,2) (,3) (,4) (,5)
(1,)    0  242    0  272 9222
(2,)    0   10    0  123    0
(3,)    0    0    0    0    0
(4,)    0  131    0    0    0
(5,)    0    0    0    0    0

$$```$$
``````

## Norm of Schur matrix

If a matrix $$A$$ is Schur stable, i.e., all eigenvalues of $$A$$ are inside the unit circle, then does it hold that $$|A| < 1$$?
Is there any counterexample?

## interview questions – Square Spiral Matrix in Javascript

I’ve been grinding some Leetcode for an interview coming up and just completed Square Spiral in Javascript.

Looking for feedback on performance. This ranked faster than 58% of submissions, would there be anyway to increase the speed using a similar implementation to what I have here?

``````let matrix = (
(1, 2, 3, 4,  5),
(6, 7, 8, 9, 10),
(11,12,13,14,15),
(16,17,18,19,20),
(21,22,23,24,25)
)

var spiralOrder = (matrix) => {
//make copy
var dup=JSON.parse(JSON.stringify(matrix));
//setup variables
let maxX, maxY, baseX, baseY, x, y, vecX, vecY, nums, totalSquares;

//Max indexes for array and sub arrays
maxY=matrix(0).length-1;
maxX=matrix.length-1;

//baseX tells x when it has reached the first sub array.
baseX=0;

//baseY tells y when it has reached the first index of a sub array.
baseY=0;

//our indexes
x=0;
y=0;

//our movement vectors
vecX=0;
vecY=0;

//the count for keeping track of positions iterated
count=0;

//the total amount of squares to be iterated
totalSquares=matrix.length*matrix(0).length;

//return value array
nums=();

//I don't get how subArrays with only a single element could
//still be considered spiralable, but this if-statement handles
//those edge cases. Thanks Leetcode.
if (matrix(0).length===1) {
for (var i=0;i<matrix.length;i++) {
nums.push(matrix(i)(0))
}
return nums
}

//While our iterated count is not equal to the total amount of
//squares to be iterated. When this is true we know we
//have completed the spiral.
while(count!==totalSquares) {

nums.push(dup(x)(y));
count++;

//Our loop iterates around in a spiral. On every turn
//the loop increases the opposite side baseX, baseY,
//maxX or maxY to increase the tightness of the spiral
//as it loops.

//This handles starting the very first iteration, moving right.
if (x===0&&y===0) {
vecX=0;
vecY=1;
//We've reached the top rightmost square, move down.
} else if (x===baseX&&y===maxY) {
vecX=1;
vecY=0;

//This is a little confusing. We can't increment baseY
//until after the first iteration through the
//first sub array is complete. This if statement
//confirms that we're clear to increment baseY.
if (baseX>0) {
baseY+=1;
}
//We've reached bottom rightmost square, move left, increment baseX.
} else if (x===maxX&&y===maxY) {
vecX=0;
vecY=-1;
baseX+=1;
//We've reached bottom leftmost square, move up, decrement maxY.
} else if (x===maxX&&y===baseY) {
vecX=-1;
vecY=0
maxY-=1;
//We've reached top leftmost square, move right, decrement maxX.
} else if (x===baseX&&y===baseY) {
vecX=0;
vecY=1;
maxX-=1;
}

//Increment x or y by the vector.
x+=vecX;
y+=vecY;
}

return nums
}

$$```$$
``````

## matrix – Small problem with TensorReduce

Given:

$$eq=-frac{1}{2} text{varpi 1}^T.text{J1}.text{varpi 1}-frac{1}{2} Omega^T.text{jp}.Omega+frac{1}{2}left(text{Jp}+text{Jp}^Tright)^T.Omega$$

where $$J1$$, $$jp$$ and $$Jp$$$$3 times 3$$ matrices.

$$text{varpi 1}$$ and $$Omega$$$$3 times 1$$ vectors.

I want to get such result, i.e. force parenthesis $$Omega$$:

$$eq=(frac{1}{2}left(text{Jp}+text{Jp}^Tright)^T-frac{1}{2} Omega^T.text{jp}).Omega-frac{1}{2} text{varpi 1}^T.text{J1}.text{varpi 1}$$

But the `TensorReduce` doesn’t work. I understand why this is not happening – Mathematica is not clear if the expression in parentheses has the same dimension.

Is there any way to get around these restrictions?

``````eq = 1/2 Transpose(Transpose(Jp) + Jp).(CapitalOmega)(t) -
1/2 Transpose((CurlyPi)1(t)).J1.(CurlyPi)1(t) -
1/2 Transpose((CapitalOmega)(t)).jp.(CapitalOmega)(t) //
TensorReduce
``````

## transformation – What are the pros and cons to using Transform{position, scale, rotation} over Matrix (3×4)?

I’m using Transform in a certain case where I want extracting/changing/preserving the components {position, scale, rotation} to be straightforward – I guess that is a PRO. But when it comes to multiplication, it looks like 3×4 Matrix at least will be more efficient, as it appears to require less float operations, and doesn’t have lots of bizarre cases to worry about (see the following code from UE’s FTransform::Multiply)

``````A->DiagnosticCheckNaN_All();
B->DiagnosticCheckNaN_All();

checkSlow(A->IsRotationNormalized());
checkSlow(B->IsRotationNormalized());

...
if (AnyHasNegativeScale(A->Scale3D, B->Scale3D))
{
// @note, if you have 0 scale with negative, you're going to lose rotation as it can't convert back to quat
MultiplyUsingMatrixWithScale(OutTransform, A, B);
}
else
{
OutTransform->Rotation = B->Rotation*A->Rotation;
OutTransform->Scale3D = A->Scale3D*B->Scale3D;
OutTransform->Translation = B->Rotation*(B->Scale3D*A->Translation) + B->Translation;
}
``````

Am I right about the efficiency of multiplication between the two? What are other PROs and CONs I’m missing?
I’m guessing there must be full articles on this topic already, but I failed to find any.

## algorithms – When would it be optimal to use an Edge List as opposed to an Adjacency List / Matrix when representing a graph?

Any primitive operation that can be done with edge list, also can be done in adjacency list, but the main difference between them is time complexity of the primitive operations, for example, when you want find all adjacent vertices to some vertex $$i$$ the time complexity is $$theta(|E|)$$ but the same operation on adjacency list is $$theta(Deg(i))$$ that $$Deg(i)$$ is number of vertices that adjacent to vertex $$i$$ that in the worst case, if we have $$n$$ vertices it’s equal to $$theta(n)$$, but doing that operation in edge list data structure in the worst case can be $$theta(n^2)$$ time.

On the other hand the space complexity of edge list is $$theta(|E|)$$, and adjacency list need the space $$theta (|E|+|V|)$$ that edge list is more efficient in space complexity some case.

## how to solve this matrix optimization problem with symmetric matrix X?

$${min_{X} left|beta AX^{-1}B+Cright|_{F}^{2}, tr(X)=1, X ge 0}$$, where $$X$$ is a symmetric matrix.