performance – 4×4 double precision matrix multiply using AVX intrinsics (inc. benchmarks)

An optimised 4×4 double precision matrix multiply using intel AVX intrinsics. Two different variations.

Gist

For quick benchmark (with a compatible system) copy paste the command below. Runs tests on clang and gcc on optimisation levels 0 -> 3. Runs a naive matrix multiplication NORMAL as a reference.

curl -OL https://gist.githubusercontent.com/juliuskoskela/2381831c86041eb2ebf271011db7b2bf/raw/67818fad255f6a682cba6c28394ac71df0274784/run.sh
bash run.sh >> M4X4D_BENCHMARK_RESULTS.txt
cat M4X4D_TEST_RESULTS.txt

Quick result MacBook Pro 2,3 GHz 8-Core Intel Core i9:

(best)NORMAL
gcc -O3 -march=native -mavx
time: 0.170310

(best)AVX MUL + FMADD
gcc -O3 -march=native -mavx
time: 0.002334

Quick result Linux AMD Ryzen 3 3100 4-Core:

(best)NORMAL
gcc -O1 -march=native -mavx
time: 0.002570

(best)AVX MUL + MUL + ADD
clang -O2 -march=native -mavx
time: 0.002572

Results indicate that there is some discrepancy in optimisation between gcc and clang and on Linux and Mac OS. On Mac OS neither compiler is able to properly optimise the normal version. On Linux clang still didn’t know how to optimise further, but gcc Managed to find the same speeds as we hit with the intrinsics version BUT only on -O1 and -O2 and NOT on -O3.

Version of the matrix multiplication which uses one multiply and
one multiply + add instruction in the inner loop. Inner loop is unrolled
and with comments.

typedef union u_m4d
{
    __m256d m256d(4);
    double  d(4)(4);
}   t_m4d;

// Example matrices.
//
// left(0) (1 2 3 4)  | right(0) (4 1 1 1)
// left(1) (2 2 3 4)  | right(1) (3 2 2 2)
// left(2) (3 2 3 4)  | right(2) (2 3 3 3)
// left(3) (4 2 3 4)  | right(3) (1 4 4 4)

static inline void m4x4d_avx_mul(
        t_m4d *restrict dst,
        const t_m4d *restrict left,
        const t_m4d *restrict right)
{
    __m256d ymm0;
    __m256d ymm1;
    __m256d ymm2;
    __m256d ymm3;

    // Fill registers ymm0 -> ymm3 with a single value
    // from the i:th column of the left
    // hand matrix.
    //
    // left(0) (1 2 3 4) -> ymm0 (1 1 1 1)
    // left(1) (2 2 3 4) -> ymm1 (2 2 2 2)
    // left(2) (3 2 3 4) -> ymm2 (3 3 3 3)
    // left(3) (4 2 3 4) -> ymm3 (4 4 4 4)

    ymm0 = _mm256_broadcast_sd(&left->d(0)(0));
    ymm1 = _mm256_broadcast_sd(&left->d(0)(1));
    ymm2 = _mm256_broadcast_sd(&left->d(0)(2));
    ymm3 = _mm256_broadcast_sd(&left->d(0)(3));

    // Multiply vector at register ymm0 with right row(0)
    //
    // 1  1  1  1   <- ymm0
    // *
    // 4  2  3  4   <- right(0)
    // ----------
    // 4  2  3  4   <- ymm0

    ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));

    // Multiply vector at register ymm1 with right hand
    // row(1) and add at each multiply add the corresponding
    // value at ymm0 tp the result.
    //
    // 2  2  2  2   <- ymm1
    // *
    // 3  2  3  4   <- right(1)
    // +
    // 4  2  3  4   <- ymm0
    // ----------
    // 10 6  9  12  <- ymm0

    ymm0 = _mm256_fmadd_pd(ymm1, right->m256d(1), ymm0);

    // We repeat for ymm2 -> ymm3.
    //
    // 3  3  3  3   <- ymm2
    // *
    // 2  2  3  4   <- right(2)
    // ----------
    // 6  6  9  12  <- ymm2
    //
    // 2  2  2  2   <- ymm3
    // *
    // 3  2  3  4   <- right(3)
    // +
    // 6  6  9  12  <- ymm2
    // ----------
    // 10 14 21 28  <- ymm2

    ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));
    ymm2 = _mm256_fmadd_pd(ymm3, right->m256d(3), ymm2);

    // Sum accumulated vectors at ymm0 and ymm2.
    //
    // 10  6   9   12   <- ymm0
    // +
    // 10  14  21  28   <- ymm2
    // ----------
    // 20  20  30  40   <- dst(0) First row!

    dst->m256d(0) = _mm256_add_pd(ymm0, ymm2);

    // Calculate dst(1)
    ymm0 = _mm256_broadcast_sd(&left->d(1)(0));
    ymm1 = _mm256_broadcast_sd(&left->d(1)(1));
    ymm2 = _mm256_broadcast_sd(&left->d(1)(2));
    ymm3 = _mm256_broadcast_sd(&left->d(1)(3));
    ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
    ymm0 = _mm256_fmadd_pd(ymm1, right->m256d(1), ymm0);
    ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));
    ymm2 = _mm256_fmadd_pd(ymm3, right->m256d(3), ymm2);
    dst->m256d(1) = _mm256_add_pd(ymm0, ymm2);

    // Calculate dst(2)
    ymm0 = _mm256_broadcast_sd(&left->d(2)(0));
    ymm1 = _mm256_broadcast_sd(&left->d(2)(1));
    ymm2 = _mm256_broadcast_sd(&left->d(2)(2));
    ymm3 = _mm256_broadcast_sd(&left->d(2)(3));
    ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
    ymm0 = _mm256_fmadd_pd(ymm1, right->m256d(1), ymm0);
    ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));
    ymm2 = _mm256_fmadd_pd(ymm3, right->m256d(3), ymm2);
    dst->m256d(2) = _mm256_add_pd(ymm0, ymm2);

    // Calculate dst(3)
    ymm0 = _mm256_broadcast_sd(&left->d(3)(0));
    ymm1 = _mm256_broadcast_sd(&left->d(3)(1));
    ymm2 = _mm256_broadcast_sd(&left->d(3)(2));
    ymm3 = _mm256_broadcast_sd(&left->d(3)(3));
    ymm0 = _mm256_mul_pd(ymm0, right->m256d(0));
    ymm0 = _mm256_fmadd_pd(ymm1, right->m256d(1), ymm0);
    ymm2 = _mm256_mul_pd(ymm2, right->m256d(2));
    ymm2 = _mm256_fmadd_pd(ymm3, right->m256d(3), ymm2);
    dst->m256d(3) = _mm256_add_pd(ymm0, ymm2);
}

Version of the matrix multiplication which uses two multiplies and and add in the inner loop.

static inline void m4x4d_avx_mul2(
        t_m4d *restrict dst,
        const t_m4d *restrict left,
        const t_m4d *restrict right)
{
    __m256d ymm(4);

    for (int i = 0; i < 4; i++)
    {
        ymm(0) = _mm256_broadcast_sd(&left->d(i)(0));
        ymm(1) = _mm256_broadcast_sd(&left->d(i)(1));
        ymm(2) = _mm256_broadcast_sd(&left->d(i)(2));
        ymm(3) = _mm256_broadcast_sd(&left->d(i)(3));
        ymm(0) = _mm256_mul_pd(ymm(0), right->m256d(0));
        ymm(1) = _mm256_mul_pd(ymm(1), right->m256d(1));
        ymm(0) = _mm256_add_pd(ymm(0), ymm(1));
        ymm(2) = _mm256_mul_pd(ymm(2), right->m256d(2));
        ymm(3) = _mm256_mul_pd(ymm(3), right->m256d(3));
        ymm(2) = _mm256_add_pd(ymm(2), ymm(3));
        dst->m256d(i) = _mm256_add_pd(ymm(0), ymm(2));
    }
}

Comparison matrix multiply that doesn’t use intrinsics.

static inline void m4x4d_mul(double d(4)(4), double l(4)(4), double r(4)(4))
{
    d(0)(0) = l(0)(0) * r(0)(0) + l(0)(1) * r(1)(0) + l(0)(2) * r(2)(0) + l(0)(3) * r(3)(0);
    d(0)(1) = l(0)(0) * r(0)(1) + l(0)(1) * r(1)(1) + l(0)(2) * r(2)(1) + l(0)(3) * r(3)(1);
    d(0)(2) = l(0)(0) * r(0)(2) + l(0)(1) * r(1)(2) + l(0)(2) * r(2)(2) + l(0)(3) * r(3)(2);
    d(0)(3) = l(0)(0) * r(0)(3) + l(0)(1) * r(1)(3) + l(0)(2) * r(2)(3) + l(0)(3) * r(3)(3);
    d(1)(0) = l(1)(0) * r(0)(0) + l(1)(1) * r(1)(0) + l(1)(2) * r(2)(0) + l(1)(3) * r(3)(0);
    d(1)(1) = l(1)(0) * r(0)(1) + l(1)(1) * r(1)(1) + l(1)(2) * r(2)(1) + l(1)(3) * r(3)(1);
    d(1)(2) = l(1)(0) * r(0)(2) + l(1)(1) * r(1)(2) + l(1)(2) * r(2)(2) + l(1)(3) * r(3)(2);
    d(1)(3) = l(1)(0) * r(0)(3) + l(1)(1) * r(1)(3) + l(1)(2) * r(2)(3) + l(1)(3) * r(3)(3);
    d(2)(0) = l(2)(0) * r(0)(0) + l(2)(1) * r(1)(0) + l(2)(2) * r(2)(0) + l(2)(3) * r(3)(0);
    d(2)(1) = l(2)(0) * r(0)(1) + l(2)(1) * r(1)(1) + l(2)(2) * r(2)(1) + l(2)(3) * r(3)(1);
    d(2)(2) = l(2)(0) * r(0)(2) + l(2)(1) * r(1)(2) + l(2)(2) * r(2)(2) + l(2)(3) * r(3)(2);
    d(2)(3) = l(2)(0) * r(0)(3) + l(2)(1) * r(1)(3) + l(2)(2) * r(2)(3) + l(2)(3) * r(3)(3);
    d(3)(0) = l(3)(0) * r(0)(0) + l(3)(1) * r(1)(0) + l(3)(2) * r(2)(0) + l(3)(3) * r(3)(0);
    d(3)(1) = l(3)(0) * r(0)(1) + l(3)(1) * r(1)(1) + l(3)(2) * r(2)(1) + l(3)(3) * r(3)(1);
    d(3)(2) = l(3)(0) * r(0)(2) + l(3)(1) * r(1)(2) + l(3)(2) * r(2)(2) + l(3)(3) * r(3)(2);
    d(3)(3) = l(3)(0) * r(0)(3) + l(3)(1) * r(1)(3) + l(3)(2) * r(2)(3) + l(3)(3) * r(3)(3);
};

Main method and utils

///////////////////////////////////////////////////////////////////////////////
//
// Main and utils for testing.

t_v4d   v4d_set(double n0, double n1, double n2, double n3)
{
    t_v4d   v;

    v.d(0) = n0;
    v.d(1) = n1;
    v.d(2) = n2;
    v.d(3) = n3;
    return (v);
}

t_m4d   m4d_set(t_v4d v0, t_v4d v1, t_v4d v2, t_v4d v3)
{
    t_m4d   m;

    m.m256d(0) = v0.m256d;
    m.m256d(1) = v1.m256d;
    m.m256d(2) = v2.m256d;
    m.m256d(3) = v3.m256d;
    return (m);
}

int main(int argc, char **argv)
{
    t_m4d   left;
    t_m4d   right;
    t_m4d   res;
    t_m4d   ctr;

    if (argc != 2)
        return (printf("usage: avx4x4 (iters)"));

    left = m4d_set(
        v4d_set(1, 2, 3, 4),
        v4d_set(2, 2, 3, 4),
        v4d_set(3, 2, 3, 4),
        v4d_set(4, 2, 3, 4));

    right = m4d_set(
        v4d_set(4, 2, 3, 4),
        v4d_set(3, 2, 3, 4),
        v4d_set(2, 2, 3, 4),
        v4d_set(1, 2, 3, 4));

    size_t  iters;
    clock_t begin;
    clock_t end;
    double  time_spent;

    // Test 1
    m4x4d_mul(ctr.d, left.d, right.d);
    iters = atoi(argv(1));

    begin = clock();
    for (size_t i = 0; i < iters; i++)
    {
        m4x4d_mul(res.d, left.d, right.d);
        
        // To prevent loop unrolling with optimisation flags.
        __asm__ volatile ("" : "+g" (i));
    }
    end = clock();

    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("nNORMALnntime: %lfn", time_spent);

    // Test 2
    m4x4d_avx_mul(&ctr, &left, &right);
    iters = atoi(argv(1));

    begin = clock();
    for (size_t i = 0; i < iters; i++)
    {
        m4x4d_avx_mul(&res, &left, &right);
        __asm__ volatile ("" : "+g" (i));
    }
    end = clock();

    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("nAVX MUL + FMADDnntime: %lfn", time_spent);

    // Test 3
    m4x4d_avx_mul2(&ctr, &left, &right);
    iters = atoi(argv(1));

    begin = clock();
    for (size_t i = 0; i < iters; i++)
    {
        m4x4d_avx_mul2(&res, &left, &right);
        __asm__ volatile ("" : "+g" (i));
    }
    end = clock();

    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("nAVX MUL + MUL + ADDnntime: %lfn", time_spent);
}
```

python 3.x – Multiply by 9 without multiplying by 9, using vedic math

# digit containing 9 has to be greater or equal in length 

mul9 = { str(i): str(9-i) for i in range(10)}

print(f"{mul9}n")

number1 = int(input("Enter 9s': "))
len1 = len(str(number1))
number2 = int(input(f"Enter (0 - {str(9)*len1}): "))

if len(str(number1)) < len(str(number2)):
    print("This trick won't work")
    
else:
    res = str(number2 - 1)
    print(f"{number2} - 1 = {res} ")
    
    end = ''
    
    for i in res:
        end += mul9(i)
        print(f"{i} needs {mul9(i)} to become 9")
    
    res += (len(str(number1)) - len(str(number2))) * "9" + end # This accounts for adding the invisible 0s at the start of 'res' in the video. I've simply added the correct number of 9s and avoided but avoided looping through them in the for loop.
    
    print(res)

gives

{'0': '9', '1': '8', '2': '7', '3': '6', '4': '5', '5': '4', '6': '3', '7': '2', '8': '1', '9': '0'}

Enter 9s': 99
Enter (0 - 99): 99
99 - 1 = 98
9 needs 0 to become 9
8 needs 1 to become 9
9801

(Program finished) 

The objective of the code is to demo students how we can multiply large numbers in smallest amount of time.
I am new to python so not that aware of syntax.
Works as expected.
Can you please point out at lines which can be reduced.

floating point – Unit conversion – Better to divide by an integer or multiply by a double?

I currently have a long timestamp measured in units of 100ns elapsed since January 1st, 1900. I need to convert it to milliseconds.

I have the choice of either multiplying by 0.0001 or dividing by 10_000. Although at first glance they sound the same, the former would actually cause an implicit cast to a double – the latter would of course result in another long with the remainder truncated.

Which would yield a better result? Obviously double is an imprecise type that introduces errors due to the use of a mantissa, floating radix, and exponent, but would that error be less or more than the error from performing the simple integer division? Or would the error be demonstrably negligible?

To give an example of one of my values, here is one of the timestamps: 38348440316924872 .

I’m specifically referring to C#, but this question should be general to computer science.

opengl – What happens if first we scale a rotation matrix and then multiply it with an object?


Your privacy


By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.




python – Erro : TypeError: can’t multiply sequence by non-int of type ‘dict’

**Estou fazendo um programa onde preciso calcular o IPVA em 4% o valor do carro mas está dando esse erro na hora de calcular tanto o IPVA quanto o seguro !! **

valorvec=veiculos
IPVA = valorvec
#FUNCOES
def menu():
    print("nn0-Sairn1-Cadastrarn2-Imprimirn3-Alterarn4-Excluir")
    print("5-Excluir Todosn6-Pesquisarn7-Calcular IPVAn8-Calcular Seguro")
    opcao=int(input("Digite a opção : "))
    return(opcao)
def nao_vazio():
    if(len(veiculos)==0):
        print("Dicionario Vazio!!!")
        return(False)
    else:
        return(True)
def inserir(num_cod):
    marca=str(input("Digite a marca do veiculo: "))
    modelo=str(input("Digite o modelo do veiculo: "))
    cor=str(input("Digite a cor do veiculo : "))
    ano=float(input("Digite o ano de fabricação do veiculo : "))
    idade=float(input("Digite a idade do motorista do veiculo: "))
    valorvec=float(input("Digite o valor do veiculo : "))
    veiculos(num_cod)=(marca, modelo, cor, ano, idade, valorvec)
def imprimir():
    print(veiculos)
def alterar(num_cod):
    marca=str(input("Digite a marca do veiculo: "))
    modelo=str(input("Digite o modelo do veiculo: "))
    cor=str(input("Digite a cor do veiculo : "))
    ano=float(input("Digite o ano de fabricação do veiculo : "))
    idade=float(input("Digite a idade do motorista do veiculo: "))
    valorvec=float(input("Digite o valor do veiculo : "))
    veiculos(num_cod)=(marca, modelo, cor, ano, idade, valorvec)     
def excluir(num_cod):
    del(veiculos(num_cod))   
def excluir_todos():
    veiculos.clear()      
def pesquisar(num_cod):
    print(veiculos(num_cod))
def calcularIPVA(num_cod):
   print("O valor do IPVA é : R$",valorvec*'0,04') 
def calcularseguro(num_cod):
    if(veiculos(num_cod)<25):
        print('Valor do seguro é : R$', valorvec*'0,02')
    elif(veiculos(num_cod)>60):
        print('Valor do seguro é : R$', valorvec*'0,03')
    else:
        print('Valor do seguro é : R$', valorvec*'0,01')
#PROGRAMA
op=menu()
while(op!=0):
    if(op==1):
        cod=int(input("Digite o código : "))
        if(cod not in veiculos):
            inserir(cod)
        else:
            print("Código já cadastrado !!!")
    elif(op==2):
        if(nao_vazio()):
            imprimir()
    elif(op==3):
        if(nao_vazio()):
            cod=int(input("Digite o código que deseja alterar : "))
            if(cod in veiculos):
                alterar(cod)
            else:
                print("Código já cadastrado !!!")
    elif(op==4):
        if(nao_vazio()):
            cod=int(input("Digite o código que deseja excluir : "))
            if(cod in veiculos):
                excluir(cod)
            else:
                print("Código já cadastrado !!!")
    elif(op==5):
        if(nao_vazio()):
            excluir_todos()
    elif(op==6):
        if(nao_vazio()):
            cod=int(input("Digite o código que deseja pesquisar : "))
            if(cod in veiculos):
                pesquisar(cod)
            else:
                print("Código já cadastrado!!!")        
    elif(op==7):
        if(nao_vazio()):
            cod=int(input("Digite o código que deseja calcular o IPVA : "))
            if(cod in veiculos):
                calcularIPVA(cod)
            else:
                print("Código já cadastrado !!!")   
    elif(op==8):
        if(nao_vazio()):
            cod=int(input("Digite o código que deseja calcular o seguro : "))
            if(cod in veiculos):
                calcularseguro(cod)
            else:
                print("Código já cadastrado !!!")            
    else:
        print("Opcao Invalida!!!")
    op=menu()   
print("Fechando o programa!!!")  ´´´  

Multiply not working

I did add to this site, but could not get any answer
https://forum.freecodecamp.org/t/help-me-with-javascript/61934
SEMrush

In my first part code I have
input 1(add amount in here)
input 2 (leave blank, get info from somewhere else)
multiply button
output (input1 * input2) (2*4=8)
This is working 100%

in my 2nd part code
input 1 (leave blank, get info from part 1 output)
input 2 (any number)
multiply button
output (output 1st part * input 2 2nd part ) (2*5=10)
Cannot get this to work
part 1 code

<font color="red"><b><label for="firstNum">CURRENCY:</label></b></font>
<input type="number" id="firstNum" name="firstNum">

<font color="PURPLE"><b><label for="secondNum">Total </label></b></font>
<input type="number" id="secondNum" name="secondNum">

<button onclick="multiply()">Multiply</button>

<font color="BLUE"><b><label for="result">Total</label></b></font>
<input type="text" id="result" name="result"/>

<script>
function multiply(){
    num1 = document.getElementById("firstNum").value;
    num2 = document.getElementById("secondNum").value = 1154.69514250;
    result = num1 * num2;
    document.getElementById("result").value = result.toLocaleString('en-US');
}
</script>

HTML:

2nd part code

<font color="red"><b><label for="usd">leave blank( info from 1 total</label></b></font>
<input type="usdzar" id="usd" name="usd">

<font color="PURPLE"><b><label for="zar">Put ZAR in</label></b></font>
<input type="usdzar" id="zar" name="zar">

<button onclick="multiply1()">Multiply</button>

<font color="BLUE"><b><label for="result1">Total</label></b></font>
<input type="text" id="result1" name="result1"/>


<script>
function multiply1(){
    num3 = document.getElementById("result").value;
    num4 = document.getElementById("zar").value;
    result1 = num3 * num4;
    document.getElementById("result1").value = result.toLocaleString('en-US');
}
</script>

HTML:

 

boot – Multiply claimed blocks repeating themselves with no end I can see they are game files what do I do 20.04 I have tried multiple times to fix it reins


Your privacy


By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.




python – I want to multiply and add each line in the text file to output one final answer

Hi everyone now I have a problem where I want to add an option in my software where it multiplies the integers and floats on each line in my text file and adds them all up together and outputs the answer when the user accesses this option.Can someone help? My text file looks like this:

Endgame, science-fiction, 58, 4.5
Up, animated, 42, 3.0
IT, horror, 50, 6.2
Johnwick, action, 10, 5.9
Palmsprings, romance, 33,End 8.0
Parasite, thriller, 12, 0.75
Harrypotter, fantasy, 60, 6.0
Babydriver, music, 22, 4.3
1917, war, 66, 1.75```

Can someone provide me with code on how to achieve this.
I think i need a dictionary but I dont know how to do it.
Thank you!

[GET]-Proxy Multiply 1.0.0.88

[​IMG] SEND ME A PRIVATE MESSAGE FOR THE LINK<

use multiply cache service ????

hello
on share environment someone need redis another one need memcache,
is it stable to run both of them on server ?… | Read the rest of https://www.webhostingtalk.com/showthread.php?t=1843093&goto=newpost