performance – User implementation of memcpy, where to optimize further?

Edit:

By adding the restrict keyword I was able to get my memcpy up to speed with the library implementation (and in this particular test, exceeding the library implementations speed). New results:

Test case mem_cpy mem_cpy_naive memcpy
big string (1000 bytes) 2.584988s 3.936075s 3.952187s
small string (8 bytes) 0.025931s 0.051899s 0.025807s

Note: I tested also it as a part of a bigger implementation I had been working on. Previously I gained about 20% performance by swapping the libc memcpy in place of my own, now there was no difference.

Updated code:

static void
copy_words(void *restrict dst, const void *restrict src, size_t words)
{
    const uint64_t  *restrict src64;
    uint64_t        *restrict dst64;
    uint64_t        pages;
    uint64_t        offset;

    pages = words / 8;
    offset = words - pages * 8;
    src64 = (const uint64_t *restrict)src;
    dst64 = (uint64_t *restrict)dst;
    while (pages--)
    {
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
    }
    while (offset--)
        *dst64++ = *src64++;
}
static void
copy_small(void *restrict dst, const void *restrict src, size_t size)
{
    const uint64_t  *restrict src64;
    uint64_t        *restrict dst64;

    src64 = (const uint64_t *restrict)src;
    dst64 = (uint64_t *restrict)dst;
    *dst64 = *src64;
}
void
*mem_cpy(void *restrict dst, const void *restrict src, const size_t size)
{
    const uint8_t   *restrict src8;
    uint8_t         *restrict dst8;
    size_t          offset;
    size_t          words;
    size_t          aligned_size;

    if (!src || !dst)
        return (NULL);
    if (size <= 8)
    {
        copy_small(dst, src, size);
        return (dst);
    }
    words = size / 8;
    aligned_size = words * 8;
    offset = size - aligned_size;
    copy_words(dst, src, words);
    if (offset)
    {
        src8 = (const uint8_t *restrict)src;
        src8 = &src8(aligned_size);
        dst8 = (uint8_t *restrict)dst;
        dst8 = &dst8(aligned_size);
        while (offset--)
            *dst8++ = *src8++;
    }
    return (dst);
}


As a practice in optimization I’m trying to get my memcpy re-creation as close in speed to the libc one as I can. I have used the following techniques to optimize my memcpy:

  • Casting the data to as big a datatype as possible for copying.
  • Unrolling the main loop 8 times.
  • For data <= 8 bytes I bypass the main loop.

My results (I have added a naive 1 byte at a time memcpy for reference):

Test case mem_cpy mem_cpy_naive memcpy
big string (1000 bytes) 12.452919s 212.728906s 0.935605s
small string (8 bytes) 0.367271s 1.413559s 0.149886s

I feel I have exhausted the “low hanging fruit” in terms of optimization. I understand that the libc function could be optimized on a level not accessible to me writing only C, but I wonder if there’s still something to be done here or is the next step to write it in assembly. To give a bit more clarification as to my motive for this. I study programming in a school that has performance constrains on our projects, but as of now we are only able to use standard C, so I can’t go optimizing on assembly level yet. We are also not allowed to use libc and have to create our own versions of the standard functions we want to use so making my memcpy as fast as possible helps me hitting the performance goals in my projects. And it’s great for learning obviously. I welcome any ideas!

Here is the code including the tests, can be compiled as is:

#include <time.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

const size_t        iters = 100000000;

//-----------------------------------------------------------------------------
// Optimized memcpy
//
static void         copy_words(void *dst, const void *src, size_t words)
{
    const uint64_t  *src64;
    uint64_t        *dst64;
    uint64_t        pages;
    uint64_t        offset;

    pages = words / 8;
    offset = words - pages * 8;
    src64 = (const uint64_t *)src;
    dst64 = (uint64_t *)dst;
    while (pages--)
    {
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
        *dst64++ = *src64++;
    }
    while (offset--)
        *dst64++ = *src64++;
}

static void         copy_small(void *dst, const void *src, size_t size)
{
    const uint64_t  *src64;
    uint64_t        *dst64;

    src64 = (const uint64_t *)src;
    dst64 = (uint64_t *)dst;
    *dst64 = *src64;
}

void                *mem_cpy(void *dst, const void *src, const size_t size)
{
    const uint8_t   *src8;
    uint8_t         *dst8;
    size_t          offset;
    size_t          words;
    size_t          aligned_size;

    if (!src || !dst)
        return (NULL);
    if (size <= 8)
    {
        copy_small(dst, src, size);
        return (dst);
    }
    words = size / 8;
    aligned_size = words * 8;
    offset = size - aligned_size;
    copy_words(dst, src, words);
    if (offset)
    {
        src8 = (const uint8_t *)src;
        src8 = &src8(aligned_size);
        dst8 = (uint8_t *)dst;
        dst8 = &dst8(aligned_size);
        while (offset--)
            *dst8++ = *src8++;
    }
    return (dst);
}

//-----------------------------------------------------------------------------
// Naive memcpy
//
void                *mem_cpy_naive(void *dst, const void *src, size_t n)
{
    const uint8_t   *src8;
    uint8_t         *dst8;

    if (src == NULL)
        return (NULL);
    src8 = (const uint8_t *)src;
    dst8 = (uint8_t *)dst;
    while (n--)
        *dst8++ = *src8++;
    return (dst);
}

//-----------------------------------------------------------------------------
// Tests
//
int         test(int (*f)(), char *test_name)
{   
    clock_t begin = clock();
    f();
    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%s: %fn", test_name, time_spent);
    return (1);
}

char        *big_data()
{
    char    *out;
    size_t  i;

    out = (char *)malloc(sizeof(char) * 1000);
    i = 0;
    while (i < 1000)
    {
        out(i) = 'a';
        i++;
    }
    return (out);
}

int         test1()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = big_data();
    dst = (char *)malloc(sizeof(char) * 1000);
    i = 0;
    while (i < iters)
    {
        mem_cpy(dst, src, 1000);
        i++;
    }
    return (1);
}


int         test2()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = big_data();
    dst = (char *)malloc(sizeof(char) * 1000);
    i = 0;
    while (i < iters)
    {
        mem_cpy_naive(dst, src, 1000);
        i++;
    }
    return (1);
}

int         test3()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = big_data();
    dst = (char *)malloc(sizeof(char) * 1000);
    i = 0;
    while (i < iters)
    {
        memcpy(dst, src, 1000);
        i++;
    }
    return (1);
}

int         test4()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = "12345678";
    dst = (char *)malloc(sizeof(char) * 8);
    i = 0;
    while (i < iters)
    {
        mem_cpy(dst, src, 8);
        i++;
    }
    return (1);
}

int         test5()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = "12345678";
    dst = (char *)malloc(sizeof(char) * 8);
    i = 0;
    while (i < iters)
    {
        mem_cpy_naive(dst, src, 8);
        i++;
    }
    return (1);
}

int         test6()
{
    char    *src;
    char    *dst;
    size_t  i;

    src = "12345678";
    dst = (char *)malloc(sizeof(char) * 8);
    i = 0;
    while (i < iters)
    {
        memcpy(dst, src, 8);
        i++;
    }
    return (1);
}

int         main(void)
{
    test(test1, "User memcpy (big string)");
    test(test2, "User memcpy naive (big string)");
    test(test3, "Libc memcpy (big string)");
    test(test4, "User memcpy");
    test(test5, "USer memcpy naive");
    test(test6, "Libc memcpy");
}

I won’t paste the assembly, since I think it’s more convenient to just put a link to compiler explorer:

https://godbolt.org/z/Yva9EaPrP

oracle grid control – How to Check GridDB performance?


Your privacy


By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.




google sheets – Dynamic Line Charts for Stock Performance Overtime

I’ve been riding the struggle bus for a few days now trying to figure out dynamic charts.

I am trying to create a dynamic chart that will show stock data for specific increments of time. (30, 60, 90 days, 1, 2,3,5, and 10 Years).

I’ve watched a few tutorials, but they are all for basic, non-complex data. I am trying to do it for line charts with two sets of data. I’ve managed to create individual stand alone charts for each set of data, but I can’t figure out how to build it all into one dynamic chart. Please help. Link below

https://docs.google.com/spreadsheets/d/1IHot5YvtyBNDWDIp0hhnqWpohg9CkubTYJEJHez8K0c/edit?usp=sharing

javascript – How can I deduce my RPG lags for better performance?

Okay, so I’ve started making a game using Khan Academy’s Processing JavaScript, and I’m trying to delag it a bit since the framerate is very shaky. So, I’ve taken the advice of someone, and they told me instead of new object_name(), I could instead use something like object_name.create() by having this code:

Object.constructor.prototype.create = function() {
    var obj = Object.create(this.prototype);
    this.apply(obj, arguments);
    return obj;
};

So, I’m assuming it’s helping since the framerate improved a bit, but not at all that much. I’ve been using a lot of OOP, and people have told me that Khan Academy’s memory is acting weird, so it’s not as good to use OOP, but I’m used to using it, so I’d like to see what kind of things there could be to help improve my program in framerate. I’m planning to check if the NPCs and enemies are in range, then draw them, but I’m not sure if this would help, because I’m not very familiar with the computers and how fast they would run certain things. Thank you.

For reference, this is my game Quests of a Warrior

This is the comment that someone posted that I happened to see, and that’s why I used the object_name.create() for now. Discussion post (expand_key will lead you there)

If you are unfamiliar with Processing JavaScript, you can go to the documentation, although it doesn’t include much (Khan Academy version). You’ll find it here, but you can also glance at the real Processing JavaScript (notice that some can’t be used for Khan Academy purposes) here

performance – Lo Shu Magic Square (Python)

I wrote a python program to find if a matrix is a magic square or not. It works, but I can’t help feeling like I may have overcomplicated the solution. I have seen other implementations that were a lot shorter, but I was wondering how efficient/inefficient my code is and how I could improve it and/or shorten it to achieve the result I am looking for.

def main():

    matrix = ((4,9,2),
              (3,5,7),
              (8,1,6))
    
    result = loShu(matrix)

    print(result)


def loShu(matrix):

    i = 0
    j = 0

    for i in range(0, len(matrix)):
        for j in range(0, len(matrix(j))):
            if ((matrix(i)(j) < 1) or (matrix(i)(j) > 9)):
                return ("This is not a Lo Shu Magic Square - one of the numbers is invalid")

    row1 = matrix(0)(0) + matrix(0)(1) + matrix(0)(2)
    row2 = matrix(1)(0) + matrix(1)(1) + matrix(1)(2)
    row3 = matrix(2)(0) + matrix(2)(1) + matrix(2)(2)

    ver1 = matrix(0)(0) + matrix(1)(0) + matrix(2)(0)
    ver2 = matrix(0)(1) + matrix(1)(1) + matrix(2)(1)
    ver3 = matrix(0)(2) + matrix(1)(2) + matrix(2)(2)

    diag1 = matrix(0)(0) + matrix(1)(1) + matrix(2)(2)
    diag2 = matrix(0)(2) + matrix(1)(1) + matrix(2)(0)

    checkList = (row1,row2,row3,ver1,ver2,ver3,diag1,diag2)

    temp = checkList(0)

    for x in range (0, len(checkList)):
        if checkList(x) != temp:
            return ("This is not a Lo Shu Magic Square")

    return ("This is a Lo Shu Magic Square")


main()

bitcoind – Raspiblitz slow sync performance

I’m syncing my raspberry pi 4 model B, 4GB and after 4 days I’m less than 50% synced. It seems as though there is something wrong with bitcoind as there are some weird symptoms going on. The pi is running raspiblitz 1.7RC2 (64bit). I have 400mbps internet and it’s connected to a new 1TB sandisk ssd.

First I looked at the debug.log file (sudo tail -f /mnt/hdd/bitcoin/debug.log) When bitoind is first started up I see blocks being added at a rate of several per second. After letting it run for awhile it slows down to one block every few seconds. I also get ping timouts and my peer connections are constantly cutting off and don’t exceed ~10.

Trying to communicate with bitcoind also becomes slow, bitcoin-cli getnetworkinfo | grep connections at the start is instant but after awhile this command can take 30 seconds to execute.

Finally using nmon the system usage is very low except bitcoind is reading from the ssd at a rate of ~250 mbps but not doing much writing. The cpu usage is blocked by waiting io. The only thing I have tried was lowering the dbcache by 500mb thinking there wasn’t enough free memory but it did not help.

bitcoin.conf:

# bitcoind configuration
# mainnet/testnet
testnet=0
# Bitcoind options
server=1
daemon=1
txindex=0
disablewallet=1
peerbloomfilters=1
# Connection settings
rpcuser=raspibolt
rpcpassword=c4w0Mh2q
rpcport=8332
rpcallowip=127.0.0.1
rpcbind=127.0.0.1:8332
zmqpubrawblock=tcp://127.0.0.1:28332
zmqpubrawtx=tcp://127.0.0.1:28333
# Raspberry Pi optimizations
dbcache=2560
maxorphantx=10
maxmempool=300
maxconnections=40
maxuploadtarget=5000
datadir=/mnt/hdd/bitcoin

nmon

Edit
After several restarts the sync seems to be going faster than before, even after waiting. The disk reads are much lower with some writes as well. Not sure why lower read speeds results in faster syncing…

Eidt 2

I noticed after “leaving block file” it’s operating at about 1/2 speed for awhile now. Look at the time stamps in the attached screenshot. Things slow down quite a bit after that.enter image description here

macos – How to tell what Safari tab requires high performance GPU?

I’m experiencing lower than expected battery life on my 2015 Macbook Pro. I believe that part of the problem is that the High Performance GPU is being activated when not needed by some tab in Safari:
Activity Monitor Energy Pane

This is NOT a CPU related problem, as the CPU tab in Activity Monitor shows little activity:
Activity Monitor CPU pane

How can I tell which tab is requiring the high-performance GPU?

performance – Are there any advantages to having SQL hosting separate from a website? Hosting my own servers or paying for a service?

Background

I have been tinkering with an application I am writing. It is essentially a POS system that stores information about customers and inventory. Currently I keep all the data stored in CSVs while I develop the GUI, classes, etc.; however, I have always envisioned needing a SQL server at some point. Additionally, farther down the line I might end up buying a domain and having my own website, but I may use something that has free hosting since I only expect there to be one page.

Quick aside, I am aware plain-text CSVs are not secure and I am not storing real peoples’ info in them.

I have been curious to try hosting my own database (and possibly a web server). Currently, I cannot see my database exceeding a hundred real entries, but I do occasionally test with sample CSVs that are 20k. Obviously, I should keep in mind scalability.

I’m currently on Windows, but I used to feel at home in a BASH terminal if that makes any difference.

Questions

Anyways, I suppose my questions are as follows:

What are the benefits of hosting your own database?

Is there any benefits of hosting your database on a separate server from your web server, e.g.MySQL hosting on Scale Grid then eventually a web server on BlueHost?

Are there any advantages to hosting an SQL server on your own? I’m literally imagining a spare laptop, or building a small custom PC with a RAID setup. Although, a laptop with all the necessary files on a cloud back up such as OneDrive seems feasible.

Anyways, I am not necessarily looking for a direct answer. I am just hoping you all will share your experiences and thoughts.

Apologies if this on the wrong exchange, this is a bit of an open ended question and this one seemed like a good fit!

performance – Powershell Refolder by Filename Before Delim

I’m pretty new to Powershell but have been using it more and more to automate tasks at work. Now I’m reaching out for some help. I typically use the scripts I write in a way where I copy and paste the ps1 into the directory I want to perform the task on. Then double-clicking. That is why you’ll see that the ps1 filename is used to create a variable. It allows me to create a sort of preset of ps1 depending on what I use them for.

In the example, files are grouped into a single folder based on the filename before an underscore. They are then moved using multithreading. I’m using Powershell 7.1.3. If anyone knows a way to dynamically set threads based on the PC for the -Throttlelimit parameter that would be great to know.

Github version of this script.

Is there a more optimized way of performing this re-folder task? Whether it’s a different way or just a more optimized version of the code below. If my formatting is off please let me know, that includes if you think a comment isn’t worded correctly or could be more clear. I appreciate any help.

# Set the delim - all characters in the filenames before the delim will become the new folders.
$refolderDelim = '_'

# Set Threads for multithreading
$threads = 6

# Get the filename of the script to then use substrings of the filename to set variables. 
$scriptPath = $MyInvocation.MyCommand.Path
$scriptName = Split-Path $scriptPath -leaf

# The file type before .ps1 is used as the filter, The two .'s are used to replace, Text before the two .'s in the filename can be changed.
$fileType = $scriptName -replace ('^.+?.','.') -replace ('.ps1')

Measure-Command {

# Filters files in the script's directory by filetype in script filename.
Get-ChildItem -File -Filter *$fileType |
    Group-Object { $_.Name -replace ($refolderDelim + '.*') } |
        ForEach-Object -Parallel {
        
# Checks if folder exists and creates it if not
            if ( -Not ( Test-Path -Path $_.Name ) ) {
                $dir = New-Item -Type Directory -Name $_.Name
            }
            
# If folder does exist it only sets the $dir variable
            else {
                $dir = $_.Name
            }
# Filenames are moved to the repective folder
            $_.Group | Move-Item -Destination $dir
        } -ThrottleLimit $threads
}

Write-Output 'REFOLDER FINISHED: Ready for Next Step'

# Remove lines below if you want Powershell to close automatically after running.
Read-Host -Prompt "Press Enter to exit"
exit

performance – Execution plan different on same db in two instance

I am struggling with a mysterious problem:
I have a query in production that dropped his performance dramatically, taking from a few second to two minutes.
By analyzing the execution plan, I found that it inexplicably no longer uses a non-clouster index on a field in the table.
I performed all the following steps:

  • checked the index fragmentation:
  • recompiled the index;
  • dropped statistics;
  • recreated the index;
  • restarted the server;
    But nothing..
    Whenever i launch the query, the execution plan doesn’t use the index.

If i force the query to use the index (with a query hint) it is done istantly.
Why sql server does not consider the index in execution plan?

To try to recreate the problem, I restored the same db in another instance of sql server:
the result is that the execution plan use index by default.

Why in production still not using the index?
(HW configuration is very good, the server license is standard edition)