optimization – Clustering sets by set difference

Suppose you have $n$ nonequal sets $S_1, ldots, S_n$ and some constant $0 le k < n$. The goal of set clustering is to find a partition of the set $mathbf{S} = {S_1, ldots, S_n}$ such that the sum of the total distance for each subset of $mathbf{S}$ is minimized and such that $mathrm{cardinality}(mathbf{S}) = k$ (in reality, this latter constraint is not quite so tight, but the size of $mathbf{S}$ must be less than $k$, and hopefully near it). The total distance $d$ of a set $X in mathbf{S}$ is $d(X) = sum_{A, B in X} mathrm{cardinality}(A ominus B)$ where $ominus$ is symmetric difference. Assume for the purposes of the problem that the sets consume $O(1)$ space (so symmetric difference can be computed in constant time).

Is there a good greedy linear(ithmic) heuristic for this problem? Is there any literature on this problem, or similar ones?


All I’ve come up with so far is an $O(n^2 log n)$ heuristic that looks like:

  1. Set $I = {1, ldots, n}$
  2. Choose some $i in I$
  3. Emit a cluster $C subset I$ containing all $m in I$ such that $mathrm{cardinality}(S_i ominus S_m) le d_mathrm{max}$.
  4. Set $I = I – C$
  5. Go to step 2.

where this process occurs in each step of a binary search that finds the best value of $d_mathrm{max}$ for a given $k$.

I was thinking that if there were some way to sort the list of sets such that nearby sets in the list have small symmetric difference, then a linearithmic solution might be easy to write.

c++ – Implementing a Queue with Small size optimization

I was trying to implement a queue implementing the small size optimization. I was wondering any ways I can make this code better (more performant, better code style, etc). It seems to work fine (in terms of correctness).

For reference, Global::specifyError is my error handling mechanism for the project, Global’s allocate is right now just implemented with new.

Thank you!

Code:

#ifndef QUEUE_HPP
#define QUEUE_HPP

#include "Error.h"
#include "Allocator.h"
#include <cstddef>
#include <algorithm>

namespace Utils {
    template<typename T>
    class Queue {
        private:
            static const int FLAG_POS = (sizeof(size_t)*8)-1;
            static const size_t FLAG_SFT = 1ULL << FLAG_POS;
            static const size_t MAX_CAPACITY = (1ULL << FLAG_POS-1)-1;
        protected:
            T* start;
            size_t _front;
            size_t _back;
            size_t length;
            size_t capacity; // bit manipulation to encode stack/heap.
            
            Queue(T* st, size_t f, size_t b, size_t l, size_t c){
                if(__builtin_expect(c > MAX_CAPACITY,false)){
                    Global::specifyError("Queue requested more than 1 << sizeof(size_t)-2 capacity.");
                    throw Global::MemoryRequestError;
                }
                _front = f;
                _back = b;
                start = st;
                length = l;
                capacity = c;
            }
            ~Queue(){
            }
        public:
            // no code reuse with Vector, too many different things.
            void push_back(T val){
                const bool heap = capacity>>FLAG_POS;
                const bool leqc = (length&FLAG_SFT-1) >0 && _front == _back;
                enum:char {STACK_PUSH, STACK_MOVE_HEAP, HEAP_PUSH, HEAP_GROW};
                
                switch((heap<<1)+leqc){
                    case STACK_MOVE_HEAP:
                    case HEAP_GROW:
                    {
                        if(__builtin_expect(length*2 > MAX_CAPACITY,false)){
                            Global::specifyError("Vector requested more than (1 << sizeof(size_t)-2)-1 capacity.");
                            throw Global::MemoryRequestError;
                        }
                        T* aux = Global::getAllocator()->allocate<T>(length*2);
                        std::copy(start+_front, start+capacity, aux);
                        std::copy(start, start+_back, aux+capacity-_front);
                        capacity = FLAG_SFT+length*2;
                        if(__builtin_expect((heap<<1)+leqc==HEAP_GROW,true)){
                            Global::getAllocator()->deallocate<T>(start);
                        }
                        start = aux;
                        _front = 0;
                        _back = length;
                    }
                    case STACK_PUSH:
                    case HEAP_PUSH:
                        ++length;
                        start(_back++) = val;
                        _back &= capacity-1;
                        break;
                    // no default case.
                }
            }
            void pop_front(){
                --length;
                ++_front;
                _front &= capacity-1;
            }
            int size() const {
                return length;
            }
            T front() const {
                return start(_front);
            }
    };
      
    /*based on Chandler Carruth's talk at CppCon 2016 */
    /** Type T, threshold for stack allocation N.*/
    
    template<typename T, size_t N>
    class SmallQueue : public Queue<T> {
        private:
            static constexpr size_t nextPow2(const size_t curr){
                size_t s = curr;
                --s;
                s |= s >> 1;
                s |= s >> 2;
                s |= s >> 4;
                s |= s >> 8;
                return s+1;
            }
            // should get replaced with literal value.
            T buffer(nextPow2(N));
        public:
            SmallQueue() : Queue<T> (buffer, 0, 0, 0, nextPow2(N)) {
            }
            ~SmallQueue(){
                if(this->capacity > nextPow2(N)){
                    Global::getAllocator()->deallocate<T>(this->start);
                }
            }
    };
}

#endif
```

optimization – How to leverage the fact that I’m solving 1000’s of very similar SMT instances?

I have a core SMT solver problem consisting of 100,000 bit-vector array clauses, and one 10000-dimensional bit-vector array. Then, my program takes as input k << 100,000 new clauses, and adds them to the core SMT problem. My goal is, for any input of k clauses, to solve the entire problem.

Is there any static optimization or learning I could do on the core problem in order to find a better way to solve each of its siblings? For instance, is there some property of the graph of bit-vector variables being constrained in each clause that I could use as a heuristic for solving the specific instances?

Thanks!

python – Optimization for data storage

I’d like your advice on the design of my application.

I use websockets to receive new data and the request module to retrieve older data.
Then I use pyqtgraph to display data and tables etc with pyqt5.

There are some data that I don’t keep in memory, I just display them on screen without the possibility to interact with them, and I have other data that I keep in memory, with which I do some processing.

I would like to know if I should use dictionaries to store and process data or create a database with SQL or use pandas.
There will be a lot of inserting, extracting, deleting and a lot of calculations.

Potentially, when there are big moves, I could have thousands of messages per second to process, which I would have to add to my database, process and then display them on screen or do whatever I wanted with them.

If you have any questions, don’t hesitate.

Example of connection:

import websockets
import asyncio
import json

async def capture_data():
    subscriptions = (sub for sub in ("quote", "trade", "instrument"))
    uri = "wss://www.bitmex.com/realtime?subscribe=" + ",".join(subscriptions)

    async with websockets.connect(uri) as websocket:
        while True:
            data = await websocket.recv()
            print(json.loads(data))


asyncio.get_event_loop().run_until_complete(capture_data())

optimization – Are may database writes an anti-pattern?

I am designing a database api. I have two options:

  1. Have a complex aggregation that queries the data as desired or,
  2. Create a materialized view that logs all data changes on an easy to query manner and potentially makes querying this data faster than querying data from option 1.

I was wondering is having many writes an anti-pattern in database design and administration? Because of not I would love to use option 2.

I am using a no-SQL database; MongoDB.

optimization – Minimize the maximum inner product with vectors in a given set

Given a set $S$ of non-negative unit vectors in $mathbb R_+^n$, find a non-negative unit vector $x$ such that the largest inner product of $x$ and a vector $v in S$ is minimized. That is,
$$
min_{xin mathbb R_+^n,|x|_2=1}max_{vin S} x^Tv.
$$

It seems like a quite fundamental problem in computational geometry. Has this problem been considered in the literature?

It can be formulated as an infinity norm minimization problem, which can in turn be expressed as a quadratically constrained LP. If the rows of matrix $A$ are the vectors in $S$, we seek
$$
begin{align}
&&min_x|Ax|_infty
\ rm{s.t.} && x^Tx=1
\ && xgeq 0.
end{align}
$$

But the quadratic constraint is non-convex, so this is not very encouraging.

performance – Particle Swarm Optimization template

I am trying to create a PSO algorithm template for my project. I’ve decide to make the Swarm as separate type and PSO class that computes the algorithm. For this program, the intention is to minimize the sphere function, as the result is expected to be close to 0. I am seeking advice in how should I improve my code so that it is more efficient and scalable for my future use. Furthermore, I tried to add unit testing to one of the function, but I really doubt that is the proper way. A snippet of the Swarm type is as follow

struct Swarm{
  const int number_of_particles = 1;
  struct Particles{
    std::vector<T> X;
    std::vector<T> V;
    std::vector<T> P_BEST_X;
    T P_BEST_O = (T)inf;
  };
  std::vector<T> G_BEST_X;
  T G_BEST_O = (T)inf;
  std::vector<Particles> particles;

  Swarm(){
    particles.reserve(number_of_particles);
  }

  Swarm(const int& nVar, const int& number_of_particles)
      : number_of_particles(number_of_particles){
    particles.resize(number_of_particles);
    for(auto it = particles.begin(); it != particles.end(); ++it){
      it->X.resize(nVar, 0.0);
      it->V.resize(nVar, 0.0);
      it->P_BEST_X.resize(nVar, 0.0);
      it->P_BEST_O = inf;
    }
    G_BEST_O = inf;
    G_BEST_X.resize(nVar, 0.0);
  }
};

The following Github repository contains the remaining code. The snippet is excerpted from pso.h. I deeply apologize in case my code is really horrible.

Complete on-page seo optimization of your wordpress website for $10

Complete on-page seo optimization of your wordpress website

Is it accurate to say that you are battling to rank your site on the First Page of Google? Try not to Worry any longer you are at the correct gig. I will deal with all your On Page SEO to Top Rank your site for your focused keyword on Google.

Why Me?

Manual – White Hat Work

100% Customer Satisfaction Guaranteed

After-Sale Service and Maintenance

On Page Optimization:

  • Formulating and Executing Content Marketing Strategy*
  • Best KWs Research for your Niche
  • OnPage Optimization for Targeted Keywords
  • Title and Meta Description
  • H1, H2 and H3 Tags Setup
  • Internal Linking
  • Yoast Settings
  • Clean Permalink
  • Image Optimization* and Alt Tags
  • Google Search Console and Analytics Setup*
  • XML Sitemap
  • Robots.txt
  • SEO Audit and Reporting*

Check gig extras

Order now or Contact me for additional subtleties.

Thank You

.

optimization – Why a query takes too long in statistics thread state in AWS Aurora MySQL?

The following query execution too long in statistics state and I couldn’t figure out why.

DB engine – 5.7.mysql_aurora.2.07.2

DB Size – db.r5.4xlarge

Sample Query Profile output

+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000023 |
| checking query cache for query | 0.000155 |
| checking permissions           | 0.000009 |
| checking permissions           | 0.000002 |
| checking permissions           | 0.000003 |
| checking permissions           | 0.000002 |
| checking permissions           | 0.000009 |
| Opening tables                 | 0.000035 |
| init                           | 0.000102 |
| System lock                    | 0.000035 |
| optimizing                     | 0.000004 |
| optimizing                     | 0.000003 |
| optimizing                     | 0.000011 |
| statistics                     | 0.224528 |
| preparing                      | 0.000030 |
| Sorting result                 | 0.000017 |
| statistics                     | 0.000041 |
| preparing                      | 0.000013 |
| Creating tmp table             | 0.000023 |
| optimizing                     | 0.000013 |
| statistics                     | 0.064207 |
| preparing                      | 0.000035 |
| Sorting result                 | 0.000025 |
| statistics                     | 0.000098 |
| preparing                      | 0.000018 |
| executing                      | 0.000011 |
| Sending data                   | 0.000007 |
| executing                      | 0.000003 |
| Sending data                   | 0.000251 |
| executing                      | 0.000007 |
| Sending data                   | 0.000003 |
| executing                      | 0.000002 |
| Sending data                   | 0.000526 |
| end                            | 0.000007 |
| query end                      | 0.000013 |
| removing tmp table             | 0.000007 |
| query end                      | 0.000004 |
| closing tables                 | 0.000003 |
| removing tmp table             | 0.000004 |
| closing tables                 | 0.000002 |
| removing tmp table             | 0.000005 |
| closing tables                 | 0.000002 |
| removing tmp table             | 0.000004 |
| closing tables                 | 0.000010 |
| freeing items                  | 0.000050 |
| storing result in query cache  | 0.000007 |
| cleaned up                     | 0.000004 |
| cleaning up                    | 0.000017 |
+--------------------------------+----------+

Query

select xo.ITEM, xo.VALUE
from (
         select pi.ITEM, pi.ITEM_GROUP, pi.VALUE
         from TABLE_2 pi
                  inner join (select max(ps.EXPORTED_DATE) as max_expo, ps.ITEM
                              from TABLE_2 ps
                                       inner join (
                                  select max(pp.EFFECTIVE_DATE) max_eff_TABLE_2, pp.ITEM
                                  from TABLE_2 pp
                                  where pp.EFFECTIVE_DATE <= '2020/07/17'
                                    and ITEM in
                                        ('20', '30', '40', '50', '110', '120', '320', '520', '720', '820', '920', '321',
                                         '275', '221')
                                  group by ITEM
                              ) a on ps.EFFECTIVE_DATE = a.max_eff_TABLE_2 and ps.ITEM = a.ITEM
                              group by a.ITEM) rr on rr.ITEM = pi.ITEM and rr.max_expo = pi.EXPORTED_DATE) xo

         inner join (
    select ea.ITEM, ea.CUSTOMER_ID, ea.ITEM_GROUP
    from TABLE_1 ea
             inner join (
        select MAX(e.EFFECTIVE_DATE) eat_max_eff, e.ITEM, e.CUSTOMER_ID
        from TABLE_1 e
        where e.CUSTOMER_ID = '20'
          and ITEM in ('20', '30', '40', '50', '110', '120', '320', '520', '720', '820', '920', '321', '275', '221')
          and EFFECTIVE_DATE <= '2020/07/17'
        group by e.ITEM
    ) aa
    where ea.ITEM = (aa.ITEM)
      and ea.CUSTOMER_ID = aa.CUSTOMER_ID
      and ea.EFFECTIVE_DATE = aa.eat_max_eff) lo
                    on lo.ITEM_GROUP = xo.ITEM_GROUP and lo.ITEM = xo.ITEM;

Indexes

Table 1

mysql> SHOW INDEX FROM T1;
+-------+------------+--------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name     | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+--------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| T1    |          0 | PRIMARY      |            1 | CUSTOMER_ID    | A         |     3297549 |     NULL | NULL   |      | BTREE      |         |               |
| T1    |          0 | PRIMARY      |            2 | ITEM           | A         |   687374784 |     NULL | NULL   |      | BTREE      |         |               |
| T1    |          0 | PRIMARY      |            3 | EFFECTIVE_DATE | A         |  1314196480 |     NULL | NULL   |      | BTREE      |         |               |
| T1    |          1 | t1_ix_item   |            1 | ITEM           | A         |     2151649 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+--------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Table 2


mysql> SHOW INDEX FROM TABLE_2;
+-------+------------+-----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name              | Seq_in_index | Column_name    | Collation | Cardinality | Sub_T2rt | T2cked | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| T2    |          0 | PRIMARY               |            1 | ITEM           | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          0 | PRIMARY               |            2 | ITEM_GROUP     | A         |       14265 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          0 | PRIMARY               |            3 | EFFECTIVE_DATE | A         |    63663076 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          0 | PRIMARY               |            4 | EXPORTED_DATE  | A         |    62464764 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_expo       |            1 | ITEM           | A         |      115823 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_expo       |            2 | EXPORTED_DATE  | A         |    13766454 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_eff_date   |            1 | ITEM           | A         |      115823 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_eff_date   |            2 | EFFECTIVE_DATE | A         |    13766454 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_eff_ig     |            1 | ITEM           | A         |      115823 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_eff_ig     |            2 | EFFECTIVE_DATE | A         |    13766454 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_ix_item_eff_ig     |            3 | ITEM_GROUP     | A         |    68216912 |     NULL | NULL   |      | BTREE      |         |               |
| T2    |          1 | t2_idx_effective_date |            1 | EFFECTIVE_DATE | A         |       79406 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+-----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

According to this: statistics State in MySQL Processlist

I checked the innodb_buffer_pool_size.

mysql> SHOW VARIABLES LIKE "innodb_buffer_pool_size";
+-------------------------+-------------+
| Variable_name           | Value       |
+-------------------------+-------------+
| innodb_buffer_pool_size | 96223625216 |
+-------------------------+-------------+

In EXPLAIN output rows are minimal (Depends on the Item count in the query. If Item count is 10, the number of rows were 20). Even though the row counts are minimal why the query takes too long in statistics state?