I have come across a puzzle of challenging algorithms when trying to generate a large amount of test data. The problem is the following:

We have $ N $ cubes $ B_1 $ through $ B_N $. Each bucket $ B_i $ maps to a single Item $ a_i $ and a tell $ k_i $. In total, the collection keeps $ T = sum_1 ^ N {k_i} $ articles. This is a more compact representation of a vector of $ T $ articles where each $ a_i $ is repeated $ k_i $ times.

We want to output a random list of $ T $ articles, all equally probable permutations, using only $ O (N) $ Space and minimum complexity time. (Assume a perfect RNG.)

$ N $ it's pretty big and $ T $ it is much bigger; 5,000 and 5,000,000 in the problem that led me to this investigation.
Now clearly the complexity of time is at least $ O (T) $ since we have to take out so many articles. But how close can we get to that lower limit? Some algorithms:

Algorithm 1: Expand the cubes in a vector $ T $ Articles and use of FisherYates. This uses $ O (T) $ time, but also $ O (T) $ Space, we want to avoid.

Algorithm 2: For each step, choose a random number $ R $ since $[0,T1]$. Go through the cubes, subtracting. $ k_i $ since $ R $ every time, until $ R <0 $; then exit $ i $ and decrement $ k_i $ Y $ T $. This seems correct and does not use extra space. However, take $ O (NT) $ time, which is quite slow when $ N $ It is long.

Algorithm 3: Convert the cube vector into a balanced binary tree with cubes in the leaf nodes; the depth should be close to $ log_2 {N} $. Each node stores the total count of all the cubes below it. To shuffle, choose a random number $ R $ since $[0,T1]$, then descend to the tree accordingly, decreasing each count of nodes as we go; when descending to the right, reduce $ R $ By the count of the left. When we reach a leaf node, we will emit its value. Uses $ O (N) $ space and $ O (T log {N}) $ hour.

Algorithm 3a: same as Algorithm 3, but with a Huffman tree; this should be faster if the $ k_i $ the values vary widely, since the most visited nodes will be closer to the root. The performance is more difficult to evaluate, but it seems that it would vary from $ O (T) $ to $ O (T log {N}) $ depending on the distribution of $ k_i $.
Algorithm 3 is the best I have found. Here are some illustrations to clarify it:
Does anyone know of a more efficient algorithm? I tried to search with several terms, but I could not find any discussion about this particular task.