For every $i in {0,ldots,|P|}$, $j in {0,ldots,|S|}$, and $T in {0,ldots,mathcal{T}}$ (I discuss $mathcal{T}$ below), compute whether there is a subset of size $j$ of $p_1,ldots,p_i$ which sums to $T$. You should take $mathcal{T}$ to be your maximum allowable answer – $max(P) cdot |S|$ would definitely do, but you can probably pick something which is closer to $mu_S |S|$, say $mu_P |S|$. The running time is $O(|P| |S| mathcal{T})$. In your case, assuming you choose $mathcal{T} approx mu_p |S|$, then $|P| |S| mathcal{T} approx 2^{40}$, so this is barely feasible.

This is a naive dynamic programming algorithm, which probably can be improved, especially since you’re only looking for an approximation.

For example, you could only look for solutions such that $S cap {0,ldots,i} approx frac{|S|}{|P|} i$ and $sum(S cap {0,ldots,i}) approx frac{|S|}{|P|} mu_S$, which will reduce the running time significantly. While this is only a heuristic, it might be possible to show that you get a decent approximation on average, if you randomize the order of $P$.

Another possible optimization is to quantize the partial sums, quantizing more aggressively as $i$ gets larger. If done carefully, you won’t lose much in accuracy but will have a significant gain in running time.