I want to generate means and standard deviations per hour in different subsets of data that divide the data set.
In a small data set, this is simple, just run the code I have below as an example.
In a large data set, my method is not efficient. creating a billion A-Z variables, depleting the alphabet, … storing all partitions of my data and repeatedly writing the criteria for the subset () function is slow.
I am trying to find a way to automate what I am doing using purrr or other packages.
I have examined the "purr" package and I don't know how to use it.
I can have tapply do the same to calculate media in subsets of data.
Here is another reproducible example without external links.
I cannot link a and b because the set of months in which the Treatment Group number 0 and the Treatment Group number 1 are not equivalent. But I could copy and paste the data frame a and b into excel for my purposes.
Example 2: give an example where I can link partitioned data frames
Reproducible example using the CD4 data set:
#use the cd4 dataset
a<-subset(r,group01==0 & age<30)%>%
a1<-subset(r,group01==0 & age>=30)%>%
#c is the finished data frame I wanted to make that I'll import into Excel
In both examples I need to produce something like this:
# A tibble: 679 x 3
week m sd
1 0 2.71 1.05
2 3.57 2.71 NA
3 4.14 2.71 NA
4 4.71 1.79 NA
5 6.57 3.22 NA
6 6.86 2.30 NA
7 7 3.37 NA
8 7.29 3.76 0.560
9 7.43 3.71 1.42
10 7.57 1.47 1.05
# … with 669 more rows
where the first partition is stacked on top of the other
I wanted to do the same maybe using purrr avoiding doing many variables a, a1, b, b1 and writing the conditions in the subset function () as group01 == 1 or age<30 or age>= 30 repeatedly.
If I used a large set of data with more variables in addition to age, and if there were not only two treatment groups, but rather 4 or more, and I had to subdivide according to sex, height, marital status, province, political affiliations, the political party, I wanted this to work too, but doing it with dplyr is slow, tedious and inefficient, especially when the subset criteria or the dimensionality of the data set increase.
As you can see with just having an age variable, the process is much more difficult in example2.
I am trying to find a more efficient way to do this, especially if the cd4 data set had more information. Not sure how to use Python.
Similar question but without reproducible example:
I think the difficulty of this task has to do with the curse of dimensionality.
I cannot change the group_by condition.