When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design, especially when sampling is expensive.
optimall
offers a collection of functions that are
designed to streamline the process of optimum sample allocation, for a
single wave or an adaptive, multi-wave approach. Its main functions
allow users to:
Define, split, and merge strata based on values or percentiles of other variables.
Calculate the optimum number of samples to allocate to each stratum in a given study in order to minimize the variance of the target sample mean.
Select specific IDs to sample based on a stratified sampling design.
Optimally allocate a fixed number of samples to a sampling wave based on results from a prior wave.
When used together, these functions can automate most of the sampling
workflow. This vignette will outline these package features one by one
(moving from simple to more complex), introduce the theoretical
framework behind the functions, and finally demonstrate how they can be
used collectively to perform multi-wave sampling in R. The final section
then details how to use optimall_shiny()
to efficiently
make decisions about splitting strata.
In stratified sampling, strata are typically defined on values or
quantiles of inexpensive variables available for the entire population.
Given a dataset with one row per sampling unit and at least one column
containing a variable that can be used to construct strata,
optimall
allows users to easily define, split, and merge
strata.
To demonstrate the simple process of refining strata in
optimall
, we use the iris
dataset from the R
package datasets
. Suppose that we have defined three strata
based on "Species"
, but we want to split two out of the
three, setosa and virginica, in half based on the within-stratum median
of "Sepal.Width"
. In optimall
, we can do this
by calling the split_strata()
function:
library(optimall)
library(datasets)
iris <- datasets::iris
table(iris$Species)
#>
#> setosa versicolor virginica
#> 50 50 50
iris2 <- split_strata(data = iris,
strata = "Species",
split = c("setosa", "virginica"),
split_var = "Sepal.Width",
split_at = c(0.5), type = "local quantile")
table(iris2$new_strata)
#>
#> setosa.Sepal.Width_(3.4,4.4] setosa.Sepal.Width_[2.3,3.4]
#> 22 28
#> versicolor virginica.Sepal.Width_(3,3.8]
#> 50 17
#> virginica.Sepal.Width_[2.2,3]
#> 33
Similarly, we can merge strata using the function
merge_strata()
:
iris3 <- merge_strata(data = iris,
strata = "Species",
merge = c("setosa", "versicolor"),
name = "set_or_vers")
table(iris3$new_strata)
#>
#> set_or_vers virginica
#> 100 50
The split_strata
and merge_strata()
functions are quite versatile. See function documentation for more
information.
Assuming that the per-unit sampling cost is the same in each stratum and that Sh, the standard deviation of the variable of interest within each stratum, can be estimated, Neyman (1934) presented the following solution to optimally allocate n samples among H strata in order to minimize the variance of the sample mean:
$$n_h = n \frac{N_hS_h}{\sum_{i=1}^H
N_iS_i}.$$ This formula is known as Neyman allocation and is one
of the functions available in optimall
. Neyman allocation
offers the advantage of outputting sampling fractions that can later be
multiplied by n or taken on
their own if n is not
known.
While Neyman allocation has a strong theoretical backing, Wright (2014) points out some limitations that make it sub-optimal in practice.
Wright offers alternative algorithms that solve the optimal problem
for a discrete allocation that sums exactly to n, which can also be implemented in
optimall
. Essentially, his approaches use linear
constraints to optimize the allocation of samples over a space of
integer values. He makes use of within-stratum variance and population
stratum size to generate priority values, which in turn dictate how many
samples should be taken from each stratum. To learn more about the
specifics of these algorithms, see Wright (2014).
The optimall
package allows users to select between
Neyman allocation, Wright Algorithm I, and Wright Algorithm II in the
method
argument of the optimum_allocation()
function. The optimum_allocation()
function defaults to
using Wright Algorithm II. This algorithm requires that at least 2
samples are taken from each stratum. In optimall
, stratum
sampling sizes for both Wright algorithms are also constrained from
above at Nh, the
population stratum size, using the methods for constraints that Wright
details in Algorithm III.
Below is an example of how optimum_allocation()
could be
called to optimally allocate 40 samples among "Species"
in
the iris
dataset, minimizing the variance of the
"Sepal.Width"
sample mean.
sampling_design <- optimum_allocation(data = iris, strata = "Species",
y = "Sepal.Width",
nsample = 40, method = "WrightII")
sampling_design
#> strata npop sd n_sd stratum_fraction stratum_size
#> 1 setosa 50 0.38 18.95 0.38 15
#> 2 versicolor 50 0.31 15.69 0.30 12
#> 3 virginica 50 0.32 16.12 0.32 13
If we have a dataframe that holds the Nh and sdh for
each stratum instead of data for each individual unit, we can still use
optimum_allocation()
:
iris_summary <- data.frame(strata = unique(iris$Species),
size = c(50, 50, 50),
sd = c(0.3791, 0.3138, 0.3225))
optimum_allocation(data = iris_summary, strata = "strata",
sd_h = "sd", N_h = "size",
nsample = 40, method = "WrightII")
#> strata npop sd n_sd stratum_fraction stratum_size
#> 1 setosa 50 0.38 18.95 0.38 15
#> 2 versicolor 50 0.31 15.69 0.30 12
#> 3 virginica 50 0.32 16.12 0.32 13
With the number of units to sample per stratum specified in a
dataframe, optimall
can select the IDs of the units to be
sampled using simple random sampling within strata with the function
sample_strata()
:
iris$id <- 1:150
set.seed(743)
iris <- sample_strata(data = iris, strata = "Species", id = "id",
design_data = sampling_design, design_strata = "strata",
n_allocated = "stratum_size")
The output of sample_strata()
is the same input
dataframe with a new column called "sample_indicator"
that
holds a binary indicator for whether each unit should be sampled in the
specified wave:
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id sample_indicator
#> 1 5.1 3.5 1.4 0.2 setosa 1 0
#> 2 4.9 3.0 1.4 0.2 setosa 2 0
#> 3 4.7 3.2 1.3 0.2 setosa 3 0
#> 4 4.6 3.1 1.5 0.2 setosa 4 1
#> 5 5.0 3.6 1.4 0.2 setosa 5 0
#> 6 5.4 3.9 1.7 0.4 setosa 6 1
From this dataframe, we can easily extract a vector of the ids to sample:
ids_to_sample <- iris$id[iris$sample_indicator == 1]
head(ids_to_sample)
#> [1] 4 6 8 11 14 15
length(ids_to_sample)
#> [1] 40
Note that design_data
is the output of
optimum_allocation()
in this example, but in practice it
can be any dataframe that has one row per stratum and one column
specifying each stratum’s desired sample size. As such, any method of
allocating samples to strata can be implemented by
sample_strata()
as long as the design dataframe is
specified and simple random sampling within strata is used.
When measuring variables of interest is expensive or difficult, it can be advantageous to employ a multi-phase design, where cheaper variables are collected from the entire population and expensive variables are collected from subsamples selected through adaptive, multi-wave sampling. This approach, which is documented in McIsaac and Cook (2015), involves multiple waves of sampling where information from prior waves is used to inform the optimum sampling design of subsequent ones. With the understanding that both Neyman and Wright’s optimum allocation methods depend heavily on standard deviation estimates for the variable of interest, the benefit of multi-wave sampling is clear to see. Sample allocations based on prior waves will use updated estimates of nuisance parameters that incorporate data accumulated in the prior sampling waves. The optimal sampling proportion is updated at the end of each wave, and this update guides the sampling of the next wave. As the phase II data accumulate, the necessary within-strata SD estimates, and thus the estimated optimal sampling proportions, are expected to be closer to their true value.
In the design described by McIsaac and Cook, a large phase-I sample
is first taken to measure the inexpensive covariates and/or outcome. The
results of this phase will define strata which are then sampled
non-optimally (through proportional or balanced sampling) for
measurement of the expensive variable in phase-IIa. The phase-IIa
results are used to estimate the standard deviation required to
optimally allocate the next wave of samples. This process is iterated
until the desired sample size for the variable of interest is achieved.
Below is an outline of the workflow, which is facilitated by
optimall
:
optimall
allows users to input phase-I data and
iteratively allocate samples for subsequent waves with the function
allocate_wave()
. This function runs integer-valued
optimum_allocation()
on a dataset according to Wright’s
Algorithm II, but it takes into account previous sampling waves to
determine how the current wave of samples should be allocated.
For a simple example, suppose that we again want to allocate 40
samples to minimize the variance of "Sepal.Width"
in the
iris
dataset, but 30 out of the 40 have already been
sampled. We assume that the strata are still defined only by
"Species"
, and that 16 of the first 30 samples were taken
from the virginica species, 7 from setosa, and 7 from versicolor.
Assuming we only have those 30 samples to base our within-stratum
standard deviation estimates on, we can allocate the next 10 samples
using allocate_wave()
:
# Set up Wave 1
wave1_design <- data.frame(strata = c("setosa",
"virginica",
"versicolor"),
stratum_size = c(7, 16, 7))
# Collect Sepal.Width from only the 30 samples
phase1_data <- subset(datasets::iris, select = -Sepal.Width)
phase1_data$id <- 1:nrow(phase1_data) #Add id column
set.seed(234)
phase1_data <- sample_strata(data = phase1_data,
strata = "Species", id = "id",
design_data = wave1_design,
design_strata = "strata",
n_allocated = "stratum_size")
wave1_ids <- iris$id[phase1_data$sample_indicator == 1]
wave1_sampled_data <- iris[iris$id %in% wave1_ids, c("id","Sepal.Width")]
wave1_data <- merge(phase1_data, wave1_sampled_data, by = "id",
no.dups = TRUE, all.x = TRUE)
# We have our 30 samples
table(is.na(wave1_data$Sepal.Width), wave1_data$Species)
#>
#> setosa versicolor virginica
#> FALSE 7 7 16
#> TRUE 43 43 34
# Now, allocate the next 10:
wave2_design <- allocate_wave(data = wave1_data,
strata = "Species",
y = "Sepal.Width",
already_sampled = "sample_indicator",
nsample = 10,
detailed = TRUE)
wave2_design
#> strata npop nsample_optimal nsample_actual nsample_prior n_to_sample sd
#> 1 setosa 50 19 15 7 8 0.47
#> 2 versicolor 50 12 9 7 2 0.29
#> 3 virginica 50 9 16 16 0 0.24
Notice that in this case, "nsample_optimal"
does not
match "nsample_actual"
for every stratum in the outputted
design dataframe. This occurred because we oversampled from the
virginica stratum in wave 1, meaning that the optimum stratum sample
size among 40 total samples was smaller than the amount of samples
already taken in that group. When oversampling occurs,
allocate_wave()
recalculates the optimum allocation among
the non-oversampled strata and uses that result to allocate the
specified number of samples as optimally as possible. When no
oversampling occurs, "nsample_optimal"
will match
"nsample_actual"
for every stratum.
Now we can easily get the 10 new ids to sample using the
wave2_design
as the design_data
in
sample_strata()
. Notice that we do not have to specify
design_strata
or n_allocated
because the
default arguments are the column names of the
allocate_wave()
output:
optimall
to Determine Study Design and
Sample DataWe use the simulated dataset MatWgt_Sim
to demonstrate
how the functions in optimall
can be used to carry out
study design and sampling tasks in R. This section includes three
examples, which build upon eachother during the design and execution of
a multi-wave survey:
Example 1: Uses the function split_strata()
to
define and refine strata.
Example 2: Uses the function sample_strata()
to
randomly select units to be sampled based on a specified sampling
design.
Example 3: Demonstrates how to conduct an adaptive, multi-wave sampling design that uses an optimal Neyman allocation scheme at each step to allocate the next wave across strata.
The dataset contains simulated data based on a real study of the
association between maternal weight gain during pregnancy and the risk
of childhood obesity after controlling for a number of clinical and
demographic covariates. In this hypothetical example, the study data are
obtained from electronic health records, which are known to be
error-prone and do not have all the variables of interested available
without manual chart review. In a perfect world, chart review would be
used to validate and obtain the necessary variables on every observation
in our sample, but chart review is an expensive and difficult task.
Researchers determined that they could only reasonably afford to
validate 750 out of the 10,335 child-mother pairs. We refer to the
10,335 as the phase 1 sample, for which we have error-prone observations
on all, and we refer to the 750 as the phase II sample, which will be
designed using tools in the optimall
package. For
simplicity, suppose that we want to use these 750 samples to estimate
the true population mean of maternal weight change during pregnancy with
minimal variance. This goal will be accomplished in our last example
through an adaptive, multi-wave sampling design.
The simulated data used in this example is included with
optimall
. The dataset MatWgt_Sim
contains
10335 rows, one for each mother-child pair, and 6 columns containing ID
numbers and covariates which we see below:
data(MatWgt_Sim, package = "optimall")
head(MatWgt_Sim)
#> id mat_weight_est race diabetes obesity mat_weight_true
#> 1 7667 12.954711 Asian 0 0 10.768935
#> 2 8554 9.607124 Asian 0 1 8.892049
#> 3 6324 10.998284 Asian 0 0 12.075861
#> 4 8320 10.702210 Asian 0 0 10.887584
#> 5 8543 6.485988 Asian 1 0 7.331542
#> 6 127 13.338445 Asian 0 1 12.936826
Three of the covariates, child race, diabetes, and estimated maternal weight, are inexpensive to collect for all subjects, but determining the true maternal weight requires a tedious validation process. This full version of the simulated dataset contains the true weight change for all 10335 mothers, which may or may not have a relationship with the other covariates.
For the purpose of our examples, we suppose that we do not have access to this full dataset. Instead, we must sample from it to optimally estimate the mean of the true maternal weight changes.
During phase-I, we assume that all 10335 mother-child pairs have been
sampled for the inexpensive covariates. Accordingly, we define our phase
1 dataset to be 10335 x 5 with every column of the full
MatWgt_Sim
data excluding the expensive true weight change.
In the examples that follow, we will assume the validated maternal
weight variable mat_weight_true
will only be available
through phase 2 sampling.
We suspect that the true mean maternal weight change may vary with
some of the inexpensive covariates in the phase 1 dataset, so we decide
to split our population into 9 non-overlapping strata. Each stratum will
be defined by a unique combination of race and global percentile of
maternal weight gain (≤25th, 25th - 75th, >75th). We can accomplish
this quickly in optimall
using the
split_strata()
function.
phase1$strata <- phase1$race #initialize a strata column first
# Merge the smallest race categories
phase1 <- merge_strata(data = phase1,
strata = "race",
merge = c("Other","Asian"),
name = "Other")
phase1 <- split_strata(data = phase1, strata = "new_strata", split = NULL,
split_var = "mat_weight_est",
type = "global quantile",
split_at = c(0.25,0.75),
trunc = "MWC_est")
# Trunc argument specifies how to refer to mat_weight_est in new strata names
We now have the same phase1
data with a new column
specifying the strata we have defined.
head(phase1)
#> new_strata old_strata id mat_weight_est diabetes obesity
#> 1 Other.MWC_est_(9.75,15.06] Other 7667 12.954711 0 0
#> 2 Other.MWC_est_[-5.39,9.75] Other 8554 9.607124 0 1
#> 3 Other.MWC_est_(9.75,15.06] Other 6324 10.998284 0 0
#> 4 Other.MWC_est_(9.75,15.06] Other 8320 10.702210 0 0
#> 5 Other.MWC_est_[-5.39,9.75] Other 8543 6.485988 1 0
#> 6 Other.MWC_est_(9.75,15.06] Other 127 13.338445 0 1
#> strata
#> 1 Asian
#> 2 Asian
#> 3 Asian
#> 4 Asian
#> 5 Asian
#> 6 Asian
table(phase1$new_strata) # 9 strata
#>
#> Black.MWC_est_(15.06,38.46] Black.MWC_est_(9.75,15.06]
#> 628 1154
#> Black.MWC_est_[-30.21,9.75] Other.MWC_est_(15.06,30.94]
#> 745 325
#> Other.MWC_est_(9.75,15.06] Other.MWC_est_[-5.39,9.75]
#> 929 456
#> White.MWC_est_(15.06,51.69] White.MWC_est_(9.75,15.06]
#> 1631 3084
#> White.MWC_est_[-25.68,9.75]
#> 1383
With the strata now defined by the inexpensive variables sampled in
phase-I, we are ready to begin auditing patient records for validation
of the true maternal weight changes. Without any validated data to
define the optimum phase-IIa sample allocation, we decide to use
proportional stratified sampling for the first 250 out of our 750
audits. Conveniently, optimall
can select this random
sample for us with sample_strata()
.
The sample_strata()
function requires as input two
dataframes, one containing the stratum membership of each individual
unit (data
) and a second specifying the sampling design
(design_data
). The design_data
dataframe must
contain at least two columns:
design_strata
: A column holding strata namesn_allocated
: The total n allocated to each stratum.In our example, we want the proportion of samples randomly taken from each stratum to reflect its population proportion and a total of 250 samples.
phase2a_design <- data.frame(
strata_name = names(table(phase1$new_strata)),
strata_prop = as.vector(table(phase1$new_strata))/10335,
strata_n = round(250.3*as.vector(table(phase1$new_strata))/10335)
) # 250.3 to make sure 250 samples after rounding
sum(phase2a_design$strata_n)
#> [1] 250
phase2a_design
#> strata_name strata_prop strata_n
#> 1 Black.MWC_est_(15.06,38.46] 0.06076439 15
#> 2 Black.MWC_est_(9.75,15.06] 0.11165941 28
#> 3 Black.MWC_est_[-30.21,9.75] 0.07208515 18
#> 4 Other.MWC_est_(15.06,30.94] 0.03144654 8
#> 5 Other.MWC_est_(9.75,15.06] 0.08988873 22
#> 6 Other.MWC_est_[-5.39,9.75] 0.04412192 11
#> 7 White.MWC_est_(15.06,51.69] 0.15781326 40
#> 8 White.MWC_est_(9.75,15.06] 0.29840348 75
#> 9 White.MWC_est_[-25.68,9.75] 0.13381713 33
We can now call sample_strata()
to randomly draw the
specified number samples from each stratum. It will output the same
phase1
dataframe with an extra column indicating which
units should be sampled. We can then extract the ids to sample based on
this indicator.
phase1 <- sample_strata(data = phase1, strata = "new_strata",
id = "id", design_data = phase2a_design,
design_strata = "strata_name",
n_allocated = "strata_n")
ids_to_sample2a <- phase1[phase1$sample_indicator == 1,"id"]
length(ids_to_sample2a) # 250 ids to sample
#> [1] 250
We submit these 250 ids for validation. Hypothetically, this involves the hard work of trained nurses, but for our example it involves only a few lines of code.
phase2a_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2a,
c("id","mat_weight_true")]
phase2a <- merge(phase1, phase2a_samples, by = "id",
no.dups = TRUE, all.x = TRUE)
names(phase2a)
#> [1] "id" "new_strata" "old_strata" "mat_weight_est"
#> [5] "diabetes" "obesity" "strata" "sample_indicator"
#> [9] "mat_weight_true"
table(is.na(phase2a$mat_weight_true))
#>
#> FALSE TRUE
#> 250 10085
We notice that all of the units that were not validated have
NA
values in the new columns. We will use the non-missing
validated data to inform optimum allocation in our future sampling
waves.
In phase-IIa, we allocated samples to strata using proportional sampling rather than optimum allocation because we had no prior samples of the true maternal weight change from which to estimate within-stratum variances. Now that we have completed the first wave of validation, we can estimate these variances and will thus use optimum allocation for each of the subsequent sampling waves.
In phase-IIb (wave 2), we will optimally allocate 250 more samples,
raising our total number of validated samples to 500. The
allocate_wave()
function makes this step simple by
calculating the optimum allocation for 500 samples, determining how many
units have already been sampled in previous waves (only Phase-IIa in
this case), and allocating the 250 samples of the current wave to make
up the difference. The output is a design dataframe summarizing the
results for each stratum.
phase2b <- phase2a
# Add indicator for units that were already sampled
phase2b$already_sampled <- ifelse(phase2b$id %in% ids_to_sample2a, 1, 0)
# Make design
phase2b_design <- allocate_wave(data = phase2b, strata = "new_strata",
y = "mat_weight_true",
already_sampled = "already_sampled",
nsample = 250,
detailed = TRUE)
phase2b_design
#> strata npop nsample_optimal nsample_actual nsample_prior
#> 1 Black.MWC_est_(15.06,38.46] 628 46 46 15
#> 2 Black.MWC_est_(9.75,15.06] 1154 38 38 28
#> 3 Black.MWC_est_[-30.21,9.75] 745 52 52 18
#> 4 Other.MWC_est_(15.06,30.94] 325 12 12 8
#> 5 Other.MWC_est_(9.75,15.06] 929 34 34 22
#> 6 Other.MWC_est_[-5.39,9.75] 456 30 30 11
#> 7 White.MWC_est_(15.06,51.69] 1631 83 83 40
#> 8 White.MWC_est_(9.75,15.06] 3084 117 117 75
#> 9 White.MWC_est_[-25.68,9.75] 1383 88 88 33
#> n_to_sample sd
#> 1 31 3.51
#> 2 10 1.58
#> 3 34 3.32
#> 4 4 1.75
#> 5 12 1.73
#> 6 19 3.18
#> 7 43 2.44
#> 8 42 1.80
#> 9 55 3.04
Looking at the output of allocate_wave()
, we notice that
the optimum sample sizes for the next wave vary greatly between strata,
but all are greater than one. Because we set
detailed = TRUE
, we can also see what the optimum
allocation for 500 samples would have been ignoring prior wave sizes in
the column "nsample_optimal"
. This column matches
"nsample_actual"
because we did not oversample any strata
in phase-IIa.
In other cases when "n_to_sample"
= 0 for some strata,
we may not be so lucky. If the optimum sample size in a stratum is
smaller than the amount it was allocated in previous waves
("nsample_prior"
), we say that that stratum has been
oversampled. When oversampling occurs, allocate_wave()
“closes” the oversampled strata and re-allocates the remaining samples
optimally among the open strata. Under these circumstances, the total
sampling allocation is no longer optimal, but optimall
will
output the most optimal allocation from the non-oversampled strata for
the next wave.
Although we don’t see evidence of oversampling, these results suggest
that our strata are relatively imbalanced. We decide to split the
largest stratum into three similarly-sized strata along mat_weight_est
using the local quantile
option of
split_strata()
.
phase2b <- split_strata(data = phase2b,
split = "White.MWC_est_(9.75,15.06]",
strata = "new_strata",
split_var = "mat_weight_est",
type = "local quantile", split_at = c(1/3,2/3))
With our strata now more balanced in size, we can re-run
allocate_wave()
to make a new design dataframe for
sample_strata()
to use to determine which ids to sample in
phase-IIb.
phase2b_design <- allocate_wave(data = phase2b,
strata = "new_strata",
y = "mat_weight_true",
already_sampled = "already_sampled",
nsample = 250)
phase2b_design
#> strata npop nsample_actual
#> 1 Black.MWC_est_(15.06,38.46] 628 49
#> 2 Black.MWC_est_(9.75,15.06] 1154 40
#> 3 Black.MWC_est_[-30.21,9.75] 745 55
#> 4 Other.MWC_est_(15.06,30.94] 325 13
#> 5 Other.MWC_est_(9.75,15.06] 929 35
#> 6 Other.MWC_est_[-5.39,9.75] 456 32
#> 7 White.MWC_est_(15.06,51.69] 1631 88
#> 8 White.MWC_est_(9.75,15.06].mat_weight_est_(11.44,12.99] 1028 29
#> 9 White.MWC_est_(9.75,15.06].mat_weight_est_(12.99,15.05] 1028 39
#> 10 White.MWC_est_(9.75,15.06].mat_weight_est_[9.75,11.44] 1028 28
#> 11 White.MWC_est_[-25.68,9.75] 1383 92
#> nsample_prior n_to_sample
#> 1 15 34
#> 2 28 12
#> 3 18 37
#> 4 8 5
#> 5 22 13
#> 6 11 21
#> 7 40 48
#> 8 29 0
#> 9 21 18
#> 10 25 3
#> 11 33 59
# Extract IDs to sample
ids_to_sample2b <- sample_strata(data = phase2b,
strata = "new_strata",
id = "id",
already_sampled = "already_sampled",
design_data = phase2b_design,
design_strata = "strata",
n_allocated = "n_to_sample")
ids_to_sample2b <-
ids_to_sample2b[ids_to_sample2b$sample_indicator == 1,"id"]
length(ids_to_sample2b)
#> [1] 250
We can now get the validated maternal weight changes for these 250 subjects.
# Take samples from true data
phase2b_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2b,
c("id","mat_weight_true")]
# Merge extracted samples with our data.
# Note we use dplyr's coalesce here to merge sampled mat_weight_true with
# the mat_weight_true solumn already present in phase2b, but manner of
# merging of samples may vary from project to project.
# For simpler merges, see 'Multiwave Object' vignette.
phase2b <- merge(phase2b, phase2b_samples, by = "id",
no.dups = TRUE, all.x = TRUE)
library(dplyr)
phase2b$mat_weight_true <-
dplyr::coalesce(phase2b$mat_weight_true.x,
phase2b$mat_weight_true.y)
phase2b <- subset(phase2b, select = -c(mat_weight_true.x,
mat_weight_true.y))
table(is.na(phase2b$mat_weight_true)) # All are NA besides already sampled.
#>
#> FALSE TRUE
#> 500 9835
With these results in, we can move onto our final wave.
After auditing the 250 additional samples in Phase-IIb, we only have 250 left to validate until we reach our goal of 750. We will combine what we have learned in the previous waves of phase-II to optimally assign these final 250 samples in phase-IIc.
Following the same steps as in phase-IIb, we use
allocate_wave()
to find the optimum sample sizes in each
stratum based on estimates of the variance using our previous samples.
We then determine which ids to sample using
sample_strata()
.
phase2c <- phase2b
# Add indicator for units that were already sampled
phase2c$already_sampled <- ifelse(phase2b$id %in% c(ids_to_sample2a,
ids_to_sample2b),
1, 0)
# Make Design
phase2c_design <- allocate_wave(data = phase2c, strata = "new_strata",
y = "mat_weight_true",
already_sampled = "already_sampled",
nsample = 250,
detailed = TRUE)
phase2c_design
#> strata npop nsample_optimal
#> 1 Black.MWC_est_(15.06,38.46] 628 60
#> 2 Black.MWC_est_(9.75,15.06] 1154 56
#> 3 Black.MWC_est_[-30.21,9.75] 745 99
#> 4 Other.MWC_est_(15.06,30.94] 325 18
#> 5 Other.MWC_est_(9.75,15.06] 929 50
#> 6 Other.MWC_est_[-5.39,9.75] 456 35
#> 7 White.MWC_est_(15.06,51.69] 1631 176
#> 8 White.MWC_est_(9.75,15.06].mat_weight_est_(11.44,12.99] 1028 38
#> 9 White.MWC_est_(9.75,15.06].mat_weight_est_(12.99,15.05] 1028 48
#> 10 White.MWC_est_(9.75,15.06].mat_weight_est_[9.75,11.44] 1028 37
#> 11 White.MWC_est_[-25.68,9.75] 1383 133
#> nsample_actual nsample_prior n_to_sample sd
#> 1 60 49 11 3.07
#> 2 56 40 16 1.56
#> 3 99 55 44 4.29
#> 4 18 13 5 1.83
#> 5 50 35 15 1.74
#> 6 35 32 3 2.47
#> 7 176 88 88 3.50
#> 8 38 29 9 1.20
#> 9 48 39 9 1.50
#> 10 37 28 9 1.16
#> 11 133 92 41 3.12
# Find IDs to sample
ids_to_sample2c <- sample_strata(data = phase2c,
strata = "new_strata",
id = "id",
already_sampled = "already_sampled",
design_data = phase2c_design,
design_strata = "strata",
n_allocated = "n_to_sample")
ids_to_sample2c <-
ids_to_sample2c[ids_to_sample2c$sample_indicator == 1,"id"]
length(ids_to_sample2c)
#> [1] 250
# Sample
phase2c_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2c,
c("id","mat_weight_true")]
# Add samples to phase2c dataset
phase2c <- merge(phase2c, phase2c_samples, by = "id",
no.dups = TRUE, all.x = TRUE)
phase2c$mat_weight_true <-
dplyr::coalesce(phase2c$mat_weight_true.x,
phase2c$mat_weight_true.y)
phase2c <- subset(phase2c, select = -c(mat_weight_true.x,
mat_weight_true.y))
We have now validated 750 samples!
As this example shows, there are quite a few moving parts to keep
track of in the multi-wave sampling workflow. When performing a
multi-wave sampling design, you may also want to consider using
optimall's
multiwave object to organize and visualize the
process. This feature is described in detail in the package vignette
titled “The Multiwave Object”.
optimall_shiny()
In many cases, deciding which strata to split and where is a
difficult task. split_strata()
makes this job easier, but
it is designed for situations where the strata and split values have
already been decided by the user. Running it iteratively to experiment
with different splits is possible yet tedious.
optimall
includes an R Shiny application that reacts in real time to user
selections of splitting parameters. It can be called using
optimall_shiny()
. See the screenshot below:
Each time the user updates an input, the parameters of the
split_strata()
function are updated accordingly, and the
resulting dataframe showing the optimum allocation of samples among the
new strata is displayed. Once the user is satsfied with a set of inputs,
they can confirm
the split and the underlying data will be
updated. The code used to perform the split will also be displayed.
Note that the data in the Shiny App has to be loaded in from a file,
and it is thus separate from data being used in the
optimall
workflow. This means that the user has to return
to their R session to make the changes determined in the Shiny app. The
app, however, makes this easy by printing the code to perform all of the
confirmed changes, so making them in R is as easy as copying and
pasting. You may have to update the "data"
argument.
For more information on optimall_shiny
see the package
vignette titled “Splitting Strata with Optimall Shiny”.
McIsaac MA, Cook RJ. Adaptive sampling in two‐phase designs: a biomarker study for progression in arthritis. Statistics in Medicine. 2015 Sep 20;34(21):2899-912.
Wright, T. A simple method of exact optimal sample allocation under stratification with any mixed constraint patterns.2014; Statistics, 07.