Summary
This mini-tutorial is heavily derived from an excellent post by Rasmus Bååth which details using Bayesian Analysis to calculate whether his wife is pregnant.
This post serves to summarize the fundamentals needed to construct a typical Bayesian Analysis and provide a generic template from which similar problems could be approached.
Three key elements are needed, namely the data, the model, and the priors.
Part A: The Data
Here we simple describe the raw data, and then compute a metric which best captures the behaviour we are interested in (interval between periods).
period_onset <- as.Date(c("2014-07-02", "2014-08-02", "2014-08-29", "2014-09-25",
"2014-10-24", "2014-11-20", "2014-12-22", "2015-01-19"))
days_between_periods <- as.numeric(diff(period_onset))
Part B: The Model
With every model comes assumptions!
General Assumptions
The woman and man have no prior reason for being infertile
The women has regular periods
The couple are actively trying to concieve
If there is a pregnancy, there are no more periods
Specific Assumptions
The number of days between periods (
days_between_periods
) is assumed to be normally distributed with unknown mean (mean_period
) and standard deviation (sd_period
)The probability of getting pregnant during a cycle is assumed to be 0.19 if the couple is fertile (
is_fertile
). Not all couples are fertile, and if they are not fertile the probability of getting pregnant is 0. If fertility is coded as 0-1 then this can be compactly written as0.19 * is_fertile
The probability of failing to conceive for a certain number of periods (
n_non_pregant_periods
) is then(1 - 0.19 * is_fertile)^n_non_pregant_periods
Finally, if you are not going to get pregnant this cycle, then the number of days from your last period to your next period (
next_period
) is going to to be more than the current number of days since the last period (days_since_last_period
). That is, the probability of thenext_period < days_since_last_period
is zero
The Likelyhood Function
This is the key conceptual idea within Bayesian Analysis.
The likelyhood function given some parameters, and some data calculates the probability of the data, given those parameters, or rather something proportional to a probability, and that is called a likelyhood.
Note: This likelyhood can be very small, and thus it may be calculated on a log scale to avoid numerical issues.
In essence we need to:
Create a function, that takes the data, and the parameters as arguments.
Initialize the likelyhood to 1, corresponding to 0.0 on the log scale.
Using probability density functions (
dnromr
,dbinom
, anddbpois
) to calculate the likelyhoods of the different parts of the model. You then multiple these likelyhoods together. On the log scale this corresponds to adding the log-likelyhoods.Make sure the d* functions return log- likelyhoods, by using the argument log = TRUE.
Note: A likelyhood of 0.0 corresponds to a log-likelyhood of -Inf.
calc_log_like <- function(days_since_last_period, days_between_periods,
mean_period, sd_period, next_period,
is_fertile, is_pregnant) {
n_non_pregnant_periods <- length(days_between_periods)
log_like <- 0
if(n_non_pregnant_periods > 0) {
log_like <- log_like + sum( dnorm(days_between_periods, mean_period, sd_period, log = TRUE) )
}
log_like <- log_like + log( (1 - 0.19 * is_fertile)^n_non_pregnant_periods )
if(!is_pregnant && next_period < days_since_last_period) {
log_like <- -Inf
}
log_like
}
Here the data are the scalar days_since_last_period
, and the vector days_between_periods
.
The rest of the arguments are the parameters to be estimated.
Using this function I can calculate the log-likelyhood of any data + parameter combination.
Part C: The Priors
Bayesian Analysis needs priors - the information the model has before seeing any data!
Namely mean_period
, sd_period
, is_fertile
, and is_pregnant
(next_period
is also a parameter but can be derived from mean_period
and sd_period
)
I also need to set the probability of becoming pregnant in a cycle (0.19 as given above). No vague priors here - all priors were taken from the literature.
is_fertile
/is_pregnant
are based on frequenciesis_fertile
= 0.95is_pregant
is a binary parameter representing whether the couple are going to be (or already are) pregnant the current cycle
# proportion NOT pregnant after 12 cycles
prop_not_preg_12_cycles <- c( "19-26 years" = 0.08,
"27-34 years" = 0.13,
"35-39 years" = 0.18)
# therefore proportion pregant ('is_pregant') in 1 cycle
1 - (prop_not_preg_12_cycles - 0.05)^(1/12)
## 19-26 years 27-34 years 35-39 years
## 0.2533906 0.1898026 0.1563507
I have all the priors and can now construct the function that returns samples from the prior:
sample_from_prior <- function(n) {
prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
sd_period = abs(rnorm(n, 0, 2.05)),
is_fertile = rbinom(n, 1, 0.95))
prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
prior$next_period[prior$is_pregnant == 1] <- NA
prior
}
Fitting the Model using Importance Sampling
I now have the three key elements for Bayesian Statistics; the data (1), the likelyhood (2), and the prior (3).
“Importance sampling is a Monte Carlo approach, which works well when the parameter space is small, and the priors are not too dissimilar from the posterior” - Rasmus Bååth
There are three steps in Importance Sampling:
Generate a large sample from the prior (done using
sample_from_prior
)Assign a weight to each draw from the prior that is proportional to the likelyhood of the data given those parameters (done using
calc_log_like_
)Normalize the weights to sum to 1, so that they form a probability distribution over the prior sample
Finally we should re-sample the prior sample according to the probability distribution (using the function sample
)
The result of importance sampling is a new sample, which if the importance sampling worked can be treated as a sample from the posterior.That is it represents what the “model knows after seeing the data”.
# The importance sampling function is constructed as follows:
sample_from_posterior <- function(days_since_last_period, days_between_periods, n_samples) {
prior <- sample_from_prior(n_samples)
log_like <- sapply(1:n_samples, function(i) {
calc_log_like(days_since_last_period, days_between_periods,
prior$mean_period[i], prior$sd_period[i], prior$next_period[i],
prior$is_fertile[i], prior$is_pregnant[i])
})
posterior <- prior[ sample(n_samples, replace = TRUE, prob = exp(log_like)), ]
posterior
}
The Results
post <- sample_from_posterior(34, days_between_periods, n_samples = 100000)
head(post)
## mean_period sd_period is_fertile is_pregnant next_period
## 80848 28.56912 1.558370 1 1 NA
## 24754 27.38200 1.728088 1 1 NA
## 44318 28.94121 1.765881 1 1 NA
## 23812 27.05709 3.274956 1 1 NA
## 50163 28.78958 3.214948 1 1 NA
## 8820 27.56157 2.450886 1 1 NA
mean(post$is_fertile)
## [1] 0.98105
mean(post$is_pregnant)
## [1] 0.91432