Bayesian Approaches on Tiny Data

Dan Gray

2017/11/12

Summary

This mini-tutorial is heavily derived from an excellent post by Rasmus Bååth which details using Bayesian Analysis to calculate whether his wife is pregnant.

This post serves to summarize the fundamentals needed to construct a typical Bayesian Analysis and provide a generic template from which similar problems could be approached.

Three key elements are needed, namely the data, the model, and the priors.

Part A: The Data

Here we simple describe the raw data, and then compute a metric which best captures the behaviour we are interested in (interval between periods).

period_onset <- as.Date(c("2014-07-02", "2014-08-02", "2014-08-29", "2014-09-25",
                          "2014-10-24", "2014-11-20", "2014-12-22", "2015-01-19"))
days_between_periods <- as.numeric(diff(period_onset))

Part B: The Model

With every model comes assumptions!

General Assumptions

  1. The woman and man have no prior reason for being infertile

  2. The women has regular periods

  3. The couple are actively trying to concieve

  4. If there is a pregnancy, there are no more periods

Specific Assumptions

  1. The number of days between periods (days_between_periods) is assumed to be normally distributed with unknown mean (mean_period) and standard deviation (sd_period)

  2. The probability of getting pregnant during a cycle is assumed to be 0.19 if the couple is fertile (is_fertile). Not all couples are fertile, and if they are not fertile the probability of getting pregnant is 0. If fertility is coded as 0-1 then this can be compactly written as 0.19 * is_fertile

  3. The probability of failing to conceive for a certain number of periods (n_non_pregant_periods) is then (1 - 0.19 * is_fertile)^n_non_pregant_periods

  4. Finally, if you are not going to get pregnant this cycle, then the number of days from your last period to your next period (next_period) is going to to be more than the current number of days since the last period (days_since_last_period). That is, the probability of the next_period < days_since_last_period is zero

The Likelyhood Function

This is the key conceptual idea within Bayesian Analysis.

The likelyhood function given some parameters, and some data calculates the probability of the data, given those parameters, or rather something proportional to a probability, and that is called a likelyhood.

Note: This likelyhood can be very small, and thus it may be calculated on a log scale to avoid numerical issues.

In essence we need to:

  1. Create a function, that takes the data, and the parameters as arguments.

  2. Initialize the likelyhood to 1, corresponding to 0.0 on the log scale.

  3. Using probability density functions (dnromr, dbinom, and dbpois) to calculate the likelyhoods of the different parts of the model. You then multiple these likelyhoods together. On the log scale this corresponds to adding the log-likelyhoods.

  4. Make sure the d* functions return log- likelyhoods, by using the argument log = TRUE.

Note: A likelyhood of 0.0 corresponds to a log-likelyhood of -Inf.

calc_log_like <- function(days_since_last_period, days_between_periods,
                          mean_period, sd_period, next_period,
                          is_fertile, is_pregnant) {
  n_non_pregnant_periods <- length(days_between_periods)
  log_like <- 0
  if(n_non_pregnant_periods > 0) {
    log_like <- log_like + sum( dnorm(days_between_periods, mean_period, sd_period, log = TRUE) )
  }
  log_like <- log_like + log( (1 - 0.19 * is_fertile)^n_non_pregnant_periods )
  if(!is_pregnant && next_period < days_since_last_period) {
    log_like <- -Inf
  }
  log_like
}

Here the data are the scalar days_since_last_period, and the vector days_between_periods.

The rest of the arguments are the parameters to be estimated.

Using this function I can calculate the log-likelyhood of any data + parameter combination.

Part C: The Priors

Bayesian Analysis needs priors - the information the model has before seeing any data!

Namely mean_period, sd_period, is_fertile, and is_pregnant (next_period is also a parameter but can be derived from mean_period and sd_period)

I also need to set the probability of becoming pregnant in a cycle (0.19 as given above). No vague priors here - all priors were taken from the literature.

# proportion NOT pregnant after 12 cycles
prop_not_preg_12_cycles <- c( "19-26 years" = 0.08,
                             "27-34 years" = 0.13,
                             "35-39 years" = 0.18)

# therefore proportion pregant ('is_pregant') in 1 cycle

1 - (prop_not_preg_12_cycles - 0.05)^(1/12)
## 19-26 years 27-34 years 35-39 years 
##   0.2533906   0.1898026   0.1563507

I have all the priors and can now construct the function that returns samples from the prior:

sample_from_prior <- function(n) {
  prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
                      sd_period   = abs(rnorm(n, 0, 2.05)),
                      is_fertile  = rbinom(n, 1, 0.95))
  prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
  prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
  prior$next_period[prior$is_pregnant == 1] <- NA
  prior
}

Fitting the Model using Importance Sampling

I now have the three key elements for Bayesian Statistics; the data (1), the likelyhood (2), and the prior (3).

“Importance sampling is a Monte Carlo approach, which works well when the parameter space is small, and the priors are not too dissimilar from the posterior” - Rasmus Bååth

There are three steps in Importance Sampling:

  1. Generate a large sample from the prior (done using sample_from_prior)

  2. Assign a weight to each draw from the prior that is proportional to the likelyhood of the data given those parameters (done using calc_log_like_)

  3. Normalize the weights to sum to 1, so that they form a probability distribution over the prior sample

Finally we should re-sample the prior sample according to the probability distribution (using the function sample)

The result of importance sampling is a new sample, which if the importance sampling worked can be treated as a sample from the posterior.That is it represents what the “model knows after seeing the data”.

# The importance sampling function is constructed as follows:
sample_from_posterior <- function(days_since_last_period, days_between_periods, n_samples) {
  prior <- sample_from_prior(n_samples)
  log_like <- sapply(1:n_samples, function(i) {
    calc_log_like(days_since_last_period, days_between_periods,
                  prior$mean_period[i], prior$sd_period[i], prior$next_period[i],
                  prior$is_fertile[i], prior$is_pregnant[i])
  })
  posterior <- prior[ sample(n_samples, replace = TRUE, prob = exp(log_like)), ]
  posterior
}

The Results

post <- sample_from_posterior(34, days_between_periods, n_samples = 100000)

head(post)
##       mean_period sd_period is_fertile is_pregnant next_period
## 80848    28.56912  1.558370          1           1          NA
## 24754    27.38200  1.728088          1           1          NA
## 44318    28.94121  1.765881          1           1          NA
## 23812    27.05709  3.274956          1           1          NA
## 50163    28.78958  3.214948          1           1          NA
## 8820     27.56157  2.450886          1           1          NA
mean(post$is_fertile)
## [1] 0.98105
mean(post$is_pregnant)
## [1] 0.91432