### Summary

This mini-tutorial is heavily derived from an excellent post by Rasmus Bååth which details using Bayesian Analysis to calculate whether his wife is pregnant.

This post serves to summarize the fundamentals needed to construct a typical Bayesian Analysis and provide a generic template from which similar problems could be approached.

Three key elements are needed, namely the **data**, the **model**, and the **priors**.

### Part A: The Data

Here we simple describe the raw data, and then compute a metric which best captures the behaviour we are interested in (*interval between periods*).

```
period_onset <- as.Date(c("2014-07-02", "2014-08-02", "2014-08-29", "2014-09-25",
"2014-10-24", "2014-11-20", "2014-12-22", "2015-01-19"))
days_between_periods <- as.numeric(diff(period_onset))
```

### Part B: The Model

With every model comes assumptions!

### General Assumptions

The woman and man have no prior reason for being infertile

The women has regular periods

The couple are actively trying to concieve

If there is a pregnancy, there are no more periods

### Specific Assumptions

The number of days between periods (

`days_between_periods`

) is assumed to be normally distributed with unknown mean (`mean_period`

) and standard deviation (`sd_period`

)The probability of getting pregnant during a cycle is assumed to be 0.19 if the couple is fertile (

`is_fertile`

). Not all couples are fertile, and if they are not fertile the probability of getting pregnant is 0. If fertility is coded as 0-1 then this can be compactly written as`0.19 * is_fertile`

The probability of failing to conceive for a certain number of periods (

`n_non_pregant_periods`

) is then`(1 - 0.19 * is_fertile)^n_non_pregant_periods`

Finally, if you are not going to get pregnant this cycle, then the number of days from your last period to your next period (

`next_period`

) is going to to be more than the current number of days since the last period (`days_since_last_period`

). That is, the probability of the`next_period < days_since_last_period`

is zero

### The Likelyhood Function

This is the key conceptual idea within Bayesian Analysis.

The likelyhood function given some parameters, and some data calculates the probability of the data, given those parameters, or rather something proportional to a probability, and that is called a likelyhood.

**Note: This likelyhood can be very small, and thus it may be calculated on a log scale to avoid numerical issues.**

In essence we need to:

Create a function, that takes the

**data**, and the**parameters**as arguments.Initialize the likelyhood to 1, corresponding to 0.0 on the log scale.

Using probability density functions (

`dnromr`

,`dbinom`

, and`dbpois`

) to calculate the likelyhoods of the different parts of the model. You then**multiple these likelyhoods together**. On the**log scale this corresponds to adding**the log-likelyhoods.Make sure the d* functions return log- likelyhoods, by using the argument log = TRUE.

**Note: A likelyhood of 0.0 corresponds to a log-likelyhood of -Inf.**

```
calc_log_like <- function(days_since_last_period, days_between_periods,
mean_period, sd_period, next_period,
is_fertile, is_pregnant) {
n_non_pregnant_periods <- length(days_between_periods)
log_like <- 0
if(n_non_pregnant_periods > 0) {
log_like <- log_like + sum( dnorm(days_between_periods, mean_period, sd_period, log = TRUE) )
}
log_like <- log_like + log( (1 - 0.19 * is_fertile)^n_non_pregnant_periods )
if(!is_pregnant && next_period < days_since_last_period) {
log_like <- -Inf
}
log_like
}
```

Here the **data** are the scalar `days_since_last_period`

, and the vector `days_between_periods`

.

The rest of the arguments are the **parameters** to be estimated.

Using this function I can calculate the log-likelyhood ofanydata + parameter combination.

### Part C: The Priors

Bayesian Analysis needs **priors** - the information the model has *before* seeing any data!

Namely `mean_period`

, `sd_period`

, `is_fertile`

, and `is_pregnant`

(`next_period`

is also a parameter but can be derived from `mean_period`

and `sd_period`

)

I also need to set the probability of becoming pregnant in a cycle (0.19 as given above). No vague priors here - all priors were taken from the literature.

`is_fertile`

/`is_pregnant`

are based on frequencies`is_fertile`

= 0.95`is_pregant`

is a binary parameter representing whether the couple are going to be (or already are) pregnant the current cycle

```
# proportion NOT pregnant after 12 cycles
prop_not_preg_12_cycles <- c( "19-26 years" = 0.08,
"27-34 years" = 0.13,
"35-39 years" = 0.18)
# therefore proportion pregant ('is_pregant') in 1 cycle
1 - (prop_not_preg_12_cycles - 0.05)^(1/12)
```

```
## 19-26 years 27-34 years 35-39 years
## 0.2533906 0.1898026 0.1563507
```

I have all the **priors** and can now construct the function that returns samples from the prior:

```
sample_from_prior <- function(n) {
prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
sd_period = abs(rnorm(n, 0, 2.05)),
is_fertile = rbinom(n, 1, 0.95))
prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
prior$next_period[prior$is_pregnant == 1] <- NA
prior
}
```

## Fitting the Model using Importance Sampling

I now have the three key elements for Bayesian Statistics; the **data (1)**, the **likelyhood (2)**, and the **prior (3)**.

“Importance sampling is a Monte Carlo approach, which works well when the parameter space is small, and the priors are not too dissimilar from the posterior” - Rasmus Bååth

There are three steps in Importance Sampling:

Generate a large sample from the

**prior**(done using`sample_from_prior`

)Assign a weight to each draw from the prior that is proportional to the likelyhood of the

**data**given those parameters (done using`calc_log_like_`

)Normalize the weights to sum to 1, so that they form a probability distribution over the

**prior**sample

Finally we should re-sample the prior sample according to the probability distribution (using the function `sample`

)

The result of importance sampling is a new sample, which if the importance sampling worked can be treated as a sample from the posterior.That is it represents what the “model knows after seeing the data”.

```
# The importance sampling function is constructed as follows:
sample_from_posterior <- function(days_since_last_period, days_between_periods, n_samples) {
prior <- sample_from_prior(n_samples)
log_like <- sapply(1:n_samples, function(i) {
calc_log_like(days_since_last_period, days_between_periods,
prior$mean_period[i], prior$sd_period[i], prior$next_period[i],
prior$is_fertile[i], prior$is_pregnant[i])
})
posterior <- prior[ sample(n_samples, replace = TRUE, prob = exp(log_like)), ]
posterior
}
```

## The Results

```
post <- sample_from_posterior(34, days_between_periods, n_samples = 100000)
head(post)
```

```
## mean_period sd_period is_fertile is_pregnant next_period
## 80848 28.56912 1.558370 1 1 NA
## 24754 27.38200 1.728088 1 1 NA
## 44318 28.94121 1.765881 1 1 NA
## 23812 27.05709 3.274956 1 1 NA
## 50163 28.78958 3.214948 1 1 NA
## 8820 27.56157 2.450886 1 1 NA
```

`mean(post$is_fertile)`

`## [1] 0.98105`

`mean(post$is_pregnant)`

`## [1] 0.91432`