Link to Home

My thesis project is focused on survyeing for cryptic cervid pathogens in the northeastern United States. The pathogens I am looking at include Babesia odocolei, Plasmodium odocoilei, and Theileria cervi. For this homework, I am going to focus on my predictions for the infection rate of Plasmodium odocoilei in white tailed deer.

Based on previous research, I hypothesize that white-tailed deer across Vermont, New Hampshire, and Maine are infected with Plasmodium odocoilei at a rate of about 25%.

Start with simulating a normal data set.

library(boot)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(broom)

# Norm dist
VT <- inv.logit(rnorm(n=500,mean=0.25,sd=.50))
NH <- inv.logit(rnorm(n=500,mean=0.25,sd=.50))
ME <- inv.logit(rnorm(n=500,mean=0.25,sd=.50))

# Making a data set
PrePivotPlasmodiumData <- data.frame(VT,NH,ME)
head(PrePivotPlasmodiumData)
##          VT        NH        ME
## 1 0.5562696 0.5484395 0.6175164
## 2 0.4871093 0.3845032 0.4325832
## 3 0.4945591 0.5203044 0.7363188
## 4 0.3574442 0.6855021 0.5755426
## 5 0.7067476 0.5381469 0.6236759
## 6 0.3813299 0.3183416 0.4065027
PlasmodiumData <- PrePivotPlasmodiumData %>%
  pivot_longer(cols=VT:ME,
               names_to = "State",
               values_to = "InfectionRate")
head(PlasmodiumData)
## # A tibble: 6 × 2
##   State InfectionRate
##   <chr>         <dbl>
## 1 VT            0.556
## 2 NH            0.548
## 3 ME            0.618
## 4 VT            0.487
## 5 NH            0.385
## 6 ME            0.433
# If my independent variable is state (discrete) and my dependent variable is rate of infection (continuous) I should run an ANOVA.

# PlasmodiumANOVA 

PlasmodiumANOVA <- aov(InfectionRate ~ State, data=PlasmodiumData)
summary(PlasmodiumANOVA)
##               Df Sum Sq Mean Sq F value Pr(>F)  
## State          2   0.08 0.03985   2.971 0.0515 .
## Residuals   1497  20.08 0.01341                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Ran multiple times to test different random numbers.

Changing the means of different groups.

I decided to alter the mean infection rate of the states based on the latitude. I expect that more southern states have higher levels of infection rate due to higher populations of Anopheles mosquitoes, the vectors of Plasmodium odocoilei. So, I gave VT and NH infection rates slightly higher than 25%, and Maine a slightly lower infection rate than 25%. My results showed a statistically significant difference, which I did not get from my initial analysis. This makes sense because initially my inputs were all the same.

VT <- inv.logit(rnorm(n=500,mean=0.30,sd=.50))
NH <- inv.logit(rnorm(n=500,mean=0.32,sd=.50))
ME <- inv.logit(rnorm(n=500,mean=0.15,sd=.50))

PrePivotPlasmodiumData <- data.frame(VT,NH,ME)
head(PrePivotPlasmodiumData)
##          VT        NH        ME
## 1 0.5863141 0.5656822 0.3852405
## 2 0.5217481 0.4177075 0.4765554
## 3 0.5426395 0.5475988 0.5489203
## 4 0.4797453 0.6206729 0.2269413
## 5 0.4109047 0.3306679 0.4134341
## 6 0.6718687 0.4936155 0.5333459
PlasmodiumData <- PrePivotPlasmodiumData %>%
  pivot_longer(cols=VT:ME,
               names_to = "State",
               values_to = "InfectionRate")
head(PlasmodiumData)
## # A tibble: 6 × 2
##   State InfectionRate
##   <chr>         <dbl>
## 1 VT            0.586
## 2 NH            0.566
## 3 ME            0.385
## 4 VT            0.522
## 5 NH            0.418
## 6 ME            0.477
PlasmodiumANOVA <- aov(InfectionRate ~ State, data=PlasmodiumData)
summary(PlasmodiumANOVA)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## State          2  0.862  0.4308   30.43 1.11e-13 ***
## Residuals   1497 21.195  0.0142                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s say that latitudinal difference is actually really drastic, and ME has a much lower infection rate than VT and NH. My results here yielded an even more statistically significant result than the previous analysis.

VT <- inv.logit(rnorm(n=500,mean=0.40,sd=.50))
NH <- inv.logit(rnorm(n=500,mean=0.35,sd=.50))
ME <- inv.logit(rnorm(n=500,mean=0.10,sd=.50))

PrePivotPlasmodiumData <- data.frame(VT,NH,ME)
head(PrePivotPlasmodiumData)
##          VT        NH        ME
## 1 0.4004620 0.6567997 0.5202911
## 2 0.5766783 0.5972286 0.4399790
## 3 0.6947450 0.5854435 0.3697660
## 4 0.5036991 0.4429139 0.4231951
## 5 0.5243217 0.6396001 0.6694866
## 6 0.4690411 0.6701933 0.4966159
PlasmodiumData <- PrePivotPlasmodiumData %>%
  pivot_longer(cols=VT:ME,
               names_to = "State",
               values_to = "InfectionRate")
head(PlasmodiumData)
## # A tibble: 6 × 2
##   State InfectionRate
##   <chr>         <dbl>
## 1 VT            0.400
## 2 NH            0.657
## 3 ME            0.520
## 4 VT            0.577
## 5 NH            0.597
## 6 ME            0.440
PlasmodiumANOVA <- aov(InfectionRate ~ State, data=PlasmodiumData)
summary(PlasmodiumANOVA)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## State          2  1.243  0.6216   47.24 <2e-16 ***
## Residuals   1497 19.697  0.0132                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Changing the sample sizes of different groups.

Let’s say I get more funding that I expect (yay!) and I am able to double my sample size. When I double the sample size, I still don’t get a statistically significant result (which makes sense because the means are all the same again like in my intitial analysis), but it is slightly less significant than my initial analysis ran with half the sample size.

VT <- inv.logit(rnorm(n=1000,mean=0.25,sd=.50))
NH <- inv.logit(rnorm(n=1000,mean=0.25,sd=.50))
ME <- inv.logit(rnorm(n=1000,mean=0.25,sd=.50))

PrePivotPlasmodiumData <- data.frame(VT,NH,ME)
head(PrePivotPlasmodiumData)
##          VT        NH        ME
## 1 0.5461828 0.5721348 0.8018015
## 2 0.5112125 0.4337599 0.5778865
## 3 0.5303988 0.6516248 0.5308035
## 4 0.6417451 0.6404634 0.5670521
## 5 0.4885017 0.4864511 0.6077662
## 6 0.5612289 0.4618007 0.6676816
PlasmodiumData <- PrePivotPlasmodiumData %>%
  pivot_longer(cols=VT:ME,
               names_to = "State",
               values_to = "InfectionRate")
head(PlasmodiumData)
## # A tibble: 6 × 2
##   State InfectionRate
##   <chr>         <dbl>
## 1 VT            0.546
## 2 NH            0.572
## 3 ME            0.802
## 4 VT            0.511
## 5 NH            0.434
## 6 ME            0.578
PlasmodiumANOVA <- aov(InfectionRate ~ State, data=PlasmodiumData)
summary(PlasmodiumANOVA)
##               Df Sum Sq  Mean Sq F value Pr(>F)
## State          2   0.01 0.005501   0.411  0.663
## Residuals   2997  40.10 0.013380

Let’s say I don’t get as much funding as I want and I have to make my sample size smaller (boo!). When I run the analysis with a smaller sample size, I still don’t get a statistically signficant result (makes sense), but it is more significant than my initial analysis.

VT <- inv.logit(rnorm(n=250,mean=0.25,sd=.50))
NH <- inv.logit(rnorm(n=250,mean=0.25,sd=.50))
ME <- inv.logit(rnorm(n=250,mean=0.25,sd=.50))

PrePivotPlasmodiumData <- data.frame(VT,NH,ME)
head(PrePivotPlasmodiumData)
##          VT        NH        ME
## 1 0.5469700 0.4553230 0.4463939
## 2 0.6403199 0.7719503 0.5657565
## 3 0.5275957 0.5681032 0.5894320
## 4 0.7358128 0.4847399 0.4365344
## 5 0.7005248 0.7353973 0.5623165
## 6 0.5876092 0.6593968 0.5408778
PlasmodiumData <- PrePivotPlasmodiumData %>%
  pivot_longer(cols=VT:ME,
               names_to = "State",
               values_to = "InfectionRate")
head(PlasmodiumData)
## # A tibble: 6 × 2
##   State InfectionRate
##   <chr>         <dbl>
## 1 VT            0.547
## 2 NH            0.455
## 3 ME            0.446
## 4 VT            0.640
## 5 NH            0.772
## 6 ME            0.566
PlasmodiumANOVA <- aov(InfectionRate ~ State, data=PlasmodiumData)
summary(PlasmodiumANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## State         2  0.087 0.04327    3.19 0.0417 *
## Residuals   747 10.131 0.01356                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1