Future Exercise

#load required packages
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
Warning: package 'dslabs' was built under R version 4.4.3
library(dplyr)
#get an overview of data structure
str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
#get a summary of data
summary(gapminder)
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  
#determine what time of object gapminder is
class(gapminder)
[1] "data.frame"
#assign African countries to new object 
africadata <- gapminder %>% filter(continent == "Africa")

#overview of data structure
str(africadata)
'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
#summary of data 
summary(africadata)
         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0  
#create new object from africadata that has only infant_mortality and life_expectancy 
africa_inf_exp <- africadata %>% select("infant_mortality", "life_expectancy")

#create new object from africadata that has only population and life_expectancy
africa_pop_exp <- africadata %>% select("population", "life_expectancy")

#structure overview of both africa_inf_exp and africa_pop_exp
str(africa_inf_exp)
'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
str(africa_pop_exp)
'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
#summary of both 
summary(africa_inf_exp)
 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226                     
summary(africa_pop_exp)
   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51                         
#plot infant_mortality (x) vs. life_expectancy (y)
ggplot(africa_inf_exp, (aes(x = infant_mortality, y = life_expectancy))) +
  geom_point()
Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

#plot population (x) vs. life expectancy (y)
##population in log scale so log(x) vs. (y)
###check structure
africa_pop_exp <- africa_pop_exp %>% mutate(log_population = log10(population))
str(africa_pop_exp)
'data.frame':   2907 obs. of  3 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
 $ log_population : num  7.05 6.72 6.39 5.72 6.68 ...
ggplot(africa_pop_exp, (aes(x = log_population, y = life_expectancy))) +
    geom_point()
Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

Appearance of streaks - likely represent each country over time as its population and life expextancy both slowly increase.

#identify where the NA values are, to avoid them in the next step 
missing_inf_years <- africadata %>% filter(is.na(infant_mortality)) %>% pull(year) %>% unique()

#new object with only year 2000 data 
africadata_2000 <- africadata %>% filter(year == 2000)

#structure of new africa_data2000
str(africadata_2000)
'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
#plot infant_mortality (x) vs. life_expectancy (y)
ggplot(africadata_2000, (aes(x = infant_mortality, y = life_expectancy))) +
  geom_point()

##plot population (x) vs. life expectancy (y)
##population in log scale so log(x) vs. (y)
###check structure
africadata_2000 <- africadata_2000 %>% mutate(log_population = log10(population))
str(africadata_2000)
'data.frame':   51 obs. of  10 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
 $ log_population  : num  7.49 7.18 6.84 6.24 7.06 ...
ggplot(africadata_2000, (aes(x = log_population, y = life_expectancy))) +
    geom_point()

#Fit life expectancy as the outcome, and infant mortality as the predictor
fit1 <- lm(life_expectancy ~ infant_mortality, data = africadata_2000)

#Fit life expectancy as the outcome, and population as the predictor
fit2 <- lm(life_expectancy ~ population, data = africadata_2000)

#I wasn't sure if you wanted population or the log_population, so I did fit3 to be the log_population
fit3 <- lm(life_expectancy ~ log_population, data = africadata_2000)

#summary of the fits 
summary(fit1)

Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08
summary(fit2)

Call:
lm(formula = life_expectancy ~ population, data = africadata_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159
summary(fit3)

Call:
lm(formula = life_expectancy ~ log_population, data = africadata_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.113  -4.809  -1.554   3.907  18.863 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      65.324     12.520   5.217 3.65e-06 ***
log_population   -1.315      1.829  -0.719    0.476    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.502 on 49 degrees of freedom
Multiple R-squared:  0.01044,   Adjusted R-squared:  -0.009755 
F-statistic: 0.517 on 1 and 49 DF,  p-value: 0.4755

What I found here - fit1: p-value = 2.826e-08, R-squared = 0.4593 - p-value indicates significant difference (well below 0.05), but with an R-squared of 0.4593, that is ia fairly weak inverse (negative) correlation between life expectancy and infant mortality, confirmed visually with plot fit2: p-value = 0.6159, R-squared = -0.01513 - p-value indicates no significant difference (above 0.05), no relationship between life_expectancy and population fit3: p-vlaue = 0.4755, R-squared = -0.009755 - p-value indicates no signifiicant difference (above 0.05), no relationhsip between life expectancy and log_population

Additional exploration: dslabs::gapminder (contributed by Alexandra Tejada-Strop)

In this section I explore the gapminder dataset from the dslabs package. I do basic exploration, some light cleaning, a few figures, and then fit a simple regression model to describe how life expectancy relates to economic output and time.

Load packages —-

library(tidyverse) library(dslabs)

Load data —-

data(gapminder) # loads a data frame called gapminder

Quick look —-

glimpse(gapminder) summary(gapminder)

Check for missing values —-

gapminder %>% summarise(across(everything(), ~ sum(is.na(.))))

Basic processing / cleaning

I create a few helpful variables: - gdp_per_cap: GDP per capita - log_gdp_per_cap: log10 GDP per capita (often more linear for modeling) I also remove rows where GDP or population are missing or zero (to avoid dividing by zero and taking log of non-positive values).

gap2 <- gapminder %>% mutate( gdp_per_cap = gdp / population, log_gdp_per_cap = log10(gdp_per_cap) ) %>% filter( !is.na(gdp_per_cap), is.finite(log_gdp_per_cap), gdp_per_cap > 0, population > 0 )

glimpse(gap2)

Exploratory figures

  1. Life expectancy over time by region
  2. Relationship between life expectancy and GDP per capita (log scale), colored by region

1) Life expectancy over time by region

gap2 %>% group_by(year, region) %>% summarise(mean_life_exp = mean(life_expectancy), .groups = “drop”) %>% ggplot(aes(x = year, y = mean_life_exp, color = region)) + geom_line() + labs( title = “Average life expectancy over time by region”, x = “Year”, y = “Average life expectancy” )

2) Life expectancy vs GDP per cap (log scale)

gap2 %>% ggplot(aes(x = log_gdp_per_cap, y = life_expectancy, color = region)) + geom_point(alpha = 0.4) + geom_smooth(method = “lm”, se = FALSE) + labs( title = “Life expectancy vs log10(GDP per capita)”, x = “log10(GDP per capita)”, y = “Life expectancy” )

Simple statistical model

I fit a linear regression model predicting life expectancy from: - log10(GDP per capita) - year (to capture general time trends) - region (to capture broad geographic differences)

This is not meant to be causal, just a simple descriptive model.

m1 <- lm(life_expectancy ~ log_gdp_per_cap + year + region, data = gap2) summary(m1)

A cleaner coefficient table

broom::tidy(m1) %>% arrange(p.value)

Results (plain-language summary)

  • The coefficient for log_gdp_per_cap is typically positive: higher GDP per capita is associated with higher life expectancy.
  • The coefficient for year is typically positive: life expectancy tends to increase over time.
  • Regional coefficients reflect systematic differences across regions after accounting for GDP per capita and time.

(Exact estimates may vary slightly depending on filtering and package versions.)