Introduction

   The first synthetic plastic was created at the start of the twentieth century. Since the 1950s, worldwide plastic production has increased dramatically; In 2015, 381 million metric tons of plastic was produced globally. Plastic waste is mostly managed well and contained throughout the world, however some plastic waste is inadequately disposed of and ends up in the ocean, transported by wind, inland waterways, and wastewater systems. Ocean wildlife is especially vulnerable to harmful impacts from plastic pollution, mainly through entanglement and ingestion of plastics (Roser, 2018).

   This dataset includes coastal population sizes and weight of mismanaged plastic waste in 2010. I will construct a simple linear regression line and test to find out if coastal population is a significant predictor of weight of mismanaged plastic waste. The variables to be tested include:

Outcome/Dependent Variable

  • Mismanaged plastic waste (in metric tons): the total weight of plastic waste that is littered or inadequately disposed of in dumps or uncontrolled landfills in 2010.

Predictor/Independent Variable

  • Coastal population: includes the total population within 50 kilometers of a coastline in 2010.


data_full <- read_csv("coastal-population-vs-mismanaged-plastic.csv")

# Making simple column names:
"coastal_pop" -> names(data_full)[names(data_full) == "Coastal population"]
"mismanaged_waste" -> names(data_full)[names(data_full) == "Total mismanaged plastic waste in 2010"]
"total_pop" -> names(data_full)[names(data_full) == "Total population (Gapminder, HYDE & UN)"]
"country" -> names(data_full)[names(data_full) == "Entity"]

# The dataset includes inland populations data which have null values for the columns "Coastal population" and "Total mismanaged plastic waste in 2010"). Filtering the data to only include coastal populations:
data_coastal <- data_full[!is.na(data_full$coastal_pop), ]

# Removing irrelevant columns (country code is redundant information, year is 2010 for all rows, continent is null for most rows)
data_coastal <- subset(data_coastal, select = -c(Code, Year, Continent) )


Data Exploration

datatable(data_coastal)


Summary Statistics

df <- data.frame(cPop = data_coastal$coastal_pop, waste = data_coastal$mismanaged_waste) 

summary_stats <- data.frame(t(basicStats(df)[c("Mean", "Stdev", "Minimum", "Median", "Maximum"),]))

pander(summary_stats, big.mark = ",", scientific=FALSE)
  Mean Stdev Minimum Median Maximum
cPop 10,869,870 31,126,835 596 1,794,753 262,892,387
waste 171,204 732,024 1 15,981 8,819,717

   The mean (standard deviation) coastal population is 10,869,870 (31,126,835). The median coastal population is 1,794,753.

   The mean (standard deviation) weight of mismanaged plastic waste (in metric tons) is 171,204.3 (732,023.8). The median weight of mismanaged plastic waste (in metric tons) is 15,981.

   The data for both the predictor and outcome variables is heavily skewed right. Most total coastal populations (>50%) in this dataset are relatively small with populations of less than 1.8 million, compared with the largest total coastal population of 262.9 million. Most total mismanaged plastic waste (>50%) in this dataset has a recorded weight of less than 16k metric tons, but the largest recorded weight is 8.9 million metric tons.


Data Visualization

coastal_pop_hist <- ggplot(data_coastal, aes(coastal_pop)) + 
  geom_histogram() +
  scale_x_continuous(labels = addUnits) +
  scale_y_continuous(labels = addUnits) +
  xlab("Coastal Population") +
  ylab("Count") +
  theme_minimal() 

mismanaged_waste_hist <- ggplot(data_coastal, aes(mismanaged_waste)) + 
  geom_histogram() +
  scale_x_continuous(labels = addUnits) +
  scale_y_continuous(labels = addUnits) +
  xlab("Mismanaged Plastic Waste (in metric tons)") +
  ylab(NULL) +
  theme_minimal() 

pushViewport(viewport(layout = grid.layout(1, 2)))
print(coastal_pop_hist, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(mismanaged_waste_hist, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))


   The histograms show a clear right skew for both variables.


coastal_pop_boxplot <- ggplot(data = data_coastal, aes(x = "", y = coastal_pop)) + 
  geom_boxplot() +
  xlab("Coastal Population") +
  ylab(NULL) +
  theme_minimal()

mismanaged_waste_boxplot <- ggplot(data = data_coastal, aes(x = "", y = mismanaged_waste)) + 
  geom_boxplot() +
  xlab("Mismanaged Plastic Waste (in metric tons)") +
  ylab(NULL) +
  theme_minimal()

pushViewport(viewport(layout = grid.layout(1, 2)))
print(coastal_pop_boxplot, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(mismanaged_waste_boxplot, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))


   The bulk of the data is on the lower end of the box plots, illustrating the heavy right skew of the data.


Regression Model

data_coastal_model <- lm(mismanaged_waste ~ coastal_pop, data=data_coastal)
data_coastal_coef <- coefficients(data_coastal_model)
data_coastal_anova <- anova(data_coastal_model)
data_coastal_summary <- summary(data_coastal_model)
data_coastal_t <- as_tibble(data_coastal_summary[[4]])
data_coastal_ci <- as_tibble(confint(data_coastal_model, level=0.95))

data_coastal_summary
## 
## Call:
## lm(formula = mismanaged_waste ~ coastal_pop, data = data_coastal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2824988   -10964    25736    30071  4005977 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.903e+04  3.545e+04  -0.819    0.414    
## coastal_pop  1.842e-02  1.078e-03  17.092   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 456300 on 184 degrees of freedom
## Multiple R-squared:  0.6136, Adjusted R-squared:  0.6115 
## F-statistic: 292.1 on 1 and 184 DF,  p-value: < 2.2e-16

The resulting regression model is

\[ \hat{y} = -2.9030885\times 10^{4} + 0.0184211x \]

   For an increase of one person in a coastal population, we expect an increase of 0.0184 metric tons of mismanaged plastic waste (which is equivalent to 40.6116 pounds).

   The value of \(\beta_0\) does not have any intrinsic meaning of its own because a population of zero is outside of the scope of this model.


Coastal Population vs. Mismanaged Plastic Waste:

ggplot(data = data_coastal, aes(x = coastal_pop, y = mismanaged_waste)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color='forest green') +
  scale_x_continuous(labels = addUnits) +
  scale_y_continuous(labels = addUnits) +
  xlab("Coastal Population") +
  ylab("Mismanaged Plastic Waste (in metric tons)")


Log Transformed Coastal Population vs. Mismanaged Plastic Waste:

   Since the data is heavily skewed right, applying a logarithmic transformation makes the data visualization much clearer. The data points now appear to be distributed evenly around the regression line:


ggplot(data = data_coastal, aes(x = coastal_pop, y = mismanaged_waste)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color='forest green') +
  scale_x_log10(labels = addUnits) +
  scale_y_log10(labels = addUnits) +
  xlab("Coastal Population") +
  ylab("Mismanaged Plastic Waste (in metric tons)")

Hypothesis Test for Significance of \(\beta_1\)

   This test will determine if the weight of mismanaged plastic waste is significantly predicted by coastal population size.

Hypotheses

   \(H_0: \ \beta_1 = 0\) / the weight of mismanaged plastic waste is not significantly predicted by coastal population size
   \(H_1: \ \beta_1 \ne 0\) / the weight of mismanaged plastic waste is significantly predicted by coastal population size

Test Statistic

   \(t_0 = 17.09\).

p-value

   \(p < 0.0001\).

Rejection Region

   Reject if \(p < \alpha\), where \(\alpha=0.05\).

Conclusion and Interpretation

   Reject \(H_0\). There is sufficient evidence to suggest that the weight of mismanaged plastic waste is significantly predicted by coastal population size.


95% Confidence Interval on \(\beta_1\)

   The 95% confidence interval for \(\beta_1\) is(0.0163, 0.0205).


\(R^2\) for the Regression Line

   \(R^2=0.61\); that is, approximately 61% of the variance in the weight of mismanaged plastic waste is explained by the current model. This is a moderately strong indication that the model is a good fit for the data.


Conclusion

   The data suggests that the larger the coastal population, the greater the amount of mismanaged plastic waste there is. Most of the population sizes in the dataset are relatively small, resulting in a heavy right skew for both variables. When a logarithmic transformation is applied to the scatterplot axes, data points appear to fall neatly around the regression line. There is sufficient evidence to suggest that the weight of mismanaged plastic waste is significantly predicted by coastal population size; the hypothesis test for significance of the regression line yields \(p < 0.0001\). Approximately 61% of the variance in the weight of mismanaged plastic waste is explained by the current model, which indicates that the regression model fits the data well.


Reference

Hannah Ritchie and Max Roser (2018) - “Plastic Pollution”. Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/plastic-pollution’ [Online Resource]