ANOVA Test - Month and Violations

As from the data visualization, we observe patterns for the parking violation number over month and over weekday. In order to go more about how month and weekday are associated with the frequency of parking violation. We try to use an ANOVA test across months.

\(H_0\): The mean number of parking violation are not different across months

\(H_1\): The mean number of parking violation are different across months

fit_df = violation %>%
  mutate(month = as.factor(month)) %>%
  group_by(month, weekday, day) %>%
  summarize(n_obs = n())


fit_model_month = lm(n_obs ~ month, data = fit_df)
anova(fit_model_month) %>% knitr::kable(caption = "One way anova of violation frequency and month")
One way anova of violation frequency and month
Df Sum Sq Mean Sq F value Pr(>F)
month 9 494640293 54960033 9.662275 0
Residuals 294 1672302865 5688105 NA NA

The p-value is very small in the above test. Thus, we reject the null hypothesis that the mean number of parking violations are constant across the months. There is evidence that indicates that the average number of parking violations varies across months.

Chi-square Test - Top 5 Violation Types and Boroughs

We predicted that there is no difference in ticket amounts proportions in the top five violation types in NYC across the five boroughs. We will perform the chi-square test to verify our assumption.

\(H_0:\) the expected counts in each violation type category are the same across all boroughs.

\(H_1:\) the expected counts in each violation type category are not same across all boroughs.

## 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant. 

five_common_violation = 
  violation1 %>%
  select(borough, violation) %>% 
  filter(violation %in%
           c("NO PARKING-STREET CLEANING",
             "FAIL TO DSPLY MUNI METER RECPT",
             "INSP. STICKER-EXPIRED/MISSING",
             "REG. STICKER-EXPIRED/MISSING",
             "FIRE HYDRANT")) %>%
  count(violation, borough) %>% 
  pivot_wider(
    names_from = "violation",
    values_from = "n"
  ) %>% 
  data.matrix() %>% 
  subset(select = -c(borough))

rownames(five_common_violation) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")

five_common_violation %>% 
  knitr::kable(caption = "Results Table")
Results Table
FAIL TO DSPLY MUNI METER RECPT FIRE HYDRANT INSP. STICKER-EXPIRED/MISSING NO PARKING-STREET CLEANING REG. STICKER-EXPIRED/MISSING
Bronx 24005 32407 44436 46874 35410
Brooklyn 49347 43903 67654 101494 51038
Manhattan 49902 31794 40856 43963 35280
Queens 72767 31610 61256 60778 56589
Staten Island 6025 1946 15857 16 13492
chisq.test(five_common_violation)
## 
##  Pearson's Chi-squared test
## 
## data:  five_common_violation
## X-squared = 53839, df = 16, p-value < 2.2e-16
x_crit = qchisq(0.95, 16)
x_crit
## [1] 26.29623

Interpretation: The result of chi-square shows that \(\chi^2 > \chi_{crit}\), at significant level \(\alpha = 0.05\), so we reject the null hypothesis and conclude that there does exist at least one borough’s proportion of violation amounts is different from others.

Proportion Test

Now, we want to see whether receiving a fire hydrant violation is am equally common occurrence within the residents of each borough. To do this, we will conduct a proportion test.

We derived the population of each borough from the most recent census.

First, we will assume:

  1. Each car has only one driver;

  2. Each car gets one fire hydrant violation in 2021.

\(H_0:\) The proportion of the individuals who experienced getting a fire hydrant is the same across all boroughs.

\(H_1:\) The proportion of the individuals who experienced getting a fire hydrant is not the same across all boroughs.

url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)

population = nyc_population_html %>% 
  html_elements(".rname .prio1") %>% 
  html_text()

boro = nyc_population_html %>% 
  html_elements(".rname a span") %>% 
  html_text()

nyc_population = tibble(
  borough = boro,
  population = population %>% str_remove_all(",") %>% as.numeric()
) 
  
fire_hydrant = violation1 %>%
  select(borough, violation, plate) %>% 
  filter(violation == "FIRE HYDRANT") %>%
  distinct(plate, borough) %>% 
  count(borough) 

boro_population = left_join(fire_hydrant, nyc_population)

boro_population %>% 
  knitr::kable(caption = "Results Table")
Results Table
borough n population
Bronx 18432 1472654
Brooklyn 28124 2736074
Manhattan 20809 1694251
Queens 21832 2405464
Staten Island 1611 495747
prop.test(boro_population$n, boro_population$population)
## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  boro_population$n out of boro_population$population
## X-squared = 4127.7, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##      prop 1      prop 2      prop 3      prop 4      prop 5 
## 0.012516178 0.010278962 0.012282123 0.009076004 0.003249641

From the above results, p-values are small and so we we can say that the proportions of people getting fire hydrant violations are different across boroughs.