Statistical Analyses

ANOVA Test - Month and Violations

As from the data visualization, we observe patterns for the parking violation number over month and over weekday. In order to go more about how month and weekday are associated with the frequency of parking violation. We try to use an ANOVA test across months.

\(H_0\): The mean number of parking violation are not different across months

\(H_1\): The mean number of parking violation are different across months

fit_df = violation %>%
  mutate(month = as.factor(month)) %>%
  group_by(month, weekday, day) %>%
  summarize(n_obs = n())


fit_model_month = lm(n_obs ~ month, data = fit_df)
anova(fit_model_month) %>% knitr::kable(caption = "One way anova of violation frequency and month")

One way anova of violation frequency and month
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
month	9	494640293	54960033	9.662275	0
Residuals	294	1672302865	5688105	NA	NA

The p-value is very small in the above test. Thus, we reject the null hypothesis that the mean number of parking violations are constant across the months. There is evidence that indicates that the average number of parking violations varies across months.

Chi-square Test - Top 5 Violation Types and Boroughs

We predicted that there is no difference in ticket amounts proportions in the top five violation types in NYC across the five boroughs. We will perform the chi-square test to verify our assumption.

\(H_0:\) the expected counts in each violation type category are the same across all boroughs.

\(H_1:\) the expected counts in each violation type category are not same across all boroughs.

## 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant. 

five_common_violation = 
  violation1 %>%
  select(borough, violation) %>% 
  filter(violation %in%
           c("NO PARKING-STREET CLEANING",
             "FAIL TO DSPLY MUNI METER RECPT",
             "INSP. STICKER-EXPIRED/MISSING",
             "REG. STICKER-EXPIRED/MISSING",
             "FIRE HYDRANT")) %>%
  count(violation, borough) %>% 
  pivot_wider(
    names_from = "violation",
    values_from = "n"
  ) %>% 
  data.matrix() %>% 
  subset(select = -c(borough))

rownames(five_common_violation) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")

five_common_violation %>% 
  knitr::kable(caption = "Results Table")

Results Table
	FAIL TO DSPLY MUNI METER RECPT	FIRE HYDRANT	INSP. STICKER-EXPIRED/MISSING	NO PARKING-STREET CLEANING	REG. STICKER-EXPIRED/MISSING
Bronx	24005	32407	44436	46874	35410
Brooklyn	49347	43903	67654	101494	51038
Manhattan	49902	31794	40856	43963	35280
Queens	72767	31610	61256	60778	56589
Staten Island	6025	1946	15857	16	13492

chisq.test(five_common_violation)

## 
##  Pearson's Chi-squared test
## 
## data:  five_common_violation
## X-squared = 53839, df = 16, p-value < 2.2e-16

x_crit = qchisq(0.95, 16)
x_crit

## [1] 26.29623

Interpretation: The result of chi-square shows that \(\chi^2 > \chi_{crit}\), at significant level \(\alpha = 0.05\), so we reject the null hypothesis and conclude that there does exist at least one borough’s proportion of violation amounts is different from others.

Proportion Test

Now, we want to see whether receiving a fire hydrant violation is am equally common occurrence within the residents of each borough. To do this, we will conduct a proportion test.

We derived the population of each borough from the most recent census.

First, we will assume:

Each car has only one driver;
Each car gets one fire hydrant violation in 2021.

\(H_0:\) The proportion of the individuals who experienced getting a fire hydrant is the same across all boroughs.

\(H_1:\) The proportion of the individuals who experienced getting a fire hydrant is not the same across all boroughs.

url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)

population = nyc_population_html %>% 
  html_elements(".rname .prio1") %>% 
  html_text()

boro = nyc_population_html %>% 
  html_elements(".rname a span") %>% 
  html_text()

nyc_population = tibble(
  borough = boro,
  population = population %>% str_remove_all(",") %>% as.numeric()
) 
  
fire_hydrant = violation1 %>%
  select(borough, violation, plate) %>% 
  filter(violation == "FIRE HYDRANT") %>%
  distinct(plate, borough) %>% 
  count(borough) 

boro_population = left_join(fire_hydrant, nyc_population)

boro_population %>% 
  knitr::kable(caption = "Results Table")

Results Table
borough	n	population
Bronx	18432	1472654
Brooklyn	28124	2736074
Manhattan	20809	1694251
Queens	21832	2405464
Staten Island	1611	495747

prop.test(boro_population$n, boro_population$population)

## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  boro_population$n out of boro_population$population
## X-squared = 4127.7, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##      prop 1      prop 2      prop 3      prop 4      prop 5 
## 0.012516178 0.010278962 0.012282123 0.009076004 0.003249641

From the above results, p-values are small and so we we can say that the proportions of people getting fire hydrant violations are different across boroughs.