As from the data visualization, we observe patterns for the parking violation number over month and over weekday. In order to go more about how month and weekday are associated with the frequency of parking violation. We try to use an ANOVA test across months.
\(H_0\): The mean number of parking violation are not different across months
\(H_1\): The mean number of parking violation are different across months
fit_df = violation %>%
mutate(month = as.factor(month)) %>%
group_by(month, weekday, day) %>%
summarize(n_obs = n())
fit_model_month = lm(n_obs ~ month, data = fit_df)
anova(fit_model_month) %>% knitr::kable(caption = "One way anova of violation frequency and month")
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
month | 9 | 494640293 | 54960033 | 9.662275 | 0 |
Residuals | 294 | 1672302865 | 5688105 | NA | NA |
The p-value is very small in the above test. Thus, we reject the null hypothesis that the mean number of parking violations are constant across the months. There is evidence that indicates that the average number of parking violations varies across months.
We predicted that there is no difference in ticket amounts proportions in the top five violation types in NYC across the five boroughs. We will perform the chi-square test to verify our assumption.
\(H_0:\) the expected counts in each violation type category are the same across all boroughs.
\(H_1:\) the expected counts in each violation type category are not same across all boroughs.
## 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant.
five_common_violation =
violation1 %>%
select(borough, violation) %>%
filter(violation %in%
c("NO PARKING-STREET CLEANING",
"FAIL TO DSPLY MUNI METER RECPT",
"INSP. STICKER-EXPIRED/MISSING",
"REG. STICKER-EXPIRED/MISSING",
"FIRE HYDRANT")) %>%
count(violation, borough) %>%
pivot_wider(
names_from = "violation",
values_from = "n"
) %>%
data.matrix() %>%
subset(select = -c(borough))
rownames(five_common_violation) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")
five_common_violation %>%
knitr::kable(caption = "Results Table")
FAIL TO DSPLY MUNI METER RECPT | FIRE HYDRANT | INSP. STICKER-EXPIRED/MISSING | NO PARKING-STREET CLEANING | REG. STICKER-EXPIRED/MISSING | |
---|---|---|---|---|---|
Bronx | 24005 | 32407 | 44436 | 46874 | 35410 |
Brooklyn | 49347 | 43903 | 67654 | 101494 | 51038 |
Manhattan | 49902 | 31794 | 40856 | 43963 | 35280 |
Queens | 72767 | 31610 | 61256 | 60778 | 56589 |
Staten Island | 6025 | 1946 | 15857 | 16 | 13492 |
chisq.test(five_common_violation)
##
## Pearson's Chi-squared test
##
## data: five_common_violation
## X-squared = 53839, df = 16, p-value < 2.2e-16
x_crit = qchisq(0.95, 16)
x_crit
## [1] 26.29623
Interpretation: The result of chi-square shows that \(\chi^2 > \chi_{crit}\), at significant level \(\alpha = 0.05\), so we reject the null hypothesis and conclude that there does exist at least one borough’s proportion of violation amounts is different from others.
Now, we want to see whether receiving a fire hydrant violation is am equally common occurrence within the residents of each borough. To do this, we will conduct a proportion test.
We derived the population of each borough from the most recent census.
First, we will assume:
Each car has only one driver;
Each car gets one fire hydrant violation in 2021.
\(H_0:\) The proportion of the individuals who experienced getting a fire hydrant is the same across all boroughs.
\(H_1:\) The proportion of the individuals who experienced getting a fire hydrant is not the same across all boroughs.
url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)
population = nyc_population_html %>%
html_elements(".rname .prio1") %>%
html_text()
boro = nyc_population_html %>%
html_elements(".rname a span") %>%
html_text()
nyc_population = tibble(
borough = boro,
population = population %>% str_remove_all(",") %>% as.numeric()
)
fire_hydrant = violation1 %>%
select(borough, violation, plate) %>%
filter(violation == "FIRE HYDRANT") %>%
distinct(plate, borough) %>%
count(borough)
boro_population = left_join(fire_hydrant, nyc_population)
boro_population %>%
knitr::kable(caption = "Results Table")
borough | n | population |
---|---|---|
Bronx | 18432 | 1472654 |
Brooklyn | 28124 | 2736074 |
Manhattan | 20809 | 1694251 |
Queens | 21832 | 2405464 |
Staten Island | 1611 | 495747 |
prop.test(boro_population$n, boro_population$population)
##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: boro_population$n out of boro_population$population
## X-squared = 4127.7, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.012516178 0.010278962 0.012282123 0.009076004 0.003249641
From the above results, p-values are small and so we we can say that the proportions of people getting fire hydrant violations are different across boroughs.