In the period of 1991 to 2017, housing quality in New York has improved dramatically; however, some sectors of the housing stock continue to face poor conditions and some specific maintenance deficiencies continue to show higher prevalence. In this project, we develop an index that presents poor qualtity of housing in New York by measuring the physical deficiencies to show how the prevalence of these issues has shifted over time.
The index measures weighted sums of interactions between 22 variables that the authors chose. The selected variables were chosen if the authors agreed they described poor housing conditions. The index is not exhaustive, and potentially more data could be collected to better suit our purpose.
Item | Description | NYCHVS Variable | Score |
---|---|---|---|
1 | Exterior Walls: Missing brick, sliding or other | d1 | 2 |
2 | Exterior Walls: Sloping or bulgin walls | d2 | 2 |
3 | Exterior walls: Major Cracks | d3 | 2 |
4 | Exterior Walls: Loose or hanging corvice, roof, etc. | d4 | 2 |
5 | Interior Walls: Cracks or holes | 36a | 2 |
6 | Interior Walls: Broken plaster or peeling paint | 37a | 2 |
7 | Broken or missing windows | e1 | 5 |
8 | Rotten or loose windows | e2 | 2 |
9 | Boarded up windows | e3 | 3 |
10 | Sagging or sloping floors | g1 | 2 |
11 | Slanted/shifted doorsills or frames | g2 | 2 |
12 | Deep wear in floor causing depressions | g3 | 2 |
13 | Holes or missing flooring | g4 | 2 |
14 | Stairs: Loose, broken, or missing stair | f1 | 2 |
15 | Stairs: Loose, broken, or missing setps | f2 | 2 |
16 | No interior steps or stairways | f4 | 2 |
17 | No exterior steps or stairways | f5 | 2 |
18 | Number of heating equipment breakdowns | 32b | 2 per break down |
19 | Kitchen facilities fucntioning | 26c | 3 if no, 5 if no kitchen facilities |
20 | Toilet Breakdowns | 25c | 3 if any, 5 if no toliet or plumbing |
21 | Presence of mice or rats | 35a | 3 |
22 | Water Leakage | 38a | 3 |
Figure 1 shows the poor quality index scores for the 156,230 occupied units in the New York Housing Dataset from 1991 to 2017. The frequency distribution is skewed to the right. Overall, fourty five percent of the units were scored 0. The highest score was in 1993 with 54 points. 2008 had the highest percent(64%) of units that has 0 poor quality scores.
Figure 2 shows percent the percent of ccupied units with poor quality scores. Over the period of 1991 to 2017, most of the units has poor quality scores between 1 and 10 points; very little units that has the poor quality scroes over 20 points.
Figure 3 tracks trends in poor quality index scores during the period of 1991 to 2017. We decided to report the means, medians, 75th percentiles, 95th percentiles, and 99th percentiles. In most of the years, the median had the poor quality scores of 0. The mean ranged from 4.0 in 1991 to 2.5 in 2017. The 99th percentiles clearly show the improvement of housing in New York( from 25 poor quality points in 1991 to 18 porr quality points in 2017)
Figure 4 shows the poor condition of housing in five different boroughs in New York city in the period of 1991 to 2017. Overall, all five boroughs had an improvement of the house quality. Bronx had the worse housing condition and Stalen Island had the best housing condition.
## OGR data source with driver: GeoJSON
## Source: "/Users/thienngole/Desktop/MSU/10-MSU-Spring-2019/MTH390Q-DataScience/project/NY-Housing-Data/Community Districts.geojson", layer: "OGRGeoJSON"
## with 71 features
## It has 3 fields
For multiple regression model we built a model using our poor quality index(pqi) as a response variable and household income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.
Our multiple regression model:
Y = 6.876 + X1(-3.963e-06) + X2(-1.855e-04) + X3(-2.279e-03) + X4(8.665e-04) + X5(-2.082e-03)
#Preprocess the data
data_part2 <- read.csv("all_data_v2.csv", stringsAsFactors = F)
data91_99 <- filter(data_part2, year == 91 | year == 93 | year == 96 | year == 99 )
data02_17 <- filter(data_part2, year == 2002 | year == 2005 | year == 2008 | year == 2011 | year == 2014| year == 2017)
# Remove 999998 = Not Reported *** exist only in 1991
data91_99 <- filter(data91_99, hhinc != 999998)
# replace 999999 and 9999999 = No Income to 0
data91_99$hhinc <- ifelse(data91_99$hhinc == 999999, 0 , data91_99$hhinc)
data02_17$hhinc <- ifelse(data02_17$hhinc == 9999999, 0 , data02_17$hhinc)
# concatenate the 2 dataframe back
data_part2 <- rbind(data91_99, data02_17)
data_part2$X_28c[data_part2$X_28c == 9999] <- NA # Not applicable (owner occupied or gas and electric not combined)
data_part2$X_28d2[data_part2$X_28d2 == 9999] <- NA # Not applicable (owner occupied, included in rent, condo, or other fee, or no charge)
data_part2$X_30a[data_part2$X_30a == 99999] <- NA # Not applicable (occupied rent free or owner occupied)
data_part2$X_30a[data_part2$X_30a == 99998] <- NA # Not reported
data_part2$mgrent[data_part2$mgrent == 9998] <- NA # Not reported
data_part2$mgrent[data_part2$mgrent == 9999] <- NA # Not applicable (occupied rent free or owner occupied)
data_part2$mgrent[data_part2$mgrent == 99999] <- NA # Not applicable (occupied rent free or owner occupied)
my.reg <- lm(pqi ~ hhinc + year + X_28c + X_28d2 + X_30a, data = data_part2)
summary(my.reg)
##
## Call:
## lm(formula = pqi ~ hhinc + year + X_28c + X_28d2 + X_30a, data = data_part2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.605 -3.958 -1.760 2.228 49.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.883e+00 3.219e-01 21.379 < 2e-16 ***
## hhinc -3.963e-06 6.425e-07 -6.169 6.92e-10 ***
## year -1.856e-04 3.288e-05 -5.644 1.67e-08 ***
## X_28c -2.279e-03 6.696e-05 -34.027 < 2e-16 ***
## X_28d2 8.592e-04 3.156e-04 2.723 0.00647 **
## X_30a -2.081e-03 6.338e-05 -32.841 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.657 on 50716 degrees of freedom
## (101591 observations deleted due to missingness)
## Multiple R-squared: 0.0559, Adjusted R-squared: 0.0558
## F-statistic: 600.5 on 5 and 50716 DF, p-value: < 2.2e-16
As income was not included in the index we might consider modeling the relationship between income and housing quality. However, as already discussed the relationship between these variables is difficult. We will instead investigate the relationship between quantiles of household income and poor quality index by subborough. We will model the 5th percentile of household income using year and the 95th percentile of PQI. This is a naive model and not a proper qunatile regression which we plan to employ in the future.
data_part2[which(data_part2$hhinc==9999999),"hhinc"] <- NA
data_part2[data_part2$year<100,"year"]<-data_part2[data_part2$year<100,"year"] + 1900
data_part2 %>% group_by(sba, year) %>% summarise(pqi_95 = quantile(pqi,.95),
hhinc_05 = quantile(hhinc, .05,na.rm=T)) -> bmod
head(bmod)
## # A tibble: 6 x 4
## # Groups: sba [1]
## sba year pqi_95 hhinc_05
## <int> <dbl> <dbl> <dbl>
## 1 101 1991 20.7 1547
## 2 101 1993 17 2060
## 3 101 1996 21 1964.
## 4 101 1999 19 432.
## 5 101 2002 16 2953.
## 6 101 2005 13.0 1776
ggplot(bmod, aes(pqi_95, hhinc_05, col=year)) + geom_point(position="jitter") + geom_smooth(method="lm") -> lm_plot
lm_plot %>% ggplotly()
It seems clear that time should factor in to the model. However, it is not clear if we should include and interaction term.
ggplot(bmod, aes(pqi_95, hhinc_05, col=year)) + geom_point(position="jitter") + geom_smooth(method="lm") -> lm_plot
ggplot(bmod, aes(year, hhinc_05)) + geom_point(position="jitter") -> year_plot
ggplot(bmod, aes(pqi_95*year, hhinc_05)) + geom_point(position="jitter") -> int_plot
ggplot(bmod,(aes(year,pqi_95))) + geom_point(position = "jitter") -> pqi_year
grid.arrange(year_plot, int_plot, pqi_year)
The plots seem to show that the interaction is overwhelmed by the PQI, which wold result in multicollinearity issues.
Let’s look at the models
lm(data=bmod, hhinc_05 ~ pqi_95 * year) -> bfit
coef(bfit)
## (Intercept) pqi_95 year pqi_95:year
## -294415.48671 20851.11045 151.12694 -10.57636
The coefficient for PQI doesnt make much sense, being positive, looking at the graph above it should be negative. The interaction term is negative, but this likely a multicolinearity.
lm(data=bmod, hhinc_05 ~ pqi_95 + year) -> bfit2
coef(bfit2)
## (Intercept) pqi_95 year
## -44955.83224 -304.57814 26.47481
This model has more sensible coefficients, but the \(R^2\) dropped a little bit.
Checking Multicollinearity
vif(bfit)
## pqi_95 year pqi_95:year
## 59630.799920 7.207624 59294.638215
vif(bfit2)
## pqi_95 year
## 1.087208 1.087208
It is clear that the first model has multicolinearity with the addtion of an interaction term. Making the coefficients uninterpretable. We choose the second additive model instead.
summary(bfit2)
##
## Call:
## lm(formula = hhinc_05 ~ pqi_95 + year, data = bmod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6682.7 -1721.1 61.8 1600.2 10983.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44955.83 26548.38 -1.693 0.0910 .
## pqi_95 -304.58 22.72 -13.408 <2e-16 ***
## year 26.47 13.21 2.004 0.0456 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2515 on 547 degrees of freedom
## Multiple R-squared: 0.2834, Adjusted R-squared: 0.2808
## F-statistic: 108.2 on 2 and 547 DF, p-value: < 2.2e-16
We estimate that for each unit increase in the 95th percentile of a given subborough will correlate with an average decrease of 333 dollars in the 5th percentile of houshold inocme for the same suborough.
We estimate that for each increase in year the 5th percentile of househould within a given subborough increases by 1 dollar
Notes: It is clear that after cheking basic assumptions of normality are violated. This is likely due to the fact that the model is using quantiles as a response requiring further work to implement a proper quantile regression.
For logistic regression model we built a model using the Toilet Brokedowns(X_25c) as a response variable and household income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.
Our logistic regression model:
Y = 1.672e-01 + X1(-1.673e-07) + X2(-2.845e-05) + X3(6.521e-06) + X4(-3.623e-05)
data_part2$X_25c[data_part2$X_25c == 2] <- 0 # NO toilet breakdown
data_part2$X_25c[data_part2$X_25c == 3] <- NA # NO toilet in the apartment
data_part2$X_25c[data_part2$X_25c == 8] <- NA # Not reported
data_part2$X_25c[data_part2$X_25c == 9] <- NA # Not applicable (No plumbing facilities)
my.logreg <- glm(X_25c ~ hhinc + X_28c + X_28d2 + X_30a, data = data_part2)
summary(my.logreg)
##
## Call:
## glm(formula = X_25c ~ hhinc + X_28c + X_28d2 + X_30a, data = data_part2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.1719 -0.1325 -0.1201 -0.1054 1.1028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.671e-01 1.928e-02 8.668 < 2e-16 ***
## hhinc -1.673e-07 4.023e-08 -4.159 3.20e-05 ***
## X_28c -2.844e-05 3.960e-06 -7.181 7.03e-13 ***
## X_28d2 6.579e-06 1.891e-05 0.348 0.728
## X_30a -3.625e-05 3.982e-06 -9.103 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1066699)
##
## Null deviance: 4887.2 on 45586 degrees of freedom
## Residual deviance: 4862.2 on 45582 degrees of freedom
## (106726 observations deleted due to missingness)
## AIC: 27353
##
## Number of Fisher Scoring iterations: 2
For machine learning procedure, we decided to build a k nearest neighbors model that has borough as a response variable and poor quality index(pqi), household income(hhinc), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.
1
=“Bronx”, 2
=“Brooklyn”, 3
=“Manhattan”, 4
=“Queens”, 5
=“Staten Island”
library(dplyr)
library(class)
# Change this value to try different k:
ktry <- 2
train_data <- select(na.omit(data_part2), c(borough, pqi, hhinc, X_28c, X_28d2, X_30a))
#train_data <- na.omit(train_data)
my.knn <- knn(train = train_data,
test = train_data,
cl = train_data$borough,
k = ktry)
borough <- data.frame(Actual = train_data$borough, Predicted = my.knn)
confusion <- table(borough)
confusion
## Predicted
## Actual 1 2 3 4 5
## 1 5523 1078 1116 551 52
## 2 862 10313 1264 1303 157
## 3 1107 1383 8943 1105 96
## 4 560 1383 1035 6317 167
## 5 55 185 109 243 574
Correct classification rate
sum(diag(confusion)) / nrow(data_part2)
## [1] 0.2079271
For our hierarchy clustering, we decided to use our Poor Quality Index(pqi), Household Income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) and Monthly Gross Rent (mgrent). The hierarchy clustering includes only the households of tenants.
hc_data <- select(data_part2, c(pqi, hhinc, X_28c, X_28d2, X_30a, mgrent))
hc_data$X_28c[hc_data$X_28c == 999] <- NA
hc_data$X_28c[hc_data$X_28c == 998] <- NA
hc_data$X_28d2[hc_data$X_28d2 == 999] <- NA
hc_data$X_28d2[hc_data$X_28d2 == 998] <- NA
hc_data <- na.omit(scale(hc_data)) %>% data.frame()
We chose to scale the data to get better comparison among the diferrent varaibles we selected.
library(ape)
arr_dist <- dist(hc_data, method = "euclidean")
arr_clust <- hclust(arr_dist)
arr_tree <- as.phylo(arr_clust)
plot(arr_tree, cex = 0.5)
summary(arr_tree)
##
## Phylogenetic tree: arr_tree
##
## Number of tips: 251
## Number of nodes: 250
## Branch lengths:
## mean: 0.3768499
## variance: 0.2445826
## distribution summary:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0.008640528 0.167880872 0.255955929 0.431429115 6.724070149
## No root edge.
## First ten tip labels: 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## No node labels.
The results are not very interpretable. Why? In short, the data is too finely cut, we have no node labels to interpret from, but intheory the nodes represent different housing scenarios based off of our selected input. However, it’s hard to make any remakrs off the analysis considering the complexity of housing in a place like NYC. Further, its likely, but not confirmed by us, that many of the numerical variables we chose(income, and monthly costs/rent) are highly confounded and result in overly similar clusters that are hard to make any distinguishments. Instead we should aggregate the data to a coarser level with labels, e.g., sub-borough area, and consider variables that have unknown relationships(i.e., not confounded). As final note the data for both analyses was cut down to only 251 houses, with borough disregarded, which likely would not give the best description of housing for the whole city. As final note the data was cut down to only 251 houses, with borough disregarded, which likely would not give the best description of housing for the whole city.
For our k-means clustering, we used the same variables as our hierarchy clustering, which is our Poor Quality Index(pqi), Household Income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) and Monthly Gross Rent (mgrent). The k-means clustering also includes only the households of tenants.
k_means_data <- select(data_part2, c(pqi, hhinc, X_28c, X_28d2, X_30a, mgrent))
k_means_data$X_28c[k_means_data$X_28c == 999] <- NA
k_means_data$X_28c[k_means_data$X_28c == 998] <- NA
k_means_data$X_28d2[k_means_data$X_28d2 == 999] <- NA
k_means_data$X_28d2[k_means_data$X_28d2 == 998] <- NA
k_means_data <- na.omit(scale(k_means_data)) %>% data.frame()
set.seed(25)
arr_clust <- kmeans(k_means_data, 3)
arr_clust
## K-means clustering with 3 clusters of sizes 7, 165, 79
##
## Cluster means:
## pqi hhinc X_28c X_28d2 X_30a mgrent
## 1 0.1802038 2.5240827 3.2346576 -0.20458301 5.3531725 5.8022960
## 2 0.2988552 -0.1571351 -0.1715836 -0.31238999 -0.2371015 -0.1373023
## 3 -0.1606853 0.1974197 1.7039213 0.09122733 1.0345868 1.4274657
##
## Clustering vector:
## [1] 2 3 2 3 2 3 2 3 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [36] 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
## [71] 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 1
## [106] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 3 2 3 2 3 3 3 2 2 2 2 2 2 3 2
## [141] 3 3 2 3 2 2 3 2 3 2 3 3 2 2 3 3 3 3 2 2 1 2 2 2 2 2 2 3 2 2 1 3 2 2 2
## [176] 3 3 3 2 2 2 2 3 2 2 3 3 3 2 3 2 2 2 2 3 2 3 3 3 2 3 2 3 3 2 2 2 2 3 3
## [211] 2 3 3 3 3 3 3 3 3 3 3 3 1 1 2 2 3 1 3 3 1 2 3 2 3 3 3 2 3 3 2 2 3 3 3
## [246] 3 3 2 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 214.0125 469.7071 439.8295
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
k_means_data %>% mutate(cluster = arr_clust$cluster)%>%
group_by(cluster) %>%
summarise(num_cluster = n())
## # A tibble: 3 x 2
## cluster num_cluster
## <int> <int>
## 1 1 7
## 2 2 165
## 3 3 79
This confirms how poorly kmeans, and clustering performs on the data at a unit level. Trying differnt seeds shows a huge variation in cluster sizes. However, one of the clusters is consitiintely small which is interesting, but this could be outliers, e.g., very high income earners out of a small sample(n=251). This just reafirms some of the comments made above. Further analysis is needed and a better data aggregation needs to be implemented. It would be interesting to aggregate by subborough and perfor kmeans with 5 clusters to explore how “similar” subboroghs are within the same borough.
Ultimately, we did not arrive at a method to test our index. However the authors believe the index should be validated against a variable indicative of quality, but not measured in the index. It is in future plans to find data to perfom such a validation test. Potential variables were omitted due to the fact they only had data for recent years. Whether a unit has functioning air conditioning was only measured during the years 2014 and 2017. In measuring housing quality this variable would have been useful. The authors’ have chosen not to include such datas it may inflate the index scores for later years. However, there are plans to create strong indexes for the recent years.
In this paper we have created a housing quality index that measures poor housing conditions. We remark that housing conditions have been slowly improving over time, particularly among units with high index values. Our goal was to measure hosuing quality and our proposed index specificaly measures poor housing conditions rather than just quality. We believe it would be benefecial to creat several indexes concering qualilty of hosuing e.g., High Quality Index, Neighborhood Quality Index and to consider all such indexes when considering hosuing quality. We also reccomend further exploration of the spatial component of the data to see if things such as crime, location, and the index value related.