PART 1: Preliminary analysis

Introduction

In the period of 1991 to 2017, housing quality in New York has improved dramatically; however, some sectors of the housing stock continue to face poor conditions and some specific maintenance deficiencies continue to show higher prevalence. In this project, we develop an index that presents poor qualtity of housing in New York by measuring the physical deficiencies to show how the prevalence of these issues has shifted over time.

Methodology

The index measures weighted sums of interactions between 22 variables that the authors chose. The selected variables were chosen if the authors agreed they described poor housing conditions. The index is not exhaustive, and potentially more data could be collected to better suit our purpose.

Item Description NYCHVS Variable Score
1 Exterior Walls: Missing brick, sliding or other d1 2
2 Exterior Walls: Sloping or bulgin walls d2 2
3 Exterior walls: Major Cracks d3 2
4 Exterior Walls: Loose or hanging corvice, roof, etc. d4 2
5 Interior Walls: Cracks or holes 36a 2
6 Interior Walls: Broken plaster or peeling paint 37a 2
7 Broken or missing windows e1 5
8 Rotten or loose windows e2 2
9 Boarded up windows e3 3
10 Sagging or sloping floors g1 2
11 Slanted/shifted doorsills or frames g2 2
12 Deep wear in floor causing depressions g3 2
13 Holes or missing flooring g4 2
14 Stairs: Loose, broken, or missing stair f1 2
15 Stairs: Loose, broken, or missing setps f2 2
16 No interior steps or stairways f4 2
17 No exterior steps or stairways f5 2
18 Number of heating equipment breakdowns 32b 2 per break down
19 Kitchen facilities fucntioning 26c 3 if no, 5 if no kitchen facilities
20 Toilet Breakdowns 25c 3 if any, 5 if no toliet or plumbing
21 Presence of mice or rats 35a 3
22 Water Leakage 38a 3

Visualization

Figure 1 shows the poor quality index scores for the 156,230 occupied units in the New York Housing Dataset from 1991 to 2017. The frequency distribution is skewed to the right. Overall, fourty five percent of the units were scored 0. The highest score was in 1993 with 54 points. 2008 had the highest percent(64%) of units that has 0 poor quality scores.

Figure 2 shows percent the percent of ccupied units with poor quality scores. Over the period of 1991 to 2017, most of the units has poor quality scores between 1 and 10 points; very little units that has the poor quality scroes over 20 points.

Figure 3 tracks trends in poor quality index scores during the period of 1991 to 2017. We decided to report the means, medians, 75th percentiles, 95th percentiles, and 99th percentiles. In most of the years, the median had the poor quality scores of 0. The mean ranged from 4.0 in 1991 to 2.5 in 2017. The 99th percentiles clearly show the improvement of housing in New York( from 25 poor quality points in 1991 to 18 porr quality points in 2017)

Figure 4 shows the poor condition of housing in five different boroughs in New York city in the period of 1991 to 2017. Overall, all five boroughs had an improvement of the house quality. Bronx had the worse housing condition and Stalen Island had the best housing condition.

Figure 5: Average Household Income and Index by Sub-borough

## OGR data source with driver: GeoJSON 
## Source: "/Users/thienngole/Desktop/MSU/10-MSU-Spring-2019/MTH390Q-DataScience/project/NY-Housing-Data/Community Districts.geojson", layer: "OGRGeoJSON"
## with 71 features
## It has 3 fields

PART 2: Statistical analyses

Multiple regression analyses

For multiple regression model we built a model using our poor quality index(pqi) as a response variable and household income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.

Our multiple regression model:

Y = 6.876 + X1(-3.963e-06) + X2(-1.855e-04) + X3(-2.279e-03) + X4(8.665e-04) + X5(-2.082e-03)

#Preprocess the data
data_part2 <- read.csv("all_data_v2.csv", stringsAsFactors = F)

data91_99 <- filter(data_part2, year == 91 | year == 93 | year == 96 | year == 99 )
data02_17 <- filter(data_part2, year == 2002 | year == 2005 | year == 2008 | year == 2011 | year == 2014| year == 2017)

# Remove 999998 = Not Reported *** exist only in 1991
data91_99 <- filter(data91_99, hhinc != 999998)

# replace 999999 and 9999999  = No Income to 0
data91_99$hhinc <- ifelse(data91_99$hhinc == 999999, 0 , data91_99$hhinc)
data02_17$hhinc <- ifelse(data02_17$hhinc == 9999999, 0 , data02_17$hhinc)

# concatenate the 2 dataframe back
data_part2 <- rbind(data91_99, data02_17)
data_part2$X_28c[data_part2$X_28c == 9999] <- NA # Not applicable (owner occupied or gas and electric not combined)
data_part2$X_28d2[data_part2$X_28d2 == 9999] <- NA # Not applicable (owner occupied, included in rent, condo, or other fee, or no charge)
data_part2$X_30a[data_part2$X_30a == 99999] <- NA # Not applicable (occupied rent free or owner occupied)
data_part2$X_30a[data_part2$X_30a == 99998] <- NA # Not reported

data_part2$mgrent[data_part2$mgrent == 9998] <- NA # Not reported
data_part2$mgrent[data_part2$mgrent == 9999] <- NA # Not applicable (occupied rent free or owner occupied)
data_part2$mgrent[data_part2$mgrent == 99999] <- NA # Not applicable  (occupied rent free or owner occupied)
my.reg <- lm(pqi ~ hhinc + year + X_28c + X_28d2 + X_30a, data = data_part2)
summary(my.reg)
## 
## Call:
## lm(formula = pqi ~ hhinc + year + X_28c + X_28d2 + X_30a, data = data_part2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.605 -3.958 -1.760  2.228 49.013 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.883e+00  3.219e-01  21.379  < 2e-16 ***
## hhinc       -3.963e-06  6.425e-07  -6.169 6.92e-10 ***
## year        -1.856e-04  3.288e-05  -5.644 1.67e-08 ***
## X_28c       -2.279e-03  6.696e-05 -34.027  < 2e-16 ***
## X_28d2       8.592e-04  3.156e-04   2.723  0.00647 ** 
## X_30a       -2.081e-03  6.338e-05 -32.841  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.657 on 50716 degrees of freedom
##   (101591 observations deleted due to missingness)
## Multiple R-squared:  0.0559, Adjusted R-squared:  0.0558 
## F-statistic: 600.5 on 5 and 50716 DF,  p-value: < 2.2e-16

Second linear model

As income was not included in the index we might consider modeling the relationship between income and housing quality. However, as already discussed the relationship between these variables is difficult. We will instead investigate the relationship between quantiles of household income and poor quality index by subborough. We will model the 5th percentile of household income using year and the 95th percentile of PQI. This is a naive model and not a proper qunatile regression which we plan to employ in the future.

data_part2[which(data_part2$hhinc==9999999),"hhinc"] <- NA
data_part2[data_part2$year<100,"year"]<-data_part2[data_part2$year<100,"year"] + 1900
data_part2 %>% group_by(sba, year) %>% summarise(pqi_95 = quantile(pqi,.95), 
                                               hhinc_05 = quantile(hhinc, .05,na.rm=T)) -> bmod
head(bmod)
## # A tibble: 6 x 4
## # Groups:   sba [1]
##     sba  year pqi_95 hhinc_05
##   <int> <dbl>  <dbl>    <dbl>
## 1   101  1991   20.7    1547 
## 2   101  1993   17      2060 
## 3   101  1996   21      1964.
## 4   101  1999   19       432.
## 5   101  2002   16      2953.
## 6   101  2005   13.0    1776
ggplot(bmod, aes(pqi_95, hhinc_05, col=year)) + geom_point(position="jitter") + geom_smooth(method="lm") -> lm_plot
lm_plot %>% ggplotly()

It seems clear that time should factor in to the model. However, it is not clear if we should include and interaction term.

ggplot(bmod, aes(pqi_95, hhinc_05, col=year)) + geom_point(position="jitter") + geom_smooth(method="lm") -> lm_plot
ggplot(bmod, aes(year, hhinc_05)) + geom_point(position="jitter") -> year_plot
ggplot(bmod, aes(pqi_95*year, hhinc_05)) + geom_point(position="jitter") -> int_plot 
ggplot(bmod,(aes(year,pqi_95))) + geom_point(position = "jitter") -> pqi_year
grid.arrange(year_plot, int_plot, pqi_year)

The plots seem to show that the interaction is overwhelmed by the PQI, which wold result in multicollinearity issues.

Let’s look at the models

lm(data=bmod, hhinc_05 ~ pqi_95 * year) -> bfit
coef(bfit)
##   (Intercept)        pqi_95          year   pqi_95:year 
## -294415.48671   20851.11045     151.12694     -10.57636

The coefficient for PQI doesnt make much sense, being positive, looking at the graph above it should be negative. The interaction term is negative, but this likely a multicolinearity.

lm(data=bmod, hhinc_05 ~ pqi_95 + year) -> bfit2
coef(bfit2)
##  (Intercept)       pqi_95         year 
## -44955.83224   -304.57814     26.47481

This model has more sensible coefficients, but the \(R^2\) dropped a little bit.

Checking Multicollinearity

vif(bfit)
##       pqi_95         year  pqi_95:year 
## 59630.799920     7.207624 59294.638215
vif(bfit2)
##   pqi_95     year 
## 1.087208 1.087208

It is clear that the first model has multicolinearity with the addtion of an interaction term. Making the coefficients uninterpretable. We choose the second additive model instead.

summary(bfit2)
## 
## Call:
## lm(formula = hhinc_05 ~ pqi_95 + year, data = bmod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6682.7 -1721.1    61.8  1600.2 10983.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44955.83   26548.38  -1.693   0.0910 .  
## pqi_95        -304.58      22.72 -13.408   <2e-16 ***
## year            26.47      13.21   2.004   0.0456 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2515 on 547 degrees of freedom
## Multiple R-squared:  0.2834, Adjusted R-squared:  0.2808 
## F-statistic: 108.2 on 2 and 547 DF,  p-value: < 2.2e-16

We estimate that for each unit increase in the 95th percentile of a given subborough will correlate with an average decrease of 333 dollars in the 5th percentile of houshold inocme for the same suborough.

We estimate that for each increase in year the 5th percentile of househould within a given subborough increases by 1 dollar

Notes: It is clear that after cheking basic assumptions of normality are violated. This is likely due to the fact that the model is using quantiles as a response requiring further work to implement a proper quantile regression.

Logistic regression analysis

For logistic regression model we built a model using the Toilet Brokedowns(X_25c) as a response variable and household income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.

Our logistic regression model:

Y = 1.672e-01 + X1(-1.673e-07) + X2(-2.845e-05) + X3(6.521e-06) + X4(-3.623e-05)

data_part2$X_25c[data_part2$X_25c == 2] <- 0  # NO toilet breakdown
data_part2$X_25c[data_part2$X_25c == 3] <- NA # NO toilet in the apartment
data_part2$X_25c[data_part2$X_25c == 8] <- NA # Not reported
data_part2$X_25c[data_part2$X_25c == 9] <- NA # Not applicable (No plumbing facilities)
my.logreg <- glm(X_25c ~ hhinc + X_28c + X_28d2 + X_30a, data = data_part2)
summary(my.logreg)
## 
## Call:
## glm(formula = X_25c ~ hhinc + X_28c + X_28d2 + X_30a, data = data_part2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1719  -0.1325  -0.1201  -0.1054   1.1028  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.671e-01  1.928e-02   8.668  < 2e-16 ***
## hhinc       -1.673e-07  4.023e-08  -4.159 3.20e-05 ***
## X_28c       -2.844e-05  3.960e-06  -7.181 7.03e-13 ***
## X_28d2       6.579e-06  1.891e-05   0.348    0.728    
## X_30a       -3.625e-05  3.982e-06  -9.103  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1066699)
## 
##     Null deviance: 4887.2  on 45586  degrees of freedom
## Residual deviance: 4862.2  on 45582  degrees of freedom
##   (106726 observations deleted due to missingness)
## AIC: 27353
## 
## Number of Fisher Scoring iterations: 2

Machine learning procedure

For machine learning procedure, we decided to build a k nearest neighbors model that has borough as a response variable and poor quality index(pqi), household income(hhinc), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) as the explanatory variables.

1=“Bronx”, 2=“Brooklyn”, 3=“Manhattan”, 4=“Queens”, 5=“Staten Island”

library(dplyr)
library(class)

# Change this value to try different k:
ktry <- 2
train_data <- select(na.omit(data_part2), c(borough, pqi, hhinc, X_28c, X_28d2, X_30a))
#train_data <- na.omit(train_data)
my.knn <- knn(train = train_data, 
              test = train_data, 
              cl = train_data$borough, 
              k = ktry)
borough <- data.frame(Actual = train_data$borough, Predicted = my.knn)
confusion <- table(borough)
confusion
##       Predicted
## Actual     1     2     3     4     5
##      1  5523  1078  1116   551    52
##      2   862 10313  1264  1303   157
##      3  1107  1383  8943  1105    96
##      4   560  1383  1035  6317   167
##      5    55   185   109   243   574

Correct classification rate

sum(diag(confusion)) / nrow(data_part2)
## [1] 0.2079271

PART 3: Final analysis

Hierarchical cluster analysis

For our hierarchy clustering, we decided to use our Poor Quality Index(pqi), Household Income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) and Monthly Gross Rent (mgrent). The hierarchy clustering includes only the households of tenants.

hc_data <- select(data_part2, c(pqi, hhinc, X_28c, X_28d2, X_30a, mgrent))

hc_data$X_28c[hc_data$X_28c == 999] <- NA
hc_data$X_28c[hc_data$X_28c == 998] <- NA
hc_data$X_28d2[hc_data$X_28d2 == 999] <- NA
hc_data$X_28d2[hc_data$X_28d2 == 998] <- NA

hc_data <- na.omit(scale(hc_data)) %>% data.frame()

We chose to scale the data to get better comparison among the diferrent varaibles we selected.

library(ape)
arr_dist <- dist(hc_data, method = "euclidean")
arr_clust <- hclust(arr_dist)
arr_tree <- as.phylo(arr_clust)
plot(arr_tree, cex = 0.5)

summary(arr_tree)
## 
## Phylogenetic tree: arr_tree 
## 
##   Number of tips: 251 
##   Number of nodes: 250 
##   Branch lengths:
##     mean: 0.3768499 
##     variance: 0.2445826 
##     distribution summary:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## 0.008640528 0.167880872 0.255955929 0.431429115 6.724070149 
##   No root edge.
##   First ten tip labels: 1 
##                         2
##                         3
##                         4
##                         5
##                         6
##                         7
##                         8
##                         9
##                         10
##   No node labels.

The results are not very interpretable. Why? In short, the data is too finely cut, we have no node labels to interpret from, but intheory the nodes represent different housing scenarios based off of our selected input. However, it’s hard to make any remakrs off the analysis considering the complexity of housing in a place like NYC. Further, its likely, but not confirmed by us, that many of the numerical variables we chose(income, and monthly costs/rent) are highly confounded and result in overly similar clusters that are hard to make any distinguishments. Instead we should aggregate the data to a coarser level with labels, e.g., sub-borough area, and consider variables that have unknown relationships(i.e., not confounded). As final note the data for both analyses was cut down to only 251 houses, with borough disregarded, which likely would not give the best description of housing for the whole city. As final note the data was cut down to only 251 houses, with borough disregarded, which likely would not give the best description of housing for the whole city.

k-means cluster analysis

For our k-means clustering, we used the same variables as our hierarchy clustering, which is our Poor Quality Index(pqi), Household Income(hhinc), year(year), Combined Gas/Electricity Monthly Cost(X_28c), Water and Sewer Annual Cost(X_28d2), Monthly contract rent(X_30a) and Monthly Gross Rent (mgrent). The k-means clustering also includes only the households of tenants.

k_means_data <- select(data_part2, c(pqi, hhinc, X_28c, X_28d2, X_30a, mgrent))

k_means_data$X_28c[k_means_data$X_28c == 999] <- NA
k_means_data$X_28c[k_means_data$X_28c == 998] <- NA
k_means_data$X_28d2[k_means_data$X_28d2 == 999] <- NA
k_means_data$X_28d2[k_means_data$X_28d2 == 998] <- NA

k_means_data <- na.omit(scale(k_means_data)) %>% data.frame()
set.seed(25)
arr_clust <- kmeans(k_means_data, 3)
arr_clust
## K-means clustering with 3 clusters of sizes 7, 165, 79
## 
## Cluster means:
##          pqi      hhinc      X_28c      X_28d2      X_30a     mgrent
## 1  0.1802038  2.5240827  3.2346576 -0.20458301  5.3531725  5.8022960
## 2  0.2988552 -0.1571351 -0.1715836 -0.31238999 -0.2371015 -0.1373023
## 3 -0.1606853  0.1974197  1.7039213  0.09122733  1.0345868  1.4274657
## 
## Clustering vector:
##   [1] 2 3 2 3 2 3 2 3 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
##  [36] 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
##  [71] 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 1
## [106] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 3 2 3 2 3 3 3 2 2 2 2 2 2 3 2
## [141] 3 3 2 3 2 2 3 2 3 2 3 3 2 2 3 3 3 3 2 2 1 2 2 2 2 2 2 3 2 2 1 3 2 2 2
## [176] 3 3 3 2 2 2 2 3 2 2 3 3 3 2 3 2 2 2 2 3 2 3 3 3 2 3 2 3 3 2 2 2 2 3 3
## [211] 2 3 3 3 3 3 3 3 3 3 3 3 1 1 2 2 3 1 3 3 1 2 3 2 3 3 3 2 3 3 2 2 3 3 3
## [246] 3 3 2 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 214.0125 469.7071 439.8295
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
k_means_data %>% mutate(cluster = arr_clust$cluster)%>%
  group_by(cluster) %>%
summarise(num_cluster = n())
## # A tibble: 3 x 2
##   cluster num_cluster
##     <int>       <int>
## 1       1           7
## 2       2         165
## 3       3          79

This confirms how poorly kmeans, and clustering performs on the data at a unit level. Trying differnt seeds shows a huge variation in cluster sizes. However, one of the clusters is consitiintely small which is interesting, but this could be outliers, e.g., very high income earners out of a small sample(n=251). This just reafirms some of the comments made above. Further analysis is needed and a better data aggregation needs to be implemented. It would be interesting to aggregate by subborough and perfor kmeans with 5 clusters to explore how “similar” subboroghs are within the same borough.

Limitations and Future Plans

Ultimately, we did not arrive at a method to test our index. However the authors believe the index should be validated against a variable indicative of quality, but not measured in the index. It is in future plans to find data to perfom such a validation test. Potential variables were omitted due to the fact they only had data for recent years. Whether a unit has functioning air conditioning was only measured during the years 2014 and 2017. In measuring housing quality this variable would have been useful. The authors’ have chosen not to include such datas it may inflate the index scores for later years. However, there are plans to create strong indexes for the recent years.

In this paper we have created a housing quality index that measures poor housing conditions. We remark that housing conditions have been slowly improving over time, particularly among units with high index values. Our goal was to measure hosuing quality and our proposed index specificaly measures poor housing conditions rather than just quality. We believe it would be benefecial to creat several indexes concering qualilty of hosuing e.g., High Quality Index, Neighborhood Quality Index and to consider all such indexes when considering hosuing quality. We also reccomend further exploration of the spatial component of the data to see if things such as crime, location, and the index value related.

References

https://www.huduser.gov/publications/pdf/AHS_hsg.pdf

https://www1.nyc.gov/site/hpd/about/nychvs-asa-data-challenge-expo.page