Trends in Residex: Housing Price Index of India

To justify (narrow down to an appropriate city, as well) an upcoming property investment in India, I was looking for some government-published dataset on residential housing prices. After, accidentally discovering National Housing Bank (http://www.nhb.org.in/), I was searching for some CSV. Unfortunately there was none. Closest thing was - http://www.nhb.org.in/Residex/Data&Graphs.php
Even if I have to scrape it, let's be it !!

Let's use rvest library for the scraping. It honours the magrittr semantics and allow CSS selectors.

require(rvest) || install.packages("rvest")  
library(rvest)  

Let's extract the ugly HTML table to a data.frame. html_table is the key function provided by rvest (converts HTML table to a data-frame, almost !)

india_residex <- html("http://www.nhb.org.in/Residex/Data&Graphs.php")  
residex_list <- india_residex %>%  
                    html_nodes("table[border='1']") %>% 
                    html_table()
residex_frame <- residex_list[[1]]  

Column names are not inferred properly. Let's fix it

> names(residex_frame)
 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "X7"  "X8"  "X9"  "X10" "X11" "X12"

> names(residex_frame) <- c("CITIES", "2007", "OCT_DEC_2012", "JAN_MAR_2013", 
"APR_JUN_2013", "JUL_SEPT_2013", "OCT_DEC_2013", "JAN_MAR_2014", "APR_JUN_2014", "JUL_SEP_2014", "OCT_DEC_2014", "JAN_MAR_2015")

Also skip the first row (html_table() should have parameters like header=TRUE)

# Looks we need to slice the first row
residex_frame <- residex_frame[2:nrow(residex_frame),]  

Let's take a look at our data.frame now -

> str(residex_frame)
'data.frame':    26 obs. of  12 variables:  
 $ CITIES       : chr  "Hyderabad" "Faridabad" "Patna" "Ahmedabad" ...
 $ 2007         : chr  "100" "100" "100" "100" ...
 $ OCT_DEC_2012 : chr  "90" "205" "151" "191" ...
 $ JAN_MAR_2013 : chr  "88" "207" "152" "192" ...
 $ APR_JUN_2013 : chr  "84" "202" "147" "186" ...
 $ JUL_SEPT_2013: chr  "88" "204" "150" "191" ...
 $ OCT_DEC_2013 : chr  " \n\n                                    93" " \n\n                                    209" " \n\n                                    159" " \n\n                                    197" ...
 $ JAN_MAR_2014 : chr  "95" "209" "150" "209" ...
 $ APR_JUN_2014 : chr  "95" "211" "154" "213" ...
 $ JUL_SEP_2014 : chr  "93" "216" "153" "217" ...
 $ OCT_DEC_2014 : chr  "95" "222" "157" "215" ...
 $ JAN_MAR_2015 : chr  "97" "224" "160" "215" ...

It looks like the OCT_DEC_2013 column have some bad HTML content. Let's clean it out.

strip <- function(elem) { gsub('[^0-9]', '', elem) }  
residex_frame$OCT_DEC_2013 <- as.numeric(sapply(residex_frame$OCT_DEC_2013, FUN = strip))  

Looks better now -

> str(residex_frame)
'data.frame':    26 obs. of  12 variables:  
 $ CITIES       : chr  "Hyderabad" "Faridabad" "Patna" "Ahmedabad" ...
 $ 2007         : chr  "100" "100" "100" "100" ...
 $ OCT_DEC_2012 : chr  "90" "205" "151" "191" ...
 $ JAN_MAR_2013 : chr  "88" "207" "152" "192" ...
 $ APR_JUN_2013 : chr  "84" "202" "147" "186" ...
 $ JUL_SEPT_2013: chr  "88" "204" "150" "191" ...
 $ OCT_DEC_2013 : num  93 209 159 197 330 105 185 235 154 85 ...
 $ JAN_MAR_2014 : chr  "95" "209" "150" "209" ...
 $ APR_JUN_2014 : chr  "95" "211" "154" "213" ...
 $ JUL_SEP_2014 : chr  "93" "216" "153" "217" ...
 $ OCT_DEC_2014 : chr  "95" "222" "157" "215" ...
 $ JAN_MAR_2015 : chr  "97" "224" "160" "215" ...

Other than CITIES and OCT_DEC_2013 column, we need convert rest of the columns into numeric ones.

residex_frame_1 <- residex_frame %>%  
                      select(-c(OCT_DEC_2013, CITIES)) %>%
                      apply(2, as.numeric)
residex_frame <- data.frame(  
                  residex_frame[, c("OCT_DEC_2013", "CITIES")],
                  residex_frame_1
                )

Data munging looks complete now -

> str(residex_frame)
'data.frame':    26 obs. of  12 variables:  
 $ OCT_DEC_2013 : num  93 209 159 197 330 105 185 235 154 85 ...
 $ CITIES       : chr  "Hyderabad" "Faridabad" "Patna" "Ahmedabad" ...
 $ X2007        : num  100 100 100 100 100 100 100 100 100 100 ...
 $ OCT_DEC_2012 : num  90 205 151 191 314 87 189 205 150 87 ...
 $ JAN_MAR_2013 : num  88 207 152 192 310 112 183 221 140 89 ...
 $ APR_JUN_2013 : num  84 202 147 186 303 110 187 219 142 86 ...
 $ JUL_SEPT_2013: num  88 204 150 191 318 108 191 219 145 86 ...
 $ JAN_MAR_2014 : num  95 209 150 209 349 101 194 232 165 85 ...
 $ APR_JUN_2014 : num  95 211 154 213 355 102 193 241 161 86 ...
 $ JUL_SEP_2014 : num  93 216 153 217 362 101 196 242 160 88 ...
 $ OCT_DEC_2014 : num  95 222 157 215 366 103 198 247 164 88 ...
 $ JAN_MAR_2015 : num  97 224 160 215 364 102 200 251 168 88 ...

Let's melt these "untidy" columns (like - OCT_DEC_2012, JAN_MAR_2013)

require(tidyr) || install.packages("tidyr")  
library(tidyr)

residex_tidy <- residex_frame %>%  
                  gather(TIME_OF_YEAR, INDEX, -CITIES)

I love this collapse -

> head(residex_tidy)
     CITIES TIME_OF_YEAR INDEX
1 Hyderabad OCT_DEC_2013    93  
2 Faridabad OCT_DEC_2013   209  
3     Patna OCT_DEC_2013   159  
4 Ahmedabad OCT_DEC_2013   197  
5   Chennai OCT_DEC_2013   330  
6    Jaipur OCT_DEC_2013   105  

Some more factorizations -

residex_tidy$TIME_OF_YEAR <- factor(residex_tidy$TIME_OF_YEAR,  
                                    levels = c("2007", "OCT_DEC_2012", "JAN_MAR_2013", "APR_JUN_2013", "JUL_SEPT_2013", "OCT_DEC_2013", "JAN_MAR_2014", "APR_JUN_2014", "JUL_SEP_2014", "OCT_DEC_2014", "JAN_MAR_2015"))
residex_tidy$CITIES <- factor(residex_tidy$CITIES)  
residex_tidy$INDEX <- as.integer(residex_tidy$INDEX)  

It's plotting time ! First try (line graph - geom_line) -

require(ggthemes) || install.packages("ggthemes")  
library(ggthemes)

residex_tidy %>%  
  ggplot(aes(x = TIME_OF_YEAR, y = INDEX, color = CITIES, group = CITIES)) +
  geom_line() +
  theme_economist()

It's hard to decipher what's going on. Legends are hard to correlate with the graph. Not good at all.

Second try with a gradual color palette (line graph - geom_line) -

unique_cities <- length(levels(residex_tidy$CITIES))  
color_palette_colors <- colorRampPalette(rev(brewer.pal(5, "OrRd")))(unique_cities)  
residex_tidy %>%  
  ggplot(aes(x = TIME_OF_YEAR, y = INDEX)) +
  geom_line(aes(color = CITIES, group = CITIES)) +
  scale_color_manual(values = color_palette_colors) +
  theme_economist()

Still not convinced. Let's apply our old trick of heatmap with time along X-axis and cities along Y-axis (similar technique used in the visualization case-study of Temperature trends - http://www.dbose.in/seasonal-maximum-temperature-of-india-a-data-visualization-study/)

unique_index_count <- length(unique(residex_tidy$INDEX))  
color_palette_colors <- colorRampPalette(rev(brewer.pal(5, "Oranges")))(unique_index_count)

residex_tidy %>%  
  ggplot(aes(x = TIME_OF_YEAR, y = CITIES, fill = factor(INDEX))) +
  geom_tile() +
  scale_fill_manual(values = rev(color_palette_colors), name = "") +
  theme_change

Drawing

Heaps better! As an improvement, I should try to overlay this data over a map of India. Cities have a spatial aspect and the visualization should uphold this.

Conclusion

Chennai market has been RED 'hot' since Oct-Dec, 2012 (we don't have data points from 2007-2012). Unsurprisingly, housing prices are rising fast for Pune and Mumbai market. Among smaller cities, Faridabad and Bhopal are catching up pretty fast. Kolkata (my home city) is trying to reach the 'Big League'. On the other side of the spectrum, not much is happening for Kochi and Jaipur. Other related economics datasets need to be studied to gain a deeper picture. Metrics like Household Debt/Income Per City, Property Sales Rate, Property Vacancy Ratio, Secondary-market Sales Rate/City need to be investigated.