Work with Canadian Census data

This is a course assignment demo (GG606 Scientific Data Wrangling)

cancensus is a R package that can assess Statistics Canada Census data for Census year 1996, 2001, 2006, 2011, 2016 and 2021. The datasets present information from the Census of Population for various levels of geography, including provinces and territories, census metropolitan areas, communities and census tracts.

1. Installation and retrieve the data vectors list (API key required)

install.packages("cancensus")
library(cancensus)

#view avaliable Census data
list_census_datasets()
#view regions and vectors in a given dataset
list_census_regions("CA21")
list_census_vectors("CA21")

#retrieve the data needed
#Warning: Cached regions list may be out of date. Set `use_cache = FALSE` to update it.
    library(geojsonsf)
    dataset <- "CA21"
    #set census api
    set_cancensus_api_key('ENTER API KEY')
    set_cancensus_cache_path(here("data"))
    library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

    regions.list <- list_census_regions(dataset, use_cache = FALSE) %>% 
      filter(level == "PR", name == "Alberta") %>% as_census_region_list

    vec <- find_census_vectors("housing", dataset = dataset, query_type = "semantic")

2. Housing situation in Alberta

2.1 Dwelling Value

    # getting 2021 data
    # selecting vectors in need

    # Median Dwelling value
    dwelling.21.cost <- get_census(dataset = "CA21", regions = regions.list,
                              vectors = c("median.dwelling"="v_CA21_4311"), 
                              level = "CSD", labels = 'short', geo_format = 'sf')

    dwelling.21.value <- get_census(dataset = "CA21", regions = regions.list,
                              vectors = c("median.dwelling"="v_CA21_4311"), 
                              level = "CSD", labels = 'short') %>% 
      slice_max(median.dwelling, prop = .05) 

    ggplot(dwelling.21.cost) + geom_sf(aes(fill=median.dwelling)) + 
      theme_minimal() +
      theme(panel.grid = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            strip.text.x = element_text(size=12)) + 
      coord_sf(datum=NA) +
      scale_fill_viridis_c("Median Dwellings Value", labels = scales::dollar)

By looking at general housing situation, I first started with the price of dwellings. The first figure showed the distribution of median dwelling value in Alberta. In a large number of district, the value of dwellings are around $500,000 and above. Roughly, the price of dwellings is more expensive in the southern Alberta compared with the North, which might related to what kind of climate people want to live in. There is a special pattern around the district of Calgary, where in the center of Calgary, the price of dwellings is far less than its surrounding area. The similar, but less obvious pattern could be found around Edmonton. The reason may relate to people prefer to live in suburban area compared with living in downtown, but the gap of dwellings value around Calgary is astonishing. Besides, the surrounding cycle outside Alberta is the most expensive area in terms of dwellings value in the province. The next figure will look into it in more detail.

    options(scipen = 999)
    ggplot(dwelling.21.value, aes(x=median.dwelling, y=fct_reorder(`Region Name`, median.dwelling))) +
      geom_point() + theme_minimal() + theme(axis.title.y = element_blank()) + 
      xlab("Median Dwellings Value") + ggtitle("Districts with Top 5% Median Dwellings Value in Alberta")

The second figure looked into dwelling value in more detail: showing the districts with top 5% median dwelling value. 25 out of 423 districts in Alberta have been selected with higher median dwelling value, and particularly, all district with median dwelling value higher than $500,000 have been selected. Apparently, the district around Calgary have all been included, and also the districts to the west of Calgary. By Google-search these district names, I realized these districts located in a good place for relax and vacation. There are often beside the lake (e.g. Birchcliff, Jarvis Bay) or near natural park (e.g. Banff, Jasper). The abnormal increase of the dwelling value may be related to its good location and nature environment.

2.2 Monthly cost

    # Median Monthly shelter cost
    mon.cost.21 <- get_census(dataset = "CA21", regions = regions.list,
                              vectors = c("median.cost.rent"="v_CA21_4317", 
                                          "median.cost.own"="v_CA21_4309"), 
                              level = "CSD", 
                              geo_format = 'sf', labels = 'short')

    mon.cost.21 <- mon.cost.21 %>% pivot_longer(cols = starts_with("median.cost"), 
                                 names_to = "status",
                                 names_prefix = "median.cost.",
                                 values_to = "month.cost")
    mon.cost.21$status <- factor(mon.cost.21$status, levels = c("own","rent"),
                          labels = c("Owner Households", "Renter Households"))

    ggplot(mon.cost.21) + geom_sf(aes(fill=month.cost)) + 
      theme_minimal() +
      theme(panel.grid = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            strip.text.x = element_text(size=12)) + 
      coord_sf(datum=NA) +
      scale_fill_viridis_c("Median Monthly cost", labels = scales::dollar) +
      facet_wrap(~status)

The next question I had about the housing is whether the cost of housing people spend monthly share the same distribution as the dwelling price?

The above figure showed the median monthly cost people spend per household. The monthly cost, in general, share the same characteristic as the dwelling price: where the dwelling value is low, the monthly cost people spend on housing is also tend to be low.

There are 2 interesting differences: first the cycle of higher price around Calgary is disappeared. So people who live in high value dwellings actually don’t spend significantly more monthly. It might because those high value dwellings are owned, not rented, and the cost for owner maintaining is not significantly higher. We can also notice this in the right sub-figure in figure-3, where area which has high value of dwellings actually have average or even lower monthly cost for renter households.

The second interesting difference is between the owner household and renter household. The monthly cost distribution tend to be the same, however, in the northeastern area, the owner households have a abnormal high cost monthly. By Google-search, the Northeastern area is agricultural area, where the owner might spend more to maintain their farm house, which not only serve as living.

2.3 Difference between large cities and elsewhere

    # Number of Household of Renter and Owner
    household <- get_census(dataset = "CA21", regions = regions.list,
                              vectors = c("status.Renter"="v_CA21_4239", 
                                          "status.Owner"="v_CA21_4238",
                                          "total.house"="v_CA21_4237"), 
                              level = "CSD", 
                              labels = 'short')

    household <- household %>% slice_max(total.house, n=5) 

    ggplot(household, aes(x=total.house, y=fct_reorder(`Region Name`, total.house))) +
      geom_col(fill="cornflowerblue") + theme_minimal() + theme(axis.title.y = element_blank()) + 
      xlab("Number of Households") + ggtitle("Districts with Top 5 Number of Household in Alberta")

While I looking at the household distribution, I realized the households are distributed very unevenly within Alberta. The above figure showed the districts with top 5 number of household in Alberta. As shown in the figure, the number of household in Calgary and Edmonton is extremely larger than the rest of the province. Calgary has around 500,000 households, with about 400,000 in Edmonton, while the third district in rank has less than 50,000, which means less the 10% of the number of households in Alberta. This mean the majority of households are accumulated in merely 2 districts in the province.

This then led to the last question: whether the housing situation has large difference between the 2 districts (Calgary and Edmonton) and elsewhere in Alberta?

    #suitable housing data
    vec_suit<- list_census_vectors("CA21") %>% filter(vector == "v_CA21_4293") %>% 
      child_census_vectors(TRUE, keep_parent = TRUE)
    accept <- get_census(dataset = "CA21", regions = regions.list,
                         vectors = vec_suit$vector, level = "CSD", 
                         labels = "short")

    regions.main <- list_census_regions(dataset, use_cache = FALSE) %>% 
      filter(level == "CSD", name %in% c("Calgary", "Edmonton")) %>% 
      as_census_region_list

    ##calgary & edmonton vs elsewhere
    rate <- accept %>% mutate(urban = 
                                case_when(
                                  GeoUID %in% regions.main$CSD ~ "Calgary and Edmonton",
                                  TRUE ~ "Elsewhere")) 
    ## no identified problems
    rest.select <- rate %>% select(contains("v_CA21"), -v_CA21_4293)
    rate$rest <- rowSums(rest.select)
    rate <- rate %>% mutate(v_CA21_4293=Households-rest)

    ##factor order
    categories <- vec_suit %>% pull("vector")
    cat_list <- factor(categories, ordered = TRUE)
    cat_name <- vec_suit %>% pull("label")
    cat_name[1]="No identified problems"

    ##change column names to human readable
    names(rate)[grepl("v_", names(rate))] <- cat_name

    ##pivot_longer
    plot_data <- rate %>% pivot_longer(cols = cat_name[1]:cat_name[8],
                                       names_to = "Categories",
                                       values_to = "Count")
    plot_data$Categories <- factor(plot_data$Categories, levels = cat_name, 
                                   ordered = TRUE) 
    plot_data <- plot_data %>% group_by(Categories, urban) %>% 
      summarise(Count=mean(Count, na.rm = TRUE))

    # Make plots wider 
    #knitr::opts_chunk$set(fig.width=8, fig.height=6)
    ggplot(plot_data, aes(x="", Count, group=Categories, fill=Categories)) +
      geom_bar(stat = "identity", position="fill", width = 1) + coord_polar("y", start = 0) +
      facet_wrap(~urban)+ theme_void() +
      theme(legend.position = "bottom", axis.title = element_blank(),
            strip.text.x = element_text(size=15), 
            legend.text = element_text(size=10),
            plot.title = element_text(size=20)) + 
      guides(fill=guide_legend(nrow = 4)) + 
      ggtitle("Housing problems: ") +
      scale_fill_brewer(palette = "Set2")

When assessing housing situation, there are 3 criteria used:

whether people spend 30% or more income on shelter cost. If people spend more than 30% of their income on housing, it means the cost of housing is too high.
whether there are more than one person per room in the dwelling. If there are more than one person per room, it means people have to live in overcrowded dwellings, and it will be considered as “not suitable” housing for living.
whether the dwelling needs major repairs.

The above pie chart showed the housing problems in 2 large cities (Calgary and Edmonton) versus elsewhere in Alberta. Overall, there are more percentage of households having identified problems in Calgary and Edmonton. Specifically, there are more people spend more than 30% of their income, and more people living in “not suitable” household. It means in these 2 cities, people are suffer from high cost of housing, they either have to spend more than 30% of their income on housing, or share a room with other to reduce the cost. While in elsewhere in Alberta, the housing situation is slightly better, but there ae more households need major repairs. Which indicates that in elsewhere (other than Calgary and Alberta), there are more percentage of households not in good condition and needs repairs, but people living in couldn’t afford to.

Comments on figures

All the data downloading and pre-possessing codes can be found in the first 2 chunks in .rmd file, while figures are shown in separate chunks align with the analysis paragraphs.

There are 3 types of figures used in the analysis:

The spatial distribution of a given value: I used this showing the distribution of dwelling value and monthly cost people spend on housing. This type of figure could fulfill the purpose showing how the value/cost change in space, and can lead to interesting finding, such as the high value of dwelling cycle around Calgary. When assessing how the price of housing differ in area, this type of figure could be useful. In my figures, I selected the color pallete to make the higher price more significant in bright yellow color, while lower price in deep blue color. This could highlight the area in which having higher price. I also change the scale of legend to dollar signs, which makes more sense for the analized questions.
The rank of districts by value: I used geom_point and geom_col to generate the 2 figures about the higher dwelling value and larger number of household. When there are a lot of observations to show, e.g. the dwelling value in districts, geom_point could make the figure more readable; While when there are just 5 observation value to show, i.e. in number of household, geom_col could really make the higher value bar stand out and attract the readers’ attention. In both figures, I reorder the data and make it rank from high to low, which increased the readability. Besides, this type of figure required reader refer to x-axis to find the exact number value, so I used theme_minimal to remain the light grid lines to help indicate the exact value in x-axis.
pie chart, which I used to show the percentage of housing problems in different area. ggplot actually doesn’t have a pie chart function, so I modified this from geom_bar by adjusting the function and formulating the data. I selected the color pallete to clearly distinguish the different types of problems. (I don’t know for some reasons the figure can show perfectly in Rmd but can’t show fully in knit html file…)

1. Installation and retrieve the data vectors list (API key required)#

2. Housing situation in Alberta#

2.1 Dwelling Value#

2.2 Monthly cost#

2.3 Difference between large cities and elsewhere#

Comments on figures#