Earthquakes data analysis demo

This is a course assignment for analyze and visualize earthquake data

1. Read the data in and clean it for analysis, used the `readr` package functions for reading and parsing data. [5 marks]

My answer is written here and is explains what I did and why.

    #code 
    library(tidyverse)

    ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
    ## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
    ## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
    ## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
    ## ✔ readr   2.1.3      ✔ forcats 0.5.2 
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()

    #read data
    data <- read_csv("https://jjvenky.github.io/GG606AW23/database.csv", 
                     col_types = cols(Date = col_date(format = "%m/%d/%Y"),
                                      Time = col_time(format = "%H:%M:%S")))
    #problems(data)
    #data[problems(data)$row,]
    #as.POSIXct(problems(data)$actual, "%Y-%m-%dT%H:%M:%S", tz="UTC")

    #check data and omit N/A
    if(any(is.na(data$Date))){
      data <- filter(data, !is.na(data$Date))
    }
    data

## # A tibble: 23,409 × 21
##    Date       Time     Latitude Longitude Type     Depth Depth…¹ Depth…² Magni…³
##    <date>     <time>      <dbl>     <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>
##  1 1965-01-02 13:44:18    19.2      146.  Earthqu…  132.      NA      NA     6  
##  2 1965-01-04 11:29:49     1.86     127.  Earthqu…   80       NA      NA     5.8
##  3 1965-01-05 18:05:58   -20.6     -174.  Earthqu…   20       NA      NA     6.2
##  4 1965-01-08 18:49:43   -59.1      -23.6 Earthqu…   15       NA      NA     5.8
##  5 1965-01-09 13:32:50    11.9      126.  Earthqu…   15       NA      NA     5.8
##  6 1965-01-10 13:36:32   -13.4      167.  Earthqu…   35       NA      NA     6.7
##  7 1965-01-12 13:32:25    27.4       87.9 Earthqu…   20       NA      NA     5.9
##  8 1965-01-15 23:17:42   -13.3      166.  Earthqu…   35       NA      NA     6  
##  9 1965-01-16 11:32:37   -56.5      -27.0 Earthqu…   95       NA      NA     6  
## 10 1965-01-17 10:43:17   -24.6      178.  Earthqu…  565       NA      NA     5.8
## # … with 23,399 more rows, 12 more variables: `Magnitude Type` <chr>,
## #   `Magnitude Error` <dbl>, `Magnitude Seismic Stations` <dbl>,
## #   `Azimuthal Gap` <dbl>, `Horizontal Distance` <dbl>,
## #   `Horizontal Error` <dbl>, `Root Mean Square` <dbl>, ID <chr>, Source <chr>,
## #   `Location Source` <chr>, `Magnitude Source` <chr>, Status <chr>, and
## #   abbreviated variable names ¹`Depth Error`, ²`Depth Seismic Stations`,
## #   ³Magnitude

Here col_types in read_csv function was defined with format to match the format of Date and Time in the .csv file.

Note:

There are 3 rows in which the Date and Time were written different format compared with the rest of the dataset. Although it’s possible to convert them to acessible format, for example using as.POSIXct(problems(data)$actual, "%Y-%m-%dT%H:%M:%S", tz="UTC").The current method somehow could successfully return the correct Date, and the following question doesn’t necessary need the exact time of the earthquakes, plus I’m not sure about the timezone in the Date written in those 3 rows. So I just ignored those 3 rows as this warning would not affect following analysis.

2. Did more earthquakes happen on weekends or weekdays? [5 marks]

    #code 
    #select Earthquake
    Earthquake <- subset(data, Type == "Earthquake")
    #convert Date to day in a week
    Earthquake$wday <- weekdays(Earthquake$Date)
    #reorder days in week (so it'll be in good order in figure's legend)
    Earthquake$wday <- factor(Earthquake$wday, 
                              levels = c("Monday","Tuesday","Wednesday","Thursday",
                                         "Friday","Saturday","Sunday"))
    #separate weekdays and weekend
    Earthquake <- mutate(Earthquake, 
                         wday_ID = case_when(wday == "Sunday" ~ "Weekends", 
                                             wday == "Saturday" ~ "Weekends", 
                                             TRUE ~ "Weekdays"))
    #check dataset before plotting 
    #Earthquake

    #plot
    ggplot(data = Earthquake) +
      geom_bar(aes(wday_ID,fill=wday), alpha=1) +
      theme_classic() + scale_fill_brewer(palette = "Set3") +
      labs(title = "Amount of Earthquake happened on Weekdays and Weekends",
           x="",fill="Day in week")

The above analysis demonstrates the amount of earthquakes happened on weekdays and weekends in the dataset. There are significantly more earthquakes happened on weekdays, however, when it comes to the number of earthquakes per day, it doesn’t show any preference between weekdays or weekends. There are 5 days in a week could be counted as weekdays, while only 2 days are weekends. If earthquakes are equally distributed in every day, then when numbers added up, it will be more earthquakes on weekdays than weekends as there are more days in weekdays.

3. Has there been any change in the frequency of earthquakes? [5 marks]

    #code 
    #create new variable Year
    Earthquake <- mutate(Earthquake, Year = format(Date, "%Y"))

    #(for setting figure axis)
    #count minmax of Year and number of earthquakes
    minyear <- min(Earthquake$Year)
    maxyear <- max(Earthquake$Year)
    counteq <- Earthquake %>% count(Earthquake$Year)
    maxeq <- max(counteq$n)
    maxy <- ceiling(maxeq/100)*100

    #plot
    ggplot(data = Earthquake) + geom_bar(aes(Year),fill="cornflowerblue") + 
      theme_minimal() + scale_fill_brewer(palette = "Set3") +
      labs(title = paste0("Amount of earthquake each year (", minyear,
                            "-", maxyear, ")")) +
      scale_x_discrete(breaks = seq(minyear,maxyear,by=5)) + 
      scale_y_continuous(breaks = seq(0,maxy,by=100), expand = c(0,0)) +
      coord_cartesian(ylim = c(0,maxy))

The above analysis demonstrates the trend of earthquake from 1965 to 2016. The number of earthquake in each year showed a steady increase in the given time period. Therefore, we concluded that the frequency of earthquake in each year has a steady increase between the year of 1965 and 2016.

4. Where were there more earthquakes in the 1980s, South America or North America? [5 marks]

    #code 
    library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

    #select earthquakes in 1980s
    eq.1980s <- filter(Earthquake, Year >= 1980 & Year <= 1989)
    #eq.1980s

    #First, plot a scatterplot map with all the earthquakes in 1980s
    world_map <- map_data("world")
    ggplot(NULL) + 
      geom_polygon(data = world_map, aes(x=long,y=lat,group=group), 
                   fill="azure3") + 
      geom_point(data = eq.1980s, aes(x=Longitude, y=Latitude), 
                 color = "cornflowerblue", size=.5) +
      theme_void() +
      ggtitle("Earthquakes in 1980s")

    ## Notice a lot of earthquakes located in ocean, some on land
    ##define South/North America

    ##only count earthquakes happened on land
    ##convert (lat,lon) to continent (point in polygon question)
    ##https://stackoverflow.com/questions/21708488/get-country-and-continent-from-longitude-and-latitude-point-in-r
    library(sp)
    library(rworldmap)

## ### Welcome to rworldmap ###

## For a short introduction type :   vignette('rworldmap')

    latlon2Con <- function(lat,lon){
      sPDF <- getMap()
      points = data.frame(lon=lon,lat=lat)
      pointsSP = SpatialPoints(points, proj4string = CRS(proj4string(sPDF)))
      indices = over(pointsSP, sPDF)
      return(indices$REGION)
      #return(indice$ADMIN)
    }
    eq.1980s.land <- mutate(eq.1980s, 
                       Continent = latlon2Con(eq.1980s$Latitude,eq.1980s$Longitude))
    eq.1980s.land <- filter(eq.1980s.land, !is.na(Continent))
    ggplot(data = eq.1980s.land) +geom_bar(aes(Continent), fill="cornflowerblue")+
      theme_classic() + ggtitle("Earthquakes in 1980s (on land)")

    ##2. consider earthquakes both on land and ocean
    ##if earthquake located in ocean, find the nearest continent
    ##calculate centriods of each countries
    #library(sf)
    #world_map_sf <- st_as_sf(world_map, coords = c("long","lat"), crs=4326)
    #crs = 4326 means lat lon are interpreted as WGS 84 coordinates
    #world_cens <- st_centroid(world_map_sf)
    #create index to speed up the nearest function
    #grid_index <- st_make_grid(world_cens)
    #find nearest centroids 
    #point <- data.frame(lon=100,lat=23)
    #point_sf <- st_as_sf(point, coords = c("lon","lat"), crs=4326,agr="contant")
    #nearest_points <- st_nearest_points(world_cens,point_sf,index=grid_index)
    #still takes too much time and memory, ignore this method!
    #nearest_points

The above analysis demonstrates the geospatial distribution of earthquakes in 1980s and the number of earthquake in each continent.

First, I demonstrated the raw data of earthquakes in 1980s and their distribution. The earthquakes were mostly taken place in the major earthquake zone, while some of the earthquakes happened on land, a large number of them happened in the ocean area. Here, when identifying the origin location(continent) of earthquakes, I excluded those in the ocean.

Idealy, there should be 2 algorithms identifying the origin location of the earthquakes: point-in-polygon and finding-nearest-points. After experiments, I realized it’s not feasible to calculate nearest points using such large dataset on my laptop. Here I modified the function in rworldmap package and in this solution. The latlon2Con function in the code could decide whether the location (indexed by latitude and longitude) falls into a certain country’s polygon, and return the continent where that country belongs. However, for earthquakes happened in ocean, it will return N/A value.

With the above function, we can assigned a Continent value for each earthquakes on land in 1980s, and plot and bar plot indicating the number of earthquakes on each continent in 1980s. Here thh bar plot shows that there are more earthquakes in South America than North America in 1980s.

5. Has there been any geographic shifts in the distribution of earthquakes? [10 marks]

    #code 
    eq.land <- mutate(Earthquake, Continent = 
                        latlon2Con(Earthquake$Latitude,Earthquake$Longitude))
    eq.land <- filter(eq.land, !is.na(Continent))

    eq.land.num <- eq.land %>% count(Year, Continent) %>% group_by(Continent)

    ggplot(eq.land.num, aes(x=Year, y=Continent, fill=n)) +
      geom_tile() + theme_classic() +
      labs(title = paste0("Earthquakes on land (", minyear,
                            "-", maxyear, ")"),fill="Num") +
      scale_x_discrete(breaks = seq(minyear,maxyear,by=5)) + 
      scale_y_discrete(expand = c(0,0)) +
      scale_fill_distiller(palette = "YlOrRd", direction = 1)

The above analysis demonstrates the number of earthquakes in each continent from the year of 1965 to 2016. Similarly, here I only considered earthquakes happened on land, using the same function as in the previous question. According to the figure, there is no major shifts of the location(origin continent) of earthquakes in the given time period. Asia and South America constantly have more earthquakes compared to other continents.

6. Comment on how lessons from Wilke’s Fundementals of Data Visualization were applied to each figure with specific reference to book sections [5 marks]

Overall, I tried to produce clean and readable figure, with clear coordinates and readable titles, legends and axis.

For the figure illustrating the number of earthquakes on weekdays and weekends: It is a figure to visualize amounts, I chose stacked bar plots. In this way, the figure could display the amount of earthquakes in weekdays and weekend while comparing earthquakes happened in each day, showing more information at the same time. I also assigned an order to the weekdays column so that it will displayed in a more readable way (i.e. following the order of day in a week). Reference: Section 6.2
For the third question, I displayed the amount of earthquakes each year to visualize how the frequency of earthquakes (in each year) changed in the given time period. Similar to the second question, here I also visualized the amount, but with more data points. Another potential way is visualizing the trends with smoothed line plot, however, I want readers have a clearer sense of specific numbers without omitting too much information, so I still used the bar plot. This time, in order to give a clearer reference of numbers, I changed to theme_minimal() to display hidden grid lines on the figure. This design will allow readers locate the values of each bar easier while still maintain a clear visualization style. Reference: Section 25, Section 28.3
For the 4th question, I used 2 figures:

Global distribution of earthquakes in 1980s.
- Although it is not directly asked, for question involved with further analysis, I prefer to offer the raw data first. This will give reader (and myself) a more general idea of the data, and avoid future misunderstanding or mistakes when further analysis the data. Here, I displayed the raw distribution of earthquakes globally, showing some general knowledge of where the earthquakes usually take place, and leads to further discussion that lots of earthquakes happened in the ocean area, while I only consider earthquakes on land in the following analysis. This will also avoid future questions from readers e.g. why the number of earthquakes counted in the next figure is much smaller compared to the numbers shown in the 2nd and 3rd question.
- Here I used theme_void() to only display the global landmass and earthquakes location. The color of landmass and earthquakes were carefully chose so that it can clearly distinguish earthquakes from landmass. It might be better to use Robinson Projections so that the high latitude area is less distorted, but I have problems finding this projection in coord_map(), besides we are not focused on high latitude in this question, so I used the default settings. Reference: Section 4.1
Number of earthquakes on each continents. Here it’s also required to display “amounts”, so I used the simple bar plots to make the figure clear.

For the 5th question, I used a heatmap to illustrate how the number of earthquakes changes on each continent in each year. If there was geographic shift, say, from continent to continent, then it would be easily spotted on the heatmap figure. Besides, I also selected color palette to make the fill colors represent the number values in a more custom way. Reference: Section 6.3, Section 4.2

1. Read the data in and clean it for analysis, used the readr package functions for reading and parsing data. [5 marks]#

2. Did more earthquakes happen on weekends or weekdays? [5 marks]#

3. Has there been any change in the frequency of earthquakes? [5 marks]#

4. Where were there more earthquakes in the 1980s, South America or North America? [5 marks]#

5. Has there been any geographic shifts in the distribution of earthquakes? [10 marks]#

6. Comment on how lessons from Wilke’s Fundementals of Data Visualization were applied to each figure with specific reference to book sections [5 marks]#