The idea that the properties of the moon, whether position or luminosity and so on, have any impact on weather has long since been debunked. However, for the purposes of a data analytics exercise the availability of easily accessible data on lunar properties and the attributes of storms by date, does make for interesting analysis. We will be using the incredible R programming language to analyze and visualize our data using packages from the CRAN repository. Specifically, dplyr and lubridate from tidyverse (highly recommended), suncalc, and finally, GGally. Aside from providing functionality for wrangling our data, these packages also provide the data we’ll be using in our analysis. dyplyr provides the storms dataset, and suncalc provides the moon’s positions and the illuminated fraction by date.
Now that a baseline has been set, we need to define an interesting question to answer.
Developing the Question
All three (3) of the datasets have useful properties. For example, the storms dataset features the columns below.
1 name 2 year 3 month 4 day 5 hour 6 lat 7 long 8 status 9 category 10 wind 11 pressure 12 ts_diameter 13 hu_diameter
While moon illumination and moon position respectively feature (column names have been prefixed for later readability when merged):
1 date 2 mi_fraction 3 mi_phase 4 mi_angle
1 date 2 lat 3 long 4 mp_altitude 5 mp_azimuth 6 mp_distance 7 mp_parallacticangle
From the above we can observe that the data has the common column of date, and in the case of storms and moon position, also long (longitude) and lat (latitude). This opens up a myriad of questions we could ask the data. We could seek to find out if any correlations exist between any property of storms and any property of the moon by date or position. With this in mind, perhaps the question most people would be interested in would be the destructive power of a storm, which most often manifests in its wind speed.
So, we have arrived at our question – is there any correlation between the wind speed of storms and properties of the moon over time?
Preparing Storms Data
To merge the datasets in order to derive our correlations based on wind speed we must first bring them to common key. We have already identified date, latitude, and longitude as these keys. Both the moon position and illumination datasets have dates in a standard format, however, the storms dataset breaks out year, month, and day as separate columns. So to use the dates of storms to match the dates of lunar properties, we must reformat the columns in the storms dataset.
First, let’s look at the only the dates field in their current form. These are ten (10) dates from tropical storm Amy.
> subset(storms, select = c(1:4)) # A tibble: 10,010 x 4 name year month day <chr> <dbl> <dbl> <int> 1 Amy 1975 6 27 2 Amy 1975 6 27 3 Amy 1975 6 27 4 Amy 1975 6 27 5 Amy 1975 6 28 6 Amy 1975 6 28 7 Amy 1975 6 28 8 Amy 1975 6 28 9 Amy 1975 6 29 10 Amy 1975 6 29 # … with 10,000 more rows
Now let’s create a new data frame (stormsr) with a new column, rename it “date”, then remove the previous three (3) columns comprising the date shown above. A data frame is an in-memory structure for holding data in R. Consider it a table with indexed rows and columns.
# Derive date column stormsr <- cbind(as_date(paste(storms$year, storms$month, storms$day, sep = "-")), storms) # Add date column name names(stormsr)  <- "date" # Remove now redundant date fields stormsr <- subset(stormsr, select = -c(year, month, day)) # Add a meaningful prefix to all columns in this dataframe colnames(stormsr) <- tolower(c(colnames(stormsr), paste("st", colnames(stormsr[, c(2:ncol(stormsr))]), sep = "_")))
Now we have a date column ready to be merged with moon position and illumination.
head(subset(stormsr, select = c(1:4))) date name hour lat 1 1975-06-27 Amy 0 27.5 2 1975-06-27 Amy 6 28.5 3 1975-06-27 Amy 12 29.5 4 1975-06-27 Amy 18 30.5 5 1975-06-28 Amy 0 31.5 6 1975-06-28 Amy 6 32.4
Preparing Moon Illumination Data
The second of three (3) datasets is moon illumination data from the suncalc package. The package features a function in which you inject dates and it returns the illuminated fraction of the moon by date. Since our interest is dates of storms, we will provide the function with unique dates on which storms occurred.
# Get moon illumination on the dates of storms mooni <- getMoonIllumination(unique(stormsr$date)) # Add a meaningful prefix to all columns in this dataframe colnames(mooni) <- tolower(c(colnames(mooni), paste("mi", colnames(mooni[, c(2:ncol(mooni))]), sep = "_")))
Now we have another ready dataset.
> head(mooni) date mi_fraction mi_phase mi_angle 1 1975-06-27 0.8890171 0.6081085 1.179845 2 1975-06-28 0.8226903 0.6383494 1.156614 3 1975-06-29 0.7456759 0.6682515 1.143138 4 1975-06-30 0.6603653 0.6980358 1.140903 5 1975-07-01 0.5691094 0.7279311 1.150350 6 1975-07-02 0.4743917 0.7581549 1.171613
Preparing Moon Position Data
The third and final dataset is moon position data again provided through the suncalc package. The getMoonPosition function can accept a dataframe that contains dates, latitudes, and longitudes as parameters. As before, our interest is in how this relates to the wind speed of storms so we will inject appropriate dates, latitudes, and longitudes. We do this by breaking out those specific three (3) columns into a new data frame (moonp), rename the columns as expected by getMoonPosition, then inject the data frame as a parameter.
# Get a subset of columns from storm data for moon positioning moonp <- subset(stormsr, select = c(date, st_lat, st_long)) # Rename columns as needed to inject data into moon positioning # function names(moonp) <- "lat" names(moonp) <- "lon" # Get moon position on the dates of storms moonp <- getMoonPosition(data = moonp) # Add a meaningful prefix to all columns in this dataframe colnames(moonp) <- tolower(c(colnames(moonp), paste("mp", colnames(moonp[, c(2:ncol(moonp))]), sep = "_")))
The resulting values are now:
> head(moonp) date mp_lat mp_lon mp_altitude mp_azimuth mp_distance mp_parallacticangle 1 1975-06-27 27.5 -79.0 -0.5433772 -1.616085 402016.5 -1.1334748 2 1975-06-27 28.5 -79.0 -0.5424935 -1.626805 402016.5 -1.1130111 3 1975-06-27 29.5 -79.0 -0.5414233 -1.637496 402016.5 -1.0925719 4 1975-06-27 30.5 -79.0 -0.5401672 -1.648151 402016.5 -1.0721619 5 1975-06-28 31.5 -78.8 -0.6524281 -1.841398 404321.4 -0.9788887 6 1975-06-28 32.4 -78.7 -0.6467250 -1.851845 404321.4 -0.9602955
All Together Now
Notice that all three (3) datasets have at least one common column, date. Two (2) of the datasets share date, latitude, and longitude. It’s now time for us to merge these datasets into a single data frame.
Let’s start by merging storms data with moon illumination data. These two (2) datasets only have the date column in common. Notice as well that we will be doing a left outer join from storms to moon illumination. This means that we want every date from storms reflected in the result even if no date matches in moon illumination. That is, we don’t want any date on which there was a storm to be omitted.
# Merge storm data with moon illumination data stormsr <- merge(x = stormsr, y = mooni, by = "date", all = TRUE)
Now let’s take the merged dataframe and merge it again with the moon position dataset. Notice here that we are merging on all three (3) possible fields. This will give us the moon position properties on the dates of the storms, and directly above the storms.
# Merge storm and moon illumination data with # moon position stormsr <- mergedData <- merge(x = stormsr, y = moonp, by = c("date", "lat", "long"))
The first few rows of our complete dataset are below. We’re almost at the point where we can begin our analysis barring one final step.
> head(stormsr) date lat long st_name st_hour st_status st_category st_wind st_pressure st_ts_diameter st_hu_diameter mi_fraction mi_phase mi_angle mp_altitude mp_azimuth mp_distance mp_parallacticangle 1 1975-06-27 27.5 -79.0 Amy 0 tropical depression -1 25 1013 NA NA 0.8890171 0.6081085 1.179845 -0.5433772 -1.616085 402016.5 -1.1334748 2 1975-06-27 28.5 -79.0 Amy 6 tropical depression -1 25 1013 NA NA 0.8890171 0.6081085 1.179845 -0.5424935 -1.626805 402016.5 -1.1130111 3 1975-06-27 29.5 -79.0 Amy 12 tropical depression -1 25 1013 NA NA 0.8890171 0.6081085 1.179845 -0.5414233 -1.637496 402016.5 -1.0925719 4 1975-06-27 30.5 -79.0 Amy 18 tropical depression -1 25 1013 NA NA 0.8890171 0.6081085 1.179845 -0.5401672 -1.648151 402016.5 -1.0721619 5 1975-06-28 31.5 -78.8 Amy 0 tropical depression -1 25 1012 NA NA 0.8226903 0.6383494 1.156614 -0.6524281 -1.841398 404321.4 -0.9788887 6 1975-06-28 32.4 -78.7 Amy 6 tropical depression -1 25 1012 NA NA 0.8226903 0.6383494 1.156614 -0.6467250 -1.851845 404321.4 -0.9602955
Getting and Setting Object Classes
You might have noticed that some columns in the above data frame are numeric and some are not. If we hope to do compute operations on the data to derive correlations we need to ensure that all are numeric. Aside from inspecting them manually, we can apply the class function to each column to get its type.
Browse> lapply(stormsr, class) $date  "Date" $lat  "numeric" $long  "numeric" $st_name  "character" $st_hour  "numeric" $st_status  "character" $st_category  "ordered" "factor" $st_wind  "integer" $st_pressure  "integer" $st_ts_diameter  "numeric" $st_hu_diameter  "numeric" $mi_fraction  "numeric" $mi_phase  "numeric" $mi_angle  "numeric" $mp_altitude  "numeric" $mp_azimuth  "numeric" $mp_distance  "numeric" $mp_parallacticangle  "numeric"
We can observe that date, st_name, st_status, st_category, st_wind, and st_pressure are all non-numeric fields.
- date and st_name can be removed immediately since they would bring no value to our analysis
- st_status is a character variable that shows whether the storm is a tropical depression, tropical storm, or a hurricane. This is valuable and we will convert it to a factor to be included when plotting visualizations later with ggpairs
- st_category is a factor that we can also transform to a numeric
- st_wind and st_pressure are integers (not numeric) but its still good practice to bring them to a common type before performing computations
- Before we remove date from the dataset (having completed merges), let’s also extract just the most recent year to make our plots clearer and less crowded
- Let’s also remove any rows that have nulls (NA) to avoid them causing issues when computing correlations
# EXTRACT MOST RECENT YEAR #### stormsr <- subset(stormsr, stormsr$date > (max(stormsr$date) - 365)) # GENERATE AND PLOT GENERAL CORRELATIONS #### # Remove columns extraneous to correlations stormsr <- subset(stormsr, select = -c(date, st_name)) # Convert columns to numeric stormsr$st_status <- factor(stormsr$st_status) stormsr$st_category <- as.numeric(stormsr$st_category) stormsr$st_wind <- as.numeric(stormsr$st_wind) stormsr$st_pressure <- as.numeric(stormsr$st_pressure) # Remove any observations that may be incomplete stormsr <- as.data.frame(stormsr[complete.cases(stormsr), ])
Having done all that, we can again observe our columns by class showing one factor and the rest numeric:
> lapply(stormsr, class) $lat  "numeric" $long  "numeric" $st_hour  "numeric" $st_status  "factor" $st_category  "numeric" $st_wind  "numeric" $st_pressure  "numeric" $st_ts_diameter  "numeric" $st_hu_diameter  "numeric" $mi_fraction  "numeric" $mi_phase  "numeric" $mi_angle  "numeric" $mp_altitude  "numeric" $mp_azimuth  "numeric" $mp_distance  "numeric" $mp_parallacticangle  "numeric"
We can further observe that we have 222 rows of data ready for analysis.
> nrow(stormsr)  222
We begin our analysis by plotting the general correlations between all pairs of columns using ggcor from the GGally packages. This shows how each column relates to another through either a negative or positive correlation between -1 through 0, to +1. The closer values are to -1 or +1 then the stronger the correlation. Negative correlations reflect inverted relationships where generally speaking one value decreases as another increases. Consider altitude and temperature, the higher you go the lower the temperature. Positive correlations show joined relationships where as one value increases the other does the same. Consider height and shoe size.
ggcorr does an outstanding job of generating a step-wise matrix visualizing correlations from a given dataset.
# Generate plot showing correlations all columns stormsrc <- ggcorr(subset(stormsr, select = -c(st_status)), name = "Correlation", method = c("everything", "pearson"), low = "red", mid = "gray", high = "green", hjust = 1, size = 5, color = "grey50", label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE) + theme(plot.title = element_text(size = 24, hjust = 0.5, face = "bold", color = "black", margin = margin(0, 0, 30, 0)), legend.title = element_text(size = 12), legend.text = element_text(size = 12), legend.position = "right", text = element_text(size = 12, hjust = 0.5)) # Echo the plot to the console print(stormsrc)
Note the following:
- The plot is a heatmap showing correlations in a stepwise fashion. Greens show stronger positive correlations while reds show stronger negative correlations. For example, we can immediately see strong positive correlations between st_wind and st_category, and strong negative correlations between st_wind and st_pressure.
- Similarly, strong correlations between mi_phase and mi_angle, and mi_fraction and mp_distance.
- The method is “pearson“, also available are “kendall” and “spearman“.
This plot shows all columns against all columns, and it is possible to observe that st_wind has some weaker correlations to mi_angle and mp_azimuth, it would be helpful to get a richer representation.
For that, we turn to ggpairs.
# Generate chart showing density plots and correlations between the wind # speed of storms vs properties of the moon stormsrdc <- ggpairs(stormsr, title = "Storm Wind Speed vs Moon Position and Illumination", columns = c(6, 10:13), mapping = ggplot2::aes(color = st_status), upper = list(continuous = wrap("cor", size = 4, alignPercent = 0.9)), lower = list(continuous = wrap("points", alpha = 0.8))) + theme_minimal() + theme(plot.title = element_text(size = 24, hjust = 0.5, face = "bold", color = "black", margin = margin(0, 0, 30, 0)), axis.text.x = element_text(angle = 90, hjust = 1, size = 9), axis.text.y = element_text(angle = 0, hjust = 1, size = 9), legend.title = element_text(size = 12), legend.text = element_text(size = 12, angle = 90), legend.position = "top", text = element_text(size = 12, hjust = 0.5), panel.background = element_rect(fill = "white", color = "white"), panel.border = element_rect(linetype = "solid", color = "black", fill = NA)) # Echo the plot to the console print(stormsrdc)
We will base the bulk of our observations on the plot above. A few highlights:
- Fig 2 is split into two quadrants along a diagonal, an upper, and a lower.
- The upper quadrant shows the correlation of the two intersecting values by storm status. For example, the correlation of st_wind and mi_fraction when the st_status is hurricane (red), tropical depression (green), or tropical storm (blue). This follows for all pairs.
- The lower quadrant plots the correlations above and is colored by storm status. We notice some expected behaviors. For example, we see more red dots on the right of the plots in the leftmost column (st_wind) reflecting that hurricanes will, of course, have higher wind speeds than tropical storms and tropical depressions.
- We can also observe some mild clustering of hurricanes in the upper right of the plot showing correlations between mi_phase and mi_angle. This would suggest that there is a mild positive correlation when storms have reached the status of being a hurricane.
- We see a similarly mild correlation in the lower right quadrant of the mi_phase and mp_altitude correlation plot when storms have become hurricanes. This would suggest a weak but present negative correlation between mi_phase and mp_altitude under hurricane conditions.
- There are some interesting correlations between the properties of the moon. For example, mi_fraction and mi_phase show the expected bell curve, with no regard if a storm is a hurricane, tropical depression, or tropical storm.
- The diagonal shows density plots for st_wind, mi_fraction, mi_phase, mi_angle, and mi_altitude. Density plots show a smoothed distribution of values over a variable of interest. We see conspicuous spikes in mi_phase, mi_angle, and mp_altitude when a storm has achieved hurricane status.
Unsurprisingly no earth-shattering discoveries were made. There is no potent correlation between the wind speed of storms and lunar properties. However, this exercise demonstrated that we can derive robust analytics using R and a few phenomenal packages. We also were able to generate plots that were information rich yet compact.
This exercise focused on a weather phenomenon but could just as easily be applied to a business problem. For example, what if a business needed to understand which advertising channels had the highest correlations to sales by region or some other geographic. This same approach could also be applied to create visualizations to surface meaningful new knowledge.
Thank you for reading.