Our previous blog post focused on a data analytics exercise to examine the unlikely relationship between characteristics of storms and properties of the moon. Unsurprisingly the long established belief that no such meaningful relationship exists held true. Some correlations surfaced but they were mild. However, an interesting side effect from that study was that a merged dataset was generated. Specifically, data related to storms and the position and illumination of the moon were collapsed into a single object. We can now see dates on which storms occurred and what properties of the moon were prevalent on the same dates. We have the data so why not perform some machine learning. Before we do, let’s set a few concepts. More importantly, let’s also frame questions so we know what we are trying to accomplish.
Machine learning is a discipline of artificial intelligence that focuses on computers inferring new knowledge, and possibly new behaviors based on that knowledge, from historical data. Historical data is sometimes referred to as experience. There are two (2) major subgroups of machine learning – supervised learning and unsupervised learning.
Supervised learning is the example we’ll be doing today where data is introduced to some algorithm (a process known as “training“) to produce a model. A model is a construct used to perform some operation on new (but structurally similar) data that was not a part of training. Let’s consider a very simplified example, say your business wants to predict the growth of sales in a particular region. You have historical sales data by week going back ten (10) years. There is also matching dollar amount spends on marketing by channel going back a similar ten (10) years. Its possible to combine these two (2) sources to form a training dataset (the computer’s “experience”), then feed it into a machine learning algorithm. This would create a model that establishes the relationship between your marketing spends and your resulting sales.
However, that’s not where your business realizes value from machine learning. Now you need to use the trained model to predict the result you need. You need the model to answer the question concerning growth in sales. This is done by feeding the trained model planned marketing spends for it to predict future sales. That’s the value proposition of supervised learning.
Unsupervised learning is where an algorithm tries to detect groups in a dataset. There is no training phase. This is commonly accomplished by finding centroids, or hubs, in the data. The algorithm will then find records in the data that are reasonably similar to each hub and group them. It does this by using some approach to estimate the distance of each record from each hub.
For example, let’s say we have a dataset of cars that has each vehicle’s horsepower, engine displacement, mpg, and range. Our goal would be to estimate which record is a sports car, luxury car, or economy car. We will notice that we have three (3) buckets for each type of car. This implicitly provides us with the number of centroids we need to create our model. It’s actually the “k” in probably the model popular algorithm for unsupervised learning, k-means clustering. The algorithm would create our three (3) centroids then try to group each record in the dataset into one of the three (3) matching groups by measuring their commonalities using the available columns. That is, it would try to associate each record with a group based on similarities in horsepower, displacement, mpg, and range.
Now that we understand what machine learning is and what it can do, let’s frame questions we want to answer in this article, and then get into the details of the example.
Developing the Question
We previously mentioned having a merged dataset of storm and moon data by date. A possible interesting direction to take this exercise would be to try to predict the position of hurricanes in the dataset through latitude and longitude coordinates. It would also be interesting to extend that exercise to find the best performing machine learning algorithm for our hurricane prediction model. That is, among the dozens of algorithms available to perform machine learning, which works best for this particular example. Therefore, our two questions could be:
- Is it possible to accurately predict the position of a hurricane using the data we have?
- Which machine learning algorithm(s) is best suited for predicting the position of a hurricane using the data we have?
Let’s get started.
Preparing the Data
Fortunately the data is ready-made from our previous post on data analytics and there is not much more work needed to prepare it. The most interesting step here would be to split the data into two parts – one part for training the algorithm into a model, and another for testing the model. The approach we’ll be taking is as follows:
- Split the merged data into two datasets, one for training and another for testing. The latest record in the data is in the year 2015, so a natural split would be use everything up to 2015-01-01 for training, and everything after for testing. This will help us answer the first question of whether or not it is possible to predict the position of a hurricane using our available data.
- To answer the second question we will need to apply several different machine learning algorithms to create several different models and assess each model in turn to find the better performing (or best “fit”). We will be using the most excellent caret package that comes bundled with support for a great many algorithms. We’ll loop over a few, log the results, then pick our winner based on some measure of accuracy.
On to the code.
Creating Machine Learning Models
Having established our process, let’s get started by walking through some R code. We start by reading some files from disk, the first is the merged dataset, the second is a list of algorithms from the caret package.
# Read data from disk storms_filename <- "in/stormsr.csv" algo_filename <- "in/regressors.csv"
Let’s create our training and testing set by splitting the data at the date 2015-01-01.
# SPLIT TRAINING AND TESTING DATASETS #### # Get the cut-off date for training vs testing data testStartDate <- paste(as.character(year(max(data$date))), "01-01", sep = "-") # Now get the testing dataset testing <- subset(data, data$date >= testStartDate) # Conversely get the training dataset training <- subset(data, data$date < testStartDate)
Now we need formulas to feed into the algorithms – one for predicting latitude and another for longitude. Notice that the formula includes characteristics of storms (prefix of “st_”) and characteristics of the moon (prefixes of “mi_” and “mp_”). This is indicating to the algorithms to create models to predict latitude and longitude of storms using known values from the data about storms and known values about the moon.
# Create a formula formula_lat <- lat ~ st_day + st_month + long + st_status + st_category + st_wind + st_pressure + st_ts_diameter + st_hu_diameter + mi_fraction + mi_phase + mi_angle + mp_altitude + mp_azimuth + mp_distance + mp_parallacticangle # Create a formula formula_long <- long ~ st_day + st_month + lat + st_status + st_category + st_wind + st_pressure + st_ts_diameter + st_hu_diameter + mi_fraction + mi_phase + mi_angle + mp_altitude + mp_azimuth + mp_distance + mp_parallacticangle
The next step is to create models. Note as well that we’re logging the duration of each training command using Sys.time(). This will help us to determine which algorithm (shown as regressor here) is the fastest. Bearing in mind that speed may not necessarily have any relationship with accuracy.
# Train model for predicting latitude and log duration model_lat_start_time <- Sys.time() model_lat <- train(formula_lat, data = training, method = regressor, trControl = ctrl ) model_lat_endtime <- Sys.time() model_lat_timetaken <- model_lat_endtime - model_lat_start_time # Train model for predicting longitude and log duration model_long_start_time <- Sys.time() model_long <- train(formula_long, data = training, method = regressor, trControl = ctrl ) model_long_endtime <- Sys.time() model_long_timetaken <- model_long_endtime - model_long_start_time
Now that we have applied our training data to algorithms to create models we’ll next use our models to perform predictions against our testing day (dates greater than 2015-01-01) based on the formulas above. We’ll perform predictions for latitude and longitude separately.
# Perform prediction for latitudes predictions_lat <- predict(model_lat, newdata = testing_lat) # Perform prediction for longitudes predictions_long <- predict(model_long, newdata = testing_long)
We will create a model and perform predictions for each algorithm found in the regressor.csv file we read from disk earlier. Each resulting prediction will be logged to disk for the next step.
Finding the Best Results
Choosing the right algorithm to power your machine learning model can be a complex affair. Different algorithms have different use cases, some are applicable to regression problems, some classification, and some are capable of both. Algorithms are also tuned in different ways and come with different tuning parameters. Algorithms offer different benefits in terms of speed versus accuracy.
Further, how the performance of an algorithm is measured is also important. Regressors and classifiers are measured very differently. There are theoretical measurements such as MAE or RMSE for regressors, and AUC or ROC for classifiers. Then there are real world measurements against regressor prediction performance in the wild such as residuals. Residuals are the difference between a prediction and the actual observed value in the real world. For example, if we predicted that the latitude of Hurricane Joaquin would be 24.37 degrees on 2015-10-01, but the actual observed latitude was 25.37 degrees then the residual would be 1 degree. That would be the error of our prediction, that is, its real world bottom-line performance that end users care about.
So, to choose the best performing algorithm we will be choosing those predictions with the lowest residuals for predictions of latitude and longitude. The following steps detail how this is done.
First, we obtain residuals. Notice that the values are made absolute with the abs function. Absolute numbers are simply values without regard for them being negative or positive, the sign is removed.
# Get residuals for latitude and longitude predictions <- cbind(predictions, abs(predictions$lat - predictions$predictions_lat)) predictions <- cbind(predictions, abs(predictions$long - predictions$predictions_long)) names(predictions)  <- "residuals_lat" names(predictions)  <- "residuals_long"
Now we compound the residuals into a single value by multiplying them.
# Gather residuals into a single column predictions <- cbind(predictions, predictions$residuals_lat * predictions$residuals_long) # Rename column as appropriate names(predictions)  <- "residuals_stacked"
Now get the minimum residuals for the compounded value of latitude and longitude by date. That is, get the best predictions for both latitude and longitude that day, for that storm.
# Get the minimum residuals by date min_residuals <- aggregate(predictions$residuals_stacked, list(predictions$date), min) # Rename column as appropriate names(min_residuals)  <- "residuals_stacked"
Finally, merge the best residuals with our predictions data frame to get a full set of columns, including the best performing algorithms.
# Find rows that match the minimum residuals predictions <- merge(predictions, min_residuals, by = "residuals_stacked") # Remove extraneous columns predictions <- subset(predictions, select = -c(Group.1))
Our final data frame looks like this.
> predictions %>% select(date, st_name, lat, long, predictions_lat, predictions_long, regressor, residuals_stacked) date st_name lat long predictions_lat predictions_long regressor residuals_stacked 1 2015-09-30 Joaquin 25.4 -71.8 25.40540 -73.97837 gam 0.011753685 2 2015-10-01 Joaquin 23.1 -73.7 23.06251 -74.10402 ranger 0.015146710 3 2015-10-02 Joaquin 23.6 -74.8 23.60023 -71.00883 ridge 0.000865186 4 2015-10-03 Joaquin 25.4 -72.6 25.39637 -68.43249 dnn 0.015148261 5 2015-10-04 Joaquin 28.9 -68.3 27.70000 -68.30000 qrf 0.000000000 6 2015-10-05 Joaquin 32.6 -66.0 32.59912 -70.88011 bagEarth 0.004306631 7 2015-10-06 Joaquin 37.9 -60.4 37.02775 -60.43105 bagEarthGCV 0.027079906 8 2015-10-07 Joaquin 40.3 -51.5 40.28145 -53.14296 xgbLinear 0.030473250
For clarity, let’s focus our observations on Hurricane Joaquin as presented above. We note the following:
- It is possible to now answer our first question – we can predict, to some measurable accuracy, the path of a hurricane. We can observe the differences between lat and predictions_lat, and long and predictions_long respectively. We can also observe the compounded, or stacked, residuals.
- Most importantly, we were only able to get these results by applying multiple algorithms, of the eight (8) rows available, there are eight (8) different algorithms that provide the best prediction for Hurricane Joaquin on that day. This further implies that no single algorithm was optimal to this prediction problem. This makes sense since we found no strong correlations in our data analysis. Machine learning algorithms tend to perform better when meaningful correlations exist in the data.
- Finally, we did not perform any feature engineering in this exercise. It’s possible that if we did that pre-work we might have seen a prevalent algorithm surfacing to give optimal results.
Let’s attach the table above to a visualization using leaflet. The circles show predictions, while the markers show the actual positions.
We were ultimately able to answer both questions of interest. We can predict the path of a hurricane using the combined historical storm and moon data, however, it was done using somewhat unconventional means. The predictions were only possible by harvesting the best performing algorithm for each day for each storm. Ensemble approaches as this are not terribly uncommon but needless to say are more difficult to create and to operate. This is especially true in public cloud ecosystems where customers oftentimes pay by the hour of computing time. It may not be feasible to operate dozens of algorithms to pick single best performing results.
Having said that, we already knew the data was not highly correlated so this degree of accuracy at all is still good performance. If this were a more typical dataset (not moon versus storm) then we could have expected to see good, even stellar, performance from a single algorithm especially if we had performed feature engineering and then performed tuning.
Thank you for reading.