# My Data Science Blogs

## January 22, 2019

### Top Stories, Jan 14-20: How to go from Zero to Employment in Data Science; How to build an API for a machine learning model in 5 minutes using Flask

Also: End To End Guide For Machine Learning Projects; How to solve 90% of NLP problems: a step-by-step guide; Data Scientists Dilemma: The Cold Start Problem - Ten Machine Learning Examples; Ontology and Data Science; 9 Must-have skills you need to become a Data Scientist, updated

### Image Dithering in R

This January I played the most intriguing computer game I’ve played in ages: The Return of the Obra Dinn. Except for being a masterpiece of murder-mystery storytelling it also has the most unique art-style as it only uses black and white pixels. To pull this off Obra Dinn makes use of image dithering: the arrangement of pixels of low color resolution to emulate the color shades in between. Since the game was over all too quickly I thought I instead would explore how basic image dithering can be implemented in R. If old school graphics piques your interest, read on! There will be some grainy looking ggplot charts at the end.

(The image above is copyright Lucas Pope and is the title screen of The Return of the Obra Dinn)

## Horatio Nelson in black and white

Image dithering tries to solve the problem that you want to show an image with many shades of color, but your device can only display a much smaller number of colors. This might sound like a silly problem now, but was a very real problem in the early days of computers. For example, the original Mac could only display black and white pixels, not even any shades of grey!

So let’s do some image dithering in R! The Return of Obra Dinn takes place on an early 19th century East Indiaman ship, so let’s use something related as an example image. Why not use a low-resolution painting of Vice Admiral Horatio Nelson (1758 - 1805) the British officer who defeated the French and Spanish navies during the battle of Trafalgar. To read in the image I will use the imager package.

library(imager)
nelson

## Image. Width: 199 pix Height: 240 pix Depth: 1 Colour channels: 3


The imager package is a really useful package when you want to manipulate (and mess with) images directly in R. The nelson object is now a cimg object, which is basically an array with dimensions Width, Height, Depth (a time dimension, if you have a series of images), and Color channels. More importantly, cimg objects can be plotted:

plot(nelson)


As an example, I’m going to do black-and-white dithering so let’s remove the color and any transparency (called “alpha” here) from the image.

nelson_gray <- grayscale( rm.alpha(nelson) )
plot(nelson_gray)


Now let’s imagine that we would want to display this image using only black and white pixels. Before getting to the dithering, what would the simplest method be to achieve this? Well, we could just threshold the image. A pixel with value 0.0 (fully black) to 0.5 are made black, and pixels with values above 0.5 are set to white (1.0). This is easy to do as nelson_gray can be treated as a matrix:

nelson_threshold <- nelson_gray > 0.5
plot(nelson_threshold)


So, while the image looks kind of cool, it has lost a lot of nuance as large parts of it are now completely black. So how to fake shades of gray using only black and white pixels? Well, you can dither the image, that is, add some noise to the image as you reduce the number of colors. Let’s start by trying out the most basic kind of noise: White noise, here created using the runif function:

rand_matrix <- matrix(
data = runif(length(nelson_gray)),
ncol = ncol(nelson_gray), nrow=nrow(nelson_gray))
rand_cimg <- as.cimg(rand_matrix)
plot(rand_cimg)


Each pixel in rand_cimg is a value from 0.0 to 1.0 and we can now use rand_cimg when thresholding instead of the static 0.5. If you try out many different noise images then every black and white pixel will on average have the same value as the original grayscale pixel. This sounds like a good property, but let’s see how it looks with the current rand_cimg:

nelson_rand <- nelson_gray > rand_cimg
plot(nelson_rand)


To be correct on average doesn’t help much, in practice, we get a very noisy Nelson. But if you squint you can now see shades of gray in the picture, at least. Random noise is just too random, but maybe we can get better dithering by adding less random noise. What about a checker pattern?

checker_pattern <- rbind(c(1/3, 2/3),
c(2/3, 1/3))
plot(as.cimg(checker_pattern))


The pattern above uses cutoffs of 1/3 and 2/3, so Nelson-pixels that gets compared to a darker 1/3-pixel will be more likely to go white and Nelson-pixels that are compared to a lighter 2/3-pixel will tend to go black. Let’s scale this patter to Nelson-size.

# rep_mat takes a matrix (mat) and tiles it so that the resulting
# matrix has size nrow_out x ncol_out.
# It's basically a 2d version of rep()
rep_mat <- function(mat, nrow_out, ncol_out) {
mat[rep(seq_len(nrow(mat)), length.out = nrow_out),
rep(seq_len(ncol(mat)), length.out = ncol_out)]
}

checker_cimg <- as.cimg(rep_mat(checker_pattern, nrow(nelson_gray), ncol(nelson_gray)))
plot(checker_cimg)


And let’s do the thresholding with this new checker pattern:

nelson_checker <- nelson_gray > checker_cimg
plot(nelson_checker)


Well, it’s not good, but it kind of looks like we got at least one shade of gray now compared to using the static 0.5. Actually, that’s exactly what we got! We can see that by taking a smooth gradient…

gradient <- as.cimg( rep(seq(0, 1, 0.01), 101), x=101, y=101)


… and thresholding with the checker pattern:

checker_cimg <- as.cimg(rep_mat(checker_pattern,


This gives us three columns: black, “checker-gray”, and white. So, using the checker pattern we can achieve some more nuance than with simple thresholding. Is there perhaps an even better pattern that allows for even more nuance?

## Better patterns, Bayer patterns

Yes, there is! The classical pattern used in many image dithering implementations is the Bayer pattern (or Bayer matrix) named after it’s inventor Bryce Bayer. It’s an evolution of the checker pattern defined for matrices of size 2×2, 4×4, 8×8, etc. The exact construction and properties of the Bayer matrix are well described in the Wikipedia article but here is how to create it in R and how it looks:

# Calculates a non-normalized Bayer pattern matrix of size 2^n
recursive_bayer_pattern <- function(n) {
if(n <= 0) {
return(matrix(0))
}
m <- recursive_bayer_pattern(n - 1)
rbind(
cbind(4 * m + 0, 4 * m + 2),
cbind(4 * m + 3, 4 * m + 1))
}

# Returns a Bayer pattern of size 2^n normalized so all values
# are between 0.0 and 1.0.
normalized_bayer_pattern <- function(n) {
pattern <- recursive_bayer_pattern(n)
(1 + pattern) / ( 1 + length(pattern) )
}

par(mfcol = c(1, 3), mar = c(0, 0, 2, 1), ps = 18)
plot(as.cimg(normalized_bayer_pattern(1)), main = "Bayer 2×2")
plot(as.cimg(normalized_bayer_pattern(2)), main = "Bayer 4×4")
plot(as.cimg(normalized_bayer_pattern(3)), main = "Bayer 8×8")


Basically, a Bayer matrix contains as many shades of gray it’s possible to fit in there, and the shades are as spread out as possible. Let’s see how a 4x4 Bayer matrix transforms the smooth gradient:

bayer_cimg <- as.cimg(rep_mat(normalized_bayer_pattern(2),


Pretty smooth! We get the classical “crosshatch” patterns reminiscent of last-century computer graphics. Let’s give Admiral Nelson the same treatment:

bayer_matrix <- rep_mat(normalized_bayer_pattern(2),
nrow(nelson_gray), ncol(nelson_gray))
bayer_cimg <- as.cimg(bayer_matrix)
nelson_bayer <- nelson_gray > bayer_cimg
plot(nelson_bayer)


Now he looks doubly old-school. So far I’ve only worked with grayscale images and black-and-white dithering, but we can quickly hack together some color dithering by just performing the dither thresholding on one color channel at a time.

nelson_bayer_color <- nelson
for(rgb_i in 1:3) {
color_channel <- nelson_bayer_color[ , , 1, rgb_i, drop = FALSE]
nelson_bayer_color[ , , 1, rgb_i] <- color_channel > bayer_cimg
}
plot(nelson_bayer_color)


This method does not generalize to arbitrary color scales, but I still think it looks pretty cool!

## Image dithering ggplots in R

Finally, I’ll show you how to dither some ggplots. Below is most of the code above wrapped up into functions:

# rep_mat takes a matrix (mat) and tiles it so that the resulting
# matrix has size nrow_out × ncol_out.
# It's basically a 2d version of rep()
rep_mat <- function(mat, nrow_out, ncol_out) {
mat[rep(seq_len(nrow(mat)), length.out = nrow_out),
rep(seq_len(ncol(mat)), length.out = ncol_out)]
}

# Calculates a Bayer pattern matrix of size 2^n
# Source: https://gist.github.com/MehdiNS/bd41bbc6db780c9409157d35d331ac80
recursive_bayer_pattern <- function(n) {
if(n <= 0) {
return(matrix(0))
}
m <- recursive_bayer_pattern(n - 1)
rbind(
cbind(4 * m + 0, 4 * m + 2),
cbind(4 * m + 3, 4 * m + 1))
}

# Returns a Bayer pattern of size 2^n normalized so all values
# are between 1 / (m + 1) and m / (m + 1) where m is the number
# of elements in the 2^n × 2^n matrix.
normalized_bayer_pattern <- function(n) {
pattern <- recursive_bayer_pattern(n)
(1 + pattern) / ( 1 + length(pattern) )
}

# Returns a  nrow_out × ncol_out cimg image repeatig a 2×2 Bayer pattern
rep_bayer_cimg <- function(nrow_out, ncol_out) {
bayer_matrix <- rep_mat(normalized_bayer_pattern(2), nrow_out, ncol_out)
as.cimg(bayer_matrix)
}

# Transforms a cimg image into a dithered black and white image
img_to_bayer_bw <- function(img) {
img <- grayscale(rm.alpha(img))
bayer_cimg <- rep_bayer_cimg(nrow(img), ncol(img))
img >= bayer_cimg
}

# Transforms a cimg image into a dithered color image with 8 colors.
img_to_bayer_color <- function(img) {
img <- rm.alpha(img)
bayer_cimg <- rep_bayer_cimg(nrow(img), ncol(img))
for(rgb_i in 1:3) {
color_channel <- img[ , , 1, rgb_i, drop = FALSE]
img[ , , 1, rgb_i] <- color_channel >= bayer_cimg
}
img
}


Let’s then create the ggplot we will transform.

library(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg, fill = factor(cyl))) +
geom_violin(color = "black") +
theme_classic() +
theme(axis.text= element_text(colour="black"))


Then we’ll turn it into a low res cimg image.

# This function is a hack to read in a ggplot2 plot as a cimg image
# by saving it as a png to disk and reading it back in.
ggplot_to_cimg <- function(width, height, dpi) {
tmp_fname <- tempfile(fileext = ".png")
ggsave(tmp_fname, width = width, height = height, dpi = dpi, antialias = "none")
}

plot_img <- ggplot_to_cimg( width = 3, height = 2, dpi = 140)
plot(plot_img)


Finally, we can turn it into a retro-dithered black and white plot…

plot( img_to_bayer_bw(plot_img) )


…or a dithered eight-color plot.

plot( img_to_bayer_color(plot_img) )


Something for your next retro inspired presentation, maybe? If you want to have full control over your image dithering it would probably be more convenient to post-process your plots using an image editor, such as the free and open source GIMP rather than to do it directly in R.

This post has covered basic image dithering in R and, to be more specific, it has covered ordered dithering. There is also error-diffusion dithering which works in a rather different way. But that’s for another time. Now I’m going to go back to mourning that I’ve already finished The Return of the Obra Dinn.

### If you did not already know

Beetle Antennae Search (BAS)
Meta-heuristic algorithms have become very popular because of powerful performance on the optimization problem. A new algorithm called beetle antennae search algorithm (BAS) is proposed in the paper inspired by the searching behavior of longhorn beetles. The BAS algorithm imitates the function of antennae and the random walking mechanism of beetles in nature, and then two main steps of detecting and searching are implemented. Finally, the algorithm is benchmarked on 2 well-known test functions, in which the numerical results validate the efficacy of the proposed BAS algorithm.
BSAS: Beetle Swarm Antennae Search Algorithm for Optimization Problems

Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prior-aware Dual Decomposition (PADD) that is capable of learning document-specific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data. …

In this paper, we focus on approaches to parallelizing stochastic gradient descent (SGD) wherein data is farmed out to a set of workers, the results of which, after a number of updates, are then combined at a central master node. Although such synchronized SGD approaches parallelize well in idealized computing environments, they often fail to realize their promised computational acceleration in practical settings. One cause is slow workers, termed stragglers, who can cause the fusion step at the master node to stall, which greatly slowing convergence. In many straggler mitigation approaches work completed by these nodes, while only partial, is discarded completely. In this paper, we propose an approach to parallelizing synchronous SGD that exploits the work completed by all workers. The central idea is to fix the computation time of each worker and then to combine distinct contributions of all workers. We provide a convergence analysis and optimize the combination function. Our numerical results demonstrate an improvement of several factors of magnitude in comparison to existing methods. …

### Five Major AI Predictions to Watch in 2019

Will 2019 be the year your organization embraces AI? If not, it might be too late. In this live webinar, we will unveil five major predictions companies should pay close attention to in 2019. Register now!

### Cigna: Data Scientist [Remote, US]

Cigna is seeking a Data Scientist who is passionate about machine learning (both supervised and unsupervised), building models, data mining and NLP in their Service Operations group. This is a work from home position for candidates based in the US.

### R Packages worth a look

purrr’-Like Apply Functions Over Input Elements (dapr)
An easy-to-use, dependency-free set of functions for iterating over elements of various input objects. Functions are wrappers around base apply()/lappl …

Evaluation of Failure Time Surrogate Endpoints in Individual Patient Data Meta-Analyses (surrosurv)
Provides functions for the evaluation of surrogate endpoints when both the surrogate and the true endpoint are failure time variables. The approaches i …

Density Surface Modelling of Distance Sampling Data (dsm)
Density surface modelling of line transect data. A Generalized Additive Model-based approach is used to calculate spatially-explicit estimates of anima …

### Darrell Huff (4) vs. Monty Python; Frank Sinatra advances

In yesterday’s battle of the Jerseys, Jonathan offered this comment:

Sinatra is an anagram of both artisan and tsarina. Apgar has no English anagram. Virginia is from New Jersey. Sounds confusing.

And then we got this from Dzhaughn:

I got as far as “Nancy’s ancestor,” and then a Youtube clip of Joey Bishop told me, pal, stop he’s a legend, he don’t need no backronym from you or anybody. He don’t need no google doodle, although it would have been a respectful gesture on his 100th birthday, but nevermind. He’s a legend, and he’s against someone who puts people to sleep. Professionally.

Good point. As much as I’d love to see Apgar, we can’t have a seminar speaker who specializes in putting people to sleep. So it will be Frank facing Julia in the second round.

Today, we have the #4 seed in the “People whose name ends in f” category, vs. an unseeded entry in the Wits category. (Yes, Monty Python is an amazing group, but the Wits category is a tough one; seedings are hard to come by when you’re competing with the likes of Oscar Wilde and Dorothy Parker.)

Darrell Huff is a bit of a legend in statistics, or used to be, based on his incredibly successful book from 1954, How to Lie with Statistics. But the guy didn’t really understand statistics; he was a journalist who wrote that one book and then went on to other things, most notoriously working on a book, How to Lie with Smoking Statistics, which was paid for by the cigarette industry but was never completed, or at least never published. Huff could talk about how to lie with statistics firsthand—but I suspect his knowledge of statistics was simplistic enough that he might not have even known what he was doing.

As for Monty Python: You know who they are. I have nothing to add on that account.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

### What were the most significant machine learning/AI advances in 2018?

2018 was an exciting year for Machine Learning and AI. We saw “smarter” AI, real-world applications, improvements in underlying algorithms and a greater discussion on the impact of AI on human civilization. In this post, we discuss some of the highlights.

### Simple Boruta Feature Selection Scripting

In the previous post of this series about feature selection WhizzML scripts, we introduced the problem of having too many features in our dataset, and we saw how Recursive Feature Elimination helps us to detect and remove useless fields. In this second post, we will learn another useful script, Boruta. Introduction to Boruta We talked previously […]

### Jupyter notebooks for post-election audits

The following is a guest blog post authored by Kellie Ottoboni, describing her recent work where they used Jupyter to support statistical audits of election results.

In December 2018, I facilitated pilot post-election risk-limiting audits in three cities in Michigan. This was the first time that SUITE, a new method for “hybrid” risk-limiting audits that I helped develop, has been used in practice. I wrote a Jupyter notebook tool to do the SUITE risk calculations, to determine the necessary sample size, and to sample the ballots using a cryptographically secure pseudo-random number generator.

### What are risk-limiting audits?

A risk-limiting audit (RLA) is a statistical check that the reported outcome (the reported winner(s), as opposed to exact vote totals) of an election is correct. The procedure limits the chance that an incorrect outcome will go uncorrected, if the reported outcome is in fact wrong. RLAs involve sampling and examining paper ballots. 34 RLA pilots have been done in California, Colorado, Indiana, New Jersey, Ohio, Virginia, and Denmark. Colorado began requiring RLAs by law in 2017, and Rhode Island will begin requiring RLAs in 2019.

SUITE is a general method for conducting RLAs of stratified samples, where a population of ballots is divided into non-overlapping strata and samples are drawn independently from each stratum. We began developing SUITE for Colorado, treating absentee votes and in-precinct votes as two strata. Aside from this special case, SUITE may be useful for auditing states where counties work independently.

Michigan used the specific two-stratum version of SUITE, which combines two types of RLAs: ballot polling, which involves “polling” the ballots and recording the proportion of votes for each candidate, and ballot-level comparison, which involves comparing paper ballots to their electronic record and counting the number of discrepancies. Ballot-level comparison audits require looking at fewer ballots than ballot polling, but can only be done when ballots can be matched to their electronic records. In two of the Michigan cities, absentee ballots could be matched to their electronic record. SUITE allows you to use these two strategies side-by-side for a single RLA.

### The Jupyter tool

Previous RLA pilots have used HTML pages with Javascript code, written by Philip Stark, to conduct the audits. With a short timeframe to create a similar SUITE tool, Philip and I decided that the best solution was to build out the Python library we had begun and use a Jupyter notebook for the interface.

The SUITE notebook made the audits more transparent. In each city, I projected the notebook running locally on my laptop on a screen for about 30 local election officials. They could observe me entering the reported vote totals for each candidate and the 20-digit random seed to initialize the pseudo-random number generator, then see the sampled ballots appear on the screen in a nicely formatted table. These folks are used to working in Excel spreadsheets, so the interactivity of a Jupyter notebook is more familiar interface than simply running an executable file in the terminal. Each step of the audit was interactive and annotated with Markdown.

Our hope is that the SUITE notebook can serve as a proof of concept for further development. Free & Fair turned the HTML RLA tools into industrial strength software and Democracy Works built upon it to create the program that Colorado uses for statewide RLAs.

### Pitfalls

One main difference between the HTML RLA tools and the SUITE notebook is that the webpages hide all the code. While we moved most of the code to modules and limited most code cells to one function, the notebook still contained a lot of code to scroll through. Voila, a tool for hiding code in notebooks with interactive widgets, is one possible solution to display only the crucial input and output pieces of the tool.

Another issue we faced was the sequential nature of notebooks. RLAs are iterative: if the risk of the ballots from a first round of sampling is too high, then the audit proceeds to more rounds, until either the risk is sufficiently small or all ballots have been counted. A more proficient software developer than I might have come up with an elegant way to make this possible in a Jupyter notebook while tracking the data from each round of sampling. My hack was to assume that in Michigan we would need at most two rounds of sampling and to copy the code for the first round, with some modifications. (In fact, none of the three cities used more than one round.)

The RLA tools written by Free & Fair and Democracy Works are written in Java and Clojure. Academics continue to improve the statistics, so it would be helpful to have an API accessible with Python. It would enable the open source community to contribute to the codebase rather than reinvent the wheel to build a tool every time new statistics are developed.

### Michigan pilots

Rochester Hills, Lansing, and Kalamazoo participated in the pilots of their November, 2018 election. I traveled to each city with a team of election auditing experts from MIT, the NYU Brennan Center for Justice, the Electoral Assistance Commission, and Democracy Works. We hope that this Jupyter notebook helped demystify the code and math, and illustrate that RLAs are feasible and efficient way to insure election integrity.

Jupyter notebooks for post-election audits was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### How AI and Data Science is Changing the Utilities Industry

Together, artificial intelligence (AI) and data science are causing positive developments for the utilities providers that choose to investigate these things. Here are some examples of technology at work.

### The Best Strategies for Overcoming a Challenge

Learning data science is a long-distance race. And while you may have days where you’re flying, everything comes naturally, and you feel like a coding god, you’re also going to have days where nothing seems to work and you just can’t grasp the concept you’re supposed to be learning.

Your long-term success depends on your ability to push through those tough moments, the moments where people are most likely to give up.

What can you do to keep your head in the game and ensure you stick around even when the going gets tough and studying doesn’t feel fun? Science, thankfully, has some answers.

## Finding Success Doing Something You Aren’t Enjoying

A research team led by Marie Hennecke at the University of Zurich recently published a paper in the European Journal of Personality about a series of experiments they conducted to investigate the strategies people use to successfully push themselves through doing something they don’t like.

They started with a survey of nearly 335 people that asked about the strategies respondents used to get through sometimes-unpleasant tasks like running on a treadmill. Then, they boiled the list of nearly 2,000 specific strategies respondents gave them into a few categories:

Changing the conditions strategies included things like reducing distractions, changing the environment, drinking coffee, and putting on some music - strategies that don’t change the task itself but that are aimed at making it more pleasant in the hopes that that will make them more likely to complete the activity.
Attentional strategies aimed at focusing attention, either by focusing on the activity or intentionally distracting oneself from it, to facilitate persistence.
Cognitive change strategies were psychological strategies for helping people think differently about the activity: things like focusing on the positive outcomes of completing the run, thinking about the negative outcomes of skipping it, monitoring progress, thinking about how they’re near the finish, etc.
Response modulation strategies, of which the researchers cite just one: suppressing the urge to quit.

Then in a subsequent study, researchers asked subjects about unpleasant-but-necessary activities they’d completed recently (like self-study, attending lectures, commuting, etc.). Respondents noted which of the above strategies they’d used to get through these tasks, and whether or not they felt they had been successful.

Ultimately, the researchers found that some of these strategies were more correlated with success than others. Specifically, strategies that tended to correlate with self-reported success were:

• Focusing on the positive consequences of completing the task
• Controlling emotions
• Monitoring progress towards goals
• Thinking about being close to the finish

Interestingly, goal-setting had no perceptible effect on respondents’ success rate. Just one strategy correlated negatively with success: distracting oneself from the task.

## What we can learn from this

We’ll start with a few caveats: this study makes some interesting suggestions, but they shouldn’t be confused for cast-in-stone facts (nor should the results of any single study). The fact that this research relies on self-reported success may be a particular area of weakness, since respondents may not always have accurately assessed their own results. And of course, correlation isn’t proof of causation, so it’s possible the correlation of these strategies with success is purely coincidental.

It’s also worth pointing out that if you’re learning data science, you should be having fun. These tips may be helpful for getting through the occasional down moments and those really challenging days, but if you’re feeling like you have to break these out every day, you might want to take a step back and look at how you’re studying data science.

With that said, this study suggests that when you’re having a tough day, attempting to apply these strategies might make it easier for you to push through and actually complete whatever you’re working on:

• Thinking about the reasons why it’s valuable for you to complete this task or solve this problem. If you’re struggling to get through a two-hour study session, for example, think about how much you’ll have learned or how much additional practice you’ll have gotten by the end.

• Doing your best to control the negative thoughts and emotions. This might be particularly helpful for frustrating moments where your code just isn’t working and you can’t figure out why. Pause for a moment, take a deep breath, and try thinking positively about what you’re doing or how you’ll feel when the task is completed before diving back in.

• Tracking your progress through sections of study you find less interesting could be helpful. (Of course, the Dataquest platform tracks your progress in each mission, course, and path automatically for you, but you could also track other metrics like time spent studying, or even more granular metrics like lines of code written, to see if that’s helpful.)

• Particularly when you’re trying to finish out a longer session, thinking about how you’re close to the finish might help. Struggling to make it to the end of two hours of statistics study? Thinking about how you’ve only got 20 minutes left may help you push through to the end more consistently rather than being tempted to clock out early.

None of these strategies are likely to be a magic bullet, but they’re all worth trying if you want to maximize your chances of success.

### 2018’s Top 7 R Packages for Data Science and AI

This is a list of the best packages that changed our lives this year, compiled from my weekly digests.

### Science as an intellectual “safe space”? How to do it right.

I don’t recall hearing the term “safe space” until recently, but now it seems to be used all the time, by both the left and the right, to describe an environment where people can feel free to express opinions that might be unpopular in a larger community, without fear of criticism or contradiction.

Sometimes a safe space is taken to be a good thing—a sort of hothouse garden in which ideas can be explored and allowed to grow rapidly in an environment free of natural enemies or competition—and other times it’s taken to be a bad thing, a place where ideas will not develop in useful ways.

The short version is that people sometimes (but not always) want safe spaces for themselves, but they typically are derisive of safe spaces for people they disagree with. Safe spaces are a little like protective tariffs in economics: if you feel strong, you might not see the need for anyone to have a safe space; if you feel weak, or if you have weak allies, you might want those protective zones.

Psychology journals as safe spaces

This all came up when I was thinking what seemed to me to be exaggeratedly defensive reactions of scientists (in particular, some psychology researchers) to criticism of their published work. To me, if you publish your work, it’s public, and you should welcome criticism and active engagement with your ideas. But, to many researchers, it seems that praise is OK but skepticism is unwelcome. And in some cases researchers go all out and attack their critics. What these researchers seem to be looking for is a safe space. But to me it seems ridiculous to publish a paper in a public journal, promote it all over the public news media, and then claim object to criticism. This sort of behavior seems roughly equivalent to fencing off some area in a public park and then declaring it private property and only admitting your friends. Actually, even worse than that, because prominent psychologists use their safe space to spread lies about people. So it’s more like fencing off some area in a public park and then declaring it private property and then using it to launch missiles against people who you perceive as threatening your livelihood.

“Freakonomics” as a safe space

Another example came up a few years ago when Kaiser Fung and I wrote an article expressing a mix of positive and negative attitudes toward the Freakonomics franchise. One of the Freaknomics authors responded aggressively to us, and in retrospect I think he wanted Freakonomics to be a sort of safe space for economic thinking. The idea, perhaps, was that “thinking like an economist” is counterintuitive and sometimes unpopular, and so it’s a bad idea for outsiders such as Kaiser and me to go in and criticize. If the Freaknomics team are correct in their general themes (the importance of incentives, the importance of thinking like an economist, etc.), then we’re being counterproductive to zoom in on details they may have gotten wrong.

Safe spaces in science

I have no problem with my work being criticized; indeed, I see it as a key benefit of publication that more people can see what I’ve done and find problems with it.

That said, I understand the need for safe spaces. Just for example, I don’t share the first draft of everything I write. Or, to step back even further, suppose I’m working on a math problem and I want to either prove statement X or find a counterexample. Then it can be helpful to break the task in two, and separately try to find the proof or find the counterexample. When working on the proof, you act as if you know that X is true, and when searching for the counterexample, you act as if you know X is false. Another example is group problem solving, where it’s said to be helpful to have a “brainstorming session” in which people throw ideas on the table without expectation or fear of criticism. At some point you want to hear about longshot ideas, and it can be good to have some sort of safe space where these speculations can be shared without being immediately crushed.

My suggestion

So here’s my proposal. If you want a safe space for your speculations, fine: Just label what you’re doing as speculation, not finished work, and if NPR or anyone else interviews you about it, please be clear that you’re uncertain about these ideas and, as far as you’re concerned, these ideas remain in a no-criticism zone, a safe space where they can be explored without concern that people will take them too seriously.

### The trinity of errors in financial models: An introductory analysis using TensorFlow Probability

An exploration of three types of errors inherent in all financial models.

At Hedged Capital, an AI-first financial trading and advisory firm, we use probabilistic models to trade the financial markets. In this blog post, we explore three types of errors inherent in all financial models, with a simple example of a model in TensorFlow Probability (TFP).

## Finance is not physics

Adam Smith, generally recognized as the founder of modern economics, was in awe of Newton’s laws of mechanics and gravitation. Ever since then, economists have endeavored to make their discipline into a science like physics. They aspire to formulate theories that accurately explain and predict the economic activities of human beings at the micro and macro levels. This desire gathered momentum in the early 20th century with economists like Irving Fisher and culminated in the Econophysics movement of the late 20th century.

Despite all the complicated mathematics of modern finance, its theories are woefully inadequate, especially when compared to those of physics. For instance, physics can predict the motion of the moon and the electrons in your computer with jaw-dropping precision. These predictions can be calculated by any physicist, at any time, anywhere on the planet. By contrast, market participants have trouble explaining the causes of daily market movements or predicting the price of a stock at any time, anywhere in the world.

Perhaps finance is harder than physics. Unlike atoms and pendulums, people are complex, emotional beings with free will and latent cognitive biases. They tend to behave inconsistently and continually react to the actions of others. Furthermore, market participants profit by beating or gaming the systems in which they operate.

After losing a fortune on his investment in the South Sea Company, Newton remarked, “I can calculate the movement of the stars, but not the madness of men.” Note that Newton was not “retail dumb money.” He served as the Warden of the Mint in England for almost 31 years, helping put the British pound on the gold standard where it would stay for over two centuries.

## All financial models are wrong

Models are used to simplify the complexity of the real world, thus enabling us to focus on the features of a phenomenon that interests us. Clearly, a map will not be able to capture the richness of the terrain it models. George Box, a statistician, famously quipped, “All models are wrong, but some are useful.”

This observation is particularly applicable to finance. Some academics have even argued that financial models are not only wrong, but also dangerous; the veneer of a physical science lulls adherents of economic models into a false sense of certainty about the accuracy of their predictive powers. This blind faith has led to many disastrous consequences for their adherents and for society at large. The most successful hedge fund in history, Renaissance Technologies, has put its critical views of financial theories into practice. Instead of hiring people with a finance or Wall Street background, they prefer to hire physicists, mathematicians, statisticians, and computer scientists. They trade the markets using quantitative models based on non-financial theories such as information theory, data science, and machine learning.

Whether financial models are based on academic theories or empirical data mining strategies, they are all subject to the trinity of modeling errors explained below. All models, therefore, need to quantify the uncertainty inherent in their predictions. Errors in analysis and forecasting may arise from any of the following modeling issues: using an inappropriate functional form, inputting inaccurate parameters, or failing to adapt to structural changes in the market.

## The trinity of modeling errors

1. Errors in model specification: Almost all financial theories use the normal distribution in their models. For instance, the normal distribution is the foundation upon which Markowitz’s Modern Portfolio Theory and Black-Scholes-Merton Option Pricing Theory are built. However, it is a well documented fact that stocks, bonds, currencies, and commodities have fat-tailed distributions. In other words, extreme events occur far more frequently than predicted by the normal distribution.

If asset price returns were normally distributed, none of the following financial disasters would occur within the age of the universe: Black Monday, the Mexican Peso Crisis, Asian Currency Crisis, the bankruptcy of Long Term Capital Management (which incidentally was led by two “Nobel-prize” winning economists), or the Flash Crash. “Mini flash crashes” of individual stocks occur with even higher frequency than these macro events.

Yet, finance textbooks, programs, and professionals continue to use the normal distribution in their asset valuation and risk models because of its simplicity and analytical tractability. These reasons are no longer justifiable given today’s advanced algorithms and computational resources. This reluctance in abandoning the normal distribution is a clear example of “the drunkard’s search”: a principle derived from a joke about a drunkard who loses his key in the darkness of a park but frantically searches for it under a lamppost because that’s where the light is.

2. Errors in model parameter estimates: Errors of this type may arise because market participants have access to different levels of information with varying speeds of delivery. They also have different levels of sophistication in processing abilities and different cognitive biases. These factors lead to profound epistemic uncertainty about model parameters.

Let’s consider a specific example of interest rates. Fundamental to the valuation of any financial asset, interest rates are used to discount uncertain future cash flows of the asset and estimate its value in the present. At the consumer level, for example, credit cards have variable interest rates pegged to a benchmark called the prime rate. This rate generally changes in lock-step with the federal funds rate, an interest rate of seminal importance to the U.S. and the world economies.

Let’s imagine that you would like to estimate the interest rate on your credit card one year from now. Suppose the current prime rate is 2% and your credit card company charges you 10% plus prime. Given the strength of the current economy, you believe the Federal Reserve is more likely to raise interest rates than not. The Fed will meet eight times in the next 12 months and will either raise the federal funds rate by 0.25% or leave it at the previous level.

In the following TFP code example (see entire Colab), we use the binomial distribution to model your credit card’s interest rate at the end of the 12-month period. Specifically, we’ll use the TensorFlow Probability Binomial distribution class with the following parameters: total_count = 8 (number of trials or meetings), probs = {0.6, 0.7,0 .8, 0.9}, for our range of estimates about the probability of the Fed raising the federal funds rate by 0.25% at each meeting.

# First we encode our assumptions.
num_times_fed_meets_per_year = 8.
possible_fed_increases = tf.range(
start=0.,
limit=num_times_fed_meets_per_year + 1)
possible_cc_interest_rates = 2. + 10. + 0.25 * possible_fed_increases
prob_fed_raises_rates = tf.constant([0.6, 0.7, 0.8, 0.9])
# Now we use TFP to compute probabilities in a vectorized manner.
# Pad a dim so we broadcast fed probs against CC interest rates.
prob_fed_raises_rates = prob_fed_raises_rates[…, tf.newaxis]
prob_cc_interest_rate = tfd.Binomial(
total_count=num_times_fed_meets_per_year,
probs=prob_fed_raises_rates).prob(possible_fed_increases)

In the graphs below, notice how the probability distribution for your credit card rate in 12 months depends critically on your estimate about the probability of the Fed raising rates at each of the eight meetings. You can see that for every increase of 0.1 in your estimate of the Fed raising rates at each meeting, the expected interest rate for your credit card in 12 months increases by about 0.2%.

Even if all market participants used the binomial distribution in their models, it’s easy to see how they could disagree about the future prime rate because of the differences in their estimate for probs. Indeed, this parameter is hard to estimate. Many institutions have dedicated analysts, including previous employees of the Fed, analyzing the Fed’s every document, speech, and event to try to estimate this parameter.

Recall that we assumed this parameter probs was constant in our model for each of the next eight Fed meetings. How realistic is that? Members of the Federal Open Market Committee (FOMC), the rate setting body, are not just a set of biased coins. They can and do change their individual biases based on how the economy changes over time. The assumption that the parameter probs will be constant over the next 12 months is not only unrealistic, but also risky.

3. Errors from the failure of a model to adapt to structural changes: The underlying data-generating stochastic process may vary over time—i.e., the process is not stationary ergodic. We live in a dynamic capitalist economy characterized by technological innovations and changing monetary and fiscal policies. Time-variant distributions for asset values and risks are the rule, not the exception. For such distributions, parameter values based on historical data are bound to introduce errors into forecasts.

In our example above, if the economy were to show signs of slowing down, the Fed might decide to adopt a more neutral stance in its fourth meeting, making you change your probs parameter from 70% to 50% going forward. This change in your probs parameter will in turn change the forecast of your credit card interest rate.

Sometimes the time-variant distributions and their parameters change continuously or abruptly, as in the Mexican Peso Crisis. For either continuous or abrupt changes, the models used will need to adapt to evolving market conditions. A new functional form with different parameters might be required to explain and predict asset values and risks in the new regime.

Suppose after the fifth meeting in our example, the U.S. economy is hit by an external shock—say a new populist government in Greece decides to default on its debt obligations. Now the Fed may be more likely to cut interest rates than to raise them. Given this structural change in the Fed’s outlook, we will have to change the binomial probability distribution in our model to a trinomial distribution with appropriate parameters.

## Conclusions

Finance is not a precise predictive science like physics. Not even close. So, let’s not treat academic theories and models of finance as if they were models of quantum physics.

All financial models, whether based on academic theories or data mining strategies, are at the mercy of the trinity of modeling errors. While this trifecta of errors can be mitigated with appropriate modeling tools, it cannot be eliminated. There will always be asymmetry of information and cognitive biases. Models of asset values and risks will change over time due to the dynamic nature of capitalism, human behavior, and technological innovation.

Financial models need a framework that quantifies the uncertainty inherent in predictions of time-variant stochastic processes. Equally importantly, the framework needs to continually update the model or its parameters—or both—based on materially new data sets. Such models will have to be trained using small data sets, since the underlying environment may have changed too quickly to collect a sizable amount of relevant data.

### Acknowledgements

We thank the TensorFlow Probability team, especially Mike Shwe and Josh Dillon, for their help in earlier drafts of this blog post.

### References

1. The Money Formula, by David Orrell and Paul Wilmott, Wiley, 2017
2. Nobels For Nonsense, by J.R. Thompson, L.S. Baggett, W.C. Wojciechowski, and E.E. Williams, Journal of Post Keynesian Economics, Fall 2006
3. Model Error, by Katerina Simons, New England Economic Review, November 1997
4. Bayesian Risk Management, by Matt Sekerke, Wiley, 2015

### Four short links: 22 January 2019

Data Science with Puzzles, Formal Methods, Sketching from Photos, and Teaching Machines

1. Teaching Data Science with Puzzles (Irene Steves) -- genius! The repo has the puzzles in an R project.
2. Why Don't People Use Formal Methods? -- a really good introduction to the field and current challenges. And entertainingly written: Proofs are hard. Obnoxiously hard. “Quit programming and join the circus” hard. Surprisingly, formal code proofs are often more rigorous than the proofs most mathematicians write! Mathematics is a very creative activity with a definite answer that’s only valid if you show your work. Creativity, formalism, and computers are a bad combination.
3. Photo Sketching: Inferring Contour Drawings from Images -- the examples in the paper are impressive.
4. History of Teaching Machines (Audrey Watters) -- a bit of context for the ZOMG APPS WILL SAVE EDUCATION mania.

### Looking for common misspellings

Some words are harder to spell than others, and on the internet, sometimes people indicate the difficulty by following their uncertainty with “(sp?)”. Colin Morris collected all the words in reddit threads with this uncertainty. Download the data on GitHub.

Tags: ,

### 9 trends to watch in systems engineering and operations

From artificial intelligence to serverless to Kubernetes, here’s what's on our radar.

If your job or business relies on systems engineering and operations, be sure to keep an eye on the following trends in the months ahead.

## AIOps

Artificial intelligence for IT operations (AIOps) will allow for improved software delivery pipelines in 2019. This practice incorporates machine learning in order to make sense of data and keep engineers informed about both patterns and problems so they can address them swiftly. Rather than replace current approaches, however, the goal of AIOps is to enhance these processes by consolidating, automating, and updating them. A related innovation, Robotic Process Automation (RPA), presents options for task automation and is expected to see rapid and substantial growth as well.

## Knative vs. AWS Lambda vs. Microsoft Azure Functions vs. Google Cloud

The serverless craze is in full swing, and shows no signs of stopping—since December 2017 alone, the technology has grown 22%, and Gartner reports that by 2020, more than 20% of global enterprises will be deploying serverless. This is a huge projected increase from the mere 5% that are currently utilizing it. The advantages of serverless are numerous: it saves money and allows organizations to scale and pivot quickly, and better manage development and testing.

While Knative was introduced in July at Google Next—as a joint initiative of Google, Red Hat, Pivotal, SAP, and IBM—the jury’s still out as to whether it will become the industry standard for building serverless applications on top of Kubernetes. It does have a lot going for it, though—namely, that it’s open source and portable between cloud providers. Because it’s not linked to any specific cloud platform (unlike its competitors Amazon Web Services Lambda, Microsoft Azure Functions, and Google Cloud), it may also be more appealing to organizations looking to avoid obligations to a particular vendor, and in turn has the potential to unify the serverless world.

## Cloud-native infrastructure

Another fast-growing trend, cloud-native applications in production have seen a 200% spike in the past year. This development marks a clear shift from merely doing business in the cloud. Instead, it moves the focus to creating cloud-native applications, and puts the spotlight on the advantages of cloud-based infrastructure.

## Security

As both security threats and compliance pressures grow, automating security and baking security controls into the software development process is now critical. With recently established GDPR regulations that necessitate the notification of a security breach within 72 hours, DevOps and security practices are becoming intertwined not only in process, but in culture as well. Gone are the days of simply responding to issues as they pop up. Instead, a proactive approach that seamlessly weaves security into the development lifecycle will optimize productivity and efficiency for development teams into next year and beyond.

## Service mesh

The movement from monolith to microservices has already started, and service meshes will be a key component in fast-tracking the transition. A service mesh can best be described as a dedicated layer of infrastructure that enables rapid, secure, and dependable communication between and among service instances. Among the vendors to watch are Istio, HashiCorps, and Linkerd.

## DevOps breaks down more silos; rise of the SRE

Teams and departments will continue to merge across organizations, as both data management and security requirements demand cross-functional processes and the lines between traditional role definitions blur. Reliability will be key to ensuring always-on, always-available performance so we’ll see more engineers and administrators adding reliability to their titles. Database reliability engineers (DBREs), for starters, will replace database administrators (DBAs), incorporating site reliability engineering processes into their day-to-day routine, and adopting the same code review and practices as DevOps teams use to create and deploy applications.

## Kubernetes

The current industry standard for container orchestration, Kubernetes will continue to own the spotlight in 2019 as more organizations start to walk the talk—either by implementing their own Kube-based infrastructure or letting their cloud vendor manage the complexity through a hosted solution like Microsoft’s Azure Kubernetes Service (AKS), Amazon’s EKS, or Google’s Kubernetes Engine. A recent O’Reilly survey found that less than 40% of respondents have actually implemented Kubernetes, suggesting that the hype currently outweighs the reality. There’s still plenty of room for adoption and growth within organizations—despite how oversaturated the landscape may currently seem. If you haven’t worked with Kubernetes yet, you likely will soon.

## Distributed tracing

Distributed tracing, a tool for monitoring and debugging microservices applications, is poised to become a critical trend going forward. The prevalence of heterogeneous distributed systems has made it slightly more difficult to put distributed tracing into practice there. However, service meshes—another hot topic—have made communication between services more hassle-free, so the time is right for this method to shine. While there are a variety of open source distributed tracing tools, including Google's Dapper and Open Tracing API, Twitter’s Zipkin is currently the most buzzed about, and it will likely rise to the top and set a precedent for the industry.

## Containers

According to a 2018 survey of 576 IT leaders, 44% were planning to replace at least a portion of their virtual machines with containers. There are a number of reasons for this switch—namely, C-suite executives’ ever-increasing dissatisfaction with VM licensing fees. In addition to a major infrastructure change, the move to containers also necessitates the adoption of both DevOps processes and culture, affecting the makeup of IT departments.

Continue reading 9 trends to watch in systems engineering and operations.

### What’s the deal with wind chill?

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

WIND SPEEDS UP HEAT LOSS

Twenty years ago, a German colleague asked me what the deal was with wind chill. I guess they didn’t have it in Germany. I explained it was an attempt to communicate how it feels when there is a low temperature combined with wind.  But my colleague wanted to know how they get the values in charts like these:

If you look closely at the first chart, it gives the formula for wind chill. It looks pretty hairy but it’s just a function of two things: wind speed and temperature. The windchill in Fahrenheit is just:

35.74 + .6251 * temp – 35.75 * windspeed^.16 + .4275 * temp * windspeed^.16

(Note that windchill is only defined for temperatures below 50°F and wind speeds above 3 mph.)

But where does this equation come from? This brochure has the answer. Researchers put sensors on people’s faces and had them walk in wind tunnels:

During the human trials, 6 male and 6 female volunteers were placed in a chilled wind tunnel. Thermal transducers were stuck to their faces to measure heat flow from the cheeks, forehead, nose and chin while walking 3 mph on a treadmill. Each volunteer took part in four trials of 90 minutes each and was exposed to varying wind speeds and temperatures. The NWS Wind Chill Temperature (WCT) index uses advances in science, technology, and computer modeling to provide an accurate, understandable, and useful formula for calculating the dangers from winter winds and freezing temperatures. The index does the following:

• Calculates wind speed at an average height of 5 feet, the typical height of an adult human face, based on readings from the national standard height of 33 feet, typical height of an anemometer
• Is based on a human face model
• Incorporates heat transfer theory based on heat loss from the body to its surroundings, during cold and breezy/windy days
• Lowers the calm wind threshold to 3 mph
• Uses a consistent standard for skin tissue resistance
• Assumes no impact from the sun, i.e., clear night sky.

A common misunderstanding is to assume that the wind can make something colder than the outside temperature. That can’t happen. Wind just speeds up the chilling process.

Ok, this is all fine, but we wanted a chart that would 1) make it easier to see the effect of the wind speed and temperature and 2) show the wind chill effect for typical February weather in New York and Chicago. We coded up the following:

click to enlarge

Note that in this chart, the Y axis isn’t the temperature, it’s the difference between the outside temperature and the windchill: the number of degrees you need to subtract.

So, when the wind is 10 miles per hour (red line), and the temperature is 25 degrees (x axis value of 25), the effect of wind chill is to lower the perceived temperature by 10 degrees (y axis value of -10), but at the same temperature when the wind is 25 miles per hour (blue line), wind lowers perceived temperature by about 16 degrees.

A useful takeaway is that with average winds and average temperatures, the effect of windchill is to lower the perceived temperature by about 7-12 degrees.

(BTW, if you like this stuff, you might enjoy our post on the heat index.)

Want to mess around with the code? Here you go:

The post What’s the deal with wind chill? appeared first on Decision Science News.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### rstudio::conf 2019

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Last week RStudio hosted their conference, rstudio::conf, in Austin and a whole lot of members of the R community came to see what’s new, where the community and the field might be heading and to enjoy tacos.

## R in production

A major theme from this conference was R in production. Joe Cheng kicked this one off in his keynote about Shiny in production, presenting how the answer to “Can Shiny be used for production?” changed from “Yes, it’s quite possible” (in 2015) to “Yes, it’s quite easy” (in 2019). Of course, it isn’t all easy yet. There are still cultural challenges that arise from Shiny apps being developed by R user who aren’t necessarily software engineers, i.e., they may not have all the background knowledge about what is necessary to put their app into production. Organisational challenges may arise when entities like IT and management meet the idea of Shiny apps in production with scepticism. While Shiny makes it easy to create a web app, it doesn’t exactly make automated testing, load testing, profiling and deployment easy. However, the Shiny team at RStudio addresses these aspects and Joe presented

Our own Mark Sellors then continued the conversation around the cultural and organisational challenges of getting Shiny, and R in general, into production. One way is “the path of magic” where you get people on board with a fantastic Shiny app like Jacqueline Nolis and Heather Nolis have done at T-Mobile. Another path is via building the confidence the business has in the work of the data scientists. This involves a lot of building bridges between different teams so here is a list of things to tackle head on, or coincidentally, a list of “CS/software engineering concepts data scientists might need to learn more about to get things into production” as Caitlin Hudon put it. Mark also shared an R production readiness checklist. For an example that had people buzzing at the conference check out what Jacqueline Nolis and Heather Nolis did after their magic Shiny app: presentations, blog posts and code can all be found here while we wait for the conference videos.

Other talks on this topic included more recent RStudio work:

## Thinking beyond the code

Felienne reminded everyone in her keynote that spreadsheets are code – functional, reactive programming at that – and advocated for a pedagogical debate on how we teach programming. A Dutch programming book aimed at children includes sentences like

• “The best part of programming is finding mistakes.”
• “Programmers only learn from making mistakes.”
• “You will fail often, and it will be frustrating.”

illustrating the “let people discover things on their own” approach. An approach of explaining ideas and then letting people practice works a lot better – in other fields and in teaching programming. Or as Felienne put it: You don’t become an expert by doing expert things.

Angela Bassa gave an excellent talk on Data Science as a team sport highlighting how to grow a data science team by adding specialisation, process and resilience.

Hilary Parker spoke about Using Data Effectively: Beyond Art and Science and casually tidied the tidy workflow.

Caitlin Hudon shared her learnings on which data science mistakes to avoid, covering analysis mistakes, how to work with developers and how to communicate with business stakeholders.

## But also think about the code

It wouldn’t quite be an R conference without talks that highlight fantastic things to do with R:

David Robinson spoke about The Unreasonable Effectiveness of Public Work in his keynote, illustrating how helpful various forms of public work can be for a data scientist to advance their career through different stages. He recommends tweeting, blogging and contributing to open source for all stages from junior to senior. With increasing experience he also recommends giving talks, recording screencasts and writing books. If you are inspired to get out there talking, look for a meetup and/or R-Ladies group near you. If you want to speak at a bigger R conference before the next edition of rstudio::conf, check out useR! and EARL.

All throughout his keynote, David did a wonderful job of highlighting other people’s work (all while plugging his own books two slides in a row, master of the game that he is), in particular talks and tweets from the conference. His keynote basically was the first conference review, while the conference was still underway. Other great conference reviews have since been put out there by

So we leave you with this enjoyable “further reading” list and hope to see you soon at LondonR, EARL or on Twitter!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:
• What are genetic algorithms?
• GA in ML?
• What does a solution look like?
• GA process and its operators
• The fitness function
• Genetics Algorithms in R!
• Try it yourself
• Relating concepts
The pilots fought continuously until the end of the flight’, said Capt. Nurcahyo Utomo, the head of the investigation of Lion Air Flight 610 that crashed on October 29, 2018, killing the 189 people aboard. The analysis of the black boxes had revealed that the Boeing 737’s nose was repeatedly forced down, apparently by an automatic system receiving incorrect sensor readings. During 10 minutes preceding the tragedy, the pilots tried 24 times to manually pull up the nose of the plane. They struggled against a malfunctioning anti-stall system that they did not know how to disengage for that specific version of the plane. That type of dramatic scene of humans struggling with a stubborn automated system belongs to pop culture. In the famous scene of the 1968 science-fiction film ‘2001: A Space Odyssey’, the astronaut Dave asks HAL (Heuristically programmed ALgorithmic computer) to open a pod bay door on the spacecraft, to which HAL responds repeatedly, ‘I’m sorry, Dave, I’m afraid I can’t do that’.
The first step for anyone who wants to promote or sell something is to understand the psychology of potential customers. Getting into the minds of consumers is often problematic because measuring psychological traits is a complex task. Researchers have developed many parameters that describe our feelings, attitudes, personality and so on. One of these measures is consumer involvement, which is a measure of the attitude people have towards a product or service. The most common method to measure psychological traits is to ask people a battery of questions. Analysing these answers is complicated because it is difficult to relate the responses to a survey to the software of the mind. While the answers given by survey respondents are the directly measured variables, what we like to know are the hidden (latent) states in the mind of the consumer. Factor Analysis is a technique that helps to discover latent variables within a responses set of data, such as a customer survey. The basic principle of measuring consumer attitudes is that the consumer’s state of mind causes them to respond to questions in a certain way. Factor analysis seeks to reverse this causality by looking for patterns in the responses that are indicative of the consumer’s state of mind. Using a computing analogy, factor analysis is a technique to reverse-engineer the source code by analysing the input and output. This article introduces the concept of consumer involvement and how it can be predictive of other important marketing metrics such as service quality. An example using data from tap water consumers illustrates the theory. The data collected from these consumers is analysed using factor analysis in R, using the psych package.
The development of AI is creating new opportunities to improve the lives of people around the world, from business to healthcare to education. It is also raising new questions about the best way to build fairness, interpretability, privacy, and security into these systems. These questions are far from solved, and in fact are active areas of research and development. Google is committed to making progress in the responsible development of AI and to sharing knowledge, research, tools, datasets, and other resources with the larger community. Below we share some of our current work and recommended practices. As with all of our research, we will take our latest findings into account, work to incorporate them as appropriate, and adapt as we learn more over time.
The ancient philosopher Confucius has been credited with saying ‘study your past to know your future.’ This wisdom applies not only to life but to machine learning also. Specifically, the availability and application of labeled data (things past) for the labeling of previously unseen data (things future) is fundamental to supervised machine learning. Without labels (diagnoses, classes, known outcomes) in past data, then how do we make progress in labeling (explaining) future data? This would be a problem.
The ability to predict what comes next in a sequence is fascinating. It’s one of the reasons I became interested in data science! Interestingly – human mind is really good at it, but that is not the case with machines. Given a mysterious plot in a book, the human brain will start creating outcomes. But, how to teach machines to do something similar? Thanks to Deep Learning – we can do lot more today than what was possible a few years back. The ability to work with sequence data, like music lyrics, sentence translation, understanding reviews or building chatbots – all this is now possible thanks to sequence modeling.
Tensorflow is the most famous library used in production for deep learning models. It has a very large and awesome community and gives lots of flexibility in operations. However, Tensorflow is not that user-friendly and has a steeper learning curve. To solve that, the high-level Keras API of Tensorflow provides building blocks to create and train deep learning models more easily. Also, Keras models are made by connecting configurable building blocks together with few restrictions. It makes it more modular and composable. You can explore on their official site.
Tensorflow is an open source software library for numerical computation using data flow graphs that enables machine learning practitioners to do more data-intensive computing. It provides a robust implementation of some widely used deep learning algorithms and has flexible architecture. The main features of tensorflow are fast computing, flexibility, portability, easy debugging, unified API, extensible.
TensorFlow is a popular open-source software library from Google. Originally it was developed by the Google Brain team for internal Google use. As the AI research community got more and more collaborative, TensorFlow was released under the Apache 2.0 open source license. Detailed study of TensorFlow can take months. But a glimpse of its power provides a good motivation to dive into it. With this in mind, this blog looks at implementation of a classification model. Classification is one of the frequent problems we work in AI. Typically we have a set of inputs that have to be classified into different categories. We can use TensorFlow to train a model for this task. Below, we will step through each step of one such implementation.
Stocks, Significance Testing & p-Hacking. Follow me on Twitter (twitter.com/pdquant) for more. Over the past 32 years, October has been the most volatile month on average for the S&P500 and December the least, in this article we will use simulation to assess the statistical significance of this observation and to what extent this observation could occur by chance. All code included! Originally posted here. Our goal:
• Demonstrate how to use Pandas to analyze Time Series
• Understand how to construct a hypothesis test
• Use simulation to perform hypothesis testing
• Show the importance of accounting for multiple comparison bias
SOD is an embedded, modern cross-platform computer vision and machine learning software library that expose a set of APIs for deep-learning, advanced media analysis & processing including real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices. SOD was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in open source as well commercial products.
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods [1-7] and representing the only possible consistent and locally accurate additive feature attribution method based on expectations (see the SHAP NIPS paper for details).
RStudio Connect is the publishing platform for everything you create in R. In conversations with our customers, R users were excited to have a central place to share all their data products, but were facing a tough problem. Their colleagues working in Python didn’t have the same option, leaving their work stranded on their desktops. Today, we are excited to introduce the ability for data science teams to publish Jupyter Notebooks and mixed Python and R content to RStudio Connect. Connect 1.7.0 is a major release that also includes many other significant improvements, such as programmatic deployment and historical event data.

### Whats new on arXiv

By achieving their purposes through interactions with the physical world, Cyber Physical Systems (CPS) pose new challenges in terms of dependability. Indeed, the evolution of the physical systems they control with transducers can be affected by surrounding physical processes over which they have no control and which may potentially hamper the achievement of their purposes. While it is illusory to hope for a comprehensive model of the physical environment at design time to anticipate and remove faults that may occur once these systems are deployed, it becomes necessary to evaluate their degree of effectiveness in vivo.In this paper, the degree of effectiveness is formally defined and generalized in the context of the measure theory and the mathematical properties it has to comply with are detailed. The measure is developed in the context of the Transfer Belief Model (TBM), an elaboration on the Dempster Shafer Theory (DST) of evidence so as to handle epistemic and aleatory uncertainties respectively pertaining the users expectations and the natural variability of the physical environment. This theoretical framework has several advantages over the probability and the possibility theories. (1) It is built on the Open World Assumption (OWA), (2) it allows to cope with dependent and possibly unreliable sources of information. The TBM is used in conjunction with the Input Output Hidden Markov Modeling framework (IOHMM) to specify the expected evolution of the physical system controlled by the CPS and the tolerances towards uncertainties. The measure of effectiveness is obtained from the forward algorithm, leveraging the conflict entailed by the successive combinations of the beliefs obtained from observations of the physical system and the beliefs corresponding to its expected evolution. The conflict, inherent to OWA, is meant to quantify the inability of the model at explaining observations.
Application of neural networks to a vast variety of practical applications is transforming the way AI is applied in practice. Pre-trained neural network models available through APIs or capability to custom train pre-built neural network architectures with customer data has made the consumption of AI by developers much simpler and resulted in broad adoption of these complex AI models. While prebuilt network models exist for certain scenarios, to try and meet the constraints that are unique to each application, AI teams need to think about developing custom neural network architectures that can meet the tradeoff between accuracy and memory footprint to achieve the tight constraints of their unique use-cases. However, only a small proportion of data science teams have the skills and experience needed to create a neural network from scratch, and the demand far exceeds the supply. In this paper, we present NeuNetS : An automated Neural Network Synthesis engine for custom neural network design that is available as part of IBM’s AI OpenScale’s product. NeuNetS is available for both Text and Image domains and can build neural networks for specific tasks in a fraction of the time it takes today with human effort, and with accuracy similar to that of human-designed AI models.
This paper presents an evolutionary metaheuristic called Multiple Search Neuroevolution (MSN) to optimize deep neural networks. The algorithm attempts to search multiple promising regions in the search space simultaneously, maintaining sufficient distance between them. It is tested by training neural networks for two tasks, and compared with other optimization algorithms. The first task is to solve Global Optimization functions with challenging topographies. We found to MSN to outperform classic optimization algorithms such as Evolution Strategies, reducing the number of optimization steps performed by at least 2X. The second task is to train a convolutional neural network (CNN) on the popular MNIST dataset. Using 3.33% of the training set, MSN reaches a validation accuracy of 90%. Stochastic Gradient Descent (SGD) was able to match the same accuracy figure, while taking 7X less optimization steps. Despite lagging, the fact that the MSN metaheurisitc trains a 4.7M-parameter CNN suggests promise for future development. This is by far the largest network ever evolved using a pool of only 50 samples.
An appropriate choice of the activation function (like ReLU, sigmoid or swish) plays an important role in the performance of (deep) multilayer perceptrons (MLP) for classification and regression learning. Prototype-based classification learning methods like (generalized) learning vector quantization (GLVQ) are powerful alternatives. These models also deal with activation functions but here they are applied to the so-called classifier function instead. In this paper we investigate successful candidates of activation functions known for MLPs for application in GLVQ and their influence on the performance.
Univariate time series often take the form of a collection of curves observed sequentially over time. Examples of these include hourly ground-level ozone concentration curves. These curves can be viewed as a time series of functions observed at equally spaced intervals over a dense grid. Since functional time series may contain various types of outliers, we introduce a robust functional time series forecasting method to down-weigh the influence of outliers in forecasting. Through a robust principal component analysis based on projection pursuit, a time series of functions can be decomposed into a set of robust dynamic functional principal components and their associated scores. Conditioning on the estimated functional principal components, the crux of the curve-forecasting problem lies in modeling and forecasting principal component scores, through a robust vector autoregressive forecasting method. Via a simulation study and an empirical study on forecasting ground-level ozone concentration, the robust method demonstrates the superior forecast accuracy that dynamic functional principal component regression entails. The robust method also shows the superior estimation accuracy of the parameters in the vector autoregressive models for modeling and forecasting principal component scores, and thus improves curve forecast accuracy.
Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art results on various competitive benchmarks. The powerful learning ability of deep CNN is largely achieved with the use of multiple non-linear feature extraction stages that can automatically learn hierarchical representation from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs and recently very interesting deep CNN architectures are reported. The recent race in deep CNN architectures for achieving high performance on the challenging benchmarks has shown that the innovative architectural ideas, as well as parameter optimization, can improve the CNN performance on various vision-related tasks. In this regard, different ideas in the CNN design have been explored such as use of different activation and loss functions, parameter optimization, regularization, and restructuring of processing units. However, the major improvement in representational capacity is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is gaining substantial appreciation. This survey thus focuses on the intrinsic taxonomy present in the recently reported CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting and attention. Additionally, it covers the elementary understanding of the CNN components and sheds light on the current challenges and applications of CNNs.
In an effort to improve the performance of deep neural networks in data-scarce, non-i.i.d., or unsupervised settings, much recent research has been devoted to encoding invariance under symmetry transformations into neural network architectures. We treat the neural network input and output as random variables, and consider group invariance from the perspective of probabilistic symmetry. Drawing on tools from probability and statistics, we establish a link between functional and probabilistic symmetry, and obtain generative functional representations of joint and conditional probability distributions that are invariant or equivariant under the action of a compact group. Those representations completely characterize the structure of neural networks that can be used to model such distributions and yield a general program for constructing invariant stochastic or deterministic neural networks. We develop the details of the general program for exchangeable sequences and arrays, recovering a number of recent examples as special cases.
A churn prediction system guides telecom service providers to reduce revenue loss. Development of a churn prediction system for a telecom industry is a challenging task, mainly due to size of the data, high dimensional features, and imbalanced distribution of the data. In this paper, we focus on a novel solution to the inherent problems of churn prediction, using the concept of Transfer Learning (TL) and Ensemble-based Meta-Classification. The proposed method TL-DeepE is applied in two stages. The first stage employs TL by fine tuning multiple pre-trained Deep Convolution Neural Networks (CNNs). Telecom datasets are in vector form, which is converted into 2D images because Deep CNNs have high learning capacity on images. In the second stage, predictions from these Deep CNNs are appended to the original feature vector and thus are used to build a final feature vector for the high-level Genetic Programming and AdaBoost based ensemble classifier. Thus, the experiments are conducted using various CNNs as base classifiers with the contribution of high-level GP-AdaBoost ensemble classifier, and the results achieved are as an average of the outcomes. By using 10-fold cross-validation, the performance of the proposed TL-DeepE system is compared with existing techniques, for two standard telecommunication datasets; Orange and Cell2cell. In experimental result, the prediction accuracy for Orange and Cell2cell datasets were as 75.4% and 68.2% and a score of the area under the curve as 0.83 and 0.74, respectively.
Playlist recommendation involves producing a set of songs that a user might enjoy. We investigate this problem in three cold-start scenarios: (i) cold playlists, where we recommend songs to form new personalised playlists for an existing user; (ii) cold users, where we recommend songs to form new playlists for a new user; and (iii) cold songs, where we recommend newly released songs to extend users’ existing playlists. We propose a flexible multitask learning method to deal with all three settings. The method learns from user-curated playlists, and encourages songs in a playlist to be ranked higher than those that are not by minimising a bipartite ranking loss. Inspired by an equivalence between bipartite ranking and binary classification, we show how one can efficiently approximate an optimal solution of the multitask learning objective by minimising a classification loss. Empirical results on two real playlist datasets show the proposed approach has good performance for cold-start playlist recommendation.
Pavement crack detection is a critical task for insuring road safety. Manual crack detection is extremely time-consuming. Therefore, an automatic road crack detection method is required to boost this progress. However, it remains a challenging task due to the intensity inhomogeneity of cracks and complexity of the background, e.g., the low contrast with surrounding pavements and possible shadows with similar intensity. Inspired by recent advances of deep learning in computer vision, we propose a novel network architecture, named Feature Pyramid and Hierarchical Boosting Network (FPHBN), for pavement crack detection. The proposed network integrates semantic information to low-level features for crack detection in a feature pyramid way. And, it balances the contribution of both easy and hard samples to loss by nested sample reweighting in a hierarchical way. To demonstrate the superiority and generality of the proposed method, we evaluate the proposed method on five crack datasets and compare it with state-of-the-art crack detection, edge detection, semantic segmentation methods. Extensive experiments show that the proposed method outperforms these state-of-the-art methods in terms of accuracy and generality.
We introduce and study a new notion of non-commutative independence, called V-monotone independence, which can be viewed as an extension of the monotone independence of Muraki. We investigate the combinatorics of mixed moments of V-monotone random variables and prove the central limit theorem. We obtain a combinatorial formula for the limit moments and we find the solution of the differential equation for the moment generating function in the implicit form.
We present the winning solution for the Inclusive Images Competition organized as part of the Conference on Neural Information Processing Systems (NeurIPS 2018) Competition Track. The competition was organized to study ways to cope with domain shift in image processing, specifically geographical shift: the training and two test sets in the competition had different geographical distributions. Our solution has proven to be relatively straightforward and simple: it is an ensemble of several CNNs where only the last layer is fine-tuned with the help of a small labeled set of tuning labels made available by the organizers. We believe that while domain shift remains a formidable problem, our approach opens up new possibilities for alleviating this problem in practice, where small labeled datasets from the target domain are usually either available or can be obtained and labeled cheaply.
Reliably detecting anomalies in a given set of images is a task of high practical relevance for visual quality inspection, surveillance, or medical image analysis. Autoencoder neural networks learn to reconstruct normal images, and hence can classify those images as anomalies, where the reconstruction error exceeds some threshold. Here we analyze a fundamental problem of this approach when the training set is contaminated with a small fraction of outliers. We find that continued training of autoencoders inevitably reduces the reconstruction error of outliers, and hence degrades the anomaly detection performance. In order to counteract this effect, an adversarial autoencoder architecture is adapted, which imposes a prior distribution on the latent representation, typically placing anomalies into low likelihood-regions. Utilizing the likelihood model, potential anomalies can be identified and rejected already during training, which results in an anomaly detector that is significantly more robust to the presence of outliers during training.

### Document worth reading: “Deep learning for time series classification: a review”

Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state of the art performance for document classification and speech recognition. In this article, we study the current state of the art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date. Deep learning for time series classification: a review

### WHAT TO LOOK FOR IN DATA – PART 1

Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or … Continue reading

### If you did not already know

Co-Arg
This paper presents Co-Arg, a new type of cognitive assistant to an intelligence analyst that enables the synergistic integration of analyst imagination and expertise, computer knowledge and critical reasoning, and crowd wisdom, to draw defensible and persuasive conclusions from masses of evidence of all types, in a world that is changing all the time. Co-Arg’s goal is to improve the quality of the analytic results and enhance their understandability for both experts and novices. The performed analysis is based on a sound and transparent argumentation that links evidence to conclusions in a way that shows very clearly how the conclusions have been reached, what evidence was used and how, what is not known, and what assumptions have been made. The analytic results are presented in a report describes the analytic conclusion and its probability, the main favoring and disfavoring arguments, the justification of the key judgments and assumptions, and the missing information that might increase the accuracy of the solution. …

Locally Smoothed Neural Network (LSNN)
Convolutional Neural Networks (CNN) and the locally connected layer are limited in capturing the importance and relations of different local receptive fields, which are often crucial for tasks such as face verification, visual question answering, and word sequence prediction. To tackle the issue, we propose a novel locally smoothed neural network (LSNN) in this paper. The main idea is to represent the weight matrix of the locally connected layer as the product of the kernel and the smoother, where the kernel is shared over different local receptive fields, and the smoother is for determining the importance and relations of different local receptive fields. Specifically, a multi-variate Gaussian function is utilized to generate the smoother, for modeling the location relations among different local receptive fields. Furthermore, the content information can also be leveraged by setting the mean and precision of the Gaussian function according to the content. Experiments on some variant of MNIST clearly show our advantages over CNN and locally connected layer. …

Exponential Random Graph Models (ERGM)
Exponential random graph models (ERGMs) are a family of statistical models for analyzing data about social and other networks. Many metrics exist to describe the structural features of an observed network such as the density, centrality, or assortativity. However, these metrics describe the observed network which is only one instance of a large number of possible alternative networks. This set of alternative networks may have similar or dissimilar structural features. To support statistical inference on the processes influencing the formation of network structure, a statistical model should consider the set of all possible alternative networks weighted on their similarity to an observed network. However because network data is inherently relational, it violates the assumptions of independence and identical distribution of standard statistical models like linear regression. Alternative statistical models should reflect the uncertainty associated with a given observation, permit inference about the relative frequency about network substructures of theoretical interest, disambiguating the influence of confounding processes, efficiently representing complex structures, and linking local-level processes to global-level properties. Degree Preserving Randomization, for example, is a specific way in which an observed network could be considered in terms of multiple alternative networks. …

### Visual Exploration of Unemployment Data

The charts on unemployment data I put up last week are best viewed as a collection.

I have put them up on the (still in beta) JMP Public website. You can find the project here

I believe that if you make an account, you can grab the underlying dataset.

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Do you know what that sound is, Highness? Those are the Shrieking Eels — if you don’t believe me, just wait. They always grow louder when they’re about to feed on human flesh. If you swim back now, I promise, no harm will come to you. I doubt you will get such an offer from the Eels.

Vizzini, The Princess Bride

European eels (Anguilla anguilla) have it tough. Not only are they depicted as monsters in movies, they are critically endangered in real life. One of the many aspects that is contributing to their decline is the reduced connectivity between their freshwater and marine habitats. Eels are catadromous: they live in freshwater, but migrate to the Sargasso Sea to spawn, a route that is blocked by numerous human structures (shipping locks, sluices, pumping stations, etc.). Pieterjan Verhelst studies the impact of these structures on the behaviour of eels, making use of the fish acoustic receiver network that was established as part of the Belgian LifeWatch observatory. This animated video gives a quick introduction to his research and the receiver network:

In this blog post, we’ll explore if the migration of one eel is influenced by the tide. It’s a research use case for our R package wateRinfo, which was recently peer reviewed (thanks to reviewer Laura DeCicco and editor Karthik Ram for their constructive feedback!) and accepted as a community-contributed package to rOpenSci.

### Meet Princess Buttercup

Pieterjan provided us the tracking data for eel with transmitter A69-1601-52622. Let’s call her Princess Buttercup, after the princess that almost got eaten by the Shrieking Eels in the classic and immensly quotable movie The Princess Bride.

eel <- read_csv(here("data", "eel_track.csv"))


Her tracking data consists of the residence time interval (arrival until departure) at each receiver station that detected her along the Scheldt river. It also contains the calculated residencetime (in seconds), as well as the station name, latitude and longitude.

date receiver latitude longitude station arrival departure residencetime
2016-10-19 VR2W-112297 51.00164 3.85695 s-Wetteren 2016-10-19 23:44:00 2016-10-19 23:48:00 240
2016-10-19 VR2W-112287 51.00588 3.77876 s-2 2016-10-19 16:07:00 2016-10-19 19:12:00 11100
2016-10-20 VR2W-122322 51.02032 3.96965 s-2a 2016-10-20 13:18:00 2016-10-20 13:23:00 300
2016-10-20 VR2W-122322 51.02032 3.96965 s-2a 2016-10-20 02:21:00 2016-10-20 02:29:00 480
2016-10-20 VR2W-115438 51.01680 3.92527 s-Wichelen 2016-10-20 01:01:00 2016-10-20 01:09:00 480
2016-10-20 VR2W-122322 51.02032 3.96965 s-2a 2016-10-20 05:52:00 2016-10-20 06:00:00 480

Using the latitude, longitude and total residencetime for each station, we can map where Princess Buttercup likes to hang out:

### Moving up and down the Scheldt river

To get a better sense of her journey along the river, we add a distance_to_sea (in meters) for the stations, by joining the tracking data with a distance reference file 1. We can now plot her movement over time and distance:

Princess Buttercup’s signal was picked up by receivers in Merelbeke (near Ghent) shortly after she was captured and released there on October 11. She resided in a 40 km stretch of the river (between Wetteren and Sint-Amands) for about a month before migrating towards the sea and starting the long journey towards the Sargasso Sea. The periodic movement pattern up and down the river during the second half of November is of particular interest: it looks like tidal frequency 2. It would be interesting to compare the movement pattern with real water level data from the Scheldt river… which is where our wateRinfo package comes in.

### Getting tidal data with the wateRinfo package

Waterinfo.be, managed by the Flanders Environment Agency (VMM) and Flanders Hydraulics Research, is a website where one can find real-time water and weather related environmental variables for Flanders (Belgium), such as rainfall, air pressure, discharge, and water level. The website also provides an API to download time series of these measurements as open data, but compositing the download URL with the proper system codes can be challenging. To facilitate users in searching for stations and variables, subsequently downloading data of interest and incorporating waterinfo.be data access in repeatable workflows, we developed the R package wateRinfo to do just that. See the package documentation for more information on how to install and get started with the package.

Timeseries in waterinfo.be (identified by a ts_id) are a combination of a variable, location (station_id) and measurement frequency (15min by default). For example:

library(wateRinfo)
get_stations("water_level") %>%
select(ts_id, station_id, station_name, parametertype_name) %>%

ts_id station_id station_name parametertype_name
92956042 39483 Kieldrecht/Noordzuidverbinding H
31683042 21995 Leuven/Dijle1e arm/stuw K3 H
20682042 20301 StMichiels/Kerkebeek/Rooster H
96495042 10413 Oudenburg/Magdalenakreek H
25074042 20788 Schulen/Inlaat A/WB_Schulensbroek H
2006042 10432 Merkem/Steenbeek H

At the time of writing (see this issue), the stations measuring water levels in the Scheldt tidal zone are not yet included by the API under the core variable water_level and are thus not yet available via get_stations("water_level"). We therefore rely on a precompiled list of tidal time series identifiers (tidal_zone_ts_ids):

tidal_zone_ts_ids <- read_csv(here("data", "tidal_zone_ts_ids.csv"))


From which we select the 10-min frequency tidal timeseries in the Scheldt river:

ts_id station_id station_name portal_bekken
55419010 0430087 Sint-Amands tij/Zeeschelde Beneden-Scheldebekken
55565010 0430029 Tielrode tij/Durme Beneden-Scheldebekken
55493010 0430081 Temse tij/Zeeschelde Beneden-Scheldebekken
54186010 0419418 Dendermonde tij/Zeeschelde Beneden-Scheldebekken
54493010 0430078 Hemiksem tij/Zeeschelde Beneden-Scheldebekken
55355010 0430091 Schoonaarde tij/Zeeschelde Beneden-Scheldebekken
53989010 0419484 Antwerpen tij/Zeeschelde Beneden-Scheldebekken
54606010 0430071 Kallosluis tij/Zeeschelde Beneden-Scheldebekken
56088010 0430062 Prosperpolder tij/Zeeschelde Beneden-Scheldebekken
54936010 0430068 Liefkenshoek tij/Zeeschelde Beneden-Scheldebekken
55059010 0430242 Melle tij/Zeeschelde Beneden-Scheldebekken

This wateRinfo vignette shows how to download data for multiple stations at once using wateRinfo and dplyr. We use a similar approach, but instead of manually providing the start and end date, we get these from Princess Buttercup’s tracking data:

tidal_data <-
tidal_zone_ts_ids %>%
group_by(ts_id) %>%

do(get_timeseries_tsid(
.$ts_id, from = min(eel$date), # Start of eel tracking data
to = max(eel$date), # End of eel tracking data datasource = 4 )) %>% # Join data back with tidal_zone_ts_id metadata ungroup() %>% left_join(tidal_zone_ts_ids, by = "ts_id")  In just a few lines of code, we downloaded the tidal data for each measurement station for the required time period. The water level is expressed in mTAW (meter above mean sea level) 3. Let’s plot the data for a station (here Dendermonde in November 2016) to have a look: We now have all the pieces to verify if Princess Buttercup was surfing the tide back and forth. ### Is Princess Buttercup surfing the tide? Let’s add the tidal data to Princess Buttercup’s journey plot we created before. The first step is to join the tidal data with the same distance reference file to know the distance from each tidal station to the sea: tidal_data <- tidal_data %>% left_join( distance_from_sea %>% select(station, distance_from_sea, municipality), by = c("station_name" = "station") ) %>% filter(station_name != "Hemiksem tij/Zeeschelde") # Exclude (probably erroneous) data from Hemiksem  To avoid visual clutter, we’ll use ridges (from ggridges) to display the tidal data for each station over time: Looking at the plot, Princess Buttercup seems to be “lazy” and drift with the tide. Rising water levels push her upstream, while decreasing water levels bring her closer to sea again. On November 22 (see also previous plot), she embarks on her migration for real. ### Conclusion In this blogpost we used the wateRinfo package to gain some insight in the movement/migration behaviour of an individual eel. We hope the package can support many more research questions and that you have fun storming the castle. ### Acknowledgements Waterinfo.be data is made available by the Flanders Environment Agency, Flanders Hydraulics Research, Agentschap Maritieme Dienstverlening & Kust and De Vlaamse Waterweg. The fish acoustic receiver network, as well as the work by Stijn and Peter, is funded by FWO as part of the Flemish contribution to LifeWatch. The animated video on eel research was funded by the Flanders Marine Institute (VLIZ), coordinated by Karen Rappé and Pieterjan Verhelst, animated by Steve Bridger, and narrated by Bryan Kopta. If you want to get in touch with our team, contact us via Twitter or email. 1. To represent the data along a straight line (y-axis), we calculated the distance along the river from each station to a reference station close to the sea (ws-TRAWL), using a costDistance function. See this script for more the details on the calculation. 2. The Scheldt is under tidal influence from its river mouth all the way to Ghent (160km upstream) where it is stopped by sluices. The tide goes much further than the freshwater-saltwater boundary of the river. 3. TAW means Tweede Algemene Waterpassing, a reference height for Belgium. To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Correlated longitudinal data with varying time intervals (This article was first published on ouR data generation, and kindly contributed to R-bloggers) I was recently contacted to see if simstudy can create a data set of correlated outcomes that are measured over time, but at different intervals for each individual. The quick answer is there is no specific function to do this. However, if you are willing to assume an “exchangeable” correlation structure, where measurements far apart in time are just as correlated as measurements taken close together, then you could just generate individual-level random effects (intercepts and/or slopes) and pretty much call it a day. Unfortunately, the researcher had something more challenging in mind: he wanted to generate auto-regressive correlation, so that proximal measurements are more strongly correlated than distal measurements. As is always the case with R, there are certainly multiple ways to do tackle this problem. I came up with this particular solution, which I thought I’d share. The idea is pretty simple: first, generate the time data with varying intervals, which can be done using simstudy; second, create an alternate data set of “latent” observations that include all time points, also doable with simstudy; last, merge the two in a way that gives you what you want. ### Step 1: varying time intervals The function addPeriods can create intervals of varying lengths. The function determines if the input data set includes the special fields mInterval and vInterval. If so, a time value is generated from a gamma distribution with mean mInterval and dispersion vInterval. maxTime <- 180 # limit follow-up time to 180 days def1 <- defData(varname = "nCount", dist = "noZeroPoisson", formula = 20) def1 <- defData(def1, varname = "mInterval", dist = "nonrandom", formula = 20) def1 <- defData(def1, varname = "vInterval", dist = "nonrandom", formula = 0.4) set.seed(20190101) dt <- genData(1000, def1) dtPeriod <- addPeriods(dt) dtPeriod <- dtPeriod[time <= maxTime] Here is a plot if time intervals for a small sample of the data set: ### Step 2: generate correlated data In this step, I am creating 181 records for each individual (from period = 0 to period = 180). In order to create correlated data, I need to specify the mean and variance for each observation; in this example, the mean is a quadratic function of time and the variance is fixed at 9. I generate the correlated data using the addCorGen function, and specify an AR-1 correlation structure with $$\rho = 0.4$$, def2 <- defDataAdd(varname = "mu", dist = "nonrandom", formula = "2 + (1/500) * (time) * (180 - time)") def2 <- defDataAdd(def2, varname = "var", dist = "nonrandom", formula = 9) dtY <- genData(1000) dtY <- addPeriods(dtY, nPeriod = (maxTime + 1) ) setnames(dtY, "period", "time") dtY <- addColumns(def2, dtY) dtY <- addCorGen(dtOld = dtY, idvar = "id", nvars = (maxTime + 1), rho = .4, corstr = "ar1", dist = "normal", param1 = "mu", param2 = "var", cnames = "Y") dtY[, :=(timeID = NULL, var = NULL, mu = NULL)] Here is a plot of a sample of individuals that shows the values of $$Y$$ at every single time point (not just the time points generated in step 1). The $$Y$$’s are correlated within individual. ### Step 3 Now we just do an inner-join, or perhaps it is a left join – hard to tell, because one data set is a subset of the other. In any case, the new data set includes all the rows from step 1 and the ones that match from step 2. setkey(dtY, id, time) setkey(dtPeriod, id, time) finalDT <- mergeData(dtY, dtPeriod, idvars = c("id", "time")) Here is a plot of the observed data for a sample of individuals: ### Addendum To verify that the data are indeed correlated with an AR-1 structure, I first convert the complete (latent) data from step 2 from its long format to a wide format. The correlation is calculated from this $$1000 \times 181$$ matrix, where each row is an individual and each column is a value of $$Y$$ at a different time point. And since the correlation matrix, which has dimensions $$181 \times 181$$, is too big to show, what you see is only the upper left hand corner of the matrix: round(cor(as.matrix(dcast(dtY, id ~ time, value.var = "Y")[, -1]))[1:13, 1:13], 1) ## 0 1 2 3 4 5 6 7 8 9 10 11 12 ## 0 1.0 0.4 0.2 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 ## 1 0.4 1.0 0.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ## 2 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 ## 3 0.1 0.1 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.1 ## 4 0.0 0.0 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.1 0.0 0.0 0.0 ## 5 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.1 0.0 0.0 0.0 0.0 ## 6 0.1 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.0 0.0 0.0 0.0 ## 7 0.0 0.0 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.2 0.1 0.1 0.0 ## 8 0.0 0.0 0.1 0.0 0.0 0.1 0.2 0.4 1.0 0.4 0.1 0.0 0.0 ## 9 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.2 0.4 1.0 0.4 0.2 0.0 ## 10 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.4 1.0 0.4 0.2 ## 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.2 0.4 1.0 0.4 ## 12 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.4 1.0 To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Brexit day is closer than it looks A majority of the government bills needed for a managed exit have still not been passed Continue Reading… ## January 21, 2019 ### Magister Dixit “The mechanical process by which data scientists and citizen data scientists make better use of data and analytics is underpinned by a deeper question about the organization as a whole: Does it have processes for sharing anything? This is not always a given in companies that have grown quickly, have grown through mergers and acquisitions or have begun to shrink. If the culture has never embraced or fostered the notion of transparency and sharing, then whatever process the company may put in place to use software to publish analytical models and the data they harvest is unlikely to succeed.” TIBCO ( 2017 ) Continue Reading… ### The Shiny Module Design Pattern (This article was first published on R – William Doane, and kindly contributed to R-bloggers) Foremost in your mind should be the quintessential reality of R: Everything that happens in R is the result of a function call. Shiny is no exception. To write a minimal shiny app, you create an object that describes your app’s user interface, write a function describing runtime behaviors, and pass them into a function that spins up the app you’ve described. By convention, these two objects are associated with the variable names ui and server. library(shiny)ui <- fluidPage()server <- function(input, output, session) {} This is just R code. You can type it into the Console to execute it line by line and inspect what it does. If you’re working in RStudio, you can type it into a Source file, then press Control-Enter (Windows) or Command-Return (MacOS) to send each line to the Console for execution. Checking the Environment—or the structure of these two objects with str()—we can see that ui is a list of three objects. If we print ui to the Console, we see only an empty HTML element. The object associated with server is simply a function with no body. To execute this minimal shiny app, we pass the ui and server objects to the shinyApp() function. shinyApp(ui, server) The app will be spun up either in RStudio’s Viewer pane, in a Viewer window, or in your default Web browser, depending on your settings in RStudio. Don’t be surprised: it will be just a blank window, since all that has been defined thus far is an empty element. The document that opened is an HTML document with some boilerplate CSS and JavaScript. You can inspect it using your Browser’s Developer Tools. That’s it. That’s shiny. Everything else flows from these core ideas: • ui is a list object representing the HTML UI to be constructed. • server is a function describing the runtime behavior of your app. • shinyApp() takes these two objects and uses them to construct an HTML document that then gets spun up in a browser. ## A Shiny App that does Something Let’s take the empty shell and start adding some content and behaviors. We’ll use whitespace to make our code more readable. library(shiny) ui <- fluidPage( textOutput("output_area"), actionButton( "next_num", "Click to generate a random number" ) ) server <- function(input, output, session) { observeEvent(input$next_num, {     outputoutput_area <- renderText({ rnorm(1) }) }) } shinyApp(ui, server) Since the elements of ui are passed in as arguments to a layout function—fluidPage, here—they are separated by commas. Since the content of server is just a function definition, its parts are not separated by commas. It may not be obvious in a short example like this, but the organization of Shiny code can get out of control quickly: you start with 5, 10, or 20 UI blocks, then a set of server behaviors that somehow align with those UI blocks. The UI code and the code that dictates runtime behavior for those UI elements can be hundreds of lines separated from each other. It’s easy to make spaghetti, but we’d rather have ravioli. ## Refactoring Toward Shiny Modules Let’s refactor this by extracting to functions the UI elements and server code. To signal my intention that these two parts belong together, I’ll adopt a naming convention of calling them the same thing with the UI elements having _UI suffixed to it. Since there are multiple items being returned from the _UI function, I’ll need to bundle them together in a list. library(shiny) button_UI <- function() { list( textOutput("output_area"), actionButton( "next_num", "Click to generate a random number" ) ) } button <- function() { observeEvent(inputnext_num, {       output$output_area <- renderText({ rnorm(1) }) }) } ui <- fluidPage( button_UI() ) server <- function(input, output, session) { button() } shinyApp(ui, server) That doesn’t quite work, but it’s close. The immediate problem is with the environment scope of the input variable: button() wants to know about it, but I haven’t passed them in to button(). Let’s fix that and, while we’re at it, pass in output and session, too. library(shiny) button_UI <- function() { list( textOutput("output_area"), actionButton( "next_num", "Click to generate a random number" ) ) } button <- function(input, output, session) { observeEvent(input$next_num, {       output$output_area <- renderText({ rnorm(1) }) }) } ui <- fluidPage( button_UI() ) server <- function(input, output, session) { button(input, output, session) } shinyApp(ui, server) It works! This is two-thirds of the way to applying the Shiny Modules design pattern. Why isn’t this sufficient? What more could we possibly need? At the moment, all of these objects and the element IDs they create are being created at the top level of the running Shiny application. In other words, all the functions I write this way are sharing a single namespace. That’s not ideal, if I want to create a second button and output field. I would need to write two more extracted functions, come up with more globally unique IDs—output_area2, output_area3, … next_num2, next_num3… it all would get messy quickly and totally unmanageable at scale. The solution is to have those two extracted functions exist in their own namespace so that I could instantiate another pair of them, then another, then another, without needing to concern myself with name collisions. The Shiny Modules pattern achieves this by having you provide a unique ID each time you instantiate this pair of functions. shiny::NS() takes the id you provide and returns a function that will pre-pend the id onto the individual UI elements’ IDs. button_UI <- function(id) { ns = NS(id) list( textOutput(ns("output_area")), actionButton( ns("next_num"), "Click to generate a random number" ) )} To make the module’s server fragment aware of that namespace, you use shiny::callModule() in your main server function located in app.R. You pass in the ID of the instance of the UI you want that module to work with. ui <- fluidPage( button_UI("first") ) server <- function(input, output, session) { callModule(button, "first") } Let’s look at the complete code: library(shiny) button_UI <- function(id) { ns = NS(id) list( textOutput(ns("output_area")), actionButton( ns("next_num"), "Click to generate a random number" ) ) } button <- function(input, output, session) { observeEvent(input$next_num, {     output$output_area <- renderText({ rnorm(1) }) }) } ui <- fluidPage( button_UI("first") ) server <- function(input, output, session) { callModule(button, "first") } shinyApp(ui, server) That’s the Shiny Module design pattern! It’s worth pointing out that the input, output, and session input parameters to a module are not the same as the values of input, output, and session that exist in your shiny app as a whole. Within the server fragment of your module, input, output, and session are scoped to your module’s namespace. ## Separate Modules into their own Directories For the sake of code organization, you can move the two module functions to their own script file and source() them into your app.R and since you might end up having many modules, you might want to create a separate directory to contain them (I call mine modules/): ### modules/button_mod.R button_UI <- function(id) { ns = NS(id) list( textOutput(ns("output_area")), actionButton( ns("next_num"), "Click to generate a random number" ) ) } button <- function(input, output, session) { observeEvent(input$next_num, {     output$output_area <- renderText({ rnorm(1) }) }) } ### app.R library(shiny) source("modules/button_mod.R") ui <- fluidPage( button_UI("first") ) server <- function(input, output, session) { callModule(button, "first") } shinyApp(ui, server) Now, in app.R, you can reuse your module freely… simply give each instance its own ID: ### app.R with Module Reuse library(shiny) source("modules/button_mod.R") ui <- fluidPage( button_UI("first"), button_UI("second") ) server <- function(input, output, session) { callModule(button, "first") callModule(button, "second") } shinyApp(ui, server) You can click freely on each of the two buttons without it having an effect on the other. With this pattern you have: • modularity • distinct namespaces for each instance (encapsulation) • reusability all of which is derived from the fact that these are just R functions. The usual rules of well-defined functions apply: • within a module’s server fragment, code must rely only on objects it creates or that are explicitly passed into the module as additional arguments to callModule(). • any objects you create within a module that you need access to in other modules must be returned from the producing module as a reactive expression, captured by a variable assignment, and passed into the consuming module(s) where it will be invoked as a function call to that reactive expression. produceMod <- function(input, output, session) { reactive(rnorm(1))}consumeMod <- function(input, output, session, data) { output$someOutput <- renderText({    data() + 100  })}...server <- function(input, output, session) {  result <- callModule(produceMod, "someID")  callModule(consumeMod, "", result)}

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data Science and Ethics – Why Companies Need a new CEO (Chief Ethics Officer)

We explain why data science companies need to have a Chief Ethics Officer and discuss their importance in tackling algorithm bias.

### PAW celebrates its 10-year anniversary – here’s what’s happened so far

Happy 10th Anniversary to Predictive Analytics World! Join Mega-PAW, the premier predictive analytics conference, Jun 16-20 in Las Vegas. Register now!

### R Studio Conf 2019 – Easing your FOMO with R Resources

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

Image credit: Results from the 2018 RStudio Learning R Survey, shared by Carl Howe at RStudio Conf 2019

Y’all if I had just one takeaway from R Studio Conference 2019, it’s that we are a friendly and tight knit community! Virtually and in person, we look out for each other.

This past week, the FOMO was real, both from those unable to attend the conference and those who came. There were so many great sessions and such excellent information floating around that it was hard to keep up.

In this blog, my focus is to share as much information as possible to help everyone feel included, ready to level up their R skills and engage with the community!

# Learn

## General R Content

Some of the books featured in the free online r books moment by Martin Skarzynski, myself, Matthew Hendrickson and others

• RStudio Cheat Sheets – RStudio offers oodles and oodles of amazing cheatsheets at this link with new ones coming all the time. Visit the link, download cheatsheets for your fave packages and subscribe to get all of the new ones sent to you as they come out!

Sample RStudio Cheat Sheets

• Tidy Tuesday Screencasts by David RobinsonSince late 2018, David Robinson has been posting screencasts of his analysis process on the #TidyTuesday datasets. These are an excellent way to learn how others approach a new dataset. They also tend to be a great introduction to new tips, tricks and packages. See more about the Tidy Tuesday program in the “Engage” section below.

• R Packages to Check Out –  As part of my “We are R Ladies” curation week, I decided to share some of the R packages that I find the most interesting. Check out the twitter moment to see a preview of the packages and links to tutorials or documentation. Message me to add more R packages to this moment.

Sample packages included in the #FOMORPkgs twitter moment created by myself

• Curated Gifs by Nic Crane – This lovely twitter moment is a great way to browse compelling R features or packages through a series of curated R gifs. Message Nic to add a new gif to this moment.

• Find R Content – If you are looking for a great way of searching for #rstats resources, look no further! The Rseek tool will search R issues, blogs, packages, twitter and more!

• DataCamp Courses – DataCamp offers a free, interactive “Introduction to R” course.

# Engage

RLadies at RStudio Conference 2019. Photo credit to JD Long!

• Datanauts – The datanauts program is another excellent way to engage with other data enthusiasts in the R and python community. It is a program which enables the greater data and space enthusiast groups to engage with each other to solve interesting space problems. I had the good fortune to participate in this program in the fall of 2018 and I would highly recommend it!

# More RStudio::Conf

## Other RStudio::Conf blogs

The R blogosphere is already a buzz with the amazingness of the 2019 conference. I’ve already seen five blogs up and counting. I’ve linked to the ones I’ve found below, but please feel free to send over any new links!

## Next Year

Next year the conference is in San Francisco on Jan 27-30th, 2020. The first 100 ticket purchasers will get the special price of $450, so sign up now! A hilarious side note is that when the location was announced at the end of R Studio Conf 2019, they accidentally put an image of New York! That’s when the R twitter fans started creating their own version of the twitter fail. Martin Skarzynski captured the whole thing in a twitter moment. Great work Martin! A sample of some of the rstudio::conf 2020 location trolling screenshots captured from Martin Skarzynski’s twitter moment ## Thanks Thanks for taking the time to read this blog post. I could go on further and on about some of the amazing sessions and packages introduced at the conference, but I already did that throughout the week on the “We Are R Ladies” twitter feed. If you want to learn more about the event, I’d suggest you look at the shared conference content listed at the top, read the other great blogs which I’ve linked to and buy your tickets for next year. Until then, keep learning and engage with the community! To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R Packages worth a look Bi-Level Selection of Conditional Main Effects (cmenet) Provides functions for implementing cmenet – a bi-level variable selection method for conditional main effects (see Mak and Wu (2018) <doi:10.1080/0 … Instrumental Variable Analysis in Case-Control Association Studies (iva) Mendelian randomization (MR) analysis is a special case of instrumental variable analysis with genetic instruments. It is used to estimate the unconfou … Segment Data With Maximum Likelihood (segmentr) Given a likelihood provided by the user, this package applies it to a given matrix dataset in order to find change points in the data that maximize the … Continue Reading… ### Distilled News You are probably familiar with Matplotlib and Seaborn, two excellent (and highly related) Python plotting libraries. The purpose of this article is to get you started with Bokeh if you are not yet familiar with it. You will learn how to write a custom Python class to simplify plotting interactive histograms with Bokeh. When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU. Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive. Luckily there are some options in the cloud. One that I tried out was using the wonderful Google Compute Engine. GPUs are available in the GCE as external accelerators of an instance. Currently, there are these GPUs available (prices for us-central1). When implementing the ResNet architecture in a deep learning project I was working on, it was a huge leap from the basic, simple convolutional neural networks I was used to. One prominent feature of ResNet is that it utilizes a micro-architecture within it’s larger macroarchitecture: residual blocks! I decided to look into the model myself to gain a better understanding of it, as well as look into why it was so successful at ILSVRC. I implemented the exact same ResNet model class in Deep Learning for Computer Vision with Python by Dr. Adrian Rosebrock , which followed the ResNet model from the 2015 ResNet academic publication, Deep Residual Learning for Image Recognition by He et al. . When I ask myself the question ‘What is data science?’ I tend to think of the following five components. Data science is • the application of design thinking to data problems; • the creation and management of workflows for transforming and processing data; • the negotiation of human relationships to identify context, allocate resources, and characterize audiences for data analysis products; • the application of statistical methods to quantify evidence; and • the transformation of data analytic information into coherent narratives and stories What in the world is AI policy? First, a definition: AI policy is defined as public policies that maximize the benefits of AI, while minimizing its potential costs and risks. From this perspective, the purpose of AI policy is two-fold. On the one hand, governments should invest in the development and adoption of AI to secure its many benefits for the economy and society. Governments can do this by investing in fundamental and applied research, the development of specialized AI and ‘AI + X’ talent, digital infrastructure and related technologies, and programs to help the private and public sectors adopt and apply new AI technologies. On the other hand, governments need to also respond to the economic and societal challenges brought on by advances in AI. Automation, algorithmic bias, data exploitation, and income inequality are just a few of the many challenges that governments around the world need to develop policy solutions for. These policies include investments into skills development, the creation of new regulations and standards, and targeted efforts to remove bias from AI algorithms and data sets. We trained a machine learning model with high performance. However, it did not work and was not useful in practice.’ I have heard this sentence several times, and each time I was eager to find out the reason. There could be different reasons that a model failed to work in practice. As these issues are not usually addressed in data science courses, in this article I address one of the common mistakes in designing and deploying a machine learning model. In the rest of this article, first, I will discuss the confusion between Correlation and Causation that leads to the misuse of machine learning models. I will illustrate the discussion with an example. After that, different possibilities between inputs and outputs of the model are shown. Finally, I provide some suggestions to avoid this mistake. We propose an Aitken estimator for Gini regression. The suggested A-Gini estimator is proven to be a U-statistics. Monte Carlo simulations are provided to deal with heteroskedasticity and to make some comparisons between the generalized least squares and the Gini regression. A Gini-White test is proposed and shows that a better power is obtained compared with the usual White test when outlying observations contaminate the data We present practical methods for near real-time monitoring of machine learning systems which detect system-level or model-level faults and can see when the world changes. Tree-based methods can be used for regression or classification. They involve segmenting the prediction space into a number of simple regions. The set of splitting rules can be summarized in a tree, hence the name decision tree methods. A single decision tree is often not as performant as linear regression, logistic regression, LDA, etc. However, by introducing bagging, random forests, and boosting, it can result in dramatic improvements in prediction accuracy at the expense of some loss in interpretation. In this post, we introduce everything you need to know about decision trees, bagging, random forests, and boosting. It will be a long read, but it will be worth it! To achieve state-of-the-art result in NLP tasks, researchers try tremendous way to let machine understand language and solving downstream tasks such as textual entailment, semantic classification. OpenAI released a new model which named as Generative Pre-Training (GPT). After reading this article, you will understand: • Finetuned Transformer LM Design • Architecture • Experiments • Implementation • Take Away Time Series data is can be confusing, but very interesting to explore. The reason this sort of data grabbed my attention is that it can be found in almost every business (sales, deliveries, weather conditions etc.). For instance: using Google BigQuery ho Throughout the world, mobility is becoming increasingly shaped by the digital revolution. The ‘automation’ of private transport operating in the public road environment is taken to mean technological driving aids that relieve the pressure on drivers, assist or even replace them in part or in whole. The partial automation of driving is already standard equipment in new vehicles. Conditionally and highly automated systems which, without human intervention, can autonomously change lanes, brake and steer are available or about to go into mass production. In both Germany and the US, there are test tracks on which conditionally automated vehicles can operate. For local public transport, driverless robot taxis or buses are being developed and trialled. Today, processors are already available or are being developed that are able, by means of appropriate sensors, to detect in real time the traffic situation in the immediate surroundings of a car, determine the car’s own position on appropriate mapping material and dynamically plan and modify the car’s route and adapt it to the traffic conditions. As the ‘perception’ of the vehicle’s surroundings becomes increasingly perfected, there is likely to be an ever better differentiation of road users, obstacles and hazardous situations. This makes it likely that it will be possible to significantly enhance road safety. Indeed, it cannot be ruled out that, at the end of this development, there will be motor vehicles that are inherently safe, in other words will never be involved in an accident under any circumstances. Nevertheless, at the level of what is technologically possible today, and given the realities of heterogeneous and nonconnected road traffic, it will not be possible to prevent accidents completely. This makes it essential that decisions be taken when programming the software of conditionally and highly automated driving systems. The technological developments are forcing government and society to reflect on the emerging changes. The decision that has to be taken is whether the licensing of automated driving systems is ethically justifiable or possibly even imperative. If these systems are licensed – and it is already apparent that this is happening at international level – everything hinges on the conditions in which they are used and the way in which they are designed. At the fundamental level, it all comes down to the following questions. How much dependence on technologically complex systems – which in the future will be based on artificial intelligence, possibly with machine learning capabilities – are we willing to accept in order to achieve, in return, more safety, mobility and convenience? What precautions need to be taken to ensure controllability, transparency and data autonomy? What technological development guidelines are required to ensure that we do not blur the contours of a human society that places individuals, their freedom of development, their physical and intellectual integrity and their entitlement to social respect at the heart of its legal regime? This paper compares the out-of-sample predictive performance of different early warning models for systemic banking crises using a sample of advanced economies covering the past 45 years. We compare a benchmark logit approach to several machine learning approaches recently proposed in the literature. We find that while machine learning methods often attain a very high in-sample fit, they are outperformed by the logit approach in recursive out-of-sample evaluations. This result is robust to the choice of performance measure, crisis definition, preference parameter, and sample length, as well as to using different sets of variables and data transformations. Thus, our paper suggests that further enhancements to machine learning early warning models are needed before they are able to offer a substantial value-added for predicting systemic banking crises. Conventional logit models appear to use the available information already fairly effciently, and would for instance have been able to predict the 2007/2008 financial crisis out-of-sample for many countries. In line with economic intuition, these models identify credit expansions, asset price booms and external imbalances as key predictors of systemic banking crises. Continue Reading… ### Frank Sinatra (3) vs. Virginia Apgar; Julia Child advances My favorite comment from yesterday came from Ethan, who picked up on the public TV/radio connection and rated our two candidate speakers on their fundraising abilities. Very appropriate for the university—I find myself spending more and more time raising money for Stan, myself. A few commenters picked up on Child’s military experience. I like the whole shark repellent thing, as it connects to the whole “shark attacks determine elections” story. Also, Jeff points out that “a Julia win would open at least the possibility of a Wilde-Child semifinal,” and Diana brings up the tantalizing possibility that Julia Grownup would show up. That would be cool. I looked up Julia Grownup and it turns out she was on Second City too! As for today’s noontime matchup . . . What can I say? New Jersey’s an amazing place. Hoboken’s own Frank Sinatra is only the #3 seed of our entries from that state, and he’s pitted against Virginia Apgar, an unseeded Jerseyite. Who do you want to invite for our seminar: the Chairman of the Board, or a pioneering doctor who’s a familiar name to all parents of newborns? Here’s an intriguing twist: I looked up Apgar on wikipedia and learned that she came from a musical family! Meanwhile, Frank Sinatra had friends who put a lot of people in the hospital. So lots of overlap here. You can evaluate the two candidates on their own merits, or based on who has a better chance of besting Julia Child in round 2. Again, the full bracket is here, and here are the rules: We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above. I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best! Continue Reading… ### KDD 2019 Call for Research, Applied Data Science Papers KDD-2019 invites submission of papers describing innovative research on all aspects of data science, and of applied papers describing designs and implementations for practical tasks in data science. Submissions due Feb 3. Continue Reading… ### hrbrthemes 0.6.0 on CRAN + Other In-Development Package News (This article was first published on R – rud.is, and kindly contributed to R-bloggers) Version 0.6.0 of the hrbrthemes package should be hitting a CRAN mirror near you soon. Apart from some general documentation and code cleanup this release includes the dark theme folks have been seeing in blog posts and tweets over the past few months. It’s called theme_ft_rc() since it is an homage to the wonderful new chart theme developed by the @ft_data crew over at the Financial Times (you can see examples from their work here). While there was nothing stopping folks from using the GitHub version, the CRAN release makes it more widely available. There are still intermittent issues with fonts for some folks which I’ll be working on for the next release. Since you’ve already seen lots of examples of these charts I won’t just make a gratuitous example using the theme. I will, however, make some charts based on a new data package dubbed iceout. The iceout package was originally conceived by Ben Tupper from the Bigelow Laboratory for Ocean Sciences. I keep an eye on fellow Mainer repositories and I did not realize (but should have known) that researches keep track of when inland bodies of water freeze and thaw. The package name is derived from the term used for the thaw measurements (“ice-out” or “ice-off”). Before becoming obsessed with this data and getting the package to the current state it is in, the original codebase worked off of a USGS Lake Ice-Out Data for New England dataset that focused solely on New England and only went up to 2005. Some digging discovered that • Maine’s Department of Agriculture and Forestry maintains online records since 2003; and, • Minnesota’s Department of Natural Resources maintains a comprehensive database of records going back to the 1800’s. But I hit the jackpot after discovering the U.S. National Snow & Ice Data Center’s Global Lake and River Ice Phenology dataset which: … contains freeze and breakup dates and other ice cover descriptive data for 865 lakes and rivers. Of the 542 water bodies that have records longer than 19 years, 370 are in North America and 172 are in Eurasia; 249 have records longer than 50 years; and 66 longer than 100 years. A few have data prior to 1845. These data, from water bodies distributed around the Northern Hemisphere, allow analysis of broad spatial patterns as well as long-term temporal patterns. So, I converted the original package to a data package containing all four of those datasets plus some interactive functions for pulling “live” data and a set of “builders” to regenerate the databases. Let’s take a quick look at what’s in the NSIDC data and the global coverage area: library(iceout) # github/hrbrmstr/iceout library(hrbrthemes) library(ggplot2) library(dplyr) data("nsidc_iceout") glimpse(nsidc_iceout) ## Observations: 35,918 ## Variables: 37 ##$ lakecode                 "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1…
## $lakename "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Su… ##$ lakeorriver              "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
## $season "1443-44", "1444-45", "1445-46", "1446-47", "1447-48", "1448… ##$ iceon_year               1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, …
## $iceon_month 12, 11, 12, 12, 11, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, … ##$ iceon_day                8, 23, 1, 2, 30, 8, 13, 8, 23, 28, 3, 5, 1, 5, 6, 20, 10, 15…
## $iceoff_year NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ iceoff_month             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $iceoff_day NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ duration                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $latitude 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.1… ##$ longitude                138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.…
## $country "Japan", "Japan", "Japan", "Japan", "Japan", "Japan", "Japan… ##$ froze                    TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $obs_comments "calendar correction for ice_on: -30 days of original data; … ##$ area_drained             531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, …
## $bow_comments NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ conductivity_us          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $elevation 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, … ##$ filename                 "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $initials "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA… ##$ inlet_streams            "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $landuse_code "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAF… ##$ largest_city_population  52000, 52000, 52000, 52000, 52000, 52000, 52000, 52000, 5200…
## $max_depth 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, … ##$ mean_depth               4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, …
## $median_depth NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ power_plant_discharge    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $secchi_depth NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ##$ shoreline                18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $surface_area 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, … ##$ state                    "Nagano Prefecture", "Nagano Prefecture", "Nagano Prefecture…
## $iceon_date 1443-12-08, 1444-11-23, 1445-12-01, 1446-12-02, 1447-11-30,… ##$ iceon_doy                342, 328, 335, 336, 334, 343, 347, 342, 357, 333, 337, 339, …
## $iceout_date NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ##$ iceout_doy               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

maps::map("world", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>%
fortify() -> wrld

ggplot() +
ggalt::geom_cartogram(
data = wrld, map = wrld, aes(long, lat, map_id=region),
fill="#3B454A",  color = "white", size = 0.125
) +
geom_point(
data = distinct(nsidc_iceout, lakeorriver, longitude, latitude),
aes(longitude, latitude, fill = lakeorriver),
size = 1.5, color = "#2b2b2b", stroke = 0.125, shape = 21
) +
scale_fill_manual(
name = NULL, values = c("L"="#fdbf6f", "R"="#1f78b4"), labels=c("L" = "Lake", "R" = "River")
) +
ggalt::coord_proj("+proj=wintri", ylim = range(nsidc_iceout$latitude, na.rm = TRUE)) + labs(title = "NSIDC Dataset Coverage") + theme_ft_rc(grid="") + theme(legend.position = c(0.375, 0.1)) + theme(axis.text = element_blank(), axis.title = element_blank())  W00t! Lots of data (though not all of the extra features are populated for all readings/areas)! I think the reason the ice-out data garnered my obsession was how it can be used as another indicator that we are indeed in the midst of a climate transformation. Let’s look at the historical ice-out information for Maine inland bodies of water: filter(nsidc_iceout, country == "United States", state == "ME") %>% mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% # we want the Y axis formatted as month-day so we choose a leap year to ensure we get leap dates (if any) ggplot(aes(iceoff_year, iceout_date)) + geom_point(aes(color = lakename), size = 0.5, alpha=1/4) + geom_smooth(aes(color = lakename), se=FALSE, method = "loess", size=0.25) + scale_y_date(date_labels = "%b-%d") + labs( x = NULL, y = "Ice-out Month/Day", color = NULL, title = "Historical Ice-out Data/Trends for Maine Inland Bodies of Water" ) + theme_ft_rc(grid="XY")  You can follow that code-pattern to look at other states. It’s also fun to look at the ice-out date distributions by latitude grouping: filter(nsidc_iceout, !is.na(latitude) & !is.na(longitude) & !is.na(iceout_date)) %>% filter(country == "United States") %>% mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% mutate(lat_grp = cut(latitude, scales::pretty_breaks(5)(latitude), ordered_result = TRUE)) %>% arrange(desc(iceoff_year)) %>% ggplot() + ggbeeswarm::geom_quasirandom( aes(lat_grp, iceout_date, fill = iceoff_year), groupOnX = TRUE, shape = 21, size =1, color = "white", stroke = 0.125, alpha=1/2 ) + scale_y_date(date_labels = "%b-%d") + viridis::scale_fill_viridis(name = "Year", option = "magma") + labs( x = "Latitude Grouping", y = "Ice-out Month/Day", title = "U.S. Ice-out Historical Day/Month Distributions by Latitude Grouping" ) + theme_ft_rc(grid="Y")  If you want to focus on individual lakes there’s a Shiny app for that (well one for the U.S. anyway). After loading the package, just enter explore_us() at an R console and you’ll see something like this: The leaflet view will zoom to each new lake selected and the graph will be updated as well. ### Other Package News The sergeant package is reaching a stable point in the 0.8.0 branch (mostly due to David Severski’s tireless help finding bugs ) and should be headed to CRAN soon. Get your issues or PRs in if you want them CRANdied. I’ve finally updated the Java library dependencies in pdfboxjars so pdfbox will no longer cause GitHub to tell you or I that it is insecure. There’s a new package dubbed reapr that is aimed somewhere at the intersection of curl + httr + rvest. Fundamentally, it provides some coder-uplift when scraping data. The README has examples but here’s what you get on an initial scrape of this blog’s index page: reapr::reap_url("http://rud.is/b") ## Title: rud.is | "In God we trust. All others must bring data" ## Original URL: http://rud.is/b ## Final URL: https://rud.is/b/ ## Crawl-Date: 2019-01-17 19:51:09 ## Status: 200 ## Content-Type: text/html; charset=UTF-8 ## Size: 50 kB ## IP Address: 104.236.112.222 ## Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1], ## label[1], noscript[1], section[1], title[1], ## aside[2], nav[2], ul[2], style[5], img[6], ## input[6], article[8], time[8], footer[9], h1[9], ## header[9], p[10], li[19], meta[20], div[31], ## script[40], span[49], link[53], a[94] ## # Comments: 17 ## Total Request Time: 2.093s  The reap_url() function: • Uses httr::GET() to make web connections and retrieve content which enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible. • Returns a richer set of data. After the httr::response object is obtained many tasks are performed including: • timestamping of the URL crawl • extraction of the asked-for URL and the final URL (in the case of redirects) • extraction of the IP address of the target server • extraction of both plaintext and parsed (xml_document) HTML • extraction of the plaintext webpage (if any) • generation of a dynamic list tags in the document which can be fed directly to HTML/XML search/retrieval function (which may speed up node discovery) • extraction of the text of all comments in the HTML document • inclusion of the full httr::response object with the returned object • extraction of the time it took to make the complete request I’m still wrestling with the API so definitely file issues with suggestions (wherever you’re most comfortable socially coding). Speaking of IP addresses (bullet 3 above), I finally got some time to study the gdns C library (a modern DNS API library) and created the clandnstine package. The package name jeu de mots is due to the fact that the intent is to have it solely support DNS over TLS requests since regular DNS is plaintext, enables ISP spying/injection and generally fraught with peril. All forms of DNS lookups are supported. The catch is that you have to point it at a DNS over TLS-capable resolver. The package defaults to Quad9 (9.9.9.9) because I trust them more than Google or Cloudflare (btw: that’s not saying much as I trust used car salesfolks more than all three of them). Keep an eye (or RSS reader) peeled on$WORK blog over the coming weeks as I’ll have some analysis and data on a few hundred DNS over TLS endpoints you can use thanks to a new study developed by cow-orkers Jon Hart and Shan Skidar.

There also a toy package forecequotes that is more “have fun with the cli & crayon packages” than anything else. But if you like Star Wars, random quote APIs and want to integrate richer command line interface output into your work, then definitely give it a peek.

Finally, I haven’t used R’s direct C interface in a while (since Rcpp is addictive and handy) and wanted to keep those skills fresh, so I made a wrapper to an old (in internet years) IP address trie C library. The underlying library is much slower than what we use in iptools but it works, does a bit more than its iptoos counterpart and covers data marshaling, external pointer handling, and attribute/class setting so it may be a half-decent reference package for using the R<->C bridge.

### FIN

If you know of more/better ice-out data please drop an issue in the Bigelow Labs’ iceout repo and I’ll get it integrated. And, if you do your own ice-out exploration definitely blog about it, tell R Weekly and drop a note in the comments.

Here are links to all the mentioned packages grouped by social coding platform (so you can interact/collaborate wherever you feel most comfortable working):

sr.ht

GitLab

GitHub

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tracking America in the age of Trump

Eleven measures that take the temperature of Donald Trump’s America

### Regression with Keras

In this tutorial, you will learn how to perform regression using Keras and Deep Learning. You will learn how to train a Keras neural network for regression and continuous value prediction, specifically in the context of house price prediction.

Today’s post kicks off a 3-part series on deep learning, regression, and continuous value prediction.

We’ll be studying Keras regression prediction in the context of house price prediction:

• Part 1: Today we’ll be training a Keras neural network to predict house prices based on categorical and numerical attributes such as the number of bedrooms/bathrooms, square footage, zip code, etc.
• Part 2: Next week we’ll train a Keras Convolutional Neural Network to predict house prices based on input images of the houses themselves (i.e., frontal view of the house, bedroom, bathroom, and kitchen).
• Part 3: In two weeks we’ll define and train a neural network that combines our categorical/numerical attributes with our images, leading to better, more accurate house price prediction than the attributes or images alone.

Unlike classification (which predicts labels), regression enables us to predict continuous values.

For example, classification may be able to predict one of the following values: {cheap, affordable, expensive}.

Regression, on the other hand, will be able to predict an exact dollar amount, such as “The estimated price of this house is $489,121”. In many real-world situations, such as house price prediction or stock market forecasting, applying regression rather than classification is critical to obtaining good predictions. To learn how to perform regression with Keras, just keep reading! Looking for the source code to this post? Jump right to the downloads section. ## Regression with Keras In the first part of this tutorial, we’ll briefly discuss the difference between classification and regression. We’ll then explore the house prices dataset we’re using for this series of Keras regression tutorials. From there, we’ll configure our development environment and review our project structure. Along the way, we will learn how to use Pandas to load our house price dataset and define a neural network that for Keras regression prediction. Finally, we’ll train our Keras network and then evaluate the regression results. ### Classification vs. Regression Figure 1: Classification networks predict labels (top). In contrast, regression networks can predict numerical values (bottom). We’ll be performing regression with Keras on a housing dataset in this blog post. Typically on the PyImageSearch blog, we discuss Keras and deep learning in the context of classification — predicting a label to characterize the contents of an image or an input set of data. Regression, on the other hand, enables us to predict continuous values. Let’s again consider the task of house price prediction. As we know, classification is used to predict a class label. For house price prediction we may define our categorical labels as: labels = {very cheap, cheap, affordable, expensive, very expensive} If we performed classification, our model could then learn to predict one of those five values based on a set of input features. However, those labels are just that — categories that represent a potential range of prices for the house but do nothing to represent the actual cost of the home. In order to predict the actual cost of a home, we need to perform regression. Using regression we can train a model to predict a continuous value. For example, while classification may only be able to predict a label, regression could say: “Based on my input data, I estimate the cost of this house to be$781,993.”

Figure 1 above provides a visualization of performing both classification and regression.

In the rest of this tutorial, you’ll learn how to train a neural network for regression using Keras.

### The House Prices Dataset

Figure 2: Performing regression with Keras on the house pricing dataset (Ahmed and Moustafa) will ultimately allow us to predict the price of a house given its image.

The dataset we’ll be using today is from 2016 paper, House price estimation from visual and textual features, by Ahmed and Moustafa.

The dataset includes both numerical/categorical attributes along with images for 535 data points, making it and excellent dataset to study for regression and mixed data prediction.

The house dataset includes four numerical and categorical attributes:

1. Number of bedrooms
2. Number of bathrooms
3. Area (i.e., square footage)
4. Zip code

These attributes are stored on disk in CSV format.

We’ll be loading these attributes from disk later in this tutorial using

pandas
, a popular Python package used for data analysis.

A total of four images are also provided for each house:

1. Bedroom
2. Bathroom
3. Kitchen
4. Frontal view of the house

The end goal of the houses dataset is to predict the price of the home itself.

In today’s tutorial, we’ll be working with just the numerical and categorical data.

Next week’s blog post will discuss working with the image data.

And finally, two weeks from now we’ll combine the numerical/categorical data with the images to obtain our best performing model.

But before we can train our Keras model for regression, we first need to configure our development environment and grab the data.

Figure 3: To perform regression with Keras, we’ll be taking advantage of several popular Python libraries including Keras + TensorFlow, scikit-learn, and pandas.

For this 3-part series of blog posts, you’ll need to have the following packages installed:

• NumPy
• scikit-learn
• pandas
• Keras with the TensorFlow backend (CPU or GPU)
• OpenCV (for the next two blog posts in the series)

Luckily most of these are easily installed with pip, a Python package manager.

Let’s install the packages now, ideally into a virtual environment as shown (you’ll need to create the environment):

$workon house_prices$ pip install numpy
$pip install scikit-learn$ pip install pandas
$pip install tensorflow # or tensorflow-gpu Notice that I haven’t instructed you to install OpenCV yet. The OpenCV install can be slightly involved — especially if you are compiling from source. Let’s look at our options: 1. Compiling from source gives us the full install of OpenCV and provides access to optimizations, patented algorithms, custom software integrations, and more. The good news is that all of my OpenCV install tutorials are meticulously put together and updated regularly. With patience and attention to detail, you can compile from source just like I and many of my readers do. 2. Using pip to install OpenCV is hands-down the fastest and easiest way to get started with OpenCV and essentially just checks prerequisites and places a precompiled binary that will work on most systems into your virtual environment site-packages. Optimizations may or may not be active. The big caveat is that the maintainer has elected not to include patented algorithms for fear of lawsuits. There’s nothing wrong with using patented algorithms for educational and research purposes, but you should use alternative algorithms commercially. Nevertheless, the pip method is a great option for beginners just remember that you don’t have the full install. Pip is sufficient for this 3-part series of blog posts. You can install OpenCV in your environment via: $ workon house_prices
$pip install opencv-contrib-python Please reach out to me if you have any difficulties getting your environment established. ### Downloading the House Prices Dataset Before you download the dataset, go ahead and grab the source code to this post by using “Downloads” section. From there, unzip the file and navigate into the directory: $ cd path/to/downloaded/zip
$unzip keras-regression.zip$ cd keras-regression

From there, you can download the House Prices Dataset using the following command:

$git clone https://github.com/emanhamed/Houses-dataset When we are ready to train our Keras regression network you’ll then need to supply the path to the Houses-dataset directory via command line argument. ### Project structure Now that you have the dataset, go ahead and use the tree command with the same arguments shown below to print a directory + file listing for the project: $ tree --dirsfirst --filelimit 10
.
├── Houses-dataset
│   ├── Houses Dataset [2141 entries]
├── pyimagesearch
│   ├── __init__.py
│   ├── datasets.py
│   └── models.py
└── mlp_regression.py

3 directories, 5 files

Houses-dataset/
folder.

The

pyimagesearch/
directory is actually a module included with the code “Downloads” where inside, you’ll find:

• datasets.py
• models.py
: Our Multi-Layer Perceptron architecture implementation

These two scripts will be reviewed today. Additionally, we’ll be reusing both

datasets.py
and
models.py
(with modifications) in the next two tutorials to keep our code organized and reusable.

The regression + Keras script is contained in

mlp_regression.py
which we’ll be reviewing it as well.

Figure 4: We’ll use Python and pandas to read a CSV file in this blog post.

Before we can train our Keras regression model we first need to load the numerical and categorical data for the houses dataset.

Open up the

datasets.py
file an insert the following code:

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
import glob
import cv2
import os

# initialize the list of column names in the CSV file and then
cols = ["bedrooms", "bathrooms", "area", "zipcode", "price"]
df = pd.read_csv(inputPath, sep=" ", header=None, names=cols)

We begin by importing libraries and modules from scikit-learn, pandas, NumPy and OpenCV. OpenCV will be used next week as we’ll be adding the ability to load images to this script.

On Line 10, we define the

load_house_attributes
function which accepts the path to the input dataset.

Inside the function we start off by defining the names of the columns in the CSV file (Line 13). From there, we use pandas’ function,

read_csv
to load the CSV file into memory as a date frame (
df
) on Line 14.

Below you can see an example of our input data, including the number of bedrooms, number of bathrooms, area (i.e., square footage), zip code, code, and finally the target price our model should be trained to predict:

bedrooms  bathrooms  area  zipcode     price
0         4        4.0  4053    85255  869500.0
1         4        3.0  3343    36372  865200.0
2         3        4.0  3923    85266  889000.0
3         5        5.0  4022    85262  910000.0
4         3        4.0  4116    85266  971226.0

Let’s finish up the rest of the

load_house_attributes
function:

# determine (1) the unique zip codes and (2) the number of data
# points with each zip code
zipcodes = df["zipcode"].value_counts().keys().tolist()
counts = df["zipcode"].value_counts().tolist()

# loop over each of the unique zip codes and their corresponding
# count
for (zipcode, count) in zip(zipcodes, counts):
# the zip code counts for our housing dataset is *extremely*
# unbalanced (some only having 1 or 2 houses per zip code)
# so let's sanitize our data by removing any houses with less
# than 25 houses per zip code
if count < 25:
idxs = df[df["zipcode"] == zipcode].index
df.drop(idxs, inplace=True)

# return the data frame
return df

In the remaining lines, we:

• Determine the unique set of zip codes and then count the number of data points with each unique zip code (Lines 18 and 19).
• Filter out zip codes with low counts (Line 28). For some zip codes we only have one or two data points, making it extremely challenging, if not impossible, to obtain accurate house price estimates.
• Return the data frame to the calling function (Line 33).

Now let’s create the

process_house_attributes
function used to preprocess our data:

def process_house_attributes(df, train, test):
# initialize the column names of the continuous data
continuous = ["bedrooms", "bathrooms", "area"]

# performin min-max scaling each continuous feature column to
# the range [0, 1]
cs = MinMaxScaler()
trainContinuous = cs.fit_transform(train[continuous])
testContinuous = cs.transform(test[continuous])

We define the function on Line 35. The

process_house_attributes
function accepts three parameters:

• df
: Our data frame generated by pandas (the previous function helps us to drop some records from the data frame)
• train
: Our training data for the House Prices Dataset
• test
: Our testing data.

Then on Line 37, we define the columns of our our continuous data, including bedrooms, bathrooms, and size of the home.

We’ll take these values and use scikit-learn’s

MinMaxScaler
to scale the continuous features to the range [0, 1] (Lines 41-43).

Now we need to pre-process our categorical features, namely the zip code:

# one-hot encode the zip code categorical data (by definition of
# one-hot encoing, all output features are now in the range [0, 1])
zipBinarizer = LabelBinarizer().fit(df["zipcode"])
trainCategorical = zipBinarizer.transform(train["zipcode"])
testCategorical = zipBinarizer.transform(test["zipcode"])

# construct our training and testing data points by concatenating
# the categorical features with the continuous features
trainX = np.hstack([trainCategorical, trainContinuous])
testX = np.hstack([testCategorical, testContinuous])

# return the concatenated training and testing data
return (trainX, testX)

First, we’ll one-hot encode the zip codes (Lines 47-49).

Then we’ll concatenate the categorical features with the continuous features using NumPy’s

hstack
function (Lines 53 and 54), returning the resulting training and testing sets as a tuple (Line 57).

Keep in mind that now both our categorical features and continuous features are all in the range [0, 1].

### Implementing a Neural Network for Regression

Figure 5: Our Keras regression architecture. The input to the network is a datapoint including a home’s # Bedrooms, # Bathrooms, Area/square footage, and zip code. The output of the network is a single neuron with a linear activation function. Linear activation allows the neuron to output the predicted price of the home.

Before we can train a Keras network for regression, we first need to define the architecture itself.

Today we’ll be using a simple Multilayer Perceptron (MLP) as shown in Figure 5.

Open up the

models.py
file and insert the following code:

# import the necessary packages
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.models import Model

def create_mlp(dim, regress=False):
# define our MLP network
model = Sequential()

# check to see if the regression node should be added
if regress:

# return our model
return model

First, we’ll import all of the necessary modules from Keras (Lines 2-11). We’ll be adding a Convolutional Neural Network to this file in next week’s tutorial, hence the additional imports that aren’t utilized here today.

Let’s define the MLP architecture by writing a function to generate it called

create_mlp
.

The function accepts two parameters:

• dim
: Defines our input dimensions
• regress
: A boolean defining whether or not our regression neuron should be added

We’ll go ahead and start construction our MLP with a

dim-8-4
architecture (Lines 15-17).

If we are performing regression, we add a

Dense
layer containing a single neuron with a linear activation function (Lines 20 and 21). Typically we use ReLU-based activations, but since we are performing regression we need a linear activation.

Finally, our

model
is returned on Line 24.

### Implementing our Keras Regression Script

It’s now time to put all the pieces together!

Open up the

mlp_regression.py
file and insert the following code:

# import the necessary packages
from sklearn.model_selection import train_test_split
from pyimagesearch import datasets
from pyimagesearch import models
import numpy as np
import argparse
import locale
import os

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="path to input dataset of house images")
args = vars(ap.parse_args())

We begin by importing necessary packages, modules, and libraries.

Namely, we’ll need the

Adam
optimizer from Keras,
train_test_split
from scikit-learn, and our
datasets
+
models
functions from the
pyimagesearch
module.

Additionally, we’ll use math features from NumPy for collecting statistics when we evaluate our model.

The

argparse
module is for parsing command line arguments.

Our script requires just one command line argument

--dataset
(Lines 12-15). You’ll need to provide the
--dataset
switch and the actual path to the dataset when you go to run the training script in your terminal.

Let’s load the house dataset attributes and construct our training and testing splits:

# construct the path to the input .txt file that contains information
# on each house in the dataset and then load the dataset
inputPath = os.path.sep.join([args["dataset"], "HousesInfo.txt"])

# construct a training and testing split with 75% of the data used
# for training and the remaining 25% for evaluation
print("[INFO] constructing training/testing split...")
(train, test) = train_test_split(df, test_size=0.25, random_state=42)

Using our handy

load_house_attributes
function, and by passing the
inputPath
to the dataset itself, our data is loaded into memory (Lines 20 and 21).

Our training (75%) and testing (25%) data is constructed via Line 26 and scikit-learn’s

train_test_split
method.

Let’s scale our house pricing data:

# find the largest house price in the training set and use it to
# scale our house prices to the range [0, 1] (this will lead to
# better training and convergence)
maxPrice = train["price"].max()
trainY = train["price"] / maxPrice
testY = test["price"] / maxPrice

As stated in the comment, scaling our house prices to the range [0, 1] will allow our model to more easily train and converge. Scaling the output targets to [0, 1] will reduce the range of our output predictions (versus [0,

maxPrice
]) and make it not only easier and faster to train our network but enable our model to obtain better results as well.

Thus, we grab the maximum price in the training set (Line 31), and proceed to scale our training and testing data accordingly (Lines 32 and 33).

Let’s process the house attributes now:

# process the house attributes data by performing min-max scaling
# on continuous features, one-hot encoding on categorical features,
# and then finally concatenating them together
print("[INFO] processing data...")
(trainX, testX) = datasets.process_house_attributes(df, train, test)

Recall from the

datasets.py
script that the
process_house_attributes
function:

• Pre-processes our categorical and continuous features.
• Scales our continuous features to the range [0, 1] via min-max scaling.
• One-hot encodes our categorical features.
• Concatenates the categorical and continuous features to form the final feature vector.

Now let’s go ahead and fit our MLP model to the data:

# create our MLP and then compile the model using mean absolute
# percentage error as our loss, implying that we seek to minimize
# the absolute percentage difference between our price *predictions*
# and the *actual prices*
model = models.create_mlp(trainX.shape[1], regress=True)
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)

# train the model
print("[INFO] training model...")
model.fit(trainX, trainY, validation_data=(testX, testY),
epochs=200, batch_size=8)

Our

model
is initialized with the
Adam
optimizer (Lines 45 and 46) and then compiled (Line 47). Notice that we’re using mean absolute percentage error as our loss function, indicating that we seek to minimize the mean percentage difference between the predicted price and the actual price.

The actual training process is kicked off on Lines 51 and 52.

After training is complete we can evaluate our model and summarize our results:

# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict(testX)

# compute the difference between the *predicted* house prices and the
# *actual* house prices, then compute the percentage difference and
# the absolute percentage difference
diff = preds.flatten() - testY
percentDiff = (diff / testY) * 100
absPercentDiff = np.abs(percentDiff)

# compute the mean and standard deviation of the absolute percentage
# difference
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)

# finally, show some statistics on our model
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
print("[INFO] avg. house price: {}, std house price: {}".format(
locale.currency(df["price"].mean(), grouping=True),
locale.currency(df["price"].std(), grouping=True)))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))

Line 56 instructs Keras to make predictions on our testing set.

Using the predictions, we compute the:

1. Difference between predicted house prices and the actual house prices (Line 61).
2. Percentage difference (Line 62).
3. Absolute percentage difference (Line 63).

From there, on Lines 67 and 68, we calculate the mean and standard deviation of the absolute percentage difference.

The results are printed via Lines 72-75.

Regression with Keras wasn’t so tough, now was it?

Let’s train the model and analyze the results!

### Keras Regression Results

Figure 6: For today’s blog post, our Keras regression model takes four numerical inputs, producing one numerical output: the predicted value of a home.

To train our own Keras network for regression and house price prediction make sure you have:

1. Configured your development environment according to the guidance above.
3. Downloaded the house prices dataset based on the instructions in the “The House Prices Dataset” section above.

From there, open up a terminal and supply the following command (making sure the

--dataset
command line argument points to where you downloaded the house prices dataset):

$python mlp_regression.py --dataset Houses-dataset/Houses\ Dataset/ [INFO] loading house attributes... [INFO] constructing training/testing split... [INFO] processing data... [INFO] training model... Train on 271 samples, validate on 91 samples Epoch 1/200 271/271 [==============================] - 0s 680us/step - loss: 84.0388 - val_loss: 61.7484 Epoch 2/200 271/271 [==============================] - 0s 110us/step - loss: 49.6822 - val_loss: 50.4747 Epoch 3/200 271/271 [==============================] - 0s 112us/step - loss: 42.8826 - val_loss: 43.5433 Epoch 4/200 271/271 [==============================] - 0s 112us/step - loss: 38.8050 - val_loss: 40.4323 Epoch 5/200 271/271 [==============================] - 0s 112us/step - loss: 36.4507 - val_loss: 37.1915 Epoch 6/200 271/271 [==============================] - 0s 112us/step - loss: 34.3506 - val_loss: 35.5639 Epoch 7/200 271/271 [==============================] - 0s 111us/step - loss: 33.2662 - val_loss: 37.5819 Epoch 8/200 271/271 [==============================] - 0s 108us/step - loss: 32.8633 - val_loss: 30.9948 Epoch 9/200 271/271 [==============================] - 0s 110us/step - loss: 30.4942 - val_loss: 30.6644 Epoch 10/200 271/271 [==============================] - 0s 107us/step - loss: 28.9909 - val_loss: 28.8961 ... Epoch 195/200 271/271 [==============================] - 0s 111us/step - loss: 20.8431 - val_loss: 21.4466 Epoch 196/200 271/271 [==============================] - 0s 109us/step - loss: 22.2301 - val_loss: 21.8503 Epoch 197/200 271/271 [==============================] - 0s 112us/step - loss: 20.5079 - val_loss: 21.5884 Epoch 198/200 271/271 [==============================] - 0s 108us/step - loss: 21.0525 - val_loss: 21.5993 Epoch 199/200 271/271 [==============================] - 0s 112us/step - loss: 20.4717 - val_loss: 23.7256 Epoch 200/200 271/271 [==============================] - 0s 107us/step - loss: 21.7630 - val_loss: 26.0129 [INFO] predicting house prices... [INFO] avg. house price:$533,388.27, std house price: $493,403.08 [INFO] mean: 26.01%, std: 18.11% As you can see from our output, our initial mean absolute percentage error starts off as high as 84% and then quickly drops to under 30%. By the time we finish training we can see our network starting to overfit a bit. Our training loss is as low as ~21%; however, our validation loss is at ~26%. Computing our final mean absolute percentage error we obtain a final value of 26.01%. What does this value mean? Our final mean absolute percentage error implies, that on average, our network will be ~26% off in its house price predictions with a standard deviation of ~18%. ### Limitations of the House Price Dataset Being 26% off in a house price prediction is a good start but is certainly not the type of accuracy we are looking for. That said, this prediction accuracy can also be seen as a limitation of the house price dataset itself. Keep in mind that the dataset only includes four attributes: 1. Number of bedrooms 2. Number of bathrooms 3. Area (i.e., square footage) 4. Zip code Most other house price datasets include many more attributes. For example, the Boston House Prices Dataset includes a total of fourteen attributes which can be leveraged for house price prediction (although that dataset does have some racial discrimination). The Ames House Dataset includes over 79 different attributes which can be used to train regression models. When you think about it, the fact that we are able to even obtain 26% mean absolute percentage error without the knowledge of an expert real estate agent is fairly reasonable given: 1. There are only 535 total houses in the dataset (we only used 362 total houses for the purpose of this guide). 2. We only have four attributes to train our regression model on. 3. The attributes themselves, while important in describing the home itself, do little to characterize the area surrounding the house. 4. The house prices are incredibly varied with a mean of$533K and a standard deviation of $493K (based on our filtered dataset of 362 homes). With all that said, learning how to perform regression with Keras is an important skill! In the next two posts in this series I’ll be showing you how to: 1. Leverage the images provided with the house price dataset to train a CNN on them. 2. Combine our numerical/categorical data with the house images, leading to a model that outperforms all of our previous Keras regression experiments. ## Summary In this tutorial, you learned how to use the Keras deep learning library for regression. Specifically, we used Keras and regression to predict the price of houses based on four numerical and categorical attributes: • Number of bedrooms • Number of bathrooms • Area (i.e., square footage) • Zip code Overall our neural network obtained a mean absolute percentage error of 26.01%, implying that, on average, our house price predictions will be off by 26.01%. That raises the questions: • How can we better our house price prediction accuracy? • What if we leveraged images for each house? Would that improve accuracy? • Is there some way to combine both our categorical/numerical attributes with our image data? To answer these questions you’ll need to stay tuned for the remaining to tutorials in this Keras regression series. To download the source code to this post (and be notified when the next tutorial is published here on PyImageSearch), just enter your email address in the form below. ## Downloads: If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately! The post Regression with Keras appeared first on PyImageSearch. Continue Reading… ### The butterfly effect: It’s not what you think it is. John Cook writes: The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later . . . Once you think about these things for a while, you start to see nonlinearity and potential butterfly effects everywhere. There are tipping points everywhere waiting to be tipped! But it’s not so simple. Cook continues: A butterfly flapping its wings usually has no effect, even in sensitive or chaotic systems. You might even say especially in sensitive or chaotic systems. Sensitive systems are not always and everywhere sensitive to everything. They are sensitive in particular ways under particular circumstances, and can otherwise be quite resistant to influence. And: The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. We should evaluate what aspects can be steered and how well they can be steered. The most effective intervention may not come from tweaking the inputs but from changing the structure of the system. Yes! That’s an excellent, Deming-esque point. Bradley Groff pointed be to the above-linked post and noted the connection to my recent note on the piranha principle, where I wrote: A fundamental tenet of social psychology, behavioral economics, at least how it is presented in the news media, and taught and practiced in many business schools, is that small “nudges,” often the sorts of things that we might not think would affect us at all, can have big effects on behavior. . . . The model of the world underlying these claims is not just the “butterfly effect” that small changes can have big effects; rather, it’s that small changes can have big and predictable effects. It’s what I sometimes call the “button-pushing” model of social science, the idea that if you do X, you can expect to see Y. . . . In response to this attitude, I sometimes present the “piranha argument,” which goes as follows: There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data. I’m thinking of social science and I’m being mathematically vague (I do think there’s a theorem there somewhere, something related to random matrix theory, perhaps), whereas Cook is thinking more of physical systems with a clearer mathematical connection to nonlinear dynamics. But I think our overall points are the same, and with similar implications for thinking about interventions, causal effects, and variation in outcomes. P.S. This is related to my skepticism of structural equation or path analysis modeling and similar approaches used in some quarters of sociology and psychology for many years and promoted in slightly different form by Judea Pearl and other computer scientists: These methods often seem to me to promise a sort of causal discovery that cannot be realistically delivered and in which in many cases I don’t even think makes sense (see this article, especially the last full paragraph on page 960 and the example on page 962), and I see this as connected with the naive view of the butterfly effect described above, an attitude that if you just push certain buttons in a complex social system that you can get predictable results. In brief: I doubt that the claims deriving from such data analyses will replicate in new experiments, but I have no doubt that anything that doesn’t replicate will be explained as the results of additional butterflies in the system. What I’d really like is for researchers to just jump to the post-hoc explanation stage before even gathering those new validation data. The threat of replication should be enough to motivate people to back off of some of their extreme claims. To speak generically: 1. Research team A publishes a paper claiming that X causes Y. 2. Research team B tries to replicate the finding, but it fails to replicate. 3. Research team A explains that the original finding is not so general; it only holds under conditions Z, which contain specifics on the experimental intervention, the people in the study, and the context of the study. The finding only holds if the treatment is done for 1 minute, not 3 minutes; it holds only in warm weather, not cold weather; it holds only in Israel, not in the United States; it works for some sorts of stimulus but not others. 4. Ideally, in the original published paper, team A could list all the conditions under which they are claiming their result will appear. That is, they could anticipate step 2 and jump right to step 3, saving us all a lot of time and effort. P.P.S. This post was originally called “Of butterflies and piranhas”; after seeing some comments, I changed the title to focus the message. Continue Reading… ### A shiny Web App from LEGO— truck + trailer (This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers) #### How to Build a Shiny “Truck” part 2 — Let the LEGO “truck” app pull a trailer. An example of a modularized shiny app. In September 2018 I used an automotive metaphor explaining a large scale R shiny app. RViews published the article. I would summarize the article in one phrase. Upon building large applications (trucks) in R shiny there are a lot of things to keep in mind. To cover all these things in a single app I’m providing this tutorial. You can find all files of the app under https://github.com/zappingseb/biowarptruck — folder: example_packaged #### Summary (skip if you read the article in RViews) The article I wrote in RViews told the reader to pay regard to the fact that any shiny app might become big someday. Ab initio it must be well planned. Additionally, it should be possible to remove or add any part of your app. Thus it has to be modular. Each module must work as a LEGO brick. LEGO bricks come with different functionalities. These bricks follow certain rules, that make them stick to each other. These rules we call a standard. Modules designed like LEGO bricks increase your flexibility. Hence the re-usability of your modules grows. When you set up your app to like that, you have the possibility to add an unlimited number of LEGO bricks. It can grow. Imagining small scale applications like cars. Large scale applications are trucks. The article explained how to build a LEGO truck. If you build your car from LEGO / more and different parts can make it a truck. If you built your app from standardized modules / you have the flexibility to insert a lot more functionalities. ### A modularized shiny app — Where to start? The image below explains the idea of the modularized shiny app. You start with a core shiny application. See it like the chassis of your car. It’s made of LEGO. Any other part made of LEGO a stick to your chassis. Such parts can change its functionality. Different modules will help you build different cars. Additionally, you want to have a brick instruction (plan). The plan tells which parts to take and to increase flexibility. The back pages of your brick instruction can contain a different model from the same bricks. If you can build one app from your modules, you can also build a different app containing the same modules. If this is clear to you, we can start building our app in R-shiny: Implementation rules: • Each module is an R package • The core R package defines the standardization of bricks • The core app is a basic shiny app • The brick instruction (plan) file is not in R Why these rules exist, will become clear reading the article. ### The app we want to build The app we want to build will create different kinds of outputs from a panel of user inputs. These different outputs will show up inside the app. Additionally, all outputs will go into a PDF file. The example will include two plots in the plot module and one table in the table module. As each module is an R-package, you can imagine adding many more R-packages step by step. A lot of outputs are possible within shiny. The main feature of this app is the possibility to add more and more modules. More modules will not screw up the PDF reporting function or the view function. Modules do not interact at all inside this app. ### The core R-package The core package contains the structure that modules have to follow to fit into the core app. There are two kinds of structures that we will define as R-S4 classes. One that represents modules and one that represents output elements in those modules. For task one, we call the object (class) a Report. The Report is the main brick we define in the core app. It contains: plots — A list of all elements shown in the report filename - The name of the output file (where to report to) obs - The handling of the input value input$obs rendered - Whether it shows up in the app right now

Additionally, the Report class carries some functionalities to generate a shiny output. Moreover, it allows creating PDF reports. The functionalities come within the methods shinyElement() and pdfElement() . In R-S4 this looks like this:

https://medium.com/media/356fa4ae59374cdaa0f59ab1f925b183/href

Now we would also like to define, how to structure each element of the Thus . Thus we define a class AnyPlot that carries an expression as it’s the only slot. The evalElement method will evaluate this expression. The pdfElement method creates an output that can go to PDF. The shinyElement creates a PlotOutput by calling shiny::renderPlot(). The logElement method writes the expression into a logFile. The R-S4 code shows up below:

https://medium.com/media/6f03ee9acd3187f6c68716a94ea7ce2d/href

### The core app

To keep this example simple, the core app will include all inputs. The outputs of this app will be modular. The core app has to fulfill the following tasks:

1. have a container to show modules
3. include a button to print modules to PDF
4. imagine also a button printing modules to “.png”, “.jpg”, “.xlsx”
5. include the inputs

#### Showing modules

For task one we use the shinyElement method of a given object and insert this in any output. I decided on a Tab output for each module. So each module gets rendered inside a different tab.

Now here comes the hard part of the app. As I said I wanted to add two modules. One with plots and one with a table. The plan (config.xml) file has to contain this information. So I use this as a plan file:

      module1    Plot Module    module1    PlotReport        module2    Text Output    module2    TableReport

You can see I have two modules. There is a package for each module. Inside this package, a class defines (see section module packages) the output. This class is a child of our Report class.

The module shows up as a tab inside our app. We will go through this step by step. First, we need to have a function to load the packages for each module:

library(XML)load_module <- function(xmlItem){  devtools::load_all(paste0("./",xmlValue(xmlItem[["package"]]))) }

Second, we need a function to generate a tab out of the information of the module:

library(shiny)module_tab <- function(xmlItem){  tabPanel(XML::xmlValue(xmlItem[["name"]]),           uiOutput(xmlValue(xmlItem[["id"]]))  )}

As we now have these two functions, we can iterate over the XML file and build up our app. First we need a TabPanel inside the UI such as tabPanel(id='modules') . Afterwards, we can read the configuration of the app into the TabPane . Thus we use the appendTab function. The function XML::xmlApply lets us iterate over each node of the XML (config.xml) and perform these tasks.

configuration <- xmlApply(xmlRoot(xmlParse("config.xml")),function(xmlItem){
    load_module(xmlItem)        appendTab("modules",module_tab(xmlItem),select = TRUE)        list(      name = xmlValue(xmlItem[["name"]]),      class = xmlValue(xmlItem[["class"]]),      id = xmlValue(xmlItem[["id"]])    )  })

Each module is now loaded into the app in a static manner. The next part will deal with making it reactive.

For Dynamic rendering of the panels, it is necessary to know some inputs. First the tab the user chose. The input$modules variable defines the tab chosen. Additionally the outputs of our shiny app must update by one other input, input$obs . So upon changing the tab or changing the input$obs we need to call an event. This event will call the Constructor function of our S4 object. Following this the shinyElement method renders the output. https://medium.com/media/4b7747178ddeed404fa31ed7feda0eed/href The reactive report_obj is a function that can call the Constructor of our Report object. Using the observeEvent function for input$obs and input$modules we call this reactive. This allows reacting on user inputs. #### Deriving PDF files from reports The pdfElement function renders the S4 object as a PDF file. If this worked fine the PDF elements add up to the download button. An extra label checks the success of the PDF rendering. https://medium.com/media/c9f21d78cb1f303d477d39cb86f3ef68/href We finished the core app. You can find the app here: app.R and the core package here: core. The last step is to put the whole truck together. ### Module packages The two module packages will now contain two classes. Both must be children of the class Report. Each element inside these classes must be a child class of the class AnyPlot. Red bricks in the next picture represent Reports and yellow bricks represent AnyPlots. #### Plot package The first module package will produce a scatter plot and a histogram plot. Both are children of AnyPlot by contains='AnyPlot' inside there class definition. PlotReport is the class for the Report of this package. It contains both of these plots inside the plots slot. See the code below for the constructors of those classes. https://medium.com/media/79e9370314c6d5ef52a64232b0174868/href #### Table package The table package follows the same rules as the plot package. The main difference is that there is only one element inside the plots slot. This one element is not a plot. That is why it contains a data.frame call as its expression. https://medium.com/media/baf4284fbf546af244b1d44cd8b98771/href To render a data.frame call inside shiny, we have to overwrite the shinyElement method. Instead of returning a renderPlot output we will return a renderDataTable output. Additionally the pdfElement method has to return a gridExtra::grid.table output. https://medium.com/media/9e5c2ca5e38c0547b45aec2adea6c719/href #### Packaging advantage A major advantage of packaging each module is the definition of dependencies. The DESCRIPTION file specifies all dependencies of the module package. For example, the table module needs the gridExtra package. The core app package needs shiny, methods, XML, devtools . The app does not need extra library calls. Any co-worker can install all dependencies ### Final words Now you must have the tools to start building up your own large scale shiny application. Modularize the app using packages. Standardize it using S4 or any other object-oriented R style. Set the app up using an XML or JSON documents. You’re good to go. Set up the core package and the module packages inside one directory. You can load them with devtools and start building your shiny file app.R . You can now build your own app exchanging the module packages. Like every kid, you can now enjoy playing with your truck afterward and you’re good to go. I cannot tell you if it’s more fun building or more fun rolling. Dear Reader: It’s always a pleasure to write about my work on building modular shiny apps. I thank you for reading until the end of this article. If you liked the article, you can clap for it on Medium or star the repository on github. In case of any comment, leave it here or on my LinkedIn profile http://linkedin.com/in/zappingseb. ### Further reading A shiny Web App from LEGO— truck + trailer was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### 2018’s Top 7 Python Libraries for Data Science and AI This is a list of the best libraries that changed our lives this year, compiled from my weekly digests. Continue Reading… ### Four short links: 21 January 2019 Programming Spreadsheets, Star Emulator, AI for Social Good, Participatory Democracy 1. Applying Programming Language Research Ideas to Transform Spreadsheets (Microsoft) -- an Excel cell can now contain a first-class record, linked to external data sources. And ordinary Excel formulas can now compute array values, that “spill” into adjacent cells (dynamic arrays). There is more to come: we have a working version of full, higher-order, lexically scoped lambdas (and let-expressions) in Excel’s formula language and we are prototyping sheet-defined functions and full nesting of array values. 2. Darkstar: A Xerox Star Emulator -- this blog post describes the journey of building the emulator for this historic system. From the good folks at the Living Computer Museum in Seattle. 3. AI for Social Good Impact Challenge --$25M pool, $500K-$2M for 1-3 years. If you are selected to receive a grant, the standard grant agreement will require any intellectual property created with grant funding from Google be made available for free to the public under a permissive open source license.
4. Decidim -- free open source participatory democracy for cities and organizations.

### Real-time speed of light from Earth to Mars

Hurry up, light. We’re gonna be late:

By James O’Donoghue, the animation shows the speed of light in real-time. The distance between Earth, the moon, and Mars is to scale, but the locations are scaled up so that you can see them.

See also the animations for Earth to the moon and of light orbiting Earth.

Tags: , , ,

### Magister Dixit

“Wise ones do not rely on theories. Wise ones do not tie themselves up. They only watch carefully and listen carefully.” Buddha

### Docker Images for R: r-base versus r-apt

(This article was first published on R on datawookie, and kindly contributed to R-bloggers)

I need to deploy a Plumber API in a Docker container. The API has some R package dependencies which need to be baked into the Docker image. There are a few options for the base image:

The first option, r-base, would require building the dependencies from source, a somewhat time consuming operation. The last option, r-apt, makes it possible to install most packages using apt, which is likely to be much quicker. I’ll immediately eliminate the other option, tidyverse, because although it already contains a load of packages, many of those are not required and, in addition, it incorporates RStudio Server, which is definitely not necessary for this project.

I’m trying to optimise the image for two criteria:

• small image and
• quick build time.

Of these the latter is the most important. Under normal circumstances I would not be too concerned about build time because Docker caches complete layers. However, this image is going to be built using a continuous integration system which will build the entire image from scratch each time.

## r-base

Let’s start with the r-base image. Installing the required packages is as simple as executing R using RUN and then running install.packages(). The packages are installed from source, which takes some time (compiling source code and installing further dependencies).

FROM r-base

RUN R -e 'install.packages(c("plumber", "jsonlite", "dplyr", "stringr", "here"))'

## r-apt

Using the r-apt image allows you to install binary versions of those packages.

FROM rocker/r-apt:bionic

RUN apt-get update && \
apt-get install -y -qq \
r-cran-plumber \
r-cran-jsonlite \
r-cran-dplyr \
r-cran-stringr

RUN R -e 'install.packages("here")'

CMD ["R"]

The here package is not (curently) available as a binary, so you still need to compile that from source. This package alone does not have much impact on the build time.

The base r-apt image will launch bash by default, so we need to explicitly start R using CMD.

## Build Times

What’s the difference in build times? Well it turns out that this really does make a big difference.

$time docker build --no-cache -t r_apt -f Dockerfile-r-apt . real 3m39.590s user 0m0.115s sys 0m0.082s $ time docker build --no-cache -t r_base -f Dockerfile-r-base .
real    14m2.068s
user    0m0.467s
sys     0m0.456s

Using binary packages takes just less than 4 minutes (there is still some download time and I’m not on a very fast connection). Building those packages from source is much more time consuming, taking more than three times longer.

## Image Sizes

What about the sizes of the images?

$docker images REPOSITORY TAG IMAGE ID CREATED SIZE r_base latest 9e278980b348 2 minutes ago 944MB r_apt latest 493e243d84fe 12 minutes ago 805MB The image built using binary packages is smaller too. Based on these results I’ll be using the r-apt base image for this project, but there’s no saying that r-base won’t come in handy for the next project! To leave a comment for the author, please follow the link and comment on their blog: R on datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Book Memo: “Advanced Time Series Data Analysis”  Forecasting Using EViews Introduces the latest developments in forecasting in advanced quantitative data analysis This book presents advanced univariate multiple regressions, which can directly be used to forecast their dependent variables, evaluate their in-sample forecast values, and compute forecast values beyond the sample period. Various alternative multiple regressions models are presented based on a single time series, bivariate, and triple time-series, which are developed by taking into account specific growth patterns of each dependent variables, starting with the simplest model up to the most advanced model. Graphs of the observed scores and the forecast evaluation of each of the models are offered to show the worst and the best forecast models among each set of the models of a specific independent variable. Advanced Time Series Data Analysis: Forecasting Using EViews provides readers with a number of modern, advanced forecast models not featured in any other book. They include various interaction models, models with alternative trends (including the models with heterogeneous trends), and complete heterogeneous models for monthly time series, quarterly time series, and annually time series. Each of the models can be applied by all quantitative researchers. Continue Reading… ### R Packages worth a look An R-Shiny Application for Creating Visual Abstracts (abstractr) An R-Shiny application to create visual abstracts for original research. A variety of user defined options and formatting are included. Bayesian Reliability Estimation (Bayesrel) So far, it provides the most common single test reliability estimates, being: Coefficient Alpha, Guttman’s lambda-2/-4/-6, greatest lower bound and Mcd … Inferring Causal Effects on Collective Outcomes under Interference (netchain) In networks, treatments may spill over from the treated individual to his or her social contacts and outcomes may be contagious over time. Under this s … Continue Reading… ## January 20, 2019 ### Spatial lag model trees (This article was first published on Achim Zeileis, and kindly contributed to R-bloggers) Economic growth models are recursively partitioned to assess heterogeneity in growth and convergence across EU regions while adjusting for spatial dependencies. Accompanied by R package lagsarlmtree, combining partykit::mob and spdep::lagsarlm. ## Citation Martin Wagner, Achim Zeileis (2019). “Heterogeneity and Spatial Dependence of Regional Growth in the EU: A Recursive Partitioning Approach.” German Economic Review, 20(1), 67-82. doi:10.1111/geer.12146pdf ] ## Abstract We use model-based recursive partitioning to assess heterogeneity of growth and convergence processes based on economic growth regressions for 255 European Union NUTS2 regions from 1995 to 2005. Spatial dependencies are taken into account by augmenting the model-based regression tree with a spatial lag. The starting point of the analysis is a human-capital-augmented Solow-type growth equation similar in spirit to Mankiw et al. (1992, The Quarterly Journal of Economics, 107, 407-437). Initial GDP and the share of highly educated in the working age population are found to be important for explaining economic growth, whereas the investment share in physical capital is only significant for coastal regions in the PIIGS countries. For all considered spatial weight matrices recursive partitioning leads to a regression tree with four terminal nodes with partitioning according to (i) capital regions, (ii) non-capital regions in or outside the so-called PIIGS countries and (iii) inside the respective PIIGS regions furthermore between coastal and non-coastal regions. The choice of the spatial weight matrix clearly influences the spatial lag parameter while the estimated slope parameters are very robust to it. This indicates that accounting for heterogeneity is an important aspect of modeling regional economic growth and convergence. ## Software https://CRAN.R-project.org/package=lagsarlmtree ## Heterogeneity of regional growth in the EU The growth model to be assessed for heterogeneity is a linear regression model for the average growth rate of real GDP per capita (ggdpcap) as the dependent variable with the following regressors: • Initial real GDP per capita in logs (gdpcap0), sometimes simply referred to as initial income, to capture potential β-convergence. • Investment share in GDP (shgfcf) to capture physical capital accumulation. • Shares of high and of medium educated in the labor force (shsh and shsm) as measures of human capital. Thus, a human-capital-augmented version of the Solow model is employed, inspired by the by now classical work of Mankiw et al. (1992). The well-known data sets from Sala-i-Martin et al. (2004) and Fernandez et al. (2001) are employed below for estimation. To assess whether a single growth regression model with stable parameters across all EU regions is sufficient, splitting the data by the following partitioning variables is considered: • Log of initial real GDP per capita itself as a partitioning variable as a simple device to check for the presence of initial income driven convergence clubs. • Two measures for traffic accessibility of the region, one for accessibility via rail (accessrail) and one via the road network (accessroad). • A dummy variable for capital regions (capital). • Dummy variables for border regions (regborder) and coastal regions (regcoast). • Dummy for EU Objective 1 regions (regobj1). • Two dummy variables for Central and Eastern European countries (cee) and Portugal/Ireland/Italy/Greece/Spain (piigs). To adjust for spatial dependencies a spatial lag term with inverse distance weights is considered here. Other weight specifications lead to very similar estimated tree structures and regression coefficients, though. library("lagsarlmtree") data("GrowthNUTS2", package = "lagsarlmtree") data("WeightsNUTS2", package = "lagsarlmtree") tr <- lagsarlmtree(ggdpcap ~ gdpcap0 + shgfcf + shsh + shsm | gdpcap0 + accessrail + accessroad + capital + regboarder + regcoast + regobj1 + cee + piigs, data = GrowthNUTS2, listw = WeightsNUTS2$invw, minsize = 12, alpha = 0.05)
print(tr)
## Spatial lag model tree
##
## Model formula:
## ggdpcap ~ gdpcap0 + shgfcf + shsh + shsm | gdpcap0 + accessrail +
##     accessroad + capital + regboarder + regcoast + regobj1 +
##     cee + piigs
##
## Fitted party:
## [1] root
## |   [2] capital in no
## |   |   [3] piigs in no: n = 176
## |   |       (Intercept)     gdpcap0      shgfcf        shsh        shsm
## |   |           0.11055    -0.01171     0.00208     0.02195     0.00179
## |   |   [4] piigs in yes
## |   |   |   [5] regcoast in no: n = 13
## |   |   |       (Intercept)     gdpcap0      shgfcf        shsh        shsm
## |   |   |            0.1606     -0.0159     -0.0469      0.0789     -0.0234
## |   |   |   [6] regcoast in yes: n = 39
## |   |   |       (Intercept)     gdpcap0      shgfcf        shsh        shsm
## |   |   |           0.07348    -0.01106     0.09156     0.11668     0.00942
## |   [7] capital in yes: n = 27
## |       (Intercept)     gdpcap0      shgfcf        shsh        shsm
## |            0.2056     -0.0223     -0.0075      0.0411      0.0528
##
## Number of inner nodes:    3
## Number of terminal nodes: 4
## Number of parameters per node: 5
## Objective function (residual sum of squares): 0.0155
##
## Rho (from lagsarlm model):
##   rho
## 0.837


The resulting linear regression tree can be visualized with p-values from the parameter stability tests displayed in the inner nodes and a scatter plot of GDP per capita growth (ggdpcap) vs. (log) initial real GDP per capita (ggdcap0) in the terminal nodes:

plot(tr, tp_args = list(which = 1))


It is most striking that the speed of β-convergence is much higher for the 27 capital regions. More details about differences in the other regressors are shown in the table below. Finally, it is of interest which variables were not selected for splitting in the tree, i.e., are not associated with significant parameter instabilities: initial income, the border dummy, and Objective 1 regions, among others.

Node n Partitioning variables   Regressor variables
capital piigs regcoast (Const.) gdpcap0 shgfcf shsh shsm
3 176 no no  0.111
(0.016)
–0.0117
(0.0016)
–0.0021
(0.0077)
0.022
(0.011)
0.0018
(0.0068)

5 13 no yes no  0.161
(0.128)
–0.0159
(0.0135)
–0.0469
(0.0815)
0.079
(0.059)
–0.0234
(0.0660)

6 39 no yes yes  0.073
(0.056)
–0.0111
(0.0059)
0.0916
(0.0420)
0.117
(0.029)
0.0094
(0.0218)

7 27 yes  0.206
(0.031)
–0.0223
(0.0029)
–0.0075
(0.0259)
0.041
(0.020)
0.0528
(0.0117)

For more details see the full manuscript. Replication materials for the entire analysis from the manuscript are available as a demo in the package:

demo("GrowthNUTS2", package = "lagsarlmtree")


R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Whats new on arXiv

We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.
Given a graph $G$ and a vertex $q\in G$, the community search (CS) problem aims to efficiently find a subgraph of $G$ whose vertices are closely related to $q$. Communities are prevalent in social and biological networks, and can be used in product advertisement and social event recommendation. In this paper, we study profiled community search (PCS), where CS is performed on a profiled graph. This is a graph in which each vertex has labels arranged in a hierarchical manner. Extensive experiments show that PCS can identify communities with themes that are common to their vertices, and is more effective than existing CS approaches. As a naive solution for PCS is highly expensive, we have also developed a tree index, which facilitate efficient and online solutions for PCS.
This work addresses the problem of segmentation in time series data with respect to a statistical parameter of interest in Bayesian models. It is common to assume that the parameters are distinct within each segment. As such, many Bayesian change point detection models do not exploit the segment parameter patterns, which can improve performance. This work proposes a Bayesian change point detection algorithm that makes use of repetition in segment parameters, by introducing segment class labels that utilise a Dirichlet process prior.
MAPF is the problem of finding paths for multiple agents such that every agent reaches its goal and the agents do not collide. Most prior work on MAPF were on grid, assumed all actions cost the same, agents do not have a volume, and considered discrete time steps. In this work we propose a MAPF algorithm that do not assume any of these assumptions, is complete, and provides provably optimal solutions. This algorithm is based on a novel combination of SIPP, a continuous time single agent planning algorithms, and CBS, a state of the art multi-agent pathfinding algorithm. We analyze this algorithm, discuss its pros and cons, and evaluate it experimentally on several standard benchmarks.
We quantify the separation between the numbers of labeled examples required to learn in two settings: Settings with and without the knowledge of the distribution of the unlabeled data. More specifically, we prove a separation by $\Theta(\log n)$ multiplicative factor for the class of projections over the Boolean hypercube of dimension $n$. We prove that there is no separation for the class of all functions on domain of any size. Learning with the knowledge of the distribution (a.k.a. fixed-distribution learning) can be viewed as an idealized scenario of semi-supervised learning where the number of unlabeled data points is so great that the unlabeled distribution is known exactly. For this reason, we call the separation the value of unlabeled data.
We analyze the changes in the training and educational efforts of the SciNet HPC Consortium, a Canadian academic High Performance Computing center, in the areas of Scientific Computing and High-Performance Computing, over the last six years. Initially, SciNet offered isolated training events on how to use HPC systems and write parallel code, but the training program now consists of a broad range of workshops and courses that users can take toward certificates in scientific computing, data science, or high-performance computing. Using data on enrollment, attendence, and certificate numbers from SciNet’s education website, used by almost 1800 users so far, we extract trends on the growth, demand, and breadth of SciNet’s training program. Among the results are a steady overall growth, a sharp and steady increase in the demand for data science training, and a wider participation of ‘non-traditional’ computing disciplines, which has motivated an increasingly broad spectrum of training offerings. Of interest is also that many of the training initiatives have evolved into courses that can be taken as part of the graduate curriculum at the University of Toronto.
These are lecture notes for my course on Artificial Neural Networks that I have given at Chalmers (FFR135) and Gothenburg University (FIM720). This course describes the use of neural networks in machine learning: deep learning, recurrent networks, and other supervised and unsupervised machine-learning algorithms.
Ensemble Kalman methods constitute an increasingly important tool in both state and parameter estimation problems. Their popularity stems from the derivative-free nature of the methodology which may be readily applied when computer code is available for the underlying state-space dynamics (for state estimation) or for the parameter-to-observable map (for parameter estimation). There are many applications in which it is desirable to enforce prior information in the form of equality or inequality constraints on the state or parameter. This paper establishes a general framework for doing so, describing a widely applicable methodology, a theory which justifies the methodology, and a set of numerical experiments exemplifying it.
Natural lifeforms specialise to their environmental niches across many levels; from low-level features such as DNA and proteins, through to higher-level artefacts including eyes, limbs, and overarching body plans. We propose Multi-Level Evolution (MLE), a bottom-up automatic process that designs robots across multiple levels and niches them to tasks and environmental conditions. MLE concurrently explores constituent molecular and material ‘building blocks’, as well as their possible assemblies into specialised morphological and sensorimotor configurations. MLE provides a route to fully harness a recent explosion in available candidate materials and ongoing advances in rapid manufacturing processes. We outline a feasible MLE architecture that realises this vision, highlight the main roadblocks and how they may be overcome, and show robotic applications to which MLE is particularly suited. By forming a research agenda to stimulate discussion between researchers in related fields, we hope to inspire the pursuit of multi-level robotic design all the way from material to machine.
Matrix profile has been recently proposed as a promising technique to the problem of all-pairs-similarity search on time series. Efficient algorithms have been proposed for computing it, e.g., STAMP, STOMP and SCRIMP++. All these algorithms use the z-normalized Euclidean distance to measure the distance between subsequences. However, as we observed, for some datasets other Euclidean measurements are more useful for knowledge discovery from time series. In this paper, we propose efficient algorithms for computing matrix profile for a general class of Euclidean distances. We first propose a simple but efficient algorithm called AAMP for computing matrix profile with the ‘pure’ (non-normalized) Euclidean distance. Then, we extend our algorithm for the p-norm distance. We also propose an algorithm, called ACAMP, that uses the same principle as AAMP, but for the case of z-normalized Euclidean distance. We implemented our algorithms, and evaluated their performance through experimentation. The experiments show excellent performance results. For example, they show that AAMP is very efficient for computing matrix profile for non-normalized Euclidean distances. The results also show that the ACAMP algorithm is significantly faster than SCRIMP++ (the state of the art matrix profile algorithm) for the case of z-normalized Euclidean distance.
In this paper, we investigate an artificial-intelligence (AI) driven approach to design error correction codes (ECC). Classic error correction code was designed upon coding theory that typically defines code properties (e.g., hamming distance, subchannel reliability, etc.) to reflect code performance. Its code design is to optimize code properties. However, an AI-driven approach doesn’t necessarily rely on coding theory any longer. Specifically, we propose a constructor-evaluator framework, in which the code constructor is realized by AI algorithms and the code evaluator provides code performance metric measurements. The code constructor keeps improving the code construction to maximize code performance that is evaluated by the code evaluator. As examples, we construct linear block codes and polar codes with reinforcement learning (RL) and evolutionary algorithms. The results show that comparable code performance can be achieved with respect to the existing codes. It is noteworthy that our method can provide superior performances where existing classic constructions fail to achieve optimum for a specific decoder (e.g., list decoding for polar codes).
In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales. SAFE is composed of a multi-scale convolutional encoder and a scale attention network. The multi-scale convolutional encoder targets at extracting character features under multiple scales, and the scale attention network is responsible for selecting features from the most relevant scale(s). SAFE has two main advantages over the traditional single-CNN encoder used in current state-of-the-art text recognizers. First, it explicitly tackles the scale problem by extracting scale-invariant features from the characters. This allows the recognizer to put more effort in handling other challenges in scene text recognition, like those caused by view distortion and poor image quality. Second, it can transfer the learning of feature encoding across different character scales. This is particularly important when the training set has a very unbalanced distribution of character scales, as training with such a dataset will make the encoder biased towards extracting features from the predominant scale. To evaluate the effectiveness of SAFE, we design a simple text recognizer named scale-spatial attention network (S-SAN) that employs SAFE as its feature encoder, and carry out experiments on six public benchmarks. Experimental results demonstrate that S-SAN can achieve state-of-the-art (or, in some cases, extremely competitive) performance without any post-processing.
Inspired by recent work of P.-L. Lions on conditional optimal control, we introduce a problem of optimal stopping under bounded rationality: the objective is the expected payoff at the time of stopping, conditioned on another event. For instance, an agent may care only about states where she is still alive at the time of stopping, or a company may condition on not being bankrupt. We observe that conditional optimization is time-inconsistent due to the dynamic change of the conditioning probability and develop an equilibrium approach in the spirit of R. H. Strotz’ work for sophisticated agents in discrete time. Equilibria are found to be essentially unique in the case of a finite time horizon whereas an infinite horizon gives rise to non-uniqueness and other interesting phenomena. We also introduce a theory which generalizes the classical Snell envelope approach for optimal stopping by considering a pair of processes with Snell-type properties.
Neural architecture search (NAS) methods have been proposed to release human experts from tedious architecture engineering. However, most current methods are constrained in small-scale search due to the issue of computational resources. Meanwhile, directly applying architectures searched on small datasets to large-scale tasks often bears no performance guarantee. This limitation impedes the wide use of NAS on large-scale tasks. To overcome this obstacle, we propose an elastic architecture transfer mechanism for accelerating large-scale neural architecture search (EAT-NAS). In our implementations, architectures are first searched on a small dataset (the width and depth of architectures are taken into consideration as well), e.g., CIFAR-10, and the best is chosen as the basic architecture. Then the whole architecture is transferred with elasticity. We accelerate the search process on a large-scale dataset, e.g., the whole ImageNet dataset, with the help of the basic architecture. What we propose is not only a NAS method but a mechanism for architecture-level transfer. In our experiments, we obtain two final models EATNet-A and EATNet-B that achieve competitive accuracies, 73.8% and 73.7% on ImageNet, respectively, which also surpass the models searched from scratch on ImageNet under the same settings. For computational cost, EAT-NAS takes only less than 5 days on 8 TITAN X GPUs, which is significantly less than the computational consumption of the state-of-the-art large-scale NAS methods.
The present work is aimed to examine the potential of advanced machine learning strategies to predict the monthly rainfall (precipitation) for the Indus Basin, using climatological variables such as air temperature, geo-potential height, relative humidity and elevation. In this work, the focus is on thirteen geographical locations, called index points, within the basin. Arguably, not all of the hydrological components are relevant to the precipitation rate, and therefore, need to be filtered out, leading to a lower-dimensional feature space. Towards this goal, we adopted the gradient boosting method to extract the most contributive features for precipitation rate prediction. Five state-of-the-art machine learning methods have then been trained where pearson correlation coefficient and mean absolute error have been reported as the prediction performance criteria. The Random Forest regression model outperformed the other regression models achieving the maximum pearson correlation coefficient and minimum mean absolute error for most of the index points. Our results suggest the relative humidity (for pressure levels of 300 mb and 150 mb, respectively), the u-direction wind (for pressure level of 700 mb), air temperature (for pressure levels of 150 mb and 10 mb, respectively) as the top five influencing features for accurate forecasting the precipitation rate.
A regression-based BNN model is proposed to predict spatiotemporal quantities like hourly rider demand with calibrated uncertainties. The main contributions of this paper are (i) A feed-forward deterministic neural network (DetNN) architecture that predicts cyclical time series data with sensitivity to anomalous forecasting events; (ii) A Bayesian framework applying SVGD to train large neural networks for such tasks, capable of producing time series predictions as well as measures of uncertainty surrounding the predictions. Experiments show that the proposed BNN reduces average estimation error by 10% across 8 U.S. cities compared to a fine-tuned multilayer perceptron (MLP), and 4% better than the same network architecture trained without SVGD.
We study the problem of reducing the amount of labeled training data required to train supervised classification models. We approach it by leveraging Active Learning, through sequential selection of examples which benefit the model most. Selecting examples one by one is not practical for the amount of training examples required by the modern Deep Learning models. We consider the mini-batch Active Learning setting, where several examples are selected at once. We present an approach which takes into account both informativeness of the examples for the model, as well as the diversity of the examples in a mini-batch. By using the well studied K-means clustering algorithm, this approach scales better than the previously proposed approaches, and achieves comparable or better performance.

### If you did not already know

TechKG
Knowledge graph is a kind of valuable knowledge base which would benefit lots of AI-related applications. Up to now, lots of large-scale knowledge graphs have been built. However, most of them are non-Chinese and designed for general purpose. In this work, we introduce TechKG, a large scale Chinese knowledge graph that is technology-oriented. It is built automatically from massive technical papers that are published in Chinese academic journals of different research domains. Some carefully designed heuristic rules are used to extract high quality entities and relations. Totally, it comprises of over 260 million triplets that are built upon more than 52 million entities which come from 38 research domains. Our preliminary ex-periments indicate that TechKG has high adaptability and can be used as a dataset for many diverse AI-related applications. We released TechKG at: http://www.techkg.cn.

Vertex-Diminished Random Walk (VDRW)
Imbalanced data widely exists in many high-impact applications. An example is in air traffic control, where we aim to identify the leading indicators for each type of accident cause from historical records. Among all three types of accident causes, historical records with ‘personnel issues’ are much more than the other two types (‘aircraft issues’ and ‘environmental issues’) combined. Thus, the resulting dataset is highly imbalanced, and can be naturally modeled as a network. Up until now, most existing work on imbalanced data analysis focused on the classification setting, and very little is devoted to learning the node representation from imbalanced networks. To address this problem, in this paper, we propose Vertex-Diminished Random Walk (VDRW) for imbalanced network analysis. The key idea is to encourage the random particle to walk within the same class by adjusting the transition probabilities each step. It resembles the existing Vertex Reinforced Random Walk in terms of the dynamic nature of the transition probabilities, as well as some convergence properties. However, it is more suitable for analyzing imbalanced networks as it leads to more separable node representations in the embedding space. Then, based on VDRW, we propose a semi-supervised network representation learning framework named ImVerde for imbalanced networks, in which context sampling uses VDRW and the label information to create node-context pairs, and balanced-batch sampling adopts a simple under-sampling method to balance these pairs in different classes. Experimental results demonstrate that ImVerde based on VDRW outperforms state-of-the-art algorithms for learning network representation from imbalanced data. …

Caffe
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.
http://…/neural-networks-with-caffe-on-the-gpu
Github

### Model Interpretability with TCAV (Testing with Concept Activation Vectors)

This Domino Data Science Field Note provides very distilled insights and excerpts from Been Kim’s recent MLConf 2018 talk and research about Testing with Concept Activation Vectors (TCAV), an interpretability method that allows researchers to understand and quantitatively measure the high-level concepts their neural network models are using for prediction, “even if the concept was not part of the training“. If interested in additional insights not provided in this blog post, please refer to the MLConf 2018 video, the ICML 2018 video, and the paper.

# Introduction

What if there was a way to quantitatively measure whether your machine learning (ML) model reflects specific domain expertise or potential bias? with post-training explanations? on a global level instead of a local level? Would industry people be interested? These are the kind of questions Been Kim, Senior Research Scientist at Google Brain, poised in the MLConf 2018 talk, “Interpretability Beyond Feature Attribution: Testing with Concept Activation Vectors (TCAV)”. The MLConf talk is based on a paper Kim co-authored and the code is available. This Domino Data Science Field Note provides some distilled insights about TCAV, an interpretability method that allows researchers to understand and quantitatively measure the high-level concepts their neural network models are using for prediction, “even if the concept was not part of the training” (Kim Slide 33). TCAV “uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result” (Kim et al 2018).

# Introducing Concept Activation Vectors (CAVs)

So. What are Concept Activation Vectors or CAVs? CAVs = the name created by the researchers for representing a high-level concept as a vector and “activation” is the “direction of the values”.

Within the research, Kim et al. indicate that

“CAV for a concept is simply a vector in the direction of the values (e.g., activations) of that concept’s set of examples… we derive CAVs by training a linear classifier between a concept’s examples and random counter examples and then taking the vector orthogonal to the decision boundary.“

In the MLConf talk, Kim relays that “Well, this vector simply means that a direction is pointing from away from random, to more like the concept. And this idea of having this such a vector is not new” as gender direction  was discussed within a word2vec paper. Kim also indicates that CAVs  are “just another way to retrieve that vector”.

# Testing with Concept Activation Vectors (TCAV): The Zebra

A prominent image classification example used to showcase TCAV within the research is “how sensitive a prediction of zebra is to the presence of stripes”. In the MLConf talk, Kim unpacked obtaining the quantitative explanation or the TCAV score

“The core idea of TCAV is taking the directional derivative of the logit layer, the probability of zebra with respect to that vector we just got ….and intuitively what this means is if I make this picture more like the concept or a little less like the concept, how much would the probability of zebra change? If it changes a lot, it’s an important concept, if it doesn’t change a lot, it’s not an important concept. We do that for many, many zebra pictures to get more robust and reliable explanation. The final score, TCAV score is simply the ratio of zebra pictures where having stripeness positively increased the probability of zebra. In other words, if you have a 100 pictures of zebras, how many of them returned positive directional derivative? That’s it.”

It is also important to note as indicated on Slide 38 of the MLConf talk that

“TCAV provides quantitative importance of a concept if and only if your network has learned about it”.

# Conclusion: Why Consider TCAV?

In the MLConf talk, Kim argues for using interpretability as a tool to work towards using ML “more responsibly and more safely”. Within the paper, Kim et al indicate that

“TCAV is a step toward creating a human-friendly linear interpretation of the internal state of a deep learning model so that questions about model decisions may be answered in terms of natural high-level concepts”.

Using high-level concepts is potentially useful when translating seemingly black box models to stakeholders that are not experts in ML. For example, when building a model to help medical doctors to do a diagnosis, it may be helpful to test your model using doctor-friendly high-level concepts to surface bias or level of domain expertise. This blog post provided distilled highlights and excerpts from recent research on TCAV. If interested additional insights, please refer to the following resources that were referenced and reviewed during the writing of this blog post.

## Other Related Papers and Slide Decks

Domino Data Science Field Notes provide highlights of data science research, trends, techniques, and more, that support data scientists and data science leaders accelerate their work or careers. If you are interested in your data science work being covered in this blog series, please send us an email at writeforus(at)dominodatalab(dot)com.

The post Model Interpretability with TCAV (Testing with Concept Activation Vectors) appeared first on Data Science Blog by Domino.

### Men and women faced different experiences in the labor market

Last week, I showed how the aggregate statistics, unemployment rate, masked some unusual trends in the labor market in the U.S. Despite the unemployment rate in 2018 being equal, and even a little below, that in 2000, the peak of the last tech boom, there are now significantly more people "not in the labor force," and these people are not counted in the unemployment rate statistic.

The analysis focuses on two factors that are not visible in the unemployment rate aggregate: the proportion of people considered not in labor force, and the proportion of employees who have part-time positions. The analysis itself masks a difference across genders.

It turns out that men and women had very different experiences in the labor market.

For men, things have looked progressively worse with each recession and recovery since 1990. After each recovery, more men exit the labor force, and more men become part-timers. The Great Recession, however, hit men even worse than previous recessions, as seen below:

For women, it's a story of impressive gains in the 1990s, and a sad reversal since 2008.

P.S. See here for Part 1 of this series. In particular, the color scheme is explained there. Also, the entire collection can be viewed here

### What to make of the historically low unemployment rate

One of the amazing economic stories of the moment is the unemployment rate, which at around 4% has returned to the level last reached during the peak of the tech boom in 2000. The story is much more complex than it seems.

I devoted a chapter of Numbersense (link) to explain how the government computes unemployment rates. The most important thing to realize is that an unemployment rate of 4 percent does NOT mean that four out of 100 people in the U.S. are unemployed, and 96 out of 100 are employed.

It doesn't even mean that four out of 100 people of working age are unemployed, and 96 out of 100 of working age are employed.

What it means is of the people that the government decides are "employable", 96 out of 100 are employed. Officially, this employability is known as "in labor force." There are many ways to be disqualified from the labor force; one example is if the government decides that the person is not looking for a job.

On the flip side, who the government counts as "employed" also matters! Part-timers are considered employed. They are counted just like a full-time employee in the unemployment metric. Part-time, according to the government, is one to 34 hours worked during the week the survey is administered.

***

So two factors can affect the unemployment rate a lot - the proportion of the population considered "not in labor force" (thus not counted at all); and the proportion of those considered employed who are part-timers. (Those are two disjoint groups.)

The following chart then shows that despite the unemployment rate looking great, the U.S. labor market in 2018 looks nothing like what it looked like from 1990 to 2008.

Technical notes: all the data are seasonally adjusted by the Bureau of Labor Statistics. I used a spline to smooth the data first - the top chart shows the smoothed version of the unemployment rates. Smoothing removes month-to-month sharp edges from the second chart. The color scale is based on standardized values of the smoothed data.

P.S. See Part 2 of this series explores the different experiences of male and female workers. Also, the entire collection can be viewed here.

### Document worth reading: “A Mathematical Theory of Interpersonal Interactions and Group Behavior”

Emergent collective group processes and capabilities have been studied through analysis of transactive memory, measures of group task performance, and group intelligence, among others. In their approach to collective behaviors, these approaches transcend traditional studies of group decision making that focus on how individual preferences combine through power relationships, social choice by voting, negotiation and game theory. Understanding more generally how individuals contribute to group effectiveness is important to a broad set of social challenges. Here we formalize a dynamic theory of interpersonal communications that classifies individual acts, sequences of actions, group behavioral patterns, and individuals engaged in group decision making. Group decision making occurs through a sequence of communications that convey personal attitudes and preferences among members of the group. The resulting formalism is relevant to psychosocial behavior analysis, rules of order, organizational structures and personality types, as well as formalized systems such as social choice theory. More centrally, it provides a framework for quantifying and even anticipating the structure of informal dialog, allowing specific conversations to be coded and analyzed in relation to a quantitative model of the participating individuals and the parameters that govern their interactions. A Mathematical Theory of Interpersonal Interactions and Group Behavior

### Julia Child (2) vs. Ira Glass; Dorothy Parker advances

Yesterday we got this argument from Manuel in favor of Biles:

After suffering so many bad gymnastics (mathematical, logical, statistical, you name it) at seminars, to have some performed by a true champion would be a welcome change.

But Parker takes it away, based on this formidable contribution of Dzhaughn:

Things I Have Learned From the Contest So Far:
(Cf. “Resume” by Dorothy Parker)

Thorpe’s 1/8th hashtag
Babe’s just a champ
Oscar is all Gray
Hotdogs cause cramp
Serena’s a whiner
Erdos sylvan
Jeff’s gone ballistic
I might as well win.

Today’s contest features the second seed in the Creative Eaters category against an unseeded magician. (Regular listeners to This American Life will recall that Glass did magic shows when he was about 12 years old, I think it was.) Both have lots of experience performing in front of an audience. So what’ll it be? Public TV or public radio? In either case, the winner will be facing someone from New Jersey in the second round.

Again, the full bracket is here, and here are the rules:

We’re trying to pick the ultimate seminar speaker. I’m not asking for the most popular speaker, or the most relevant, or the best speaker, or the deepest, or even the coolest, but rather some combination of the above.

I’ll decide each day’s winner not based on a popular vote but based on the strength and amusingness of the arguments given by advocates on both sides. So give it your best!

### Moneyball for evaluating community colleges

Far more community college students transfer prior to completing the Associate’s degree than actually complete first. According to a new report from the National Student Clearinghouse Research Center, about 350,000 transfer before completion, compared to about 60,000 who complete first.

That matters in several ways.

Most basically, it suggests that measuring community colleges by their graduation rates misses the point. A student who does a year at Brookdale before transferring to Rutgers, and subsequently graduating, got what she wanted, but she shows up in our numbers as a dropout. In states with “performance funding,” the community college could be punished for her decision, even if it was what she intended to do all along. . .

People who only look at “headline” numbers, and don’t bother with the asterisks, look at graduation rates and assume that something is going horribly wrong. But a ratio of 35 to 6 is such a honker of an asterisk that failing to account for it amounts to misrepresentation. . . .

My preferred measures of community college performance would be based on actual student behavior. For example, does the percentage of bachelor’s grads in a given area with community college credits roughly match the percentage of undergrads who are enrolled at community colleges? (Nationally, it does.) If so, then the idea of community colleges as dropout factories is hard to sustain. For programs not built around transfer, how are the employment outcomes? I wouldn’t look at loan repayment rates, just because the percentage of students with loans is so low; it’s a skewed sample. I would look at achievement gaps by race, sex, age, and income. I would look at ROI for public investment, as well as at local reputation. . . .

And a bunch more. I don’t know much about the world of education policy: Maybe some of these things are already being measured? Seems important, in any case.

### Intrinsic time for cryptocurrency data

(This article was first published on factbased, and kindly contributed to R-bloggers)

This week, I attended a three-day hacking event of the Crypto Traders Berlin Meetup group.

The aim was to find relationships between sentiment data of bitcointalk and twitter and the price of crypto currencies. In practise, it turns out to be not that easy. For starters, it is not easy to determine if the social media discussion spikes up after a significant price change or the other way round.

During this meetup series, I explored the concept of “intrinsic time“, as proposed by a team at the University of Essex. The idea is to summarize price data not by fixed intervals, but by directional changes in the price. This, they argue, has the advantage of covering more extreme points of the price data (a longer coast line). Also, some empirical scaling laws emerge when looking at the data this way.

The R code I came up with can be found here at pastebin, and here is a chart showing the output:

There is still a lot to explore, for instance if the cryptocurrency price do follow some empirical scaling laws and identify market regimes based on this kind of summary, but it was a good start.

Being a trading rooky, it was also interesting to learn some ideas, like the maker-taker model and the metapher of “surfing the waves” for trend following.

It was a friendly meetup with interesting people to get acquainted with, I recommend this meetup if you are in the area and interested in this topic. Thanks to the organizers and sponsor, as well!