# My Data Science Blogs

## July 20, 2019

### Youngsters are avoiding the Facebook app—but not the firm’s other platforms

Facebook owes its resilience to savvy acquisitions and tolerant regulators

## July 18, 2019

### Adapters: A Compact and Extensible Transfer Learning Method for NLP

Adapters obtain comparable results to BERT on several NLP tasks while achieving parameter efficiency.

### Animation in visualization, revisited a decade later

Rewind to 2006 when Hans Rosling’s talk using moving bubbles was at peak attention. Researchers studied whether animation in visualization was a good thing. Danyel Fisher revisits their research a decade later.

While they found that readers didn’t get much more accuracy from the movement versus other method, there was a big but:

But we also found that users really liked the animation view: Study participants described it as “fun”, “exciting”, and even “emotionally touching.” At the same time, though, some participants found it confusing: “the dots flew everywhere.”

This is a dilemma. Do we make users happy, or do we help them be effective? After the novelty effect wears off, will we all wake up with an animation hangover and just want our graphs to stay still so we can read them?

Tags: ,

### If you did not already know

Fractional Langevin Monte Carlo (FLMC)
Along with the recent advances in scalable Markov Chain Monte Carlo methods, sampling techniques that are based on Langevin diffusions have started receiving increasing attention. These so called Langevin Monte Carlo (LMC) methods are based on diffusions driven by a Brownian motion, which gives rise to Gaussian proposal distributions in the resulting algorithms. Even though these approaches have proven successful in many applications, their performance can be limited by the light-tailed nature of the Gaussian proposals. In this study, we extend classical LMC and develop a novel Fractional LMC (FLMC) framework that is based on a family of heavy-tailed distributions, called $\alpha$-stable L\'{e}vy distributions. As opposed to classical approaches, the proposed approach can possess large jumps while targeting the correct distribution, which would be beneficial for efficient exploration of the state space. We develop novel computational methods that can scale up to large-scale problems and we provide formal convergence analysis of the proposed scheme. Our experiments support our theory: FLMC can provide superior performance in multi-modal settings, improved convergence rates, and robustness to algorithm parameters. …

PRESISTANT
Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only ‘syntactically’ applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks. …

Binary Network Embedding (BinaryNE)
Traditional network embedding primarily focuses on learning a dense vector representation for each node, which encodes network structure and/or node content information, such that off-the-shelf machine learning algorithms can be easily applied to the vector-format node representations for network analysis. However, the learned dense vector representations are inefficient for large-scale similarity search, which requires to find the nearest neighbor measured by Euclidean distance in a continuous vector space. In this paper, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a sparse binary code for each node, by simultaneously modeling node context relations and node attribute relations through a three-layer neural network. BinaryNE learns binary node representations efficiently through a stochastic gradient descent based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bit-wise comparisons to support much quicker network node search compared to Euclidean distance or other distance measures. Our experiments and comparisons show that BinaryNE not only delivers more than 23 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods. …

NeuralDater
Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document internal structures. In this paper, we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. To the best of our knowledge, this is the first application of deep learning for the problem of document dating. Through extensive experiments on real-world datasets, we find that NeuralDater significantly outperforms state-of-the-art baseline by 19% absolute (45% relative) accuracy points. …

### Four short links: 18 July 2019

Weird Algorithms, Open Syllabi, Conversational AI, and Quantum Computing

1. 30 Weird Chess Algorithms (YouTube) — An intricate and lengthy account of several different computer chess topics from my SIGBOVIK 2019 papers. We conduct a tournament of fools with a pile of different weird chess algorithms, ostensibly to quantify how well my other weird program to play color- and piece-blind chess performs. On the way we "learn" about mirrors, arithmetic encoding, perversions of game tree search, spicy oils, and hats.
2. Open Syllabus Project — as FastCompany explains, the 6M+ syllabi from courses around the world tell us about changing trends in subjects. Not sure how I feel that four of the textbooks I learned on are still in the top 20 (Cormen, Tanenbaum, Silberschatz, Stallings).
3. Plato — Uber open-sourced its flexible platform for developing conversational AI agents. See also their blog post.
4. Speediest Quantum Operation Yet (ScienceDaily) — In Professor Michelle Simmons' approach, quantum bits (or qubits) are made from electrons hosted on phosphorus atoms in silicon.[...] "Atom qubits hold the world record for the longest coherence times of a qubit in silicon with the highest fidelities," she says. "Using our unique fabrication technologies, we have already demonstrated the ability to read and initialise single electron spins on atom qubits in silicon with very high accuracy. We've also demonstrated that our atomic-scale circuitry has the lowest electrical noise of any system yet devised to connect to a semiconductor qubit." [...] A two-qubit gate is the central building block of any quantum computer -- and the UNSW team's version of it is the fastest that's ever been demonstrated in silicon, completing an operation in 0.8 nanoseconds, which is ~200 times faster than other existing spin-based two-qubit gates.

### Magister Dixit

“Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.
Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).”
Robert Chang

### rOpenSci Hiring for New Position in Statistical Software Testing and Peer Review

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Are you passionate about statistical methods and software? If so we would love for you to join our team to dig deep into the world of statistical software packages. You’ll develop standards for evaluating and reviewing statistical tools, publish, and work closely with an international team of experts to set up a new software review system.

We are seeking a creative, dedicated, and collaborative software research scientist to support a two-year project in launching a new software peer-review initiative. The software research scientist will work on the Sloan Foundation supported rOpenSci project, with rOpenSci staff and a statistical methods editorial board. They will research and develop standards and review guidelines for statistical software, publish findings, and develop R software to test packages against those standards. The software research scientist will work with staff and the board to collaborate broadly with the statistical and software communities to gather input, refine and promote the standards, and recruit editors and peer reviewers. The candidate must be self-motivated, proactive, collaborative and comfortable working openly and reproducibly with a broad online community.

For more details and how to apply see https://ropensci.org/careers/.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## July 17, 2019

### Top KDnuggets tweets, Jul 10-16: Intuitive Visualization of Outlier Detection Methods; What’s wrong with the approach to Data Science?

What's wrong with the approach to Data Science?; Intuitive Visualization of Outlier Detection Methods; The Death of Big Data and the Emergence of the Multi-Cloud Era

### Distilled News

In this post we’ll share how we used TensorFlow’s object detection API to build a custom image annotation service for eyeson. Below you can seen an example where Philipp is making the ‘thinking’ ?? pose during a meeting which automatically triggers a GIF reaction.
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one’s experiences–a hallmark of human intelligence from infancy–remains a formidable challenge for modern AI. The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between ‘hand-engineering’ and ‘end-to-end’ learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias–the graph network–which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. As a companion to this paper, we have released an open-source software library for building graph networks, with demonstrations of how to use them in practice.
Today, Python is one of the most popular programming languages and it has replaced many languages in the industry. There are various reasons for its popularity and one of them is that python has a large collection of libraries.
Machine Learning, a prominent topic in Artificial Intelligence domain, has been in the spotlight for quite some time now. This area may offer an attractive opportunity, and starting a career in it is not as difficult as it may seem at first glance. Even if you have zero-experience in math or programming, it is not a problem. The most important element of your success is purely your own interest and motivation to learn all those things. If you are a newcomer, you do not know where to start studying and why you need Machine Learning and why it is gaining more and more popularity lately, you got into the right place! I’ve gathered all the needed information and useful resources to help you gain new knowledge and accomplish your first projects.
Building an effective Machine Learning model is all about striking the right balance between Bias (Underfitting) and Variance (Overfitting) but what are Bias and Variance ? What Bias and Variance mean intuitively ? Let’s take a step back and understand the terms Bias and Variance on a conceptual level and then try to relate these concepts to Machine Learning.
Reinforcement Learning (RL) is the problem of studying an agent in an environment, the agent has to interact with the environment in order to maximize some cumulative rewards. Example of RL is an agent in a labyrinth trying to find its way out. The fastest it can find the exit, the better reward it will get.
On July 5, the R Core Group released the source code for the latest update to R, R 3.6.1, and binaries are now available to download for Windows, Linux and Mac from your local CRAN mirror.
This report examines the emerging regulatory and policy landscape surrounding artificial intelligence (AI) in jurisdictions around the world and in the European Union (EU). In addition, a survey of international organizations describes the approach that United Nations (UN) agencies and regional organizations have taken towards AI. As the regulation of AI is still in its infancy, guidelines, ethics codes, and actions by and statements from governments and their agencies on AI are also addressed. While the country surveys look at various legal issues, including data protection and privacy, transparency, human oversight, surveillance, public administration and services, autonomous vehicles, and lethal autonomous weapons systems, the most advanced regulations were found in the area of autonomous vehicles, in particular for the testing of such vehicles.
After several AI PoCs, I realized that it is quite easy to launch AI PoCs with initially positive results, but at the same time, it is difficult to scale up AI to enterprise-wide applications and reach the production stage. In this article, I’ll share some of the reasons why I failed in a couple of projects.
1. Data
2. Compliance
3. Realistic Expectations
4. Scalability
5. Size and nature of your PoC
6. Implementation process
7. AI Accuracy / Available Data
8. PoC Evaluation
9. Time Window
Automatic synthesis of realistic images from text have become popular with deep convolutional and recurrent neural network architectures to aid in learning discriminative text feature representations. Discriminative power and strong generalization properties of attribute representations even though attractive, its a complex process and requires domain-specific knowledge. In comparison, natural language offers a general and flexible interface for describing objects in any space of visual categories. The best thing is to combine generality of text descriptions with the discriminative power of attributes. This blog addresses different text to image synthesis algorithms using GAN (Generative Adversarial Network) thats aims to directly map words and characters to image pixels with natural language representation and image synthesis techniques.

### Document worth reading: “Abandon Statistical Significance”

In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration–often scant–given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible. Abandon Statistical Significance

### The role of open source in mitigating natural disasters

Pedro Cruz and Brad Topol discuss Call for Code, a global developer competition that uses open source technologies to address natural disasters.

Continue reading The role of open source in mitigating natural disasters.

### The next age of open innovation

Alison McCauley looks at how blockchain technology offers new tools that can help extend the ethos of open innovation into new areas.

Continue reading The next age of open innovation.

### Highlights from the O'Reilly Open Source Software Conference in Portland 2019

Experts explore the role open source software plays in fields as varied as machine learning, blockchain, disaster response, and more.

People from across the open source world are coming together in Portland, Ore. for the O'Reilly Open Source Software Conference (OSCON). Below you'll find links to highlights from the event.

## Better living through software

Tiffani Bell shares three lessons she's learned exploring how technology can help the less fortunate.

## The next age of open innovation

Alison McCauley looks at how blockchain technology offers new tools that can help extend the ethos of open innovation into new areas.

## Built to last: Building and growing open source communities

Kay Williams explores key lessons for building strong open source communities based on Microsoft’s real-world experience with Kubernetes and VSCode.

## The role of open source in mitigating natural disasters

Pedro Cruz and Brad Topol discuss Call for Code, a global developer competition that uses open source technologies to address natural disasters.

### Built to last: Building and growing open source communities

Kay Williams explores key lessons for building strong open source communities based on Microsoft’s real-world experience with Kubernetes and VSCode.

Continue reading Built to last: Building and growing open source communities.

### Better living through software

Tiffani Bell shares three lessons she's learned exploring how technology can help the less fortunate.

Continue reading Better living through software.

### R Packages worth a look

Fit Latent Dirichlet Allocation Models using Stochastic Variational Inference (lda.svi)
Fits Latent Dirichlet Allocation topic models to text data using the stochastic variational inference algorithm described in Hoffman et. al. (2013) &lt …

R Markdown Output Formats for Storytelling (rolldown)
R Markdown output formats based on JavaScript libraries such as ‘Scrollama’ (<

General Tools for Building GLM Expectation-Maximization Models (
emax.glm)
Implementation of Expectation Maximization (EM) regression of general linear models. The package currently supports Poisson and Logistic regression wit …

Talking to ‘Docker’ and ‘Singularity’ Containers (babelwhale)
Provides a unified interface to interact with ‘docker’ and ‘singularity’ containers. You can execute a command inside a container, mount a volume or co …

### Online Workshop: How to set up Kubernetes for all your machine learning workflows

Join this free live online workshop, Jul 31 @12 PM ET, to learn how to set up your Kubernetes cluster, so you can run Spark, TensorFlow, and any ML framework instantly, touching on the entire machine learning pipeline from model training to model deployment.

### Bottomline Technologies: Data Scientist [Portsmouth, NH]

Bottomline Technologies is seeking a Data Scientist in Portsmouth, NH. Join a newly established predictive and advanced analytics team to build innovative analytics solutions to solve problems in financial industry.

### Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part II

This is a continuation of Julien Mairal‘s guest post on CNNs, see part I here.

Stability to deformations of convolutional neural networks

In their ICML paper Zhang et al. introduce a functional space for CNNs with one layer, by noticing that for some dot-product kernels, smoothed variants of rectified linear unit activation functions (ReLU) live in the corresponding RKHS, see also this paper and that one. By following a similar reasoning with multiple layers, it is then possible to show that the functional space described in part I contains CNNs with such smoothed ReLU, and that the norm of such networks can be controlled by the spectral norms of filter matrices. This is consistent with previous measures of complexity for CNNs, see this paper by Bartlett et al.

A perhaps more interesting finding is that the abstract representation , which only depends on the network architecture, may provide near-translation invariance and stability to small image deformations while preserving information—that is, can be recovered from . The original characterization we use was introduced by Mallat in his paper on the scattering transform—a multilayer architecture akin to CNNs based on wavelets, and was extended to by Alberto Bietti, who should be credited for all the hard work here.

Our goal is to understand under which conditions it is possible to obtain a representation that (i) is near-translation invariant, (ii) is stable to deformations, (iii) preserves signal information. Given a -diffeomorphism and denoting by its action operator (for an image defined on the continuous domain ), the main stability bound we obtain is the following one, see Theorem 7 in Mallat’s paper if , for all ,

where are universal constants, is the scale parameter of the pooling operator corresponding to the “amount of pooling” performed up to the last layer, is the maximum pixel displacement and represents the maximum amount of deformation, see the paper for the precise definitions of all these quantities. Note that when , the representation becomes translation invariant: indeed, consider the particular case of being a translation, then and .

The stability bound and a few additional results tell us a few things about the network architecture: (a) small patches lead to more stable representations (the dependency is hidden in ); (b) signal preservation for discrete signals requires small subsampling factors (and thus small pooling) between layers. In such a setting, the scale parameter still grows exponentially with and near translation invariance may be achieved with several layers.

Interestingly, we may now come back to the Cauchy-Schwarz inequality from part 1, and note that if is stable, the RKHS norm is then a natural quantity that provides stability to deformations to the prediction function , in addition to measuring model complexity in a traditional sense.

Feature learning in RKHSs and convolutional kernel networks

The previous paragraph is devoted to the characterization of convolutional architectures such as CNNs but the previous kernel construction can in fact be used to derive more traditional kernel methods. After all, why should one spend efforts defining a kernel between images if not to use it?

This can be achieved by considering finite-dimensional approximations of the previous feature maps. In order to shorten the presentation, we simply describe the main idea based on the Nystrom approximation and refer to the paper for more details. Approximating the infinite-dimensional feature maps (see the figure at the top of part I) can be done by projecting each point in onto a -dimensional subspace leading to a finite-dimensional feature map akin to CNNs, see the figure at the top of the post.

By parametrizing with anchor points , and using a dot-product kernel, a patch from is encoded through the mapping function

where is applied pointwise. Then, computing from admits a CNN interpretation, where only the normalization and the matrix multiplication by are not standard operations. It remains now to choose the anchor points:

• kernel approximation: a first approach consists of using a variant of the Nystrom method, see this paper and that one. When plugging the corresponding image representation in a linear classifier, the resulting approach behaves as a classical kernel machine. Empirically, we observe that the higher the number of anchor points, the better the kernel approximation, and the higher the accuracy. For instance, a two-layer network with a -dimensional representations achieves about accuracy on CIFAR-10 without data augmentation (see here).
• back-propagation, feature selection: learning the anchor points can also be done as in a traditional CNN, by optimizing them end-to-end. This allows using deeper lower-dimensional architectures and empirically seems to perform better when enough data is available, e.g., accuracy on CIFAR-10 with simple data augmentation. There, the subspaces are not learned anymore to provide the best kernel approximation, but the model seems to perform a sort of feature selection in each layer’s RKHS , which is not well understood yet (This feature selection interpretation is due to my collaborator Laurent Jacob).

Note that the first CKN model published here was based on a different approximation principle, which was not compatible with end-to-end training. We found this to be less scalable and effective.

Other links between neural networks and kernel methods

Finally, other links between kernels and infinitely-wide neural networks with random weights are classical, but they were not the topic of this blog post (they should be the topic of another one!). In a nutshell, for a large collection of weights distributions and nonlinear functions , the following quantity admits an analytical form

where the terms may be seen as an infinitely-wide single-layer neural network. The first time such a relation appears is likely to be in the PhD thesis of Radford Neal with a Gaussian process interpretation, and it was revisited later by Le Roux and Bengio and by Cho and Saul with multilayer models.

In particular, when is the rectified linear unit and follows a Gaussian distribution, it is known that we recover the arc-cosine kernel. We may also note that random Fourier features also yield a similar interpretation.

Other important links have also been drawn recently between kernel regression and strongly over-parametrized neural networks, see this paper and that one, which is another exciting story.

### A Summary of DeepMind’s Protein Folding Upset at CASP13

Learn how DeepMind dominated the last CASP competition for advancing protein folding models. Their approach using gradient descent is today's state of the art for predicting the 3D structure of a protein knowing only its comprising amino acid compounds.

### How we automated mybinder.org dependency upgrades in 10 steps

BinderHub and repo2docker are key components of the service at mybinder.org. In order to give Binder users the best experience, the Binder SRE team must continuously upgrade the version of these tools that mybinder.org uses. To avoid merging in massive updates at irregular intervals, it is desirable to merge updates in frequent intervals of smaller changes in order to more easily identify any breaking changes from the dependency upgrades.

While this process only takes a few minutes following processes outlined in the “Site Reliability Guide,” it is prone to human error (e.g., remembering to use the right SHA in upgrading the packages), and the team must remember to regularly do it in the first place. In the interest of automation, the Binder team decided to use a bot to relieve this burden, and we’ve decided to highlight its functionality in this blog post!

### What does the mybinder.org upgrade bot do?

The upgrade bot should automatically update the versions of BinderHub and repo2docker that are deployed on mybinder.org. These are defined in the mybinder.org helm chart. To check whether an upgrade is needed, we want the bot to first “diff” the latest commit hash for both repo2docker and BinderHub repos against the deployed versions in the mybinder.org repo. If either or both are different, the upgrade bot does the following:

• Fork the mybinder.org-deploy repo
• Clone the fork locally
• Checkout a new branch for the bump
• Make the appropriate edit to update the commit hash in the mybinder.org fork repo
• Add and commit the change
• Push to the branch in the forked repo
• Create a PR to the main mybinder.org repo
• Remove the locally cloned repo

Additionally, it would be ideal if the bot could update an existing PR instead of creating new ones for the version bumps. We’d also like to provide some information in the comments of the PR as to what high level changes were made so we have some idea about what we’re merging in.

Here’s what we’re aiming for. The PR body:

The PR diff:

Now that we’ve broken it down a bit, let’s write up some Python code. Once we have a functioning script, we can worry about how we will run this in the cloud (cron job vs. web app).

### Writing the bot

If you don’t care about the step-by-step, you can skip to the final version of the code.

In the interest of linear understanding and simplicity for a bot-writing tutorial, the step-by-step below will not write functions or classes but just list the raw code necessary to carry out the tasks. The final version of the code linked above is one way to refactor it.

### Step 1: Retrieve current deployed mybinder.org dependency versions

The first step is to see if any changes are necessary in the first place. Fortunately, @choldgraf had already made a script to do this.

To find the current live commit SHA for BinderHub in mybinder.org, we simply check the requirements.yaml file. We’ll need Python’s yaml and requests modules to make the GET request and parse the yaml in the response. Note that this is also conveniently the file we’d want to change to upgrade the version.

https://medium.com/media/ee7662b9c01dbc45c420efd8b8a8002b/href

Similarly, for repo2docker, we check the mybinder.org values.yaml file:

https://medium.com/media/c20431acba838e3fc127827f378d8401/href

Let’s store these SHAs in a dictionary we can use for later reference:

https://medium.com/media/e2fcefa8739c19a6c064b0638988cfaf/href

### Step 2: Retrieve latest commits from the dependency repos

When we get the latest commit SHAs for repo2docker and BinderHub, we need to be careful and make sure we don’t automatically grab the latest one from GitHub. The travis build for mybinder.org looks for the repo2docker Docker image from DockerHub, and the latest BinderHub from the JupyterHub helm chart.

Let’s get the repo2docker version first:

https://medium.com/media/44b0e1253d7c5a8656ddd54523449ef6/href

Now we can do BinderHub:

https://medium.com/media/ca78a0f61a72d202cb686d5c9fe1691b/href

Let’s add these to our dictionary too:

https://medium.com/media/0c1798e1dc39838429064ff856bd3661/href

Great, now we should have all the information we need to determine whether an update needs to be made or not, and what the new commit SHA should be!

### Step 3: Fork mybinder.org repo

If we determine an upgrade for the repo is necessary, we need to fork the mybinder.org repository, make the change, commit, push, and make a PR. Fortunately, the GitHub API has all the functionality we need! Let’s just make a fork first.

If you have permissions to a bunch of repos and organizations on GitHub, you may want to create a new account or organization so that you don’t accidentally start automating git commands through an account that has write access to so much, especially while developing and testing the bot. I created the henchbot account for this.

Once you know which account you want to be making the PRs with, you’ll need to create a personal access token from within that account. I’ve set this as an environment variable so it isn’t hard-coded in the script.

https://medium.com/media/ec4da4b2db0835aa0e55ccc250b4580d/href

Using the API for a post request to the forks endpoint will fork the repo to your account. That’s it!

### Step 4: Clone your fork

You should be quite used to this! We’ll use Python’s subprocess module to run all of our bash commands. We’ll need to run these within the for-loop above.

https://medium.com/media/a9e4ef7f3b58b56f2113fa320b0d043d/href

Let’s also cd into it and check out a new branch.

https://medium.com/media/35170cc7e3372ce40c1d2a87ecb592c7/href

### Step 5: Make the file changes

Now we need to edit the file like we would for an upgrade.

For repo2docker, we edit the same values.yaml file we checked above and replace the old SHA (“live”) with the “latest”.

https://medium.com/media/cefbe0d19d0a21949427f766e97d7383/href

For BinderHub, we edit the same requirements.yaml file we checked above and replace the old SHA (“live”) with the “latest”.

https://medium.com/media/0ce1e9b7330a297b77a0bf785c5bff33/href

### Step 6: Stage, commit, push

Now that we’ve edited the correct files, we can stage and commit the changes. We’ll make the commit message the name of the repo and the compare URL for the commit changes so people can see what has changed between versions for the dependency.

https://medium.com/media/544b9d0bb7da3c827911de4df23aae95/href

Awesome, we now have a fully updated fork ready to make a PR to the main repo!

### Step 7: Make the body for the PR

We want the PR to have a nice comment explaining what’s happening and linking any helpful information so that the merger knows what they’re doing. We’ll note that this is a version bump and link the URL diff so it can be clicked to see what has changed.

https://medium.com/media/72c2705cce9f6c8142176412f99f88fb/href

### Step 8: Make the PR

We can use the GitHub API to make a pull request by calling the pulls endpoint with the title, body, base, and head. We’ll use the nice body we formatted above, call the title the same as the commit message we made with the repo name and the two SHAs, and put the base as master and the head the name of our fork. Then we just make a POST request to the pulls endpoint of the main repo.

https://medium.com/media/1b461c8b62f4f375af3943ed88e2cdae/href

### Step 9: Confirm and merge!

If we check the mybinder.org PRs, we would now see the automated PR from our account!

### Step 10: Automating the script (cron)

Now that we have a script we can simply execute to create a PR ($python henchbot.py), we want to make this as hands-off as possible. Generally we have two options: (1) set this script to be run as a cron job; (2) have a web app listener that gets pinged whenever a change is made and executes your script as a reaction to the ping. Given that these aren’t super urgent updates that need to be made seconds or minutes after a repository update, we will go for the easier and less computationally-expensive option of cron. If you aren’t familiar with cron, it’s simply a system program that will run whatever command you want at whatever time or time interval you want. For now, we’ve decided that we want to execute this script every hour. Cron can be run on your local computer (though it would need to be continuously running) or a remote server. I’ve elected to throw it on my raspberry pi, which is always running. Since I have a few projects going on, I like to keep the cron jobs in a file. $ vim crontab-jobs

You can define your cron jobs here with the correct syntax (space-separated). Check out this site for help with the crontab syntax. Since we want to run this every hour, we will set it to run on the 0 minutes, for every hour, every day, every month, every year. We also need to make sure it has the correct environment variable with the GitHub personal access token we created, so we’ll add that to the command.

0 * * * * cd /home/pi/projects/mybinder.org-upgrades && HENCHBOT_TOKEN='XXXXX' /home/pi/miniconda3/bin/python henchbot.py

Now we point our cron to the file we’ve created to load the jobs.

$crontab crontab-jobs To see our active crontab, we can list it: $ crontab -l

That’s it! At the top of every hour, our bot will check to see if an update needs to be made, and if so, create a PR. To clean up files and handle existing PRs, in addition to some other details, I’ve written a few other functions. It is also implemented as a class with appropriate methods. You can check out the final code here.

Automating mybinder.org dependency upgrades in 10 steps was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### How to Make Stunning 3D Plots for Better Storytelling

3D Plots built in the right way for the right purpose are always stunning. In this article, we’ll see how to make stunning 3D plots with R using ggplot2 and rayshader.

### How Databricks IAM Credential Passthrough Solves Common Data Authorization Problems

In our first blog post, we introduced Databricks IAM Credential Passthrough as a secure, convenient way for customers to manage access to their data. In this post, we’ll take a closer look at how passthrough compares to other Identity and Access Management (IAM) systems. If you’re not familiar with passthrough, we suggest reading the first post before you continue with this one.

## Properties of a good cloud IAM system

When our customers choose a system to manage how their users access cloud data, they look for the following common set of properties:

1. Security: Users should be able to access only the data they’re authorized to access.
2. Attribution: An admin should be able to trace data accesses to the users who initiated them and audit historical accesses. This auditing should be trustworthy: users should not be able to remove or modify the audit logs.
3. Ease of administration: Larger organizations typically have few personnel administering IAM systems with hundreds of non-admin users, such as data scientists, analysts, and data engineers. The non-admin users may have a few dozen entitlements (e.g., “access PII user data”, “write to log pipeline tables”, “read sales aggregate tables”).
4. Efficiency: The system should be as cost-efficient as possible. Resources should be shared among users and not sit idle.

In this blog post, we’ll explore a few common cloud IAM systems and how well they achieve these properties. We’ll finish with an exploration of Databricks IAM Credential Passthrough and how it achieves security and attribution without sacrificing ease of administration or efficiency.

## Sub-optimal system 1: One EC2 instance per entitlement

Our AWS customers typically represent entitlements using IAM Instance Profiles: they’ll have one instance profile to access all data from a particular customer, another to read a certain class of data, and so on. A user can get entitlement to some data by running code on an AWS EC2 instance that has the associated instance profile attached. Because each AWS EC2 instance can only be given a single instance profile, one of the most common solutions we see is for customers to launch a separate EC2 instance for each entitlement in their systems. The admin team is then responsible for making sure that users have access only to clusters with the appropriate entitlements.

Sub-Optimal System 1: One EC2 instance per entitlement

Security – The main benefit of this system is that it is straightforwardly secure. As long as the admin team maps users to EC2 instances correctly, each user has access to the correct set of entitlements.

Attribution – This system does not allow attribution to users. Because all users on a given EC2 instance share the same Instance Profile, cloud-native audit logs such as AWS CloudTrail can attribute accesses only to the instance, not to the user running code on the instance.

Ease of administration – This system is easy to administer provided the number of users and entitlements remain small. However, administration becomes increasingly difficult as the organization scales: admins need to ensure that each user accesses only the EC2 instances with the correct Instance Profiles, which may require manual management if the Instance Profiles don’t map cleanly to policies in the admin’s identity management system (such as LDAP or Active Directory).

Efficiency – This system requires a separate EC2 instance for each entitlement, which quickly becomes expensive as the organization’s permission model becomes more complex. If there are only a few users with a particular entitlement, that EC2 instance will either sit idle most of the day (increasing cost) or have to be stopped and started according to the work schedule of its users (increasing administrative overhead and slowing down users). Because Apache Spark™ distributes work across a cluster of instances that all require the same Instance Profile, the cost of an idle cluster can become quite large.

## Sub-optimal system 2: Users with different permission levels sharing an EC2 instance

Because of the drawbacks of the “one instance per entitlement” system, many of our customers decide to share EC2 instances among multiple users with different entitlements. These organizations let each user encode their particular credentials (in the form of usernames and passwords or AWS Access Keys and Secret Keys rather than Instance Profiles) in local configurations on the instance. The admin team assigns each user their credentials and trusts the user to manage them securely.

Sub-Optimal System 2: Hard-coded credentials

Security – The drawback of this system is that it is usually not secure. Besides the obvious problems of users sharing their credentials with each other or accidentally exposing their credentials via some global configuration, there are several more subtle security vulnerabilities hiding in this system.

• Users can use the instance’s shared filesystem to access data that other users have collected to the instance’s local disk. AWS IAM can’t protect data once it’s on a local disk.
• Users can inspect the in-memory state of their data processing tools to extract another user’s credentials. For example, a user could dump the memory of another user’s process or use Java reflection (for JVM-based tools) to crawl the JVM’s object graph and extract the object containing another user’s credentials.
• Users can trick their data-processing tools into performing data access using another user’s credentials. Most data-processing tools (including Spark) are not hardened against mutually-untrusted users sharing the same engine. Even if a user cannot read another user’s credentials directly, they can often use their tools to impersonate that other user.

Attribution – In theory, this system offers per-user attribution: since each user uses their own credentials, cloud access logs should be able to attribute each access to a user. In practice, however, that attribution can’t be trusted because of the security holes described above. You can be sure that a given user’s credentials were used to access data, but you can’t confirm which user was using those credentials.

Ease of administration – At first glance, this system is easy to administer: just give users access to the credentials they need and they’ll do the job of getting those credentials onto the instances they’re using. Eventually, though, we find that most admins get tired of having to rotate credentials that some user accidentally exposed to everyone on their EC2 instance.

Efficiency – The main benefit of this system is that it is cost-efficient, albeit at the cost of security. Users can share exactly the number of instances they need and the instances see good utilization.

## Our system: Databricks IAM Credential Passthrough

Databricks IAM Credential Passthrough allows users with different entitlements to share a single EC2 instance without the risk of exposing their credentials to other users. This combines the security of System 1 with the efficiency of System 2, and achieves better attribution and easier administration than either system. For an overview of how Passthrough works, see our first blog post.

IAM Credential Passthrough

Security – Any system that shares a single execution engine between users with different entitlements must defend against a long tail of subtle security holes. By integrating closely with the internals of Apache SparkTM, Passthrough avoids these common pitfalls:

• It locks down the instance’s local filesystem to prevent users from accessing data downloaded by other users and to secure data as it is transferred between user-defined code and Spark internal code.
• It runs code from different users in separate, low-privilege processes that can only run JVM commands from a predetermined whitelist. This protects against reflection-based attacks and other unsafe APIs.
• It guarantees that a user’s credentials are only present on the cluster while that user is executing a Spark task. Furthermore, it purges user credentials from Spark threads after their tasks complete or are preempted, so a malicious user cannot use Spark to acquire indirect access to another user’s credentials.

Attribution – Because Passthrough guarantees that each user only runs commands with their own credentials, cloud-native audit logs such as AWS CloudTrail will work out of the box.

Ease of administration – Passthrough integrates with our existing SAML-based Single Sign-On feature (link), so admins can assign roles to users within their existing SAML Identity Provider. Permissions can be granted or revoked based on groups, so the extra overhead of using Passthrough is minimal.

Efficiency – Because multiple users can share a single cluster, Passthrough is cost-efficient, especially when combined with our autoscaling high-concurrency clusters.

Summary Table

 Solution Security Attribution Ease of Administration Efficiency One instance per entitlement Yes No – can only attribute to an instance No – have to maintain user → instance map manually No – instances (or clusters) will often sit idle Shared instances No – users can access each others’ credentials No – users can impersonate each other No – have to rotate user credentials as they leak Yes Passthrough Yes Yes Yes Yes

## Conclusion

Databricks IAM Credential Passthrough allows admin users to manage access to their cloud data with a system that is secure, attributable, easy to administer, and cost-effective. Because it is deeply integrated with Apache Spark, Passthrough allows users with different credentials to share the same EC2 instances (reducing cost) without sharing their credentials (guaranteeing security and attribution). IAM Credential Passthrough is in private preview right now; please contact your Databricks representative to find out more.

--

The post How Databricks IAM Credential Passthrough Solves Common Data Authorization Problems appeared first on Databricks.

### Blindfold play and sleepless nights

In Edward Winter’s Chess Explorations there is the following delightful quote from the memoirs of chess player William Winter:

Blindfold play I have never attempted seriously. I once played six, but spent so many sleepless nights trying to drive the positions out of my head that I gave it up.

I love that. We think of the difficulty as being in the remembering, but maybe it is the forgetting that is the challenge. I’m reminded of a lecture I saw by Richard Feynman at Bell Labs: He was talking about the theoretical challenges of quantum computing, and he identified the crucial entropy-producing step as that of zeroing the machine, i.e. forgetting.

### Computer Vision for Beginners: Part 1

Image processing is performing some operations on images to get an intended manipulation. Think about what we do when we start a new data analysis. We do some data preprocessing and feature engineering. It’s the same with image processing.

### Machine learning to erase penis drawings

Working from the Quick, Draw! dataset, Moniker dares people to not draw a penis:

In 2018 Google open-sourced the Quickdraw data set. “The world’s largest doodling data set”. The set consists of 345 categories and over 15 million drawings. For obvious reasons the data set was missing a few specific categories that people enjoy drawing. This made us at Moniker think about the moral reality big tech companies are imposing on our global community and that most people willingly accept this. Therefore we decided to publish an appendix to the Google Quickdraw data set.

Draw what you want, and the application compares your sketch against a model, erasing any offenders.

Tags: ,

### KDnuggets™ News 19:n26, Jul 17: The Death of Big Data and the Emergence of Multi-Cloud; Top 10 Data Science Leaders You Should Follow

The end of Big Data era and what replaces it; An excellent list of Data Science leaders to follow; A Hackathon guide for aspiring Data Scientist; How to showcase your work; What is wrong with the approach to Data Science; and more.

### Combining momentum and value into a simple strategy to achieve higher returns

(This article was first published on Data based investing, and kindly contributed to R-bloggers)

In this post I’ll introduce a simple investing strategy that is well diversified and has been shown to work across different markets. In short, buying cheap and uptrending stocks has historically led to notably higher returns. The strategy is a combination of these two different investment styles, value and momentum. In a previous post I explained how the range of possible outcomes in investing into a single market is excessively high. Therefore, global diversification is the key to assure that you achieve your investment objective. This strategy is diversified across strategies, markets and different stocks. The benefits of this strategy are the low implementation costs, a high diversification level, higher expected returns and lower drawdowns.
We’ll use data from Barclays for the CAPEs which represent valuations, and Yahoo Finance using quantmod for the returns that do not include dividends, which we’ll use as absolute momentum. Let’s take a look at the paths of valuation and momentum for the U.S. stock market for the last seven years:
The two corrections are easy to spot, because momentum was low, and valuations decreased. The U.S. stock market currently has a strong momentum as measured by six-month absolute return, but the valuation level is really high. Therefore the U.S. is not the optimal country to invest in. So, which market is the optimal place to be? Let’s look at just the current values of different markets:
There is only one market that is just in the right spot: Russia. It has the highest momentum and second lowest valuation of all the countries in this sample. In emerging markets things happen faster and more intensively, which leads to more opportunities and makes investing in them more interesting. Different markets also tend to be in different cycles, which makes this combination strategy even more attractive. Let’s discuss more about these strategies and why they work well together.
Research on the topic

Value and momentum factors are negatively correlated, which means that when the other one has low returns, the other one’s returns tend to be higher. Both have been found to lead to excess returns and are two of the most researched so-called anomalies. Both strategies have been tried to be explained using risk-based and behavioral factors, but no single explanation has been agreed on for either of the strategies. The fact that there are multiple explanations for the superior performance can rather be viewed as a good thing for the strategies.
In their book “Your Complete Guide to Factor-Based Investing”, Berkin and Swedroe found out that the yearly returns of the two anomalies using a long-short strategy was 4.8 percent for value and 9.6 percent for the momentum anomaly. This corresponds to the return of the factor itself and can directly be compared to the market beta factor, which has had a historical annual return of 8.3 percent during the same period. This means that investing just in the momentum factor and therefore hedging against the market would have led to a higher return than just investing in the market. It is important to notice that investing normally just using a momentum strategy without shorting gives exposure to both of the market beta and momentum factors, which leads to a higher return than investing just into either of these factors.
Andreu et al. examined momentum on the country level and found out that the return of the momentum factor has been about 6 percent per annum for a holding period of six months. For a holding period of twelve months, the return was cut in half (source). It seems that a short holding period seems to work best for this momentum strategy. They researched investing in a single country and three countries at a time and shorting the same amount of countries at a time. The smaller amount of countries led to higher returns, but no risk measures were presented in the study. As a short-term strategy I’d suggest equal weighting some of the countries with high momentum and low valuation. I’ve also tested the combination of value and momentum in the U.S. stock market, and it seems that momentum does not affect the returns at all on longer periods of time.
Value on the other hand tends to correlate strongly with future returns only on much longer periods, and on shorter periods the correlation is close to zero as I demonstrated in a previous post. However, the short-term CAGR of the value strategy on the country level in the U.S. has still been rather impressive at 14.5 percent for a CAPE ratio of 5 to 10, as shown by Faber (source, figure 3A). I chose to show this specific valuation level, since currently countries such as Turkey and Russia are trading at these valuation levels (source).
The 10-year cyclically adjusted price to earnings ratio that was discussed in the previous chapter, also known as CAPE, has been shown to be among the best variables for explaining the future returns of the stock market. It has a logarithmic relationship with future 10-15 year returns, and an r-squared as high as 0.49 across 17 country-level indices (source, page 11). A lower CAPE has also lead to smaller maximum and average drawdowns (source).

Faber has shown that investing in countries with a low CAPE has returned 14 percent annually since 1993, and the risk-adjusted return has also been really good (source). The strategy, and value investing as a whole, has however underperformed for the last ten years or so (source). This is good news if you believe in mean reversion in the stock market.

The two strategies work well together on the stock level, as shown by Keimling (source). According to the study, the quintile with highest momentum has led to a yearly excess return of 2.7 percent, and the one with the lowest valuation has led to a yearly excess return of 3 percent globally. Choosing stocks with highest momentum and lowest valuations has over doubled the excess return to 7.6 percent. O’Shaughnessy has shown that the absolute return for a quintile with the highest momentum was 11.6 percent, and 11.8 percent for value. Combining the two lead to a return of 18.5 percent (source).
Lastly, let’s take a closer look at some selected countries and their paths:
As expected, the returns of the emerging markets vary a lot compared to U.S. market. U.S. has performed extremely well, but the historical earnings haven’t kept up with the prices. Israel on the other hand has gotten cheaper while the momentum has been good. Even though the momentum of U.S. is higher than any other point in time in this sample, Russia’s momentum currently is, and Turkey’s momentum has been way higher. Both Russia’s and Turkey’s valuations are less than a third of U.S. valuations, which makes these markets very interesting.
In conclusion, combining value and momentum investing into a medium-term strategy is likely to lead to excess returns as shown by previous research. The strategy can be easily implemented using country-specific exchange traded funds, and the data is easily available. Currently only Russia is in the sweet spot for this strategy, and Turkey might be once it gains some momentum. Investing to just one country is however risky, and I suggest diversifying between the markets with high momentum and low valuations.
The R code used in the analysis can be found here.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four short links: 17 July 2019

Margaret Hamilton, WeChat Censorship, Refactoring, and Ancient Games

1. Margaret Hamilton Interview (The Guardian) — I found a job to support our family at the nearby Massachusetts Institute of Technology (MIT). It was in the laboratory of Prof Edward Lorenz, the father of chaos theory, working on a system to predict weather. He was asking for math majors. To take care of our daughter, we hired a babysitter. Here I learned what a computer was and how to write software. Computer science and software engineering were not yet disciplines; instead, programmers learned on the job. Lorenz’s love for software experimentation was contagious, and I caught the bug.
2. How WeChat Censors Images in Private Chats (BoingBoing) — WeChat maintains a massive index of the MD5 hashes of every image that Chinese censors have prohibited. When a user sends another user an image that matches one of these hashes, it's recognized and blocked at the server before it is transmitted to the recipient, with neither the recipient or the sender being informed that the censorship has taken place. Separately, all images not recognized in the hash database are processed out-of-band.
3. The Best Refactoring You've Never Heard Of (James Koppel) — lambdas vs data structures. Very interesting talk.
4. Machine Learning is About to Revolutionize the Study of Ancient Games (MIT TR) — The team model games as mathematical entities that lend themselves to computational study. This is based on the idea that games are composed of units of information called ludemes, such as a throw of the dice or the distinctively skewed shape of a knight’s move in chess. Ludemes are equivalent to genes in living things or memes as elements of cultural inheritance. They can be transmitted from one game to another, or they may die, never to be seen again. But a key is that they can be combined into bigger edifices that form games themselves.

### Document worth reading: “Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology”

Performance metrics (error measures) are vital components of the evaluation frameworks in various fields. The intention of this study was to overview of a variety of performance metrics and approaches to their classification. The main goal of the study was to develop a typology that will help to improve our knowledge and understanding of metrics and facilitate their selection in machine learning regression, forecasting and prognostics. Based on the analysis of the structure of numerous performance metrics, we propose a framework of metrics which includes four (4) categories: primary metrics, extended metrics, composite metrics, and hybrid sets of metrics. The paper identified three (3) key components (dimensions) that determine the structure and properties of primary metrics: method of determining point distance, method of normalization, method of aggregation of point distances over a data set. Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology

### Distilled News

This article presents in details how to predict tags for posts from StackOverflow using Linear Model after carefully preprocessing our text features.
It seems fair to say that simple computer vision models weigh easily ~100Mo. A hundred Mo just to be able to make an inference isn’t a viable solution for an end product. A remote API can do the trick, but now your product needs to add encryption, you need to store and upload data, the user needs to have a reliable internet connection to have a decent speed. We can train a narrower network, they’ll probably fit in a small memory. But chances are they won’t be good enough at extracting complex features. And we’re not talking about ensembles. Ensembles are a great way to extract a lot of knowledge from the training data. But at test time it can be too expensive to run a hundred different models in parallel. The knowledge per parameter ratio is quite low.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators [1] have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Predicting the Generalization Gap in Deep Neural Networks Tuesday, July 9, 2019 Posted by Yiding Jiang, Google AI Resident Deep neural networks (DNN) are the cornerstone of recent progress in machine learning, and are responsible for recent breakthroughs in a variety of tasks such as image recognition, image segmentation, machine translation and more. However, despite their ubiquity, researchers are still attempting to fully understand the underlying principles that govern them. In particular, classical theories (e.g., VC-dimension and Rademacher complexity) suggest that over-parameterized functions should generalize poorly to unseen data, yet recent work has found that massively over-parameterized functions (orders of magnitude more parameters than the number of data points) generalize well. In order to improve models, a better understanding of generalization, which can lead to more theoretically grounded and therefore more principled approaches to DNN design, is required.
The job ‘Data Scientist’ has been around for decades, it was just not called ‘Data Scientist’. Statisticians have used their knowledge and skills using machine learning techniques such as Logistic Regression and Random Forest for prediction and insights for longer than people actually realize.
Who deals with big dataset in order to use Machine Learning techniques knows that it is vital to keep data clean and to avoid data which is weird. In this point, outliers are a pain in the neck because they may make the results be misunderstood. Several methods can be used to remove outliers from the data, but this post will focus on an unsupervised Machine Learning technique: autoencoder, a kind of neural network. In this blog we have already seen several ways to detect outliers based on Machine Learning techniques, but now we describe a method which uses neural networks. As well, this blog has some explanations about neural networks and several examples of using them. I encourage you to go deeper into those posts to know all the information that has been published here.
It’s been a long time since my last update and I’ve decided to start with Tableau, of all topics! Although open source advocates do not look kindly upon Tableau, I find myself using it frequently and relearning all the stuff I can do in R. For my series of ‘how-to’s’ regarding Tableau, I’d like to start with posting about how to make a waffle chart in Tableau.
We hear this sentence over and over again. But what does that actually mean? This small analysis uncovers this topic with the help of R, and simple regressions, focusing on how alcohol impacts health.
For a beginner, setting up a Python environment and installing packages are little intimidating. Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment.
The industries of healthcare and finance have one thing in common: they are both getting highly interrupted with the advancement of technology, namely data science. And this phenomenon is being highly encouraged as Data Science Helps Human. In 2017 alone, 3.5 million USD was invested in over 180 health companies. The core of significant transformation in the health industry, therefore, lies in data science. More than a billion clinical records are being created, for instance, in the US every year. Doctors and life scientists have an immense amount of data to base their studies on. Moreover, immense volumes of information related to health are made available through the large-scale choice of wearable gadgets. This opens the door to new innovations for more informed, better healthcare. The main objective for health data scientists working with the healthcare industry is to make sense of this huge data set and derive helpful insights from it so the human body and its issues can be understood better by healthcare providers. Therefore, data science can strongly transform healthcare.
I recently went on a weekend camping trip in The Enchantments, which is just over a two hour drive from where I live in Seattle, WA. To plan for the trip, we relied on Washington Trails Association (WTA) and a few other resources to make sure we had the optimal trail routes and camping spots for each day. Many of these outdoor adventure resources can help folks plan for multi-day camping trips, figure out where to go for a hike with parents or make sure to correctly traverse Aasgard Pass, a sketchy 2300 feet elevation gain in less than a mile. But there is still something lacking.
These concepts are important to both the theory and the practice of data science. They also come up in job interviews and academic exams. A biased predictor is eccentric, i.e. its predictions are consistently off. No matter how well it’s trained, it just doesn’t get it. Generally, such a predictor is too simple for the problem at hand. It under-fits the data, no matter how rich. A high-variance predictor is, in a sense, the opposite. It often arises when one tries to fix bias and over-compensates. One’s gone from a model that is too simple – i.e., biased – to one that is too complex – i.e. has high variance. It over-fits the data. Somewhere between these two is the ‘sweet spot’ – the best predictor. Often this is not easy to find. A data scientist can help. That’s another story …
It has already been more than a year since I started working as a Project Manager for Artificial Intelligence (AI). I suppose you don’t notice the time passing when you love your job. I started onto this role with a background in wireless communication, something which is not usual and mostly helpful while working at a telecom operator. Since March 2018, learning has become an integral part of my life as I had a lot of catching up to do with data science (and still do). Since there is no college degree in AI project management, how could I adapt to this responsibility? Well, I learnt on the job.

### An Ad-hoc Method for Calibrating Uncalibrated Models

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

## Example: Predicting income

We’ll continue the example from the previous posts in the series: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):

library(zeallot)
library(wrapr)
c(test, train) %<-% split(incomedata, incomedata$gp) # get the rollups (mean) by grouping variable show_conditional_means <- function(d, outcome = "income") { cols <- qc(sex, employment, education) lapply( cols := cols, function(colname) { aggregate(d[, outcome, drop = FALSE], d[, colname, drop = FALSE], FUN = mean) }) } display_tables <- function(tlist) { for(vi in tlist) { print(knitr::kable(vi)) } } display_tables( show_conditional_means(train)) sex income Male 55755.51 Female 47718.52 employment income Employee of a private for profit 51620.39 Federal government employee 64250.09 Local government employee 54740.93 Private not-for-profit employee 53106.41 Self employed incorporated 66100.07 Self employed not incorporated 41346.47 State government employee 53977.20 education income no high school diploma 31883.18 Regular high school diploma 38052.13 GED or alternative credential 37273.30 some college credit, no degree 42991.09 Associate’s degree 47759.61 Bachelor’s degree 65668.51 Master’s degree 79225.87 Professional degree 97772.60 Doctorate degree 91214.55 ## A random forest model For this post, we’ll train a random forest model to predict income. library(randomForest) model_rf_1stage <- randomForest(income ~ age+sex+employment+education, data=train) train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response")
# doesn't roll up
display_tables(
show_conditional_means(train,
qc(income, pred_rf_raw)))
sex income pred_rf_raw
Male 55755.51 55292.47
Female 47718.52 48373.40
employment income pred_rf_raw
Employee of a private for profit 51620.39 51291.36
Federal government employee 64250.09 61167.31
Local government employee 54740.93 55425.30
Private not-for-profit employee 53106.41 54174.31
Self employed incorporated 66100.07 63714.20
Self employed not incorporated 41346.47 46415.34
State government employee 53977.20 55599.89
education income pred_rf_raw
no high school diploma 31883.18 41673.91
Regular high school diploma 38052.13 42491.11
GED or alternative credential 37273.30 43037.49
some college credit, no degree 42991.09 44547.89
Associate’s degree 47759.61 46815.79
Bachelor’s degree 65668.51 63474.48
Master’s degree 79225.87 69953.53
Professional degree 97772.60 76861.44
Doctorate degree 91214.55 75940.24

As we observed before, the random forest model predictions do not match the true rollups, even on the training data.

## Polishing the model

Suppose that we wish to make individual predictions of subjects’ incomes, and estimate mean income as a function of employment type. An ad-hoc way to do this is to adjust the predictions from the random forest, depending on subjects’ employment type, so that the resulting polished model is calibrated with respect to employment. Since linear models are calibrated, we might try fitting a linear model to the random forest model’s predictions, along with employment.

(Of course, we could use a Poisson model as well, but for this example we’ll just use a simple linear model for the polishing step).

One caution: we shouldn’t use the same data to fit both the random forest model and the polishing model. This leads to nested-model bias, a potential source of overfit. Either we must split the training data into two sets: one to train the random forest model and another to train the polishing model; or we have to use cross-validation to simulate having two sets. This second procedure is the same procedure used when stacking multiple models; you can think of this polishing procedure as being a form of stacking, where some of the sub-learners are simply single variables.

Let’s use 5-fold cross-validation to "stack" the random forest model and the employment variable. We’ll use vtreat to create the cross-validation plan.

set.seed(2426355)

# build a schedule for 5-way crossval
crossplan <- vtreat::kWayCrossValidation(nrow(train), 5)

The crossplan is a list of five splits of the data (described by row indices); each split is itself a list of two disjoint index vectors: split$train and split$app. For each fold, we want to train a model using train[split$train, , drop=FALSE] and then apply the model to train[split$app, , drop=FALSE].

train$pred_uncal <- 0 # use cross validation to get uncalibrated predictions for(split in crossplan) { model_rf_2stage <- randomForest(income ~ age+sex+employment+education, data=train[split$train, , drop=FALSE])
predi <- predict(model_rf_2stage,
newdata=train[split$app, , drop=FALSE], type="response") train$pred_uncal[split$app] <- predi } The vector train$pred_uncal is now a vector of random forest predictions on the training data; every prediction is made using a model that was not trained on the datum in question.

Now we can use these random forest predictions to train the linear polishing model.

# learn a polish/calibration for employment
rf_polish <- lm(income - pred_uncal ~ employment,
data=train)
# get rid of pred_uncal, as it's no longer needed
train$pred_uncal <- NULL Now, take the predictions from the original random forest model (the one trained on all the data, earlier), and polish them with the polishing model. # get the predictions from the original random forest model train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response")

# polish the predictions so that employment rolls up correctly
train$pred_cal <- train$pred_rf_raw +
predict(rf_polish, newdata=train, type="response")
# see how close the rollups get to ground truth

rollups <-  show_conditional_means(train,
qc(income, pred_cal, pred_rf_raw))

display_tables(rollups)
sex income pred_cal pred_rf_raw
Male 55755.51 55343.35 55292.47
Female 47718.52 48296.93 48373.40
employment income pred_cal pred_rf_raw
Employee of a private for profit 51620.39 51640.44 51291.36
Federal government employee 64250.09 64036.19 61167.31
Local government employee 54740.93 54739.80 55425.30
Private not-for-profit employee 53106.41 53075.76 54174.31
Self employed incorporated 66100.07 66078.76 63714.20
Self employed not incorporated 41346.47 41341.37 46415.34
State government employee 53977.20 53946.07 55599.89
education income pred_cal pred_rf_raw
no high school diploma 31883.18 41526.88 41673.91
Regular high school diploma 38052.13 42572.57 42491.11
GED or alternative credential 37273.30 43104.09 43037.49
some college credit, no degree 42991.09 44624.38 44547.89
Associate’s degree 47759.61 46848.84 46815.79
Bachelor’s degree 65668.51 63468.93 63474.48
Master’s degree 79225.87 69757.13 69953.53
Professional degree 97772.60 76636.17 76861.44
Doctorate degree 91214.55 75697.59 75940.24

Note that the rolled up predictions from the polished model almost match the true rollups for employment, but not for the other grouping variables (sex and education). To see this better, let’s look at the total absolute error of the estimated rollups.

err_mag <- function(x, y) {
sum(abs(y-x))
}

preds = qc(pred_rf_raw, pred_cal)

errframes <- lapply(rollups,
function(df) {
lapply(df[, preds],
function(p)
err_mag(p, df$income)) %.>% as.data.frame(.) }) errframes <- lapply(rollups, function(df) { gp = names(df)[[1]] errs <- lapply(df[, preds], function(p) err_mag(p, df$income))
as.data.frame(c(grouping=gp, errs))
})

display_tables(errframes)
grouping pred_rf_raw pred_cal
sex 1117.927 990.5685
grouping pred_rf_raw pred_cal
employment 14241.51 323.2577
grouping pred_rf_raw pred_cal
education 70146.37 70860.7

We can reduce the rollup errors substantially for the variables that the polishing model was exposed to. For variables that the polishing model is not exposed to, there is no improvement; it’s likely that those estimated rollups will in many cases be worse.

## Model performance on holdout data

Let’s see the performance of the polished model on test data.

# get the predictions from the original random forest model
test$pred_rf_raw <- predict(model_rf_1stage, newdata=test, type="response") # polish the predictions so that employment rolls up correctly test$pred_cal <-
test$pred_rf_raw + predict(rf_polish, newdata=test, type="response") # compare the rollups on employment preds <- qc(pred_rf_raw, pred_cal) employment_rollup <- show_conditional_means(test, c("income", preds))$employment
knitr::kable(employment_rollup)
employment income pred_rf_raw pred_cal
Employee of a private for profit 50717.96 51064.25 51413.32
Federal government employee 66268.05 61401.94 64270.82
Local government employee 52565.89 54878.96 54193.47
Private not-for-profit employee 52887.52 54011.64 52913.09
Self employed incorporated 67744.61 63664.51 66029.07
Self employed not incorporated 41417.25 46215.42 41141.44
State government employee 51314.92 55395.96 53742.14
# see how close the rollups get to ground truth for employment

lapply(employment_rollup[, preds],
function(p) err_mag(p, employment_rollup$income)) %.>% as.data.frame(.) %.>% knitr::kable(.) pred_rf_raw pred_cal 21608.9 8764.302 The polished model estimates rollups with respect to employment better than the uncalibrated random forest model. Its performance on individual predictions (as measured by root mean squared error) is about the same. # predictions on individuals rmse <- function(x, y) { sqrt(mean((y-x)^2)) } lapply(test[, preds], function(p) rmse(p, test$income)) %.>%
as.data.frame(.)  %.>%
knitr::kable(.)
pred_rf_raw pred_cal
31780.39 31745.12

## Conclusion

We’ve demonstrated a procedure that mitigates bias issues with ensemble models, or any other uncalibrated model. This potentially allows the data scientist to balance the requirement for highly accurate predictions on individuals with the need to correctly estimate specific aggregate quantities.

This method is ad-hoc, and may be somewhat brittle. In addition, it requires that the data scientist knows ahead of time which rollups will be desired in the future. However, if you find yourself in a situation where you must balance accurate individual prediction with accurate aggregate estimates, this may be a useful trick to have in your data science toolbox.

### Loglinear models

Jelmer Ypma has pointed out to us that for the special case of loglinear models (that is, a linear model forlog(y)), there are other techniques for mitigating bias in predictions on y. More information on these methods can be found in chapter 6.4 of Introductory Econometrics: A Modern Approach by Jeffrey Woolrich (2014).

These methods explicitly assume that y is lognormally distributed (an assumption that is often valid for monetary amounts), and try to estimate the true standard deviation of log(y) in order to adjust the estimates of y. They do not completely eliminate the bias, because this true standard deviation is unknown, but they do reduce it, while making predictions on individuals with RMSE performance competitive with the performance of linear or (quasi)Poisson models fit directly to y. However, they do not give the improvements on relative error that the naive adjustment we showed in the first article of this series will give.

### Three Strategies for Working with Big Data in R

(This article was first published on R Views, and kindly contributed to R-bloggers)

For many R users, it’s obvious why you’d want to use R with big data, but not so obvious how. In fact, many people (wrongly) believe that R just doesn’t work very well for big data.

In this article, I’ll share three strategies for thinking about how to use big data in R, as well as some examples of how to execute each of them.

By default R runs only on data that can fit into your computer’s memory. Hardware advances have made this less of a problem for many users since these days, most laptops come with at least 4-8Gb of memory, and you can get instances on any major cloud provider with terabytes of RAM. But this is still a real problem for almost any data set that could really be called big data.

The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1. Because you’re actually doing something with the data, a good rule of thumb is that your machine needs 2-3x the RAM of the size of your data.

An other big issue for doing Big Data work in R is that data transfer speeds are extremely slow relative to the time it takes to actually do data processing once the data has transferred. For example, the time it takes to make a call over the internet from San Francisco to New York City takes over 4 times longer than reading from a standard hard drive and over 200 times longer than reading from a solid state hard drive.1 This is an especially big problem early in developing a model or analytical project, when data might have to be pulled repeatedly.

Nevertheless, there are effective methods for working with big data in R. In this post, I’ll share three strategies. And, it important to note that these strategies aren’t mutually exclusive – they can be combined as you see fit!

## Strategy 1: Sample and Model

To sample and model, you downsample your data to a size that can be easily downloaded in its entirety and create a model on the sample. Downsampling to thousands – or even hundreds of thousands – of data points can make model runtimes feasible while also maintaining statistical validity.2

If maintaining class balance is necessary (or one class needs to be over/under-sampled), it’s reasonably simple stratify the data set during sampling.

• Speed Relative to working on your entire data set, working on just a sample can drastically decrease run times and increase iteration speed.
• Prototyping Even if you’ll eventually have to run your model on the entire data set, this can be a good way to refine hyperparameters and do feature engineering for your model.
• Packages Since you’re working on a normal in-memory data set, you can use all your favorite R packages.

• Sampling Downsampling isn’t terribly difficult, but does need to be done with care to ensure that the sample is valid and that you’ve pulled enough points from the original data set.
• Scaling If you’re using sample and model to prototype something that will later be run on the full data set, you’ll need to have a strategy (such as pushing compute to the data) for scaling your prototype version back to the full data set.
• Totals Business Intelligence (BI) tasks frequently answer questions about totals, like the count of all sales in a month. One of the other strategies is usually a better fit in this case.

## Strategy 2: Chunk and Pull

In this strategy, the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining. This strategy is conceptually similar to the MapReduce algorithm. Depending on the task at hand, the chunks might be time periods, geographic units, or logical like separate businesses, departments, products, or customer segments.

• Full data set The entire data set gets used.
• Parallelization If the chunks are run separately, the problem is easy to treat as embarassingly parallel and make use of parallelization to speed runtimes.

• Need Chunks Your data needs to have separable chunks for chunk and pull to be appropriate.
• Pull All Data Eventually have to pull in all data, which may still be very time and memory intensive.
• Stale Data The data may require periodic refreshes from the database to stay up-to-date since you’re saving a version on your local machine.

## Strategy 3: Push Compute to Data

In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.

Sometimes, more complex operations are also possible, including computing histogram and raster maps with dbplot, building a model with modeldb, and generating predictions from machine learning models with tidypredict.

• Use the Database Takes advantage of what databases are often best at: quickly summarizing and filtering data based on a query.
• More Info, Less Transfer By compressing before pulling data back to R, the entire data set gets used, but transfer times are far less than moving the entire data set.

• Database Operations Depending on what database you’re using, some operations might not be supported.
• Database Speed In some contexts, the limiting factor for data analysis is the speed of the database itself, and so pushing more work onto the database is the last thing analysts want to do.

## An Example

I’ve preloaded the flights data set from the nycflights13 package into a PostgreSQL database, which I’ll use for these examples.

Let’s start by connecting to the database. I’m using a config file here to connect to the database, one of RStudio’s recommended database connection methods:

library(DBI)
library(dplyr)
library(ggplot2)

db <- DBI::dbConnect(
odbc::odbc(),
Driver = config$driver, Server = config$server,
Port = config$port, Database = config$database,
UID = config$uid, PWD = config$pwd,
BoolsAsChar = ""
)

The dplyr package is a great tool for interacting with databases, since I can write normal R code that is translated into SQL on the backend. I could also use the DBI package to send queries directly, or a SQL chunk in the R Markdown document.

df <- dplyr::tbl(db, "flights")
tally(df)
## # A tibble: 1 x 1
##        n
##
## 1 336776

With only a few hundred thousand rows, this example isn’t close to the kind of big data that really requires a Big Data strategy, but it’s rich enough to demonstrate on.

## Sample and Model

Let’s say I want to model whether flights will be delayed or not. This is a great problem to sample and model.

# Create is_delayed column in database
df <- df %>%
mutate(
# Create is_delayed column
is_delayed = arr_delay > 0,
# Get just hour (currently formatted so 6 pm = 1800)
hour = sched_dep_time / 100
) %>%
# Remove small carriers that make modeling difficult
filter(!is.na(is_delayed) & !carrier %in% c("OO", "HA"))

df %>% count(is_delayed)
## # A tibble: 2 x 2
##   is_delayed      n
##
## 1 FALSE      194078
## 2 TRUE       132897

These classes are reasonably well balanced, but since I’m going to be using logistic regression, I’m going to load a perfectly balanced sample of 40,000 data points.

For most databases, random sampling methods don’t work super smoothly with R, so I can’t use dplyr::sample_n or dplyr::sample_frac. I’ll have to be a little more manual.

set.seed(1028)

# Create a modeling dataset
df_mod <- df %>%
# Within each class
group_by(is_delayed) %>%
# Assign random rank (using random and row_number from postgres)
mutate(x = random() %>% row_number()) %>%
ungroup()

# Take first 20K for each class for training set
df_train <- df_mod %>%
filter(x <= 20000) %>%
collect()

# Take next 5K for test set
df_test <- df_mod %>%
filter(x > 20000 & x <= 25000) %>%
collect()

# Double check I sampled right
count(df_train, is_delayed)
count(df_test, is_delayed)
## # A tibble: 2 x 2
##   is_delayed     n
##
## 1 FALSE      20000
## 2 TRUE       20000
## # A tibble: 2 x 2
##   is_delayed     n
##
## 1 FALSE       5000
## 2 TRUE        5000

Now let’s build a model – let’s see if we can predict whether there will be a delay or not by the combination of the carrier, the month of the flight, and the time of day of the flight.

mod <- glm(is_delayed ~ carrier +
as.character(month) +
poly(sched_dep_time, 3),
family = "binomial",
data = df_train)

# Out-of-Sample AUROC
df_test$pred <- predict(mod, newdata = df_test) auc <- suppressMessages(pROC::auc(df_test$is_delayed, df_test$pred)) auc ## Area under the curve: 0.6425 As you can see, this is not a great model and any modelers reading this will have many ideas of how to improve what I’ve done. But that wasn’t the point! I built a model on a small subset of a big data set. Including sampling time, this took my laptop less than 10 seconds to run, making it easy to iterate quickly as I want to improve the model. After I’m happy with this model, I could pull down a larger sample or even the entire data set if it’s feasible, or do something with the model from the sample. ## Chunk and Pull In this case, I want to build another model of on-time arrival, but I want to do it per-carrier. This is exactly the kind of use case that’s ideal for chunk and pull. I’m going to separately pull the data in by carrier and run the model on each carrier’s data. I’m going to start by just getting the complete list of the carriers. # Get all unique carriers carriers <- df %>% select(carrier) %>% distinct() %>% pull(carrier) Now, I’ll write a function that • takes the name of a carrier as input • pulls the data for that carrier into R • splits the data into training and test • trains the model • outputs the out-of-sample AUROC (a common measure of model quality) carrier_model <- function(carrier_name) { # Pull a chunk of data df_mod <- df %>% dplyr::filter(carrier == carrier_name) %>% collect() # Split into training and test split <- df_mod %>% rsample::initial_split(prop = 0.9, strata = "is_delayed") %>% suppressMessages() # Get training data df_train <- split %>% rsample::training() # Train model mod <- glm(is_delayed ~ as.character(month) + poly(sched_dep_time, 3), family = "binomial", data = df_train) # Get out-of-sample AUROC df_test <- split %>% rsample::testing() df_test$pred <- predict(mod, newdata = df_test)
suppressMessages(auc <- pROC::auc(df_test$is_delayed ~ df_test$pred))

auc
}

Now, I’m going to actually run the carrier model function across each of the carriers. This code runs pretty quickly, and so I don’t think the overhead of parallelization would be worth it. But if I wanted to, I would replace the lapply call below with a parallel backend.3

set.seed(98765)
mods <- lapply(carriers, carrier_model) %>%
suppressMessages()

names(mods) <- carriers

Let’s look at the results.

mods
## $UA ## Area under the curve: 0.6408 ## ##$AA
## Area under the curve: 0.6041
##
## $B6 ## Area under the curve: 0.6475 ## ##$DL
## Area under the curve: 0.6162
##
## $EV ## Area under the curve: 0.6419 ## ##$MQ
## Area under the curve: 0.5973
##
## $US ## Area under the curve: 0.6096 ## ##$WN
## Area under the curve: 0.6968
##
## $VX ## Area under the curve: 0.6969 ## ##$FL
## Area under the curve: 0.6347
##
## $AS ## Area under the curve: 0.6906 ## ##$9E
## Area under the curve: 0.6071
##
## $F9 ## Area under the curve: 0.625 ## ##$YV
## Area under the curve: 0.7029

So these models (again) are a little better than random chance. The point was that we utilized the chunk and pull strategy to pull the data separately by logical units and building a model on each chunk.

## Push Compute to the Data

In this case, I’m doing a pretty simple BI task – plotting the proportion of flights that are late by the hour of departure and the airline.

Just by way of comparison, let’s run this first the naive way – pulling all the data to my system and then doing my data manipulation to plot.

system.time(
df_plot <- df %>%
collect() %>%
# Change is_delayed to numeric
mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>%
group_by(carrier, sched_dep_time) %>%
# Get proportion per carrier-time
summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>%
ungroup() %>%
# Change string times into actual times
mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>%
strptime("%H%M") %>%
as.POSIXct())) -> timing1

Now that wasn’t too bad, just 2.366 seconds on my laptop.

But let’s see how much of a speedup we can get from chunk and pull. The conceptual change here is significant – I’m doing as much work as possible on the Postgres server now instead of locally. But using dplyr means that the code change is minimal. The only difference in the code is that the collect call got moved down by a few lines (to below ungroup()).

system.time(
df_plot <- df %>%
# Change is_delayed to numeric
mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>%
group_by(carrier, sched_dep_time) %>%
# Get proportion per carrier-time
summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>%
ungroup() %>%
collect() %>%
# Change string times into actual times
mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>%
strptime("%H%M") %>%
as.POSIXct())) -> timing2

It might have taken you the same time to read this code as the last chunk, but this took only 0.269 seconds to run, almost an order of magnitude faster!4 That’s pretty good for just moving one line of code.

Now that we’ve done a speed comparison, we can create the nice plot we all came for.

df_plot %>%
mutate(carrier = paste0("Carrier: ", carrier)) %>%
ggplot(aes(x = sched_dep_time, y = delay_pct)) +
geom_line() +
facet_wrap("carrier") +
ylab("Proportion of Flights Delayed") +
xlab("Time of Day") +
scale_y_continuous(labels = scales::percent) +
scale_x_datetime(date_breaks = "4 hours",
date_labels = "%H")

It looks to me like flights later in the day might be a little more likely to experience delays, but that’s a question for another blog post.

1. This isn’t just a general heuristic. You’ll probably remember that the error in many statistical processes is determined by a factor of $$\frac{1}{n^2}$$ for sample size $$n$$, so a lot of the statistical power in your model is driven by adding the first few thousand observations compared to the final millions.

2. One of the biggest problems when parallelizing is dealing with random number generation, which you use here to make sure that your test/training splits are reproducible. It’s not an insurmountable problem, but requires some careful thought.

3. And lest you think the real difference here is offloading computation to a more powerful database, this Postgres instance is running on a container on my laptop, so it’s got exactly the same horsepower behind it.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Trump’s tweet divides Americans

Polls show Republicans mostly approve of the president’s racially charged remarks

## July 16, 2019

### R Packages worth a look

Detect Multiple Change Points from Time Series (offlineChange)
Detect the number and locations of change points. The locations can be either exact or in terms of ranges, depending on the available computational res …

Trend Estimation of Univariate and Bivariate Time Series with Controlled Smoothness (TSsmoothing)
It performs the smoothing approach provided by penalized least squares for univariate and bivariate time series, as proposed by Guerrero (2007) and Ger …

Hypothesis Testing Using the Overlapping Interval Estimates (intervcomp)
Performs hypothesis testing using the interval estimates (e.g., confidence intervals). The non-overlapping interval estimates indicates the statistical …

Create Beautiful, Customizable, Publication-Ready Summary Tables for Statistical Models (modelsummary)
Create beautiful, customizable, publication-ready summary tables for statistical models. ‘modelsummary’ leverages the power of the ‘gt’ and ‘broom’ pac …

### Magister Dixit

“Data literacy includes the ability to read, work with, analyze and argue with data.” R. Bhargava, C. D’Ignazio ( 2015 )

### 101 Machine Learning Algorithms for Data Science with Cheat Sheets

(This article was first published on R Programming - Data Science Blog | AI, ML, big data analytics , and kindly contributed to R-bloggers)

Think of this as the one-stop-shop/dictionary/directory for your machine learning algorithms. The algorithms have been sorted into 9 groups: Anomaly Detection, Association Rule Learning, Classification, Clustering, Dimensional Reduction, Ensemble, Neural Networks, Regression, Regularization. In this post, you’ll find 101 machine learning algorithms, including useful infographics to help you know when to use each one (if available).

## 101 Machine Learning Algorithms

Each of the accordian drop downs are embeddable if you want to take them with you. All you have to do is click the little ’embed’ button in the lower left hand corner and copy/paste the iframe. All we ask is you link back to this post.

By the way, if you have trouble with Medium/TDS, just throw your browser into incognito mode.

### Classification Algorithms

Any of these classification algorithms can be used to build a model that predicts the outcome class for a given dataset. The datasets can come from a variety of domains. Depending upon the dimensionality of the dataset, the attribute types, sparsity, and missing values, etc., one algorithm might give better predictive accuracy than most others. Let’s briefly discuss these algorithms. (18)

### Regression Analysis

Regression Analysis is a statistical method for examining the relationship between two or more variables. There are many different types of Regression analysis, of which a few algorithms can be found below. (20)

### Neural Networks

A neural network is an artificial model based on the human brain. These systems learn tasks by example without being told any specific rules. (11)

### Anomaly Detection

Also known as outlier detection, anomaly detection is used to find rare occurrences or suspicious events in your data. The outliers typically point to a problem or rare event. (5)

## Dimensionality Reduction

With some problems, especially classification, there can be so many variables, or features, that it is difficult to visualize your data. Correlation amongst your features creates redundancies, and that’s where dimensionality reduction comes in. Dimensionality Reduction reduces the number of random variables you’re working with. (17)

## Ensemble

Ensemble learning methods are meta-algorithms that combine several machine learning methods into a single predictive model to increase the overall performance. (11)

## Clustering

In supervised learning, we know the labels of the data points and their distribution. However, the labels may not always be known. Clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. Clustering can either be semi parametric or probabilistic. (14)

## Association Rule Analysis

Association rule analysis is a technique to uncover how items are associated with each other. (2)

## Regularization

Regularization is used to prevent overfitting. Overfitting means the a machine learning algorithm has fit the data set too strongly such that it has a high accuracy in it but does not perform well on unseen data. (3)

### Scikit-Learn Algorithm Cheat Sheet

First and foremost is the Scikit-Learn cheat sheet. If you click the image, you’ll be taken to the same graphic except it will be interactive. We suggest saving this site as it makes remembering the algorithms, and when best to use them, incredibly simple and easy.

### SAS: The Machine Learning Algorithm Cheat Sheet

You can also find many of the same algorithms on SAS’s machine learning cheet sheet as the one above. The SAS website (click the pic) also gives great  descriptions about how, when, and why to use each algorithm.

### Microsoft Azure Machine Learning: Algorithm Cheat Sheet

Microsoft Azure’s cheet sheet is the simplest cheet sheet by far. Even though it is simple, Microsoft was still able to pack a ton of information into it. Microsoft also made their algorithm sheet available to download.

There you have it, 101 machine learning algorithms with cheat sheets, descriptions, and tutorials! We hope you are able to make good use of this list. If there are any algorithms that you think should be added, go ahead and leave a comment with the algorithm and a link to a tutorial. Thanks!

## Sources

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Registration now open for the 2019 Probability and Programming Research Workshop, Big Code Summit, and PLEMM

To be held at the W Bellevue Hotel (Washington), the annual three-day event brings together top experts in the areas of programming languages, software engineering, and machine learning. The event specifically aims to encourage collaboration and exchange of ideas between academia and industry, and across companies.

Starting on Tuesday, September 17, and new to this year’s function, is a full-day workshop including the winners of the Probability and Programming research awards. These 10 winning proposals will be presented by the awardees and given the opportunity to field questions and provide insight into their work about the fundamental problems at the intersection of machine learning, programming languages, and software engineering.

Day two (Wednesday, September 18) will be the second annual occurrence of the Big Code Summit. Just as machine learning has pervaded almost every major industry, it is poised to make a significant impact on how we develop software. Like the inaugural summit in London, the talks and discussion in this event will focus on applying techniques from machine learning to building innovative developer tools.

In its third annual occurrence, PLEMM, Programming Languages Enthusiasts Mind Melt, will be held on Thursday, September 19. The conference will cover a number of trending topics in programming languages such as the design of high-performance virtual machines for dynamic programming languages, parallel compilers, safety of c++ programs, program analysis, and tooling. PLEMM will bring together a broad range of language, systems, hardware, and compiler experts with the goal to exchange ideas, aiming to bridge the usual gaps between academia and industry, and between people in different disciplines.

### If you did not already know

Summarized
Domains such as scientific workflows and business processes exhibit data models with complex relationships between objects. This relationship is typically represented as sequences, where each data item is annotated with multi-dimensional attributes. There is a need to analyze this data for operational insights. For example, in business processes, users are interested in clustering process traces into smaller subsets to discover less complex process models. This requires expensive computation of similarity metrics between sequence-based data. Related work on dimension reduction and embedding methods do not take into account the multi-dimensional attributes of data, and do not address the interpretability of data in the embedding space (i.e., by favoring vector-based representation). In this work, we introduce Summarized, a framework for efficient analysis on sequence-based multi-dimensional data using intuitive and user-controlled summarizations. We introduce summarization schemes that provide tunable trade-offs between the quality and efficiency of analysis tasks and derive an error model for summary-based similarity under an edit-distance constraint. Evaluations using real-world datasets show the effectives of our framework. …

Teaching Risk
Learning near-optimal behaviour from an expert’s demonstrations typically relies on the assumption that the learner knows the features that the true reward function depends on. In this paper, we study the problem of learning from demonstrations in the setting where this is not the case, i.e., where there is a mismatch between the worldviews of the learner and the expert. We introduce a natural quantity, the teaching risk, which measures the potential suboptimality of policies that look optimal to the learner in this setting. We show that bounds on the teaching risk guarantee that the learner is able to find a near-optimal policy using standard algorithms based on inverse reinforcement learning. Based on these findings, we suggest a teaching scheme in which the expert can decrease the teaching risk by updating the learner’s worldview, and thus ultimately enable her to find a near-optimal policy. …

Online video advertising gives content providers the ability to deliver compelling content, reach a growing audience, and generate additional revenue from online media. Recently, advertising strategies are designed to look for original advert(s) in a video frame, and replacing them with new adverts. These strategies, popularly known as product placement or embedded marketing, greatly help the marketing agencies to reach out to a wider audience. However, in the existing literature, such detection of candidate frames in a video sequence for the purpose of advert integration, is done manually. In this paper, we propose a deep-learning architecture called ADNet, that automatically detects the presence of advertisements in video frames. Our approach is the first of its kind that automatically detects the presence of adverts in a video frame, and achieves state-of-the-art results on a public dataset. …

### Demystifying Data Science: Free Online Conference July 30-31

On Jul 30-31, join 22 speakers giving 16 talks and 6 workshops during Demystifying Data Science, a FREE two-day live online conference hosted by Metis, a leader in data science education.

### Infectious disease fatalities rise in Australia as overall death rate falls

Australia’s death rate is now the lowest it has ever been, but some specific causes of death are on the rise

The rate of death in Australia is now the lowest it has ever been, but some specific causes of death are on the rise, according to new data – and experts say we could be doing more to fix this.

Death rates in Australia have been falling for a while, and the latest figures released on Wednesday by the Australian Institute of Health and Welfare (AIHW), are no different.

### Announcing Databricks Runtime 5.5 and Runtime 5.5 for Machine Learning

Databricks is pleased to announce the release of Databricks Runtime 5.5.  This release includes Apache Spark 2.4.3 along with several important improvements and bug fixes as noted in the latest release notes [Azure|AWS].  We recommend all users upgrade to take advantage of this new runtime release.  This blog post gives a brief overview of some of the new high-value features that increase performance, compatibility, manageability and simplifying machine learning on Databricks.

## Faster Cluster Launches with Instance Pools – public preview

In Databricks Runtime 5.5 we are previewing a feature called Instance Pools, which significantly reduces the time it takes to launch a Databricks cluster. Today, launching a new cluster requires acquiring virtual machines from your cloud provider, which can take up to several minutes. With Instance Pools, you can hold back a set of virtual machines so they can be used to rapidly launch new clusters. You pay only cloud provider infrastructure costs while virtual machines are not being used in a Databricks cluster, and pools can scale down to zero instances, avoiding costs entirely when there are no workloads.

## Presto and Amazon Athena Compatibility with Delta Lake – public preview on AWS

As of Databricks Runtime 5.5, you can make Delta Lake tables available for querying from Presto and Amazon Athena. These tables can be queried just like tables with data stored in formats like Parquet. This feature is implemented using manifest files. When an external table is defined in the Hive metastore using manifest files, Presto and Amazon Athena use the list of files in the manifest rather than finding the files by directory listing.

## AWS Glue as the Databricks Metastore – generally available

We’ve partnered with Amazon Web Services to bring AWS Glue to Databricks. Databricks Runtime can now use AWS Glue as a drop-in replacement for the Hive metastore. For further information, see Using AWS Glue Data Catalog as the Metastore for Databricks Runtime.

## DBFS FUSE v2 – private preview

The Databricks Filesystem (DBFS) is a layer on top of cloud storage that abstracts away peculiarities of underlying cloud storage providers. The existing DBFS FUSE client lets processes access DBFS using local filesystem APIs. However, it was designed mainly for convenience instead of performance. We introduced high-performance FUSE storage at location file:/dbfs/ml for Azure in Databricks Runtime 5.3 and for AWS in Databricks Runtime 5.4.  DBFS FUSE v2 expands the improved performance from dbfs:/ml to all DBFS locations including mounts. The feature is in private preview; to try it contact Databricks support.

## Secrets API in R notebooks

The Databricks Secrets API [Azure|AWS] lets you inject secrets into notebooks without hardcoding them. As of Databricks Runtime 5.5, this API is available in R notebooks in addition to existing support for Python and Scala notebooks. You can use the dbutils.secrets.get function to obtain secrets. Secrets are redacted before printing to a notebook cell.

## Plan to drop Python 2 support in Databricks Runtime 6.0

Python 2 is coming to the end of life in 2020. Many popular projects have announced they will cease supporting Python 2 on or before 2020, including a recent announcement for Spark 3.0. We have considered our customer base and plan to drop Python 2 support starting with Databricks Runtime 6.0, which is due to release later in 2019.

Databricks Runtime 6.0 and newer versions will support only Python 3. Databricks Runtime 4.x and 5.x will continue to support both Python 2 and 3. In addition, we plan to offer long-term support (LTS) for the last release of Databricks Runtime 5.x. You can continue to run Python 2 code in the LTS Databricks Runtime 5.x. We will soon announce which Databricks Runtime 5.x will be LTS.

## Enhancements to Databricks Runtime for Machine Learning

With Databricks Runtime 5.5 for Machine Learning, we have made major package upgrades including:

• Added MLflow 1.0 Python package
• Tensorflow upgraded from 1.12.0 to 1.13.1
• PyTorch upgraded from 0.4.1 to 1.1.0
• scikit-learn upgraded from 0.19.1 to 0.20.3

### Single-node multi-GPU operation for HorovodRunner

We enabled HorovodRunner to utilize multi-GPU driver-only clusters. Previously, to use multiple GPUs, HorovodRunner users would have to spin up a driver and at least one worker. With this change, customers can now distribute training within a single node (i.e. a multi-GPU node) and thus use compute resources more efficiently. HorovodRunner is available only in Databricks Runtime for ML.

### Faster model inference pipelines with improved binary file data source and scalar iterator Pandas UDF – public preview

Machine learning tasks, especially in the image and video domain, often have to operate on a large number of files. In Databricks Runtime 5.4, we made available the binary file data source to help ETL arbitrary files such as images into Spark tables. In Databricks Runtime 5.5, we have added an option, recursiveFileLookup, to load files recursively from nested input directories. See binary file data source [Azure|AWS].

The binary file data source enable you to run model inference tasks in parallel from Spark tables using a scalar Pandas UDF. However, you might have to initialize the model for every record batch, which introduces overhead. In Databricks Runtime 5.5, we are backporting a new Pandas UDF type called “scalar iterator” from Apache Spark master. With it you can initialize the model only once and apply the model to many input batches, which can result in a 2-3x speedup for models like ResNet50. See Scalar Iterator UDFs [Azure|AWS].

--

The post Announcing Databricks Runtime 5.5 and Runtime 5.5 for Machine Learning appeared first on Databricks.

### How to Build Disruptive Data Science Teams: 10 Best Practices

Building a data science team from the ground up isn't easy. This strategic roadmap will help hiring managers with tactical advice and how to properly support a data science team once established.

### Four short links: 16 July 2019

Quantum TiqTaqToe, Social Media and Depression, Incidents, and Unity ML

1. Introducing a new game: Quantum TiqTaqToe -- This experience was essential to the birth of Quantum TiqTaqToe. In my quest to understand Unity and Quantum Games, I set out to implement a “simple” game to get a handle on how all the different game components worked together. Having a game based on quantum mechanics is one thing; making sure it is fun to play requires an entirely different skill set.
2. Association of Screen Time and Depression in Adolescence (JAMA) -- Time-varying associations between social media, television, and depression were found, which appeared to be more explained by upward social comparison and reinforcing spirals hypotheses than by the displacement hypothesis. (via Slashdot)
3. CAST Handbook -- How to learn more from incidents and accidents.
4. ML-Agents -- Unity Machine Learning Agents Toolkit, open source.

### Triple GM Abhishek Thakur Answers Qs from the Kaggle Community

Last week we crowned the world’s first-ever Triple Grandmaster, Abhishek Thakur. In a video interview with Kaggle Data scientist Walter Reade, Abhishek answered our burning questions about who he is, what inspires him to compete, and what advice he would give to others. If you missed the video interview, take a listen.

See below for Abhishek's off the cuff responses to select Twitter questions. Have something more you want to know? Leave a comment on this post, or tweet him @abhi1thakur

Here's what YOU wanted to know...

I used to read and implement quite a lot of papers during my master's degree and then during my unfinished PhD. After that, I decided to join the industry and thus I read papers relevant to the industry I am working in. Sometimes I also read papers I come across on Reddit and Twitter and also Kaggle. Recently, I have read papers on XLNet and BERT.

As for my favorite tools, Python is my bread and butter I love scikit-learn, XGBoost, Keras, TensorFlow and PyTorch.

It’s very difficult to find the time when you are working. Here's what I do: I wake up early and work 1-2 hours on a Kaggle problem before work each day. I try my best to start a model and have written scripts that will do K-Fold training automatically. I also have some scripts that automate submissions. When I’m back home from work, these models finish and I can work on post-processing or new models.

A few hours every day if you are a student. If you are working, maybe an hour or two a day. You can invest a few more hours over the weekends. Rather than investing time, it’s more about understanding the problem statement. I suggest writing down a few different approaches to try.

It’s also very good idea to read the discussion forums as a lot of ideas are shared there. If you're just starting with Kaggle, you also might want to take a look at past competitions and learn how the winners approached the problem. From there you can try to implement them on your own without looking at the code.

Every competition brings its own challenges and there is something new to learn from each one. For example, an image segmentation competition can be started by approaches like U-Net or Mask R-CNN. In a given image segmentation problem, one approach might outperform the other. So, you have to know which approach will work best in different scenarios and that can only be done when you have worked on several image segmentation problems.

Same with tabular data competitions. You can get numerical variables or a mix of numerical and categorical variables. If you have experience with these, you will know right away which approach works well and which models you can start without a lot of processing on the dataset.

So, yes, the process becomes smoother with every competition you try. The more competitions you participate in, the more you learn. Once you have a lot of scripts and functions that you can re-use, you can just automate everything (well, most of the things).

One of the most difficult challenges I worked on was the Stumbleupon Evergreen Classification Challenge. Now, if you look at that competition, you might not even find it challenging. At that time though, I had no clue about NLP and the tools and libraries we have available today to process text data and clean HTML.

Another tough one for me was the Amazon Employee Access-Challenge. Here, we were given categorical data which was again very new to me. Any time there's something in the data that you have less knowledge about or don’t know about at all, it can be challenging. The only way to avoid this is to learn the different approaches, and practice, practice, practice.

Check out Andrew Ng’s courses on Coursera. He explains everything in the simplest manner possible. I think you would need some basic mathematics background which you might have already and if not, I suggest working a little bit with algebra, some basic calculus, and probabilities. The only way to learn is to solve some problems. When you have an idea about how the problems are being solved, dig more into the algorithms and see what happens in the background.

One of the best things I've learned is to never give up. When starting in any field, you will fail several times before you succeed. And if you give up after failing you might not succeed at all. Another important thing I've learned is how to work on a team— how to manage time and divide tasks when working on the same problem. I also learned a lot about preprocessing and post-processing of data, different types of machine learning models, cross-validation techniques and how to improve on a given metric without compromising on the training or inference time.

### shinymeta — a revolution for reproducibility

(This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

Joe Cheng presented shinymeta enabling reproducibility in shiny at useR in July 2019. I am really thankful for this. This article shows a…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Things I Have Learned About Data Science

Read this collection of 38 things the author has learned along his travels, and has opted to share for the benefit of the reader.

### Update on keeping Mechanical Turk responses trustworthy

This topic has come up before . . . Now there’s a new paper by Douglas Ahler, Carolyn Roush, and Gaurav Sood, who write:

Amazon’s Mechanical Turk has rejuvenated the social sciences, dramatically reducing the cost and inconvenience of collecting original data. Recently, however, researchers have raised concerns about the presence of “non-respondents” (bots) or non-serious respondents on the platform. Spurred by these concerns, we fielded an original survey on MTurk to measure response quality. While we find no evidence of a “bot epidemic,” we do find that a significant portion of survey respondents engaged in suspicious be- havior. About 20% of respondents either circumvented location requirements or took the survey multiple times. In addition, at least 5-7% of participants likely engaged in “trolling” or satisficing. Altogether, we find about a quarter of data collected on MTurk is potentially untrustworthy. Expectedly, we find response quality impacts experimental treatments. On average, low quality responses attenuate treatment effects by approximately 9%. We conclude by providing recommendations for collecting data on MTurk.

And here are the promised recommendations:

• Use geolocation filters on survey platforms like Qualtrics to enforce any geographic restrictions.

• Make use of tools on survey platforms to retrieve IP addresses. Run each IP through Know Your IP to identify blacklisted IPs and multiple responses originating from the same IP.

• Include questions to detecting trolling and satisficing but do not copy and paste from a standard canon as that makes “gaming the survey” easier.

• Increase the time between Human intelligence task (HIT) completion and auto-approval so that you can assess your data for untrustworthy responses before approving or rejecting the HIT.

• Rather than withhold payments, a better policy may be to incentivize workers by giving them a bonus when their responses pass quality filters.

• Be mindful of compensation rates. While unusually stingy wages will lead to slow data collection times and potentially less effort by Workers, unusually high wages may give rise to adverse selection—especially because HITs are shared on Turkopticon, etc. soon after posting. . . Social scientists who conduct research on MTurk should stay apprised of the current “fair wage” on MTurk and adhere accordingly.

• Use Worker qualifications on MTurk and filter to include only Workers who have a high percentage of approved HITs into your sample.

They also say they do not think that the problem is limited to MTurk.

I haven’t tried to evaluate all these claims myself, but I thought I’d share it all with those of you who are using this tool in your research. (Or maybe some of you are MTurk bots; who knows what will be the effect of posting this material here.)

From my end, “random” error is mostly a non-issue in this context. People don’t use M-Turk to produce generalizable estimates—hardly anyone post-stratifies, for instance. Most people use it to say they did something. I suppose it is a good way to ‘fail fast.’ (The downside is that most failures probably don’t see the light of day.) And if we people wanted to buy stat. sig., bulking up on n is easily and cheaply done — it is the raison d’etre of MTurk in some ways.

So what is the point of the article? Twofold, perhaps. First is that it is good to parcel out measurement error where we can. And the second point is about how do we build a system where the long-term prognosis is not simply noise. And what struck out for me from the data was just the sheer scale of plausibly cheeky behavior. I did not anticipate that.

### the accidental misdirect

A friend of mine, Mark Bradbourne, recently posted a picture to Twitter showing a bar chart that his local utility company included in his most recent bill. He entitled the picture “Let’s spot the issue!”

So as to protect the utility company in question, I’ve recreated the chart below, as faithfully as possible. (There are, of course, many changes I would make in order to render this a storytelling with data-esque visualization, but for the purposes of this discussion it’s important that you see the chart as close to its original, “true” form as possible.)

The chart from Mark’s utility bill, recreated from the original photograph as posted on Twitter.

The internet immediately latched onto the seemingly absurd collection of months portrayed in this chart. The bill, dating from June of 2019, included 13 prior months of usage from as early as August of 2016, as recently as March of 2019, and in a random order.

Soon, our non-U.S.-based friends pointed out that the dates made even less sense to them, as (of course) their convention is not to show dates in MM/YY format, but in YY/MM format.

And with this, the truth of the matter became obvious: the dates were in neither MM/YY format nor YY/MM format; they were in MM/DD format, and excluded labeling the year entirely.

Whenever we run across these kind of so-called “chart fails,” it helps to keep in mind that whoever created the chart wasn’t setting out to be confusing or deceptive. The utility company clearly wanted its customers to be aware of their recent usage, and went so far as to show that usage in a visual format so that it would be more accessible.

The danger, though, is in the assumptions we make when we are the ones creating the chart. Specifically, in this case, there were likely assumptions made about how much information needed to be made explicit versus how much could be assumed.

The energy company likely thought:

The chart says that it’s showing monthly usage; and, since it shows 13 bars, the homeowner will know, or at least assume, that the bars represent the last 13 months in chronological order.

And in general, yes: that is what our first assumptions would be, if there had been no labels whatsoever.

In this case, the company chose to label the bars with a MM/DD convention, excluding the year—probably to denote what specific day the meter was last read, or on what specific day the last water bill was issued. But we very rarely see dates in MM/DD format when they cut across two different years. We’re trained to see date formats in the style of XX/YY being representative of months and years, not months and days. To interpret the chart correctly, we would have had to ignore and resist our personal experience with this convention.

So on the one hand, logic told us that the chart showed the last 13 months; on the other hand, our experience and the direct labels told us that it was mistakenly showing us 13 random months. What other elements of the chart, or other design choices, could have nudged us towards one of these interpretations over the other?

Perhaps if the chart had been a line chart rather than a bar chart, we would have been nudged into thinking that the data was being shown over a continuous period of time; this could have been enough to make the chart more easily interpreted.

The original chart recreated as a line, rather than a bar.

Or, if the labels had used abbreviations for the months, rather than numbers, we almost certainly would  have seen the orderly progression of months more clearly.

The original bar chart, but with the months on the horizontal axis labels shown with three-letter abbreviations instead of numbers.

Another solution, one which would have almost certainly eliminated all confusion, would have been to include the actual year in the labels, or as super-categories below the existing labels.

With super-categories for the years along the horizontal axis, confusion is likely minimized.

We could also ask the question: Do we need to be so precise with our X axis labels that the specific day of the month is shown at all?

It doesn’t seem like it; especially considering that the data on the Y axis has most likely been rounded off, and is presented to the audience at a very general level.

Look at the level of granularity on the Y axis; although it ranges from 0.1 to 0.7 (in 1000s of units), every bar is shown at an exact increment of 0.1. It’s unlikely that a homeowner’s actual monthly utility usage is always an exact multiple of 100.

In this case, the labeling of the specific date on the X axis implies a specificity of data that the Y axis does not support.

Bar chart with more consistency of specificity between the horizontal and vertical axes.

The bottom line, though, is that the creator of the chart made assumptions about what they needed to show versus what they could exclude; and in making those assumptions, they inadvertently misled their audience in a manner that was very confusing.

It is important to focus your audience’s attention on your data in your visualizations, and to remove extraneous clutter and distracting elements—including redundant information in labels. This case, however, highlights the danger of taking your assumptions too far, and inadvertently adding confusion rather than clarity.

Sometimes we get so familiar with our own work, and our own data, that we lose track of what is, or isn’t, obvious to other people. During your design process, it can be valuable to get input from people who aren’t as close to your work. This helps to identify, and avoid, situations like this one, where familiarity with the data led to design choices that were confusing, rather than clarifying.

Putting yourself in the mind of your audience, and soliciting feedback from other people who aren’t as close to your subject, will help you to avoid these kinds of misunderstandings in your own work.

Mike Cisneros is a Data Storyteller on the SWD team. He believes that everybody has a story to tell, and he is driven to find ways to help people get their data stories heard. Connect with Mike on LinkedIn or Twitter.

### Dealing with categorical features in machine learning

Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.

### How the American Work Day Changed in 15 Years

The American Time Use Survey recently released results for 2018. That makes 15 years of data. What's different? What's the same? Read More

### Building a Calculator Jupyter Kernel

A step-by-step guide for authoring language kernels with Xeus

In order to provide a language-agnostic scientific development environment, the Jupyter project is built upon a well-specified protocol to communicate with the Kernel, the part of the infrastructure responsible for executing the code.

For a programming language to leverage the potential of the Jupyter ecosystem, such as JupyterHub, JupyterLab, and interactive widgets, all that is needed is a Kernel to be created for that language that is, an executable implementing the specified inter-process communication. Dozens of kernels have already been implemented bringing Jupyter to many programming languages.

We are completing our engineering degree and interning at QuantStack. We recently attended the Jupyter Community Workshop on the kernel protocol that took place in Paris in late May. In this occasion, we set ourselves to write a new Jupyter kernel.

Today, we are proud to announce the first release of xeus-calc, a calculator kernel for Jupyter! xeus-calc is meant to serve as a minimal, self-containedexample of Jupyter kernel. It is built upon the xeus project, a modern C++ implementation of the protocol. This article is a step-by-step description on how the kernel was implemented.

You may find this post especially useful if you are creating a new programming language and you want it to work in Jupyter from the start.

### Xeus

Implementing the Jupyter kernel protocol from scratch may be a tedious and difficult task. One needs to deal with ZMQ sockets and complex concurrency issues, rely on third-party libraries for cryptographically signing messages or parsing JSON efficiently. This is where the xeus project comes into play: it takes all of that burden so that developers can focus on the parts that are specific to their use case.

In the end, the kernel author only needs to implement a small number of virtual functions inherited from the xinterpreter class.

#include "xeus/xinterpreter.hpp"#include "nlohmann/json.hpp"
using xeus::xinterpreter;namespace nl = nlohmann;
namespace custom{    class custom_interpreter : public xinterpreter    {    public:
        custom_interpreter() = default;        virtual ~custom_interpreter() = default;
     private:
        void configure() override;        nl::json execute_request_impl(int execution_counter,                                      const std::string& code,                                       bool silent,                                       bool store_history,                      const nl::json::node_type* user_expressions,                                      bool allow_stdin) override;            nl::json complete_request_impl(const std::string& code,                                       int cursor_pos) override;
        nl::json inspect_request_impl(const std::string& code,                                      int cursor_pos,                                      int detail_level) override;
        nl::json is_complete_request_impl(const std::string& code)        override;
        nl::json kernel_info_request_impl() override;    };}

Typically, a kernel author will make use of the C or C++ API of the target programming language and embed the interpreter into the application.

This differs from the wrapper kernel approach documented in the ipykernel package where kernel authors make use of the kernel protocol implementation of ipykernel, typically spawning a separate process for the interpreter and capturing its standard output.

Jupyter kernels based on xeus include:

• xeus-cling: a C++ kernel built upon the cling C++ interpreter from CERN
• xeus-python: a new Python kernel for Jupyter.
• JuniperKernel: a new R kernel for Jupyter based on xeus.

In this post, instead of calling into the API of an external interpreter, we implement the internal logic of the calculator in the kernel itself.

#### A calculator project

First, to implement your own Jupyter kernel, you should install Xeus. You can either download it with conda, or install it from sources as detailed in the readme.

Now that the installation is out of the way, let’s focus on the implementation itself.

Recall that the main class for the calculator kernel must inherit from the xinterpreterclass so that Xeus can correctly route the messages received from the front-end.

This class defines the behavior of the kernel for each message type that is received from the front-end.

• kernel_info_request_impl: returns the information about the kernel, such as the name, the version or even a “banner”, that is a message that is prompted to console clients upon launch. This is a good place to be creative with ASCII art.
• complete_request_impl: checks if the code can be completed, by that we mean semantic completion, and makes a suggestion accordingly. This way the user can receive a proposition for an adequate completion to the code he is currently writing. We did not use it during our implementation as you will see later, it is safe to return a JSON with a status value only, if you do not want to handle completion.
• is_complete_request_impl: whether the submitted code is complete and ready for evaluation. For example, if brackets are not all closed, there is probably more to be typed. This message is not used by the notebook front-end but is required for the console, which shows a continuation prompt for further input if it is deemed incomplete. It also checks whether the code is valid or not. Since the calculator expects single-line inputs, it is safe to return an empty JSON object. This may be refined in the future.
• inspect_request_impl: concerns documentation. It inspects the code to show useful information to the user. We did not use it in our case and went with the default implementation (that is to return an empty JSON object).
• execute_request_impl: the main function. An execute_request message is sent by the front-end to ask the kernel to execute the code on behalf of the user. In the case of the calculator, this means parsing the mathematical expression, evaluating it and returning the result, as described in the next section.

Implementation of the calculator

First things first, we need to find a way to parse mathematical expressions. To do so, we turn the user input into Reverse Polish Notation (or RPN), a name full of meaning for the wisest among our readers (or at least the oldest) who used RPN calculators in high school.

The RPN, also called Postfix notation, presents the mathematical expression in a specific way : the operands go first followed by the operator. The main advantage of this notation is how it implicitly displays the precedence of operators.

The main logic of the calculator is provided by two main functions dealing respectively with parsing and evaluating the user expression and a third one for handling spaces in the expression.

First we have the parsing function (parse_rpn) transforming the expression into this representation. For this purpose we implement the Shunting-yard algorithm.

It is based on the use of a stack data structure to change the order of the elements in the expression, depending on their type : operator, operand or parenthesis.

Now that we have the expression turned into RPN (with spaces delimiting operands and operators) we need to do the computation. For this purpose we have the function compute_rpn. Its implementation is based on a loop through a stringstream (hence the need for space delimiters) which performs operations in the right order.

Note that the result is not returned as an execute_reply message but is sent on a broadcasting channel instead, so that other clients to the kernel can also see it. The function execute_reply_impl actually returns the status of the execution only, as you may see in the code below.

nl::json interpreter::execute_request_impl(int execution_counter,                                           const std::string& code,                                           bool /*silent*/,                                           bool /*store_history*/,                                           nl::json /*user_exprs*/,                                           bool /*allow_stdin*/){    nl::json pub_data;    std::string result = "Result = ";    auto publish = [this](const std::string& name,                           const std::string& text) {        this->publish_stream(name,text);    };    try    {        std::string spaced_code = formating_expr(code);        result += std::to_string(compute_rpn(parse_rpn(spaced_code,                                                       publish),                                             publish));        pub_data["text/plain"] = result;        publish_execution_result(execution_counter,                                 std::move(pub_data),                                 nl::json::object());        nl::json jresult;        jresult["status"] = "ok";        jresult["payload"] = nl::json::array();        jresult["user_expressions"] = nl::json::object();        return jresult;    }    catch (const std::runtime_error& err)    {        nl::json jresult;        publish_stream("stderr", err.what());        jresult["status"] = "error";        return jresult;    }}

And that’s it for our calculator! It is as simple as that.

Yet remember that Xeus is a library, not a kernel by itself. We still have to create an executable that gathers the interpreter and the library. This is done in a main function whose implementation looks like:

int main(int argc, char* argv[])                       {                               // Load configuration file                               std::string file_name =         (argc == 1) ? "connection.json" : argv[2];                               xeus::xconfiguration config =         xeus::load_configuration(file_name);                                                           // Create interpreter instance                               using interpreter_ptr = std::unique_ptr<xeus_calc::interpreter>;                               interpreter_ptr interpreter =         std::make_unique<xeus_calc::interpreter>();                                      // Create kernel instance and start it                               xeus::xkernel kernel(config,                          xeus::get_user_name(),                          std::move(interpreter));                               kernel.start();                                                       return 0;}

First, we need to load the configuration file. To do so, we check if one was passed as an argument, otherwise, we look for the connection.json file.

Then, we instantiate the interpreter that we previously set up. Finally, we can create the kernel with all that we defined beforehand. The kernel constructor accepts more parameters that allow customizing some predefined behaviors. You can find more details in the Xeus documentation. Start the kernel and we are good to go!

Now that everything is set, we can test out our homemade calculator kernel.

As you can see in the demonstration below, the code displays step-by-step how the computation is done with RPN. This is done with publish_streamstatements, which is equivalent to std::cout for the Jupyter notebook, very useful for debugging purposes.

You should now have all the information you need to implement your own Jupyter kernel. As you noticed, the Xeus library makes this task quite simple. All that you have to do is to inherit from the xinterpreter virtual class and implement the functions related to the messaging protocol. Nothing more is required.

This project can be found on GitHub. Feel free to contribute to the project if you wish to improve it, keeping in mind that xeus-calc should remain lean and simple!

Note that the current implementation only supports arithmetical operators. However it can be easily extended and we may add functional support in the near future.

Acknowledgments

We would like to thank the whole QuantStack team for their help throughout the process of making this blog post.

We are also grateful to the organizers of the Jupyter community workshop on kernels as we actually started to endeavor during the event.

Vasavan Thiru is completing a master’s degree at Sorbonne Université Pierre & Marie Curie in applied mathematics for mechanics. He is currently interning as a scientific software developer at QuantStack.

Thibault Lacharme is finishing a master’s degree in Quantitative Finance at Université Paris Dauphine. Thibault is currently on his internship as a scientific software developer at QuantStack.

Building a Calculator Jupyter Kernel was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

### Shiny Modules

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

Tidiness is half the life .. this is a German saying that you might not necessarily have to live. While it becomes essential in programming, at least in my opinion. Because when you do not invest a little time into the order and structure of your projects, the time you spend debugging will multiply.

If you developed a shiny app before, you had to think about the user interface (UI) and the server side. The code to create the UI probably landed in the file ui.R, the server function in a server.R file (or you combined both of them in app.R). As soon as the app gets a bit more complex, the two files get bigger and quickly confusing. Typically, you then begin to outsource individual blocks or functions to additional files in order not to lose the overview in ui.R and server.R. These files also continue to grow, become confusing and often cause errors, which take an enormous amount of time and nerves, although this would be easy to avoid.

As in our blog article on using modules in R explained, modules represent a level of abstraction between functions and packages. They are virtual storage boxes in our packages, which make it enormously easier for us to establish a structure in projects and to keep order.

In this article we look at how to build a shiny app with clear code, reusable and automatically tested modules. For that, we first go into the package structure and testing a shiny app before we focus on the actual modules.

## Package Structure

Even if you program only small applications, it makes sense to wrap them up in a package. This will give you a versioned edition of the app and lets you easily jump back to previous versions, if in a new version something is not working as it should. It also makes it possible to keep logic out of the app and put it in the package along with automated tests instead.

You can create the package structure either manually or use the function package.skeleton().

We pack the app in a folder of any name (for example app) in the inst folder. To start the app, we put another function called startApplication() in the R folder:

#' Start Application #' #' @param port port of web application #' #' @export startApplication <- function(port = 4242, appFolder = "app") {   runApp(     system.file(appFolder, package = "myPackageName"),     port = port,     host = "0.0.0.0"   ) }

The package structure then looks something like this.

And the typical workflow might be something like this:

devtools::install()    # neue Version des Pakets installieren library(myPackageName) # Paket laden startApplication()     # App starten

## Server Function Without Logic

Implementing automated tests in shiny apps is not trivial, because you have to simulate the interaction of many UI building blocks and server components. However, the package shinytest provides a convenient framework for testing. Most of the time, you can also make sure to write a robust app by meeting some precautions in your app code without having to use shinytest.

For this it is necessary that the server.R file contains as little logic as possible. For this you should put R functions into the new package structure, which can then be tested as part of R CMD check. Ideally, every reactive function in shiny consists of only one line of code. These reactive functions include either other input elements, reactive values, other reactive functions or simply function calls:

data <- observeEvent(input$button, getData()) model <- reactive(runModel(data(), param1, param2) results <- reactive(extractResults(model()) output$text <- renderText(prepareResults(results()))

## Modules

It is possible to combine larger components of an app in one module. This has several advantages:

• A module has one specific task
• Own namespace
• Defined interface with module
• Reusability (in- and outside of the project)

For example, at http://inwtlab.shinyapps.io/exportPlotModule you will find a module that allows the export of plots from a shiny application.

As for a shiny app itself, we need to implement both the UI and the server function for a module. Let’s start with the UI. We need a function that creates UI elements for the module. The name of the function is arbitrary, but it makes sense to start with the module name.

plotExportUI <- function(id) {   ns <- NS(id)   tagList(     selectInput(ns("type"), label = "Type", choices = c("png", "pdf", "tiff", "svg")),     plotOutput(ns("preview")),     downloadButton(ns("download"), "Download")   ) }

The code ns creates the namespace function ns, which converts any id of a UI element into an id in the namespace of the module. Otherwise, only different UI elements are defined in the function and returned via tagList.

The server function looks like this. Again, the name is arbitrary, but again it makes sense to start with the module name.

plotExport <- function(input, output, session, plotObj) {   output$preview <- renderPlot({ plotObj() }) output$download <- downloadHandler(      filename = function(){        paste0("plot.", input$type) }, content = function(file){ switch( input$type,         png = png(file),         pdf = pdf(file),         tiff = tiff(file),         svg = svg(file)       )       print(plotObj())       dev.off()     }   ) }

The function looks like a normal server function with an additional parameter plotObj. This contains the plot object as a reactive element.

Now let’s look at how the module is called. Unsurprisingly, this happens in two places. In the ui.R the module is called like a normal UI element:

[...] plotExportUI(id = "export") [...]

The id parameter defines the namespace of the module. It is also possible to use the module several times in one app – just be sure to use another id.

In the server.R the module has to be started as well:

[...]   plot <- reactive({     [...]     ggplot(d, aes_string(x = input$xcol, y = input$ycol, col = "clusters")) +       geom_point(size = 4)   })   callModule(plotExport, "export", plot) [...]

plotExport is the name of the server function of the module, “export” is the same id as for the UI element of the module. In addition, one passes the reactive element plot. One could also use fixed parameters at this point. However, if you want to pass a reactive element, you have to pay attention to input variables. For example, to pass the variable input$abc, you would need to call callModule like this: callModule([...], param = reactive(input$abc))

So you have to include the variable in reactive to make sure it’s passed as a reactive variable. The same applies to reactive values that have been generated with reactiveValues.

The UI and the server function of the module can be easily stored in a R package to make it available to others.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### eRum2020 in Milan

(This article was first published on Mirai Solutions, and kindly contributed to R-bloggers)

The European R conference will visit Milan in 2020! Mirai Solutions is delighted to actively support and participate in the organization of the event.

The European R Users Meeting (eRum) is a biennial conference, taking place in Europe during those years when the useR is held overseas. It is a unique chance for R practitioners to get together, exchange experiences, broaden their knowledge around R and initiate collaborations with like-minded people. After Poznan in 2016, and Budapest in 2018, the 2020 edition will take place in Milan, Italy, through May 27-30, as announced officially by Miraier Francesca and co-organizers at useR! 2019 in Toulouse.

The event will include one day of workshops and two days of conference, with:

• Multiple parallel sessions on hot topics in machine learning, statistics and data science
• Plenary sessions with international keynote speakers
• More than 70 contributions in the form of presentations, lightning talks and workshops
• More than 500 international R practitioners
• A dedicated Shiny side-event

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tutorial: Advanced For Loops in Python

If you've already mastered the basics of iterating through Python lists, take it to the next level and learn to use for loops in pandas, numpy, and more!

The post Tutorial: Advanced For Loops in Python appeared first on Dataquest.

### Reinforcement Learning: Life is a Maze

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

It can be argued that most important decisions in life are some variant of an exploitation-exploration problem. Shall I stick with my current job or look for a new one? Shall I stay with my partner or seek a new love? Shall I continue reading the book or watch the movie instead? In all of those cases the question is always whether I should “exploit” the thing I have or whether I should “explore” new things. If you want to learn how to tackle this most basic trade off read on…

At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?

There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.

Let us create an example with bandits, which return units on average, except the second one which returns units. So the best strategy would obviously be to choose the second bandit right away and stick with it, but of course we don’t know the average returns of each bandit so this won’t work. Instead we need another vector which tallies the results of each bandit so far. This vector has to be updated after each game, so we need an update function which gets as arguments the current bandit and the return of the game.

The intelligence of the strategy lies in this update function, so how should we go about? The big idea behind this strategy is called Bellman equation and in its simplest form it works as follows: the adjustment of the former result vector is the difference between the former result and the current result weighted by some discount factor, in this case the inverse of the number of games played on the respective machine. This learning strategy is called Q-learning and is a so called reinforcement learning technique.

Have a look at the following example implementation:

set.seed(3141) # for reproducibility

# Q-learning update function
update <- function(i, r) {
Q[i] <<- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function
k[i] <<- k[i] + 1 # one more game played on i'th bandit
}

# simulate game on one-armed bandit i
ret <- function(i) {
round(rnorm(1, mean = rets[i]))
}

# chose which bandit to play
which.bandit <- function() {
p <- runif(1)
ifelse(p >= epsilon, which.max(Q), sample(1:n, 1))
}

epsilon <- 0.1 # switch in epsilon percent of cases
rets <- c(4, 5, 4, 4, 4) # average returns of bandits
n <- length(rets)
Q <- rep(0, n) # initialize return vector
k <- rep(0, n) # initialize vector for games played on each bandit
N <- 1000 # no. of runs
R <- 0 # sum of returns

for (j in 1:N) {
i <- which.bandit() # chose bandit
r <- ret(i) # simulate bandit
R <- R + r # add return of bandit to overall sum of returns
update(i, r) # calling Q-learning update function
}

which.max(Q) # which bandit has the highest return?
## [1] 2

Q
## [1] 4.000000 5.040481 4.090909 4.214286 3.611111

k
## [1]  32 914  22  14  18

N * max(rets) # theoretical max. return
## [1] 5000

R
## [1] 4949

R / (N * max(rets)) # percent reached of theoretical max
## [1] 0.9898


So, obviously the algorithm found a nearly perfect strategy all on its own!

Now, this is the simplest possible application of reinforcement learning. Let us now implement a more sophisticated example: a robot navigating a maze. Whereas the difficulty in the first example was that the feedback was blurred (because the return of each one-armed bandit is only an average return) here we only get definitive feedback after several steps (when the robot reaches its goal). Because this situation is more complicated we need more memory to store the intermediate results. In our multi-armed bandit example the memory was a vector, here we will need a matrix.

The robot will try to reach the goal in the following maze and find the best strategy for each room it is placed in:

Have a look at the code (it is based on the Matlab code from the same tutorial the picture is from, which is why the names of variables and functions are called the same way to ensure consistency):

# find all possible actions
AllActions <- function(state, R) {
which(R[state, ] >= 0)
}

# chose one action out of all possible actions by chance
PossibleAction <- function(state, R) {
sample(AllActions(state, R), 1)
}

# Q-learning function
UpdateQ <- function(state, Q, R, gamma, goalstate) {
action <- PossibleAction(state, R)
Q[state, action] <- R[state, action] + gamma * max(Q[action, AllActions(action, R)]) # Bellman equation (learning rate implicitly = 1)
if(action != goalstate) Q <- UpdateQ(action, Q, R, gamma, goalstate)
Q
}

# recursive function to get the action with the the maximum Q value
MaxQ <- function(state, Q, goalstate) {
action <- which.max(Q[state[length(state)], ])
if (action != goalstate) action <- c(action, MaxQ(action, Q, goalstate))
action
}

# representation of the maze
R <- matrix(c(-Inf, -Inf, -Inf, -Inf,    0, -Inf,
-Inf, -Inf, -Inf,    0, -Inf,  100,
-Inf, -Inf, -Inf,    0, -Inf, -Inf,
-Inf,    0,    0, -Inf,    0, -Inf,
0, -Inf, -Inf,    0, -Inf,  100,
-Inf,    0, -Inf, -Inf,    0,  100), ncol = 6, byrow = TRUE)
colnames(R) <- rownames(R) <- LETTERS[1:6]
R
##      A    B    C    D    E    F
## A -Inf -Inf -Inf -Inf    0 -Inf
## B -Inf -Inf -Inf    0 -Inf  100
## C -Inf -Inf -Inf    0 -Inf -Inf
## D -Inf    0    0 -Inf    0 -Inf
## E    0 -Inf -Inf    0 -Inf  100
## F -Inf    0 -Inf -Inf    0  100

Q <- matrix(0, nrow = nrow(R), ncol = ncol(R))
colnames(Q) <- rownames(Q) <- LETTERS[1:6]
Q
##   A B C D E F
## A 0 0 0 0 0 0
## B 0 0 0 0 0 0
## C 0 0 0 0 0 0
## D 0 0 0 0 0 0
## E 0 0 0 0 0 0
## F 0 0 0 0 0 0

gamma <- 0.8 # learning rate
goalstate <- 6
N <- 50000 # no. of episodes

for (episode in 1:N) {
state <- sample(1:goalstate, 1)
Q <- UpdateQ(state, Q, R, gamma, goalstate)
}

Q
##      A    B    C    D   E   F
## A -Inf -Inf -Inf -Inf 400   0
## B    0    0    0  320   0 500
## C -Inf -Inf -Inf  320   0   0
## D    0  400  256    0 400   0
## E  320    0    0  320   0 500
## F    0  400    0    0 400 500

Q / max(Q) * 100
##      A    B    C    D  E   F
## A -Inf -Inf -Inf -Inf 80   0
## B    0    0  0.0   64  0 100
## C -Inf -Inf -Inf   64  0   0
## D    0   80 51.2    0 80   0
## E   64    0  0.0   64  0 100
## F    0   80  0.0    0 80 100

# print all learned routes for all rooms
for (i in 1:goalstate) {
cat(LETTERS[i], LETTERS[MaxQ(i, Q, goalstate)], sep = " -> ")
cat("\n")
}
## A -> E -> F
## B -> F
## C -> D -> B -> F
## D -> B -> F
## E -> F
## F -> F


So again, the algorithm has found the best route for each room!

Recently the combination of Neural Networks (see also Understanding the Magic of Neural Networks) and Reinforcement Learning has become quite popular. For example AlphaGo, the machine from Google that defeated a Go world champion for the first time in history is based on this powerful combination!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

TrialChain
The governance of data used for biomedical research and clinical trials is an important requirement for generating accurate results. To improve the visibility of data quality and analysis, we developed TrialChain, a blockchain-based platform that can be used to validate data integrity from large, biomedical research studies. We implemented a private blockchain using the MultiChain platform and integrated it with a data science platform deployed within a large research center. An administrative web application was built with Python to manage the platform, which was built with a microservice architecture using Docker. The TrialChain platform was integrated during data acquisition into our existing data science platform. Using NiFi, data were hashed and logged within the local blockchain infrastructure. To provide public validation, the local blockchain state was periodically synchronized to the public Ethereum network. The use of a combined private/public blockchain platform allows for both public validation of results while maintaining additional security and lower cost for blockchain transactions. Original data and modifications due to downstream analysis can be logged within TrialChain and data assets or results can be rapidly validated when needed using API calls to the platform. The TrialChain platform provides a data governance solution to audit the acquisition and analysis of biomedical research data. The platform provides cryptographic assurance of data authenticity and can also be used to document data analysis. …

Unsupervised Temperature Scaling (UTS)
Great performances of deep learning are undeniable, with impressive results on wide range of tasks. However, the output confidence of these models is usually not well calibrated, which can be an issue for applications where confidence on the decisions is central to bring trust and reliability (e.g., autonomous driving or medical diagnosis). For models using softmax at the last layer, Temperature Scaling (TS) is a state-of-the-art calibration method, with low time and memory complexity as well as demonstrated effectiveness. TS relies on a T parameter to rescale and calibrate values of the softmax layer, using a labelled dataset to determine the value of that parameter.We are proposing an Unsupervised Temperature Scaling (UTS) approach, which does not dependent on labelled samples to calibrate the model,allowing, for example, using a part of test samples for calibrating the pre-trained model before going into inference mode. We provide theoretical justifications for UTS and assess its effectiveness on the wide range of deep models and datasets. We also demonstrate calibration results of UTS on skin lesion detection, a problem where a well-calibrated output can play an important role for accurate decision-making. …

You Only Look Once (YOLO)
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset. …

MEBoost
Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. …

### SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere.

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals.

***
Here is the entire infographics.

### Document worth reading: “Shannon’s entropy and its Generalizations towards Statistics, Reliability and Information Science during 1948-2018”

Starting from the pioneering works of Shannon and Weiner in 1948, a plethora of works have been reported on entropy in different directions. Entropy-related review work in the direction of statistics, reliability and information science, to the best of our knowledge, has not been reported so far. Here we have tried to collect all possible works in this direction during the period 1948-2018 so that people interested in entropy, specially the new researchers, get benefited. Shannon’s entropy and its Generalizations towards Statistics, Reliability and Information Science during 1948-2018