# My Data Science Blogs

## October 21, 2018

### If you did not already know

Dynamical Atoms Network (DYAN)
The ability to anticipate the future is essential when making real time critical decisions, provides valuable information to understand dynamic natural scenes, and can help unsupervised video representation learning. State-of-art video prediction is based on LSTM recursive networks and/or generative adversarial network learning. These are complex architectures that need to learn large numbers of parameters, are potentially hard to train, slow to run, and may produce blurry predictions. In this paper, we introduce DYAN, a novel network with very few parameters and easy to train, which produces accurate, high quality frame predictions, significantly faster than previous approaches. DYAN owes its good qualities to its encoder and decoder, which are designed following concepts from systems identification theory and exploit the dynamics-based invariants of the data. Extensive experiments using several standard video datasets show that DYAN is superior generating frames and that it generalizes well across domains. …

Safe Reinforcement Learning
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning. …

### Faceted Graphs with cdata and ggplot2

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so:

I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:

Except — which columns do I plot and what do I facet on?

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Here’s one way to create the plot I want, using the cdata package along with ggplot2.

First, load the packages and data:

library("ggplot2")
library("cdata")

iris <- data.frame(iris)

Now define the data-shaping transform, or control table. The control table is basically a picture that sketches out the final data shape that I want. I want to specify the x and y columns of the plot (call these the value columns of the data frame) and the column that I am faceting by (call this the key column of the data frame). And I also need to specify how the key and value columns relate to the existing columns of the original data frame.

Here’s what the control table looks like:

The control table specifies that the new data frame will have the columns flower_part, Length and Width. Every row of iris will produce two rows in the new data frame: one with a flower_part value of Petal, and another with a flower_part value of Sepal. The Petal row will take the Petal.Length and Petal.Width values in the Length and Width columns respectively. Similarly for the Sepal row.

Here I create the control table in R, using the convenience function wrapr::build_frame() to create the controlTable data frame in a legible way.

(controlTable <- wrapr::build_frame(
"flower_part", "Length"      , "Width"       |
"Petal"      , "Petal.Length", "Petal.Width" |
"Sepal"      , "Sepal.Length", "Sepal.Width" ))
##   flower_part       Length       Width
## 1       Petal Petal.Length Petal.Width
## 2       Sepal Sepal.Length Sepal.Width

Now I apply the transform to iris using the function rowrecs_to_blocks(). I also want to carry along the Species column so I can color the scatterplot points by species.

iris_aug <- rowrecs_to_blocks(
iris,
controlTable,
columnsToCopy = c("Species"))

head(iris_aug)
##   Species flower_part Length Width
## 1  setosa       Petal    1.4   0.2
## 2  setosa       Sepal    5.1   3.5
## 3  setosa       Petal    1.4   0.2
## 4  setosa       Sepal    4.9   3.0
## 5  setosa       Petal    1.3   0.2
## 6  setosa       Sepal    4.7   3.2

And now I can create the plot!

ggplot(iris_aug, aes(x=Length, y=Width)) +
geom_point(aes(color=Species, shape=Species)) +
facet_wrap(~flower_part, labeller = label_both, scale = "free") +
ggtitle("Iris dimensions") +
scale_color_brewer(palette = "Dark2")

In the next post, I will show how to use cdata and ggplot2 to create a scatterplot matrix.

### Magister Dixit

“Understanding the distinction between Data Science and Big Data is critical to investing in a sound data strategy.” Sean McClure ( July 2015 )

### The space race is dominated by new contenders

Private businesses and rising powers are replacing the cold-war duopoly

### Supreme Court justices are increasingly political

Donald Trump’s nominee is likely to accelerate the pace

### Python could become the world’s most popular coding language

But its rivals are unlikely to disappear

### Statistics Sunday: What Fast Food Can Tell Us About a Community and the World

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly.

First up, in the wake of recent hurricanes, is the Waffle House Index. As The Economist explains:

Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage.

The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green.

Next is the Big Mac Index, created by The Economist:

The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries.

You might remember a discussion of the “basket of goods” in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way “to make exchange-rate theory more digestible,” it’s since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you’ll need to use a bit of Python to get at the Jupyter notebook they’ve put together. Fortunately, they’ve provided detailed information on installing and setting everything up.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Multilevel models with group-level predictors

Kari Lock Morgan writes:

I’m writing now though with a multilevel modeling question that has been nagging me for quite some time now. In your book with Jennifer Hill, you include a group-level predictor (for example, 12.15 on page 266), but then end up fitting this as an individual-level predictor with lmer. How can this be okay? It seems as if lmer can’t really still be fitting the model specified in 12.15? In particular, I’m worried about analyzing a cluster randomized experiment where the treatment is applied at the cluster level and outcomes are at the individual level – intuitively, of course it should matter that the treatment was applied at the cluster level, not the individual level, and therefore somehow this should enter into how the model is fit? However, I can’t seem to grasp how lmer would know this, unless it is implicitly looking at the covariates to see if they vary within groups or not, which I’m guessing it’s not? In your book you act as if fitting the model with county-level uranium as an individual predictor is the same as fitting it as a group-level predictor, which makes me think perhaps I am missing something obvious?

My reply: It indeed is annoying that lmer (and, for that matter, stan_lmer in its current incarnation) only allows individual-level predictors, so that any group-level predictors need to be expanded to the individual level (for example, u_full <- u[group]). But from the standpoint of fitting the statistical model, it doesn't matter. Regarding the question, how does the model "know" that, in this case, u_full is actually an expanded group-level predictor: The answer is that it "figures it out" based on the dependence between u_full and the error terms. It all works out.

The post Multilevel models with group-level predictors appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Document worth reading: “Machine Learning for Spatiotemporal Sequence Forecasting: A Survey”

Spatiotemporal systems are common in the real-world. Forecasting the multi-step future of these spatiotemporal systems based on the past observations, or, Spatiotemporal Sequence Forecasting (STSF), is a significant and challenging problem. Although lots of real-world problems can be viewed as STSF and many research works have proposed machine learning based methods for them, no existing work has summarized and compared these methods from a unified perspective. This survey aims to provide a systematic review of machine learning for STSF. In this survey, we define the STSF problem and classify it into three subcategories: Trajectory Forecasting of Moving Point Cloud (TF-MPC), STSF on Regular Grid (STSF-RG) and STSF on Irregular Grid (STSF-IG). We then introduce the two major challenges of STSF: 1) how to learn a model for multi-step forecasting and 2) how to adequately model the spatial and temporal structures. After that, we review the existing works for solving these challenges, including the general learning strategies for multi-step forecasting, the classical machine learning based methods for STSF, and the deep learning based methods for STSF. We also compare these methods and point out some potential research directions. Machine Learning for Spatiotemporal Sequence Forecasting: A Survey

### Distilled News

This study investigates a discrete causal method for nominal data (DCMND) which is one of the important issues of causal inference. It is utilized to learn the causal Bayesian network to reflect the interconnections between variables in our paper. This article also proposes a Bayesian network construction algorithm based on discrete causal inference (BDCI) and an extended BDCI Bayesian network construction algorithm based on DCMND. Furthermore, the paper studies the alarm data of mobile communication system in practice. The results suggest that decision criterion based our method is effective in causal inference and the Bayesian network constructed by our method has better classification accuracy compared to other methods.
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner proposed a neural network architecture for handwritten and machine-printed character recognition in 1990’s which they called LeNet-5. The architecture is straightforward and simple to understand that’s why it is mostly used as a first step for teaching Convolutional Neural Network.
Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce 5 patches that were accepted by the human developers and permanently merged in the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair. In this post, we tell the story about this research done at KTH Royal Institute of Technology, Inria, the University of Lille and the University of Valenciennes.
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!
In this tutorial, you’ll learn about commonly used probability distributions in machine learning literature.
Practice autocorrelation in R by using course material from DataCamp’s Introduction to Time Series Analysis course.
Practice basic programming skills in R by using course material from DataCamp’s free Model a Quantitative Trading Strategy in R course.
RPA (Robotic Process Automation) is hot! This term – which refers to small ‘bots’ that automate manual tasks – has ranked between #5 and #7 in searches on Gartner.com over the summer. To satisfy the demand for information on RPA, Gartner has published many works on the subject, including a market guide and several deployment best practice notes. Our published work however cautions clients not use RPA as an alternative when more robust, configured enterprise applications are available. The notion behind this advice is that RPA is inferior when compared with configured, enterprise-class solutions purpose-built for a particular process. Let me give you an example. One could create an RPA ‘bot’ to transcribe invoice data into an accounts payable system. OR the accounts payable department could deploy a modern e-invoicing system or B2B network solution that does the same thing. Neither approach is ‘correct.’ But I think all of us at Gartner for technology leaders could agree that the enterprise application is the more architecturally clean and flexible way of doing things.
Visualization is a great way to get an overview of credit modeling. Typically you will start by making data management and data cleaning and after this, your credit modeling analysis will start with visualizations. This article is, therefore, the first part of a credit machine learning analysis with visualizations. The second part of the analysis will typically use logistic regression and ROC curves.
Score one for the human brain. In a new study, computer scientists found that artificial intelligence systems fail a vision test a child could accomplish with ease. ‘It’s a clever and important study that reminds us that ‘deep learning’ isn’t really that deep,’ said Gary Marcus, a neuroscientist at New York University who was not affiliated with the work. The result takes place in the field of computer vision, where artificial intelligence systems attempt to detect and categorize objects. They might try to find all the pedestrians in a street scene, or just distinguish a bird from a bicycle (which is a notoriously difficult task). The stakes are high: As computers take over critical tasks like automated surveillance and autonomous driving, we’ll want their visual processing to be at least as good as the human eyes they’re replacing.
You might be familiar with structured data, it is everywhere. Here i would like to focus on discussion on how we transform unstructured data to something data machine can process the data then to take inference.
Gradient Boosted Decision Trees and Random Forest are my favorite ML models for tabular heterogeneous datasets. These models are the top performers on Kaggle competitions and in widespread use in the industry. Catboost, the new kid on the block, has been around for a little more than a year now, and it is already threatening XGBoost, LightGBM and H2O.
This is the second part of a series of two blogposts on deep learning model exploration, translation, and deployment. Both involve many technologies like PyTorch, TensorFlow, TensorFlow Serving, Docker, ONNX, NNEF, GraphPipe, and Flask. We will orchestrate these technologies to solve the task of image classification using the more challenging and less popular EMNIST dataset. In the first part, we introduced EMNIST, developed and trained models with PyTorch, translated them using the Open Neural Network eXchange format (ONNX) and served them through GraphPipe. This part concludes the series by adding two additional approaches for model deployment. TensorFlow Serving and Docker as well as a rather hobbyist approach in which we build a simple web application that serves our model. Both deployments will offer a REST API to call for predictions. You will find all the related sourcecode on GitHub. If you like to start from the very beginning, find the first part here on Towards Data Science.
This is the first part of a series of two blogposts on deep learning model exploration, translation, and deployment. Both involve many technologies like PyTorch, TensorFlow, TensorFlow Serving, Docker, ONNX, NNEF, GraphPipe, and Flask. We will orchestrate these technologies to solve the task of image classification using the more challenging and less popular EMNIST dataset. The first part introduces EMNIST, we develop and train models with PyTorch, translate them with the Open Neural Network eXchange format ONNX and serve them through GraphPipe. Part two covers TensorFlow Serving and Docker as well as a rather hobbyist approach in which we build a simple web application that serves our model. You can find all the related sourcecode on GitHub.
The Bayesian approach to statistical analysis has been gaining popularity in recent years, in the wake of the Replication Crisis and with the help of greater computational power. While many of us have heard that it is an alternative to the Frequentist approach that most people are familiar with, not a lot truly understand what it does and how to use it. This post hopes to simplify the core concepts of Bayesian analysis, and briefly explains why it was proposed as a solution to the Replication Crisis.
Different types of machine learning types and algorithms, also when and where these are used.
Even though AI can be achieved in many ways why does machine learning has more edge over others? ( A must read for beginners ) Why is it called Machine Learning?
Although NLP and text mining are not the same thing, they are closely related, deal with the same raw data type, and have some crossover in their uses. Let’s discuss the steps in approaching these types of tasks.

### If you did not already know

Conditional variants of Generative Adversarial Networks (GANs), known as cGANs, are generative models that can produce data samples ($x$) conditioned on both latent variables ($z$) and known auxiliary information ($c$). Another GAN variant, Bidirectional GAN (BiGAN) is a recently developed framework for learning the inverse mapping from $x$ to $z$ through an encoder trained simultaneously with the generator and the discriminator of an unconditional GAN. We propose the Bidirectional Conditional GAN (BCGAN), which combines cGANs and BiGANs into a single framework with an encoder that learns inverse mappings from $x$ to both $z$ and $c$, trained simultaneously with the conditional generator and discriminator in an end-to-end setting. We present crucial techniques for training BCGANs, which incorporate an extrinsic factor loss along with an associated dynamically-tuned importance weight. As compared to other encoder-based GANs, BCGANs not only encode $c$ more accurately but also utilize $z$ and $c$ more effectively and in a more disentangled way to generate data samples. …

Scientific Information Extractor (SciIE)
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature. …

Site Reliability Engineering (SRE)

### Whats new on arXiv

Neural message passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, these methods only consider nodes that are a few propagation steps away and the size of this utilized neighborhood cannot be easily extended. In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct personalized embedding propagation (PEP) and its approximation, PEP$_\text{A}$. Our model’s training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification on multiple graphs in the most thorough study done so far for GCN-like models.
Data preprocessing techniques are devoted to correct or alleviate errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. We have implemented six of the most popular data preprocessing algorithms, three for discretization and the rest for feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but to maintain or even improve the original accuracy in a short time. DPASF contains useful algorithms when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.
This paper develops a low-nonnegative-rank approximation method to identify the state aggregation structure of a finite-state Markov chain under an assumption that the state space can be mapped into a handful of meta-states. The number of meta-states is characterized by the nonnegative rank of the Markov transition matrix. Motivated by the success of the nuclear norm relaxation in low rank minimization problems, we propose an atomic regularizer as a convex surrogate for the nonnegative rank and formulate a convex optimization problem. Because the atomic regularizer itself is not computationally tractable, we instead solve a sequence of problems involving a nonnegative factorization of the Markov transition matrices by using the proximal alternating linearized minimization method. Two methods for adjusting the rank of factorization are developed so that local minima are escaped. One is to append an additional column to the factorized matrices, which can be interpreted as an approximation of a negative subgradient step. The other is to reduce redundant dimensions by means of linear combinations. Overall, the proposed algorithm very likely converges to the global solution. The efficiency and statistical properties of our approach are illustrated on synthetic data. We also apply our state aggregation algorithm on a Manhattan transportation data set and make extensive comparisons with an existing method.
Hardware accelerators are available on the Cloud for enhanced analytics. Next generation Clouds aim to bring enhanced analytics using accelerators closer to user devices at the edge of the network for improving Quality-of-Service by minimizing end-to-end latencies and response times. The collective computing model that utilizes resources at the Cloud-Edge continuum in a multi-tier hierarchy comprising the Cloud, the Edge and user devices is referred to as Fog computing. This article identifies challenges and opportunities in making accelerators accessible at the Edge. A holistic view of the Fog architecture is key to pursuing meaningful research in this area.
Dropout is a popular regularization technique in neural networks. Yet, the reason for its success is still not fully understood. This paper provides a new interpretation of Dropout from a frame theory perspective. This leads to a novel regularization technique for neural networks that minimizes the cross-correlation between filters in the network. We demonstrate its applicability in convolutional and fully connected layers in both feed-forward and recurrent networks.
In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of neural network-based systems, we propose a new technique to train deep neural networks over several data sources. Our method allows for deep neural networks to be trained using data from multiple entities in a distributed fashion. We evaluate our algorithm on existing datasets and show that it obtains performance which is similar to a regular neural network trained on a single machine. We further extend it to incorporate semi-supervised learning when training with few labeled samples, and analyze any security concerns that may arise. Our algorithm paves the way for distributed training of deep neural networks in data sensitive applications when raw data may not be shared directly.
Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in favor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over different dimensions; compared to state-of-the-art approaches PRETZEL is on average able to reduce 99th percentile latency by 5.5x while reducing memory footprint by 25x, and increasing throughput by 4.7x.
Understanding dynamic fracture propagation is essential to predicting how brittle materials fail. Various mathematical models and computational applications have been developed to predict fracture evolution and coalescence, including finite-discrete element methods such as the Hybrid Optimization Software Suite (HOSS). While such methods achieve high fidelity results, they can be computationally prohibitive: a single simulation takes hours to run, and thousands of simulations are required for a statistically meaningful ensemble. We propose a machine learning approach that, once trained on data from HOSS simulations, can predict fracture growth statistics within seconds. Our method uses deep learning, exploiting the capabilities of a graph convolutional network to recognize features of the fracturing material, along with a recurrent neural network to model the evolution of these features. In this way, we simultaneously generate predictions for qualitatively distinct material properties. Our prediction for total damage in a coalesced fracture, at the final simulation time step, is within 3% of its actual value, and our prediction for total length of a coalesced fracture is within 2%. We also develop a novel form of data augmentation that compensates for the modest size of our training data, and an ensemble learning approach that enables us to predict when the material fails, with a mean absolute error of approximately 15%.
The choice of activation function can significantly influence the performance of neural networks. The lack of guiding principles for the selection of activation function is lamentable. We try to address this issue by introducing our variational neural networks, where the activation function is represented as a linear combination of possible candidate functions, and an optimal activation is obtained via minimization of a loss function using gradient descent method. The gradient formulae for the loss function with respect to these expansion coefficients are central for the implementation of gradient descent algorithm, and here we derive these gradient formulae.
Change detection involves segmenting sequential data such that observations in the same segment share some desired properties. Multivariate change detection continues to be a challenging problem due to the variety of ways change points can be correlated across channels and the potentially poor signal-to-noise ratio on individual channels. In this paper, we are interested in locating additive outliers (AO) and level shifts (LS) in the unsupervised setting. We propose ABACUS, Automatic BAyesian Changepoints Under Sparsity, a Bayesian source separation technique to recover latent signals while also detecting changes in model parameters. Multi-level sparsity achieves both dimension reduction and modeling of signal changes. We show ABACUS has competitive or superior performance in simulation studies against state-of-the-art change detection methods and established latent variable models. We also illustrate ABACUS on two real application, modeling genomic profiles and analyzing household electricity consumption.
Given a sequential learning algorithm and a target model, sequential machine teaching aims to find the shortest training sequence to drive the learning algorithm to the target model. We present the first principled way to find such shortest training sequences. Our key insight is to formulate sequential machine teaching as a time-optimal control problem. This allows us to solve sequential teaching by leveraging key theoretical and computational tools developed over the past 60 years in the optimal control community. Specifically, we study the Pontryagin Maximum Principle, which yields a necessary condition for optimality of a training sequence. We present analytic, structural, and numerical implications of this approach on a case study with a least-squares loss function and gradient descent learner. We compute optimal training sequences for this problem, and although the sequences seem circuitous, we find that they can vastly outperform the best available heuristics for generating training sequences.
Batch Normalization (BN) has been used extensively in deep learning to achieve faster training process and better resulting models. However, whether BN works strongly depends on how the batches are constructed during training and it may not converge to a desired solution if the statistics on a batch are not close to the statistics over the whole dataset. In this paper, we try to understand BN from an optimization perspective by formulating the optimization problem which motivates BN. We show when BN works and when BN does not work by analyzing the optimization problem. We then propose a refinement of BN based on compositional optimization techniques called Full Normalization (FN) to alleviate the issues of BN when the batches are not constructed ideally. We provide convergence analysis for FN and empirically study its effectiveness to refine BN.
In this paper we compute the stopping times in the game Rock-Paper-Scissors. By exploiting the recurrence relation we compute the mean values of stopping times. On the other hand, by constructing a transition matrix for a Markov chain associated with the game, we get also the distribution of the stopping times and thereby we compute the mean stopping times again. Then we show that the mean stopping times increase exponentially fast as the number of the participants increases.
To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying loss, but typically suffer from a large computational overhead and substantial bias when the data happens to be sub-Gaussian, limiting their utility. In this work, we propose a novel robust gradient descent procedure which makes use of a smoothed multiplicative noise applied directly to observations before constructing a sum of soft-truncated gradient coordinates. We show that the procedure has competitive theoretical guarantees, with the major advantage of a simple implementation that does not require an iterative sub-routine for robustification. Empirical tests reinforce the theory, showing more efficient generalization over a much wider class of data distributions.
The connections within many real-world networks change over time. Thus, there has been a recent boom in studying temporal graphs. Recognizing patterns in temporal graphs requires a similarity measure to compare different temporal graphs. To this end, we initiate the study of dynamic time warping (an established concept for mining time series data) on temporal graphs. We propose the dynamic temporal graph warping distance (dtgw) to determine the (dis-)similarity of two temporal graphs. Our novel measure is flexible and can be applied in various application domains. We show that computing the dtgw-distance is a challenging (NP-hard) optimization problem and identify some polynomial-time solvable special cases. Moreover, we develop a quadratic programming formulation and an efficient heuristic. Preliminary experiments indicate that the heuristic performs very well and that our concept yields meaningful results on real-world instances.
Whereas most dimensionality reduction techniques (e.g. PCA, ICA, NMF) for multivariate data essentially rely on linear algebra to a certain extent, summarizing ranking data, viewed as realizations of a random permutation $\Sigma$ on a set of items indexed by $i\in \{1,\ldots,\; n\}$, is a great statistical challenge, due to the absence of vector space structure for the set of permutations $\mathfrak{S}_n$. It is the goal of this article to develop an original framework for possibly reducing the number of parameters required to describe the distribution of a statistical population composed of rankings/permutations, on the premise that the collection of items under study can be partitioned into subsets/buckets, such that, with high probability, items in a certain bucket are either all ranked higher or else all ranked lower than items in another bucket. In this context, $\Sigma$‘s distribution can be hopefully represented in a sparse manner by a bucket distribution, i.e. a bucket ordering plus the ranking distributions within each bucket. More precisely, we introduce a dedicated distortion measure, based on a mass transportation metric, in order to quantify the accuracy of such representations. The performance of buckets minimizing an empirical version of the distortion is investigated through a rate bound analysis. Complexity penalization techniques are also considered to select the shape of a bucket order with minimum expected distortion. Beyond theoretical concepts and results, numerical experiments on real ranking data are displayed in order to provide empirical evidence of the relevance of the approach promoted.
Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details. We discuss six core elements, six important mechanisms, and twelve applications, focusing on contemporary work, and in historical contexts. We start with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources. Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation. Then we discuss important mechanisms for RL, including attention and memory, unsupervised learning, hierarchical RL, multi-agent RL, relational RL, and learning to learn. After that, we discuss RL applications, including games, robotics, natural language processing (NLP), computer vision, finance, business management, healthcare, education, energy, transportation, computer systems, and, science, engineering, and art. Finally we summarize briefly, discuss challenges and opportunities, and close with an epilogue.
Unsupervised ensemble learning has long been an interesting yet challenging problem that comes to prominence in recent years with the increasing demand of crowdsourcing in various applications. In this paper, we propose a novel method– unsupervised ensemble learning via Ising model approximation (unElisa) that combines a pruning step with a predicting step. We focus on the binary case and use an Ising model to characterize interactions between the ensemble and the underlying true classifier. The presence of an edge between an observed classifier and the true classifier indicates a direct dependence whereas the absence indicates the corresponding one provides no additional information and shall be eliminated. This observation leads to the pruning step where the key is to recover the neighborhood of the true classifier. We show that it can be recovered successfully with exponentially decaying error in the high-dimensional setting by performing nodewise $\ell_1$-regularized logistic regression. The pruned ensemble allows us to get a consistent estimate of the Bayes classifier for predicting. We also propose an augmented version of majority voting by reversing all labels given by a subgroup of the pruned ensemble. We demonstrate the efficacy of our method through extensive numerical experiments and through the application to EHR-based phenotyping prediction on Rheumatoid Arthritis (RA) using data from Partners Healthcare System.
Null hypothesis significance testing remains popular despite decades of concern about misuse and misinterpretation. We believe that much of the problem is due to language: significance testing has little to do with other meanings of the word ‘significance’. Despite the limitations of null-hypothesis tests, we argue here that they remain useful in many contexts as a guide to whether a certain effect can be seen clearly in that context (e.g. whether we can clearly see that a correlation or between-group difference is positive or negative). We therefore suggest that researchers describe the conclusions of null-hypothesis tests in terms of statistical ‘clarity’ rather than statistical ‘significance’. This simple semantic change could substantially enhance clarity in statistical communication.
Opinion polls remain among the most efficient and widespread methods to capture psycho-social data at large scales. However, there are limitations on the logistics and structure of opinion polls that restrict the amount and type of information that can be collected. As a consequence, data from opinion polls are often reported in simple percentages and analyzed non-parametrically. In this paper, response data on just four questions from a national opinion poll were used to demonstrate that a parametric scale can be constructed using item response modeling approaches. Developing a parametric scale yields interval-level measures which are more useful than the strictly ordinal-level measures obtained from Likert-type scales common in opinion polls. The metric that was developed in this paper, a measure of religious morality, can be processed and used in a wider range of statistical analyses compared to conventional approaches of simply reporting percentages at item-level. Finally, this paper reports the item parameters so that researchers can adopt these items to future instruments and place their own results on the same scale, thereby allowing responses from future samples to be compared to the results from the representative data in this paper.
We present a predictor-corrector framework, called PicCoLO, that can transform a first-order model-free reinforcement or imitation learning algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. The new ‘PicCoLOed’ algorithm optimizes a policy by recursively repeating two steps: In the Prediction Step, the learner uses a model to predict the unseen future gradient and then applies the predicted estimate to update the policy; in the Correction Step, the learner runs the updated policy in the environment, receives the true gradient, and then corrects the policy using the gradient error. Unlike previous algorithms, PicCoLO corrects for the mistakes of using imperfect predicted gradients and hence does not suffer from model bias. The development of PicCoLO is made possible by a novel reduction from predictable online learning to adversarial online learning, which provides a systematic way to modify existing first-order algorithms to achieve the optimal regret with respect to predictable information. We show, in both theory and simulation, that the convergence rate of several first-order model-free algorithms can be improved by PicCoLO.
We consider the problem of balancing exploration and exploitation in sequential decision making problems. To explore efficiently, it is vital to consider the uncertainty over all consequences of a decision, and not just those that follow immediately; the uncertainties involved need to be propagated according to the dynamics of the problem. To this end, we develop Successor Uncertainties, a probabilistic model for the state-action value function of a Markov Decision Process that propagates uncertainties in a coherent and scalable way. We relate our approach to other classical and contemporary methods for exploration and present an empirical analysis.
Imitation learning provides an appealing framework for autonomous control: in many tasks, demonstrations of preferred behavior can be readily obtained from human experts, removing the need for costly and potentially dangerous online data collection in the real world. However, policies learned with imitation learning have limited flexibility to accommodate varied goals at test time. Model-based reinforcement learning (MBRL) offers considerably more flexibility, since a predictive model learned from data can be used to achieve various goals at test time. However, MBRL suffers from two shortcomings. First, the predictive model does not help to choose desired or safe outcomes — it reasons only about what is possible, not what is preferred. Second, MBRL typically requires additional online data collection to ensure that the model is accurate in those situations that are actually encountered when attempting to achieve test time goals. Collecting this data with a partially trained model can be dangerous and time-consuming. In this paper, we aim to combine the benefits of imitation learning and MBRL, and propose imitative models: probabilistic predictive models able to plan expert-like trajectories to achieve arbitrary goals. We find this method substantially outperforms both direct imitation and MBRL in a simulated autonomous driving task, and can be learned efficiently from a fixed set of expert demonstrations without additional online data collection. We also show our model can flexibly incorporate user-supplied costs as test-time, can plan to sequences of goals, and can even perform well with imprecise goals, including goals on the wrong side of the road.

### Getting the data from the Luxembourguish elections out of Excel

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

In this blog post, similar to a previous blog post
I am going to show you how we can go from an Excel workbook that contains data to flat file. I will
taking advantage of the structure of the tables inside the Excel sheets by writing a function
that extracts the tables and then mapping it to each sheet!

Last week, October 14th, Luxembourguish nationals went to the polls to elect the Grand Duke! No,
actually, the Grand Duke does not get elected. But Luxembourguish citizen did go to the polls
to elect the new members of the Chamber of Deputies (a sort of parliament if you will).
The way the elections work in Luxembourg is quite interesting; you can vote for a party, or vote
for individual candidates from different parties. The candidates that get the most votes will
then seat in the parliament. If you vote for a whole party,
each of the candidates get a vote. You get as many votes as there are candidates to vote for. So,
for example, if you live in the capital city, also called Luxembourg, you get 21 votes to distribute.
You could decide to give 10 votes to 10 candidates of party A and 11 to 11 candidates of party B.
Why 21 votes? The chamber of Deputies is made up 60 deputies, and the country is divided into four
legislative circonscriptions. So each voter in a circonscription gets an amount of votes that is
proportional to the population size of that circonscription.

Now you certainly wonder why I put the flag of Gambia on top of this post? This is because the
government that was formed after the 2013 elections was made up of a coalition of 3 parties;
the Luxembourg Socialist Worker’s Party, the Democratic Party and The Greens.
The LSAP managed to get 13 seats in the Chamber, while the DP got 13 and The Greens 6,
meaning 32 seats out of 60. So because they made this coalition, they could form the government,
and this coalition was named the Gambia coalition because of the colors of these 3 parties:
red, blue and green. If you want to take a look at the ballot from 2013 for the southern circonscription,

Now that you have the context, we can go back to some data science. The results of the elections
of last week can be found on Luxembourg’s Open Data portal, right here.
The data is trapped inside Excel sheets; just like I explained in a previous blog post
the data is easily read by human, but not easily digested by any type of data analysis software.
So I am going to show you how we are going from this big Excel workbook to a flat file.

First of all, if you open the Excel workbook, you will notice that there are a lot of sheets; there
is one for the whole country, named “Le Grand-Duché de Luxembourg”, one for the four circonscriptions,
“Centre”, “Nord”, “Sud”, “Est” and 102 more for each commune of the country (a commune is an
administrative division). However, the tables are all very similarly shaped, and roughly at the
same position.

This is good, because we can write a function to extracts the data and then map it over
all the sheets. First, let’s load some packages and the data for the country:

library("tidyverse")
library("tidyxl")
library("brotools")
# National Level 2018
elections_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx",
sheets = "Le Grand-Duché de Luxembourg")

{brotools} is my own package. You can install it with:

devtools::install_github("b-rodrigues/brotools")

it contains a function that I will use down below. The function I wrote to extract the tables
is not very complex, but requires that you are familiar with how {tidyxl} imports Excel workbooks.
So if you are not familiar with it, study the imported data frame for a few minutes. It will make
understanding the next function easier:

extract_party <- function(dataset, starting_col, target_rows){

almost_clean <- dataset %>%
filter(row %in% target_rows) %>%
filter(col %in% c(starting_col, starting_col + 1)) %>%
select(character, numeric) %>%
fill(numeric, .direction = "up") %>%
filter(!is.na(character))

party_name <- almost_clean$character[1] %>% str_split("-", simplify = TRUE) %>% .[2] %>% str_trim() almost_clean$character[1] <- "Pourcentage"

almost_clean$party <- party_name colnames(almost_clean) <- c("Variables", "Values", "Party") almost_clean %>% mutate(Year = 2018) %>% select(Party, Year, Variables, Values) } This function has three arguments, dataset, starting_col and target_rows. dataset is the data I loaded with xlsx_cells from the {tidyxl} package. I think the following picture illustrates easily what the function does: So the function first filters only the rows we are interested in, then the cols. I then select the columns I want which are called character and numeric (if the Excel cell contains characters then you will find them in the character column, if it contains numbers you will them in the numeric column), then I fill the empty cells with the values from the numeric column and the I remove the NA’s. These two last steps might not be so clear; this is how the data looks like up until the select() function: > elections_raw_2018 %>% + filter(row %in% seq(11,19)) %>% + filter(col %in% c(1, 2)) %>% + select(character, numeric) # A tibble: 18 x 2 character numeric 1 1 - PIRATEN - PIRATEN NA 2 NA 0.0645 3 Suffrage total NA 4 NA 227549 5 Suffrages de liste NA 6 NA 181560 7 Suffrage nominatifs NA 8 NA 45989 9 Pourcentage pondéré NA 10 NA 0.0661 11 Suffrage total pondéré NA 12 NA 13394. 13 Suffrages de liste pondéré NA 14 NA 10308 15 Suffrage nominatifs pondéré NA 16 NA 3086. 17 Mandats attribués NA 18 NA 2  So by filling the NA’s in the numeric the data now looks like this: > elections_raw_2018 %>% + filter(row %in% seq(11,19)) %>% + filter(col %in% c(1, 2)) %>% + select(character, numeric) %>% + fill(numeric, .direction = "up") # A tibble: 18 x 2 character numeric 1 1 - PIRATEN - PIRATEN 0.0645 2 NA 0.0645 3 Suffrage total 227549 4 NA 227549 5 Suffrages de liste 181560 6 NA 181560 7 Suffrage nominatifs 45989 8 NA 45989 9 Pourcentage pondéré 0.0661 10 NA 0.0661 11 Suffrage total pondéré 13394. 12 NA 13394. 13 Suffrages de liste pondéré 10308 14 NA 10308 15 Suffrage nominatifs pondéré 3086. 16 NA 3086. 17 Mandats attribués 2 18 NA 2  And then I filter out the NA’s from the character column, and that’s almost it! I simply need to add a new column with the party’s name and rename the other columns. I also add a “Year” colmun. Now, each party will have a different starting column. The table with the data for the first party starts on column 1, for the second party it starts on column 4, column 7 for the third party… So the following vector contains all the starting columns: position_parties_national <- seq(1, 24, by = 3) (If you study the Excel workbook closely, you will notice that I do not extract the last two parties. This is because these parties were not present in all of the 4 circonscriptions and are very, very, very small.) The target rows are always the same, from 11 to 19. Now, I simply need to map this function to this list of positions and I get the data for all the parties: elections_national_2018 <- map_df(position_parties_national, extract_party, dataset = elections_raw_2018, target_rows = seq(11, 19)) %>% mutate(locality = "Grand-Duchy of Luxembourg", division = "National") I also added the locality and division columns to the data. Let’s take a look: glimpse(elections_national_2018) ## Observations: 72 ## Variables: 6 ##$ Party      "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN...
## $Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018... ##$ Variables  "Pourcentage", "Suffrage total", "Suffrages de liste...
## $Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+... ##$ locality   "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo...
## $division "National", "National", "National", "National", "Nat... Very nice. Now we need to do the same for the 4 electoral circonscriptions. First, let’s load the data: # Electoral districts 2018 districts <- c("Centre", "Nord", "Sud", "Est") elections_district_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx", sheets = districts) Now things get trickier. Remember I said that the number of seats is proportional to the population of each circonscription? We simply can’t use the same target rows as before. For example, for the “Centre” circonscription, the target rows go from 12 to 37, but for the “Est” circonscription only from 12 to 23. Ideally, we would need a function that would return the target rows. This is that function: # The target rows I need to extract are different from district to district get_target_rows <- function(dataset, sheet_to_extract, reference_address){ last_row <- dataset %>% filter(sheet == sheet_to_extract) %>% filter(address == reference_address) %>% pull(numeric) seq(12, (11 + 5 + last_row)) } This function needs a dataset, a sheet_to_extract and a reference_address. The reference address is a cell that actually contains the number of seats in that circonscription, in our case “B5”. We can easily get the list of target rows now: # Get the target rows list_targets <- map(districts, get_target_rows, dataset = elections_district_raw_2018, reference_address = "B5") list_targets ## [[1]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ## [24] 35 36 37 ## ## [[2]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ## ## [[3]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ## [24] 35 36 37 38 39 ## ## [[4]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 Now, let’s split the data we imported into a list, where each element of the list is a dataframe with the data from one circonscription: list_data_districts <- map(districts, ~filter(.data = elections_district_raw_2018, sheet == .))  Now I can easily map the function I defined above, extract_party to this list of datasets. Well, I say easily, but it’s a bit more complicated than before because I have now a list of datasets and a list of target rows: elections_district_2018 <- map2(.x = list_data_districts, .y = list_targets, ~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y)) The way to understand this is that for each element of list_data_districts and list_targets, I have to map extract_party to each element of position_parties_national. This gives the intented result: elections_district_2018 ## [[1]] ## # A tibble: 208 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0514 ## 2 PIRATEN 2018 CLEMENT Sven (1) 8007 ## 3 PIRATEN 2018 WEYER Jerry (2) 3446 ## 4 PIRATEN 2018 CLEMENT Pascal (3) 3418 ## 5 PIRATEN 2018 KUNAKOVA Lucie (4) 2860 ## 6 PIRATEN 2018 WAMPACH Jo (14) 2693 ## 7 PIRATEN 2018 LAUX Cynthia (6) 2622 ## 8 PIRATEN 2018 ISEKIN Christian (5) 2610 ## 9 PIRATEN 2018 SCHWEICH Georges (9) 2602 ## 10 PIRATEN 2018 LIESCH Mireille (8) 2551 ## # ... with 198 more rows ## ## [[2]] ## # A tibble: 112 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0767 ## 2 PIRATEN 2018 COLOMBERA Jean (2) 5074 ## 3 PIRATEN 2018 ALLARD Ben (1) 4225 ## 4 PIRATEN 2018 MAAR Andy (3) 2764 ## 5 PIRATEN 2018 GINTER Joshua (8) 2536 ## 6 PIRATEN 2018 DASBACH Angelika (4) 2473 ## 7 PIRATEN 2018 GRÜNEISEN Sam (6) 2408 ## 8 PIRATEN 2018 BAUMANN Roy (5) 2387 ## 9 PIRATEN 2018 CONRAD Pierre (7) 2280 ## 10 PIRATEN 2018 TRAUT ép. MOLITOR Angela Maria (9) 2274 ## # ... with 102 more rows ## ## [[3]] ## # A tibble: 224 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0699 ## 2 PIRATEN 2018 GOERGEN Marc (1) 9818 ## 3 PIRATEN 2018 FLOR Starsky (2) 6737 ## 4 PIRATEN 2018 KOHL Martine (3) 6071 ## 5 PIRATEN 2018 LIESCH Camille (4) 6025 ## 6 PIRATEN 2018 KOHL Sylvie (6) 5628 ## 7 PIRATEN 2018 WELTER Christian (5) 5619 ## 8 PIRATEN 2018 DA GRAÇA DIAS Yanick (10) 5307 ## 9 PIRATEN 2018 WEBER Jules (7) 5301 ## 10 PIRATEN 2018 CHMELIK Libor (8) 5247 ## # ... with 214 more rows ## ## [[4]] ## # A tibble: 96 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0698 ## 2 PIRATEN 2018 FRÈRES Daniel (1) 4152 ## 3 PIRATEN 2018 CLEMENT Jill (7) 1943 ## 4 PIRATEN 2018 HOUDREMONT Claire (2) 1844 ## 5 PIRATEN 2018 BÖRGER Nancy (3) 1739 ## 6 PIRATEN 2018 MARTINS DOS SANTOS Catarina (6) 1710 ## 7 PIRATEN 2018 BELLEVILLE Tatjana (4) 1687 ## 8 PIRATEN 2018 CONTRERAS Gerald (5) 1687 ## 9 PIRATEN 2018 Suffrages total 14762 ## 10 PIRATEN 2018 Suffrages de liste 10248 ## # ... with 86 more rows I now need to add the locality and division columns: elections_district_2018 <- map2(.y = elections_district_2018, .x = districts, ~mutate(.y, locality = .x, division = "Electoral district")) %>% bind_rows() We’re almost done! Now we need to do the same for the 102 remaining sheets, one for each commune of Luxembourg. This will now go very fast, because we got all the building blocks from before: communes <- xlsx_sheet_names("leg-2018-10-14-22-58-09-737.xlsx") communes <- communes %-l% c("Le Grand-Duché de Luxembourg", "Centre", "Est", "Nord", "Sud", "Sommaire") Let me introduce the following function: %-l%. This function removes elements from lists: c("a", "b", "c", "d") %-l% c("a", "d") ## [1] "b" "c" You can think of it as “minus for lists”. This is called an infix operator. So this function is very useful to get the list of communes, and is part of my package, {brotools}. As before, I load the data: elections_communes_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx", sheets = communes) Then get my list of targets, but I need to change the reference address. It’s “B8” now, not “B7”. # Get the target rows list_targets <- map(communes, get_target_rows, dataset = elections_communes_raw_2018, reference_address = "B8") I now create a list of communes by mapping a filter function to the data: list_data_communes <- map(communes, ~filter(.data = elections_communes_raw_2018, sheet == .))  And just as before, I get the data I need by using extract_party, and adding the “locality” and “division” columns: elections_communes_2018 <- map2(.x = list_data_communes, .y = list_targets, ~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y)) elections_communes_2018 <- map2(.y = elections_communes_2018, .x = communes, ~mutate(.y, locality = .x, division = "Commune")) %>% bind_rows() The steps are so similar for the four circonscriptions and for the 102 communes that I could have write a big wrapper function and the use it for the circonscription and communes at once. But I was lazy. Finally, I bind everything together and have a nice, tidy, flat file: # Final results elections_2018 <- bind_rows(list(elections_national_2018, elections_district_2018, elections_communes_2018)) glimpse(elections_2018) ## Observations: 15,544 ## Variables: 6 ##$ Party      "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN...
## $Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018... ##$ Variables  "Pourcentage", "Suffrage total", "Suffrages de liste...
## $Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+... ##$ locality   "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo...
## $division "National", "National", "National", "National", "Nat... This blog post is already quite long, so I will analyze the data now that R can easily ingest it in a future blog post. If you found this blog post useful, you might want to follow me on twitter for blog post updates. To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ## October 20, 2018 ### R Packages worth a look WHOIS Server Querying (Rwhois) Queries data from WHOIS servers. A Shiny Application for Automatic Measurements of Tree-Ring Widths on Digital Images (MtreeRing) Use morphological image processing and edge detection algorithms to automatically identify tree-ring boundaries on digital images. Tree-ring boundaries … Functions for the Lognormal Distribution (lognorm) The lognormal distribution (Limpert al. (2001) <doi:10.1641/0006-3568(2001)051[0341:lndats]2.0.co;2>) can characterize uncertainty that is bounde … Continue Reading… ### Book Memo: “Asymmetric Kernel Smoothing”  Theory and Applications in Economics and Finance This is the first book to provide an accessible and comprehensive introduction to a newly developed smoothing technique using asymmetric kernel functions. Further, it discusses the statistical properties of estimators and test statistics using asymmetric kernels. The topics addressed include the bias-variance tradeoff, smoothing parameter choices, achieving rate improvements with bias reduction techniques, and estimation with weakly dependent data. Further, the large- and finite-sample properties of estimators and test statistics smoothed by asymmetric kernels are compared with those smoothed by symmetric kernels. Lastly, the book addresses the applications of asymmetric kernel estimation and testing to various forms of nonnegative economic and financial data. Until recently, the most popularly chosen nonparametric methods used symmetric kernel functions to estimate probability density functions of symmetric distributions with unbounded support. Yet many types of economic and financial data are nonnegative and violate the presumed conditions of conventional methods. Examples include incomes, wages, short-term interest rates, and insurance claims. Such observations are often concentrated near the boundary and have long tails with sparse data. Smoothing with asymmetric kernel functions has increasingly gained attention, because the approach successfully addresses the issues arising from distributions that have natural boundaries at the origin and heavy positive skewness. Offering an overview of recently developed kernel methods, complemented by intuitive explanations and mathematical proofs, this book is highly recommended to all readers seeking an in-depth and up-to-date guide to nonparametric estimation methods employing asymmetric kernel smoothing. Continue Reading… ### Magister Dixit “I basically can’t hire people who don’t know Git.” Eric Jonas Continue Reading… ### Science and Technology links (October 20th, 2018) 1. Should we stop eating meat to combat climate change? Maybe not. White and Hall worked out what happened if the US stopped using farm animals: The modeled system without animals (…) only reduced total US greenhouse gas emissions by 2.6 percentage units. Compared with systems with animals, diets formulated for the US population in the plants-only systems (…) resulted in a greater number of deficiencies in essential nutrients. (source: PNAS) Of concern when considering farm animals are methane emissions. Methane is a potent greenhouse gas, with the caveat that it is short-lived in the atmosphere unlike CO2. Should we be worried about methane despite its short life? According to the American EPA (Environmental Protection Agency), total methane emissions have been falling consistently for the last 20 years. That should not surprise us: greenhouse gas emissions in most developed countries (including the US) have peaked some time ago. Not emissions per capita, but total emissions. So beef, at least in the US, is not a major contributor to climate change. But we could do even better. Several studies like Stanley et al. report that well managed grazing can lead to carbon sequestration in the grassland. There are certainly countries were animal grazing is an environmental disaster. Many industries throughout the world are a disaster and we should definitively put pressure on the guilty parties. But, in case you were wondering, if you live in a country like Canada, McDonald’s is not only serving only locally-produced beef, but they also require that it be produced in a sustainable manner. In any case, there are good reasons to stop eating meat, but in the developed countries like the US and Canada, climate change seems like a bogus one. (Special thanks to professor Leroy for providing many useful pointers.) 2. News agencies reported this week that climate change could bring back the plague and the black death that wiped out Europe. The widely reported prediction was made by Professor Peter Frankopan while at the Cheltenham Literary Festival. Frankopan is a history professor at Oxford. 3. There is a reverse correlation between funding and scientific output, meaning that beyond a certain point, you start getting less science for your dollars. (…) prestigious institutions had on average 65% higher grant application success rates and 50% larger award sizes, whereas less-prestigious institutions produced 65% more publications and had a 35% higher citation impact per dollar of funding. These findings suggest that implicit biases and social prestige mechanisms (…) have a powerful impact on where (…) grant dollars go and the net return on taxpayers investments. It is well documented that there is diminishing returns in research funding. Concentrating your research dollars into too few individuals is wasteful. My own explanation for this phenomenon is that, Elon Musk aside, we have all have cognitive bottlenecks. One researcher might carry fruitfully two, three major projects at the same time, but once they supervise too many students and assistants, they become a “negative manager”, meaning that make other researchers no more productive and often less productive. They spend less and less time optimizing the tools and instruments. If you talk with graduate students who work in lavishly funded laboratories, you will often hear (when the door is closed) about how poorly managed the projects are. People are forced into stupid directions, they do boring and useless work to satisfy project objectives that no longer make sense. Currently, “success” is often defined by how quickly you can acquire and spend money. But how do you optimally distribute research dollars? It is tricky because, almost by definition, almost all research is worthless. You are mining for rare events. So it is akin to venture capital investing. You want to invest into many start ups that have a high potential. 4. A Nature columns tries to define what makes a good PhD student: the key attributes needed to produce a worthy PhD thesis are a readiness to accept failure; resilience; persistence; the ability to troubleshoot; dedication; independence; and a willingness to commit to very hard work — together with curiosity and a passion for research. The two most common causes of hardship in PhD students are an inability to accept failure and choosing this career path for the prestige, rather than out of any real interest in research. Continue Reading… ### Table of Contents for PIM I am down to the home stretch for publishing my upcoming book, “A Programmer’s Introduction to Mathematics.” I don’t have an exact publication date—I’m self publishing—but after months of editing, I’ve only got two chapters left in which to apply edits that I’ve already marked up in my physical copy. That and some notes from external reviewers, and adding jokes and anecdotes and fun exercises as time allows. I’m committing to publishing by the end of the year. When that happens I’ll post here and also on the book’s mailing list. Here’s a sneak preview of the table of contents. And a shot of the cover design (still a work in progress) Continue Reading… ### A Lazy Function (This article was first published on CillianMacAodh, and kindly contributed to R-bloggers) It has been quite a while since I posted, but I haven’t been idle, I completed my PhD since the last post, and I’m due to graduate next Thursday. I am also delighted to have recently been added to R-bloggers.com so I’m keen to get back into it. ## A Lazy Function I have already written 2 posts about writing functions, and I will try to diversify my content. That said, I won’t refrain from sharing something that has been helpful to me. The function(s) I describe in this post is an artefact left over from before I started using R Markdown. It is a product of its time but may still be of use to people who haven’t switched to R Markdown yet. It is lazy (and quite imperfect) solution to a tedious task. ### The Problem At the time I wrote this function I was using R for my statistics and Libreoffice for writing. I would run a test in R and then write it up in Libreoffice. Each value that needed reporting had to be transferred from my R output to Libreoffice – and for each test there are a number of values that need reporting. Writing up these tests is pretty formulaic. There’s a set structure to the sentence, for example writing up a t-test with a significant result nearly always looks something like this: An independent samples t-test revealed a significant difference in X between the Y sample, (M = [ ], SD = [ ]), and the Z sample, (M = [ ], SD = [ ]), t([df]) = [ ], p = [ ]. And the write up of a non-significant result looks something like this: An independent samples t-test revealed no significant difference in X between the Y sample, (M = [ ], SD = [ ]), and the Z sample, (M = [ ], SD = [ ]), t([df]) = [ ], p = [ ]. Seven values (the square [ ] brackets) need to be reported for this single test. Whether you copy and paste or type each value, the reporting of such tests can be very tedious, and leave you prone to errors in reporting. ### The Solution In order to make reporting values easier (and more accurate) I wrote the t_paragraph() function (and the related t_paired_paragraph() function). This provided an output that I could copy and paste into a Word (Libreoffice) document. This function is part of the desnum1 package (McHugh, 2017). #### The t_parapgraph() Function The t_parapgraph() function runs a t-test and generates an output that can be copied and pasted into a word document. The code for the function is as follows: # Create the function t_paragraph with arguments x, y, and measure # x is the dependent variable # y is the independent (grouping) variable # measure is the name of dependent variable inputted as string t_paragraph <- function (x, y, measure){ # Run a t-test and store it as an object t t <- t.test(x ~ y) # If your grouping variable has labelled levels, the next line will store them for reporting at a later stage labels <- levels(y) # Create an object for each value to be reported tsl <- as.vector(t$statistic)
ts <- round(tsl, digits = 3)
tpl <- as.vector(t$p.value) tp <- round(tpl, digits = 3) d_fl <- as.vector(t$parameter)
d_f <- round(d_fl, digits = 2)
ml <- as.vector(tapply(x, y, mean))
m <- round(ml, digits = 2)
sdl <- as.vector(tapply(x, y, sd))
sd <- round(sdl, digits = 2)

# Use print(paste0()) to combine the objects above and create two potential outputs
# The output that is generated will depend on the result of the test

# wording if significant difference is observed

if (tp < 0.05)
print(paste0("An independent samples t-test revealed a significant difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M =", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)

# wording if no significant difference is observed

if (tp > 0.05)
print(paste0("An independent samples t-test revealed no difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M = ", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)
}

When using t_paragraph()x is your DV, y is your grouping variable while measure is a string value that the name of the dependent variable. To illustrate the function I’ll use the mtcars dataset.

#### Applications of the t_parapgraph() Function

The mtcars dataset is comes with R. For information on it simply type help(mtcars). The variables of interest here are am(transmission; 0 = automatic, 1 = manual), mpg (miles per gallon), qsec (1/4 mile time). The two questions I’m going to look at are:

1. Is there a difference in miles per gallon depending on transmission?
2. Is there a difference in 1/4 mile time depending on transmission?

Before running the test it is a good idea to look at the data2. Because we’re going to look at differences between groups we want to run descriptives for each group separately. To do this I’m going to combine the the descriptives() function which I previously covered here (also part of the desnum package) and the tapply() function.

The tapply() function allows you to run a function on subsets of a dataset using a grouping variable (or index). The arguments are as follows tapply(vector, index, function)vector is the variable you want to pass through function; and index is the grouping variable. The examples below will make this clearer.

We want to run descriptives on mtcars$mpg and on mtcars$qsec and for each we want to group by transmission (mtcars$am). This can be done using tapply() and descriptives() together as follows: tapply(mtcars$mpg, mtcars$am, descriptives) ##$0
##       mean       sd  min  max len
## 1 17.14737 3.833966 10.4 24.4  19
##
## $1 ## mean sd min max len ## 1 24.39231 6.166504 15 33.9 13 Recall that 0 = automatic, and 1 = manual. Replace mpg with qsec and run again: tapply(mtcars$qsec, mtcars$am, descriptives) ##$0
##       mean       sd   min  max len
## 1 18.18316 1.751308 15.41 22.9  19
##
## $1 ## mean sd min max len ## 1 17.36 1.792359 14.5 19.9 13 ### Running t_paragraph() Now that we know the values for automatic vs manual cars we can run our t-tests using t_paragraph(). Our first question: Is there a difference in miles per gallon depeding on transmission? t_paragraph(mtcars$mpg, mtcars$am, "miles per gallon") ## [1] An independent samples t-test revealed a significant difference in miles per gallon between the sample, (M = 17.15, SD = 3.83), and the sample, (M =24.39, SD =6.17), t(18.33) = -3.767, p = 0.001. There is a difference, and the output above can be copied and pasted into a word document with minimal changes required. Our second question was: Is there a difference in 1/4 mile time depending on transmission? t_paragraph(mtcars$qsec, mtcars$am, "quarter-mile time") ## [1] An independent samples t-test revealed no difference in quarter-mile time between the sample, (M = 18.18, SD = 1.75), and the sample, (M = 17.36, SD =1.79), t(25.53) = 1.288, p = 0.209. This time there was no significant difference, and again the output can be copied and pasted into word with minimal changes. ### Limitations The function described was written a long time ago, and could be updated. However I no longer copy and paste into word (having switched to R markdown instead). The reporting of the p value is not always to APA standards. If p is < .001 this is what should be reported. The code for t_paragraph() could be updated to include the p_report function (described here) which would address this. Another limitation is that the formatting of the text isn’t perfect, the letters (N,M,SD,t,p) should all be italicised, but having to manually fix this formatting is still easier than manually transferring individual values. ### Conclusion Despite the limitations the functions t_paragraph() and t_paired_paragraph()3 have made my life easier. I still use them occasionally. I hope they can be of use to anyone who is using R but has not switched to R Markdown yet. ### References McHugh, C. (2017). Desnum: Creates some useful functions. 1. To install desnum just run devtools::install_github("cillianmiltown/R_desnum") 2. In this case this is particularly useful because there are no value labels for mtcars$am, so it won’t be clear from the output which values refer to the automatic group and which refer to the manual group. Running descriptives will help with this.
3. If you want to see the code for t_paired_paragraph() just load desnum and run t_paired_paragraph (without parenthesis)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Dr. Data Show Video: How Can You Trust AI?

This new web series breaks the mold for data science infotainment, captivating the planet with short webisodes that cover the very best of machine learning and predictive analytics.

### He’s a history teacher and he has a statistics question

Someone named Ian writes:

I am a History teacher who has become interested in statistics! The main reason for this is that I’m reading research papers about teaching practices to find out what actually “works.”

I’ve taught myself the basics of null hypothesis significance testing, though I confess I am no expert (Maths was never my strong point at school!). But I also came across your blog after I heard about this “replication crisis” thing.

I wanted to ask you a question, if I may.

Suppose a randomised controlled experiment is conducted with two groups and the mean difference turns out to be statistically significant at the .05 level. I’ve learnt from my self-study that this means:

“If there were genuinely no difference in the population, the probability of getting a result this big or bigger is less than 5%.”

So far, so good (or so I thought).

But from my recent reading, I’ve gathered that many people criticise studies for using “small samples.” What was interesting to me is that they criticise this even after a significant result has been found.

So they’re not saying “Your sample size was small so that may be why you didn’t find a significant result.” They’re saying: “Even though you did find a significant result, your sample size was small so your result can’t be trusted.”

I was just wondering whether you could explain why one should distrust significant results with small samples? Some people seem to be saying it’s because it may have been a chance finding. But isn’t that what the p-value is supposed to tell you? If p is less then 0.05, doesn’t that mean I can assume it (probably) wasn’t a “chance finding”?

My reply: See my paper, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” recently published in the Personality and Social Psychology Bulletin. The short answer is that (a) it’s not hard to get p less than 0.05 just from chance, via forking paths, and (b) when effect sizes are small and a study is noisy, any estimate that reaches “statistical significance” is likely to be an overestimate, perhaps a huge overestimate.

### Document worth reading: “Deep learning: Technical introduction”

This note presents in a technical though hopefully pedagogical way the three most common forms of neural network architectures: Feedforward, Convolutional and Recurrent. For each network, their fundamental building blocks are detailed. The forward pass and the update rules for the backpropagation algorithm are then derived in full. Deep learning: Technical introduction

### Magister Dixit

“Regardless, it’s clear that Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.” Donnie Berkholz ( March 13, 2015 )

### Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

1. Deduplication: eliminating duplicate (exact) copies of repeated data.
2. Record linkage: identifying records that reference the same entity across different sources.
3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

### How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

### Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe


Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py


### Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},
]


Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

• the first three digits of the phone number
• the full name
• the first five characters of the name
• a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

## Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

• (y)es: confirms that the two references are to the same entity
• (n)o: labels the two references as not the same entity
• (u)nsure: does not label the two references as the same entity or as different entities
• (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
]


A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

## Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
"""
"""
response = requests.get(url)
with open(fname, 'w') as f:
f.write(response.content)

ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

## Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname

try:
print ("I've connected")
except:
print ("I am unable to connect to the database")
cur = conn.cursor()


The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
(visitor_id SERIAL PRIMARY KEY,
lastname    varchar,
firstname   varchar,
uin         varchar,
apptstart   varchar,
apptend     varchar,
meeting_loc varchar);''')
conn.commit()
with open(nfile, 'rU') as infile:
for field in DATEFIELDS:
if row[field] != '':
try:
dt = parser.parse(row[field])
row[field] = dt.toordinal()  # We also tried dt.isoformat()
except:
continue
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
conn.commit()
print ("All done!")

dateParseSQL(ORIGFILE)


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';


For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor


Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}
]


We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records

block_id = row['block_id']
records = []
i += 1

if i % 10000 == 0:
print ('{} blocks'.format(i))

smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
else:
smaller_ids = lset([])

records.append((row[KEY_FIELD], row, smaller_ids))

if records:
yield records


And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)

with psycopg2.connect(database=args.dbname,
host='localhost',
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)

print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
del(temp_d)

if os.path.exists(args.training):
with open(args.training) as tf:

print ('Starting active learning')
dedupe.convenience.consoleLabel(deduper)

print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)

print ('Saving new training file to {}'.format(args.training))
with open(args.training, 'w') as training_file:
deduper.writeTraining(training_file)

deduper.cleanupTraining()

print ('Creating blocking_map table')
c.execute("""
DROP TABLE IF EXISTS blocking_map
""")
c.execute("""
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)

for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
c_index.execute("""
SELECT DISTINCT %s FROM %s
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
c_index.close()

print ('Generating blocking map')
c_block = con.cursor('block')
c_block.execute("""
SELECT * FROM %s
""" % SOURCE_TABLE)
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)

print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
csv_writer.writerows(b_data)
csv_file.close()

f = open(csv_file.name, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
f.close()

os.remove(csv_file.name)

con.commit()

print ('Indexing blocks')
c.execute("""
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
""")
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

print ('Calculating plural_key')
c.execute("""
CREATE TABLE plural_key
(block_key VARCHAR(200),
block_id SERIAL PRIMARY KEY)
""")
c.execute("""
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
""")

print ('Indexing block_key')
c.execute("""
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
""")

print ('Calculating plural_block')
c.execute("""
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)

c.execute("""
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
""" % (KEY_FIELD, KEY_FIELD))
c.execute("""
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Creating covered_blocks')
c.execute("""
CREATE TABLE covered_blocks AS
(SELECT %s,
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
GROUP BY %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Indexing covered_blocks')
c.execute("""
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
""" % (KEY_FIELD, KEY_FIELD))
print ('Committing')

print ('Creating smaller_coverage')
c.execute("""
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
""" % (KEY_FIELD, KEY_FIELD))
con.commit()

print ('Clustering...')
c_cluster = con.cursor('cluster')
c_cluster.execute("""
SELECT *
FROM smaller_coverage
INNER JOIN %s
USING (%s)
ORDER BY (block_id)
""" % (SOURCE_TABLE, KEY_FIELD))
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)

print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
c.execute("""
CREATE TABLE entity_map (
%s INTEGER,
canon_id INTEGER,
cluster_score FLOAT,
PRIMARY KEY(%s)
)""" % (KEY_FIELD, KEY_FIELD))

print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
c.execute("""
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))

c_cluster.close()
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
con.commit()

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()
find_dupes(args)


## Active Learning Observations

We ran multiple experiments:

• Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
• Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


## Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

### Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671


### Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


## Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### R Packages worth a look

R Interface to the Yacas Computer Algebra System (Ryacas)
An interface to the yacas computer algebra system.

Deep Forest Model (gcForest)
R application programming interface (API) for Deep Forest which based on Zhou and Feng (2017). Deep Forest: Towards an Alternative to Deep Neural Netwo …

Strategy Estimation (stratEst)
Implements variants of the strategy frequency estimation method by Dal Bo & Frechette (2011) <doi:10.1257/aer.101.1.411>, including its adapt …

## October 19, 2018

### If you did not already know

YCML
A Machine Learning framework for Objective-C and Swift (OS X / iOS) …

Gaussian Process Autoregressive Regression Model (GPAR)
Multi-output regression models must exploit dependencies between outputs to maximise predictive performance. The application of Gaussian processes (GPs) to this setting typically yields models that are computationally demanding and have limited representational power. We present the Gaussian Process Autoregressive Regression (GPAR) model, a scalable multi-output GP model that is able to capture nonlinear, possibly input-varying, dependencies between outputs in a simple and tractable way: the product rule is used to decompose the joint distribution over the outputs into a set of conditionals, each of which is modelled by a standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and real-world problems, outperforming existing GP models and achieving state-of-the-art performance on the tasks with existing benchmarks. …

Randomized Weighted Majority Algorithm (RWMA)
The randomized weighted majority algorithm is an algorithm in machine learning theory. It improves the mistake bound of the weighted majority algorithm. Imagine that every morning before the stock market opens, we get a prediction from each of our ‘experts’ about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single best expert in hindsight. “Weighted Majority Algorithm”

### Book Memo: “Machine Learning Using R”

 With Time Series and Industry-Based Use Cases in R Examine the latest technological advancements in building a scalable machine-learning model with big data using R. This second edition shows you how to work with a machine-learning algorithm and use it to build a ML model from raw data. You will see how to use R programming with TensorFlow, thus avoiding the effort of learning Python if you are only comfortable with R. As in the first edition, the authors have kept the fine balance of theory and application of machine learning through various real-world use-cases which gives you a comprehensive collection of topics in machine learning. New chapters in this edition cover time series models and deep learning.

### An Intuitive Guide to Financial Analysis with Data Transformations

With regards to the analysis of financial markets, there exists two major schools of thought: fundamental analysis and technical analysis. Fundamental analysis focuses on understanding the intrinsic value of a company based on information such as quarterly financial statements, cash flow, and other information about an industry in general. The goal is to discover and […]

### Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

When receiving bytes from the network, we often assume that they are unicode strings, encoded using something called UTF-8. Sadly, not all streams of bytes are valid UTF-8. So we need to check the strings. It is probably a good idea to optimize this problem as much as possible.

In earlier work, we showed that you could validate a string using a little as 0.7 cycles per byte, using commonly available 128-bit SIMD registers (in C). SIMD stands for Single-Instruction-Multiple-Data, it is a way to parallelize the processing on a single core.

What if we use 256-bit registers instead?

 Reference naive function 10 cycles per byte fast SIMD version (128-bit) 0.7 cycles per byte new SIMD version (256-bit) 0.45 cycles per byte

That’s good, almost twice as fast.

A common problem is that you receive as inputs ASCII characters. That’s a common scenario. It is much faster to check that a string in made of ASCII characters than to check that it is made of valid UTF-8 characters. Indeed, to check that it is made of ASCII characters, you only have to check that one bit per byte is zero (since ASCII uses only 7 bits per byte).

It turns out that only about 0.05 cycles are needed to check that a string is made of ASCII characters. Maybe up to 0.08 cycles. That makes us look bad.

You could start checking the file for ASCII characters and then switch to our function when non-ASCII characters are found, but this has a problem: what if the string starts with a non-ASCII character followed by a long stream of ASCII characters?

A quick solution is to add an ASCII path. Each time we read a block of 32 bytes, we check whether it is made of 32 ASCII characters, and if so, we take a different (fast) path. Thus if it happens frequently that we have long streams of ASCII characters, we will be quite fast.

The new numbers are quite appealing when running benchmarks on ASCII characters:

 new SIMD version (256-bit) 0.45 cycles per byte new SIMD version (256-bit), w. ASCII path 0.088 cycles per byte ASCII check (SIMD + 256-bit) 0.051 cycles per byte

Springboard’s Introduction to Data Science Course will help you build a strong foundation in R programming, communicate effectively by telling a story with data, clean and analyze large datasets, and more. Apply before Oct 22 and use code KDNUGGETSOCT500 for $500 Data Science Career Track. Continue Reading… ### Statistics Challenge Invites Students to Tackle Opioid Crisis Using Real-World Data (This article was first published on R-posts.com, and kindly contributed to R-bloggers) In 2016, 2.1 million Americans were found to have an opioid use disorder (according to SAMHSA), with drug overdose now the leading cause of injury and death in the United States. But some of the country’s top minds are working to fight this epidemic, and statisticians are helping to lead the charge. In This is Statistics’ second annual fall data challenge, high school and undergraduate students will use statistics to analyze data and develop recommendations to help address this important public health crisis. The contest invites teams of two to five students to put their statistical and data visualization skills to work using the Centers for Disease Control and Prevention (CDC)’s Multiple Cause of Death (Detailed Mortality) data set, and contribute to creating healthier communities. Given the size and complexity of the CDC dataset, programming languages such as R can be used to manipulate and conduct analysis effectively. Each submission will consist of a short essay and presentation of recommendations. Winners will be awarded for best overall analysis, best visualization and best use of external data. Submissions are due November 12, 2018. If you or a student you know is interested in participating, get full contest details here Teachers, get resources about how to engage your students in the contest here. To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Solving the chinese postman problem (This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers) Some pre-Halloween post today. It started actually while I was in Barcelona : kids wanted to go back to some store we’ve seen the first day, in the gothic part, and I could not remember where it was. And I said to myself that would be quite long to do all the street of the neighborhood. And I discovered that it was actually an old problem. In 1962, Meigu Guan was interested in a postman delivering mail to a number of streets such that the total distance walked by the postman was as short as possible. How could the postman ensure that the distance walked was a minimum? A very close notion is the concept of traversable graph, which is one that can be drawn without taking a pen from the paper and without retracing the same edge. In such a case the graph is said to have an Eulerian trail (yes, from Euler’s bridges problem). An Eulerian trail uses all the edges of a graph. For a graph to be Eulerian all the vertices must be of even order. An algorithm for finding an optimal Chinese postman route is: 1. List all odd vertices. 2. List all possible pairings of odd vertices. 3. For each pairing find the edges that connect the vertices with the minimum weight. 4. Find the pairings such that the sum of the weights is minimised. 5. On the original graph add the edges that have been found in Step 4. 6. The length of an optimal Chinese postman route is the sum of all the edges added to the total found in Step 4. 7. A route corresponding to this minimum weight can then be easily found. For the first steps, we can use the codes from Hurley & Oldford’s Eulerian tour algorithms for data visualization and the PairViz package. First, we have to load some R packages  1 2 3 4  require(igraph) require(graph) require(eulerian) require(GA) Then use the following function from stackoverflow,  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  make_eulerian = function(graph){ info = c("broken" = FALSE, "Added" = 0, "Successfull" = TRUE) is.even = function(x){ x %% 2 == 0 } search.for.even.neighbor = !is.even(sum(!is.even(degree(graph)))) for(i in V(graph)){ set.j = NULL uneven.neighbors = !is.even(degree(graph, neighbors(graph,i))) if(!is.even(degree(graph,i))){ if(sum(uneven.neighbors) == 0){ if(sum(!is.even(degree(graph))) > 0){ info["Broken"] = TRUE uneven.candidates <- !is.even(degree(graph, V(graph))) if(sum(uneven.candidates) != 0){ set.j <- V(graph)[uneven.candidates][[1]] }else{ info["Successfull"] <- FALSE } } }else{ set.j <- neighbors(graph, i)[uneven.neighbors][[1]] } }else if(search.for.even.neighbor == TRUE & is.null(set.j)){ info["Added"] <- info["Added"] + 1 set.j <- neighbors(graph, i)[ !uneven.neighbors ][[1]] if(!is.null(set.j)){search.for.even.neighbor <- FALSE} } if(!is.null(set.j)){ if(i != set.j){ graph <- add_edges(graph, edges=c(i, set.j)) info["Added"] <- info["Added"] + 1 } } } (list("graph" = graph, "info" = info))} Then, consider some network, with 12 nodes  1 2 3  g1 = graph(c(1,2, 1,3, 2,4, 2,5, 1,5, 3,5, 4,7, 5,7, 5,8, 3,6, 6,8, 6,9, 9,11, 8,11, 8,10, 8,12, 7,10, 10,12, 11,12), directed = FALSE) To plot that network, use  1 2 3 4  V(g1)$name=LETTERS[1:12] V(g1)$color=rgb(0,0,1,.4) ly=layout.kamada.kawai(g1) plot(g1,vertex.color=V(newg)$color,layout=ly)

Then we convert it to some traversable graph by adding 5 vertices

 1 2 3 4 5  eulerian = make_eulerian(g1) eulerian$info broken Added Successfull 0 5 1 g = eulerian$graph

as shown below

 1 2  ly=layout.kamada.kawai(g) plot(g,vertex.color=V(newg)$color,layout=ly) We cut those 5 vertices in two part, and therefore, we add 5 artificial nodes  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  A=as.matrix(as_adj(g)) A1=as.matrix(as_adj(g1)) newA=lower.tri(A, diag = FALSE)*A1+upper.tri(A, diag = FALSE)*A for(i in 1:sum(newA==2)) newA = cbind(newA,0) for(i in 1:sum(newA==2)) newA = rbind(newA,0) s=nrow(A) for(i in 1:nrow(A)){ Aj=which(newA[i,]==2) if(!is.null(Aj)){ for(j in Aj){ newA[i,s+1]=newA[s+1,i]=1 newA[j,s+1]=newA[s+1,j]=1 newA[i,j]=1 s=s+1 }}} We get the following graph, where all nodes have an even number of vertices !  1 2 3 4 5 6 7 8 9 10 11 12  newg=graph_from_adjacency_matrix(newA) newg=as.undirected(newg) V(newg)$name=LETTERS[1:17] V(newg)$color=c(rep(rgb(0,0,1,.4),12),rep(rgb(1,0,0,.4),5)) ly2=ly transl=cbind(c(0,0,0,.2,0),c(.2,-.2,-.2,0,-.2)) for(i in 13:17){ j=which(newA[i,]>0) lc=ly[j,] ly2=rbind(ly2,apply(lc,2,mean)+transl[i-12,]) } plot(newg,layout=ly2) Our network is now the following (new nodes are small because actually, they don’t really matter, it’s just for computational reasons)  1 2 3  plot(newg,vertex.color=V(newg)$color,layout=ly2, vertex.size=c(rep(20,12),rep(0,5)), vertex.label.cex=c(rep(1,12),rep(.1,5)))

Now we can get the optimal path

 1 2 3 4 5 6 7  n <- LETTERS[1:nrow(newA)] g_2 <- new("graphNEL",nodes=n) for(i in 1:nrow(newA)){ for(j in which(newA[i,]>0)){ g_2 <- addEdge(n[i],n[j],g_2,1) }} etour(g_2,weighted=FALSE) [1] "A" "B" "D" "G" "E" "A" "C" "E" "H" "F" "I" "K" "H" "J" "G" "P" "J" "L" "K" "Q" "L" "H" "O" "F" "C" [26] "N" "E" "B" "M" "A"

or

 1 2 3 4 5 6 7 8 9 10 11 12 13 14  edg=attr(E(newg), "vnames") ET=etour(g_2,weighted=FALSE) parcours=trajet=rep(NA,length(ET)-1) for(i in 1:length(parcours)){ u=c(ET[i],ET[i+1]) ou=order(u) parcours[i]=paste(u[ou[1]],u[ou[2]],sep="|") trajet[i]=which(edg==parcours[i]) } parcours [1] "A|B" "B|D" "D|G" "E|G" "A|E" "A|C" "C|E" "E|H" "F|H" "F|I" "I|K" "H|K" "H|J" "G|J" "G|P" "J|P" [17] "J|L" "K|L" "K|Q" "L|Q" "H|L" "H|O" "F|O" "C|F" "C|N" "E|N" "B|E" "B|M" "A|M" trajet [1] 1 3 8 9 4 2 6 10 11 12 16 15 14 13 26 27 18 19 28 29 17 25 24 7 22 23 5 21 20

Let us try now on a real network of streets. Like Missoula, Montana.

I will not try to get the shapefile of the city, I will just try to replicate the photography above.

If you look carefully, you will see some problem : 10 and 93 have an odd number of vertices (3 here), so one strategy is to connect them (which explains the grey line).

But actually, to be more realistic, we start in 93, and we end in 10. Here is the optimal (shortest) path which goes through all vertices.

Now, we are ready for Halloween, to go through all streets in the neighborhood !

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Will Models Rule the World? Data Science Salon Miami, Nov 6-7

This post is excerpted from the thoughts of Data Science Salon Miami speakers on the future of model-based decision-making.

The data necessary to account for every aspect of our human complexity poses significant challenges to health AI systems. There’s certainly no way around data science to get a hold of it – but don’t you count physicians out too soon! Precision Medicine bears the promise to bring highly individualized

The post Is your Precision Medicine AI ready for the Bio-Psycho-Socio-Cultural Patient? appeared first on Dataconomy.

### Gold-Mining Week 7 (2018)

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 7 Gold Mining and Fantasy Football Projection Roundup now available. Go check out our cheat sheet for this week.

The post Gold-Mining Week 7 (2018) appeared first on Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Holy Grail of AI for Enterprise — Explainable AI

Explainable AI (XAI) is an emerging branch of AI where AI systems are made to explain the reasoning behind every decision made by them. We investigate some of its key benefits and design principles.

### New Course: Interactive Data Visualization with rbokeh

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

### Course Description

Data visualization is an integral part of the data analysis process. This course will get you introduced to rbokeh: a visualization library for interactive web-based plots. You will learn how to use rbokeh layers and options to create effective visualizations that carry your message and emphasize your ideas. We will focus on the two main pieces of data visualization: wrangling data in the appropriate format as well as employing the appropriate visualization tools, charts and options from rbokeh.

### Chapter 1: rbokeh Introduction (Free)

In this chapter we get introduced to rbokeh layers. You will learn how to specify data and arguments to create the desired plot and how to combine multiple layers in one figure.

### Chapter 2: rbokeh Aesthetic Attributes and Figure Options

In this chapter you will learn how to customize your rbokeh figures using aesthetic attributes and figure options. You will see how aesthetic attributes such as color, transparancy and shape can serve a purpose and add more info to your visualizations. In addition, you will learn how to activate the tooltip and specify the hover info in your figures.

### Chapter 3: Data Manipulation for Visualization and More rbokeh Layers

In this chapter, you will learn how to put your data in the right format to fit the desired figure. And how to transform between the wide and long formats. You will also see how to combine normal layers with regression lines. In addition you will learn how to customize the interaction tools that appear with each figure.

### Chapter 4: Grid Plots and Maps

In this chapter you will learn how to combine multiple plots in one layout using grid plots. In addition, you will learn how to create interactive maps.

### Prerequisites

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Course: Visualization Best Practices in R

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

### Course Description

This course will help you take your data visualization skills beyond the basics and hone them into a powerful member of your data science toolkit. Over the lessons we will use two interesting open datasets to cover different types of data (proportions, point-data, single distributions, and multiple distributions) and discuss the pros and cons of the most common visualizations. In addition, we will cover some less common alternatives visualizations for the data types and how to tweak default ggplot settings to most efficiently and effectively get your message across.

### Chapter 1: Proportions of a whole (Free)

In this chapter, we focus on visualizing proportions of a whole; we see that pie charts really aren’t so bad, along with discussing the waffle chart and stacked bars for comparing multiple proportions.

### Chapter 2: Point data

We shift our focus now to single-observation or point data and go over when bar charts are appropriate and when they are not, what to use when they are not, and general perception-based enhancements for your charts.

### Chapter 3: Single distributions

We now move on to visualizing distributional data, we expose the fragility of histograms, discuss when it is better to shift to a kernel density plots, and how to make both plots work best for your data.

### Chapter 4: Comparing distributions

Finishing off we take a look at comparing multiple distributions to each other. We see why the traditional box plots are very dangerous and how to easily improve them, along with investigating when you should use more advanced alternatives like the beeswarm plot and violin plots.

### Prerequisites

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Intuitions Behind Bayesian Optimization with Gaussian Processes

Bayesian Optimization adds a Bayesian methodology to the iterative optimizer paradigm by incorporating a prior model on the space of possible target functions. This article introduces the basic concepts and intuitions behind Bayesian Optimization with Gaussian Processes.

### An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”

Someone writes:

So the NYT yesterday has a story about this study I am directed to it and am immediately concerned about all the things that make this study somewhat dubious. Forking paths in the definition of the independent variable, sample selection in who wore the accelerometers, ignorance of the undoubtedly huge importance of interactions in the controls, etc, etc. blah blah blah. But I am astonished at the bald statement at the start of the study: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” Why shouldn’t everyone, including the NYT, stop reading right there? How does a journal accept the article? The dataset itself is public and they didn’t create it! They’re just saying Fuck You.

I was, like, Really? So I followed the link. And, indeed, here it is:

The Journal of the American Heart Association published this? And the New York Times promoted it?

As a heart patient myself, I’m annoyed. I’d give it a subliminal frowny face, but I don’t want to go affecting your views on immigration.

By the way, I started Who is Rich? this week and it’s great.

P.P.S. The above all happened half a year ago. Today my post appeared, and then I received a note from Joseph Hilgard saying informing me that this statement, “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure,” apparently is a technical requirement of TOP when the data are already publicly available—not a defiant statement from the authors. Hilgard also informed me that TOP is “Transparency and Openness Promotion guidelines. Journal-level standards for how firm a journal wants to be about requesting data sharing, code, etc.”

I remain baffled as to why, if the data are already publicly available, you couldn’t just say, “The data are already publicly available,” and also why you have to say that the analytic methods and study materials will not be made available. I can believe that this is a requirement of the journal. Various organizations have various screwy requirements, there are millions of forms to be filled out and hoops to be jumped through, etc. And the end result—no details on how the data were processed, no code, etc.—that’s not good in any case. It should be easy to have reproducible research when the data are public.

It’s amazing how fast standards have changed. Back when we published our Red State Blue State book, ten years ago, we didn’t even think of posting all our data and code. Partly because it was such a mess, with five coauthors doing different things at different times, but also because this was not something that was usually done. I felt we were ahead of the game by including careful descriptions of our methods in the notes section at the end of the book. But there’s a big gap between my written descriptions and all the details of what we did. When it comes to scientific communication, things have changed for the better.

Let’s just hope that the Center for Open Science and the Journal of the American Heart Association can fix this particular bug, which seems at least in this case to have encouraged researchers to not make their methods and study materials available.

### McKinsey Datathon: The City Cup17 November, Amsterdam, Stockholm and Zurich. Apply Now

While solving the challenge, you will gain insights into the types of problems that McKinsey Data Scientists solve daily to help their clients. Top prize is 5K Euro + conference attendance of your choice.

### Education disparities

There are many racial disparities in education. ProPublica shows estimates for the gaps:

Based on civil rights data released by the U.S. Department of Education, ProPublica has built an interactive database to examine racial disparities in educational opportunities and school discipline. Look up more than 96,000 individual public and charter schools and 17,000 districts to see how they compare with their counterparts.

Using white students as the baseline, compare opportunity, discipline, segregation, and achievement for black and Hispanic students.

Be sure to click through to a school district or state of interest to see more detailed breakdowns of the measures.

Tags: , ,

### Four short links: 19 October 2018

PDF to Data Frame, Clever Story, Conceptual Art, and Automatic Patch Synthesis

1. Camelot -- Python library that extracts tables of data from PDF documents, returning them as Pandas frames.
2. STET -- short story told via footnotes, editorial markup, and more. Magnificent! (via Cory Doctorow)
3. Solving Sol -- interpreting a conceptual artist's art as instructions, reframed as an AI problem. Clever!
4. Human-Competitive Patches with Repairnator -- Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce five patches that were accepted by the human developers and permanently merged in the code base.

### Loops and Pizzas

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

An Introduction to Loops in R –

# Loops in R

First, if you are new to programming, you should know that loops are a
way to tell the computer that you want to repeat some operation for a
number of times. This is a very common task that can be found in many
programming languages. For example, let’s say you invited five friends
for dinner at your home and the whole cost of four pizzas will be split
evenly. Assume now that you must give instructions to a computer on
calculating how much each one will pay at the end of dinner. For that,
you need to sum up the individual tabs and divide by the number of
of x=zero, take each individual pizza cost and sum it to x until all
costs are processed, dividing the result by the number of friends at the
end
.

The great thing about loops is that the length of it is dynamically
set. Using the previous example, if we had 500 friends (and a large
dinner table!), we could use the same instructions for calculating the
individual tabs. That means we can encapsulate a generic procedure for
processing any given number of friends at dinner. With it, you have at
your reach a tool for the execution of any sequential process. In other
words, you are the boss of your computer and, as long as you can write
it down clearly, you can set it to do any kind of repeated task for you.

Now, about the code, we could write the solution to the pizza problem
in R as:

pizza.costs <- c(50, 80, 30, 60) # each cost of pizza
n.friends <- 5 # number of friends

x <- 0 # set first cost to zero
for (i.cost in pizza.costs) {
x <- x + i.cost # sum it up
}

x <- x/n.friends # divide for average per friend
print(x)

## [1] 44


Don’t worry if you didn’t understand the code. We’ll get to the
structure of a loop soon.

Back to our case, each friend would pay 44 for the meal. We can check
the result against function sum:

x == sum(pizza.costs)/n.friends

## [1] TRUE


The output TRUE shows that the results are equal.

## The Structure of a Loop

Knowing how to use loops can be a powerful ally in a complex data
related problem. Let’s talk more about how loops are defined in R. The
structure of a loop in R follows:

for (i in i.vec){
...
}


In the previous code, command for indicates the beginning of a loop.
Object i in (i in i.vec) is the iterator of the loop. This
iterator will change its value in each iteration, taking each individual
value contained in i.vec. Note the loop is encapsulated by curly
braces ({}). These are important, as they define where the loop
starts and where it ends. The indentation (use of bigger margins) is
also important for visual cues, but not necessary. Consider the
following practical example:

# set seq
my.seq <- seq(-5,5)

# do loop
for (i in my.seq){
cat(paste('\nThe value of i is',i))
}

##
## The value of i is -5
## The value of i is -4
## The value of i is -3
## The value of i is -2
## The value of i is -1
## The value of i is 0
## The value of i is 1
## The value of i is 2
## The value of i is 3
## The value of i is 4
## The value of i is 5


In the code, we created a sequence from -5 to 5 and presented a text for
each element with the cat function. Notice how we also broke the
prompt line with '\n'. The loop starts with i=-5, execute command
cat(paste('\nThe value of i is', -5)), proceed to the next iteration
by setting i=-4, rerun the cat command, and so on. At its final
iteration, the value of i is 5.

The iterated sequence in the loop is not exclusive to numerical
vectors. Any type of vector or list may be used. See next:

# set char vec
my.char.vec <- letters[1:5]

# loop it!
for (i.char in my.char.vec){
cat(paste('\nThe value of i.char is', i.char))
}

##
## The value of i.char is a
## The value of i.char is b
## The value of i.char is c
## The value of i.char is d
## The value of i.char is e


The same goes for lists:

# set list
my.l <- list(x = 1:5,
y = c('abc','dfg'),
z = factor('A','B','C','D'))

# loop list
for (i.l in my.l){

cat(paste0('\nThe class of i.l is ', class(i.l), '. '))
cat(paste0('The number of elements is ', length(i.l), '.'))

}

##
## The class of i.l is integer. The number of elements is 5.
## The class of i.l is character. The number of elements is 2.
## The class of i.l is factor. The number of elements is 1.


In the definition of loops, the iterator does not have to be the only
object incremented in each iteration. We can create other objects and
increment them using a simple sum operation. See next:

# set vec and iterators
my.vec <- seq(1:5)
my.x <- 5
my.z <- 10

for (i in my.vec){
# iterate "manually"
my.x <- my.x + 1
my.z <- my.z + 2

cat('\nValue of i = ', i,
' | Value of my.x = ', my.x,
' | Value of my.z = ', my.z)
}

##
## Value of i =  1  | Value of my.x =  6  | Value of my.z =  12
## Value of i =  2  | Value of my.x =  7  | Value of my.z =  14
## Value of i =  3  | Value of my.x =  8  | Value of my.z =  16
## Value of i =  4  | Value of my.x =  9  | Value of my.z =  18
## Value of i =  5  | Value of my.x =  10  | Value of my.z =  20


Using nested loops, that is, a loop inside of another loop is also
possible. See the following example, where we present all the elements
of a matrix:

# set matrix
my.mat <- matrix(1:9, nrow = 3)

# loop all values of matrix
for (i in seq(1,nrow(my.mat))){
for (j in seq(1,ncol(my.mat))){
cat(paste0('\nElement [', i, ', ', j, '] = ', my.mat[i,j]))
}
}

##
## Element [1, 1] = 1
## Element [1, 2] = 4
## Element [1, 3] = 7
## Element [2, 1] = 2
## Element [2, 2] = 5
## Element [2, 3] = 8
## Element [3, 1] = 3
## Element [3, 2] = 6
## Element [3, 3] = 9


## A Real World Example

Now, the computational needs of the real world is far more complex than
dividing a dinner expense. A practical example of using loops is
processing data according to groups. Using an example from Finance, if
we have a return dataset for several stocks and we want to calculate the
average return of each stock, we can use a loop for that. In this
example, we will use Yahoo Finance data from three stocks: FB, GE and
AA. The first step is downloading it with package BatchGetSymbols.

library(BatchGetSymbols)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

##

my.tickers <-  c('FB', 'GE', 'AA')

df.stocks <- BatchGetSymbols(tickers = my.tickers,
first.date = '2012-01-01',
freq.data = 'yearly')[[2]]

##
## Running BatchGetSymbols for:
##    tickers = FB, GE, AA
## FB | yahoo (1|3) | Found cache file - Good job!
## GE | yahoo (2|3) | Found cache file - Nice!
## AA | yahoo (3|3) | Found cache file - You got it!


It worked fine. Let’s check the contents of the dataframe:

dplyr::glimpse(df.stocks)

## Observations: 21
## Variables: 10
## $ticker "AA", "AA", "AA", "AA", "AA", "AA", "AA", ... ##$ ref.date             2012-01-03, 2013-01-02, 2014-01-02, 2015-...
## $volume 2217410500, 2149575500, 2146821400, 268355... ##$ price.open           21.48282, 21.33864, 25.30359, 38.13561, 22...
## $price.high 25.85628, 25.68807, 42.29280, 41.01921, 32... ##$ price.low            19.27206, 18.50310, 24.27030, 18.79146, 16...
## $price.close 22.17969, 21.60297, 25.30359, 38.15964, 23... ##$ price.adjusted       20.89342, 20.62187, 24.48568, 37.24207, 23...
## $ret.adjusted.prices NA, -0.01299715, 0.18736494, 0.52097326, -... ##$ ret.closing.prices   NA, -0.02600212, 0.17130149, 0.50807215, -...


All financial data is there. Notice that the return series is available

Now we will use a loop to build a table with the mean return of each
stock:

# find unique tickers in column ticker
unique.tickers <- unique(df.stocks$ticker) # create empty df tab.out <- data.frame() # loop tickers for (i.ticker in unique.tickers){ # create temp df with ticker i.ticker temp <- df.stocks[df.stocks$ticker==i.ticker, ]

# row bind i.ticker and mean.ret
tab.out <- rbind(tab.out,
data.frame(ticker = i.ticker,
mean.ret = mean(temp\$ret.adjusted.prices, na.rm = TRUE)))

}

# print result
print(tab.out)

##   ticker   mean.ret
## 1     AA 0.24663684
## 2     FB 0.35315566
## 3     GE 0.06784693


In the code, we used function unique to find out the names of all the
tickers in the dataset. Soon after, we create an empty dataframe to
save the results and a loop to filter the data of each stock
sequentially and average its returns. At the end of the loop, we use
function rbind to paste the results of each stock with the results of
the main table. As you can see, we can use the data to perform group
calculations with loop.

By now, I must be forward in saying that the previous loop is by no
means the best way of performing the data operation. What we just did by
loops is called a split-apply-combine procedure. There are base
function in R such as tapply, split and lapply/sapply that can
do the same job but with a more intuitive and functional approach. Going
further, functions from package tidyverse can do the same procuedure
with an even more intuitive approach. In a future post I shall discuss
this possibilities further.

I hope you guys liked the post. Got a question? Just drop it at the
comment section.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “Review of Deep Learning”

In recent years, China, the United States and other countries, Google and other high-tech companies have increased investment in artificial intelligence. Deep learning is one of the current artificial intelligence research’s key areas. This paper analyzes and summarizes the latest progress and future research directions of deep learning. Firstly, three basic models of deep learning are outlined, including multilayer perceptrons, convolutional neural networks, and recurrent neural networks. On this basis, we further analyze the emerging new models of convolution neural networks and recurrent neural networks. This paper then summarizes deep learning’s applications in many areas of artificial intelligence, including voice, computer vision, natural language processing and so on. Finally, this paper discusses the existing problems of deep learning and gives the corresponding possible solutions. Review of Deep Learning

### If you did not already know

In this paper, we propose Factorized Adversarial Networks (FAN) to solve unsupervised domain adaptation problems for image classification tasks. Our networks map the data distribution into a latent feature space, which is factorized into a domain-specific subspace that contains domain-specific characteristics and a task-specific subspace that retains category information, for both source and target domains, respectively. Unsupervised domain adaptation is achieved by adversarial training to minimize the discrepancy between the distributions of two task-specific subspaces from source and target domains. We demonstrate that the proposed approach outperforms state-of-the-art methods on multiple benchmark datasets used in the literature for unsupervised domain adaptation. Furthermore, we collect two real-world tagging datasets that are much larger than existing benchmark datasets, and get significant improvement upon baselines, proving the practical value of our approach. …

Reinforced Continual Learning
Most artificial intelligence models have limiting ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also fits new tasks well. The experiments on sequential classification tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks. …

Reliability Modelling
Reliability modeling is the process of predicting or understanding the reliability of a component or system prior to its implementation. Two types of analysis that are often used to model a complete system’s availability behavior (including effects from logistics issues like spare part provisioning, transport and manpower) are Fault Tree Analysis and reliability block diagrams. At a component level, the same types of analyses can be used together with others. The input for the models can come from many sources including testing; prior operational experience; field data; as well as data handbooks from similar or related industries. Regardless of source, all model input data must be used with great caution, as predictions are only valid in cases where the same product was used in the same context. As such, predictions are often only used to help compare alternatives. …

### Revised and Extended Remarks at "The Rise of Intelligent Economies and the Work of the IMF"

Attention conservation notice: 2700+ words elaborating a presentation from a non-technical conference about AI, where the conversation devolved to "blockchain" within an hour; includes unexplained econometric jargon. Life is short, and you should have more self-respect.

I got asked to be a panelist at a November 2017 symposium at the IMF on machine learning, AI and what they can do to/for the work of the Fund and its sister organizations, specifically the work of its economists. What follows is an amplification and rationalization of my actual remarks. It is also a reconstruction, since my notes were on an only-partially-backed-up laptop stolen in the next month. (Roman thieves are perhaps the most dedicated artisans in Italy, plying their trade with gusto on Christmas Eve.) Posted now because reasons.

On the one hand, I don't have any products to sell, or even much of a consulting business to promote, so I feel a little bit out of place. But against that, there aren't many other people who work on machine learning who read macro and development economics for fun, or have actually estimated a DSGE model from data, so I don't feel totally fradulent up here.

We've been asked to talk about AI and machine learning, and how they might impact the work of the Fund and related multi-lateral organizations. I've never worked for the Fund or the World Bank, but I do understand a bit about how you economists work, and it seems to me that there are three important points to make: a point about data, a point about models, and a point about intelligence. The first of these is mostly an opportunity, the second is an opportunity and a clarification, and the third is a clarification and a criticism --- so you can tell I'm an academic by taking the privilege of ending on a note of skepticism and critique, rather than being inspirational.

I said my first point is about data --- in fact, it's about what, a few turns of the hype cycle ago, we'd have called "big data". Economists at the Fund typically rely for data on the output of official statistical agencies from various countries. This is traditional, this sort of reliance on the part of economists actually pre-dates the Bretton Woods organizations, and there are good reasons for it. With a few notable exceptions, those official statistics are prepared very carefully, with a lot of effort going in to making them both precise and accurate, as well as comparable over time and, increasingly, across countries.

But even these official statistics have their issues, for the purposes of the Fund: they are slow, they are noisy, and they don't quite measure what you want them to.

The issue of speed is familiar: they come out annually, maybe quarterly or monthly. This rate is pretty deeply tied to the way the statistics are compiled, which in turn is tied to their accuracy --- at least for the foreseeable future. It would be nice to be faster.

The issue of noise is also very real. Back in 1950, the great economist Oskar Morgenstern, the one who developed game theory with John von Neumann, wrote a classic book called On the Accuracy of Economic Observations, where he found a lot of ingenious ways of checking the accuracy of official statistics, e.g., looking at how badly they violated accounting identities. To summarize very crudely, he concluded that lots of those statistics couldn't possibly be accurate to better than 10%, maybe 5% --- and this was for developed countries with experienced statistical agencies. I'm sure that things are better now --- I'm not aware of anyone exactly repeating his efforts, but it'd be a worthwhile exercise --- maybe the error is down to 1%, but that's still a lot, especially to base policy decisions on.

The issue of measurement is the subtlest one. I'm not just talking about measurement noise now. Instead, it's that the official statistics are often tracking variables which aren't quite what you want1. Your macroeconomic model might, for example, need to know about the quantity of labor available for a certain industry in a certain country. But the theory in that model defines "quantity of labor" in a very particular way. The official statistical agencies, on the other hand, will have their own measurements of "quantity of labor", and none of those need to have exactly the same definitions. So even if we could magically eliminate measurement errors, just plugging the official value for "labor" in to your model isn't right, that's just an approximate, correlated quantity.

So: official statistics, which is what you're used to using, are the highest-quality statistics, but they're also slow, noisy, and imperfectly aligned with your models. There hasn't been much to be done about that for most of the life of the Fund, though, because what was your alternative?

What "big data" can offer is the possibility of a huge number of noisy, imperfect measures. Computer engineers --- the people in hardware and systems and databases, not in machine learning or artificial intelligence --- have been making it very, very cheap and easy to record, store, search and summarize all the little discrete facts about our economic lives, to track individual transactions and aggregate them into new statistics. (Moving so much of our economic lives, along with all the rest of our lives, on to the Internet only makes it easier.) This could, potentially, give you a great many aggregate statistics which tell you, in a lot of detail and at high frequency, about consumption, investment, employment, interest rates, finance, and so on and so forth. There would be lots of noise, but having a great many noisy measurements could give you a lot more information. It's true that basically none of them would be well-aligned with the theoretical variables in macro models, but there are well-established statistical techniques for using lots of imperfect proxies to track a latent, theoretical variable, coming out of factor-analysis and state-space modeling. There have been some efforts already to incorporate multiple imperfect proxies into things like DSGE models.

I don't want get carried away here. The sort of ubiquitous recording I'm talking about is obviously more advanced in richer countries than in poorer ones --- it will work better in, say, South Korea, or even Indonesia, than in Afghanistan. It's also unevenly distributed within national economies. Getting hold of the data, even in summary forms, would require a lot of social engineering on the part of the Fund. The official statistics, slow and imperfect as they are, will always be more reliable and better aligned to your models. But, wearing my statistician hat, my advice to economists here is to get more information, and this is one of the biggest ways you can expand your information set.

The second point is about models --- it's a machine learning point. The dirty secret of the field, and of the current hype, is that 90% of machine learning is a rebranding of nonparametric regression. (I've got appointments in both ML and statistics so I can say these things without hurting my students.) I realize that there are reasons why the overwhelming majority of the time you work with linear regression, but those reasons aren't really about your best economic models and theories. Those reasons are about what has, in the past, been statistically and computationally feasible to estimate and work with. (So they're "economic" reasons in a sense, but about your own economies as researchers, not about economics-as-a-science.) The data will never completely speak for itself, you will always need to bring some assumptions to draw inferences. But it's now possible to make those assumptions vastly weaker, and to let the data say a lot more. Maybe everything will turn out to be nice and linear, but even if that's so, wouldn't it be nice to know that, rather than to just hope?

There is of course a limitation to using more flexible models, which impose fewer assumptions, which is that it makes it easier to "over-fit" the data, to create a really complicated model which basically memorizes every little accident and even error in what it was trained on. It may not, when you examine it, look like it's just memorizing, it may seem to give an "explanation" for every little wiggle. It will, in effect, say things like "oh, sure, normally the central bank raising interest rates would do X, but in this episode it was also liberalizing the capital account, so Y". But the way to guard against this, and to make sure your model, or the person selling you their model, isn't just BS-ing is to check that it can actually predict out-of-sample, on data it didn't get to see during fitting. This sort of cross-validation has become second nature for (honest and competent) machine learning practitioners.

This is also where lots of ML projects die. I think I can mention an effort at a Very Big Data Indeed Company to predict employee satisfaction and turn-over based on e-mail activity, which seemed to work great on the training data, but turned out to be totally useless on the next year's data, so its creators never deployed it. Cross-validation should become second nature for economists, and you should be very suspicious of anyone offering you models who can't tell you about their out-of-sample performance. (If a model can't even predict well under a constant policy, why on Earth would you trust it to predict responses to policy changes?)

Concretely, going forward, organizations like the Fund can begin to use much more flexible modeling forms, rather than just linear models. The technology to estimate them and predict from them quickly now exists. It's true that if you fit a linear regression and a non-parametric regression to the same data set, the linear regression will always have tighter confidence sets, but (as Jeffrey Racine says) that's rapid convergence to a systematically wrong answer. Expanding the range and volume of data used in your economic modeling, what I just called the "big data" point, will help deal with this, and there's a tremendous amount of on-going progress in quickly estimating flexible models on truly enormous data sets. You might need to hire some people with Ph.D.s in statistics or machine learning who also know some economics --- and by coincidence I just so happen to help train such people! --- but it's the right direction to go, to help your policy decisions be dictated by the data and by good economics, and not by what kinds of models were computationally feasible twenty or even sixty years ago.

The third point, the most purely cautionary one, is the artificial intelligence point. This is that almost everything people are calling "AI" these days is just machine learning, which is to say, nonparametric regression. Where we have seen breakthroughs is in the results of applying huge quantities of data to flexible models to do very particular tasks in very particular environments. The systems we get from this are really good at that, but really fragile, in ways that don't mesh well with our intuition about human beings or even other animals. One of the great illustrations of this are what are called "adversarial examples", where you can take an image that a state-of-the-art classifier thinks is, say, a dog, and by tweaking it in tiny ways which are imperceptible to humans, you can make the classifier convinced it's, say, a car. On the other hand, you can distort that picture of a dog into an image something unrecognizable by any person while the classifier is still sure it's a dog.

If we have to talk about our learning machines psychologically, try not to describe them as automating thought or (conscious) intelligence, but rather as automating unconscious perception or reflex action. What's now called "deep learning" used to be called "perceptrons", and it was very much about trying to do the same sort of thing that low-level perception in animals does, extracting features from the environment which work in that environment to make a behaviorally-relevant classification2 or prediction or immediate action. This is the sort of thing we're almost never conscious of in ourselves, but is in fact what a huge amount of our brains are doing. (We know this because we can study how it breaks down in cases of brain damage.) This work is basically inaccessible to consciousness --- though we can get hints of it from visual illusions, and from the occasions where it fails, like the shock of surprise you feel when you put your foot on a step that isn't there. This sort of perception is fast, automatic, and tuned to very, very particular features of the environment.

Our current systems are like this, but even more finely tuned to narrow goals and contexts. This is why the have such alien failure-modes, and why they really don't have the sort of flexibility we're used to from humans or other animals. They generalize to more data from their training environment, but not to new environments. If you take a person who's learned to play chess and give them a 9-by-9 board with an extra rook on each side, they'll struggle but they won't go back to square one; AlphaZero will need to relearn the game from scratch. Similarly for the video-game learners, and just about everything else you'll see written up in the news, or pointed out as a milestone in a conference like this. Rodney Brooks, one of the Revered Elders of artificial intelligence, puts it nicely recently, saying that the performances of these systems give us a very misleading idea of their competences3.

One reason these genuinely-impressive and often-useful performances don't indicate human competences is that these systems work in very alien ways. So far as we can tell4, there's little or nothing in them that corresponds to the kind of explicit, articulate understanding human intelligence achieves through language and conscious thought. There's even very little in them of the un-conscious, in-articulate but abstract, compositional, combinatorial understanding we (and other animals) show in manipulating our environment, in planning, in social interaction, and in the structure of language.

Now, there are traditions of AI research which do take inspiration from human (and animal) psychology (as opposed to a very old caricature of neurology), and try to actually model things like the structure of language, or planning, or having a body which can be moved in particular ways to interact with physical objects. And while these do make progress, it's a hell of a lot slower than the progress in systems which are just doing reflex action. That might change! There could be a great wave of incredible breakthroughs in AI (not ML) just around the corner, to the point where it will make sense to think about robots actually driving shipping trucks coast to coast, and so forth. Right now, not only is really autonomous AI beyond our grasp, we don't even have a good idea of what we're missing.

In the meanwhile, though, lots of people will sell their learning machines as though they were real AI, with human-style competences, and this will lead to a lot of mischief and (perhaps unintentional) fraud, as the machines get deployed in circumstances where their performance just won't be anything like what's intended. I half suspect that the biggest economic consequence of "AI" for the foreseeable future is that companies will be busy re-engineering human systems --- warehouses and factories, but also hospitals, schools and streets --- so to better accommodate their machines.

So, to sum up:

• The "big data" point is that there's a huge opportunity for the Fund, the Bank, and their kin to really expand the data on which they base their analyses and decisions, even if you keep using the same sorts of models.
• The "machine learning" point is that there's a tremendous opportunity to use more flexible models, which do a better job of capturing economic, or political-economic, reality.
• The "AI" point is that artificial intelligence is the technology of the future, and always will be.

1. Had there been infinite time, I like to think I'd have remembered that Haavelmo saw this gap very clearly, back in the day. Fortunately, J. W. Mason has a great post on this.^

2. The classic paper on this, by, inter alia, one of the inventors of neural networks, was called "What the frog's eye tells the frog's brain". This showed how, already in the retina, the frog's nervous system picked out small-dark-dots-moving-erratically. In the natural environment, these would usually be flies or other frog-edible insects.^

3. Distinguishing between "competence" and "performance" in this way goes back, in cognitive science, at least to Noam Chomsky; I don't know whether Uncle Noam originated the distinction.^

4. The fact that I need a caveat-phrase like this is an indication of just how little we understand why some of our systems work as well as they do, which in turn should be an indication that nobody has any business making predictions about how quickly they'll advance.^

### Data over Space and Time, Lectures 9--13: Filtering, Fourier Analysis, African Population and Slavery, Linear Generative Models

I have fallen behind on posting announcements for the lectures, and I don't feel like writing five of these at once (*). So I'll just list them:

1. Separating Signal and Noise with Linear Methods (a.k.a. the Wiener filter and seasonal adjustment; .Rmd)
2. Fourier Methods I (a.k.a. a child's primer of spectral analysis; .Rmd)
3. Midterm review
4. Guest lecture by Prof. Patrick Manning: "African Population and Migration: Statistical Estimates, 1650--1900" [PDF handout]
5. Linear Generative Models for Time Series (a.k.a. the eigendecomposition of the evolution operator is the source of all knowledge; .Rmd)
6. Linear Generative Models for Spatial and Spatio-Temporal Data (a.k.a. conditional and simultaneous autoregressions; .Rmd)

*: Yes, this is a sign that I need to change my workflow. Several readers have recommended Blogdown, which looks good, but which I haven't had a chance to try out yet.

### Young Investigator Special Competition for Time-Sharing Experiment for the Social Sciences

Sociologists Jamie Druckman and Jeremy Freese write:

Time-Sharing Experiments for the Social Sciences is Having A Special Competition for Young Investigators

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak Panel (see http:/tessexperiments.org for more information). While anyone can submit a proposal to TESS at any time through our regular mechanism, we are having a Special Competition for Young Investigators. Graduate students and individuals who received their PhD in 2016 or after are eligible.

To give some examples of experiments we’ve done: one TESS experiment showed that individuals are more likely to support a business refusing service to a gay couple versus an interracial couple, but were no more supportive of religious reasons for doing so versus nonreligious reasons. Another experiment found that participants were more likely to attribute illnesses of obese patients as due to poor lifestyle choices and of non-obese patients to biological factors, which, in turn, resulted in participants being less sympathetic to overweight patients—especially when patients are female. TESS has also fielded an experiment about whether the opinions of economists influence public opinion on different issues, and the study found that they do on relatively technical issues but not so much otherwise.

The proposals that win our Special Competition will be able to be fielded at up to twice the size of a regular TESS study. We will begin accepting proposals for the Special Competition on January 1, 2019, and the deadline is March 1, 2019. Full details about the competition are available at http://www.tessexperiments.org/yic.html.

### Magister Dixit

“The value is not in software, the value is in data, and this is really important for every single company, that they understand what data they’ve got.” John Straw

### Whats new on arXiv

We introduce an extension to the Protege ontology editor, which allows for discovering concept definitions, which are not explicitly present in axioms, but are logically implied by an ontology. The plugin supports ontologies formulated in the Description Logic EL, which underpins the OWL 2 EL profile of the Web Ontology Language and despite its limited expressiveness captures most of the biomedical ontologies published on the Web. The developed tool allows to verify whether a concept can be defined using a vocabulary of interest specified by a user. In particular, it allows to decide whether some vocabulary items can be omitted in a formulation of a complex concept. The corresponding definitions are presented to the user and are provided with explanations generated by an ontology reasoner.
Most existing deep reinforcement learning (DRL) frameworks consider either discrete action space or continuous action space solely. Motivated by applications in computer games, we consider the scenario with discrete-continuous hybrid action space. To handle hybrid action space, previous works either approximate the hybrid space by discretization, or relax it into a continuous set. In this paper, we propose a parametrized deep Q-network (P- DQN) framework for the hybrid action space without approximation or relaxation. Our algorithm combines the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them. Empirical results on a simulation example, scoring a goal in simulated RoboCup soccer and the solo mode in game King of Glory (KOG) validate the efficiency and effectiveness of our method.
Accurately estimating the remaining useful life (RUL) of industrial machinery is beneficial in many real-world applications. Estimation techniques have mainly utilized linear models or neural network based approaches with a focus on short term time dependencies. This paper introduces a system model that incorporates temporal convolutions with both long term and short term time dependencies. The proposed network learns salient features and complex temporal variations in sensor values, and predicts the RUL. A data augmentation method is used for increased accuracy. The proposed method is compared with several state-of-the-art algorithms on publicly available datasets. It demonstrates promising results, with superior results for datasets obtained from complex environments.
We propose a neural machine-reading model that constructs dynamic knowledge graphs from procedural text. It builds these graphs recurrently for each step of the described procedure, and uses them to track the evolving states of participant entities. We harness and extend a recently proposed machine reading comprehension (MRC) model to query for entity states, since these states are generally communicated in spans of text and MRC models perform well in extracting entity-centric spans. The explicit, structured, and evolving knowledge graph representations that our model constructs can be used in downstream question answering tasks to improve machine comprehension of text, as we demonstrate empirically. On two comprehension tasks from the recently proposed PROPARA dataset (Dalvi et al., 2018), our model achieves state-of-the-art results. We further show that our model is competitive on the RECIPES dataset (Kiddon et al., 2015), suggesting it may be generally applicable. We present some evidence that the model’s knowledge graphs help it to impose commonsense constraints on its predictions.
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm PAM, partitioning around medoids, also known as k-medoids. In Euclidean geometry the mean–as used in k-means–is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or even more complex distances. A key issue with PAM is, however, its high run time cost. In this paper, we propose modifications to the PAM algorithm where at the cost of storing O(k) additional values, we can achieve an O(k)-fold speedup in the second (‘SWAP’) phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. We also show how the CLARA and CLARANS algorithms benefit from this modification. In experiments on real data with k=100, we observed a 200 fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.
We propose a purely probabilistic model to explain the evolution path of a population maximum fitness. We show that after $n$ births in the population there are about $\ln n$ upwards jumps. This is true for any mutation probability and any fitness distribution and therefore suggests a general law for the number of upwards jumps. Simulations of our model show that a typical evolution path has first a steep rise followed by long plateaux. Moreover, independent runs show parallel paths. This is consistent with what was observed by Lenski and Travisano (1994) in their bacteria experiments.
We study a class of private information retrieval (PIR) methods that we call one-shot schemes. The intuition behind one-shot schemes is the following. The user’s query is regarded as a dot product of a query vector and the message vector (database) stored at multiple servers. Privacy, in an information theoretic sense, is then achieved by encrypting the query vector using a secure linear code, such as secret sharing. Several PIR schemes in the literature, in addition to novel ones constructed here, fall into this class. One-shot schemes provide an insightful link between PIR and data security against eavesdropping. However, their download rate is not optimal, i.e., they do not achieve the PIR capacity. Our main contribution is two transformations of one-shot schemes, which we call refining and lifting. We show that refining and lifting one-shot schemes gives capacity-achieving schemes for the cases when the PIR capacity is known. In the other cases, when the PIR capacity is still unknown, refining and lifting one-shot schemes gives the best download rate so far.
We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representations $T$ decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true $I(X;T)$ over these networks is provably either constant (discrete $X$) or infinite (continuous $X$). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which $I(X;T)$ is a meaningful quantity that depends on the network’s parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models. By relating $I(X;T)$ in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the $T$ space. Finally, we return to the estimator of $I(X;T)$ employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.
Abstractive summarization has been studied using neural sequence transduction methods with datasets of large, paired document-summary examples. However, such datasets are rare and the models trained from them do not generalize to other domains. Recently, some progress has been made in learning sequence-to-sequence mappings with only unpaired examples. In our work, we consider the setting where there are only documents and no summaries provided and propose an end-to-end, neural model architecture to perform unsupervised abstractive summarization. Our proposed model consists of an auto-encoder trained so that the mean of the representations of the input documents decodes to a reasonable summary. We consider variants of the proposed architecture and perform an ablation study to show the importance of specific components. We apply our model to the summarization of business and product reviews and show that the generated summaries are fluent, show relevancy in terms of word-overlap, representative of the average sentiment of the input documents, and are highly abstractive compared to baselines.
Understanding how a learned black box works is of crucial interest for the future of Machine Learning. In this paper, we pioneer the question of the global interpretability of learned black box models that assign numerical values to symbolic sequential data. To tackle that task, we propose a spectral algorithm for the extraction of weighted automata (WA) from such black boxes. This algorithm does not require the access to a dataset or to the inner representation of the black box: the inferred model can be obtained solely by querying the black box, feeding it with inputs and analyzing its outputs. Experiments using Recurrent Neural Networks (RNN) trained on a wide collection of 48 synthetic datasets and 2 real datasets show that the obtained approximation is of great quality.
Neural architecture search (NAS) automatically finds the best task-specific neural network topology, outperforming many manual architecture designs. However, it can be prohibitively expensive as the search requires training thousands of different networks, while each can last for hours. In this work, we propose the Graph HyperNetwork (GHN) to amortize the search cost: given an architecture, it directly generates the weights by running inference on a graph neural network. GHNs model the topology of an architecture and therefore can predict network performance more accurately than regular hypernetworks and premature early stopping. To perform NAS, we randomly sample architectures and use the validation accuracy of networks with GHN generated weights as the surrogate search signal. GHNs are fast — they can search nearly 10 times faster than other random search methods on CIFAR-10 and ImageNet. GHNs can be further extended to the anytime prediction setting, where they have found networks with better speed-accuracy tradeoff than the state-of-the-art manual designs.
As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel ‘cleanliness’ score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.
The current success of deep neural networks (DNNs) in an increasingly broad range of tasks for the artificial intelligence strongly depends on the quality and quantity of labeled training data. In general, the scarcity of labeled data, which is often observed in many natural language processing tasks, is one of the most important issues to be addressed. Semi-supervised learning (SSL) is a promising approach to overcome this issue by incorporating a large amount of unlabeled data. In this paper, we propose a novel scalable method of SSL for text classification tasks. The unique property of our method, Mixture of Expert/Imitator Networks, is that imitator networks learn to ‘imitate’ the estimated label distribution of the expert network over the unlabeled data, which potentially contributes as a set of features for the classification. Our experiments demonstrate that the proposed method consistently improves the performance of several types of baseline DNNs. We also demonstrate that our method has the more data, better performance property with promising scalability to the unlabeled data.
Parameter learning is the technique for obtaining the probabilistic parameters in conditional probability tables in Bayesian networks from tables with (observed) data — where it is assumed that the underlying graphical structure is known. There are basically two ways of doing so, referred to as maximal likelihood estimation (MLE) and as Bayesian learning. This paper provides a categorical analysis of these two techniques and describes them in terms of basic properties of the multiset monad M, the distribution monad D and the Giry monad G. In essence, learning is about the reltionships between multisets (used for counting) on the one hand and probability distributions on the other. These relationsips will be described as suitable natural transformations.
A time series is uniquely represented by its geometric shape, which also carries information. A time series can be modelled as the trajectory of a particle moving in a force field with one degree of freedom. The force acting on the particle shapes the trajectory of its motion, which is made up of elementary shapes of infinitesimal neighborhoods of points in the trajectory. It has been proved that an infinitesimal neighborhood of a point in a continuous time series can have at least 29 different shapes or configurations. So information can be encoded in it in at least 29 different ways. A 3-point neighborhood (the smallest) in a discrete time series can have precisely 13 different shapes or configurations. In other words, a discrete time series can be expressed as a string of 13 symbols. Across diverse real as well as simulated data sets it has been observed that 6 of them occur more frequently and the remaining 7 occur less frequently. Based on frequency distribution of 13 configurations or 13 different ways of information encoding a novel entropy measure, called semantic entropy (E), has been defined. Following notion of power in Newtonian mechanics of the moving particle whose trajectory is the time series, a notion of information power (P) has been introduced for time series. E/P turned out to be an important indicator of synchronous behaviour of time series as observed in epileptic EEG signals.

### survHE new release

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

I have just submitted a revised version of survHE on CRAN — it should be up very shortly. This will be version 1.0.64 and its main feature is a major restructuring in the way the rstan/HMC stuff works.

Basically, this is due to a change in the default C++ compiler. I don’t think much will change in terms of how survHE works when running full Bayesian models using HMC, but R now compiles it without problems. After the advice of Ben Goodrich, I have also modified the package so that it compiles the Stan programs serially as opposed to gluing them all together, which should optmise the use of the memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## October 18, 2018

### Maryland's Bridge Safety, reported using R

A front-page story in the Baltimore Sun reported last week on the state of the bridges in Maryland. Among the report's findings:

• 5.4% of bridges are classified in "poor" or "structurally deficient" condition
• 13% of bridges in the city of Baltimore are in "poor" condition

Those findings were the result of analysis of Federal infrastructure data by reporter Christine Zhang. The analysis was performed using R and documented in a Jupyter Notebook published on Github. The raw data included almost 50 variables including type, location, ownership, and inspection dates and ratings, and required a fair bit of processing with tidyverse functions to extract the Marlyland-specific statistics above. The analysis also turned up an unusual owner for one of the bridges: this one — an access road to the Goddard Space Flight Center — is owned by NASA.

You can read the story in the Baltimore Sun, or check out the R analysis in the Github repo linked below.

Github (Baltimore Sun): Maryland bridges analysis (via Sharon Machlis)

### Book Memo: “Data Mining for Systems Biology”

 Methods and Protocols This fully updated book collects numerous data mining techniques, reflecting the acceleration and diversity of the development of data-driven approaches to the life sciences. The first half of the volume examines genomics, particularly metagenomics and epigenomics, which promise to deepen our knowledge of genes and genomes, while the second half of the book emphasizes metabolism and the metabolome as well as relevant medicine-oriented subjects. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that is useful for getting optimal results. Authoritative and practical, Data Mining for Systems Biology: Methods and Protocols, Second Edition serves as an ideal resource for researchers of biology and relevant fields, such as medical, pharmaceutical, and agricultural sciences, as well as for the scientists and engineers who are working on developing data-driven techniques, such as databases, data sciences, data mining, visualization systems, and machine learning or artificial intelligence that now are central to the paradigm-altering discoveries being made with a higher frequency.

### R Packages worth a look

Creates an Interactive Tree Structure of a Directory (directotree)
Represents the content of a directory as an interactive collapsible tree. Offers the possibility to assign a text (e.g., a ‘Readme.txt’) to each folder …

Approximate the Variance of the Horvitz-Thompson Total Estimator (UPSvarApprox)
Variance approximations for the Horvitz-Thompson total estimator in Unequal Probability Sampling using only first-order inclusion probabilities. See Ma …

Presentations in the REPL (REPLesentR)
Create presentations and display them inside the R ‘REPL’ (Read-Eval-Print loop), aka the R console. Presentations can be written in ‘RMarkdown’ or any …

### Ethics in statistical practice and communication: Five recommendations.

I recently published an article summarizing some of my ideas on ethics in statistics, going over these recommendations:

1. Open data and open methods,

2. Be clear about the information that goes into statistical procedures,

3. Create a culture of respect for data,

4. Publication of criticisms,

5. Respect the limitations of statistics.

### How to Solve the ModelOps Challenge

A recent study shows that while 85% believe data science will allow their companies to obtain or sustain a competitive advantage, only 5% are using data science extensively. Join this webinar, Nov 14, to find out why.

### Distilled News

This blog consists of 3 parts:
1. What is transfer learning ?
2. Why does transfer learning work so well ?
3. Coding your first image recognizer using transfer learning.
Word embeddings have revolutionized the world of natural language processing(NLP). Conceptually, word embeddings are language modeling methods that map phrases or words in a sentence to vectors and numbers. One of the first steps in any NLP application is to determine what type of word embedding algorithm is going to be used. Typically, NLP models resort to pretrained word embedding algorithm such as Word2Vec, Glove or FastText. While that approach is relatively simple, it also results highly inefficient as is near to impossible to determine what word embedding will perform better as the NLP model evolves. What if the NLP model itself could select the best word-embedding for a given context? In a recent paper, researchers from Facebook’s Artificial Intelligence Research Lab(FAIR), proposed a method that allow NLP models to dynamically select a word-embedding algorithm that performs the best on a given environment. Dynamic Meta-Embeddings is a technique that combine different word-embedding models in an ensemble model and allows a NLP algorithm to choose what embedding to use based on their performance. Facebook’s technique, essentially, delays the selection of an embedding algorithm from design time to runtime based on the specific behavior of the ensemble.
Deep Learning is becoming a very popular subset of machine learning due to its high level of performance across many types of data. A great way to use deep learning to classify images is to build a convolutional neural network (CNN). The Keras library in Python makes it pretty simple to build a CNN. Computers see images using pixels. Pixels in images are usually related. For example, a certain group of pixels may signify an edge in an image or some other pattern. Convolutions use this to help identify images.
Suppose you are given the scores of two exams for various applicants and the objective is to classify the applicants into two categories based on their scores i.e, into Class-1 if the applicant can be admitted to the university or into Class-0 if the candidate can’t be given admission. Can this problem be solved using Linear Regression? Let’s check.
So you’ve built your model and are getting sensible results, and are now ready to squeeze out as much performance as possible. One possibility is doing Grid Search, where you try every possible combination of hyper-parameters and choose the best one. That works well if your number of choices are relatively small, but what if you have a large number of hyper-parameters, and some are continuous values that might span several orders of magnitude? Random Search works pretty well to explore the parameter space without committing to exploring all of it, but is randomly groping in the dark the best we can do? Of course not. Bayesian Optimization is a technique for optimizing a function when making sequential decisions. In this case, we’re trying to maximize performance by choosing hyper-parameter values. This sequential decision framework means that the hyper-parameters you choose for the next step will be influenced by the performance of all the previous attempts. Bayesian Optimization makes principled decisions about how to balance exploring new regions of the parameter space vs exploiting regions that are known to perform well. This is all to say that it’s generally much more efficient to use Bayesian Optimization than alternatives like Grid Search and Random Search.
Machine learning and deep learning have found their place in financial institution for their power in predicting time series data with high degrees of accuracy. There is a lot of research going on to improve models so that they can predict data will higher degree of accuracy. This post is a write up about my project AlphaAI, which is a stacked neural network architecture that predicts the stock prices of various companies. This project is also one of the finalists at iNTUtion 2018, a hackathon for undergraduates here in Singapore.
A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.
In the recent years, we’ve seen a lot of innovations in Deep Reinforcement Learning. From DeepMind and the Deep Q learning architecture in 2014 to OpenAI playing Dota2 with OpenAI five in 2018, we live in an exciting and promising moment. And today we’ll learn about Curiosity-Driven Learning, one of the most exciting and promising strategy in Deep Reinforcement Learning.
AI is a field of study that seeks to understand, develop and implement intelligent behavior into hardware and software systems to mimic and expand human-like abilities. To deliver its promise, AI implements various techniques in the field of Machine Learning (ML), which is a subset of studies that focus on developing software systems with the ability to learn new skills from experience, by trial and error or by applying known rules. Deep Learning (DL), is so far, the technique in Machine Learning that, by a wide margin, has delivered the most exciting results and practical use cases in domains such as speech and image recognition, language translation and plays a role in a wide range of current AI applications.
We have made a tremendous progress in the field of Information & Technology in recent times. Some of the revolutionary feats achieved in the tech-ecosystem are really worth commendable. Data and Analytics have been the most commonly-used words in the last decade or two. As such, it’s important to know why they are inter-related, what roles in the market are currently evolving and how they are reshaping businesses.
Continuing this journey, I have discussed the loss function and optimization process of linear regression at Part I, logistic regression at part II, and this time, we are heading to Support Vector Machine.
This series aims to explain loss functions of a few widely-used supervised learning models, and some options of optimization algorithms. In part I, I walked through the optimization process of Linear Regression in details by using Gradient Descent and using Least Squared Error as loss function. In this part, I will move to Logistic Regression.
When building a machine learning model, some questions similar like these usually comes into my mind: How does a model being optimized? Why does Model A outperform Model B? To answer them, I think one of entry points can be understanding loss functions of different models, and furthermore, being able to choose an appropriate loss function or self-define a loss function based on the goal of the project and the tolerance of error type. I will post a series of blogs discussing loss functions and optimization algorithms of a few common supervised learning models. I will try to explain in a way that is friendly to the audience who don’t have a strong mathematical background. Let’s start from Part I, Linear Regression.
One of the most common questions I get when talking with customers, is how they are able to set up a good big data architecture that will allow them to process all their existing data. With as an ultimate goal to perform advanced analytics and AI on top of it, to extract insights that will allow them to stay relevant in the ever faster evolving world of today. To tackle this issue, I always first start by asking them what their understanding ‘Big Data’ is, because one customer is not the other. One might think that Big Data is just the way they are able to process all their different excel files, while another might think that it is the holy grail for all their projects and intelligence. Well in this article, I want to explain you what Big Data means to me and provide you a thought process that can help you in defining your Big Data strategy for your organization.
Counter-intuitively, by connecting this way DenseNets require fewer parameters than an equivalent traditional CNN, as there is no need to learn redundant feature maps. Furthermore, some variations of ResNets have proven that many layers are barely contributing and can be dropped. In fact, the number of parameters of ResNets are big because every layer has its weights to learn. Instead, DenseNets layers are very narrow (e.g. 12 filters), and they just add a small set of new feature-maps. Another problem with very deep networks was the problems to train, because of the mentioned flow of information and gradients. DenseNets solve this issue since each layer has direct access to the gradients from the loss function and the original input image.
ImageNet dataset consist on a set of images (the authors used 1.28 million training images, 50k validation images and 100k test images) of size (224×224) belonging to 1000 different classes. However, CIFAR10 consist on a different set of images (45k training images, 5k validation images and 10k testing images) distributed into just 10 different classes. Because the sizes of the input volumes (images) are completely different, it is easy to think that the same structure will not be suitable to train on this dataset. We cannot perform the same reductions on the dataset without having dimensionality mismatches. We are going to follow the solution the authors give to ResNets to train on CIFAR10, which are also tricky to follow like for ImageNet dataset.
Researchers observed that it makes sense to affirm that ‘the deeper the better’ when it comes to convolutional neural networks. This makes sense, since the models should be more capable (their flexibility to adapt to any space increase because they have a bigger parameter space to explore). However, it has been noticed that after some depth, the performance degrades. This was one of the bottlenecks of VGG. They couldn’t go as deep as wanted, because they started to lose generalization capability.
It seems like there’s yet another cloud-based text analytics Application Programming Interface (API) on the market every few weeks. If you’re interested in building an application using these kinds of services, how do you decide which API to go for? In the previous post in this series, we looked at the text analytics APIs from the behemoths in the cloud software world: Amazon, Google, IBM and Microsoft. In this post, we survey sixteen APIs offered by smaller players in the market.
If you’re in the market for an off-the-shelf text analytics API, you have a lot of options. You can choose to go with a major player in the software world, for whom each AI-related service is just another entry in their vast catalogues of tools, or you can go for a smaller provider that focusses on text analytics as their core business. In this first of two related posts, we look at what the most prominent software giants have to offer today.

### Distilled News

This paper develops a novel tree-based algorithm, called Bonsai, for efficient prediction on IoT devices – such as those based on the Arduino Uno board having an 8 bit ATmega328P microcontroller operating at 16 MHz with no native floating point support, 2 KB RAM and 32 KB read-only flash. Bonsai maintains prediction accuracy while minimizing model size and prediction costs by: (a) developing a tree model which learns a single, shallow, sparse tree with powerful nodes; (b) sparsely projecting all data into a low-dimensional space in which the tree is learnt; and (c) jointly learning all tree and projection parameters. Experimental results on multiple benchmark datasets demonstrate that Bonsai can make predictions in milliseconds even on slow microcontrollers, can fit in KB of memory, has lower battery consumption than all other algorithms while achieving prediction accuracies that can be as much as 30% higher than stateof- the-art methods for resource-efficient machine learning. Bonsai is also shown to generalize to other resource constrained settings beyond IoT by generating significantly better search results as compared to Bing’s L3 ranker when the model size is restricted to 300 bytes. Bonsai’s code can be downloaded from (BonsaiCode).
What are some open datasets for machine learning? After scrapping the web for hours after hours, we have created a great cheat sheet for high quality and diverse machine learning datasets.
This article is the first in a series dedicated to explaining how Uber leverages forecasting to build better products and services. In recent years, machine learning, deep learning, and probabilistic programming have shown great promise in generating accurate forecasts. In addition to standard statistical algorithms, Uber builds forecasting solutions using these three techniques. Below, we discuss the critical components of forecasting we use, popular methodologies, backtesting, and prediction intervals.
This article focuses on the paper ‘Going deeper with convolutions’ from which the hallmark idea of inception network came out. Inception network was once considered a state-of-the-art deep learning architecture (or model) for solving image recognition and detection problems. It put forward a breakthrough performance on the ImageNet Visual Recognition Challenge (in 2014), which is a reputed platform for benchmarking image recognition and detection algorithms. Along with this, it set off a ton of research in the creation of new deep learning architectures with innovative and impactful ideas. We will go through the main ideas and suggestions propounded in the aforementioned paper and try to grasp the techniques within. In the words of the author: ‘In this paper, we will focus on an efficient deep neural network architecture for computer vision, code named Inception, which derives its name from (…) the famous ‘we need to go deeper’ internet meme.’
• Machine Learning Yearning
• Programming Collective Intelligence
• Machine Learning for Hackers
• Machine Learning by Tom M Mitchell
• The Elements of Statistical Learning
• Learning from Data
• Pattern Recognition and Machine Learning
• Natural Language Processing with Python
• Artificial Intelligence: A Modern Approach
• Artificial Intelligence for Humans
• Paradigm of Artificial Intelligence Programming
• Artificial Intelligence: A New Synthesis
• Machine Learning Yearning
• The Singularity is Near
• Life 3.0 – Being Human in the Age of Artificial Intelligence
• The Master Algorithm
Learn Time Series Analysis with R along with using a package in R for forecasting to fit the real-time series to match the optimal model.
In this tutorial, you are going to learn how to create GUI apps in Python. You’ll also learn about all the elements needed to create GUI apps in Python.
Let’s have a look at the main approaches to NLP tasks that we have at our disposal. We will then have a look at the concrete NLP tasks we can tackle with said approaches.
Numerical algorithms are computationally demanding, which makes performance an important consideration when using Python for machine learning, especially as you move from desktop to production.
In this webinar, we look at:
• Role of productivity and performance for numerical computing and machine learning
• Python algorithm choice and efficient package usage
• Requirements for efficient use of hardware
• NumPy and SciPy performance with the Intel MKL (math kernel library)
• How Intel and ActivePython help you accelerate and scale Python performance
Deep neural networks – the kind of machine learning models that have recently led to dramatic performance improvements in a wide range of applications – are vulnerable to tiny perturbations of their inputs. We investigate how to deal with these vulnerabilities.
The hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. What to do after you shoot yourself in the foot in interesting ways with Git.
We previously discussed improved support in RStudio v1.2 for SQL, D3, Python, and C/C++. Today, we’re excited to announce improved support for the Stan programming language. The Stan programming language makes it possible for researchers and analysts to write high-performance and scalable statistical models.
This is it. Your build guide for constructing your personal home computer for AI and Data Science. There are already a lot of build guides out there for deep learning. Many for Data Science too. And some here and there for reinforcement learning. But very very few for all of them.That’s why I wrote this guide, for people who were, or may at some point be, interested in all of these fields!
In this article, we will learn about how to collect Twitter data and create interesting visualizations in Python. We will briefly explore about how to collect tweets using Tweepy and we will mostly explore about the various Data Visualization techniques for the Twitter data using Matplotlib. Before that, Data Visualization and the overall statistical process that enables it will be explained.
Artificial Intelligence is not a buzzword anymore. As of 2018, it is a well-developed branch of Big Data analytics with multiple applications and active projects. Here is a brief review of the topic.
AI is the umbrella term for various approaches to big data analysis, like machine learning models and deep learning networks. We have recently demystified the terms of AI, ML and DL and the differences between them, so feel free to check this up. In short, AI algorithms are various data science mathematical models that help improve the outcome of the certain process or automate some routine task
However, the technology has now matured enough to move these data science advancements from the pilot projects phase to the stage of production-ready deployment at scale. Below is the overview of various aspects of AI technology adoption across the IT industry in 2018.
We take a look at the parameters like:
• the most widely used types of AI algorithms,
• the way the companies apply the AI,
• the industries where AI implementation will have the most impact
• the most popular languages, libraries, and APIs used for AI development
Thus said, the numbers used in this review come from a variety of open sources like Statista, Forbes, BigDataScience, DZone and other.
We reached out to the speakers to ask them about the importance of model-based decision making, how models combine with creativity, and the future of models for the industry.
A chatbot (also known as a talkbot, chatterbot, Bot, IM bot, interactive agent, or Artificial Conversational Entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods. Such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the Turing test. Chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition. Some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.’
Data Science is an interesting field to work in, a combination of statistics and real world programming. There are number of programming languages used by Data Science Engineers, each of which has unique features. Most famous among them are Scala, Python and R. Since I am working with Scala at work, I would like to share some of the most important concepts that I come across and worthy for the beginners in Data Science Engineering.
Hi everyone, lets continue the discussion on Scala for Data Science Engineering. Find the first part here. In this part I will discuss about Partial Functions, Pattern Matching & Case Classes, Collections, Currying and Implicit.