# My Data Science Blogs

## October 14, 2019

### Whats new on arXiv

Transfer learning aims to learn robust classifiers for the target domain by leveraging knowledge from a source domain. Since the source and the target domains are usually from different distributions, existing methods mainly focus on adapting the cross-domain marginal or conditional distributions. However, in real applications, the marginal and conditional distributions usually have different contributions to the domain discrepancy. Existing methods fail to quantitatively evaluate the different importance of these two distributions, which will result in unsatisfactory transfer performance. In this paper, we propose a novel concept called Dynamic Distribution Adaptation (DDA), which is capable of quantitatively evaluating the relative importance of each distribution. DDA can be easily incorporated into the framework of structural risk minimization to solve transfer learning problems. On the basis of DDA, we propose two novel learning algorithms: (1) Manifold Dynamic Distribution Adaptation (MDDA) for traditional transfer learning, and (2) Dynamic Distribution Adaptation Network (DDAN) for deep transfer learning. Extensive experiments demonstrate that MDDA and DDAN significantly improve the transfer learning performance and setup a strong baseline over the latest deep and adversarial methods on digits recognition, sentiment analysis, and image classification. More importantly, it is shown that marginal and conditional distributions have different contributions to the domain divergence, and our DDA is able to provide good quantitative evaluation of their relative importance which leads to better performance. We believe this observation can be helpful for future research in transfer learning.
Spectral Clustering(SC) is a prominent data clustering technique of recent times which has attracted much attention from researchers. It is a highly data-driven method and makes no strict assumptions on the structure of the data to be clustered. One of the central pieces of spectral clustering is the construction of an affinity matrix based on a similarity measure between data points. The way the similarity measure is defined between data points has a direct impact on the performance of the SC technique. Several attempts have been made in the direction of strengthening the pairwise similarity measure to enhance the spectral clustering. In this work, we have defined a novel affinity measure by employing the concept of non-conformity used in Conformal Prediction(CP) framework. The non-conformity based affinity captures the relationship between neighborhoods of data points and has the power to generalize the notion of contextual similarity. We have shown that this formulation of affinity measure gives good results and compares well with the state of the art methods.
Automatic question generation is an important problem in natural language processing. In this paper we propose a novel adaptive copying recurrent neural network model to tackle the problem of question generation from sentences and paragraphs. The proposed model adds a copying mechanism component onto a bidirectional LSTM architecture to generate more suitable questions adaptively from the input data. Our experimental results show the proposed model can outperform the state-of-the-art question generation methods in terms of BLEU and ROUGE evaluation scores.
Federated Machine Learning (FML) creates an ecosystem for multiple parties to collaborate on building models while protecting data privacy for the participants. A measure of the contribution for each party in FML enables fair credits allocation. In this paper we develop simple but powerful techniques to fairly calculate the contributions of multiple parties in FML, in the context of both horizontal FML and vertical FML. For Horizontal FML we use deletion method to calculate the grouped instance influence. For Vertical FML we use Shapley Values to calculate the grouped feature importance. Our methods open the door for research in model contribution and credit allocation in the context of federated machine learning.
Pre-trained language representation models, such as BERT, capture a general language representation from large-scale corpora, but lack domain-specific knowledge. When reading a domain text, experts make inferences with relevant knowledge. For machines to achieve this capability, we propose a knowledge-enabled language representation model (K-BERT) with knowledge graphs (KGs), in which triples are injected into the sentences as domain knowledge. However, too much knowledge incorporation may divert the sentence from its correct meaning, which is called knowledge noise (KN) issue. To overcome KN, K-BERT introduces soft-position and visible matrix to limit the impact of knowledge. K-BERT can easily inject domain knowledge into the models by equipped with a KG without pre-training by-self because it is capable of loading model parameters from the pre-trained BERT. Our investigation reveals promising results in twelve NLP tasks. Especially in domain-specific tasks (including finance, law, and medicine), K-BERT significantly outperforms BERT, which demonstrates that K-BERT is an excellent choice for solving the knowledge-driven problems that require experts.
The widespread use of ML-based decision making in domains with high societal impact such as recidivism, job hiring and loan credit has raised a lot of concerns regarding potential discrimination. In particular, in certain cases it has been observed that ML algorithms can provide different decisions based on sensitive attributes such as gender or race and therefore can lead to discrimination. Although, several fairness-aware ML approaches have been proposed, their focus has been largely on preserving the overall classification accuracy while improving fairness in predictions for both protected and non-protected groups (defined based on the sensitive attribute(s)). The overall accuracy however is not a good indicator of performance in case of class imbalance, as it is biased towards the majority class. As we will see in our experiments, many of the fairness-related datasets suffer from class imbalance and therefore, tackling fairness requires also tackling the imbalance problem. To this end, we propose AdaFair, a fairness-aware classifier based on AdaBoost that further updates the weights of the instances in each boosting round taking into account a cumulative notion of fairness based upon all current ensemble members, while explicitly tackling class-imbalance by optimizing the number of ensemble members for balanced classification error. Our experiments show that our approach can achieve parity in true positive and true negative rates for both protected and non-protected groups, while it significantly outperforms existing fairness-aware methods up to 25% in terms of balanced error.
We propose K-TanH, a novel, highly accurate, hardware efficient approximation of popular activation function Tanh for Deep Learning. K-TanH consists of a sequence of parameterized bit/integer operations, such as, masking, shift and add/subtract (no floating point operation needed) where parameters are stored in a very small look-up table. The design of K-TanH is flexible enough to deal with multiple numerical formats, such as, FP32 and BFloat16. High quality approximations to other activation functions, e.g., Swish and GELU, can be derived from K-TanH. We provide RTL design for K-TanH to demonstrate its area/power/performance efficacy. It is more accurate than existing piecewise approximations for Tanh. For example, K-TanH achieves $\sim 5\times$ speed up and $> 6\times$ reduction in maximum approximation error over software implementation of Hard TanH. Experimental results for low-precision BFloat16 training of language translation model GNMT on WMT16 data sets with approximate Tanh and Sigmoid obtained via K-TanH achieve similar accuracy and convergence as training with exact Tanh and Sigmoid.
Distributed word embeddings have yielded state-of-the-art performance in many NLP tasks, mainly due to their success in capturing useful semantic information. These representations assign only a single vector to each word whereas a large number of words are polysemous (i.e., have multiple meanings). In this work, we approach this critical problem in lexical semantics, namely that of representing various senses of polysemous words in vector spaces. We propose a topic modeling based skip-gram approach for learning multi-prototype word embeddings. We also introduce a method to prune the embeddings determined by the probabilistic representation of the word in each topic. We use our embeddings to show that they can capture the context and word similarity strongly and outperform various state-of-the-art implementations.
Reinforcement Learning (RL) algorithms usually assume their environment to be a Markov Decision Process (MDP). Additionally, they do not try to identify specific features of environments which could help them perform better. Here, we present a few key meta-features of environments: delayed rewards, specific reward sequences, sparsity of rewards, and stochasticity of environments, which may violate the MDP assumptions and adapting to which should help RL agents perform better. While it is very time consuming to run RL algorithms on standard benchmarks, we define a parameterised collection of fast-to-run toy benchmarks in OpenAI Gym by varying these meta-features. Despite their toy nature and low compute requirements, we show that these benchmarks present substantial difficulties to current RL algorithms. Furthermore, since we can generate environments with a desired value for each of the meta-features, we have fine-grained control over the environments’ difficulty and also have the ground truth available for evaluating algorithms. We believe that devising algorithms that can detect such meta-features of environments and adapt to them will be key to creating robust RL algorithms that work in a variety of different real-world problems.
We introduce SpERT, an attention model for span-based joint entity and relation extraction. Our approach employs the pre-trained Transformer network BERT as its core. We use BERT embeddings as shared inputs for a light-weight reasoning, which features entity recognition and filtering, as well as relation classification with a localized, marker-free context representation. The model is trained on strong within-sentence negative samples, which are efficiently extracted in a single BERT pass. These aspects facilitate a search over all spans in the sentence. In ablation studies, we demonstrate the benefits of pre-training, strong negative sampling and localized context. Our model outperforms prior work by up to 5% F1 score on several datasets for joint entity and relation extraction.
Combinatorial optimization problems for clustering are known to be NP-hard. Most optimization methods are not able to find the global optimum solution for all datasets. To solve this problem, we propose a global optimal path-based clustering (GOPC) algorithm in this paper. The GOPC algorithm is based on two facts: (1) medoids have the minimum degree in their clusters; (2) the minimax distance between two objects in one cluster is smaller than the minimax distance between objects in different clusters. Extensive experiments are conducted on synthetic and real-world datasets to evaluate the performance of the GOPC algorithm. The results on synthetic datasets show that the GOPC algorithm can recognize all kinds of clusters regardless of their shapes, sizes, or densities. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the GOPC algorithm. In addition, the GOPC algorithm needs only one parameter, i.e., the number of clusters, which can be estimated by the decision graph. The advantages mentioned above make GOPC a good candidate as a general clustering algorithm. Codes are available at https://…/Clustering.
Sequential vision-to-language or visual storytelling has recently been one of the areas of focus in computer vision and language modeling domains. Though existing models generate narratives that read subjectively well, there could be cases when these models miss out on generating stories that account and address all prospective human and animal characters in the image sequences. Considering this scenario, we propose a model that implicitly learns relationships between provided characters and thereby generates stories with respective characters in scope. We use the VIST dataset for this purpose and report numerous statistics on the dataset. Eventually, we describe the model, explain the experiment and discuss our current status and future work.
We present sktime — a new scikit-learn compatible Python library with a unified interface for machine learning with time series. Time series data gives rise to various distinct but closely related learning tasks, such as forecasting and time series classification, many of which can be solved by reducing them to related simpler tasks. We discuss the main rationale for creating a unified interface, including reduction, as well as the design of sktime’s core API, supported by a clear overview of common time series tasks and reduction approaches.
Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG’s news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text.
In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code. Ludwig implements a novel approach to deep learning model building based on two main abstractions: data types and declarative configuration files. The data type abstraction allows for easier code and sub-model reuse, and the standardized interfaces imposed by this abstraction allow for encapsulation and make the code easy to extend. Declarative model definition configuration files enable inexperienced users to obtain effective models and increase the productivity of expert users. Alongside these two innovations, Ludwig introduces a general modularized deep learning architecture called Encoder-Combiner-Decoder that can be instantiated to perform a vast amount of machine learning tasks. These innovations make it possible for engineers, scientists from other fields and, in general, a much broader audience to adopt deep learning models for their tasks, concretely helping in its democratization.
The ability to understand and work with numbers (numeracy) is critical for many complex reasoning tasks. Currently, most NLP models treat numbers in text in the same way as other tokens—they embed them as distributed vectors. Is this enough to capture numeracy? We begin by investigating the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset. We find this model excels on questions that require numerical reasoning, i.e., it already captures numeracy. To understand how this capability emerges, we probe token embedding methods (e.g., BERT, GloVe) on synthetic list maximum, number decoding, and addition tasks. A surprising degree of numeracy is naturally present in standard embeddings. For example, GloVe and word2vec accurately encode magnitude for numbers up to 1,000. Furthermore, character-level embeddings are even more precise—ELMo captures numeracy the best for all pre-trained methods—but BERT, which uses sub-word units, is less exact.
We introduce Block Sparse Canonical Correlation Analysis which estimates multiple pairs of canonical directions (together a ‘block’) at once, resulting in significantly improved orthogonality of the sparse directions which, we demonstrate, translates to more interpretable solutions. Our approach builds on the sparse CCA method of (Solari, Brown, and Bickel 2019) in that we also express the bi-convex objective of our block formulation as a concave minimization problem over an orthogonal k-frame in a unit Euclidean ball, which in turn, due to concavity of the objective, is shrunk to a Stiefel manifold, which is optimized via gradient descent algorithm. Our simulations show that our method outperforms existing sCCA algorithms and implementations in terms of computational cost and stability, mainly due to the drastic shrinkage of our search space, and the correlation within and orthogonality between pairs of estimated canonical covariates. Finally, we apply our method, available as an R-package called BLOCCS, to multi-omic data on Lung Squamous Cell Carcinoma(LUSC) obtained via The Cancer Genome Atlas, and demonstrate its capability in capturing meaningful biological associations relevant to the hypothesis under study rather than spurious dominant variations.
Few-shot learning (FSL) for action recognition is a challenging task of recognizing novel action categories which are represented by few instances in the training data. In a more generalized FSL setting (G-FSL), both seen as well as novel action categories need to be recognized. Conventional classifiers suffer due to inadequate data in FSL setting and inherent bias towards seen action categories in G-FSL setting. In this paper, we address this problem by proposing a novel ProtoGAN framework which synthesizes additional examples for novel categories by conditioning a conditional generative adversarial network with class prototype vectors. These class prototype vectors are learnt using a Class Prototype Transfer Network (CPTN) from examples of seen categories. Our synthesized examples for a novel class are semantically similar to real examples belonging to that class and is used to train a model exhibiting better generalization towards novel classes. We support our claim by performing extensive experiments on three datasets: UCF101, HMDB51 and Olympic-Sports. To the best of our knowledge, we are the first to report the results for G-FSL and provide a strong benchmark for future research. We also outperform the state-of-the-art method in FSL for all the aforementioned datasets.
A new approach to the sparse Canonical Correlation Analysis (sCCA)is proposed with the aim of discovering interpretable associations in very high-dimensional multi-view, i.e.observations of multiple sets of variables on the same subjects, problems. Inspired by the sparse PCA approach of Journee et al. (2010), we also show that the sparse CCA formulation, while non-convex, is equivalent to a maximization program of a convex objective over a compact set for which we propose a first-order gradient method. This result helps us reduce the search space drastically to the boundaries of the set. Consequently, we propose a two-step algorithm, where we first infer the sparsity pattern of the canonical directions using our fast algorithm, then we shrink each view, i.e. observations of a set of covariates, to contain observations on the sets of covariates selected in the previous step, and compute their canonical directions via any CCA algorithm. We also introduceDirected Sparse CCA, which is able to find associations which are aligned with a specified experiment design, andMulti-View sCCA which is used to discover associations between multiple sets of covariates. Our simulations establish the superior convergence properties and computational efficiency of our algorithm as well as accuracy in terms of the canonical correlation and its ability to recover the supports of the canonical directions. We study the associations between metabolomics, trasncriptomics and microbiomics in a multi-omic study usingMuLe, which is an R-package that implements our approach, in order to form hypotheses on mechanisms of adaptations of Drosophila Melanogaster to high doses of environmental toxicants, specifically Atrazine, which is a commonly used chemical fertilizer.
Many studies collect functional data from multiple subjects that have both multilevel and multivariate structures. An example of such data comes from popular neuroscience experiments where participants’ brain activity is recorded using modalities such as EEG and summarized as power within multiple time-varying frequency bands within multiple electrodes, or brain regions. Summarizing the joint variation across multiple frequency bands for both whole-brain variability between subjects, as well as location-variation within subjects, can help to explain neural reactions to stimuli. This article introduces a novel approach to conducting interpretable principal components analysis on multilevel multivariate functional data that decomposes total variation into subject-level and replicate-within-subject-level (i.e. electrode-level) variation, and provides interpretable components that can be both sparse among variates (e.g. frequency bands) and have localized support over time within each frequency band. The sparsity and localization of components is achieved by solving an innovative rank-one based convex optimization problem with block Frobenius and matrix $L_1$-norm based penalties. The method is used to analyze data from a study to better understand reactions to emotional information in individuals with histories of trauma and the symptom of dissociation, revealing new neurophysiological insights into how subject- and electrode-level brain activity are associated with these phenomena.
Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds – designing a distributed training method that has both high performance as All-Reduce in homogeneous environment and good heterogeneity tolerance as AD-PSGD? In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive Partial All-Reduce that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is 1.1 times faster than the state-of-the-art implementation of All-Reduce, 5.1 times faster than Parameter Server and 4.3 times faster than AD-PSGD. In a heterogeneous setting, Ripples shows 2 times speedup over All-Reduce, and still obtains 3 times speedup over the Parameter Server baseline.
Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors. In this work, we implement a simple, efficient intra-layer model parallel approach that enables training state of the art transformer language models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging an 8.3 billion parameter transformer language model using 512 GPUs, making it the largest transformer model ever trained at 24x times the size of BERT and 5.6x times the size of GPT-2. We sustain up to 15.1 PetaFLOPs per second across the entire application with 76% scaling efficiency, compared to a strong single processor baseline that sustains 39 TeraFLOPs per second, which is 30% of peak FLOPs. The model is trained on 174GB of text, requiring 12 ZettaFLOPs over 9.2 days to converge. Transferring this language model achieves state of the art (SOTA) results on the WikiText103 (10.8 compared to SOTA perplexity of 16.4) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. We release training and evaluation code, as well as the weights of our smaller portable model, for reproducibility.
Deep neural networks (DNN) have achieved unprecedented success in numerous machine learning tasks in various domains. However, the existence of adversarial examples raises our concerns in adopting deep learning to safety-critical applications. As a result, we have witnessed increasing interests in studying attack and defense mechanisms for DNN models on different data types, such as images, graphs and text. Thus, it is necessary to provide a systematic and comprehensive overview of the main threats of attacks and the success of corresponding countermeasures. In this survey, we review the state of the art algorithms for generating adversarial examples and the countermeasures against adversarial examples, for three most popular data types, including images, graphs and text.
In recent years, the softmax model and its fast approximations have become the de-facto loss functions for deep neural networks when dealing with multi-class prediction. This loss has been extended to language modeling and recommendation, two fields that fall into the framework of learning from Positive and Unlabeled data. In this paper, we stress the different drawbacks of the current family of softmax losses and sampling schemes when applied in a Positive and Unlabeled learning setup. We propose both a Relaxed Softmax loss (RS) and a new negative sampling scheme based on Boltzmann formulation. We show that the new training objective is better suited for the tasks of density estimation, item similarity and next-event prediction by driving uplifts in performance on textual and recommendation datasets against classical softmax.
Fair machine learning has become a significant research topic with broad societal impact. However, most fair learning methods require direct access to personal demographic data, which is increasingly restricted to use for protecting user privacy (e.g. by the EU General Data Protection Regulation). In this paper, we propose a distributed fair learning framework for protecting the privacy of demographic data. We assume this data is privately held by a third party, which can communicate with the data center (responsible for model development) without revealing the demographic information. We propose a principled approach to design fair learning methods under this framework, exemplify four methods and show they consistently outperform their existing counterparts in both fairness and accuracy across three real-world data sets. We theoretically analyze the framework, and prove it can learn models with high fairness or high accuracy, with their trade-offs balanced by a threshold variable.
Ensemble models comprising of deep Convolutional Neural Networks (CNN) have shown significant improvements in model generalization but at the cost of large computation and memory requirements. % In this paper, we present a framework for learning compact CNN models with improved classification performance and model generalization. For this, we propose a CNN architecture of a compact student model with parallel branches which are trained using ground truth labels and information from high capacity teacher networks in an ensemble learning fashion. Our framework provides two main benefits: i) Distilling knowledge from different teachers into the student network promotes heterogeneity in feature learning at different branches of the student network and enables the network to learn diverse solutions to the target problem. ii) Coupling the branches of the student network through ensembling encourages collaboration and improves the quality of the final predictions by reducing variance in the network outputs. % Experiments on the well established CIFAR-10 and CIFAR-100 datasets show that our Ensemble Knowledge Distillation (EKD) improves classification accuracy and model generalization especially in situations with limited training data. Experiments also show that our EKD based compact networks outperform in terms of mean accuracy on the test datasets compared to state-of-the-art knowledge distillation based methods.
We develop a projected least squares estimator for the change point parameter in a high dimensional time series model with a potential change point. Importantly we work under the setup where the jump size may be near the boundary of the region of detectability. The proposed methodology yields an optimal rate of convergence despite high dimensionality of the assumed model and a potentially diminishing jump size. The limiting distribution of this estimate is derived, thereby allowing construction of a confidence interval for the location of the change point. A secondary near optimal estimate is proposed which is required for the implementation of the optimal projected least squares estimate. The prestep estimation procedure is designed to also agnostically detect the case where no change point exists, thereby removing the need to pretest for the existence of a change point for the implementation of the inference methodology. Our results are presented under a general positive definite spatial dependence setup, assuming no special structure on this dependence. The proposed methodology is designed to be highly scalable, and applicable to very large data. Theoretical results regarding detection and estimation consistency and the limiting distribution are numerically supported via monte carlo simulations.

### R Packages worth a look

Phase Plane Analysis of One- And Two-Dimensional Autonomous ODE Systems (phaseR)
Performs a qualitative analysis of one- and two-dimensional autonomous ordinary differential equation systems, using phase plane methods. Programs are available to identify and classify equilibrium points, plot the direction field, and plot trajectories for multiple initial conditions. In the one-dimensional case, a program is also available to plot the phase portrait. Whilst in the two-dimensional case, programs are additionally available to plot nullclines and stable/unstable manifolds of saddle points. Many example systems are provided for the user. For further details can be found in Grayling (2014) <doi:10.32614/RJ-2014-023>.

Distributions for Ecological Models in ‘nimble’ (nimbleEcology)
Common ecological distributions for ‘nimble’ models in the form of nimbleFunction objects. Includes Cormack-Jolly-Seber, occupancy, dynamic occupancy, hidden Markov, and dynamic hidden Markov models. (Jolly (1965) <doi:10.2307/2333826>, Seber (1965) <10.2307/2333827>, Turek et al. (2016) <doi:10.1007/s10651-016-0353-z>).

Bayesian Estimation of Bivariate Volatility Model (BayesBEKK)
The Multivariate Generalized Autoregressive Conditional Heteroskedasticity (MGARCH) models are used for modelling the volatile multivariate data sets. In this package a variant of MGARCH called BEKK (Baba, Engle, Kraft, Kroner) proposed by Engle and Kroner (1995) <http://…/3532933> has been used to estimate the bivariate time series data using Bayesian technique.

Data Transformation or Simulation with Empirical Covariance Matrix (simTargetCov)
Transforms or simulates data with a target empirical covariance matrix supplied by the user. The method to obtain the data with the target empirical covariance matrix is described in Section 5.1 of Christidis, Van Aelst and Zamar (2019) <arXiv:1812.05678>.

Graphical and Numerical Checks for Mode-Finding Routines (optimCheck)
Tools for checking that the output of an optimization algorithm is indeed at a local mode of the objective function. This is accomplished graphically by calculating all one-dimensional ‘projection plots’ of the objective function, i.e., varying each input variable one at a time with all other elements of the potential solution being fixed. The numerical values in these plots can be readily extracted for the purpose of automated and systematic unit-testing of optimization routines.

### Making a background color gradient in ggplot2

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was recently making some arrangements for the 2020 eclipse in South America, which of course got me thinking of the day we were lucky enough to have a path of totality come to us.

We have a weather station that records local temperature every 5 minutes, so after the eclipse I was able to plot the temperature change over the eclipse as we experienced it at our house. Here is an example of a basic plot I made at the time.

Looking at this now with new eyes, I see it might be nice replace the gray rectangle with one that goes from light to dark to light as the eclipse progresses to totality and then back. I’ll show how I tackled making a gradient color background in this post.

I’ll load ggplot2 for plotting and dplyr for data manipulation.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

# The dataset

My weather station records the temperature in °Fahrenheit every 5 minutes. I downloaded the data from 6 AM to 12 PM local time and cleaned it up a bit. The date-times and temperature are in a dataset I named temp. You can download this below if you’d like to play around with these data.

Here are the first six lines of this temperature dataset.

head(temp)
# # A tibble: 6 x 2
#   datetime            tempf
#
# 1 2017-08-21 06:00:00  54.9
# 2 2017-08-21 06:05:00  54.9
# 3 2017-08-21 06:10:00  54.9
# 4 2017-08-21 06:15:00  54.9
# 5 2017-08-21 06:20:00  54.9
# 6 2017-08-21 06:25:00  54.8

I also stored the start and end times of the eclipse and totality in data.frames, which I pulled for my location from this website.

If following along at home, make sure your time zones match for all the date-time variables or, from personal experience , you’ll run into problems.

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
end = as.POSIXct("2017-08-21 10:18:52") )

# Initial plot

I decided to make a plot of the temperature change during the eclipse only.

To keep the temperature line looking continuous, even though it’s taken every 5 minutes, I subset the data to times close but outside the start and end of the eclipse.

plottemp = filter(temp, between(datetime,
as.POSIXct("2017-08-21 09:00:00"),
as.POSIXct("2017-08-21 12:00:00") ) )

I then zoomed the plot to only include times encompassed by the eclipse with coord_cartesian(). I removed the x axis expansion in scale_x_datetime().

Since the plot covers only about 2 and half hours, I make breaks on the x axis every 15 minutes.

ggplot(plottemp) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() ) 

I wanted the background of the plot to go from light to dark back to light through time. This means a color gradient should go from left to right across the plot.

Since the gradient will be based on time, I figured I could add a vertical line with geom_segment() for every second of the eclipse and color each segment based on how far that time was from totality.

## Make a variable for the color mapping

The first step I took was to make variable with a row for every second of the eclipse, since I wanted a segment drawn for each second. I used seq.POSIXt for this.

color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )

Then came some hard thinking. How would I make a continuous variable to map to color?

While I don’t have an actual measurement of light throughout the eclipse, I can show the general idea of a light change with color by using a linear change in color from the start of the eclipse to totality and then another linear change in color from totality to the end of the eclipse.

My first idea for creating a variable was to use information on the current time vs totality start/end. After subtracting the times before totality from totality start and subtracting totality end from times after totality, I realized that the amount of time before totality wasn’t actually the same as the amount of time after totality. Back to the drawing board.

Since I was making a linear change in color, I realized I could make a sequence of values before totality and after totality that covered the same range but had a different total number of values. This would account for the difference in the length of time before and after totality. I ended up making a sequence going from 100 to 0 for times before totality and a sequence from 0 to 100 after totality. Times during totality were assigned a 0.

Here’s one way to get these sequences, using base::replace(). My dataset is in order by time, which is key to this working correctly.

color_dat = mutate(color_dat,
color = 0,
color = replace(color,
time < totality$start, seq(100, 0, length.out = sum(time < totality$start) ) ),
color = replace(color,
time > totality$end, seq(0, 100, length.out = sum(time > totality$end) ) ) )

## Adding one geom_segment() per second

Once I had my color variable I was ready plot the segments along the x axis. Since the segments neeeded to go across the full height of the plot, I set y and yend to -Inf and Inf, respectively.

I put this layer first to use it as a background that the temperature line was plotted on top of.

g1 = ggplot(plottemp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )

g1

## Switching to a gray scale

The default blue color scheme for the segments actually works OK, but I was picturing going from white to dark. I picked gray colors with grDevices::gray.colors() in scale_color_gradient(). In gray.colors(), 0 is black and 1 is white. I didn’t want the colors to go all the way to black, since that would make the temperature line impossible to see during totality. And, of course, it’s not actually pitch black during totality.

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
high = gray.colors(1, 1) )

# Using segments to make a gradient rectangle

I can use this same approach on only a portion of the x axis to give the appearance of a rectangle with gradient fill. Here’s an example using times outside the eclipse.

g2 = ggplot(temp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "1 hour",
date_labels = "%H:%M",
expand = c(0, 0) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(12, 24, by = 2)) ) +
high = gray.colors(1, 1) ) +
theme_bw(base_size = 14) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank() )

g2

## Bonus: annotations with curved arrows

This second plot gives me a chance to try out Cédric Scherer’s very cool curved annotation arrow idea for the first time .

g2 = g2 +
annotate("text", x = as.POSIXct("2017-08-21 08:00"),
y = 74,
label = "Partial eclipse begins\n09:05:10 PDT",
color = "grey24") +
annotate("text", x = as.POSIXct("2017-08-21 09:00"),
y = 57,
label = "Totality begins\n10:16:55 PDT",
color = "grey24")
g2

I’ll make a data.frame for the arrow start/end positions. I’m skipping all the work it took to get the positions where I wanted them, which is always iterative for me.

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
"2017-08-21 09:34") ),
x2 = c(eclipse$start, totality$start),
y1 = c(74, 57.5),
y2 = c(72.5, 60) )

I add arrows with geom_curve(). I changed the size of the arrowhead and made it closed in arrow(). I also thought the arrows looked better with a little less curvature.

g2 +
geom_curve(data = arrows,
aes(x = x1, xend = x2,
y = y1, yend = y2),
arrow = arrow(length = unit(0.075, "inches"),
type = "closed"),
curvature = 0.25)

# Other ways to make a gradient color background

Based on a bunch of internet searches, it looks like a gradient background in ggplot2 generally takes some work. There are some nice examples out there on how to use rasterGrob() and annotate_custom() to add background gradients, such as in this Stack Overflow question. I haven’t researched how to make this go from light to dark and back to light for the uneven time scale like in my example.

I’ve also seen approaches involving dataset expansion and drawing many filled rectangles or using rasters, which is like what I did with geom_segment().

# Eclipses!

Before actually experiencing totality, it seemed to me like the difference between a 99% and a 100% eclipse wasn’t a big deal. I mean, those numbers are pretty darn close.

I was very wrong.

If you ever are lucky enough to be near a path of totality, definitely try to get there even if it’s a little more trouble then the 99.9% partial eclipse. It’s an amazing experience.

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
end = as.POSIXct("2017-08-21 10:18:52") )

plottemp = filter(temp, between(datetime,
as.POSIXct("2017-08-21 09:00:00"),
as.POSIXct("2017-08-21 12:00:00") ) )
ggplot(plottemp) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )
color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )
color_dat = mutate(color_dat,
color = 0,
color = replace(color,
time < totality$start, seq(100, 0, length.out = sum(time < totality$start) ) ),
color = replace(color,
time > totality$end, seq(0, 100, length.out = sum(time > totality$end) ) ) )
g1 = ggplot(plottemp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "15 min",
date_labels = "%H:%M",
expand = c(0, 0) ) +
coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(16, 24, by = 1)) ) +
theme_bw(base_size = 14) +
theme(panel.grid = element_blank() )

g1

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
high = gray.colors(1, 1) )
g2 = ggplot(temp) +
geom_segment(data = color_dat,
aes(x = time, xend = time,
y = -Inf, yend = Inf, color = color),
show.legend = FALSE) +
geom_line( aes(datetime, tempf), size = 1 ) +
scale_x_datetime( date_breaks = "1 hour",
date_labels = "%H:%M",
expand = c(0, 0) ) +
labs(y = expression( Temperature~(degree*F) ),
x = NULL,
title = "Temperature during 2017-08-21 solar eclipse",
subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
) +
scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,
name =  expression( Temperature~(degree*C)),
breaks = seq(12, 24, by = 2)) ) +
high = gray.colors(1, 1) ) +
theme_bw(base_size = 14) +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank() )

g2
g2 = g2 +
annotate("text", x = as.POSIXct("2017-08-21 08:00"),
y = 74,
label = "Partial eclipse begins\n09:05:10 PDT",
color = "grey24") +
annotate("text", x = as.POSIXct("2017-08-21 09:00"),
y = 57,
label = "Totality begins\n10:16:55 PDT",
color = "grey24")
g2

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
"2017-08-21 09:34") ),
x2 = c(eclipse$start, totality$start),
y1 = c(74, 57.5),
y2 = c(72.5, 60) )
g2 +
geom_curve(data = arrows,
aes(x = x1, xend = x2,
y = y1, yend = y2),
arrow = arrow(length = unit(0.075, "inches"),
type = "closed"),
curvature = 0.25)

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## October 13, 2019

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

GN
A hierachical community detection algorithm by Girvan Newman. A Girvan Newman step is defined as a couple of successive edge removes such that a new community occurs.

meditorch
A PyTorch package for biomedical image processing

metriculous
Very unstable library containing utilities to measure and visualize statistical properties of machine learning models.

podworld
2D partially observable dynamic world for RL experiments. PodWorld is (https://gym.openai.com ) environment for (http://…/the-book-2nd.html ) experimentations. PodWorld is specifically designed to be partially observable and dynamic (hence abbreviation P.O.D.). We emphasize these two attributes to force agents learn spatial as well as temporal representations that must go beyond simple memorization. In addition, all entities in PodWorld must obey laws of physics allowing for long tail of emergent observations that may not appear in games designed with arbitrary hand crafted rules for human entertainment. PodWorld is designed to be highly customizable as well as hackable with fairly simple and minimal code base. PodWorld is designed to be fast (>500 FPS on usual laptops) without needing GPU and run cross platform in headless mode or with render.

sensplit
Splits a dataset (in Pandas dataframe format) to train/test sets.

simple-gpu-scheduler
A simple scheduler for running commands on multiple GPUs. A simple scheduler to run your commands on individual GPUs. Following the (https://…/KISS_principle ), this script simply accepts commands via stdin and executes them on a specific GPU by setting the CUDA_VISIBLE_DEVICES variable.

simple-s3

sparkle-hypothesis
Use the power of hypothesis property based testing in PySpark tests. Data heavy tests benefit from Hypothesis to generate your data and desinging your tests. Sparkle-hypothesis makes it easy to use Hypothesis strategies to generate dataframes.

sstk
A systematic strategy toolkit.

TeaML
Automated Modeling in Financial Domain. TeaML is a simple and design friendly automatic modeling learning framework. It can automatically model from beginning to end, and in the end, it will also help you output a model report about the model.

### If you did not already know

Policy Learning based on Completely Behavior Cloning (PLCBC)
Direct policy search is one of the most important algorithm of reinforcement learning. However, learning from scratch needs a large amount of experience data and can be easily prone to poor local optima. In addition to that, a partially trained policy tends to perform dangerous action to agent and environment. In order to overcome these challenges, this paper proposed a policy initialization algorithm called Policy Learning based on Completely Behavior Cloning (PLCBC). PLCBC first transforms the Model Predictive Control (MPC) controller into a piecewise affine (PWA) function using multi-parametric programming, and uses a neural network to express this function. By this way, PLCBC can completely clone the MPC controller without any performance loss, and is totally training-free. The experiments show that this initialization strategy can help agent learn at the high reward state region, and converge faster and better. …

PredRNN++
We present PredRNN++, an improved recurrent network for video predictive learning. In pursuit of a greater spatiotemporal modeling capability, our approach increases the transition depth between adjacent states by leveraging a novel recurrent unit, which is named Causal LSTM for re-organizing the spatial and temporal memories in a cascaded mechanism. However, there is still a dilemma in video predictive learning: increasingly deep-in-time models have been designed for capturing complex variations, while introducing more difficulties in the gradient back-propagation. To alleviate this undesirable effect, we propose a Gradient Highway architecture, which provides alternative shorter routes for gradient flows from outputs back to long-range inputs. This architecture works seamlessly with causal LSTMs, enabling PredRNN++ to capture short-term and long-term dependencies adaptively. We assess our model on both synthetic and real video datasets, showing its ability to ease the vanishing gradient problem and yield state-of-the-art prediction results even in a difficult objects occlusion scenario. …

ReSIFT
This paper presents a full-reference image quality estimator based on SIFT descriptor matching over reliability-weighted feature maps. Reliability assignment includes a smoothing operation, a transformation to perceptual color domain, a local normalization stage, and a spectral residual computation with global normalization. The proposed method ReSIFT is tested on the LIVE and the LIVE Multiply Distorted databases and compared with 11 state-of-the-art full-reference quality estimators. In terms of the Pearson and the Spearman correlation, ReSIFT is the best performing quality estimator in the overall databases. Moreover, ReSIFT is the best performing quality estimator in at least one distortion group in compression, noise, and blur category. …

Feature Generation by Convolutional Neural Network (FGCNN)
Click-Through Rate prediction is an important task in recommender systems, which aims to estimate the probability of a user to click on a given item. Recently, many deep models have been proposed to learn low-order and high-order feature interactions from original features. However, since useful interactions are always sparse, it is difficult for DNN to learn them effectively under a large number of parameters. In real scenarios, artificial features are able to improve the performance of deep models (such as Wide & Deep Learning), but feature engineering is expensive and requires domain knowledge, making it impractical in different scenarios. Therefore, it is necessary to augment feature space automatically. In this paper, We propose a novel Feature Generation by Convolutional Neural Network (FGCNN) model with two components: Feature Generation and Deep Classifier. Feature Generation leverages the strength of CNN to generate local patterns and recombine them to generate new features. Deep Classifier adopts the structure of IPNN to learn interactions from the augmented feature space. Experimental results on three large-scale datasets show that FGCNN significantly outperforms nine state-of-the-art models. Moreover, when applying some state-of-the-art models as Deep Classifier, better performance is always achieved, showing the great compatibility of our FGCNN model. This work explores a novel direction for CTR predictions: it is quite useful to reduce the learning difficulties of DNN by automatically identifying important features. …

### Rename Columns | R

[This article was first published on Data Science Using R – FinderDing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Often data you’re working with has abstract column names, such as (x1, x2, x3…). Typically, the first step I take when renaming columns with r is opening my web browser.

The dataset cars is data from the 1920s on “Speed and Stopping Distances of Cars”. There is only 2 columns shown below.

colnames(datasets::cars)
[1] "speed" "dist" 

If we wanted to rename the column “dist” to make it easier to know what the data is/means we can do so in a few different ways.

## Using dplyr:

cars %>%
rename("Stopping Distance (ft)" = dist) %>%
colnames()

[1] "speed"             "Stopping Distance"
cars %>%
rename("Stopping Distance (ft)" = dist, "Speed (mph)" = speed) %>%
colnames()

[1] "Speed (mph)"            "Stopping Distance (ft)"

## Using Base r:

colnames(cars)[2] <-"Stopping Distance (ft)"

[1] "speed"                  "Stopping Distance (ft)"

colnames(cars)[1:2] <-c("Speed (mph)","Stopping Distance (ft)")

[1] "Speed (mph)"            "Stopping Distance (ft)"

## Using GREP:

colnames(cars)[grep("dist", colnames(cars))] <-"Stopping Distance (ft)"

"speed"                  "Stopping Distance (ft)"

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

A time series is a series of data points ordered in time. Time series adds an explicit order dependence between observations: a time dimension. In a normal machine learning dataset, the dataset is a collection of observations that are treated equally when future is being predicted. In time series the order of observations provides a source of additional information that should be analyzed and used in the prediction process. Time series are typically assumed to be generated at regularly spaced interval of time (e.g. daily temperature), and so are called regular time series. But the data in a time series doesn’t have to come in regular time intervals. In that case it is called irregular time series. In irregular time series the data follows a temporal sequence, but the measurements might not occur at a regular time intervals. For example, the data might be generated as a burst or with varying time intervals [1]. Account deposits or withdrawals from an ATM are examples of an irregular time series. Time series can have one or more variables that change over time. If there is only one variable varying over time, we call it Univariate time series. If there is more than one variable it is called Multivariate time series. For example, a tri-axial accelerometer. There are three accelerations variables, one for each axis (x,y,z) and they vary simultaneously over time.
An end to end guide on how to reduce the number of Features in a Dataset with practical examples in Python.
One of the primary challenges for the realization of near-term quantum computers has to do with their most basic constituent: the qubit. Qubits can interact with anything in close proximity that carries energy close to their own – stray photons (i.e., unwanted electromagnetic fields), phonons (mechanical oscillations of the quantum device), or quantum defects (irregularities in the substrate of the chip formed during manufacturing) – which can unpredictably change the state of the qubits themselves. Further complicating matters, there are numerous challenges posed by the tools used to control qubits. Manipulating and reading out qubits is performed via classical controls: analog signals in the form of electromagnetic fields coupled to a physical substrate in which the qubit is embedded, e.g., superconducting circuits. Imperfections in these control electronics (giving rise to white noise), interference from external sources of radiation, and fluctuations in digital-to-analog converters, introduce even more stochastic errors that degrade the performance of quantum circuits. These practical issues impact the fidelity of the computation and thus limit the applications of near-term quantum devices. To improve the computational capacity of quantum computers, and to pave the road towards large-scale quantum computation, it is necessary to first build physical models that accurately describe these experimental problems.
CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a comprehensive course guide in order to share our knowledge with students and the general public, and hopefully draw the interest of students from other universities to Berkeley’s Machine Learning curriculum. This guide was started by CS 189 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with the assistance of William Wang and Alex Yang. We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk for his machine learning notes, from which we drew inspiration. The latest version of this document can be found either at http://www.eecs189.org or http: //snasiriany.me/cs189/. Please report any mistakes to the sta , and contact the authors if you wish to redistribute this document.
The modern business leader’s new responsibility in a brave new world ruled by data. As Data Science moves along the hype cycle and matures as a business function, so do the challenges that face the discipline. The problem statement for data science went from ‘we waste 80% of our time preparing data’ via ‘production deployment is the most difficult part of data science’ to ‘lack of measurable business impact’ in the last few years.
If everyone had the time and desire to go to college and get an AI degree, you very likely wouldn’t be reading this blog. AI works in mysterious ways, but these five AI principles ought to help you avoid errors when dealing with this tech. A quick run down of this post for the AI acolyte on the go:
1. Evaluate AI systems on unseen data
2. More data leads to better models
3. An ounce of clean data is worth a pound of dirty data
5. AI isn’t magic
Learn how to transform and load (ETL) a data pipeline from scratch using R and SQLite to gather tweets in real-time and store them for future analyses.
Last year, in a post, I discussed how to merge levels of factor variables, using combinatorial techniques (it was for my STT5100 cours, and trees are not in the syllabus), with an extension on trees at the end of the post.
Benefits of data science, you might guess that data science has played a role in your daily life. After all, it not only affects what you do online, but what you do offline. Companies are using massive amounts of data to create better ads, produce tailored recommendations, and stock shelves, in the case of retail stores. It’s also shaping how, and who, we love. Here’s how data impacts us daily.
Python can be lots of fun. It’s not a difficult task to re-invent some built-in function that you don’t know exists in the first place, but why would you want to do that?. Today we’ll take a look at three of those functions which I use more or less on a daily basis, but was unaware of for a good portion of my data science career.
In this series of 6 posts we will leave the basics of prediction aside and look at a more handcrafted aspect of data science and machine learning, such as interpreting, gaining insights, and understanding what goes on within algorithms. It is not a trivial matter and is far from exhausted. In this series of posts we’ll start with the simplest statistical models and how we model even the most sophisticated ensemble and deep learning models trying to figure out the whys of a prediction.
PyTorch has become one of the most popular deep learning frameworks in the market and certainly a favorite of the research community when comes to experimentation. As a reference, PyTorch citations in papers on ArXiv grew 194 percent in the first half of 2019 alone, as noted by O’Reilly. For years, Facebook has based its deep learning work in a combination of PyTorch and Caffe2 and has put a lot of resources to support the PyTorch stack and developer community. Yesterday, Facebook released the latest version of PyTorch which showcases some state-of-the-art deep learning capabilities. There have been plenty of articles covering the launch of PyTorch 1.3. Instead of doing that, I would like to focus on some of the new projects accompanying the new release of the deep learning framework. Arguably, the most impressive capability of PyTorch is how quickly it has been able to incorporate implementations about new research technique. Not surprisingly, the artificial intelligence(AI) research community has started adopting PyTorch as one of the preferred stacks to experiment with new deep learning methods. The new release of PyTorch continues this trend by adding some impressive open source projects surrounding the core stack.
An end to end guide on how to reduce a dataset dimensionality using Feature Extraction Techniques such as: PCA, ICA, LDA, LLE, t-SNE and AE.
How to cope with AI and start becoming a part of it. If you open a news site today, you are almost sure to be met with an article about AI, Robotics, Quantum computing, genetic engineering, autonomous vehicles, natural language processing, and other technologies from the box called ‘The fourth industrial revolution’. Rating these technologies makes no sense, as they all have a staggering potential to change our world forever. Artificial intelligence, however, is already surging into all of the other technologies. To facilitate by the mastering of big data, pattern recognition, or prediction is an inherent quality of AI, and is frequently being applied to support ground-breaking discoveries in other technologies. I once heard a driving instructor comparing holding the hands on the steering wheel to having a gun in the hand, because of how dangerous it is to drive a car. AI is also dangerous, and we need to face the dark side of AI too, not only revel in the glorious benefits it brings us. Anything else would be reckless driving.
Natural language processing(NLP) is becoming the most ubiquitous application in the modern deep learning ecosystem. From support in popular deep learning frameworks to APIs in cloud runtimes such as Google Cloud, Azure, AWS or Bluemix, NLP is an omnipresent component of deep learning platforms. Despite the incredible progress, building NLP applications at scale remains incredibly challenging often surfacing strong frictions between the possibilities of research/experimentation and the realities of model serving/deployment. As one of the biggest conversational environments in the market, Facebook has been facing the challenges of building NLP applications at scale for years. Recently, the Facebook engineering team open sourced the first version of PyText, a PyTorch-based framework for building faster and more efficient NLP solutions. The ultimate goal of PyText is to provide a simpler experience for the end-to-end implementation of NLP workflows. To achieve that, PyText needs to address some of the existing friction points in NLP workflows. From those friction points, the most troublesome is the existing mismatch between the experimentation and model serving stages of the lifecycle of an NLP application.
Data science and software development are two very different fields. Trying to use the Agile methodology in the same way as you would on a software project for a data science project doesn’t really work. When it comes to data science, there tends to be a lot of investigation, exploration, testing, and tuning. In data science, you deal with unknown data which can lead to an unknown result. Software development, on the other hand, has structured data with known results; the programmers already know what they want to build (although their clients may not).
In the past 7 projects, we implemented the same project using different classification algorithms namely – ‘Logistic Regression’, ‘KNN’, ‘SVM’, ‘Kernel SVM’, ‘Naive Bayes’, ‘Decision Tree’ and ‘Random Forest’. The reason I wrote a separate article for each is to understand the intuition behind each algorithm.

### When presenting a new method, talk about its failure modes.

A coauthor writes:

I really like the paper [we are writing] as it is. My only criticism of it perhaps would be that we present this great new method and discuss all of its merits, but we do not really discuss when it fails / what its downsides are. Are there any cases where the traditional analyses or some other analysis are more appropriate? Should we say that in the main body of the paper?

Good point! I’m gonna add a section to the paper called Failure Modes or something like that, to explore where our method makes things worse.

And, no, this is not the same as the traditional “Threats to Validity” section. The Threats to Validity section, like the Robustness Checks, are typically a joke in that the purpose is usually not to explore potential problems but rather to rule out potential objections.

Now, don’t get me wrong, I love our recommended method and so does my coauthor. It’s actually hard for us to think of examples where our approach would be worse than what people were diong before. But I’ll think about it. I agree we should write something about failure modes.

I love my collaborators. They’re just great.

### What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

spacy-transformers
spaCy pipelines for pre-trained BERT and other transformers

stackerpy
Model Stacking for scikit-learn models for Machine Learning (including blending)

tabnet
Tensorflow 2.0 implementation of TabNet of any configuration. A Tensorflow 2.0 port for the paper (https://…/1908.07442 ), whose original codebase is available at https://…/tabnet.

TeaML
Automated Modeling in Financial Domain. TeaML is a simple and design friendly automatic modeling learning framework.

traintorch
Package for live visualization of metrics during training of a machine learning model

trax
Trax

VoicePy
A Python Library to interface with Alexa, Dialogflow, and Google Actions.

aiPool
Framework for Advanced Statistics and Data Sciences

cprior
Fast Bayesian A/B and multivariate testing

elasticdl
A Kubernetes-native Deep Learning Framework

### easyMTS: My First R Package (Story, and Results)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This weekend I decided to create my first R package… it’s here!

Although I’ve been using R for 15 years, developing a package has been the one thing slightly out of reach for me. Now that I’ve been through the process once, with a package that’s not completely done (but at least has a firm foundation, and is usable to some degree), I can give you some advice:

• Make sure you know R Markdown before you begin.
• Some experience with Git and Github will be useful. Lots of experience will be very, very useful.
• Write the functions that will go into your package into a file that you can source into another R program and use. If your programs work when you run the code this way, you will have averted many problems early.

The process I used to make this happen was:

I hope you enjoy following along with my process, and that it helps you write packages too. If I can do it, so can you!

The post easyMTS: My First R Package (Story, and Results) appeared first on Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new R package in development. Please cite if you use it.

The post easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS) appeared first on Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### The best is the enemy of the good. It is also the enemy of the not so good.

This post is by Phil Price, not Andrew.

The Ocean Cleanup Project’s device to clean up plastic from the Great Pacific Garbage Patch is back in the news because it is back at work and is successfully collecting plastic. A bunch of my friends are pretty happy about it and have said so on social media…and it drives me nuts. The machine might be OK but it makes no sense to put it way out in the Pacific.  Someone asked why not, and here’s what I wrote:

Suppose I have a machine that removes plastic from all of the water it encounters. I offer you a choice: you can put it in a location where it will remove 1 ton per month — the Pacific Garbage Patch — or in a location where it will remove 10 tons per month (let’s say that’s the Gulf of Thailand but in fact I do not know where the best place would be). Obviously you will put it where it can remove 10 tons per month. Now you raise money to build and operate a second machine. You put your first machine in the best place you could find, so do you now put your second machine in the Pacific Garbage Patch? You shoudn’t, if your goal is to remove as much plastic from the ocean as possible: you should put it in the best place where you don’t already have a machine…the Bay of Bengal, maybe. Or maybe it, too, should go in the Gulf of Thailand. Or maybe in the Caribbean. I have no idea where the plastic concentrations are highest, but I know it is not the Great Pacific Garbage Patch. At any rate you should put the first machine where it will remove the most plastic per month; the second machine in the best remaining place after you have installed the first one; the third machine in the best remaining place after you have installed the first two; and so on. The Pacific Garbage Patch isn’t literally the last place you should install a machine, but it is way way down the list. (If you know in advance that you are going to build a lot of machines, you can optimize the joint placement of all of them and you might come up with a slightly different answer, but let’s not worry about that detail.)

The paragraph above assumes that you are just trying to remove as much plastic from the ocean as possible. If you have some other goal then of course the answer could be different. For instance, if you are trying to reduce the amount of plastic at some specific spot in the middle of the Pacific, you should put your machine at that spot even if it won’t get you very much in terms of plastic removed per month.

That paragraph also implicitly assumes the cost of installing and operating the machine is the same everywhere. If it is very expensive to install and operate the machine in the Gulf of Thailand, then maybe you’d be better off somewhere else: for the same money as one machine in the place where it would maximize the plastic removal per month, maybe I could build two machines and install them in cheaper places where they would combine to remove more plastic. It becomes an optimization problem. But: I have never seen anyone, not even the project proponents, who thinks the middle of the Pacific is a relatively _cheap_ place to install and operate a machine: in fact it is very expensive because it is so remote.

And of course the situation gets even more complicated when you consider other factors like whether you will interfere with fishing or with ship traffic, what effect will the machine have on the marine ecosystem, are you inside or outside a nation’s territorial waters, and so on.

Choosing the best place for your first, second, third, fourth, fifth, sixth,… machine might be complicated, but I have not seen any reasonable argument for why the Pacific Garbage Patch is even in the running. It just doesn’t make sense.

I am in agreement with…uh, I think it was Darrell Huff (author of “How to Lie with Statistics”) who made this point, but I could be wrong… when he said that the more important something is, the more important it is to be rational about it. If you’re trying to save human lives, for example, anything other than the most efficient allocation of resources is literally killing people. So to the extent that it is important to people to remove plastic from the oceans, it’s important to allocate resources efficiently. But, much as we would like to think it is important to people and therefore should be done as efficiently as possible, in fact people are often not rational. It may be the case that people are willing to contribute much much more money, time, and energy to a program to remove plastic from the ocean inefficiently than to one that would do so efficiently. If people are willing to contribute to remove plastic from the Pacific Garbage Patch but not from anywhere else, well, OK, put your machine in the Pacific Garbage Patch. So I’m not saying people shouldn’t do this project. I’m just saying it doesn’t make sense. That is, sadly, not the same thing.

This post is by Phil, not Andrew

### Hyper-Parameter Optimization of General Regression Neural Networks

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A major advantage of General Regression Neural Networks (GRNN) over other types of neural networks is that there is only a single hyper-parameter, namely the sigma. In the previous post (https://statcompute.wordpress.com/2019/07/06/latin-hypercube-sampling-in-hyper-parameter-optimization), I’ve shown how to use the random search strategy to find a close-to-optimal value of the sigma by using various random number generators, including uniform random, Sobol sequence, and Latin hypercube sampling.

In addition to the random search, we can also directly optimize the sigma based on a pre-defined objective function by using the grnn.optmiz_auc() function (https://github.com/statcompute/yager/blob/master/code/grnn.optmiz_auc.R), in which either Golden section search by default or Brent’s method is employed in the one-dimension optimization. In the example below, the optimized sigma is able to yield a slightly higher AUC in both training and hold-out samples. As shown in the plot, the optimized sigma in red is right next to the best sigma in the random search.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Distilled News

Here’s the List of Power-Packed NLP Hack Sessions at DHS 2019
• Comparison of Transfer Learning Models in NLP
• Synthetic Text Data Generation using RNN based Deep Learning Models
• Identifying security vulnerabilities in software using Deep Transfer Learning for NLP
• Deep Learning for Search in E-Commerce
• Intent Identification for Indic Languages
• Interpreting State-of-the-Art NLP Models
• Automatic Subtitle Generation using NLP and Deep Learning
In order to be a highly efficient, flexible, and production-ready library, TensorFlow uses dataflow graphs to represent computation in terms of the relationships between individual operations. Dataflow is a programming model widely used in parallel computing and, in a dataflow graph, the nodes represent units of computation while the edges represent the data consumed or produced by a computation unit. This post is taken from the book Hands-On Neural Networks with TensorFlow 2.0 by Ekta Saraogi and Akshat Gupta. This book by Packt Publishing explains how TensorFlow works, from basics to advanced level with case-study based approach.
Leverage AI to Create Autonomous Policies that Adapts without Human Intervention. Policies are the foundation for any successful organization. Policies are the rules, or laws, of an organization. Policies document the principles, best practices and compliance guidelines that aid decision-making in supporting the consistent and repeatable operations of the business. Heck, one could argue that an organization’s culture is better defined by its policies than it is by the character of its leadership team. Unfortunately, the management, creation and execution of policies haven’t changed much since the days of ‘time-and-motion studies’. In many cases, policies are nothing more than a static list of what-if rules that govern what workers are to do in well-defined situations. For example, [If your car has been driven over 3,000 miles since the last oil change, thenchange the oil] or [If you haven’t visited the dentist in greater than 6 months, then visit the dentist]. But what if…what if these policies weren’t just static if-then rules but were instead AI-based models that changed to optimize the actions based upon the constantly evolving state of the environment in which the business operates…without human intervention? In much the same way that we are seeing AI being used to create autonomous vehicles, robots and devices that learn and adapt without human intervention, can we leverage AI to create autonomous policies that learn and adapt without human intervention?
Goodhart’s Law states that ‘When a measure becomes a target, it ceases to be a good measure.’ At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so. This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok (such as Google’s algorithm contributing to radicalizing people into white supremacy, teachers being fired by an algorithm, or essay grading software that rewards sophisticated garbage) all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.
In statistics, when trying to compare samples, our first thought is to perform a student’s t-test. It compares the means of two samples (or a sample and population) relative to the standard error of the mean or pooled standard deviation. While the t-test is a robust and useful experiment, it limits itself to comparing only two groups at a time. In order to compare multiple groups at once, we can look at the ANOVA, or Analysis of Variance. Unlike the t-test, it compares the variance within each sample relative to the variance between the samples. Ronald Fisher introduced the term Variance and its formal analysis in 1918, with Analysis of Variance becoming widely known in 1925 after Fisher’s Statistical Methods for Research Workers. The students t-test follows a t-distribution, follows normal distribution’s shape, however it has fatter tails to account for more values farther from the mean in samples.
On February 10th, 1996, IBM’s Deep Blue AI beat world champion Garry Kasparov at a game of chess. Google’s Alpha Go AI is the best GO player in the world and has crushed world champions over and over again. But how is this even possible? How is it possible that a computer can outsmart humans? The answer… Reinforcement Learning.
A fun mini experiment to test the predictions of tree ensembles without missing value replacement against the prediction from logistic regression using median and mode imputations.
In this story, DUNet, by Tianjin University, Linkoping University, and, is briefly reviewed. DUNet, Deformable U-Net:
• exploits the retinal vessels’ local features with a U-shape architecture, with upsampling operators to extract context information.
• enable precise localization by combining low-level feature maps with high-level ones.
• captures the retinal vessels at various shapes and scales by adaptively adjusting the receptive fields according to vessels’ scales and shapes using the deformable convolutional network (DCN).
With DUNet, it is the potential to have an early diagnosis of diseases. It is published in 2019 JKNOSYS (Current Impact Factor: 5.101). (Sik-Ho Tsang @ Medium)
The M competitions[1] are a prestigious series of forecasting challenges organised to compare and advance forecasting research. In the past, statistical algorithms have always won it. Tried and tested models like ARIMA and Exponential Smoothing produce predictions that were hard to beat by more complex but less accurate algorithms. But this all changed last year. In this post I touch on the winner of M4, ES-RNN, a fusion of Long Short Term Memory networks (‘LSTMs’) and Exponential Smoothing. Then, I elaborate on what I learnt about N-BEATS, a pure neural network that beats even ES-RNN.
In today’s era of Big data and IoT, we are easily loaded with rich datasets having extremely high dimensions. In order to perform any machine learning task or to get insights from such high dimensional data, feature selection becomes very important. Since some features may be irrelevant or less significant to the dependent variable so their unnecessary inclusion to the model leads to
• Increase in complexity of a model and makes it harder to interpret.
• Increase in time complexity for a model to get trained.
• Result in a dumb model with inaccurate or less reliable predictions.
Hence, it gives an indispensable need to perform feature selection. Feature selection is very crucial and must component in machine learning and data science workflows especially while dealing with high dimensional datasets.
The topic for today is on Tensorflow’s latest reinforcement learning library called TF-Agents. This library is fairly new and just open-sourced to the world about a year ago. As a result, it seriously lacks proper documentations and tutorials compared to the rest of the popular reinforcement learning libraries. In this tutorial, we are going to learn the proper way to setup and run the tutorials provided by the official documentation. The content is categorized into the following:
• Installation
• Examples
• Conclusions
Without further ado, let’s get started!
From training to experimenting with different parameters, the process of designing neural networks is labor-intensive, challenging, and often cumbersome. But imagine if it was possible to automate this process. That imaginative leap-turned-reality forms the basis of this guide. We’ll explore a range of research papers that have sought to solve the challenging task of automating neural network design. In this guide, we assume that the reader has been involved in the process of designing neural networks from scratch using one of the frameworks such as Keras or TensorFlow.
As a data scientist, I don’t have a lot of software engineering experience but I have certainly heard a lot of great comments about containers. I have heard about how lightweight they are compared to traditional VMs and how good they are at ensuring a safe consistent environment for your code. However, when I tried to Dockerize my own model, I soon realized it is not that intuitive. It is not at all as simple as putting RUN in front of your EC2 bootstrap script. I found that inconsistencies and unpredictable behaviors happen quite a lot and it can be frustrating to learn to debug a new tool. All of these motivated me to create this post with all the code snippets you need to factorize your ML model in Python to a Docker container. I will guide you through installing all the pip packages you need and build your first container image. And in the second part of this post, we will be setting up all the necessary AWS environment and kicking off the container as a Batch job.
Lately, I have been running a lot of Machine Learning experiments both at work and on Kaggle. One thing has become clear: it’s really hard to confidently say Model A is better than Model B. Typically there’s a clear trade-off between speed and confidence in results. In a setup like Kaggle, your success is largely driven by how many ideas you can evaluate, and that in turn is driven by the design of your test bed. For instance, you could train models A and B, measure the performance, and choose the best one. Fast and easy. But if you change the random seed and repeat, maybe you get the opposite result. You then might decide to do 5-fold cross validation to get error bars, but this takes ~5x the time and raises more questions about how to sample from the data to form the folds (e.g. group k-fold, stratified k-fold…etc). Maybe you then decide to shorten the duration of each run by increasing the learning rate or downsampling the training set, but this adds additional variance to your results, making it even harder to find the signal in the noise.

### If you did not already know

HANA Data Scientist Tool
The Application Function Modeler 2.0 (AFM 2) is a graphical editor for complex data analysis pipelines in the HANA Studio. This tool is based on the HANA Data Scientist prototype developed at the HANA Platform Innovation Center in Potsdam, Germany. It is planned to be the next generation of the existing HANA Studio Application Function Modeler which was developed at the TIP CE&SP Algorithm Labs in Shanghai, China. The AFM 2 team consists of original and new developers from both locations. …

Hybrid Petri net (PN)
This paper presents an approach to model an unknown Ladder Logic based Programmable Logic Controller (PLC) program consisting of Boolean logic and counters using Process Mining techniques. First, we tap the inputs and outputs of a PLC to create a data flow log. Second, we propose a method to translate the obtained data flow log to an event log suitable for Process Mining. In a third step, we propose a hybrid Petri net (PN) and neural network approach to approximate the logic of the actual underlying PLC program. We demonstrate the applicability of our proposed approach on a case study with three simulated scenarios. …

Inverse Reinforcement Learning Method for Architecture Search (IRLAS)
In this paper, we propose an inverse reinforcement learning method for architecture search (IRLAS), which trains an agent to learn to search network structures that are topologically inspired by human-designed network. Most existing architecture search approaches totally neglect the topological characteristics of architectures, which results in complicated architecture with a high inference latency. Motivated by the fact that human-designed networks are elegant in topology with a fast inference speed, we propose a mirror stimuli function inspired by biological cognition theory to extract the abstract topological knowledge of an expert human-design network (ResNeXt). To avoid raising a too strong prior over the search space, we introduce inverse reinforcement learning to train the mirror stimuli function and exploit it as a heuristic guidance for architecture search, easily generalized to different architecture search algorithms. On CIFAR-10, the best architecture searched by our proposed IRLAS achieves 2.60% error rate. For ImageNet mobile setting, our model achieves a state-of-the-art top-1 accuracy 75.28%, while being 2~4x faster than most auto-generated architectures. A fast version of this model achieves 10% faster than MobileNetV2, while maintaining a higher accuracy. …

### Let’s get it right

We focus on the problem of designing an artificial agent capable of assisting a human user to complete a task. Our goal is to guide human users towards optimal task performance while keeping their cognitive load as low as possible. Our insight is that in order to do so, we should develop an understanding of human decision for the task domain. In this work, we consider the domain of collaborative packing, and as a first step, we explore the mechanisms underlying human packing strategies. We conduct a user study in which human participants complete a series of packing tasks in a virtual environment. We analyze their packing strategies and discover that they exhibit specific spatial and temporal patterns (e.g., humans tend to place larger items into corners first). Our insight is that imbuing an artificial agent with an understanding of this spatiotemporal structure will enable improved assistance, which will be reflected in the task performance and human perception of the AI agent. Ongoing work involves the development of a framework that incorporates the extracted insights to predict and manipulate human decision making towards an efficient route of low cognitive load. A follow-up study will evaluate our framework against a set of baselines featuring distinct strategies of assistance. Our eventual goal is the deployment and evaluation of our framework on an autonomous robotic manipulator, actively assisting users on a packing task.
Gary Marcus is not impressed by the hype around deep learning. While the NYU professor believes that the technique has played an important role in advancing AI, he also thinks the field’s current overemphasis on it may well lead to its demise. Marcus, a neuroscientist by training who has spent his career at the forefront of AI research, cites both technical and ethical concerns. From a technical perspective, deep learning may be good at mimicking the perceptual tasks of the human brain, like image or speech recognition. But it falls short on other tasks, like understanding conversations or causal relationships. To create more capable and broadly intelligent machines, often referred to colloquially as artificial general intelligence, deep learning must be combined with other methods.
As so many more organizations now rely on AI to deliver services and consumer experiences, establishing a public trust in the AI is crucial as these systems begin to make harder decisions that impact customers.
Bias in AI programming, both conscious and unconscious, is an issue of concern raised by scholars, the public, and the media alike. Given the implications of usage in hiring, credit, social benefits, policing, and legal decisions, they have good reason to be. AI bias occurs when a computer algorithm makes prejudiced decisions based on data and/or programming rules. The problem of bias is not only with coding (or programming), but also with the datasets that are used to train AI algorithms, in what some call the ‘discrimination feedback loop.’
Today was another rainy Friday afternoon in Berlin. At 4 pm sharp, a dear colleague of mine came to my working place and took me for a quick coffee break. While walking, avoiding the small water puddles in the floor, she said: ‘Yesterday I saw a movie about this couple. After 9 years being together and still loving each other, they broke up. The girl had an amazing working opportunity in another place and the guy did not want a long distance relationship. Seriously mate, heartbreaking.’ One thing led to another, and suddenly I said: ‘look, no matter what everybody says. I am convinced that when two people love, truly love, each other, everything can be solved. They can overcome everything. I understand and respect other people opinions, but love is at the core of everything that I am and do’. ‘But, dude, really, why you just can’t nurture yourself from other human experiences around you. Can’t you see that life is harsh and realise that love is not enough?’, she replied.
Benefits of data science, you might guess that data science has played a role in your daily life. After all, it not only affects what you do online, but what you do offline. Companies are using massive amounts of data to create better ads, produce tailored recommendations, and stock shelves, in the case of retail stores. It’s also shaping how, and who, we love. Here’s how data impacts us daily.

Article: Bias and Algorithmic Fairness

The modern business leader’s new responsibility in a brave new world ruled by data. As Data Science moves along the hype cycle and matures as a business function, so do the challenges that face the discipline. The problem statement for data science went from ‘we waste 80% of our time preparing data’ via ‘production deployment is the most difficult part of data science’ to ‘lack of measurable business impact’ in the last few years.
Services mediated by ICT platforms have shaped the landscape of the digital markets and produced immense economic opportunities. Unfortunately, the users of platforms not only surrender the value of their digital traces but also subject themselves to the power and control that data brokers exert for prediction and manipulation. As the platform revolution takes hold in public services, it is critically important to protect the public interest against the risks of mass surveillance and human rights abuses. We propose a set of design constraints that should underlie data systems in public services and which can serve as a guideline or benchmark in the assessment and deployment of platform-mediated services. The principles include, among others, minimizing control points and non-consensual trust relationships, empowering individuals to manage the linkages between their activities and empowering local communities to create their own trust relations. We further propose a set of generic and generative design primitives that fulfil the proposed constraints and exemplify best practices in the deployment of platforms that deliver services in the public interest. For example, blind tokens and attribute-based authorization may prevent the undue linking of data records on individuals. We suggest that policymakers could adopt these design primitives and best practices as standards by which the appropriateness of candidate technology platforms can be measured in the context of their suitability for delivering public services.

### R Packages worth a look

Blyth-Still-Casella Exact Binomial Confidence Intervals (BlythStillCasellaCI)
Computes Blyth-Still-Casella exact binomial confidence intervals based on a refining procedure proposed by George Casella (1986) <doi:10.2307/3314658>.

Genome-Wide Structural Equation Modeling (gwsem)
Melds genome-wide association tests with structural equation modeling (SEM) using ‘OpenMx’. This package contains low-level C/C++ code to rapidly read genetic data encoded in U.K. Biobank or ‘plink’ formats. Prebuilt modeling options include one and two factor models. Alternately, analyses may utilize arbitrary, user-provided SEMs. See Verhulst, Maes, & Neale (2017) <doi:10.1007/s10519-017-9842-6> for details. An updated manuscript is in preparation.

Food Network Inference and Visualization (foodingraph)
Displays a weighted undirected food graph from an adjacency matrix. Can perform confidence-interval bootstrap inference with mutual information or maximal information coefficient. Based on my Master 1 internship at the Bordeaux Population Health center. References : Reshef et al. (2011) <doi:10.1126/science.1205438>, Meyer et al. (2008) <doi:10.1186/1471-2105-9-461>, Liu et al. (2016) <doi:10.1371/journal.pone.0158247>.

Varying Coefficients (varycoef)
Gives maximum likelihood estimation (MLE) method to estimate and predict spatially varying coefficient (SVC) Models. It supports covariance tapering by Furrer et al. (2006) <doi:10.1198/106186006X132178> to allow MLE on large data.

An Extension of the Taylor Diagram to Two-Dimensional Vector Data (SailoR)
A new diagram for the verification of vector variables (wind, current, etc) generated by multiple models against a set of observations is presented in this package. It has been designed as a generalization of the Taylor diagram to two dimensional quantities. It is based on the analysis of the two-dimensional structure of the mean squared error matrix between model and observations. The matrix is divided into the part corresponding to the relative rotation and the bias of the empirical orthogonal functions of the data. The full set of diagnostics produced by the analysis of the errors between model and observational vector datasets comprises the errors in the means, the analysis of the total variance of both datasets, the rotation matrix corresponding to the principal components in observation and model, the angle of rotation of model-derived empirical orthogonal functions respect to the ones from observations, the standard deviation of model and observations, the root mean squared error between both datasets and the squared two-dimensional correlation coefficient. See the output of function UVError() in this package.

### Discover Offensive Programming

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Package wyz.code.offensiveProgramming version 1.1.12 is available on CRAN.

If you are interested in reducing time and efforts to implement and debug R code, to generate R documentation, to generate test code, then you may consider using this package. It provides tools to manage

1. semantic naming of function parameter arguments
2. function return type
3. functional test cases

Using this package you will be able to verify types of arguments passed to functions without implementing verification code into your function, thus reducing their size and your implementation time. Type and length of each parameter are verified on your explicit demand, allowing use at any stage of the software delivery life cycle.

Similarly, expected function returned types can also be verified on demand, either interactively or programmatically.

Browse on-line documentation to know more.

More to come on how to put that in action on next post.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### A Shiny Intro Survey to an Open Science Course

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, we started a new course titled “Statistical Programming and Open
Science Methods”. It is being offered under the research program of
TRR 266 “Accounting for Transparency”
and enables students to conduct data-based research so that others can contribute
and collaborate. This involves making research data and methods FAIR
(findable, accessible, interoperable and reusable) and results reproducible.
All the materials of the course are
available on GitHub together with
some notes in the README on how to use them for self-guided learning.

The course is over-booked so running a normal introduction round was not
feasible. Yet, I was very interested to learn about the students’ backgrounds
with respect to statistical programming and their learning objectives. Thus,
I decided to construct a quick online survey using the ‘shiny’ package. In
addition to collecting data, this also provided me the opportunity to show-case
one of the less obvious applications of statistical programming.

The design of the survey is relatively straightforward. It asks the students
about their familiarity with a set of statistical programming languages and then
changes the survey dynamically to collect their assessments about their
usability and how easy they are to learn. After that, it presents a list of
programming-related terms and asks students to state whether they are reasonably
familiar with these terms. It closes with asking students about their learning
objectives for this course and gives them the opportunity to state their name.

The data is being stored in a SQLite file-based database in the directory of
the shiny app. Another app accesses the data and presents a quick evaluation
as well as the opportunity to download the anonymized data. You can access the
survey here (submit
button disabled) and the evaluation app here.

To visualize the learning objectives I used the ‘ggwordcloud’ package. Fancy
looking but of limited relevance.

The code for the survey and its evaluation is part of the course’s GitHub
repository
.
Feel free to reuse. Some things that might be relevant here:

• Watch out for SQL Injection issues. In my code, I use DBI::sqlInterpolate()
for this purpose.
• The repository contains both shiny apps (app_survey.R and app_results.R)
in one directory. Make sure to export them as two separate shiny apps in
separate folders.
app is writing to. When you are hosting this on your own shiny server this
can be realized by the results app linking to the database file in the folder
of the survey app. If you plan to host your apps on a service like
‘shinyapps.io’ then this will most likely not be feasible. In this case,
you might consider switching to an external database.
• When using shiny in very large courses, your students might experience
“Too many users” errors from shiny as it has a limit of 100 concurrent
users for a given app. When running your own shiny server you can configure
shiny to allow more users but my guess is that you will run into performance
issues at some point.

This is it. Let me know your thoughts and I would be very happy to get in touch
if you are reusing the code for your own class survey. Feel free to comment below. Alternatively, you can reach me via email or twitter.

Enjoy!

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Cluster multiple time series using K-means

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been recently confronted to the issue of finding similarities among time-series and though
about using k-means to cluster them. To illustrate the method, I’ll be using data from the
Penn World Tables, readily available in R (inside the {pwt9} package):

library(tidyverse)
library(lubridate)
library(pwt9)
library(brotools)

First, of all, let’s only select the needed columns:

pwt <- pwt9.0 %>%
select(country, year, avh)

avh contains the average worked hours for a given country and year. The data looks like this:

head(pwt)
##          country year avh
## ABW-1950   Aruba 1950  NA
## ABW-1951   Aruba 1951  NA
## ABW-1952   Aruba 1952  NA
## ABW-1953   Aruba 1953  NA
## ABW-1954   Aruba 1954  NA
## ABW-1955   Aruba 1955  NA

For each country, there’s yearly data on the avh variable. The goal here is to cluster the different
countries by looking at how similar they are on the avh variable. Let’s do some further cleaning.
The k-means implementation in R expects a wide data frame (currently my data frame is in the long
format) and no missing values. These could potentially be imputed, but I can’t be bothered:

pwt_wide <- pwt %>%
pivot_wider(names_from = year, values_from = avh)  %>%
filter(!is.na(1950)) %>%
mutate_at(vars(-country), as.numeric)

To convert my data frame from long to wide, I use the fresh pivot_wider() function, instead of the
less intuitive spread() function.

We’re ready to use the k-means algorithm. To know how many clusters I should aim for, I’ll be using
the elbow method (if you’re not familiar with this method, click on the image at the very top of
this post):

wss <- map_dbl(1:5, ~{kmeans(select(pwt_wide, -country), ., nstart=50,iter.max = 15 )$tot.withinss}) n_clust <- 1:5 elbow_df <- as.data.frame(cbind("n_clust" = n_clust, "wss" = wss)) ggplot(elbow_df) + geom_line(aes(y = wss, x = n_clust), colour = "#82518c") + theme_blog() Looks like 3 clusters is a good choice. Let’s now run the kmeans algorithm: clusters <- kmeans(select(pwt_wide, -country), centers = 3) clusters is a list with several interesting items. The item centers contains the “average” time series: (centers <- rownames_to_column(as.data.frame(clusters$centers), "cluster"))
##   cluster     1950     1951     1952     1953     1954     1955     1956
## 1       1 2110.440 2101.273 2088.947 2074.273 2066.617 2053.391 2034.926
## 2       2 2086.509 2088.571 2084.433 2081.939 2078.756 2078.710 2074.175
## 3       3 2363.600 2350.774 2338.032 2325.375 2319.011 2312.083 2308.483
##       1957     1958     1959     1960     1961     1962     1963     1964
## 1 2021.855 2007.221 1995.038 1985.904 1978.024 1971.618 1963.780 1962.983
## 2 2068.807 2062.021 2063.687 2060.176 2052.070 2044.812 2038.939 2037.488
## 3 2301.355 2294.556 2287.556 2279.773 2272.899 2262.781 2255.690 2253.431
##       1965     1966     1967     1968     1969     1970     1971     1972
## 1 1952.945 1946.961 1928.445 1908.354 1887.624 1872.864 1855.165 1825.759
## 2 2027.958 2021.615 2015.523 2007.176 2001.289 1981.906 1967.323 1961.269
## 3 2242.775 2237.216 2228.943 2217.717 2207.037 2190.452 2178.955 2167.124
##       1973     1974     1975     1976     1977     1978     1979     1980
## 1 1801.370 1770.484 1737.071 1738.214 1713.395 1693.575 1684.215 1676.703
## 2 1956.755 1951.066 1933.527 1926.508 1920.668 1911.488 1904.316 1897.103
## 3 2156.304 2137.286 2125.298 2118.138 2104.382 2089.717 2083.036 2069.678
##       1981     1982     1983     1984     1985     1986     1987     1988
## 1 1658.894 1644.019 1636.909 1632.371 1623.901 1615.320 1603.383 1604.331
## 2 1883.376 1874.730 1867.266 1861.386 1856.947 1849.568 1848.748 1847.690
## 3 2055.658 2045.501 2041.428 2030.095 2040.210 2033.289 2028.345 2029.290
##       1989     1990     1991     1992     1993     1994     1995     1996
## 1 1593.225 1586.975 1573.084 1576.331 1569.725 1567.599 1567.113 1558.274
## 2 1842.079 1831.907 1823.552 1815.864 1823.824 1830.623 1831.815 1831.648
## 3 2031.741 2029.786 1991.807 1974.954 1973.737 1975.667 1980.278 1988.728
##       1997     1998     1999     2000     2001     2002     2003     2004
## 1 1555.079 1555.071 1557.103 1545.349 1530.207 1514.251 1509.647 1522.389
## 2 1835.372 1836.030 1839.857 1827.264 1813.477 1781.696 1786.047 1781.858
## 3 1985.076 1961.219 1966.310 1959.219 1946.954 1940.110 1924.799 1917.130
##       2005     2006     2007     2008     2009     2010     2011     2012
## 1 1514.492 1512.872 1515.299 1514.055 1493.875 1499.563 1503.049 1493.862
## 2 1775.167 1776.759 1773.587 1771.648 1734.559 1736.098 1742.143 1735.396
## 3 1923.496 1912.956 1902.156 1897.550 1858.657 1861.875 1861.608 1850.802
##       2013     2014
## 1 1485.589 1486.991
## 2 1729.973 1729.543
## 3 1848.158 1851.829

clusters also contains the cluster item, which tells me to which cluster the different countries
belong to. I can easily add this to the original data frame:

pwt_wide <- pwt_wide %>%
mutate(cluster = clusters$cluster) Now, let’s prepare the data for visualisation. I have to go back to a long data frame for this: pwt_long <- pwt_wide %>% pivot_longer(cols=c(-country, -cluster), names_to = "year", values_to = "avh") %>% mutate(year = ymd(paste0(year, "-01-01"))) centers_long <- centers %>% pivot_longer(cols = -cluster, names_to = "year", values_to = "avh") %>% mutate(year = ymd(paste0(year, "-01-01"))) And I can now plot the different time series, by cluster and highlight the “average” time series for each cluster as well (yellow line): ggplot() + geom_line(data = pwt_long, aes(y = avh, x = year, group = country), colour = "#82518c") + facet_wrap(~cluster, nrow = 1) + geom_line(data = centers_long, aes(y = avh, x = year, group = cluster), col = "#b58900", size = 2) + theme_blog() + labs(title = "Average hours worked in several countries", caption = "The different time series have been clustered using k-means. Cluster 1: Belgium, Switzerland, Germany, Denmark, France, Luxembourg, Netherlands, Norway, Sweden.\nCluster 2: Australia, Colombia, Ireland, Iceland, Japan, Mexico, Portugal, Turkey.\nCluster 3: Argentina, Austria, Brazil, Canada, Cyprus, Spain, Finland, UK, Italy, New Zealand, Peru, USA, Venezuela") + theme(plot.caption = element_text(colour = "white")) Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ## October 12, 2019 ### Document worth reading: “On the Diversity of Memory and Storage Technologies” The last decade has seen tremendous developments in memory and storage technologies, starting with Flash Memory and continuing with the upcoming Storage-Class Memories. Combined with an explosion of data processing, data analytics, and machine learning, this led to a segmentation of the memory and storage market. Consequently, the traditional storage hierarchy, as we know it today, might be replaced by a multitude of storage hierarchies, with potentially different depths, each tailored for specific workloads. In this context, we explore in this ‘Kurz Erkl\’art’ the state of memory technologies and reflect on their future use with a focus on data management systems. On the Diversity of Memory and Storage Technologies Continue Reading… ### Magister Dixit “Data science is, at its foundation, centered on the analysis of data. The technical areas of data science are those that need direct study to make data analysis as effective as possible The areas are: 1.Statistical theory, 2.Statistical models, 3.Statistical and machine-learning methods, 4.Visualization methods, 5.Algorithms for statistical, machine-learning, and visualization methods, 6.Computational environments for data analysis: hardware, software, and database management, as well as 7.Live analyses of data where results are judged by the subject-matter findings, not the methodology and systems that are used. Of course, these areas can be divided into sub-areas, which in turn can have sub-sub-areas, and so forth. Also, research in an area can depend heavily on research in others.” Wen-wen Tung, Ashrith Barthur, Matthew C. Bowers, Yuying Song, John Gerth, William S. Cleveland ( 2018 ) Continue Reading… ### Whats new on arXiv Many problems in real life can be converted to combinatorial optimization problems (COPs) on graphs, that is to find a best node state configuration or a network structure such that the designed objective function is optimized under some constraints. However, these problems are notorious for their hardness to solve because most of them are NP-hard or NP-complete. Although traditional general methods such as simulated annealing (SA), genetic algorithms (GA) and so forth have been devised to these hard problems, their accuracy and time consumption are not satisfying in practice. In this work, we proposed a simple, fast, and general algorithm framework called Gumbel-softmax Optimization (GSO) for COPs. By introducing Gumbel-softmax technique which is developed in machine learning community, we can optimize the objective function directly by gradient descent algorithm regardless of the discrete nature of variables. We test our algorithm on four different problems including Sherrington-Kirkpatrick (SK) model, maximum independent set (MIS) problem, modularity optimization, and structural optimization problem. High-quality solutions can be obtained with much less time consuming compared to traditional approaches. Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global \textit{a priori} features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an \emph{unnormalized} GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the \emph{normalized} distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one. In this paper, we present a new fault diagnosis (FD) -based approach for detection of imagery changes that can detect significant changes as inconsistencies between different sub-modules (e.g., self-localizaiton) of visual SLAM. Unlike classical change detection approaches such as pairwise image comparison (PC) and anomaly detection (AD), neither the memorization of each map image nor the maintenance of up-to-date place-specific anomaly detectors are required in this FD approach. A significant challenge that is encountered when incorporating different SLAM sub-modules into FD involves dealing with the varying scales of objects that have changed (e.g., the appearance of small dangerous obstacles on the floor). To address this issue, we reconsider the bag-of-words (BoW) image representation, by exploiting its recent advances in terms of self-localization and change detection. As a key advantage, BoW image representation can be reorganized into any different scaling by simply cropping the original BoW image. Furthermore, we propose to combine different self-localization modules with strong and weak BoW features with different discriminativity, and to treat inconsistency between strong and weak self-localization as an indicator of change. The efficacy of the proposed approach for FD with/without AD and/or PC was experimentally validated. Neural Architecture Search (NAS) technologies have been successfully performed for efficient neural architectures for tasks such as image classification and semantic segmentation. However, existing works implement NAS for target tasks independently of domain knowledge and focus only on searching for an architecture to replace the human-designed network in a common pipeline. Can we exploit human prior knowledge to guide NAS? To address it, we propose a framework, named Pose Neural Fabrics Search (PNFS), introducing prior knowledge of body structure into NAS for human pose estimation. We lead a new neural architecture search space, by parameterizing cell-based neural fabric, to learn micro as well as macro neural architecture using a differentiable search strategy. To take advantage of part-based structural knowledge of the human body and learning capability of NAS, global pose constraint relationships are modeled as multiple part representations, each of which is predicted by a personalized neural fabric. In part representation, we view human skeleton keypoints as entities by representing them as vectors at image locations, expecting it to capture keypoint’s feature in a relaxed vector space. The experiments on MPII and MS-COCO datasets demonstrate that PNFS can achieve comparable performance to state-of-the-art methods, with fewer parameters and lower computational complexity. Logical rules are a popular knowledge representation language in many domains, representing background knowledge and encoding information that can be derived from given facts in a compact form. However, rule formulation is a complex process that requires deep domain expertise, and is further challenged by today’s often large, heterogeneous, and incomplete knowledge graphs. Several approaches for learning rules automatically, given a set of input example facts, have been proposed over time, including, more recently, neural systems. Yet, the area is missing adequate datasets and evaluation approaches: existing datasets often resemble toy examples that neither cover the various kinds of dependencies between rules nor allow for testing scalability. We present a tool for generating different kinds of datasets and for evaluating rule learning systems. Traditional Reinforcement Learning (RL) problems depend on an exhaustive simulation environment that models real-world physics of the problem and trains the RL agent by observing this environment. In this paper, we present a novel approach to creating an environment by modeling the reward function based on empirical rules extracted from human domain knowledge of the system under study. Using this empirical rewards function, we will build an environment and train the agent. We will first create an environment that emulates the effect of setting cabin temperature through thermostat. This is typically done in RL problems by creating an exhaustive model of the system with detailed thermodynamic study. Instead, we propose an empirical approach to model the reward function based on human domain knowledge. We will document some rules of thumb that we usually exercise as humans while setting thermostat temperature and try and model these into our reward function. This modeling of empirical human domain rules into a reward function for RL is the unique aspect of this paper. This is a continuous action space problem and using deep deterministic policy gradient (DDPG) method, we will solve for maximizing the reward function. We will create a policy network that predicts optimal temperature setpoint given external temperature and humidity. Services mediated by ICT platforms have shaped the landscape of the digital markets and produced immense economic opportunities. Unfortunately, the users of platforms not only surrender the value of their digital traces but also subject themselves to the power and control that data brokers exert for prediction and manipulation. As the platform revolution takes hold in public services, it is critically important to protect the public interest against the risks of mass surveillance and human rights abuses. We propose a set of design constraints that should underlie data systems in public services and which can serve as a guideline or benchmark in the assessment and deployment of platform-mediated services. The principles include, among others, minimizing control points and non-consensual trust relationships, empowering individuals to manage the linkages between their activities and empowering local communities to create their own trust relations. We further propose a set of generic and generative design primitives that fulfil the proposed constraints and exemplify best practices in the deployment of platforms that deliver services in the public interest. For example, blind tokens and attribute-based authorization may prevent the undue linking of data records on individuals. We suggest that policymakers could adopt these design primitives and best practices as standards by which the appropriateness of candidate technology platforms can be measured in the context of their suitability for delivering public services. In this paper we consider a regression model that allows for time series covariates as well as heteroscedasticity with a regression function that is modelled nonparametrically. We assume that the regression function changes at some unknown time $\lfloor ns_0\rfloor$, $s_0\in(0,1)$, and our aim is to estimate the (rescaled) change point $s_0$. The considered estimator is based on a Kolmogorov-Smirnov functional of the marked empirical process of residuals. We show consistency of the estimator and prove a rate of convergence of $O_P(n^{-1})$ which in this case is clearly optimal as there are only $n$ points in the sequence. Additionally we investigate the case of lagged dependent covariates, that is, autoregression models with a change in the nonparametric (auto-) regression function and give a consistency result. The method of proof also allows for different kinds of functionals such that Cram\’er-von Mises type estimators can be considered similarly. The approach extends existing literature by allowing nonparametric models, time series data as well as heteroscedasticity. Finite sample simulations indicate the good performance of our estimator in regression as well as autoregression models and a real data example shows its applicability in practise. We address the problem of learning to benchmark the best achievable classifier performance. In this problem the objective is to establish statistically consistent estimates of the Bayes misclassification error rate without having to learn a Bayes-optimal classifier. Our learning to benchmark framework improves on previous work on learning bounds on Bayes misclassification rate since it learns the {\it exact} Bayes error rate instead of a bound on error rate. We propose a benchmark learner based on an ensemble of $\epsilon$-ball estimators and Chebyshev approximation. Under a smoothness assumption on the class densities we show that our estimator achieves an optimal (parametric) mean squared error (MSE) rate of $O(N^{-1})$, where $N$ is the number of samples. Experiments on both simulated and real datasets establish that our proposed benchmark learning algorithm produces estimates of the Bayes error that are more accurate than previous approaches for learning bounds on Bayes error probability. With the rapid development of deep neural networks (DNN), there emerges an urgent need to protect the trained DNN models from being illegally copied, redistributed, or abused without respecting the intellectual properties of legitimate owners. Following recent progresses along this line, we investigate a number of watermark-based DNN ownership verification methods in the face of ambiguity attacks, which aim to cast doubts on ownership verification by forging counterfeit watermarks. It is shown that ambiguity attacks pose serious challenges to existing DNN watermarking methods. As remedies to the above-mentioned loophole, this paper proposes novel passport-based DNN ownership verification schemes which are both robust to network modifications and resilient to ambiguity attacks. The gist of embedding digital passports is to design and train DNN models in a way such that, the DNN model performance of an original task will be significantly deteriorated due to forged passports. In other words genuine passports are not only verified by looking for predefined signatures, but also reasserted by the unyielding DNN model performances. Extensive experimental results justify the effectiveness of the proposed passport-based DNN ownership verification schemes. Code and models are available at https://…/DeepIPR In this short paper we investigate whether meta-learning techniques can be used to more effectively tune the hyperparameters of machine learning models using successive halving (SH). We propose a novel variant of the SH algorithm (MeSH), that uses meta-regressors to determine which candidate configurations should be eliminated at each round. We apply MeSH to the problem of tuning the hyperparameters of a gradient-boosted decision tree model. By training and tuning our metaregressors using existing tuning jobs from 95 datasets, we demonstrate that MeSH can often find a superior solution to both SH and random search. The recently introduced Tsetlin Machine (TM) has provided competitive pattern recognition accuracy in several benchmarks, however, requires a 3-dimensional hyperparameter search. In this paper, we introduce the Multigranular Tsetlin Machine (MTM). The MTM eliminates the specificity hyperparameter, used by the TM to control the granularity of the conjunctive clauses that it produces for recognizing patterns. Instead of using a fixed global specificity, we encode varying specificity as part of the clauses, rendering the clauses multigranular. This makes it easier to configure the TM because the dimensionality of the hyperparameter search space is reduced to only two dimensions. Indeed, it turns out that there is significantly less hyperparameter tuning involved in applying the MTM to new problems. Further, we demonstrate empirically that the MTM provides similar performance to what is achieved with a finely specificity-optimized TM, by comparing their performance on both synthetic and real-world datasets. Interpreting semantic knowledge describing entities, relations and attributes explicitly with visuals and implicitly with in behind-scene common senses gain more attention in autonomous robotics. By incorporating vision and language modeling with common-sense knowledge, we can provide rich features indicating strong semantic meanings for human and robot action relationships, which can be utilized further in autonomous robotic controls. In this paper, we propose a systematic scheme to generate high-conceptual dynamic knowledge graphs representing Entity-Relation-Entity (E-R-E) and Entity-Attribute-Value (E-A-V) knowledges by ‘watching’ a video clip. A combination of Vision-Language model and static ontology tree is used to illustrate workspace, configurations, functions and usages for both human and robot. The proposed method is flexible and well-versed. It will serve as our first positioning investigation for further research in various applications for autonomous robots. Information on different fields which are collected by users requires appropriate management and organization to be structured in a standard way and retrieved fast and more easily. Document classification is a conventional method to separate text based on their subjects among scientific text, web pages and digital library. Different methods and techniques are proposed for document classifications that have advantages and deficiencies. In this paper, several unsupervised and supervised document classification methods are studied and compared. In cognitive psychology, automatic and self-reinforcing irrational thought patterns are known as cognitive distortions. Left unchecked, patients exhibiting these types of thoughts can become stuck in negative feedback loops of unhealthy thinking, leading to inaccurate perceptions of reality commonly associated with anxiety and depression. In this paper, we present a machine learning framework for the automatic detection and classification of 15 common cognitive distortions in two novel mental health free text datasets collected from both crowdsourcing and a real-world online therapy program. When differentiating between distorted and non-distorted passages, our model achieved a weighted F1 score of 0.88. For classifying distorted passages into one of 15 distortion categories, our model yielded weighted F1 scores of 0.68 in the larger crowdsourced dataset and 0.45 in the smaller online counseling dataset, both of which outperformed random baseline metrics by a large margin. For both tasks, we also identified the most discriminative words and phrases between classes to highlight common thematic elements for improving targeted and therapist-guided mental health treatment. Furthermore, we performed an exploratory analysis using unsupervised content-based clustering and topic modeling algorithms as first efforts towards a data-driven perspective on the thematic relationship between similar cognitive distortions traditionally deemed unique. Finally, we highlight the difficulties in applying mental health-based machine learning in a real-world setting and comment on the implications and benefits of our framework for improving automated delivery of therapeutic treatment in conjunction with traditional cognitive-behavioral therapy. Short-text classification, like all data science, struggles to achieve high performance using limited data. As a solution, a short sentence may be expanded with new and relevant feature words to form an artificially enlarged dataset, and add new features to testing data. This paper applies a novel approach to text expansion by generating new words directly for each input sentence, thus requiring no additional datasets or previous training. In this unsupervised approach, new keywords are formed within the hidden states of a pre-trained language model and then used to create extended pseudo documents. The word generation process was assessed by examining how well the predicted words matched to topics of the input sentence. It was found that this method could produce 3-10 relevant new words for each target topic, while generating just 1 word related to each non-target topic. Generated words were then added to short news headlines to create extended pseudo headlines. Experimental results have shown that models trained using the pseudo headlines can improve classification accuracy when limiting the number of training examples. Do state-of-the-art models for language understanding already have, or can they easily learn, abilities such as boolean coordination, quantification, conditionals, comparatives, and monotonicity reasoning (i.e., reasoning about word substitutions in sentential contexts)? While such phenomena are involved in natural language inference (NLI) and go beyond basic linguistic understanding, it is unclear the extent to which they are captured in existing NLI benchmarks and effectively learned by models. To investigate this, we propose the use of semantic fragments—systematically generated datasets that each target a different semantic phenomenon—for probing, and efficiently improving, such capabilities of linguistic models. This approach to creating challenge datasets allows direct control over the semantic diversity and complexity of the targeted linguistic phenomena, and results in a more precise characterization of a model’s linguistic behavior. Our experiments, using a library of 8 such semantic fragments, reveal two remarkable findings: (a) State-of-the-art models, including BERT, that are pre-trained on existing NLI benchmark datasets perform poorly on these new fragments, even though the phenomena probed here are central to the NLI task. (b) On the other hand, with only a few minutes of additional fine-tuning—with a carefully selected learning rate and a novel variation of ‘inoculation’—a BERT-based model can master all of these logic and monotonicity fragments while retaining its performance on established NLI benchmarks. A robust model for time series forecasting is highly important in many domains, including but not limited to financial forecast, air temperature and electricity consumption. To improve forecasting performance, traditional approaches usually require additional feature sets. However, adding more feature sets from different sources of data is not always feasible due to its accessibility limitation. In this paper, we propose a novel self-boosted mechanism in which the original time series is decomposed into multiple time series. These time series played the role of additional features in which the closely related time series group is used to feed into multi-task learning model, and the loosely related group is fed into multi-view learning part to utilize its complementary information. We use three real-world datasets to validate our model and show the superiority of our proposed method over existing state-of-the-art baseline methods. This article provides a concise overview of the main mathematical theory of Benford’s law in a form accessible to scientists and students who have had first courses in calculus and probability. In particular, one of the main objectives here is to aid researchers who are interested in applying Benford’s law, and need to understand general principles clarifying when to expect the appearance of Benford’s law in real-life data and when not to expect it. A second main target audience is students of statistics or mathematics, at all levels, who are curious about the mathematics underlying this surprising and robust phenomenon, and may wish to delve more deeply into the subject. This survey of the fundamental principles behind Benford’s law includes many basic examples and theorems, but does not include the proofs or the most general statements of the theorems; rather it provides precise references where both may be found. Continue Reading… ### GitHub Streak: Round Six [This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Five ago I referenced the Seinfeld Streak used in an earlier post of regular updates to to the Rcpp Gallery: This is sometimes called Jerry Seinfeld’s secret to productivity: Just keep at it. Don’t break the streak. and then showed the first chart of GitHub streaking And four year ago a first follow-up appeared in this post: And three years ago we had a followup And two years ago we had another one And last year another one As today is October 12, here is the newest one from 2018 to 2019: Again, special thanks go to Alessandro Pezzè for the Chrome add-on GithubOriginalStreak. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Science and Technology (October 12th 2019) 1. In many countries, like Canada, there is relatively little private (business) research. Meanwhile, other research indicates that private research is precisely the kind of research that leads directly to productivity growths. Private research seems concentrated in hubs like the Silicon Valley. To entice businesses in doing research, the Canadian government has pursued an agressive tax credit strategy: roughly speaking, research becomes a tax avoidance strategy. It does not work well. 2. Artificial ovaries work in mice, and we think that they may work in women too. 3. Alzheimer’s research has failed us despite massive investments. It might be time for a reset. 4. Drinking alcohol triggers an hormone (FGF21) that makes you more likely to drinking water. 5. Dog owners are less likely to die. 6. The greatest inequality might be in dating and sex. The bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men. In other words, a small number of men have their pick of all of the women and most guys are unattractive to most women. 7. The U.S. Navy looked at skin cancer rates among its staff. They found the highest rate of skin cancer in people working essentially indoor. 8. Fruit flies exposed to a combo of three different drugs lived 48% longer, even though the individual effect of each drug is relatively small. 9. Moderate alcohol consumption is associated with reduced inflammation and improved responses to vaccination. 10. Choline might protect against Alzheimer’s. Choline is found in meat among other places. 11. You may have read that we find symmetric faces more attractive. A new study challenges this claim. 12. Iron might be an aging factor. 13. Climate models routinely predicts higher temperatures than what is actually observed. Where do the heat goes? A paper in Nature claimed that the heat in question went into the ocean. Nature withdrew the paper in question as it contained too many data processing mistakes. 14. Alpha-ketoglutarate can significantly extend lifespan and healthspan in mice. You can get it in the form of cheap supplements. Continue Reading… ### On the term “self-appointed” . . . I was reflecting on what bugs me so much about people using the term “self-appointed” (for example, when disparaging “self-appointed data police” or “self-appointed chess historians“). The obvious question when someone talks about “self-appointed” whatever is, Who self-appointed you to decide who is illegitimately self-appointed? But my larger concern is with the idea that being a self-appointed whatever is a bad thing. Consider the alternative, which is to be appointed by some king or queen or governmental body or whatever. That wouldn’t do much to foster a culture of openness, would it? First, the kind of people who are appointed would be those who don’t offend the king/queen/government/etc, or else they’d need to hide their true colors until getting that appointment. Second, by restricting yourself to criticism coming from people with official appointments, you’re shutting out the vast majority of potential sources of valuable criticism. Let’s consider the two examples above. 1. “Self-appointed data police.” To paraphrase Thomas Basboll, there are no data police. In any case, data should be available to all (except in cases of trade secrets, national security, confidentiality, etc.), and anyone should be able to “appoint themselves” the right to criticize data analyses. 2. “Self-appointed chess historians.” This one’s even funnier in that I don’t think there are any official chess historians. Here’s a list, but it includes one of the people criticized in the above quote as being “self-appointed” so that won’t really work. So, next time you hear someone complain about “self-appointed” bla bla, consider the alternative . . . Should criticism only be allowed from those who have been officially appointed? That’s a recipe for disaster. And, regarding questions regarding the personal motivations of critics (calling them “terrorists” etc.), recall the Javert paradox. Continue Reading… ### Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them [This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. # Introduction In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created. In this third part, we will look at how to write R functions that generate SQL queries that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. We also briefly present wrapping these approaches into functions that can be combined with other Spark operations. # Preparation The full setup of Spark and sparklyr is not in the scope of this post, please check the previous one for some setup instructions and a ready-made Docker image. # Load packages suppressPackageStartupMessages({ library(sparklyr) library(dplyr) library(nycflights13) }) # Prepare the data weather <- nycflights13::weather %>% mutate(id = 1L:nrow(nycflights13::weather)) %>% select(id, everything()) # Connect sc <- sparklyr::spark_connect(master = "local") # Copy the weather dataset to the instance tbl_weather <- dplyr::copy_to( dest = sc, df = weather, name = "weather", overwrite = TRUE ) # Copy the flights dataset to the instance tbl_flights <- dplyr::copy_to( dest = sc, df = nycflights13::flights, name = "flights", overwrite = TRUE ) # R functions as Spark SQL generators There are use cases where it is desirable to express the operations directly with SQL instead of combining dplyr verbs, for example when working within multi-language environments where re-usability is important. We can then send the SQL query directly to Spark to be executed. To create such queries, one option is to write R functions that work as query constructors. Again using a very simple example, a naive implementation of column normalization could look as follows. Note that the use of SELECT * is discouraged and only here for illustration purposes: normalize_sql <- function(df, colName, newColName) { paste0( "SELECT", "\n ", df, ".*", ",", "\n (", colName, " - (SELECT avg(", colName, ") FROM ", df, "))", " / ", "(SELECT stddev_samp(", colName,") FROM ", df, ") as ", newColName, "\n", "FROM ", df ) } Using the weather dataset would then yield the following SQL query when normalizing the temp column: normalize_temp_query <- normalize_sql("weather", "temp", "normTemp") cat(normalize_temp_query) ## SELECT ## weather.*, ## (temp - (SELECT avg(temp) FROM weather)) / (SELECT stddev_samp(temp) FROM weather) as normTemp ## FROM weather Now that we have the query created, we can look at how to send it to Spark for execution. # Executing the generated queries via Spark ## Using DBI as the interface The R package DBI provides an interface for communication between R and relational database management systems. We can simply use the dbGetQuery() function to execute our query, for instance: res <- DBI::dbGetQuery(sc, statement = normalize_temp_query) head(res) ## id origin year month day hour temp dewp humid wind_dir wind_speed ## 1 1 EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 ## 2 2 EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 ## 3 3 EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 ## 4 4 EWR 2013 1 1 4 39.92 28.04 62.21 250 12.65858 ## 5 5 EWR 2013 1 1 5 39.02 28.04 64.43 260 12.65858 ## 6 6 EWR 2013 1 1 6 37.94 28.04 67.21 240 11.50780 ## wind_gust precip pressure visib time_hour normTemp ## 1 NaN 0 1012.0 10 2013-01-01 06:00:00 -0.9130047 ## 2 NaN 0 1012.3 10 2013-01-01 07:00:00 -0.9130047 ## 3 NaN 0 1012.5 10 2013-01-01 08:00:00 -0.9130047 ## 4 NaN 0 1012.2 10 2013-01-01 09:00:00 -0.8624083 ## 5 NaN 0 1011.9 10 2013-01-01 10:00:00 -0.9130047 ## 6 NaN 0 1012.4 10 2013-01-01 11:00:00 -0.9737203 As we might have noticed thanks to the way the result is printed, a standard data frame is returned, as opposed to tibbles returned by most sparklyr operations. It is important to note that using dbGetQuery() automatically computes and collects the results to the R session. This is in contrast with the dplyr approach which constructs the query and only collects the results to the R session when collect() is called, or computes them when compute() is called. We will now examine 2 options to use the prepared query lazily and without collecting the results into the R session. ## Invoking sql on a Spark session object Without going into further details on the invoke() functionality of sparklyr which we will focus on in the fourth installment of the series, if the desire is to have a “lazy” SQL that does not get automatically computed and collected when called from R, we can invoke a sql method on a SparkSession class object. The method takes a string SQL query as input and processes it using Spark, returning the result as a Spark DataFrame. This gives us the ability to only compute and collect the results when desired: # Use the query "lazily" without execution: normalized_lazy_ds <- sc %>% spark_session() %>% invoke("sql", normalize_temp_query) normalized_lazy_ds ## ## org.apache.spark.sql.Dataset ## [id: int, origin: string ... 15 more fields] # Collect when needed: normalized_lazy_ds %>% collect() ## # A tibble: 26,115 x 17 ## id origin year month day hour temp dewp humid wind_dir ## ## 1 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 ## 2 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 ## 3 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 ## 4 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 ## 5 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 ## 6 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 ## 7 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 ## 8 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 ## 9 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 ## 10 10 EWR 2013 1 1 10 41 28.0 59.6 260 ## # … with 26,105 more rows, and 7 more variables: wind_speed , ## # wind_gust , precip , pressure , visib , ## # time_hour , normTemp  ## Using tbl with dbplyr’s sql The above method gives us a reference to a Java object as a result, which might be less intuitive to work with for R users. We can also opt to use dbplyr’s sql() function in combination with tbl() to get a more familiar result. Note that when printing the below normalized_lazy_tbl, the query gets partially executed to provide the first few rows. Only when collect() is called the entire set is retrieved to the R session: # Nothing is executed yet normalized_lazy_tbl <- normalize_temp_query %>% dbplyr::sql() %>% tbl(sc, .) # Print the first few rows normalized_lazy_tbl ## # Source: spark ## # [?? x 17] ## id origin year month day hour temp dewp humid wind_dir ## ## 1 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 ## 2 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 ## 3 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 ## 4 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 ## 5 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 ## 6 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 ## 7 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 ## 8 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 ## 9 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 ## 10 10 EWR 2013 1 1 10 41 28.0 59.6 260 ## # … with more rows, and 7 more variables: wind_speed , ## # wind_gust , precip , pressure , visib , ## # time_hour , normTemp # Collect the entire result to the R session and print normalized_lazy_tbl %>% collect() ## # A tibble: 26,115 x 17 ## id origin year month day hour temp dewp humid wind_dir ## ## 1 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 ## 2 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 ## 3 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 ## 4 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 ## 5 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 ## 6 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 ## 7 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 ## 8 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 ## 9 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 ## 10 10 EWR 2013 1 1 10 41 28.0 59.6 260 ## # … with 26,105 more rows, and 7 more variables: wind_speed , ## # wind_gust , precip , pressure , visib , ## # time_hour , normTemp Wrapping the tbl approach into functions In the approach above we provided sc in the call to tbl(). When wrapping such processes into a function, it might however be useful to take the specific DataFrame reference as an input instead of the generic Spark connection reference. In that case, we can use the fact that the connection reference is also stored in the DataFrame reference, in the con sub-element of the src element. For instance, looking at our tbl_weather: class(tbl_weather[["src"]][["con"]]) ## [1] "spark_connection" "spark_shell_connection" ## [3] "DBIConnection" Putting this together, we can create a simple wrapper function that lazily sends a SQL query to be processed on a particular Spark DataFrame reference: lazy_spark_query <- function(tbl, qry) { qry %>% dbplyr::sql() %>% dplyr::tbl(tbl[["src"]][["con"]], .) } And use it to do the same as we did above with a single function call: lazy_spark_query(tbl_weather, normalize_temp_query) %>% collect() ## # A tibble: 26,115 x 17 ## id origin year month day hour temp dewp humid wind_dir ## ## 1 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 ## 2 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 ## 3 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 ## 4 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 ## 5 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 ## 6 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 ## 7 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 ## 8 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 ## 9 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 ## 10 10 EWR 2013 1 1 10 41 28.0 59.6 260 ## # … with 26,105 more rows, and 7 more variables: wind_speed , ## # wind_gust , precip , pressure , visib , ## # time_hour , normTemp Combining multiple approaches and functions into lazy datasets The power of Spark partly comes from the lazy execution and we can take advantage of this in ways that are not immediately obvious. Consider the following function we have shown previously: lazy_spark_query ## function(tbl, qry) { ## qry %>% ## dbplyr::sql() %>% ## dplyr::tbl(tbl[["src"]][["con"]], .) ## } Since the output of this function without collection is actually only a translated SQL statement, we can take that output and keep combinining it with other operations, for instance: qry <- normalize_sql("flights", "dep_delay", "dep_delay_norm") lazy_spark_query(tbl_flights, qry) %>% group_by(origin) %>% summarise(mean(dep_delay_norm)) %>% collect() ## Warning: Missing values are always removed in SQL. ## Use mean(x, na.rm = TRUE) to silence this warning ## This warning is displayed only once per session. ## # A tibble: 3 x 2 ## origin mean(dep_delay_norm) ## ## 1 EWR 0.0614 ## 2 JFK -0.0131 ## 3 LGA -0.0570 The crucial advantage is that even though the lazy_spark_query would return the entire updated weather dataset when collected stand-alone, in combination with other operations Spark first figures out how to execute all the operations together efficiently and only then physically executes them and returns only the grouped and aggregated data to the R session. We can therefore effectively combine multiple approaches to interfacing with Spark while still keeping the benefit of retrieving only very small, aggregated amounts of data to the R session. The effect is quite significant even with a dataset as small as flights (336,776 rows of 19 columns) and with a local Spark instance. The chart below compares executing a query lazily, aggregating within Spark and only retrieving the aggregated data, versus retrieving first and aggregating locally. The third boxplot shows the cost of pure collection on the query itself: bench <- microbenchmark::microbenchmark( times = 20, collect_late = lazy_spark_query(tbl_flights, qry) %>% group_by(origin) %>% summarise(mean(dep_delay_norm)) %>% collect(), collect_first = lazy_spark_query(tbl_flights, qry) %>% collect() %>% group_by(origin) %>% summarise(mean(dep_delay_norm)), collect_only = lazy_spark_query(tbl_flights, qry) %>% collect() ) Where SQL can be better than dbplyr translation When a translation is not there We have discussed in the first part that the set of operations translated to Spark SQL via dbplyr may not cover all possible use cases. In such a case, the option to write SQL directly is very useful. When translation does not provide expected results In some instances using dbplyr to translate R operations to Spark SQL can lead to unexpected results. As one example, consider the following integer division on a column of a local data frame. # id_div_5 is as expected weather %>% mutate(id_div_5 = id %/% 5L) %>% select(id, id_div_5) ## # A tibble: 26,115 x 2 ## id id_div_5 ## ## 1 1 0 ## 2 2 0 ## 3 3 0 ## 4 4 0 ## 5 5 1 ## 6 6 1 ## 7 7 1 ## 8 8 1 ## 9 9 1 ## 10 10 2 ## # … with 26,105 more rows As expected, we get the result of integer division in the id_div_5 column. However, applying the very same operation on a Spark DataFrame yields unexpected results: # id_div_5 is normal division, not integer division tbl_weather %>% mutate(id_div_5 = id %/% 5L) %>% select(id, id_div_5) ## # Source: spark [?? x 2] ## id id_div_5 ## ## 1 1 0.2 ## 2 2 0.4 ## 3 3 0.6 ## 4 4 0.8 ## 5 5 1 ## 6 6 1.2 ## 7 7 1.4 ## 8 8 1.6 ## 9 9 1.8 ## 10 10 2 ## # … with more rows This is due to the fact that translation to integer division is quite difficult to implement: https://github.com/tidyverse/dbplyr/issues/108. We could certainly figure our a way to fix this particular issue, but the workarounds may prove inefficient: tbl_weather %>% mutate(id_div_5 = as.integer(id %/% 5L)) %>% select(id, id_div_5) ## # Source: spark [?? x 2] ## id id_div_5 ## ## 1 1 0 ## 2 2 0 ## 3 3 0 ## 4 4 0 ## 5 5 1 ## 6 6 1 ## 7 7 1 ## 8 8 1 ## 9 9 1 ## 10 10 2 ## # … with more rows # Not too efficient: tbl_weather %>% mutate(id_div_5 = as.integer(id %/% 5L)) %>% select(id, id_div_5) %>% explain() ## ## SELECT id, CAST(id / 5 AS INT) AS id_div_5 ## FROM weather ## ## ## == Physical Plan == ## *(1) Project [id#24, cast((cast(id#24 as double) / 5.0) as int) AS id_div_5#4273] ## +- InMemoryTableScan [id#24] ## +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas) ## +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39] Using SQL and the knowledge that Hive does provide a built-in DIV arithmetic operator, we can get the desired results very simply and efficiently with writing SQL: "SELECT id, id DIV 5 id_div_5 FROM weather" %>% dbplyr::sql() %>% tbl(sc, .) ## # Source: spark [?? x ## # 2] ## id id_div_5 ## ## 1 1 0 ## 2 2 0 ## 3 3 0 ## 4 4 0 ## 5 5 1 ## 6 6 1 ## 7 7 1 ## 8 8 1 ## 9 9 1 ## 10 10 2 ## # … with more rows Even though the numeric value of the results is correct here, we may still notice that the class of the returned id_div_5 column is actually numeric instead of integer. Such is the life of developers using data processing interfaces. ## When portability is important Since the languages that provide interfaces to Spark are not limited to R and multi-language setups are quite common, another reason to use SQL statements directly is the portability of such solutions. A SQL statement can be executed by interfaces provided for all languages – Scala, Java, and Python, without the need to rely on R-specific packages such as dbplyr. # References To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### What’s going on on PyPI Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way. adafdr A fast and covariate-adaptive method for multiple hypothesis testing linear A straightforward package for linear regression with Gaussian priors. lm-decoder KenLM decoder for Non-Autoregressive Models. meepo2 event sourcing for databases. neural-wrappers Generic neural networks high level wrapper for PyTorch ngboost Library for probabilistic predictions via gradient boosting. numba-scipy numba-scipy extends Numba to make it aware of SciPy numpy-financial The numpy-financial package contains a collection of elementary financial functions. object-detection-core Tensorflow Object Detection Library olr olr: Optimal Linear Regression. The olr function runs all the possible combinations of linear regressions with all of the dependent variables against the independent variable and returns the statistical summary of either the greatest adjusted R-squared or R-squared term. optima PyTorch Optimization Framework for Researchers ox_mon Monitor status, security, and robustness of your machines. Basically, there are lots of little monitoring and checking tasks you may find yourself needing to do. You could write a separate script for each such task, but it would be nice to have some basic scaffolding for things like notifications, logging, testing, and so on to simplify machine monitoring. Better yet, by having a common framework many developers can contribute small snippets of such tools that work in a similar way to simplify life for everyone. pennpaper Set of utilities for ploting results of non-deterministic experiments, e.g. machine learning, optimization, genetic algorithms. Pen’n’paper is a package to easily collect the data about (noisy) processes and plot them for comparison. This package is not aiming at feature completeness. Instead it should give you an easy start during the phase of the project when you want to just concentrate on an experimental idea. rlcard A Toolkit for Reinforcement Learning in Card Games soms A package to check system related metrics and values for simpler blackbox monitoring Continue Reading… ### If you did not already know Regression Calibration (RC) Medical studies that depend on electronic health records (EHR) data are often subject to measurement error as the data are not collected to support research questions under study. Methodology to address covariate measurement error has been well developed; however, time-to-event error has also been shown to cause significant bias but methods to address it are relatively underdeveloped. More generally, it is possible to observe errors in both the covariate and the time-to-event outcome that are correlated. We propose regression calibration (RC) estimators to simultaneously address correlated error in the covariates and the censored event time. Although RC can perform well in many settings with covariate measurement error, it is biased for nonlinear regression models, such as the Cox model. Thus, we additionally propose raking estimators which are consistent estimators of the parameter defined by the population estimating equations, can improve upon RC in certain settings with failure-time data, require no explicit modeling of the error structure, and can be utilized under outcome-dependent sampling designs. We discuss features of the underlying estimation problem that affect the degree of improvement the raking estimator has over the RC approach. Detailed simulation studies are presented to examine the performance of the proposed estimators under varying levels of signal, error, and censoring. The methodology is illustrated on observational EHR data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic. … Gated Path Planning Network Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used by agents to perform navigation while still maintaining end-to-end differentiability of the entire architecture. Despite their effectiveness, they suffer from several disadvantages including training instability, random seed sensitivity, and other optimization problems. In this work, we reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. From this perspective, we argue that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN. The resulting architecture, which we call the Gated Path Planning Network, is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization. Furthermore, we show that this performance gap is consistent across different maze transition types, maze sizes and even show success on a challenging 3D environment, where the planner is only provided with first-person RGB images. … Pool Adjacent Violators Algorithm (PAVA) Pool Adjacent Violators Algorithm (PAVA) is a linear time (and linear memory) algorithm for linear ordering isotonic regression. “Isotonic Regression” http://…/deleeuw_hornik_mair_R_09.pdf DeepSSM Statistical shape modeling is an important tool to characterize variation in anatomical morphology. Typical shapes of interest are measured using 3D imaging and a subsequent pipeline of registration, segmentation, and some extraction of shape features or projections onto some lower-dimensional shape space, which facilitates subsequent statistical analysis. Many methods for constructing compact shape representations have been proposed, but are often impractical due to the sequence of image preprocessing operations, which involve significant parameter tuning, manual delineation, and/or quality control by the users. We propose DeepSSM: a deep learning approach to extract a low-dimensional shape representation directly from 3D images, requiring virtually no parameter tuning or user assistance. DeepSSM uses a convolutional neural network (CNN) that simultaneously localizes the biological structure of interest, establishes correspondences, and projects these points onto a low-dimensional shape representation in the form of PCA loadings within a point distribution model. To overcome the challenge of the limited availability of training images, we present a novel data augmentation procedure that uses existing correspondences on a relatively small set of processed images with shape statistics to create plausible training samples with known shape parameters. Hence, we leverage the limited CT/MRI scans (40-50) into thousands of images needed to train a CNN. After the training, the CNN automatically produces accurate low-dimensional shape representations for unseen images. We validate DeepSSM for three different applications pertaining to modeling pediatric cranial CT for characterization of metopic craniosynostosis, femur CT scans identifying morphologic deformities of the hip due to femoroacetabular impingement, and left atrium MRI scans for atrial fibrillation recurrence prediction. … Continue Reading… ### Magister Dixit “The problems that can be solved with machine learning are often not solvable by other methods.” Zachary Chase Lipton ( January 2015 ) Continue Reading… ### Loading packages efficiently [This article was first published on woodpeckR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Problem Especially in a project with many different scripts, it can be challenging to keep track of all the packages you need to load. It’s also easy to lose track of whether or not you’ve incorporated package loading into the script itself until you switch to a new computer or restart R and all of a sudden, your packages need to be re-loaded. ### Context When I was first starting out in R, I learned quickly to load packages all together at the top of a script, not along the way as I needed them. But it took a while, until I started using R Projects, before I decided to centralize package loading above the script level. I was sick of having to deal with loading the right packages at the right times, so I decided to just streamline the whole thing. ### Solution Make a separate R script, called “libraries.R” or “packages.R” or something. Keep it consistent. Mine is always called “libraries,” and I keep it in my project folder. It looks something like this (individual libraries may vary, of course): Then, at the top of each analysis script, I can simply source the libraries script, and all the libraries I need load automatically. ### Outcome I can easily load libraries in the context of a single R Project, keep track of which ones are loaded, and not have to worry about making my scripts look messy with a whole chunk of library() commands at the top of each one. It’s also straightforward to pop open the “libraries” script whenever I want to add a new library or delete one. To leave a comment for the author, please follow the link and comment on their blog: woodpeckR. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Explaining Predictions: Boosted Trees Post-hoc Analysis (Xgboost) [This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. # Recap We’ve covered various approaches in explaining model predictions globally. Today we will learn about another model specific post hoc analysis. We will learn to understand the workings of gradient boosting predictions. Like past posts, the Clevaland heart dataset as well as tidymodels principle will be used. Refer to the first post of this series for more details. # Gradient Boosting Besides random forest introduced in a past post, another tree-based ensemble model is gradient boosting. In gradient boosting, a shallow and weak tree is first trained and then the next tree is trained based on the errors of the first tree. The process continues with a new tree being sequentially added to the ensemble and the new successive tree improves on the errors of the ensemble of preceding trees. On the hand, random forest is an ensemble of deep independent trees. #library library(tidyverse) library(tidymodels) theme_set(theme_minimal()) #import heart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F) # Renaming var colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp", "chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp", "ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease") #elaborating cat var ##simple ifelse conversion heart<-heart %>% mutate(sex= ifelse(sex=="1", "male", "female"),fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"), heart_disease=ifelse(heart_disease=="0", "no", "yes")) #remember to leave it as numeric for DALEX ## complex ifelse conversion using case_when heart<-heart %>% mutate( rest_cp=case_when(rest_cp== "1" ~ "typical",rest_cp=="2" ~ "atypical", rest_cp== "3" ~ "non-CP pain",rest_cp== "4" ~ "asymptomatic"), rest_ecg=case_when(rest_ecg=="0" ~ "normal",rest_ecg=="1" ~ "ST-T abnorm",rest_ecg=="2" ~ "LV hyperthrophy"), ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm", ex_STpeak== "2" ~ "flat",ex_STpeak== "3" ~ "down"), thalassemia=case_when(thalassemia=="3.0" ~ "norm", thalassemia== "6.0" ~ "fixed", thalassemia== "7.0" ~ "reversable")) # convert missing value "?" into NA heart<-heart%>% mutate_if(is.character, funs(replace(., .=="?", NA))) # convert char into factors heart<-heart %>% mutate_if(is.character, as.factor) #train/test set set.seed(4595) data_split <- initial_split(heart, prop=0.75, strata = "heart_disease") heart_train <- training(data_split) heart_test <- testing(data_split) The gradient boosting package which we’ll use is xgboost. xgboost only accepts numeric values thus one-hot encoding is required for categorical variables. However, I was still able to train a xgboost model without one-hot encoding when I used the parsnip interface. # create recipe object heart_recipe<-recipe(heart_disease ~., data= heart_train) %>% step_knnimpute(all_predictors()) # process the traing set/ prepare recipe(non-cv) heart_prep <-heart_recipe %>% prep(training = heart_train, retain = TRUE) No tunning was done, the hyperparameters are default settings which were made explicit. # boosted tree model bt_model<-boost_tree(learn_rate=0.3, trees = 100, tree_depth= 6, min_n=1, sample_size=1, mode="classification") %>% set_engine("xgboost", verbose=2) %>% fit(heart_disease ~ ., data = juice(heart_prep)) # Feature Importance (global level) The resulting gradient boosting model bt_model$fit represented as a parsnip object does not inherently contain feature importance unlike a random forest model represented as a parsnip object.

summary(bt_model$fit) ## Length Class Mode ## handle 1 xgb.Booster.handle externalptr ## raw 66756 -none- raw ## niter 1 -none- numeric ## call 7 -none- call ## params 9 -none- list ## callbacks 1 -none- list ## feature_names 20 -none- character ## nfeatures 1 -none- numeric We can extract the important features from the boosted tree model with xgboost::xgb.importance. Although we did the pre-processing and modelling using tidymodels, we ended up using the original Xgboost package to explain the model. Perhaps, tidymodels could consider integrating prediction explanation for more models that they support in the future. library(xgboost) xgb.importance(model=bt_model$fit) %>% head()
##                Feature       Gain      Cover  Frequency
## 1:     thalassemianorm 0.24124439 0.05772889 0.01966717
## 2: ex_STdepression_dur 0.17320374 0.15985018 0.15279879
## 3:            ex_maxHR 0.10147873 0.12927719 0.13615734
## 4:                 age 0.07165646 0.09136876 0.12859304
## 5:                chol 0.06522754 0.10151576 0.15431165
## 6:             rest_bp 0.06149660 0.09178222 0.11497731

## Variable importance score

Feature importance are computed using three different importance scores.

1. Gain: Gain is the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher score suggests the feature is more important in the boosted tree’s prediction.
2. Cover: Cover is the relative observations associated with a predictor. For example, feature X is used to determine the terminal node for 10 observations in tree A and 20 observations in tree B. The absolute observations associated with feature X is 30 and the relative observation is 30/sum of all absolute observation for all features.
3. Frequency: Frequency refers to the relative frequency a variable occurs in the ensembled of trees. For instance, feature X occurs in 1 split in tree A and 2 splits in tree B. The absolute occurrence of feature X is 3 and the (relative) frequency is 3/sum of all absolute occurrence for all features.

Category variables especially those with minimal cardinality will have low frequency score as these variables are seldom used in each tree. Compared to continuous variables or to some extend category variables with high cardinality as they have are likely to have a larger range of values which increases the odds of occurring the model. Thus, the developers of xgboost discourage using frequency score unless you’re clear about your rationale for selecting frequency as the feature importance score. Rather, gain score is the most valuable score to determine variable importance. xgb.importance selects gain score as the fault measurement and arranges features according to the descending value of gain score resulting in the most important feature to be displayed at the top.

## Plotting variable importance

xgbosst provides two options to plot variable importance.

1. Using basic R graphics via xgb.plot.importance
2. Using ggplot interface via xgb.ggplot.importance. I’ll be using the latter.

The xgb.ggplot.importance uses the gain variable importance measurement by default to calculate variable importance. The default argument measure=NULL can be changed to use other variable importance measurements. However, based on the previous section, it will be wiser to leave the argument as the default.
The xgb.ggplot.importance graph also displays the cluster of variables that have similar variable importance scores. The xgb.ggplot.importance graph displays each variable’s gain score as a relative contribution to the overall model importance by default. The sum of all the gain scores will equal to 1.

xgb.importance(model=bt_model$fit) %>% xgb.ggplot.importance( top_n=6, measure=NULL, rel_to_first = F)  Alternatively, the gain scores can be presented as relative scores to the most important feature. In this case, the most importance feature will have a score of 1 and the gain scores of the other variables will be scaled to the gain score of the most important feature. This alternate demonstration of gain score can be achieved by changing the default argument rel_to_first=F to rel_to_first=T. xgb.importance(model=bt_model$fit) %>% xgb.ggplot.importance(
top_n=6, measure=NULL, rel_to_first = T) 

# Sum up

This is the last post of this series looking at explaining model predictions at a global level. We first started this series explaining predictions using white box models such as logistic regression and decision tree. Next, we did model specific post hoc evaluation on black box models. Specifically, for random forest and Xgboost.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Help support GetDFPData

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The shiny version of GetDFPData is currently hosted in a private server at DigitalOcean. A problem with the basic (5 USD) server I was using is with the low amount of available memory (RAM and HD). With that, I had to limit all xlsx queries for the data, otherwise the shiny app would ran out of memory. After upgrading R in the server, the xlsx option was no longer working.

Today I tried all tricks in the book for keeping the 5 USD server and get the code to work. Nothing worked effectively. The Microsoft Excel is a very restrictive format, and you should only use it to small projects. If the volume of data is high, as in GetDFPData, you’re going to run into a lot of issues of cell sizes and memory allocation. Despite my explicit recommendation to avoid Excel format as much as possible, people still use it a lot. Not surprisingly, once I took the “xlsx” option from the shiny interface, people complained to my email – a lot.

I just upgraded the RAM and HD of the server in DigitalOcean. The xlsx option is back and working. The new bill is 10 USD per month. So far I’ve been paying the bill from my own pocket, using revenues from my books. The GetDFPData has no official financial support and yes, I’ll continue to finance it as much as can. But, support from those using the shiny interface of the CRAN package is very much welcomed and will motive further development to keep things running smoothly.

If you can, please help donating a small value and keeping the server financed. Once I reach 12 months of payed bills (around 120 USD), I’ll remove the Paypal donation button and only add it back after the cash runs out.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Opioid prescribing habits in Texas

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A paper I worked on was just published in a medical journal. This is quite an odd thing for me to be able to say, given my academic background and the career path I have had, but there you go! The first author of this paper is a long-time friend of mine working in anesthesiology and pain management, and he obtained data from the Texas Prescription Drug Monitoring Program (PDMP) about controlled substance prescriptions from April 2015 to 2018. The DEA also provides data about controlled substances transactions between manufacturers and distributors (available in R) but PDMP data is somewhat different as it monitors prescriptions directly, down to the individual prescriber level. Each state maintains a separate PDMP, and access is often limited to licensed providers in that state. My coauthor/friend is, among other things, a licensed provider in Texas and was able to obtain this data for research purposes!

## Clean and tidy the data

The first step in this analysis was to read in, clean, and tidy the PDMP data. This is a dataset of prescriptions for controlled substances, aggregated at the county and month level for us by the state agency; we requested data at two separate times and received data in two different formats. First, we have an Excel file.

library(tidyverse)
library(lubridate)

path <- "CountyDrugPillQty_2017_07.xlsx"

opioids_raw <- path %>%
excel_sheets() %>%
set_names() %>%
map_df(~ read_excel(path = path, sheet = .x), .id = "sheet") %>%
mutate(Date = dmy(str_c("01-", sheet))) %>%
select(-sheet) %>%
rename(Name = Generic Name)

Then we have a second batch of data in Google Sheets.

new_opioids_sheet <- gs_title("TX CS Qty By Drug Name-County")

new_opioids_raw <- new_opioids_sheet %>%
gs_read("TX CS RX By Generic Name-County",
col_types = "cnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn",
skip = 4,
verbose = FALSE) %>%
rename(Name = Date/Month Filter)  %>%
mutate(Date = case_when(str_detect(Name,
"^[a-zA-Z]{3}-[0-9]{2}\$") ~ Name,
TRUE ~ NA_character_)) %>%
fill(Date, .direction = "down") %>%
select(-Grand Total) %>%
filter(Name != Date) %>%
mutate(Date = dmy(str_c("01-", Date))) %>%
select(Name, Date, everything())

We have overlapping measurements for the same drugs and counties from February to June of 2017. Most measurements were close, but the new data is modestly higher in prescription quantity, telling us something about data quality and how this data is collected. When we have it available, we use the newer values. My coauthor/friend placed the individual drugs into larger categories so that we can look at groupings between the individual drug level and the schedule level. Using all that, finally, we have a tidy dataset of prescriptions per county per month.

categories_sheet <- gs_title("Drug categories")

drug_categories <- categories_sheet %>%
rename(Name = Generic Name) %>%
bind_rows(categories_sheet %>%
rename(Name = Generic Name)) %>%
filter(Schedule %in% c("II", "III", "IV", "V"))

opioids_tidy <- opioids_raw %>%
gather(County, PillsOld, ANDERSON:ZAVALA) %>%
full_join(new_opioids_raw %>%
gather(County, PillsNew, ANDERSON:ZAVALA),
by = c("Name", "Date", "County")) %>%
mutate(Pills = coalesce(PillsNew, PillsOld),
Pills = ifelse(Pills > 1e10, NA, Pills)) %>%
replace_na(replace = list(Pills = 0)) %>%
mutate(County = str_to_title(County)) %>%
select(-PillsNew, -PillsOld) %>%
left_join(drug_categories, by = "Name") %>%
select(County, Date, Name, Category, Schedule, Pills) %>%
filter(Name != "Unspecified",
!is.na(Schedule)) %>%
filter(Date < "2018-05-01")

opioids_tidy
## # A tibble: 1,234,622 x 6
##    County  Date       Name                        Category   Schedule Pills
##
##  1 Anders… 2015-04-01 ACETAMINOPHEN WITH CODEINE… Opioid     III      37950
##  2 Anders… 2015-04-01 ACETAMINOPHEN/CAFFEINE/DIH… Opioid     III        380
##  3 Anders… 2015-04-01 ALPRAZOLAM                  Benzodiaz… IV       52914
##  4 Anders… 2015-04-01 AMITRIPTYLINE HCL/CHLORDIA… Benzodiaz… IV         180
##  5 Anders… 2015-04-01 AMPHETAMINE SULFATE         Amphetami… IV          60
##  6 Anders… 2015-04-01 ARMODAFINIL                 Stimulant  IV         824
##  7 Anders… 2015-04-01 ASPIRIN/CAFFEINE/DIHYDROCO… Opioid     III          0
##  8 Anders… 2015-04-01 BENZPHETAMINE HCL           Amphetami… III         30
##  9 Anders… 2015-04-01 BROMPHENIRAMINE MALEATE/PH… Opioid     V            0
## 10 Anders… 2015-04-01 BROMPHENIRAMINE MALEATE/PS… Opioid     III          0
## # … with 1,234,612 more rows

In this step, we removed the very small number of prescriptions that were missing drug and schedule information (“unspecified”). Now it’s ready to go!

## Changing prescribing habits

The number of pills prescribed per month is changing at about -0.00751% each month, or about -0.0901% each year. This is lower (negative, even) than the rate of Texas’ population growth, estimated by the US Census Bureau at about 1.4% annually. Given what we find out further below about the racial/ethnic implications of population level opioid use in Texas and what groups are driving population growth in Texas, this likely makes sense.

opioids_tidy %>%
count(Schedule, Date, wt = Pills) %>%
mutate(Schedule = factor(Schedule, levels = c("II", "III", "IV", "V",
"Unspecified"))) %>%
ggplot(aes(Date, n, color = Schedule)) +
geom_line(alpha = 0.8, size = 1.5) +
expand_limits(y = 0) +
labs(x = NULL, y = "Pills prescribed per month",
title = "Controlled substance prescriptions by schedule",
subtitle = "Schedule IV drugs account for the most doses, with Schedule II close behind")

We can also fit models to find which individual drugs are increasing or decreasing. The most commonly prescribed drugs that exhibited significant change in prescribing volume are amphetamines (increasing) and barbiturates (decreasing).

## Connecting to Census data

When I started to explore how this prescription data varied spatially, I knew I wanted to connect this PDMP dataset to Census data. My favorite way to use Census data from R is the tidycensus package. Texas is an interesting place. It’s not only where I grew up (and where my coauthor and friend lives), but the second largest state in the United States by both land area and population. It contains 3 of the top 10 largest cities in the United States, yet
also 3 of the 4 least densely populated counties in the United States. It is also the seventh most ethnically diverse state with a substantially higher Hispanic population compared with the United States as a whole, but similar proportion of white and black residents. We can download Census data to explore these issues.

library(tidycensus)

population <- get_acs(geography = "county",
variables = "B01003_001",
state = "TX",
geometry = TRUE) 
household_income <- get_acs(geography = "county",
variables = "B19013_001",
state = "TX",
geometry = TRUE)

To look at geographical patterns, we will take the median number of pills prescribed per month for each county during the time we have data for.

opioids_joined <- opioids_tidy %>%
group_by(County, Date) %>%
summarise(Pills = sum(Pills)) %>%
ungroup %>%
mutate(Date = case_when(Date > "2017-01-01" ~ "2017 and later",
TRUE ~ "Before 2017")) %>%
group_by(County, Date) %>%
summarise(Pills = median(Pills)) %>%
ungroup %>%
mutate(County = str_to_lower(str_c(County, " County, Texas")),
County = ifelse(County == "de witt county, texas",
"dewitt county, texas", County)) %>%
inner_join(population %>% mutate(County = str_to_lower(NAME)), by = "County") %>%
mutate(OpioidRate = Pills / estimate)

What are the controlled substance prescription rates in the top 10 most populous Texas counties?

opioids_joined %>%
filter(Date == "2017 and later") %>%
top_n(10, estimate) %>%
arrange(desc(estimate)) %>%
select(NAME, OpioidRate) %>%
kable(col.names = c("County", "Median monthly pills per capita"), digits = 2)
County Median monthly pills per capita
Harris County, Texas 5.68
Dallas County, Texas 6.20
Tarrant County, Texas 7.74
Bexar County, Texas 7.41
Travis County, Texas 6.40
Collin County, Texas 7.02
Hidalgo County, Texas 3.31
El Paso County, Texas 4.43
Denton County, Texas 7.58
Fort Bend County, Texas 5.17

These rates vary a lot; the controlled substance prescription rate in Tarrant County is almost 40% higher than the rate in Harris County.

library(sf)
library(viridis)

opioids_map <- opioids_joined %>%
mutate(OpioidRate = ifelse(OpioidRate > 16, 16, OpioidRate))

opioids_map %>%
mutate(Date = factor(Date, levels = c("Before 2017", "2017 and later"))) %>%
st_as_sf() %>%
ggplot(aes(fill = OpioidRate, color = OpioidRate)) +
geom_sf() +
coord_sf() +
facet_wrap(~Date) +
scale_fill_viridis(labels = scales::comma_format()) +
scale_color_viridis(guide = FALSE) +
labs(fill = "Monthly pills\nper capita",
title = "Controlled substance prescriptions across Texas",
subtitle = "The prescription rate was higher overall before 2017")

This strong geographic trend is one of the most interesting results from this analysis. There are low rates in the Rio Grande Valley and high rates in north and east Texas. When I saw that pattern, I knew we needed to look into how race/ethnicity was related to these controlled prescription rates. Also, notice the change over time as these rates have decreased.

We don’t see a direct or obvious relationship with household income, but, as the maps hint at, race is another matter.

race_vars <- c("P005003", "P005004", "P005006", "P004003")

texas_race <- get_decennial(geography = "county",
variables = race_vars,
year = 2010,
summary_var = "P001001",
state = "TX")

race_joined <- texas_race %>%
mutate(PercentPopulation = value / summary_value,
variable = fct_recode(variable,
White = "P005003",
Black = "P005004",
Asian = "P005006",
Hispanic = "P004003")) %>%
inner_join(opioids_joined %>%
filter(OpioidRate < 20) %>%
group_by(GEOID, Date) %>%
summarise(OpioidRate = median(OpioidRate)))

race_joined %>%
group_by(NAME, variable, GEOID) %>%
summarise(Population = median(summary_value),
OpioidRate = median(OpioidRate),
PercentPopulation = median(PercentPopulation)) %>%
ggplot(aes(PercentPopulation, OpioidRate,
size = Population, color = variable)) +
geom_point(alpha = 0.4) +
facet_wrap(~variable) +
scale_x_continuous(labels = scales::percent_format()) +
scale_y_continuous(labels = scales::comma_format()) +
scale_color_discrete(guide = FALSE) +
labs(x = "% of county population in that racial/ethnic group",
y = "Median monthly pills prescribed per capita",
title = "Race and controlled substance prescriptions",
subtitle = "The more white a county is, the higher the median monthly pills prescribed there",
size = "County\npopulation")

The more white a county is, the higher the rate of controlled substance prescription there. The more Hispanic a county is, the lower the rate of controlled substance prescription there. Effects with Black and Asian race are not clear in Texas.

## Building a model

We used straightforward multiple linear regression to understand how prescription rates are associated with various factors. We fit a single model to all the counties to understand how their characteristics affect the opioid prescription rate. We explored including and excluding the various relevant predictors to build the best explanatory model that can account for the relationships that exist in this integrated PDMP and US Census Bureau dataset.

This was the first time I had used the huxtable package for a publication, and it was so convenient!

library(huxtable)

opioids <- race_joined %>%
select(GEOID, OpioidRate, TotalPop = summary_value,
variable, PercentPopulation, Date) %>%
left_join(household_income %>%
select(GEOID, Income = estimate)) %>%
select(-geometry, -GEOID) %>%
mutate(Income = Income / 1e5,
OpioidRate = OpioidRate,
Date = factor(Date, levels = c("Before 2017", "2017 and later")),
Date = fct_recode(Date,  2017 and later = "2017 and later"))

lm1 <- lm(OpioidRate ~ Income + White, data = opioids)
lm2 <- lm(OpioidRate ~ Income + White + Date, data = opioids)
lm3 <- lm(OpioidRate ~ Income + Date + log(TotalPop), data = opioids)
lm4 <- lm(OpioidRate ~ Income + White + Date + log(TotalPop), data = opioids)

huxreg(lm1, lm2, lm3, lm4)
 (1) (2) (3) (4) (Intercept) 5.640 *** 6.468 *** 8.394 *** 3.668 *** (0.524) (0.508) (0.847) (0.785) Income -3.209 ** -3.229 *** -0.239 -4.432 *** (0.973) (0.922) (1.063) (0.941) White 7.120 *** 7.134 *** 7.718 *** (0.560) (0.531) (0.536) Date 2017 and later -1.650 *** -1.640 *** -1.649 *** (0.216) (0.251) (0.211) log(TotalPop) 0.081 0.309 *** (0.077) (0.067) N 507 507 507 507 R2 0.243 0.322 0.080 0.349 logLik -1194.782 -1166.829 -1244.045 -1156.282 AIC 2397.563 2343.658 2498.091 2324.564 *** p < 0.001; ** p < 0.01; * p < 0.05.

Model metrics such as AIC and log likelihood indicate that the model including income, percent white population, date, and total population on a log scale provides the most explanatory power for the opioid rate. Using the proportion of population that is Hispanic gives a model that is about as good; these are basically interchangeable but opposite in effect. Overall, the $$R^2$$ of these models is not extremely high (the best model has an adjusted $$R^2$$ of 0.359) because these models are estimating population level characteristics and there is significant county-to-county variation that is not explained by these four predictors alone. The population level trends are statistically significant and with the effect sizes at the levels shown here.

We can more directly explore the factors involved in this explanatory model (income, ethnicity, time) visually.

race_joined %>%
filter(variable == "White") %>%
left_join(household_income %>%
as.data.frame() %>%
select(GEOID, Income = estimate)) %>%
filter(!is.na(Income)) %>%
mutate(Income = ifelse(Income <= median(Income, na.rm = TRUE),
"Low income", "High income"),
PercentPopulation = cut_width(PercentPopulation, 0.1)) %>%
group_by(PercentPopulation, Income, Date) %>%
summarise(OpioidRate = median(OpioidRate)) %>%
mutate(Date = factor(Date, levels = c("Before 2017", "2017 and later"))) %>%
ggplot(aes(PercentPopulation, OpioidRate, color = Income, group = Income)) +
geom_line(size = 1.5, alpha = 0.8) +
geom_smooth(method = "lm", lty = 2, se = FALSE) +
scale_y_continuous(labels = scales::comma_format(),
limits = c(0, NA)) +
scale_x_discrete(labels = paste0(seq(0, 0.9, by = 0.1) * 100, "%")) +
theme(legend.position = "top") +
facet_wrap(~Date) +
labs(x = "% of county population that is white",
y = "Median monthly pills prescribed per 1k population",
color = NULL,
title = "White population, income, and controlled substance usage",
subtitle = "Before 2017, the more white a county was, the more low income was associated with more controlled substance usage")

This plot illustrates the relationship between white population percentage and income, and how that has changed with time. The difference in controlled substance usage between lower and higher income counties (above and below the median in Texas) changes along the spectrum of counties’ population that is white.

The first effect to notice here is that the more white a county is, the higher the rate of controlled substance prescriptions. This was true both before 2017 and for 2017 and later, and for both low-income and high-income groups of counties. The second effect, though, is to compare the slopes of the two lines. Before 2017, the slope was shallower for higher income counties (above the median in Texas), but in lower income counties (below the median in Texas), the slope was steeper, i.e., the increase in prescription rate with white percentage was more dramatic. For 2017 and later, there is no longer a difference between low-income and high-income counties, although the trend with white population percentage remains.

What have we learned here? In the discussion of our paper, we focus on the difference or disparity in opioid prescription rates with race/ethnicity, and how that may be related to the subjective nature of the evaluation of pain by medical practitioners. A racial/ethnic difference in opioid prescribing rate has been found in other studies using alternative data sources. We can understand the differences in how media, the healthcare system, and the culture at large have portrayed the opioid epidemic compared to previous drug epidemics (such as those of the 1980s) due to what populations are impacted.

If you want to read more about this new analysis and related work, check out the paper. You can also look at the GitHub repo where I have various bits of code for this analysis, which is now public.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Drawing the line between anti-Semitism and criticism of Israel

Polls show the two are correlated, but the strength of the link depends on ideology

## October 11, 2019

### R Packages worth a look

Matching Multiply Imputed Datasets (MatchThem)
Provides the necessary tools for the pre-processing technique of matching in multiply imputed datasets, to improve the robustness and transparency of deriving causal inferences from studying these datasets. This package includes functions to perform propensity score matching within or across the imputed datasets as well as to estimate weights (including inverse propensity score weights) of observations, to analyze each matched or weighted datasets using parametric or non-parametric statistical models, and to combine the obtained results from these models according to Rubin’s rules. Please see the package repository <https://…/MatchThem> for details.

Recursive Partitioning Based Multivariate Adaptive Regression Models, Classification Trees, Survival Trees (macs)
Implements recursive partitioning based, nonparametric methods for high dimensional regression and classification. Depending on the aims of data analysis as well as the structures of the data, macs provides three major functions: multivariate adaptive regression models, classification trees and survival trees. A list of references for this package is, Zhang, H. (1997) <doi:10.1080/10618600.1997.10474728>, Zhang, H. et al. (1999) <ISBN:978-1-4757-3027-2>, Zhang, H. et al. (2014) <doi:10.1002/gepi.21843>.

Logistic Regression Trees (glmtree)
A logistic regression tree is a decision tree with logistic regressions at its leaves. A particular stochastic expectation maximization algorithm is used to draw a few good trees, that are then assessed via the user’s criterion of choice among BIC / AIC / test set Gini. The formal development is given in a PhD chapter, see Ehrhardt (2019) <https://…/>.

Forecast Verification for Extreme Events (extremeIndex)
An index measuring the amount of information brought by forecasts for extreme events, subject to calibration, is computed. This index is originally designed for weather or climate forecasts, but it may be used in other forecasting contexts. This is the implementation of the index in Taillardat et al. (2019) <arXiv:1905.04022>.

A Collection of Useful Functions by John (usefun)
A set of general functions that I have used in various projects and in other R packages. They support some miscellaneous operations on data frames, matrices and vectors: adding a row on a ternary (3-value) data.frame based on positive and negative vector-indicators, rearranging a list of data.frames by rownames, pruning rows or columns of a data.frame that contain only one specific value given by the user, checking for matrix equality, pruning and reordering a vector according to the common elements between its names and elements of another given vector, finding the non-common elements between two vectors (outer-section), normalization of a vector, matrix or data.frame’s numeric values in a specified range, pretty printing of vector names and values in an R notebook (common names and values between two vectors also supported), retrieving the parent directory of any string path, checking whether a numeric value is inside a given interval, trim the decimal points of a given numeric value, quick saving of data to a file, making a multiple densities plot and a color bar plot and executing a plot string expression while generating the result to the specified file format.

### Document worth reading: “Human Action Recognition and Prediction: A Survey”

Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in the action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions. Human Action Recognition and Prediction: A Survey

### if ifelse() had more if’s

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Problem

The ifelse() function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?

### Context

I was recently given some survey data to clean up. It looked something like this (but obviously much larger):

I needed to classify people in this data set based on whether they had passed or failed certain tests.

I wanted to separate the people into three groups:

• People who passed both tests: Group A
• People who passed one test: Group B
• People who passed neither test: Group C

I thought about using a nested ifelse statement, and I certainly could have done that. But that approach didn’t make sense to me. The tests are equivalent and not given in any order; I simply want to sort the people into three equal groups. Any nesting of “if” statements would seem to imply a hierarchy that doesn’t really exist in the data. Not to mention that I hate nesting functions. It’s confusing and hard to read.

### Solution

Once again, dplyr to the rescue! I’m becoming more and more of a tidyverse fan with each passing day.

Turns out, dplyr has a function for exactly this purpose: case_when(). It’s also known as “a general vectorised if,” but I like to think of it as “if ifelse() had more if’s.”

Here’s the syntax:

library(dplyr)df <- df %>% mutate(group = case_when(test1 & test2 ~ "A", # both tests: group A                         xor(test1, test2) ~ "B", # one test: group B                         !test1 & !test2 ~ "C" # neither test: group C))

Output:

Let me translate the above into English. After loading the package, I reassign df, the name of my data frame, to a modified version of the old df. Then (%>%), I use the mutate function to add a new column called group. The contents of the column will be defined by the case_when() function.

case_when(), in this example, took three conditions, which I’ve lined up so you can read them more easily. The condition is on the left side of the ~, and the resulting category (A, B, or C) is on the right. I used logical operators for my conditions. The newest one to me was the xor() function, which is an exclusive or: only one of the conditions in the parentheses can be TRUE, not both.

### Outcome

Easily make conditional assignments within a data frame. This function is a little less succinct than ifelse(), so I’m probably not going to use it for applications with only two cases, where ifelse() would work fine. But for three or more cases, it can’t be beat. Notice that I could have added any number of conditions to my case_when() statement, with no other caveats.

I love this function, and I think we should all be using it.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Computer Vision for Global Challenges research award winners

The post Computer Vision for Global Challenges research award winners appeared first on Facebook Research.

### Distilled News

Take the crash course in the ‘whys’ and ‘whens’ of using Deep Learning in Time Series Analysis.
With this article, I will start a series of short tutorials on Pyspark, from data pre-processing to modeling. The first will deal with the import and export of any type of data, CSV , text file, Avro, Json …etc. I work on a virtual machine on google cloud platform data comes from a bucket on cloud storage. Let’s import them.
In Part I, I talked about basics and challenges in managing ML products. Developing ML products involves more experiments and iterations. Therefore, as a PM, you need to give engineers and scientists enough space and flexibility to explore before deciding on the path going forward. But how do you help your team navigate these uncertainties? How do you go about defining the business problems and metrics while allowing your team to explore?
Google’s TensorFlow team announced its newest and quite easy-to-use version earlier this year. For people who have used any of TensorFlow 1.XX, this version is less ugly, less creepy and more user-friendly (technical updates and changes are discussed below). I will be writing a 3 part series and this serves as the beginning and the shortest one. The theme of these parts are as follow:
• Introducing TF 2.0
• Image classification in TensorFlow 2.0
• Transfer learning in TensorFlow 2.0
You keep hearing about Artificial Intelligence (AI). Nvidia’s CEO says, ‘AI is going to eat software.’ Mark Cuban exclaims, ‘Artificial Intelligence, deep learning, machine learning – whatever you’re doing if you don’t understand it – learn it. Because otherwise, you’re going to be a dinosaur within 3 years.’ And yet, at the same time, you read articles about companies failing to execute with AI. Some even claim that 85% of efforts fail. Being part of a startup, you don’t have the capital nor the time to have a potentially expensive failed experiment with AI, but you also can’t afford to be left behind. So where do you start?
Open environments and competition among species are two driving forces of Darwinian evolution that are largely absent in recent works on evolutionary approaches toward AI models. Within a given generation, faster impalas and faster cheetahs are more likely to survive (and reproduce) than their slower counterparts – leading to the evolution of faster and faster impalas and cheaters over time. Can these and other principles of genetics and natural selection guide us towards major advances in AI?
Continual Learning is the ability to learn continually from a stream of data sequentially, building on what was learned previously and being able to remember those learnt tasks. This is what humans are capable of, and what is also the end goal for an artificially intelligent machine. In our brains, the neo-cortex relies on synaptic consolidation & knowledge of the previous task encoded in synapses, these are immutable therefore stable over long periods of time. So I figured let’s talk about how to guide intelligent machines to not forget!
Q-learning is one of the most popular Reinforcement learning algorithms and lends itself much more readily for learning through implementation of toy problems as opposed to scouting through loads of papers and articles. This is a simple introduction to the concept using a Q-learning table implementation. I will set up the context of what we are doing, establish a toy game to experiment with, define the Q-learning algorithm, provide a 101 implementation and explore the concepts – all in a hopefully short post that anyone can follow.
This article will go through what Anaconda is, what Minconda is and what Conda is, why you should know about them if you’re a data scientist or machine learning engineer and how you can use them. Your computer is capable of running many different programs and applications. However, when you want to create or write your own, such as, building a machine learning project, it’s important to set your computer up in the right way.
The topic has been randomly chosen when I was exploring the dengue trend in Singapore. There has been a recent spike in dengue cases, especially in the Dengue Red Zone where I am living. However, I was unable to scrape raw data from the NEA website. I was wondering, has dengue affected the life expectancy of people in any country?
The case for cross validation and how to implement one version (k-fold). Using a single train-test split gives you a single snapshot of the performance of a machine learning model or algorithm. It’s like evaluating a football team (American or otherwise) based on a single game. If you really want to know how well a team performs in general, you’re going to want more than just this snapshot. Likewise, we shouldn’t evaluate our algorithms on a single random split. Luckily, there’s a better way. I bring you k-fold cross validation.
A large amount of data is stored in the form of time series: stock indices, climate measurements, medical tests, etc. Time series classification has a wide range of applications: from identification of stock market anomalies to automated detection of heart and brain diseases. There are many methods for time series classification. Most of them consist of two major stages: on the first stage you either use some algorithm for measuring the difference between time series that you want to classify (dynamic time warping is a well-known one) or you use whatever tools are at your disposal (simple statistics, advanced mathematical methods etc.) to represent your time series as feature vectors. In the second stage you use some algorithm to classify your data. It can be anything from k-nearest neighbors and SVMs to deep neural network models. But one thing unites these methods: they all require some kind of feature engineering as a separate stage before classification is performed.
How to avoid incorrect prediction by Softmax. Softmax is by far the most common activation function used. It is usually used in classification of tasks in the output layer of the Neural Network. What softmax actually does is it turns output (logits) into probabilities. Now, before we jump into the main problem with softmax, let’s first discuss how it actually works and also understand what are logits! Logits- These are last raw scores that are predicted by the last layer of the neural network. These are the values that we get before any activation function is applied to them.
Forecast Policy Conversions in an Imbalance Scenario. In Anomaly Detection one of the most tedious problem is to deal with imbalance. Our role as Data Scientist is, at first stage, to detect patterns responsable to abnormal behaviors. Secondly, develop adhoc ML models which override class imbalance and try to return the best results. What we do in this post is to take advantage of imbalance. Through the building of a clever pipeline and some tricks, we’ll be able to boost the performance of our ML model. In details, we analize a problem of conversions forecasting but our approach is not only specialized on this task. Similar problem, of imbalance classification, can be fraud detection, churn prediction, remaining life extimation, toxicity detection and so on.
Never stop learning new things… Artificial Intelligence it’s a journey, not a destination. This means only one thing; you need to be prepared for constant learning. Is it a tough path? With all the abundance of abstract terms and an almost infinite number of details, the AI and ML learning curve can indeed be steep for many. But, getting started with anything new is hard, isn’t it? Moreover, I believe everyone can learn it if only there is a strong desire. Besides, there is an effective approach that will facilitate your learning. Like for example, you don’t need to rush, just start with small moves. Imagine a picture of everything you have learned. Every day you should add new elements to this picture, make it bigger and more detailed. Today you can make your picture even bigger by dint of lots of tools out there that allow anyone to get started learning Machine Learning. No excuses! And you have not to be an AI wizard or mathematician. You just need to learn how to teach machines that work in ones and zeros to reach their conclusions about the world. You’re teaching them how to think! Wanna learn how to do so? Here are the best books, courses and more that will help you do it more effectively without being confused.

### Upcoming Webinar, Machine Learning Vital Signs: Metrics and Monitoring Models in Production

In this upcoming webinar on Oct 23 @ 10 AM PT, learn why you should invest time in monitoring your machine learning models, the dangers of not paying attention to how a model’s performance can change over time, metrics you should be gathering for each model and what they tell you, and much more.

### 6 Questions to Ask Before Pursuing a Master’s Degree in Data Science

Face it, you’ve been on the fence about going back to school for a while now. After Googling master’s degree programs in data science here and there, you still feel overwhelmed about deciding if one is better than another. When there’s countless sources of information tugging you in every direction, it’s best to gather your bearings and figure out the most important details you need in order to guide you in the right direction.

Luckily, we’ve done the legwork. Below are the six most important questions to ask before you begin your application to a data science master’s degree program. Think of this as a mental checklist to help you cross off crucial considerations before taking on your educational future:

## 1. Do I have an interest in or want to strengthen my problem-solving, data management, and research skills?

When you have an interest in data science practices and want to grow your hard and soft skills, enrolling in a data science master’s degree program is a promising endeavor. In fact, 2019 Emsi research shows that data management, clinical research, and data collection skills are frequently listed in data science job postings, while management, problem-solving, and communications round out the preferred soft skills:

Employers want data science master’s degree graduates to be well-rounded professionals, capable of tapping into the technical side of a role while effectively understanding and communicating data to drive actionable insights. With this in mind, it’s to your advantage to find a master’s degree program that not only prioritizes the hard skills of data science, but also the soft skills. When researching program curricula, look for courses such as “Communicating about Data,” “Ethics of Data Science,” and “Data Science and Strategic Decision-Making,”—all of which expand on skills that are not only needed, but also demanded in today’s data science roles.

## 2. What are my career goals?

If you’re looking to be part of a dynamic team in charge of solving real-world problems, or if you want to efficiently draw conclusions from data to better inform actionable business strategies, a master’s degree in data science can set those career goals in motion.

Whether you strive to be a data analyst, database administrator, data architect, strategic business and technology intelligence consultant, or another data-focused professional, the impact of your work can be substantial. For example, Venmathi Shanmugam, a recent graduate of the University of Wisconsin Master of Science in Data Science online program, is the Modeling and Simulations Engineer at the Veterans Affairs office in Austin, Texas. In her role, she manages large amounts of government data pertaining to veterans healthcare and finance. She also holds responsibilities in the supply chain division of her department—using data to positively impact patient needs.

## 3. Who teaches the data science curriculum?

When enrolling in a graduate school program, you want to feel confident that you are learning from qualified experts in your desired field. A strong data science master’s program is rooted in an interdisciplinary approach, not siloed off to one academic school or department. Diversity in faculty expertise and perspectives is a critical component to consider before applying to a program. Don’t hesitate to research potential faculty members’ background, education, and recently published research studies.

The UW Data Science online program is a prime example of an interdisciplinary approach in action, where students learn from faculty members across six UW System campuses. With advanced degrees in mathematics, marketing, computer science, philosophy, management, statistics, and rhetoric and computer composition, graduate students receive the direct benefits of working and learning alongside UW faculty who are driven to grow and advance from every corner of the data science profession.

## 4. Does the degree program provide a networking community fueled by collaboration?

It’s important to consider how a data science master’s degree program sets you up to connect with industry leaders during your time as a student and after you graduate. Look for programs that feature an advisory board of data science professionals across business sectors. This collaboration of individuals shows that the program is supported by outside experts who can help you connect with future employers and/or mentors.

The UW Data Science online program boasts an impressive advisory board, where recognized data science leaders have the opportunity to shape the program. UW Data Science Advisory Board members can also sponsor capstone projects, plugging students into real-world data science settings where they draw upon their skills and grow from hands-on experiences. Currently, the UW Data Science Advisory Board includes data science professionals in the retail, banking, software programming, manufacturing, state government, insurance, transportation, staffing, and medical fields.

## 5. Do I have the time to earn a graduate degree?

For some working adults, it is too time consuming to earn a master’s degree through an on-campus program. Travelling to and from a campus throughout the week to sit in on hour-long lectures can complicate the balancing act of work and family responsibilities.

If this is your reality, then pursuing an online graduate degree might be your best option. But, not all online degrees are the same. Make sure to thoroughly research an online degree’s requirements and consider how much flexibility you will need when it comes to lectures, readings, and coursework.

UW Data Science is a smart choice for busy adults who want to advance their careers, or make a career change. Offered 100 percent online, you can study and complete coursework whenever and wherever you have an internet connection. Courses have no set meeting times, and you never need to come to campus.

## 6. Is the program accredited and respected?

When deciding to pursue a master’s in data science, you want to be sure that future employers will value the degree-granting institution you’ll graduate from. There’s no question that you can earn a graduate-level education from almost anywhere in the country and world, so how do you know which programs are worth your time and money? Look for degree programs that are accredited by regional and national accreditors, such as the Higher Learning Commission (HLC).

According to the HLC website, in order to be accredited an educational institution must 1.) have a clear and publicly articulated mission, 2.) conduct actions responsibly, ethically, and with integrity, 3.) provide high-quality education, no matter where and how offerings are delivered, 4.) evaluate its student learning effectiveness and promote continuous improvement, and 5.) have resources, structures, and procedures that support the institution’s mission.

The UW Data Science program is HLC accredited, signifying that its students receive a high-quality education led by faculty who are committed to staying up to date with the expanding data science field. Plus, UW is known and respected worldwide, helping graduates of the UW Data Science program stand out to employers. As a trusted and valued institution, a UW degree can help graduates accomplish their career and personal goals.

## So, you’ve gone through all six questions, what have you found?

With an expert-led curriculum, a 100 percent online format, and a recognized and respected UW degree, now is the time to see where the UW Master of Science in Data Science online program can take you.

Have questions about courses, tuition, or how to apply? Talk with an enrollment adviser by emailing learn@uwex.edu or calling 1-877-895-3276.

### Beyond Word Embedding: Key Ideas in Document Embedding

This literature review on document embedding techniques thoroughly covers the many ways practitioners develop rich vector representations of text -- from single sentences to entire books.

### Dan’s Paper Corner: Yes! It does work!

Only share my research
With sick lab rats like me
Trapped behind the beakers
Cut off from the world, I may not ever get free
But I may
One day
Trying to find
An antidote for strychnine — The Mountain Goats

Hi everyone! Hope you’re enjoying Peak Libra Season! I’m bringing my Air Sign goodness to another edition of Dan’s Paper Corner, which is a corner that I have covered in papers I really like.

And honestly, this one is mostly cheating. Two reasons really. First, it says nice things about the work Yuling, Aki, Andrew, and I did and then proceeds to do something much better. And second because one of the authors is Tamara Broderick, who I really admire and who’s been on an absolute tear recently.

Tamara—often working with the fabulous Trevor Campbell (who has the good grace to be Canadian), the stunning Jonathan Huggins (who also might be Canadian? What am I? The national register of people who are Canadian?), and the unimpeachable Ryan Giordano (again. Canadian? Who could know?)—has written a pile of my absolute favourite recent papers on Bayesian modelling and Bayesian computation.

Here are some of my favourite topics:

As I say, Tamara and her team of grad students, postdocs, and co-authors have been on one hell of a run!

Which brings me to today’s paper: Practical Posterior Error Bounds from Variational Objectives by Jonathan Huggins, Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick.

In the grand tradition of Dan’s Paper Corner, I’m not going to say much about this paper except that it’s really nice and well worth reading if you care about asking “Yes, but did it work?” for variational inference.

I will say that this paper is amazing and covers a tonne of ground. It’s fully possible that someone reading this paper for the first time won’t recognize how unbelievably practical it is. It is not trying to convince you that its new melon baller will ball melons faster and smoother than your old melon baller. Instead it stakes out much bolder ground: this paper provides a rigorous and justified and practical workflow for using variational inference to solve a real statistical problem.

I have some approximately sequential comments below, but I cannot stress this enough: this is the best type of paper. I really like it. And while it may be of less general interest than last time’s general theory of scientific discovery, it is of enormous practical value. Hold this paper close to your hearts!

• On a personal note, they demonstrate that the idea in the paper Yuling, Aki, Andrew, and I wrote is good for telling when variational posteriors are bad, but the k-hat diagnostic being small does not necessarily mean that the variational posterior will be good. (And, tbh, that’s why we recommended polishing it with importance sampling)
• But that puts us in good company, because they show that neither the KL divergence that’s used in deriving the ELBO or the Renyi divergence is a particularly good measure of the quality of the solution.
• The first of these is not all that surprising. I think it’s been long acknowledged that the KL divergence used to derive variational posteriors is the wrong way around!
• I do love the Wasserstein distance (or as an extremely pissy footnote in my copy of Bogachev’s glorious two volume treatise on measure theory insists: the KantorovichRubinstein metric). It’s so strong. I think it does CrossFit. (Side note: I saw a fabulous version of A Streetcar Named Desire in Toronto [Runs til Oct 27] last week and really it must be so much easier to find decent Stanleys since CrossFit became a thing.)
• The Hellinger distance is strong too and will also control the moments (under some conditions. See Lemma 6.3.7 of Andrew Stuart’s encyclopedia)
• Reading the paper sequentially, I get to Lemma 4.2 and think “ooh. that could be very loose”. And then I get excited about minimizing over $\eta$ in Theorem 4.3 because I contain multitudes.
• Maybe my one point of slight disagreement with this paper is where they agree with our paper. Because, as I said, I contain multitudes. They point out that it’s useful to polish VI estimates with importance sampling, but argue that they can compute their estimate of VI error instead of k-hat. I’d argue that you need to compute both because just like we didn’t show that small k-hat guarantees a good variational posterior, they don’t show that a good approximate upper bound on the Wasserstein distance guarantees that importance sampling will work. So ha! (In particular, Chatterjee and Diaconis argue very strongly, as does Mackay in his book, that the variance of an importance sampler being finite is somewhere near meaningless as a practical guarantee that an importance sampler actually works in moderate to high dimensions.)
• But that is nought but a minor quibble, because I completely and absolutely agree with the workflow for Variational Inference that they propose in Section 4.3.
• Let’s not kid ourselves here. The technical tools in this paper are really nice.
• There is not a single example I hate more than the 8 schools problem. It is the MNIST of hierarchical modelling. Here’s hoping it doesn’t have any special features that makes it a bad generic example of how things work!
• That said, it definitely shows that k-hat isn’t enough to guarantee good posterior behaviour.

Anyway. Here’s to more papers like this and to fewer examples of what the late, great David Berman referred to as ceaseless feasts of schadenfreude“.

### #FunDataFriday – gTrendsR

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## What is it?

The gtrendsR package is an R package that can be used to programmatically gather and display Google trend information. Lately, I seem to be finding a lot of fun use cases for it, so I figured I would share the joy in my #FunDataFriday series!

## Why is it awesome?

It’s an awesome package because it’s so simple! In three lines of code, you can pull Google trend data and visualize the results. Because you get the raw trend data, you can very easily extend your analysis to do almost anything.

## How to get started?

Getting started is easy. With just three lines, you can plot your own gTrendsR graph in R.

library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

With three more lines, you can make the graph interactive

library(plotly)
p <-plot(trends)
ggplotly(p)

A gTrendsR graph inspired by Epi Ellie’s outrageous competition on the popularity of Nerds and Smarties!

With a little more effort, you can either start diving into the data and merge it with other sources. If you want to stay on the data visualization path, you can easily exploit the full benefits of ggplot2 to analyze the results! If you want to learn more, I have a few more examples in my recent blog post analyzing the relative popularity of The Bachelor franchise series over time.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Whats new on arXiv

The success of Deep Learning and its potential use in many safety-critical applications has motivated research on formal verification of Neural Network (NN) models. In this context, verification means verifying whether a NN model satisfies certain input-output properties. Despite the reputation of learned NN models as black boxes, and the theoretical hardness of proving useful properties about them, researchers have been successful in verifying some classes of models by exploiting their piecewise linear structure and taking insights from formal methods such as Satisifiability Modulo Theory. However, these methods are still far from scaling to realistic neural networks. To facilitate progress on this crucial area, we make two key contributions. First, we present a unified framework based on branch and bound that encompasses previous methods. This analysis results in the identification of new methods that combine the strengths of multiple existing approaches, accomplishing a speedup of two orders of magnitude compared to the previous state of the art. Second, we propose a new data set of benchmarks which includes a collection of previously released test cases. We use the benchmark to provide a thorough experimental comparison of existing algorithms and identify the factors impacting the hardness of verification problems.
Recently, recommender systems play a pivotal role in alleviating the problem of information overload. Latent factor models have been widely used for recommendation. Most existing latent factor models mainly utilize the interaction information between users and items, although some recently extended models utilize some auxiliary information to learn a unified latent factor for users and items. The unified latent factor only represents the characteristics of users and the properties of items from the aspect of purchase history. However, the characteristics of users and the properties of items may stem from different aspects, e.g., the brand-aspect and category-aspect of items. Moreover, the latent factor models usually use the shallow projection, which cannot capture the characteristics of users and items well. In this paper, we propose a Neural network based Aspect-level Collaborative Filtering model (NeuACF) to exploit different aspect latent factors. Through modelling the rich object properties and relations in recommender system as a heterogeneous information network, NeuACF first extracts different aspect-level similarity matrices of users and items respectively through different meta-paths, and then feeds an elaborately designed deep neural network with these matrices to learn aspect-level latent factors. Finally, the aspect-level latent factors are fused for the top-N recommendation. Moreover, to fuse information from different aspects more effectively, we further propose NeuACF++ to fuse aspect-level latent factors with self-attention mechanism. Extensive experiments on three real world datasets show that NeuACF and NeuACF++ significantly outperform both existing latent factor models and recent neural network models.
Learning representations that accurately capture long-range dependencies in sequential inputs — including text, audio, and genomic data — is a key problem in deep learning. Feed-forward convolutional models capture only feature interactions within finite receptive fields while recurrent architectures can be slow and difficult to train due to vanishing gradients. Here, we propose Temporal Feature-Wise Linear Modulation (TFiLM) — a novel architectural component inspired by adaptive batch normalization and its extensions — that uses a recurrent neural network to alter the activations of a convolutional model. This approach expands the receptive field of convolutional sequence models with minimal computational overhead. Empirically, we find that TFiLM significantly improves the learning speed and accuracy of feed-forward neural networks on a range of generative and discriminative learning tasks, including text classification and audio super-resolution
The selection of variables with high-dimensional and missing data is a major challenge and very few methods are available to solve this problem. Here we propose a new method — adaptive Bayesian SLOPE — which is an extension of the SLOPE method of sorted $l_1$ regularization within a Bayesian framework and which allows to simultaneously estimate the parameters and select variables for large data despite missing values. The method follows the idea of the Spike and Slab LASSO, but replaces the Laplace mixture prior with the frequentist motivated ‘SLOPE’ prior, which targets control of the False Discovery Rate. The regression parameters and the noise variance are estimated using stochastic approximation EM algorithm, which allows to incorporate missing values as well as latent model parameters, like the signal magnitude and its sparsity. Extensive simulations highlight the good behavior in terms of power, FDR and estimation bias under a wide range of simulation scenarios. Finally, we consider an application of severely traumatized patients from Paris hospitals to predict the level of platelet, and demonstrate, beyond the advantage of selecting relevant variables, which is crucial for interpretation, excellent predictive capabilities. The methodology is implemented in the R package ABSLOPE, which incorporates C++ code to improve the efficiency of the proposed method.
Pre-training Transformer from large-scale raw texts and fine-tuning on the desired task have achieved state-of-the-art results on diverse NLP tasks. However, it is unclear what the learned attention captures. The attention computed by attention heads seems not to match human intuitions about hierarchical structures. This paper proposes Tree Transformer, which adds an extra constraint to attention heads of the bidirectional Transformer encoder in order to encourage the attention heads to follow tree structures. The tree structures can be automatically induced from raw texts by our proposed “Constituent Attention” module, which is simply implemented by self-attention between two adjacent words. With the same training procedure identical to BERT, the experiments demonstrate the effectiveness of Tree Transformer in terms of inducing tree structures, better language modeling, and further learning more explainable attention scores.
What makes a paper independently reproducible? Debates on reproducibility center around intuition or assumptions but lack empirical results. Our field focuses on releasing code, which is important, but is not sufficient for determining reproducibility. We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. For each paper, we did not look at the authors code, if released, in order to prevent bias toward discrepancies between code and paper.
In the context of machine learning, a prediction problem exhibits predictive multiplicity if there exist several ‘good’ models that attain identical or near-identical performance (i.e., accuracy, AUC, etc.). In this paper, we study the effects of multiplicity in human-facing applications, such as credit scoring and recidivism prediction. We introduce a specific notion of multiplicity — predictive multiplicity — to describe the existence of good models that output conflicting predictions. Unlike existing notions of multiplicity (e.g., the Rashomon effect), predictive multiplicity reflects irreconcilable differences in the predictions of models with comparable performance, and presents new challenges for common practices such as model selection and local explanation. We propose measures to evaluate the predictive multiplicity in classification problems. We present integer programming methods to compute these measures for a given datasets by solving empirical risk minimization problems with discrete constraints. We demonstrate how these tools can inform stakeholders on a large collection of recidivism prediction problems. Our results show that real-world prediction problems often admit many good models that output wildly conflicting predictions, and support the need to report predictive multiplicity in model development.
In class-incremental learning, a model learns continuously from a sequential data stream in which new classes occur. Existing methods often rely on static architectures that are manually crafted. These methods can be prone to capacity saturation because a neural network’s ability to generalize to new concepts is limited by its fixed capacity. To understand how to expand a continual learner, we focus on the neural architecture design problem in the context of class-incremental learning: at each time step, the learner must optimize its performance on all classes observed so far by selecting the most competitive neural architecture. To tackle this problem, we propose Continual Neural Architecture Search (CNAS): an autoML approach that takes advantage of the sequential nature of class-incremental learning to efficiently and adaptively identify strong architectures in a continual learning setting. We employ a task network to perform the classification task and a reinforcement learning agent as the meta-controller for architecture search. In addition, we apply network transformations to transfer weights from previous learning step and to reduce the size of the architecture search space, thus saving a large amount of computational resources. We evaluate CNAS on the CIFAR-100 dataset under varied incremental learning scenarios with limited computational power (1 GPU). Experimental results demonstrate that CNAS outperforms architectures that are optimized for the entire dataset. In addition, CNAS is at least an order of magnitude more efficient than naively using existing autoML methods.
Partial multi-label learning (PML), which tackles the problem of learning multi-label prediction models from instances with overcomplete noisy annotations, has recently started gaining attention from the research community. In this paper, we propose a novel adversarial learning model, PML-GAN, under a generalized encoder-decoder framework for partial multi-label learning. The PML-GAN model uses a disambiguation network to identify noisy labels and uses a multi-label prediction network to map the training instances to the disambiguated label vectors, while deploying a generative adversarial network as an inverse mapping from label vectors to data samples in the input feature space. The learning of the overall model corresponds to a minimax adversarial game, which enhances the correspondence of input features with the output labels. Extensive experiments are conducted on multiple datasets, while the proposed model demonstrates the state-of-the-art performance for partial multi-label learning.
Because it is not feasible to collect training data for every language, there is a growing interest in cross-lingual transfer learning. In this paper, we systematically explore zero-shot cross-lingual transfer learning on reading comprehension tasks with a language representation model pre-trained on multi-lingual corpus. The experimental results show that with pre-trained language representation zero-shot learning is feasible, and translating the source data into the target language is not necessary and even degrades the performance. We further explore what does the model learn in zero-shot setting.
The distribution of collective firing of neurons, known as a neural avalanche, obeys a power law. Three proposed explanations of this emergent scale-free behavior are criticality, neutral theory, and self-organized criticality. We show that the neutral theory of neural avalanches can be unified with criticality, which requires fine tuning of control parameters, and rule out self-organized criticality. We study a model of the brain for which the dynamics are governed by neutral theory. We identify the tuning parameters, which are consistent with experiments, and show that scale-free neural avalanches occur only at the critical point. The scaling hypothesis provides a unified explanation of the power laws which characterize the critical point. The critical exponents characterizing the avalanche distributions and divergence of the response functions are shown to be consistent with the predictions of the scaling hypothesis. We use an universal scaling function for the avalanche profile to find that the firing rate for avalanches of different sizes shows data collapse after appropriate rescaling. Critical slowing-down and algebraic relaxation of avalanches demonstrate that the dynamics are also consistent with the system being at a critical point. We discuss how our results can motivate future empirical studies of criticality in the brain.
We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the $k$-means objective ($\mathcal{C}$) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the $k$-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon) \mathcal{C}$ and present a no-regret algorithm with runtime $O\left(T(\mathrm{poly}(log(T),k,d,1/\epsilon)^{k(d+O(1))}\right)$. Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that na\'{i}ve online algorithms, such as \emph{Follow The Leader}, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data.
Over the past years, there has been an increased research interest in the problem of detecting anomalies in temporal streaming data. Among the many algorithms proposed every year, there exists no single general method that has been shown to outperform the others across different anomaly types, applications, and datasets. Furthermore, experimental studies conducted using existing methods lack reliability since they attempt to assess the superiority of the algorithms without studying their shared properties and the differences thoroughly. In this paper, we propose SAFARI, a general framework formulated by abstracting and unifying the fundamental tasks in streaming anomaly detection and provides a flexible and extensible anomaly detection procedure. SAFARI helps to facilitate more elaborate algorithm comparisons by allowing to isolate the effects of shared and unique characteristics of different algorithms on the detection performance. Using SAFARI, we have implemented different anomaly detectors and identified a research gap which motivates us to propose a novel learning strategy in this work. Then, we have conducted an extensive evaluation study on 20 detectors that are composed by SAFARI and compared their performances using real-world benchmark datasets with different properties. Finally, we have discussed their benefits and drawbacks in-depth and drawn a set of conclusions to guide future users of SAFARI.
In this paper, we focus on the separability of classes with the cross-entropy loss function for classification problems by theoretically analyzing the intra-class distance and inter-class distance (i.e. the distance between any two points belonging to the same class and different classes, respectively) in the feature space, i.e. the space of representations learnt by neural networks. Specifically, we consider an arbitrary network architecture having a fully connected final layer with Softmax activation and trained using the cross-entropy loss. We derive expressions for the value and the distribution of the squared L2 norm of the product of a network dependent matrix and a random intra-class and inter-class distance vector (i.e. the vector between any two points belonging to the same class and different classes), respectively, in the learnt feature space (or the transformation of the original data) just before Softmax activation, as a function of the cross-entropy loss value. The main result of our analysis is the derivation of a lower bound for the probability with which the inter-class distance is more than the intra-class distance in this feature space, as a function of the loss value. We do so by leveraging some empirical statistical observations with mild assumptions and sound theoretical analysis. As per intuition, the probability with which the inter-class distance is more than the intra-class distance decreases as the loss value increases, i.e. the classes are better separated when the loss value is low. To the best of our knowledge, this is the first work of theoretical nature trying to explain the separability of classes in the feature space learnt by neural networks trained with the cross-entropy loss function.
Spoken Language Understanding (SLU) mainly involves two tasks, intent detection and slot filling, which are generally modeled jointly in existing works. However, most existing models fail to fully utilize co-occurrence relations between slots and intents, which restricts their potential performance. To address this issue, in this paper we propose a novel Collaborative Memory Network (CM-Net) based on the well-designed block, named CM-block. The CM-block firstly captures slot-specific and intent-specific features from memories in a collaborative manner, and then uses these enriched features to enhance local context representations, based on which the sequential information flow leads to more specific (slot and intent) global utterance representations. Through stacking multiple CM-blocks, our CM-Net is able to alternately perform information exchange among specific memories, local contexts and the global utterance, and thus incrementally enriches each other. We evaluate the CM-Net on two standard benchmarks (ATIS and SNIPS) and a self-collected corpus (CAIS). Experimental results show that the CM-Net achieves the state-of-the-art results on the ATIS and SNIPS in most of criteria, and significantly outperforms the baseline models on the CAIS. Additionally, we make the CAIS dataset publicly available for the research community.
We present CrypTFlow, a first of its kind system that converts TensorFlow inference code into Secure Multi-party Computation (MPC) protocols at the push of a button. To do this, we build three components. Our first component, Athos, is an end-to-end compiler from TensorFlow to a variety of semi-honest MPC protocols. The second component, Porthos, is an improved semi-honest 3-party protocol that provides significant speedups for Tensorflow like applications. Finally, to provide malicious secure MPC protocols, our third component, Aramis, is a novel technique that uses hardware with integrity guarantees to convert any semi-honest MPC protocol into an MPC protocol that provides malicious security. The security of the protocols output by Aramis relies on hardware for integrity and MPC for confidentiality. Moreover, our system, through the use of a new float-to-fixed compiler, matches the inference accuracy over the plaintext floating-point counterparts of these networks. We experimentally demonstrate the power of our system by showing the secure inference of real-world neural networks such as ResNet50, DenseNet121, and SqueezeNet over the ImageNet dataset with running times of about 30 seconds for semi-honest security and under two minutes for malicious security. Prior work in the area of secure inference (SecureML, MiniONN, HyCC, ABY3, CHET, EzPC, Gazelle, and SecureNN) has been limited to semi-honest security of toy networks with 3–4 layers over tiny datasets such as MNIST or CIFAR which have 10 classes. In contrast, our largest network has 200 layers, 65 million parameters and over 1000 ImageNet classes. Even on MNIST/CIFAR, CrypTFlow outperforms prior work.

### There is No Such Thing as a Free Lunch: Part 1

You have heard the expression “there is no such thing as a free lunch” – well in machine learning the same principle holds. In fact there is even a theorem with the same name.

### Distilled News

It may seem like an odd problem, but some images just don’t have enough hashtags. But don’t take my word for it, take Facebook’s: they tasked an entire team to build a state-of-the-art neural net and deployed the result on Instagram. Hashtags are super useful. Not only do they help users find content, they also help recommendation systems to curate content to users, and that has intrinsic value in it. Better content recommendations means happier users, which translates to a longer time spent on your site and ultimately more money earned.
With this example, a set of business rules and a NLP (Natural Language Processing) model were created to automate the answering of return requests from customers. Particularity, I’ll be focusing the NLP side. Tf-idf is used to parse customers notes to either allow or stop automatic pipeline of answers to client returns requests. To address issues with the rise of return requests, we’ve used supervised machine learning (NLP) tools to further increase the number of responses done automatically, increasing productivity and reducing the time to answer.
Forecasting of Gold and Oil have garnered major attention from academics, investors and Government agencies like. These two products are known for their substantial influence on global economy. I will show here, how to use Granger’s Causality Test to test the relationships of multiple variables in the time series and Vector Auto Regressive Model (VAR) to forecast the future Gold & Oil prices from the historical data of Gold prices, Silver prices, Crude Oil prices, Stock index , Interest Rate and USD rate.
Bias in AI programming, both conscious and unconscious, is an issue of concern raised by scholars, the public, and the media alike. Given the implications of usage in hiring, credit, social benefits, policing, and legal decisions, they have good reason to be. AI bias occurs when a computer algorithm makes prejudiced decisions based on data and/or programming rules. The problem of bias is not only with coding (or programming), but also with the datasets that are used to train AI algorithms, in what some call the ‘discrimination feedback loop.’
My first computer was a Commodore Vic-20 in 1981. I bought the device because of this incredible urge to program in BASIC as a result of Mr. Ted Becker’s course on computer programming. I vaguely remember the leap from the pain-staking process of programming using punch cards to writing code and watching your program run immediately, once you resolved all of the syntax errors of course. Nonetheless, it was thrilling and addictive! In hindsight, a part of that addiction was a result of being able to code these complex calculations involving products and sums to ridiculously large numbers. Indeed, my first use of a home computer was defined by my vision for the ultimate calculator. I spent hours performing complex calculations that I would never have been able to perform with a calculator. But why was this the case? What choice would I make today if I were presented with a new technology that is so different that it blows away anything that would be considered light years ahead, given all of the current advances in technology? What role did the scientific calculator play in defining how we view the modern computer? Finally, what would be the ideal device for today’s data analytics professional? The answers to these questions will be the focus of this article.
The age of information has bestowed upon humanity one of the biggest explosions of tech-focused jobs ever. While an abundance of the power behind big successful companies such as Uber, Facebook, AirBnB, and Amazon is their ingenuity and convenience for consumers, their success can also be attributed to the utilization of data – and lots of it. It is undeniable that data is now the new resource that everyone is flocking to. Even companies such as Coca-Cola and Pepsi are using consumer data to make better marketing campaigns and strategic decisions about their products. This has resulted in data being called, ‘the new oil’ a quote originally spoken by mathematician Clive Humby and sensationalized by the Economist. While many disagree with this claim on both literal and figurative points, the point is obvious: data has extraordinary value and it’s only going to become more valuable in the future. Understanding how to navigate a world full of data is crucial for maintaining the sociological, economic, environmental, and physical well being of humans and the rest of the planet. Data scientists can make this easier by breaking down the abundance of obscure aspects pertaining to data into clear terms. So I thought of ways to advance the data science mind for all, while sticking true to the actual science.
Many believe that the coming of the age of Machine Intelligence – MI (I prefer the term Machine Intelligence to the term Artificial Intelligence as it captures a great sense of possibility) is as epochal an event as when humans learnt to make and control fire. Up until very recently MI was still in its infancy based in essence on super-fast computer-aided pattern recognition. The most widespread use of the first generation of MI was in the banal business of targeting of digital advertising using surreptitiously gathered usage. With the benefits largely going to the likes of Google and Facebook. It is not clear whether it benefited consumers in any significant way. In fact it is emerging that MI-driven targeting techniques have been misused to sow dissent and strife in societies and communities. Over the past few years the second generation of MI has begun to emerge through a set of neural network techniques based on the principle of Deep Learning.
One of the widely used Natural Language Processing & Supervised Machine Learning (ML) task in different problems and used cases is the so-called Text Classification. It is an example of Supervised Machine Learning task since a labeled dataset containing text documents and their labels is used for training a text classifier.
Deep learning is everywhere…from classifying images and translating languages to building a self-driving car. All these tasks are being driven by computers rather than manual human effort. And no, for doing so you don’t need to be a magician, you just need to have a solid grasp on deep learning techniques. And yes, it is quite possible to learn it on your own! So, what is Deep learning? It is a phrase used for complex neural networks. What does it mean? A deep learning system is self-teaching, learning as it goes by filtering information through multiple hidden layers, in a similar way to humans. Without neural networks, there would be no deep learning. And this means only one thing – if you want to master deep learning, start with neural networks! In this post, I gathered everything to systematize all the most important modern knowledge about neural networks: What is a neuron? What is a neural network and what is the history of its development? How do we ‘train’ neural networks? What is gradient descent and backpropagation and more… If you want to know the answers to all these questions, let’s jump right in!
Artificial Intelligence (AI) has been part of computing since the 1950s. But it’s only been since 2000 that AI systems have been able to accomplish useful tasks like classifying images or understanding spoken language. And only very recently has Machine Learning advanced to a point such that significant AI computations can be performed on the smartphones and tablets available to students. MIT is building tools into App Inventor that will enable even beginning students to create original AI applications that would have been advanced research a decade ago. This creates new opportunities for students to explore the possibilities of AI and empowers students as creators of the digital future. AI with MIT App Inventor includes tutorial lessons as well as suggestions for student explorations and project work. Each unit also includes supplementary teaching materials: lesson plans, slides, unit outlines, assessments and alignment to the Computer Science Teachers of America (CSTA) K12 Computing Standards. As with all MIT App Inventor efforts, the emphasis is on active constructionist learning where students create projects and programs that instantiate their ideas.
Overview
• Web scraping is a highly effective method to extract data from websites (depending on the website’s regulations)
• Learn how to perform web scraping in Python using the popular BeautifulSoup library
• We will cover different types of data that can be scraped, such as text and images
In this post, I will show you how to create interactive world maps and how to show these in the form of an R Shiny app. As the Shiny app cannot be embedded into this blog, I will direct you to the live app and show you in this post on my GitHub how to embed a Shiny app in your R Markdown files, which is a really cool and innovative way of preparing interactive documents. To show you how to adapt the interface of the app to the choices of the users, we’ll make use of two data sources such that the user can choose what data they want to explore, and that the app adapts the possible input choices to the users’ previous choices. The data sources here are about childlessness and gender inequality, which is the focus of my PhD research, in which I computationally analyse the effects of gender and parental status on socio-economic inequalities. We’ll start by loading and cleaning the data, whereafter we will build our interactive world maps in R Shiny. Let’s first load the required packages into RStudio.
The new era of training Machine Learning model with on-device capability. In this tutorial I will be using PyTorch and PySyft to train a Deep Learning neural network using federated approach. Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralised data. Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well. The goal is a machine learning setting where the goal is to train a high-quality centralised model with training data distributed over a large number of clients each with unreliable and relatively slow network connections.
If we want a machine learning model to be able to generalize these forms together, we need to map them to a shared representation. But when are two different words the same for our purposes? It depends.

### What’s the p-value good for: I answer some questions.

Martin King writes:

For a couple of decades (from about 1988 to 2006) I was employed as a support statistician, and became very interested in the p-value issue; hence my interest in your contribution to this debate. (I am not familiar with the p-value ‘reconciliation’ literature, as published after about 2005.) I would hugely appreciate it, if you might find the time to comment further on some of the questions listed in this document.

I would be particularly interested in learning more about your views on strict Neyman-Pearson hypothesis testing, based on critical values (critical regions), given an insistence on power calculations among research funding organisations (i.e., first section headed ‘p-value thresholds’), and the long-standing recommendation that biomedical researchers should focus on confidence intervals instead of p-values (i.e., penultimate section headed ‘estimation and confidence intervals’).

Here are some excerpts from King’s document that I will respond to:

My main question is about ‘dichotomous thinking’ and p-value thresholds. McShane and Gal (2017, page 888) refers to “dichotomous thinking and similar errors”. Is it correct to say that dichotomous thinking is an error? . . .

If funding bodies insist on strict hypothesis testing (otherwise why the insistence on power analysis, as opposed to some other assessment of adequate precision), is it fair to criticise researchers for obeying the rules dictated by the method? In summary, before banning p-value thresholds, do you have to persuade the funding bodies to abandon their insistence on power calculations, and allow applicants more flexibility in showing that a proposed study has sufficient precision? . . .

This brings us to the second question regarding what should be taught in statistics courses, aimed at biomedical researchers. A teacher might want the freedom to design courses that assumes an ideal world in which statisticians and researchers are free to adopt a rational approach of their choice. Thus, a teacher might decide to drop frequentist methods (if she/he regards frequentist statistics a nonsense) and focus on the alternatives. But this creates a problem for the course recipients, if grant awarding bodies and journal editors insist on frequentist statistics? . . .

It is suggested (McShane et al. 2018) that researchers often fail to provide sufficient information on currently subordinate factors. I spent many years working in an experimental biomedical environment, and it is my impression that most experimental biomedical researchers do present this kind of information. (They do not spend time doing experiments that are not expected to work or collecting data that are not expected to yield useful and substantial information. It is my impression that some authors go to the extreme in attempting to present an argument for relevance and plausibility.) Do you have a specific literature in mind where it is common to see results offered with no regard for motivation, relevance, mechanism, plausibility etc. (apart from data dredging/data mining studies in which mechanism and plausibility might be elusive)? . . .

For many years it had not occurred to me that there is a distinction between looking at p-values (or any other measure of evidence) obtained as a participant in a research study, versus looking at third-party results given in some publication, because the latter have been through several unknown filters (researcher selection, significance filter etc). Although others had commented on this problem, it was your discussions on the significance filter that prompted me to fully realise the importance of this issue. Is it a fact that there is no mechanism by which readers can evaluate the strength of evidence in many published studies? I realise that pre-registration has been proposed as a partial solution to this problem. But it is my impression that, of necessity, much experimental and basic biomedical science research takes the form of an iterative and adaptive learning process, as outlined by Box and Tiao (pages 4-5), for example. I assume that many would find It difficult to see how pre-registration (with constant revision) would work in this context, without imposing a massive obstacle to making progress.

And now my response:

1. Yes, I think dichotomous frameworks are usually a mistake in science. With rare exceptions, I don’t think it makes sense to say that an effect is there or not there. Instead I’d say that effects vary.

Sometimes we don’t have enough data to distinguish an effect from zero, and that can be a useful thing to say. Reporting that an effect is not statistically significant can be informative, but I don’t think it should be taken as an indication that the true effect as zero; it just tells us that our data and model do not give us enough precision to distinguish the effect from zero.

2. Sometimes decisions have to be made. That’s fine. But then I think the decisions should be made based on estimated costs, benefits, and probabilities—not based on the tail-area probability with respect of a straw-man null hypothesis.

3. If scientists in the real world are required to do X, Y, and Z, then, yes, we should train them on how to do X, Y, and Z, but we should also explain why these actions can be counterproductive to larger goals of scientific discovery, public health, etc.

Perhaps a sports analogy will help. Suppose you’re a youth coach, and your players would like to play in an adult league that uses what you consider to be poor strategies. Short term, you need to teach your players these poor strategies so they can enter the league on the league’s terms. But you should also teach them the strategies that will ultimately be more effective so that, once they’ve established themselves, or if they happen to play with an enlightened coach, they can really shine.

4. Regarding “currently subordinate factors”: In many many of the examples we’ve discussed over the years on this blog, published papers do not include raw data or anything close to it, they don’t give details on what data were collected or how the data were processed or what data were excluded. Yes, there will be lots of discussion of motivation, relevance, mechanism, plausibility etc. of the theories, but not much thought about data quality. Some quick examples include the evolutionary psychology literature, where the days of peak fertility were mischaracterized or measurement of finger length was characterized as a measure of testosterone. There’s often a problem that data and measurements are really noisy, and authors of published papers (a) don’t even address the point and (b) don’t seem to think it matters, under the (fallacious) reasoning that, once you have achieved statistical significance, measurement error doesn’t matter.

5. Preregistration is fine for what it is, but I agree that it does not resolve issues of research quality. At best, preregistration makes it more difficult for people to make strong claims from noise (although they can still do it!), hence it provides an indirect incentives for people to gather better data and run stronger studies. But it’s just an incentive; a noisy study that is preregistered is still a noisy study.

Summary

I think that p-values and statistical significance as used in practice are a noise magnifier, and I think people would be better off reporting what they find without the need to declare statistical significance.

There are times when p-values can be useful: it can help to know that a certain data + model are weak enough that we can’t rule out some simple null hypothesis.

I don’t think the p-value is a good measure of the strength of evidence for some claim, and for several reasons I don’t think it makes sense to compare p-values. But the p-value as one piece of evidence in a larger argument about data quality, that can make sense.

Finally the above comments apply not just to p-values but to any method used for null hypothesis significance testing.

### The problem with metrics is a big problem for AI

The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so.

### Fall foliage colors mapped

For The Washington Post, Lauren Tierney and Joe Fox mapped fall foliage colors across the United States:

Forested areas in the United States host a variety of tree species. The evergreens shed leaves gradually, as promised in their name. The leaves of deciduous varieties change from green to yellow, orange or red before letting go entirely. Using USDA forest species data, we mapped the thickets of fall colors you may encounter in the densely wooded parts of the country.

Nice. Be sure to click through to the full story to see leaf profiles and an animation of the changing colors as fall arrives.

Tags: , , ,

### Elsevier > Association for Psychological Science

Everyone dunks on Elsevier. But here’s a case where they behaved well. Jordan Anaya points us to this article from Retraction Watch:

In May, [psychology professor Barbara] Fredrickson was last author of a paper in Psychoneuroendocrinology claiming to show that loving-kindness meditation slowed biological aging, specifically that it kept telomeres — which protect chromosomes — from shortening. The paper caught the attention of Harris Friedman, a retired researcher from University of Florida who had scrutinized some of Fredrickson’s past work, for what Friedman, in an interview with Retraction Watch, called an “extraordinary claim.”

Friedman, along with three colleagues, looked deeper. When they did, they found a few issues. One was that the control group in the study seemed to show a remarkably large decrease in telomere length, which made the apparent differences between the groups seem larger. The quartet — Friedman, Nicholas Brown, Douglas MacDonald and James Coyne — also found a coding error.

Friedman and his colleagues wanted to write a piece for the journal that would address all of these issues, but they were told they could submit a letter of only 500 words. They did, and it was published in August. The journal also published a corrigendum about the coding error last month — but only after having changed the article without notice first.

Friedman had hoped that the journal would credit him and his colleagues in the corrigendum, which it did not. But it was a letter that the journal published on August 24 that really caught his eye (as well as the eye of a PubPeer commenter, whose comment was flagged for us.) It read, in its entirety:

As Corresponding Author of “Loving-kindness meditation slows biological aging in novices: Evidence from a 12-week randomized controlled trial,” I decline to respond to the Letter authored by Friedman, MacDonald, Brown and Coyne. I stand by the peer review process that the primary publication underwent to appear in this scholarly journal. Readers should be made aware that the current criticisms continue a long line of misleading commentaries and reanalyses by this set of authors that (a) repeatedly targets me and my collaborators, (b) dates back to 2013, and (c) spans multiple topic areas. I take this history to undermine the professional credibility of these authors’ opinions and approaches.

When Friedman saw the letter, he went straight to the journal’s publisher, Elsevier, and said it was defamatory, and had no business appearing in a peer-reviewed journal.

The journal has now removed the letter, and issued a notice of temporary removal. Fredrickson hasn’t responded to our requests for comment.

As Friedman noted, however, the letter’s language, which is undeniably sharp, is “coming from the loving-kindness researcher.”

Jordan writes:

I didn’t realize Friedman asked the journal to take down the response. To me I would have been happy the response was posted since it made Fredrickson look really bad—if her critics’ points are truly wrong and have been wrong over the course of multiple years then it should be easy for her to dunk on her critics with a scientific response.

I disagree. Mud can stick. Better to have the false statement removed, or at least flagged with a big RETRACTED watermark, rather than having it out there to confuse various outsiders.

Anyway, say what you want about Elsevier. At least they’re willing to retract false and defamatory claims that they publish. The Association for Psychological Science won’t do that. When I pointed out that they’d made false and defamatory statements about me and another researcher, they just refused to do anything.

It’s sad that a purportedly serious professional organization is worse on ethics than a notorious publisher.

But maybe we should look on the bright side. It’s good news that a notorious publisher is better on ethics than a serious professional organization.

At this point I think it would be pretty cool if the Association for Psychological Science would outsource its ethics decisions to Elsevier or some other outside party.

In the meantime, I suggest that Fredrickson send that letter to Perspectives on Psychological Science. They’d probably have no problem with it!

P.S. Fredrickson’s webpage says, “She has authored 100+ peer-reviewed articles and book chapters . . .” I guess they’ll have to change that to 99+.

### issuer: Local issue tracking, no net required

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The goal of issuer is to provide a simple issue tracker, hosted on your local file system, for those users who don’t want to or are disallowed from using cloud-based code repositories.

Online code repositories often provide an issue tracker to allow developers, reviewers, and users to report bugs, submit feature requests, and so on. However, many developers either choose to work offline or work on enterprise networks where use of cloud services may be prohibited.

issuer is an Add-in for use in RStudio’s desktop IDE. It works entirely locally with no requirement for a cloud service or even a network connection.

You can install the development version of issuer from Github with:

devtools::install_github("WilDoane/issuer")

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Who is winning the race for Westminster?

See which parties British voters plan to support