# My Data Science Blogs

## August 25, 2019

### Distilled News

Understanding Accuracy, Recall, Precision, ROC, AUC and F1-score. The Confusion-matrix yields the most ideal suite of metrics for evaluating the performance of a classification algorithm such as Logistic-regression or Decision-trees. It’s typically used for binary classification problems but can be used for multi-label classification problems by simply binarizing the output. Without rhetorics, The Confusion-matrix can certainly tell us the Accuracy, Recall, Precision, ROC, AUC, as well as the F1-score, of a classification model. We shall look at these metrics closely in a few minutes…
There is innumerable text available in the net on Bayesian Network, but most of them are have heavy mathematical formulas and concepts thus quite difficult to understand. Here, I have tried to explain the topic as simple as possible with minimum equations and a real-world example. Bayesian Network is a very important tool in understanding the dependency among events and assigning probabilities to them thus ascertaining how probable or what is the change of occurrence of one event given the other. For example, suppose you are getting scolded at school by your teacher for being late and there could be many reasons for being late like waking up late, traffic jams, etc. So here, scolding is dependent on the events like waking up late or traffic jam i.e., these reasons have a direct influence on you being scolded. This can be efficiently represented using Bayesian Network which we will see soon.
Worldwide access to vast amounts of data has changed the business landscape. Competitive marketing depends on knowing how to manage, process, and analyze that data. This article describes the path organizations need to take from collecting data to maximizing its use. Today’s organizations are undergoing a challenging transformation process around their technical systems. The static software platforms that might have stored and processed a business’ data are no longer sustainable in the current web environment. Enterprises need cutting-edge technology to collect big data in real-time, analyze that data, and then get the information they need to stay competitive in today’s marketplace.
Over the past few years, financial-news sentiment analysis has taken off as a commercial natural language processing (NLP) application. Like any other type of sentiment analysis, there are two main approaches: one, more traditional, is by using sentiment-labelled word lists (which we will also refer to as dictionaries). The other, is using sentiment classifiers based on language models trained on huge corpora (such as Amazon product reviews or IMDB film reviews). For domain-specific sentiment analysis, these latter language models tend to perform poorly. Hardly a surprise: a medical article reads nothing like a film review. In this respect, transfer learning is an interesting growing field. Currently, though, a dictionary still lies at the core of many domain-specific sentiment-analysis applications. Within finance, those trying to build on open-source resources will likely end up with Notre Dame’s McDonald- Loughran (M-L) word lists, which were created by analysing over fifty thousand earnings reports over the 1994-2008 period. This dictionary has been used by, among others, Google, Fidelity, Citadel, Dow Jones, and S&P Global.
Mostly, because of valid data security and privacy concerns. And AI/ML’s voracious appetite for data is creating conflicts with enterprises’ traditionally cautious security policies. Simply put, data security and privacy concerns are stalling the enterprise adoption of cloud-native AI/ML. These concerns are material, real, and valid. But they conflict with AI/ML’s voracious appetite for data. And until the conflicts are addressed, cloud-native AI/ML adoption in the enterprise will stall.
A major challenge of text data is extracting meaningful patterns and using those patterns to find actionable insights.
NLP can be thought of as a two part problem:
• Processing. Converting the text data from its original form into a form the computer can understand. This includes data cleaning and feature extraction.
• Analysis. Using the processed data to extract insights and make predictions.
Here we will focus on the processing step.
One of the main tasks while working with text data is to create a lot of text-based features. One could like to find out certain patterns in the text, emails if present in a text as well as phone numbers in a large text. While it may sound fairly trivial to achieve such functionalities it is much simpler if we use the power of Python’s regex module. For example, let’s say you are tasked with finding the number of punctuations in a particular piece of text. Using Shakespeare here. How do you normally go about it?
In this article I describe how to train a neural network to evaluate loans that are offered on the crowd lending platform Lending Club. I also cover how to test the model, how to adjust the risk in loan selection, and how to use the model to make automatic investments using Lending Club’s API.
In this guide, I’ll use BERT to train a sentiment analysis classifier and Cortex to deploy it as a web API on AWS. The API will autoscale to handle production workloads, support rolling updates so that models can be updated without any downtime, stream logs to make debugging easy, and support inference on CPUs and GPUs.
This post will take a look at the ways some teams have attempted to classify various Data Science roles. It will focus on how job titles are organized at two large companies with mature Data Science programs, and will also introduce some essential non-technical skills from the perspective of one Data Scientist at a smaller organization. It will consider the trajectory of Data Science careers as perceived by industry practitioners and thought leaders. Finally, it will offer recommendations for some Data Science ‘adjacent’ skills to obtain in order to add value and remain relevant in this rapidly evolving industry.
1. zip: Combine Multiple Lists in Python
2. gmplot: Plot the GPS Coordinates in your Dataset on Google Maps
3. category_encoders: Encode your Categorical Variables using 15 Different Encoding Schemes
4. progress_apply: Monitor the Time you Spend on Data Science Tasks
5. pandas_profiling: Generate a Detailed Report of your Dataset
6. grouper: Grouping Time Series Data
7. unstack: Transform the Index into Columns of your Dataframe
8. %matplotlib Notebook: Interactive Plots in your Jupyter Notebook
9. %%time: Check the Running Time of a Particular Block of Python Code
10: rpy2: R and Python in the Same Jupyter Notebook!
Python library for efficient multi-threaded data processing, with the support for out-of-memory datasets. If you are an R user, chances are that you have already been using the data.table package. Data.table is an extension of the data.frame package in R. It’s also the go-to package for R users when it comes to the fast aggregation of large data (including 100GB in RAM). The R’s data.table package is a very versatile and a high-performance package due to its ease of use, convenience and programming speed. It is a fairly famous package in the R community with over 400k downloads per month and almost 650 CRAN and Bioconductor packages using it(source). So, what is in it for the Python users? Well, the good news is that there also exists a Python counterpart to thedata.table package called datatable which has a clear focus on big data support, high performance, both in-memory and out-of-memory datasets, and multi-threaded algorithms. In a way, it can be called as data.table’s younger sibling.
The big challenge for organizations looking to make use of advanced machine learning is getting access to large volumes of clean, accurate, complete, and well-labeled data to train ML models. More advanced forms of ML like deep learning neural networks require especially large volumes of data to create accurate models. In this 2019 report, AI analyst firm Cognilytica shares valuable insight to help you get to market quickly with data and vendors that you can trust.
• How to optimize and accelerate the data preparation that takes up almost 80% of AI project time
• Important factors to consider when choosing third-party data labeling vendors and data providers
• What solutions and hiring needs to expect from the machine learning and AI data preparation market in the next few years
LSTM is a very special kind of Recurrent Neural Network (RNN) which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them.
Welcome to this tutorial on single-image super-resolution. The goal of super-resolution (SR) is to recover a high-resolution image from a low-resolution input, or as they might say on any modern crime show, enhance! To accomplish this goal, we will be deploying the super-resolution convolution neural network (SRCNN) using Keras. This network was published in the paper, ‘Image Super-Resolution Using Deep Convolutional Networks’ by Chao Dong, et al. in 2014. You can read the full paper at https://…/1501.00092.
Using non-stationary time series data in forecasting models produces unreliable and spurious results that leads to poor understanding and forecasting. The solution to the problem is to transform the time series data so that it becomes stationary. ADF and KPSS are quick stationary statistical tests to understand the data you are dealing with. Most statistical forecasting methods are based on the assumption that the time series are approximately stationary. A stationary series is relatively easy to predict: you simply forecast that its statistical properties will be the same in the future as they have been in the past. Analysis of time series patterns is the first step of converting non-stationary data in to stationary data (for example by trend removal), so that the statistical forecasting methods could be applied. There are three fundamental steps of building a quality forecasting time series model: making the data stationary, selecting the right model, and evaluating model accuracy. This article will focus on the first step making the data stationarity.
Civilizations evolve through strategic forgetting of once-vital life skills. But can machines do all our remembering? When I was a student, in the distant past when most computers were still huge mainframes, I had a friend whose PhD advisor insisted that he carry out a long and difficult atomic theory calculation by hand. This led to page after page of pencil scratches, full of mistakes, so my friend finally gave in to his frustration. He snuck into the computer lab one night and wrote a short code to perform the calculation. Then he laboriously copied the output by hand, and gave it to his professor. Perfect, his advisor said – this shows you are a real physicist. The professor was never any the wiser about what had happened. While I’ve lost touch with my friend, I know many others who’ve gone on to forge successful careers in science without mastering the pencil-and-paper heroics of past generations.
In this blog I will be demonstrating how deep learning can be applied even if we don’t have enough data. I have created my own custom car vs bus classifier with 100 images of each category. The training set has 70 images while validation set makes up for the 30 images.

Continue Reading…

### Dummy Is As Dummy Does

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the 1975 edition of “Applied multiple regression/correlation analysis for the behavioral sciences” by Jacob Cohen, an interesting approach of handling missing values in numeric variables was proposed with the purpose to improve the traditional single-value imputation, as described below:

– First of all, impute missing values by the value of mean or median
– And then create a dummy variable to flag out imputed values

In the setting of a regression model, both imputed and dummy variables would be included and therefore the number of independent variables are doubled.

Although the aforementioned approach has long been criticized and eventually abandoned by Cohen himself in the recent edition of the book, I was told that this obsolete technique is still being actively used.

Out of my own curiosity, I applied this dummy imputation approach to the data used in https://statcompute.wordpress.com/2019/05/04/why-use-weight-of-evidence and then compared it with the WoE imputation in the context of Logistic Regression.

Below are my observations:

– Since the dummy approach converts each numeric variable with missing values, the final model tends to have more independent variables, which is not desirable in terms of the model parsimony. For instance, there are 7 independent variables in the model with dummy imputation and only 5 in the model with WoE approach.

– The model performance doesn’t seem to justify the use of more independent variables in the regression with the dummy imputation. As shown in the output below, ROC statistic from the model with WoE approach is significantly better than the one with the dummy imputation based on the DeLong’s test, which is also consistent with the result of Vuong test.

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### How will AI Change Healthcare? Robot-Assisted Surgery and Virtual Nurses are Just the Start

Technology is driving a major revolution in patient care. These days, wearable devices and healthcare applications on smartphones are putting the power into patients’ hands, allowing them to be more involved in their healthcare and in improving their overall health and wellness. At the same time, robots are now making

The post How will AI Change Healthcare? Robot-Assisted Surgery and Virtual Nurses are Just the Start appeared first on Dataconomy.

Continue Reading…

### The US is separating families. There's overwhelming evidence that's bad for kids

Both children and parents face increased mental health risks when they are separated in the immigration process

Recent immigration raids in Mississippi are likely to be followed by more. The raids, which targeted almost 700 Latino immigrants (and by extension, the Latino community) were followed by television images of traumatized children crying out for their parents.

Despite news coverage of the families’ grief, the White House has ordered US Immigration and Customs Enforcement (Ice) agents to identify more workplace targets across the country, according to a report by CNN.

Continue reading...

Continue Reading…

### R Packages worth a look

Rational Approximations of Fractional Stochastic Partial Differential Equations (rSPDE)
Functions that compute rational approximations of fractional elliptic stochastic partial differential equations. The package also contains functions for common statistical usage of these approximations. The main reference for the methods is Bolin and Kirchner (2019) <arXiv:1711.04333>, which can be generated by the citation function in R.

Power and Sample Size for Linear Mixed Effect Models (pass.lme)
Power and sample size calculation for testing fixed effect coefficients in multilevel linear mixed effect models with one or more than one independent populations. Laird, Nan M. and Ware, James H. (1982) <doi:10.2307/2529876>.

Degrees of Freedom Adjustment for Robust Standard Errors (dfadjust)
Computes small-sample degrees of freedom adjustment for
heteroskedasticity robust standard errors, and for clustered standard errors
in linear regression. See Imbens and Kolesár (1994)
<doi:10.1162/REST_a_00552> for a discussion of these adjustments.

Virome Sequencing Analysis Result Browser (viromeBrowser)
Experiments in which highly complex virome sequencing data is generated are difficult to visualize and unpack for a person without programming experience. After processing of the raw sequencing data by a next generation sequencing (NGS) processing pipeline the usual output consists of contiguous sequences (contigs) in fasta format and an annotation file linking the contigs to a reference sequence. The virome analysis browser app imports an annotation file and a corresponding fasta file containing the annotated contigs. It facilitates browsing of annotations in multiple files and allows users to select and export specific annotated sequences from the associated fasta files. Various annotation quality thresholds can be set to filter contigs from the annotation files. Further inspection of selected contigs can be done in the form of automatic open reading frame (ORF) detection. Separate contigs and/or separate ORFs can be downloaded in nucleotide or amino acid format for further analysis.

Continue Reading…

### Whats new on arXiv

Due to the dynamic nature, chaotic time series are difficult predict. In conventional signal processing approaches signals are treated either in time or in space domain only. Spatio-temporal analysis of signal provides more advantages over conventional uni-dimensional approaches by harnessing the information from both the temporal and spatial domains. Herein, we propose an spatio-temporal extension of RBF neural networks for the prediction of chaotic time series. The proposed algorithm utilizes the concept of time-space orthogonality and separately deals with the temporal dynamics and spatial non-linearity(complexity) of the chaotic series. The proposed RBF architecture is explored for the prediction of Mackey-Glass time series and results are compared with the standard RBF. The spatio-temporal RBF is shown to out perform the standard RBFNN by achieving significantly reduced estimation error.
To make efficient use of limited spectral resources, we in this work propose a deep actor-critic reinforcement learning based framework for dynamic multichannel access. We consider both a single-user case and a scenario in which multiple users attempt to access channels simultaneously. We employ the proposed framework as a single agent in the single-user case, and extend it to a decentralized multi-agent framework in the multi-user scenario. In both cases, we develop algorithms for the actor-critic deep reinforcement learning and evaluate the proposed learning policies via experiments and numerical results. In the single-user model, in order to evaluate the performance of the proposed channel access policy and the framework’s tolerance against uncertainty, we explore different channel switching patterns and different switching probabilities. In the case of multiple users, we analyze the probabilities of each user accessing channels with favorable channel conditions and the probability of collision. We also address a time-varying environment to identify the adaptive ability of the proposed framework. Additionally, we provide comparisons (in terms of both the average reward and time efficiency) between the proposed actor-critic deep reinforcement learning framework, Deep-Q network (DQN) based approach, random access, and the optimal policy when the channel dynamics are known.
Graphs have become a crucial way to represent large, complex and often temporal datasets across a wide range of scientific disciplines. However, when graphs are used as input to machine learning models, this rich temporal information is frequently disregarded during the learning process, resulting in suboptimal performance on certain temporal infernce tasks. To combat this, we introduce Temporal Neighbourhood Aggregation (TNA), a novel vertex representation model architecture designed to capture both topological and temporal information to directly predict future graph states. Our model exploits hierarchical recurrence at different depths within the graph to enable exploration of changes in temporal neighbourhoods, whilst requiring no additional features or labels to be present. The final vertex representations are created using variational sampling and are optimised to directly predict the next graph in the sequence. Our claims are reinforced by extensive experimental evaluation on both real and synthetic benchmark datasets, where our approach demonstrates superior performance compared to competing methods, out-performing them at predicting new temporal edges by as much as 23% on real-world datasets, whilst also requiring fewer overall model parameters.
Normalization layers are essential in a Deep Convolutional Neural Network (DCNN). Various normalization methods have been proposed. The statistics used to normalize the feature maps can be computed at batch, channel, or instance level. However, in most of existing methods, the normalization for each layer is fixed. Batch-Instance Normalization (BIN) is one of the first proposed methods that combines two different normalization methods and achieve diverse normalization for different layers. However, two potential issues exist in BIN: first, the Clip function is not differentiable at input values of 0 and 1; second, the combined feature map is not with a normalized distribution which is harmful for signal propagation in DCNN. In this paper, an Instance-Layer Normalization (ILN) layer is proposed by using the Sigmoid function for the feature map combination, and cascading group normalization. The performance of ILN is validated on image segmentation of the Right Ventricle (RV) and Left Ventricle (LV) using U-Net as the network architecture. The results show that the proposed ILN outperforms previous traditional and popular normalization methods with noticeable accuracy improvements for most validations, supporting the effectiveness of the proposed ILN.
We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After this initial learning phase, our agent can quickly adapt to any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.
In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora. The designed system leverages a deep learning based technology for relation extraction that can be trained by a distantly supervised approach. In addition to that, the system uses a deep learning approach for knowledge base completion by utilizing the global structure information of the induced KG to further refine the confidence of the newly discovered relations. The designed system does not require any effort for adaptation to new languages and domains as it does not use any hand-labeled data, NLP analytics and inference rules. Our experiments, performed on a popular academic benchmark demonstrate that the suggested system boosts the performance of relation extraction by a wide margin, reporting error reductions of 50%, resulting in relative improvement of up to 100%. Also, a web-scale experiment conducted to extend DBPedia with knowledge from Common Crawl shows that our system is not only scalable but also does not require any adaptation cost, while yielding substantial accuracy gain.
We provide an introduction to a very specific toy model of memristive networks, for which an exact differential equation for the internal memory which contains the Kirchhoff laws is known. In particular, we highlight how the circuit topology enters the dynamics via an analysis of directed graph. We try to highlight in particular the connection between the asymptotic states of memristors and the Ising model, and the relation to the dynamics and statics of disordered systems.
In this work, we present X-SQL, a new network architecture for the problem of parsing natural language to SQL query. X-SQL proposes to enhance the structural schema representation with the contextual output from BERT-style pre-training model, and together with type information to learn a new schema representation for down-stream tasks. We evaluated X-SQL on the WikiSQL dataset and show its new state-of-the-art performance.
This in an introduction to free probability theory, covering the basic combinatorial and analytic theory, as well as the relations to random matrices and operator algebras. The material is mainly based on the two books of the lecturer, one joint with Nica and one joint with Mingo. Free probability is here restricted to the scalar-valued setting, the operator-valued version is treated in the subsequent lecture series on ‘Non-Commutative Distributions’. The material here was presented in the winter term 2018/19 at Saarland University in 26 lectures of 90 minutes each. The lectures were recorded and can be found online at https://…/index.html
We propose a novel approach for estimating the difficulty and transferability of supervised classification tasks. Unlike previous work, our approach is solution agnostic and does not require or assume trained models. Instead, we estimate these values using an information theoretic approach: treating training labels as random variables and exploring their statistics. When transferring from a source to a target task, we consider the conditional entropy between two such variables (i.e., label assignments of the two tasks). We show analytically and empirically that this value is related to the loss of the transferred model. We further show how to use this value to estimate task hardness. We test our claims extensively on three large scale data sets — CelebA (40 tasks), Animals with Attributes 2 (85 tasks), and Caltech-UCSD Birds 200 (312 tasks) — together representing 437 classification tasks. We provide results showing that our hardness and transferability estimates are strongly correlated with empirical hardness and transferability. As a case study, we transfer a learned face recognition model to CelebA attribute classification tasks, showing state of the art accuracy for tasks estimated to be highly transferable.
Active learning (AL) on attributed graphs has received increasing attention with the prevalence of graph-structured data. Although AL has been widely studied for alleviating label sparsity issues with the conventional independent and identically distributed (i.i.d.) data, how to make it effective over attributed graphs remains an open research question. Existing AL algorithms on graphs attempt to reuse the classic AL query strategies designed for i.i.d. data. However, they suffer from two major limitations. First, different AL query strategies calculated in distinct scoring spaces are often naively combined to determine which nodes to be labelled. Second, the AL query engine and the learning of the classifier are treated as two separating processes, resulting in unsatisfactory performance. In this paper, we propose a SEmi-supervised Adversarial active Learning (SEAL) framework on attributed graphs, which fully leverages the representation power of deep neural networks and devises a novel AL query strategy in an adversarial way. Our framework learns two adversarial components: a graph embedding network that encodes both the unlabelled and labelled nodes into a latent space, expecting to trick the discriminator to regard all nodes as already labelled, and a semi-supervised discriminator network that distinguishes the unlabelled from the existing labelled nodes in the latent space. The divergence score, generated by the discriminator in a unified latent space, serves as the informativeness measure to actively select the most informative node to be labelled by an oracle. The two adversarial components form a closed loop to mutually and simultaneously reinforce each other towards enhancing the active learning performance. Extensive experiments on four real-world networks validate the effectiveness of the SEAL framework with superior performance improvements to state-of-the-art baselines.
A new challenge for knowledge graph reasoning started in 2018. Deep learning has promoted the application of artificial intelligence (AI) techniques to a wide variety of social problems. Accordingly, being able to explain the reason for an AI decision is becoming important to ensure the secure and safe use of AI techniques. Thus, we, the Special Interest Group on Semantic Web and Ontology of the Japanese Society for AI, organized a challenge calling for techniques that reason and/or estimate which characters are criminals while providing a reasonable explanation based on an open knowledge graph of a well-known Sherlock Holmes mystery story. This paper presents a summary report of the first challenge held in 2018, including the knowledge graph construction, the techniques proposed for reasoning and/or estimation, the evaluation metrics, and the results. The first prize went to an approach that formalized the problem as a constraint satisfaction problem and solved it using a lightweight formal method; the second prize went to an approach that used SPARQL and rules; the best resource prize went to a submission that constructed word embedding of characters from all sentences of Sherlock Holmes novels; and the best idea prize went to a discussion multi-agents model. We conclude this paper with the plans and issues for the next challenge in 2019.
Entity alignment is the task of linking entities with the same real-world identity from different knowledge graphs (KGs), which has been recently dominated by embedding-based methods. Such approaches work by learning KG representations so that entity alignment can be performed by measuring the similarities between entity embeddings. While promising, prior works in the field often fail to properly capture complex relation information that commonly exists in multi-relational KGs, leaving much room for improvement. In this paper, we propose a novel Relation-aware Dual-Graph Convolutional Network (RDGCN) to incorporate relation information via attentive interactions between the knowledge graph and its dual relation counterpart, and further capture neighboring structures to learn better entity representations. Experiments on three real-world cross-lingual datasets show that our approach delivers better and more robust results over the state-of-the-art alignment methods by learning better KG representations.
Recent years have witnessed a surge of interest in machine learning on graphs and networks with applications ranging from vehicular network design to IoT traffic management to social network recommendations. Supervised machine learning tasks in networks such as node classification and link prediction require us to perform feature engineering that is known and agreed to be the key to success in applied machine learning. Research efforts dedicated to representation learning, especially representation learning using deep learning, has shown us ways to automatically learn relevant features from vast amounts of potentially noisy, raw data. However, most of the methods are not adequate to handle heterogeneous information networks which pretty much represents most real-world data today. The methods cannot preserve the structure and semantic of multiple types of nodes and links well enough, capture higher-order heterogeneous connectivity patterns, and ensure coverage of nodes for which representations are generated. We propose a novel efficient algorithm, motif2vec that learns node representations or embeddings for heterogeneous networks. Specifically, we leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of motifs to transform the original graph to motif graph(s), conduct biased random walk to efficiently explore higher order neighborhoods, and then employ heterogeneous skip-gram model to generate the embeddings. Unlike previous efforts that uses different graph meta-structures to guide the random walk, we use graph motifs to transform the original network and preserve the heterogeneity. We evaluate the proposed algorithm on multiple real-world networks from diverse domains and against existing state-of-the-art methods on multi-class node classification and link prediction tasks, and demonstrate its consistent superiority over prior work.
We develop the first quantum algorithm for the constrained portfolio optimization problem. The algorithm has running time $\widetilde{O} \left( n\sqrt{r} \frac{\zeta \kappa}{\delta^2} \log \left(1/\epsilon\right) \right)$, where $r$ is the number of positivity and budget constraints, $n$ is the number of assets in the portfolio, $\epsilon$ the desired precision, and $\delta, \kappa, \zeta$ are problem-dependent parameters related to the well-conditioning of the intermediate solutions. If only a moderately accurate solution is required, our quantum algorithm can achieve a polynomial speedup over the best classical algorithms with complexity $\widetilde{O} \left( \sqrt{r}n^\omega\log(1/\epsilon) \right)$, where $\omega$ is the matrix multiplication exponent that has a theoretical value of around $2.373$, but is closer to $3$ in practice. We also provide some experiments to bound the problem-dependent factors arising in the running time of the quantum algorithm, and these experiments suggest that for most instances the quantum algorithm can potentially achieve an $O(n)$ speedup over its classical counterpart.
Nowadays online social networks are used extensively for personal and commercial purposes. This widespread popularity makes them an ideal platform for advertisements. Social media can be used for both direct and word-of-mouth (WoM) marketing. Although WoM marketing is considered more effective and it requires less advertisement cost, it is currently being under-utilized. To do WoM marketing, we need to identify a set of people who can use their authoritative position in social network to promote a given product. In this paper, we show how to do WoM marketing in Facebook group, which is a question answer type of social network. We also present concept of reinforced WoM marketing, where multiple authorities can together promote a product to increase the effectiveness of marketing. We perform our experiments on Facebook group dataset consisting of 0.3 million messages and 10 million user reactions.
In this paper, a simple topology of Capsule Network (CapsNet) is investigated for the problem of image colorization. The generative and segmentation capabilities of the original CapsNet topology, which is proposed for image classification problem, is leveraged for the colorization of the images by modifying the network as follows:1) The original CapsNet model is adapted to map the grayscale input to the output in the CIE Lab colorspace, 2) The feature detector part of the model is updated by using deeper feature layers inherited from VGG-19 pre-trained model with weights in order to transfer low-level image representation capability to this model, 3) The margin loss function is modified as Mean Squared Error (MSE) loss to minimize the image-to-imagemapping. The resulting CapsNet model is named as Colorizer Capsule Network (ColorCapsNet).The performance of the ColorCapsNet is evaluated on the DIV2K dataset and promising results are obtained to investigate Capsule Networks further for image colorization problem.
It is no exaggeration to say that since the introduction of Bitcoin, blockchains have become a disruptive technology that has shaken the world. However, the rising popularity of the paradigm has led to a flurry of proposals addressing variations and/or trying to solve problems stemming from the initial specification. This added considerable complexity to the current blockchain ecosystems, amplified by the absence of detail in many accompanying blockchain whitepapers. Through this paper, we set out to explain blockchains in a simple way, taming that complexity through the deconstruction of the blockchain into three simple, critical components common to all known systems: membership selection, consensus mechanism and structure. We propose an evaluation framework with insight into system models, desired properties and analysis criteria, using the decoupled components as criteria. We use this framework to provide clear and intuitive overviews of the design principles behind the analyzed systems and the properties achieved. We hope our effort will help clarifying the current state of blockchain proposals and provide directions to the analysis of future proposals.
Recommender Systems are nowadays successfully used by all major web sites (from e-commerce to social media) to filter content and make suggestions in a personalized way. Academic research largely focuses on the value of recommenders for consumers, e.g., in terms of reduced information overload. To what extent and in which ways recommender systems create business value is, however, much less clear, and the literature on the topic is scattered. In this research commentary, we review existing publications on field tests of recommender systems and report which business-related performance measures were used in such real-world deployments. We summarize common challenges of measuring the business value in practice and critically discuss the value of algorithmic improvements and offline experiments as commonly done in academic environments. Overall, our review indicates that various open questions remain both regarding the realistic quantification of the business effects of recommenders and the performance assessment of recommendation algorithms in academia.
Despite a multitude of empirical studies, little consensus exists on whether neural networks are able to generalise compositionally, a controversy that, in part, stems from a lack of agreement about what it means for a neural model to be compositional. As a response to this controversy, we present a set of tests that provide a bridge between, on the one hand, the vast amount of linguistic and philosophical theory about compositionality and, on the other, the successful neural models of language. We collect different interpretations of compositionality and translate them into five theoretically grounded tests that are formulated on a task-independent level. In particular, we provide tests to investigate (i) if models systematically recombine known parts and rules (ii) if models can extend their predictions beyond the length they have seen in the training data (iii) if models’ composition operations are local or global (iv) if models’ predictions are robust to synonym substitutions and (v) if models favour rules or exceptions during training. To demonstrate the usefulness of this evaluation paradigm, we instantiate these five tests on a highly compositional data set which we dub PCFG SET and apply the resulting tests to three popular sequence-to-sequence models: a recurrent, a convolution based and a transformer model. We provide an in depth analysis of the results, that uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement.
In industrial data analytics, one of the fundamental problems is to utilize the temporal correlation of the industrial data to make timely predictions in the production process, such as fault prediction and yield prediction. However, the traditional prediction models are fixed while the conditions of the machines change over time, thus making the errors of predictions increase with the lapse of time. In this paper, we propose a general data renewal model to deal with it. Combined with the similarity function and the loss function, it estimates the time of updating the existing prediction model, then updates it according to the evaluation function iteratively and adaptively. We have applied the data renewal model to two prediction algorithms. The experiments demonstrate that the data renewal model can effectively identify the changes of data, update and optimize the prediction model so as to improve the accuracy of prediction.
Practical application of Reinforcement Learning (RL) often involves risk considerations. We study a generalized approximation scheme for risk measures, based on Monte-Carlo simulations, where the risk measures need not necessarily be \emph{coherent}. We demonstrate that, even in simple problems, measures such as the variance of the reward-to-go do not capture the risk in a satisfactory manner. In addition, we show how a risk measure can be derived from model’s realizations. We propose a neural architecture for estimating the risk and suggest the risk critic architecture that can be use to optimize a policy under general risk measures. We conclude our work with experiments that demonstrate the efficacy of our approach.
Echo State Networks (ESNs) are known for their fast and precise one-shot learning of time series. But they often need good hyper-parameter tuning for best performance. For this good validation is key, but usually, a single validation split is used. In this rather practical contribution we suggest several schemes for cross-validating ESNs and introduce an efficient algorithm for implementing them. The component that dominates the time complexity of the already quite fast ESN training remains constant (does not scale up with $k$) in our proposed method of doing $k$-fold cross-validation. The component that does scale linearly with $k$ starts dominating only in some not very common situations. Thus in many situations $k$-fold cross-validation of ESNs can be done for virtually the same time complexity as a simple single split validation. Space complexity can also remain the same. We also discuss when the proposed validation schemes for ESNs could be beneficial and empirically investigate them on several different real-world datasets.
Given a sparse rating matrix and an auxiliary matrix of users or items, how can we accurately predict missing ratings considering different data contexts of entities? Many previous studies proved that utilizing the additional information with rating data is helpful to improve the performance. However, existing methods are limited in that 1) they ignore the fact that data contexts of rating and auxiliary matrices are different, 2) they have restricted capability of expressing independence information of users or items, and 3) they assume the relation between a user and an item is linear. We propose DaConA, a neural network based method for recommendation with a rating matrix and an auxiliary matrix. DaConA is designed with the following three main ideas. First, we propose a data context adaptation layer to extract pertinent features for different data contexts. Second, DaConA represents each entity with latent interaction vector and latent independence vector. Unlike previous methods, both of the two vectors are not limited in size. Lastly, while previous matrix factorization based methods predict missing values through the inner-product of latent vectors, DaConA learns a non-linear function of them via a neural network. We show that DaConA is a generalized algorithm including the standard matrix factorization and the collective matrix factorization as special cases. Through comprehensive experiments on real-world datasets, we show that DaConA provides the state-of-the-art accuracy.
The Shapley value has become a popular method to attribute the prediction of a machine-learning model on an input to its base features. The Shapley value [1] is known to be the unique method that satisfies certain desirable properties, and this motivates its use. Unfortunately, despite this uniqueness result, there are a multiplicity of Shapley values used in explaining a model’s prediction. This is because there are many ways to apply the Shapley value that differ in how they reference the model, the training data, and the explanation context. In this paper, we study an approach that applies the Shapley value to conditional expectations (CES) of sets of features (cf. [2]) that subsumes several prior approaches within a common framework. We provide the first algorithm for the general version of CES. We show that CES can result in counterintuitive attributions in theory and in practice (we study a diabetes prediction task); for instance, CES can assign non-zero attributions to features that are not referenced by the model. In contrast, we show that an approach called the Baseline Shapley (BS) does not exhibit counterintuitive attributions; we support this claim with a uniqueness (axiomatic) result. We show that BS is a special case of CES, and CES with an independent feature distribution coincides with a randomized version of BS. Thus, BS fits into the CES framework, but does not suffer from many of CES’s deficiencies.
One of the challenging questions in time series forecasting is how to find the best algorithm. In recent years, a recommender system scheme has been developed for time series analysis using a meta-learning approach. This system selects the best forecasting method with consideration of the time series characteristics. In this paper, we propose a novel approach to focusing on some of the unanswered questions resulting from the use of meta-learning in time series forecasting. Therefore, three main gaps in previous works are addressed including, analyzing various subsets of top forecasters as inputs for meta-learners; evaluating the effect of forecasting error measures; and assessing the role of the dimensionality of the feature space on the forecasting errors of meta-learners. All of these objectives are achieved with the help of a diverse state-of-the-art pool of forecasters and meta-learners. For this purpose, first, a pool of forecasting algorithms is implemented on the NN5 competition dataset and ranked based on the two error measures. Then, six machine-learning classifiers known as meta-learners, are trained on the extracted features of the time series in order to assign the most suitable forecasting method for the various subsets of the pool of forecasters. Furthermore, two-dimensionality reduction methods are implemented in order to investigate the role of feature space dimension on the performance of meta-learners. In general, it was found that meta-learners were able to defeat all of the individual benchmark forecasters; this performance was improved even after applying the feature selection method.
Relation extraction aims to extract relational facts from sentences. Previous models mainly rely on manually labeled datasets, seed instances or human-crafted patterns, and distant supervision. However, the human annotation is expensive, while human-crafted patterns suffer from semantic drift and distant supervision samples are usually noisy. Domain adaptation methods enable leveraging labeled data from a different but related domain. However, different domains usually have various textual relation descriptions and different label space (the source label space is usually a superset of the target label space). To solve these problems, we propose a novel model of relation-gated adversarial learning for relation extraction, which extends the adversarial based domain adaptation. Experimental results have shown that the proposed approach outperforms previous domain adaptation methods regarding partial domain adaptation and can improve the accuracy of distance supervised relation extraction through fine-tuning.
Generic Image recognition is a fundamental and fairly important visual problem in computer vision. One of the major challenges of this task lies in the fact that single image usually has multiple objects inside while the labels are still one-hot, another one is noisy and sometimes missing labels when annotated by humans. In this paper, we focus on tackling these challenges accompanying with two different image recognition problems: multi-model ensemble and noisy data recognition with a unified framework. As is well-known, usually the best performing deep neural models are ensembles of multiple base-level networks, as it can mitigate the variation or noise containing in the dataset. Unfortunately, the space required to store these many networks, and the time required to execute them at runtime, prohibit their use in applications where test sets are large (e.g., ImageNet). In this paper, we present a method for compressing large, complex trained ensembles into a single network, where the knowledge from a variety of trained deep neural networks (DNNs) is distilled and transferred to a single DNN. In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously. Extensive experiments on CIFAR-10/100, SVHN, ImageNet and iMaterialist Challenge Dataset demonstrate the effectiveness of our MEAL method. On ImageNet, our ResNet-50 based MEAL achieves top-1/5 21.79%/5.99% val error, which outperforms the original model by 2.06%/1.14%. On iMaterialist Challenge Dataset, our MEAL obtains a remarkable improvement of top-3 1.15% (official evaluation metric) on a strong baseline model of ResNet-101.
Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness efficiently.

Continue Reading…

## August 24, 2019

### RcppExamples 0.1.9

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of the RcppExamples package is now on CRAN.

The RcppExamples package provides a handful of short examples detailing by concrete working examples how to set up basic R data structures in C++. It also provides a simple example for packaging with Rcpp.

This releases brings a number of small fixes, including two from contributed pull requests (extra thanks for those!), and updates the package in a few spots. The NEWS extract follows:

#### Changes in RcppExamples version 0.1.9 (2019-08-24)

• Extended DateExample to use more new Rcpp features

• Do not print DataFrame result twice (Xikun Han in #3)

• Missing parenthesis added in man page (Chris Muir in #5)

• Rewrote StringVectorExample slightly to not run afould the -Wnoexcept-type warning for C++17-related name mangling changes

• Updated NAMESPACE and RcppExports.cpp to add registration

• Removed the no-longer-needed #define for new Datetime vectors

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### If you did not already know

Joint Accuracy- and Latency-Aware Deep Structure Decoupling (JALAD)
Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep models in data center servers, causing large latency because a significant amount of data has to be transferred from the edge of network to the data center. In this paper, we propose JALAD, a joint accuracy- and latency-aware execution framework, which decouples a deep neural network so that a part of it will run at edge devices and the other part inside the conventional cloud, while only a minimum amount of data has to be transferred between them. Though the idea seems straightforward, we are facing challenges including i) how to find the best partition of a deep structure; ii) how to deploy the component at an edge device that only has limited computation power; and iii) how to minimize the overall execution latency. Our answers to these questions are a set of strategies in JALAD, including 1) A normalization based in-layer data compression strategy by jointly considering compression rate and model accuracy; 2) A latency-aware deep decoupling strategy to minimize the overall execution latency; and 3) An edge-cloud structure adaptation strategy that dynamically changes the decoupling for different network conditions. Experiments demonstrate that our solution can significantly reduce the execution latency: it speeds up the overall inference execution with a guaranteed model accuracy loss. …

Hierarchically-Clustered Representation Learning (HCRL)
The joint optimization of representation learning and clustering in the embedding space has experienced a breakthrough in recent years. In spite of the advance, clustering with representation learning has been limited to flat-level categories, which often involves cohesive clustering with a focus on instance relations. To overcome the limitations of flat clustering, we introduce hierarchically-clustered representation learning (HCRL), which simultaneously optimizes representation learning and hierarchical clustering in the embedding space. Compared with a few prior works, HCRL firstly attempts to consider a generation of deep embeddings from every component of the hierarchy, not just leaf components. In addition to obtaining hierarchically clustered embeddings, we can reconstruct data by the various abstraction levels, infer the intrinsic hierarchical structure, and learn the level-proportion features. We conducted evaluations with image and text domains, and our quantitative analyses showed competent likelihoods and the best accuracies compared with the baselines. …

Topic Model
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is. …

Datification / Datafication
A concept that tracks the conception, development, storage and marketing of all types of data, both for business and life. It has grown in popularity of late to capture how data measures things and organizations in order to compete and win. It is about making business visible. …

Continue Reading…

### Visualizing the relationship between multiple variables

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical.

For all the code in this post in one file, click here.

The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package:

library(fueleconomy)
data(vehicles)
df <- vehicles[1:100, ]
str(df)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	100 obs. of  12 variables:
#     $id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ... #$ make : chr  "AM General" "AM General" "AM General" "AM General" ...
# $model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ... #$ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
# $class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ... #$ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
# $drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ... #$ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
# $displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ... #$ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
# $hwy : int 17 17 13 13 17 13 21 26 28 26 ... #$ cty  : int  18 18 13 13 16 13 14 20 22 18 ...


Let’s see how GGally::ggpairs() visualizes relationships between quantitative variables:

library(GGally)
quant_df <- df[, c("cyl", "hwy", "cty")]
ggpairs(quant_df)


• Along the diagonal, we see a density plot for each of the variables.
• Below the diagonal, we see scatterplots for each pair of variables.
• Above the diagonal, we see the (Pearson) correlation between each pair of variables.

The visualization changes a little when we have a mix of quantitative and categorical variables. Below, fuel is a categorical variable while hwy is a quantitative variable.

mixed_df <- df[, c("fuel", "hwy")]
ggpairs(mixed_df)


• For a categorical variable on the diagonal, we see a barplot depicting the number of times each category appears.
• In one of the corners (top-right), for each categorical value we have a boxplot for the quantitative variable.
• In one of the corners (bottom-left), for each categorical value we have a histogram for the quantitative variable.

The only behavior for GGally::ggpairs() we haven’t observed yet is for a pair of categorical variables. In the code fragment below, all 3 variables are categorical:

cat_df <- df[, c("fuel", "make", "drive")]
ggpairs(cat_df)


For each pair of categorical variables, we have a barplot depicting the number of times each pair of categorical values appears.

You may have noticed that the plots above the diagonal are essentially transposes of the plot below the diagonal, and so they don’t really convey any more information. What follows below is my attempt to make the plots above the diagonal more useful. Instead of plotting the transpose barplot, I will plot a heatmap showing the relative proportion of observations having each pair of categorical values.

First, the scaffold for the plot. I will use the gridExtra package to put several ggplot2 objects together. The code below puts the same barplots below the diagonal, variable names along the diagonal, and empty canvases above the diagonal. (Notice that I need some tricks to make the barplots with the variables as strings, namely the use of aes_string() and as.formula() within facet_grid().)

library(gridExtra)
library(tidyverse)
grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
for (j in 1:ncol(cat_df)) {
idx <- idx + 1

# get feature names (note that i & j are reversed)
x_feat <- names(cat_df)[j]
y_feat <- names(cat_df)[i]

if (i < j) {
# empty canvas
grobs[[idx]] <- ggplot() + theme_void()
} else if (i == j) {
# just the name of the variable
label_df <- data.frame(x = -0, y = 0, label = x_feat)
grobs[[idx]] <- ggplot(label_df, aes(x = x, y = y, label = label),
fontface = "bold", hjust = 0.5) +
geom_text() +
coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) +
theme_void()
}
else {
# 2-dimensional barplot
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
geom_bar() +
facet_grid(as.formula(paste(y_feat, "~ ."))) +
theme(legend.position = "none", axis.title = element_blank())
}
}
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))


This is essentially showing the same information as GGally::ggpairs(). To add the frequency proportion heatmaps, replace the code in the (i < j) branch with the following:

# frequency proportion heatmap
# get frequency proportions
freq_df <- cat_df %>%
group_by_at(c(x_feat, y_feat)) %>%
summarize(proportion = n() / nrow(cat_df)) %>%
ungroup()

# get all pairwise combinations of values
temp_df <- expand.grid(unique(cat_df[[x_feat]]),
unique(cat_df[[y_feat]]))
names(temp_df) <- c(x_feat, y_feat)

# join to get frequency proportion
temp_df <- temp_df %>%
left_join(freq_df, by = c(setNames(x_feat, x_feat),
setNames(y_feat, y_feat))) %>%
replace_na(list(proportion = 0))

grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) +
geom_tile(aes(fill = proportion)) +
geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
scale_fill_gradient(low = "white", high = "#007acc") +
theme(legend.position = "none", axis.title = element_blank())


Notice that each heatmap has its own limits for the color scale. If you want to have the same color scale for all the plots, you can add limits = c(0, 1) to the scale_fill_gradient() layer of the plot.

The one thing we lose here over the GGally::ggpairs() version is the marginal barplot for each variable. This is easy to add but then we don’t really have a place to put the variable names. Replacing the code in the (i == j) branch with the following is one possible option.

# df for positioning the variable name
label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2,
y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
# marginal barplot with variable name on top
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
geom_bar() +
geom_label(data = label_df, aes(x = x, y = y, label = label),
size = 5)


In this final version, we clean up some of the axes so that more of the plot space can be devoted to the plot itself, not the axis labels:

theme_update(legend.position = "none", axis.title = element_blank())

grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
for (j in 1:ncol(cat_df)) {
idx <- idx + 1

# get feature names (note that i & j are reversed)
x_feat <- names(cat_df)[j]
y_feat <- names(cat_df)[i]

if (i < j) {
# frequency proportion heatmap
# get frequency proportions
freq_df <- cat_df %>%
group_by_at(c(x_feat, y_feat)) %>%
summarize(proportion = n() / nrow(cat_df)) %>%
ungroup()

# get all pairwise combinations of values
temp_df <- expand.grid(unique(cat_df[[x_feat]]),
unique(cat_df[[y_feat]]))
names(temp_df) <- c(x_feat, y_feat)

# join to get frequency proportion
temp_df <- temp_df %>%
left_join(freq_df, by = c(setNames(x_feat, x_feat),
setNames(y_feat, y_feat))) %>%
replace_na(list(proportion = 0))

grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) +
geom_tile(aes(fill = proportion)) +
geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
scale_fill_gradient(low = "white", high = "#007acc") +
theme(axis.ticks = element_blank(), axis.text = element_blank())
} else if (i == j) {
# df for positioning the variable name
label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2,
y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
# marginal barplot with variable name on top
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
geom_bar() +
geom_label(data = label_df, aes(x = x, y = y, label = label),
size = 5)
}
else {
# 2-dimensional barplot
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
geom_bar() +
facet_grid(as.formula(paste(y_feat, "~ ."))) +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
}
}
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))


To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### Magister Dixit

“Data Scientist may be a prestigious title but it doesn’t reflect our area of specialization or the depth of our experience. As legions of newly minted Data Scientists are granted degrees over the next few years the problem for both employee and employer will only grow worse.” William Vorhies ( August 11, 2015 )

Continue Reading…

### Finding out why

Python Library: xswap

Python-wrapped C/C++ library for degree-preserving network randomization: XSwap is an algorithm for degree-preserving network randomization (permutation). Permuted networks can be used for a number of purposes in network analysis, including for generating counterfactual distributions of features when only the network’s degree sequence is maintained or for computing a prior probability of an edge given only the network’s degree sequence. Overall, permuted networks allow one to quantify the effects of degree on analysis and prediction methods. Understanding this effect is useful when a network’s degree sequence is subject to biases. This implementation is a modified version of the algorithm due to Hanhijärvi et al. with two additional parameters (allow_self_loops and allow_antiparallel), which enable greater generalizability to bipartite, directed, and undirected networks.
There are some mind blowing facts and games related to mathematics that even mathematics cannot explain. I just like to call those facts ‘The Abracadabra of Mathematics’. Let’s take a look at one of those games or tricks. Assume that you are in a classroom with a population of at least 25 students and I am the instructor. I give a blank paper to each of you to write a digit , any number between 0-9, inclusive. After you are done writing your digit on your paper and fold it, I will collect the papers. Of course I don’t have any idea of what you write on the paper. However, I guarantee that I will know the most written number in the classroom.
Explainable models in Artificial Intelligence are often employed to ensure transparency and accountability of AI systems. The fidelity of the explanations are dependent upon the algorithms used as well as on the fidelity of the data. Many real world datasets have missing values that can greatly influence explanation fidelity. The standard way to deal with such scenarios is imputation. This can, however, lead to situations where the imputed values may correspond to a setting which refer to counterfactuals. Acting on explanations from AI models with imputed values may lead to unsafe outcomes. In this paper, we explore different settings where AI models with imputation can be problematic and describe ways to address such scenarios.
Modern control theories such as systems engineering approaches try to solve nonlinear system problems by revelation of causal relationship or co-relationship among the components; most of those approaches focus on control of sophisticatedly modeled white-boxed systems. We suggest an application of actor-critic reinforcement learning approach to control a nonlinear, complex and black-boxed system. We demonstrated this approach on artificial green-house environment simulator all of whose control inputs have several side effects so human cannot figure out how to control this system easily. Our approach succeeded to maintain the circumstance at least 20 times longer than PID and Deep Q Learning.
We address measurement error bias in propensity score (PS) analysis due to covariates that are latent variables. In the setting where latent covariate $X$ is measured via multiple error-prone items $\mathbf{W}$, PS analysis using several proxies for $X$ — the $\mathbf{W}$ items themselves, a summary score (mean/sum of the items), or the conventional factor score (cFS , i.e., predicted value of $X$ based on the measurement model) — often results in biased estimation of the causal effect, because balancing the proxy (between exposure conditions) does not balance $X$. We propose an improved proxy: the conditional mean of $X$ given the combination of $\mathbf{W}$, the observed covariates $Z$, and exposure $A$, denoted $X_{WZA}$. The theoretical support, which applies whether $X$ is latent or not (but is unobserved), is that balancing $X_{WZA}$ (e.g., via weighting or matching) implies balancing the mean of $X$. For a latent $X$, we estimate $X_{WZA}$ by the inclusive factor score (iFS) — predicted value of $X$ from a structural equation model that captures the joint distribution of $(X,\mathbf{W},A)$ given $Z$. Simulation shows that PS analysis using the iFS substantially improves balance on the first five moments of $X$ and reduces bias in the estimated causal effect. Hence, within the proxy variables approach, we recommend this proxy over existing ones. We connect this proxy method to known results about weighting/matching functions (Lockwood & McCaffrey, 2016; McCaffrey, Lockwood, & Setodji, 2013). We illustrate the method in handling latent covariates when estimating the effect of out-of-school suspension on risk of later police arrests using Add Health data.
With recent advances in deep learning, neuroimaging studies increasingly rely on convolutional networks (ConvNets) to predict diagnosis based on MR images. To gain a better understanding of how a disease impacts the brain, the studies visualize the salience maps of the ConvNet highlighting voxels within the brain majorly contributing to the prediction. However, these salience maps are generally confounded, i.e., some salient regions are more predictive of confounding variables (such as age) than the diagnosis. To avoid such misinterpretation, we propose in this paper an approach that aims to visualize confounder-free saliency maps that only highlight voxels predictive of the diagnosis. The approach incorporates univariate statistical tests to identify confounding effects within the intermediate features learned by ConvNet. The influence from the subset of confounded features is then removed by a novel partial back-propagation procedure. We use this two-step approach to visualize confounder-free saliency maps extracted from synthetic and two real datasets. These experiments reveal the potential of our visualization in producing unbiased model-interpretation.

Continue Reading…

### If you did not already know

Intersection over Union (IoU)
Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks.
“Jaccard Index”

Self-Paced Sparse Coding (SPSC)
Sparse coding (SC) is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications. However, most existing sparse coding algorithms are nonconvex and are thus prone to becoming stuck into bad local minima, especially when there are outliers and noisy data. To enhance the learning robustness, in this paper, we propose a unified framework named Self-Paced Sparse Coding (SPSC), which gradually include matrix elements into SC learning from easy to complex. We also generalize the self-paced learning schema into different levels of dynamic selection on samples, features and elements respectively. Experimental results on real-world data demonstrate the efficacy of the proposed algorithms. …

PublicSelf Model
Most of agents that learn policy for tasks with reinforcement learning (RL) lack the ability to communicate with people, which makes human-agent collaboration challenging. We believe that, in order for RL agents to comprehend utterances from human colleagues, RL agents must infer the mental states that people attribute to them because people sometimes infer an interlocutor’s mental states and communicate on the basis of this mental inference. This paper proposes PublicSelf model, which is a model of a person who infers how the person’s own behavior appears to their colleagues. We implemented the PublicSelf model for an RL agent in a simulated environment and examined the inference of the model by comparing it with people’s judgment. The results showed that the agent’s intention that people attributed to the agent’s movement was correctly inferred by the model in scenes where people could find certain intentionality from the agent’s behavior. …

Event-triggered Learning
Efficient exchange of information is an essential aspect of intelligent collective behavior. Event-triggered control and estimation achieve some efficiency by replacing continuous data exchange between agents with intermittent, or event-triggered communication. Typically, model-based predictions are used at times of no data transmission, and updates are sent only when the prediction error grows too large. The effectiveness in reducing communication thus strongly depends on the quality of the prediction model. In this article, we propose event-triggered learning as a novel concept to reduce communication even further and to also adapt to changing dynamics. By monitoring the actual communication rate and comparing it to the one that is induced by the model, we detect mismatch between model and reality and trigger learning of a new model when needed. Specifically for linear Gaussian dynamics, we derive different classes of learning triggers solely based on a statistical analysis of inter-communication times and formally prove their effectiveness with the aid of concentration inequalities. …

Continue Reading…

### R Packages worth a look

Peak Functions for Peak Detection in Univariate Time Series (scorepeak)
Provides peak functions, which enable us to detect peaks in time series. The methods implemented in this package are based on Girish Keshav Palshikar (2009) <https://…ms_for_Peak_Detection_in_Time-Series>.

Utility Functions for Large-scale Data (bigutilsr)
Utility functions for large-scale data. For now, package ‘bigutilsr’ mainly includes functions for outlier detection.

Bootstrap Tests for Equality of 2, 3, or 4 Population Variances (testequavar)
Tests the hypothesis that variances are homogeneous or not using bootstrap. The procedure uses a variance-based statistic, and is derived from a normal-theory test. The test equivalently expressed the hypothesis as a function of the log contrasts of the population variances. A box-type acceptance region is constructed to test the hypothesis. See Cahoy (2010) <doi:10.1016/j.csda.2010.04.012>.

Item Response Theory Models (Rirt)
Parameter estimation, computation of probability, information, and (log-)likelihood, and visualization of item/test characteristic curves and item/test information functions for three uni-dimensional item response theory models: the 3-parameter-logistic model, generalized partial credit model, and graded response model. The full documentation and tutorials are at <https://…/Rirt>.

Continue Reading…

### Science and Technology links (August 24th 2019)

1. The net contribution of the Amazon ecosystem to the world’s oxygen is effectively zero. Furthermore, there is a lot of oxygen in the air and it would be excessively difficult to lower or increase the ratio. In effect, we are not going to run out of oxygen on Earth any time soon, even if we tried to.
2. According to Nature, the Earth is getting greener. The main driver is climate change and CO2 which acts as a fertilizer. A large fraction of this greening is happening in China and India. These countries have increased their food production by 35% since 2000.
3. Vitamin D supplements appear to reduce cancer risks. Vitamin D supplements are typically available as either vitamin D2 or D3. We know that D2 supplements do not lower mortality risks. We do not have enough data to assess vitamin D3. (Zhang et al., 2019)
4. There is no evidence that plastic particules in water are harmful to human beings.
5. CNBC reports that Mackey, a vegan who runs the successful Whole Foods chain, is critical of the fashionable Beyond Meat “plant-based meat”:

I don’t think eating highly processed foods is healthy. I think people thrive on eating whole foods. As for health, I will not endorse that, and that is about as big of criticism that I will do in public.

Meanwhile the stock price of Beyond Meat went up 800% in a few months.

Continue Reading…

### More on why Cass Sunstein should be thanking, not smearing, people who ask for replications

Recently we discussed law professor and policy intellectual Cass Sunstein’s statement that people who ask for social science findings to be replicated are like the former East German secret police.

In that discussion I alluded to a few issues:

1. The replication movement is fueled in large part by high-profile work, lauded by Sunstein and other Ivy League luminaries, that did not replicate.

2. Until outsiders loudly criticized the unreplicated work, those unreplicated claims were essentially uncriticized in the popular and academic press. And the criticism had to be loud, Loud, LOUD. Recall the Javert paradox.

3. That work wasn’t just Gladwell and NPR-bait, it also had real-world implications.

For example, check this out from the Nudge blog, several years ago:

As noted above, Sunstein had no affiliation with that blog. My point is that his brand was, unwittingly, promoting bad research.

And this brings me to my main point for today. Sunstein likens research critics to the former East German secret police, echoing something that a psychology professor wrote a few years ago regarding “methodological terrorists.” But . . . without these hateful people who are some cross between the Stasi and Al Qaeda, those destructive little second-stringers etc. . . . without them, Sunstein would I assume still be promoting claims based on garbage research. (And, yes, sure, Wansink’s claims could still be true, research flaws notwithstanding: It’s possible that the guy just had a great intuition about behavior and was right every time—but then it’s still a mistake to present those intuitions as being evidence-based.)

For example, see this recent post:

The link states that “A field study and a laboratory study with American participants found that calorie counts to the left (vs. right) decreased calories ordered by 16.31%.” 16.31%, huh? OK, I’ll believe it when it’s replicated for real, not before. The point is that, without the research critics—including aggressive research critics, the Javerts who annoy Sunstein and his friends so much—junk science would expand until it entirely filled up the world of policy analysis. Gresham, baby, Gresham.

So, again, Sunstein should be thanking, not smearing, people who ask for replications.

The bearer of bad tidings is your friend, not your enemy.

P.S. Probably not a good idea to believe anything Brian Wansink has ever written, at least not until you see clearly documented replication. This overview by Elizabeth Nolan Brown gives some background on the problems with Wansink’s work, along with discussions of some political concerns:

For the better half of a decade, American public schools have been part of a grand experiment in “choice architecture” dressed up as simple, practical steps to spur healthy eating. But new research reveals the “Smarter Lunchrooms” program is based largely on junk science.

Smarter Lunchrooms, launched in 2010 with funding from the U.S. Department of Agriculture (USDA) . . . is full of “common sense,” TED-talk-ready, Malcolm Gladwell-esque insights into how school cafeterias can encourage students to select and eat more nutritious foods. . . . This “light touch” is the foundation upon which Wansink, a former executive director of the USDA’s Center for Nutrition Policy and Promotion and a drafter of U.S. Dietary Guidelines, has earned ample speaking and consulting gigs and media coverage. . . .

The first serious study testing the program’s effectiveness was published just this year. At the end of nine weeks, students in Smarter Lunchroom cafeterias consumed an average of 0.10 more fruit units per day—the equivalent of about one or two bites of an apple. Wansink and company called it a “significant” increase in fruit consumption.

But “whether this increase is meaningful and has real world benefit is questionable,” Robinson* writes.

Nonetheless, the USDA claims that the “strategies that the Smarter Lunchrooms Movement endorses have been studied and proven effective in a variety of schools across the nation.” More than 29,000 U.S. public schools now employ Smarter Lunchrooms strategies, and the number of school food service directors trained on these tactics increased threefold in 2015 over the year before.

Also this:

One study touted by the USDA even notes that since food service directors who belong to professional membership associations were more likely to know about the Smarter Lunchrooms program, policy makers and school districts “consider allocating funds to encourage [directors] to engage more fully in professional association meetings and activities.”

But now that Wansink’s work has been discredited, the government will back off and stop wasting all this time and money, right?

Ummm . . .

A spokesman for the USDA told The Washington Post that while they had some concerns about the research coming out of Cornell, “it’s important to remember that Smarter Lunchrooms strategies are based upon widely researched principles of behavioral economics, as well as a strong body of practice that supports their ongoing use.”

Brown summarizes:

We might disagree on whether federal authorities should micromanage lunchroom menus or if local school districts should have more control, and what dietary principles they should follow; whether the emphasis of school cafeterias should be fundraising or nutrition; or whether school meals need more funding. But confronting these challenges head-on is a hell of a lot better than a tepid consensus for feel-good fairytales about banana placement.

Or celebrating the “coolest behavioral finding of 2019.”

P.P.S. I did some internet searching and came across this tweet by Sunstein from 2018:

On one hand I appreciate that dude linked to our blog. On the other hand . . . I hate twitter! What does Sunstein mean by saying my post was “ill-considered and graceless”? I’d appreciate knowing what exactly I wrote was “ill-considered” and what exactly was graceless? I’m soooo sick of the happy-talk Harvard world where every bit of research is “cool” and scientists are all pals, promoting each other’s work and going on NPR.

Again:

Science is hard. I appreciate the value of public intellectuals like Sunstein, who read up on the latest science and collaborate with scientists and do their best to adapt scientific ideas to the real world, to influence public and private decision making in a positive way. I don’t agree with every one of Sunstein’s ideas, but then again I don’t agree with every one of any policy entrepreneurs’ ideas. That’s fine: it’s not their job to make me agree with them. Policy is controversial, and that’s part of the point too.

OK, fine. The point is, science-based policy advice ultimately depends on science—or, at least it should. And if you want to depend on science, you need to be open to criticism. Not every “cool” or “masterpiece” idea is correct. Even Ivy League professors make mistakes, all the time. (I know it: I’ve been an Ivy League professor forever, and I make lots mistakes all the time, all by myself.)

So to take legitimate criticism—in this case, when someone sent me a New York Times article that included following passage:

Knowing a person’s political leanings should not affect your assessment of how good a doctor she is — or whether she is likely to be a good accountant or a talented architect. But in practice, does it? Recently we conducted an experiment to answer that question. Our study . . . found that knowing about people’s political beliefs did interfere with the ability to assess those people’s expertise in other, unrelated domains.

and I pointed out that, no, the research article in question never said anything about doctors, accountants, architects, or any professional skills . . . When I did that, I was helping out Sunstein. I was pointing out that he was over-claiming. If you want to make science-based policy and you want to do it right, you want to avoid going beyond what your data are telling you. Or, if you want to extrapolate, then do so, but make the extrapolation step clear.

Is it “ill-considered and graceless” to point out an error in published work? No, it’s not. I can’t speak to graceless, but I think it was ill-considered to refer to Brian Wansink’s work as “masterpieces” and I think it was ill-considered to write something in the New York Times that mischaracterized their work. I can understand how such things happen—we all make mistakes, even in our areas of expertise—but, again, criticism can help us do better.

Again, I hate twitter because it encourages people to sling insults (“ill-considered,” “graceless,” “Stasi,” etc.) without backing them up. If you knew you had to back it up, you might reconsider the insults.

Continue Reading…

### Magister Dixit

“A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.” Dr. Olivier Lichtarge

Continue Reading…

### Document worth reading: “A Survey of Optimization Methods from a Machine Learning Perspective”

Machine learning develops rapidly, which has made many theoretical breakthroughs and is widely applied in various fields. Optimization, as an important part of machine learning, has attracted much attention of researchers. With the exponential growth of data amount and the increase of model complexity, optimization methods in machine learning face more and more challenges. A lot of work on solving optimization problems or improving optimization methods in machine learning has been proposed successively. The systematic retrospect and summary of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance for both developments of optimization and machine learning research. In this paper, we first describe the optimization problems in machine learning. Then, we introduce the principles and progresses of commonly used optimization methods. Next, we summarize the applications and developments of optimization methods in some popular machine learning fields. Finally, we explore and give some challenges and open problems for the optimization in machine learning. A Survey of Optimization Methods from a Machine Learning Perspective

Continue Reading…

### Let’s get it right

President-elect of the European Commission, Ursula von der Leyen made clear in her recently unveiled policy agenda, that not only will artificial intelligence (AI) be a key component of European digital strategy, but the cornerstone of the European AI plan will be to develop ‘AI made in Europe’ that is more ethical than AI made anywhere else in the world. What this means is not always clear, since there is no universal consensus on ethics. However, most European policymakers are less concerned about the ‘what’ and more about the ‘why.’ As explained by former Vice-President for the Digital Single Market, Andrus Ansip, ‘Ethical AI is a win-win proposition that can become a competitive advantage for Europe.’ This idea that Europe can become the global leader in AI simply by creating the most ethical AI systems, rather than by competing to build the best-performing ones, has become the conventional wisdom in Brussels, repeated ad nauseum by those tasked with charting a course for Europe’s AI future. But it is a delusion built on three fallacies: that there is a market for AI that is ethical-by-design, that other countries are not interested in AI ethics, and that Europeans have a competitive advantage in producing AI systems that are more ethical than those produced elsewhere.
Civilizations evolve through strategic forgetting of once-vital life skills. But can machines do all our remembering? When I was a student, in the distant past when most computers were still huge mainframes, I had a friend whose PhD advisor insisted that he carry out a long and difficult atomic theory calculation by hand. This led to page after page of pencil scratches, full of mistakes, so my friend finally gave in to his frustration. He snuck into the computer lab one night and wrote a short code to perform the calculation. Then he laboriously copied the output by hand, and gave it to his professor. Perfect, his advisor said – this shows you are a real physicist. The professor was never any the wiser about what had happened. While I’ve lost touch with my friend, I know many others who’ve gone on to forge successful careers in science without mastering the pencil-and-paper heroics of past generations.

Article: AI and Collective Action

Towards a more responsible development of artificial intelligence with a research paper from OpenAI. The 10th of July team members of OpenAI released a paper on arXiv called The Role of Cooperation in Responsible AI Development by Amanda Askell, Miles Brundage and Gillian Hadfield. One of the main statements in the article goes as follows: ‘Competition between AI companies could decrease the incentives of each company to develop responsibly by increasing their incentives to develop faster. As a result, if AI companies would prefer to develop AI systems with risk levels that are closer to what is socially optimal – as we believe many do – responsible AI development can be seen as a collective action problem’ Therefore how is it proposed we approach this problem?
Big Data powering Big Money, the return of direct democracy, and the tyranny of the minority. Nowadays, artificial intelligence (AI) is one of the most widely discussed phenomena. AI is poised to fundamentally alter almost every dimension of human life – from healthcare and social interactions to military and international relations. However, it is worth considering the effects of the advent of AI in politics – since politics are one of the fundamental pillars of today’s societal system, and understanding the dangers that AI poses for politics is crucial to combat AI’s negative implications, while at the same time maximizing the benefits stemming from the new opportunities in order to strengthen democracy.
Systems that augment sensory abilities are increasingly employing AI and machine learning (ML) approaches, with applications ranging from object recognition and scene description tools for blind users to sound awareness tools for d/Deaf users. However, unlike many other AI-enabled technologies, these systems provide information that is already available to non-disabled people. In this paper, we discuss unique AI fairness challenges that arise in this context, including accessibility issues with data and models, ethical implications in deciding what sensory information to convey to the user, and privacy concerns both for the primary user and for others.
The recent 21st Century Cures Act propagates innovations to accelerate the discovery, development, and delivery of 21st century cures. It includes the broader application of Bayesian statistics and the use of evidence from clinical expertise. An example of the latter is the use of trial-external (or historical) data, which promises more efficient or ethical trial designs. We propose a Bayesian meta-analytic approach to leveraging historical data for time-to-event endpoints, which are common in oncology and cardiovascular diseases. The approach is based on a robust hierarchical model for piecewise exponential data. It allows for various degrees of between trial-heterogeneity and for leveraging individual as well as aggregate data. An ovarian carcinoma trial and a non-small-cell cancer trial illustrate methodological and practical aspects of leveraging historical data for the analysis and design of time-to-event trials.
Using adversarial machine learning, researchers can trick machines – potentially with fatal consequences. But the legal system hasn’t caught up. Imagine you’re cruising in your new Tesla, autopilot engaged. Suddenly you feel yourself veer into the other lane, and you grab the wheel just in time to avoid an oncoming car. When you pull over, pulse still racing, and look over the scene, it all seems normal. But upon closer inspection, you notice a series of translucent stickers leading away from the dotted lane divider. And to your Tesla, these stickers represent a non-existent bend in the road that could have killed you. In April this year, a research team at the Chinese tech giant Tencent showed that a Tesla Model S in autopilot mode could be tricked into following a bend in the road that didn’t exist simply by adding stickers to the road in a particular pattern. Earlier research in the U.S. had shown that small changes to a stop sign could cause a driverless car to mistakenly perceive it as a speed limit sign. Another study found that by playing tones indecipherable to a person, a malicious attacker could cause an Amazon Echo to order unwanted items.
The development of artificial general intelligence offers tremendous benefits and terrible risks. There is no easy definition for artificial intelligence, or A.I. Scientists can’t agree on what constitutes ‘true A.I.’ versus what might simply be a very effective and fast computer program. But here’s a shot: intelligence is the ability to perceive one’s environment accurately and take actions that maximize the probability of achieving given objectives. It doesn’t mean being smart, in a sense of having a great store of knowledge, or the ability to do complex mathematics.

Continue Reading…

### Changing the variable inside an R formula

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset:

data(mtcars)
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


Let’s say I wanted to fit a linear model of mpg vs. hp and get the coefficients. This is easy:

lm(mpg ~ hp, data = mtcars)$coefficients #> (Intercept) hp #> 30.09886054 -0.06822828  But what if I wanted to fit a linear model of y vs. hp, where y is a response variable that I won’t know until runtime? Or what if I want to fit 3 linear models: each of mpg, disp, drat vs. hp? Or what if I want to fit 300 such models? There has to be a way to do this programmatically. It turns out that there are at least 4 different ways to achieve this in R. For all these methods, let’s assume that the responses we want to fit models for are in a character vector: response_list <- c("mpg", "disp", "drat")  Here are the 4 ways I know (in decreasing order of preference): 1. as.formula() as.formula() converts a string to a formula object. Hence, we can programmatically create the formula we want as a string, then pass that string to as.formula(): for (y in response_list) { lmfit <- lm(as.formula(paste(y, "~ hp")), data = mtcars) print(lmfit$coefficients)
}
#> (Intercept)          hp
#> 30.09886054 -0.06822828
#> (Intercept)          hp
#>    20.99248     1.42977
#> (Intercept)          hp
#>  4.10990867 -0.00349959


2. Don’t specify the data option

Passing the data = mtcars option to lm() gives us more succinct and readable code. However, lm() also accepts the response vector and data matrix themselves:

for (y in response_list) {
lmfit <- lm(mtcars[[y]] ~ mtcars$hp) print(lmfit$coefficients)
}
#> (Intercept)          hp
#> 30.09886054 -0.06822828
#> (Intercept)          hp
#>    20.99248     1.42977
#> (Intercept)          hp
#>  4.10990867 -0.00349959


3. get()

get() searches for an R object by name and returns that object if it exists.

for (y in response_list) {
lmfit <- lm(get(y) ~ hp, data = mtcars)
print(lmfit$coefficients) } #> (Intercept) hp #> 30.09886054 -0.06822828 #> (Intercept) hp #> 20.99248 1.42977 #> (Intercept) hp #> 4.10990867 -0.00349959  4. eval(parse()) This one is a little complicated. parse() returns the parsed but unevaluated expressions, while eval() evaluates those expressions (in a specified environment). for (y in response_list) { lmfit <- lm(eval(parse(text = y)) ~ hp, data = mtcars) print(lmfit$coefficients)
}
#> (Intercept)          hp
#> 30.09886054 -0.06822828
#> (Intercept)          hp
#>    20.99248     1.42977
#> (Intercept)          hp
#>  4.10990867 -0.00349959


Of course, for any of these methods, we could replace the outer loop with apply() or purrr::map().

References:

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### Distilled News

The robot simulator V-REP, with integrated development environment, is based on a distributed control architecture: each object/model can be individually controlled via an embedded script, a plugin, a ROS or BlueZero node, a remote API client, or a custom solution. This makes V-REP very versatile and ideal for multi-robot applications. Controllers can be written in C/C++, Python, Java, Lua, Matlab or Octave. V-REP is used for fast algorithm development, factory automation simulations, fast prototyping and verification, robotics related education, remote monitoring, safety double-checking, etc.
Statistical modeling, partial pooling, Multilevel modeling, hierarchical modeling. Pricing is a common problem faced by any e-commerce business, and one that can be addressed effectively by Bayesian statistical methods. The Mercari Price Suggestion data set from Kaggle seems to be a good candidate for the Bayesian models I wanted to learn. If you remember, the purpose of the data set is to build a model that automatically suggests the right price for any given product for Mercari website sellers. I am here to attempt to see whether we can solve this problem by Bayesian statistical methods, using PyStan.
In the previous post we have learnt about the importance of Latent Variables in Bayesian modelling. Now starting from this post, we will see Bayesian in action. We will walk through different aspects of machine learning and see how Bayesian methods will help us in designing the solutions. And also the additional capabilities and insights we can have by using it. The sections which follows are generally known as Bayesian inference. In this post we will see how Bayesian methods can be used to do clustering on the given data.
Mobile health (mHealth) is considered one of the most transformative drivers for health informatics delivery of ubiquitous medical applications. Machine learning has proven to be a powerful tool in classifying medical images for detecting various diseases. However, supervised machine learning requires a large amount of data to train the model, whose storage and processing pose considerable system requirements challenges for mobile applications. Therefore, many studies focus on deploying cloud-based machine learning, which takes advantage of the Internet connection to outsource data intensive computing. However, this approach comes with certain drawbacks such as those related to latency and privacy, which need to be considered in the context of sensitive data. To tackle these challenges of mHealth applications, we present an on-device inference App and use a dataset of skin cancer images to demonstrate a proof of concept. We pre-trained a Convolutional Neural Network model using 10,015 skin cancer images. The model is then deployed on a mobile device, where the inference process takes place, i.e. when presented with new test image all computations are executed locally where the test data remains. This approach reduces latency, saves bandwidth and improves privacy.
The advent of social networks, big data and e-commerce has re-emphasized the importance of analyzing a unique type of data structure- one which depicts relationships among its entities, also known as a Graph. It is imperative to briefly introduce the concept of a ‘Graph’ before I venture into the Introduction of Graph Analytics. Let’s start by looking at a sample graph of friends presented below. I will be using the same graph in some of the following sections to further explain the concepts of graph analytics.
Solving package size issues of fbprophet serverless deployment. I assume you’re reading this post because you’re looking for ways to use the awesome fbprophet (Facebook open source forecasting) library on AWS Lambda and you’re already familiar with the various issues around getting it done. I will be using a python 3.6 example, but the approach is applicable to other runtimes as well as other large ML libraries.
For a recent data science project, I collaborated with several other Lambda School students to search for food deserts in L.A. County. A general definition for what qualifies as a food desert is an area that does not have access, within one mile, to a grocery store/market providing fresh, healthy food options, such as fruits, vegetables, meats, etc. We wanted to test the theory that lower income neighborhoods are more likely to live in a food desert.
DVC tracks ML models and data sets. DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code.
If you are in the process of deploying large-scale data systems into production or if you are using large-scale data in production now, this book is for you. In it we address the difference in big data hype versus serious large-scale projects that bring real value in a wide variety of enterprises. Whether this is your first large-scale data project or you are a seasoned user, you will find helpful content that should reinforce your chances for success. Here, we speak to business team leaders; CIOs, CDOs, and CTOs; business analysts; machine learning and artificial intelligence (AI) experts; and technical developers to explain in practical terms how to take big data analytics and machine learning/AI into production successfully. We address why this is challenging and offer ways to tackle those challenges. We provide suggestions for best practice, but the book is intended as neither a technical reference nor a com- prehensive guide to how to use big data technologies. You can understand it regardless of whether you have a deep technical back- ground. That said, we think that you’ll also benefit if you’re techni- cally adept, not so much from a review of tools as from fundamental ideas about how to make your work easier and more effective. The book is based on our experience and observations of real-world use cases so that you can gain from what has made others successful.
We have been discussing several reinforcement learning problems, within each we are trying to get the optimal policy by keep playing the game and update our estimates. In essence, we estimate the (state, value) pair or (state, action, value) pair and based on the estimates we generate policy by taking the action that gives the highest value. But this is not the only way to do it. In this paragraph, I will introduce Monte Carlo Method, which is another way to estimate the value of a state, or value of a policy. The Monte Carlo Method involves a wide broad of methods, but all follows the same principal – sampling. The idea is straightforward and intuitive, if you are not sure of the value of a state, just do sampling, which is keep visiting that state and averaging over the reward that got from simulated actions by interacting with the environment.
A walk through for calculating several popular association matrices. Ever been in a scenario where you needed to come up with pairwise covariance, correlation, or cosine matrices for data on the fly without the help of a function? Probably not.
In the article, we’ll explore some architectural design patterns that support the machine learning model life cycle.
Artificial Intelligence (AI) is no longer a field restricted only to research papers and academia. Businesses and organizations across diverse domains in the industry are building large-scale applications powered by AI. The questions to think about here would be, ‘Do we trust decisions made by AI models?’ and ‘How does a machine learning or deep learning model make its decisions?’. Interpreting machine learning or deep learning models has always been a task often overlooked in the entire data science lifecycle since data scientists or machine learning engineers would be more involved with actually pushing things out to production or getting a model up and running.
Chart.xkcd is a chart library plots ‘sketchy’, ‘cartoony’ or ‘hand-drawn’ styled charts.
Data Shapley provides us with one way of finding and correcting or eliminating low-value (and potentially harmful) data points from a training set. In today’s paper choice, Konstantinov & Lampert provide a way of assessing the value of datasets as a whole. The idea is that you might be learning e.g. a classifier by combining data from multiple sources. By assigning a value (weighting) to each of those data sources we can intelligently combine them to get the best possible performance out of the resulting trained model. So if you need more data in order to improve the performance of your model, ‘Robust learning from untrusted sources’ provides an approach that lets you tap into additional, potentially noisier, sources. It’s similar in spirit to Snorkel which we looked at last year, and is designed to let you incorporate data from multiple ‘weakly supervised’ (i.e. noisy) data sources. Snorkel replaces labels with probability-weighted labels, and then trains the final classifier using those.
Want to know how a product owner can find out heaps on insight about how their products are received by customers, without having to read millions of articles and reviews? A Sentiment Analyser is the answer, these things can be hooked up to twitter, review sites, databases or all of the above utilising Neural Neworks in Keras. Its a great lazy way to understand how a product is viewed by a large group of customers in a very short space of time.
Finding An Optimum Investment Portfolio Using Monte-Carlo Simulation In Python From Start To End. This article focuses on generating an optimum investment portfolio via Monte-Carlo simulation. I have implemented an end-to-end application in Python and this article documents the solution so that a wider audience can benefit from it. The article will explain the required financial, mathematical and programming knowledge of investment management in an easy-to-understand manner. Hence, this article is useful for data scientists, programmers, mathematicians and those who are interested in finance.
Every Mental Health Awareness Day (October 10), there is a peak in search interest for ‘mental health’ on Google Trends. However, this past October, there was the highest search interest ever seen. Mental health in the United States is growing as a part of the global conversation – partially due to its destigmatization, but mainly due to its relevance to technology (most notably social media), domestic terrorism, and drug addiction. While there has been a ton of pre-existing analysis on this topic conducted both by kagglers and nonprofits, I hope to use time series anomaly detection techniques for a different perspective on the suicide statistics datasets.
When applying deep Reinforcement Learning (RL) to robotics, we are faced with a conundrum: how do we train a robot to do a task when deep learning requires hundreds of thousands, even millions, of examples? To achieve 96% grasp success on never-before-seen objects, researchers at Google and Berkeley trained a robotic agent through 580,000 real-world grasp attempts. This feat took seven robots and several weeks to accomplish. Without Google resources, it may seem hopeless for the average ML practitioner. We cannot expect to easily run hundreds of thousands of iterations of training using a physical robot, which is subject to wear-and-tear and requires human supervision, neither of which comes cheap. It would be much more feasible if we could pretrain such RL algorithms to drastically reduce the number of real world attempts needed.

Continue Reading…

### If you did not already know

FastDeepIoT
Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally affect performance attributes of interest, such as execution time. Rather, extreme run-time nonlinearities exist over the network configuration space. Hence, we propose a novel framework, called FastDeepIoT, that uncovers the non-linear relation between neural network structure and execution time, then exploits that understanding to find network configurations that significantly improve the trade-off between execution time and accuracy on mobile and embedded devices. FastDeepIoT makes two key contributions. First, FastDeepIoT automatically learns an accurate and highly interpretable execution time model for deep neural networks on the target device. This is done without prior knowledge of either the hardware specifications or the detailed implementation of the used deep learning library. Second, FastDeepIoT informs a compression algorithm how to minimize execution time on the profiled device without impacting accuracy. We evaluate FastDeepIoT using three different sensing-related tasks on two mobile devices: Nexus 5 and Galaxy Nexus. FastDeepIoT further reduces the neural network execution time by $48\%$ to $78\%$ and energy consumption by $37\%$ to $69\%$ compared with the state-of-the-art compression algorithms. …

MXNET-MPI
Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming paradigms, co-existing at the same time. The key advantage of the new model is to embed the scaling benefits of MPI parallelism into the loosely coupled PS task model. Apart from providing a practical usage model of MPI in cloud, such framework allows for novel communication avoiding algorithms that do parameter averaging in Stochastic Gradient Descent(SGD) approaches. We show how MPI and PS models can synergestically apply algorithms such as Elastic SGD to improve the rate of convergence against existing approaches. These new algorithms directly help scaling SGD clusterwide. Further, we also optimize the critical component of the framework, namely global aggregation or allreduce using a novel concept of tensor collectives. These treat a group of vectors on a node as a single object allowing for the existing single vector algorithms to be directly applicable. We back our claims with sufficient emperical evidence using large scale ImageNet 1K data. Our framework is built upon MXNET but the design is generic and can be adapted to other popular DL infrastructures. …

MobiVSR
Visual speech recognition (VSR) is the task of recognizing spoken language from video input only, without any audio. VSR has many applications as an assistive technology, especially if it could be deployed in mobile devices and embedded systems. The need of intensive computational resources and large memory footprint are two of the major obstacles in developing neural network models for VSR in a resource constrained environment. We propose a novel end-to-end deep neural network architecture for word level VSR called MobiVSR with a design parameter that aids in balancing the model’s accuracy and parameter count. We use depthwise-separable 3D convolution for the first time in the domain of VSR and show how it makes our model efficient. MobiVSR achieves an accuracy of 73\% on a challenging Lip Reading in the Wild dataset with 6 times fewer parameters and 20 times lesser memory footprint than the current state of the art. MobiVSR can also be compressed to 6 MB by applying post training quantization. …

Feature Fusion Learning (FFL)
We propose a learning framework named Feature Fusion Learning (FFL) that efficiently trains a powerful classifier through a fusion module which combines the feature maps generated from parallel neural networks. Specifically, we train a number of parallel neural networks as sub-networks, then we combine the feature maps from each sub-network using a fusion module to create a more meaningful feature map. The fused feature map is passed into the fused classifier for overall classification. Unlike existing feature fusion methods, in our framework, an ensemble of sub-network classifiers transfers its knowledge to the fused classifier and then the fused classifier delivers its knowledge back to each sub-network, mutually teaching one another in an online-knowledge distillation manner. This mutually teaching system not only improves the performance of the fused classifier but also obtains performance gain in each sub-network. Moreover, our model is more beneficial because different types of network can be used for each sub-network. We have performed a variety of experiments on multiple datasets such as CIFAR-10, CIFAR-100 and ImageNet and proved that our method is more effective than other alternative methods in terms of performance of both sub-networks and the fused classifier. …

Continue Reading…

### Burgundy wine investors have beaten the stockmarket

Scarcity and complexity have made Pinot Noir a new status symbol

Continue Reading…

## August 23, 2019

### Fresh from the Python Package Index

en-qai-sm
English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.

faircorels
FairCORELS, a modified version of CORELS to build fair and interpretable models. Welcome to FairCorels, a Python library for learning fair and interpretable models using the Certifiably Optimal RulE ListS (CORELS) algorithm!

flexi-hash-embedding
PyTorch Extension Library of Optimized Scatter Operations. This PyTorch Module hashes and sums variably-sized dictionaries of features into a single fixed-size embedding. Feature keys are hashed, which is ideal for streaming contexts and online-learning such that we don’t have to memorize a mapping between feature keys and indices. Multiple variable-length features are grouped by example and then summed. Feature embeddings are scaled by their values, enabling linear features rather than just one-hot features.

JLpy-utils-package
Custom methodes for various data science, computer vision, and machine learning operations in python

kde-gpu
We implemented nadaraya waston kernel density and kernel conditional probability estimator using cuda through cupy. It is much faster than cpu version but it requires GPU with high memory.

matchzoo-py
Facilitating the design, comparison and sharingof deep text matching models.

naive-feature-selection
Naive Feature Selection. This package solves the Naive Feature Selection problem described in [the paper](https://…/1905.09884 ).

pencilbox
A pencilbox for your Jupyter notebooks.

pyAudioProcessing
Audio processing-feature extraction and building machine learning models from audio data.

scrolldown
Keep long running notebook in jupyter scrolled to the bottom

styletransfer
Transfer the style of one image to another using PyTorch

tf-object-detection
A Thin Wrapper around Tensorflow Object Detection API for Easy Installation and Use. This is a thin wrapper around [Tensorflow Object Detection API](https://…/object_detection ) for easy installation and use. The original [installation procedure](https://…/installation.md ) contains multiple manual steps that make dependency management difficult. This repository creates a pip package that automate the installation so that you can install the API with a single pip install.

titanfe
titan Data Flow Engine for Python

trendet
trendet – is a Python package for trend detection on stock time series data

twodlearn
Easy development of machine learning models

Continue Reading…

### Whats new on arXiv

Ant Colony Optimization (ACO) is a metaheuristic proposed by Marco Dorigo in 1991 based on behavior of biological ants. Pheromone laying and selection of shortest route with the help of pheromone inspired development of first ACO algorithm. Since, presentation of first such algorithm, many researchers have worked and published their research in this field. Though initial results were not so promising but recent developments have made this metaheuristic a significant algorithm in Swarm Intelligence. This research presents a brief overview of recent developments carried out in ACO algorithms in terms of both applications and algorithmic developments. For application developments, multi-objective optimization, continuous optimization and time-varying NP-hard problems have been presented. While to review articles based on algorithmic development, hybridization and parallel architectures have been investigated.
Open-domain dialog systems (also known as chatbots) have increasingly drawn attention in natural language processing. Some of the recent work aims at incorporating affect information into sequence-to-sequence neural dialog modeling, making the response emotionally richer, while others use hand-crafted rules to determine the desired emotion response. However, they do not explicitly learn the subtle emotional interactions captured in real human dialogs. In this paper, we propose a multi-turn dialog system capable of learning and generating emotional responses that so far only humans know how to do. Compared to two baseline models, offline experiments show that our method performs the best in perplexity scores. Further human evaluations confirm that our chatbot can keep track of the conversation context and generate emotionally more appropriate responses while performing equally well on grammar.
Multi-Task Learning (MTL) aims at boosting the overall performance of each individual task by leveraging useful information contained in multiple related tasks. It has shown great success in natural language processing (NLP). Currently, a number of MLT architectures and learning mechanisms have been proposed for various NLP tasks. However, there is no systematic exploration and comparison of different MLT architectures and learning mechanisms for their strong performance in-depth. In this paper, we conduct a thorough examination of typical MTL methods on a broad range of representative NLP tasks. Our primary goal is to understand the merits and demerits of existing MTL methods in NLP tasks, thus devising new hybrid architectures intended to combine their strengths.
A large number of engineering, science and computational problems have yet to be solved in a computationally efficient way. One of the emerging challenges is how evolving technologies grow towards autonomy and intelligent decision making. This leads to collection of large amounts of data from various sensing and measurement technologies, e.g., cameras, smart phones, health sensors, smart electricity meters, and environment sensors. Hence, it is imperative to develop efficient algorithms for generation, analysis, classification, and illustration of data. Meanwhile, data is structured purposefully through different representations, such as large-scale networks and graphs. We focus on data science as a crucial area, specifically focusing on a curse of dimensionality (CoD) which is due to the large amount of generated/sensed/collected data. This motivates researchers to think about optimization and to apply nature-inspired algorithms, such as evolutionary algorithms (EAs) to solve optimization problems. Although these algorithms look un-deterministic, they are robust enough to reach an optimal solution. Researchers do not adopt evolutionary algorithms unless they face a problem which is suffering from placement in local optimal solution, rather than global optimal solution. In this chapter, we first develop a clear and formal definition of the CoD problem, next we focus on feature extraction techniques and categories, then we provide a general overview of meta-heuristic algorithms, its terminology, and desirable properties of evolutionary algorithms.
We introduce a heuristic model of Quantum Computing and apply it to argue that a deep understanding of quantum computing is unlikely to be helpful to address current bottlenecks in Artificial Intelligence Alignment. Our argument relies on the claims that Quantum Computing leads to compute overhang instead of algorithmic overhang, and that the difficulties associated with the measurement of quantum states do not invalidate any major assumptions of current Artificial Intelligence Alignment research agendas. We also discuss tripwiring, adversarial blinding, informed oversight and side effects as possible exceptions.
Accuracy is the most important parameter among few others which defines the effectiveness of a machine learning algorithm. Higher accuracy is always desirable. Now, there is a vast number of well established learning algorithms already present in the scientific domain. Each one of them has its own merits and demerits. Merits and demerits are evaluated in terms of accuracy, speed of convergence, complexity of the algorithm, generalization property, and robustness among many others. Also the learning algorithms are data-distribution dependent. Each learning algorithm is suitable for a particular distribution of data. Unfortunately, no dominant classifier exists for all the data distribution, and the data distribution task at hand is usually unknown. Not one classifier can be discriminative well enough if the number of classes are huge. So the underlying problem is that a single classifier is not enough to classify the whole sample space correctly. This thesis is about exploring the different techniques of combining the classifiers so as to obtain the optimal accuracy. Three classifiers are implemented namely plain old nearest neighbor on raw pixels, a structural feature extracted neighbor and Gabor feature extracted nearest neighbor. Five different combination strategies are devised and tested on Tibetan character images and analyzed
We explore the intersection of human and machine creativity by generating sculptural objects through machine learning. This research raises questions about both the technical details of automatic art generation and the interaction between AI and people, as both artists and the audience of art. We introduce two algorithms for generating 3D point clouds and then discuss their actualization as sculpture and incorporation into a holistic art installation. Specifically, the Amalgamated DeepDream (ADD) algorithm solves the sparsity problem caused by the naive DeepDream-inspired approach and generates creative and printable point clouds. The Partitioned DeepDream (PDD) algorithm further allows us to explore more diverse 3D object creation by combining point cloud clustering algorithms and ADD.
In low-resource settings where vital registration of death is not routine it is often of critical interest to determine and study the cause of death (COD) for individuals and the cause-specific mortality fraction (CSMF) for populations. Post-mortem autopsies, considered the gold standard for COD assignment, are often difficult or impossible to implement due to deaths occurring outside the hospital, expense, and/or cultural norms. For this reason, Verbal Autopsies (VAs) are commonly conducted, consisting of a questionnaire administered to next of kin recording demographic information, known medical conditions, symptoms, and other factors for the decedent. This article proposes a novel class of hierarchical factor regression models that avoid restrictive assumptions of standard methods, allow both the mean and covariance to vary with COD category, and can include covariate information on the decedent, region, or events surrounding death. Taking a Bayesian approach to inference, this work develops an MCMC algorithm and validates the FActor Regression for Verbal Autopsy (FARVA) model in simulation experiments. An application of FARVA to real VA data shows improved goodness-of-fit and better predictive performance in inferring COD and CSMF over competing methods. Code and a user manual are made available at https://…/farva.
Deep neural networks (DNNs) have demonstrated impressive performance on many challenging machine learning tasks. However, DNNs are vulnerable to adversarial inputs generated by adding maliciously crafted perturbations to the benign inputs. As a growing number of attacks have been reported to generate adversarial inputs of varying sophistication, the defense-attack arms race has been accelerated. In this paper, we present MODEF, a cross-layer model diversity ensemble framework. MODEF intelligently combines unsupervised model denoising ensemble with supervised model verification ensemble by quantifying model diversity, aiming to boost the robustness of the target model against adversarial examples. Evaluated using eleven representative attacks on popular benchmark datasets, we show that MODEF achieves remarkable defense success rates, compared with existing defense methods, and provides a superior capability of repairing adversarial inputs and making correct predictions with high accuracy in the presence of black-box attacks.
A cross-modal retrieval process is to use a query in one modality to obtain relevant data in another modality. The challenging issue of cross-modal retrieval lies in bridging the heterogeneous gap for similarity computation, which has been broadly discussed in image-text, audio-text, and video-text cross-modal multimedia data mining and retrieval. However, the gap in temporal structures of different data modalities is not well addressed due to the lack of alignment relationship between temporal cross-modal structures. Our research focuses on learning the correlation between different modalities for the task of cross-modal retrieval. We have proposed an architecture: Supervised-Deep Canonical Correlation Analysis (S-DCCA), for cross-modal retrieval. In this forum paper, we will talk about how to exploit triplet neural networks (TNN) to enhance the correlation learning for cross-modal retrieval. The experimental result shows the proposed TNN-based supervised correlation learning architecture can get the best result when the data representation extracted by supervised learning.
Previous research on empathetic dialogue systems has mostly focused on generating responses given certain emotions. However, being empathetic not only requires the ability of generating emotional responses, but more importantly, requires the understanding of user emotions and replying appropriately. In this paper, we propose a novel end-to-end approach for modeling empathy in dialogue systems: Mixture of Empathetic Listeners (MoEL). Our model first captures the user emotions and outputs an emotion distribution. Based on this, MoEL will softly combine the output states of the appropriate Listener(s), which are each optimized to react to certain emotions, and generate an empathetic response. Human evaluations on empathetic-dialogues (Rashkin et al., 2018) dataset confirm that MoEL outperforms multitask training baseline in terms of empathy, relevance, and fluency. Furthermore, the case study on generated responses of different Listeners shows high interpretability of our model.
Tensor methods have become a promising tool to solve high-dimensional problems in the big data era. By exploiting possible low-rank tensor factorization, many high-dimensional model-based or data-driven problems can be solved to facilitate decision making or machine learning. In this paper, we summarize the recent applications of tensor computation in obtaining compact models for uncertainty quantification and deep learning. In uncertainty analysis where obtaining data samples is expensive, we show how tensor methods can significantly reduce the simulation or measurement cost. To enable the deployment of deep learning on resource-constrained hardware platforms, tensor methods can be used to significantly compress an over-parameterized neural network model or directly train a small-size model from scratch via optimization or statistical techniques. Recent Bayesian tensorized neural networks can automatically determine their tensor ranks in the training process.
Recurrent Neural Network (RNN) and its variations such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have become standard building blocks for learning online data of sequential nature in many research areas, including natural language processing and speech data analysis. In this paper, we present a new methodology to significantly reduce the number of parameters in RNNs while maintaining performance that is comparable or even better than classical RNNs. The new proposal, referred to as Restricted Recurrent Neural Network (RRNN), restricts the weight matrices corresponding to the input data and hidden states at each time step to share a large proportion of parameters. The new architecture can be regarded as a compression of its classical counterpart, but it does not require pre-training or sophisticated parameter fine-tuning, both of which are major issues in most existing compression techniques. Experiments on natural language modeling show that compared with its classical counterpart, the restricted recurrent architecture generally produces comparable results at about 50% compression rate. In particular, the Restricted LSTM can outperform classical RNN with even less number of parameters.
Most existing recommender systems represent a user’s preference with a feature vector, which is assumed to be fixed when predicting this user’s preferences for different items. However, the same vector cannot accurately capture a user’s varying preferences on all items, especially when considering the diverse characteristics of various items. To tackle this problem, in this paper, we propose a novel Multimodal Attentive Metric Learning (MAML) method to model user diverse preferences for various items. In particular, for each user-item pair, we propose an attention neural network, which exploits the item’s multimodal features to estimate the user’s special attention to different aspects of this item. The obtained attention is then integrated into a metric-based learning method to predict the user preference on this item. The advantage of metric learning is that it can naturally overcome the problem of dot product similarity, which is adopted by matrix factorization (MF) based recommendation models but does not satisfy the triangle inequality property. In addition, it is worth mentioning that the attention mechanism cannot only help model user’s diverse preferences towards different items, but also overcome the geometrically restrictive problem caused by collaborative metric learning. Extensive experiments on large-scale real-world datasets show that our model can substantially outperform the state-of-the-art baselines, demonstrating the potential of modeling user diverse preference for recommendation.
Matrix factorization (MF) is one of the most efficient methods for rating predictions. MF learns user and item representations by factorizing the user-item rating matrix. Further, textual contents are integrated to conventional MF to address the cold-start problem. However, the textual contents do not reflect all aspects of the items. In this paper, we propose a model that leverages the information hidden in the item co-click (i.e., items that are often clicked together by a user) into learning item representations. We develop TCMF (Textual Co Matrix Factorization) that learns the user and item representations jointly from the user-item matrix, textual contents and item co-click matrix built from click data. Item co-click information captures the relationships between items which are not captured via textual contents. The experiments on two real-world datasets MovieTweetings, and Bookcrossing) demonstrate that our method outperforms competing methods in terms of rating prediction. Further, we show that the proposed model can learn effective item representations by comparing with state-of-the-art methods in classification task which uses the item representations as input vectors.
Traditionally, graph quality metrics focus on readability, but recent studies show the need for metrics which are more specific to the discovery of patterns in graphs. Cluster analysis is a popular task within graph analysis, yet there is no metric yet explicitly quantifying how well a drawing of a graph represents its cluster structure. We define a clustering quality metric measuring how well a node-link drawing of a graph represents the clusters contained in the graph. Experiments with deforming graph drawings verify that our metric effectively captures variations in the visual cluster quality of graph drawings. We then use our metric to examine how well different graph drawing algorithms visualize cluster structures in various graphs; the results con-firm that some algorithms which have been specifically designed to show cluster structures perform better than other algorithms.
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Continuing advances in neural interfaces have enabled simultaneous monitoring of spiking activity from hundreds to thousands of neurons. To interpret these large-scale data, several methods have been proposed to infer latent dynamic structure from high-dimensional datasets. One recent line of work uses recurrent neural networks in a sequential autoencoder (SAE) framework to uncover dynamics. SAEs are an appealing option for modeling nonlinear dynamical systems, and enable a precise link between neural activity and behavior on a single-trial basis. However, the very large parameter count and complexity of SAEs relative to other models has caused concern that SAEs may only perform well on very large training sets. We hypothesized that with a method to systematically optimize hyperparameters (HPs), SAEs might perform well even in cases of limited training data. Such a breakthrough would greatly extend their applicability. However, we find that SAEs applied to spiking neural data are prone to a particular form of overfitting that cannot be detected using standard validation metrics, which prevents standard HP searches. We develop and test two potential solutions: an alternate validation method (‘sample validation’) and a novel regularization method (‘coordinated dropout’). These innovations prevent overfitting quite effectively, and allow us to test whether SAEs can achieve good performance on limited data through large-scale HP optimization. When applied to data from motor cortex recorded while monkeys made reaches in various directions, large-scale HP optimization allowed SAEs to better maintain performance for small dataset sizes. Our results should greatly extend the applicability of SAEs in extracting latent dynamics from sparse, multidimensional data, such as neural population spiking activity.

Continue Reading…

### Improvements to RSwitch in v1.3.0

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s only been a couple days since the initial version of my revamped take on RSwitch but there have been numerous improvements since then worth mentioning.

For starters, there’s a new app icon that uses the blue and gray from the official (modern) R logo to help visually associate it with R:

In similar fashion, the menubar icon now looks better in dark mode (I may still tweak it a bit, tho).

There are also some new features in the menu bar from v1.0.0:

• numbered shortcuts for each R version
• handy menubar links to R resources on the internet
• two menubar links to make it easier to download the latest RStudio dailies by hand (if you’re not using something like Homebrew already for that) and the latest R-devel macOS distribution tarball
• saner/cleaner alerts

On tap for 1.4.0 is using Notification Center for user messaging vs icky alerts and, perhaps, some TouchBar icons for Mac folk with capable MacBook Pro models.

### FIN

As usual, kick the tyres & file issues, questions, feature requests & PRs where you like.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### R Packages worth a look

Computes the Scores for Different Implicit Measures (implicitMeasures)
A tool for computing the scores for the Implicit Association Test (IAT; Greenwald, McGhee & Schwartz (1998) <doi:10.1037/0022-3514.74.6.1464>) and the Single Category-IAT (SC-IAT: Karpinski & Steinman (2006) <doi:10.1037/0022-3514.91.1.16>). Functions for preparing the data (both for the IAT and the SC-IAT), plotting the results, and obtaining a table with the scores of implicit measures descriptive statistics are provided.

Methods for Calculating the Loreau & Hector 2001 BEF Partition (partitionBEFsp)
A collection of functions that can be used to estimate selection and complementarity effects, sensu Loreau & Hector (2001) <doi:10.1038/35083573>, even in cases where data are only available for a random subset of species (i.e. incomplete sample-level data). A full derivation and explanation of the statistical corrections used here is available in Clark et al. (2019) <doi:10.1111/2041-210X.13285>.

Virome Sequencing Analysis Result Browser (viromeBrowser)
Experiments in which highly complex virome sequencing data is generated are difficult to visualize and unpack for a person without programming experience. After processing of the raw sequencing data by a next generation sequencing (NGS) processing pipeline the usual output consists of contiguous sequences (contigs) in fasta format and an annotation file linking the contigs to a reference sequence. The virome analysis browser app imports an annotation file and a corresponding fasta file containing the annotated contigs. It facilitates browsing of annotations in multiple files and allows users to select and export specific annotated sequences from the associated fasta files. Various annotation quality thresholds can be set to filter contigs from the annotation files. Further inspection of selected contigs can be done in the form of automatic open reading frame (ORF) detection. Separate contigs and/or separate ORFs can be downloaded in nucleotide or amino acid format for further analysis.

Chinese Word Segmentation (Rwordseg)
Provides interfaces and useful tools for Chinese word segmentation. Implements a segmentation algorithm based on Hidden Markov Model (HMM) in native R codes. Methods for HHMM-Based Chinese lexical analyzer are as described in : Hua-Ping Zhang et al., (2003) <doi:10.3115/1119250.1119280>.

Continue Reading…

### Polyglot FizzBuzz in R (Plus: “Why Can’t Johnny Code?”)

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I caught this post on the The Surprising Number Of Programmers Who Can’t Program from the Hacker News RSS feed. Said post links to another, classic post on the same subject and you should read both before continuing.

Back? Great! Let’s dig in.

### Why does hrbrmstr care about this?

Offspring #3 completed his Freshman year at UMaine Orono last year but wanted to stay academically active over the summer (he’s majoring in astrophysics and knows he’ll need some programming skills to excel in his field) and took an introductory C++ course from UMaine that was held virtually, with 1 lecture per week (14 weeks IIRC) and 1 assignment due per week with no other grading.

After seeing what passes for a standard (UMaine is not exactly on the top list of institutions to attend if one wants to be a computer scientist) intro C++ course, I’m not really surprised “Johnny can’t code”. Thirteen weeks in the the class finally started covering OO concepts, and the course is ending with a scant intro to polymorphism. Prior to this, most of the assignments were just variations on each other (read from stdin, loop with conditionals, print output) with no program going over 100 LoC (that includes comments and spacing). This wasn’t a “compsci for non-compsci majors” course, either. Anyone majoring in an area of study that requires programming could have taken this course to fulfill one of the requirements, and they’d be set on a path of forever using StackOverflow copypasta to try to get their future work done.

I’m fairly certain most of #3’s classmates could not program fizzbuzz without googling and even more certain most have no idea they weren’t really “coding in C++” most of the course.

If this is how most other middling colleges are teaching the basics of computer programming, it’s no wonder employers are having a difficult time finding qualified talent.

### You have an “R” tag — actually, a few language tags — on this post, so where’s the code?

After the article triggered the lament in the previous section, a crazy, @coolbutuseless-esque thought came into my head: “I wonder how many different language FizzBuz solutions can be created from within R?”.

The criteria for that notion is/was that there needed to be some Rcpp::cppFunction(), reticulate::py_run_string(), V8 context eval()-type way to have the code in-R but then run through those far-super-to-any-other-language’s polyglot extensibility constructs.

Before getting lost in the weeds, there were some other thoughts on language inclusion:

• Should Java be included? I :heart: {rJava}, but cat()-ing Java code out and running system() to compile it first seemed like cheating (even though that’s kinda just what cppFunction() does). Toss a note into a comment if you think a Java example should be added (or add said Java example in a comment or link to it in one!).
• I think Julia should be in this example list but do not care enough about it to load {JuliaCall} and craft an example (again, link or post one if you can crank it out quickly).
• I think Lua could be in this example given the existence of {luar}. If you agree, give it a go!
• Go & Rust compiled code can also be called in R (thanks to Romain & Jeroen) once they’re turned into C-compatible libraries. Should this polyglot example show this as well?
• What other languages am I missing?

### The aforementioned “weeds”

One criteria for each language fizzbuzz example is that they need to be readable, not hacky-cool. That doesn’t mean the solutions still can’t be a bit creative. We’ll lightly go through each one I managed to code up. First we’ll need some helpers:

suppressPackageStartupMessages({
library(purrr)
library(dplyr)
library(reticulate)
library(V8)
library(Rcpp)
})


The R, JavaScript, and Python implementations are all in the microbenchmark() call way down below. Up here are C and C++ versions. The C implementation is boring and straightforward, but we’re using Rprintf() so we can capture the output vs have any output buffering woes impact the timings.

cppFunction('
void cbuzz() {

// super fast plain C

for (unsigned int i=1; i<=100; i++) {
if      (i % 15 == 0) Rprintf("FizzBuzz\\n");
else if (i %  3 == 0) Rprintf("Fizz\\n");
else if (i %  5 == 0) Rprintf("Buzz\\n");
else Rprintf("%d\\n", i);
}

}
')


The cbuzz() example is just fine even in C++ land, but we can take advantage of some C++11 vectorization features to stay formally in C++-land and play with some fun features like lambdas. This will be a bit slower than the C version plus consume more memory, but shows off some features some folks might not be familiar with:

cppFunction('
void cppbuzz() {

std::vector numbers(100); // will eventually be 1:100
std::iota(numbers.begin(), numbers.end(), 1); // kinda sorta equiva of our R 1:100 but not exactly true

std::vector fb(100); // fizzbuzz strings holder

// transform said 1..100 into fizbuzz strings
std::transform(
numbers.begin(), numbers.end(),
fb.begin(),
[](int i) -> std::string { // lambda expression are cool like a fez
if      (i % 15 == 0) return("FizzBuzz");
else if (i %  3 == 0) return("Fizz");
else if (i %  5 == 0) return("Buzz");
else return(std::to_string(i));
}
);

// round it out with use of for_each and another lambda
// this turns out to be slightly faster than range-based for-loop
// collection iteration syntax.
std::for_each(
fb.begin(), fb.end(),
[](std::string s) { Rcout << s << std::endl; }
);

}
',
plugins = c('cpp11'))


Both of those functions are now available to R.

Next, we need to prepare to run JavaScript and Python code, so we’ll initialize both of those environments:

ctx <- v8()

py_config() # not 100% necessary but I keep my needed {reticulate} options in env vars for reproducibility


Then, we tell R to capture all the output. Using sink() is a bit better than capture.output() in this use-case since to avoid nesting calls, and we need to handle Python stdout the same way py_capture_output() does to be fair in our measurements:

output_tools <- import("rpytools.output")
restore_stdout <- output_tools$start_stdout_capture() cap <- rawConnection(raw(0), "r+") sink(cap)  There are a few implementations below across the tidy and base R multiverse. Some use vectorization; some do not. This will let us compare overall “speed” of solution. If you have another suggestion for a readable solution in R, drop a note in the comments: microbenchmark::microbenchmark( # tidy_vectors_case() is slowest but you get all sorts of type safety # for free along with very readable idioms. tidy_vectors_case = map_chr(1:100, ~{ case_when( (.x %% 15 == 0) ~ "FizzBuzz", (.x %% 3 == 0) ~ "Fizz", (.x %% 5 == 0) ~ "Buzz", TRUE ~ as.character(.x) ) }) %>% cat(sep="\n"), # tidy_vectors_if() has old-school if/else syntax but still # forces us to ensure type safety which is cool. tidy_vectors_if = map_chr(1:100, ~{ if (.x %% 15 == 0) return("FizzBuzz") if (.x %% 3 == 0) return("Fizz") if (.x %% 5 == 0) return("Buzz") return(as.character(.x)) }) %>% cat(sep="\n"), # walk() just replaces for but stays in vector-land which is cool tidy_walk = walk(1:100, ~{ if (.x %% 15 == 0) cat("FizzBuzz\n") if (.x %% 3 == 0) cat("Fizz\n") if (.x %% 5 == 0) cat("Buzz\n") cat(.x, "\n", sep="") }), # vapply() gets us some similiar type assurance, albeit with arcane syntax base_proper = vapply(1:100, function(.x) { if (.x %% 15 == 0) return("FizzBuzz") if (.x %% 3 == 0) return("Fizz") if (.x %% 5 == 0) return("Buzz") return(as.character(.x)) }, character(1), USE.NAMES = FALSE) %>% cat(sep="\n"), # sapply() is def lazy but this can outperform vapply() in some # circumstances (like this one) and is a bit less arcane. base_lazy = sapply(1:100, function(.x) { if (.x %% 15 == 0) return("FizzBuzz") if (.x %% 3 == 0) return("Fizz") if (.x %% 5 == 0) return("Buzz") return(.x) }, USE.NAMES = FALSE) %>% cat(sep="\n"), # for loops...ugh. might as well just use C base_for = for(.x in 1:100) { if (.x %% 15 == 0) cat("FizzBuzz\n") else if (.x %% 3 == 0) cat("Fizz\n") else if (.x %% 5 == 0) cat("Buzz\n") else cat(.x, "\n", sep="") }, # ok, we'll just use C! c_buzz = cbuzz(), # we can go back to vector-land in C++ cpp_buzz = cppbuzz(), # some <3 for javascript js_readable = ctx$eval('
for (var i=1; i <101; i++){
if      (i % 15 == 0) console.log("FizzBuzz")
else if (i %  3 == 0) console.log("Fizz")
else if (i %  5 == 0) console.log("Buzz")
else console.log(i)
}
'),

# icky readable, non-vectorized python

python = reticulate::py_run_string('
for x in range(1, 101):
if (x % 15 == 0):
print("Fizz Buzz")
elif (x % 5 == 0):
print("Buzz")
elif (x % 3 == 0):
print("Fizz")
else:
print(x)
')

) -> res


Turn off output capturing:

sink()
if (!is.null(restore_stdout)) invisible(output_tools$end_stdout_capture(restore_stdout))  We used microbenchmark(), so here are the results: res ## Unit: microseconds ## expr min lq mean median uq max neval cld ## tidy_vectors_case 20290.749 21266.3680 22717.80292 22231.5960 23044.5690 33005.960 100 e ## tidy_vectors_if 457.426 493.6270 540.68182 518.8785 577.1195 797.869 100 b ## tidy_walk 970.455 1026.2725 1150.77797 1065.4805 1109.9705 8392.916 100 c ## base_proper 357.385 375.3910 554.13973 406.8050 450.7490 13907.581 100 b ## base_lazy 365.553 395.5790 422.93719 418.1790 444.8225 587.718 100 ab ## base_for 521.674 545.9155 576.79214 559.0185 584.5250 968.814 100 b ## c_buzz 13.538 16.3335 18.18795 17.6010 19.4340 33.134 100 a ## cpp_buzz 39.405 45.1505 63.29352 49.1280 52.9605 1265.359 100 a ## js_readable 107.015 123.7015 162.32442 174.7860 187.1215 270.012 100 ab ## python 1581.661 1743.4490 2072.04777 1884.1585 1985.8100 12092.325 100 d  Said results are since this is a toy example, but I wanted to show that Jeroen’s {V8} can be super fast, especially when there’s no value marshaling to be done and that some things you may have thought should be faster, aren’t. ### FIN Definitely add links or code for changes or additions (especially the aforementioned other languages). Hopefully my lament about the computer science program at UMaine is not universally true for all the programming courses there. To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Pros and Cons of Top Data Science Online Courses [This article was first published on R – data technik, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. There are a variety of data science courses online, but which one is the best? Find out the pros and cons of each! ## Coursera, EdX, etc These MOOCs have been around for several years now and continue to grow. But are they really the best option for learning online? Pros: • Lots of Topics including R and Python • Affordable and even a free option • Well thought out curriculum from professors in great schools Cons: • Not easily translatable to industry • Not taught by current industry professionals, but instead academics Now, these MOOCs are still worth checking out and seeing if it works for you, but beware that you may feel tired of analyzing the iris data set. ## PluralSight Pros: • Lots of Topics in R, Python, and databases • Easy to skip around through the user interface instead of going in order • Taught by industry veterans in top companies that know current trends and expectations • You can use your own apps -Anaconda and RStudio – on your computer and not in the website itself Cons: • Still just a bit limited on their data courses, but still growing quickly ## DataCamp Pros: • Great options for beginners to intermediate • Courses build on each other, fairly good examples • Most instructors have spent time in the industry Cons: • You have to use their in website coding tool • Exercises are not always that clear • Never know if your app will work the same way on your own computer So that’s a quick overview of options for learning online. Of course blogs are fantastic, too, and stack overflow can really be helpful! Feel free to add your recommendations, too! Check out PluralSight’s great offer today! To leave a comment for the author, please follow the link and comment on their blog: R – data technik. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Nothing but NumPy: Understanding & Creating Neural Networks with Computational Graphs from Scratch Entirely implemented with NumPy, this extensive tutorial provides a detailed review of neural networks followed by guided code for creating one from scratch with computational graphs. Continue Reading… ### Top Handy SQL Features for Data Scientists Whenever we hear "data," the first thing that comes to mind is SQL! SQL comes with easy and quick to learn features to organize and retrieve data, as well as perform actions on it in order to gain useful insights. Continue Reading… ### Yes, you can include prior information on quantities of interest, not just on parameters in your model Nick Kavanagh writes: I studied economics in college and never heard more than a passing reference to Bayesian stats. I started to encounter Bayesian concepts in the workplace and decided to teach myself on the side. I was hoping to get your advice on a problem that I recently encountered. It has to do with the best way to encode prior information into a model in which the prior knowledge pertains to the overall effect of some change (not the values of individual parameters). I haven’t seen this question addressed before and thought it might be a good topic for a blog post. I’m building a model to understand the effects of advertising on sales, controlling for other factors like pricing. A simplified version of the model is presented below. sales = alpha + beta_ad * ad_spend + beta_price * log(price) Additional units of advertising will, at some point, yield lower incremental sales. This non-linearity is incorporated into the model through a variable transformation — f(ad_spend, s) — where the parameter s determines the rate of diminishing returns. sales = alpha + beta_ad * f(ad_spend, s) + beta_price * log(price) Outside the model, I have estimates of the impact of advertising on sales obtained through randomized experiments. These experiments don’t provide estimates of beta_ad and s. They simply tell you that “increasing advertising spend by$100K generated 400 [300, 500] incremental sales.” The challenge is that different sets of parameter values for beta_ad and s yield very similar results in terms of incremental sales. I’m struggling with the best way to incorporate the experimental results into the model.

My reply:

In Stan this is super-easy: You can put priors on anything, including combinations of parameters. Consider this code fragment:

model {
target += normal(y | a + b*x, sigma);  \\ data model
target += normal(a | 0, 10);           \\ weak prior information on a
target += normal(b | 0, 10);           \\ weak prior information on b
target += normal(a + 5*b | 4.5, 0.2);  \\ strong prior information on a + 5*b


In this example, you have prior information on the linear combination, a + 5*b, an estimate of 4.5 with standard error 0.2, from some previous experiment.

The key is that prior information is, mathematically, just more data.

You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.

P.S. I’ve updated the comments on the above code in response to Lakeland’s suggestion in comments.

Continue Reading…

### Order Matters: Alibaba’s Transformer-based Recommender System

Alibaba, the largest e-commerce platform in China, is a powerhouse not only when it comes to e-commerce, but also when it comes to recommender systems research. Their latest paper, Behaviour Sequence Transformer for E-commerce Recommendation in Alibaba, is yet another publication that pushes the state of the art in recommender systems.

Continue Reading…

### History of slavery in America

USA Today looks at some of the numbers on 17th century slavery in America. The format, with zooms in and out and shifts to different views, focuses both on scale and the individuals.

Tags: ,

Continue Reading…

### Finding out why

Mining causality from text is a complex and crucial natural language understanding task. Most of the early attempts at its solution can group into two categories: 1) utilizing co-occurrence frequency and world knowledge for causality detection; 2) extracting cause-effect pairs by using connectives and syntax patterns directly. However, because causality has various linguistic expressions, the noisy data and ignoring implicit expressions problems induced by these methods cannot be avoided. In this paper, we present a neural causality detection model, namely Multi-level Causality Detection Network (MCDN), to address this problem. Specifically, we adopt multi-head self-attention to acquire semantic feature at word level and integrate a novel Relation Network to infer causality at segment level. To the best of our knowledge, in touch with the causality tasks, this is the first time that the Relation Network is applied. The experimental results on the AltLex dataset, demonstrate that: a) MCDN is highly effective for the ambiguous and implicit causality inference; b) comparing with the regular text classification task, causality detection requires stronger inference capability; c) the proposed approach achieved state-of-the-art performance.
Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness.
All networks can be analyzed at multiple scales. A higher scale of a network is made up of macro-nodes: subgraphs that have been grouped into individual nodes. Recasting a network at higher scales can have useful effects, such as decreasing the uncertainty in the movement of random walkers across the network while also decreasing the size of the network. However, the task of finding such a macroscale representation is computationally difficult, as the set of all possible scales of a network grows exponentially with the number of nodes. Here we compare various methods for finding the most informative scale of a network, discovering that an approach based on spectral analysis outperforms greedy and gradient descent-based methods. We then use this procedure to show how several structural properties of preferential attachment networks vary across scales. We describe how meso- and macroscale representations of networks can have significant benefits over their underlying microscale, which include properties such as increase in determinism, a decrease in degeneracy, a lower entropy rate of random walkers on the network, an increase in global network efficiency, and higher values for a variety of centrality measures than the microscale.
We use an analogy between non-isomorphic mathematical structures defined over the same set and the algebras induced by associative and causal levels of information in order to argue that Reinforcement Learning, in its current formulation, is not a causal problem, independently if the motivation behind it has to do with an agent taking actions.
Regression modelling typically assumes homogeneity of the conditional distribution of responses Y given features X. For inhomogeneous data, with latent groups having potentially different underlying distributions, the hidden group structure can be crucial for estimation and prediction, and standard regression models may be severely confounded. Worse, in the multivariate setting, the presence of such inhomogeneity can easily pass undetected. To allow for robust and interpretable regression modelling in the heterogeneous data setting we put forward a class of mixture models that couples together both the multivariate marginal on X and the conditional Y | X to capture the latent group structure. This joint modelling approach allows for group-specific regression parameters, automatically controlling for the latent confounding that may otherwise pose difficulties, and offers a novel way to deal with suspected distributional shifts in the data. We show how the latent variable model can be regularized to provide scalable solutions with explicit sparsity. Estimation is handled via an expectation-maximization algorithm. We illustrate the key ideas via empirical examples.
A dilated causal one-dimensional convolutional neural network architecture is proposed for quantile regression. The model can forecast any arbitrary quantile, and it can be trained jointly on multiple similar time series. An application to Value at Risk forecasting shows that QCNN outperforms linear quantile regression and constant quantile estimates.

Continue Reading…

### Distilled News

At this post, we will talk about the analysis of time series data with Trend and Seasonal components. An econometric approach will be followed to model the statistical properties of the data. The business objective here is forecasting. We attempted to explain various concepts involved in time series modelling, such as time series components, serial correlation, model fitting, metrics, etc. We will use SARIMAX model provided by statsmodels library to model both, seasonality and trend in the data. SARIMA (Seasonal ARIMA) is capable of modelling seasonality and trend together, unlike ARIMA which can only model trend.
Its easy to have a CSV file and implement it in various ML models. But, the difficulty lies in implementing the e2e process of getting a video, extracting images, uploading them on Google cloud storage and later performing AutoML on them, entirely using Python. Nowadays, most of the companies have their own in-built models; if not, then they use Google ML models or others.
ktrain is a library to help build, train, debug, and deploy neural networks in the deep learning software framework, Keras. Inspired by the fastai library, with only a few lines of code, ktrain allows you to easily:
• estimate an optimal learning rate for your model given your data using a learning rate finder
• employ learning rate schedules such as the triangular learning rate policy, 1cycle policy, and SGDR to more effectively train your model
• employ fast and easy-to-use pre-canned models for both text classification (e.g., NBSVM, fastText, GRU with pretrained word embeddings) and image classification (e.g., ResNet, Wide Residual Networks, Inception)
• load and preprocess text and image data from a variety of formats
• inspect data points that were misclassified to help improve your model
• leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data
When first starting to learn how to optimise machine learning models I would often find, after getting to the model building stage, that I would have to keep going back to revisit the data to better handle the types of features present in the dataset. Over time I have found that one of the first steps to take before building the models is to carefully review the variable types present in the data, and to try to determine up front the best transformation process to take to achieve the optimal model performance. In the following post I am going to describe the process I take to identify, and transform four common variable types. I am going to be using a dataset taken from the ‘machine learning with a heart’ warm up competition hosted on the https://www.drivendata.org website. The full dataset can be downloaded here https://…/. DrivenData host regular online challenges that are based on solving social problems. I have recently started to engage in some of these competitions in an effort to use some of my skills for a good cause, and also to gain experience with data sets and problems that I don’t usually encounter in my day to day work.
A step-by-step guide to performing additive and multiplicative decomposition. Last time, we talked about the main patterns found in time series data. We saw that, trend, season, and cycle are the most common variations in data recorded through time. However, each of these patterns might affect the time series in different ways. In fact, when choosing a forecasting model, after identifying patterns like trend and season, we need to understand how each one behaves in the series. With this goal in mind, let’s explore two different pre-processing techniques – additive and multiplicative decomposition.
It can sometimes be hard to see how today’s clunky stethoscopes will turn into tomorrow’s Star Trek tricorders. This post will help you better envision that path by explaining one concrete development in healthcare: a technology that determines your heart rate just from video.
Research Findings:
• There is a diversity crisis in the AI sector across gender and race.
• The AI sector needs a profound shift in how it addresses the current diversity crisis.
• The overwhelming focus on ‘women in tech’ is too narrow and likely to privilege white women over others.
• Fixing the ‘pipeline’ won’t fix AI’s diversity problems.
• The use of AI systems for the classification, detection, and prediction of race and gender is in urgent need of re-evaluation.
I’ve always been sceptical of deepfakes. What are they good for? I’ve never understood the excitement over the perceived utility of deep fakes for disinformation in information warfare. Information warfare does not need deepfakes, cheapfakes are more than enough. Finally, someone has found a use for deepfakes as offensive cyber tools, so let’s deep dive deep fakes!
In the previous article, I introduced the concept of topic modeling and walked through the code for developing your first topic model using Latent Dirichlet Allocation (LDA) method in the python using sklearn implementation. Pursuing on that understanding, in this article, we’ll go a few steps deeper by outlining the framework to quantitatively evaluate topic models through the measure of topic coherence and share the code template in python using Gensim implementation to allow for end-to-end model development.
Whenever statisticians are asked to make inference on some population parameters, which cannot be observed, they need to start from a representative sample of that population. However, once obtained the estimate of that parameter (called statistic), how can they state whether it corresponds to the real parameter, since the latter is unknown? Because of the impossibility of comparing the two results (again, one is not observable), it will be necessary to make some assumptions and run the so-called hypothesis tests. Those tests are meant to evaluate the likelihood of the estimate of being equal to the real parameter. The idea is that it always exists a situation which can be considered as the default: it is the conservative scenario, the one you’d be better to keep if you are not enough sure of your assumption. This situation will be our Null Hypothesis, or H0. On the other hand, there is the alternative scenario, that, if accepted, will change the status quo ante. It is the Alternative Hypothesis, or H1.
No matter how well you build a model, no one knows it if you cannot ship model. However, lots of data scientists want to focus on model building and skipping the rest of the stuff such as data ingestion and model serving. DevOps for data scientists is very important. There are multi ways to deliver your model serves as an API for the downstream product team.
One of my biggest regrets as a data scientist is that I avoided learning Python for too long. I always figured that other languages provided parity in terms of accomplishing data science tasks, but now that I’ve made the leap to Python there is no looking back. I’ve embraced a language that can help data scientists quickly take ideas from conception to prototype to production. And that last term, production, is perhaps the most important aspect in the ever evolving discipline of data science. Knowing how to build machine learning models is useful, but new tools such as AutoML are beginning to commoditize the work of data scientists. Instead of having a data scientist build a bespoke model for a single product, you can now build robust models that scale to a portfolio for products. As new roles emerge, such as applied scientist, with a hybrid of ML engineering and data science competencies, there’s new opportunities for data science.
Sentiment analysis is a hot topic in NLP, but this technology is increasingly relevant in the financial markets – which is in large part driven by investor sentiment. With so many reports and economic bulletins being generated on a daily basis, one of the big challenges for policymakers is to extract meaningful information in a short period of time to inform policy decisions. In this example, two reports from the European Central Bank website (available from the relevant GitHub repository) are converted into text format, and then a logistic regression is used to rank keywords by positive and negative sentiment. The bulletins in question are sourced from the European Central Bank website.
Rapid advancements in science and research has lead to an enormous amount of digital scholarly data being produced and collected every day¹. This scholarly data can be in the form of scientific publications, books, teaching materials, and many other scholarly sources of information. Over the course of time, these information sources start to build complex relationships among them through citation, co-authorship, and lead to the formation of Big Scholarly Networks which become increasingly challenging to decode.
In this Blog, we are going to learn about how to do Data Cleaning in Python. Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data, which is an inefficient data strategy. The reason data scientists are hired in the first place is to develop algorithms and build machine learning models – and these are typically the parts of the job that they enjoy most. Yet in most companies today, 80 percent of a data scientist’s valuable time is spent simply finding, cleaning and re-organizing huge amounts of data. If you are just stepping into this field or planning to make your career in this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers. In this tutorial, we are going to use python’s NumPy and Pandas libraries to clean data and see in how many ways we can use them.

Continue Reading…

### Four short links: 23 August 2019

Open Source Economics, Program Synthesis, YouTube Influence, and ChatBot Papers

1. The Economics of Open Source (CJ Silverio) -- I'm going to tell you a story about who owns the Javascript language commons, how we got into the situation that the language commons is *by* someone, and why we need to change it.
2. State of the Art in Program Synthesis -- conference, with talks to be posted afterwards, run by a YC startup. Program Synthesis is one of the most exciting fields in software today, in my humble opinion: Programs that write programs are the happiest programs in the world, in the words of Andrew Hume. It'll give coders superpowers, or make us redundant, but either way it's interesting.
3. Alternative Influence (Data and Society) -- amazing report. Extremely well-written, it lays out how the alt right uses YouTube. These strategies reveal a tension underlying the content produced by these influencers: while they present themselves as news sources, their content strategies often more accurately consist of marketing and advertising approaches. These approaches are meant to provoke feelings, memories, emotions, and social ties. In this way, the “accuracy” of their messaging can be difficult to assess through traditional journalistic tactics like fact-checking. Specifically, they recount ideological testimonials that frame ideology in terms of personal growth and self-betterment. They engage in self-branding techniques that present traditional, white, male-dominated values as desirable and aspirational. They employ search engine optimization (SEO) to highly rank their content against politically charged keywords. And they strategically use controversy to gain attention and frame political ideas as fun entertainment.
4. Chatbot and Related Research Paper Notes with Images -- Papers related to chatbot models in chronological order spanning about five years from 2014. Some papers are not about chatbots, but I included them because they are interesting, and they may provide insights into creating new and different conversation models. For each paper I provided a link, the names of the authors, and GitHub implementations of the paper (noting the deep learning framework) if I happened to find any. Since I tried to make these notes as concise as possible they are in no way summarizing the papers but are merely a starting point to get a hang of what the paper is about, and to mention main concepts with the help of pictures.

Continue reading Four short links: 23 August 2019.

Continue Reading…

### R Packages worth a look

Larger-than-RAM Disk-Based Data Manipulation Framework (disk.frame)
A disk-based data manipulation tool for working with arbitrarily large datasets as long as they fit on disk.

Cross-Entropy Optimisation of Noisy Functions (noisyCE2)
Cross-Entropy optimisation of unconstrained deterministic and noisy functions illustrated in Rubinstein and Kroese (2004, ISBN: 978-1-4419-1940-3) through a highly flexible and customisable function which allows user to define custom variable domains, sampling distributions, updating and smoothing rules, and stopping criteria. Several built-in methods and settings make the package very easy-to-use under standard optimisation problems.

Read Tabular Data from Diverse Sources and Easily Make Them Tidy (tidycells)
Provides utilities to read, cells from complex tabular data and heuristic detection based ‘structural assignment’ of those cells to a columnar or tidy format. Read functionality has the ability to read structured, partially structured or unstructured tabular data from various types of documents. The ‘structural assignment’ functionality has both supervised and unsupervised way of assigning cells data to columnar/tidy format. Multiple disconnected blocks of tables in a single sheet are also handled appropriately. These tools are suitable for unattended conversation of messy tables into a consumable format(usable for further analysis and data wrangling).

Nonparametric Item Response Theory (KernSmoothIRT)
Fits nonparametric item and option characteristic curves using kernel smoothing. It allows for optimal selection of the smoothing bandwidth using cross-validation and a variety of exploratory plotting tools. The kernel smoothing is based on methods described in Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.

Continue Reading…

### Whats new on arXiv

In the absence of sufficient data variation (e.g., scanner and protocol variability) in annotated data, deep neural networks (DNNs) tend to overfit during training. As a result, their performance is significantly lower on data from unseen sources compared to the performance on data from the same source as the training data. Semi-supervised domain adaptation methods can alleviate this problem by tuning networks to new target domains without the need for annotated data from these domains. Adversarial domain adaptation (ADA) methods are a popular choice that aim to train networks in such a way that the features generated are domain agnostic. However, these methods require careful dataset-specific selection of hyperparameters such as the complexity of the discriminator in order to achieve a reasonable performance. We propose to use knowledge distillation (KD) — an efficient way of transferring knowledge between different DNNs — for semi-supervised domain adaption of DNNs. It does not require dataset-specific hyperparameter tuning, making it generally applicable. The proposed method is compared to ADA for segmentation of white matter hyperintensities (WMH) in magnetic resonance imaging (MRI) scans generated by scanners that are not a part of the training set. Compared with both the baseline DNN (trained on source domain only and without any adaption to target domain) and with using ADA for semi-supervised domain adaptation, the proposed method achieves significantly higher WMH dice scores.
Fast Fourier transform was included in the Top 10 Algorithms of 20th Century by Computing in Science & Engineering. In this paper, we provide a new simple derivation of both the discrete Fourier transform and fast Fourier transform by means of elementary linear algebra. We start the exposition by introducing the convolution product of vectors, represented by a circulant matrix, and derive the discrete Fourier transform as the change of basis matrix that diagonalizes the circulant matrix. We also generalize our approach to derive the Fourier transform on any finite abelian group, where the case of Fourier transform on the Boolean cube is especially important for many applications in theoretical computer science.
As a general-purpose generative model architecture, VAE has been widely used in the field of image and natural language processing. VAE maps high dimensional sample data into continuous latent variables with unsupervised learning. Sampling in the latent variable space of the feature, VAE can construct new image or text data. As a general-purpose generation model, the vanilla VAE can not fit well with various data sets and neural networks with different structures. Because of the need to balance the accuracy of reconstruction and the convenience of latent variable sampling in the training process, VAE often has problems known as ‘posterior collapse’. images reconstructed by VAE are also often blurred. In this paper, we analyze the main cause of these problem, which is the lack of mutual information between the sample variable and the latent feature variable during the training process. To maintain mutual information in model training, we propose to use the auxiliary softmax multi-classification network structure to improve the training effect of VAE, named VAE-AS. We use MNIST and Omniglot data sets to test the VAE-AS model. Based on the test results, It can be show that VAE-AS has obvious effects on the mutual information adjusting and solving the posterior collapse problem.
Collaborative filtering has been widely used in recommendation systems to recommend items that users might like. However, collaborative filtering based recommendation systems are vulnerable to shilling attacks. Malicious users tend to increase or decrease the recommended frequency of target items by injecting fake profiles. In this paper, we propose a Kalman filter-based attack detection model, which statistically analyzes the difference between the actual rating and the predicted rating calculated by this model to find the potential abnormal time period. The Kalman filter filters out suspicious ratings based on the abnormal time period and identifies suspicious users based on the source of these ratings. The experimental results show that our method performs much better detection performance for the shilling attack than the traditional methods.
Convolutional Neural Networks (CNNs) provide excellent performance when used for image classification. The classical method of training CNNs is by labeling images in a supervised manner as in ‘input image belongs to this label’ (Positive Learning; PL), which is a fast and accurate method if the labels are assigned correctly to all images. However, if inaccurate labels, or noisy labels, exist, training with PL will provide wrong information, thus severely degrading performance. To address this issue, we start with an indirect learning method called Negative Learning (NL), in which the CNNs are trained using a complementary label as in ‘input image does not belong to this complementary label.’ Because the chances of selecting a true label as a complementary label are low, NL decreases the risk of providing incorrect information. Furthermore, to improve convergence, we extend our method by adopting PL selectively, termed as Selective Negative Learning and Positive Learning (SelNLPL). PL is used selectively to train upon expected-to-be-clean data, whose choices become possible as NL progresses, thus resulting in superior performance of filtering out noisy data. With simple semi-supervised training technique, our method achieves state-of-the-art accuracy for noisy data classification, proving the superiority of SelNLPL’s noisy data filtering ability.
We start with a brief introduction to reinforcement learning (RL), about its successful stories, basics, an example, issues, the ICML 2019 Workshop on RL for Real Life, how to use it, study material and an outlook. Then we discuss a selection of RL applications, including recommender systems, computer systems, energy, finance, healthcare, robotics, and transportation.
The use of statistical software in academia and enterprises has been evolving over the last years. More often than not, students, professors, workers, and users, in general, have all had, at some point, exposure to statistical software. Sometimes, difficulties are felt when dealing with such type of software. Very few persons have theoretical knowledge to clearly understand software configurations or settings, and sometimes even the presented results. Very often, the users are required by academies or enterprises to present reports, without the time to explore or understand the results or tasks required to do an optimal preparation of data or software settings. In this work, we present a statistical overview of some theoretical concepts, to provide fast access to some concepts.
Despite numerous research work in reinforcement learning (RL) and the recent successes obtained by combining it with deep learning, deep reinforcement learning (DRL) is still facing many challenges. Some of them, like the ability to abstract actions or the difficulty to explore the environment with sparse rewards, can be addressed by the use of intrinsic motivation. In this article, we provide a survey on the role of intrinsic motivation in DRL. We categorize the different kinds of intrinsic motivations and detail their interests and limitations. Our investigation shows that the combination of DRL and intrinsic motivation enables to learn more complicated and more generalisable behaviours than standard DRL. We provide an in-depth analysis describing learning modules through an unifying scheme composed of information theory, compression theory and reinforcement learning. We then explain how these modules could serve as building blocks over a complete developmental architecture, highlighting the numerous outlooks of the domain.
The paper discusses regularization properties of artificial data for deep learning. Artificial datasets allow to train neural networks in the case of a real data shortage. It is demonstrated that the artificial data generation process, described as injecting noise to high-level features, bears several similarities to existing regularization methods for deep neural networks. One can treat this property of artificial data as a kind of ‘deep’ regularization. It is thus possible to regularize hidden layers of the network by generating the training data in a certain way.
The problem of event extraction is a relatively difficult task for low resource languages due to the non-availability of sufficient annotated data. Moreover, the task becomes complex for tail (rarely occurring) labels wherein extremely less data is available. In this paper, we present a new dataset (InDEE-2019) in the disaster domain for multiple Indic languages, collected from news websites. Using this dataset, we evaluate several rule-based mechanisms to augment deep learning based models. We formulate our problem of event extraction as a sequence labeling task and perform extensive experiments to study and understand the effectiveness of different approaches. We further show that tail labels can be easily incorporated by creating new rules without the requirement of large annotated data.
Linear and non-linear measures of heart rate variability (HRV) are widely investigated as non-invasive indicators of health. Stress has a profound impact on heart rate, and different meditation techniques have been found to modulate heartbeat rhythm. This paper aims to explore the process of identifying appropriate metrices from HRV analysis for sonification. Sonification is a type of auditory display involving the process of mapping data to acoustic parameters. This work explores the use of auditory display in aiding the analysis of HRV leveraged by unsupervised machine learning techniques. Unsupervised clustering helps select the appropriate features to improve the sonification interpretability. Vocal synthesis sonification techniques are employed to increase comprehension and learnability of the processed data displayed through sound. These analyses are early steps in building a real-time sound-based biofeedback training system.
Network representation learning, a fundamental research problem which aims at learning low-dimension node representations on graph-structured data, has been extensively studied in the research community. By generalizing the power of neural networks on graph-structured data, graph neural networks (GNNs) achieve superior capability in network representation learning. However, the node features of many real-world graphs could be high-dimensional and sparse, rendering the learned node representations from existing GNN architectures less expressive. The main reason lies in that those models directly makes use of the raw features of nodes as input for the message-passing and have limited power in capturing sophisticated interactions between features. In this paper, we propose a novel GNN framework for learning node representations that incorporate high-order feature interactions on feature-sparse graphs. Specifically, the proposed message aggregator and feature factorizer extract two channels of embeddings from the feature-sparse graph, characterizing the aggregated node features and high-order feature interactions, respectively. Furthermore, we develop an attentive fusion network to seamlessly combine the information from two different channels and learn the feature interaction-aware node representations. Extensive experiments on various datasets demonstrate the effectiveness of the proposed framework on a variety of graph learning tasks.
A massive number of well-trained deep networks have been released by developers online. These networks may focus on different tasks and in many cases are optimized for different datasets. In this paper, we study how to exploit such heterogeneous pre-trained networks, known as teachers, so as to train a customized student network that tackles a set of selective tasks defined by the user. We assume no human annotations are available, and each teacher may be either single- or multi-task. To this end, we introduce a dual-step strategy that first extracts the task-specific knowledge from the heterogeneous teachers sharing the same sub-task, and then amalgamates the extracted knowledge to build the student network. To facilitate the training, we employ a selective learning scheme where, for each unlabelled sample, the student learns adaptively from only the teacher with the least prediction ambiguity. We evaluate the proposed approach on several datasets and experimental results demonstrate that the student, learned by such adaptive knowledge amalgamation, achieves performances even better than those of the teachers.
The objective of the change-point detection is to discover the abrupt property changes lying behind the time-series data. In this paper, we firstly summarize the definition and in-depth implication of the changepoint detection. The next stage is to elaborate traditional and some alternative model-based changepoint detection algorithms. Finally, we try to go a bit further in the theory and look into future research directions.
Electronic health record is an important source for clinical researches and applications, and errors inevitably occur in the data, which could lead to severe damages to both patients and hospital services. One of such error is the mismatches between diagnoses and prescriptions, which we address as ‘medication anomaly’ in the paper, and clinicians used to manually identify and correct them. With the development of machine learning techniques, researchers are able to train specific model for the task, but the process still requires expert knowledge to construct proper features, and few semantic relations are considered. In this paper, we propose a simple, yet effective detection method that tackles the problem by detecting the semantic inconsistency between diagnoses and prescriptions. Unlike traditional outlier or anomaly detection, the scheme uses continuous bag of words to construct the semantic connection between specific central words and their surrounding context. The detection of medication anomaly is transformed into identifying the least possible central word based on given context. To help distinguish the anomaly from normal context, we also incorporate a ranking accumulation strategy. The experiments were conducted on two real hospital electronic medical records, and the topN accuracy of the proposed method increased by 3.91 to 10.91% and 0.68 to 2.13% on the datasets, respectively, which is highly competitive to other traditional machine learning-based approaches.
In this paper, we study the optimal multiple stopping problem under the filtration consistent nonlinear expectations. The reward is given by a set of random variables satisfying some appropriate assumptions rather than an RCLL process. We first construct the optimal stopping time for the single stopping problem, which is no longer given by the first hitting time of processes. We then prove by induction that the value function of the multiple stopping problem can be interpreted as the one for the single stopping problem associated with a new reward family, which allows us to construct the optimal multiple stopping times. If the reward family satisfies some strong regularity conditions, we show that the reward family and the value functions can be aggregated by some progressive processes. Hence, the optimal stopping times can be represented as hitting times.
Regulatory compliance is an organization’s adherence to laws, regulations, guidelines and specifications relevant to its business. Compliance officers responsible for maintaining adherence constantly struggle to keep up with the large amount of changes in regulatory requirements. Keeping up with the changes entail two main tasks: fetching the regulatory announcements that actually contain changes of interest, and incorporating those changes in the business process. In this paper we focus on the first task, and present a Compliance Change Tracking System, that gathers regulatory announcements from government sites, news sites, email subscriptions; classifies their importance i.e Actionability through a hierarchical classifier, and business process applicability through a multi-class classifier. For these classifiers, we experiment with several approaches such as vanilla classification methods (e.g. Naive Bayes, logistic regression etc.), hierarchical classification methods, rule based approach, hybrid approach with various preprocessing and feature selection methods; and show that despite the richness of other models, a simple hierarchical classification with bag-of-words features works the best for Actionability classifier and multi-class logistic regression works the best for Applicability classifier. The system has been deployed in global delivery centers, and has received positive feedback from payroll compliance officers.
Most of the existing generative adversarial networks (GAN) for text generation suffer from the instability of reinforcement learning training algorithms such as policy gradient, leading to unstable performance. To tackle this problem, we propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML). During adversarial training, the discriminator assigns rewards to samples which are acquired from a stationary distribution near the data rather than the generator’s distribution. The generator is optimized with maximum likelihood estimation augmented by the discriminator’s rewards instead of policy gradient. Experiments show that our model can outperform state-of-the-art text GANs with a more stable training process.
We investigate the impact of filter choice on forecast accuracy in state space models. The filters are used both to estimate the posterior distribution of the parameters, via a particle marginal Metropolis-Hastings (PMMH) algorithm, and to produce draws from the filtered distribution of the final state. Multiple filters are entertained, including two new data-driven methods. Simulation exercises are used to document the performance of each PMMH algorithm, in terms of computation time and the efficiency of the chain. We then produce the forecast distributions for the one-step-ahead value of the observed variable, using a fixed number of particles and Markov chain draws. Despite distinct differences in efficiency, the filters yield virtually identical forecasting accuracy, with this result holding under both correct and incorrect specification of the model. This invariance of forecast performance to the specification of the filter also characterizes an empirical analysis of S&P500 daily returns.
Word analogy tasks have tended to be handcrafted, involving permutations of hundreds of words with dozens of relations, mostly morphological relations and named entities. Here, we propose modeling commonsense knowledge down to word-level analogical reasoning. We present CA-EHN, the first commonsense word analogy dataset containing 85K analogies covering 5K words and 6K commonsense relations. This was compiled by leveraging E-HowNet, an ontology that annotates 88K Chinese words with their structured sense definitions and English translations. Experiments show that CA-EHN stands out as a great indicator of how well word representations embed commonsense structures, which is crucial for future end-to-end models to generalize inference beyond training corpora. The dataset is publicly available at \url{https://…/CA-EHN}.
As deep learning applications are becoming more and more pervasive in robotics, the question of evaluating the reliability of inferences becomes a central question in the robotics community. This domain, known as predictive uncertainty, has come under the scrutiny of research groups developing Bayesian approaches adapted to deep learning such as Monte Carlo Dropout. Unfortunately, for the time being, the real goal of predictive uncertainty has been swept under the rug. Indeed, these approaches are solely evaluated in terms of raw performance of the network prediction, while the quality of their estimated uncertainty is not assessed. Evaluating such uncertainty prediction quality is especially important in robotics, as actions shall depend on the confidence in perceived information. In this context, the main contribution of this article is to propose a novel metric that is adapted to the evaluation of relative uncertainty assessment and directly applicable to regression with deep neural networks. To experimentally validate this metric, we evaluate it on a toy dataset and then apply it to the task of monocular depth estimation.
Knowledge graphs have attracted lots of attention in academic and industrial environments. Despite their usefulness, popular knowledge graphs suffer from incompleteness of information, especially in their type assertions. This has encouraged research in the automatic discovery of entity types. In this context, multiple works were developed to utilize logical inference on ontologies and statistical machine learning methods to learn type assertion in knowledge graphs. However, these approaches suffer from limited performance on noisy data, limited scalability and the dependence on labeled training samples. In this work, we propose a new unsupervised approach that learns to categorize entities into a hierarchy of named groups. We show that our approach is able to effectively learn entity groups using a scalable procedure in noisy and sparse datasets. We experiment our approach on a set of popular knowledge graph benchmarking datasets, and we publish a collection of the outcome group hierarchies.
Recognizing multiple labels of images is a practical and challenging task, and significant progress has been made by searching semantic-aware regions and modeling label dependency. However, current methods cannot locate the semantic regions accurately due to the lack of part-level supervision or semantic guidance. Moreover, they cannot fully explore the mutual interactions among the semantic regions and do not explicitly model the label co-occurrence. To address these issues, we propose a Semantic-Specific Graph Representation Learning (SSGRL) framework that consists of two crucial modules: 1) a semantic decoupling module that incorporates category semantics to guide learning semantic-specific representations and 2) a semantic interaction module that correlates these representations with a graph built on the statistical label co-occurrence and explores their interactions via a graph propagation mechanism. Extensive experiments on public benchmarks show that our SSGRL framework outperforms current state-of-the-art methods by a sizable margin, e.g. with an mAP improvement of 2.5%, 2.6%, 6.7%, and 3.1% on the PASCAL VOC 2007 & 2012, Microsoft-COCO and Visual Genome benchmarks, respectively. Our codes and models are available at https://…/SSGRL.
Items in modern recommender systems are often organized in hierarchical structures. These hierarchical structures and the data within them provide valuable information for building personalized recommendation systems. In this paper, we propose a general hierarchical Bayesian learning framework, i.e., \emph{HBayes}, to learn both the structures and associated latent factors. Furthermore, we develop a variational inference algorithm that is able to learn model parameters with fast empirical convergence rate. The proposed HBayes is evaluated on two real-world datasets from different domains. The results demonstrate the benefits of our approach on item recommendation tasks, and show that it can outperform the state-of-the-art models in terms of precision, recall, and normalized discounted cumulative gain. To encourage the reproducible results, we make our code public on a git repo: \url{https://…/ycruhk4t}.
The challenge of nowcasting and forecasting the effect of natural disasters (e.g. earthquakes, floods, hurricanes) on assets, people and society is of primary importance for assessing the ability of such systems to recover from extreme events. Traditional disaster recovery estimates, such as surveys and interviews, are usually costly, time consuming and do not scale. Here we present a methodology to indirectly estimate the post-emergency recovery status (‘downtime’) of small businesses in urban areas looking at their online posting activity on social media. Analysing the time series of posts before and after an event, we quantify the downtime of small businesses for three natural disasters occurred in Nepal, Puerto Rico and Mexico. A convenient and reliable method for nowcasting the post-emergency recovery status of economic activities could help local governments and decision makers to better target their interventions and distribute the available resources more effectively.
Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed after the switch to neural networks and continuous representations, the basic trade-off between rich features and global optimization remains essentially the same. Moreover, we show that deep contextualized word embeddings, which allow parsers to pack information about global sentence structure into local feature representations, benefit transition-based parsers more than graph-based parsers, making the two approaches virtually equivalent in terms of both accuracy and error profile. We argue that the reason is that these representations help prevent search errors and thereby allow transition-based parsers to better exploit their inherent strength of making accurate local decisions. We support this explanation by an error analysis of parsing experiments on 13 languages.
The last decade has seen tremendous developments in memory and storage technologies, starting with Flash Memory and continuing with the upcoming Storage-Class Memories. Combined with an explosion of data processing, data analytics, and machine learning, this led to a segmentation of the memory and storage market. Consequently, the traditional storage hierarchy, as we know it today, might be replaced by a multitude of storage hierarchies, with potentially different depths, each tailored for specific workloads. In this context, we explore in this ‘Kurz Erkl\’art’ the state of memory technologies and reflect on their future use with a focus on data management systems.
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia’s metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.

Continue Reading…

### If you did not already know

DeepTracker
Deep convolutional neural networks (CNNs) have achieved remarkable success in various fields. However, training an excellent CNN is practically a trial-and-error process that consumes a tremendous amount of time and computer resources. To accelerate the training process and reduce the number of trials, experts need to understand what has occurred in the training process and why the resulting CNN behaves as such. However, current popular training platforms, such as TensorFlow, only provide very little and general information, such as training/validation errors, which is far from enough to serve this purpose. To bridge this gap and help domain experts with their training tasks in a practical environment, we propose a visual analytics system, DeepTracker, to facilitate the exploration of the rich dynamics of CNN training processes and to identify the unusual patterns that are hidden behind the huge amount of training log. Specifically,we combine a hierarchical index mechanism and a set of hierarchical small multiples to help experts explore the entire training log from different levels of detail. We also introduce a novel cube-style visualization to reveal the complex correlations among multiple types of heterogeneous training data including neuron weights, validation images, and training iterations. Three case studies are conducted to demonstrate how DeepTracker provides its users with valuable knowledge in an industry-level CNN training process, namely in our case, training ResNet-50 on the ImageNet dataset. We show that our method can be easily applied to other state-of-the-art ‘very deep’ CNN models. …

Principled Bayesian Workflow
Experiments in research on memory, language, and in other areas of cognitive science are increasingly being analyzed using Bayesian methods. This has been facilitated by the development of probabilistic programming languages such as Stan, and easily accessible front-end packages such as brms. However, the utility of Bayesian methods ultimately depends on the relevance of the Bayesian model, in particular whether or not it accurately captures the structure of the data and the data analyst’s domain expertise. Even with powerful software, the analyst is responsible for verifying the utility of their model. To accomplish this, we introduce a principled Bayesian workflow (Betancourt, 2018) to cognitive science. Using a concrete working example, we describe basic questions one should ask about the model: prior predictive checks, computational faithfulness, model sensitivity, and posterior predictive checks. The running example for demonstrating the workflow is data on reading times with a linguistic manipulation of object versus subject relative sentences. This principled Bayesian workflow also demonstrates how to use domain knowledge to inform prior distributions. It provides guidelines and checks for valid data analysis, avoiding overfitting complex models to noise, and capturing relevant data structure in a probabilistic model. Given the increasing use of Bayesian methods, we aim to discuss how these methods can be properly employed to obtain robust answers to scientific questions. …

Linear Quadratic Estimation (LQE)
Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing noise (random variations) and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those based on a single measurement alone. More formally, the Kalman filter operates recursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. The filter is named after Rudolf (Rudy) E. Kálmán, one of the primary developers of its theory. The Kalman filter has numerous applications in technology. A common application is for guidance, navigation and control of vehicles, particularly aircraft and spacecraft. Furthermore, the Kalman filter is a widely applied concept in time series analysis used in fields such as signal processing and econometrics. Kalman filters also are one of the main topics in the field of Robotic motion planning and control, and sometimes included in Trajectory optimization. …

Deep Confidence
Deep learning architectures have proved versatile in a number of drug discovery applications, including the modelling of in vitro compound activity. While controlling for prediction confidence is essential to increase the trust, interpretability and usefulness of virtual screening models in drug discovery, techniques to estimate the reliability of the predictions generated with deep learning networks remain largely underexplored. Here, we present Deep Confidence, a framework to compute valid and efficient confidence intervals for individual predictions using the deep learning technique Snapshot Ensembling and conformal prediction. Specifically, Deep Confidence generates an ensemble of deep neural networks by recording the network parameters throughout the local minima visited during the optimization phase of a single neural network. This approach serves to derive a set of base learners (i.e., snapshots) with comparable predictive power on average, that will however generate slightly different predictions for a given instance. The variability across base learners and the validation residuals are in turn harnessed to compute confidence intervals using the conformal prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, we show that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of independently trained deep neural networks. In addition, we find that the confidence regions predicted using the Deep Confidence framework span a narrower set of values. Overall, Deep Confidence represents a highly versatile error prediction framework that can be applied to any deep learning-based application at no extra computational cost. …

Continue Reading…

### Organize Why R? 2019 pre-meeting in your city

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? pre-meetings are R meetups that support local R groups. They promote Why R? 2019 Conference.

The purpose of those meetings is to provide the space for the professional networking and knowledge exchange, between practitioners and students, from the area of statistical machine learning, programming, optimization and data science fields.

We managed to co-organize already 12 R meetups around Central-Eastern Europe, among others in Warsaw, Prague, Amsterdam, Copenhagen or Wroclaw. We often provide beverages and pizza, cover transportation and accomodation costs for speakers and pay for the after party. Thanks to those meetings we had a chance to understand needs and areas of interest of R users around the Europe.

Those meetings would not happen without the help, time and the energy that local R hosts invested in this project!

If you’d like to organize R meetup in your city, and are looking for the support to cover snacks or transport for the speakers, do not hesitate to contact us under kontakt_at_whyr.pl We would be more than happy to brainstorm the potential use case for presentations, share ideas for the networking and suggest speakers in your close region.

# Register for Why R?!

Why R? pre-meetings also promote Why R? 2019 Conference that will be held between 26-29 September in Warsaw. Check out prices for tickets below.

The Regular registration ends August 31st! It’s the last chance to get tickets in the decent prices, before the Late registration starts.

Register here and see you on the conference!

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### Eliminating Tail Calls in Python Using Exceptions

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

The note correctly points out that Python purposely does not have a goto statement, a tool one might use to implement true tail call elimination. So Kyle Miller built up a data-structure based replacement for the call stack, which allows one to work around the stack-limit for a specific function (without changing any Python configuration, and without changing the behavior of other functions).

Of course Python does have some exotic control-flow controls: raise and yield. So I decided to build an exception based solution of our own using raise .

Please read on for how we do this, and for some examples.

Let’s see an example of the problem. Notice the (silly) self-calling function doesn’t succeed as it runs-out the call stack before finishing its calculation.

In [1]:
def recursive_example(n, d=1):
if n <= 1:
return d
else:
return recursive_example(n - 1, d + 1)

try:
recursive_example(10000)
except Exception as ex:
print(ex)

maximum recursion depth exceeded in comparison


Of course, catching excess recursion neatly (as Python did above) is a feature. It is one way to stop possible run-away recursions.

However, if we want one particular function to exceed this limit (especially for tail calls, which should require no memory overhead!): we need to set up a framework similar to “Tail call recursion in Python”.

First we build a “thunk” to represent the evaluation of a function with all arguments specified, but that hasn’t happened yet. We implement pending calculations with the class data_algebra.pending_eval.PendingFunctionEvaluation (source here). The extra bit is: we have PendingFunctionEvaluation extend Exception, so we can use raise to jump out of our current function context.

Then, when we have what would normally be a “tail call” of the form “return f(x)“, we instead write “raise PendingFunctionEvaluation(f, x)“. The idea is: we end our current function by raising the exception, and the exception itself has the instructions for the desired next step or continuation of the calculation. An outer wrapper then iteratively evaluates any PendingFunctionEvaluations encountered. Thus any tail recursion is replaced by iteration, and we have eliminated the stack and memory use of the tail calls. It should also be possible to use a return-style notation with the PendingFunctionEvaluation wrapper, but we feel the raise notation more clearly documents intent.

An example is given here:

In [2]:
import data_algebra.pending_eval as pe

def recursive_example_ex(n, d=1):
if n <= 1:
return d
else:
# eliminate tail-call by using exception
# instead of return recursive_example_ex(n-1, d+1)
raise pe.PendingFunctionEvaluation(
recursive_example_ex, n - 1, d + 1)

pe.eval_using_exceptions(recursive_example_ex, 100000)

Out[2]:
100000

Nota bene: the raise will throw-through any intermediate functions, so any non-tail calls (direct or indirect) to these throwing functions would have to go use the eval_using_exceptions() guard! After working some examples, we have settled that the original return-based mechanism is better. The exceptions are too hard to manage and don’t add much. For our adaptation of the return-based example, please see here.

We can also specialize the method for method-calls as we show below. The pattern we are using is a simple one: methods ending in an underbar raise exceptions in place of tail-calls (and only call the underbar versions of methods), and an outer method without an underbar performs the exception handling.

In [3]:
class C:

def f_(self, n, d=1):
if n <= 1:
return d
else:
# Eliminate tail-call by using an exception.
# instead of: return self.f_(n-1, d+1), use:
raise pe.PendingFunctionEvaluation(
self.f_, n - 1, d + 1)

def f(self, n, d=1):
return pe.eval_using_exceptions(self.f_, n=n, d=d)

o = C()
o.f(100000)

Out[3]:
100000

And there you have it: low-space exception based tail call elimination. This is one of the ideas we are considering using to remove the deeply nested object traversal limit from the upcoming Python version of rquery (the other being a non-recursive tree-visit iterator).

Continue Reading…

### Magister Dixit

“Modeling (especially linear modeling) tends to find the larger generalities first; modeling with larger datasets usually helps to work out nuances, ‘small disjuncts’, and other nonlinearities that are difficult or impossible to capture from smaller datasets.” Enric Junqué de Fortuny, David Martens, Foster Provost ( 2014 )

Continue Reading…

### Exploration of 3D Fractals

[This article was first published on exploRations in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To leave a comment for the author, please follow the link and comment on their blog: exploRations in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### Which countries dominate the world’s dinner tables?

America has a culinary deficit, whereas Italy boasts a vast surplus

Continue Reading…

### A Shiny App for JS Mediation

[This article was first published on R Blog on Cillian McHugh, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Background

This is a brief post about making my first Shiny App (see also). I made this app following a meeting of the Advancing Social Cognition lab (ASC-Lab) where we discussed this paper by Yzerbyt et al. (2018) proposing a new method for mediation analysis. Any attempt to detail the differences in methods is well beyond the scope of a blog post. The take home message is that the method proposed by Yzerbyt et al. (2018) is less prone to Type I errors (or false positives) than the most commonly used methods (e.g., Hayes 2017). In addition to identifying a problem and proposing a solution, the authors also provide the tools to implement their solution with an R package (Batailler et al. 2019). Unfortunately, not everyone uses R, and this is why I set about developing a simple way for SPSS users to access this new method.

# Regression and JS Mediation

Before I describe the Shiny App, I’ll briefly demonstrate the 2 functions that are included in the Shiny App. I’ll use the built in dataset mtcars and investigate the relationship between 1/4 mile time (qsec), gross horsepower (hp) and weight (wt), specifically1:

• does horsepower predict 1/4 mile time?
• and is this relationship mediated by weight?

## Set up the dataframe

For ease of reusing code (particularly later on) I’ll save mtcars as a dataframe df and rename the variables of interest as iv (predictor variable), dv (outcome variable), and mediator.

df <- mtcars          # create df from mtcars

# create new variables with generic names
df$dv <- df$qsec      # save 1/4 mile time as dv
df$iv <- df$hp        # save horsepower as iv
df$mediator <- df$wt  # save weight as mediator

## Simple Regression

Before running the mediation I’ll run a quick regression to assess the nature of the relationship between the variables.

fit <- lm(dv ~ iv + mediator, data=df)  # save the regression in an object 'fit'
summary(fit)                            # show the results
##
## Call:
## lm(formula = dv ~ iv + mediator, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.8283 -0.4055 -0.1464  0.3519  3.7030
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.825585   0.671867  28.020  < 2e-16 ***
## iv          -0.027310   0.003795  -7.197 6.36e-08 ***
## mediator     0.941532   0.265897   3.541  0.00137 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 29 degrees of freedom
## Multiple R-squared:  0.652,  Adjusted R-squared:  0.628
## F-statistic: 27.17 on 2 and 29 DF,  p-value: 2.251e-07

As you can see from the output, 1/4 mile time is predicted by both horsepower and by weight.

## Simple Mediation

Now that we have a picture of the relationships between the variables we can run the mediation analysis. The code for this is detailed below.

JS_model <- mdt_simple(data = df, # create an object 'JS_model'
DV = dv,
IV = iv,
M  = mediator)
add_index(JS_model)               # display the results of the mediation
## Test of mediation (simple mediation)
## ==============================================
##
## Variables:
##
## - IV: iv
## - DV: dv
## - M: mediator
##
## Paths:
##
## ====  ==============  =====  ======================
## Path  Point estimate     SE  APA
## ====  ==============  =====  ======================
## a              0.009  0.002  t(30) = 4.80, p < .001
## b              0.942  0.266  t(29) = 3.54, p = .001
## c             -0.018  0.003  t(30) = 5.49, p < .001
## c'            -0.027  0.004  t(29) = 7.20, p < .001
## ====  ==============  =====  ======================
##
## Indirect effect index:
##
## - type: Indirect effect
## - point estimate: 0.00885
## - confidence interval:
##   - method: Monte Carlo (5000 iterations)
##   - level: 0.05
##   - CI: [0.00337; 0.0156]
##
## Fitted models:
##
## - X -> Y
## - X -> M
## - X + M -> Y
• Here we can see that horsepower predicts both 1/4 mile time and weight.
• There is also an indirect effect of horsepower on 1/4 mile time through weight.

# Building a Shiny App

The full code for the app is below, for the next sections I’ll go through some of the key pieces of code.2

## The Geography of the Shiny App

The Shiny App has two panels.

• On the left we have:
• The data upload option
• A dropdown menu for selecting the data you wish to use (the uploaded file, the mtcars data set, or the iris data set)
• Dropdown menus for defining each of your variables,
• Text describing the App
• On the right we have:
• The output of the regression
• The output from the mediation analysis

The code for generating these panels is below (comments above relevant lines describe the purpose of the various sections):

# UI for app
ui<-(pageWithSidebar(

# We use headerPanel() to give a title to our app
headerPanel("JS Mediation"),

# use sidebarPanel() to create the content of the side panel (panel on the left)
sidebarPanel
(
# use fileInput() to create a dialogue for inputting a file
fileInput("file1", "Upload SPSS File",
multiple = TRUE,
accept = c(".sav")),
# create a horizontal line break
tags$hr(), # create a dropdown menu for selecting the dataset to be used selectInput("dataset","Data:", choices =list(iris = "iris", mtcars = "mtcars", uploaded_file = "inFile"), selected=NULL), # create a dropdown menu for selecting the dependent variable to be used selectInput("dv","Dependent Variable:", choices = NULL), # create a dropdown menu for selecting the Independent variable to be used selectInput("iv","Independent Variable:", choices = NULL), # create a dropdown menu for selecting the mediator to be used selectInput("mediator","Mediator:", choices = NULL) #, # use HTML() to input formatted text describing the App ,HTML('In response to this paper by Yzerbyt, Batailler and Judd (2018) which outined a new method of conducting mediation analyses (with less susceptability to false positives than Hayes’ PROCESS) I created a ShinyApp so that their R-package could be used by SPSS users. Upload your SPSS file above and select the variables you wish to compare.') ,br(),br(),br() ,HTML('Yzerbyt, V., Muller, D., Batailler, C., & Judd, C. M. (2018). New Recommendations for Testing Indirect Effects in Mediational Models: The Need to Report and Test Component Paths. Journal of Personality and Social Psychology: Attitudes and Social Cognition, 115(6), 929–943. http://dx.doi.org/10.1037/pspa0000132') ), # use mainPanel() to create the panel on the right where the output of our tests will be mainPanel( # give a title to the the first output h3("Summary of Regression Model"), # report the result of the regression, saved in the object 'fit' verbatimTextOutput("fit"), # give a title for the second output h3("Mediation Results"), # report the result of the mediation, saved in the object 'mediation' verbatimTextOutput("mediation") ) ))  ## The Backend of the Shiny App Above we have the code for setting up and modifying the look and feel of our app. Below we go through the code for making the app do what it is supposed to do. The code in full is at the bottom of this post, however I have isolated specific sections of code to describe their function. ### Inputting data from file The code below runs read.spss() on whatever file you have uploaded using the dialogue box in the side panel and creates a dataframe called inFile.  upload_data<-reactive({ inFile <- input$file1
if (is.null(inFile))
return(NULL)
read.spss(input$file1$datapath, to.data.frame = TRUE)
})

observeEvent(input$file1,{ inFile<<-upload_data() })  ### Selecting data and variables The code below retrieves information about the dataset that is selected, and displays the variables associated with the selected dataset in the dropdown menus for each of your variables (IV, DV, & mediator). # update variables based on the data observe({ # make sure upload exists if(!exists(input$dataset)) return()
# retrieve names of columns (variable names) and save as 'var.opts'
var.opts<-colnames(get(input$dataset)) # set var.opts as the options for the drop down menus updateSelectInput(session, "dv", choices = var.opts) updateSelectInput(session, "iv", choices = var.opts) updateSelectInput(session, "mediator", choices = var.opts) }) ### Setting up data for analysis Below we extract the data and variables selected in the dropdown menus and save them as objects that we can use in functions. Specifically we create a list obj which contains the vectors dv, iv, and mediator.  # get data object get_data<-reactive({ if(!exists(input$dataset)) return() # if no upload
check<-function(x){is.null(x) || x==""}
if(check(input$dataset)) return() # retrieve the selected data and create objects and obj<-list(data=get(input$dataset),
dv=input$dv, iv=input$iv,
mediator=input$mediator ) # require all to be set to proceed if(any(sapply(obj,check))) return() # make sure choices had a chance to update check<-function(obj){ !all(c(obj$dv,obj$iv,obj$mediator) %in% colnames(obj$data)) } if(check(obj)) return() # return 'obj' on completion obj })  ### Running the analyses Now that we can retrieve the selected data and variables, we can turn them into a dataframe and run our analyses on them. #### Regression The code below creates an object output$fit which contains the output of the regression.

  output$fit <- renderPrint({ # create an object 'data_list', which is a list that contains the selected data and variables dataset_list <- get_data() # isloate the elements in the list as separate objects a <- dataset_list$dv
b <- dataset_list$iv m <- dataset_list$mediator
c <- dataset_list$data # create a dataframe 'df' from the object 'c' the selected dataset df <- colnames<-( cbind.data.frame( # we extract and use the variables from 'c' that have the same names as those selected c[which(colnames(c)==a)], c[which(colnames(c)==b)], c[which(colnames(c)==m)] ), c("dv","iv","mediator")) # now we have a dataframe df with 3 variables named 'dv', 'iv', and 'mediator' # we need to ensure data is numeric df$dv <- suppressWarnings(as.numeric(df$dv)) df$iv <- suppressWarnings(as.numeric(df$iv)) df$mediator <- suppressWarnings(as.numeric(df$mediator)) # using the same code previously discussed, we run the regression fit <- lm(dv ~ iv + mediator, data=df) summary(fit) # show results }) #### Mediation Below we follow mostly the same steps to create our dataframe, and this time we run the mediation instead of the regression.  output$mediation <- renderPrint({
# create an object 'data_list', which is a list that contains the selected data and variables
dataset_list <- get_data()

# isloate the elements in the list as separate objects
a <- dataset_list$dv b <- dataset_list$iv
m <- dataset_list$mediator c <- dataset_list$data

# create a dataframe 'df' from the object 'c' the selected dataset
df <- colnames<-(
cbind.data.frame(
# we extract and use the variables from 'c' that have the same names as those selected
c[which(colnames(c)==a)],
c[which(colnames(c)==b)],
c[which(colnames(c)==m)]
), c("dv","iv","mediator"))
# now we have a dataframe df with 3 variables named 'dv', 'iv', and 'mediator'

# we need to ensure data is numeric
df$dv <- suppressWarnings(as.numeric(df$dv))
df$iv <- suppressWarnings(as.numeric(df$iv))
df$mediator <- suppressWarnings(as.numeric(df$mediator))

# and we run the mediation using the same code as at the beginning of this post
JS_model <- mdt_simple(data = df,
DV = dv,
IV = iv,
M  = mediator)
add_index(JS_model)
})

# Conclusion

Above I have described how I went about making my first Shiny App which makes a new method of mediation analysis accessible to SPSS users. Feel free to try it out (although I have not paid for a premium account with Shiny, so it might time out).

Both the mtcars dataset and the iris dataset are preloaded in the app if you want to try it but you don’t have any SPSS files to upload. If you are an R user hopefully this post might help you to make your own Shiny Apps to make R functionality available to your SPSS using colleagues. Many thanks to the examples online that helped me, particularly this example for uploading files and working with them.

(also if you have any suggestions for improving the app, or if I have left anything out, let me know)

library(shiny)
library(foreign)
library(purrr)
library(dplyr)
library("devtools")
#install.packages("JSmediation")
library(JSmediation)

# UI for app
ui<-(pageWithSidebar(

# We use headerPanel() to give a title to our app
headerPanel("JS Mediation"),

# use sidebarPanel() to create the content of the side panel (panel on the left)
sidebarPanel
(
# use fileInput() to create a dialogue for inputting a file
fileInput("file1", "Upload SPSS File",
multiple = TRUE,
accept = c(".sav")),
# create a horizontal line break
tags$hr(), # create a dropdown menu for selecting the dataset to be used selectInput("dataset","Data:", choices =list(iris = "iris", mtcars = "mtcars", uploaded_file = "inFile"), selected=NULL), # create a dropdown menu for selecting the dependent variable to be used selectInput("dv","Dependent Variable:", choices = NULL), # create a dropdown menu for selecting the Independent variable to be used selectInput("iv","Independent Variable:", choices = NULL), # create a dropdown menu for selecting the mediator to be used selectInput("mediator","Mediator:", choices = NULL) #, # use HTML() to input formatted text describing the App ,HTML('In response to this paper by Yzerbyt, Batailler and Judd (2018) which outined a new method of conducting mediation analyses (with less susceptability to false positives than Hayes’ PROCESS) I created a ShinyApp so that their R-package could be used by SPSS users. Upload your SPSS file above and select the variables you wish to compare.') ,br(),br(),br() ,HTML('Yzerbyt, V., Muller, D., Batailler, C., & Judd, C. M. (2018). New Recommendations for Testing Indirect Effects in Mediational Models: The Need to Report and Test Component Paths. Journal of Personality and Social Psychology: Attitudes and Social Cognition, 115(6), 929–943. http://dx.doi.org/10.1037/pspa0000132') ), # use mainPanel() to create the panel on the right where the output of our tests will be mainPanel( # give a title to the the first output h3("Summary of Regression Model"), # report the result of the regression, saved in the object 'fit' verbatimTextOutput("fit"), # give a title for the second output h3("Mediation Results"), # report the result of the mediation, saved in the object 'mediation' verbatimTextOutput("mediation") ) )) # shiny server side code for each call server<-(function(input, output, session){ # update variables based on the data observe({ #browser() if(!exists(input$dataset)) return() #make sure upload exists
var.opts<-colnames(get(input$dataset)) updateSelectInput(session, "dv", choices = var.opts) updateSelectInput(session, "iv", choices = var.opts) updateSelectInput(session, "mediator", choices = var.opts) }) # get data object get_data<-reactive({ if(!exists(input$dataset)) return() # if no upload
check<-function(x){is.null(x) || x==""}
if(check(input$dataset)) return() obj<-list(data=get(input$dataset),
dv=input$dv, iv=input$iv,
mediator=input$mediator ) # require all to be set to proceed if(any(sapply(obj,check))) return() #make sure choices had a chance to update check<-function(obj){ !all(c(obj$dv,obj$iv,obj$mediator) %in% colnames(obj$data)) } if(check(obj)) return() obj }) upload_data<-reactive({ inFile <- input$file1
if (is.null(inFile))
return(NULL)
# could also store in a reactiveValues
read.spss(input$file1$datapath, to.data.frame = TRUE)
})

observeEvent(input$file1,{ inFile<<-upload_data() }) # create regression output output$fit <- renderPrint({

dataset_list <- get_data()

a <- dataset_list$dv b <- dataset_list$iv
m <- dataset_list$mediator c <- dataset_list$data

df <- colnames<-(
cbind.data.frame(
c[which(colnames(c)==a)],
c[which(colnames(c)==b)],
c[which(colnames(c)==m)]
), c("dv","iv","mediator"))

df$dv <- suppressWarnings(as.numeric(df$dv))
df$iv <- suppressWarnings(as.numeric(df$iv))
df$mediator <- suppressWarnings(as.numeric(df$mediator))

fit <- lm(dv ~ iv + mediator, data=df)
summary(fit) # show results
})

# create mediation output
output$mediation <- renderPrint({ dataset_list <- get_data() a <- dataset_list$dv
b <- dataset_list$iv m <- dataset_list$mediator
c <- dataset_list$data df <- colnames<-( cbind.data.frame( c[which(colnames(c)==a)], c[which(colnames(c)==b)], c[which(colnames(c)==m)] ), c("dv","iv","mediator")) df$dv <- suppressWarnings(as.numeric(df$dv)) df$iv <- suppressWarnings(as.numeric(df$iv)) df$mediator <- suppressWarnings(as.numeric(df\$mediator))

JS_model <- mdt_simple(data = df,
DV = dv,
IV = iv,
M  = mediator)
add_index(JS_model)
})
# #JS_model
})

# Create Shiny app ----
shinyApp(ui, server)

# References

Batailler, Cédric, Dominique Muller, Vincent Yzerbyt, and Charles Judd. 2019. JSmediation: Mediation Analysis Using Joint Significance. https://CRAN.R-project.org/package=JSmediation.

Hayes, Andrew F. 2017. Introduction to Mediation, Moderation, and Conditional Process Analysis, Second Edition: A Regression-Based Approach. Guilford Publications.

Yzerbyt, Vincent, Dominique Muller, Cédric Batailler, and Charles M Judd. 2018. “New Recommendations for Testing Indirect Effects in Mediational Models: The Need to Report and Test Component Paths.” Journal of Personality and Social Psychology: Attitudes and Social Cognition 115 (6): 929–43. https://doi.org/http://dx.doi.org/10.1037/pspa0000132.

1. The purpose of this post is to demonstrate the code for these analyses, as such there may be issues with the analyses reported – I haven’t checked any assumptions or anything.

2. In order to enable people to use the app for their own analysis I needed a way for them to upload their data into the app. After a bit of googling I found this example, for uploading .csv files. I copied the code and modified it to include read.spss() from the package foreign instead of read.csv()

To leave a comment for the author, please follow the link and comment on their blog: R Blog on Cillian McHugh.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

## August 22, 2019

### Fresh from the Python Package Index

tuesday
A python package for automisation. You can easily automize and most of the commands of system by your voice. By using this package u can easily build jarvis like programme, irrespective of your knowledge of coding. Easy voice commands to work with.

cltkv1
Classical Language Toolkit

predicode
Simulations and interface to analytical solutions of predictive coding

allennlp-pvt-nightly
An open-source NLP research library, built on PyTorch.

analytics-utils
Package with functions for data analytics

data-view
Automated view of dataset

dibb
A framework for Distributed Black-Box {optimization, search, algorithms}

jiayan
The NLP toolkit designed for classical chinese.

larrydata
Library of helper modules for common data tasks using AWS resources such as S3, SQS, MTurk and others. Library of utilities for common data tasks using AWS. While boto3 is a great interface for interacting with AWS services, it can be overly complex for data scientists and others who want to perform simple operations on data without worrying about API-specific interactions and parameters. Larrydata provides a simple wrapper for S3, MTurk, and other data-oriented services to let you focus on the data rather than syntax.

MatchZoo-test
Facilitating the design, comparison and sharingof deep text matching models.

proto-dist-ml
Prototype-based Machine Learning on Distance Data.

telegram-botup
Library for development Telegram bots

tensorflow-radam
RAdam implemented in Keras & TensorFlow

testbook
A testing package using Jupyter Notebooks

Continue Reading…

### Book Memo: “Agile Machine Learning”

 Effective Machine Learning Inspired by the Agile Manifesto Build resilient applied machine learning teams that deliver better data products through adapting the guiding principles of the Agile Manifesto. Bringing together talented people to create a great applied machine learning team is no small feat. With developers and data scientists both contributing expertise in their respective fields, communication alone can be a challenge. Agile Machine Learning teaches you how to deliver superior data products through agile processes and to learn, by example, how to organize and manage a fast-paced team challenged with solving novel data problems at scale, in a production environment. The authors’ approach models the ground-breaking engineering principles described in the Agile Manifesto. The book provides further context, and contrasts the original principles with the requirements of systems that deliver a data product.

Continue Reading…

### Multilevel structured (regression) and post-stratification

My enemies are all too familiar. They’re the ones who used to call me friend – Jawbreaker

Well I am back from Australia where I gave a whole pile of talks and drank more coffee than is probably a good idea. So I’m pretty jetlagged and I’m supposed to be writing my tenure packet, so obviously I’m going to write a long-ish blog post about a paper that we’ve done on survey estimation that just appeared on arXiv. We, in this particular context, is my stellar grad student Alex Gao, the always stunning Lauren Kennedy, the eternally fabulous Andrew Gelman, and me.

What is our situation?

When data is a representative sample from the population of interest, life is peachy. Tragically, this never happens.

Maybe a less exciting way to say that would be that your sample is representative of a population, but it might not be an interesting population. An example of this would be a psychology experiment where the population is mostly psychology undergraduates at the PI’s university. The data can make reasonable conclusions about this population (assuming sufficient sample size and decent design etc), but this may not be a particularly interesting population for people outside of the PI’s lab. Lauren and Andrew have a really great paper about this!

It’s also possible that the population that is being represented by the data is difficult to quantify.  For instance, what is the population that an opt-in online survey generalizes to?

Moreover, it’s very possible that the strata of the population have been unevenly sampled on purpose. Why would someone visit such violence upon their statistical inference? There are many many reasons, but one of the big one is ensuring that you get enough samples from a rare population that’s of particular interest to the study. Even though there are good reasons to do this, it can still bork your statistical analysis.

All and all, dealing with non-representative data is a difficult thing and it will surprise exactly no one to hear that there are a whole pile of approaches that have been proposed from the middle of last century onwards.

Maybe we can weight it

Maybe the simplest method for dealing with non-representative data is to use sample weights. The purest form of this idea occurs when the population is stratified into $J$ subgroups of interest and data is drawn independently at random from the $j$th population with probability $\pi_j$.  From this data it is easy to compute the sample average for each subgroup, which we will call $\bar{y}_j$. But how do we get an estimate of the population average from this?

Well just taking the average of the averages probably won’t work–if one of the subgroups has a different average from the others it’s going to give you the wrong answer.  The correct answer, aka the one that gives an unbiased estimate of the mean, was derived by Horvitz and Thompson in the early 1950s. To get an unbiased estimate of the mean you need to use the subgroup means and the sampling probabilities.  The Horvitz-Thompson estimator has the form

$\frac{1}{J}\sum_{j=1}^J\frac{\bar{y}_j}{\pi_j}$.

Now, it is a truth universally acknowledged, if perhaps not universally understood, that unbiasedness is really only a meaningful thing if a lot of other things are going very well in your inference. In this case, it really only holds if the data was sampled from the population with the given probabilities.  Most of the time that doesn’t really happen. One of the problems is non-response bias, which (as you can maybe infer from the name) is the bias induced by non-response.

(There are ways through this, like raking, but I’m not going to talk about those today).

Poststratification: flipping the problem on its head

One way to think about poststratification is that instead of making assumptions about how the observed sample was produced from the population, we make assumptions about how the observed sample can be used to reconstruct the rest of the population.  We then use this reconstructed population to estimate the population quantities of interest (like the population mean).

The advantage of this viewpoint is that we are very good at prediction. It is one of the fundamental problems in statistics (and machine learning because why not). This viewpoint also suggests that our target may not necessarily be unbiasedness but rather good prediction of the population. It also suggests that, if we can stomach a little bias, we can get much tighter estimates of the population quantity than survey weights can give. That is, we can trade of bias against variance!

Of course, anyone who tells you they’re doing assumption free inference is a dirty liar, and the fewer assumptions we have the more desperately we cling to them. (Beware the almost assumption-free inference. There be monsters!) So let’s talk about the two giant assumptions that we are going to make in order for this to work.

Giant assumption 1: We know the composition of our population. In order to reconstruct the population from the sample, we need to know how many people or things should be in each subgroup. This means that we are restricted in how we can stratify the population. For surveys of people, we typically build out our population information from census data, as well as from smaller official surveys like the American Community Survey (for estimation things about the US! The ACS is less useful in Belgium.).  (This assumption can be relaxed somewhat by clever people like Lauren and Andrew, but poststratifying to a variable that isn’t known in the population is definitely an adanced skill.)

Giant assumption 2: The people who didn’t answer the survey are like the people who did answer the survey. There are a few ways to formalize this, but one that is clear for me is that we need two things. First, that the people who were asked to participate in the survey in subgroup j is a random sample of subgroup j. The second thing we need is that the people who actually answered the survey in subgroup j is a random sample of the people who were asked.  These sort of missing at random or missing completely at random or ignorability assumptions are pretty much impossible to verify in practice. There are various clever things you can do to relax some of them (e.g. throw a hand full of salt over your left shoulder and whisper “causality” into a mouldy tube sock found under a teenage boy’s bed), but for the most part this is the assumption that we are making.

A thing that I hadn’t really appreciated until recently is that this also gives us some way to do model assessment and checking.  There are two ways we can do this. Firstly we can treat the observed data as the full population and fit our model to a random subsample and use that to assess the fit by estimating the population quantity of interest (like the mean). The second method is to assess how well the prediction works on left out data in each subgroup. This is useful because poststratification explicitly estimates the response in the unobserved population, so how good the predictions are (in each subgroup!) is a good thing to know!

This means that tools like LOO-CV are still useful, although rather than looking at a global LOO-elpd score, it would be more useful to look at it for each unique combination of stratifying variables. That said, we have a lot more work to do on model choice for survey data.

So if we have a way to predict the responses for the unobserved members of the population, we make estimates based on non-representative samples. So how do we do this prediction.

Enter Mister P

Mister P (or MRP) is a grand old dame. Since Andrew and Thomas Little  introduced it in the mid-90s, a whole lot of hay has been made from the technique. It stands for Multilevel Regression and Poststratification and it kinda does what it says on the box.

It uses multilevel regression to predict what unobserved data in each subgroup would look like, and then uses poststratification to fill in the rest of the population values and make predictions about the quantities of interest.

(This is a touch misleading. What it does is estimate the distribution of each subgroup mean and then uses poststratification to turn these into an estimate the distribution of the mean for the whole population. Mathematically it’s the same thing, but it’s much more convenient than filling in each response in the population.)

And there is scads of literature suggesting that this approach works very well. Especially if the multilevel structure and the group-level predictors are chosen well.

But no method is perfect and in our paper we launch at one possible corner of the framework that can be improved. In particular, we look at the effect that using structured priors within the multilevel regression will have on the poststratified estimates. These changes turn out not to massively change whole population quantities, but can greatly improve the estimates within subpopulations.

What are the challenges with using multilevel regression in this context?

The standard formulation of Mister P treats each stratifying variable the same (allowing for a varying intercept and maybe some group-specific effects). But maybe not all stratifying variables are created equal.  (But all stratifying variables will be discrete because it is not the season for suffering. November is the season for suffering.)

Demographic variables like gender or race/ethnicity have a number of levels that are more or less exchangeable. Exchangeability has a technical definition, but one way to think about it is that a priori we think that the size of the effect of a particular gender on the response has the same distribution as the size of the effect of another gender on the response (perhaps after conditioning on some things).

From a modelling perspective, we can codify this as making the effect of each level of the demographic variable a different independent draw from the same normal distribution.

In this setup, information is shared between different levels of the demographic variable because we don’t know what the mean and standard deviation of the normal distribution will be. These parameters are (roughly) estimated using information from the overall effect of that variable (total pooling) and from the variability of the effects estimated independently for each group (no pooling).

But this doesn’t necessarily make sense for every type of demographic variable. One example that we used in the paper is age, where it may make more sense to pool information more strongly from nearby age groups than from distant age groups. A different example would be something like state, where it may make sense to pool information from nearby states rather from the whole country.

We can incorporate this type of structured pooling using what we call structured priors in the multilevel model. Structured priors are everywhere: Gaussian processes, time series models (like AR(1) models), conditional autogregressive (CAR) models, random walk priors, and smoothing splines are all commonly used examples.

But just because you can do something doesn’t mean you should. This leads to the question that inspired this work:

When do structured priors help MRP?

Structured priors typically lead to more complex models than the iid varying intercept model that a standard application of the MRP methodology uses. This extra complexity means that our we have more space to achieve our goal of predicting the unobserved survey responses.

But as the great sages say: with low power comes great responsibility.

If the sample size is small or if the priors are set wrong, this extra flexibility can lead to high-variance predictions and will lead to worse estimation of the quantities of interest. So we need to be careful.

As much as I want it to, this isn’t going to turn into a(nother) blog post about priors. But it’s worth thinking about. I’ve written about it at length before and will write about it at length again. (Also there’s the wiki!)

But to get back to the question, the answer depends on how we want to pool information. In a standard multilevel model, we augment the information within subgroup with the whole population information.  For instance, if we are estimating a mean and we have one varying intercept, it’s a tedious algebra exercise to show that

$\mathbb{E}(\mu_j \mid y)=\approx\frac{\frac{n_j}{\sigma^2} \bar{y}_j+\frac{1}{\tau^2}\bar{y}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}$,

so we’ve borrowed some extra information from the raw mean of the data $\bar{y}$ to augment the local means $\bar{y}_j$ when they don’t have enough information.

But if our population is severely unbalanced and the different groups have vastly different different responses, this type of pooling may not be appropriate.

A canny ready might say “well what if we put weights in so we can shrink to a better estimate of the population mean?”. Well that turns out to be very difficult.

Everybody needs good neighbours (especially when millennials don’t answer the phone)

The solution we went with was to use a random walk prior on the age. This type of prior prioritizes pooling to nearby age categories.  We found that this makes a massive difference to the subpopulation estimates, especially when some age groups are less likely to answer the phone than others.

We put this all together into a detailed simulation study that showed that you can get some real advantages to doing this!

We also used this technique to analyze some phone survey data from The Annenberg Public Policy Center of the University of Pennsylvania about popular support for marriage equality in 2008. This example was chosen because, even in 2008, young people had a tendency not to answer their phones. Moreover, we expect the support for marriage equality to be different among different age groups.  Things went well.

How to bin ordinal variables (don’t!)

One of the advantages of our strategy is that we can treat variables like age at their natural resolution (eg year) while modelling, and then predict the distribution of the responses in an aggregated category where we have enough demographic information to do poststratification.

This breaks an awkward dependence between modelling choices and the assumptions needed to do poststratification.

Things that are still to be done!

No paper is complete, so there are a few things we think are worth looking at now that we know that this type of strategy works.

• Model selection: How can you tell which structure is best?
• Prior choice: Always an issue!
• Interactions: Some work has been done on using BART with MRP (they call it … BARP). This should cover interaction modelling, but doesn’t really allow for the types of structured modelling we’re using in this paper.
• Different structures: In this paper, we used an AR(1) model and a second order random walk  model (basically a spline!). Other options include spatial models and Gaussian process models. We expect them to work the same way.

What’s in a name? (AKA the tl;dr)

I (and really no one else) really wants to call this Ms P, which would stand for Multilevel Structured regression with Poststratification.

But regardless of name, the big lesson of this paper are:

1. Using structured priors allow us to pool information in a more problem appropriate way than standard multilevel models do, especially when stratifying our population according to an ordinal or spatial variable.
2. Structured priors are especially useful when one of the stratifying variable is ordinal (like age) and the response is expected depend (possibly non-linearly) with this variable.
3. The gain from using structured priors increases when certain levels of the ordinal stratifying variable are over- or under-sampled. (Eg if young people stop answering phone surveys.)

So go forth and introduce yourself to Ms P. You’ll like her.

Continue Reading…

### National Scale Interactive Computing

This is an invited post from Jim Colliander, Professor of Mathematics at UBC and Director of the Pacific Institute for the Mathematical Sciences.¹

In 2017, the Pacific Institute for the Mathematical Sciences (PIMS), in partnership with Compute Canada and Cybera, launched Syzygy, a cloud-hosted interactive computing platform that delivers JupyterHub deployments for universities across Canada.

Syzygy has been used by over 16,000 students at 20 universities. The main results of the Syzygy experiment so far are:

1. Demand for interactive computing is ubiquitous² and growing strongly at universities.
2. The Jupyter ecosystem is an effective way to deliver interactive computing.
3. A scalable, sustainable, and cost-effective interactive computing service for universities is needed as soon as possible.

### Demand for interactive computing

Both research and teaching at universities are adapting to major societal changes driven by explosions in data and computational tools. New educational programs that prepare students to think computationally are emerging, while research strategies are changing in ways that are more open, reproducible, collaborative, and interdisciplinary. These transformations are inextricably linked and are accelerating demand for interactive computing. The Syzygy experiment has shown that using Jupyter in educational programs drives interest in using Jupyter for research (and vice versa). For example, students in mathematics, statistics, and computer science collaborated with a researcher from St. Paul’s Hospital in Vancouver using Syzygy to identify new pathways to prevent death from sepsis. Research communities typically need access to deeper computational resources and often span multiple universities, but the common thread is the need to expand access to interactive computing.

### Technical milestone achieved

The Syzygy project has demonstrated that it’s possible to deploy tools for interactive computation at a national scale rapidly and efficiently using an entirely open source technology stack. Students, faculty and staff across Canada use Syzygy to access Jupyter through their browsers with their university single-sign-on credentials. The JupyterHubs range from a “standard” configuration to bespoke environments with specially curated tools and data integrations. This richness is possible because of the architecture of the Syzygy and Jupyter projects and the flexibility of the underlying cloud resources. As a case-study, Syzygy demonstrates that the Jupyter community has achieved a significant technical milestone: interactive computing can be delivered at national scale using cloud technologies.

### Service level requirements

The validation that interactive computing can be technically delivered at national scale prompts universities to ask a variety of questions. Can interactive computing service be delivered robustly? How will users be supported? What are the uptime expectations? What is the data security policy? How is privacy protected? Can the robustness of the service be clarified in a service level agreement? Syzygy, as an experimental service offered to universities at no charge and without a service level agreement, does not properly address these questions. To advance on their education-research-service mission and address growing demand, universities need a reliable interactive computing service with a service level agreement.

### What’s next?

How should universities address their needs for interactive computing over the next five years? Right now, universities are following two primary approaches:

• 🙏 Ad hoc: faculty figure out how to meet their own needs for interactive computing; IT staff deploys JupyterHub on local or commercial cloud servers; this approach gives universities control over their deployments and hardware, though requires time and expertise that many may not have.
• 🎩 Use a cloud provider’s service: Google Colab, Amazon Sagemaker, Microsoft Azure Notebooks, IBM Watson Studio; this approach allows universities to quickly launch interactive computing services, with a loss of flexibility and some risks by becoming reliant upon a particular vendor’s closed-source and proprietary software.

These approaches are not sustainable over the long term. If universities all deploy their own JupyterHub services, many will need technical expertise they do not currently have and will involve a significant duplication of effort. If universities rely on hosted cloud notebook services, the reliance on proprietary technology will impair their ability to switch between different cloud vendors, change hardware, customize software, etc. Vendor lock-in will limit the ability of universities to respond to changes in price for the service. Universities will lose agility in responding to changes in faculty, staff, and student computing needs.

There is a third option that addresses the issues with these two approaches and generates other benefits for universities:

• 🤔 Form an interactive computing consortium: universities collaborate to build an interactive computing service provider aligned with their missions to better serve their students, facilitate research, and avoid risks associated with vendor lock-in.

To retain control over their interactive computing stacks, avoid dependence⁴ on cloud providers, and accelerate the emergence of new programs, universities should work together to deploy interactive computing environments in a vendor-agnostic manner. This might take the form of a consortium — an organization dedicated to serving the needs of universities through customized shared infrastructure for interactive computing. The consortium would also ensure that universities will continue to play a leadership role in the development of the interactive computing tools used for education and research.

The Syzygy experiment confirmed that growing demand for interactive computation within universities can be supplied with the available technologies advanced by the Jupyter open source community. In the coming year, we aim to build upon the success of the Syzygy experiment and seed an initial node of a consortium in Canada with the intention of fostering a global network of people invested in advanced interactive computing. If you are interesting in partnering, please get in touch! See this Jupyter Community Forum post to continue the discussion.

1. The author gratefully acknowledges feedback on this piece from Ian Allison, Lindsey Heagy, Chris Holdgraf, Fernando Perez, and Lindsay Sill.
2. Interactive computing needs have been identified in agriculture, applied mathematics, astronomy, chemistry, climate science, computer science, data science, digital humanities, ecology, economics, engineering, genomics, geoscience, health sciences, K-12 education, neuroscience, political science, physics, pure mathematics, statistics, and sociology.
3. Relying on commercial cloud vendors to provide the interactive computing service for universities risks recreating the problems associated with scientific publishing that emerged with the internet.

National Scale Interactive Computing was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue Reading…

### Stancon is happening now.

Hi, everyone!

Continue Reading…

### Document worth reading: “Survey on Evaluation Methods for Dialogue Systems”

In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented dialogue systems, conversational dialogue systems, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then by presenting the evaluation methods regarding this class. Survey on Evaluation Methods for Dialogue Systems

Continue Reading…

### eBook: How to Enhance Privacy in Data Science

Check out this eBook, How to Enhance Privacy in Data Science, to equip yourself with the tools to enhance privacy in data science, including transforming data in a manner that protects the privacy, an overview of the challenges and opportunities of privacy-aware analytics, and more.

Continue Reading…

### ✚ Chart Different, Then Adjust (The Process #53)

Practicality will make its self known whether you want to or not. So, try different visual forms and take it from there. Read More

Continue Reading…

### ‘mRpostman’ – IMAP Tools for R in a Tidy Way

[This article was first published on R on ALLAN V. C. QUADROS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

mRpostman is an R package to help you to easy connect to your IMAP (Internet Message Access Protocol) server and execute commands, such as listing mailboxes, searching and fetching messages in a tidy way. It calls ‘curl’ in background when issuing the IMAP commands (all credit to Jeroen Ooms and Daniel Stenberg).

So far, I have tested mRpostman with Gmail, Yahoo Mail and AOL Mail, but it should also work with other mail providers. I would be happy to hear other successful experiences from users.

ATTENTION: Before you start, you have to enable “less secure apps access” in your mail account settings.

Check out a detailed vignette HERE showing how to use the package!

I hope you enjoy mRpostman.

To leave a comment for the author, please follow the link and comment on their blog: R on ALLAN V. C. QUADROS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue Reading…

### Proptech and the proper use of technology for house sales prediction

Using the ATTOM dataset, we extracted data on sales transactions in the USA, loans, and estimated values of property. We developed an optimal prediction model from correlations in the time and status of ownership as well as the time of the year of sales fluctuations.

Continue Reading…

### Guest Blog: How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas

At Virgin Hyperloop One, we work on making Hyperloop a reality, so we can move passengers and cargo at airline speeds but at a fraction of the cost of air travel. In order to build a commercially viable system, we collect and analyze a large, diverse quantity of data, including Devloop Test Track runs, numerous test rigs, and various simulation, infrastructure and socio economic data. Most of our scripts handling that data are written using Python libraries with pandas as the main data processing tool that glues everything together. In this blog post, we want to share with you our experiences of scaling our data analytics using Koalas, achieving massive speedups with minor code changes.

As we continue to grow and build new stuff, so do our data processing needs. Due to the increasing scale and complexity of our data operations, our pandas-based Python scripts were too slow to meet our business needs. This led us to Spark, with the hopes of fast processing times and flexible data storage as well as on-demand scalability. We were, however, struggling with the “Spark switch” – we would have to make a lot of custom changes to migrate our pandas-based codebase to PySpark. We needed a solution that was not only much faster, but also would ideally not require rewriting code. These challenges drove us to research other options and we were very happy to discover that there exists an easy way to skip that tedious step: the Koalas package, recently open-sourced by Databricks.

As described in the Koalas Readme,

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

(…)

Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.

Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

In this article I will try to show that this is (mostly) true and why Koalas is worth trying out. By making changes to less than 1% of our pandas lines, we were able to run our code with Koalas and Spark. We were able to reduce the execution times by more than 10x, from a few hours to just a few minutes, and since the environment is able to scale horizontally, we’re prepared for even more data.

## Quick Start

Before installing Koalas, make sure that you have your Spark cluster set up and can use it with PySpark. Then, simply run:

pip install koalas

or, for conda users:

conda install koalas -c conda-forge

Refer to Koalas Readme for more details.

A quick sanity check after installation:

import databricks.koalas as ks
kdf = ks.DataFrame({'column1':[4.0, 8.0]}, {'column2':[1.0, 2.0]})
kdf

As you can see, Koalas can render pandas-like interactive tables. How convenient!

## Example with basic operations

For the sake of this article, we generated some test data consisting of 4 columns and parameterized number of rows.

import pandas as pd
## generate 1M rows of test data
pdf = generate_pd_test_data( 1e6 )
pdf.head(3)
>>>     timestamp pod_id trip_id speed_mph
0 7.522523 pod_13 trip_6 79.340006
1 22.029855 pod_5 trip_22 65.202122
2 21.473178 pod_20 trip_10 669.901507

We’d like to assess some key descriptive analytics across all pod-trips, for example: What is the trip time per pod-trip?

Operations needed:

1. Group by ['pod_id','trip id']
2. For every trip, calculate the trip_time as last timestamp – first timestamp.
3. Calculate distribution of the pod-trip times (mean, stddev)

### The short & slow ( pandas ) way:

(snippet #1)

import pandas as pd
# take the grouped.max (last timestamp) and join with grouped.min (first timestamp)
gdf = pdf.groupby(['pod_id','trip_id']).agg({'timestamp': ['min','max']})
gdf.columns = ['timestamp_first','timestamp_last']
gdf['trip_time_sec'] = gdf['timestamp_last'] - gdf['timestamp_first']
gdf['trip_time_hours'] = gdf['trip_time_sec'] / 3600.0
# calculate the statistics on trip times
pd_result = gdf.describe()

### The long & fast ( PySpark ) way:

(snippet #2)

import pyspark as spark
# import pandas df to spark (this line is not used for profiling)
sdf = spark.createDataFrame(pdf)
# sort by timestamp and groupby
sdf = sdf.sort(desc('timestamp'))
sdf = sdf.groupBy("pod_id", "trip_id").agg(F.max('timestamp').alias('timestamp_last'), F.min('timestamp').alias('timestamp_first'))
# add another column trip_time_sec as the difference between first and last
sdf = sdf.withColumn('trip_time_sec', sdf2['timestamp_last'] - sdf2['timestamp_first'])
sdf = sdf.withColumn('trip_time_hours', sdf3['trip_time_sec'] / 3600.0)
# calculate the statistics on trip times
sdf4.select(F.col('timestamp_last'),F.col('timestamp_first'),F.col('trip_time_sec'),F.col('trip_time_hours')).summary().toPandas()


### The short & fast ( Koalas ) way:

(snippet #3)

import databricks.koalas as ks
# import pandas df to koalas (and so also spark) (this line is not used for profiling)
kdf = ks.from_pandas(pdf)
# the code below is the same as the pandas version
gdf = kdf.groupby(['pod_id','trip_id']).agg({'timestamp': ['min','max']})
gdf.columns = ['timestamp_first','timestamp_last']
gdf['trip_time_sec'] = gdf['timestamp_last'] - gdf['timestamp_first']
gdf['trip_time_hours'] = gdf['trip_time_sec'] / 3600.0
ks_result = gdf.describe().to_pandas()

Note that for the snippets #1 and #3, the code is exactly the same, and so the “Spark switch” is seamless! For most of the pandas scripts, you can even try to change the import pandas databricks.koalas as pd, and some scripts will run fine with minor adjustments, with some limitations explained below.

### Results

All the snippets have been verified to return the same pod-trip-times results. The describe and summary methods for pandas and Spark are slightly different, as explained here but this should not affect performance too much.

Sample results:

## Advanced Example: UDFs and complicated operations

We’re now going to try to solve a more complex problem with the same dataframe, and see how pandas and Koalas implementations differ.

Goal: Analyze the average speed per pod-trip:

1. Group by ['pod_id','trip id']
2. For every pod-trip calculate the total distance travelled by finding the area below the velocity (time) chart (method explained here):
3. Sort the grouped df by timestamp column.
4. Calculate diffs of timestamps.
5. Multiply the diffs with the speed – this will result in the distance traveled in that time diff.
6. Sum the distance_travelled column – this will give us total distance travelled per pod-trip.
7. Calculate the trip time as timestamp.last – timestamp.first (as in the previous paragraph).
8. Calculate the average_speed as distance_travelled / trip time.
9. Calculate distribution of the pod-trip times (mean, stddev).

We decided to implement this task using a custom apply function and UDF (user defined functions).

### The pandas way:

(snippet #4)

import pandas as pd
def calc_distance_from_speed( gdf ):
gdf = gdf.sort_values('timestamp')
gdf['time_diff'] = gdf['timestamp'].diff()
return pd.DataFrame({
'distance_miles':[ (gdf['time_diff']*gdf['speed_mph']).sum()],
'travel_time_sec': [ gdf['timestamp'].iloc[-1] - gdf['timestamp'].iloc[0] ]
})
results = df.groupby(['pod_id','trip_id']).apply( calculate_distance_from_speed)
results['distance_km'] = results['distance_miles'] * 1.609
results['avg_speed_mph'] = results['distance_miles'] / results['travel_time_sec'] / 60.0
results['avg_speed_kph'] = results['avg_speed_mph'] * 1.609
results.describe()

### The PySpark way:

(snippet #5)

import databricks.koalas as ks
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pyspark.sql.functions as F
schema = StructType([
StructField("pod_id", StringType()),
StructField("trip_id", StringType()),
StructField("distance_miles", DoubleType()),
StructField("travel_time_sec", DoubleType())
])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def calculate_distance_from_speed( gdf ):
gdf = gdf.sort_values('timestamp')
print(gdf)
gdf['time_diff'] = gdf['timestamp'].diff()
return pd.DataFrame({
'pod_id':[gdf['pod_id'].iloc[0]],
'trip_id':[gdf['trip_id'].iloc[0]],
'distance_miles':[ (gdf['time_diff']*gdf['speed_mph']).sum()],
'travel_time_sec': [ gdf['timestamp'].iloc[-1]-gdf['timestamp'].iloc[0] ]
})
sdf = spark_df.groupby("pod_id","trip_id").apply(calculate_distance_from_speed)
sdf = sdf.withColumn('distance_km',F.col('distance_miles') * 1.609)
sdf = sdf.withColumn('avg_speed_mph',F.col('distance_miles')/ F.col('travel_time_sec') / 60.0)
sdf = sdf.withColumn('avg_speed_kph',F.col('avg_speed_mph') * 1.609)
sdf = sdf.orderBy(sdf.pod_id,sdf.trip_id)
sdf.summary().toPandas() # summary calculates almost the same results as describe

### The Koalas way:

(snippet #6)

import databricks.koalas as ks
def calc_distance_from_speed_ks( gdf ) -> ks.DataFrame[ str, str, float , float]:
gdf = gdf.sort_values('timestamp')
gdf['meanspeed'] = (gdf['timestamp'].diff()*gdf['speed_mph']).sum()
gdf['triptime'] = (gdf['timestamp'].iloc[-1] - gdf['timestamp'].iloc[0])
return gdf[['pod_id','trip_id','meanspeed','triptime']].iloc[0:1]

kdf = ks.from_pandas(df)
results = kdf.groupby(['pod_id','trip_id']).apply( calculate_distance_from_speed_ks)
# due to current limitations of the package, groupby.apply() returns c0 .. c3 column names
results.columns = ['pod_id', 'trip_id', 'distance_miles', 'travel_time_sec']
# spark groupby does not set the groupby cols as index and does not sort them
results = results.set_index(['pod_id','trip_id']).sort_index()
results['distance_km'] = results['distance_miles'] * 1.609
results['avg_speed_mph'] = results['distance_miles'] / results['travel_time_sec'] / 60.0
results['avg_speed_kph'] = results['avg_speed_mph'] * 1.609
results.describe()

Koalas’ implementation of apply is based on PySpark’s pandas_udf which requires schema information, and this is why the definition of the function has to also define the type hint. The authors of the package introduced new custom type hints, ks.DataFrame and ks.Series. Unfortunately, the current implementation of the apply method is quite cumbersome, and it took a bit of an effort to arrive at the same result (column names change, groupby keys not returned). However, all the behaviors are appropriately explained in the package documentation.

## Performance

To assess the performance of Koalas, we profiled the code snippets for different number of rows.

The profiling experiment was done on Databricks platform, using the following cluster configurations:

• Spark driver node (also used to execute the pandas scripts): 8 CPU cores, 61GB RAM.

• 15 Spark worker nodes: 4CPU cores, 30.5GB RAM each (sum: 60CPUs / 457.5GB ).

Every experiment was repeated 10 times, and the clips shown below are indicating the min and max times for the executions.

### Basic ops

When the data is small, the initialization operations and data transfer are huge in comparison to the computations, so pandas is much faster (marker a). For larger amounts of data, pandas’ processing times exceed distributed solutions (marker b). We can then observe some performance hits for Koalas, but it gets closer to PySpark as data increases (marker c).

### UDFs

For the UDF profiling, as specified in PySpark and Koalas documentation, the performance decreases dramatically. This is why we needed to decrease the number of rows we tested with by 100x vs the basic ops case. For each test case, Koalas and PySpark show a striking similarity in performance, indicating a consistent underlying implementation. During experimentation, we discovered that there exists a much faster way of executing that set of operations using PySpark windows functionality, however this is not currently implemented in Koalas so we decided to only compare UDF versions.

### Discussion

Koalas seems to be the right choice if you want to make your pandas code immediately scalable and executable on bigger datasets that are not possible to process on a single node. After the quick swap to Koalas, just by scaling your Spark cluster, you can allow bigger datasets and improve the processing times significantly. Your performance should be comparable (but 5 to 50% lower, depending on the dataset size and the cluster) with PySpark’s.

On the other hand, the Koalas API layer does cause a visible performance hit, especially in comparison to the native Spark. At the end of the day, if computational performance is your key priority, you should consider switching from Python to Scala.

## Limitations and differences

During your first few hours with Koalas, you might wonder, “Why is this not implemented?!” Currently, the package is still under development and is missing some pandas API functionality, but much of it should be implemented in the next few months (for example groupby.diff() or kdf.rename()).

Also from my experience as a contributor to the project, some of the features are either too complicated to implement with Spark API or were skipped due to a significant performance hit. For example, DataFrame.values requires materializing the entire working set in a single node’s memory, and so is suboptimal and sometimes not even possible. Instead if you need to retrieve some final results on the driver, you can call DataFrame.to_pandas() or DataFrame.to_numpy().

Another important thing to mention is that Koalas’ execution chain is different from pandas’: when executing the operations on the dataframe, they are put on a queue of operations but not executed. Only when the results are needed, e.g. when calling kdf.head() or kdf.to_pandas() the operations will be executed. That could be misleading for somebody who never worked with Spark, since pandas does everything line-by-line.

## Conclusions

Koalas helped us to reduce the burden to “Spark-ify” our pandas code. If you’re also struggling with scaling your pandas code, you should try it too! If you are desperately missing any behavior or found inconsistencies with pandas, please open an issue so that as a community we can ensure that the package is actively and continually improved. Also, feel free to contribute!

## Resources

1. Koalas github: https://github.com/databricks/koalas
2. Koalas documentation: https://koalas.readthedocs.io
3. Code snippets from this article: https://gist.github.com/patryk-oleniuk/043f97ae9c405cbd13b6977e7e6d6fbc .

--

Try Databricks for free. Get started today.

The post Guest Blog: How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas appeared first on Databricks.

Continue Reading…

### How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions

As machine learning evolves, the need for tools and platforms that automate the lifecycle management of training and testing datasets is becoming increasingly important. Fast growing technology companies like Uber or LinkedIn have been forced to build their own in-house data lifecycle management solutions to power different groups of machine learning models.

Continue Reading…

Thanks for reading!