# My Data Science Blogs

## February 21, 2017

### If you did not already know: “Gini Impurity”

Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. … Gini Impurity

### Whats new on arXiv

This work develops a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies. The resulting exact diffusion strategy is shown in Part II to have a wider stability range and superior convergence performance than the EXTRA strategy. The exact diffusion solution is applicable to non-symmetric left-stochastic combination matrices, while most earlier developments on exact consensus implementations are limited to doubly-stochastic matrices; these latter matrices impose stringent constraints on the network topology. Similar difficulties arise for implementations with right-stochastic policies, which are common in push-sum consensus solutions. The derivation of the exact diffusion strategy in this work relies on reformulating the aggregate optimization problem as a penalized problem and resorting to a diagonally-weighted incremental construction. Detailed stability and convergence analyses are pursued in Part II and are facilitated by examining the evolution of the error dynamics in a transformed domain. Numerical simulations illustrate the theoretical conclusions.
We introduce a semi-supervised discrete choice model to calibrate discrete choice models when relatively few requests have both choice sets and stated preferences but the majority only have the choice sets. Two classic semi-supervised learning algorithms, the expectation maximization algorithm and the cluster-and-label algorithm, have been adapted to our choice modeling problem setting. We also develop two new algorithms based on the cluster-and-label algorithm. The new algorithms use the Bayesian Information Criterion to evaluate a clustering setting to automatically adjust the number of clusters. Two computational studies including a hotel booking case and a large-scale airline itinerary shopping case are presented to evaluate the prediction accuracy and computational effort of the proposed algorithms. Algorithmic recommendations are rendered under various scenarios.
Part I of this work developed the exact diffusion algorithm to remove the bias that is characteristic of distributed solutions for deterministic optimization problems. The algorithm was shown to be applicable to a larger set of combination policies than earlier approaches in the literature. In particular, the combination matrices are not required to be doubly stochastic or right-stochastic, which impose stringent conditions on the graph topology and communications protocol. In this Part II, we examine the convergence and stability properties of exact diffusion in some detail and establish its linear convergence rate. We also show that it has a wider stability range than the EXTRA consensus solution, meaning that it is stable for a wider range of step-sizes and can, therefore, attain faster convergence rates. Analytical examples and numerical simulations illustrate the theoretical findings.
The multilabel learning problem with large number of labels, features, and data-points has generated a tremendous interest recently. A recurring theme of these problems is that only a few labels are active in any given datapoint as compared to the total number of labels. However, only a small number of existing work take direct advantage of this inherent extreme sparsity in the label space. By the virtue of Restricted Isometry Property (RIP), satisfied by many random ensembles, we propose a novel procedure for multilabel learning known as RIPML. During the training phase, in RIPML, labels are projected onto a random low-dimensional subspace followed by solving a least-square problem in this subspace. Inference is done by a k-nearest neighbor (kNN) based approach. We demonstrate the effectiveness of RIPML by conducting extensive simulations and comparing results with the state-of-the-art linear dimensionality reduction based approaches.
The aim of this paper is to give an overview of domain adaptation and transfer learning with a specific view on visual applications. After a general motivation, we first position domain adaptation in the larger transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we overview the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes. Finally, we conclude the paper with a section where we relate domain adaptation to other machine learning solutions.
This article studies the Gram random matrix model $G=\frac1T\Sigma^{\rm T}\Sigma$, $\Sigma=\sigma(WX)$, classically found in random neural networks, where $X=[x_1,\ldots,x_T]\in\mathbb{R}^{p\times T}$ is a (data) matrix of bounded norm, $W\in\mathbb{R}^{n\times p}$ is a matrix of independent zero-mean unit variance entries, and $\sigma:\mathbb{R}\to\mathbb{R}$ is a Lipschitz continuous (activation) function — $\sigma(WX)$ being understood entry-wise. We prove that, as $n,p,T$ grow large at the same rate, the resolvent $Q=(G+\gamma I_T)^{-1}$, for $\gamma>0$, has a similar behavior as that met in sample covariance matrix models, involving notably the moment $\Phi=\frac{T}n\mathrm{E}[G]$, which provides in passing a deterministic equivalent for the empirical spectral measure of $G$. This result, established by means of concentration of measure arguments, enables the estimation of the asymptotic performance of single-layer random neural networks. This in turn provides practical insights into the underlying mechanisms into play in random neural networks, entailing several unexpected consequences, as well as a fast practical means to tune the network hyperparameters.
The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear time computational complexity on clustering tasks. MIMOSA algorithms mark and match partial-signature keys in a hash table to obtain exact, error-free cluster retrieval. Benchmark measurements, on clustering a data set of 10,000,000 news articles by news topic, found that a MIMOSA implementation finished more than four orders of magnitude faster than a standard centroid implementation.
In the era of big data, reducing data dimensionality is critical in many areas of science. Widely used Principle Component Analysis (PCA) addresses this problem by computing a low dimensional data embedding that maximally explain variance of the data. However, PCA has two major weaknesses. Firstly, it only considers linear correlations among variables (features), and secondly it is not suitable for categorical data. We resolve these issues by proposing Maximally Correlated Principle Component Analysis (MCPCA). MCPCA computes transformations of variables whose covariance matrix has the largest Ky Fan norm. Variable transformations are unknown, can be nonlinear and are computed in an optimization. MCPCA can also be viewed as a multivariate extension of Maximal Correlation. For jointly Gaussian variables we show that the covariance matrix corresponding to the identity (or the negative of the identity) transformations majorizes covariance matrices of non-identity functions. Using this result we characterize global MCPCA optimizers for nonlinear functions of jointly Gaussian variables for every rank constraint. For categorical variables we characterize global MCPCA optimizers for the rank one constraint based on the leading eigenvector of a matrix computed using pairwise joint distributions. For a general rank constraint we propose a block coordinate descend algorithm and show its convergence to stationary points of the MCPCA optimization. We compare MCPCA with PCA and other state-of-the-art dimensionality reduction methods including Isomap, LLE, multilayer autoencoders (neural networks), kernel PCA, probabilistic PCA and diffusion maps on several synthetic and real datasets. We show that MCPCA consistently provides improved performance compared to other methods.

### Predictive Analytics World for Business, June 19-22, Chicago

Join your peers at Predictive Analytics World for Business and tap the potential of predictive analytics to optimize business. You will grasp it, own it, and put it to use by learning from the best of the best.

### Distilled News

This exercise focuses on linear regression with both analytical (normal equation) and numerical (gradient descent) methods. We will start with linear regression with one variable. From this part of the exercise, we will create plots that help to visualize how gradient descent gets the coefficient of the predictor and the intercept. In the second part, we will implement linear regression with multiple variables.
Only Homo sapiens, of all the descendants of Homo erectus, survived on earth whereas other species such as homo soloensis, homo denisova, Homo neanderthalensis, Homo floresiensis faded away more than 40,000 years ago. What advantages did Homo sapiens possess that helped them to flourish while other species are extinct? Apparently a cognitive revolution (according to Prof. Yuval Harari in his famous book Sapiens) triggered by some kind of genetic mutation provided Homo Species with more cerebral power and thus they acquired an ability not possessed by any other species – ability to imagine things that did not exist. This ability helped them to invent things including powerful communicative languages, religion, tools and more. Does current cognitive revolution ushered under various nomenclatures such as artificial intelligence, cognitive computing etc provide more powers, this time to machines and bring in unprecedented progress to human life? Are public and private institutions geared towards initiating, leading and supporting this change? How can leaders at private enterprises reap the benefit of AI and how do leaders at public organizations and government institutions create policies and regulations to handle issues arising from growing adoption of AI? It is imperative that leaders in these organizations are conscious of the latest developments in the field of artificial intelligence.
factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses.
Recently I have been starting to use dplyr for handling my data in R. It makes everything a lot smoother! My previous workflow – running an SQL query, storing the results as CSV, loading it in RStudio – is now history. With dplyr you can directly query d
R is widely taught in business courses and, hence, known by most data scientists with business background. However, when it comes to optimization and Operations Research, many other languages are used. Especially for optimization, solutions range from Microsoft Excel solvers to modeling environments such as Matlab and GAMS. Most of these are non-free and require students to learn yet another language. Because of this, we propose to use R in optimization problems of Operations Research, since R is open source, comes for free and is widely known. Furthermore, R provides a multitude of numerical optimization packages that are readily available. At the same time, R is widely used in industry, making it a suitable and skillful tool to lever the potential of numerical optimization.
A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful.
Following on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data.
Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand how the interaction between general seasonality in activity and the underlying trend. The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing.
There is a renaissance happening in the field of artificial intelligence. For many long term practitioners in the field, it is not too obvious. I wrote earlier about the push back that many are making against the developments of Deep Learning. Deep Learning is however an extremely radical departure from classical methods. One researcher who recognized that our classical approaches to Artificial General Intelligence (AGI) were all but broken is Monica Anderson.
TensorFlow was the new kid on the block when it was introduced in 2015 and has become the most used deep learning framework last year. I jumped on the train a few months after the first release and began my journey into deep learning during my master’s thesis. It took a while to get used to the computation graph and session model, but since then I’ve got my head around most of the quirks and twists. This short article is no introduction to TensorFlow, but instead offers some quick tips, mostly focused on performance, that reveal common pitfalls and may boost your model and training performance to new levels. We’ll start with preprocessing and your input pipeline, visit graph construction and move on to debugging and performance optimizations.

## February 20, 2017

### coauthorship and citation networks

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

As I discovered (!) the Annals of Applied Statistics in my mailbox just prior to taking the local train to Dauphine for the first time in 2017 (!), I started reading it on the way, but did not get any further than the first discussion paper by Pengsheng Ji and Jiashun Jin on coauthorship and citation networks for statisticians. I found the whole exercise intriguing, I must confess, with little to support a whole discussion on the topic. I may have read the paper too superficially as a métro pastime, but to me it sounded more like a post-hoc analysis than a statistical exercise, something like looking at the network or rather at the output of a software representing networks and making sense of clumps and sub-networks a posteriori. (In a way this reminded of my first SAS project at school, on the patterns of vacations in France. It was in 1983 on pinched cards. And we spent a while cutting & pasting in a literal sense the 80 column graphs produced by SAS on endless listings.)

It may be that part of the interest in the paper is self-centred. I do not think analysing a similar dataset in another field like deconstructionist philosophy or Korean raku would have attracted the same attention. Looking at the clusters and the names on the pictures is obviously making sense, if more at a curiosity than a scientific level, as I do not think this brings much in terms of ranking and evaluating research (despite what Bernard Silverman suggests in his preface) or understanding collaborations (beyond the fact that people in the same subfield or same active place like Duke tend to collaborate). Speaking of curiosity, I was quite surprised to spot my name in one network and even more to see that I was part of the “High-Dimensional Data Analysis” cluster, rather than of the “Bayes” cluster.  I cannot fathom how I ended up in that theme, as I cannot think of a single paper of mines pertaining to either high dimensions or data analysis [to force the trait just a wee bit!]. Maybe thanks to my joint paper with Peter Mueller. (I tried to check the data itself but cannot trace my own papers in the raw datafiles.)

I also wonder what is the point of looking at solely four major journals in the field, missing for instance most of computational statistics and biostatistics, not to mention machine learning or econometrics. This results in a somewhat narrow niche, if obviously recovering the main authors in the [corresponding] field. Some major players in computational stats still make it to the lists, like Gareth Roberts or Håvard Rue, but under the wrong categorisation of spatial statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“We are in the Golden Age of Data. For those of us on the front-lines, it doesn’t feel that way. Every step forward this technology takes, the need for deeper analytics takes two. We’re constantly catching up.” Michal Klos ( January 28, 2015 )

### R Weekly

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

During my Monday morning ritual of avoiding work,  I found this publication that is written in R, for people who use R – R Weekly.  The authors do a pretty awesome job of aggregating useful, entertaining, and informative content about what’s happening surrounding our favorite programming language.  Check it out, give the authors some love on GitHub, and leave a like if you find something useful there.

Have a good week,

Kiefer Smith

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### rxNeuralNet vs. xgBoost vs. H2O

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Recently, I did a session at local user group in Ljubljana, Slovenija, where I introduced the new algorithms that are available with MicrosoftML package for Microsoft R Server 9.0.3.

For dataset, I have used two from (still currently) running sessions from Kaggle. In the last part, I did image detection and prediction of MNIST dataset and compared the performance and accuracy between.

MNIST Handwritten digit database is available here.

Starting off with rxNeuralNet, we have to build a NET# model or Neural network to work it’s way.

Model for Neural network:

const { T = true; F = false; }

input Picture [28, 28];

hidden C1 [5 * 13^2]
from Picture convolve {
InputShape  = [28, 28];
KernelShape = [ 5,  5];
Stride      = [ 2,  2];
MapCount = 5;
}

hidden C2 [50, 5, 5]
from C1 convolve {
InputShape  = [ 5, 13, 13];
KernelShape = [ 1,  5,  5];
Stride      = [ 1,  2,  2];
Sharing     = [ F,  T,  T];
MapCount = 10;
}

hidden H3 [100]
from C2 all;

// Output layer definition.
output Result [10]
from H3 all;

Once we have this, we can work out with rxNeuralNet algorithm:

model_DNN_GPU <- rxNeuralNet(label ~.
,data = dataTrain
,type = "multi"
,numIterations = 10
,normalize = "no"
#,acceleration = "gpu" #enable this if you have CUDA driver
,miniBatchSize = 64 #set to 1 else set to 64 if you have CUDA driver problem
,netDefinition = netDefinition
,optimizer = sgd(learningRate = 0.1, lRateRedRatio = 0.9, lRateRedFreq = 10)
)

Then do the prediction and calculate accuracy matrix:

DNN_GPU_score <- rxPredict(model_DNN_GPU, dataTest, extraVarsToWrite = "label")
sum(Score_DNN$Label == DNN_GPU_score$PredictedLabel)/dim(DNN_GPU_score)[1]

Accuracy for this model is:

[1] 0.9789

When working with H2O package, the following code was executed to get same paramethers for Neural network:

model_h20 <- h2o.deeplearning(x = 2:785
,y = 1   # label for label
,training_frame = train_h2o
,activation = "RectifierWithDropout"
,input_dropout_ratio = 0.2 # % of inputs dropout
,hidden_dropout_ratios = c(0.5,0.5) # % for nodes dropout
,balance_classes = TRUE
,hidden = c(50,100,100)
,momentum_stable = 0.99
,nesterov_accelerated_gradient = T # use it for speed
,epochs = 15)

When results of test dataset against the learned model is executed:

h2o.confusionMatrix(model_h20)
100-(416/9978)*100

the  result is confusion matrix for accuracy of predicted values with value of:

# [1] 95.83083

For comparison, I have added xgBoost (eXtrem Gradient Boosting), but this time, I will not focus on this one.

Time comparison against the packages (in seconds), from left to right are: H20, MicrosoftML with GPU acceleration, MicrosoftML without GPU acceleration and xgBoost.

As for the accuracy of the trained model, here are results (based on my tests):

MicrosoftML – Neural Network – 97,8%

H20 – Deep Learning – 95,3 %

xgBoost – 94,9 %

As always, code and dataset are available at GitHub.

Happy R-ing

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Should the Problems with Polls Make Us Worry about the Quality of Health Surveys? (my talk at CDC tomorrow)

My talk this Thursday at CDC, Tuesday, February 21, 2017, 12:00 noon, 2400 Century Center, Room 1015C:

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys?

Response rates in public opinion polls have been steadily declining for more than half a century and are currently heading toward the 0% mark. We have learned much in recent years about the problems this is causing and how we can improve data collection and statistical analysis to get better estimates of opinion and opinion trends. In this talk, we review research in this area and then discuss the relevance of this work to similar problems in health surveys.

### strcode – structure your code better

(This article was first published on Rsome - A blog about some R stuff, and kindly contributed to R-bloggers)

I am pleased to announce my package strcode, a package that should make structuring code easier. You can install it from GitHub, a CRAN submission is planned at a later stage.

# A concept for code structuring

The main feature of the package is its function to insert code breaks. These are helpful in breaking the code down into smaller sections. We suggest three levels of granularity for code structuring, wherein higher-level blocks can contain lower-level blocks:

You can notice from the above that:

• The number of #’s used in front of the break character (___, ..., . .) correspond to the level of granularity for a code separator.
• The breaks characters ___, ..., . . were chosen such that they reflect the level of granularity, namely ___ has a much higher visual density than . ..
• Each block has an (optional) short title on what that block is about.

Every title line ends with ####. Therefore, the titles are recognized by RStudio as sections. This has the advantage that you can get a quick summary of your code in RStudio’s code pane as depicted below.

In addition, it means that you can fold sections as you can fold function declarations or if statements.

The package strcode provides an RStudio Add-in to insert each of the three separators presented above – with a single click. To invoke the Add-in, simply click on the button Addins in your RStudio Window and select the separator you want to insert.

By default, a Shiny Gadget will open in the viewer pane where you can also specify the title of the section (optional) and whether or not a unique identifier/anchor should be added to the section (see below).

If you prefer to insert the separator without the Shiny Gadget, you can change the option strcode$insert_with_shiny to FALSE which will only insert the break. The separators all have length 80. The value is looked up in the global option strcode$char_length and can therefore be changed by the user as well. The length of separators is thought to correspond to the character width limit you use.

# Anchoring sections

Sometimes it is required to refer to a code section, which can be done by a title. A better way, however, is to use a unique hash sequence – let us call it a code anchor – to create an arguably unique reference to that section. A code anchor in strcode is enclosed by #< and ># so all anchors can be found using regular expressions. You can add section breaks that include a hash. Simply tick the box when you insert the break via the Shiny Gadget. The outcome might look like this

# Summarizing code

Once code has been structured with the separators introduced above, it can easily be summarized in a compact form. This is particularly handy when the code base is large, when a lot of people work on the code or when new people join a project. The function sum_str is designed for the purpose of extracting separators and respective comments, in order to provide high level code summaries. Thanks to RStudio’s API, you can even create summaries of the file you are working on, simply by typing sum_str() in the console.

The outcome might look like this:

sum_str is highly customizable and flexible, with a host of options. For example, you can specify to omit level 3 or level 2 sections in the summary, summarizing multiple files at once, writing the summaries to a file and more.

# Insert a code anchor

Code anchors might prove helpful in other situations where one wants to anchor a single line. That is also possible with strcode. An example of a code anchor is the following:

The hash sequences in strcode are produced with the R package digest.

To wrap it up, strcode tries to make structuring code easy and fun while keeping things simple. Give it a go. Feedback to the package is most welcome. If you find any bugs (in particular in sum_str) or if you have any suggestions for extensions of the package, feel free to open an issue on the GitHub repo of the package.

# Appendix

As an appendix to this post, I would like to give a real-world example of how using strcode can improve the legibility of code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### NEC Labs: Researcher

Seeking multiple researchers to work in the area of data analytics and mining. The ideal candidates must have expertise in data mining and statistical learning, and can develop algorithms to analyze massive amount of data to build innovative analytics applications.

### Tracking the fortunes of America’s white working-class men

DURING Donald Trump’s inauguration speech he declared that America’s “forgotten men and women” will be “forgotten no longer”. Then, earlier this month, he vowed to bring back jobs to states that have been “hurt so badly” by globalisation. By “forgotten” people, he means above all white working-class men.

### R Packages worth a look

Generalized Poisson Binomial Distribution (GPB)
Functions that compute the distribution functions for the Generalized Poisson Binomial distribution, which provides the cdf, pmf, quantile function, and random number generation for the distribution.

The Critical Care Clinical Data Processing Tools (cleanEHR)
A toolset to deal with the Critical Care Health Informatics Collaborative dataset. It is created to address various data reliability and accessibility problems of electronic healthcare records (EHR). It provides a unique platform which enables data manipulation, transformation, reduction, anonymisation, cleaning and validation.

Statistical Disclosure Control Substitution Matrix Calculator (sdcTarget)
Classes and methods to calculate and evaluate target matrices for statistical disclosure control.

### Sir Austin Bradford Hill for the Data Scientist: An xkcd Story

This is an attempt to explain Hill’s criteria using xkcd comics, both because it seemed fun, and also to motivate causal inference instructures to have some variety in which xkcd comic they include in lectures.

### Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the  sample, mean,  sd ,  sort,  pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),] Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Exercise 1 Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass. You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on. Exercise 2 Find the mean and standard error (standard deviation) of the sampling distribution. You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Creativity is Crucial in Data Science

Creativity and Innovation are integral to Data Science and going forward in the world of AI, those are the things that will give edge to the humans over the machines.

### Data for Democracy: The First Two Months of D4D

Let's hear about how Data Science is used for democracy and well being of human societies by Data for Democracy organisation.

### Deep Learning, Artificial Intuition and the Quest for AGI

Deep Learning systems exhibit behavior that appears biological despite not being based on biological material. It so happens that humanity has luckily stumbled upon Artificial Intuition in the form of Deep Learning.

### The 20 Most Popular Business Intelligence Tools

Given the enormous amount of Business Intelligence software solutions available, narrowing down the right one for your business can be a tedious process. How does a business start implementing this software? One way to start is by looking at systems that are popular among peers, because those products are the

The post The 20 Most Popular Business Intelligence Tools appeared first on Dataconomy.

### Text skew correction with OpenCV and Python

Today’s tutorial is a Python implementation of my favorite blog post by Félix Abecassis on the process of text skew correction (i.e., “deskewing text”) using OpenCV and image processing functions.

Given an image containing a rotated block of text at an unknown angle, we need to correct the text skew by:

1. Detecting the block of text in the image.
2. Computing the angle of the rotated text.
3. Rotating the image to correct for the skew.

We typically apply text skew correction algorithms in the field of automatic document analysis, but the process itself can be applied to other domains as well.

Looking for the source code to this post?

## Text skew correction with OpenCV and Python

The remainder of this blog post will demonstrate how to deskew text using basic image processing operations with Python and OpenCV.

We’ll start by creating a simple dataset that we can use to evaluate our text skew corrector.

We’ll then write Python and OpenCV code to automatically detect and correct the text skew angle in our images.

### Creating a simple dataset

Similar to Félix’s example, I have prepared a small dataset of four images that have been rotated by a given number of degrees:

Figure 1: Our four example images that we’ll be applying text skew correction to with OpenCV and Python.

The text block itself is from Chapter 11 of my book, Practical Python and OpenCV, where I’m discussing contours and how to utilize them for image processing and computer vision.

The filenames of the four files follow:

$ls images/ neg_28.png neg_4.png pos_24.png pos_41.png The first part of the filename specifies whether our image has been rotated counter-clockwise (negative) or clockwise (positive). The second component of the filename is the actual number of degrees the image has been rotated by. The goal our text skew correction algorithm will be to correctly determine the direction and angle of the rotation, then correct for it. To see how our text skew correction algorithm is implemented with OpenCV and Python, be sure to read the next section. ### Deskewing text with OpenCV and Python To get started, open up a new file and name it correct_skew.py . From there, insert the following code: # import the necessary packages import numpy as np import argparse import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image file") args = vars(ap.parse_args()) # load the image from disk image = cv2.imread(args["image"]) Lines 2-4 import our required Python packages. We’ll be using OpenCV via our cv2 bindings, so if you don’t already have OpenCV installed on your system, please refer to my list of OpenCV install tutorials to help you get your system setup and configured. We then parse our command line arguments on Lines 7-10. We only need a single argument here, --image , which is the path to our input image. The image is then loaded from disk on Line 13. Our next step is to isolate the text in the image: # import the necessary packages import numpy as np import argparse import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image file") args = vars(ap.parse_args()) # load the image from disk image = cv2.imread(args["image"]) # convert the image to grayscale and flip the foreground # and background to ensure foreground is now "white" and # the background is "black" gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.bitwise_not(gray) # threshold the image, setting all foreground pixels to # 255 and all background pixels to 0 thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] Our input images contain text that is dark on a light background; however, to apply our text skew correction process, we first need to invert the image (i.e., the text is now light on a dark background — we need the inverse). When applying computer vision and image processing operations, it’s common for the foreground to be represented as light while the background (the part of the image we are not interested in) is dark. A thresholding operation (Lines 23 and 24) is then applied to binarize the image: Figure 2: Applying a thresholding operation to binarize our image. Our text is now white on a black background. Given this thresholded image, we can now compute the minimum rotated bounding box that contains the text regions: # import the necessary packages import numpy as np import argparse import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image file") args = vars(ap.parse_args()) # load the image from disk image = cv2.imread(args["image"]) # convert the image to grayscale and flip the foreground # and background to ensure foreground is now "white" and # the background is "black" gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.bitwise_not(gray) # threshold the image, setting all foreground pixels to # 255 and all background pixels to 0 thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # grab the (x, y) coordinates of all pixel values that # are greater than zero, then use these coordinates to # compute a rotated bounding box that contains all # coordinates coords = np.column_stack(np.where(thresh > 0)) angle = cv2.minAreaRect(coords)[-1] # the cv2.minAreaRect function returns values in the # range [-90, 0); as the rectangle rotates clockwise the # returned angle trends to 0 -- in this special case we # need to add 90 degrees to the angle if angle < -45: angle = -(90 + angle) # otherwise, just take the inverse of the angle to make # it positive else: angle = -angle Line 30 finds all (x, y)-coordinates in the thresh image that are part of the foreground. We pass these coordinates into cv2.minAreaRect which then computes the minimum rotated rectangle that contains the entire text region. The cv2.minAreaRect function returns angle values in the range [-90, 0). As the rectangle is rotated clockwise the angle value increases towards zero. When zero is reached, the angle is set back to -90 degrees again and the process continues. Note: For more information on cv2.minAreaRect , please see this excellent explanation by Adam Goodwin. Lines 37 and 38 handle if the angle is less than -45 degrees, in which case we need to add 90 degrees to the angle and take the inverse. Otherwise, Lines 42 and 43 simply take the inverse of the angle. Now that we have determined the text skew angle, we need to apply an affine transformation to correct for the skew: # import the necessary packages import numpy as np import argparse import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image file") args = vars(ap.parse_args()) # load the image from disk image = cv2.imread(args["image"]) # convert the image to grayscale and flip the foreground # and background to ensure foreground is now "white" and # the background is "black" gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.bitwise_not(gray) # threshold the image, setting all foreground pixels to # 255 and all background pixels to 0 thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # grab the (x, y) coordinates of all pixel values that # are greater than zero, then use these coordinates to # compute a rotated bounding box that contains all # coordinates coords = np.column_stack(np.where(thresh > 0)) angle = cv2.minAreaRect(coords)[-1] # the cv2.minAreaRect function returns values in the # range [-90, 0); as the rectangle rotates clockwise the # returned angle trends to 0 -- in this special case we # need to add 90 degrees to the angle if angle < -45: angle = -(90 + angle) # otherwise, just take the inverse of the angle to make # it positive else: angle = -angle # rotate the image to deskew it (h, w) = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, angle, 1.0) rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) Lines 46 and 47 determine the center (x, y)-coordinate of the image. We pass the center coordinates and rotation angle into the cv2.getRotationMatrix2D (Line 48). This rotation matrix M is then used to perform the actual transformation on Lines 49 and 50. Finally, we display the results to our screen: # import the necessary packages import numpy as np import argparse import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image file") args = vars(ap.parse_args()) # load the image from disk image = cv2.imread(args["image"]) # convert the image to grayscale and flip the foreground # and background to ensure foreground is now "white" and # the background is "black" gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.bitwise_not(gray) # threshold the image, setting all foreground pixels to # 255 and all background pixels to 0 thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # grab the (x, y) coordinates of all pixel values that # are greater than zero, then use these coordinates to # compute a rotated bounding box that contains all # coordinates coords = np.column_stack(np.where(thresh > 0)) angle = cv2.minAreaRect(coords)[-1] # the cv2.minAreaRect function returns values in the # range [-90, 0); as the rectangle rotates clockwise the # returned angle trends to 0 -- in this special case we # need to add 90 degrees to the angle if angle < -45: angle = -(90 + angle) # otherwise, just take the inverse of the angle to make # it positive else: angle = -angle # rotate the image to deskew it (h, w) = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, angle, 1.0) rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) # draw the correction angle on the image so we can validate it cv2.putText(rotated, "Angle: {:.2f} degrees".format(angle), (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2) # show the output image print("[INFO] angle: {:.3f}".format(angle)) cv2.imshow("Input", image) cv2.imshow("Rotated", rotated) cv2.waitKey(0) Line 53 draws the angle on our image so we can verify that the output image matches the rotation angle (you would obviously want to remove this line in a document processing pipeline). Lines 57-60 handle displaying the output image. ### Skew correction results To grab the code + example images used inside this blog post, be sure to use the “Downloads” section at the bottom of this post. From there, execute the following command to correct the skew for our neg_4.png image: $ python correct_skew.py --image images/neg_4.png
[INFO] angle: -4.086

Figure 3: Applying skew correction using OpenCV and Python.

Here we can see that that input image has a counter-clockwise skew of 4 degrees. Applying our skew correction with OpenCV detects this 4 degree skew and corrects for it.

Here is another example, this time with a counter-clockwise skew of 28 degrees:

$python correct_skew.py --image images/neg_28.png [INFO] angle: -28.009 Figure 4: Deskewing images using OpenCV and Python. Again, our skew correction algorithm is able to correct the input image. This time, let’s try a clockwise skew: $ python correct_skew.py --image images/pos_24.png
[INFO] angle: 23.974

Figure 5: Correcting for skew in text regions with computer vision.

And finally a more extreme clockwise skew of 41 degrees:

python correct_skew.py --image images/pos_41.png [INFO] angle: 41.037 Figure 6: Deskewing text with OpenCV. Regardless of skew angle, our algorithm is able to correct for skew in images using OpenCV and Python. ## Interested in learning more about computer vision and OpenCV? If you’re interested in learning more about the fundamentals of computer vision and image processing, be sure to take a look at my book, Practical Python and OpenCV: Inside the book you’ll learn the basics of computer vision and OpenCV, working your way up to more advanced topics such as face detectionobject tracking in video, and handwriting recognition, all with lots of examples, code, and detailed walkthroughs. If you’re interested in learning more (and how my book can teach you these algorithms in less than a single weekend), just click the button below: ## Summary In today’s blog post I provided a Python implementation of Félix Abecassis’ approach to skew correction. The algorithm itself is quite straightforward, relying on only basic image processing techniques such as thresholding, computing the minimum area rotated rectangle, and then applying an affine transformation to correct the skew. We would commonly use this type of text skew correction in an automatic document analysis pipeline where our goal is to digitize a set of documents, correct for text skew, and then apply OCR to convert the text in the image to machine-encoded text. I hope you enjoyed today’s tutorial! To be notified when future blog posts are published, be sure to enter your email address in the form below! ## Downloads: If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately! The post Text skew correction with OpenCV and Python appeared first on PyImageSearch. Continue Reading… ### Top Stories, Feb 13-19: 17 More Must-Know Data Science Interview Questions and Answers • Removing Outliers Using Standard Deviation in Python 17 More Must-Know Data Science Interview Questions and Answers • Removing Outliers Using Standard Deviation in Python • Natural Language Processing Key Terms, Explained • KDnuggets Top Blogger: An Interview with Brandon Rohrer Continue Reading… ### Blind Spot X pointed me to this news article reporting an increase in death rate among young adults in the United States: Selon une enquête publiée le 26 janvier par la revue scientifique The Lancet, le taux de mortalité des jeunes Américains âgés de 25 à 35 ans a connu une progression entre 1999 et 2014, alors que ce taux n’a cessé de baisser dans l’ensemble des pays les plus riches depuis quarante ans. . . . Ce sont principalement les jeunes femmes blanches qui tirent les chiffres à la hausse . . . Ainsi, l’analyse des statistiques collectées auprès du National Center for Health Statistics, montre que le taux de mortalité des femmes blanches de 25 ans a connu une progression moyenne annuelle de 3 % pendant les quinze années prises en compte, et de 2,3 % pour la catégorie des trentenaires. Pour des garçons du même âge, la croissance annuelle du taux de mortalité s’élève à 1,9 %. I ran this by Jonathan Auerbach to see what he thought. After all, it’s the Lancet, which seems to specialize in papers of high publicity and low content, so it’s not like I’m gonna believe anything in there without careful scrutiny. As part of our project, Jonathan had already run age-adjusted estimates for different ethnic groups every decade of age. These time series should be better than what was in the paper discussed in the above news article because, in addition to age adjusting, we also got separate estimated trends for each state, fitting some sort of hierarchical model in Stan. Jonathan reported that we found a similar increase in death rates for women after adjustment. But there are comparable increases for men after breaking down by state. Here are the estimated trends in age-adjusted death rates for non-Hispanic white women aged 25-34: And here are the estimated trends for men: In the graphs for the women, certain states with too few observations were removed. (It would be fine to estimate these trends from the raw data, but for simplicity we retrieved some aggregates from the CDC website, and it didn’t provide numbers in every state and every year.) Anyway, the above graphs show what you can do with Stan. We’re not quite sure what to do with all these analyses: we don’t have stories to go with them so it’s not clear where they could be published. But at least we can blog them in response to headlines on mortality trends. P.S. The Westlake titles keep on coming. It’s not just that they are so catchy—after all, that’s their point—but how apt they are, each time. And the amazing thing is, I’m using them in order. Those phrases work for just about anything. I’m just looking forward to a month or so on when I’ve worked my way down to the comedy titles lower down on the list. The post Blind Spot appeared first on Statistical Modeling, Causal Inference, and Social Science. Continue Reading… ### SQL Slammer Is Back. Here’s What You Need to Know First reported back in 2002, the SQL Slammer virus caught fire in January of 2003, and spread worldwide. The post SQL Slammer Is Back. Here’s What You Need to Know appeared first on Thomas LaRock. If you liked this post then consider subscribing to the IS [NOT] NULL newsletter: http://thomaslarock.com/is-not-null-newsletter/ Continue Reading… ### PCA – hierarchical tree – partition: Why do we need to choose for visualizing data? (This article was first published on François Husson, and kindly contributed to R-bloggers) Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering. But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster. Thus, the graph gives simultaneously the information given by the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package). library(FactoMineR) temperature <- read.table(“http://factominer.free.fr/livre/temperat.csv&#8221;, header=TRUE, sep=”;”, dec=”.”, row.names=1) res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf, graph = FALSE,quanti.sup=13:16,quali.sup=17) res.hcpc <- HCPC(res.pca) The approaches complement one another in two ways: • firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework; • secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions). So why do we need to choose when we want to better visualize the data? The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables. If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper. To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Gaussian Naive Bayes Classifier implementation in Python Building Gaussian Naive Bayes Classifier in Python In this post, we are going to implement the Naive Bayes classifier in Python using my favorite machine learning library scikit-learn. Next, we are going to use the trained Naive Bayes (supervised classification), model to predict the Census Income. As we discussed the Bayes theorem in naive Bayes classifier + Read More The post Gaussian Naive Bayes Classifier implementation in Python appeared first on Dataaspirant. Continue Reading… ### Four short links: 20 Feb 2017 Car Security, Civ Math, Free Mindstorms, and Chinese AI Research 1. Used Cars Still Controllable From Previous Owners' Phone -- “The car is really smart, but it’s not smart enough to know who its owner is, so it’s not smart enough to know it’s been resold,” Henderson told CNNTech. “There’s nothing on the dashboard that tells you ‘the following people have access to the car.'” 2. Mathematics of Civilization V -- a beautiful obsession. (Theoretically beautiful. The page is full of LaTeX-rendered math and graphs, and is less than beautiful) 3. Samuel Papert's Mindstorms, Free -- classic, available online as a PDF. 4. China's AI Research (The Atlantic) -- Yet as the research matures in China, Ng says, it is also becoming its own distinct community. After a recent international meeting in Barcelona, he recalls seeing Chinese language write-ups of the talks circulate right away. He never found any in English. The language issue creates a kind of asymmetry: Chinese researchers usually speak English so they have the benefit of access to all the work disseminated in English. The English-speaking community, on the other hand, is much less likely to have access to work within the Chinese AI community. Continue reading Four short links: 20 Feb 2017. Continue Reading… ### How to drive shareholder value with artificial intelligence Your company is probably already doing AI and machine learning, but it needs a road map. Continue reading How to drive shareholder value with artificial intelligence. Continue Reading… ### More tidyverse: using dplyr functions This week, we return to our “Getting Started With R” series. Today we are going to look at some tools from the “dplyr” package. Hadley Wickham, the creator of dplyr, calls it “A Grammar of Data Manipulation.” ## filter() Use filter() for subsetting data by rows. It takes logical expressions as inputs, and returns all rows of your data for which those expressions are true. To demonstrate, let’s start by loading the tidyverse library (which includes dplyr), and we’ll also load the gapminder data. library(tidyverse) library(gapminder)  Here’s how filter() works: filter(gapminder, lifeExp<30)  Produces this output: # A tibble: 2 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Rwanda Africa 1992 23.599 7290203 737.0686 >  ## The pipe operator The pipe operator is one of the great features of the tidyverse. In base R, you often find yourself calling functions nested within functions nested within… you get the idea. The pipe operator %>% takes the object on the left-hand side, and “pipes” it into the function on the right hand side. For example: > gapminder %>% head() # A tibble: 6 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134 >  This is the equivalent of saying “head(gapminder).” So far, that doesn’t seem a lot easier… but wait a bit and you’ll see the beauty of the pipe. ## select() We talked about using filter() to subset data by rows. We can use select() to do the same thing for columns: > select(gapminder, year, lifeExp) # A tibble: 1,704 × 2 year lifeExp <int> <dbl> 1 1952 28.801 2 1957 30.332 3 1962 31.997 4 1967 34.020 5 1972 36.088 6 1977 38.438 7 1982 39.854 8 1987 40.822 9 1992 41.674 10 1997 41.763 # ... with 1,694 more rows  Here’s the same thing, but using pipes, and sending it through “head()” to make the display more compact: > gapminder %>% + select(year, lifeExp) %>% + head(4) # A tibble: 4 × 2 year lifeExp <int> <dbl> 1 1952 28.801 2 1957 30.332 3 1962 31.997 4 1967 34.020 >  We are going to be making some changes to the gapminder data, so let’s start by creating a copy of the data. That way, we don’t have to worry about changing the original data. new_gap <- gapminder  ## mutate() mutate() is a function that defines a new variable and inserts it into your tibble. For example, gapminder has GDP per capita and population; if we multiply these we get the GDP. new_gap %>% mutate(gdp = pop * gdpPercap)  Note that the above code creates the new field and displays the resulting tibble; we would have had to use the “<-” operator to save the new field in our tibble. ## arrange() arrange() reorders the rows in a data frame. The gapminder data is currently arranged by country, and then by year. But what if we wanted to look at it by year, and then by country? new_gap %>% arrange(year, country)  # A tibble: 1,704 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Albania Europe 1952 55.230 1282697 1601.0561 3 Algeria Africa 1952 43.077 9279525 2449.0082 4 Angola Africa 1952 30.015 4232095 3520.6103 5 Argentina Americas 1952 62.485 17876956 5911.3151 6 Australia Oceania 1952 69.120 8691212 10039.5956 7 Austria Europe 1952 66.800 6927772 6137.0765 8 Bahrain Asia 1952 50.939 120447 9867.0848 9 Bangladesh Asia 1952 37.484 46886859 684.2442 10 Belgium Europe 1952 68.000 8730405 8343.1051 # ... with 1,694 more rows  ## group_by() and summarize() The group_by() function adds grouping information to your data, which then allows you to do computations by groups. The summarize() function is a natural partner for group_by(). summarize() takes a dataset with n observations, calculates the requested summaries, and returns a dataset with 1 observation: my_gap %>% group_by(continent) %>% summarize(n = n())  The functions you’ll apply within summarize() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take nn inputs and distill them down into 1 output. new_gap %>% group_by(continent) %>% summarize(avg_lifeExp = mean(lifeExp))  # A tibble: 5 × 2 continent avg_lifeExp <fctr> <dbl> 1 Africa 48.86533 2 Americas 64.65874 3 Asia 60.06490 4 Europe 71.90369 5 Oceania 74.32621  ## A wondrous example To fully appreciate the wonders of the pipe command and the dplyr data manipulation commands, take a look at this example. It comes from Jenny Brian‘s excellent course, STAT545, at the University of British Columbia (to whom I owe a debt for much of the information included in this series of blog posts). new_gap %>% select(country, year, continent, lifeExp) %>% group_by(continent, country) %>% ## within country, take (lifeExp in year i) - (lifeExp in year i - 1) ## positive means lifeExp went up, negative means it went down mutate(le_delta = lifeExp - lag(lifeExp)) %>% ## within country, retain the worst lifeExp change = smallest or most negative summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>% ## within continent, retain the row with the lowest worst_le_delta top_n(-1, wt = worst_le_delta) %>% arrange(worst_le_delta)  Source: local data frame [5 x 3] Groups: continent [5] continent country worst_le_delta <fctr> <fctr> <dbl> 1 Africa Rwanda -20.421 2 Asia Cambodia -9.097 3 Americas El Salvador -1.511 4 Europe Montenegro -1.464 5 Oceania Australia 0.170  To quote Jenny: “Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.” Continue Reading… ### Accessing the contents of a stanfit object I was just needing this. Then, lo and behold, I found it on the web. It’s credited to Stan Development Team but I assume it was written by Ben and Jonah. Good to have this all in one place. The post Accessing the contents of a stanfit object appeared first on Statistical Modeling, Causal Inference, and Social Science. Continue Reading… ### Forest of Numbers To celebrate the ten-year anniversary of the National Art Center in Tokyo, Emmanuelle Moureaux made the Forest of Numbers. The installation “Forest of Numbers” visualized the decade of the future from 2017 to 2026, created a sense of stillness across the large exhibition space. More than 60,000 pieces of suspended numeral figures from 0 to 9 were regularly aligned in three dimensional grids. A section was removed, created a path that cut through the installation, invited visitors to wonder inside the colorful forest filled with numbers. Tags: Continue Reading… ### Is my time series additive or multiplicative? Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand how the interaction between general seasonality in activity and the underlying trend. The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing. ## Additive or multiplicative? It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further. There are three components to a time series: trend how things are overall changing seasonality how things change within a given period e.g. a year, month, week, day error/residual/irregular activity not explained by the trend or the seasonal value How these three components interact determines the difference between a multiplicative and an additive time series. In a multiplicative time series, the components multiply together to make the time series. If you have an increasing trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated. This is common when you’re looking at web traffic. In an additive time series, the components added together to make the time series.If you have an increasing trend, you still see roughly the same size peaks and troughs. This is often seen in indexed time series where the absolute value is growing but changes stay relative. You can have a time series that is somewhere in between the two, a statistician’s “it depends”, but I’m interested in attaining a quick classification so I won’t be handling this complication here. ## There’s a package for that When I first started doing time series analysis, the only way to visualise how a time series splits into different components was to use base R. About the time I was feeling the pain, someone released a ggplot2 time series extension! I’ll be using ggseas where I can. We’ll use the nzbop data set from ggseas to, first of all, examine a single time series and then process all the time series in the dataset to determine if they’re multiplicative or additive. sample_ts<-nzdata[Account == "Current account" & Category=="Services; Exports total", .(TimePeriod, Value)]  TimePeriod Value 1971-06-30 55 1971-09-30 56 1971-12-31 60 1972-03-31 65 1972-06-30 65 1972-09-30 63 I’ll be using other packages (like data.table) and will only show relevant code snippets as I go along. You can get the whole script in a GIST. ## Decomposing the data To be able to determine if the time series is additive or multiplicative, the time series has to be split into its components. Existing functions to decompose the time series include decompose(), which allows you pass whether the series is multiplicative or not, and stl(), which is only for additive series without transforming the data. I could use stl() with a multiplicative series if I transform the time series by taking the log. For either function, I need to know whether it’s additive or multiplicative first. ### The trend The first component to extract is the trend. There are a number of ways you can do this, and some of the simplest ways involve calculating a moving average or median. sample_ts[,trend := zoo::rollmean(Value, 8, fill=NA, align = "right")]  TimePeriod Value trend 2014-03-31 5212 4108.625 2014-06-30 3774 4121.750 2014-09-30 3698 4145.500 2014-12-31 4752 4236.375 2015-03-31 6154 4376.500 2015-06-30 4543 4478.875 A moving median is less sensitive to outliers than a moving mean. It doesn’t work well though if you have a time series that includes periods of inactivity. Lots of 0s can result in very weird trends. ### The seasonality Seasonality will be cyclical patterns that occur in our time series once the data has had trend removed. Of course, the way to de-trend the data needs to additive or multiplicative depending on what type your time series is. Since we don’t know the type of time series at this point, we’ll do both. sample_ts[,:=( detrended_a = Value - trend, detrended_m = Value / trend )]  TimePeriod Value trend detrended_a detrended_m 2014-03-31 5212 4108.625 1103.375 1.2685509 2014-06-30 3774 4121.750 -347.750 0.9156305 2014-09-30 3698 4145.500 -447.500 0.8920516 2014-12-31 4752 4236.375 515.625 1.1217137 2015-03-31 6154 4376.500 1777.500 1.4061465 2015-06-30 4543 4478.875 64.125 1.0143172 To work out the seasonality we need to work out what the typical de-trended values are over a cycle. Here I will calculate the mean value for the observations in Q1, Q2, Q3, and Q4. sample_ts[,:=(seasonal_a = mean(detrended_a, na.rm = TRUE), seasonal_m = mean(detrended_m, na.rm = TRUE)), by=.(quarter(TimePeriod)) ]  TimePeriod Value trend detrended_a detrended_m seasonal_a seasonal_m 2014-03-31 5212 4108.625 1103.375 1.2685509 574.1919 1.2924422 2014-06-30 3774 4121.750 -347.750 0.9156305 -111.2878 1.0036648 2014-09-30 3698 4145.500 -447.500 0.8920516 -219.8363 0.9488803 2014-12-31 4752 4236.375 515.625 1.1217137 136.7827 1.1202999 2015-03-31 6154 4376.500 1777.500 1.4061465 574.1919 1.2924422 2015-06-30 4543 4478.875 64.125 1.0143172 -111.2878 1.0036648 My actual needs aren’t over long economic periods so I’m not using a better seasonality system for this blog post. There are some much better mechanisms than this. ### The remainder Now that we have our two components, we can calculate the residual in both situations and see which has the better fit. sample_ts[,:=( residual_a = detrended_a - seasonal_a, residual_m = detrended_m / seasonal_m )]  TimePeriod Value trend detrended_a detrended_m seasonal_a seasonal_m residual_a residual_m 2014-03-31 5212 4108.625 1103.375 1.2685509 574.1919 1.2924422 529.1831 0.9815146 2014-06-30 3774 4121.750 -347.750 0.9156305 -111.2878 1.0036648 -236.4622 0.9122871 2014-09-30 3698 4145.500 -447.500 0.8920516 -219.8363 0.9488803 -227.6637 0.9401098 2014-12-31 4752 4236.375 515.625 1.1217137 136.7827 1.1202999 378.8423 1.0012620 2015-03-31 6154 4376.500 1777.500 1.4061465 574.1919 1.2924422 1203.3081 1.0879763 2015-06-30 4543 4478.875 64.125 1.0143172 -111.2878 1.0036648 175.4128 1.0106135 ## Visualising decomposition I’ve done the number crunching, but you could also perform a visual decomposition. ggseas gives us a function ggsdc() which we can use. ggsdc(sample_ts, aes(x = TimePeriod, y = Value), method = "decompose", frequency = 4, s.window = 8, type = "additive")+ geom_line()+ ggtitle("Additive")+ theme_minimal()  The different decompositions produce differently distributed residuals. We need to assess these to identify which decomposition is a better fit. ## Assessing fit After decomposing our data, we need to compare the residuals. As we’re just trying to classify the time series, we don’t need to do anything particularly sophisticated – a big part of this exercise is to produce a quick function that could be used to perform an initial classification in a batch processing environment so simpler is better. We’re going to check the whether how much correlation between data points is still encoded within the residuals. This is the Auto-Correlation Factor (ACF) and it has a function for calculating it. As some of the correlations could be negative we will select the type with the smallest sum of squares of correlation values. ssacf<- function(x) sum(acf(x, na.action = na.omit)acf^2)
sample_ts[,.(ts_type = compare_ssacf(residual_a, residual_m ))]

ts_type
Multiplicative

## Putting it all together

This isn’t a fully generalized function (as it doesn’t have configurable lags, medians, seasonality etc) but if I had to apply to run this exercise over multiple time series from this dataset, my overall function and usage would look like:

ssacf<- function(x) sum(acf(x, na.action = na.omit, plot = FALSE)acf^2) compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), "Additive", "Multiplicative") additive_or_multiplicative <- function(dt){ m<-copy(dt) m[,trend := zoo::rollmean(Value, 8, fill="extend", align = "right")] m[,:=( detrended_a = Value - trend, detrended_m = Value / trend )] m[Value==0,detrended_m:= 0] m[,:=(seasonal_a = mean(detrended_a, na.rm = TRUE), seasonal_m = mean(detrended_m, na.rm = TRUE)), by=.(quarter(TimePeriod)) ] m[is.infinite(seasonal_m),seasonal_m:= 1] m[,:=( residual_a = detrended_a - seasonal_a, residual_m = detrended_m / seasonal_m)] compare_ssacf(mresidual_a, m$residual_m ) } # Applying it to all time series in table sample_ts<-nzdata[ , .(Type=additive_or_multiplicative(.SD)), .(Account, Category)]  Account Category Type Current account Inflow total Additive Current account Balance Multiplicative Current account Goods; Exports (fob) total Additive Current account Services; Exports total Multiplicative Current account Primary income; Inflow total Multiplicative Current account Secondary income; Inflow total Multiplicative Current account Goods balance Multiplicative Current account Services balance Multiplicative Current account Primary income balance Additive Current account Outflow total Additive Current account Goods; Imports (fob) total Additive Current account Services; Imports total Additive Current account Primary income; Outflow total Additive Current account Secondary income; Outflow total Multiplicative Capital account Inflow total Multiplicative Capital account Outflow total Multiplicative NA Net errors and omissions Multiplicative Financial account Foreign inv. in NZ total Multiplicative Financial account Balance Additive Financial account Foreign inv. in NZ; Direct inv. liabilities Additive Financial account Foreign inv. in NZ; Portfolio inv. liabilities Multiplicative Financial account Foreign inv. in NZ; Other inv. liabilities Multiplicative Financial account NZ inv. abroad total Multiplicative Financial account NZ inv. abroad; Direct inv. assets Multiplicative Financial account NZ inv. abroad; Portfolio inv. assets Additive Financial account NZ inv. abroad; Financial derivative assets Multiplicative Financial account NZ inv. abroad; Other inv. assets Additive Financial account NZ inv. abroad; Reserve assets Multiplicative ## Conclusion This is a very simple way of quickly assessing whether multiple time series are additive or multiplicative. It gives an effective starting point for conditionally processing batches of time series. Get the GIST of the code used throughout this blog to work through it yourself. If you’ve got an easier way of classifying time series, let me know in the comments! The post Is my time series additive or multiplicative? appeared first on Locke Data. Continue Reading… ### Thesis: Sparse Grids for Big Data: Exploiting Parsimony for Large-Scale Learning by Valeriy Khakhutskyy Congratulations Dr. Khakhutskyy ! Sparse Grids for Big Data: Exploiting Parsimony for Large-Scale Learning by Valeriy Khakhutskyy High-dimensional data analysis becomes ubiquitous in both science and industry. An important tool for data analysis is supervised learning with non-parametric models, which estimates the dependency between target and input variables without imposing explicit assumptions on the data. This generality, however, comes at a price of computational costs that grow exponentially with the dimensionality of the input. In general, nonparametric models cannot evade this curse of dimensionality unless the problem exhibits certain properties. Hence, to facilitate large-scale supervised learning, this thesis focuses on two such properties: the existence of a low-dimensional manifold in the data and the discounting importance of high-order interactions between input variables. Often a problem would exhibit both these properties to a certain degree. To identify and exploit these properties, this work extends the notion of parsimony for hierarchical sparse grid models. It develops learning algorithms that simultaneously optimise the model parameters and the model structure to befit the problem at hand. The new algorithms for adaptive sparse grids increase the range of computationally feasible supervised learning problems. They decrease the computation costs for training sparse grid models and the memory footprint of the resulting models. Hence, the algorithms can be used for classification and regression on high-dimensional data. Furthermore, they improve the interpretability of the sparse grid model and are suitable for learning the structure of the underlying data distribution. Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### Book Memo: “Knowledge Graph and Semantic Computing: Semantic, Knowledge, and Linked Big Data”  First China Conference, CCKS 2016, Beijing, China, September 19-22, 2016, Revised Selected Papers This book constitutes the refereed proceedings of the first China Conference on Knowledge Graph and Semantic Computing, CCKS, held in Beijing, China, in September 2016. The 19 revised full papers presented together with 6 shared tasks were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on knowledge representation and learning; knowledge graph construction and information extraction; linked data and knowledge-based systems; shared tasks. Continue Reading… ### R Packages worth a look Rcpp Hidden Markov Model (RcppHMM) Collection of functions to evaluate sequences, decode hidden states and estimate parameters from a single or multiple sequences of a discrete time Hidden Markov Model. The observed values can be modeled by a multinomial distribution for categorical emissions, a mixture of Gaussians for continuous data and also a mixture of Poissons for discrete values. It includes functions for random initialization, simulation, backward or forward sequence evaluation, Viterbi or forward-backward decoding and parameter estimation using an Expectation-Maximization approach. Dynamic OpenAPI/Swagger Client (rapiclient) Access services specified in OpenAPI (formerly Swagger) format. It is not a code generator. Client is generated dynamically as a list of R functions. Chunk Processing or Split-Apply-Combine on Delimited Files(CSV Etc) (fileplyr) Perform chunk processing or split-apply-combine on data in a delimited file(example: CSV) across multiple cores of a single machine with low memory footprint. These functions are a convenient wrapper over the versatile package ‘datadr’. Fixed Rank Kriging (FRK) Fixed Rank Kriging is a tool for spatial/spatio-temporal modelling and prediction with large datasets. The approach, discussed in Cressie and Johannesson (2008), decomposes the field, and hence the covariance function, using a fixed set of n basis functions, where n is typically much smaller than the number of data points (or polygons) m. The method naturally allows for non-stationary, anisotropic covariance functions and the use of observations with varying support (with known error variance). The projected field is a key building block of the Spatial Random Effects (SRE) model, on which this package is based. The package FRK provides helper functions to model, fit, and predict using an SRE with relative ease. Reference: Cressie, N. and Johannesson, G. (2008) <DOI:10.1111/j.1467-9868.2007.00633.x>. Functions for Solving Multiple-criteria Decision-making Problems (OutrankingTools) Functions to process ‘outranking’ ELECTRE methods existing in the literature. See, e.g., http://…/ELECTRE about the outranking approach and the foundations of ELECTRE methods. Estimates and Bounds for the False Discovery Proportion, by Permutation (confSAM) For multiple testing. Computes estimates and confidence bounds for the False Discovery Proportion (FDP), the fraction of false positives among all rejected hypotheses. The methods in the package use permutations of the data. Doing so, they take into account the dependence structure in the data. OPUS Miner Algorithm for Filtered Top-k Association Discovery (opusminer) Provides a simple R interface to the OPUS Miner algorithm (implemented in C++) for finding the top-k productive, non-redundant itemsets from transaction data. The OPUS Miner algorithm uses the OPUS search algorithm to efficiently discover the key associations in transaction data, in the form of self-sufficient itemsets, using either leverage or lift. See <http://…/> for more information in relation to the OPUS Miner algorithm. Co-Clustering with Document-Term Matrix (Rcoclust) Several co-clustering algorithms are implemented for sparse binary and contingency matrices. They aim at a simultaneous clustering of the rows and the columns via an objective function. Continue Reading… ### future: Reproducible RNGs, future_lapply() and more (This article was first published on jottR, and kindly contributed to R-bloggers) future 1.3.0 is available on CRAN. With futures, it is easy to write R code once, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).  Futures makes it easy to harness any resources at hand. Thanks to great feedback from the community, this new version provides: • A convenient lapply() function • Added future_lapply() that works like lapply() and gives identical results with the difference that futures are used internally. Depending on user’s choice of plan(), these calculations may be processed sequential, in parallel, or distributed on multiple machines. • Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument future.seed = TRUE(default) will use a random initial seed, which may also be specified as future.seed = <integer>. L’Ecuyer-CMRG RNG streams are used internally. • Load balancing can be controlled by argument future.scheduling, which is a scalar adjusting how many futures each worker should process. • Clarifies distinction between developer and end user • The end user controls what future strategy to use by default, e.g. plan(multiprocess) or plan(cluster, workers = c("machine1", "machine2", "remote.server.org")). • The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. f <- future(..., lazy = TRUE). Because of this, plan(lazy) is now deprecated. • Is even more friendly to multi-tenant compute environments • availableCores() returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (parallel::detectCores()). If option mc.cores or environment variable MC_CORES is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. This way developers don’t have to adjust their code to match a certain compute environment; the default works everywhere. • With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option future.availableCores.fallback or environment variable R_FUTURE_AVAILABLE_FALLBACK. For instance, by using R_FUTURE_AVAILABLE_FALLBACK=1system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake. • Analogously to how availableCores() returns the number of cores, the new function availableWorkers() returns the host names available to the R process. The default is rep("localhost", times = availableCores()), but when using HPC schedulers it may be the host names of other compute notes allocated to the job. For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems. ## A quick example The bootstrap example of help("clusterApply", package = "parallel") adapted to make use of futures. library("future")library("boot")run <- function(...) { cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle <- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)}# base::lapply()system.time(boot <- lapply(1:100, FUN = run))### user system elapsed ### 133.637 0.000 133.744# Sequentially on the local machineplan(sequential)system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))### user system elapsed ### 134.916 0.003 135.039 # In parallel on the local machine (with 8 cores)plan(multisession)system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))### user system elapsed### 0.960 0.041 29.527 stopifnot(all.equal(boot1, boot0)) ## What’s next? The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using plan(batchjobs_torque) instead of plan(multiprocess) your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon. Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using plan(aws_lambda). Happy futuring! (*) Alright, I admit, it was intended. ## Links ## See also To leave a comment for the author, please follow the link and comment on their blog: jottR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### If you did not already know: “Process Mining” Process mining is a process management technique that allows for the analysis of business processes based on event logs. The basic idea is to extract knowledge from event logs recorded by an information system. Process mining aims at improving this by providing techniques and tools for discovering process, control, data, organizational, and social structures from event logs. Process Mining Process Mining Continue Reading… ### Magister Dixit “The best way to predict the future is to invent it.” Alan Kay ( 1971 ) Continue Reading… ### My Podcasting Setup I’ve gotten a number of inquiries over the last 2 years about my podcasting setup and I’ve been meaning to write about it but…. But here it is! I actually wanted to write this because I felt like there actually wasn’t a ton of good information about this on the Internet that wasn’t for people who wanted to do it professionally but were rather more “casual” podcasters. So here’s what I’ve got. There are two types of podcasts roughly: The kind you record with everyone in the same room and the kind you record where everyone is in different rooms. They both require slightly different setups so I’ll talk about both. For me, Elizabeth Matsui and I record The Effort Report locally because we’re both at Johns Hopkins. But Hilary Parker and I record Not So Standard Deviations remotely because she’s on the other side of the country most of the time. ## Recording Equipment When Hilary and I first started we just used the microphone attached to the headphones you get with your iPhone or whatever. That’s okay but the sound feels very “narrow” to me. That said, it’s a good way to get started and it likely costs you nothing. The next level up for many people is the Blue Yeti USB Microphone which is perfectly fine microphone and not too expensive. Also, it uses USB (as opposed to more professional XLR) so it connects to any computer, which is nice. However, it typically retails for$120, which isn’t nothing, and there are probably cheaper microphones that are just as good. For example, Jason Snell recommends the Audio Technica ATR2100 which is only about $70. If you’re willing to shell out a little more money, I’d highly recommend the Zoom H4n portable recorder. This is actually two things: a microphone and a recorder. It has a nice stero microphone built into the top along with two XLR inputs on the bottom that allow you to record from external mics. It records to SD cards so it’s great for a portable setup where you don’t want to carry a computer around with you. It retails for about$200 so it’s not cheap, but in my opinion it is worth every penny. I’ve been using my H4n for years now.

Because we do a lot or recording for our online courses here, we’ve actually got a bit more equipment in the office. So for in-person podcasts I sometimes record using a Sennheiser MKH416-P48US attached to an Auray MS-5230T microphone stand which is decidedly not cheap but is a great piece of hardware.

By the way, a microphone stand is great to have, if you can get one, so you don’t have to set the microphone on your desk or table. That way if you bump the table by accident or generally like to bang the table, it won’t get picked up on the microphone. It’s not something to get right away, but maybe later when you make the big time.

## Recording Software

If you’re recording by yourself, you can just hook up your microphone to your computer and record to any old software that records sound (on the Mac you can use Quicktime). If you have multiple people, you can either

1. Speak into the same mic and have both your voices recorded on the same audio file
2. Use separate mics (and separate computers) and record separtely on to separate audio files. This requires synching the audio files in an editor, but that’s not too big a deal if you only have 2-3 people.

For local podcasts, I actually just use the H4n and record directly to the SD card. This creates separate WAV files for each microphone that are already synced so you can just plop them in the editor.

For remote podcasts, you’ll need some communication software. Hilary and I use Zencastr which has its own VoIP system that allows you to talk to anyone by just sending your guests a link. So I create a session in Zencastr, send Hilary the link for the session, she logs in (without needing any credentials) and we just start talking. The web site records the audio directly off of your microphone and then uploads the audio files (one for each guest) to Dropbox. The service is really nice and there are now a few just like it. Zencastr costs $20 a month right now but there is a limited free tier. The other approach is to use something like Skype and then use something like ecamm call-recorder to record the conversation. The downside with this approach is that if you have any network trouble that messes up the audio, then you will also record that. However, Zencastr (and related services) do not work on iOS devices and other devices that use WebKit based browsers. So if you have someone calling in on a mobile device via Skype or something, then you’ll have to use this approach. Otherwise, I prefer the Zencastr approach and can’t really see any downside except for the cost. ## Editing Software There isn’t a lot of software that’s specifically designed for editing podcasts. I actually started off editing podcasts in Final Cut Pro X (nonlinear video editor) because that’s what I was familiar with. But now I use Logic Pro X, which is not really designed for podcasts, but it’s a real digital audio workstation and has nice features (like strip silence). But I think something like Audacity would be fine for basic editing. The main thing I need to do with editing is merge the different audio tracks together and cut off any extraneous material at the beginning or the end. I don’t usually do a lot of editing in the middle unless there’s a major mishap like a siren goes by or a cat jumps on the computer. Once the editing is done I bounce to an AAC or MP3 file for uploading. ## Hosting You’ll need a service for hosting your audio files if you don’t have your own server. You can technically host your audio files anywhere, but specific services have niceties like auto-generating the RSS feed. For Not So Standard Deviations I use SoundCloud and for The Effort Report I use Libsyn. Of the two services, I think I prefer Libsyn, because it’s specifically designed for podcasting and has somewhat better analytics. The web site feels a little bit like it was designed in 2003, but otherwise it works great. Libsyn also has features for things like advertising and subscriptions, but I don’t use any of those. SoundCloud is fine but wasn’t really designed for podcasting and sometimes feels a little unnatural. ## Summary If you’re interested in getting started in podcasting, here’s my bottom line: 1. Get a partner. It’s more fun that way! 2. If you and your partner are remote, use Zencastr or something similar. 3. Splurge for the Zoom H4n if you can, otherwise get a reasonable cheap microphone like the Audio Technica or the Yeti. 4. Don’t focus too much on editing. Just clip off the beginning and the end. 5. Host on Libsyn. Continue Reading… ## February 19, 2017 ### ComSciCon: Science Communication Workshop for Graduate Students Nathan Sanders writes: A few years ago, you were kind enough to post a notice on your blog about our science communication conference by grad students, for grad students (ComSciCon). I thought I’d let you know that our program has been going strong and growing, and we’re excited to have opened our application for the 2017 workshop (open through March 1). Below is a blurb. For both ComSciCon and our sister program of ‘bites’ sites, I [Sanders] also remain really interested in finding grad students or other statisticians who share an interest in communication to diverse audiences. The bites sites are daily readers’ digests of the literature in different fields for authored by graduate students and aimed at undergrads (e.g. see Astrobites, http://astrobites.com). We have a variety of initiatives to help this model to spread to other fields and to bring these resources into classrooms. Here’s the blurb: Applications are now open for ComSciCon 2017, the 5th annual Communicating Science workshop, to be held in Cambridge, MA on June 8-10th 2017. Graduate students at U.S. institutions in all fields of science, technology, engineering, health, mathematics, and related fields are encouraged to apply. The application will close on March 1st. Acceptance to the workshop is competitive; attendance is free and travel support and lodging will be provided to accepted applicants. Participants will build the communication skills that scientists and other technical professionals need to express complex ideas to the general public, experts in other fields, and their peers. In additional to panel discussions (on topics such as Media and Journalism, Science Advocacy, and Diversity in Science Communication), ample time is allotted for networking with science communication experts and developing science outreach collaborations with fellow graduate students. You can visit our website to submit an application or learn more about our workshop programs and participants. You can also follow us on Twitter (@comscicon) and use #comscicon17 ! If you have any questions or concerns, please contact us at comscicon17@comscicon.org . I have no connection to this conference but I think it’s a good general idea, so I hope it goes well. Continue Reading… ### SatRday and visual inference of vine copulas (This article was first published on Pareto's Playground, and kindly contributed to R-bloggers) # SatRday From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections: R and Git Easy integration of version control through git and Rstudio has never been this easy. If you are not already following this principle of Pull, Commit and Push in your workflow, I recommend you vist the website http://happygitwithr.com/ which will help you to correct your current workflow indiscretions! If you are however an avid user of github and regularly push your commits with correct messages, here is a gem of a website that made my day – starlogs Shiny The Shiny workshop was an amazing exhibition on what can be done in an afternoon with the power of shiny and Rstudio! We got to build an interactive dashboard investigating disease data from South Africa using the new flexdashboard. Training Models Once a again I realised how powerful the caret package can be in training and testing a wide range of models in a modular way without too much hassle or fuss. I am really keen to start playing with the h20 package after a brief overview was shown on the ensemble capabilities of the package. ## The Conference The saturday saw a multitude of speakers take the stage to discuss some of the most interesting topics and application of R that I have seen in a very long time. The full programme can be viewed on the website. I also had the opportunity to talk about how we can combine Shiny and the visNetworks html widget as a way to conduct analysis of high-dimensional Vine Copulas. R, as an open source software with a large following provides quick and easy access to complex models and methods that are used and applied widely within academics, but more importantly, in practice. This increase in complexity both theoretically and mathematically, has resulted in an obligation from practitioners to break down walls of jargon and mathematical hyperbole into easy digestable actions and insight. My attempt at addressing this philosophical idea was to create a small package called VineVizR. The package can be viewed and downloaded here. I used this package and its function to build a shiny App where one can explore the interlinkages between 23 of South Africa’s largest companies by market cap using Vine Copulas. Vine copulas allow us to model complex linkages among assets, but currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output that is hard to interpret. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output. The app can be found on my ShinyApp page. I hope you enjoy this small app and the meta data that has been integrated into it. If you feel that you still want to play a bit more, go download the package and play with the VizVineR function where you can play and create your own groups!! But be aware, I don’t see myself as a developer – so when you peek under the hood and see any improvements that can be made – let me know. ### View of the dashboard To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Document worth reading: “Deep Reinforcement Learning: An Overview” We give an overview of recent exciting achievements of deep reinforcement learning (RL). We start with background of deep learning and reinforcement learning, as well as introduction of testbeds. Next we discuss Deep Q-Network (DQN) and its extensions, asynchronous methods, policy optimization, reward, and planning. After that, we talk about attention and memory, unsupervised learning, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, spoken dialogue systems (a.k.a. chatbot), machine translation, text sequence prediction, neural architecture design, personalized web services, healthcare, finance, and music generation. We mention topics/papers not reviewed yet. After listing a collection of RL resources, we close with discussions. Deep Reinforcement Learning: An Overview Continue Reading… ### Building Shiny App exercises part 7 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Connect widgets & plots In the seventh part of our journey we are ready to connect more of the widgets we created before with our k-means plot in order to totally control its output. Of cousre we will also reform the plot itself properly in order to make it a real k-means plot. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin! Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. First of all let’s move the widgets we are going to use from the sidebarPanel into the mainPanel and specifically under our plot. Exercise 1 Remove the textInput from your server.R file. Then place the checkboxGroupInput and the selectInput in the same row with the sliderInput. Name them “Variable X” and “Variable Y” respectively. HINT: Use fluidrow and column. Create a reactive expression Reactive expressions are expressions that can read reactive values and call other reactive expressions. Whenever a reactive value changes, any reactive expressions that depended on it are marked as “invalidated” and will automatically re-execute if necessary. If a reactive expression is marked as invalidated, any other reactive expressions that recently called it are also marked as invalidated. In this way, invalidations ripple through the expressions that depend on each other. The reactive expression is activated like this: example <- reactive({ }) Exercise 2 Place a reactive expression in server.R, at any spot except inside output$All and name it “Data”. HINT: Use reactive

Now let’s connect your selectInput with the variables of your dataset as in the example below.

#ui.R
library(shiny) shinyUI(fluidPage( titlePanel("Shiny App"),

 

 sidebarLayout( sidebarPanel(h2("Menu"), selectInput('ycol', 'Y Variable', names(iris)) ), mainPanel(h1("Main") ) ) ))
#server.R
shinyServer(function(input, output) { example <- reactive({ iris[, c(input$ycol)] }) })  Exercise 3 Put the variables of the iris dataset as inputs in your selectInput as “Variable Y” . HINT: Use names. Exercise 4 Do the same for checkboxGroupInput and “Variable X”. HINT: Use names. Select the fourth variabale as default like the example below. #ui.R library(shiny) shinyUI(fluidPage( titlePanel("Shiny App"),    sidebarLayout( sidebarPanel(h2("Menu"), checkboxGroupInput("xcol", "Variable X",names(iris), selected=names(iris)[[4]]), selectInput("ycol", "Y Variable", names(iris), selected=names(iris)[[4]]) ), mainPanel(h1("Main") ) ) )) #server.R shinyServer(function(input, output) { example <- reactive({ iris[, c(input$xcol,input$ycol) ] }) })  Exercise 5 Make the second variable the default choise for both widgets. HINT: Use selected. Now follow the example below to create a new function and place there the automated function for k means calculation. #ui.R library(shiny) shinyUI(fluidPage( titlePanel("Shiny App"),    sidebarLayout( sidebarPanel(h2("Menu"), checkboxGroupInput("xcol", "Variable X",names(iris), selected=names(iris)[[4]]), selectInput("ycol", "Y Variable", names(iris), selected=names(iris)[[4]]) ), mainPanel(h1("Main") ) ) )) #server.R shinyServer(function(input, output) { example <- reactive({ iris[, c(input$xcol,input$ycol) ] }) example2 <- reactive({ kmeans(example()) }) })  Exercise 6 Create the reactive function Clusters and put in there the function  kmeans which will be applied on the function Data. HINT: Use reactive. Connect your plot with the widgets. It is time to connect your plot with the widgets. Exercise 7 Put Data inside renderPlot as first argument replacing the data that you have chosen to be plotted until now. Moreover delete xlab and ylab. Improve your k-means visualiztion. You gan change automatically the colours of your clusters by copying and pasting this part of code as first argument of renderPlot before the plot function: palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999")) We will choose to have up to nine clusters so we choose nine colours. Exercise 8 Set min of your sliderInput to 1, max to 9 and value to 4 and use the palette function to give colours. This is how you can give different colors to your clusters. To activate these colors put this part of code into your plot function. col = Clusters()$cluster,

Exercise 9

Activate the  palette function.

To make your clusters easily foundable you can fully color them by adding into plot function this:
pch = 20, cex = 3

Exercise 10

Fully color the points of your plot.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses, including:

1. Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

2. Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).

3. Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.

4. Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).

5. Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.

6. Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

However, the result is presented differently according to the used packages. To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and dimensionality reduction analysis – we developed an easy-to-use R package named factoextra.

• The R package factoextra has flexible and easy-to-use methods to extract quickly, in a human readable standard data format, the analysis results from the different packages mentioned above.

• It produces a ggplot2-based elegant data visualization with less typing.

• It contains also many functions facilitating clustering analysis and visualization.

We’ll use i) the FactoMineR package (Sebastien Le, et al., 2008) to compute PCA, (M)CA, FAMD, MFA and HCPC; ii) and the factoextra package for extracting and visualizing the results.

FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. The official website is available at: http://factominer.free.fr/. Thanks to François Husson for his impressive work.

The figure below shows methods, which outputs can be visualized using the factoextra package. The official online documentation is available at: http://www.sthda.com/english/rpkgs/factoextra.

## Why using factoextra?

1. The factoextra R package can handle the results of PCA, CA, MCA, MFA, FAMD and HMFA from several packages, for extracting and visualizing the most important information contained in your data.

2. After PCA, CA, MCA, MFA, FAMD and HMFA, the most important row/column elements can be highlighted using :
• their cos2 values corresponding to their quality of representation on the factor map
• their contributions to the definition of the principal dimensions.

If you want to do this, the factoextra package provides a convenient solution.

1. PCA and (M)CA are used sometimes for prediction problems : one can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the information provided by the previously performed PCA or (M)CA. This can be done easily using the FactoMineR package.

If you want to make predictions with PCA/MCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2: then factoextra can help you. It’s quick, write less and do more…

1. Several functions from different packages – FactoMineR, ade4, ExPosition, stats – are available in R for performing PCA, CA or MCA. However, The components of the output vary from package to package.

No matter the package you decided to use, factoextra can give you a human understandable output.

## Installing FactoMineR

The FactoMineR package can be installed and loaded as follow:

# Install
install.packages("FactoMineR")

library("FactoMineR")

• factoextra can be installed from CRAN as follow:
install.packages("factoextra")
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
library("factoextra")

## Main functions in the factoextra package

See the online documentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.

### Visualizing dimension reduction analysis outputs

Functions Description
fviz_eig (or fviz_eigenvalue) Extract and visualize the eigenvalues/variances of dimensions.
fviz_pca Graph of individuals/variables from the output of Principal Component Analysis (PCA).
fviz_ca Graph of column/row variables from the output of Correspondence Analysis (CA).
fviz_mca Graph of individuals/variables from the output of Multiple Correspondence Analysis (MCA).
fviz_mfa Graph of individuals/variables from the output of Multiple Factor Analysis (MFA).
fviz_famd Graph of individuals/variables from the output of Factor Analysis of Mixed Data (FAMD).
fviz_hmfa Graph of individuals/variables from the output of Hierarchical Multiple Factor Analysis (HMFA).
fviz_ellipses Draw confidence ellipses around the categories.
fviz_cos2 Visualize the quality of representation of the row/column variable from the results of PCA, CA, MCA functions.
fviz_contrib Visualize the contributions of row/column elements from the results of PCA, CA, MCA functions.

### Extracting data from dimension reduction analysis outputs

Functions Description
get_eigenvalue Extract and visualize the eigenvalues/variances of dimensions.
get_pca Extract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs.
get_ca Extract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs.
get_mca Extract results from Multiple Correspondence Analysis outputs.
get_mfa Extract results from Multiple Factor Analysis outputs.
get_famd Extract results from Factor Analysis of Mixed Data outputs.
get_hmfa Extract results from Hierarchical Multiple Factor Analysis outputs.
facto_summarize Subset and summarize the output of factor analyses.

### Clustering analysis and visualization

Functions Description
dist(fviz_dist, get_dist) Enhanced Distance Matrix Computation and Visualization.
get_clust_tendency Assessing Clustering Tendency.
fviz_nbclust(fviz_gap_stat) Determining and Visualizing the Optimal Number of Clusters.
fviz_dend Enhanced Visualization of Dendrogram
fviz_cluster Visualize Clustering Results
fviz_mclust Visualize Model-based Clustering Results
fviz_silhouette Visualize Silhouette Information from Clustering.
hcut Computes Hierarchical Clustering and Cut the Tree
hkmeans (hkmeans_tree, print.hkmeans) Hierarchical k-means clustering.
eclust Visual enhancement of clustering analysis

## Dimension reduction and factoextra

As depicted in the figure below, the type of analysis to be performed depends on the data set formats and structures.

In this section we start by illustrating classical methods – such as PCA, CA and MCA – for analyzing a data set containing continuous variables, contingency table and qualitative variables, respectively.

We continue by discussing advanced methods – such as FAMD, MFA and HMFA – for analyzing a data set containing a mix of variables (qualitatives & quantitatives) organized or not into groups.

Finally, we show how to perform hierarchical clustering on principal components (HCPC), which useful for performing clustering with a data set containing only qualitative variables or with a mixed data of qualitative and quantitative variables.

### Principal component analysis

• Data: decathlon2 [in factoextra package]
• PCA function: FactoMineR::PCA()
• Visualization factoextra::fviz_pca()

Read more about computing and interpreting principal component analysis at: Principal Component Analysis (PCA).

library("factoextra")
data("decathlon2")
df <- decathlon2[1:23, 1:10]
1. Principal component analysis
library("FactoMineR")
res.pca <- PCA(df,  graph = FALSE)
1. Extract and visualize eigenvalues/variances:
# Extract eigenvalues/variances
get_eig(res.pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.1242133        41.242133                    41.24213
## Dim.2   1.8385309        18.385309                    59.62744
## Dim.3   1.2391403        12.391403                    72.01885
## Dim.4   0.8194402         8.194402                    80.21325
## Dim.5   0.7015528         7.015528                    87.22878
## Dim.6   0.4228828         4.228828                    91.45760
## Dim.7   0.3025817         3.025817                    94.48342
## Dim.8   0.2744700         2.744700                    97.22812
## Dim.9   0.1552169         1.552169                    98.78029
## Dim.10  0.1219710         1.219710                   100.00000
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

4.Extract and visualize results for variables:

# Extract the results for variables
var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description
## 1 "$coord" "Coordinates for the variables" ## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables" ## 4 "$contrib" "contributions of the variables"
# Coordinates of variables
head(var$coord) ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 ## X100m -0.8506257 -0.17939806 0.3015564 0.03357320 -0.1944440 ## Long.jump 0.7941806 0.28085695 -0.1905465 -0.11538956 0.2331567 ## Shot.put 0.7339127 0.08540412 0.5175978 0.12846837 -0.2488129 ## High.jump 0.6100840 -0.46521415 0.3300852 0.14455012 0.4027002 ## X400m -0.7016034 0.29017826 0.2835329 0.43082552 0.1039085 ## X110m.hurdle -0.7641252 -0.02474081 0.4488873 -0.01689589 0.2242200 # Contribution of variables head(var$contrib)
##                  Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
## X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
## Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
## Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
## High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
## X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
## X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193
# Graph of variables: default plot
fviz_pca_var(res.pca, col.var = "black")

It’s possible to control variable colors using their contributions (“contrib”) to the principal axes:

# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
repel = TRUE # Avoid text overlapping
)
1. Variable contributions to the principal axes:
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
1. Extract and visualize results for individuals:
# Extract the results for individuals
ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description
## 1 "$coord" "Coordinates for the individuals" ## 2 "$cos2"    "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals" # Coordinates of individuals head(ind$coord)
##                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
## SEBRLE     0.1955047  1.5890567  0.6424912  0.08389652  1.16829387
## CLAY       0.8078795  2.4748137 -1.3873827  1.29838232 -0.82498206
## BERNARD   -1.3591340  1.6480950  0.2005584 -1.96409420  0.08419345
## YURKOV    -0.8889532 -0.4426067  2.5295843  0.71290837  0.40782264
## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
## McMULLEN   0.1212195 -1.0139102 -0.8625170  1.34164291  1.62151286
# Graph of individuals
# 1. Use repel = TRUE to avoid overplotting
# 2. Control automatically the color of individuals using the cos2
# cos2 = the quality of the individuals on the factor map
# Use points only
fviz_pca_ind(res.pca, col.ind = "cos2",
repel = TRUE # Avoid text overlapping (slow if many points)
)
# Biplot of individuals and variables
fviz_pca_biplot(res.pca, repel = TRUE)
1. Color individuals by groups:
# Compute PCA on the iris data set
# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- PCA(iris[,-5], graph = FALSE)

# Visualize
# Use habillage to specify groups for coloring
fviz_pca_ind(iris.pca,
label = "none", # hide individual labels
habillage = iris$Species, # color by groups palette = c("#00AFBB", "#E7B800", "#FC4E07"), addEllipses = TRUE # Concentration ellipses ) ### Correspondence analysis • Data: housetasks [in factoextra] • CA function FactoMineR::CA() • Visualize with factoextra::fviz_ca() Read more about computing and interpreting correspondence analysis at: Correspondence Analysis (CA). • Compute CA:  # Loading data data("housetasks") # Computing CA library("FactoMineR") res.ca <- CA(housetasks, graph = FALSE) • Extract results for row/column variables: # Result for row variables get_ca_row(res.ca) # Result for column variables get_ca_col(res.ca) • Biplot of rows and columns fviz_ca_biplot(res.ca, repel = TRUE) To visualize only row points or column points, type this: # Graph of row points fviz_ca_row(res.ca, repel = TRUE) # Graph of column points fviz_ca_col(res.ca) # Visualize row contributions on axes 1 fviz_contrib(res.ca, choice ="row", axes = 1) # Visualize column contributions on axes 1 fviz_contrib(res.ca, choice ="col", axes = 1) ### Multiple correspondence analysis • Data: poison [in factoextra] • MCA function FactoMineR::MCA() • Visualization factoextra::fviz_mca() Read more about computing and interpreting multiple correspondence analysis at: Multiple Correspondence Analysis (MCA). 1. Computing MCA: library(FactoMineR) data(poison) res.mca <- MCA(poison, quanti.sup = 1:2, quali.sup = 3:4, graph=FALSE) 1. Extract results for variables and individuals: # Extract the results for variable categories get_mca_var(res.mca) # Extract the results for individuals get_mca_ind(res.mca) 1. Contribution of variables and individuals to the principal axes: # Visualize variable categorie contributions on axes 1 fviz_contrib(res.mca, choice ="var", axes = 1) # Visualize individual contributions on axes 1 # select the top 20 fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20) 1. Graph of individuals # Color by groups # Add concentration ellipses # Use repel = TRUE to avoid overplotting grp <- as.factor(poison[, "Vomiting"]) fviz_mca_ind(res.mca, habillage = grp, addEllipses = TRUE, repel = TRUE) 1. Graph of variable categories: fviz_mca_var(res.mca, repel = TRUE) 1. Biplot of individuals and variables: fviz_mca_biplot(res.mca, repel = TRUE) ### Advanced methods The factoextra R package has also functions that support the visualization of advanced methods such: ## Cluster analysis and factoextra To learn more about cluster analysis, you can refer to the book available at: Practical Guide to Cluster Analysis in R The main parts of the book include: • distance measures, • partitioning clustering, • hierarchical clustering, • cluster validation methods, as well as, • advanced clustering methods such as fuzzy clustering, density-based clustering and model-based clustering. The book presents the basic principles of these tasks and provide many examples in R. It offers solid guidance in data mining for students and researchers. ### Partitioning clustering # 1. Loading and preparing data data("USArrests") df <- scale(USArrests) # 2. Compute k-means set.seed(123) km.res <- kmeans(scale(USArrests), 4, nstart = 25) # 3. Visualize library("factoextra") fviz_cluster(km.res, data = df, palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"), ggtheme = theme_minimal(), main = "Partitioning Clustering Plot" ) ### Hierarchical clustering library("factoextra") # Compute hierarchical clustering and cut into 4 clusters res <- hcut(USArrests, k = 4, stand = TRUE) # Visualize fviz_dend(res, rect = TRUE, cex = 0.5, k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07")) ### Determine the optimal number of clusters # Optimal number of clusters for k-means library("factoextra") my_data <- scale(USArrests) fviz_nbclust(my_data, kmeans, method = "gap_stat") ## Acknoweledgment I would like to thank Fabian Mundt for his active contributions to factoextra. We sincerely thank all developers for their efforts behind the packages that factoextra depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), FactoMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more ….. ## References • H. Wickham (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.5. • Sebastien Le, Julie Josse, Francois Husson (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software, 25(1), 1-18. 10.18637/jss.v025.i01 • Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428 ## Infos This analysis has been performed using R software (ver. 3.3.2) and factoextra (ver. 1.0.4.999) To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### “Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.” Javier Benitez writes: This is a paragraph from Kathryn Montgomery’s book, How Doctors Think: If medicine were practiced as if it were a science, even a probabilistic science, my daughter’s breast cancer might never have been diagnosed in time. At 28, she was quite literally off the charts, far too young, an unlikely patient who might have eluded the attention of anyone reasoning “scientifically” from general principles to her improbable case. Luckily, medicine is a practice that ignores the requirements of science in favor of patient care. I [Benitez] am not sure I agree with her assessment. I have been doing some reading on history and philosophy of science, there’s not much on philosophy of medicine, and this is a tough question to answer, at least for me. I would think that science, done right, should help, not hinder, the cause of cancer decision making. (Incidentally, the relevant science here would necessarily be probabilistic, so I wouldn’t speak of “even” a probabilistic science as if it were worth considering any deterministic science of cancer diagnosis.) So how to think about the above quote? I have a few directions, in no particular order: 1. Good science should help, but bad science could hurt. It’s possible that there’s enough bad published work in the field of cancer diagnosis that a savvy doctor is better off ignoring a lot of it, performing his or her own meta-analysis, as it were, partially pooling the noisy and biased findings toward some more reasonable theory-based model. 2. I haven’t read the book where this quote comes from, but the natural question is, How did the doctor diagnose the cancer in that case? Presumably the information used by the doctor could be folded into a scientific diagnostic procedure. 3. There’s also the much-discussed cost-benefit angle. Early diagnosis can save lives but it can also has costs in dollars and health when there is misdiagnosis. To the extend that I have a synthesis of all these ideas, it’s through the familiar idea of anomalies. Science (that is, probability theory plus data plus models of data plus empirical review and feedback) is supposed to be the optimal way to make decisions under uncertainty. So if doctors have a better way of doing it, this suggests that the science they’re using is incomplete, and they should be able to do better. The idea here is to think of the “science” of cancer diagnosis not as a static body of facts or even as a method of inquiry, but as a continuously-developing network of conjectures and models and data. To put it another way, it can make sense to “ignore the requirements of science.” And when you make that decision, you should explain why you’re doing it—what information you have that moves you away from what would be the “science-based” decision. Benitez adds some more background: As I’m sure you already know, what and how science is practiced means different things to different people. Although pretty significant this is just one quote from her book:) I may be wrong but I think she is a literary scholar interested in epistemology of medicine. Here’s a few links to give you more context on the book: You probably already know we memorize lots of facts, get very little training in statistics and philosophy, so asking a doctor if the practice of medicine is a science is a challenging question. I also think it’s a very important question and addressing it would benefit the field. This all raises interesting questions. I agree that it would be a mistake to call medicine a science. As is often the case, I like the Wikipedia definition (“Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe”). Medicine uses a lot of science, and there is a science of medicine (the systematic study of what is done in medicine and what are the outcomes of medical decisions), but the practice of medicine proceeds case by case. It’s similar with the practice of education, or for that matter the practice of scientific research, or the practice of nursing, or the practice of truck driving, or any job: it uses science, and it can be the subject of scientific inquiry, but it is not itself a science. Continue Reading… ### Data Links #91 ## New data analysis competitions ## How-to ## Privacy • Beyond providing support for police operations, the OPSOC also monitors the social media activity of protesters during demonstrations. Bottom line: the earlier you understand that all your social media postings are public, regardless of the access level you have set up, the better. ## Tech • Now, a team at the University of North Carolina, Chapel Hill, has detected brain growth changes linked to autism in children as young as 6 months old. And it piqued our interest because a deep-learning algorithm was able to use that data to predict whether a child at high-risk of autism would be diagnosed with the disorder at 24 months. The algorithm correctly predicted the eventual diagnosis in high-risk children with 81 percent accuracy and 88 percent sensitivity. That's pretty damn good compared with behavioral questionnaires, which yield information that leads to early autism diagnoses (at around 12 months old) that are just 50 percent accurate. • The Truth About The Trump Data Team That People Are Freaking Out About. Another article disputing the claim that Cambridge Analytica basically performed black magic during the campaign. We already had a few of these links in Data Links #89. But interviews with 13 former employees, campaign staffers, and executives at other Republican consulting firms who have seen Cambridge Analytica's work suggest that its psychological approach was not actually used by the Trump campaign and, furthermore, the company has never provided evidence that it even works. Rather than a sinister breakthrough in political technology, the Cambridge Analytica story appears to be part of the traditional contest among consultants on a winning political campaign to get their share of credit — and win future clients. • Looking across various times and locations, the researchers saw that communities undergoing a demographic change due to immigration clearly don't experience significant increases in crime. In fact, even though these communities may feel like they were in flux due to population changes, crime was either stable or declined in communities that were incorporating many immigrants. • The research project, called Detox, began last year and used machine learning methods to flag comments that contain personal attacks. The researchers looked at 14 years of Wikipedia comments for patterns in abusive behaviour. Detox is part of Jigsaw's Conversation AI project, which aims to build open-source AI tools for web forums and social media platforms to use in the fight against online harassment. Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish. Have you read an article you liked and would you like to suggest it for the next issue? Just contact me! Continue Reading… ### T-Shirts!! The Becoming a Data Scientist tees are ready to sell! I ordered a couple myself before posting them for sale, to make sure the quality was good. They came out great!! And if you order from Teespring before March 1 using this link: Becoming a Data Scientist Store – Free Shipping, you’ll get free shipping on your order! The design is a combination of those submitted to our contest by Amarendranath “Amar” Reddy and Ryne & Alexis. You can see their design submissions and read more about them on the finalists post! They are each receiving prizes for being selected. Thanks Amar, Ryne, and Alexis for the awesome design! There are a variety of styles and colors available. The Premium Tee is 100% cotton. The Women’s Premium is a 50/50 cotton/poly blend, and is cut to fit more snugly. They are available in navy blue, gray, purple, and black. There’s even a long-sleeve version! I make anywhere from$2-$7 on each order (it’s print-on-demand, so not cheap enough for me to make a significant profit yet, and my proceeds will be lower with the free shipping offer, but I want to reward those of you who are excited to flaunt your Becoming a Data Scientist status!) and every dollar earned from these will be going to the fund that helps support my new small team of assistants, who you’ll meet soon! Also, the more of them I sell, the lower the cost to print is per shirt, so please share with all of your friends! Here are photos of me wearing the shirt, but this was before I made the front design slightly smaller (so it doesn’t wrap into armpit), and I moved the back design slightly higher and also made the gray dots (data points?) transparent so the color of the shirt will show through there now (see store images above for current design). You can see that the teal came out as a lighter blue in printing. This is the “Premium Tee” style in “New Navy”. Here’s a model wearing a simulated version of the shirt. ### Order yours here, with Free Shipping Until March 1! Update: Kids sizes now available, too! (the design is on the front for kids’ shirts) Continue Reading… ### Document worth reading: “Object Oriented Analysis using Natural Language Processing concepts: A Review” The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form of Software Requirement Specification (SRS). SRS document needed for software development is mostly written in Natural Language(NL) convenient for the client. From the SRS document only, the class name, its attributes and the functions incorporated in the body of the class are traced based on pre-knowledge of analyst. The paper intends to present a review on Object Oriented (OO) analysis using Natural Language Processing (NLP) techniques. This analysis can be manual where domain expert helps to generate the required diagram or automated system, where the system generates the required diagram, from the input in the form of SRS. Object Oriented Analysis using Natural Language Processing concepts: A Review Continue Reading… ### R Packages worth a look Spatial Downscaling Using Bias Correction Approach (spdownscale) Spatial downscaling of climate data (Global Circulation Models/Regional Climate Models) using quantile-quantile bias correction technique. Rainforest Plots for Meta-Analysis (metaviz) Creates rainforest plots (proposed by Schild & Voracek, 2015 <doi:10.1002/jrsm.1125>), a variant and enhancement of the classic forest plot for meta-analysis. In the near future, the ‘metaviz’ package will be extended by further, established as well as novel, plotting options for visualizing meta-analytic data. Probability of Default Calibration (LDPD) Implementation of most popular approaches to PD (probability of default) calibration: Quasi Moment Matching algorithm (D. Tasche), algorithm proposed by M. van der Burgt, K. Pluto and D. Tasche’s most prudent estimation methodology. Performs spatial NBDA in a Bayesian context (spatialnbda) Network based diffusion analysis (NBDA) allows inference on the asocial and social transmission of information. This may involve the social transmission of a particular behaviour such as tool use, for example. For the NBDA, the key parameters estimated are the social effect and baseline rate parameters. The baseline rate parameter gives the rate at which the behaviour is first performed (or acquired) asocially amongst the individuals in a given population. The social effect parameter quantifies the effect of the social associations amongst the individuals on the rate at which each individual first performs or displays the behaviour. Spatial NBDA involves incorporating spatial information in the analysis. This is done by incorporating social networks derived from spatial point patterns (of the home bases of the individuals under study). In addition, a spatial covariate such as vegetation cover, or slope may be included in the modelling process. Statistical Methods for Trapezoidal Fuzzy Numbers (FuzzyStatTra) The aim of the package is to provide some basic functions for doing statistics with trapezoidal fuzzy numbers. In particular, the package contains several functions for simulating trapezoidal fuzzy numbers, as well as for calculating some central tendency measures (mean and two types of median), some scale measures (variance, ADD, MDD, Sn, Qn, Tn and some M-estimators) and one diversity index and one inequality index. Moreover, functions for calculating the 1-norm distance, the mid/spr distance and the (phi,theta)-wabl/ldev/rdev distance between fuzzy numbers are included, and a function to calculate the value phi-wabl given a sample of trapezoidal fuzzy numbers. Computation of P Values and Bayes Factors for Conditioning Data (condir) Set of functions for the easy analyses of conditioning data. Discrete Event Simulation for R (simmer) simmer is a discrete event package for the R language. It is developed with my own specific requirements for simulating day-to-day hospital proceses and thus might not be suited for everyone. It is designed to be as simple to use as possible and tries to be compatible with the chaining/piping workflow introduced by the magrittr package. GitHub Linear Regression with Non-Constant Variances (lmvar) Runs a linear regression in which both the expected value and the variance can vary per observation. The expected values mu follows the standard linear model mu = X_mu * beta_mu. The standard deviation sigma follows the model log(sigma) = X_sigma * beta_sigma. The package comes with two vignettes: ‘Intro’ gives an introduction, ‘Math’ gives mathematical details. Continue Reading… ### Predicting food preferences with sparklyr (machine learning) (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers) This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R. The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines. ### Apache Spark Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup. Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training. Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance. #### MLlib Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing. With sparklyr you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries. sparklyr also implements dplyr, making it especially convenient for handling data. If you don’t have Spark installed locally, run: library(sparklyr) spark_install(version = "2.0.0")  Now we can connect to a local Spark instance: library(sparklyr) sc <- spark_connect(master = "local", version = "2.0.0")  ### Preparations Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages tidyr (for gathering data for plotting), dplyr (for data manipulation) and ggrepel (for non-overlapping text labels in plots). library(tidyr) library(ggplot2) library(ggrepel) library(dplyr) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "right", legend.justification = "top", panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) }  ### The Data Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup. This dataset is part of the fivethirtyeight package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen: • 5: I love this country’s traditional cuisine. I think it’s one of the best in the world. • 4: I like this country’s traditional cuisine. I think it’s considerably above average. • 3: I’m OK with this county’s traditional cuisine. I think it’s about average. • 2: I dislike this country’s traditional cuisine. I think it’s considerably below average. • 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world. • N/A: I’m unfamiliar with this country’s traditional cuisine. Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0. library(fivethirtyeight) food_world_cup[food_world_cup == "N/A"] <- NA food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0 food_world_cup$gender <- as.factor(food_world_cup$gender) food_world_cup$location <- as.factor(food_world_cup$location)  The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location. Before I do any machine learning, however, I want to get to know the data. First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country. # calculating percentages per category and country percentages <- food_world_cup %>% select(algeria:vietnam) %>% gather(x, y) %>% group_by(x, y) %>% summarise(n = n()) %>% mutate(Percent = round(n / sum(n) * 100, digits = 2)) # rename countries & plot percentages %>% mutate(x_2 = gsub("_", " ", x)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = "", y = Percent, fill = y)) + geom_bar(width = 1, stat = "identity") + theme_minimal() + coord_polar("y", start = 0) + facet_wrap(~ x_2, ncol = 8) + scale_fill_brewer(palette = "Set3") + labs(fill = "")  As we can already see from the distributions, there are big differences between countries. Some countries’ cuisines are very well known and liked (e.g. Italy, Mexico and the United States), while others are only known to a handful of people (e.g. Algeria, Croatia, Ghana, etc.). #### Imputing missing values In the demographic information, there are a few missing values. These, I want to impute: library(mice) dataset_impute <- mice(food_world_cup[, -c(1, 2)], print = FALSE) food_world_cup <- cbind(food_world_cup[, 2, drop = FALSE], mice::complete(dataset_impute, 1))  #### Transforming preference values Strictly speaking, the preference data is categorical. But because using them as factor levels would make the models much more complex and calculation-intensive, I am converting the factors to numbers and transform them by dividing through the mean of non-zero values for each country. food_world_cup[8:47] <- lapply(food_world_cup[8:47], as.numeric) countries <- paste(colnames(food_world_cup)[-c(1:7)]) for (response in countries) { food_world_cup[paste(response, "trans", sep = "_")] <- food_world_cup[response] / mean(food_world_cup[food_world_cup[response] > 0, response]) }  As we can see by plotting the distribution of transformed values, they are far from being normal distributions. Moreover, we still see big differences between well known and not well known countries – this means that the data is unbalanced. food_world_cup %>% gather(x, y, algeria_trans:vietnam_trans) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(y)) + geom_density(fill = "navy", alpha = 0.7) + my_theme() + facet_wrap(~ x_2, ncol = 8) + labs(x = "transformed preference")  #### Most liked cuisines and gender biases Then, I wanted to know which countries were the most liked and whether we see a gender bias in preference for some country’s cuisines. The first plot shows the sum of preference values for each country, separated by male and female answers. food_world_cup_gather <- food_world_cup %>% collect %>% gather(country, value, algeria:vietnam) food_world_cup_gather$value <- as.numeric(food_world_cup_gather$value) food_world_cup_gather$country <- as.factor(food_world_cup_gather$country)  food_world_cup_gather <- food_world_cup_gather %>% mutate(x_2 = gsub("_", " ", country)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) order <- aggregate(food_world_cup_gather$value, by = list(food_world_cup_gather$x_2), FUN = sum) food_world_cup_gather %>% mutate(x_2 = factor(x_2, levels = order$Group.1[order(order$x, decreasing = TRUE)])) %>% ggplot(aes(x = x_2, y = value, fill = gender)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(fill = "Gender", x = "", y = "sum of preferences")  Cuisines from Italy, the United States and Mexico were the most liked, from Ghana, Cameroon and the Ivory Coast were least liked (that they were also less well known playes into it, of course). No country shows an obvious gender bias, though. To dig a bit deeper into gender differences, I am calculating the differences between mean preference of males and females for each country. Here, I am using the country list that I defined earlier when transforming the values. Because I want to paste the country list to dplyr’s functions, I need to use standard evaluation (SE). Be default, dplyr uses non-standard evalutaion (NSE), i.e. variable names are given without quotation marks. This makes it possible to easily convert between R and SQL code. But each dplyr function also has a version to use with SE: they each have a “” appended to the function name, here e.g. *mutate_each()*. food_world_cup %>% collect %>% mutate_each_(funs(as.numeric), countries) %>% group_by(gender) %>% summarise_each_(funs(mean), countries) %>% summarise_each_(funs(diff), countries) %>% gather(x, y) %>% mutate(x_2 = gsub("_", " ", x)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = y)) + geom_bar(stat = "identity", fill = "navy", alpha = 0.7) + my_theme() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "difference\nbetween gender")  All gender differences are very close to zero, so it seems that men and women generally have similar food preferences. ### Spark Now, that I’m somewhat familiar with the data, I copy it to the Spark instance: food_world_cup <- copy_to(sc, food_world_cup)  #### Principal Component Analysis (PCA) One of the functions of Spark’s machine learning functions is for PCA: ml_pca(). I am using it to get an idea about where the countries fall on a 2-dimensional plane of the first two principal components. pca <- food_world_cup %>% mutate_each_(funs(as.numeric), countries) %>% ml_pca(features = paste(colnames(food_world_cup)[-c(1:47)])) library(tibble) as.data.frame(pca$components) %>%
rownames_to_column(var = "labels") %>%
mutate(x_2 = gsub("_trans", "", labels)) %>%
mutate(x_2 = gsub("_", " ", x_2)) %>%
mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
mutate(x_2 = gsub("And", "and", x_2)) %>%
ggplot(aes(x = PC1, y = PC2, color = x_2, label = x_2)) +
geom_point(size = 2, alpha = 0.6) +
geom_text_repel() +
labs(x = paste0("PC1: ", round(pca$explained.variance[1], digits = 2) * 100, "% variance"), y = paste0("PC2: ", round(pca$explained.variance[2], digits = 2) * 100, "% variance")) +
my_theme() +
guides(fill = FALSE, color = FALSE)


The least well known countries cluster on the top right, while the most liked are on the bottom right. PC2 seems to be mainly driven by how many 0s there are in the data.

#### Preparing the data for machine learning

First, I need to convert the factor strings of the non-country features into indexes. To do so, I am using the ft_string_indexer() function. For other feature transformation functions, check out sparklyr’s website.

food_world_cup <- tbl(sc, "food_world_cup") %>%
ft_string_indexer(input_col = "interest", output_col = "interest_idx") %>%
ft_string_indexer(input_col = "gender", output_col = "gender_idx") %>%
ft_string_indexer(input_col = "age", output_col = "age_idx") %>%
ft_string_indexer(input_col = "household_income", output_col = "household_income_idx") %>%
ft_string_indexer(input_col = "education", output_col = "education_idx") %>%
ft_string_indexer(input_col = "location", output_col = "location_idx") %>%
ft_string_indexer(input_col = "knowledge", output_col = "knowledge_idx")


Now I can divide the data into training and test sets using the sdf_partition() function.

partitions <- food_world_cup %>%
sdf_partition(training = 0.75, test = 0.25, seed = 753)


#### Modeling

Now, I run the Random Forest algorithm to predict each country’s preference based on all other countries’ preferences and the demographic information. Check back with sparklyr’s website to find out about other machine learning algorithms you can use.

For each country (as response variable), I am defining the features as all other countries’ transformed values and the indexed factor variables.

Then, I am filtering out all data rows where the response variable was 0. When I left the 0s in the data, the countries with many 0s introduced a very strong bias to the prediction. Here, I needed to use standard evaluation again, this time with lazyeval’s interp() function and a formula.

I am using the ml_random_forest() function to run classification models. Inititally, I run a model on all features, then extract the 10 features with highest importance and re-run the model again on this subset of features. This improved the prediction accuracy compared to all-feature-models.

The sdf_predict() function is used to predict the classes of the test set. To obtain the quality metrics F1, weighted precision and weighted recall, I need to copy the prediction table to the Spark instance and run the ml_classification_eval() function.

Finally, I am combining the output tables from all countries’s models to compare.

library(lazyeval)

for (response in countries) {

features <- colnames(partitions$training)[-grep(response, colnames(partitions$training))]
features <- features[grep("_trans|_idx", features)]

fit <- partitions$training %>% filter_(interp(~ var > 0, var = as.name(response))) %>% ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification") feature_imp <- ml_tree_feature_importance(sc, fit) features <- as.character(feature_imp[1:10, 2]) fit <- partitions$training %>%
filter_(interp(~ var > 0, var = as.name(response))) %>%
ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")

partitions$test <- partitions$test %>%
filter_(interp(~ var > 0, var = as.name(response)))

pred <- sdf_predict(fit, partitions$test) %>% collect pred_2 <- as.data.frame(table(pred[[response]], pred$prediction))
pred_2$response <- response pred_sc <- select(pred, -rawPrediction, -probability) pred_sc <- copy_to(sc, pred_sc, overwrite = TRUE) feature_imp$response <- response

f1 <- ml_classification_eval(pred_sc, response, "prediction", metric = "f1")
wP <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedPrecision")
wR <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedRecall")

ml_eval <- data.frame(response = response,
f1 = f1,
weightedPrecision = wP,
weightedRecall = wR)

if (response == "algeria") {
feature_imp_df <- feature_imp
ml_eval_df <- ml_eval
pred_df <- pred_2
} else {
feature_imp_df <- rbind(feature_imp_df, feature_imp)
ml_eval_df <- rbind(ml_eval_df, ml_eval)
pred_df <- rbind(pred_df, pred_2)
}
}


#### Model evaluation

Now, I can compare the prediction accuracies for each country’s model by plotting the F1, weighted precision and weighted recall values.

• Precision

In classification models, precision gives the proportion of correct classifications or “true positives” (i.e. how many of the samples that were classified as “5” were actually a “5”). If we have a high precision, this means that our model classified most samples correctly.

• Recall

Recall, or sensitivity, gives the proportion of how many of all true “5” samples in the training data were correctly classified. If we have a high recall, this means that our model correctly detected most classes.

• F1 score

The F-score gives the harmonic mean or weighted average of precision and recall.

Let’s say we had 10 “5s” in the training data. The prediction table classified 8 samples as “5s”, 6 of which were correctly classified. In this example, we have 2 false positives and 4 false negatives. This means we would have a precision of 6/8 and a recall of 6/10.

results <- ml_eval_df %>%
mutate(x_2 = gsub("_", " ", response)) %>%
mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
mutate(x_2 = gsub("And", "and", x_2))

order <- results$x_2[order(results$f1, decreasing = TRUE)]

gather(results, x, y, f1:weightedRecall) %>%
mutate(x_2 = factor(x_2, levels = order)) %>%
ggplot(aes(x = x_2, y = y, fill = x)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
scale_fill_brewer(palette = "Set1") +
my_theme() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(fill = "", color = "", x = "", y = "value")


The plot above shows that the model predicting preference for Spanish dishes had the highest accuracy, while the model for Italy, India and Ireland had the lowest accuracy.

To see what features had been used in more and less successful models, I am plotting the values of the 10 final features for classifying Spanish, Greece and Italian food preference categories 1 – 5 in the original dataset.

as.data.frame(food_world_cup) %>%
select_(.dots = c("spain", as.character(feats$feature))) %>% gather(x, y, -spain) %>% filter(spain > 0) %>% group_by(spain, x) %>% summarise(mean = mean(y)) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_idx", "", x_2)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = spain, fill = mean)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "mean") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = "", y = "Spain")  feats <- feature_imp_df %>% filter(response == "greece") %>% slice(1:10) as.data.frame(food_world_cup) %>% select_(.dots = c("greece", as.character(feats$feature))) %>%
gather(x, y, -greece) %>%
filter(greece > 0) %>%
group_by(greece, x) %>%
summarise(mean = mean(y)) %>%
mutate(x_2 = gsub("_trans", "", x)) %>%
mutate(x_2 = gsub("_idx", "", x_2)) %>%
mutate(x_2 = gsub("_", " ", x_2)) %>%
mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
mutate(x_2 = gsub("And", "and", x_2)) %>%
ggplot(aes(x = x_2, y = greece, fill = mean)) +
geom_tile(width = 0.9, height = 0.9) +
scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
my_theme() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "", y = "Greece")


feats <- feature_imp_df %>%
filter(response == "italy") %>%
slice(1:10)

as.data.frame(food_world_cup) %>%
select_(.dots = c("italy", as.character(feats$feature))) %>% gather(x, y, -italy) %>% filter(italy > 0) %>% group_by(italy, x) %>% summarise(mean = mean(y)) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_idx", "", x_2)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = italy, fill = mean)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "mean") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = "", y = "Italy")  ### Conclusions This dataset was a bit difficult because the preferences were highly unbalanced, i.e. the distributions of preferences were very different between countries. Some countries’ cuisines were largely unknown, while others’ were well known and generally popular. If I kept the unknown class (0) in the prediction models, countries with many 0s jumped up in prediction accuracy. But this was mainly due to the fact, that the chance of correctly classifying a sample as 0 was over-proportionally high (leading to high precision). Removing the 0s while working on raw preference values showed the same effect as before but now countries with a majority of 5s (e.g. Italy) were overly “accurate”. Transforming the data and removing the 0s led to a more evenly distributed accuracy of well known and less well known countries. But by removing them, the number of samples used for modeling varied strongly between countries, introducing another type of bias. It was not possible to avoid bias entirely with this dataset. But I wanted to show it as an example nonetheless. Usually, machine learning examples show datasets where the models worked very well, leaving the reader in awe of the powers of machine learning. And while there are certainly powerful and impressive prediction models, real-life data is not always as simple. As can be seen with this dataset, it’s not always as straight forward to build a good classification model, even though you’d intuitively assume that it should be easy to predict food preferences based on which other cuisines someone likes… The model could potentially be improved by engineering features, modeling factors and/ or different algorithms but I’ll leave this for another time. Next week, I’ll be looking into how to use the h2o package with Spark using rsparkling. Other machine learning topics I have covered include ## R version 3.3.2 (2016-10-31) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.1 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tibble_1.2 fivethirtyeight_0.1.0 dplyr_0.5.0 ## [4] ggrepel_0.6.5 ggplot2_2.2.1 tidyr_0.6.1 ## [7] sparklyr_0.5.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 knitr_1.15.1 magrittr_1.5 ## [4] rappdirs_0.3.1 munsell_0.4.3 colorspace_1.3-2 ## [7] R6_2.2.0 plyr_1.8.4 stringr_1.1.0 ## [10] httr_1.2.1 tools_3.3.2 parallel_3.3.2 ## [13] grid_3.3.2 gtable_0.2.0 config_0.2 ## [16] DBI_0.5-1 withr_1.0.2 htmltools_0.3.5 ## [19] lazyeval_0.2.0 yaml_2.1.14 assertthat_0.1 ## [22] rprojroot_1.2 digest_0.6.12 RColorBrewer_1.1-2 ## [25] base64enc_0.1-3 evaluate_0.10 rmarkdown_1.3 ## [28] labeling_0.3 stringi_1.1.2 scales_0.4.1 ## [31] backports_1.0.5 jsonlite_1.2  To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Bar bar plots but not Babar plots (This article was first published on Maëlle, and kindly contributed to R-bloggers) You might have heard of the “bar bar plots” movement whose goal is to prevent using (let’s use ggplot2 language shall we) geom_bar when you could have used e.g. geom_boxplot or geom_histogram because the bar plot hides the variability of the distributions you say you’re comparing, even if you add some sort of error bar. I whole-heartedly agree with this movement but in this post I’d like to present some ok bar plots, that represent counts of individuals in different categories. You might know them as geom_bar(blabla, stat = "identity") or geom_col. They’re still bar plots and they’re great, in particular when you make puns out of them which is what I did with Miles McBain. The reason I mentioned bar plots to Miles was my wanting to recycle some old code of mine (again) for a blog post. It was code from my “let’s test all the kitsch R plots have to offer” period, when I had started using Thomas Leeper’s colourlovers package for adding patterns on bars. colourlovers is a great package for getting trendy colour palettes, although I’ll admit I’m a viridis person. But colourlovers also has patterns, about which the package documentation states “Patterns are images created on COLOURlovers using a specified palette of colors. They serve as examples of how to use the palette in practice.”. At the time I read those sentences I got really excited because I was convinced one could do more with patterns, like filling bar plots with them. In this post, I’ll prove you can, with my non-generalizable worflow. library("dplyr") library("tidyr") library("httr") library("grid") library("ggplot2") library("colourlovers")  # General comments on using patterns A pattern is a square image that I convert to a raster. This raster is a matrix with 200 rows and 200 columns. In this matrix$x_{i,j}$represents the hex code of the colour of a pixel. Below I show how to get a pattern figure and how to plot it on its own. # get pattern catsBlue <- clpattern(id = "3363032") # get the URL of the picture corresponding to the pattern imageURL <- catsBlue$imageUrl
# get the picture
picture <-  content(GET(as.character(imageURL)))
# convert it to a raster
img <- as.raster(picture)
# plot it!
plot(img)


This pattern can also be seen as a puzzle piece: one can copy the pattern several times and put each copy side by side and/or pile them up and get something that looks homogeneous with no problem at the border of each piece.

In ggplot2, I’ll add the patterns with geom_point. The main issue will be to re-size puzzle pieces (how big should be a single puzzle piece for the graph to look nice?), to paste pieces together (now many puzzle pieces?), and to crop the puzzle (the puzzle should not be bigger than the bar it covers, for instance).

# Writing code for decorating a bar plot

Here is the original bar plot I aim to make fancier.

# Normal plot
c <- ggplot(mtcars, aes(factor(cyl)))
c <- c + geom_bar(width = .5,
aes(fill = factor(cyl)))+
theme(text = element_text(size=20))
c


Now, since my puzzle pieces are squares, I want the plot to have a x/y ratio such that the distance between two major grid lines on the x axis is the same as the distance as the distance between two major grid lines on the y axis.

plotInfo <- ggplot_build(c)
extentX <- diff(plotInfo$layout$panel_ranges[[1]]$x.major_source)[1] extentY <- diff(plotInfo$layout$panel_ranges[[1]]$y.major_source)[1]
c <- c + coord_fixed(ratio = extentX/extentY)
c


I’ll reckon once again that since writing the code and publishing it I’ve slept many times and thus forgotten how I came up with the solution. Today I had to change the code for getting extentY and extentX a bit since ggplot_build() gives a slightly different output since ggplot2 newest version was released.

I shall now get three patterns from colourlovers. I went on the website of colourlovers itself to find three patterns using the same template with different colours so I know their IDs.

# get the patterns
get_patterns <- function(id_blue, id_red, id_green){
blue <- clpattern(id = id_blue)
red <- clpattern(id = id_red)
green <- clpattern(id = id_green)
list(red, green, blue)
}

patterns <- get_patterns(id_blue = 4348319,
id_red = 4376078,
id_green = 4360659)


I shall first get one colour from each pattern and re-do the figure with these colours. I do this to have a legend later. I don’t want to try and reproduce parts of the patterns to get them in the legend.

library("purrr")
get_colors <- function(pattern){
pattern$colors[3]$hex
}
paste0("#", color)
}

colors <- map(patterns, get_colors) %>%
unlist()

c2 <- c + scale_fill_manual(values = colors)
c2


Now let’s add the patterns to the graph! It’s quite inelegant since I use a loop for adding things successively to the graph.

add_patterns <- function(plot, patterns, colors){
for (i in 1:length(levels(factor(mtcars$cyl)))){ imageURL <- patterns[[i]]$imageUrl
# get pattern
picture <-  content(GET(as.character(imageURL)))
# picture is a 4 dimensional array
img <- as.raster(picture)

# we have to repeat the data.frame/pattern
# I repeat it so that one extentY = one square
xmax <- 1
ymax <- sum(mtcars$cyl == levels(factor(mtcars$cyl))[i])

size <- ceiling(ymax*2/extentY)

img2 <- apply(img,MARGIN=2,function(x) rep(x,size))

# from matrix to data.frame
img2 <- tbl_df(as.data.frame(as.matrix(img2)))

# I change the column names so that they correspond to x coordinates
names(img2) <- seq(i - 0.25, to = i + 0.25, length = ncol(img2))

# I transform the img2 so that I have one line = 1 x, 1 y and 1 colour
dataPlot <- img2 %>%
mutate(y = seq(from = size/2*extentY, to = 0, length = nrow(img2)))%>%
gather(x, value, 1:ncol(img2)) %>%
# filter part of the pattern that doesn't fit in the original bar
filter(y <= ymax)  %>%
mutate(x = as.numeric(x))

plot <- plot + geom_point(data = dataPlot, aes(x, y), col = dataPlot$value) } plot } add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Babar bar plot")  We’ve just made a Babar bar plot! Thanks Miles for this pun. Now let’s apply the same principles again and by principles I mean applying the colourlover code I’ve just presented, and using puns developped with Miles. A recipe for success! # Other punny plots ## Barber plot patterns <- get_patterns(id_blue = 3490878, id_red = 3277163, id_green = 4038034) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Barber bar plot")  This is one plot decoration you could use to illustrate something about Occam’s razor. ## Bark plot patterns <- get_patterns(id_blue = 1816764, id_red = 1825775, id_green = 1815897) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Bark bar plot")  I’m less convinced by this one, maybe it’d have looked better with barking dogs? ## BART plot patterns <- get_patterns(id_blue = 1902711, id_red = 1916712, id_green = 1930435) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("BART bar plot")  This one is the loudest bar plot you’ll ever make since it was inspired by San Francisco BART. I might have come up with this idea after not finding a Bart Simpson pattern. I told Miles I’d stop once I’d only think of a barf plot which I don’t want to draw ever. As I had said the first time I made a bar plot with patterns, I’m not sure how useful it is, but you never know when you’re gonna need a fancy plot to get someone’s attention, right? To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Using Rcpp with C++11, C++14 and C++17 (This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers) ### Background When we started the Rcpp Gallery in late 2012, a few of us spent the next four weeks diligently writing articles ensuring that at least one new article would be posted per day. Two early articles covered the then-budding support for C++11. Both First steps in using C++11 with Rcpp and A first lambda function with C++11 and Rcpp started by actually showing how to set the compiler directive via the PKG_CXXFLAGS environment variable. Both posts were updated a few months later to demonstrate the C++11 plugin that was added in Rcpp release 0.10.3 in March of 2013, or almost four years ago. Many posts here, on StackOverflow and of course on the mailing list make use of the plugin to tell the compiler to turn on C++11 support. In the early days, this often meant partial support via (in the case of g++) the -std=c++0x switch. This is still supported by Rcpp as in particular the Windows side of things was relying on older compilers like g++ version 4.6.* until more recently (see below). But some things got a lot better with R 3.1.0. As the NEWS announcement dryly noted: There is support for compiling C++11 code in packages on suitable platforms: see ‘Writing R Extensions’. which was coupled with support for selecting the C++11 language standard via either CXX_STD in src/Makevars, or the SystemRequirements line in DESCRIPTION. In late 2011, Martyn Plummer also wrote a helpful R Journal article on best practices for portable C++ packages. And as we alluded to above, Rcpp has supported C++11 since the dawn of time (because, as we detail below, it ultimately comes down what your compiler supports, and what R facilitates). Many CRAN packages have by now taken advantage of this increased support for C++11 in particular. As of today, we see 88 CRAN packages declaring this via DESCRIPTION and 127 CRAN package via src/Makevars. And of course, almost all use Rcpp along with C++ to take advantage of the R and C++ integration it offers. ### Rcpp: Sitting between your Compiler and R So what are the defining parameters for support by Rcpp? In essence, Rcpp is guided by just two (external) factors: • the support in the provided compiler, and • the support offered by R for package building, and both are worth detailing as we do in the next two sections.. ### Compiler Support First, the choice of compiler matters. Rcpp operates on top of R, and (like any R package) it is driven entirely by the build instructions from R. It can therefore be dependent on the compiler choices offered by R. This meant g++ version 4.6.3 for Windows for years. And this got a lot better when Rtools switched to g++ version 4.9.3 with R 3.3.0 not quite a year ago. It now means that support for C++11 is almost universal. (Some small things are still missing in g++-4.9.3; notably complete support for the C++ Standard Library leading us recently to backport get_time from LLVM into RcppCCTZ. Also note that macOS / OS X has its own dependency chain due to compiler, and release, choices made by the R package build system for that platform.) Now, that is just the minimum available compiler for a particular platform, albeit a very important one as it defines what binary CRAN packages on Windows can support. Other platforms, however, have a faster release cadence. g++-5 and clang++-3.3 are now fairly common and have (near-)complete C++11 support. The most recent Ubuntu release 16.10 even defaults to g++ version 6.2.0, and Debian already has version 6.3.0 (and binaries of version 7.* in its experimental branch). Similarly, recent version of clang are available both directly in the distributoin and via nightly builds from dedicated PPAs. And for these ‘6.*’ version of g++, the default C++ standard is already C++14, the follow-up standard release to C++11. For example, C++14 extends support for auto to return values so that we can compile a simple program such as withour any additional switches or flags if the compiler is as recent as g++ version 6. A plain g++ -o demoprog demoprog.cpp will do (provided the snippet was saved as file demoprogr.cpp) as C++14 is the default for this compiler. Otherwise, adding -std=c++14 explicitly instruct the compiler to compile as C++14. Moreover, it also works for R: [1] 2  [1] 2  (on a machine such as the one used to write this piece) without requiring any additional plugins. As a reminder, cppFunction() supports these via the plugin= argument. On another machine, we might use Rcpp::cppFunction("auto doubleMe(const int &x) { return x + x; }", plugin="cpp14"). Similarly, in code sourced via sourceCpp() we would use an attribute; see the Rcpp Attributes vignette for details. ### Support by R for Packages: C++14 coming soon via R-devel As noted above, the R 3.1.0 release started to add support for modern C++ by enabling packages to select C++11. The next release will be R 3.4.0. While as of now without an announced release date, it is likely to be shipped this April. It will bring support for C++14. The current draft of its version of Writing R Extenions shows that CXX_STD = CXX14 can be used to select this language standard in a package. Several build variables extend support to CXX1Y beyond the existing CXX1X, see the Writing R Extenions manual for full details. To illustrate, on a machine with g++ version 6.*, nothing has to be turned on for C++14 in what will be R 3.4.0. On a machine where g++-5 is the default, the CXX1XSTD value may be empty as this compiler defaults to C++11; we would expect CXX1YSTD = -std=c++14 there (or the ‘gnu’ variant). As we noted above, well over one-hundred packages on CRAN already use C++11. We expect to see C++14 being used once R 3.4.0 is released later this spring ### Extensions on the Rcpp side: C++17 As we wrote in the opening paragraph, a plugin for C++11 has been part of Rcpp for several years. And a plugin for C++14 was added by Dan in Rcpp 0.12.4 about a year ago. The current development sources of Rcpp, corresponding to interim GitHub release 0.12.9.3, added a plugin for C++17 as experimental support exists in g++ and clang++ right now. With that, a suitably recent compiler, and a version of Rcpp that is at least release 0.12.9.3, the following example (motivated by this example section of the C++ Standard post) also builds: ### Summary This note discussed where Rcpp stands with respect to “modern C++”. As a brief summary: • Rcpp supports any C++ language standard the underlying compiler supports: C++98, C++11, C++14, C++17; • Packages using Rcpp can deploy every language standard suppported by R: currently C++, C++11 and very soon C++14; • Package distribution may need to reflect the build infracture; on Windows this means g++-4.9.3 with near-complete C++11 support; • Local developement can be more experimental and even C++17 is now supported by Rcpp as well; • Portable packages should specify the C++ language standard they expect (unless it is C++98). To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### padr::pad does now do group padding (This article was first published on That’s so Random, and kindly contributed to R-bloggers) A few weeks ago padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr") for a general introduction. In v0.2.0 the pad function is extended with a group argument, which makes your life a lot easier when you want to do padding within groups. In the Examples of padr in v0.1.0 I showed that padding over multiple groups could be done by using padr in conjunction with dplyr and tidyr. library(dplyr) library(padr) # padding a data.frame on group level day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp = rep(LETTERS[1:3], each = 4), y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp, date) x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>% tidyr::fill(grp)  I quite quickly realized this is an unsatisfactory solution for two reasons: 1) It is a hassle. It is the goal of padr to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal. 2) It does not work when one or both of the start_val and end_val arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with tidyr::fill will incorrectly fill the grouping variable(s) as a result. It was therefore necessary to expand pad, so the grouping variable(s) do not contain missing values anymore after padding. The group argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows. The new version of the section in the Examples of padr is: day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4), grp2 = letters[1:2], y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date) # pad by one grouping var x_df_grp %>% pad(group = 'grp1') # pad by two groups vars x_df_grp %>% pad(group = c('grp1', 'grp2'))  ### Bug fixes Besides the additional argument there were two bug fixes in this version: • pad does no longer break when datetime variable contains one value only. Returns x and a warning if start_val and end_val are NULL, and will do proper padding when one or both are specified. • In the fill_ function now a meaningful error is thrown, when forgetting to specify at least one column to apply the function on. ### v0.2.1 Right before posting this blog, Doug Friedman found out that in v0.2.0 the by argument no longer functioned. This bug was fixed in the patch release v0.2.1. I hope you (still) enjoy working with padr, let me know when you find a bug or got ideas for improvement. To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ## February 18, 2017 ### If you did not already know: “Generalized Mallows Model Latent Dirichlet Allocation (GMM-LDA)” Modeling document structure is of great importance for discourse analysis and related applications. The goal of this research is to capture the document intent structure by modeling documents as a mixture of topic words and rhetorical words. While the topics are relatively unchanged through one document, the rhetorical functions of sentences usually change following certain orders in discourse. We propose GMM-LDA, a topic modeling based Bayesian unsupervised model, to analyze the document intent structure cooperated with order information. Our model is flexible that has the ability to combine the annotations and do supervised learning. Additionally, entropic regularization can be introduced to model the significant divergence between topics and intents. We perform experiments in both unsupervised and supervised settings, results show the superiority of our model over several state-of-the-art baselines. … Generalized Mallows Model Latent Dirichlet Allocation (GMM-LDA) Continue Reading… ### Book Memo: “Simulation and the Monte Carlo Method”  This accessible new edition explores the major topics in Monte Carlo simulation that have arisen over the past 30 years and presents a sound foundation for problem solving Simulation and the Monte Carlo Method, Third Edition reflects the latest developments in the field and presents a fully updated and comprehensive account of the state-of-the-art theory, methods and applications that have emerged in Monte Carlo simulation since the publication of the classic First Edition over more than a quarter of a century ago. While maintaining its accessible and intuitive approach, this revised edition features a wealth of up-to-date information that facilitates a deeper understanding of problem solving across a wide array of subject areas, such as engineering, statistics, computer science, mathematics, and the physical and life sciences. The book begins with a modernized introduction that addresses the basic concepts of probability, Markov processes, and convex optimization. Subsequent chapters discuss the dramatic changes that have occurred in the field of the Monte Carlo method, with coverage of many modern topics including: Markov Chain Monte Carlo, variance reduction techniques such as importance (re-)sampling, and the transform likelihood ratio method, the score function method for sensitivity analysis, the stochastic approximation method and the stochastic counter-part method for Monte Carlo optimization, the cross-entropy method for rare events estimation and combinatorial optimization, and application of Monte Carlo techniques for counting problems. An extensive range of exercises is provided at the end of each chapter, as well as a generous sampling of applied examples. Continue Reading… ### Analytical and Numerical Solutions to Linear Regression Problems (This article was first published on DataScience+, and kindly contributed to R-bloggers) This exercise focuses on linear regression with both analytical (normal equation) and numerical (gradient descent) methods. We will start with linear regression with one variable. From this part of the exercise, we will create plots that help to visualize how gradient descent gets the coefficient of the predictor and the intercept. In the second part, we will implement linear regression with multiple variables. ## Linear regression with one variable In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next. The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in$10,000s.

library(ggplot2)
library(data.table)
library(magrittr)
library(caret)
library(fields)
library(plot3D)

Load the data and display first 6 observations

ex1data1 <- fread("ex1data1.txt",col.names=c("population","profit"))
population  profit
1:     6.1101 17.5920
2:     5.5277  9.1302
3:     8.5186 13.6620
4:     7.0032 11.8540
5:     5.8598  6.8233
6:     8.3829 11.8860

### Plotting the Data

Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that we encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+
geom_point(color="blue",size=4,alpha=0.5)+
y=ex1data1$profit head(X) [,1] [,2] [1,] 1 6.1101 [2,] 1 5.5277 [3,] 1 8.5186 [4,] 1 7.0032 [5,] 1 5.8598 [6,] 1 8.3829 The function below calcuates cost based on the equation given above. computeCost=function(X,y,theta){ z=((X%*%theta)-y)^2 return(sum(z)/(2*nrow(X))) } Now, we can calculate the initial cost by initilizating the initial parameters to 0. theta=matrix(rep(0,ncol(X))) round(computeCost(X,y,theta),2) 32.07 ### Gradient descent Next, we will implement gradient descent by calling the computeCost function above. A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. Our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm. The gradient descent function below returns the cost in every iteration and the optimal parameters for the number of iterations and learning rate alpha specified. gradientDescent=function(X, y, theta, alpha, iters){ gd=list() cost=rep(0,iters) for(k in 1:iters){ z=rep(0,ncol(X)) for(i in 1:ncol(X)){ for(j in 1:nrow(X)){ z[i]=z[i]+(((X[j,]%*%theta)-y[j])*X[j,i]) } } theta= theta-((alpha/nrow(X))*z) cost[k]=computeCost(X,y,theta) } gd$theta= theta
gd$cost=cost gd } Now, let’s use the gradientDescent function to find the parameters and we have to make sure that our cost never increases. Let’s use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application. iterations = 1500 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta
theta
[,1]
[1,] -3.630291
[2,]  1.166362

### Ploting the cost function as a function of the number of iterations

We expect that the cost should be decreasing or at least should not be at all increasing if our implementaion is correct. Let’s plot the cost fucntion as a function of number of iterations and make sure our implemenation makes sense.

data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>% ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+ ggtitle("Cost as a function of number of iteration")+ theme(plot.title = element_text(size = 16,colour="red")) Here it is the plot: As the plot above shows, the cost decrases with number of iterations and gets almost close to convergence with 1500 iterations. ### Plot the linear fit Now, since we know the parameters (slope and intercept), we can plot the linear fit on the scatter plot. ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=3,alpha=0.5)+ ylab('Profit in$10,000s')+
xlab('Population of City in 10,000s')+
ggtitle ('Figure 1: Scatter plot of training data') +
geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+
theme(plot.title = element_text(size = 16,colour="red"))+
annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population"))

Here it is the plot:

### Visualizing J(θ)

To understand the cost function J(θ) better, let’s now plot the cost over a 2-dimensional grid of θ0 and θ1 values. The global minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.

Intercept=seq(from=-10,to=10,length=100)
Slope=seq(from=-1,to=4,length=100)
# initialize cost values to a matrix of 0's
Cost = matrix(0,length(Intercept), length(Slope));
for(i in 1:length(Intercept)){
for(j in 1:length(Slope)){
t = c(Intercept[i],Slope[j])
Cost[i,j]= computeCost(X, y, t)
}
}
persp3D(Intercept,Slope,Cost,theta=-45, phi=25, expand=0.75,lighting = TRUE,
ticktype="detailed", xlab="Intercept", ylab="Slope",
zlab="",axes=TRUE, main="Surface")
image.plot(Intercept,Slope,Cost, main="Contour, showing minimum")
points(theta[1],theta[2],col='red',pch=4,lwd=6)

Here it is the plot:

### Normal Equation

Since linear regression has closed-form solution, we can solve it analytically and it is called normal equation. It is given by the formula below. we do not need to iterate or choose learning curve. However, we need to calculate inverse of a matrix , which make it slow if the number of records is very large. Gradient descent is applicable to other machine learning techniques as well. Further, gradient descent method is more appropriate even to linear regression when the number of observations is very large.
θ=(XTX)−1XTy

theta2=solve((t(X)%*%X))%*%t(X)%*%y
theta2
[,1]
[1,] -3.895781
[2,]  1.193034

There is very small difference between the parameters we got from normal equation and using gradient descent. Let’s increase the number of iteration and see if they get closer to each other. I increased the number of iterations from 1500 to 15000.

iterations = 15000
alpha = 0.01
theta= matrix(rep(0, ncol(X)))
theta=gradientDescent_results$theta theta [,1] [1,] -3.895781 [2,] 1.193034 As you can see, now the results from normal equation and gradient descent are the same. ### Using caret package By the way, we can use packages develoed by experts in the field and perform our machine learning tasks. There are many machine learning packages in R for differnt types of machine learning tasks. To verify that we get the same results, let’s use the caret package, which is among the most commonly used machine learning packages in R. my_lm <- train(profit~population, data=ex1data1,method = "lm") my_lm$finalModel$coefficients (Intercept) -3.89578087831187 population 1.1930336441896 ## Linear regression with multiple variables In this part of the exercise, we will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Port- land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) head(ex1data2) size bedrooms price 1: 2104 3 399900 2: 1600 3 329900 3: 2400 3 369000 4: 1416 2 232000 5: 3000 4 539900 6: 1985 4 299900 ### Feature Normalization House sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly. ex1data2=as.data.frame(ex1data2) for(i in 1:(ncol(ex1data2)-1)){ ex1data2[,i]=(ex1data2[,i]-mean(ex1data2[,i]))/sd(ex1data2[,i]) } ### Gradient Descent Previously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged. X=cbind(1,ex1data2$size,ex1data2$bedrooms) y=ex1data2$price

iterations = 6000
alpha = 0.01
theta= matrix(rep(0, ncol(X)))
theta=gradientDescent_results$theta theta [,1] [,2] [,3] [1,] 1 0.13000987 -0.2236752 [2,] 1 -0.50418984 -0.2236752 [3,] 1 0.50247636 -0.2236752 [4,] 1 -0.73572306 -1.5377669 [5,] 1 1.25747602 1.0904165 [6,] 1 -0.01973173 1.0904165 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474 ### Normal Equation theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474 ### Using caret package ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) my_lm <- train(price~size+bedrooms, data=ex1data2,method = "lm", preProcess = c("center","scale")) my_lm$finalModel$coefficients (Intercept) 340412.659574468 size 110631.050278846 bedrooms -6649.4742708198 ## Summary In this post, we saw how to implement numerical and analytical solutions to linear regression problems using R. We also used caret -the famous R machine learning package- to verify our results. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository. Related Post To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Magister Dixit “Organizations that use ML to make user-impacting decisions must be able to fully explain the data and algorithms that resulted in a particular decision.” European Union’s General Data Protection Regulation (GDPR) ( Dec. 2016 ) Continue Reading… ### Putting It All Together (This article was first published on R – rud.is, and kindly contributed to R-bloggers) The kind folks over at @RStudio gave a nod to my recently CRAN-released epidata package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt and the new hrbrthemes (github-only for now) packages. ### Labor force participation rate The U.S. labor force participation rate (LFPR) is an oft-overlooked and under- or mis-reported economic indicator. I’ll borrow the definition from Investopedia: The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate. Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy. Since the LFPR is has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the gov-quants have been tracking it). We can easily grab this data with epidata::get_labor_force_participation_rate()(and, we’ll setup some library() calls while we’re at it: library(epidata) library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes") library(ggalt) library(tidyverse) library(stringi) part_rate <- get_labor_force_participation_rate("e") glimpse(part_rate) ## Observations: 457 ## Variables: 7 ##$ date              <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01...
## $all <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0... ##$ less_than_hs      <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0...
## $high_school <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0... ##$ some_college      <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0...
## $bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0... ##$ advanced_degree   <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0...

One of the easiest things to do is to use ggplot2 to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful:

gather(part_rate, category, rate, -date) %>%
mutate(category=stri_replace_all_fixed(category, "_", " "),
category=stri_trans_totitle(category),
category=stri_replace_last_regex(category, "Hs$", "High School"), category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College", "High School", "Less Than High School", "All"))) -> part_rate Now, we’ll make a simple line chart, tweaking the aesthetics just a bit: ggplot(part_rate) + geom_line(aes(date, rate, group=category)) + scale_y_percent(limits=c(0.3, 0.9)) + facet_wrap(~category, scales="free") + labs(x=paste(format(range(part_rate$date), "%Y-%b"), collapse=" to "),
y="Participation rate (%)",
title="U.S. Labor Force Participation Rate",
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
theme_ipsum_rc(grid="XY", axis="XY")

The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter.

It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first:

group_by(part_rate, category) %>%
arrange(date) %>%
slice(c(1, n())) %>%
ungroup() %>%
filter(category != "All") %>%
mutate(category=factor(category, levels=rev(levels(category)))) -> rate_range

arrange(date) %>%
slice(c(1, n())) %>%
mutate(lab=lubridate::year(date)) -> lab_df

(We’ll be using the extra data frame to add labels the chart.)

Now, we can compare the various ranges, once again tweaking aesthetics a bit:

ggplot(rate_range) +
geom_dumbbell(aes(y=category, x=1978-12-01, xend=2016-12-01),
size=3, color="#e3e2e1",
colour_x = "#5b8124", colour_xend = "#bad744",
dot_guide=TRUE, dot_guide_size=0.25) +
geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) +
scale_x_percent(limits=c(0.375, 0.9)) +
labs(x=NULL, y=NULL,
title=sprintf("U.S. Labor Force Participation Rate %s-Present", lab_df\$lab[1]),
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
theme_ipsum_rc(grid="X")

### Fin

One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate() function — along with most other functions in the epidata package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Pizzagate and Kahneman, two great flavors etc.

1. The pizzagate story (of Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years”) keeps developing.

Last week someone forwarded me an email from the deputy dean of the Cornell business school regarding concerns about some of Wansink’s work. This person asked me to post the letter (which he assured me “was written with the full expectation that it would end up being shared”) but I wasn’t so interested in this institutional angle so I passed it along to Retraction Watch, along with links to Wansink’s contrite note and a new post by Jordan Anaya listing some newly-discovered errors in yet another paper by Wansink.

Since then, Retraction Watch ran an interview with Wansink, in which the world-renowned eating behavior expert continued with a mixture of contrition and evasion, along with insights into his workflow, for example this:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this? Second, who eats 12 pieces of pizza? I guess they must be really small pieces!

Wansink also pulls one out of the Bargh/Baumeister/Cuddy playbook:

Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results.

Ummm . . . I’ll believe it when I see the evidence. And not before.

In our struggle to understand Wansink’s mode of operation, I think we should start from the position that he’s not trying to cheat; rather, he just doesn’t know what he’s doing. Think of it this way: it’s possible that he doesn’t write the papers that get published, he doesn’t produce the tables with all the errors, he doesn’t analyze the data, maybe he doesn’t even collect the data. I have no idea who was out there passing out survey forms in the pizza restaurant—maybe some research assistants? He doesn’t design the survey forms—that’s how it is that he just realized that they asked that bizarre 0-to-12-pieces-of-pizza question. Also he’s completely out of the loop on statistics. When it comes to stats, this guy makes Satoshi Kanazawa look like Uri Simonsohn. That explains why his response to questions about p-hacking or harking was, “Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it.”

What Wansink has been doing for several years is organizing studies, making sure they get published, and doing massive publicity. For years and years and years, he’s been receiving almost nothing but positive feedback. (Yes, five years ago someone informed his lab of serious, embarrassing flaws in one of his papers, but apparently that inquiry was handled by one of his postdocs. So maybe the postdoc never informed Wansink of the problem, or maybe Wansink just thought this was a one-off in his lab, somebody else’s problem, and ignored it.)

When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.

Here’s an example of how this framing can help our understanding:

Someone emailed this question to me regarding that original “failed study” that got the whole ball rolling:

I’m still sort of surprised that they weren’t able to p-hack the original hypothesis, which was presumably some correlate with the price paid (either perceived quality, or amount eaten, or time spent eating, or # trips to the bathroom, or …).

My response:

I suspect the answer is that Wansink was not “p-hacking” or trying to game the system. My guess is that he’s legitimately using these studies to inform his thinking–that is, he forms many of his hypotheses and conclusions based on his data. So when he was expecting to see X, but he didn’t see X, he learned something! (Or thought he learned something; given the noise level in his experiments, it might be that his original hypothesis happened to be true, irony of ironies.) Sure, if he’d seen X at p=0.06, I expect he would’ve been able to find a way to get statistical significance, but when X didn’t show up at all, he saw it as a failed study. So, from Wansink’s point of view, the later work by the student really did have value in that they learned something new from their data.

I really don’t like the “p-hacking” frame because it “gamifies” the process in a way that I don’t think is always appropriate. I prefer the “forking paths” analogy: Wansink and his students went down one path that led nowhere, then they tried other paths.

2. People keep pointing me to a recent statement by Daniel Kahneman in a comment on a blog by Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, who wrote that the “priming research” of Bargh and others that was featured in Kahneman’s book “is a train wreck” and should not be considered “as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.” Here’s Kahneman:

I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited.

What the blog gets absolutely right is that I placed too much faith in underpowered studies. As pointed out in the blog, and earlier by Andrew Gelman, there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples. We also cited Overall (1969) for showing “that the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results.” Our article was written in 1969 and published in 1971, but I failed to internalize its message.

My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.

I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.

I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.

Following up on Kahneman’s remarks, neuroscientist Jeff Bowers added:

There is another reason to be sceptical of many of the social priming studies. You [Kahneman] wrote:

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions.

However, there is an important constraint on subliminal priming that needs to be taken into account. That is, they are very short lived, on the order of seconds. So any claims that a masked prime affects behavior for an extend period of time seems at odd with these more basic findings. Perhaps social priming is more powerful than basic cognitive findings, but it does raise questions. Here is a link to an old paper showing that masked *repetition* priming is short-lived. Presumably semantic effects will be even more transient.

And psychologist Hal Pashler followed up:

One might ask if this is something about repetition priming, but associative semantic priming is also fleeting. In our JEP:G paper failing to replicate money priming we noted:

For example, Becker, Moscovitch, Behrmann, and Joordens (1997) found that lexical decision priming effects disappeared if the prime and target were separated by more than 15 seconds, and similar findings were reported by Meyer, Schvaneveldt, and Ruddy (1972). In brief, classic priming effects are small and transient even if the prime and measure are strongly associated (e.g., NURSE-DOCTOR), whereas money priming effects are [purportedly] large and relatively long-lasting even when the prime and measure are seemingly unrelated (e.g., a sentence related to money and the desire to be alone).

Kahneman’s statement is stunning because it seems so difficult for people to admit their mistakes, and in this case he’s not just saying he got the specifics wrong, he’s pointing to a systematic error in his ways of thinking.

You don’t have to be Thomas W. Kuhn to know that you can learn more from failure than success, and that a key way forward is to push push push to understand anomalies. Not to sweep them under the rug but to face them head-on.

3. Now return to Wansink. He’s in a tough situation. His career is based on publicity, and now he has bad publicity. And there no easy solution for him, as once he starts to recognize problems with his research methods, the whole edifice collapses. Similarly for Baumeister, Bargh, Cuddy, etc. The cost of admitting error is so high that they’ll go to great lengths to avoid facing the problems in their research.

It’s easier for Kahneman to admit his errors because, yes, this does suggest that some of the ideas behind “heuristics and biases” or “behavioral economics” have been overextended (yes, I’m looking at you, claims of voting and political attitudes being swayed by shark attacks, college football, and subliminal smiley faces), but his core work with Tversky is not threatened. Similarly, I can make no-excuses corrections of my paper that was wrong because of our data coding error, and my other paper with the false theorem.

P.S. Hey! I just realized that the above examples illustrate two of Clarke’s three laws.

The post Pizzagate and Kahneman, two great flavors etc. appeared first on Statistical Modeling, Causal Inference, and Social Science.

### cricketr and yorkr books – Paperback now in Amazon

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

My books
– Cricket Analytics with cricketr
– Beaten by sheer pace!: Cricket analytics with yorkr
are now available on Amazon in both Paperback and Kindle versions

The cricketr and yorkr packages are written in R, and both are available in CRAN. The books contain details on how to use these R packages to analyze performance of cricketers.

cricketr is based on data from ESPN Cricinfo Statsguru, and can analyze Test, ODI and T20 batsmen & bowlers. yorkr is based on data from Cricsheet, and can analyze ODI, T20 and IPL. yorkr can analyze batsmen, bowlers, matches and teams.

Cricket Analytics with cricketr
You can access the paperback at Cricket analytics with cricketr

Beaten by sheer pace! Cricket Analytics with yorkr
You can buy the paperback from Amazon at Beaten by sheer pace: Cricket analytics with yorkr