# My Data Science Blogs

## January 22, 2018

### Call for participants: Workshop and Advanced school "Statistical physics and machine learning back together" in Cargese, Corsica, France, August 20-31, 2018

Lenka sent me the following the other day:

Dear Colleagues and Friends,

This is the announcement and call for participants of the workshop and advanced school "Statistical physics and machine learning back together" that will take place in Cargese, Corsica, France during August 20-31, 2018. Please forward this to your colleagues/students that may be interested.

Researchers, students and postdocs interested to participate in the event are invited to apply on the website http://cargese.krzakala.org
(or http://www.lps.ens.fr/~krzakala/WEBSITE_Cargese2018/home.htm ) by February 28st, 2018.

The capacity of the Cargese amphitheatre is limited, due to this constraint participants will be selected from the applicants.

The main goal of this event is to gather the community of researchers working on questions that relate in some way statistical physics and high dimensional statistical inference and learning. The format will be several (~10) 3h introductory lectures, and about thrice as many invited talks.

The topics include:
• Energy/loss landscapes in disordered systems, machine learning and inference problems
• Computational and statistical thresholds and trade-offs
• Theory of artificial multilayer neural networks
• Rigorous approaches to spin glasses and related models of statistical inference
• Parallels between optimisation algorithms and dynamics in physics
• Vindicating the replica and cavity method rigorously
• Current trends in variational Bayes inference
• Developments in message passing algorithms
• Applications on machine learning in physics
• Information processing in biological systems

Lecturers:
• Gerard Ben Arous (Courant Institute)
• Giulio Biroli (CEA Saclay, France)
• Nicolas Brunel (Duke University)
• Yann LeCun (Courant Institute and Facebook)
• Michael Jordan (UC Berkeley)
• Stephane Mallat (ENS et college de France)
• Andrea Montanari (Stanford)
• Dmitry Panchenko (University of Toronto, Canada)
• Sundeep Rangan (New York University)
• Riccardo Zecchina (Politecnico Turin, Italy)

Speakers:
• Antonio C Auffinger (Northwestern University)
• Afonso Bandeira (Courant Institute, NYU)
• Jean Barbier (Queens Mary, London)
• Quentin Berthet (Cambridge UK)
• Jean-Philippe Bouchaud (CFM, Paris)
• Joan Bruna (Courant Institute, NYU)
• Patrick Charbonneau (Duke)
• Amir Dembo (Stanford)
• Allie Fletcher (UCLA)
• Silvio Franz (Paris-Orsay)
• Surya Ganguli (Stanford)
• Alice Guionnet (ENS Lyon)
• Aukosh Jagganath (Harvard)
• Yoshiyuki Kabashima (Tokyo Tech)
• Christina Lee (MIT)
• Marc Lelarge (ENS, Paris)
• Tengyu Ma (Princeton)
• Marc Mezard (ENS, Paris)
• Leo Miolane (ENS, Paris)
• Remi Monasson (ENS, Paris)
• Cristopher Moore (Santa Fe Institute)
• Giorgio Parisi (Roma La Sapienza)
• Will Perkins (Birmingham)
• Federico Ricci-Tersenghi (Roma La Sapienza)
• Cindy Rush (Columbia Univ.)
• Levent Sagun (CEA Saclay)
• S. S. Schoenholz (Google Brain)
• Phil Schniter (Ohio State University)
• David Jason Schwab (Northwestern University)
• Guilhem Semerjian (ENS, Paris)
• Alexandre Tkatchenko (University of Luxembourg)
• Naftali Tishby (Hebrew University)
• Pierfrancesco Urbani (CNRS, Paris)
• Francesco Zamponi (ENS, Paris)

With best regards the organizers

Florent Krzakala and Lenka Zdeborova

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Call for participants: Workshop and Advanced school "Statistical physics and machine learning back together" in Cargese, Corsica, France, August 20-31, 2018

Lenka sent me the following the other day:

Dear Colleagues and Friends,

This is the announcement and call for participants of the workshop and advanced school "Statistical physics and machine learning back together" that will take place in Cargese, Corsica, France during August 20-31, 2018. Please forward this to your colleagues/students that may be interested.

Researchers, students and postdocs interested to participate in the event are invited to apply on the website http://cargese.krzakala.org
(or http://www.lps.ens.fr/~krzakala/WEBSITE_Cargese2018/home.htm ) by February 28st, 2018.

The capacity of the Cargese amphitheatre is limited, due to this constraint participants will be selected from the applicants.

The main goal of this event is to gather the community of researchers working on questions that relate in some way statistical physics and high dimensional statistical inference and learning. The format will be several (~10) 3h introductory lectures, and about thrice as many invited talks.

The topics include:
• Energy/loss landscapes in disordered systems, machine learning and inference problems
• Computational and statistical thresholds and trade-offs
• Theory of artificial multilayer neural networks
• Rigorous approaches to spin glasses and related models of statistical inference
• Parallels between optimisation algorithms and dynamics in physics
• Vindicating the replica and cavity method rigorously
• Current trends in variational Bayes inference
• Developments in message passing algorithms
• Applications on machine learning in physics
• Information processing in biological systems

Lecturers:
• Gerard Ben Arous (Courant Institute)
• Giulio Biroli (CEA Saclay, France)
• Nicolas Brunel (Duke University)
• Yann LeCun (Courant Institute and Facebook)
• Michael Jordan (UC Berkeley)
• Stephane Mallat (ENS et college de France)
• Andrea Montanari (Stanford)
• Dmitry Panchenko (University of Toronto, Canada)
• Sundeep Rangan (New York University)
• Riccardo Zecchina (Politecnico Turin, Italy)

Speakers:
• Antonio C Auffinger (Northwestern University)
• Afonso Bandeira (Courant Institute, NYU)
• Jean Barbier (Queens Mary, London)
• Quentin Berthet (Cambridge UK)
• Jean-Philippe Bouchaud (CFM, Paris)
• Joan Bruna (Courant Institute, NYU)
• Patrick Charbonneau (Duke)
• Amir Dembo (Stanford)
• Allie Fletcher (UCLA)
• Silvio Franz (Paris-Orsay)
• Surya Ganguli (Stanford)
• Alice Guionnet (ENS Lyon)
• Aukosh Jagganath (Harvard)
• Yoshiyuki Kabashima (Tokyo Tech)
• Christina Lee (MIT)
• Marc Lelarge (ENS, Paris)
• Tengyu Ma (Princeton)
• Marc Mezard (ENS, Paris)
• Leo Miolane (ENS, Paris)
• Remi Monasson (ENS, Paris)
• Cristopher Moore (Santa Fe Institute)
• Giorgio Parisi (Roma La Sapienza)
• Will Perkins (Birmingham)
• Federico Ricci-Tersenghi (Roma La Sapienza)
• Cindy Rush (Columbia Univ.)
• Levent Sagun (CEA Saclay)
• S. S. Schoenholz (Google Brain)
• Phil Schniter (Ohio State University)
• David Jason Schwab (Northwestern University)
• Guilhem Semerjian (ENS, Paris)
• Alexandre Tkatchenko (University of Luxembourg)
• Naftali Tishby (Hebrew University)
• Pierfrancesco Urbani (CNRS, Paris)
• Francesco Zamponi (ENS, Paris)

With best regards the organizers

Florent Krzakala and Lenka Zdeborova

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Four short links: 22 January 2018

Corporate Surveillance, Crowds and Decisions, Crawling Robot Infant, and Personal Data Representatives

1. Corporate Surveillance in Everyday Life -- a quite detailed report on how thousands of companies monitor, analyze, and influence the lives of billions. Who are the main players in today’s digital tracking? What can they infer from our purchases, phone calls, web searches, and Facebook likes? How do online platforms, tech companies, and data brokers collect, trade, and make use of personal data? (via BoingBoing)
2. Smaller Crowds Outperform Larger Crowds and Individuals in Realistic Task Conditions -- We derive this nonmonotonic relationship between group size and accuracy from the Condorcet jury theorem and use simulations and further analyses to show that it holds under a variety of assumptions. We further show that situations favoring moderately sized groups occur in a variety of real-life situations, including political, medical, and financial decisions and general knowledge tests. These results have implications for the design of decision-making bodies at all levels of policy. Take with the usual kilogram-sized pinch of social science salt. (via Marginal Revolution)
3. Robotic Crawling Infant -- the least-cute infant robot I've seen this year. Research is on what the babies collect from the carpet/ground as they walk. (via IEEE Spectrum)
4. Personal Data Representatives: An Idea (Tom Steinberg) -- it is time to allow people to nominate trusted representatives who can make decisions about our personal data for us, so that we can get on with our lives.

### Top Stories, Jan 15-21: The Value of Semi-Supervised Machine Learning; A Day in the Life of an AI Developer

Also: Managing Machine Learning Workflows with Scikit-learn Pipelines Part 2: Integrating Grid Search; Generative Adversarial Networks, an overview; Learning Curves for Machine Learning; Top 10 TED Talks for Data Scientists and Machine Learning Engineers

### Surprise, the world was warmer again in 2017

According to NASA estimates, 2017 was the second warmest year on record since 1880. Henry Fountain, Jugal K. Patel, and Nadja Popovich reporting for The New York Times:

What made the numbers unexpected was that last year had no El Niño, a shift in tropical Pacific weather patterns that is usually linked to record-setting heat and that contributed to record highs the previous two years. In fact, last year should have benefited from a weak version of the opposite phenomenon, La Niña, which is generally associated with lower atmospheric temperatures.

### Data to identify Wikipedia rabbit holes

The Wikimedia Foundation’s Analytics team is releasing a monthly clickstream dataset. The dataset represents—in aggregate—how readers reach a Wikipedia article and navigate to the next. Previously published as a static release, this dataset is now available as a series of monthly data dumps for English, Russian, German, Spanish, and Japanese Wikipedias.

Tags:

### Document worth reading: “An Introduction to the Practical and Theoretical Aspects of Mixture-of-Experts Modeling”

Mixture-of-experts (MoE) models are a powerful paradigm for modeling of data arising from complex data generating processes (DGPs). In this article, we demonstrate how different MoE models can be constructed to approximate the underlying DGPs of arbitrary types of data. Due to the probabilistic nature of MoE models, we propose the maximum quasi-likelihood (MQL) estimator as a method for estimating MoE model parameters from data, and we provide conditions under which MQL estimators are consistent and asymptotically normal. The blockwise minorization-maximizatoin (blockwise-MM) algorithm framework is proposed as an all-purpose method for constructing algorithms for obtaining MQL estimators. An example derivation of a blockwise-MM algorithm is provided. We then present a method for constructing information criteria for estimating the number of components in MoE models and provide justification for the classic Bayesian information criterion (BIC). We explain how MoE models can be used to conduct classification, clustering, and regression and we illustrate these applications via a pair of worked examples. An Introduction to the Practical and Theoretical Aspects of Mixture-of-Experts Modeling

### If you did not already know

A Triangle Generative Adversarial Network ($\Delta$-GAN) is developed for semi-supervised cross-domain joint distribution matching, where the training data consists of samples from each domain, and supervision of domain correspondence is provided by only a few paired samples. $\Delta$-GAN consists of four neural networks, two generators and two discriminators. The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs. The generators and discriminators are trained together using adversarial learning. Under mild assumptions, in theory the joint distributions characterized by the two generators concentrate to the data distribution. In experiments, three different kinds of domain pairs are considered, image-label, image-image and image-attribute pairs. Experiments on semi-supervised image classification, image-to-image translation and attribute-based image generation demonstrate the superiority of the proposed approach. …

BoostJet
Recommenders have become widely popular in recent years because of their broader applicability in many e-commerce applications. These applications rely on recommenders for generating advertisements for various offers or providing content recommendations. However, the quality of the generated recommendations depends on user features (like demography, temporality), offer features (like popularity, price), and user-offer features (like implicit or explicit feedback). Current state-of-the-art recommenders do not explore such diverse features concurrently while generating the recommendations. In this paper, we first introduce the notion of Trackers which enables us to capture the above-mentioned features and thus incorporate users’ online behaviour through statistical aggregates of different features (demography, temporality, popularity, price). We also show how to capture offer-to-offer relations, based on their consumption sequence, leveraging neural embeddings for offers in our Offer2Vec algorithm. We then introduce BoostJet, a novel recommender which integrates the Trackers along with the neural embeddings using MatrixNet, an efficient distributed implementation of gradient boosted decision tree, to improve the recommendation quality significantly. We provide an in-depth evaluation of BoostJet on Yandex’s dataset, collecting online behaviour from tens of millions of online users, to demonstrate the practicality of BoostJet in terms of recommendation quality as well as scalability. …

Boruta
Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences. …

### Book Memo: “Nature-Inspired Algorithms and Applied Optimization”

 This book reviews the state-of-the-art developments in nature-inspired algorithms and their applications in various disciplines, ranging from feature selection and engineering design optimization to scheduling and vehicle routing. It introduces each algorithm and its implementation with case studies as well as extensive literature reviews, and also includes self-contained chapters featuring theoretical analyses, such as convergence analysis and no-free-lunch theorems so as to provide insights into the current nature-inspired optimization algorithms. Topics include ant colony optimization, the bat algorithm, B-spline curve fitting, cuckoo search, feature selection, economic load dispatch, the firefly algorithm, the flower pollination algorithm, knapsack problem, octonian and quaternion representations, particle swarm optimization, scheduling, wireless networks, vehicle routing with time windows, and maximally different alternatives. This timely book serves as a practical guide and reference resource for students, researchers and professionals.

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

One of the great features of R is the possibility to quickly access web-services. While some companies have the habit and policy to document their APIs, there is still a large chunk of undocumented but great web-services that help the regular data scientist.

In the following short post, I will show how we can turn a simple web-serivce in a nice R-function.
The example I am going to use is the linguee translation service: DeepL.
Just as google translate, Deepl features a simple text field. When a user types in text, the translation appears in a second textbox. Users can choose between the languages.

In order to see how the service works in the backend, let’s have a quick look at the network traffic.
For that we open the browser’s developer tools and jump to the network tab. Next, we type in a sentence and see which requests (XHR) are made. The interface repeatedly sends JSON requests to the following endpoint: “https://www.deepl.com/jsonrpc”.

Looking at a single request we can quickly identify the parameters that we typed in (grey area, in the lower right corner). We copy these in r and assign them to a variable.

Using a service to format the json (e.g. https://jsonformatter.curiousconcept.com/) we can turn the blob in a well readable json file. Next, we convert the JSON string in a R object (a nested list) by using a simple JSON to R language translation:

Finally, we evaluate the string as R-code, this gives us the DeepL web-services’ parameters as an R nested list.
All we have to do now is wrap the parameters in a R function and use variables to change the important ones:

I hope the post helps you turn more web-services into R-functions/packages.
If you are looking for other translation services have a look at the translate or translateR packages.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## January 21, 2018

### #15: Tidyverse and data.table, sitting side by side … (Part 1)

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the fifteenth post in the rarely rational R rambling series, or R4 for short. There are two posts I have been meaning to get out for a bit, and hope to get to shortly—but in the meantime we are going start something else.

Another longer-running idea I had was to present some simple application cases with (one or more) side-by-side code comparisons. Why? Well at times it feels like R, and the R community, are being split. You’re either with one (increasingly "religious" in their defense of their deemed-superior approach) side, or the other. And that is of course utter nonsense. It’s all R after all.

Programming, just like other fields using engineering methods and thinking, is about making choices, and trading off between certain aspects. A simple example is the fairly well-known trade-off between memory use and speed: think e.g. of a hash map allowing for faster lookup at the cost of some more memory. Generally speaking, solutions are rarely limited to just one way, or just one approach. So if pays off to know your tools, and choose wisely among all available options. Having choices is having options, and those tend to have non-negative premiums to take advantage off. Locking yourself into one and just one paradigm can never be better.

In that spirit, I want to (eventually) show a few simple comparisons of code being done two distinct ways.

One obvious first candidate for this is the gunsales repository with some R code which backs an earlier NY Times article. I got involved for a similar reason, and updated the code from its initial form. Then again, this project also helped motivate what we did later with the x13binary package which permits automated installation of the X13-ARIMA-SEATS binary to support Christoph’s excellent seasonal CRAN package (and website) for which we now have a forthcoming JSS paper. But the actual code example is not that interesting / a bit further off the mainstream because of the more specialised seasonal ARIMA modeling.

But then this week I found a much simpler and shorter example, and quickly converted its code. The code comes from the inaugural datascience 1 lesson at the Crosstab, a fabulous site by G. Elliot Morris (who may be the highest-energy undergrad I have come across lately) focusssed on political polling, forecasts, and election outcomes. Lesson 1 is a simple introduction, and averages some polls of the 2016 US Presidential Election.

#### Complete Code using Approach "TV"

Elliot does a fine job walking the reader through his code so I will be brief and simply quote it in one piece:


## Getting the polls

## Wrangling the polls

library(dplyr)
polls_2016 <- polls_2016 %>%
library(lubridate)
polls_2016 <- polls_2016 %>%
mutate(end_date = ymd(end_date))
polls_2016 <- polls_2016 %>%
right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), max(polls_2016$end_date), by="days")))

## Average the polls

polls_2016 <- polls_2016 %>%
group_by(end_date) %>%
summarise(Clinton = mean(Clinton),
Trump = mean(Trump))

library(zoo)
rolling_average <- polls_2016 %>%
mutate(Clinton.Margin = Clinton-Trump,
Clinton.Avg =  rollapply(Clinton.Margin,width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right"))

library(ggplot2)
ggplot(rolling_average)+
geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") +
geom_point(aes(x=end_date,y=Clinton.Margin))

It uses five packages to i) read some data off them interwebs, ii) then filters / subsets / modifies it leading to a right (outer) join with itself before iv) averaging per-day polls first and then creates rolling averages over 14 days before v) plotting. Several standard verbs are used: filter(), mutate(), right_join(), group_by(), and summarise(). One non-verse function is rollapply() which comes from zoo, a popular package for time-series data.

#### Complete Code using Approach "DT"

As I will show below, we can do the same with fewer packages as data.table covers the reading, slicing/dicing and time conversion. We still need zoo for its rollapply() and of course the same plotting code:


## Getting the polls

library(data.table)

## Wrangling the polls

pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
max(pollsDT[,end_date]), by="days")), on="end_date"]

## Average the polls

library(zoo)
pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
pollsDT[, Clinton.Margin := Clinton-Trump]
pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")]

library(ggplot2)
ggplot(pollsDT) +
geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") +
geom_point(aes(x=end_date,y=Clinton.Margin))

This uses several of the components of data.table which are often called [i, j, by=...]. Row are selected (i), columns are either modified (via := assignment) or summarised (via =), and grouping is undertaken by by=.... The outer join is done by having a data.table object indexed by another, and is pretty standard too. That allows us to do all transformations in three lines. We then create per-day average by grouping by day, compute the margin and construct its rolling average as before. The resulting chart is, unsurprisingly, the same.

We can looking how the two approaches do on getting data read into our session. For simplicity, we will read a local file to keep the (fixed) download aspect out of it:

R> url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"
R> file <- "/tmp/poll-responses-clean.tsv"
R> res
Unit: milliseconds
expr     min      lq    mean  median      uq      max neval
tidy 6.67777 6.83458 7.13434 6.98484 7.25831  9.27452   100
dt 1.98890 2.04457 2.37916 2.08261 2.14040 28.86885   100
R> 

That is a clear relative difference, though the absolute amount of time is not that relevant for such a small (demo) dataset.

#### Benchmark Processing

We can also look at the processing part:

R> rdin <- suppressMessages(readr::read_tsv(file))
R>
R> library(dplyr)
R> library(lubridate)
R> library(zoo)
R>
R> transformTV <- function(polls_2016=rdin) {
+     polls_2016 <- polls_2016 %>%
+         filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"))
+     polls_2016 <- polls_2016 %>%
+         mutate(end_date = ymd(end_date))
+     polls_2016 <- polls_2016 %>%
+         right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), + max(polls_2016$end_date), by="days")))
+     polls_2016 <- polls_2016 %>%
+         group_by(end_date) %>%
+         summarise(Clinton = mean(Clinton),
+                   Trump = mean(Trump))
+
+     rolling_average <- polls_2016 %>%
+         mutate(Clinton.Margin = Clinton-Trump,
+                Clinton.Avg =  rollapply(Clinton.Margin,width=14,
+                                         FUN=function(x){mean(x, na.rm=TRUE)},
+                                         by=1, partial=TRUE, fill=NA, align="right"))
+ }
R>
R> transformDT <- function(dtin) {
+     pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark
+     pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
+     pollsDT[, end_date := as.IDate(end_date)]
+     pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
+                                                   max(pollsDT[,end_date]), by="days")), on="end_date"]
+     pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)),
+                        by=end_date][, Clinton.Margin := Clinton-Trump]
+     pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
+                                        FUN=function(x){mean(x, na.rm=TRUE)},
+                                        by=1, partial=TRUE, fill=NA, align="right")]
+ }
R>
R> res <- microbenchmark(tidy=suppressMessages(transformTV(rdin)),
+                       dt=transformDT(dtin))
R> res
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
tidy 12.54723 13.18643 15.29676 13.73418 14.71008 104.5754   100
dt  7.66842  8.02404  8.60915  8.29984  8.72071  17.7818   100
R> 

Not quite a factor of two on the small data set, but again a clear advantage. data.table has a reputation for doing really well for large datasets; here we see that it is also faster for small datasets.

#### Side-by-side

Stripping the reading, as well as the plotting both of which are about the same, we can compare the essential data operations.

#### Summary

We found a simple task solved using code and packages from an increasingly popular sub-culture within R, and contrasted it with a second approach. We find the second approach to i) have fewer dependencies, ii) less code, and iii) running faster.

Now, undoubtedly the former approach will have its staunch defenders (and that is all good and well, after all choice is good and even thirty years later some still debate vi versus emacs endlessly) but I thought it to be instructive to at least to be able to make an informed comparison.

#### Acknowledgements

My thanks to G. Elliot Morris for a fine example, and of course a fine blog and (if somewhat hyperactive) Twitter account.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“Academia and business are two different worlds.” Kamil Bartocha ( 26. Apr 2015 )

### Advisory on Multiple Assignment dplyr::mutate() on Databases

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases.

In this note I exhibit a troublesome example, and a systematic solution.

First let’s set up dplyr, our database, and some example data.

library("dplyr")
##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
packageVersion("dplyr")
## [1] '0.7.4'
packageVersion("dbplyr")
## [1] '1.2.0'
db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")

d <- dplyr::copy_to(
db,
data.frame(xorig = 1:5,
yorig = sin(1:5)),
"d")

Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following.

d %>%
mutate(
delta = 0,
x0 = xorig + delta,
y0 = yorig + delta,
delta = delta + 1,
x1 = xorig + delta,
y1 = yorig + delta,
delta = delta + 1,
x2 = xorig + delta,
y2 = yorig + delta
) %>%
select(-xorig, -yorig, -delta) %>%
knitr::kable()
x0 y0 x1 y1 x2 y2
1 0.8414710 1 0.8414710 1 0.8414710
2 0.9092974 2 0.9092974 2 0.9092974
3 0.1411200 3 0.1411200 3 0.1411200
4 -0.7568025 4 -0.7568025 4 -0.7568025
5 -0.9589243 5 -0.9589243 5 -0.9589243

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

The fix is: break up the dplyr::mutate() into a series of smaller mutates that don’t exhibit the problem. It is a trade-off breaking up dplyr::mutate() on a database causes deeper statement nesting, and potential loss of performance. However, correct results should come before speed.

One automated variation of the fix is to use seplyr‘s statement partitioner. seplyr can factor the large mutate in a minimal number of very safe sub-mutates (and use dplyr to execute them).

d %>%
seplyr::mutate_se(
seplyr::quote_mutate(
delta = 0,
x0 = xorig + delta,
y0 = yorig + delta,
delta = delta + 1,
x1 = xorig + delta,
y1 = yorig + delta,
delta = delta + 1,
x2 = xorig + delta,
y2 = yorig + delta
)) %>%
select(-xorig, -yorig, -delta) %>%
knitr::kable()
x0 y0 x1 y1 x2 y2
1 0.8414710 2 1.8414710 3 2.841471
2 0.9092974 3 1.9092974 4 2.909297
3 0.1411200 4 1.1411200 5 2.141120
4 -0.7568025 5 0.2431975 6 1.243197
5 -0.9589243 6 0.0410757 7 1.041076

The above notation is, however, a bit clunky for everyday use. We did not use the more direct seplyr::mutate_nse() as we are (to lower maintenance effort) deprecating the direct non-standard evaluation methods in seplyr in favor of code using seplyr::quote_mutate or wrapr::qae().

One can instead use seplyr as a code inspecting and re-writing tool with seplyr::factor_mutate().

cat(seplyr::factor_mutate(
delta = 0,
x0 = xorig + delta,
y0 = yorig + delta,
delta = delta + 1,
x1 = xorig + delta,
y1 = yorig + delta,
delta = delta + 1,
x2 = xorig + delta,
y2 = yorig + delta
))

Warning in seplyr::factor_mutate(delta = 0, x0 = xorig + delta, y0 = yorig
+ : Mutate should be split into more than one stage.

mutate(delta = 0) %>%
mutate(x0 = xorig + delta,
y0 = yorig + delta) %>%
mutate(delta = delta + 1) %>%
mutate(x1 = xorig + delta,
y1 = yorig + delta) %>%
mutate(delta = delta + 1) %>%
mutate(x2 = xorig + delta,
y2 = yorig + delta)


seplyr::factor_mutate() both issued a warning and produced the factored code snippet seen above. We think this is in fact a different issue than explored in our prior note on dependency driven result corruption, and that fixes for the first issue did not fix this issue last time we looked.

And that why to continue to be careful when using multi assignment dplyr::mutate() statements with database backed data.

### ggplot2 Time Series Heatmaps: revisited in the tidyverse

(This article was first published on MarginTale, and kindly contributed to R-bloggers)

I revisited my previous post on creating beautiful time series calendar heatmaps in ggplot, moving the code into the tidyverse.
To obtain following example:

Simply use the following code:
I hope the commented code is self-explanatory – enjoy

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Spectral Transmittance Data for Filters (photobiologyFilters)
A data only package with spectral ‘transmittance’ data for frequently used filters and similar materials. Plastic sheets and films; optical glass and ordinary glass and some labware.

Generating Crosswords from Word Lists (crossword.r)
Generate crosswords from a list of words.

Plot Methods for Computer Experiments Design and Surrogate (DiceView)
View 2D/3D sections or contours of computer experiments designs, surrogates or test functions.

Knowledge Space Theory Input/Output (kstIO)
Knowledge space theory by Doignon and Falmagne (1999) <doi:10.1007/978-3-642-58625-5> is a set- and order-theoretical framework which proposes mathematical formalisms to operationalize knowledge structures in a particular domain. The ‘kstIO’ package provides basic functionalities to read and write KST data from/to files to be used together with the ‘kst’, ‘pks’ or ‘DAKS’ packages.

Simultaneous CIs for Ratios of Means of Log-Normal Populations with Zeros (LN0SCIs)
Construct the simultaneous confidence intervals for ratios of means of Log-normal populations with zeros. It also has a Python module that do the same thing, and can be applied to multiple comparisons of parameters of any k mixture distributions. And we provide four methods, the method based on generalized pivotal quantity with order statistics and the quantity based on Wilson by Li et al. (2009) <doi:10.1016/j.spl.2009.03.004> (GPQW), and the methods based on generalized pivotal quantity with order statistics and the quantity based on Hannig (2009) <doi:10.1093/biomet/asp050> (GPQH). The other two methods are based on two-step MOVER intervals by Amany H, Abdel K (2015) <doi:10.1080/03610918.2013.767911>. We deduce Fiducial generalized pivotal two-step MOVER intervals based on Wilson quantity (FMW) and based on Hannig’s quantity (FMWH). All these approach you can find in the paper of us which it has been submitted.

## Privacy

• Invisible Manipulation: 10 ways our data is being used against us. A good summary by the people of Privacy International.

The era where we were in control of the data on our own computers has been replaced with devices containing sensors we cannot control, storing data we cannot access, in operating systems we cannot monitor, in environments where our rights are rendered meaningless. Soon the default will shift from us interacting directly with our devices to interacting with devices we have no control over and no knowledge that we are generating data. Below we outline 10 ways in which this exploitation and manipulation is already happening.

• China testing facial-recognition surveillance system in Xinjiang – report. Every article I read about this is scarier than the previous one. In this one...

Chinese surveillance chiefs are testing a facial-recognition system that alerts authorities when targets stray more than 300 metres from their home or workplace, as part of a surveillance push that critics say has transformed the country’s western fringes into a high-tech police state.

• At the height of the Cold War, during the winter of 1980, FBI agents recorded a phone call in which a man arranged a secret meeting with the Soviet embassy in Washington, D.C. On the day of his appointment, however, agents were unable to catch sight of the man entering the embassy. At the time, they had no way to put a name to the caller from just the sound of his voice, so the spy remained anonymous. Over the next five years, he sold details about several secret U.S. programs to the USSR.

It wasn’t until 1985 that the FBI, thanks to intelligence provided by a Russian defector, was able to establish the caller as Ronald Pelton, a former analyst at the National Security Agency. The next year, Pelton was convicted of espionage.

Today, FBI and NSA agents would have identified Pelton within seconds of his first call to the Soviets. A classified NSA memo from January 2006 describes NSA analysts using a “technology that identifies people by the sound of their voices” to successfully match old audio files of Pelton to one another. “Had such technologies been available twenty years ago,” the memo stated, “early detection and apprehension could have been possible, reducing the considerable damage Pelton did to national security.”

## Tech

• Deep Empathy. The results are not always good, but the concept is novel and interesting.

Deep Empathy utilizes deep learning to learn the characteristics of Syrian neighborhoods affected by conflict, and then simulates how cities around the world would look in the midst of a similar conflict. Can this approach -- familiar in a range of artistic applications -- help us to see recognizable elements of our lives through the lens of those experiencing vastly different circumstances, theoretically a world away? And by helping an AI learn empathy, can this AI teach us to care?

• Alibaba says its deep neural network model has outscored humans in a global reading test, paving the way for the underlying technology to reduce the need for human input.

Next: have an AI read a movie script and tell the writer where it doesn't make sense. It's sorely needed.

• Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that big data and advanced machine learning make these analyses more accurate and less biased than humans. We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. We further show that a simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features.

## Visualizations

Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!

### How smartly.io productized Bayesian revenue estimation with Stan

Markus Ojala writes:

Bayesian modeling is becoming mainstream in many application areas. Applying it needs still a lot of knowledge about distributions and modeling techniques but the recent development in probabilistic programming languages have made it much more tractable. Stan is a promising language that suits single analysis cases well. With the improvements in approximation methods, it can scale to production level if care is taken in defining and validating the model. The model described here is the basis for the model we are running in production with various additional improvements.

He begins with some background:

Online advertisers are moving to optimizing total revenue on ad spend instead of just pumping up the amount of conversions or clicks. Maximizing revenue is tricky as there is huge random variation in the revenue amounts brought in by individual users. If this isn’t taken into account, it’s easy to react to the wrong signals and waste money on less successful ad campaigns. Luckily, Bayesian inference allows us to make justified decisions on a granular level by modeling the variation in the observed data.

Probabilistic programming languages, like Stan, make Bayesian inference easy. . . .

Sure, we know all that. But then comes the new part, at least it’s new to me:

In this blog post, we describe our experiences in getting Stan running in production.

Ojala discusses the “Use case: Maximizing the revenue on ad spend” and provides lots of helpful detail—not just the Stan code itself, but background on how they set up the model, and intermediate steps such as the first try which didn’t work because the model was insufficiently constrained and they needed to add prior information. As Ojala puts it:

What’s nice about Stan is that our model definition turns almost line-by-line into the final code. However, getting this model to fit by Stan is hard, as we haven’t specified any limits for the variables, or given sensible priors.

His solution is multilevel modeling:

The issue with the first approach is that the ad set estimates would be based on just the data from the individual ad sets. In this case, one random large $1,000 purchase can affect the mean estimate of a single ad set radically if there are only tens of conversion events (which is a common case). As such large revenue events could have happened also in other ad sets, we can get better estimates by sharing information between the ad sets. With multilevel modeling, we can implement a partial pooling approach to share information. . . . It’s BDA come to life! (But for some reason this paper is labeled, “Tauman, Chapter 6.”) He continues with model development and posterior predictive checking! I’m lovin it. Also this excellent point: After you get comfortable in writing models, Stan is an expressive language that takes away the need to write custom optimization or sampling code to fit your model. It allows you to modify the model and add complexity easily. Now let’s get to the challenges: The model fitting can easily crash if the learning diverges. In most cases that can be fixed by adding sensible limits and informative priors for the variables and possibly adding a custom initialization for the parameters. Also non-centered parametrization is needed for hierarchical models. These are must-haves for running the model in production. You want the model to fit 100% of the cases and not just 90% which would be fine in interactive mode. However, finding the issues with the model is hard. What to do? The best is to start with really simple model and add stuff step-by-step. Also running the model against various data sets and producing posterior plots automatically helps in identifying the issues early. And some details: We are using the PyStan Python interface that wraps the compilation and calling of the code. To avoid recompiling the models always, we precompile them and pickle them . . . For scheduling, we use Celery which is a nice distributed task queue. . . . We are now running the revenue model for thousands of different accounts every night with varying amount of campaigns, ad sets and revenue observations. The longest run takes couple minutes. Most of our customers still use the conversion optimization but are transitioning to use the revenue optimization feature. Overall, about one million euros of advertising spend on daily level is managed with our Predictive Budget Allocation. In future, we see that Stan or some other probabilistic programming language plays a big role in the optimization features of Smartly.io. That’s awesome. We used the BSD license for Stan so it could be free and open source and anyone could use it inside their software, however they like. This sort of thing is exactly what we were hoping to see. Continue Reading… ### Book Memo: “Stream Processing with Apache Spark”  Best Practices for Scaling and Optimizing Apache Spark To build analytics tools that provide faster insights, knowing how to process data in real time is a must, and moving from batch processing to stream processing is absolutely required. Fortunately, the Spark in-memory framework/platform for processing data has added an extension devoted to fault-tolerant stream processing: Spark Streaming. If you’re familiar with Apache Spark and want to learn how to implement it for streaming jobs, this practical book is a must. Continue Reading… ### Document worth reading: “Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions” Deep neural networks are widely used for classification. These deep models often suffer from a lack of interpretability — they are particularly difficult to understand because of their non-linear nature. As a result, neural networks are often treated as ‘black box’ models, and in the past, have been trained purely to optimize the accuracy of predictions. In this work, we create a novel network architecture for deep learning that naturally explains its own reasoning for each prediction. This architecture contains an autoencoder and a special prototype layer, where each unit of that layer stores a weight vector that resembles an encoded training input. The encoder of the autoencoder allows us to do comparisons within the latent space, while the decoder allows us to visualize the learned prototypes. The training objective has four terms: an accuracy term, a term that encourages every prototype to be similar to at least one encoded input, a term that encourages every encoded input to be close to at least one prototype, and a term that encourages faithful reconstruction by the autoencoder. The distances computed in the prototype layer are used as part of the classification process. Since the prototypes are learned during training, the learned network naturally comes with explanations for each prediction, and the explanations are loyal to what the network actually computes. Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions Continue Reading… ### R Weekly Bulletin Vol – XIV (This article was first published on R programming, and kindly contributed to R-bloggers) This week’s R bulletin covers some interesting ways to list functions, to list files and illustrates the use of double colon operator. We will also cover functions like path.package, fill.na, and rank. Click To TweetHope you like this R weekly bulletin. Enjoy reading! ### Shortcut Keys 1. New document – Ctrl+Shift+N 2. Close active document – Ctrl+W 3. Close all open documents – Ctrl+Shift+W ### Problem Solving Ideas #### How to list functions from an R package We can view the functions from a particular R package by using the “jwutil”s package. Install the package and use the lsf function from the package. The syntax of the function is given as: lsf(pkg) Where pkg is a character string containing package name. The function returns a character vector of function names in the given package. Example: library(jwutil) library(rowr) lsf("rowr") #### How to list files with a particular extension To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list CSV files use the following syntax: Example: # This will list all the csv files present in the current working directory. # To list files in any other folder, you need to provide the folder path. files = list.files(pattern = "\\.csv$")

# $at the end means that this is end of the string. # Adding \. ensures that you match only files with extension .csv list.files(path = "C:/Users/MyFolder", pattern = "\\.csv$")

#### Using the double colon operator

The double colon operator is used to access exported variables in a namespace. The syntax is given as:

pkg::name

Where pkg is the package name symbol or literal character string. The name argument is the variable name symbol or literal character string.

The expression pkg::name returns the value of the exported variable from the package if it has a namespace. The package will be loaded if it was not loaded already before the call. Using the double colon operator has its advantage when we have functions of the same name but from different packages. In such a case, the sequence in which the libraries are loaded is important.

To see the help documentation for these colon operators you can run the following command in R – ?’::’ or help(“:::”)

Example:

library("dplyr")

first = c(1:6)
second = c(3:9)

dplyr::intersect(first, second)
[1] 3 4 5 6
base::intersect(first, second)
[1] 3 4 5 6

In this example, we have two functions having the same names but from different R packages. In some cases, functions having same names can produce different results. By specifying the respective package name using the double colon operator, R knows in which package to look for the function.

### Functions Demystified

#### path.package function

The path.package function returns path to the locations where the given package is found. If the package is not mentioned then the function will return the path of the all the currently attached packages. The syntax of the function is given as:

path.package(package, quiet = FALSE)

The quiet argument takes a default value of False. If this is changed to True then it will throw a warning if the package named in the argument is not attached and will give an error if none are attached.

Example:

path.package("stats")

#### fill.na function

There are different R packages which have functions to fill NA values. The fill.na function is part of the mefa package and it replaces NA values with the nearest values above them in the same column.The syntax of the function is given as:

fill.na(x)

Where, x can be a vector, a matrix or a data frame.

Example:

library(mefa)
x = c(12,NA,15,17,21,NA)
fill.na(x)

#### rank function

The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and missing values can be handled in several ways.

rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”,”min”))

where,
x: numeric, complex, character or logical vector
na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA
ties.method: a character string specifying how ties are treated

Examples:

x = c(3, 5, 1, -4, NA, Inf, 90, 43)
rank(x)

rank(x, na.last = FALSE)

### Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – XIV appeared first on .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“Naming is an art, yet be careful not to add surplus meaning by being overly creative.” Joel Cadwell ( August 21, 2014 )

### Who wants to work at Google?

(This article was first published on R – Journey of Analytics, and kindly contributed to R-bloggers)

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

This dataset is a compilation of job descriptions of 1200+ open roles at Google offices across the world. This dataset is available for download from the Kaggle website, and contains text information about job location, title, department, minimum, preferred qualifications and responsibilities of the position. You can download the dataset here, and run the code on the Kaggle site itself here.

Using this dataset we will try to answer the following questions:

1. Where are the open roles?
2. Which departments have the most openings?
3. What are the minimum and preferred educational qualifications needed to get hired at Google?
4. How much experience is needed?
5. What categories of roles are the most in demand?

### Step1 – Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

# Step 2 – Analysis:

Now we will use R programming to identify patterns in the data that help us answer the questions of interest.

### a) Job Categories:

First let us look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

### b) Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

### c) Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

b) Senior Roles and skills :

A quick word search also reveals how many senior roles (roles that require 10+ years of experience) use the word “strategy” in their list of requirements, under either qualifications or responsibilities. Word association analysis can also show this. (not shown here).

### Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

### Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

### Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others.  So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

### Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

### Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

1. 5+ years of experience.
2. Engineering or Computer Science bachelor’s degree.
3. Masters degree or higher.
4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script.  And don’t forget to share!

### Next Steps:

You can tweak the code to perform the same analysis, but on a subset of data. For example, only roles in a specific department, location (HQ in California) or Google Cloud related roles.

Thanks and happy coding!

(Please note that this post has been reposted from the main blog site at http://blog.journeyofanalytics.com/ )

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Recurrent Gaussian Processes (RGP)
We define Recurrent Gaussian Processes (RGP) models, a general family of Bayesian nonparametric models with recurrent GP priors which are able to learn dynamical patterns from sequential data. Similar to Recurrent Neural Networks (RNNs), RGPs can have different formulations for their internal states, distinct inference methods and be extended with deep structures. In such context, we propose a novel deep RGP model whose autoregressive states are latent, thereby performing representation and dynamical learning simultaneously. To fully exploit the Bayesian nature of the RGP model we develop the Recurrent Variational Bayes (REVARB) framework, which enables efficient inference and strong regularization through coherent propagation of uncertainty across the RGP layers and states. We also introduce a RGP extension where variational parameters are greatly reduced by being reparametrized through RNN-based sequential recognition models. We apply our model to the tasks of nonlinear system identification and human motion modeling. The promising obtained results indicate that our RGP model maintains its highly flexibility while being able to avoid overfitting and being applicable even when larger datasets are not available. …

Algebraic Statistics
Algebraic statistics is the use of algebra to advance statistics. Algebra has been useful for experimental design, parameter estimation, and hypothesis testing. Traditionally, algebraic statistics has been associated with the design of experiments and multivariate analysis (especially time series). In recent years, the term “algebraic statistics” has been sometimes restricted, sometimes being used to label the use of algebraic geometry and commutative algebra in statistics. …

TensorLayer
Deep learning has enabled major advances in the fields of computer vision, natural language processing, and multimedia among many others. Developing a deep learning system is arduous and complex, as it involves constructing neural network architectures, managing training/trained models, tuning optimization process, preprocessing and organizing data, etc. TensorLayer is a versatile Python library that aims at helping researchers and engineers efficiently develop deep learning systems. It offers rich abstractions for neural networks, model and data management, and parallel workflow mechanism. While boosting efficiency, TensorLayer maintains both performance and scalability. TensorLayer was released in September 2016 on GitHub, and has helped people from academia and industry develop real-world applications of deep learning. …

## January 20, 2018

### Rcpp 0.12.15: Numerous tweaks and enhancements

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The fifteenth release in the 0.12.* series of Rcpp landed on CRAN today after just a few days of gestation in incoming/.

This release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5 release in May 2016, the 0.12.6 release in July 2016, the 0.12.7 release in September 2016, the 0.12.8 release in November 2016, the 0.12.9 release in January 2017, the 0.12.10.release in March 2017, the 0.12.11.release in May 2017, the 0.12.12 release in July 2017, the 0.12.13.release in late September 2017, and the 0.12.14.release in November 2017 making it the nineteenth release at the steady and predictable bi-montly release frequency.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1288 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 91 in BioConductor.

This release contains a pretty large number of pull requests by a wide variety of authors. Most of these pull requests are very focused on a particular issue at hand. One was larger and ambitious with some forward-looking code for R 3.5.0; however this backfired a little on Windows and is currently "parked" behind a #define. Full details are below.

#### Changes in Rcpp version 0.12.15 (2018-01-16)

• Changes in Rcpp API:

• Calls from exception handling to Rf_warning() now correctly set an initial format string (Dirk in #777 fixing #776).

• The ‘new’ Date and Datetime vectors now have is_na methods too. (Dirk in #783 fixing #781).

• Protect more temporary SEXP objects produced by wrap (Kevin in #784).

• Use public R APIs for new_env (Kevin in #785).

• Evaluation of R code is now safer when compiled against R 3.5 (you also need to explicitly define RCPP_PROTECTED_EVAL before including Rcpp.h). Longjumps of all kinds (condition catching, returns, restarts, debugger exit) are appropriately detected and handled, e.g. the C++ stack unwinds correctly (Lionel in #789). [ Committed but subsequently disabled in release 0.12.15 ]

• The new function Rcpp_fast_eval() can be used for performance-sensitive evaluation of R code. Unlike Rcpp_eval(), it does not try to catch errors with tryEval in order to avoid the catching overhead. While this is safe thanks to the stack unwinding protection, this also means that R errors are not transformed to an Rcpp::exception. If you are relying on error rethrowing, you have to use the slower Rcpp_eval(). On old R versions Rcpp_fast_eval() falls back to Rcpp_eval() so it is safe to use against any versions of R (Lionel in #789). [ Committed but subsequently disabled in release 0.12.15 ]

• Overly-clever checks for NA have been removed (Kevin in #790).

• The included tinyformat has been updated to the current version, Rcpp-specific changes are now more isolated (Kirill in #791).

• Overly picky fall-through warnings by gcc-7 regarding switch statements are now pre-empted (Kirill in #792).

• Permit compilation on ANDROID (Kenny Bell in #796).

• Improve support for NVCC, the CUDA compiler (Iñaki Ucar in #798 addressing #797).

• Speed up tests for NA and NaN (Kirill and Dirk in #799 and #800).

• Rearrange stack unwind test code, keep test disabled for now (Lionel in #801).

• Further condition away protect unwind behind #define (Dirk in #802).

• Changes in Rcpp Attributes:

• Addressed a missing Rcpp namespace prefix when generating a C++ interface (James Balamuta in #779).
• Changes in Rcpp Documentation:

• The Rcpp FAQ now shows Rcpp::Rcpp.plugin.maker() and not the outdated ::: use applicable non-exported functions.

Thanks to CRANberries, you can also look at a diff to the previous release. As always, details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Gapped-Kmer Support Vector Machine (gkmSVM)
Imports the ‘gkmSVM’ v2.0 functionalities into R <http://…/> It also uses the ‘kernlab’ library (separate R package by different authors) for various SVM algorithms.

Computing P-Values of the K-S Test for (Dis)Continuous Null Distribution (KSgeneral)
Computes a p-value of the one-sample two-sided (or one-sided, as a special case) Kolmogorov-Smirnov (KS) statistic, for any fixed critical level, and an arbitrary, possibly large sample size for a pre-specified purely discrete, mixed or continuous cumulative distribution function (cdf) under the null hypothesis. If a data sample is supplied, ‘KSgeneral’ computes the p-value corresponding to the value of the KS test statistic computed based on the user provided data sample. The package ‘KSgeneral’ implements a novel, accurate and efficient method named Exact-KS-FFT, expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using Fast Fourier Transform (FFT). The package can also be used to compute and plot the complementary cdf of the KS statistic which is known to depend on the hypothesized distribution when the latter is discontinuous (i.e. purely discrete or mixed).

Access Elevation Data from Various APIs (elevatr)
Several web services are available that provide access to elevation data. This package provides access to several of those services and returns elevation data either as a SpatialPointsDataFrame from point elevation services or as a raster object from raster elevation services. Currently, the package supports access to the Mapzen Elevation Service <https://…/>, Mapzen Terrain Service <https://…/>, Amazon Web Services Terrain Tiles <https://…/> and the USGS Elevation Point Query Service <http://…/>.

Rank-Based Estimation and Prediction in Random Effects Nested Models (rlme)
Estimates robust rank-based fixed effects and predicts robust random effects in two- and three- level random effects nested models. The methodology is described in Bilgic & Susmann (2013) <https://…/>.

Estimating the Error Variance in a High-Dimensional Linear Model (natural)
Implementation of the two error variance estimation methods in high-dimensional linear models of Yu, Bien (2017) <arXiv:1712.02412>.

### How to get a sense of Type M and type S errors in neonatology, where trials are often very small? Try fake-data simulation!

Tim Disher read my paper with John Carlin, “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors,” and followed up with a question:

I am a doctoral student conducting research within the field of neonatology, where trials are often very small, and I have long suspected that many intervention effects are potentially inflated.

I am curious as to whether you have any thoughts as to how the methods you describe could be applied within the context of a meta-analysis. My initial thought was to do one of:

1. Approach the issue in a similar way to how minimum information size has been adapted for meta-analysis e.g. assess the risk of type S and M errors on the overall effect estimate as if it came from a single large trial.

2. Calculate the type S and M errors for each trial individually and use a simulation approach where trials are drawn from a mean effect adjusted for inflation % chance of opposite sign.

I think this is the first time we’ve fielded a question from a nurse, so lets go for it. My quick comment is that you should get the best understanding of what your statistical procedure is doing by simulating fake data. Start with a model of the world, a generative model with hyperparameters set to reasonable values; then simulate fake data; then apply whatever statistical procedure you think might be used, including any selection for statistical significance that might occur; then compare the estimates to the assumed true values. Then repeat that simulation a bunch of times; this should give you a sense of type M and type S errors and all sorts of things.

### Winter solstice challenge #3: the winner is Bianca Kramer!

(This article was first published on chem-bla-ics, and kindly contributed to R-bloggers)

 Part of the winning submission in the category ‘best tool‘.

A bit later than intended, but I am pleased to announce the winner of the Winter solstice challenge: Bianca Kramer! Of course, she was the only contender, but her solution is awesome! In fact, I am surprised no one took her took, ran it on their own data and just submit that (which was perfectly well within the scope of the challenge).

Best Tool: Bianca Kramer
The best tool (see the code snippet on the right) uses R and a few R packages (rorcid, rjson, httpcache) and services like ORCID and CrossRef (and the I4OC project), and the (also awesome) oadoi.org project. The code is available on GitHub.

Highest Open Knowledge Score: Bianca Kramer
I did not check the self-reported score of 54%, but since no one challenged here, Bianca wins this category too.

So, what next? First, start calculating your own Open Knowledge Scores. Just to be prepared for the next challenge in 11 months. Of course, there is still a lot to explore. For example, how far should we recurse with calculating this score? The following tweet by Daniel Gonzales visualizes the importance so clearly (go RT it!):

We have all been there, and I really think we should not teach our students it is normal that you have to trust your current read and no be able to look up details. I do not know how much time Gonzales spent on traversing this trail, but it must not take more than a minute, IMHO. Clearly, any paper in this trail that is not Open, will require a look up, and if your library does not have access, an ILL will make the traverse much, much longer. Unacceptable. And many seem to agree, because Sci-Hub seems to be getting more popular every day. About the latter, almost two years ago I wrote Sci-Hub: a sign on the wall, but not a new sign.
Of course, in the end, it is the scholars that should just make their knowledge open, so that every citizen can benefit from it (keep in mind, a European goal is to educate half the population with higher education, so half of the population is basically able to read primary literature!).
That completes the circle back to the winner. After all, Bianca Kramer has done really important work on how scientists can exactly do that: make their research open. I was shocked to see this morning that Bianca did not have a Scholia page yet, but that is fixed now (though far from complete):
Other papers that you should be read more include:
Congratulations, Bianca!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“One robust way to determine if two times series, xt and yt, are related is to analyze if there exists an equation like yt=βxt+ut such us residuals (ut) are stationary (its mean and variance does not change when shifted in time).” aschinchon.wordpress.com ( 08.05.2015 )

### Whats new on arXiv

Systems are typically made from simple components regardless of their complexity. While the function of each part is easily understood, higher order functions are emergent properties and are notoriously difficult to explain. In networked systems, both digital and biological, each component receives inputs, performs a simple computation, and creates an output. When these components have multiple outputs, we intuitively assume that the outputs are causally dependent on the inputs but are themselves independent of each other given the state of their shared input. However, this intuition can be violated for components with probabilistic logic, as these typically cannot be decomposed into separate logic gates with one output each. This violation of conditional independence on the past system state is equivalent to instantaneous interaction — the idea is that some information between the outputs is not coming from the inputs and thus must have been created instantaneously. Here we compare evolved artificial neural systems with and without instantaneous interaction across several task environments. We show that systems without instantaneous interactions evolve faster, to higher final levels of performance, and require fewer logic components to create a densely connected cognitive machinery.
Survival analysis/time-to-event models are extremely useful as they can help companies predict when a customer will buy a product, churn or default on a loan, and therefore help them improve their ROI. In this paper, we introduce a new method to calculate survival functions using the Multi-Task Logistic Regression (MTLR) model as its base and a deep learning architecture as its core. Based on the Concordance index (C-index) and Brier score, this method outperforms the MTLR in all the experiments disclosed in this paper as well as the Cox Proportional Hazard (CoxPH) model when nonlinear dependencies are found.
We propose methods for learning vector representations of SQL workloads to support a variety of administration tasks and application features, including query recommendation, workload summarization, index selection, identifying expensive queries, and predicting query reuse. We consider vector representations of both raw SQL text and optimized query plans under various assumptions and pre-processing strategies, and evaluate these methods on multiple real SQL workloads by comparing with results of task and application feature metrics in the literature. We find that simple algorithms based on these generic vector representations compete favorably with previous approaches that require a number of assumptions and task-specific heuristics. We then present a new embedding strategy specialized for queries based on tree-structured Long Short Term Memory (LSTM) network architectures that improves on the text-oriented embeddings for some tasks. We find that the general approach, when trained on a large corpus of SQL queries, provides a robust foundation for a variety of workload analysis tasks. We conclude by considering how workload embeddings can be deployed as a core database system feature to support database maintenance and novel applications.
Panel data, also known as longitudinal data, consist of a collection of time series. Each time series, which could itself be multivariate, comprises a sequence of measurements taken on a distinct unit. Mechanistic modeling involves writing down scientifically motivated equations describing the collection of dynamic systems giving rise to the observations on each unit. A defining characteristic of panel systems is that the dynamic interaction between units should be negligible. Panel models therefore consist of a collection of independent stochastic processes, generally linked through shared parameters while also having unit-specific parameters. To give the scientist flexibility in model specification, we are motivated to develop a framework for inference on panel data permitting the consideration of arbitrary nonlinear, partially observed panel models. We build on iterated filtering techniques that provide likelihood-based inference on nonlinear partially observed Markov process models for time series data. Our methodology depends on the latent Markov process only through simulation; this plug-and-play property ensures applicability to a large class of models. We demonstrate our methodology on a toy example and two epidemiological case studies. We address inferential and computational issues arising for large panel datasets.
The increasing use of deep neural networks for safety-critical applications, such as autonomous driving and flight control, raises concerns about their safety and reliability. Formal verification can address these concerns by guaranteeing that a deep learning system operates as intended, but the state-of-the-art is limited to small systems. In this work-in-progress report we give an overview of our work on mitigating this difficulty, by pursuing two complementary directions: devising scalable verification techniques, and identifying design choices that result in deep learning systems that are more amenable to verification.
The driving force behind the recent success of LSTMs has been their ability to learn complex and non-linear relationships. Consequently, our inability to describe these relationships has led to LSTMs being characterized as black boxes. To this end, we introduce contextual decomposition (CD), an interpretation algorithm for analysing individual predictions made by standard LSTMs, without any changes to the underlying model. By decomposing the output of a LSTM, CD captures the contributions of combinations of words or variables to the final prediction of an LSTM. On the task of sentiment analysis with the Yelp and SST data sets, we show that CD is able to reliably identify words and phrases of contrasting sentiment, and how they are combined to yield the LSTM’s final prediction. Using the phrase-level labels in SST, we also demonstrate that CD is able to successfully extract positive and negative negations from an LSTM, something which has not previously been done.
We derive and study a significance test for determining if a panel of functional time series is separable. In the context of this paper, separability means that the covariance structure factors into the product of two functions, one depending only on time and the other depending only on the coordinates of the panel. Separability is a property which can dramatically improve computational efficiency by substantially reducing model complexity. It is especially useful for functional data as it implies that the functional principal components are the same for each member of the panel. However such an assumption must be verified before proceeding with further inference. Our approach is based on functional norm differences and provides a test with well controlled size and high power. We establish our procedure quite generally, allowing one to test separability of autocovariances as well. In addition to an asymptotic justification, our methodology is validated by a simulation study. It is applied to functional panels of particulate pollution and stock market data.
A recommender system aims to recommend items that a user is interested in among many items. The need for the recommender system has been expanded by the information explosion. Various approaches have been suggested for providing meaningful recommendations to users. One of the proposed approaches is to consider a recommender system as a Markov decision process (MDP) problem and try to solve it using reinforcement learning (RL). However, existing RL-based methods have an obvious drawback. To solve an MDP in a recommender system, they encountered a problem with the large number of discrete actions that bring RL to a larger class of problems. In this paper, we propose a novel RL-based recommender system. We formulate a recommender system as a gridworld game by using a biclustering technique that can reduce the state and action space significantly. Using biclustering not only reduces space but also improves the recommendation quality effectively handling the cold-start problem. In addition, our approach can provide users with some explanation why the system recommends certain items. Lastly, we examine the proposed algorithm on a real-world dataset and achieve a better performance than the widely used recommendation algorithm.
The past decade has seen a growth in the development and deployment of educational technologies for assisting college-going students in choosing majors, selecting courses and acquiring feedback based on past academic performance. Grade prediction methods seek to estimate a grade that a student may achieve in a course that she may take in the future (e.g., next term). Accurate and timely prediction of students’ academic grades is important for developing effective degree planners and early warning systems, and ultimately improving educational outcomes. Existing grade pre- diction methods mostly focus on modeling the knowledge components associated with each course and student, and often overlook other factors such as the difficulty of each knowledge component, course instructors, student interest, capabilities and effort. In this paper, we propose additive latent effect models that incorporate these factors to predict the student next-term grades. Specifically, the proposed models take into account four factors: (i) student’s academic level, (ii) course instructors, (iii) student global latent factor, and (iv) latent knowledge factors. We compared the new models with several state-of-the-art methods on students of various characteristics (e.g., whether a student transferred in or not). The experimental results demonstrate that the proposed methods significantly outperform the baselines on grade prediction problem. Moreover, we perform a thorough analysis on the importance of different factors and how these factors can practically assist students in course selection, and finally improve their academic performance.
We present FusedGAN, a deep network for conditional image synthesis with controllable sampling of diverse images. Fidelity, diversity and controllable sampling are the main quality measures of a good image generation model. Most existing models are insufficient in all three aspects. The FusedGAN can perform controllable sampling of diverse images with very high fidelity. We argue that controllability can be achieved by disentangling the generation process into various stages. In contrast to stacked GANs, where multiple stages of GANs are trained separately with full supervision of labeled intermediate images, the FusedGAN has a single stage pipeline with a built-in stacking of GANs. Unlike existing methods, which requires full supervision with paired conditions and images, the FusedGAN can effectively leverage more abundant images without corresponding conditions in training, to produce more diverse samples with high fidelity. We achieve this by fusing two generators: one for unconditional image generation, and the other for conditional image generation, where the two partly share a common latent space thereby disentangling the generation. We demonstrate the efficacy of the FusedGAN in fine grained image generation tasks such as text-to-image, and attribute-to-face generation.
Recent advances in meta-learning demonstrate that deep representations combined with the gradient descent method have sufficient capacity to approximate any learning algorithm. A promising approach is the model-agnostic meta-learning (MAML) which embeds gradient descent into the meta-learner. It optimizes for the initial parameters of the learner to warm-start the gradient descent updates, such that new tasks can be solved using a small number of examples. In this paper we elaborate the gradient-based meta-learning, developing two new schemes. First, we present a feedforward neural network, referred to as T-net, where the linear transformation between two adjacent layers is decomposed as T W such that W is learned by task-specific learners and the transformation T, which is shared across tasks, is meta-learned to speed up the convergence of gradient updates for task-specific learners. Second, we present MT-net where gradient updates in the T-net are guided by a binary mask M that is meta-learned, restricting the updates to be performed in a subspace. Empirical results demonstrate that our method is less sensitive to the choice of initial learning rates than existing meta-learning methods, and achieves the state-of-the-art or comparable performance on few-shot classification and regression tasks.
To create a new IR test collection at minimal cost, we must carefully select which documents merit human relevance judgments. Shared task campaigns such as NIST TREC determine this by pooling search results from many participating systems (and often interactive runs as well), thereby identifying the most likely relevant documents in a given collection. While effective, it would be preferable to be able to build a new test collection without needing to run an entire shared task. Toward this end, we investigate multiple active learning (AL) strategies which, without reliance on system rankings: 1) select which documents human assessors should judge; and 2) automatically classify the relevance of remaining unjudged documents. Because scarcity of relevant documents tends to yield highly imbalanced training data for model estimation, we investigate sampling strategies to mitigate class imbalance. We report experiments on four TREC collections with varying scarcity of relevant documents, reporting labeling accuracy achieved, as well as rank correlation when evaluating participant systems using these labels vs. NIST judgments. Results demonstrate the effectiveness of our approach, coupled with further analysis showing how varying relevance scarcity, within and across collections, impacts findings.
The concept of innateness is rarely discussed in the context of artificial intelligence. When it is discussed, or hinted at, it is often the context of trying to reduce the amount of innate machinery in a given system. In this paper, I consider as a test case a recent series of papers by Silver et al (Silver et al., 2017a) on AlphaGo and its successors that have been presented as an argument that a ‘even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance’, ‘starting tabula rasa.’ I argue that these claims are overstated, for multiple reasons. I close by arguing that artificial intelligence needs greater attention to innateness, and I point to some proposals about what that innateness might look like.
Owe to the rapid development of deep neural network (DNN) techniques and the emergence of large scale face databases, face recognition has achieved a great success in recent years. During the training process of DNN, the face features and classification vectors to be learned will interact with each other, while the distribution of face features will largely affect the convergence status of network and the face similarity computing in test stage. In this work, we formulate jointly the learning of face features and classification vectors, and propose a simple yet effective centralized coordinate learning (CCL) method, which enforces the features to be dispersedly spanned in the coordinate space while ensuring the classification vectors to lie on a hypersphere. An adaptive angular margin is further proposed to enhance the discrimination capability of face features. Extensive experiments are conducted on six face benchmarks, including those have large age gap and hard negative samples. Trained only on the small-scale CASIA Webface dataset with 460K face images from about 10K subjects, our CCL model demonstrates high effectiveness and generality, showing consistently competitive performance across all the six benchmark databases.
CBCT images suffer from acute shading artifacts primarily due to scatter. Numerous image-domain correction algorithms have been proposed in the literature that use patient-specific planning CT images to estimate shading contributions in CBCT images. However, in the context of radiosurgery applications such as gamma knife, planning images are often acquired through MRI which impedes the use of polynomial fitting approaches for shading correction. We present a new shading correction approach that is independent of planning CT images. Our algorithm is based on the assumption that true CBCT images follow a uniform volumetric intensity distribution per material, and scatter perturbs this uniform texture by contributing cupping and shading artifacts in the image domain. The framework is a combination of fuzzy C-means coupled with a neighborhood regularization term and Otsu’s method. Experimental results on artificially simulated craniofacial CBCT images are provided to demonstrate the effectiveness of our algorithm. Spatial non-uniformity is reduced from 16% to 7% in soft tissue and from 44% to 8% in bone regions. With shading-correction, thresholding based segmentation accuracy for bone pixels is improved from 85% to 91% when compared to thresholding without shading-correction. The proposed algorithm is thus practical and qualifies as a plug and play extension into any CBCT reconstruction software for shading correction.
Multilayered artificial neural networks are becoming a pervasive tool in a host of application fields. At the heart of this deep learning revolution are familiar concepts from applied and computational mathematics; notably, in calculus, approximation theory, optimization and linear algebra. This article provides a very brief introduction to the basic ideas that underlie deep learning from an applied mathematics perspective. Our target audience includes postgraduate and final year undergraduate students in mathematics who are keen to learn about the area. The article may also be useful for instructors in mathematics who wish to enliven their classes with references to the application of deep learning techniques. We focus on three fundamental questions: what is a deep neural network? how is a network trained? what is the stochastic gradient method? We illustrate the ideas with a short MATLAB code that sets up and trains a network. We also show the use of state-of-the art software on a large scale image classification problem. We finish with references to the current literature.
Gaussian graphical models are used for determining conditional relationships between variables. This is accomplished by identifying off-diagonal elements in the inverse-covariance matrix that are non-zero. When the ratio of variables (p) to observations (n) approaches one, the maximum likelihood estimator of the covariance matrix becomes unstable and requires shrinkage estimation. Whereas several classical (frequentist) methods have been introduced to address this issue, Bayesian methods remain relatively uncommon in practice and methodological literatures. Here we introduce a Bayesian method for estimating sparse matrices, in which conditional relationships are determined with projection predictive selection. Through simulation and an applied example, we demonstrate that the proposed method often outperforms both classical and alternative Bayesian estimators with respect to frequentist risk and consistently made the fewest false positives.We end by discussing limitations and future directions, as well as contributions to the Bayesian literature on the topic of sparsity.
We present N2Net, a system that implements binary neural networks using commodity switching chips deployed in network switches and routers. Our system shows that these devices can run simple neural network models, whose input is encoded in the network packets’ header, at packet processing speeds (billions of packets per second). Furthermore, our experience highlights that switching chips could support even more complex models, provided that some minor and cheap modifications to the chip’s design are applied. We believe N2Net provides an interesting building block for future end-to-end networked systems.
The generalized linear model plays an important role in statistical analysis and the related design issues are undoubtedly challenging. The state-of-the-art works mostly apply to design criteria on the estimates of regression coefficients. It is of importance to study optimal designs for generalized linear models, especially on the prediction aspects. In this work, we propose a prediction-oriented design criterion, I-optimality, and develop an efficient sequential algorithm of constructing I-optimal designs for generalized linear models. Through establishing the General Equivalence Theorem of the I-optimality for generalized linear models, we obtain an insightful understanding for the proposed algorithm on how to sequentially choose the support points and update the weights of support points of the design. The proposed algorithm is computationally efficient with guaranteed convergence property. Numerical examples are conducted to evaluate the feasibility and computational efficiency of the proposed algorithm.
Residual learning with skip connections permits training ultra-deep neural networks and obtains superb performance. Building in this direction, DenseNets proposed a dense connection structure where each layer is directly connected to all of its predecessors. The densely connected structure leads to better information flow and feature reuse. However, the overly dense skip connections also bring about the problems of potential risk of overfitting, parameter redundancy and large memory consumption. In this work, we analyze the feature aggregation patterns of ResNets and DenseNets under a uniform aggregation view framework. We show that both structures densely gather features from previous layers in the network but combine them in their respective ways: summation (ResNets) or concatenation (DenseNets). We compare the strengths and drawbacks of these two aggregation methods and analyze their potential effects on the networks’ performance. Based on our analysis, we propose a new structure named SparseNets which achieves better performance with fewer parameters than DenseNets and ResNets.
In this paper we try to organize machine teaching as a coherent set of ideas. Each idea is presented as varying along a dimension. The collection of dimensions then form the problem space of machine teaching, such that existing teaching problems can be characterized in this space. We hope this organization allows us to gain deeper understanding of individual teaching problems, discover connections among them, and identify gaps in the field.
With the demand for machine learning increasing, so does the demand for tools which make it easier to use. Automated machine learning (AutoML) tools have been developed to address this need, such as the Tree-Based Pipeline Optimization Tool (TPOT) which uses genetic programming to build optimal pipelines. We introduce Layered TPOT, a modification to TPOT which aims to create pipelines equally good as the original, but in significantly less time. This approach evaluates candidate pipelines on increasingly large subsets of the data according to their fitness, using a modified evolutionary algorithm to allow for separate competition between pipelines trained on different sample sizes. Empirical evaluation shows that, on sufficiently large datasets, Layered TPOT indeed finds better models faster.
Nonnegative matrix factorization (NMF) is one of the most frequently-used matrix factorization models in data analysis. A significant reason to the popularity of NMF is its interpretability and the parts of whole’ interpretation of its components. Recently, max-times, or subtropical, matrix factorization (SMF) has been introduced as an alternative model with equally interpretable winner takes it all’ interpretation. In this paper we propose a new mixed linear–tropical model, and a new algorithm, called Latitude, that combines NMF and SMF, being able to smoothly alternate between the two. In our model, the data is modeled using the latent factors and latent parameters that control whether the factors are interpreted as NMF or SMF features, or their mixtures. We present an algorithm for our novel matrix factorization. Our experiments show that our algorithm improves over both baselines, and can yield interpretable results that reveal more of the latent structure than either NMF or SMF alone.
Transfer learning has revolutionized computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Fine-tuned Language Models (FitLaM), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a state-of-the-art language model. Our method significantly outperforms the state-of-the-art on five text classification tasks, reducing the error by 18-24% on the majority of datasets. We open-source our pretrained models and code to enable adoption by the community.
In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a fixed, large step size and propose a novel assumption on the objective function, under which this method has the improved convergence rates (to a neighborhood of the optimal solutions). We then empirically demonstrate that these assumptions hold for logistic regression and standard deep neural networks on classical data sets. Thus our analysis helps to explain when efficient behavior can be expected from the SGD method in training classification models and deep neural networks.
While existing machine learning models have achieved great success for sentiment classification, they typically do not explicitly capture sentiment-oriented word interaction, which can lead to poor results for fine-grained analysis at the snippet level (a phrase or sentence). Factorization Machine provides a possible approach to learning element-wise interaction for recommender systems, but they are not directly applicable to our task due to the inability to model contexts and word sequences. In this work, we develop two Position-aware Factorization Machines which consider word interaction, context and position information. Such information is jointly encoded in a set of sentiment-oriented word interaction vectors. Compared to traditional word embeddings, SWI vectors explicitly capture sentiment-oriented word interaction and simplify the parameter learning. Experimental results show that while they have comparable performance with state-of-the-art methods for document-level classification, they benefit the snippet/sentence-level sentiment analysis.

### Distilled News

DataScience: Elevate is a full-day event dedicated to data science best practices. Register today to hear from experts at Uber, Facebook, Salesforce, and more. DataScience: Elevate provides a closer look at how today’s top companies use machine learning and artificial intelligence to do better business. Free to attend, this multi-city event features presentations, panels, and networking sessions designed to elevate data science work and connect you with the companies that are driving change in enterprise data science.
Imagine a world where machines understand what you want and how you are feeling when you call at a customer care – if you are unhappy about something, you speak to a person quickly. If you are looking for a specific information, you may not need to talk to a person (unless you want to!). This is going to be the new order of the world – you can already see this happening to a good degree. Check out the highlights of 2017 in the data science industry. You can see the breakthroughs that deep learning was bringing in a field which were difficult to solve before. One such field that deep learning has a potential to help solving is audio/speech processing, especially due to its unstructured nature and vast impact. So for the curious ones out there, I have compiled a list of tasks that are worth getting your hands dirty when starting out in audio processing. I’m sure there would be a few more breakthroughs in time to come using Deep Learning. The article is structured to explain each task and its importance. There is also a research paper that goes in the details of that specific task, along with a case study that would help you get started in solving the task. So let’s get cracking!
In the past few years, machine learning (ML) has revolutionized the way we do business. A disruptive breakthrough that differentiates machine learning from other approaches to automation is a step away from the rules-based programming. ML algorithms allowed engineers to leverage data without explicitly programming machines to follow specific paths of problem-solving. Instead, machines themselves arrive at the right answers based on the data they have. This capability made business executives reconsider the ways they use data to make decisions. In layman terms, machine learning is applied to make forecasts on incoming data using historic data as a training example. For instance, you may want to predict a customer lifetime value in an eCommerce store measuring the net profit of the future relationship with a customer. If you already have historic data on different customer interactions with your website and net profits associated with these customers, you may want to use machine learning. It will allow for early detection of those customers who are likely to bring the most net profit enabling you to focus greater service effort on them. While there are multiple learning styles, i.e. the approaches to training algorithms using data, the most common style is called supervised learning. This time, we’ll talk about this branch of data science and explain why it is considered low-hanging fruit for businesses that plan to embark on the ML initiative, additionally describing the most common use cases.
We’ve compiled a list of the hottest events and conferences from the world of Data Science, Machine Learning and Artificial Intelligence happening in 2018. Below are all the links you need to get yourself to these great events!
In this article, we have outlined some of the Scala libraries that can be very useful while performing major data scientific tasks. They have proved to be highly helpful and effective for achieving the best results.
In this article, I’ll talk about Generative Adversarial Networks, or GANs for short. GANs are one of the very few machine learning techniques which has given good performance for generative tasks, or more broadly unsupervised learning. In particular, they have given splendid performance for a variety of image generation related tasks. Yann LeCun, one of the forefathers of deep learning, has called them “the best idea in machine learning in the last 10 years”. Most importantly, the core conceptual ideas associated with a GAN are quite simple to understand (and in-fact, you should have a good idea about them by the time you finish reading this article).
How are you monitoring your Python applications? Take the short survey – the results will be published on KDnuggets and you will get all the details.
Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.
Tensorflow 1.4 was released a few weeks ago with an implementation of Gradient Boosting, called TensorFlow Boosted Trees (TFBT). Unfortunately, the paper does not have any benchmarks, so I ran some against XGBoost. For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2006. It’s probably as close to an out-of-the-box machine learning algorithm as you can get today, as it gracefully handles un-normalized or missing data, while being accurate and fast to train.
In a previous post, I outlined emerging applications of reinforcement learning (RL) in industry. I began by listing a few challenges facing anyone wanting to apply RL, including the need for large amounts of data, and the difficulty of reproducing research results and deriving the error estimates needed for mission-critical applications. Nevertheless, the success of RL in certain domains has been the subject of much media coverage. This has sparked interest, and companies are beginning to explore some of the use cases and applications I described in my earlier post. Many tasks and professions, including software development, are poised to incorporate some forms of AI-powered automation. In this post, I’ll describe how RISE Lab’s Ray platform continues to mature and evolve just as companies are examining use cases for RL. Assuming one has identified suitable use cases, how does one get started with RL? Most companies that are thinking of using RL for pilot projects will want to take advantage of existing libraries.
Any programming environment should be optimized for its task, and not all tasks are alike. For example, if you are exploring uncharted mountain ranges, the portability of a tent is essential. However, when building a house to weather hurricanes, investing in a strong foundation is important. Similarly, when beginning a new data science programming project, it is prudent to assess how much effort should be put into ensuring the code is reproducible. Note that it is certainly possible to go back later and “shore up” the reproducibility of a project where it is weak. This is often the case when an “ad-hoc” project becomes an important production analysis. However, the first step in starting a project is to make a decision regarding the trade-off between the amount of time to set up the project and the probability that the project will need to be reproducible in arbitrary environments.
Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.

### Version 2.2.2 Released

(This article was first published on ggtern: ternary diagrams in R, and kindly contributed to R-bloggers)

ggtern version 2.2.2 has just been submitted to CRAN, and it includes a number of new features. This time around, I have adapted the hexbin geometry (and stat), and additionally, created an almost equivalent geometry which operates on a triangular mesh rather than a hexagonal mesh. There are some subtle differences which give some added functionality, and together these will provide an additional level of richness to ternary diagrams produced with ggtern, when the data-set is perhaps significantly large and points themselves start to lose their meaning from visual clutter.

### Ternary Hexbin

Firstly, lets look a the ternary hexbin, which, as the name suggests has the capability to bin points in a regular hexagonal grid to produce a pseudo-surface. Now in the original ggplot version, this geometry is somewhat limiting since it only performs a ‘count’ on the number of points in each bin, however, it is not hard to imagine how performing a ‘mean’ or ‘standard deviation’, or other user-defined scalar function on a mapping provided by the user:

set.seed(1)
n  = 5000
df = data.frame(x     = runif(n),
y     = runif(n),
z     = runif(n),
value = runif(n))

ggtern(df,aes(x,y,z)) +
geom_hex_tern(bins=5,aes(value=value,fill=..count..))

Now because we can define user functions, we can do something a little more fancy. Here we will calculate the mean within each hexagon, and, also superimpose a text label over the top.

ggtern(df,aes(x,y,z)) +
theme_bw() +
geom_hex_tern(bins=5,fun=mean,aes(value=value,fill=..stat..)) +
stat_hex_tern(bins=5,fun=mean,
geom='text',
aes(value=value,label=sprintf("%.2f",..stat..)),
size=3, color='white')

### Ternary Tribin

The ternary tribin operates much the same, except that the binwidth no longer has meaning, instead, the density (number of panels) of the triangular mesh is controlled exclusively by the ‘bins’ argument. Using the same data above, lets create some equivalent plots:

ggtern(df,aes(x,y,z)) +
geom_tri_tern(bins=5,aes(value=value,fill=..count..))

There is a subtle difference with the labelling in the stat_tri_tern usage below, we have introduced a ‘centroid’ parameter, and this is because the orientation of each polygon is not consistent (some point up, some point down) and so unlike the hexbin where the centroid is returned by default, with the construction of the polygons being handled by the geometry, for the tribin, this is all handled in the stat.

ggtern(df,aes(x,y,z)) +
theme_bw() +
geom_tri_tern(bins=5,fun=mean,aes(value=value,fill=..stat..)) +
stat_tri_tern(bins=5,fun=mean,
geom='text',
aes(value=value,label=sprintf("%.2f",..stat..)),
size=3, color='white',centroid=TRUE)

These new geometries have been on the cards for quite some time, several users have requested them. Many thanks Laurie and Steve from QEDInsight for partially supporting the development of this work, and forcing me to pull my finger out and get it done. Hopefully we will see these in some awesome publications this year at some time. Until this is accepted on CRAN, you will have to download from my bitbucket repo.

Cheers.

Hamo

The post Version 2.2.2 Released appeared first on ggtern: ternary diagrams in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning”

Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detection and speech recognition. In this paper, we review the development of data representation learning methods. Specifically, we investigate both traditional feature learning algorithms and state-of-the-art deep learning models. The history of data representation learning is introduced, while available resources (e.g. online course, tutorial and book information) and toolboxes are provided. Finally, we conclude this paper with remarks and some interesting research directions on data representation learning. An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning

### Data Driven DIY

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

Statisfix –

## Which fixing should I buy?

I have a bathroom cabinet to put up.

It needs to go onto a tiled plasterboard (drywall) wall.
Because of the tiles, I can’t use the fixings I normally use to keep heavy objects fixed to the wall.
And bog standard rawlplugs aren’t going to do the job.

YouTube to the rescue – more specifically, this fine chap at Ultimate Handyman.

Not only does he demonstrate how to use the fixings, but also produced this strangely mesmerising strength test showing how much weight the fixings support before the plasterboard gives out.

As well as the strength of the fixing, I need to consider the price of the fixings, and also, the size of the hole required (which in turn, will also impact the overall cost of the job if I have to buy new drill bits).
Plus – I’ve never had to drill into tiles before so the smaller the hole, the better.

Here’s my code to try and visualise what to buy:

What would you go for?

In the end, I had to buy snap toggles (had never heard of them before), they were much easier to install than spring toggles.

Early days, but the cabinet is up, feels solid and worth the time spent on it.

Unlike this post.

Until next time..

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## January 19, 2018

### Because it's Friday: Principles and Values

Most companies publish mission and vision statements, and some also publish a detailed list of principles that underlie the company ethos. But what makes a good collection of principles, and does writing them down really matter? At the recent Monktoberfest conference, Bryan Cantrill argued that yes, they do matter, mostly by way of some really egregious counterexamples.

That's all from the blog for this week. We'll be back on Monday — have a great weekend!

### Book Memo: “Advances in Hybridization of Intelligent Methods”

 Models, Systems and Applications This book presents recent research on the hybridization of intelligent methods, which refers to combining methods to solve complex problems. It discusses hybrid approaches covering different areas of intelligent methods and technologies, such as neural networks, swarm intelligence, machine learning, reinforcement learning, deep learning, agent-based approaches, knowledge-based system and image processing. The book includes extended and revised versions of invited papers presented at the 6th International Workshop on Combinations of Intelligent Methods and Applications (CIMA 2016), held in The Hague, Holland, in August 2016. The book is intended for researchers and practitioners from academia and industry interested in using hybrid methods for solving complex problems.

Google has recently released a Jupyter Notebook platform called Google Colaboratory. You can run Python code in a browser, share results, and save your code for later. It currently does not support R code.

&utm&utm&utm

### The Friday #rstats PuzzleR : 2018-01-19

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Peter Meissner (@marvin_dpr) released crossword.r to CRAN today. It’s a spiffy package that makes it dead simple to generate crossword puzzles.

He also made a super spiffy javascript library to pair with it, which can turn crossword model output into an interactive puzzle.

I thought I’d combine those two creations with a way to highlight new/updated packages from the previous week, cool/useful packages in general, and some R functions that might come in handy. Think of it as a weekly way to get some R information while having a bit of fun!

This was a quick, rough creation and I’ll be changing the styles a bit for next Friday’s release, but Peter’s package is so easy to use that I have absolutely no excuse to not keep this a regular feature of the blog.

I’ll release a static, ggplot2 solution to each puzzle the following Monday(s). If you solve it before then, tweet a screen shot of your solution with the tag #rstats #puzzler and I’ll pick the first time-stamped one to highlight the following week.

I’ll also get a GitHub setup for suggestions/contributions to this effort + to hold the puzzle data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Robust Multiple Signal Classification (MUSIC)
In this paper, we introduce a new framework for robust multiple signal classification (MUSIC). The proposed framework, called robust measure-transformed (MT) MUSIC, is based on applying a transform to the probability distribution of the received signals, i.e., transformation of the probability measure defined on the observation space. In robust MT-MUSIC, the sample covariance is replaced by the empirical MT-covariance. By judicious choice of the transform we show that: 1) the resulting empirical MT-covariance is B-robust, with bounded influence function that takes negligible values for large norm outliers, and 2) under the assumption of spherically contoured noise distribution, the noise subspace can be determined from the eigendecomposition of the MT-covariance. Furthermore, we derive a new robust measure-transformed minimum description length (MDL) criterion for estimating the number of signals, and extend the MT-MUSIC framework to the case of coherent signals. The proposed approach is illustrated in simulation examples that show its advantages as compared to other robust MUSIC and MDL generalizations. …

Cumulative Gains Model Quality Metric
In developing risk models, developers employ a number of graphical and numerical tools to evaluate the quality of candidate models. These traditionally involve numerous measures including the KS statistic or one of many Area Under the Curve (AUC) methodologies on ROC and cumulative Gains charts. Typical employment of these methodologies involves one of two scenarios. The first is as a tool to evaluate one or more models and ascertain the effectiveness of that model. Second however is the inclusion of such a metric in the model building process itself such as the way Ferri et al. proposed to use Area Under the ROC curve in the splitting criterion of a decision tree. However, these methods fail to address situations involving competing models where one model is not strictly above the other. Nor do they address differing values of end points as the magnitudes of these typical measures may vary depending on target definition making standardization difficult. Some of these problems are starting to be addressed. Marcade Chief Technology officer of the software vendor KXEN gives an overview of several metric techniques and proposes a new solution to the problem in data mining techniques. Their software uses two statistics called KI and KR. We will examine the shortfalls he addresses more thoroughly and propose a new metric which can be used as an improvement to the KI and KR statistics. Although useful in a machine learning sense of developing a model, these same issues and solutions apply to evaluating a single model’s performance as related by Siddiqi and Mays with respect to risk scorecards. We will not specifically give examples of each application of the new statistics but rather make the claim that it is useful in most situations where an AUC or model separation statistic (such as KS) is used. …

Probabilistic D-Clustering
We present a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster in question. This assumption is our working principle. The method is a generalization, to several centers, of theWeiszfeld method for solving the Fermat-Weber location problem. At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle. Computations stop when the centers stop moving. Progress is monitored by the joint distance function, a measure of distance from all cluster centers, that evolves during the iterations, and captures the data in its low contours. The method is simple, fast (requiring a small number of cheap iterations) and insensitive to outliers. …

### Tracking America in the age of Trump

DURING his first year as America’s president Donald Trump attempted to redefine what it means to be leader of the free world. He has seen White House staffers come and go; been embroiled in scandal; waged war against “fake news”; and offended friends and foes alike.

### Curb your imposterism, start meta-learning

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Recently, there has been a lot of attention for the imposter syndrome. Even seasoned programmers admit they suffer from feelings of anxiety and low self-esteem. Some share their personal stories, which can be comforting for those suffering in silence. I focus on a method that helped me grow confidence in recent years. It is a simple, yet very effective way to deal with being overwhelmed by the many things a data scientis can acquaint him or herself with.

## Two Faces of the Imposter Demon

I think imposterism can be broken into two, related, entities. The first is feeling you are falling short on a personal level. That is, you think you are not intelligent enough, you think you don’t have perseverance, or any other way to describe you are not up to the job. Most advice for overcoming imposterism focuses on this part. I do not. Rather, I focus on the second foe, the feeling that you don’t know enough. This can be very much related to the feeling of failing on a personal level, you might feel you don’t know enough because you are too slow a learner. However, I think it is helpful to approach it as objective as possible. The feeling of not knowing enough can be combated more actively. Not by learning as much you can, but by considering not knowing a choice, rather than an imperfection.

## You can’t have it all

The field of data science is incredibly broad. Comprising, among many others, getting data out of computer systems, preparing data in databases, principles of distributed computing, building and interpreting statistical models, data visualization, building machine learning pipelines, text analysis, translatingbusiness problems into data problems and communicating results to stakeholders. To make matters worse, for each and every topic there are several, if not dozens, databases, languages, packages and tools. This means, by definition, no one is going to have mastery of everything the field comprises. And thus there are things you do not and never will know.

## Learning new stuff

To stay effective you have to keep up with developments within the field. New packages will aid your data preparations, new tools might process data in a faster way and new machine learning models might give superior results. Just to name a few. I think a great deal of impostering comes from feeling you can’t keep up. There is a constant list in the back of your head with cool new stuff you still have to try out. This is where meta-learning comes into play, actively deciding what you will and will not learn. For my peace of mind it is crucial to decide the things I am not going to do. I keep a log (Google Sheets document) that has two simple tabs. The first a collector of stuff I come across in blogs and on twitter. These are things that do look interesting, but it needs a more thorough look. I also add things that I come across in the daily job, such as a certain part of SQL I don’t fully grasp yet. Once in a while I empty the collector, trying to pick up the small stuff right away and moving the larger things either to second tab or to will-not-do. The second tab holds the larger things I am actually going to learn. With time at hand at work or at home I work on learning the things on the second tab. More about this later.

## Define Yourself

So you cannot have it all, you have to choose. What can be of good help when choosing is to have a definition of your unique data science profile. Here is mine:

I have thorough knowledge of statistical models and know how to apply them. I am a good R programmer, both in interactive analysis and in software development. I know enough about data bases to work effectively with them, if necessary I can do the full data preparation in SQL. I know enough math to understand new models and read text books, but I can’t derive and proof new stuff on my own. I have a good understanding of the principles of machine learning and can apply most of the algorithms in practice. My oral and written communication are quite good, which helps me in translating back and forth between data and business problems.

That’s it, focused on what I do well and where I am effective. Some things that are not in there; building a full data pipeline on an Hadoop cluster, telling data stories with d3.js, creating custom algorithms for a business, optimizing a database, effective use of python, and many more. If someone comes to me with one of these task, it is just “Sorry, I am not your guy”.

I used to feel that I had to know everything. For instance, I started to learn python because I thought a good data scientist should know it as well as R. Eventually, I realized I will never be good at python, because I will always use R as my bread-and-butter. I know enough python to cooperate in a project where it is used, but that’s it and that it will remain. Rather, I spend time and effort now in improving what I already do well. This is not because I think because R is superior to python. I just happen to know R and I am content with knowing R very well at the cost of not having access to all the wonderful work done in python. I will never learn d3.js, because I don’t know JavaScript and it will take me ages to learn. Rather, I might focus on learning Stan which is much more fitting to my profile. I think it is both effective and alleviating stress to go deep on the things you are good at and deliberately choose things you will not learn.

## The meta-learning

I told you about the collector, now a few more words about the meta-learning tab. It has three simple columns. what it is I am going to learn and how I am going to do that are the first two obvious categories. The most important, however, is why I am going to learn it. For me there are only two valid reasons. Either I am very interested in the topic and I envision enjoying doing it, or it will allow me to do my current job more effectively. I stressed current there because scrolling the requirements of job openings is about the perfect way to feed your imposter monster. Focus on what you are doing now and have faith you will pick-up new skills if a future job demands it.

Meta-learning gives me focus, relaxation and efficiency. At its core it is defining yourself as a data scientist and deliberately choose what you are and, more importantly, what you are not going to learn. I experienced, that doing this with rigor actively fights the imposterism. Now, what works for me might not work for you. Maybe a different system fits you better. However, I think everybody benefits from defining the data scientist he/she is and actively choose what not to learn.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Learn Data Science Without a Degree

But how do you learn data science? Let’s take a look at some of the steps you can take to begin your journey into data science without needing a degree, including Springboard’s Data Science Career Track.

### R Packages worth a look

Allows you to retrieve information from the ‘Google Knowledge Graph’ API <https://…/knowledge.html> and process it in R in various forms. The ‘Knowledge Graph Search’ API lets you find entities in the ‘Google Knowledge Graph’. The API uses standard ‘schema.org’ types and is compliant with the ‘JSON-LD’ specification.

Fast Region-Based Association Tests on Summary Statistics (sumFREGAT)
An adaptation of classical region/gene-based association analysis techniques that uses summary statistics (P values and effect sizes) and correlations between genetic variants as input. It is a tool to perform the most common and efficient gene-based tests on the results of genome-wide association (meta-)analyses without having the original genotypes and phenotypes at hand.

Monotonic Association on Zero-Inflated Data (mazeinda)
Methods for calculating and testing the significance of pairwise monotonic association from and based on the work of Pimentel (2009) <doi:10.4135/9781412985291.n2>. Computation of association of vectors from one or multiple sets can be performed in parallel thanks to the packages ‘foreach’ and ‘doMC’.

Computing Envelope Estimators (Renvlp)
Provides a general routine, envMU(), which allows estimation of the M envelope of span(U) given root n consistent estimators of M and U. The routine envMU() does not presume a model. This package implements response envelopes (env()), partial response envelopes (penv()), envelopes in the predictor space (xenv()), heteroscedastic envelopes (henv()), simultaneous envelopes (stenv()), scaled response envelopes (senv()), scaled envelopes in the predictor space (sxenv()), groupwise envelopes (genv()), weighted envelopes (weighted.env(), weighted.penv() and weighted.xenv()), envelopes in logistic regression (logit.env()), and envelopes in Poisson regression (pois.env()). For each of these model-based routines the package provides inference tools including bootstrap, cross validation, estimation and prediction, hypothesis testing on coefficients are included except for weighted envelopes. Tools for selection of dimension include AIC, BIC and likelihood ratio testing. Background is available at Cook, R. D., Forzani, L. and Su, Z. (2016) <doi:10.1016/j.jmva.2016.05.006>. Optimization is based on a clockwise coordinate descent algorithm.

Model Based Random Forest Analysis (mobForest)
Functions to implements random forest method for model based recursive partitioning. The mob() function, developed by Zeileis et al. (2008), within ‘party’ package, is modified to construct model-based decision trees based on random forests methodology. The main input function mobforest.analysis() takes all input parameters to construct trees, compute out-of-bag errors, predictions, and overall accuracy of forest. The algorithm performs parallel computation using cluster functions within ‘parallel’ package.

### President Trump’s first year, through The Economist’s covers

SATURDAY January 20th marks one year since Donald Trump’s inauguration as the 45th President of the United States. Over the intervening months the world has been forced to come to terms with—and repeatedly adjust to—having Mr Trump in the White House. His first 365 days have hurtled by like an out-of-control fairground ride.

### Porn traffic before and after the missile alert in Hawaii

PornHub compared minute-to-minute traffic on their site before and after the missile alert to an average Saturday (okay for work). Right after the alert there was a dip as people rushed for shelter, but not long after the false alarm notice, traffic appears to spike.

Some interpret this as people rushed to porn after learning that a missile was not headed towards their home. Maybe that’s part of the reason, but my guess is that Saturday morning porn consumers woke earlier than usual.

Tags: ,

### Edelweiss: Data Scientist

Seeking a Data Scientist for building, validating and deploying machine learning models on unstructured data for various business problems.

### Plot2txt for quantitative image analysis

Plot2txt converts images into text and other representations, helping create semi-structured data from binary, using a combination of machine learning and other algorithms.

### The Trumpets of Lilliput

Gur Huberman pointed me to this paper by George Akerlof and Pascal Michaillat that gives an institutional model for the persistence of false belief. The article begins:

This paper develops a theory of promotion based on evaluations by the already promoted. The already promoted show some favoritism toward candidates for promotion with similar beliefs, just as beetles are more prone to eat the eggs of other species. With such egg-eating bias, false beliefs may not be eliminated by the promotion system. Our main application is to scientific revolutions: when tenured scientists show favoritism toward candidates for tenure with similar beliefs, science may not converge to the true paradigm. We extend the statistical concept of power to science: the power of the tenure test is the probability (absent any bias) of denying tenure to a scientist who adheres to the false paradigm, just as the power of any statistical test is the probability of rejecting a false null hypothesis. . . .

It was interesting to see a mathematical model for the persistence of errors, and I agree that there must be something to their general point that people are motivated to support work that confirms their beliefs and to discredit work that disconfirms their beliefs. We’ve seen a lot of this sort of analysis at the individual level (“motivated reasoning,” etc.) and it makes sense to think of this at an interpersonal or institutional level too.

There were, however, some specific aspects of their model that I found unconvincing, partly on statistical grounds and partly based on my understanding of how social science works within society:

1. Just as I don’t think it is helpful to describe statistical hypotheses as “true” or “false,” I don’t think it’s helpful to describe scientific paradigms as “true” or “false.” Also, I’m no biologist, but I’m skeptical of a statement such as, “With the beetles, the more biologically fit species does not always prevail.” What does it mean to say a species is “more biologically fit”? If they survive and reproduce, they’re fit, no? And if a species’ eggs get eaten before they’re hatched, that reduces the species’s fitness.

In the article, they modify “true” and “false” to “Better” and “Worse,” but I have pretty much the same problem here, which is that different paradigms serve different purposes, so I don’t see how it typically makes sense to speak of one paradigm as giving “a more correct description of the world,” except in some extreme cases. For example, a few years ago I reviewed a pop-science book that was written from a racist paradigm. Is that paradigm “more correct” or “less correct” than a non-racist paradigm? It depends on what questions are being asked, and what non-racist paradigm is being used as a comparison.

2. Also the whole academia-tenure framework seems misplaced, in that the most important scientific paradigms are rooted in diverse environments, not just academia. For example, the solid-state physics paradigm led to transistors (developed at Bell Labs, not academia) and then of course is dominant in industry. Even goofy theories such as literary postmodernism (is this “Better” or “Worse”? How could we ever tell?) exist as much in the news media as in academe; indeed, if we didn’t keep hearing about deconstructionism in the news media, we’d never have known of its existence. And of course recent trendy paradigms in social psychology (embodied cognition, etc.) are associated with self-help gurus, Gladwell books, etc., as much as with academia. I think that a big part of the success of that sort of work in academia is because of its success in the world of business consulting. The wheelings and dealings of tenure committees are, I suspect, the least of it.

Beyond all this—or perhaps explaining my above comments—is my irritation at people who use university professors as soft targets. Silly tenured professors ha ha. Bad science is a real problem but I think it’s ludicrous to attribute that to the tenure system. Suppose there was no such thing as academic tenure, then I have a feeling that social and biomedical science research would be even more fad-driven.

I sent the above comments to the authors, and Akerlof replied:

I think that your point of view and ours are surprisingly on the same track; in fact the paper answers Thomas Kuhn’s question: what makes science so successful. The point is rather subtle and is in the back pages: especially regarding the differences between promotions of scientists and promotion of surgeons who did radical mastectomies.

The post The Trumpets of Lilliput appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Registration and talk proposals open Monday for useR!2018

Registration will open on Monday (January 22) for useR! 2018, the official R user conference to be held in Brisbane, Australia July 10-13. If you haven't been to a useR! conference before, it's a fantastic opportunity to meet and mingle with other R users from around the world, see talks on R packages and applications, and attend tutorials for deep dives on R-related topics. This year's conference will also feature keynotes from Jenny Bryan, Steph De Silva, Heike Hofmann, Thomas Lin Pedersen, Roger Peng and Bill Venables. It's my favourite conference of the year, and I'm particularly looking forward to this one.

This video from last year's conference in Brussels (a sell-out with over 1,1000 attendees) will give you a sense of what a useR! conference is like:

The useR! conference brought to you by the R Foundation and is 100% community-led. That includes the content: the vast majority of talks come directly from R users. If you've written an R package, performed an interesting analysis with R, or simply have something to share of interest to the R community, consider proposing a talk by submitting an abstract. (Abstract submissions are open now.) Most talks are 20 minutes, but you can also propose a 5-minute lightning talk or a poster. If you're not sure what kind of talk you might want to give, check out the program from useR!2017 for inspiration. R-Ladies, which promotes gender diversity in the R community, can also provide guidance on abstracts. Note that all proposals must comply with the conference code of conduct.

Early-bird registrations close on March 15, and while general registration will be open until June my advice is to get in early, as this year's conference is likely to sell out once again. If you want to propose a talk, submissions are due by March 2 (but early submissions have a better chance of being accepted). Follow @user!2018_conf on Twitter for updates about the conference, and click the links below to register or submit an abstract. I look forward to seeing you in Brisbane!

Update Jan 19: Registrations will now open January 22

useR! 2018: Registration; Abstract submission

### Managing Machine Learning Workflows with Scikit-learn Pipelines Part 2: Integrating Grid Search

Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.

### A lesson from the Charles Armstrong plagiarism scandal: Separation of the judicial and the executive functions

Charles Armstrong is a history professor at Columbia University who, so I’ve heard, has plagiarized and faked references for an award-winning book about Korean history. The violations of the rules of scholarship were so bad that the American Historical Association “reviewed the citation issue after being notified by a member of the concerns some have about the book” and, shortly after that, Armstrong relinquished the award. More background here.

To me, the most interesting part of the story is that Armstrong was essentially forced to give in, especially surprising given how aggressive his original response was, attacking the person whose work he’d stolen.

It’s hard to imagine that Columbia University could’ve made Armstrong return the prize, given that the university gave him a “President’s Global Innovation Fund Grant” many months after the plagiarism story had surfaced.

The reason why, I think, is that the American Historical Association had this independent committee.

And that gets us to the point raised in the title of this post.

Academic and corporate environments are characterized by an executive function with weak to zero legislative or judicial functions. That is, decisions are made based on consequences, with very few rules. Yes, we have lots of little rules and red tape, but no real rules telling the executives what to do.

Evaluating every decision based on consequences seems like it could be a good idea, but it leads to situations where wrongdoers are left in place, as in any given situation it seems like too much trouble to deal with the problem.

An analogy might be with the famous probability-matching problem. Suppose someone gives shuffles a deck with 100 cards, 70 red and 30 black, and then starts pulling out cards, one at a time, asking you to guess. You’ll maximize your expected number of correct answers by simply guessing Red, Red, Red, Red, Red, etc. In each case, that’s the right guess, but put it together and your guesses are not representative. Similarly, if for each scandal the university makes the locally optimal decision to do nothing, the result is that nothing is ever done.

This analogy is not perfect: I’m not recommending that the university sanction 30% of its profs at random—for one thing, that could be me! But it demonstrates the point that a series of individually reasonable decisions can be unreasonable in aggregate.

Anyway, one advantage of a judicial branch—or, more generally, a fact-finding institution that is separate from enforcement and policymaking—is that its members can feel free to look for truth, damn the consequences, because that’s their role.

So, instead of the university weighing the negatives of having an barely-repentant plagiarist on faculty or having the embarrassment of sanctioning a tenured professor, there can be an independent committee of the American History Association just judging the evidence.

it’s a lot easier to judge the evidence if you don’t have direct responsibility for what will be done by the evidence. Or, to put it another way, it’s easier to be a judge if you don’t also have to play the roles of jury and executioner.

P.S. I see that Armstrong was recently quoted in Newsweek regarding Korea policy. Maybe they should’ve interviewed the dude he copied from instead. Why not go straight to the original, no?

THE threat of nuclear holocaust, familiar to Americans who grew up during the cold war, is alien to most today. On Saturday January 13th fears of annihilation reemerged.

### Introducing RLlib: A composable and scalable reinforcement learning library

RISE Lab’s Ray platform adds libraries for reinforcement learning and hyperparameter tuning.

In a previous post, I outlined emerging applications of reinforcement learning (RL) in industry. I began by listing a few challenges facing anyone wanting to apply RL, including the need for large amounts of data, and the difficulty of reproducing research results and deriving the error estimates needed for mission-critical applications. Nevertheless, the success of RL in certain domains has been the subject of much media coverage. This has sparked interest, and companies are beginning to explore some of the use cases and applications I described in my earlier post. Many tasks and professions, including software development, are poised to incorporate some forms of AI-powered automation. In this post, I’ll describe how RISE Lab’s Ray platform continues to mature and evolve just as companies are examining use cases for RL.

Assuming one has identified suitable use cases, how does one get started with RL? Most companies that are thinking of using RL for pilot projects will want to take advantage of existing libraries.

There are several open source projects that one can use to get started. From a technical perspective, there are a few things to keep in mind when considering a library for RL:

• Support for existing machine learning libraries. Because RL typically uses gradient-based or evolutionary algorithms to learn and fit policy functions, you will want it to support your favorite library (TensorFlow, Keras, PyTorch, etc.).
• Scalability. RL is computationally intensive, and having the option to run in a distributed fashion becomes important as you begin using it in key applications.
• Composability. RL algorithms typically involve simulations and many other components. You will want a library that lets you reuse components of RL algorithms (such as policy graphs, rollouts), that is compatible with multiple deep learning frameworks, and that provides composable distributed execution primitives (nested parallelism).

## Introducing Ray RLlib

Ray is a distributed execution platform (from UC Berkeley’s RISE Lab) aimed at emerging AI applications, including those that rely on RL. RISE Lab recently released RLlib, a scalable and composable RL library built on top of Ray:

RLlib is designed to support multiple deep learning frameworks (currently TensorFlow and PyTorch) and is accessible through a simple Python API. It currently ships with the following popular RL algorithms (more to follow):

It’s important to note that there is no dominant pattern for computing and composing RL algorithms and components. As such, we need a library that can take advantage of parallelism at multiple levels and physical devices. RLlib is an open source library for the scalable implementation of algorithms that connect the evolving set of components used in RL applications. In particular, RLlib enables rapid development because it makes it easy to build scalable RL algorithms through the reuse and assembly of existing implementations (“parallelism encapsulation”). RLlib also lets developers use neural networks created with several popular deep learning frameworks, and it integrates with popular third-party simulators.

Software for machine learning needs to run efficiently on a variety of hardware configurations, both on-premise and on public clouds. Ray and RLlib are designed to deliver fast training times on a single multi-core node or in a distributed fashion, and these software tools provide efficient performance on heterogeneous hardware (whatever the ratio of CPUs to GPUs might be).

## Examples: Text summarization and AlphaGo Zero

The best way to get started is to apply RL on some of your existing data sets. To that end, a relatively recent application of RL is in text summarization. Here’s a toy example to try—use RLlib to summarize unstructured text (note that this is not a production-grade model):

# Complete notebook available here: https://goo.gl/n6f43h
document = """Insert your sample text here
"""
summary = summarization.summarize(agent, document)
print("Original document length is {}".format(len(document)))
print("Summary length is {}".format(len(summary)))


Text summarization is just one of several possible applications. A recent RISE Lab paper provides other examples, including an implementation of the main algorithm used in AlphaGo Zero in about 70 lines of RLlib pseudocode.

## Hyperparameter tuning with RayTune

Another common example involves model building. Data scientists spend a fair amount of time conducting experiments, many of which involve tuning parameters for their favorite machine learning algorithm. As deep learning (and RL) become more popular, data scientists will need software tools for efficient hyperparameter tuning and other forms of experimentation and simulation. RayTune is a new distributed, hyperparameter search framework for deep learning and RL. It is built on top of Ray and is closely integrated with RLlib. RayTune is based on grid search and uses ideas from early stopping, including the Median Stopping Rule and HyperBand.

There is a growing list of open source software tools available to companies wanting to explore deep learning and RL. We are in empirical era, and we need tools that enable quick experiments in parallel, while letting us take advantage of popular software libraries, algorithms, and components. Ray just added two libraries that will let companies experiment with reinforcement learning and also efficiently search through the space of neural network architectures.

Reinforcement learning applications involve multiple components, each of which presents opportunities for distributed computation. Ray RLlib adopts a programming model that enables the easy composition and reuse of components, and takes advantage of parallelism at multiple levels and physical devices. Over the short term, RISE Lab plans to add more RL algorithms, APIs for integration with online serving, support for multi-agent scenarios, and an expanded set of optimization strategies.

Related resources:

### Four short links: 19 January 2018

Pricing, Windows Emulation, Toxic Tech Culture, and AI Futures

1. Pricing Summary -- quick and informative read. Three-part tariff (3PT)—Again, the software has a base platform fee, but the fee is $25,000 because it includes the first 150K events free. Each marginal event costs$0.15. In academic research and theory, the three-part tariff is proven to be best. It provides many different ways for the sales team to negotiate on price and captures the most value.
2. Wine 3.0 Released -- the Windows emulator now runs Photoshop CC 2018! Astonishing work.
3. Getting Free of Toxic Tech Culture (Val Aurora and Susan Wu) -- We didn’t realize how strongly we’d unconsciously adopted this belief that people in tech were better than those who weren’t until we started to imagine ourselves leaving tech and felt a wave of self-judgment and fear. Early on, Valerie realized that she unconsciously thought of literally every single job other than software engineer as “for people who weren’t good enough to be a software engineer” – and that she thought this because other software engineers had been telling her that for her entire career. This.
4. The Future Computed: Artificial Intelligence and its Role in Society -- Microsoft's book on the AI-enabled future. Three chapters: The Future of Artificial Intelligence; Principles, Policies, and Laws for the Responsible Use of AI; and AI and the Future of Jobs and Work.

### What can Text Mining tell us about Snapchat’s new update?

Last week, Snapchat unveiled a major redesign of their app that received quite a bit of negative feedback. As a video-sharing platform that has integrated itself into users’ daily lives, Snapchat relies on simplicity and ease of use. So when large numbers of these users begin to express pretty serious frustration about the app’s new design, it’s a big threat to their business.

You can bet that right now Snapchat are analyzing exactly how big a threat this backlash is by monitoring the conversation online. This is a perfect example of businesses leveraging the Voice of their Customer with tools like Natural Language Processing. Businesses that track their product’s reputation online can quantify how serious events like this are and make informed decisions on their next steps. In this blog, we’ll give a couple of examples of how you can dive into online chatter and extract important insights on customer opinion.

This TechCrunch article pointed out that 83% of Google Play Store reviews in the immediate aftermath of the update gave the app one or two stars. But as we mentioned in a blog last week, star rating systems aren’t enough – they don’t tell you why people feel the way they do and most of the time people base their star rating on a lot more than how they felt about a product or service.

To get accurate and in-depth insights, you need to understand exactly what a reviewer is positive or negative about, and to what degree they feel this way. This can only be done effectively with text mining.

So in this short blog, we’re going to use text mining to:

1. Analyze a sample of the Play Store reviews to see what Snapchat users mentioned in reviews posted since the update.
2. Gather and analyze a sample of 1,000 tweets mentioning “Snapchat update” to see if the reaction was similar on social media.

In each of these analyses, we’ll use the use the AYLIEN Text Analysis API, which comes with a free plan that’s ideal for testing it out on small datasets like the ones we’ll use in this post.

## What did the app reviewers talk about?

As TechCrunch pointed out, 83% of reviews since the update shipped received one or two stars, which gives us a high-level overview of the sentiment shown towards the redesign. But to dig deeper, we need to look into the reviews and see what people were actually talking about in all of these reviews.

As a sample, we gathered the 40 reviews readily available on the Google Play Store and saved them in a spreadsheet. We can analyze what people were talking about in them by using our Text Analysis API’s Entities feature. This feature analyzes a piece of text and extracts the people, places, organizations and things mentioned in it.

One of the types of entities returned to us is a list of keywords. To get a quick look into what the reviewers were talking about in a positive and negative light, we visualized the keywords extracted along with the average sentiment of the reviews they appeared in.

From the 40 reviews, our Text Analysis API extracted 498 unique keywords. Below you can see a visualization of the keywords extracted and the average sentiment of the reviews they appeared in from most positive (1) to most negative (-1).

First of all, you’ll notice that keywords like “love” and “great” are high on the chart, while “frustrating” and “terrible” are low on the scale – which is what you’d expect. But if you look at keywords that refer to Snapchat, you’ll see that “Bitmoji” appears high on the chart, while “stories,” “layout,” and “unintuitive” all  appear low down the chart, giving an insight into what Snapchat’s users were angry about.

## How did Twitter react to the Snapchat update?

Twitter is such an accurate gauge of what the general public is talking about that the US Geological Survey uses it to monitor for earthquakes – because the speed at which people react to earthquakes on Twitter outpaces even their own seismic data feeds! So if people Tweet about earthquakes during the actual earthquakes, they are absolutely going to Tweet their opinions of Snapchat updates.

To get a snapshot of the Twitter conversation, we gathered 1,000 Tweets that mentioned the update.To gather the Tweets, we ran a search on Twitter using the Twitter Search API (this is really easy –  take a look at our beginners’ guide to doing this in Python).

After we gathered our Tweets, we analyzed them with our Sentiment Analysis feature and as you can see, the Tweets were overwhelmingly negative:

<noscript><a href="http://blog.aylien.com/feed/"><img alt="Sentiment of 1,000 Tweets about Snapchat " src="https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Sn&#47;Snapchat_6&#47;Sheet1&#47;1_rss.png" style="border: none;" /></a></noscript>

Quantifying the positive, negative, and neutral sentiment shown towards the update on Twitter is useful, but using Text Mining we can go one further and extract the keywords mentioned in every one of these Tweets. To do this, we use the Text Analysis API’s Entities feature.

Disclaimer: this being Twitter, there was quite a bit of opinion expressed in a NSFW manner 😉

<noscript><a href="http://blog.aylien.com/feed/"><img alt="Most mentioned keywords on Twitter in Tweets about &quot;Snapchat update&quot; " src="https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;7P&#47;7PBRWCTHW&#47;1_rss.png" style="border: none;" /></a></noscript>

The number of expletives we identified as keywords reinforces the severity of the opinion expressed towards the update. You can see that “stories” and “story” are two of the few prominently-featured keywords that referred to feature updates while keywords like “awful” and “stupid” are good examples of the most-mentioned keywords in reaction to the update as a whole.

It’s clear that using text mining processes like sentiment analysis and entity extraction – can provide a detailed overview of public reaction to an event by extracting granular information from product reviews and social media chatter.

If you can think of insights you could extract with text mining about topics that matter to you, our Text Analysis API allows you to analyze 1,000 documents per day free of charge and getting started with our tools couldn’t be easier – click on the image below to sign up.

The post What can Text Mining tell us about Snapchat’s new update? appeared first on AYLIEN.

### On Random Weights for Texture Generation in One Layer Neural Networks

Continuing up on the use of random projections (which in the context of DNNs is really about NN with random weights), today we have:

Recent work in the literature has shown experimentally that one can use the lower layers of a trained convolutional neural network (CNN) to model natural textures. More interestingly, it has also been experimentally shown that only one layer with random filters can also model textures although with less variability. In this paper we ask the question as to why one layer CNNs with random filters are so effective in generating textures? We theoretically show that one layer convolutional architectures (without a non-linearity) paired with the an energy function used in previous literature, can in fact preserve and modulate frequency coefficients in a manner so that random weights and pretrained weights will generate the same type of images. Based on the results of this analysis we question whether similar properties hold in the case where one uses one convolution layer with a non-linearity. We show that in the case of ReLu non-linearity there are situations where only one input will give the minimum possible energy whereas in the case of no nonlinearity, there are always infinite solutions that will give the minimum possible energy. Thus we can show that in certain situations adding a ReLu non-linearity generates less variable images.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### 501 days of Summer (school)

(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers)

As I anticipated earlier, we’re now ready to open registration for our Summer School in Florence (I was waiting for UCL to set up the registration system and thought it may take much longer than it actually did $-$ so well done UCL!).

We’ll probably have a few changes here and there in the timetable $-$ we’re thinking of introducing some new topics and I think I’ll certainly merge a couple of my intro lectures, to leave some time for those…

Nothing is fixed yet and we’re in the process of deliberating all the changes $-$ but I’ll post as soon as we have a clearer plan for the revised timetable.

Here’s the advert (which I’ve sent out to some relevant mailing list, also).

Summer school: Bayesian methods in health economics
Date: 4-8 June 2018
Venue: CISL Study Center, Florence (Italy)

COURSE ORGANISERS: Gianluca Baio, Chris Jackson, Nicky Welton, Mark Strong, Anna Heath

OVERVIEW:
This summer school is intended to provide an introduction to Bayesian analysis and MCMC methods using R and MCMC sampling software (such as OpenBUGS and JAGS), as applied to cost-effectiveness analysis and typical models used in health economic evaluations. We will present a range of modelling strategies for cost-effectiveness analysis as well as recent methodological developments for the analysis of the value of information.

The course is intended for health economists, statisticians, and decision modellers interested in the practice of Bayesian modelling and will be based on a mixture of lectures and computer practicals, although the emphasis will be on examples of applied analysis: software and code to carry out the analyses will be provided. Participants are encouraged to bring their own laptops for the practicals.

We shall assume a basic knowledge of standard methods in health economics and some familiarity with a range of probability distributions, regression analysis, Markov models and random-effects meta-analysis. However, statistical concepts are reviewed in the context of applied health economic evaluations in the lectures.

The summer school is hosted in the beautiful complex of the Centro Studi Cisl, overlooking and a short distance from Florence (Italy). The registration fees include full board accommodation in the Centro Studi.

More information can be found at the summer school webpage. Registration is available from the UCL Store. For more details or enquiries, email Dr Gianluca Baio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Parallel Runs of Reverse Depends (prrd)
Reverse depends for a given package are queued such that multiple workers can run the tests in parallel.

Critical Line Algorithm in Pure R (CLA)
Implements ‘Markovitz’ Critical Line Algorithm (‘CLA’) for classical mean-variance portfolio optimization. Care has been taken for correctness in light of previous buggy implementations.

Extension for ‘R6’ Base Class (r6extended)
Useful methods and data fields to extend the bare bones ‘R6’ class provided by the ‘R6’ package – ls-method, hashes, warning- and message-method, general get-method and a debug-method that assigns self and private to the global environment.

Run Predictions Inside the Database (tidypredict)
It parses a fitted ‘R’ model object, and returns a formula in ‘Tidy Eval’ code that calculates the predictions. It works with several databases back-ends because it leverages ‘dplyr’ and ‘dbplyr’ for the final ‘SQL’ translation of the algorithm. It currently supports lm(), glm() and randomForest() models.

Bayesian Structure Learning in Graphical Models using Birth-Death MCMC (BDgraph)
Provides statistical tools for Bayesian structure learning in undirected graphical models for continuous, discrete, and mixed data. The package is implemented the recent improvements in the Bayesian graphical models literature, including Mohammadi and Wit (2015) <doi:10.1214/14-BA889> and Mohammadi et al. (2017) <doi:10.1111/rssc.12171>. To speed up the computations, the BDMCMC sampling algorithms are implemented in parallel using OpenMP in C++.

### Distilled News

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.
Remember, that I told last time that Python if statements are similar to how our brain processes conditions in our everyday life? That’s true for for loops too. You go through your shopping list, until collected every item from it. The dealer gives a card for each player until everyone has five. The athlete does push-ups until reaching one-hundred… Loops everywhere! As of for loops in Python: they are perfect for processing repetitive programming tasks. In this article, I’ll show you everything you need to know about them: the syntax, the logic and best practices too!
This post shows you how to label hundreds of thousands of images in an afternoon. You can use the same approach whether you are labeling images or labeling traditional tabular data (e.g, identifying cyber security atacks or potential part failures).
I’m contemplating the idea of teaching a course on simulation next fall, so I have been exploring various topics that I might include. (If anyone has great ideas either because you have taught such a course or taken one, definitely drop me a note.) Monte Carlo (MC) simulation is an obvious one. I like the idea of talking about importance sampling, because it sheds light on the idea that not all MC simulations are created equally. I thought I’d do a brief blog to share some code I put together that demonstrates MC simulation generally, and shows how importance sampling can be an improvement.
Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpoint, curl, doParallel, foreach, and iterators) to new versions. MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).
Making deep learning simple and accessible to enterprises: Polyaxon aims to be an enterprise-grade open source platform for building, training, and monitoring large scale deep learning applications. It includes an infrastructure, set of tools, proven algorithms, and industry models to enable your organization to innovate faster. Polyaxon is a platform-agnostic with no lock-in. You keep full ownership and control of sensitive data on-premise or in the cloud.
When approaching problems with sequential data, such as natural language tasks, recurrent neural networks (RNNs) typically top the choices. While the temporal nature of RNNs are a natural fit for these problems with text data, convolutional neural networks (CNNs), which are tremendously successful when applied to vision tasks, have also demonstrated efficacy in this space. In our LSTM tutorial, we took an in-depth look at how long short-term memory (LSTM) networks work and used TensorFlow to build a multi-layered LSTM network to model stock market sentiment from social media content. In this post, we will briefly discuss how CNNs are applied to text data while providing some sample TensorFlow code to build a CNN that can perform binary classification tasks similar to our stock market sentiment model.

### Book Memo: “Stochastic Modelling in Production Planning”

 Methods for Improvement and Investigations on Production System Behaviour Alexander Hübl develops models for production planning and analyzes performance indicators to investigate production system behaviour. He extends existing literature by considering the uncertainty of customer required lead time and processing times as well as by increasing the complexity of multi-machine multi-items production models. Results are on the one hand a decision support system for determining capacity and the further development of the production planning method Conwip. On the other hand, the author develops the JIT intensity and analytically proves the effects of dispatching rules on production lead time.

### The difference between me and you is that I’m not on fire

“Eat what you are while you’re falling apart and it opened a can of worms. The gun’s in my hand and I know it looks bad, but believe me I’m innocent.” – Mclusky

While the next episode of Madam Secretary buffers on terrible hotel internet, I (the other other white meat) thought I’d pop in to say a long, convoluted hello. I’m in New York this week visiting Andrew and the Stan crew (because it’s cold in Toronto and I somehow managed to put all my teaching on Mondays. I’m Garfield without the spray tan.).

So I’m in a hotel on the Upper West Side (or, like, maybe the upper upper west side. I’m in the 100s. Am I in Harlem yet? All I know is that I’m a block from my favourite bar [which, as a side note, Aki does not particularly care for] where I am currently not sitting and writing this because last night I was there reading a book about the rise of the surprisingly multicultural anti-immigration movement in Australia and, after asking what my book was about, some bloke started asking me for my genealogy and “how Australian I am” and really I thought that it was both a bit much and a serious misunderstanding of what someone who is reading book with headphones on was looking for in a social interaction.) going through the folder of emails I haven’t managed to answer in the last couple of weeks looking for something fun to pass the time.

And I found one. Ravi Shroff from the Department of Applied Statistics, Social Science and Humanities at NYU (side note: applied statistics gets a short shrift in a lot of academic stats departments around the world, which is criminal. So I will always love a department that leads with it in the title. I’ll also say that my impression when I wandered in there for a couple of hours at some point last year was that, on top of everything else, this was an uncommonly friendly group of people. Really, it’s my second favourite statistics department in North America, obviously after Toronto who agreed to throw a man into a volcano every year as part of my startup package after I got really into both that Tori Amos album from 1996 and cultural appropriation. Obviously I’m still processing the trauma of being 11 in 1996 and singularly unable to sacrifice any young men to the volcano goddess.) sent me an email a couple of weeks ago about constructing interpretable decision rules.

(Meta-structural diversion: I starting writing this with the new year, new me idea that every blog post wasn’t going to devolve into, say, 500 words on how Medúlla is Björk’s Joanne, but that resolution clearly lasted for less time than my tenure as an Olympic torch relay runner. But if you’ve not learnt to skip the first section of my posts by now, clearly reinforcement learning isn’t for you.)

#### To hell with good intentions

Ravi sent me his paper Simple rules for complex decisions by Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel and Daniel Goldstein and it’s one of those deals where the title really does cover the content.

This is my absolute favourite type of statistics paper: it eschews the bright shiny lights of ultra-modern methodology in favour of the much harder road of taking a collection of standard tools and shaping them into something completely new.

Why do I prefer the latter? Well it’s related to the age old tension between “state-of-the-art” methods and “stuff-people-understand” methods. The latter are obviously preferred as they’re much easier to push into practice. This is in spite of the former being potentially hugely more effective. Practically, you have to balance “black box performance” with “interpretability”. Where you personally land on that Pareto frontier is between you and your volcano goddess.

This paper proposes a simple decision rule for binary classification problems and shows fairly convincingly that it can be almost as effective as much more complicated classifiers.

#### There ain’t no fool in Ferguson

The paper proposes a Select-Regress-and-Round method for constructing decision rules that works as follows:

1. Select a small number $k$ of features $\mathbf{x}$ that will be used to build the classifier
2. Regress: Use a logistic-lasso to estimate the classifier $h(\mathbf{x}) = (\mathbf{x}^T\mathbf{\beta} \geq 0 \text{ ? } 1 \text{ : } 0)$.
3. Round: Chose $M$ possible levels of effect and build weights

$w_j = \text{Round} \left( \frac{M \beta_j}{\max_i|\beta_i|}\right)$.

The new classifier (which chooses between options 1 and 0) selects 1 if

$\sum_{j=1}^k w_j x_j > 0$.

In the paper they use $k=10$ features and $M = 3$ levels.  To interpret this classifier, we can consider each level as a discrete measure of importance.  For example, when we have $M=3$ we have seven levels of importance from “very high negative effect”, through “no effect”, to “very high positive effect”. In particular

• $w_j=0$: The $j$th feature has no effect
• $w_j= \pm 1$: The $j$th feature has a low effect (positive or negative)
• $w_j = \pm 2$: The $j$th feature has a medium effect (positive or negative)
• $w_j = \pm 3$: The $j$th feature has a high effect (positive or negative).

A couple of key things here that makes this idea work.  Firstly, the initial selection phase allows people to “sense check” the initial group of features while also forcing the decision rule to only depend on a small number of features, which greatly improves the ability for people to interpret the rule.  The second two phases then works out which of those features are used (the number of active features can be less than $k$. Finally the last phase gives a qualitative weight to each feature.

This is a transparent way of building a decision rule, as the effect of each feature used to make the decision is clearly specified.  But does it work?

#### She will only bring you happiness

The most surprising thing in this paper is that this very simple strategy for building a decision rule works fairly well. Probably unsurprisingly, complicated, uninterpretable decision rules constructed through random forests typically do work better than this simple decision rule.  But the select-regress-round strategy doesn’t do too badly.  It might be possible to improve the performance by tweaking the first two steps to allow for some low-order interactions. For binary features, this would allow for classifiers where neither X nor Y are strong indicators of success, but the co-occurance of them (XY) is.

Even without this tweak, the select-regress-round classifier performs about as well as logistic regression and logistic lasso models that use all possible features (see the above figure from the paper), although it performs worse than the random forrest.  It also doesn’t appear that the rounding process has too much of an effect on the quality of the classifier.

#### This man will not hang

The substantive example in this paper has to do with whether or not a judge decides to grant bail, where the event you’re trying to predict is a failure to appear at trial. The results in this paper suggest that the select-regress-round rule leads to a consistently lower rate of failure compared to the “expert judgment” of the judges.  It also works, on this example, almost as well as a random forest classifier.

There’s some cool methodology stuff in here about how to actually build, train, and evaluate classification rules when, for any particular experimental unit (person getting or not getting bail in this case), you can only observed one of the potential outcomes.  This paper uses some ideas from the causal analysis literature to work around that problem.

I guess the real question I have about this type of decision rule for this sort of example is around how these sorts of decision rules would be applied in practice.  In particular, would judges be willing to use this type of system?  The obvious advantage of implementing it in practice is that it is data driven and, therefore, the decisions are potentially less likely to fall prey to implicit and unconscious biases. The obvious downside is that I am personally more than the sum of my demographic features (or other measurable quantities) and this type of system would treat me like the average person who has shares the $k$ features with me.