My Data Science Blogs

May 26, 2017

Testing the Hierarchical Risk Parity algorithm

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be a modified backtest of the Adaptive Asset Allocation backtest from AllocateSmartly, using the Hierarchical Risk Parity algorithm from last post, because Adam Butler was eager to see my results. On a whole, as Adam Butler had told me he had seen, HRP does not generate outperformance when applied to a small, carefully-constructed, diversified-by-selection universe of asset classes, as opposed to a universe of hundreds or even several thousand assets, where its theoretically superior properties result in it being a superior algorithm.

First off, I would like to thank one Matthew Barry, for helping me modify my HRP algorithm so as to not use the global environment for recursion. You can find his github here.

Here is the modified HRP code.

covMat <- read.csv('cov.csv', header = FALSE)
corMat <- read.csv('corMat.csv', header = FALSE)

clustOrder <- hclust(dist(corMat), method = 'single')$order

getIVP <- function(covMat) {
  invDiag <- 1/diag(as.matrix(covMat))
  weights <- invDiag/sum(invDiag)

getClusterVar <- function(covMat, cItems) {
  covMatSlice <- covMat[cItems, cItems]
  weights <- getIVP(covMatSlice)
  cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights

getRecBipart <- function(covMat, sortIx) {
  w <- rep(1,ncol(covMat))
  w <- recurFun(w, covMat, sortIx)

recurFun <- function(w, covMat, sortIx) {
  subIdx <- 1:trunc(length(sortIx)/2)
  cItems0 <- sortIx[subIdx]
  cItems1 <- sortIx[-subIdx]
  cVar0 <- getClusterVar(covMat, cItems0)
  cVar1 <- getClusterVar(covMat, cItems1)
  alpha <- 1 - cVar0/(cVar0 + cVar1)
  # scoping mechanics using w as a free parameter
  w[cItems0] <- w[cItems0] * alpha
  w[cItems1] <- w[cItems1] * (1-alpha)
  if(length(cItems0) > 1) {
    w <- recurFun(w, covMat, cItems0)
  if(length(cItems1) > 1) {
    w <- recurFun(w, covMat, cItems1)

out <- getRecBipart(covMat, clustOrder)

With covMat and corMat being from the last post. In fact, this function can be further modified by encapsulating the clustering order within the getRecBipart function, but in the interest of keeping the code as similar to Marcos Lopez de Prado’s code as I could, I’ll leave this here.

Anyhow, the backtest will follow. One thing I will mention is that I’m using Quandl’s EOD database, as Yahoo has really screwed up their financial database (I.E. some sector SPDRs have broken data, dividends not adjusted, etc.). While this database is a $50/month subscription, I believe free users can access it up to 150 times in 60 days, so that should be enough to run backtests from this blog, so long as you save your downloaded time series for later use by using write.zoo.

This code needs the tseries library for the portfolio.optim function for the minimum variance portfolio (Dr. Kris Boudt has a course on this at datacamp), and the other standard packages.

A helper function for this backtest (and really, any other momentum rotation backtest) is the appendMissingAssets function, which simply adds on assets not selected to the final weighting and re-orders the weights by the original ordering.


Quandl.api_key("YOUR_AUTHENTICATION_HERE") # not displaying my own api key, sorry 😦

# function to append missing (I.E. assets not selected) asset names and sort into original order
appendMissingAssets <- function(wts, allAssetNames, wtsDate) {
  absentAssets <- allAssetNames[!allAssetNames %in% names(wts)]
  absentWts <- rep(0, length(absentAssets))
  names(absentWts) <- absentAssets
  wts <- c(wts, absentWts)
  wts <- xts(t(wts),
  wts <- wts[,allAssetNames]

Next, we make the call to Quandl to get our data.

symbols <- c("SPY", "VGK",	"EWJ",	"EEM",	"VNQ",	"RWX",	"IEF",	"TLT",	"DBC",	"GLD")	

rets <- list()
for(i in 1:length(symbols)) {
  # quandl command to download from EOD database. Free users should use write.zoo in this loop.
  returns <- Return.calculate(Quandl(paste0("EOD/", symbols[i]), start_date="1990-12-31", type = "xts")$Adj_Close)
  colnames(returns) <- symbols[i]
  rets[[i]] <- returns
rets <- na.omit(, rets))

While Josh Ulrich fixed quantmod to actually get Yahoo data after Yahoo broke the API, the problem is that the Yahoo data is now garbage as well, and I’m not sure how much Josh Ulrich can do about that. I really hope some other provider can step up and provide free, usable EOD data so that I don’t have to worry about readers not being able to replicate the backtest, as my policy for this blog is that readers should be able to replicate the backtests so they don’t just nod and take my word for it. If you are or know of such a provider, please leave a comment so that I can let the blog readers know all about you.

Next, we initialize the settings for the backtest.

invVolWts <- list()
minVolWts <- list()
hrpWts <- list()
ep <- endpoints(rets, on =  "months")
nMonths = 6 # month lookback (6 as per parameters from allocateSmartly)
nVol = 20 # day lookback for volatility (20 ibid)

While the AAA backtest actually uses a 126 day lookback instead of a 6 month lookback, as it trades at the end of every month, that’s effectively a 6 month lookback, give or take a few days out of 126, but the code is less complex this way.

Next, we have our actual backtest.

for(i in 1:(length(ep)-nMonths)) {
  # get returns subset and compute absolute momentum
  retSubset <- rets[c(ep[i]:ep[(i+nMonths)]),]
  retSubset <- retSubset[-1,]
  moms <- Return.cumulative(retSubset)
  # select top performing assets and subset returns for them
  highRankAssets <- rank(moms) >= 6 # top 5 assets
  posReturnAssets <- moms > 0 # positive momentum assets
  selectedAssets <- highRankAssets & posReturnAssets # intersection of the above
  selectedSubset <- retSubset[,selectedAssets] # subset returns slice
  if(sum(selectedAssets)==0) { # if no qualifying assets, zero weight for period
    wts <- xts(t(rep(0, ncol(retSubset))),
    colnames(wts) <- colnames(retSubset)
    invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts
  } else if (sum(selectedAssets)==1) { # if one qualifying asset, invest fully into it
    wts <- xts(t(rep(0, ncol(retSubset))),
    colnames(wts) <- colnames(retSubset)
    wts[, which(selectedAssets==1)] <- 1
    invVolWts[[i]] <- minVolWts[[i]] <- hrpWts[[i]] <- wts
  } else { # otherwise, use weighting algorithms
    cors <- cor(selectedSubset) # correlation
    volSubset <- tail(selectedSubset, nVol) # 20 day volatility
    vols <- StdDev(volSubset)
    covs <- t(vols) %*% vols * cors
    # minimum volatility using portfolio.optim from tseries
    minVolRets <- t(matrix(rep(1, sum(selectedAssets))))
    minVolWt <- portfolio.optim(x=minVolRets, covmat = covs)$pw
    names(minVolWt) <- colnames(covs)
    minVolWt <- appendMissingAssets(minVolWt, colnames(retSubset), last(index(retSubset)))
    minVolWts[[i]] <- minVolWt
    # inverse volatility weights
    invVols <- 1/vols 
    invVolWt <- invVols/sum(invVols) 
    invNames <- colnames(invVolWt)
    invVolWt <- as.numeric(invVolWt) 
    names(invVolWt) <- invNames
    invVolWt <- appendMissingAssets(invVolWt, colnames(retSubset), last(index(retSubset)))
    invVolWts[[i]] <- invVolWt
    # hrp weights
    clustOrder <- hclust(dist(cors), method = 'single')$order
    hrpWt <- getRecBipart(covs, clustOrder)
    names(hrpWt) <- colnames(covs)
    hrpWt <- appendMissingAssets(hrpWt, colnames(retSubset), last(index(retSubset)))
    hrpWts[[i]] <- hrpWt

In a few sentences, this is what happens:

The algorithm takes a subset of the returns (the past six months at every month), and computes absolute momentum. It then ranks the ten absolute momentum calculations, and selects the intersection of the top 5, and those with a return greater than zero (so, a dual momentum calculation).

If no assets qualify, the algorithm invests in nothing. If there’s only one asset that qualifies, the algorithm invests in that one asset. If there are two or more qualifying assets, the algorithm computes a covariance matrix using 20 day volatility multiplied with a 126 day correlation matrix (that is, sd_20′ %*% sd_20 * (elementwise) cor_126. It then computes normalized inverse volatility weights using the volatility from the past 20 days, a minimum variance portfolio with the portfolio.optim function, and lastly, the hierarchical risk parity weights using the HRP code above from Marcos Lopez de Prado’s paper.

Lastly, the program puts together all of the weights, and adds a cash investment for any period without any investments.

invVolWts <- round(, invVolWts), 3) # round for readability
minVolWts <- round(, minVolWts), 3)
hrpWts <- round(, hrpWts), 3)

# allocate to cash if no allocation made due to all negative momentum assets
invVolWts$cash <- 0; invVolWts$cash <- 1-rowSums(invVolWts)
hrpWts$cash <- 0; hrpWts$cash <- 1-rowSums(hrpWts)
minVolWts$cash <- 0; minVolWts$cash <- 1-rowSums(minVolWts)

# cash value will be zero
rets$cash <- 0

# compute backtest returns
invVolRets <- Return.portfolio(R = rets, weights = invVolWts)
minVolRets <- Return.portfolio(R = rets, weights = minVolWts)
hrpRets <- Return.portfolio(R = rets, weights = hrpWts)

Here are the results:

compare <- cbind(invVolRets, minVolRets, hrpRets)
colnames(compare) <- c("invVol", "minVol", "HRP")
rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare))  
                             invVol    minVol       HRP
Annualized Return         0.0872000 0.0724000 0.0792000
Annualized Std Dev        0.1208000 0.1025000 0.1136000
Annualized Sharpe (Rf=0%) 0.7221000 0.7067000 0.6968000
Worst Drawdown            0.1548801 0.1411368 0.1593287
Calmar Ratio              0.5629882 0.5131956 0.4968234

In short, in the context of a small, carefully-selected and allegedly diversified (I’ll let Adam Butler speak for that one) universe dominated by the process of which assets to invest in as opposed to how much, the theoretical upsides of an algorithm which simultaneously exploits a covariance structure without needing to invert a covariance matrix can be lost.

However, this test (albeit from 2007 onwards, thanks to ETF inception dates combined with lookback burn-in) confirms what Adam Butler himself told me, which is that HRP hasn’t impressed him, and from this backtest, I can see why. However, in the context of dual momentum rank selection, I’m not convinced that any weighting scheme will realize much better performance than any other.

Thanks for reading.

NOTE: I am always interested in networking and hearing about full-time opportunities related to my skill set. My linkedIn profile can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Machine Learning Basics: a Guide for the Perplexed

Artificial Intelligence and Machine Learning play a bigger part in our lives today than most people can imagine. We use intelligent services and applications every day that rely heavily on Machine Learning advances. Voice activation services like Siri or Alexa, image recognition services like Snapchat or Google Image Search, and even self driving cars all rely on the ability of machines to learn and adapt.

If you’re new to Machine Learning, it can be very easy to get bogged down in buzzwords and complex concepts of this dark art. With this in mind, we thought we’d put together a quick introduction to the basics of Machine Learning and how it works.

Note: This post is aimed at newbies – if you know a Bayesian model from a CNN, head on over to the research section of our blog, where you’ll find posts on more advanced subjects.

So what exactly is Machine Learning?

Machine Learning refers to a process that is used to train machines to imitate human intuition – to make decisions without having been told what exactly to do.

Machine Learning is a subfield of computer science, and you’ll find it defined in many ways, but the simplest is probably still Arthur Samuel’s our definition from 1959: “Machine Learning gives computers the ability to learn without being explicitly programmed”. Machine Learning explores how programs, or more specifically algorithms, learn from data and make predictions based on it. These algorithms differ from traditional programs by not relying on strict coded instruction, but by making data-driven, informed predictions or decisions based on sample training inputs. Its applications in the real world are highly varied but the one common element is that every Machine Learning program learns from past experience in order to make predictions in the future.

Machine Learning can be used to process massive amounts of data efficiently, as part of a particular task or problem. It relies on specific representations of data, or “features” in order to recognise something, similar to how when a person sees a cat, they can recognize it from visual features like its shape, its tail length, and its markings, Machine Learning algorithms learn from from patterns and features in data previously analyzed.

Different types of Machine Learning

There are many types of Machine Learning programs or algorithms. The most common ones can be split into three categories or types:

    1. Supervised Machine Learning
    2. Unsupervised Machine Learning
    3. Reinforcement Learning

1. Supervised Machine Learning

Supervised learning refers to how a Machine Learning application has been trained to recognize patterns and features in data. It is “supervised”, meaning it has been trained or taught using correctly labeled (usually by a human) training data.

The way supervised learning works isn’t too different to how we learn as humans. Think of how you teach a child: when a child sees a dog, you point at it and say “Look! A dog!”. What you’re doing here essentially is labelling that animal as a “dog”. Now, It might take a few hundred repetitions, but after a while the child will see another dog somewhere and say “dog,” of their own accord. They do this by recognising the features of a dog and the association of those features with the label “dog” and a supervised Machine Learning model works in much the same way.

It’s easily explained using an everyday example that you have certainly come across. Let’s consider how your email provider catches spam. First, the algorithm used is trained on a dataset or list of thousands of examples of emails that are labelled as “Spam” or “Not spam”. This dataset can be referred to as “training data”. The “training data” allows the algorithm to build up a detailed picture of what a Spam email looks like. After this training process, the algorithm should be able to decide what label (Spam or Not spam) should be assigned to future emails based on what it has learned from the training set. This is a common example of a Classification algorithm – a supervised algorithm trained on pre-labeled data.

Screenshot (58)
Training a spam classifier

2. Unsupervised Machine Learning

Unsupervised learning takes a different approach. As you can probably gather from the name, unsupervised learning algorithms don’t rely on pre-labeled training data to learn. Alternatively, they attempt to recognize patterns and structure in data. These patterns recognized in the data can then be used to make decisions or predictions when new data is introduced to the problem.

Think back to how supervised learning teaches a child how to recognise a dog, by showing it what a dog looks like and assigning the label “dog”. Unsupervised learning is the equivalent to leaving the child to their own devices and not telling them the correct word or label to describe the animal. After a while, they would start to recognize that a lot of animals while similar to each other, have their own characteristics and features meaning they can be grouped together, cats with cats and dogs with dogs. The child has not been told what the correct label is for a cat or dog, but based on the features identified they can make a decision to group similar animals together. An unsupervised model will work in the same way by identifying features, structure and patterns in data which it uses to group or cluster similar data together.

Amazon’s “customers also bought” feature is a good example of unsupervised learning in action. Millions of people buy different combinations of books on Amazon every day, and these transactions provide a huge amount of data on people’s tastes. An unsupervised learning algorithm analyzes this data to find patterns in these transactions, and returns relevant books as suggestions. As trends change or new books are published, people will buy different combinations of books, and the algorithm will adjust its recommendations accordingly, all without needing help from a human. This is an example of a clustering algorithm – an unsupervised algorithm that learns by identifying common groupings of data.

Screenshot (40)
Clustering visualization

Supervised Versus Unsupervised Algorithms

Each of these two methods have their own strengths and weaknesses, and where one should be used over the other is dependent on a number of different factors:
The availability of labelled data to use for training

    Whether the desired outcome is already known
    Whether we have a specific task in mind or we want to make a program for very general use
    Whether the task at hand is resource or time sensitive

Put simply, supervised learning is excellent at tasks where there is a degree of certainty about the potential outcomes, whereas unsupervised learning thrives in situations where the context is more unknown.

In the case of supervised learning algorithms, the range of problems they can solve can be constrained by their reliance on training data, which is often difficult or expensive to obtain. In addition, a supervised algorithm can usually only be used in the context you trained it for. Imagine a food classifier that has only been trained on pictures of hot dogs – sure it might do an excellent job at recognising hotdogs in images, but when it’s shown an image of a pizza all it knows is that that image doesn’t contain a hotdog.

The limits of supervised learning – HBO’s Silicon Valley

Unsupervised learning approaches also have many drawbacks: they are more complex, they need much more computational power, and theoretically they are nowhere near as understood yet as supervised learning. However, more recently they have been at the center of ML research and are often referred to as the next frontier in AI. Unsupervised learning gives machines the ability to learn by themselves, to extract information about the context you put them in, which essentially, is the core challenge of Artificial Intelligence. Compared with supervised learning, unsupervised learning offers a way to teach machines something resembling common sense.

3. Reinforcement Learning

Reinforcement learning is the third approach that you’ll most commonly come across. A reinforcement learning program tries to teach itself accuracy in a task by continually giving itself feedback based on its surroundings, and continually updating its behaviour based on this feedback. Reinforcement learning allows machines to automatically decide how to behave in a particular environment in order to maximize performance based off ‘reward‘ feedback or a reinforcement signal. This approach can only be used in an environment where the program can take signals from its surroundings as positive or negative feedback.

Reinforcement Learning in action

Imagine you’re programming a self-driving car to teach itself to become better at driving. You would program it to understand certain actions – like going off the road for example – is bad by providing negative feedback as a reinforcement signal. The car will then look at data where it went off the road before, and try to avoid similar outcomes. For instance, if the car sees a pattern like when it didn’t slow down at a corner it was more likely to end up driving off the road, but when it slowed down this outcome was less likely, it would slow down at corners more.


So this concludes our introduction to the basics of Machine Learning. We hope it provides you with some grounding as you try to get familiar with some of the more advanced concepts of Machine Learning. If you’re interested in Natural Language Processing and how Machine Learning is used in NLP specifically, keep an eye on our blog as we’re going cover how Machine Learning has been applied to the field. If you want to read some in-depth posts on Machine Learning, Deep Learning, and NLP, check out the research section of our blog.

Text Analysis API - Sign up

The post Machine Learning Basics: a Guide for the Perplexed appeared first on AYLIEN.

Continue Reading…


Read More

R Packages worth a look

Constructing an Epistemic Model for the Games with Two Players (EpistemicGameTheory)
Constructing an epistemic model such that, for every player i and for every choice c(i) which is optimal, there is one type that expresses common belief in rationality.

Bayesian Network Belief Propagation (BayesNetBP)
Belief propagation methods in Bayesian Networks to propagate evidence through the network. The implementation of these methods are based on the article: Cowell, RG (2005). Local Propagation in Conditional Gaussian Bayesian Networks <http://…/>.

Quick Generalized Full Matching (quickmatch)
Provides functions for constructing near-optimal generalized full matching. Generalized full matching is an extension of the original full matching method to situations with more intricate study designs. The package is made with large data sets in mind and derives matches more than an order of magnitude quicker than other methods.

Convolution of Gamma Distributions (coga)
Convolution of gamma distributions in R. The convolution of gamma distributions is the sum of series of gamma distributions and all gamma distributions here can have different parameters. This package can calculate density, distribution function and do simulation work.

Graphical User Interface for Generalized Multistate Simulation Model (GUIgems)
A graphical user interface for the R package Gems. Apart from the functionality of Gems package in the Graphical User interface, GUIgems allows adding states to a defined model, merging states for the analysis and plotting progression paths between states based on the simulated cohort. There is also a module in the GUIgems which allows to compare costs and QALYs between different cohorts.

Continue Reading…


Read More

Science and Technology links (May 26th, 2017)

Linux, the operating system driving cloud computing and the web, was developed using an open source model. For a time, Linux was seen as a direct competitor to Microsoft, but things have changed and Microsoft is happy to see Linux as just a piece of technology. Because of how large and complicated the software got, the Linux project manager, Linus Torvalds, ended up writing its own source control tool, Git. It quickly became a standard. Today, Windows is built on Git. This shows the power of open source. “Open source” is a concept just as powerful and important for our civilization as “the scientific method”. Though both science and open source can wipe out business models, they are also engines of innovation making us all richer. Microsoft does not use Linux and Git because it gave up on having viable competitors, but rather because it understands that fighting open source is about as useful as fighting the scientific method. (As an aside, modern implementations of Git are accelerated with compressed bitmaps called EWAH, something I personally worked on.)

Organic food is better for you, right? Galgano et al. (2016) find no evidence of such benefits:

The organic food market is growing in response to an ever increasing demand for organic products. They are often
considered more nutritious, healthier, and free from pesticides than conventional foods. However, the results of scientific studies do not show that organic products are more nutritious and safer than conventional foods.

Ok but organic food is better for the environment, right? Maybe not because organic farming requires more land and more animals:

Furthermore, higher on-farm acidification potential and global warming potential per kilogram organic milk implies that higher ammonia, methane, and nitrous oxide emissions occur on farm per kilogram organic milk than for conventional milk. Total acidification potential and global warming potential per kilogram milk did not differ between the selected conventional and organic farms.

A company called Warby Parker has a mobile app you can use to sidestep entirely optometrists and opticians in some cases. The main point seem to be that a lot of what opticians do can be computerized easily and that it is not hard to check your prescription.

Omega-3 fats rejuvenate the muscles of old people:

Omega-3 polyunsaturated fatty acids reduce mitochondrial oxidant emissions, increase postabsorptive muscle protein synthesis, and enhance anabolic responses to exercise in older adults.

You have heard that teenage acne was unrelated to diet, right? Not so fast:

We found a positive association between intake of skim milk and acne. This finding suggests that skim milk contains hormonal constituents, or factors that influence endogenous hormones, in sufficient quantities to have biological effects in consumers.

The epidemic incidence of adolescent acne in Western milk-consuming societies can be explained by the increased insulin- and IGF-1-stimulation of sebaceous glands mediated by milk consumption.

We found a positive association with acne for intake of total milk and skim milk. We hypothesize that the association with milk may be because of the presence of hormones and bioactive molecules in milk.

Keytruda is the first cancer drug that targets a genetic dysfunction rather than specific cancer type. This type of drug might open the door to cheap and effective genetic cancer therapies. Note that Keytruda is actually approved for use in the US, so this is not purely speculative.

Volvo makes self-driving garbage-collecting trucks. A report from Goldman Sachs suggests that self-driving cars could destroy 25,000 jobs per month in the US. That sounds like a lot, but I don’t think it is anything dramatic, if true. See also the Sector Disruption Report (May 2017) by James Arbib and Tony Seba that sees the effect of self-driving car as enormous:

This will keep an additional $1 trillion per year in Americans’ pockets by 2030, potentially generating the largest infusion of consumer spending in history

Arthritis is a terrible disease where some people end up living in near constant pain. Yet it could be prevented with good diet and exercise, according to an article in Nature.

Jeff Bezos, Amazon’s CEO, wants us to build permanent settlements on the Moon. This is the same man who wants to deliver us goods using drones, and help cure aging through the clearance of senescent cells.

Scientists have linked 52 genes to intelligence. Let me caution you: no we cannot build super smart babies by manipulating genes. Not yet at least.

Continue Reading…


Read More

The Evolving Science of Sentiment and Emotion AI, Sentiment Analysis Symposium, June 27-28

News, sentiment, and emotion drive markets - consumer markets and financial markets, making text and sentiment analysis essential tools for research and insights. Using code KDNUGGETS to save - early reg by May 31.

Continue Reading…


Read More

A spat over language erupts at the World Bank

A WAR of words has flared up at the World Bank. Paul Romer, its new chief economist, has been stripped of control of the research division. An internal memo claimed that the change was to bring the operations department and research arm closer together.

Continue Reading…


Read More

Our general election poll tracker

The Economist's poll tracker comprises an overall rolling average from several polls and a breakdown of the headline figure by three different variables (sex, age and social grade).

Continue Reading…


Read More

3rd Valencian Summer School in Machine Learning is on the Horizon!

Since 2011, BigML has been at the forefront of the Machine Learning revolution, which is now in full swing. Big and small companies alike are seeing the tremendous payoff from using Machine Learning to make data-driven decisions. As all industries become transformed by predictive applications, BigML’s mission to democratize Machine Learning is more relevant than […]

Continue Reading…


Read More

Who is the caretaker? Evidence-based probability estimation with the bnlearn package

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean

The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

What evidence did viewers of the video have?

  • the person is American Caucasian
  • the person is professional
  • there are two kids
  • the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")



The last step is to fit the parameters of the Bayesian network conditional on its structure, the function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn <- c("yes", "no")
ca <- c("caucacian","other")
ao <- c("asian","other")
nw <- c("nanny","wife")

cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1]
cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2]
cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3]
cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1]
cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4]

dim(cptStayAtHomeMom) <- c(2, 2, 2)
dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn)

cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5]
dim(cptHouseholdHasNanny) <- c(2, 2, 2)
dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn)

cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999)
dim(cptCaretaker) <- c(2, 2, 2)
dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn)

cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99)
dim(cptCaretakerEthnicity) <- c(2, 2, 2,2)
dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao)

net.disc <-, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))

Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.

probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000)
probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) 

[1] "The probability that the caretaker is his wife  = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Bay Area Apache Spark Meetup Summary

On May 16, we held our monthly Bay Area Apache Spark Meetup (BASM) at SalesforceIQ in Palo Alto.

In all, we had three Apache Spark related talks: two from Salesforce’s Data Engineering and Machine Learning team, and one from Databricks’ Apache Spark Structured Streaming team.

For those who missed the meetup, below are all the videos and links to the presentation slides. You can peruse slides and view the videos at your leisure. To those who helped and attended, thank you for participating and continued community support.

Dependency Injection in Apache Spark Applications

View the slides here

Identifying Pricing Request Emails Using Apache Spark and Machine Learning

View the slides here

Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark

View the slides here

What’s Next

Our next BASM will be held on the eve of the 10th Spark Summit on June 5. You don’t have to be registered for the Spark Summit to attend the meetup, so RSVP Now!


Try Databricks for free. Get started today.

The post Bay Area Apache Spark Meetup Summary appeared first on Databricks.

Continue Reading…


Read More

What is an Ontology? The simplest definition you’ll find… or your money back*

This post takes the concept of an ontology and presents it in a clear and simple manner, devoid of the complexities that often surround such explanations.

Continue Reading…


Read More

Wise Practitioner – Manufacturing Predictive Analytics Interview Series: Richard Semmes at Siemens PLM

By: Bala Deshpande, Conference Co-Chair, Predictive Analytics World for Manufacturing 2016

In anticipation of his upcoming Predictive Analytics World Manufacturing Chicago, June 19-22, 2017 conferenceRichard Semmes presentation, Closing the Loop with Predictive Product Performance, we interviewed Richard Semmes, Senior Director, R&D at Siemens PLM. View the Q-and-A below for a glimpse of what’s in store at the PAW Manufacturing conference.

Q: What are the challenges in translating the lessons of predictive analytics from other verticals into manufacturing?

A: The objective for predictive analytics in manufacturing is really to enable actionable business decisions that impact the way you design, build, or service your products.  The most successful practitioners of predictive analytics in manufacturing use continuously updated data from many sources throughout their supply chain.  The biggest challenges center on data ETL, aggregation, and continuous updates.

Q: In your work with predictive analytics, what behavior do your models predict?

A: Our models predict the performance of mechatronic products.  We use predictive analytics to connect real world IoT data to the Digital Twin models of the virtual world.  That allows manufacturers of physical goods to proactively manage their businesses by better understanding what is going to happen in their factories as well as their products in the field.

Q: How does predictive analytics deliver value at your organization? What is one specific way in which it actively drives decisions?

A: Predictive analytics serves to find issues with products we did not know existed.  We use predictive models to understand the correlation between product features and product performance.  We use that insight to proactively manage those products in the field as well as optimizing the product through design changes. 

Q: Can you describe a successful result, such as the predictive lift (or accuracy) of your model or the ROI of an analytics initiative?

A: In one instance, we trained models using environmental data as well as IoT data from a very complex machine that produces other products.  The trained model was able to show us the environmental and job characteristics that had the best correlation to job failure.  That information can be used to warn the operator that there is increased risk of failure and it can be used to improve the machines to better handle those adverse situations.

Q: What surprising discovery have you unearthed in your data?

A: The extent to which environmental data should be taken into consideration when creating predictive models.  While it is obvious that weather and other environmental state can influence product performance, the extent to which including environmental conditions helps discover product feature correlations is significant.

Q: Sneak preview: Please tell us a take-away that you will provide during your talk at Predictive Analytics World for Manufacturing.

A: You don’t need an army of data scientists to reap the benefits of predictive analytics in your business.​


Don't miss Richard’s conference presentation, Closing the Loop with Predictive Product Performance, at PAW Manufacturing, on June 20, 2017 from 1:30 to 2:15 pm. Click here to register for attendance. 

By: Bala Deshpande, Founder, Simafore and Conference Co-Chair of Predictive Analytics World for Manufacturing.

The post Wise Practitioner – Manufacturing Predictive Analytics Interview Series: Richard Semmes at Siemens PLM appeared first on Analytical Worlds Blog - Predictive Analytics and Text Analytics - by Eric Siegel, Ph.D..

Continue Reading…


Read More

An Introduction to the MXNet Python API

This post outlines an entire 6-part tutorial series on the MXNet deep learning library and its Python API. In-depth and descriptive, this is a great guide for anyone looking to start leveraging this powerful neural network library.

Continue Reading…


Read More

Amazon Kinesis vs. Apache Kafka For Big Data Analysis

Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Parts of the Kinesis platform are

The post Amazon Kinesis vs. Apache Kafka For Big Data Analysis appeared first on Dataconomy.

Continue Reading…


Read More

Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do

Above is my talk at the 2017 New York R conference. Look, no slides!

The talk went well. I think the video would be more appealing to listen to if they’d mixed in more of the crowd noise. Then you’d hear people laughing at all the right spots.

P.S. Here’s my 2016 NYR talk, and my 2015 NYR talk.

Damn! I’m giving away all my material for free. I’ll have to come up with some entirely new bits when they call me up to give that Ted talk.

The post Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame.

5465544053 8b626a09c8 b

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:


# Please see the following video for installation help
# spark_install(version = "2.0.2")

# set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
                    version = "2.0.2")

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths of parquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
                             header = TRUE,
                 strip.white = TRUE,
                 stringsAsFactors = FALSE)
##   tableName tablePath
## 1   data_01   data_01
## 2   data_02   data_02
## 3   data_03   data_03

We can now read these parquet files (usually stored in Hadoop) into our Spark environment as follows.

readParquets <- function(userSpecification) {
  userSpecification <- as_data_frame(userSpecification)
  userSpecification$handle <- lapply(
    function(i) {
                         name = userSpecification$tableName[[i]], 
                         path = userSpecification$tablePath[[i]])

tableCollection <- readParquets(userSpecification)
## # A tibble: 3 x 3
##   tableName tablePath          handle
##       <chr>     <chr>          <list>
## 1   data_01   data_01 <S3: tbl_spark>
## 2   data_02   data_02 <S3: tbl_spark>
## 3   data_03   data_03 <S3: tbl_spark>

A data.frame is a great place to keep what you know about your Spark handles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {
  tableCollection <- as_data_frame(tableCollection)
  # get the references
  tableCollection$handle <-
           function(tableNamei) {
             dplyr::tbl(sc, tableNamei)
  # and tableNames to handles for convenience
  # and printing
  names(tableCollection$handle) <-
  # add in some details (note: nrow can be expensive)
  tableCollection$nrow <- vapply(tableCollection$handle, 
  tableCollection$ncol <- vapply(tableCollection$handle, 

tableCollection <- addDetails(userSpecification)

# convenient printing
## # A tibble: 3 x 5
##   tableName tablePath          handle  nrow  ncol
##       <chr>     <chr>          <list> <dbl> <dbl>
## 1   data_01   data_01 <S3: tbl_spark>    10     1
## 2   data_02   data_02 <S3: tbl_spark>    10     2
## 3   data_03   data_03 <S3: tbl_spark>    10     3
# look at the top of each table (also forces
# evaluation!).
## $data_01
## Source:   query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
## # A tibble: 6 x 1
##        a_01
##       <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
## $data_02
## Source:   query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
## # A tibble: 6 x 2
##        a_02       b_02
##       <dbl>      <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
## $data_03
## Source:   query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
## # A tibble: 6 x 3
##         a_03      b_03        c_03
##        <dbl>     <dbl>       <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {
  tableCollection$columns <- 
  columnMap <- tableCollection %>% 
    select(tableName, columns) %>%

columnMap <- columnDictionary(tableCollection)
## # A tibble: 6 x 2
##   tableName columns
##       <chr>   <chr>
## 1   data_01    a_01
## 2   data_02    a_02
## 3   data_02    b_02
## 4   data_03    a_03
## 5   data_03    b_03
## 6   data_03    c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

  • Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
  • Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

Continue Reading…


Read More

What Data Scientists Should Know About Hiring, Sharing, and Collaborating

In this post we summarize some of our most recent and favorite answers on Quora to questions from the community about hiring junior data scientists, sharing work with the public, and collaborating.

Much is said in the data science and machine learning space about new and exciting methods, tools, and advancements (like deep learning, automl, or the latest RNN playing super mario brothers).

However, as data science matures from the playground to the boardroom, practitioners and managers are finding a new set of challenges on their hands: Challenges around people, processes, and careers. Challenges that require soft skills, not models.

This is apparent from the questions starting to appear on Quora. The following three questions and answers highlight some of the soft skills required to grow as a data scientist within an organization:

What do you look for when hiring an entry-level data scientist?

There is no lack of breathless reports warning anyone who will listen about the coming catastrophe that is the data science skills gap. IBM predicts that by 2020 the demand for data scientists will grow by 28%, and yet Quora is filled with dozens if not hundreds of posts from people who are transitioning from Ph.D programs, engineering, or many other disciplines.

"What do you look for when hiring an entry-level data scientist? Would a master’s in Data Science or a bootcamp be beneficial?"

This question contains many facets we've seen before, packed into a single question. Our answer included the following...

The three traits to look for in a junior data scientist:

  1. They have the drive and determination to be a self-directed learner.
  2. They understand the fundamentals of “enough” programming,
  3. They understand how to analyze data when the goals and metrics are not explicit or time-boxed.

The real conclusion was that as a hiring manager, I want signals that let me know you will be productive in the things they don’t teach you in school. I want to know you understand how to be independent, how to write code, and how to drive to insights when everyone is busy and no one has time to help mentor you.

A master's degree or a boot camp certification are all signals that I will take into account, but neither is make-or-break. It’s everything else around your CV that motivates me to take the conversation further.

View full answer on Quora.

What’s the best way for data scientists to share their work?

Few people enjoy standing in front of a crowd and being in the spotlight. A study showed that nearly 27 million Americans have an explicit “fear” of public speaking.

Considering the deeply technical nature of the work, and the many ways in which an analysis can go awry, it can feel like an especially daunting task to share one’s work as a data scientist. However, communicating your insights is one of the most critical skills for a successful career in data science. A recent article by Emma Walker, Data Scientist at Qriously, even called communication the “critical skill” many data scientists are missing.

Telling data scientists they just have to get better at something is not particularly helpful, so instead we broke it down into this:

Five ways that data scientists at different comfort levels with public speaking can share their work with the public:

  1. Create a really nice web portfolio or GitHub page.
  2. Find your voice on social media.
  3. Get involved with the local Meetup community.
  4. Talk at conferences—from local to regional and even national.
  5. Mentor someone who is earlier along in their evolution.

No one can expect that just publishing work on the internet will get their work in front of people. The internet is a firehose full of people who want to get noticed. There is no substitute for being your own passionate advocate, standing in front of people, and excitedly telling them about the work that you’ve done.

View full answer on Quora.

What are best practices for collaboration between data scientists?

Once a data scientist had their work noticed, and once they’ve been hired as a data scientist at an organization, the truism that “data science is a team sport” will become a daily reality. However, there is a broad range of collaboration in team sports, from the near perfect synchronicity of a rowing team, to the chaos of Battle of the Nations, teamwork can mean many things to many people. This is not helped by the fact that collaboration is a vague often misappropriated term.

Genuine, practical, productive collaboration is the combination of four principles:

  1. Shared context creates an environment in which the penalty for communication or collaboration is minimized because everyone has the necessary information “paged in.” They’re able to operate on it without paying a significant cost of context switching.
  2. Discussion and communication, when it happens fluidly and is recorded in a system of record, is a tool that allows layered, compounded knowledge to be built. Learning from others' work and experiments is significantly cheaper than having to discover from first principles.
  3. Discoverability acknowledges that while context and discussion are powerful, they’re much less so if locked away behind impenetrable navigation. Providing search, taxonomy, hierarchy and ontology that makes navigation between topics and insights easy, and that provides serendipity can shortcut learning curves and encourages future collaboration.
  4. Reuse is often the most desirable outcome of collaboration, but it has to be a goal as well as a principle. Making your work reusable trivially allows others to leverage work and “stand on the shoulders of giants.” Taking lessons learned and then generalizing them into a reusable template can impact the pace of innovation in organization.

Collaboration isn’t something that can be solved with a silver bullet. It requires people devoted to a collaborative environment, tools which support, enhance, and leverage collaborative strategies, and processes that ensure adherence to collaborative ideals.

View full answer on Quora.


It's exciting to discuss the latest new approach or algorithm, but there are many interesting questions beginning to come out surrounding the people, processes, and careers of data scientists.

Although some might argue that Aristotle was the first data scientist—and therefore the field is 2,500 years old—questions around how this work is done, by whom, and in what environment are still relevant today. A rich, vibrant conversation about all of the aspects of data science, from technology to people, help to strengthen the field and to define the kind of community and profession we hope to be.

The post What Data Scientists Should Know About Hiring, Sharing, and Collaborating appeared first on Data Science Blog by Domino.

Continue Reading…


Read More

Intelligent Bits: 26 May 2017

Sukiyaki in French style, brick-and-mortar conversion tracking, route-based pricing, and technological productivity.

  1. Deep recipe transfer — Many recent results have shown the ability to transfer visual patterns and styles across images. Now researchers demonstrate how neural nets can adapt recipes to the culinary styles of particular geographic regions.
  2. Brick, mortar, and bucks — Google can now associate digital ad campaigns with in-store visits and sales by applying machine learning to its wealth of user data, including geolocation, search history, web browsing, app interactions, and now credit card transaction records.
  3. How much for that ride? — Uber applies machine learning to route-based pricing in an effort to become more sustainable by predicting how much you’re willing to pay.
  4. Phew, false alarm — Contrary to popular outcry about technological dislocation of labor, this think tank argues that more innovation is needed to drive productivity and, therefore, jobs.

Continue reading Intelligent Bits: 26 May 2017.

Continue Reading…


Read More

Four short links: 26 May 2017

Service Availability, Data Share, Eventual Consistency Explained, and Reproducible Deep Learning

  1. The Calculus of Service Availability -- A service cannot be more available than the intersection of all its critical dependencies. If your service aims to offer 99.99% availability, then all of your critical dependencies must be significantly more than 99.99% available. Internally at Google, we use the following rule of thumb: critical dependencies must offer one additional 9 relative to your service—in the example case, 99.999% availability—because any service will have several critical dependencies, as well as its own idiosyncratic problems. This is called the "rule of the extra 9."
  2. datproject -- open source crypto—guaranteed distributed data share, designed for versioned data sets.
  3. How Your Data is Stored -- eventual consistency VERY LUCIDLY explained. It follows the original (entertaining) paper by Leslie Lamport but spells everything out clearly for non-computer-scientists.
  4. OpenAI Baselines -- open source implementations of the interesting published algorithms in deep learning. The papers often gloss over some of the details, so a full and working implementation truly lets others build on research. It's like the reproducibility project for deep learning.

Continue reading Four short links: 26 May 2017.

Continue Reading…


Read More

Obi-Wan saying “Hello there” 67 million times

This clip of Obi-Wan saying “Hello there” 67 million times amused me too much.

I think there’s a lesson in averages or small multiples hidden somewhere in there.

Tags: ,

Continue Reading…


Read More

Running a word count application using Spark

How to use Apache Spark’s Resilient Distributed Dataset (RDD) API.

Continue reading Running a word count application using Spark.

Continue Reading…


Read More

Versioning R model objects in SQL Server

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

High-level info

If you build a model and never update it you’re missing a trick. Behaviours change so your model will tend to perform worse over time. You’ve got to regularly refresh it, whether that’s adjusting the existing model to fit the latest data (recalibration) or building a whole new model (retraining), but this means you’ve got new versions of your model that you have to handle. You need to think about your methodology for versioning R model objects, ideally before you lose any versions.

You could store models with ye olde YYYYMMDD style of versioning but that means regularly changing your code to use the latest model version. I’m too lazy for that!

If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple.

Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat!

For some of you, if you’re not interested in the technical details you can drop off now with the knowledge that you can store your models in a non-destructive but easy to use way in SQL Server if you need to.

If you want to see how it’s done, read on!

The technical info

Below is a working example of how to do versioning R model objects in SQL Server:

  • define a versioned model table
  • write a model into the model table
  • create a new model and overwrite the old
  • create a prediction using the latest model on the fly

Note this doesn’t tell you what changed, just that something did change. To identify model changes you will need to load up the models and compare their coefficients or other attributes to identify what exactly changed.

SQL objects

A temporal table

A normal table for storing a model might look like

CREATE TABLE [companyModels]    (  
, [name] varchar(200) NOT NULL      
, [modelObj] varbinary(max)    
, CONSTRAINT unique_modelname UNIQUE ([name]))

If we’re turning it into a temporal table we need to add some extra columns, but we won’t worry about these extra columns day-to-day.

CREATE TABLE [companyModels]    (  
, [name] varchar(200) NOT NULL      
, [modelObj] varbinary(max)    
, [ValidFrom] datetime2 (2) GENERATED ALWAYS AS ROW START  
, [ValidTo] datetime2 (2) GENERATED ALWAYS AS ROW END  
, PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)  
, CONSTRAINT unique_modelname UNIQUE ([name]))
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.companyModelsHistory));  

A stored procedure

As we have the ID and Valid* columns in our table, we can’t use RODBCs simple table functions and the ID column doesn’t play nicely with RevoScaleR’s rxWriteObject() function as that wants to insert a NULL. It’s not all bad though because we can get around it by using a stored procedure.

This stored procedure will perform an INSERT if it can’t find a model by the given name, and will perform an UPDATE if it does find a match.

@modelname  varchar(200) , 
@modelobj varbinary(max) 
WITH MySource as (
    select @modelname as modelName, @modelobj as modelObj
MERGE companymodels AS MyTarget
USING MySource
     ON MySource.modelname = MyTarget.modelname
    modelObj = MySource.modelObj
    VALUES (


Build a model

We need a model to save!

To be able to store the model in the database we need to use the serialize() function to turn it into some raw character thingies and then combine them together with paste0() so they go in the same row.

myModelV1<-lm(column1~column2, someData)
                                      ,collapse = "")

Call the stored procedure

We need RODBC and RODBCext for executing our stored procedure in our database.


dbstring<-'Driver={ODBC Driver 13 for SQL Server};Server=XXX;Database=XXX;Uid=XXX;Pwd=XXX'

RODBCext::sqlExecute(dbconn, "exec modelUpsert @modelname=? , @modelobj=?",
                      data = preppedDF)

This will now have our model in our database table.

RODBC::sqlQuery(dbconn, "select * from companymodels")
# 1 row
RODBC::sqlQuery(dbconn, "select * from companymodelshistory")
# 0 row

You should get one row in our main table and no rows in ouor history table.

Rinse and repeat

If we make a change to our model and then push the new model with the same name, we’ll still get one row in our core table, but now we’ll get a row in our history table that contains our v1 model.

myModelV2<-lm(column1~column2, someData[-1,])
                                      ,collapse = "")
RODBCext::sqlExecute(dbconn, "exec modelUpsert @modelname=? , @modelobj=?",
                     data = preppedDF)

RODBC::sqlQuery(dbconn, "select * from companymodels")
# 1 row
RODBC::sqlQuery(dbconn, "select * from companymodelshistory")
# 1 row

Using our model in SQL

If we want to use our model for predictions in SQL, we can now retrieve it from the table along with some input data and get back our input data plus a prediction.

                FROM companymodels 
                WHERE modelname='mymodel'
EXEC sp_execute_external_script
@language = N'R',  
@script = N'
    predict(unserialize(as.raw(model)), InputDataSet)
@input_data_1 = N'SELECT 42 as column2',  
@params = N'@model varbinary(max)',  
@model =  @mymodel 

The post Versioning R model objects in SQL Server appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Relaxed Wasserstein with Applications to GANs

If you are figuring what GANs are about, check on these two Highly Technical Reference pages on the subject. In the meantime:

Relaxed Wasserstein with Applications to GANs by Xin Guo, Johnny Hong, Tianyi Lin, Nan Yang
Generative Adversarial Networks (GANs) provide a versatile class of models for generative modeling. To improve the performance of machine learning models, there has recently been interest in designing objective functions based on Wasserstein distance rather than Jensen-Shannon (JS) divergence. In this paper, we propose a novel asymmetric statistical divergence called Relaxed Wasserstein (RW) divergence as a generalization of Wasserstein-L2 distance of order 2. We show that RW is dominated by Total Variation (TV) and Wasserstein-L2 distance, and establish continuity, differentiability, and duality representation of RW divergence. Finally, we provide a nonasymptotic moment estimate and a concentration inequality for RW divergence. Our experiments show that RWGANs with Kullback-Leibler (KL) divergence produce recognizable images with a ReLU Multi-Layer Perceptron (MLP) generator in fewer iterations, compared to Wasserstein-L1 GAN (WGAN).

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…


Read More

RQGIS release 1.0.0

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

Today we are proud to announce a major release of the RQGIS package providing an interface between R and QGIS. We have completeley rewritten RQGIS by now using reticulate to establish a tunnel to the Python QGIS API. To make RQGIS even more user-friendly, we have implemented, among others, following features:

  • set_env now caches its output, so you have to call it only once
  • find_algorithms now supports regular expressions
  • run_qgis now accepts R named arguments as input (see example below). The load_output argument is now logical. If TRUE all specified output files will be loaded into R.
  • RQGIS now supports simple features

For a detailed overview of all changes, please refer to the release news.

Let’s use RQGIS for calculating the slope and the aspect of a digital elevation model. First of all, we have to attach RQGIS and a digital elevation model (dem) to the global environment:

data("dem", package = "RQGIS")

Next, we need to find all paths necessary to run successfully the Python QGIS API. Please note, that the output of set_env will be cached and reused in subsequent function calls. Note also, that set_env is always faster when you indicate the path to your QGIS installation, in my case e.g., set_env(root = "C:/OSGeo4W64"). Secondly, we open a QGIS Python application with open_app.


Now, that we have established a tunnel to the QGIS Python API, we are ready for some geoprocessing. First, we need a geoalgorithm, that computes the slope and the aspect of a digital elevation model. To find the name of such a geoalgorithm, we use a regular expression which searches for the terms slope and aspect in all available QGIS geoalgorithms.

## [1] "r.slope.aspect - Generates raster layers of slope, aspect, curvatures and partial derivatives from a elevation raster layer.--->grass7:r.slope.aspect"
## [2] "Slope, aspect, curvature----------------------------->saga:slopeaspectcurvature"                                                                      
## [3] "r.slope.aspect - Generates raster layers of slope, aspect, curvatures and partial derivatives from a elevation raster layer.--->grass:r.slope.aspect"

We will use grass7:r.slope.aspect. To retrieve its function parameters and arguments you can run get_usage("grass7:r.slope.aspect"). Use get_args_man to collect the parameter-argument list including all default values:

params <- get_args_man("grass7:r.slope.aspect")

As with previous RQGIS releases, you can still modify the parameter-argument list and submit it to run_qgis:

params$elevation <- dem
params$slope <- "slope.tif"
params$aspect <- "aspect.tif"
run_qgis(alg = "grass7:r.slope.aspect", params = params)

But now you can also use R named arguments in run_qgis, i.e. you can specify the geoalgorithms parameters directly in run_qgis (adapted from package rgrass7). Here, we specify the input- and the output-rasters. For all other parameters, default values will automatically be used. For more information on the R named arguments, please refer also to the documentation by running ?run_qgis and/or ?pass_args.

out <- run_qgis(alg = "grass7:r.slope.aspect", elevation = dem, 
                slope = "slope.tif", aspect = "aspect.tif", 
                load_output = TRUE)
## $slope
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\RtmpmsSSA4/slope.tif"
## $aspect
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\RtmpmsSSA4/aspect.tif"
## $dxy
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\ac7b8544e8194dd1a1b8710e6091f1f3\\dxy.tif"
## $dxx
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\1576d9dc93434a578b3aeb16bedb17a2\\dxx.tif"
## $tcurvature
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\afea27676cc049faaa8526a486f13f70\\tcurvature.tif"
## $dx
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\2d71fd26b1aa4868a5cdfd0d7ad47a0c\\dx.tif"
## $dy
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\458f38f6c71947d3a37532e4bc6a6a53\\dy.tif"
## $pcurvature
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\80ad6fa1843d4d3a92ed0b4c6a39dcfa\\pcurvature.tif"
## $dyy
## [1] "C:\\Users\\pi37pat\\AppData\\Local\\Temp\\processing5bb46293bfb243f092a57ce9cf50348b\\6c52d235c8614a719954f1f744e3fef1\\dyy.tif"

Setting load_output to TRUE automatically loads the resulting QGIS output back into R. Since we have specified two ouput files, the output will be loaded into R as a list (here named out) with two elements: a slope and an aspect raster. However, in the background, QGIS calculates all terrain attributes and derivatives provided by grass7:r.slope.aspect, and saves them to a temporary location. Of course, if you wish you can still access these layers from there.

Before running RQGIS, you need to install third-party software. We wrote a package vignette, which guides you through the installation process of QGIS, GRASS and SAGA on various platforms. To access the vignette, please run:

vignette("install_guide", package = "RQGIS")

For more information on RQGIS and examples how to use RQGIS, please refer to its github page and my blog.

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

rtdists 0.7-2: response time distributions now with Rcpp and faster

It took us quite a while but we have finally released a new version of rtdists to CRAN which provides a few significant improvements. As a reminder, rtdists

[p]rovides response time distributions (density/PDF, distribution function/CDF, quantile function, and random generation): (a) Ratcliff diffusion model based on C code by Andreas and Jochen Voss and (b) linear ballistic accumulator (LBA)  with different distributions underlying the drift rate.

The main reason it took us relatively long to push the new version was that we had a problem with the C code for the diffusion model that we needed to sort out first. Specifically, the CDF (i.e., pdiffusion) in versions prior to 0.5-2 did not produce correct results in many cases (one consequence of this is that the model predictions given in the previous blog post are wrong). As a temporary fix, we resorted to the correct but slow numerical integration of the PDF (i.e., ddiffusion) to obtain the CDF in version 0.5-2 and later. Importantly, it appears as if the error was not present in  fastdm which is the source of the C code we use. Matthew Gretton carefully investigated the original C code, changed it such that it connects to R via Rcpp, and realized that there are two different variants of the CDF, a fast variant and a precise variant. Up to this point we had only used the fast variant and, as it turns out, this was responsible for our incorrect results. We now per default use the precise variant (which only seems to be marginally slower) as it produces the correct results for all cases we have tested (and we have tested quite a few).

In addition to a few more minor changes (see NEWS for full list), we made two more noteworthy changes. First, all diffusion functions as well as rLBA received a major performance update, mainly in situations with trial-wise parameters. Now it should almost always be fastest to call the diffusion functions (e.g., ddiffusion) only once with vectorized parameters instead of calling it several times for different sets of parameters. The speed up with the new version depends on the number of unique parameter sets, but even with only a few different sets the speed up should be clearly noticeable. For completely trial-wise parameters the speed-up should be quite dramatic.

Second, I also updated the vignette which now uses the tidyverse in, I believe, a somewhat more reasonable manner. Specifically, it now is built on nested data (via tidyr::nest) and purrr::map instead of relying heavily on dplyr::do.  The problem I had with dplyr::do is that it often leads to somewhat ugly syntax. The changes in the vignette are mainly due to me reading Chapter 25 in the great R for Data Science book by Wickham and Gorlemund. However, I still prefer lattice over ggplot2.

Example Analysis

To show the now correct behavior of the diffusion CDF let me repeat the example from the last post. As a reminder, we somewhat randomly pick one participant from the speed_acc data set and fit both diffusion model and LBA to the data.


# Exp. 1; Wagenmakers, Ratcliff, Gomez, & McKoon (2008, JML)
# remove excluded trials:
speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) 
# create numeric response variable where 1 is an error and 2 a correct response: 
speed_acc$corr <- with(speed_acc, as.numeric(stim_cat == response))+1 
# select data from participant 11, accuracy condition, non-word trials only
p11 <- speed_acc[speed_acc$id == 11 & 
                   speed_acc$condition == "accuracy" & 
                   speed_acc$stim_cat == "nonword",] 
#          1          2 
# 0.04166667 0.95833333 

ll_lba <- function(pars, rt, response) {
  d <- dLBA(rt = rt, response = response, 
            A = pars["A"], 
            b = pars["A"]+pars["b"], 
            t0 = pars["t0"], 
            mean_v = pars[c("v1", "v2")], 
            sd_v = c(1, pars["sv"]), 
  if (any(d == 0)) return(1e6)
  else return(-sum(log(d)))

start <- c(runif(3, 0.5, 3), runif(2, 0, 0.2), runif(1))
names(start) <- c("A", "v1", "v2", "b", "t0", "sv")
p11_norm <- nlminb(start, ll_lba, lower = c(0, -Inf, 0, 0, 0, 0), 
                   rt=p11$rt, response=p11$corr)
# $par
#          A         v1         v2          b         t0         sv 
#  0.1182940 -2.7409230  1.0449963  0.4513604  0.1243441  0.2609968 
# $objective
# [1] -211.4202
# $convergence
# [1] 0

ll_diffusion <- function(pars, rt, response) 
  densities <- ddiffusion(rt, response=response, 
  if (any(densities == 0)) return(1e6)

start <- c(runif(2, 0.5, 3), 0.1, runif(3, 0, 0.5))
names(start) <- c("a", "v", "t0", "sz", "st0", "sv")
p11_diff <- nlminb(start, ll_diffusion, lower = 0, 
                   rt=p11$rt, response=p11$corr)
# $par
#         a         v        t0        sz       st0        sv 
# 1.3206011 3.2727202 0.3385602 0.4621645 0.2017950 1.0551706 
# $objective
# [1] -207.5487
# $convergence
# [1] 0

As is common, we pass the negative summed log-likelihood to the optimization algorithm (here trusty nlminb) and hence lower values of objective indicate a better fit. Results show that the LBA provides a somewhat better account. The interesting question is whether this somewhat better fit translates into a visibly better fit when comparing observed and predicted quantiles.

# quantiles:
q <- c(0.1, 0.3, 0.5, 0.7, 0.9)

## observed data:
(p11_q_c <- quantile(p11[p11$corr == 2, "rt"], probs = q))
#    10%    30%    50%    70%    90% 
# 0.4900 0.5557 0.6060 0.6773 0.8231 
(p11_q_e <- quantile(p11[p11$corr == 1, "rt"], probs = q))
#    10%    30%    50%    70%    90% 
# 0.4908 0.5391 0.5905 0.6413 1.0653 

### LBA:
# predicted error rate  
(pred_prop_correct_lba <- pLBA(Inf, 2, 
                               A = p11_norm$par["A"], 
                               b = p11_norm$par["A"]+p11_norm$par["b"], 
                               t0 = p11_norm$par["t0"], 
                               mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), 
                               sd_v = c(1, p11_norm$par["sv"])))
# [1] 0.9581342

(pred_correct_lba <- qLBA(q*pred_prop_correct_lba, response = 2, 
                          A = p11_norm$par["A"], 
                          b = p11_norm$par["A"]+p11_norm$par["b"], 
                          t0 = p11_norm$par["t0"], 
                          mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), 
                          sd_v = c(1, p11_norm$par["sv"])))
# [1] 0.4871710 0.5510265 0.6081855 0.6809796 0.8301286
(pred_error_lba <- qLBA(q*(1-pred_prop_correct_lba), response = 1, 
                        A = p11_norm$par["A"], 
                        b = p11_norm$par["A"]+p11_norm$par["b"], 
                        t0 = p11_norm$par["t0"], 
                        mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), 
                        sd_v = c(1, p11_norm$par["sv"])))
# [1] 0.4684374 0.5529575 0.6273737 0.7233961 0.9314820

### diffusion:
# same result as when using Inf, but faster:
(pred_prop_correct_diffusion <- pdiffusion(rt = 20,  response = "upper",
# [1] 0.964723

(pred_correct_diffusion <- qdiffusion(q*pred_prop_correct_diffusion, 
                                      response = "upper",
# [1] 0.4748271 0.5489903 0.6081182 0.6821927 0.8444566
(pred_error_diffusion <- qdiffusion(q*(1-pred_prop_correct_diffusion), 
                                    response = "lower",
# [1] 0.4776565 0.5598018 0.6305120 0.7336275 0.9770047

### plot predictions

par(mfrow=c(1,2), cex=1.2)
plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "LBA")
points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2)
lines(pred_correct_lba, q*pred_prop_correct_lba, type = "b")
lines(pred_error_lba, q*(1-pred_prop_correct_lba), type = "b")
legend("right", legend = c("data", "predictions"), pch = c(2, 1), lty = c(0, 1))

plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "Diffusion")
points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2)
lines(pred_correct_diffusion, q*pred_prop_correct_diffusion, type = "b")
lines(pred_error_diffusion, q*(1-pred_prop_correct_diffusion), type = "b")

The fit plot compares observed quantiles (as triangles) with predicted quantiles (circles connected by lines). Here we decided to plot the 10%, 30%, 50%, 70% and 90% quantiles. In each plot, the x-axis shows RTs and the y-axis cumulative probabilities. From this it follows that the upper line and points correspond to the correct trials (which are common) and the lower line and points to the incorrect trials (which are uncommon). For both models the fit looks pretty good especially for the correct trials. However, it appears the LBA does a slightly better job in predicting very fast and slow trials here, which may be responsible for the better fit in terms of summed log-likelihood. In contrast, the diffusion model seems somewhat better in predicting the long tail of the error trials.

Checking the CDF

Finally, we can also check whether the analytical CDF does in fact correspond to the empirical CDF of the data. For this we can compare the analytical CDF function pdiffusion to the empirical CDF obtained from random deviates. One thing one needs to be careful about is that pdiffusion provides the ‘defective’ CDF that only approaches one if one adds the CDF for both response boundaries. Consequently, to compare the empirical CDF for one response with the analytical CDF, we need to scale the latter to also go from 0 to 1 (simply by dividing it by its maximal value). Here we will use the parameters values obtained in the previous fit.

rand_rts <- rdiffusion(1e5, a=p11_diff$par["a"], 
plot(ecdf(rand_rts[rand_rts$response == "upper","rt"]))

normalised_pdiffusion = function(rt,...) pdiffusion(rt,...)/pdiffusion(rt=Inf,...) 
curve(normalised_pdiffusion(x, response = "upper",
      add=TRUE, col = "yellow", lty = 2)

This figure shows that the analytical CDF (in yellow) lies perfectly on top the empirical CDF (in black). If it does not for you, you still use an old version of rtdists. We have also added a series of unit tests to rtdists that compare the empirical CDF to the analytical CDF (using ks.test) for a variety of parameter values to catch if such a problem ever occurs again.


Continue Reading…


Read More

R Packages worth a look

A Faster Implementation of the Poisson-Binomial Distribution (poisbinom)
Provides the probability, distribution, and quantile functions and random number generator for the Poisson-Binomial distribution. This package relies on FFTW to implement the discrete Fourier transform, so that it is much faster than the existing implementation of the same algorithm in R.

Evolutionary Monte Carlo (EMC) Methods for Clustering (EMCC)
Evolutionary Monte Carlo methods for clustering, temperature ladder construction and placement. This package implements methods introduced in Goswami, Liu and Wong (2007) <doi:10.1198/106186007X255072>. The paper above introduced probabilistic genetic-algorithm-style crossover moves for clustering. The paper applied the algorithm to several clustering problems including Bernoulli clustering, biological sequence motif clustering, BIC based variable selection, mixture of Normals clustering, and showed that the proposed algorithm performed better both as a sampler and as a stochastic optimizer than the existing tools, namely, Gibbs sampling, “split-merge” Metropolis-Hastings algorithm, K-means clustering, and the MCLUST algorithm (in the package ‘mclust’).

JAR Files of the Apache Commons Mathematics Library (commonsMath)
Java JAR files for the Apache Commons Mathematics Library for use by users and other packages.

Interactive Graphs with R (RJSplot)
Creates interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package.

Simulation Education (simEd)
Contains various functions to be used for simulation education, including queueing simulation functions, variate generation functions capable of producing independent streams and antithetic variates, functions for illustrating random variate generation for various discrete and continuous distributions, and functions to compute time-persistent statistics. Also contains two queueing data sets (one fabricated, one real-world) to facilitate input modeling.

Continue Reading…


Read More

MapReduce in Two Modern Paintings

(This article was first published on novyden, and kindly contributed to R-bloggers)

Two years ago we had a rare family outing to the Dallas Museum of Art (my son is teenager and he’s into sport after all). It had an excellent exhibition of modern art and DMA allowed taking pictures. Two hours and dozen of pictures later my weekend was over but thanks to Google Photos I just stumbled upon those pictures again. Suddenly, I realized that two paintings I captured make up an illustration of one of the most important concepts in big data.

There are multiple papers, tutorials and web pages about MapReduce and to truly understand and use it one should study at least a few thoroughly. And there are many illustrations of MapReduce structure and architecture out there.

But the power of art can express more with less with just two paintings. First, we have work by Erró Foodscape, 1964:

It illustrates variety, richness, potential of insight if consumed properly, and of course, scale. The painting is boundless emphasizing no end to the table surface in all 4 directions. If we zoom in (you can find better quality image here) it contains many types of food and drinks, packaging, presentations, varying in colors, texture and origin. All these represent big data so much better than any kind of flowchart diagram.

The 2d and final painting is by Wayne Thiebaud Salads, Sandwiches, and Desserts, 1962:

Should we think of how MapReduce works this seemingly infinite table (also fittingly resembling conveyor line) looks like a result of split-apply-combine executed on Foodscape items. Indeed, each vertical group is combination of the same type of finished and plated food ready to serve and combined into variably sized sets (find better quality image here).

And again, I’d like to remind of importance of taking your kids to museum.

To leave a comment for the author, please follow the link and comment on their blog: novyden. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “A Survey on Domain-Specific Languages for Machine Learning in Big Data”

The amount of data generated in the modern society is increasing rapidly. New problems and novel approaches of data capture, storage, analysis and visualization are responsible for the emergence of the Big Data research field. Machine Learning algorithms can be used in Big Data to make better and more accurate inferences. However, because of the challenges Big Data imposes, these algorithms need to be adapted and optimized to specific applications. One important decision made by software engineers is the choice of the language that is used in the implementation of these algorithms. Therefore, this literature survey identifies and describes domain-specific languages and frameworks used for Machine Learning in Big Data. By doing this, software engineers can then make more informed choices and beginners have an overview of the main languages used in this domain. A Survey on Domain-Specific Languages for Machine Learning in Big Data

Continue Reading…


Read More

Flu tracker 2017: weekly updates on symptoms around Australia

The FluTracking program is a national public health initiative that conducts an online survey of more than 20,000 Australians weekly to report symptoms of coughing and fever. Using this information, researchers can determine the onset of the flu season by region, severity of influenza strains and the effectiveness of current vaccines.

Here, you can see weekly updates on the flu situation, with the graph showing the percentage of participants who have experienced a cough and a fever in the past week compared against a five-year average.

The map shows the percentage of participants reporting the same symptoms by postcode, where there were 10 or more participants. Use the slider to change the weeks shown.

You can join the FluTracking survey here.

Continue reading...

Continue Reading…


Read More

Magister Dixit

“Facts speak louder than statistics.” Mr. Justice Streatfield ( 1950 )

Continue Reading…


Read More

Support for presidential candidates at elite law firms in 2012 and 2016

Paul Campos writes:

Thought these data were extreme enough to be of general interest.

OK, before you click on the link, here’s the story: Campos looked up the presidential campaign contributions at 11 top law firms. (I’m not sure where his data came from; maybe the same source as here?) Guess what percentage of contributions went to Mitt Romney in 2012? What about Donald Trump in 2016?

Make your guesses, then click on the link above to find out the answer.

The numbers are indeed striking, and I have nothing to add—really there’s nothing I can say at all, given that no data or link have been supplied. I do, however, wonder what would happen if we took the people in the comments section at the above-linked post, and put them in the same room as the commenters at Marginal Revolution. Matter and anti-matter (or maybe it’s the other way around). I can’t even imagine.

P.S. Campos added a link to the data in his post.

The post Support for presidential candidates at elite law firms in 2012 and 2016 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Whats new on arXiv

Input Fast-Forwarding for Better Deep Learning

This paper introduces a new architectural framework, known as input fast-forwarding, that can enhance the performance of deep networks. The main idea is to incorporate a parallel path that sends representations of input values forward to deeper network layers. This scheme is substantially different from ‘deep supervision’ in which the loss layer is re-introduced to earlier layers. The parallel path provided by fast-forwarding enhances the training process in two ways. First, it enables the individual layers to combine higher-level information (from the standard processing path) with lower-level information (from the fast-forward path). Second, this new architecture reduces the problem of vanishing gradients substantially because the fast-forwarding path provides a shorter route for gradient backpropagation. In order to evaluate the utility of the proposed technique, a Fast-Forward Network (FFNet), with 20 convolutional layers along with parallel fast-forward paths, has been created and tested. The paper presents empirical results that demonstrate improved learning capacity of FFNet due to fast-forwarding, as compared to GoogLeNet (with deep supervision) and CaffeNet, which are 4x and 18x larger in size, respectively. All of the source code and deep learning models described in this paper will be made available to the entire research community

The Prediction Advantage: A Universally Meaningful Performance Measure for Classification and Regression

We introduce the Prediction Advantage (PA), a novel performance measure for prediction functions under any loss function (e.g., classification or regression). The PA is defined as the performance advantage relative to the Bayesian risk restricted to knowing only the distribution of the labels. We derive the PA for well-known loss functions, including 0/1 loss, cross-entropy loss, absolute loss, and squared loss. In the latter case, the PA is identical to the well-known R-squared measure, widely used in statistics. The use of the PA ensures meaningful quantification of prediction performance, which is not guaranteed, for example, when dealing with noisy imbalanced classification problems. We argue that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels.

Selective Classification for Deep Neural Networks

Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, and almost 60% test coverage.

Interpreting Blackbox Models via Model Extraction

Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models.

An effective algorithm for hyperparameter optimization of neural networks

A major challenge in designing neural network (NN) systems is to determine the best structure and parameters for the network given the data for the machine learning problem at hand. Examples of parameters are the number of layers and nodes, the learning rates, and the dropout rates. Typically, these parameters are chosen based on heuristic rules and manually fine-tuned, which may be very time-consuming, because evaluating the performance of a single parametrization of the NN may require several hours. This paper addresses the problem of choosing appropriate parameters for the NN by formulating it as a box-constrained mathematical optimization problem, and applying a derivative-free optimization tool that automatically and effectively searches the parameter space. The optimization tool employs a radial basis function model of the objective function (the prediction accuracy of the NN) to accelerate the discovery of configurations yielding high accuracy. Candidate configurations explored by the algorithm are trained to a small number of epochs, and only the most promising candidates receive full training. The performance of the proposed methodology is assessed on benchmark sets and in the context of predicting drug-drug interactions, showing promising results. The optimization tool used in this paper is open-source.

Causal inference for social network data

We extend recent work by van der Laan (2014) on causal inference for causally connected units to more general social network settings. Our asymptotic results allow for dependence of each observation on a growing number of other units as sample size increases. We are not aware of any previous methods for inference about network members in observational settings that allow the number of ties per node to increase as the network grows. While previous methods have generally implicitly focused on one of two possible sources of dependence among social network observations, we allow for both dependence due to contagion, or transmission of information across network ties, and for dependence due to latent similarities among nodes sharing ties. We describe estimation and inference for causal effects that are specifically of interest in social network settings.

Grounded Recurrent Neural Networks

In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process ‘grounding’). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient’s diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.

Towards Interrogating Discriminative Machine Learning Models

It is oftentimes impossible to understand how machine learning models reach a decision. While recent research has proposed various technical approaches to provide some clues as to how a learning model makes individual decisions, they cannot provide users with ability to inspect a learning model as a complete entity. In this work, we propose a new technical approach that augments a Bayesian regression mixture model with multiple elastic nets. Using the enhanced mixture model, we extract explanations for a target model through global approximation. To demonstrate the utility of our approach, we evaluate it on different learning models covering the tasks of text mining and image recognition. Our results indicate that the proposed approach not only outperforms the state-of-the-art technique in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of a learning model.

MMD GAN: Towards Deeper Understanding of Moment Matching Network

Generative moment matching network (GMMN) is a deep generative model that differs from Generative Adversarial Network (GAN) by replacing the discriminator in GAN with a two-sample test based on kernel maximum mean discrepancy (MMD). Although some theoretical guarantees of MMD have been studied, the empirical performance of GMMN is still not as competitive as that of GAN on challenging and large benchmark datasets. The computational efficiency of GMMN is also less desirable in comparison with GAN, partially due to its requirement for a rather large batch size during the training. In this paper, we propose to improve both the model expressiveness of GMMN and its computational efficiency by introducing adversarial kernel learning techniques, as the replacement of a fixed Gaussian kernel in the original GMMN. The new approach combines the key ideas in both GMMN and GAN, hence we name it MMD-GAN. The new distance measure in MMD-GAN is a meaningful loss that enjoys the advantage of weak topology and can be optimized via gradient descent with relatively small batch sizes. In our evaluation on multiple benchmark datasets, including MNIST, CIFAR- 10, CelebA and LSUN, the performance of MMD-GAN significantly outperforms GMMN, and is competitive with other representative GAN works.

Nonparametric Preference Completion

We consider the task of collaborative preference completion: given a pool of items, a pool of users and a partially observed item-user rating matrix, the goal is to recover the personalized ranking of each user over all of the items. Our approach is nonparametric: we assume that each item i and each user u have unobserved features x_i and y_u, and that the associated rating is given by g_u(f(x_i,y_u)) where f is Lipschitz and g_u is a monotonic transformation that depends on the user. We propose a k-nearest neighbors-like algorithm and prove that it is consistent. To the best of our knowledge, this is the first consistency result for the collaborative preference completion problem in a nonparametric setting. Finally, we conduct experiments on the Netflix and Movielens datasets that suggest that our algorithm has some advantages over existing neighborhood-based methods and that its performance is comparable to some state-of-the art matrix factorization methods.

Deep Rotation Equivariant Network

Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR-10 datasets and demonstrate that it can improve the performance of state-of-the-art architectures. Our codes are released on GitHub.

Fast-Slow Recurrent Neural Networks

Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character level language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to 1.19 and 1.25 bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves 1.20 BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.

General Algorithmic Search

In this paper we present a metaheuristic for global optimization called General Algorithmic Search (GAS). Specifically, GAS is a stochastic, single-objective method that evolves a swarm of agents in search of a global extremum. Numerical simulations with a sample of 31 test functions show that GAS outperforms Basin Hopping, Cuckoo Search, and Differential Evolution, especially in concurrent optimization, i.e., when several runs with different initial settings are executed and the first best wins. Python codes of all algorithms and complementary information are available online.

Learning with Average Top-k Loss

In this work, we introduce the average top-k (AT_k) loss as a new ensemble loss for supervised learning, which is the average over the k largest individual losses over a training dataset. We show that the AT_k loss is a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss, but can combines their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based method. We provide an intuitive interpretation of the AT_k loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classified data. We further give a learning theory analysis of MAT_k learning on the classification calibration of the AT_k loss and the error bounds of AT_k-SVM. We demonstrate the applicability of minimum average top-k learning for binary classification and regression using synthetic and real datasets.

Dense Transformer Networks

The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoder-decoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods.

Sequential noise-induced escapes for oscillatory network dynamics

Weakly-normal basis vector fields in RKHS with an application to shape Newton methods

The Benefit of Being Flexible in Distributed Computation

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

Efficiently applying attention to sequential data with the Recurrent Discounted Attention unit

Bayesian Pool-based Active Learning with Abstention Feedbacks

Second-Order Word Embeddings from Nearest Neighbor Topological Features

Uplift Modeling with Multiple Treatments and General Response Types

A study on exponential-size neighborhoods for the bin packing problem with conflicts

Clinical Intervention Prediction and Understanding using Deep Networks

Predictive Analytics for Enhancing Travel Time Estimation in Navigation Apps of Apple, Google, and Microsoft

Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters

Designs for estimating the treatment effect in networks with interference

Data-driven Random Fourier Features using Stein Effect

Model-free causal inference of binary experimental data

Convolution estimates and number of disjoint partitions

Statistical Convergence Analysis of Gradient EM on General Gaussian Mixture Models

Conscious and controlling elements in combinatorial group testing problems with more defectives

Critical two-point function for long-range $O(n)$ models below the upper critical dimension

Self-Organized Supercriticality and Oscillations in Networks of Stochastic Spiking Neurons

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification

Safe Model-based Reinforcement Learning with Stability Guarantees

Random ordering formula for sofic and Rokhlin entropy of Gibbs measures

Hashing as Tie-Aware Learning to Rank

Simple Pricing Schemes for the Cloud

Joint Rate Control and Power Allocation for Non-Orthogonal Multiple Access Systems

Flexible Cache-Aided Networks with Backhauling

Exact Recovery of Number of Blocks in Blockmodels

On the multiply robust estimation of the mean of the g-functional

Sequence Summarization Using Order-constrained Kernelized Feature Subspaces

Generative Model with Coordinate Metric Learning for Object Recognition Based on 3D Models

Sufficient conditions for the existence of a path-factor which are related to odd components

Deep Learning Improves Template Matching by Normalized Cross Correlation

Journalists’ information needs, seeking behavior, and its determinants on social media

Substitution invariant Sturmian words and binary trees

Fully reliable error control for evolutionary problems

Which bridge estimator is optimal for variable selection?

Multi-Task Learning for Contextual Bandits

Dictionary-based Monitoring of Premature Ventricular Contractions: An Ultra-Low-Cost Point-of-Care Service

Robust Data Geometric Structure Aligned Close yet Discriminative Domain Adaptation

VANETs Meet Autonomous Vehicles: A Multimodal 3D Environment Learning Approach

On Using Time Without Clocks via Zigzag Causality

Self-supervised learning of visual features through embedding images into text topic spaces

Representing the suffix tree with the CDAWG

Higher order Cheeger inequalities for Steklov eigenvalues

On the Success Probability of Decoding (Partial) Unit Memory Codes

Restriction of odd degree characters of $\mathfrak{S}_n$

Efficient Covariance Approximations for Large Sparse Precision Matrices

Combinatorial n-fold Integer Programming and Applications

Towards Understanding the Invertibility of Convolutional Neural Networks

Bayesian Compression for Deep Learning

Packing parameters in graphs: New bounds and a solution to an open problem

Alliance formation with exclusion in the spatial public goods game

Stochastic decomposition applied to large-scale hydro valleys management

Daisy cubes and distance cube polynomial

On the Möbius Function and Topology of General Pattern Posets

On The Fixatic Number of Graphs

Continual Learning with Deep Generative Replay

Stochastic Sequential Neural Networks with Structured Inference

Tree-Structured Modelling of Varying Coefficients

The de Bruijn-Erdös-Hanani theorem

Inclusive Flavour Tagging Algorithm

V2X Meets NOMA: Non-Orthogonal Multiple Access for 5G Enabled Vehicular Networks

Non-orthogonal Multiple Access for High-reliable and Low-latency V2X Communications in 5G Systems

An experimental study of graph-based semi-supervised classification with additional node information

Open-Category Classification by Adversarial Sample Generation

Hajós’ cycle conjecture for small graphs

Weighted Poisson-Delaunay Mosaics

Non-Stationary Spectral Kernels

Efficient algorithm for large spectral partitions

A counterexample to Comon’s conjecture

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

A causal approach to analysis of censored medical costs in the presence of time-varying treatment

A.Ya. Khintchine’s Work in Probability Theory

Speeding up Dynamic Programming on DAGs through a Fast Approximation of Path Cover

Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

Small Sets with Large Difference Sets

Adaptive Detrending to Accelerate Convolutional Gated Recurrent Unit Training for Contextual Video Recognition

Three observations on spectra of zero-nonzero patterns

Threshold functions for small subgraphs: an analytic approach

A Two-Level Graph Partitioning Problem Arising in Mobile Wireless Communications

Group divisible (K_4-e)-packings with any minimum leave

Optimization of the Jaccard index for image segmentation with the Lovász hinge

STFT with Adaptive Window Width Based on the Chirp Rate

Continuous testing for Poisson process intensities: A new perspective on scanning statistics

Characterizing path-like trees from linear configurations

Beyond Parity: Fairness Objectives for Collaborative Filtering

A Bayesian Mallows approach to non-transitive pair comparison data: how human are sounds?

When Will AI Exceed Human Performance? Evidence from AI Experts

Boundary Crossing Probabilities for General Exponential Families

Power Systems Data Fusion based on Belief Propagation

Matrix-product structure of repeated-root constacyclic codes over finite fields

Causal Effect Inference with Deep Latent-Variable Models

From source to target and back: symmetric bi-directional adaptive GAN

Deep Investigation of Cross-Language Plagiarism Detection Methods

Transition to Shock Fluctuations in TASEP and Last Passage Percolation

Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations

Parsing with CYK over Distributed Representations: ‘Classical’ Syntactic Parsing in the Novel Era of Neural Networks

How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval

Joint Distribution Optimal Transportation for Domain Adaptation

Improved Semi-supervised Learning with GANs using Manifold Invariances

Perturbation of Conservation Laws and Averaging on Manifolds

Audio-replay attack detection countermeasures

More Circulant Graphs exhibiting Pretty Good State Transfer

Anti-spoofing Methods for Automatic SpeakerVerification System

Transport and optics at the node in a nodal loop semimetal

Flow-GAN: Bridging implicit and prescribed learning in generative models

Quantum Channel Capacities Per Unit Cost

Sharp threshold for $K_4$-percolation

Modeling flow in porous media with double porosity/permeability: A stabilized mixed formulation, error analysis, and numerical solutions

Linearizable Iterators for Concurrent Data Structures

Continue Reading…


Read More

May 25, 2017

the bullet graph

The following is a guest post written by Bill Dean. After a recent workshop, Bill shared with me his affinity for bullet graphs. I've never used one before—though can see the potential—so invited him to share his views and an example approach here. Bill leads an engineering and data science team at Microsoft that enables groups across the company to analyze and act on customer feedback at scale. He loves (and makes) both BBQ and data visualizations but hasn’t yet mixed the two. For more on Bill or to connect with him, check out LinkedIn or Twitter.

While some chart series stand on their own, it sometimes makes sense to allow them to stand alone-ish. That is, you want it to stand near some benchmark (e.g., closest competitor, last year’s performance, goals) without giving up too much of the spotlight. Enter “Bullet Graphs.” Bullet graphs were seemingly developed by Stephen Few, and provide a great way to gain information density without the cognitive load of some graph options. Bullet graphs have been around a while but are utilized much less frequently than simple 2-series bar charts or even malformed eye-candy gauges.  

For example, take a typical water bill (Figure 1, below, roughly replicated from my personal bill), which does a decent job at drawing your eyes to the current year’s data via a two-series bar chart. This is actually a good starting point as I’ve seen data like this represented as two pie charts. There’s a confusion factor lurking for users to misinterpret the chart due to its two-month groupings. At first glance, one might think that January is the white bar, February the black bar, and so on in some sort of tick-tock visual joke. Instead, we’re billed two months (e.g., January-February Unit) at a time and each period is the total for combined months in each year’s data. I’m going to start with the water bill as-is, and will adjust it towards a bullet graph through a path that’s familiar…especially for Cole’s readers.

Figure 1: Original Water Bill

Figure 1: Original Water Bill

While it’s a pretty good start, the white bars representing the “previous year’s consumption” series make my eyes bleed and I might want to help it stand out less by making those bars gray and removing the border. In Figure 2, both series still have sufficient contrast ratios from the background and each other, which is good for everyone…especially “low-vision” users. It will also help to move the legend up towards the top of the chart to inform users ahead of the visual.

Figure 2: Updated Water Bill with Gray

Figure 2: Updated Water Bill with Gray

If the exact numbers matter, I could consider adding the data labels on the inside end of the bars. Only do this if you can ensure there’s always enough vertical bar space for a contrasting text and sufficient width to accommodate reasonably large numbers.

Figure 3: Updated Water Bill with Inside Labels

Figure 3: Updated Water Bill with Inside Labels

Figure 3 feels like a natural improvement over the original chart (even though I want to remove the y-axis, I’ll leave it here for continuity). I could leave this alone, but I’m going to transform it into a simple bullet graph. Next, I’d like to put the previous year’s series behind the current year’s series so the direct comparison is more obvious. I can do this by clicking once into my first data series and making it “Secondary.” Because this will move them on top of each other, I’ll want the ensure the context (previous year series) is in the background, lighter in color (it already is), and wider.

Figure 4: Simple Bullet Graph

Figure 4: Simple Bullet Graph

Now, my chart looks different because both the primary and secondary axis are using ranges that make sense for each series independently. I’ll need to intervene and ensure they’re both set to be the same. Do this by right-clicking on each axis and the “Edit Series” icon to set the scale (in this case it’s 0 to 3500). I can also delete the axis on the right (or both, if I add the value as a data label). Dealer’s choice, really. For additional decluttering, I’ve opted to remove the line at the base of the bars. 

Figure 5: Simple Bullet Graph with Consistent Axes Range

Figure 5: Simple Bullet Graph with Consistent Axes Range

Now, I’ve got a chart that draws my attention to this year’s trend with the context of what my family consumed last year (Figure 5). There’s also a side benefit that the two-month format no longer confusing as they’re grouped and centered under both bars. It’s just as easy to have made this a horizontal bar chart instead (often better for longer labels and non-time groups). This is how that would look (also used data labels in lieu of the axis).

Figure 6: Horizontal Simple Bullet Graph

Figure 6: Horizontal Simple Bullet Graph

If my water company wanted to add a little more information and a little peer pressure, they might add a more complete bullet graph with zones of guilt and a small marker that represents my local neighborhood, a goal they set for me, or (in this case) last year’s consumption for the same period).

Figure 7: Single Bullet Graph

Figure 7: Single Bullet Graph

A full series with this information would be more informative.

We can start with the data table and walk through how to do it. We’ll start with some reference ranges that will add up to the full range of the chart you want to see and will create zones in the background for context.

Table 1: Monthly Consumption Data (in percent)

Table 1: Monthly Consumption Data (in percent)

Start, by highlighting the entire table in Excel and use the Insert tab to add a 100% stack chart. It will be hideous!

Figure 8. Default Stack Chart

Figure 8. Default Stack Chart

Excel draws the table in reverse order (the top cell value is at the bottom of each chart column with each subsequent value piled on top). This is important to know so you start with what you want in the background at the top of the table. 

The blue, orange, gray sequence for each period in Figure 8 is the Conservative, High, Extremely High data from my table, respectively. You’ll want to right click the bottom series and color that a dark gray, the orange series should be a noticeable bit lighter than that, and the gray series should be a few notches lighter than the previous value. Set the outline to “None” (or make it white to have the borders pop a little).

Figure 9. Stack Chart Mid-Formatting

Figure 9. Stack Chart Mid-Formatting

From here, you’ll want to right-click one of the series and adjust the gap width to 40% or so to ensure the bars are nice and wide. When you hover over untouched series, you’ll see that they are the current and previous year. Let’s go ahead and make those “secondary” by right- clicking on “Change Series Chart Type.” Check the box for both Current and Previous years’ data. For the current year, we’ll change that to the “stacked column” chart while the previous year will be set to a “stacked line with markers.”

Figure 10: Stack Chart Changing Chart Type and Secondary Axes

Figure 10: Stack Chart Changing Chart Type and Secondary Axes

While we’re at it, we can eliminate the chart border and the axis lines. We’re civilized, after all.

Figure 11: Stack Chart with Secondary Axes and Mixed Chart Types

Figure 11: Stack Chart with Secondary Axes and Mixed Chart Types

Formatting the Current Year Consumption to Black and removing the ‘Previous Year’ connecting lines gets us REALLY close to the final chart.

Figure 12: Stack Chart Starting to Look Like a Bullet Graph

Figure 12: Stack Chart Starting to Look Like a Bullet Graph

Here we can play with the Previous Year Consumption marker to ensure it can be seen a bit more clearly. Click on the series by right-clicking on the dots and selecting “Format Data Series” (Figure 13). Select the Marker, Built-in and click on the wide dash. The default size is 5, but it will look much better around 18.

Figure 13: Format Data Series Marker

Figure 13: Format Data Series Marker

It’s not quite done because you’ll need to ensure both axis are set to 0-100%. Do this then delete one of them (most delete the right).

Figure 14: Bullet Graph Core Completed

Figure 14: Bullet Graph Core Completed

On the home stretch as now you can adjust the colors, add a title, and size it appropriately to fit your dashboard, report, etc. There’s a really helpful feature that might help down the line so that you can consistently format all similar charts EXACTLY the same way. Right-click your chart and click “Save as Template,” name it, and save it.

Next time you have a set of data like this, you can start with whatever chart you want, right-click, and select “Change Series Chart Type” > “Templates” > Chart Template.

Figure 15: Chart Template Dialogue

Figure 15: Chart Template Dialogue

Figure 16: Bullet Graph with Title, Lighter Ranges, Less Clutter

Figure 16: Bullet Graph with Title, Lighter Ranges, Less Clutter

It’s probably more likely that each month pair has a different idea of what Conservative Water Usage, High, and Extremely High Usage so I’ve made each zone differently sized in Figure 17.

Figure 17: Bullet Graph with Variable Ranges

Figure 17: Bullet Graph with Variable Ranges

...and with a navy bar.

Figure 18: Bullet Graph with Navy Bar

Figure 18: Bullet Graph with Navy Bar

...and another, even bluer option in Figure 19 (in honor of “Towel Day”).

Figure 19: Bullet Graph with Hooloovoo

Figure 19: Bullet Graph with Hooloovoo

Jon Peltier also has a variety of instructions if you’d like to see a variety of ways to make bullet charts in both horizontal and vertical varieties. 

Huge thanks to Bill for this thorough and informative post! You can download the Excel file that contains his graphs.

Continue Reading…


Read More

If you did not already know

PANFIS++ google
The concept of evolving intelligent system (EIS) provides an effective avenue for data stream mining because it is capable of coping with two prominent issues: online learning and rapidly changing environments. We note at least three uncharted territories of existing EISs: data uncertainty, temporal system dynamic, redundant data streams. This book chapter aims at delivering a concrete solution of this problem with the algorithmic development of a novel learning algorithm, namely PANFIS++. PANFIS++ is a generalized version of the PANFIS by putting forward three important components: 1) An online active learning scenario is developed to overcome redundant data streams. This module allows to actively select data streams for the training process, thereby expediting execution time and enhancing generalization performance, 2) PANFIS++ is built upon an interval type-2 fuzzy system environment, which incorporates the so-called footprint of uncertainty. This component provides a degree of tolerance for data uncertainty. 3) PANFIS++ is structured under a recurrent network architecture with a self-feedback loop. This is meant to tackle the temporal system dynamic. The efficacy of the PANFIS++ has been numerically validated through numerous real-world and synthetic case studies, where it delivers the highest predictive accuracy while retaining the lowest complexity. …
Exact Soft Confidence-Weighted Learning google
In this paper, we propose a new Soft Confidence-Weighted (SCW) online learning scheme, which enables the conventional confidence-weighted learning method to handle non-separable cases. Unlike the previous confidence-weighted learning algorithms, the proposed soft confidence-weighted learning method enjoys all the four salient properties: (i) large margin training, (ii) confidence weighting, (iii) capability to handle non-separable data, and (iv) adaptive margin. Our experimental results show that the proposed SCW algorithms significantly outperform the original CW algorithm. When comparing with a variety of state-of-theart algorithms (including AROW, NAROW and NHERD), we found that SCW generally achieves better or at least comparable predictive accuracy, but enjoys significant advantage of computational efficiency (i.e., smaller number of updates and lower time cost). …
Maximum Margin Principal Components google
Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widely used in predictive modeling. A key factor in its widespread use in this domain is the fact that the projection of a dataset onto its first $K$ principal components minimizes the sum of squared errors between the original data and the projected data over all possible rank $K$ projections. Thus, PCA provides optimal low-rank representations of data for least-squares linear regression under standard modeling assumptions. On the other hand, when the loss function for a prediction problem is not the least-squares error, PCA is typically a heuristic choice of dimensionality reduction — in particular for classification problems under the zero-one loss. In this paper we target classification problems by proposing a straightforward alternative to PCA that aims to minimize the difference in margin distribution between the original and the projected data. Extensive experiments show that our simple approach typically outperforms PCA on any particular dataset, in terms of classification error, though this difference is not always statistically significant, and despite being a filter method is frequently competitive with Partial Least Squares (PLS) and Lasso on a wide range of datasets. …

Continue Reading…


Read More

United Nations University: Data Science Lead

Seeking a candidate to play a central role in a fully funded, multimillion-dollar project to rapidly accelerate the global creation, exchange, and uptake of knowledge about modern slavery, human trafficking, forced labour, and child labour – and how to fight them successfully.

Continue Reading…


Read More

Visualizing your fitted Stan model using ShinyStan without interfering with your Rstudio session

ShinyStan is great, but I don’t always use it because when you call it from R, it freezes up your R session until you close the ShinyStan window.

But it turns out that it doesn’t have to be that way. Imad explains:

You can open up a new session via the RStudio menu bar (Session >> New Session), which should have the same working directory as wherever you were prior to running launch_shinystan(). This will let you work on whatever you need to work on while simultaneously running ShinyStan (albeit via two RStudio sessions).

OK, good to know.

The post Visualizing your fitted Stan model using ShinyStan without interfering with your Rstudio session appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Artificial human intelligence

When we use the words artificial intelligence, we typically mean artificial machine intelligence, training machines to act like human beings. What is actually happening is the opposite - we are developing artificial human intelligence, as in, humans are being trained to think like machines.

Example 1: I recently called a cab company and told them I was at Union Square. The despatcher was taking a long time to respond, and eventually asked me whether I am in the east or west side of Manhattan. MIffed, I wondered how it is possible to not know where Union Square is. The guy explained that the system required him to enter an exact address.

When we didn't have the intelligent software, the cab drivers responded to Union Square, or an exact address, or a set of cross streets, or whichever way the passengers choose to describe their locations. Now, we must figure out what the software knows, and then speak to it in its language. After many years of such training, I doubt that these despatchers can locate Grand Central on a map!

Example 2: You call the customer service line of your bank, and get to talk to the intelligent service agent (a bot). She asks you to explain your need. You speak to her as if speaking to a human. She doesn't understand you, and wants you to try again. After a few fruitless exchanges, you learn that the only way to get through to her is to figure out her vocabulary. She only understands maybe several dozen words. So you map your real need to the "nearest neighbor" in the space of the bot's vocabulary. Next time, you call the same bank, you start using the bot's vocabulary.

Example 3: I have observed how some students learn coding. Their lifeline is Google which leads them to StackExchange or similar Q-and-A websites. They find the code they want, and copy and paste it to their screens. They press the Run button. It fails, with an error message. Suspecting that the error may have to do with the placement of a parenthesis, they move the parenthesis to a different spot. They press Run again. It fails again. They move the parenthesis to a third spot, and press Run. It fails again. Eventually, it is in the right slot and the program runs. This brute-force, try-all-possibilities paradigm is frequently used by machines because they can run calculations really fast. Now, humans are learning in the same way.

Alarm has been raised over the possibility that machines will eventually control humans. This is possible if humans adopt artificial human intelligence, that is to say, learn to think like machines.



Continue Reading…


Read More

Book Memo: “Functional Programming in R”

Advanced Statistical Programming for Data Science, Analysis and Finance
Master functions and discover how to write functional programs in R. In this concise book, you’ll make your functions pure by avoiding side-effects; you’ll write functions that manipulate other functions, and you’ll construct complex functions using simpler functions as building blocks. In Functional Programming in R, you’ll see how we can replace loops, which can have side-effects, with recursive functions that can more easily avoid them. In addition, the book covers why you shouldn’t use recursion when loops are more efficient and how you can get the best of both worlds. Functional programming is a style of programming, like object-oriented programming, but one that focuses on data transformations and calculations rather than objects and state. Where in object-oriented programming you model your programs by describing which states an object can be in and how methods will reveal or modify that state, in functional programming you model programs by describing how functions translate input data to output data. Functions themselves are considered to be data you can manipulate and much of the strength of functional programming comes from manipulating functions; that is, building more complex functions by combining simpler functions.

Continue Reading…


Read More

Machine Learning Anomaly Detection: The Ultimate Design Guide

Considering building a machine learning anomaly detection system for your high velocity business? Learn how with Anodot ultimate three-part guide.

Continue Reading…


Read More

Regular Expression & Treemaps to Visualize Emergency Department Visits

(This article was first published on R – Incidental Ideas, and kindly contributed to R-bloggers)

It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It… Continue Reading →

To leave a comment for the author, please follow the link and comment on their blog: R – Incidental Ideas. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Predicting Flight Delays with Random Forests: Alumni Spotlight on Stacy Karthas

At The Data Incubator we run a free eight-week data science fellowship to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring data scientists.  Stacy was a Fellow in our Winter 2017 cohort who landed a job with one of our hiring partners, AdTheorent

Tell us about your background. How did it set you up to be a great data scientist 

I received my Bachelor of Science degrees in mathematics and physics from the University of New Hampshire. I then went on to graduate school at Stony Brook University. I graduated with my master’s degree in Physics in December 2016. During my master’s degree, I did research in Nuclear Heavy Ion Physics with a focus on the analysis of gluons and their products as they traversed our detector. The data analysis, simulation, and clustering algorithms I worked on prepared me to become a data scientist because it was a physical application of many of the tools used by data scientists.

What do you think you got out of The Data Incubator?

The Data Incubator gave me the chance to solidify my data science knowledge. It helped me pull together tools and concepts I had been using during all of my previous research experiences. I learned a lot of new machine learning concepts and how they could be applied to real world data.

What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?

Python is key. Learning as much as you can before the program is very important. I would also suggest taking an online course or reading a bit about machine learning before the program starts. Also, it is easier if you try to relate the concepts back to something you’ve already done. It was easier for me to visualize how clustering algorithms worked because I had been working on my own for a few months.

What’s your favorite thing you learned while at The Data Incubator? This can be a technology, concept, or whatever you want!

My favorite thing I learned at The Data Incubator was how to create models with scikit-learn. Because of my limited background in Python, the fact that you can use such a convenient package to do some very solid machine learning was very neat!

Describe your Data Incubator Capstone Project

My capstone project was an app that predicted whether or not a domestic flight in the US would be delayed. This was based on date, time of day, airline, airport, etc.

How did you come up with the idea for the project?

Millions of passengers take domestic flights every day, whether for business or for pleasure. The worst thing about flying is that you have to build in time in case you have a delay, and at least 15 % of flights are delayed by more than 10 minutes and many flights are delayed hours. I thought that I could create an app that would allow people(and myself) to find an airline or flight that is not likely to be delayed so as to minimize the chance of this hassle.

What technologies did you use and what skills did you learn at TDI that you applied to the project?

I used scikit learn’s random forest classifier to build my prediction model along with other packages to assist in evaluating and cross-validating my results. I also used flask and heroku to deploy my app. Some of my visualizations used matplotlib, seaborn, plotly and d3.

What was your most surprising or interesting finding?

I thought it was interesting just how poorly some of the airlines performed. Generally, the larger airlines tended to have worse on-time statistics and small airlines like Alaskan and Hawaiian had short delays in general.

Describe the business application for this project (how could a company use your work or your data)

Time is money. I can think of two ways a business would want to use this. The first is that they don’t want to send their employees on business trips to have them waiting around in the airport so it would be best to book with airlines that have fewer delays. Additionally, this app would promote competition and accountability among airlines. They would be able to promote themselves with their on-time statistics in addition to customers holding airlines to higher standards.

Do you have an interesting visualization to share?






The cause of delays by time of day indicate that delays tend to stack, meaning that delays earlier in the day tend to cause delays later in the day.

Where are you going to be working, and tell us a little about your new job!

I am currently working as an associate data scientist at Adtheorent. Adtheorent is a digital media company that bids on mobile and web ad space for their clients. My job is to build models that help increase the likelihood that an advertisement will perform well (be clicked, be seen, or someone will buy the product).

Continue Reading…


Read More

Data science platforms are on the rise and IBM is leading the way

Download the 2017 Gartner Magic Quadrant for Data Science Platforms today to learn why IBM is named a leader in data science and to find out why data science, analytics, and machine learning are the engines of the future.

Continue Reading…


Read More

Deployment of Pre-Trained Models on Azure Container Services

This post is authored by Mathew Salvaris, Ilia Karmanov and Jaya Mathew.

Data scientists and engineers routinely encounter issues when moving their final functional software and code from their development environment (laptop, desktop) to a test environment, or from a staging environment to production. These difficulties primarily stem from differences between the underlying software environments and infrastructure, and they eventually end up costing businesses a lot of time and money, as data scientists and engineers work towards narrowing down these incompatibilities and either modify software or update environments to meet their needs.

Containers end up being a great solution in such scenarios, as the entire runtime environment (application, libraries, binaries and other configuration files) get bundled into a package to ensure smooth portability of software across different environments. Using containers can, therefore, improve the speed at which apps can be developed, tested, deployed and shared among users working in different environments. Docker is a leading software container platform for enabling developers, operators and enterprises to overcome their application portability issue.

The goal of Azure Container Services (ACS) is to provide a container hosting environment by using popular open-source tools and technologies. Like all software, deploying machine learning (ML) models can be tricky due to the plethora of libraries used and their dependencies. In this tutorial, we will demonstrate how to deploy a pre-trained deep learning model using ACS. ACS enables the user to configure, construct and manage a cluster of virtual machines preconfigured to run containerized applications. Once the cluster is setup, DC/OS is used for scheduling and orchestration. This is an ideal setup for any ML application since Docker containers facilitate ultimate flexibility in the libraries used, are scalable on demand, and all while ensuring that the application is performant.

The Docker image used in this tutorial contains a simple Flask web application with Nginx web server and uses Microsoft’s Cognitive Toolkit (CNTK) as the deep learning framework, with a pretrained ResNet 152 model. Our web application is a simple image classification service, where the user submits an image, and the application returns the class the image belongs to. This end-to-end tutorial is split into four sections, namely:

  • Create Docker image of our application (00_BuildImage.ipynb).
  • Test the application locally (01_TestLocally.ipynb).
  • Create an ACS cluster and deploy our web app (02_TestWebApp.ipynb).
  • Test our web app (03_TestWebApp.ipynb, 04_SpeedTestWebApp.ipynb).

Each section has an accompanying Jupyter notebook with step-by-step instructions on how to create, deploy and test the web application.

Create Docker Image of the Application (00_BuildImage.ipynb)

The Docker image in this tutorial contains three main elements, namely: the web application (web app), pretrained model, and the driver for executing our model, based on the requests made to the web application. The Docker image is based on an Ubuntu 16.04 image to which we added the necessary Python dependencies and installed CNTK (another option would be to test our application in an Ubuntu Data Science Virtual Machine from Azure portal). An important point to be aware of is that the Flask web app is run on port 5000, so we have created a proxy from port 88 to port 5000 using Nginx to expose port 88 in the container. Once the container is built, it is pushed to a public Docker hub account so that the ACS cluster can access it.

Test the Application Locally (01_TestLocally.ipynb)

Having short feedback loops while debugging is very important and ensures quick iterations. Docker images allow the user to do this as the user can run their application locally and check the functionality, before going through the entire process of deploying the app to ACS. This notebook outlines the process of spinning up the Docker container locally and configuring it properly. Once the container is up and running the user can send requests to be scored using the model and check the model performance.

Create an ACS Cluster and Deploy the Web App (02_DeployOnACS.ipynb)

In this notebook, the Azure CLI is used to create an ACS cluster with two nodes (this can also be done via the Azure portal). Each node is a D2 VM, which is quite small but sufficient for this tutorial. Once ACS is setup, to deploy the app, the user needs to create and SSH tunnel into the head node. This ensures that the user can send the JSON application schema to Marathon.

From the schema, we have mapped port 80 of the host to port 88 on the port (users can choose different ports as well). This tutorial only deploys one instance of the application (the user can scale this up, but it will not be discussed in here). Marathon has a web dashboard that can be accessed through the SSH tunnel by simply pointing the web browser to the tunnel created for deploying the application schema.

Test the Web App (03_TestWebApp.ipynb, 04_SpeedTestWebApp.ipynb)

Once the application has been successfully deployed the user can send scoring requests. The illustration below shows examples of some of the results returned from the application. The ResNet 152 model seems to be fairly accurate, even when parts of the subject (in the image) are occluded.

Further, the average response time for these requests is less than a second, which is very performant. Note that this tutorial was run on a virtual machine in the same region as the ACS. Response times across regions may be slower but the performance is still acceptable for a single container on a single VM.

After running the tutorial, to delete ACS and free up other associated Azure resources, run the cells at the end of 02_TestWebApp.ipynb notebook.

We hope you found this interesting – do share your thoughts or comments with us below.

Mathew, Ilia & Jaya


Continue Reading…


Read More

Using sparklyr in Databricks

Try this notebook on Databricks with all instructions as explained in this post notebook

In September 2016, RStudio announced sparklyr, a new R interface to Apache Spark. sparklyr’s interface to Spark follows the popular dplyr syntax. At Databricks, we provide the best place to run Apache Spark and all applications and packages powered by it, from all the languages that Spark supports. sparklyr’s addition to the Spark ecosystem not only complements SparkR but also extends Spark’s reach to new users and communities.

Today, we are happy to announce that sparklyr can be seamlessly used in Databricks clusters running Apache Spark 2.2 or higher. In this blog post, we show how you can install and configure sparklyr in Databricks. We also introduce some of the latest improvements in Databricks R Notebooks.

Clean R Namespace

When we released R notebooks in 2015, we integrated SparkR into the notebook: the SparkR package was imported by default in the namespace, and both Spark and SQL Context objects were initialized and configured. Thousands of users have been running R and Spark code in R notebooks. We learned that some of them use our notebooks as a convenient way for single node R data analysis. For these users, the pre-loaded SparkR functions masked several functions from other popular packages, most notably dplyr.

To improve the experience of users who wish to use R notebooks for single node analysis and the new sparklyr users starting with Spark 2.2, we are not importing SparkR by default any more. Users who are interested in single-node R data science can launch single node clusters with large instances and comfortably run their existing single-node R analysis in a clean R namespace.

For users who wish to use SparkR, the SparkSession object is still initialized and ready to be used right after they import SparkR.

sparklyr in Databricks

We collaborated with our friends at RStudio to enable sparklyr to seamlessly work in Databricks clusters. Starting with sparklyr version 0.6, there is a new connection method in sparklyr: databricks. When calling spark_connect(method = "databricks") in a Databricks R Notebook, sparklyr will connect to the spark cluster of that notebook. As this cluster is fully managed, you do not need to specify any other information such as version, SPARK_HOME, etc.

Installing sparklyr

sparklyr is under active development and new versions are regularly released with new API and bug fixes. We do not pre-install sparklyr, allowing our users can install and enjoy the latest version of the package. You can install the latest development version from GitHub:


After sparklyr 0.6 is released to CRAN, the installation process can be done much simpler.


Configuring sparklyr connection

Configuring the sparklyr connection in Databricks cannot be simpler.

sc <- spark_connect(method = "databricks")

Using sparklyr API

After setting up the sparklyr connection, you can use all sparklyr APIs. You can import and combine sparklyr with dplyr or MLlib. You can also use sparklyr extensions. Note that if the extension packages include third-party JARs, you may need to install those JARs as libraries in your workspace.

iris_tbl <- copy_to(sc, iris)
iris_summary <- iris_tbl %>% 
    mutate(Sepal_Width = ROUND(Sepal_Width * 2) / 2) %>% 
    group_by(Species, Sepal_Width) %>% 
    summarize(count = n(),
Sepal_Length = mean(Sepal_Length),
stdev = sd(Sepal_Length)) %>% collect

   aes(Sepal_Width, Sepal_Length, color = Species)) + 
    geom_line(size = 1.2) +
 ymin = Sepal_Length - stdev,
ymax = Sepal_Length + stdev),
   width = 0.05) +
    geom_text(aes(label = count), 
vjust = -0.2, hjust = 1.2, color = "black") +

Using SparkR and sparklyr Together

We find SparkR and sparklyr complementary. You can use the packages next to each other in a single notebook or job. To do so you can import SparkR along with sparklyr in Databricks notebooks. The SparkR connection is pre-configured in the notebook, and after importing the package, you can start using SparkR API. Also, remember that some of the functions in SparkR mask a number of functions in dplyr.

The following objects are masked from ‘package:dplyr’:

    arrange, between, coalesce, collect, contains, count, cume_dist,
    dense_rank, desc, distinct, explain, filter, first, group_by,
    intersect, lag, last, lead, mutate, n, n_distinct, ntile,
    percent_rank, rename, row_number, sample_frac, select, sql,
    summarize, union

If you import SparkR after you imported dplyr, you can reference the functions in dplyr by using the fully qualified names, for example, dplyr::arrange(). Similarly, if you import dplyr after SparkR the functions in SparkR are masked by dplyr.

Alternatively, you can selectively detach one of the two packages if you do not need it.



We are continuously improving Databricks R Notebooks to keep them as the best place to perform reproducible R data analysis, whether it is on distributed data with Apache Spark or single-node computation using packages from existing rich R ecosystem.

As we demonstrated with a few easy steps, you can now seamlessly use sparklyr on Databricks. You can try it out in our Community Edition with Databricks Runtime Beta 3.0 that includes the latest release candidate build of Apache Spark 2.2.


Try Databricks for free. Get started today.

The post Using sparklyr in Databricks appeared first on Databricks.

Continue Reading…


Read More

Two Sigma Financial Modeling Challenge, Winner's Interview: 2nd Place, Nima Shahbazi, Chahhou Mohamed

Our Two Sigma Financial Modeling Challenge ran from December 2016 to March 2017 this year. Asked to search for signal in financial markets data with limited hardware and computational time, this competition attracted over 2000 competitors. In this winners' interview, 2nd place winners' Nima and Chahhou describe how paying close attention to unreliable engineered features was  important to building a successful model.

The basics

What was your background prior to entering this challenge?

Nima: Last year PhD student in the Data Mining and Database Group at York University. I love problem solving and challenging myself to find the best model for regression and classification/clustering problems. Prior to entering this challenge I have entered many competitions on Kaggle including Predictive modeling, NLP, recommendation systems and deep learning image segmentation and ended up top 1% in all of them.

Chahhou: I received the MS degree in computer engineering from Université Libre de Bruxelles (Belgium) and the Ph.D. degree in computer science from the University Hassan I, Settat, Morroco. I am currently professor at the Faculty of science at Dhar al Mahraz (Fes).

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Nima: I previously worked in big data analytics, specifically on Forex Market. I developed many algorithmic trading strategies based on historical stock prices and news feed. Beside that I have participated in many Kaggle competition (won Rossmann and Home Depot), and currently Kaggle GrandMaster.
Chahhou: I have no experience/knowledge in finance but a lot of experience in machine learning from Kaggle competitions and back from school.

How did you get started competing on Kaggle?

Nima: Around 2 years ago, I was informed by one of my friends that the IEEE International Conference on Data Mining series (ICDM) has established an interesting contest in data mining. In that competition participants are tasked with identifying a set of user connections across different devices without using common user handle information such as name, email, phone number, etc. Moreover, participants were going to be asked to figure out the likelihood that a set of different IDs from different domains belong to the same user and at what performance level. I got very excited and start working on that problem. Finally, I ranked 7th in that competition and have learned lots of new approaches for machine learning and data mining problems. I realized that in order to be successful in this field you should challenge yourself with real world problems. Although the theoretical knowledge is a must, without having experience in real world problems you will not able to succeed.
Chahhou: I found Kaggle randomly when I started learning data mining and machine learning 3 years ago. Since then, it became my favorite “tool” for learning and teaching.

What made you decide to enter this competition?

Both: This is the first code competition where hardware and computation time are limited for all the participants. Also, you are not allowed to view the test data, that makes this problem even more challenging and indeed closer to real life use of machine learning. Other competitions expose all the test data without the target of course, but this is way too different. We think that these kinds of competitions will force participants to use more robust modeling on unforeseen data.

Let’s get technical

What preprocessing and feature engineering did you do?

We have 7 types of features:
a. Feature: Actual feature value
b. Lag1.Feature: feature value on previous timestamp: featuret-1
c. lag1.Feature_diff: featuret – featuret-1
d. lag1.Feature_absdiff: abs (featuret – featuret-1)
e. lag1.Feature_sumlag: featuret + featuret-1
f. Feature_AMean: feature → groupby(timestamp) → yields (mean)
python code: data.groupby(‘timestamp’).mean()
g. Feature_deMean: feature → groupby(timestamp) → yields (actual − mean)
python code: data.groupby(‘timestamp’).apply( lambda x : x-x.mean() )

What supervised learning methods did you use?

We found that Extra Trees and Ridge models were the best fit for this dataset due to the nature of the data and the time constraint. Financial data are highly noisy and unstructured, and we believe for super noisy datasets, using solid basic model to capture the super weak signal are more applicable.

After a lot of tuning, we came up with an ensemble of two Extra Trees and two Ridge models. Each model uses a different set of features.
Features in our models are selected in two steps:
a) First compute the correlation with the output y for all features (including the engineered ones) and keep only the highly-correlated ones.
b) Then, select the features (forward selection) for Ridge and Extra Trees based on our designed time-series-CV.
The validation of our ensembles was based on 4 time-series-CV:
CV_1: train from timestamp 0 to 658 and validate from timestamp 659 to 1248
CV_2: train from timestamp 658 to 1248 and validate from timestamp 1249 to 1812
CV_3: train from timestamp 0 to 658 and validate from timestamp 1249 to 1812
CV_4: train from timestamp 0 to 905 and validate from timestamp 906 to 1812

Note that it is important to remove outliers from validation sets before start evaluating your models. In the pre-designed Kaggle framework you cannot remove the outliers from pre-defined validation set. Thus, we designed our own validations and framework.
Based on the cumulative R_score curve, model selection was done following this simple guideline: a model is kept if the cumulative R_score curve of the new ensemble is above the curve of the old ensemble at each timestamp and for all CVs.
The following figure shows the cumulative R_scores for all models and the ensemble.

Cumulative R_score curves for all CVs

What was your most important insight into the data and were you surprised by any of your findings?

We had to pay attention to some of the engineered features (mostly features from the Feature_AMean category) as they were highly unreliable. A feature can show very good performance on a given CV and very bad results on other CVs. The study of fundamental_62 gave us some interesting insights about this feature. Its standard deviation by timestamp shows some interesting pattern. Surprisingly this pattern with the Volume pattern (at the same scale) have the same peaks at “slightly” regular intervals (nearly each 100 timestamps).

Also, technical_29 and technical_34 together boost the performance of Extra Tree models on CVs when they are used with technical_20, technical_30, technical_20_deMean, technical_30_deMean. In addition, Fundamental_62_AMean has a high correlation with the output and have a great impact on our CVs and the leaderboard for our Ridge models. The following figure shows the importance of features used in Extra Tree model:

Which tools did you use?

For preprocessing and exploratory data analysis we used R and python but the code submission format was only in python. Also, we designed a hash based system in our python code that we can add engineered lag features easy and fast.

How did you spend your time on this competition? (For example: What proportion on feature engineering vs. machine learning?

We spent more than 80% on feature engineering, feature selection and understanding the data, and 20% on model ensembling, and tuning.

What was the run time for both training and prediction of your winning solution?

The winning solution took around 45 minutes to run and predict on Kaggle kernel.

How well did you do on public/private leaderboard?

The leaderboard shake up from public to private was quite impressive, as you can notice that many teams were overfit on the public leaderboard. There were teams which dropped more than 900 levels in private leaderboard. We were one of those few teams that did well both in public and private leaderboard.

Words of wisdom:

What have you taken away from this competition?

We learned a lot from forums and kernels. Kaggle is really great place to improve your machine learning knowledge and share your ideas and you will not get involved until you are actively competing in a real competition.

How do you compare Kaggle competitions with Academia state of the art results?

In our opinion beating the academia state of the art is way easier than winning a Kaggle competition as we have been working on both sides. In Kaggle competition lots and lots of people around the world look at that data and try lots of things and use a very pragmatic approach. Besides that, in Kaggle competitions you can monitor your score (on public leaderboard) among the other competitors that makes you try harder and dig for more insights. It can be the power of the competition that moves you forward.

Do you have any advice for those just getting started in data science?

Everyone with a very simple knowledge on data mining tools can create a model, but only a few can create something useful. Therefore, make sure you understand the math behind regression and classification and the way that a model learns. And the most important thing you should learn is how do learning methods dealing with regularization to avoid overfitting. Second, fully understand the principles of cross validation, and which type of cross validation fit your problem. Finally, do not spend lots of time tuning the model. Instead, spend your time extracting features and understanding the data. The more you play with the data the more you will find interesting insights.


How did your team form? and how did your team work together?

I (Nima) know that this market is so noisy and blending more good models with increase our chance to win, I asked Chahhou, who was also doing quite well in the competition, to team up. We teamed up two weeks before the merging deadline and started to ensemble our models and used Slack to communicate and share ideas.

Just for fun:

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
We would really like to see more cancer related data mining competition, as solving these problems are more rewarding.

What is your dream job?
Nima: President ☺

Short Bio:

Nima Shahbazi is a last year PhD Student in the Data Mining and Database Group at York University. His current research interests include Mining Data Streams, Big Data Analytics and Deep Learning. He tackled many machine learning problems including predictive modeling, classification, natural language processing, image segmentation with deep learning, recommendation systems and was able to rank top 1% in these modeling competitions. He is also admitted to the NextAI program which aims to make Canada the world leader in AI innovation.

Chahhou Mohamed: Received his MS degree in computer engineering from Université Libre de Bruxelles (Belgium) and the Ph.D. degree in computer science from the University Hassan I, Settat, Morroco. He is currently professor at the Faculty of science at Dhar al Mahraz (Fes).

Continue Reading…


Read More

Shiny Application Layouts Exercises (Part-10)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Shiny Application Layouts – Shiny Themes

In the last part of the series we will check out which themes are available in the shinythemes package. More specifically we will create a demo app with a selector from which you can choose the theme you want.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series and in this series, as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Create the App.

In order to see how themes affect the application components ,we have seen until now, we need to create them.

As we are going to use tags here, a good idea is to use taglList() in order to create our app. TagList() is ideal for users who wish to create their own sets of tags.

Let’s see an example of the skeleton of our application and then create our own step by step before applying to it the theme selector.
tabPanel("Navbar 1",
sliderInput("slider","Slider input:", 1, 100, 50),
actionButton("action", "Action")
h4("Verbatim Text"),
h1("Header 1"),
h2("Header 2"))
tabPanel("Navbar 2")

function(input, output) {
output$vtxt <- renderText({
output$table <- renderTable({


Exercise 1

Create a UI using tag List with the form of a Navbar Page and name it “Themes”. HINT: Use tagList() and navbarPage().

Exercise 2

Your Navbar Page should have two tab Panels named “Navbar 1” and “Navbar 2”. HINT: Use tabPanel().

Exercise 3

In “Navbar 1” add sidebar and main panel. HINT: Use sidebarPanel() and mainPanel().

Exercise 4

Create three tab panels inside the main panel. Name them “Table”, “Text” and “Header” respectively. HINT: Use tabsetPanel() and tabPanel().

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 5

In the tab panel “Table” add a table of the iris dataset. Name it “Iris”. HINT : Use tableOutput() and renderTable({}).

Exercise 6

In the tab panel “Text” add verbatim Text nad name it “Vtext”. HINT: Use verbatimTextOutput().

Exercise 7

Add a slider and an actionbutton in the sidebar. Connect the slider with the “Text” tab panel. Use tags to name the actionbutton. HINT: Use sliderInput(),actionButton(), renderText() and tags.

Exercise 8

In the tab panel “Header” add two headers with size h1 and h2 respectively.


Exercise 9

Install and load the package shinythemes.

Exercise 10

Place themeSelector() inside your tagList() and then use different shiny themes to see how they affect all the components you created so far.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

How A Data Scientist Can Improve Productivity

Data Science projects involve iterative processes and may need changes in data at every iteration. But Data versioning, data pipelines and data workflows make Data Scientist's life easy, let's see how.

Continue Reading…


Read More

Graphons and "Inferencing"

In episode two of season three Neil takes us through the basics on dropout, we chat about the definition of inference (It's more about context than you think!) and hear an interview with Jennifer Chayes of Microsoft.

Continue Reading…


Read More

Unsupervised Investments (II): A Guide to AI Accelerators and Incubators

A meticulously compiled list as extensive as possible of every accelerator, incubator or program the author has read or bumped into over the past months. It looks like there are at least 29 of them. An interesting read for a wide variety of potentially interested parties - far beyond only the investor.

Continue Reading…


Read More

CI for difference between independent R square coefficients

(This article was first published on R code – Serious Stats, and kindly contributed to R-bloggers)

In an earlier blog post I provided R code for a CI of a difference in R square for dependent and non-dependent correlations. This was based on a paper by Zou (2007). That paper also provides a method for calculating the CI of a difference in independent R square coefficients based on the limits of the CI for a single R square coefficient. I’ve also been experimenting with knittr and unfortunately haven’t yet worked out how to merge the R markdown output with my blog template, so I’m linking to RPubs for convenience.

You can find the function and a few more details here.



Filed under: comparing correlations, confidence intervals, R code, serious stats Tagged: confidence intervals, R, R code, software, statistics

To leave a comment for the author, please follow the link and comment on their blog: R code – Serious Stats. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

What If Robots Did the Hiring at Fox News?

My newest Bloomberg View column is out:

What If Robots Did the Hiring at Fox News?

Continue Reading…


Read More

Love is all around: Popular words in pop hits

Data scientist Giora Simchoni recently published a fantastic analysis of the history of pop songs on the Billboard Hot 100 using the R language. Giora used the rvest package in R to scrape data from the Ultimate Music Database site for the 350,000 chart entries (and 35,000 unique songs) since 1940, and used those data to create and visualize several measures of song popularity over time.

A novel measure that Giora calculates is "area under the song curve": the sum of all the ranks above 100 for every week the song is in the Hot 100. By that measure, the most popular (and also longest-charting) song of all time is Radioactive by Imagine Dragons:

Imagine Dragons

It's turns out that calculating this "song integral" is pretty simple in R when you use the tidyverse:

calculateSongIntegral <- function(positions) {
  sum(100 - positions)

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(positions = list(ThisWeekPosition)) %>%
  mutate(integral = map_dbl(positions, calculateSongIntegral)) %>%
  group_by(Artist, Title) %>%
  tally(integral) %>%

Another fascinating chart included in Giora's post is this analysis of the most frequent words to appear in song titles, by decade. He used the tidytext package to extract individual words from song titles and then rank them by frequency of use:


So it seems as though Love Is All Around (#41, October 1994) after all! For more analysis of the Billboard Hot 100 data, including Top-10 rankings for various measures of song popularity and the associated R code, check out Giora's post linked below.

Sex, Drugs and Data: Billboard Bananas

Continue Reading…


Read More

Love is all around: Popular words in pop hits

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Data scientist Giora Simchoni recently published a fantastic analysis of the history of pop songs on the Billboard Hot 100 using the R language. Giora used the rvest package in R to scrape data from the Ultimate Music Database site for the 350,000 chart entries (and 35,000 unique songs) since 1940, and used those data to create and visualize several measures of song popularity over time.

A novel measure that Giora calculates is "area under the song curve": the sum of all the ranks above 100 for every week the song is in the Hot 100. By that measure, the most popular (and also longest-charting) song of all time is Radioactive by Imagine Dragons:

Imagine Dragons

It's turns out that calculating this "song integral" is pretty simple in R when you use the tidyverse:

calculateSongIntegral <- function(positions) {
  sum(100 - positions)

billboard %>%
  filter(EntryDate >= date_decimal(1960)) %>%
  group_by(Artist, Title) %>%
  summarise(positions = list(ThisWeekPosition)) %>%
  mutate(integral = map_dbl(positions, calculateSongIntegral)) %>%
  group_by(Artist, Title) %>%
  tally(integral) %>%

Another fascinating chart included in Giora's post is this analysis of the most frequent words to appear in song titles, by decade. He used the tidytext package to extract individual words from song titles and then rank them by frequency of use:


So it seems as though Love Is All Around (#41, October 1994) after all! For more analysis of the Billboard Hot 100 data, including Top-10 rankings for various measures of song popularity and the associated R code, check out Giora's post linked below.

Sex, Drugs and Data: Billboard Bananas

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Using AI to create new jobs

Tim O’Reilly delves into past technological transitions, speculates on the possibilities of AI, and looks at what's keeping us from making the right choices to govern our creations.

Continue reading Using AI to create new jobs.

Continue Reading…


Read More

Peeking into the black box: Lessons from the front lines of machine-learning product launches

Grace Huang shares lessons learned from running and interpreting machine-learning experiments.

Continue reading Peeking into the black box: Lessons from the front lines of machine-learning product launches.

Continue Reading…


Read More

Lessons from piloting the London Office of Data Analytics

Eddie Copeland explores how the London Office of Data Analytics overcame the barriers to joining, analyzing, and acting upon public sector data at city scale.

Continue reading Lessons from piloting the London Office of Data Analytics.

Continue Reading…


Read More

Is finance ready for AI?

Aida Mehonic explores the role artificial intelligent might play in the financial world.

Continue reading Is finance ready for AI?.

Continue Reading…


Read More

Accelerate analytics and AI innovations with Intel

Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.

Continue reading Accelerate analytics and AI innovations with Intel.

Continue Reading…


Read More

San Francisco housing debate: A yimby responds

Phil Price recently wrote two much-argued-about posts here and here on the yimby (“yes in my backyard”) movement in San Francisco. One of the people disagreeing with him is Sonja Trauss, who writes:

Phil makes a pretty basic mistake of reasoning in his post, namely, that the high income residents of the proposed new housing move to SF if and only if the new housing is built.

The reality is that 84% of the residents of new buildings already live in SF (so they’re already spending money in SF etc.) That’s from the SF Controller’s office. Not to mention, no one moves only because they see that a new building is finished. They move for work, or family.

I’m one of the founders of the YIMBY movement in the Bay, so it’s my job to write refutations of essays like Phil’s.

Here’s the response from Trauss:

Last week Phil wrote a post [] and a follow up []. He says he’ll “post again when I can figure out whether or not I really am completely wrong.”

The question is, wrong about what?

A few years ago I had a roommate who brought home a giant rabbit to live with us and the cat, Marmalade. We put the two creatures together wondering what would happen. The rabbit was as big as the cat, would the cat nonetheless chase the rabbit? Answer: No. Marmalade was disgusted by the rabbit. The cat crept around the rabbit at a safe distance, horrified but unable to look away from what was apparently the ugliest cat Marmalade had ever seen. The rabbit, on the other hand, did not recognize Marmalade as a living creature at all. The rabbit made no distinction between Marmalade, a throw pillow, or a balled up sweatshirt, which alarmed Marmalade highly as long as they lived together.

Marmalade was missing critical information that he would have needed to understand what he was seeing. There was nothing wrong with the rabbit at all, it was a normal giant rabbit. Marmalade thought there was something inexplicable about the situation because he started with the wrong assumption.

Likewise, Phil’s question about YIMBYs was sincere. After meandering through a mental model of how population & rents interact, he got to his point,

“Given all of the above, … Why are these people promoting policies that are so bad for them?”

The answer to “why are people promoting policies that are bad for them?” Is always the same thing – the asker has wrong or missing information about the constituency’s (1) material conditions, or (2) values or goals, or else he doesn’t understand the policy proposal or how it works.

Another clue that a question is ill-formed is if it leads an otherwise intelligent person to a self-contradictory or incoherent answer:

…So this is my new theory: the YIMBY and BARF people know that building more market-rate housing in San Francisco will make median rents go up, and that this will be bad for them, but they want to do it anyway because it’s a thumb in the eye of the “already-haves”, those smug people who already have a place they like and are trying to slam the door behind them.”

So the answer to Phil’s question lies with him, not with us – Which of the assumptions Phil made were wrong?

Basically all of them.

The first bad assumption was about the goal of the YIMBY policy program. Phil got the idea somewhere that our policy goal has a singular focus on median rents, in San Francisco. It’s never been true that our focus was only on San Francisco. One of the interesting things about our movement is that we encourage people to be politically active across city and county boundaries. When I started SFBARF in 2014, I lived in West Oakland. My activities were targeted at San Francisco, and I started the club specifically to protect low rents in Oakland.

It’s also not true that that we would measure our success by only or primarily looking at the median rent. Ours is an anti-displacement, pro-short commute & pro-integration effort, which means the kinds of metrics we are interested in are commute times, population within X commute distance of job centers, rate of evictions, vacancy rate, measures of economic & racial integration in Bay Area cities, number of people leaving the Bay Area per year, in addition to public opinion data like do people feel like housing is hard to find? Do people like their current housing situation? If we do look at statistics about rent, median is a lot less interesting than standard distribution.

Prices are a side-effect, a symptom, a signal of an underlying situation. In California generally and the Bay in particular the underlying housing situation is one of real shortage. What is the unmet demand for housing in the Bay? Do 10 million people want to live here? 20 million? Who knows, we can only be sure at this point that it’s more than the 7 million that do live here.

Because the way we distribute housing currently is (mostly) via the price mechanism, the way most people experience their displacement is by being priced out. But distributing housing stock by some other method wouldn’t solve the displacement problem.

Suppose the total demand for housing in the Bay is 20 million people. Currently we have housing for about 7 million people. If we distributed the limited housing we have by lottery, 13 million people would experience displacement as losing the lottery. If we distributed it via political favoritism, people’s experience of displacement would be finding out their application for housing wasn’t granted. Either way it doesn’t matter. If 20 million people want housing, and you only have housing for 7 million of them, then 13 million people are out of luck, no matter how you distribute it.

[If you want to get a really intense feeling for why prices are a distraction, I recommend learning to prove both of the welfare theorems. [] Spend some hours imagining the tangent lines or planes attaching themselves to the convex set, then imagine the convex set ballooning out to meet the prices. Then read Red Plenty.]

Phil believes in induced demand, that feeding the need for housing will only create more need for housing. I don’t think this is true but I also don’t think it matters. The reason I don’t think it’s true is that if regional economies worked like that, Detroit, St. Louis, Baltimore, Philadelphia would not exist in their current state. How can induced demand folks explain either temporary regional slowdowns, like the dot com busts, or the permanent obsolescence of the rust belt? The reason I don’t think it matters is that continuous growth in the Bay Area doesn’t sound like a bad thing to me. The Bay would be better, more livable, a better value with 20 million or more people in it.

The reason the YIMBY movement exists, in part, is that the previous generation of rent advocates were singularly obsessed with prices, like Phil is. They thought, ‘if only we could fix prices, our problem would go away,’ so they focused on price fixing policies like rent control & were indifferent or actively hostile to shortage ending practices like building housing. In fact, hostility to building new housing, and insistence that prices alone should be the focus of activism are philosophies that fit nicely together. The result, after almost 40 years, is that prices are worse than ever, but population level hasn’t changed very much.

So the first answer to “What’s wrong with Phil?” is that Phil thinks prices are the primary focus of our activism, when in fact allocations are what we are interested in.

The second and third problems with Phil are assumptions he made in thinking through his toy model of the SF housing market. This chain of reasoning purports to show that building more housing in SF increases median rent in SF (but decreases it in other parts of the Bay). This is the reasoning that led Phil to think YIMBYs are pursuing policies with outcomes that are counter to our own interests. Of course, as explained above, Phil fully identified “our interests” with “SF median rent” which was already wrong.

The biggest problem with Phil’s model is “…for better or worse I am assuming, essentially axiomatically, that building expensive units draws in additional high-income renters and buyers [to San Francisco] …” who would not have lived there otherwise. The second problem is he imagines that all of their disposable income would be new to the City. Whether these assumptions are actually true or not has a huge impact on the conclusion Phil reaches.

It happens all the time that people make axiomatic assumptions about things they think they can’t know, in order to eventually reason to the outcome that they want. This is a perfectly fine activity, but reasoning from a guess about a state of the word, to the outcome one already believes isn’t social science. It’s rationalization.

Whether N new housing units actually results in N new high income people spending new disposable income in San Francisco is an interesting research question. I would love for Phil to pursue it. I would recommend talking to demographers, economic planners and econometricians.

If pursuing the data seems too labor intensive, then Phil can try to assume the opposite of his axiom about the necessary demographics of population growth, and see if he can still reason to his desired conclusion. If he can, that result should be published at Berkeley Daily Planet, which is a blog for anti-housing ideology.

Alternately, Phil could abandon this particular line of investigation altogether because it’s not relevant to the initial question. We’ve already established that prices aren’t the fundamental focus of YIMBY activism, and Marmalade wasn’t looking at a grotesque cat, but an ordinary rabbit.

I hope you feel your mystery is solved Phil. Next time you have a question you should ask your local YIMBYs at

In my own neighborhood in NY, I’m a yimby: a few months ago someone put up a flyer in my building warning us that some developer was planning to build a high-rise nearby, and we should all go to some community board meeting and protest. My reaction was annoyance—I almost wanted to go to the community board meeting just to say that I have no problems with someone building another high-rise in my neighborhood. That said, much depends on the details: new housing is good, but I do sometimes read about sweetheart tax deals where the developer builds and then the taxpayer is stuck with huge bills for infrastructure. In any particular example, the details have to matter, and this isn’t a debate I’m planning to jump into.

P.S. The cat picture at the top of this post comes from Steven Johnson. It is not the cat discussed by Trauss. But I’m guessing it’s the same color!

The post San Francisco housing debate: A yimby responds appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Will Data Science Eliminate Data Science?

There are elements of what we do which are AI complete. Eventually, Artificial General Intelligence will eliminate the data scientist, but it’s not around the corner.

Continue Reading…


Read More

Thanks for reading!