# My Data Science Blogs

## November 21, 2017

### Magister Dixit

“One of the keys to success in big data analytics projects is building strong ties between data analysts and business units.” Ed Burns ( August 2014 )

### How to Build Your Own Blockchain Part 4.2 — Ethereum Proof of Work Difficulty Explained

We’re back at it in the Proof of Work difficulty spectrum, this time going through how Ethereum’s difficulty changes over time. This is part 4.2 of the part 4 series, where part 4.1 was about Bitcoin’s PoW difficulty, and the following 4.3 will be about jbc’s PoW difficulty.

TL;DR

To calculate the difficulty for the next Ethereum block, you calculate the time it took to mine the previous block, and if that time difference was greater than the goal time, then the difficulty goes down to make mining the next block quicker. If it was less than the time goal, then difficulty goes up to attempt to mine the next block quicker.

There are three parts to determining the new difficulty: offset, which determines the standard amount of change from one difficulty to the next; sign, which determines if the difficulty should go up or down; and bomb, which adds on extra difficulty depending on the block’s number.

These numbers are calculated slightly differently for the different forks, Frontier, Homestead, and Metropolis, but the overall formula for calculating the next difficulty is

target = parent.difficulty + (offset * sign) + bomb

### Pre notes

For the following code examples, this will be the class of the block.

class Block():
def __init__(self, number, timestamp, difficulty, uncles=None):
self.number = number
self.timestamp = timestamp
self.difficulty = difficulty
self.uncles = uncles

The data I use to show the code is correct was grabbed from Etherscan.

The code doesn’t include edge cases in calculating the difficulty either. For the most part, edge cases aren’t involved in calculating the difficulty target. By not including them, it makes the following code much easier to understand. I’ll talk about the edge cases at the end to make sure they’re not completely ignored.

Finally, for the beginning fork, I talk about what the different variables and functions perform. For later forks, Homestead and Metropolis, I only talk about the changes.

Also, here’s all the code I threw in a Github repo! If you don’t want to write all the code yourself, you should at least clone it locally and run it yourself. 1000% feel free to make pull requests if you want to add more parts to it or think I formatted the code wrong.

### The Beginning — Frontier

In the beginning, there was Frontier. I’ll jump right in by giving a bullet point section going over the config variables.

• DIFF_ADJUSTMENT_CUTOFF — Represents the seconds that Ethereum is aiming to mine a block at.
• BLOCK_DIFF_FACTOR — Helps calculate how much the current difficulty can change.
• EXPDIFF_PERIOD — Denotes after how many blocks the bomb amount is updated.
• EXPDIFF_FREE_PERIODS — How many EXPDIFF_PERIODS are ignored before including the bomb in difficulty calculation.

And now descriptions of the functions.

calc_frontier_sign — Calculates whether the next difficulty value should go up or down. In the case of Frontier, if the previous block was mined quicker than the 13 seconds DIFF_ADJUSTMENT_CUTOFF, then the sign will be 1, meaning we want the difficulty to be higher to make it more difficult with the goal that the next block be mined more slow. If the previous block was mined longer than 13 seconds, then the sign will be -1 and the next difficulty will be lower. The overall point of this is that the goal for block mining time is ~12.5 seconds. Take a look at Vitalik Buterin’s post where he talks about choosing 12 seconds at the minimum average block mining time.

calc_frontier_offset — Offset is the value that determines how much or how little the difficulty will change. In the Frontier, this is the parent’s difficulty integer devided by the BLOCK_DIFF_FACTOR. Since it’s devided by 2048, which is 2^11, offset can also be calculated by parent_difficulty >> 11 if you want to look at it in terms of shifting bits. Since 1.0 / 2048 == 0.00048828125, it means that the offset will only change the difficulty by 0.0488% per change. Not that much, which is good because we don’t want the difficulty to change a ton with each different mined block. But if the time becomes consistently under the 13 seconds cutoff, the difficulty will slowly grow to compensate.

calc_frontier_bomb — The bomb. The bomb adds an amount of difficulty that doubles after every EXPDIFF_PERIOD block is mined. In the Frontier world, this value is incredibly small. For example, at block 1500000, the bomb is 2 ** ((1500000 // 100000) - 2) == 2 ** 15 == 32768. The difficulty of the block is 34982465665323. That’s a huge difference meaning that the bomb took on zero affect. This will change later.

calc_frontier_difficulty — Once you have the values for sign, offset, and bomb, the new difficulty is (parent.difficulty + offset * sign) + bomb. Let’s say that the the time it took to mine the parent’s block was 15 seconds. In this case, the difficulty will go down by offset * -1, and then add the small amount of the bomb at the end. If the time to mine the parent’s block was 8 seconds, the difficulty will increase by offset + bomb.

In order to understand it fully, go through the code that follows and look at the calculations.

config = dict(
BLOCK_DIFF_FACTOR=2048,
EXPDIFF_PERIOD=100000,
EXPDIFF_FREE_PERIODS=2,
)

def calc_frontier_offset(parent_difficulty):
offset = parent_difficulty // config['BLOCK_DIFF_FACTOR']
return offset

def calc_frontier_sign(parent_timestamp, child_timestamp):
time_diff = child_timestamp - parent_timestamp
sign = 1
else:
sign = -1
return sign

def calc_frontier_bomb(parent_number):
period_count = (parent.number + 1) // config['EXPDIFF_PERIOD']
period_count -= config['EXPDIFF_FREE_PERIODS']
bomb = 2**(period_count)
return bomb

def calc_frontier_difficulty(parent, child_timestamp):
offset = calc_frontier_offset(parent.difficulty)
sign = calc_frontier_sign(parent.timestamp, child_timestamp)
bomb = calc_frontier_bomb(parent.number)
target = (parent.difficulty + offset * sign) + bomb
return offset, sign, bomb, target

The Homestead fork, which took place at block number 1150000 in March of 2016, has a couple big changes with calculating the sign.

calc_homestead_sign — Instead of having a single number, DIFF_ADJUSTMENT_CUTOFF which is different for the Homestead fork, that makes the difficulty go up or down, Homestead takes a slightly different approach. If you look at the code, you’ll see that that there are groupings of the sign rather than either 1 or -1. If the time_diff between grandparent and parent is in [0,9],  sign will be 1, meaning that difficulty needs to increase. If the time_diff is [10,19], the sign will be 0 meaning that the difficulty should stay as it is. If the time_diff is in the range [20, 29], then the sign becomes -1. If time_diff is in the range [30,39], then the sign is -2, etc.

This does two things. First, Homestead doesn’t want to equate a block that took 50 seconds to mine as being the same as a block that took 15 seconds to mine. If it took a block 50 seconds, then the next difficulty in fact does need to be easier. Secondly, instead of DIFF_ADJUSTMENT_CUTOFF representing the goal time, this switches the aim point to be the mid point of the range of time_diffs with a sign of 0. [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Meaning ~14.5 seconds, not including the bomb.

config = dict(
BLOCK_DIFF_FACTOR=2048,
EXPDIFF_PERIOD=100000,
EXPDIFF_FREE_PERIODS=2,
)

offset = parent_difficulty // config['BLOCK_DIFF_FACTOR']
return offset

time_diff = child_timestamp - parent_timestamp
return sign

period_count = (parent_number + 1) // config['EXPDIFF_PERIOD'] # or parent.number + 1 >> 11 if you like bit notation
period_count -= config['EXPDIFF_FREE_PERIODS']
bomb = 2**(period_count)
return bomb

target = (parent.difficulty + offset * sign) + bomb
return offset, sign, bomb, target

### The Current — Metropolis

There are a couple differences from Homestead. First is that the DIFF_ADJUSTMENT_CUTOFF is now 9, which means that, without uncles, the target block time is midpoint of [9, 10, 11, 12, 13, 14, 15, 16, 17] or ~13 seconds.

The second takes into account whether or not there are uncles included in the block. And uncle in Ethereum language refers to a point in time where two nodes mine a child block from the same grandparent. So if you’re mining a child block from a parent that has a “sibling”, you’re able to pick one of the siblings to mine from, but also include that you noticed the other block. In that case, Ethereum wants to make the new difficulty larger, buy another offset, to make sure that there is a less likely chance for two natural forks to get much longer.

Now the biggest difference is dissolving the impact of the bombs. Check out the code for calc_metropolis_bomb where not only do we subtract the value of EXPDIFF_FREE_PERIODS, but also METROPOLIS_DELAY_PERIODS which is 30 time periods. A huge number. Instead of talking about the bombs here, I’ll have a section devoted to that after this.

config = dict(
BLOCK_DIFF_FACTOR=2048,
EXPDIFF_PERIOD=100000,
EXPDIFF_FREE_PERIODS=2,
METROPOLIS_DELAY_PERIODS=30,
)

def calc_metropolis_offset(parent_difficulty):
offset = parent_difficulty // config['BLOCK_DIFF_FACTOR']
return offset

def calc_metropolis_sign(parent_timestamp, child_timestamp):
if parent.uncles:
uncles = 2
else:
uncles = 1
time_diff = child_timestamp - parent_timestamp
sign = uncles - (time_diff // config['METROPOLIS_DIFF_ADJUSTMENT_CUTOFF'])
return sign

def calc_metropolis_bomb(parent_number):
period_count = (parent_number + 1) // config['EXPDIFF_PERIOD']
period_count -= config['METROPOLIS_DELAY_PERIODS'] #chop off 30, meaning go back 3M blocks in time
period_count -= config['EXPDIFF_FREE_PERIODS'] #chop off 2 more for good measure
bomb = 2**(period_count)
return bomb

def calc_metropolis_difficulty(parent, child_timestamp):
offset = calc_metropolis_offset(parent.difficulty)
sign = calc_metropolis_sign(parent_timestamp, child_timestamp)
bomb = calc_metropolis_bomb(parent.number)
target = (parent.difficulty + offset * sign) + bomb
return offset, sign, bomb, target

### Going Deeper with the Bombs

If you look at one of the difficulty charts from online, you’ll see a recent amount of huge increasing jumps every 100000 blocks, and then a giant drop off about a month ago. Screenshot time for those not wanting to click the link:

Each horizontal line indicates a 3 second change in time it takes to mine a block.

What’s the point of having a bomb like this? A big goal of Ethereum is to get rid of Proof of Work, which requires energy and time to create and validate a new block, into Proof of Stake, which is described in the Ethereum Wiki. In order to force nodes to move to the Proof of Stake implementation, a “bomb” that doubles its impact on the difficulty every 100000 blocks would soon make it take so long to mine a new block, the nodes running on the old fork won’t be able to run anymore. This is where the term “Ice Age” comes from; the block chain would be frozen in time.

This is a good way to manage future changes, but also we run into the issue that the new Proof of Stake implementation, called Casper, wasn’t ready in time before the giant difficulty spikes seen in that graph. That’s where the Metropolis fork came into play — it eliminated the effect the bomb has on the difficulty, but in a few years it will come back into play, where the switch to Casper will (hopefully) be ready to roll. That being said, predicting when features like this fork are ready for production and adoption is difficult, so if Casper isn’t ready to be pushed quick enough, you can create another fork that moves the bomb back in time again.

Bomb Math

Besides the description of what the bombs do to the difficulty as seen on those graphs, it’s worth it to show the math behind why the difficulty is rising to those levels, and what that means for how long it takes to mine a block.

Going back, remember that offset is an indicator of how much it takes a node to mine a block.

For example, if the previous time difference between blocks was [18-26], we say that if we decrease difficulty by an offset, we’ll be able to move the time to mine a block back to the [9-17] range, or DIFF_ADJUSTMENT_CUTOFF seconds lower.

So if we increase the difficulty by offset, we expect the time it takes to mine a block to increase by DIFF_ADJUSTMENT_CUTOFF. If we increase the difficulty by half of an offset, the mining time should increase by about DIFF_ADJUSTMENT_CUTOFF / 2.

So if we want to see how much a bomb will influence the change in mining time, we want the ratio of bomb and offset.

(bomb / offset) * DIFF_ADJUSTMENT_CUTOFF

Here’s the code that shows how we can calculate the time it takes to mine and compare to the actual average time, where actual average mining time was gathered from this chart by hovering the cursor around where in time these blocks were mined.

def calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_bomb, calc_offset):
bomb = calc_bomb(block_number)
offset = calc_offset(difficulty)
bomb_offset_ratio = bomb / float(offset)
average_mining_time = 0.4 * 60
print "Bomb: %s" % bomb
print "Offset: %s" % offset
print "Bomb Offset Ratio: %s" % bomb_offset_ratio
print "Actual Avg Mining Time: %s" % actual_average_mining_time
print "Calculated Mining Time: %s" % calculated_average_mining_time
print

block_number = 4150000
difficulty = 1711391947807191
actual_average_mining_time = 0.35 * 60

block_number = 4250000
difficulty = 2297313428231280
actual_average_mining_time = 0.4 * 60

block_number = 4350000
difficulty = 2885956744389112
actual_average_mining_time = 0.5 * 60

block_number = 4546050
difficulty = 1436507601685486
actual_average_mining_time = 0.23 * 60
calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_metropolis_bomb, calc_metropolis_offset)

When run:

Bomb: 549755813888
Offset: 835640599515
Bomb Offset Ratio: 0.657885476372
Actual Avg Mining Time: 21.0
Calculated Mining Time: 21.0788547637

Bomb: 1099511627776
Offset: 1121735072378
Bomb Offset Ratio: 0.980188330427
Actual Avg Mining Time: 24.0
Calculated Mining Time: 24.3018833043

Bomb: 2199023255552
Offset: 1409158566596
Bomb Offset Ratio: 1.56052222062
Actual Avg Mining Time: 30.0
Calculated Mining Time: 30.1052222062

Bomb: 8192
Offset: 701419727385
Bomb Offset Ratio: 1.16791696614e-08
Actual Avg Mining Time: 13.8
Calculated Mining Time: 14.5000001168

Look how large the bomb offset ratio is for the later Homestead blocks, and how tiny it becomes after the Metropolis block! With the loss of the bomb’s impact, the time difference drops a crap ton.

As of now, the bomb doesn’t affect the difficulty by pretty much any amount. It’ll be back though. If Ethereum continues to mine blocks at the rate of ~14.5 seconds, they’ll have 100,000 mine blocks in around 17 days. Multiply that time by 30, which accounts for the METROPOLIS_DELAY_PERIODS, we’ll be back in a state where the bomb makes a difference in around a year and a half, no matter how much the hash power increases.

Bomb Equilibrium

The last part of bomb dealing is to quickly explain why a slight increase in the bomb will make the overall difficulty be raised to a much higher value compared to how large the bomb value is.

The thought on how this works is by taking the calculated expecting mining time and from there calculating the difficulty required by the current hash rate to mine blocks in that time period. When the bomb amount changes, the difficulty will continue to rise or fall to get to the difficulty for that time, and then hover in that equilibrium. If you look at the blocks right after the bomb change, you’ll see it takes more than a few blocks to get to the new level.

### The Edge Cases

There are a couple edge cases here not mentioned in the code above.

• MIN_DIFF=131072, which is the minimum difficulty a block can have. Considering the difficulty is incredibly larger than that, it’s pointless to think about that. But when Ethereum was gerenerated, it was probably useful to have a minimum difficulty. Also can be refered to as 2 ** 17.
• Smallest sign = -99. Imagine a situation where, for some reason, the sign for calculating the next block is -1000. The sign for the next block’s difficulty will drop by an incredible amount which would mean it’d take something like 1000 blocks to get back to the desired ~14 second mining time since the largest difficulty increasing sign is 1 (or two if it includes an uncle). Odds are very very very unlikely that any thing lower than -99 would happen, but still needs to covered.

### Final Questions

Like before, here’s a list of questions I had when writing this that didn’t get put in the main post.

I’m sure this question will come up a lot when only talking about Proof of Work difficulty. It’s great that we know how difficulty is calculated, but how to validate a block has a valid header is beyond this post.

I’m not going to explain it here, but probably in a future post when I decide how jbc’s header should be calculated after I implement transactions.

I will note that Bitcoin’s header is incredibly simple where the values are smashed together (making sure that the way the bits are combined have the right endian). Ethereum’s is much more complicated by dealing with transactions using a cash method rather than a Merkel tree.

How do you go through and figure this out?

There are tons of posts out there like this one, but frankly, most are very high level with either descriptions, numbers, but not many showing code that calculates those numbers. My goal is to do all of those.

That means that to get to complete understanding to talk about them all, I go through and look at the code bases. Here’s the calc_difficulty function from the Python repo, calcDifficulty from c++, and calcDifficulty from Go. Another bigger point that I keep talking about is that looking at the code doesn’t do much at all, what you need to do is implement similar code yourself.

All of those time estimations include a ~ before a number. Why is that necessary?

That’s a really good question. And the main answer is uncertainty in how many nodes are trying to mine blocks, as well as randomness. If you look at the difficulty chart and zoom in to a timespan of a couple days, you’ll see how random it gets. It’s the same with the average mining time for a specific day. They fluctuate, and so we can’t say exactly what time we expect.

Proof of Work is all I hear people talk about, but it isn’t universal, right?

Correct! I mentioned it above, and I’m betting that if you’re reading this much about Ethereum and you’re all the way at the bottom of this post that you’ve heard of the term Proof of Stake. This is the new option that a major cryptocurrency blockchain hasn’t implemented yet. There are other types of block validation out there. The big upcoming Enterprise versions of blockchains probably won’t be completely based in Proof of Work at all.

The biggest indicator of what type of validation blockchains will use depends on the use cases. Cryptocurrencies need something completely un-fraudable. Same with a blockchain that might store information on who owns what piece of land. We don’t want people to be able to change ownership. But a blockchain that’s used to store something less insanely valuable doesn’t need to waste all that energy. I know some of these other Proofs of validity will come into play in the next few years.

You wrote something confusing / didn’t explain it well enough. What should I do?

Get in freaking contact. Seriously, tons of posts out there that talk about results but don’t say how they calculated it and assume that everyone is as smart as them and know what’s going on. I’m trying to be the opposite where I don’t say something without fully explaining. So if there’s something confusing or wrong, contact me and I’ll make sure to fix the issue.

Can I DM you on Twitter?

@jack_schultz

Do you like cats or dogs better?

Cats.

### How to Build a Geographic Dashboard with Real-Time Data

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In this post, I show how to build an interactive geographic dashboard using Displayr, Plotly and R. It is particularly fascinating in that it tracks the real-time position of military aircraft. To do so, I am first going to bring in some data from two separate sources (regions based on the size of military air-force, and then the real-time tracking of aircraft positions). The dashboard displays the dynamic data in two ways: region shading (to indicate the size of the military force per country) and point markers (for the aircraft positions). I then build a map to neatly and attractively display all this data.

## Reading tabular data from the web

I am going to shade each country of the map according to the size of its military air-force. To do this I need a list of aircraft per country. Fortunately, the ever useful Wikipedia has just what I need (here). The following code reads in the data and cleans it to produce a neat table.

library(httr)
library(XML)

# Extract table via URL
url &amp;amp;lt;- "https://en.wikipedia.org/wiki/List_of_countries_by_level_of_military_equipment#List"
r &amp;amp;lt;- GET(url)
airforces &amp;amp;lt;- readHTMLTable(doc = content(r, "text"))[[2]]

# Clean required columns
airforces &amp;amp;lt;- airforces[-1, c("Country[note 1]", "Military aircraft[note 3]")]
colnames(airforces) &amp;amp;lt;- c("Country", "MilitaryAircraft")
remove.bracket.content &amp;amp;lt;- function(s) {
return(gsub("\$.+\$", "", s))
}
airforces &amp;amp;lt;- data.frame(apply(airforces, 2, remove.bracket.content))
airforces$MilitaryAircraft &amp;amp;lt;- as.numeric(gsub(",", "", airforces$MilitaryAircraft))
airforces


## Pooling real-time data from across the globe

Compared to the above, the second source of data is more dynamic. I am using ADS-B, which pools real-time flight tracking information from across the globe. Not all military operations are top secret. Apparently, some aircraft operated by the armed forces broadcast their positions publicly.

To collate this information, I construct a URL to retrieve a JSON object with information about military aircraft (JSON is flexible text format used for data exchange). Then I parse the JSON to create a data.frame.

library(jsonlite)&amp;amp;lt;/pre&amp;amp;gt;
url &amp;amp;lt;- paste0(url, "fMilQ=TRUE")

positions &amp;amp;lt;- fromJSON(url)$acList if (length(positions) != 0) { positions &amp;amp;lt;- positions[positions$Type != "TEST", ]
positions &amp;amp;lt;- positions[!is.na(positions$Lat), ] } positions  ## Shading countries of a map The following code produces a plotly map of the world. The countries are shaded according to the size of the air-force, with the scale shown on a legend. In plotly terminology, each layer of the map is known as a trace. library(plotly) library(flipFormat) # specify map area and projection g &amp;amp;lt;- list(scope = "world", showframe = FALSE, showcoastlines = TRUE, projection = list(type = 'mercator'), lonaxis = list(range = c(-140, 179)), lataxis = list(range = c(-55, 70)), resolution = 50) # higher resolution # shade countries by airforce size p &amp;amp;lt;- plot_geo(airforces) %&amp;amp;gt;% add_trace(data = airforces, name = "Airforce", z = ~MilitaryAircraft, color = ~MilitaryAircraft, colors = 'Blues', locations = ~Country, marker = list(line = list(color = toRGB("grey"), width = 0.5)), showscale = TRUE, locationmode = "country names", colorbar = list(title = 'Airforce', separatethousands = TRUE)) %&amp;amp;gt;% config(displayModeBar = F) %&amp;amp;gt;% layout(geo = g, margin = list(l=0, r=0, t=0, b=0, pad=0), paper_bgcolor = 'transparent')  ## Adding markers for the aircraft Finally, I add markers showing the airplane locations as another trace. I use a different color for those with speed less than 200 knots and altitude less than 2000 feet. The hover text contains more detail about the aircraft. aircolors = rep("airborne", nrow(positions)) # create a vector with the status of each aeroplane aircolors[positions$Spd &amp;amp;lt; 200 &amp;amp;amp; positions$Alt &amp;amp;lt; 2000] &amp;amp;lt;- "ground/approach" hovertext = paste0("Operator:", positions$Op, "\nModel:", positions$Mdl, "\nAltitide(ft):", sapply(positions$Alt, FormatAsReal))
hoverinfo = rep("all", nrow(positions))

p = add_trace(p, data = positions, x = positions$Long, y = positions$Lat,
color = aircolors, hovertext = hovertext, showlegend = FALSE)


The final result is shown below. Since broadcasting of position is purely voluntary, I suspect that the Top Gun flights are not shown!

While the map above shows the required information, it can easily be made more useful and appealing. Displayr allows me to add a control to switch between regions, text and image annotations, and a background. Here is a link to the finished dashboard, with a screenshot shown below.

Thinking about it again, if you can’t see any military aircraft at all on the radar, then you should really worry!

## Try it yourself

You can explore the dashboards used in this post within Displayr. All the code is embedded within the pages (look on the right-hand panel after you click on an object). Click here to see the copy of the working document (without background and finishing touches). Click here open a copy of the finished document that includes the background, control box, and other finishing touches.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Sketched Subspace Clustering (Sketch-SC)
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data. Subspace clustering (SC) is a relatively recent method that is able to successfully classify nonlinearly separable data in a multitude of settings. In spite of their high clustering accuracy, SC methods incur prohibitively high computational complexity when processing large volumes of high-dimensional data. Inspired by random sketching approaches for dimensionality reduction, the present paper introduces a randomized scheme for SC, termed Sketch-SC, tailored for large volumes of high-dimensional data. Sketch-SC accelerates the computationally heavy parts of state-of-the-art SC approaches by compressing the data matrix across both dimensions using random projections, thus enabling fast and accurate large-scale SC. Performance analysis as well as extensive numerical tests on real data corroborate the potential of Sketch-SC and its competitive performance relative to state-of-the-art scalable SC approaches. …

Horseshoe Estimator
This paper proposes a new approach to sparse-signal detection called the horseshoe estimator. We show that the horseshoe is a close cousin of the lasso in that it arises from the same class of multivariate scale mixtures of normals, but that it is almost universally superior to the double-exponential prior at handling sparsity. A theoretical framework is proposed for understanding why the horseshoe is a better default “sparsity” estimator than those that arise from powered-exponential priors. Comprehensive numerical evidence is presented to show that the difference in performance can often be large. Most importantly, we show that the horseshoe estimator corresponds quite closely to the answers one would get if one pursued a full Bayesian model-averaging approach using a “two-groups” model: a point mass at zero for noise, and a continuous density for signals. Surprisingly, this correspondence holds both for the estimator itself and for the classification rule induced by a simple threshold applied to the estimator. We show how the resulting thresholded horseshoe can also be viewed as a novel Bayes multiple-testing procedure. …

subgraph2vec
In this paper, we present subgraph2vec, a novel approach for learning latent representations of rooted subgraphs from large graphs inspired by recent advancements in Deep Learning and Graph Kernels. These latent representations encode semantic substructure dependencies in a continuous vector space, which is easily exploited by statistical models for tasks such as graph classification, clustering, link prediction and community detection. subgraph2vec leverages on local information obtained from neighbourhoods of nodes to learn their latent representations in an unsupervised fashion. We demonstrate that subgraph vectors learnt by our approach could be used in conjunction with classifiers such as CNNs, SVMs and relational data clustering algorithms to achieve significantly superior accuracies. Also, we show that the subgraph vectors could be used for building a deep learning variant of Weisfeiler-Lehman graph kernel. Our experiments on several benchmark and large-scale real-world datasets reveal that subgraph2vec achieves significant improvements in accuracies over existing graph kernels on both supervised and unsupervised learning tasks. Specifically, on two realworld program analysis tasks, namely, code clone and malware detection, subgraph2vec outperforms state-of-the-art kernels by more than 17% and 4%, respectively. …

### Whats new on arXiv

In this work we propose a deep architecture for the classification of multivariate time series. By means of a recurrent and untrained reservoir we generate a vectorial representation that embeds the temporal relationships in the data. To overcome the limitations of the reservoir vanishing memory, we introduce a bidirectional reservoir, whose last state captures also the past dependencies in the input. We apply dimensionality reduction to the final reservoir states to obtain compressed fixed size representations of the time series. These are subsequently fed into a deep feedforward network, which is trained to perform the final classification. We test our architecture on benchmark datasets and on a real-world use-case of blood samples classification. Results show that our method performs better than a standard echo state network, and it can be trained much faster than a fully-trained recurrent network.
Principal component regression is a linear regression model with principal components as regressors. This type of modelling is particularly useful for prediction in settings with high-dimensional covariates. Surprisingly, the existing literature treating of Bayesian approaches is relatively sparse. In this paper, we aim at filling some gaps through the following practical contribution: we introduce a Bayesian approach with detailed guidelines for a straightforward implementation. The approach features two characteristics that we believe are important. First, it effectively involves the relevant principal components in the prediction process. This is achieved in two steps. The first one is model selection; the second one is to average out the predictions obtained from the selected models according to model averaging mechanisms, allowing to account for model uncertainty. The model posterior probabilities are required for model selection and model averaging. For this purpose, we include a procedure leading to an efficient reversible jump algorithm. The second characteristic of our approach is whole robustness, meaning that the impact of outliers on inference gradually vanishes as they approach plus or minus infinity. The conclusions obtained are consequently consistent with the majority of observations (the bulk of the data).
There is an increasing interest in exploiting mobile sensing technologies and machine learning techniques for mental health monitoring and intervention. Researchers have effectively used contextual information, such as mobility, communication and mobile phone usage patterns for quantifying individuals’ mood and wellbeing. In this paper, we investigate the effectiveness of neural network models for predicting users’ level of stress by using the location information collected by smartphones. We characterize the mobility patterns of individuals using the GPS metrics presented in the literature and employ these metrics as input to the network. We evaluate our approach on the open-source StudentLife dataset. Moreover, we discuss the challenges and trade-offs involved in building machine learning models for digital mental health and highlight potential future work in this direction.
The motion analysis of human skeletons is crucial for human action recognition, which is one of the most active topics in computer vision. In this paper, we propose a fully end-to-end action-attending graphic neural network (A$^2$GNN) for skeleton-based action recognition, in which each irregular skeleton is structured as an undirected attribute graph. To extract high-level semantic representation from skeletons, we perform the local spectral graph filtering on the constructed attribute graphs like the standard image convolution operation. Considering not all joints are informative for action analysis, we design an action-attending layer to detect those salient action units (AUs) by adaptively weighting skeletal joints. Herein the filtering responses are parameterized into a weighting function irrelevant to the order of input nodes. To further encode continuous motion variations, the deep features learnt from skeletal graphs are gathered along consecutive temporal slices and then fed into a recurrent gated network. Finally, the spectral graph filtering, action-attending and recurrent temporal encoding are integrated together to jointly train for the sake of robust action recognition as well as the intelligibility of human actions. To evaluate our A$^2$GNN, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D dataset. The experimental results demonstrate that our network achieves the state-of-the-art performances.
Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a new model is proposed for generating explanations by utilizing localized grounding of constituent phrases in generated explanations to ensure image relevance. Specifically, we introduce a phrase-critic model to refine (re-score/re-rank) generated candidate explanations and employ a relative-attribute inspired ranking loss using ‘flipped’ phrases as negative examples for training. At test time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image.
Generative Adversarial Networks (GANs) convergence in a high-resolution setting with a computational constrain of GPU memory capacity (from 12GB to 24 GB) has been beset with difficulty due to the known lack of convergence rate stability. In order to boost network convergence of DCGAN (Deep Convolutional Generative Adversarial Networks) and achieve good-looking high-resolution results we propose a new layered network structure, HR-DCGAN, that incorporates current state-of-the-art techniques for this effect. A novel dataset, CZ Faces (CZF), containing human faces from different ethnical groups in a wide variety of illumination conditions and image resolutions is introduced. We conduct extensive experiments on CelebA and CZF.
Very important breakthroughs in data-centric machine learning algorithms led to impressive performance in transactional point applications such as detecting anger in speech, alerts from a Face Recognition system, or EKG interpretation. Non-transactional applications, e.g. medical diagnosis beyond the EKG results, require AI algorithms that integrate deeper and broader knowledge in their problem-solving capabilities, e.g. integrating knowledge about anatomy and physiology of the heart with EKG results and additional patient findings. Similarly, for military aerial interpretation, where knowledge about enemy doctrines on force composition and spread helps immensely in situation assessment beyond image recognition of individual objects. The Double Deep Learning approach advocates integrating data-centric machine self-learning techniques with machine-teaching techniques to leverage the power of both and overcome their corresponding limitations. To take AI to the next level, it is essential that we rebalance the roles of data and knowledge. Data is important but knowledge- deep and commonsense- are equally important. An initiative is proposed to build Wikipedia for Smart Machines, meaning target readers are not human, but rather smart machines. Named ReKopedia, the goal is to develop methodologies, tools, and automatic algorithms to convert humanity knowledge that we all learn in schools, universities and during our professional life into Reusable Knowledge structures that smart machines can use in their inference algorithms. Ideally, ReKopedia would be an open source shared knowledge repository similar to the well-known shared open source software code repositories. Examples in the article are based on- or inspired by- real-life non-transactional AI systems I deployed over decades of AI career that benefit hundreds of millions of people around the globe.
In this paper, we propose a novel deep learning architecture for multi-label zero-shot learning (ML-ZSL), which is able to predict multiple unseen class labels for each input instance. Inspired by the way humans utilize semantic knowledge between objects of interests, we propose a framework that incorporates knowledge graphs for describing the relationships between multiple labels. Our model learns an information propagation mechanism from the semantic label space, which can be applied to model the interdependencies between seen and unseen class labels. With such investigation of structured knowledge graphs for visual reasoning, we show that our model can be applied for solving multi-label classification and ML-ZSL tasks. Compared to state-of-the-art approaches, comparable or improved performances can be achieved by our method.
We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9x, without any loss on accuracy or even with improved accuracy.
The development of the mlpack C++ machine learning library (http://www.mlpack.org ) has required the design and implementation of a flexible, robust optimization system that is able to solve the types of arbitrary optimization problems that may arise all throughout machine learning problems. In this paper, we present the generic optimization framework that we have designed for mlpack. A key priority in the design was ease of implementation of both new optimizers and new objective functions to be optimized; therefore, implementation of a new optimizer requires only one method and implementation of a new objective function requires at most four functions. This leads to simple and intuitive code, which, for fast prototyping and experimentation, is of paramount importance. When compared to optimization frameworks of other libraries, we find that mlpack’s supports more types of objective functions, is able to make optimizations that other frameworks do not, and seamlessly supports user-defined objective functions and optimizers.
Local Binary Pattern (LBP) is a traditional descriptor for texture analysis that gained attention in the last decade. Being robust to several properties such as invariance to illumination translation and scaling, LBPs achieved state-of-the-art results in several applications. However, LBPs are not able to capture high-level features from the image, merely encoding features with low abstraction levels. In this work, we propose Deep LBP, which borrow ideas from the deep learning community to improve LBP expressiveness. By using parametrized data-driven LBP, we enable successive applications of the LBP operators with increasing abstraction levels. We validate the relevance of the proposed idea in several datasets from a wide range of applications. Deep LBP improved the performance of traditional and multiscale LBP in all cases.
Machine learning models are vulnerable to adversarial examples: minor, in many cases imperceptible, perturbations to classification inputs. Among other suspected causes, adversarial examples exploit ML models that offer no well-defined indication as to how well a particular prediction is supported by training data, yet are forced to confidently extrapolate predictions in areas of high entropy. In contrast, Bayesian ML models, such as Gaussian Processes (GP), inherently model the uncertainty accompanying a prediction in the well-studied framework of Bayesian Inference. This paper is first to explore adversarial examples and their impact on uncertainty estimates for Gaussian Processes. To this end, we first present three novel attacks on Gaussian Processes: GPJM and GPFGS exploit forward derivatives in GP latent functions, and Latent Space Approximation Networks mimic the latent space representation in unsupervised GP models to facilitate attacks. Further, we show that these new attacks compute adversarial examples that transfer to non-GP classification models, and vice versa. Finally, we show that GP uncertainty estimates not only differ between adversarial examples and benign data, but also between adversarial examples computed by different algorithms.
A user can be represented as what he/she does along the history. A common way to deal with the user modeling problem is to manually extract all kinds of aggregated features over the heterogeneous behaviors, which may fail to fully represent the data itself due to limited human instinct. Recent works usually use RNN-based methods to give an overall embedding of a behavior sequence, which then could be exploited by the downstream applications. However, this can only preserve very limited information, or aggregated memories of a person. When a downstream application requires to facilitate the modeled user features, it may lose the integrity of the specific highly correlated behavior of the user, and introduce noises derived from unrelated behaviors. This paper proposes an attention based user behavior modeling framework called ATRank, which we mainly use for recommendation tasks. Heterogeneous user behaviors are considered in our model that we project all types of behaviors into multiple latent semantic spaces, where influence can be made among the behaviors via self-attention. Downstream applications then can use the user behavior vectors via vanilla attention. Experiments show that ATRank can achieve better performance and faster training process. We further explore ATRank to use one unified model to predict different types of user behaviors at the same time, showing a comparable performance with the highly optimized individual models.
We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data.

## November 20, 2017

### How do groups work on Linux?

Hello! Last week, I thought I knew how users and groups worked on Linux. Here is what I thought:

1. Every process belongs to a user (like julia)
2. When a process tries to read a file owned by a group, Linux a) checks if the user julia can access the file, and b) checks which groups julia belongs to, and whether any of those groups owns & can access that file
3. If either of those is true (or if the ‘any’ bits are set right) then the process can access the file

So, for example, if a process is owned by the julia user and julia is in the awesome group, then the process would be allowed to read this file.

dr--r--r-- 1 bork awesome     6872 Sep 24 11:09 file.txt


I had not thought carefully about this, but if pressed I would have said that it probably checks the /etc/group file at runtime to see what groups you’re in.

### that is not how groups work

I found out at work last week that, no, what I describe above is not how groups work. In particular Linux does not check which groups a process’s user belongs to every time that process tries to access a file.

Here is how groups actually work! I learned this by reading Chapter 9 (“Process Credentials”) of The Linux Programming Interface which is an incredible book. As soon as I realized that I did not understand how users and groups worked, I opened up the table of contents with absolute confidence that it would tell me what’s up, and I was right.

### how users and groups checks are done

They key new insight for me was pretty simple! The chapter starts out by saying that user and group IDs are attributes of the process:

• real user ID and group ID;
• effective user ID and group ID;
• saved set-user-ID and saved set-group-ID;
• file-system user ID and group ID (Linux-specific); and
• supplementary group IDs.

This means that the way Linux actually does group checks to see a process can read a file is:

• look at the process’s group IDs & supplementary group IDs (from the attributes on the process, not by looking them up in /etc/group)
• look at the group on the file
• see if they match

Generally when doing access control checks it uses the effective user/group ID, not the real user/group ID. Technically when accessing a file it actually uses the file-system ids but those are usually the same as the effective uid/gid.

### Adding a user to a group doesn’t put existing processes in that group

Here’s another fun example that follows from this: if I create a new panda group and add myself (bork) to it, then run groups to check my group memberships – I’m not in the panda group!

bork@kiwi~> sudo addgroup panda
Adding group panda' (GID 1001) ...
Done.
Adding user bork' to group panda' ...
Adding user bork to group panda
Done.
bork@kiwi~> groups


no panda in that list! To double check, let’s try making a file owned by the panda group and see if I can access it:

$touch panda-file.txt$  sudo chown root:panda panda-file.txt
$sudo chmod 660 panda-file.txt$  cat panda-file.txt
cat: panda-file.txt: Permission denied


Sure enough, I can’t access panda-file.txt. No big surprise there. My shell didn’t have the panda group as a supplementary GID before, and running adduser bork panda didn’t do anything to change that.

### how do you get your groups in the first place?

So this raises kind of a confusing question, right – if processes have groups baked into them, how do you get assigned your groups in the first place? Obviously you can’t assign yourself more groups (that would defeat the purpose of access control).

It’s relatively clear how processes I execute from my shell (bash/fish) get their groups – my shell runs as me, and it has a bunch of group IDs on it. Processes I execute from my shell are forked from the shell so they get the same groups as the shell had.

So there needs to be some “first” process that has your groups set on it, and all the other processes you set inherit their groups from that. That process is called your login shell and it’s run by the login program (/bin/login) on my laptop. login runs as root and calls a C function called initgroups to set up your groups (by reading /etc/group). It’s allowed to set up your groups because it runs as root.

### let’s try logging in again!

So! Let’s say I am running in a shell, and I want to refresh my groups! From what we’ve learned about how groups are initialized, I should be able to run login to refresh my groups and start a new login shell!

Let’s try it:

$sudo login bork$ groups
$cat panda-file.txt # it works! I can access the file owned by panda now!  Sure enough, it works! Now the new shell that login spawned is part of the panda group! Awesome! This won’t affect any other shells I already have running. If I really want the new panda group everywhere, I need to restart my login session completely, which means quitting my window manager and logging in again. ### newgrp Somebody on Twitter told me that if you want to start a new shell with a new group that you’ve been added to, you can use newgrp. Like this: sudo addgroup panda sudo adduser bork panda newgrp panda # starts a new shell, and you don't have to be root to run it!  You can accomplish the same(ish) thing with sg panda bash which will start a bash shell that runs with the panda group. ### setuid sets the effective user ID I’ve also always been a little vague about what it means for a process to run as “setuid root”. It turns out that setuid sets the effective user ID! So if I (julia) run a setuid root process (like passwd), then the real user ID will be set to julia, and the effective user ID will be set to root. passwd needs to run as root, but it can look at its real user ID to see that julia started the process, and prevent julia from editing any passwords except for julia’s password. ### that’s all! There are a bunch more details about all the edge cases and exactly how everything works in The Linux Programming Interface so I will not get into all the details here. That book is amazing. Everything I talked about in this post is from Chapter 9, which is a 17-page chapter inside a 1300-page book. The thing I love most about that book is that reading 17 pages about how users and groups work is really approachable, self-contained, super useful, and I don’t have to tackle all 1300 pages of it at once to learn helpful things :) Continue Reading… ### Call for Bids to Host KDD-202x ACM SIGKDD Executive Committee hereby invites proposals to host the annual KDD Conference in 2020 and later. Proposals due Jan 31, 2018. Continue Reading… ### Distilled News This article aims to provide the perfect starting point to nudge you to use Docker for your Data Science workflows! I will cover both the useful aspects of Docker – namely, setting up your system without installing the tools and creating your own data science environment. This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more. Yet another one of these One Picture tutorials, and in some ways, in the same old-fashioned style as our Type I versus Type II Errors in One Picture. Concatenate TensorFlow tensors along a given dimension using the TensorFlow concatenation concat functionality and then check the shape of the concatenated tensor using the TensorFlow shape functionality. If you want build trust in machine learning, try treating it like a human, asking it the same type of questions. IBM announced new software to deliver faster time to insight for high performance data analytics (HPDA) workloads, such as Spark, Tensor Flow and Caffé, for AI, Machine Learning and Deep Learning. Based on the same software, which will be deployed for the Department of Energy’s CORAL Supercomputer Project at both Oak Ridge and Lawrence Livermore, IBM will enable new solutions for any enterprise running HPDA workloads. New to this launch is Deep Learning Impact (DLI), a set of software tools to help users develop AI models with the leading open source deep learning frameworks, like TensorFlow and Caffe. The DLI tools are complementary to the PowerAI deep learning enterprise software distribution already available from IBM. Also new is web access and simplified user interfaces for IBM Spectrum LSF Suites, combining a powerful workload management platform with the flexibility of remote access. Finally, the latest version of IBM Spectrum Scale software ensures support to move workloads such as unified file, object and HDFS from where it is stored to where it is analyzed. Sharing one platform has some obvious benefits for Data Science and Data Engineering teams, but technical, language and process challenges often make this a challenge. Learn how one company implemented single cloud platform for R, Python and other workloads – and some of the unexpected benefits they discovered along the way. In Part I of this series, the original GAN paper was presented. Although being clever and giving state of the art results at the time, much has been improved upon since. In this post I’ll talk about the contributions from the Deep Convolutional-GAN (DCGAN) paper. 1. Overview: Deep Learning Frameworks compared (96K views) – 5 minutes 2. Playlist: TensorFlow tutorial by Sentdex (114 K views) – 4.5 hours 3. Individual tutorial: TensorFlow tutorial 02: Convolutional Neural Network (69.7 K views) – 36 minutes 4. Overview : How to predict stock prices easily (210 K views) – 9 minutes 5. Tutorial: Introduction to Deep Learning with Python and the Theano library (201 K views) – 52 minutes 6. Playlist: PyTorch Zero to All (3 K views) – 2 hours 15 minutes 7. Individual tutorial: TensorFlow tutorial (43.9 K views) – 49 minutes 8. Playlist: Deep Learning with Python (1.8K views) – 83 minutes 9. Playlist: Deep Learning with Keras- Python (30.3 K views) – 85 minutes 10. Free online course: Deep Learning by Andrew Ng (Full course) (28 K views) – 4 week course If you develop methods for data analysis, you might only be conducting gentle tests of your method on idealized data. This leads to “fragile research,” which breaks when released into the wild. Here, I share 3 ways to make your methods robust. Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries. But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises: • Using byte-code compiler • Vectorizing operations • Using simple data structures (i.e using data frames instead of matrices in large computing instances) • Re-using results In a recent project, I was looking to plot data from different variables along the same time axis. The difficulty was, that some of these variables I wanted to have as point plots, while others I wanted as box-plots. Because I work with the tidyverse, I wanted to produce these plots with ggplot2. Faceting was the obvious first step but it took me quite a while to figure out how to best combine facets with point plots (where I have one value per time point) with and box-plots (where I have multiple values per time point). Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately, for even moderate technical people coming from different backgrounds, answer to this question is not apparent in the first instance. This sounds like a conceptual and jargon issue, but it lies in the very success of supervised learning algorithms. The pool package makes it easier for Shiny developers to connect to databases. Up until now, there wasn’t a clearly good way to do this. As a Shiny app author, if you connect to a database globally (outside of the server function), your connection won’t be robust because all sessions would share that connection (which could leave most users hanging when one of them is using it, or even all of them if the connection breaks). But if you try to connect each time that you need to make a query (e.g. for every reactive you have), your app becomes a lot slower, as it can take in the order of seconds to establish a new connection. The pool package solves this problem by taking care of when to connect and disconnect, allowing you to write performant code that automatically reconnects to the database only when needed. So, if you are a Shiny app author who needs to connect and interact with databases inside your apps, keep reading because this package was created to make your life easier. Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location. Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started! Please vote in new KDnuggets poll which examines the methods and tools used for a real-world application or project. We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem. Continue Reading… ### Twitter coverage of the Australian Bioinformatics & Computational Biology Society Conference 2017 (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers) You know the drill by now. Grab the tweets. Generate the report using RMarkdown. Push to Github. Publish to RPubs. This time it’s the Australian Bioinformatics & Computational Biology Society Conference 2017, including the COMBINE symposium. Looks like a good time was had by all in Adelaide. A couple of quirks this time around. First, the rtweet package went through a brief phase of returning lists instead of nice data frames. I hope that’s been discarded as a bad idea Second, results returned from a hashtag search that did not contain said hashtags, but were definitely from the meeting. Yet to get to the bottom of that one. I treated ABACBS and COMBINE as one event for this analysis. Third, given that most Twitter users have had 280 characters since about November 7, is this reflected in the conference tweets? It is not. In fact, most tweets seem to hit the hard limit of 140, which makes me wonder if users’ clients were not updated, or the R package has a hard-coded character limit in its parsing functions (longest = 148 characters). Filed under: australia, bioinformatics, meetings, R, statistics Tagged: abacbs, twitter To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### How to make Python easier for the R user: revoscalepy by Siddarth Ramesh, Data Scientist, Microsoft I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist. Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. But many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python. Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy ### Why revoscalepy? Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server. The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different. To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users. ### revoscalepy versus sklearn revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python executable in the MML directory, and then import the packages. The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods. import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head()  ### Preprocessing with sklearn One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands. The side effect of having to one-hot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset. from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() enc.fit(x_cats) onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)  At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. ### Preprocessing with revoscalepy Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors. titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)  This dataset is already ready to be fed into the revoscalepy model. ### Training models with sklearn One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas. Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+’ for separate independent variables, ‘:’ for interaction terms, and ‘*’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models. With sklearn, model.fit expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree: model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(-1) y = encoded_titanic_data['Survived'] model = model.fit(x,y)  ### Training models with revoscalepy revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy. #rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)  The resulting object, titanic_data_tree, is the same structural object that RxDTree() would create in R. RxDTree() integrates with the rpart library in R, meaning rx_dtree() indirectly does if the users wants to export the object back into R for further exploratory analysis around this decision tree. ### Conclusion From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree() reads the R-like formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python. Microsoft Docs: Introducing revoscalepy Continue Reading… ### One Network to Solve Them All --- Solving Linear Inverse Problems using Deep Projection Models - implementation - Wow, using the same network to perform inpainting, compressive sensing and superresolution on the precious ! While deep learning methods have achieved state-of-the-art performance in many challenging inverse problems like image inpainting and super-resolution, they invariably involve problem-specific training of the networks. Under this approach, different problems require different networks. In scenarios where we need to solve a wide variety of problems, e.g., on a mobile camera, it is inefficient and costly to use these specially-trained networks. On the other hand, traditional methods using signal priors can be used in all linear inverse problems but often have worse performance on challenging tasks. In this work, we provide a middle ground between the two kinds of methods --- we propose a general framework to train a single deep neural network that solves arbitrary linear inverse problems. The proposed network acts as a proximal operator for an optimization algorithm and projects non-image signals onto the set of natural images defined by the decision boundary of a classifier. In our experiments, the proposed framework demonstrates superior performance over traditional methods using a wavelet sparsity prior and achieves comparable performance of specially-trained networks on tasks including compressive sensing and pixel-wise inpainting. A TensorFlow implementation is here: https://github.com/rick-chang/OneNet Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Continue Reading… ### R charts in a Tweet Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I've reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.) That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters: For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below. Continue Reading… ### R charts in a Tweet (This article was first published on Revolutions, and kindly contributed to R-bloggers) Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.) That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters: For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below. To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Document worth reading: “Frankenstein’s Legacy – Four Conversations About Artificial Intelligence, Machine Learning, and the Modern World” Frankenstein’s Legacy: Four Conversations about Artificial Intelligence, Machine Learning, and the Modern World is a collaboration between Carnegie Mellon University’s ETC Press, dSHARP, and the Carnegie Mellon University Libraries in collaboration with the Alumni Association’s CMUThink program. This book is part of a university-wide series celebrating the two-hundredth anniversary of the publication of Mary Shelley’s Frankenstein. This book project specifically sprung from the panel Frankenstein 200: Perils and Potential panel hosted by Digital Scholarship Strategist Rikk Mulligan. Each of the four panel participants – Jeffrey Bigham, David Danks, Barry Luokkala, and Molly Wright Steenson – sat down with ETC Editor Brad King for wide-ranging discussions about artificial intelligence, machine learning, and the impact of these technologies on the world in which we live. Those conversations were edited into the manuscript that you’re now reading. The book – part of the ETC Press “In Conversation With” series – is a conversational examination and explanation of some of the most powerful technological tools in our society. Frankenstein’s Legacy – Four Conversations About Artificial Intelligence, Machine Learning, and the Modern World Continue Reading… ### NVIDIA DGX Systems – Deep Learning Software Whitepaper Download this whitepaper from NVIDIA DGX Systems, and gain insight into the engineering expertise and innovation found in pre-optimized deep learning frameworks available only on NVIDIA DGX Systems and learn how to dramatically reduce your engineering costs using today’s most popular frameworks. Continue Reading… ### Kaggle’s Advanced Regression Competition: Predicting Housing Prices in Ames, Iowa (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Introduction Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’. As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable. To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository. Understanding the Data Of the 79 variables provided, 51 were categorical and 28 were continuous. Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set. Processing the Data Response Variable As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution. Missing Data Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data. 1) NA/NaN is actually a class: In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable. Our solution was to impute most of the NA/NaN values to a value of “None.” 2) Not every NA/NaN corresponds to a missing attribute: While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set. Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent. 3) Domain knowledge: Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised. Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located. 4) Imputing with mode: Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation. Our solution was to simply find the most common class for this categorical variable and impute for this missing value. Ordinal Categories For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue). Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively. Nominal Categories The ranking of nominal categories is not appropriate as there is no actual rank among the classes. Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present). Outliers An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices. Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors. Skewness While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance. Our solution was to log + 1 transform several of the predictors. Near Zero Predictors Predictors with very low variance offer little predictive power to models. Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same. Feature Engineering Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created: • Remodeled (categorical): Yes or No if Year Built is different from Year Remodeled; If the year the house was remodeled is different from the year it was built, the remodeling likely increases property value • Seasonality (categorical): Combined Month Sold with Year Sold; While more houses were sold during summer months, this likely varies across years, especially during the time period these houses were sold, which coincides with the housing crash (2006-2010) • New House (categorical): Yes or No if Year Sold is equal to Year Built; If a house was sold the same year it was built, we might expect it was in high demand and might have a higher Sale Price • Total Area (continuous): Sum of all variables that describe the area of different sections of a house; There are many variables that pertain to the square footage of different aspects of each house, we might expect that the total square footage has a strong influence on Sale Price Models and Results Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price. We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better. For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression. Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price. For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions. There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model. One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price. Conclusions The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about$9000 (or about 5%) for the average-priced house.

While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Course in Semantic Technologies for Designing a Proof-of-Concept, starting Nov 30

Ontotext live, online training designed to improve understanding of how Semantic Technology operates to help you make best use of it. Preparation starts Nov 30 and live class is Dec 7.

### R Packages worth a look

Optimal Thresholding Fisher’s P-Value Combination Method (TFisher)
We provide the cumulative distribution function (CDF), quantile, and statistical power calculator for a collection of thresholding Fisher’s p-value combination methods, including Fisher’s p-value combination method, truncated product method and, in particular, soft-thresholding Fisher’s p-value combination method which is proven to be optimal in some context of signal detection. The p-value calculator for the omnibus version of these tests are also included. For reference, please see Hong Zhang and Zheyang Wu. ‘Optimal Thresholding of Fisher’s P-value Combination Tests for Signal Detection’, submitted.

Large Data Sets (ldat)
Tools for working with vectors and data sets that are too large to keep in memory. Extends the basic functionality provided in the ‘lvec’ package. Provides basis statistical functionality of ‘lvec’ objects, such as arithmetic operations and calculating means and sums. Also implements ‘data.frame’-like objects storing its data in ‘lvec’ objects.

Random Wishart Matrix Generation (rWishart)
An expansion of R’s ‘stats’ random wishart matrix generation. This package allows the user to generate singular, Uhlig and Harald (1994) <doi:10.1214/aos/1176325375>, and pseudo wishart, Diaz-Garcia, et al.(1997) <doi:10.1006/jmva.1997.1689>, matrices. In addition the user can generate wishart matrices with fractional degrees of freedom, Adhikari (2008) <doi:10.1061/(ASCE)0733-9399(2008)134:12(1029)>, commonly used in volatility modeling. Users can also use this package to create random covariance matrices.

Simple Matrix Construction (Massign)
Constructing matrices for quick prototyping can be a nuisance, requiring the user to think about how to fill the matrix with values using the matrix() function. The %<-% operator solves that issue by allowing the user to construct matrices using code that shows the actual matrices.

Multivariate Empirical Density Function (MEPDF)
Based on the input data an n-dimensional cube with sub cells of user specified side length is created. The number of sample points which fall in each sub cube is counted, and with the cell volume and overall sample size an empirical probability can be computed. A number of cubes of higher resolution can be superimposed. The basic method stems from J.L. Bentley in ‘Multidimensional Divide and Conquer’. J. L. Bentley (1980) <doi:10.1145/358841.358850>.

Renewal Hawkes Process (RHawkes)
Simulate a renewal Hawkes (RHawkes) self-exciting process, with a given immigrant hazard rate function and offspring density function. Calculate the likelihood of a RHawkes process with given hazard rate function and offspring density function for an (increasing) sequence of event times. Calculate the Rosenblatt residuals of the event times. Predict future event times based on observed event times up to a given time. For details see Chen and Stindl (2017) <doi:10.1080/10618600.2017.1341324>.

### “A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies”

Jessica Franklin writes:

Given your interest in post-publication peer review, I thought you might be interested in our recent experience criticizing a paper published in BMJ last year by Hemkens et al.. I realized that the method used for the primary analysis was biased, so we published a criticism with mathematical proof of the bias (we tried to publish in BMJ, but it was a no go). Now there has been some back and forth between the Hemkens group and us on the BMJ rapid response page, and BMJ is considering a retraction, but no action yet. I don’t really want to comment too much on the specifics, as I don’t want to escalate the tension here, but this has all been pretty interesting, at least to me.

Interesting, in part because both sides in the dispute include well-known figures in epidemiology: John Ioannidis is a coauthor on the Hemkens et al. paper, and Kenneth Rothman is a coauthor on the Franklin et al. criticism.

Background

The story starts with the paper by Hemkens et al., who performed a meta-analysis on “16 eligible RCD studies [observational studies using ‘routinely collected data’], and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths),” and they found that the observational studies overestimated efficacy of treatments compared to the later randomized experiments.

Their message: be careful when interpreting observational studies.

One thing I wonder about, though, is how much of this is due to the time ordering of the studies. Forget for a moment about which studies are observational and which are experimental. In any case, I’d expect the first published study on a topic to show statistically significant results—otherwise it’s less likely to be published in the first place—whereas anything could happen in a follow-up. Thus, I’d expect to see earlier studies overestimate effect sizes relative to later studies, irrespective of which studies are observational and which are experimental. This is related to the time-reversal heuristic.

To put it another way: The Hemkens et al. project is itself an observational study, and in their study there is complete confounding between two predictors: (a) whether a result came from an observational study or an experiment, and (b) whether the result was published first or second. So I think it’s impossible to disentangle the predictive value of (a) and (b).

The criticism and the controversy

Here are the data from Hemkens et al.:

Franklin et al. expressed the following concern:

In a recent meta-analysis by Hemkens et al. (Hemkens et al. 2016), the authors compared published RCD [routinely collected data] studies and subsequent RCTs [randomized controlled trials] using the ROR, but inverted the clinical question and corresponding treatment effect estimates for all study questions where the RCD estimate was > 1, thereby ensuring that all RCD estimates indicated protective effects.

Here’s the relevant bit from Hemkens et al.:

For consistency, we inverted the RCD effect estimates where necessary so that each RCD study indicated an odds ratio less than 1 (that is, swapping the study groups so that the first study group has lower mortality risk than the second).

So, yeah, that’s what they did.

On one hand, I can see where Hemkens et al. were coming from. To the extent that the original studies purported to be definitive, it makes sense to code them in the same direction, so that you’re asking how the replications compared to what was expected.

On the other hand, Franklin et al. have a point, that in the absence of any differences, the procedure of flipping all initial estimates to have odds ratios less than 1 will bias the estimate of the difference.

Beyond this, the above graph shots a high level of noise in the comparisons, as some of the follow-up randomized trials have standard errors that are essentially infinite. (What do you say about an estimated odds ratio that can be anywhere from 0.2 to 5?) Hemkens et al. appear to be using some sort of weighting procedure, but the relevant point here is that only a few of these studies have enough data to tell us anything at all.

My take on these papers

The above figure tells the story: The 16 observational studies appear to show a strong correlation between standard error and estimated effect size. This makes sense. Go, for example, to the bottom of the graph: I don’t know anything about Hahn 2010, Fonoarow 2008, Moss 2003, Kim 2008, and Cabell 2005, but all these studies are estimated to cut mortality by 50% or more, which seems like a lot, especially considering the big standard errors. It’s no surprise that these big estimates fail to reappear under independent replication. Indeed, as noted above, I’d expect that big estimates from randomized experiments would also generally fail to reappear under independent replication.

Franklin et al. raise a valid criticism: Even if there is no effect at all, the method used by Hemkens et al. will create the appearance of an effect: in short, the Hemkens et al. estimate is indeed biased.

Put it all together, and I think that the sort of meta-analysis performed by Hemkens et al. is potentially valuable, but maybe it would’ve been enough for them to stop with the graph on the left in the above image. It’s not clear that anything is gained from their averaging; also there’s complete confounding in their data between timing (which of the two studies came first) and mode (observational or experimental).

The discussion

Here are some juicy bits from the online discussion at the BMJ site:

02 August 2017, José G Merino, US Research Editor, The BMJ:

Last August, a group led by Jessica Franklin submitted to us a criticism of the methods used by the authors of this paper, calling into question some of the assumptions and conclusion reached by Lars Hemkens and his team. We invited Franklin and colleagues to submit their comments as a rapid response rather than as a separate paper but they declined and instead published the paper in Epidemiological Methods (Epidem Meth 2-17;20160018, DOI 10.1515/em-2016-0018.) We would like to alert the BMJ’s readers about the paper, which can be found here: https://www.degruyter.com/view/j/em.ahead-of-print/em-2016-0018/em-2016-0018.xml?format=INT

We asked Hemkens and his colleagues to submit a response to the criticism. That report is undergoing statistical review at The BMJ. We will post the response shortly.

14 September 2017, Lars G Hemkens, senior researcher, Despina G Contopoulos-Ioannidis, John P A Ioannidis:

The arguments and analyses of Franklin et al. [1] are flawed and misleading. . . . It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. . . . We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study) . . . it is the only reasonable and useful selection rule in real life . . . The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making. . . . Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. . . .

21 September 2017, Susan Gruber, Biostatistician:

The rapid response of Hemkens, Contopoulos-Ioannidis, and Ioannidis overlooks the fact that a metric of comparison can be systematic, transparent, replicable, and also wrong. Franklin et. al. clearly explains and demonstrates that inverting the OR based on RCD study result (or on the RCT result) yields a misleading statistic. . . .

02 October 2017, Jessica M. Franklin, Assistant Professor of Medicine, Sara Dejene, Krista F. Huybrechts, Shirley V. Wang, Martin Kulldorff, and Kenneth J. Rothman:

In a recent paper [1], we provided mathematical proof that the inversion rule used in the analysis of Hemkens et al. [2] results in positive bias of the pooled relative odds ratio . . . In their response, Hemkens et al [3] do not address this core statistical problem with their analysis. . . .

We applaud the transparency with which Hemkens et al reported their analyses, which allowed us to replicate their findings independently as well as to illustrate the inherent bias in their statistical method. Our paper was originally submitted to BMJ, as recently revealed by a journal editor [4], and it was reviewed there by two prominent biostatisticians and an epidemiologist. All three reviewers recognized that we had described a fundamental flaw in the statistical approach invented and used by Hemkens et al. We believe that everyone makes mistakes, and acknowledging an honest mistake is a badge of honor. Thus, based on our paper and those three reviews, we expected Hemkens et al. and the journal editors simply to acknowledge the problem and to retract the paper. Their reaction to date is disappointing.

13 November 2017, José G Merino, US Research Editor, Elizabeth Loder, Head of Research, The BMJ:

We acknowledge receipt of this letter that includes a request for retraction of the paper. We take this request very seriously. Before we make a decision on this request, we -The BMJ’s editors and statisticians – are reviewing all the available information. We hope to reach a decision that will maintain the integrity of the scientific literature, acknowledge legitimate differences of opinion about the methods used in the analysis of data, and is fair to all the participants in the debate. We will post a rapid response once we make a decision on this issue.

The discussion also includes contributions from others on unrelated aspects of the problem; here I’m focusing about the Franklin et al. critique and the Hemkens et al. paper.

Good on ya, BMJ

I love how the BMJ is handling this. The discussion is completely open, and the journal editor is completely non-judgmental. All so much better than my recent experience with the Association for Psychological Science, where the journal editor brushed me off in a polite but content-free way, and then the chair of the journal’s publication board followed up with some gratuitous rudeness. The BMJ is doing it right, and the psychology society has a few things to learn from them.

Also, just to make my position on this clear: I don’t see why anyone would think the Hemkens et al. paper should be retracted; a link to the criticisms would seem to be enough.

Just last week I got am email from someone who thought that our conclusion in our Epi Methods paper that use of the pooled ROR without inversion is “just as flawed” was too strong. I think they are right, so we will now be preparing a correction to our paper to modify this statement. So the circle of post-publication peer review continues…

Yes, exactly!

### «smooth» package for R. Common ground. Part II. Estimators

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

### A bit about estimates of parameters

Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started.

Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.

Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters.

Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.

Having clarified these points, we can proceed to the topic of today’s post.

### One-step-ahead estimators of smooth functions

We already know that the default estimator used for

smooth

functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise.

In this post we will use the series N1823 for our examples:

library(Mcomp)
x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x))

Plot the data in order to see what we have:

plot(x)

N1823 series

The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3.

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T)

N1823 and ETS(A,A,N) with MSE

Here’s the output:

Time elapsed: 0.08 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0.147 0.000
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 629.249
Cost function type: MSE; Cost function value: 377623.069

Information criteria:
AIC     AICc      BIC
1703.389 1703.977 1716.800
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1%
MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4%

It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals:

qqnorm(resid(ourModel))
qqline(resid(ourModel))

QQ-plot of the residuals from ETS(A,A,N) with MSE

The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%:

shapiro.test(resid(ourModel))
> p-value = 0.001223

This may indicate that other estimators may do a better job. And there is a magical parameter

cfType

in the

smooth

functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE:

• MAE – Mean Absolute Error:

\label{eq:MAE}
\text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}|

The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution).

Let’s see what happens with the same model, on the same data when we use MAE:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MAE")

N1823 and ETS(A,A,N) with MAE

Time elapsed: 0.09 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0.101 0.000
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 636.546
Cost function type: MAE; Cost function value: 462.675

Information criteria:
AIC     AICc      BIC
1705.879 1706.468 1719.290
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4%
MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5%

There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won’t become normal just because we used a different estimator:

QQ-plot of the residuals from ETS(A,A,N) with MAE

• HAM – Half Absolute Moment:

\label{eq:HAM}
\text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|}

This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator.

The same example, the same model, but HAM as an estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="HAM")

N1823 and ETS(A,A,N) with HAM

Time elapsed: 0.06 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0.001 0.001
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 666.439
Cost function type: HAM; Cost function value: 19.67

Information criteria:
AIC     AICc      BIC
1715.792 1716.381 1729.203
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4%
MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3%

This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE.

Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don’t have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case.

### Multiple-steps-ahead estimators of smooth functions

While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware!

• MSE$$_h$$ – Mean Squared Error for h-steps ahead forecast:

\label{eq:MSEh}
\text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2

The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of $$h$$. The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned.

Here’s what happens when we use this estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MSEh")

N1823 and ETS(A,A,N) with MSEh

Time elapsed: 0.24 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0     0
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 657.781
Cost function type: MSEh; Cost function value: 550179.34

Information criteria:
AIC     AICc      BIC
30393.86 30404.45 30635.25
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8%
MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8%

As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function.

• TMSE – Trace Mean Squared Error:

The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead:
\label{TMSE}
\text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2

This estimator is more reasonable than MSE$$_h$$ because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE$$_h$$, but slightly better.

This is what happens in our example:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="TMSE")

N1823 and ETS(A,A,N) with TMSE

Time elapsed: 0.2 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0.075 0.000
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 633.48
Cost function type: TMSE; Cost function value: 7477097.717

Information criteria:
AIC     AICc      BIC
30394.36 30404.94 30635.75
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6%
MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5%

Comparing the model estimated using TMSE with the same one estimated using MSE and MSE$$_h$$, it is worth noting that the smoothing parameters in this model are greater than in case of MSE$$_h$$, but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE$$_h$$. Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased.

• GTMSE – Geometric Trace Mean Squared Error:

This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE$$_j$$ and then sum them up:
\label{eq:GTMSE}
\text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right)

Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE$$_h$$ and TMSE. This estimator is more efficient than both TMSE and MSE$$_j$$ because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples.

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="GTMSE")

N1823 and ETS(A,A,N) with GTMSE

Time elapsed: 0.18 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0     0
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 649.253
Cost function type: GTMSE; Cost function value: 232.419

Information criteria:
AIC     AICc      BIC
30402.77 30413.36 30644.16
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9%
MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6%

In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE$$_h$$. However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples.

Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample.

• Analytical multiple steps ahead estimators.

There is also a non-documented feature in

smooth

functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE$$_h$$, aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important.

These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE$$_h$$ estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="aMSEh")

N1823 and ETS(A,A,N) with aMSEh

Time elapsed: 0.11 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta
0     0
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 627.818
Cost function type: aMSEh; Cost function value: 375907.976

Information criteria:
AIC     AICc      BIC
30652.15 30662.74 30893.55
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6%
MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3%

The resulting smoothing parameters are shrunk towards zero, similar to MSE$$_h$$, but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE$$_h$$. The difference in time may increase with the increase of sample size and forecasting horizon.

• Similar to MSE, there are empirical multi-step MAE and HAM in
smooth

functions (e.g. MAE$$_h$$ and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them.

### Conclusions

Now that we discussed all the possible estimators that you can use with

smooth

, you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice.

First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE$$_h$$ and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE$$_h$$, you need to estimate $$h$$ separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE$$_h$$, but at the same time leads to at least as accurate forecasts as in case of MSE$$_h$$ for all the steps from 1 to h.

Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy.

Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway.

Finally, don’t use MSE$$_h$$, TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New Poll: Which Data Science / Machine Learning methods and tools you used?

Please vote in new KDnuggets poll which examines the methods and tools used for a real-world application or project.

### Automated Feature Engineering for Time Series Data

We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.

### Virginia Tech: Assistant Professors in Data Analytics

Seeking applicants for two tenure-track assistant professor positions in data analytics, with research depth and breadth in data analytics, data mining, machine learning, deep learning, artificial intelligence, and related domains.

### IÉSEG School of Management: Assistant, Associate or Full Professor in Marketing Analytics

Seeking candidates whose teaching and research interests are related to marketing analytics. Applicants should possess a PhD and be able to provide evidence of publications (and/or demonstrate the potential to publish) in reputable academic journals.

### Dataiku 4.1.0: More support for R users!

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

# Introduction

Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:

Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.

Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.

Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.

There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.

# Improved R Support in 4.1.0

Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:

## R Code environments

In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.

In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.

## R Markdown reports & Shiny applications

If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:

• They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them.
• You can make use of all data that is already available in the Dataiku environment.
• Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards.

The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.

## Creating R API’s

Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.

The following three figures give you an overview of how easy it is to work with the R API functionality.

First, define an API endpoint and R (prediction) function.

Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.

Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.

Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.

# Conclusion

The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.

Cheers, Longhow.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### DataScience.com Adds Former U.S. Chief Data Scientist DJ Patil to Advisory Board

Former U.S. Chief Data Scientist DJ Patil will be lending his expertise to DataScience.com’s product, engineering, and R&D teams as they expand the features of the company’s enterprise data science platform.

### Save the Date: PyImageConf 2018

Imagine taking the practical, hands-on teaching style of the PyImageSearch blog…

…and translating it to a live, in person conference.

Sound interesting?

If so, mark the date on your calendar now:

PyImageConf is taking place on August 26th-28th 2018 at the San Francisco Hyatt Regency.

Confirmed speaker/workshop hosts include:

• François Chollet (creator of Keras, researcher at Google)
• Katherine Scott (Planet Labs/co-author of SimpleCV)
• Davis King (creator of dlib)
• Satya Mallick (author at LearnOpenCV)
• Joseph Howse (author of six computer vision/OpenCV books)
• Adam Geitgey (author of Machine Learning is Fun! series at Medium)
• Jeff Bass (Raspberry Pi hacker)
• …and more to come!

## Save the Date: PyImageConf 2018

PyImageConf is a practical, hands-on conference featuring computer vision, deep learning, and OpenCV experts.

If you want to rub elbows, make connections, and learn from the leading computer vision and deep learning experts (and hang out with us in San Francisco for a few days)…you’ll want to get on the wait list now.

### Who is speaking/presenting?

Figure 1: PyImageConf 2018 speakers include Adrian Rosebrock, François Chollet, Katherine Scott, Davis King, Satya Mallick, Joseph Howse, Adam Geitgey, Jeff Bass, and more.

PyImageConf has put together the biggest names in computer vision, deep learning, and OpenCV education to give you the best possible live, hands-on training and talks.

Each speaker is respectively known for their writing, teaching, online courses, and contributions to open source projects.

Right now there are eight confirmed speakers/workshop hosts with more to be added later:

If you’re looking for a computer vision/deep learning conference with the foremost educators and speakers, this is it.

### When?

PyImageConf will take place on August 26-28th 2018.

Tickets will go on sale in January 2018.

If you want a ticket make sure you claim your spot in line!

### Where?

The conference will be hosted in the newly remodeled San Francisco Regency Hyatt, right on the beautiful San Francisco Bay waterline.

Talks will take place in their exquisite state-of-the-art ballroom and workshops/breakout sessions will utilize the hotel’s dedicated workspaces.

Take a look at the photos below to get a feel for the space — I’m sure you’ll agree that this will be the computer vision and deep learning event of the year:

Figure 1: PyImageConf 2018 will be held at the newly renovated Hyatt Regency in San Francisco CA. The hotel is right on the San Francisco Bay waterline and walking distance to popular attractions, restaurants, and bars.

Figure 2: The conference itself will be held the Grand Ballroom with state-of-the-art visual/audio and reliable high-speed WiFi.

Figure 3: Along with the talks we’ll have a number of smaller, intimate workshops where you’ll learn techniques you can apply to your work and projects.

Figure 4: Registration and evening events will take place in the Hyatt Regency ballroom.

### How much?

I’m currently finalizing ticket prices with my conference coordinator now but expect the prices to be in the $700-$900 range.

This may seem like a lot, but keep in mind this includes two full days of talks, workshops, and training — most conferences would charge well over $1,500-$2,500 for this type of event.

If you’re a PyImageSearch Gurus member you will enjoy first dibs on the tickets (and a discount).

Take the time now to check with your stakeholders (spouse, kids, etc.) and earmark funds to grab a ticket.

### How many tickets will there be?

The conference will be small and intimate, capped at 200 attendees.

I’m purposely keeping the conference small to enable you to:

• Learn from the speakers and presenters
• Have 1-on-1 time with experts in computer vision and deep learning
• Better network with your peers and colleagues

Once the 200 tickets sell out, they’re all gone and I will not be adding more.

### Tickets will sell out…so get on the early bird list!

If you’re a longtime PyImageSearch reader I don’t need to tell you how fast these types of sales sell out.

A few years ago when I ran a Kickstarter campaign for the PyImageSearch Gurus course all early bird specials sold out within 30 minutes.

In January of this year I launched the Deep Learning for Computer Vision with Python Kickstarter — all early bird specials were claimed in under 15 minutes.

Given that there will only be 200 tickets available, I expect all tickets to sell out within 10 minutes of the tickets going live.

So, if you want a ticket to PyImageConf…

…make sure you click here and get on the early bird list!

## Summary

Here’s the TL;DR;

• PyImageConf will take place on August 26-28th 2018 at the San Francisco Hyatt Regency.
• I’m capping the conference at 200 attendees to keep the event small, intimate, and hands-on.
• Confirmed speakers include François Chollet (Keras), Davis King (dlib) Katherine Scott (SimpleCV/Planet Labs), Satya Mallick (LearnOpenCV), Adrian Rosebrock (that’s me!) and others.
• Tickets will go on sale in January 2018.
• If you want a ticket, you’ll need to claim your spot in line (tickets will sell out FAST).

The post Save the Date: PyImageConf 2018 appeared first on PyImageSearch.

### A pivotal episode in the unfolding of the replication crisis

Axel Cleeremans writes:

I appreciated your piece titled “What has happened down here is the winds have changed”. Your mini-history of what happened was truly enlightening — but you didn’t explicitly mention our failure to replicate Bargh’s slow walking effect. This was absolutely instrumental in triggering the replication crisis. As you know, the article was covered by the science journalist Ed Yong and came shortly after the Stapel affair. It was the first failure to replicate a classic priming effect that attracted so much attention. Yong’s blog post about it attracted a response from John Bargh and further replies from Yong, as you indirectly point to. But our article and the entire exchange between Yong and Bargh is also what triggered an extended email discussion involving many of the actors involved in this entire debate (including E. J. Wagenmakers, Hal Pashler, Fritz Strack and about 30 other people). That discussion was initiated by Daniel Kahneman after he and I discussed what to make of our failure to replicate Bargh’s findings. This email discussion continued for about two years and eventually resulted in further attempts to replicate, as they are unfolding now.

I was aware of the Bargh issue but I’d only read Wagenmakers (and Bargh’s own unfortunate writings) on the issue; I’d never followed up to read the original, so this is good to know. One thing I like about having these exchanges on a blog, rather than circulating emails, is that all the discussion is in one place and is open to all to read and participate.

### Top Stories, Nov 13-19: The 10 Statistical Techniques Data Scientists Need to Master; Best Online Masters in Data Science and Analytics – a comprehensive, unbiased survey

Also: A Day in the Life of a Data Scientist; Top 10 Videos on Deep Learning in Python; 8 Ways to Improve Your Data Science Skills in 2 Years; Machine Learning Algorithms: Which One to Choose for Your Problem; Top 10 Machine Learning Algorithms for Beginners

### Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.

Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started!

## Data Preprocessing

The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customer’s attributes:

library(plyr)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(ggthemes)
library(caret)
library(MASS)
library(randomForest)
library(party)

str(churn)
'data.frame':	7043 obs. of  21 variables:
$customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...$ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...$ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...$ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
$PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...$ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...$ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...$ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...$ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...$ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...$ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...$ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
churn$tenure <- NULL  ## Exploratory data analysis and feature selection Correlation between numeric variables numeric.var <- sapply(churn, is.numeric) corr.matrix <- cor(churn[,numeric.var]) corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", method="number")  Gives this plot: The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges. churn$TotalCharges <- NULL


Bar plots of categorical variables

p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol=2)


Gives this plot:

p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p5, p6, p7, p8, ncol=2)


Gives this plot:

p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + xlab("Online Backup") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + xlab("Device Protection") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + xlab("Tech Support") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + xlab("Streaming TV") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p9, p10, p11, p12, ncol=2)


Gives this plot:

p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + xlab("Streaming Movies") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + xlab("Contract") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + xlab("Paperless Billing") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + xlab("Payment Method") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + xlab("Tenure Group") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p13, p14, p15, p16, p17, ncol=2)


Gives this plot:

All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.

## Logistic Regression

First, we split the data into training and testing sets

intrain<- createDataPartition(churn$Churn,p=0.7,list=FALSE) set.seed(2017) training<- churn[intrain,] testing<- churn[-intrain,]  Confirm the splitting is correct dim(training); dim(testing) [1] 4924 19 [1] 2108 19  Fitting the Logistic Regression Model LogModel <- glm(Churn ~ .,family=binomial(link="logit"),data=training) print(summary(LogModel)) Call: glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max -2.0370 -0.6793 -0.2861 0.6590 3.1608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.030149 1.008223 -2.014 0.044053 * genderMale -0.039006 0.077686 -0.502 0.615603 SeniorCitizenYes 0.194408 0.101151 1.922 0.054611 . PartnerYes -0.062031 0.092911 -0.668 0.504363 DependentsYes -0.017974 0.107659 -0.167 0.867405 PhoneServiceYes -0.387097 0.788745 -0.491 0.623585 MultipleLinesYes 0.345052 0.214799 1.606 0.108187 InternetServiceFiber optic 1.022836 0.968062 1.057 0.290703 InternetServiceNo -0.829521 0.978909 -0.847 0.396776 OnlineSecurityYes -0.393647 0.215993 -1.823 0.068379 . OnlineBackupYes -0.113220 0.213825 -0.529 0.596460 DeviceProtectionYes -0.025797 0.213317 -0.121 0.903744 TechSupportYes -0.306423 0.220920 -1.387 0.165433 StreamingTVYes 0.399209 0.395912 1.008 0.313297 StreamingMoviesYes 0.338852 0.395890 0.856 0.392040 ContractOne year -0.805584 0.127125 -6.337 2.34e-10 *** ContractTwo year -1.662793 0.216346 -7.686 1.52e-14 *** PaperlessBillingYes 0.338724 0.089407 3.789 0.000152 *** PaymentMethodCredit card (automatic) -0.018574 0.135318 -0.137 0.890826 PaymentMethodElectronic check 0.373214 0.113029 3.302 0.000960 *** PaymentMethodMailed check 0.001400 0.136446 0.010 0.991815 MonthlyCharges -0.005001 0.038558 -0.130 0.896797 tenure_group0-12 Month 1.858899 0.205956 9.026 < 2e-16 *** tenure_group12-24 Month 0.968497 0.201452 4.808 1.53e-06 *** tenure_group24-48 Month 0.574822 0.185500 3.099 0.001943 ** tenure_group48-60 Month 0.311790 0.200096 1.558 0.119185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5702.8 on 4923 degrees of freedom Residual deviance: 4094.4 on 4898 degrees of freedom AIC: 4146.4 Number of Fisher Scoring iterations: 6  Feature Analysis The top three most-relevant features include Contract, tenure_group and PaperlessBilling. anova(LogModel, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Churn Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 4923 5702.8 gender 1 0.39 4922 5702.4 0.5318602 SeniorCitizen 1 95.08 4921 5607.3 < 2.2e-16 *** Partner 1 107.29 4920 5500.0 < 2.2e-16 *** Dependents 1 27.26 4919 5472.7 1.775e-07 *** PhoneService 1 1.27 4918 5471.5 0.2597501 MultipleLines 1 9.63 4917 5461.8 0.0019177 ** InternetService 2 452.01 4915 5009.8 < 2.2e-16 *** OnlineSecurity 1 183.83 4914 4826.0 < 2.2e-16 *** OnlineBackup 1 69.94 4913 4756.1 < 2.2e-16 *** DeviceProtection 1 47.58 4912 4708.5 5.287e-12 *** TechSupport 1 82.78 4911 4625.7 < 2.2e-16 *** StreamingTV 1 4.90 4910 4620.8 0.0269174 * StreamingMovies 1 0.36 4909 4620.4 0.5461056 Contract 2 309.25 4907 4311.2 < 2.2e-16 *** PaperlessBilling 1 14.21 4906 4297.0 0.0001638 *** PaymentMethod 3 38.85 4903 4258.1 1.867e-08 *** MonthlyCharges 1 0.10 4902 4258.0 0.7491553 tenure_group 4 163.67 4898 4094.4 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values. Assessing the predictive ability of the Logistic Regression model testing$Churn <- as.character(testing$Churn) testing$Churn[testing$Churn=="No"] <- "0" testing$Churn[testing$Churn=="Yes"] <- "1" fitted.results <- predict(LogModel,newdata=testing,type='response') fitted.results 0.5,1,0) misClasificError <- mean(fitted.results != testing$Churn)
print(paste('Logistic Regression Accuracy',1-misClasificError))
[1] "Logistic Regression Accuracy 0.796489563567362"



Logistic Regression Confusion Matrix

print("Confusion Matrix for Logistic Regression"); table(testing$Churn, fitted.results > 0.5) [1] "Confusion Matrix for Logistic Regression" FALSE TRUE 0 1392 156 1 273 287  Odds Ratio One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening. exp(cbind(OR=coef(LogModel), confint(LogModel))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.1313160 0.01815817 0.9461855 genderMale 0.9617454 0.82587632 1.1199399 SeniorCitizenYes 1.2145919 0.99591418 1.4807053 PartnerYes 0.9398537 0.78338071 1.1276774 DependentsYes 0.9821863 0.79488224 1.2124072 PhoneServiceYes 0.6790251 0.14466019 3.1878587 MultipleLinesYes 1.4120635 0.92707245 2.1522692 InternetServiceFiber optic 2.7810695 0.41762316 18.5910286 InternetServiceNo 0.4362582 0.06397364 2.9715699 OnlineSecurityYes 0.6745919 0.44144273 1.0296515 OnlineBackupYes 0.8929545 0.58709919 1.3577947 DeviceProtectionYes 0.9745328 0.64144877 1.4805460 TechSupportYes 0.7360754 0.47707096 1.1344691 StreamingTVYes 1.4906458 0.68637788 3.2416264 StreamingMoviesYes 1.4033353 0.64624171 3.0518161 ContractOne year 0.4468271 0.34725066 0.5717469 ContractTwo year 0.1896086 0.12230199 0.2861341 PaperlessBillingYes 1.4031556 1.17798691 1.6725920 PaymentMethodCredit card (automatic) 0.9815977 0.75273387 1.2797506 PaymentMethodElectronic check 1.4523952 1.16480721 1.8145076 PaymentMethodMailed check 1.0014007 0.76673087 1.3092444 MonthlyCharges 0.9950112 0.92252949 1.0731016 tenure_group0-12 Month 6.4166692 4.30371945 9.6544837 tenure_group12-24 Month 2.6339823 1.78095906 3.9256133 tenure_group24-48 Month 1.7768147 1.23988035 2.5676783 tenure_group48-60 Month 1.3658675 0.92383315 2.0261505  ## Decision Tree Decision Tree visualization For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are “Contract”, “tenure_group” and “PaperlessBilling”. tree <- ctree(Churn~Contract+tenure_group+PaperlessBilling, training) plot(tree, type='simple')  Gives this plot: 1. Out of three variables we use, Contract is the most important variable to predict customer churn or not churn. 2. If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn. 3. On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0–12 month, and using PaperlessBilling, then this customer is more likely to churn. Decision Tree Confusion Matrix We are using all the variables to product confusion matrix table and make predictions. pred_tree <- predict(tree, testing) print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn)
[1] "Confusion Matrix for Decision Tree"
Actual
Predicted   No  Yes
No  1395  346
Yes  153  214



Decision Tree Accuracy

p1 <- predict(tree, training)
tab1 <- table(Predicted = p1, Actual = training$Churn) tab2 <- table(Predicted = pred_tree, Actual = testing$Churn)
print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2)))
[1] "Decision Tree Accuracy 0.763282732447818"



The accuracy for Decision Tree has hardly improved. Let’s see if we can do better using Random Forest.

## Random Forest

Random Forest Initial Model

rfModel <- randomForest(Churn ~., data = training)
print(rfModel)
Call:
randomForest(formula = Churn ~ ., data = training)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4

OOB estimate of  error rate: 20.65%
Confusion matrix:
No Yes class.error
No  3245 370   0.1023513
Yes  647 662   0.4942704



The error rate is relatively low when predicting “No”, and the error rate is much higher when predicting “Yes”.

Random Forest Prediction and Confusion Matrix

pred_rf <- predict(rfModel, testing)
caret::confusionMatrix(pred_rf, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1381 281 Yes 167 279 Accuracy : 0.7875 95% CI : (0.7694, 0.8048) No Information Rate : 0.7343 P-Value [Acc > NIR] : 9.284e-09 Kappa : 0.4175 Mcnemar's Test P-Value : 9.359e-08 Sensitivity : 0.8921 Specificity : 0.4982 Pos Pred Value : 0.8309 Neg Pred Value : 0.6256 Prevalence : 0.7343 Detection Rate : 0.6551 Detection Prevalence : 0.7884 Balanced Accuracy : 0.6952 'Positive' Class : No  Random Forest Error Rate plot(rfModel)  Gives this plot: We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees. Tune Random Forest Model t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE, ntreeTry = 200, trace = TRUE, improve = 0.05)  Gives this plot: We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2. Fit the Random Forest Model After Tuning rfModel_new <- randomForest(Churn ~., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) print(rfModel_new) Call: randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 200 No. of variables tried at each split: 2 OOB estimate of error rate: 19.7% Confusion matrix: No Yes class.error No 3301 314 0.0868603 Yes 656 653 0.5011459  OOB error rate decreased to 19.7% from 20.65% earlier. Random Forest Predictions and Confusion Matrix After Tuning pred_rf_new <- predict(rfModel_new, testing) caret::confusionMatrix(pred_rf_new, testing$Churn)
Confusion Matrix and Statistics

Reference
Prediction   No  Yes
No  1404  305
Yes  144  255

Accuracy : 0.787
95% CI : (0.7689, 0.8043)
No Information Rate : 0.7343
P-Value [Acc > NIR] : 1.252e-08

Kappa : 0.3989
Mcnemar's Test P-Value : 4.324e-14

Sensitivity : 0.9070
Specificity : 0.4554
Pos Pred Value : 0.8215
Neg Pred Value : 0.6391
Prevalence : 0.7343
Detection Rate : 0.6660
Detection Prevalence : 0.8107
Balanced Accuracy : 0.6812

'Positive' Class : No



The accuracy did not increase but the sensitivity improved, compare with the initial Random Forest model.

Random Forest Feature Importance

varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')


Gives this plot:

## Summary

From the above example, we can see that Logistic Regression and Random Forest performed better than Decision Tree for customer churn analysis for this particular dataset.

Throughout the analysis, I have learned several important things:
1. Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn.
2. There does not seem to be a relationship between gender and churn.
3. Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn.

Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Scrollama.js, a JavaScript library for scrollytelling

Russell Goldenberg released Scrollama.js in an effort to make scrollytelling more straightforward to implement.

Scrollytelling can be complicated to implement and difficult to make performant. The goal of this library is to provide a simple interface for creating scroll-driven interactives and improve user experience by reducing scroll jank. It offers (optional) methods to implement the common scrollytelling pattern to reduce more involved DOM calculations. For lack of a better term, I refer to it as the sticky graphic pattern, whereby the graphic scrolls into view, becomes “stuck” for a duration of steps, then exits and “unsticks” when the steps conclude.

Bookmarked for later.

Tags: ,

### To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

### More New R Jobs

1. Full-Time
R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
17 Nov 2017
2. Full-Time
Customer Success Representative RStudio, Inc. – Posted by jclemens1
Anywhere
17 Nov 2017
3. Full-Time
Data Science Engineer Bonify – Posted by arianna@meritocracy
Berlin Berlin, Germany
17 Nov 2017
4. Freelance
Dashboard/Visualization developer in R/Shiny International Finance Corporation – Posted by Oleksiy
Anywhere
14 Nov 2017
5. Part-Time
Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh
Anywhere
6 Nov 2017
6. Full-Time
Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
Chicago Illinois, United States
2 Nov 2017
7. Full-Time
Business Data Analytics Faculty Maryville University – Posted by tlorden
St. Louis Missouri, United States
2 Nov 2017
8. Full-Time
PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava
Cambridge Massachusetts, United States
10 Oct 2017
9. Full-Time
Data Scientist @ London Hiscox – Posted by alan.south
London England, United Kingdom
13 Sep 2017

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

### Hunter Kelly joins AYLIEN to take our engineering to the next level

It’s an exciting time here at AYLIEN – in the past couple of months, we’ve moved office, closed a funding round, and added six people to the team. We’re delighted to announce our most recent hire and our first Chief Architect and Principal Engineer, Hunter Kelly.

The AYLIEN tech infrastructure has grown to quite a scale at this point. In addition to serving over 30,000 users with three product offerings, we’re also a fully-functional AI research lab that houses five full-time researchers, who in turn feed their findings back into the products. With such a complex architecture and backends that have highly demanding tasks, bringing an engineer with the breadth and quality of experience that Hunter has is a huge boost to us as we move into the next phase of our journey.

At first glance, Hunter’s career has followed a seemingly meandering path through some really interesting companies. After graduating from UC Berkeley in the 90s, he joined the Photoscience Department at Pixar in California, became one of the first engineers in Google’s Dublin office, and at NewBay, he designed and built a multi-petabyte storage solution for handling user-generated content, still in use by some of the world’s largest telcos today. Hunter is joining us from Zalando’s Fashion Insight Centre, where as the first engineer in the Dublin office he kicked off the Fashion Content Platform, which was shortlisted as a finalist in the 2017 DatSci Awards.

The common thread in those roles, while perhaps not obvious, is data. Hunter brings this rich experience working on data, both from an engineering and data science perspective, to focus on one extremely important problem – how can we leverage data to solve our hardest problems?

This question is central to AI research, and Hunter’s expertise is a perfect fit with AYLIEN’s mission to make Natural Language Processing hassle-free for developers. Our APIs handle the heavy lifting so developers can leverage Deep NLP in a couple of lines of code, and the ease with which our users do this is down to the great work our science and engineering teams do. Adding Hunter to the intersection of these teams will add a huge amount to our capabilities here and we’re really excited about the great work we can get done.

“I’m really excited to be joining AYLIEN at this point in time. I think that AI and Machine Learning are incredibly powerful tools that everyone should be able to leverage. I really look forward to being able to bring my expertise and experience, particularly with large-scale and streaming data platforms, to AYLIEN to help broaden their already impressive offerings. AI is just reaching that critical point of moving beyond academia and reaching wide-scale adoption. Making its power accessible to the wider community beyond very focused experts is a really interesting and exciting challenge.”

When he’s not in AYLIEN, Hunter can be found spending time with his wife and foster children, messing around learning yet another programming language, painting minis, playing board games, tabletop RPG’s and wargames, or spending too much time playing video games.  He’s also been known to do some Salsa dancing, traveling, sailing, and scuba diving.

Check out Hunter’s talks on his most recent work at ClojureConj and this year’s Kafka Summit.

### Four short links: 20 November 2017

Ancient Data, Tech Ethics, Session Replay, and Cache Filesystem​

1. Trade, Merchants, and the Lost Cities of the Bronze Age -- We analyze a large data set of commercial records produced by Assyrian merchants in the 19th Century BCE. Using the information collected from these records, we estimate a structural gravity model of long-distance trade in the Bronze Age. We use our structural gravity model to locate lost ancient cities. (via WaPo)
2. Tech Ethics Curriculum -- a Google sheet of tech ethics courses, with pointers to syllabi.
3. Session Replay Scripts (Ed Felton) -- lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder. (via BoingBoing)
4. RubiX -- Cache File System optimized for columnar formats and object stores.

### How to keep graph databases both flexible and secure

Graph databases are now common within a range of  industries such as life sciences, healthcare, financial services, government and intelligence. Graphs are particularly valuable in these sectors because of the complex nature of the data and need for powerful, yet flexible data analytics. In addition, graph databases allow visibility across

The post How to keep graph databases both flexible and secure appeared first on Dataconomy.

### Can Blockchain Supercharge the Gig Economy?

Countless words have been written in recent years about the rise of the ‘gig economy’. As the natural evolution of the freelance model, the gig economy is based on short-term and temporary jobs handled by contractors and other independent workers. Instead of hiring full-time staff, freelancers offer a concise way

The post Can Blockchain Supercharge the Gig Economy? appeared first on Dataconomy.

### Housing Prices in Ames, Iowa: Kaggle’s Advanced Regression Competition

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

Residential real estate prices are fascinating… and frustrating. The homebuyer, the home-seller, the real estate agent, the economist, and the banker are all interested in housing prices, but they’re sufficiently subtle that no one person or industry has a complete understanding.

For our third overall project and first group project we were assigned Kaggle’s Advanced Regression Techniques Competition. The goal, for the project and the original competition, was to predict housing prices in Ames, Iowa. While this particular competition is no longer active, the premise proved to be a veritable playground for testing our knowledge of data cleaning, exploratory data analysis, statistics, and, most importantly, machine learning. What follows is the combined work of Quentin Picard, Paul Ton, Kathryn Bryant, and Hans Lau.

Data

As stated on the Kaggle competition description page, the data for this project was compiled by Dean De Cock for educational purposes, and it includes 79 predictor variables (house attributes) and one target variable (price). As a result of the educational nature of the competition, the data was pre-split into a training set and a test set; the two datasets were given in the forms of csv files, each around 450 KB in size. Each of the predictor variables could fall under one of the following:

• lot/land variables
• location variables
• age variables
• basement variables
• roof variables
• garage variables
• kitchen variables
• room/bathroom variables
• utilities variables
• appearance variables
• external features (pools, porches, etc.) variables

The specifics of the 79 predictor variables are omitted for brevity but can be found in the data_description.txt file on the competition website. The target variable was Sale Price, given in US dollars.

Project Overview (Process)

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but  we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing with curriculum than by anything else.

Throughout the project our group wanted to mimic a real-world machine learning project as much as possible, which meant that although we were given both a training set and a test set, we opted to treat the given test set as if it were “future” data. As a result, we further split the Kaggle training data into a private training set and a private testing set, with an 80/20 split, respectively. This allowed us to evaluate models in two ways before predicting on the Kaggle test data: with RMSE  of predictions made on the private test set and with cross validation RMSE of the entire training set.

Given the above choices, the process for training and evaluating each individual model was broken down as follows:

1. Grid search to tune hyper parameters (if applicable) to private train set
2. Fit model to private train set with tuned parameters
3. Predict on private test set; if RMSE okay, proceed.
4. Fit new model to Kaggle train set with private train hyperparameters
5. Cross-validate on Kaggle train set; if CV RMSE okay, proceed.
6. Predict on Kaggle test set
7. Submit predictions to Kaggle

Computing RMSE of predictions made on the private test set served as a group sanity check, in that if anything was amiss with a model we found out at this point and could correct for it before proceeding. The decision to fit a new model to the Kaggle train in step 4 set using the private train hyperparameters found in step 1 was one for which we never felt completely at ease; we were trying to minimize overfitting to the Kaggle training set, but also recognized that we weren’t using all the possible information we could by not re-tuning using the full training set. One benefit to this choice was that we never saw any major discrepancies between our own cross validation scores in step 5 and the Kaggle scores from step 7. With more time, we would have further investigated whether keeping the private train parameters or re-tuning for the final models was more beneficial.

The last noteworthy choice we made in our overall process was to feed differently-cleaned datasets to our linear-based models and our tree-based models. In linear-based models (linear, ridge, LASSO, elastic net, spline), prediction values are continuous and sensitive to outliers so we opted to sacrifice information about “rare” houses in favor of gaining better predictions on “common” houses. We also wanted to minimize the likelihood of rare levels only showing up in our test data and/or causing columns of all 0’s in dummifed data. We executed this tradeoff by releveling any nominal categorical variables that contained extremely rare classes, where “rare” was defined to be “occuring in less than 1% of the observations.”

For example, the Heating variable had six levels but four of them combined accounted for only about 1% of the observations; so, we combined these four levels into a new ‘other’ level so that releveled Heating had only three levels, all accounting for 1% or more of the observations. Depending on the variable, we either created an ‘other’ level as in the example or we grouped rare levels into existing levels according to level similarity in the variable documentation.

We opted not to relevel data fed into our tree-based models because tree predictions are more robust to outliers and rare classes; trees can separate rare observations from others through splits, which prevents common observation predictions from being distorted by rare observations in fitting.

Data Cleaning and EDA

We were fortunate with the Kaggle data in that it came to us relatively clean. The only basic cleaning tasks were to correct typos in levels of categorical variables, specify numeric or categorical variables in R, and rename variables beginning with numbers to satisfy R’s variable name requirements. There were a number of  “quality” and “condition” variables that had levels of Poor, Fair, Typical/Average, Good, and Excellent which we label encoded as integers 1-5 to preserve their inherent ordinality. For nominal categorical variables, we used one-hot encoding from the ‘vtreat’ package in R. (Most machine learning algorithms in Python require all variables to have numeric values, and although R’s machine learning algorithms can handle nominal categorical variables in non-numeric/string form, R’s computations effectively use one-hot encoding under the hood in these cases.)

After taking care of these basic cleaning issues, we needed to address missingness in our data. Below is a plot that shows the variables containing missing values and the degree to which missingness occurs in each (shown in yellow):

For all variables except Garage Year Built and Lot Frontage, we performed basic mode imputation. Mode imputation was chosen for simplicity and because it could be done with both categorical and numerical data. Of course, mode imputation  has the negative side effect of artificially decreasing variance within the affected variable and therefore would not be appropriate for variables with higher degrees of missingness. It was for this reason, in fact, that we approached missingness differently for Garage Year Built and Lot Frontage.

For missing values in Garage Year Built, we imputed the Year Built for the house. We justified this because most garages are built at the same time as the house, and houses without garages get no penalty or benefit by having the Garage Year Built equal to the Year Built for the house.

For Lot Frontage, the variable with the greatest degree of missingness, was researched and explored in order to arrive at an imputation strategy. Lot Frontage was defined to be the linear feet of street connected to the property. Given that most properties were either regularly or only slightly irregularly shaped according to Lot Shape, we deduced that most properties would have Lot Frontage values that were correlated with Lot Area values. However, since length is measured in units and area is measure in square units, we found it most appropriate to relate log(Lot Frontage) with log(Lot Area), so as to get a linear relationship between the two. See plot below.

Thus, we imputed missing values for Lot Frontage by fitting a linear model of log(Lot Frontage) regressed onto log(Lot Area), predicting on the missing values for Lot Frontage using Lot Area, and then exponentiating (inverse log-ing) the result.

The next step in EDA after finding and addressing missingness was to look at outliers. Looking at boxplots of both Sale Price and log(Sale Price) we saw that there were quite a few outliers, where ‘outliers’ mathematically defined to be observations lying more than 1.5*IQR (Inner Quartile Range) above the third quartile and below the first quartile.

Although removing the outliers may have improved our predictions for average-priced homes, we were hesitant to do so due to the relatively small size of our sample (~1600 observations in Kaggle training set). We felt that removing any outliers without further justification than them simply being outliers would likely jeopardize our predictions for houses in the extremes.

Since outliers would have the most impact on the fit of linear-based models, we further investigated outliers by training a basic multiple linear regression model on the Kaggle training set with all observations included; we then looked at the resulting influence and studentized residuals plots:

From these, we saw that there were only two observations that could justifiably be removed: observation 1299 and observation 251. These were both beyond or on the lines representing Cook’s Distance, meaning that as individual observations they had a significant impact on the regression formula; as such, we removed these two observations from consideration.

The last bit of preprocessing we did was dealing with multicollinearity. This, as for outliers, was cleaning done mostly for the benefit of linear-based models; in fact, it was done only for vanilla multiple linear regression since regularization in ridge, LASSO, and elastic net models deals with collinearity by construction. To eliminate collinearity issues, we used the findLinearCombos() function from the ‘caret’ package in R on our dummified data. This function identified linear combinations between predictor variables and allowed us to easily drop linearly dependent variables.

Feature Engineering

For feature engineering we took a two-pronged approach: we used anecdotal data/personal knowledge and we did research.

• Garage interaction: Garage Quality * Number of Cars Garage Holds.
• If a home has a really great or really poor garage, the impact of that quality component on price will be exacerbated by the size of the garage.
• Total number of bathrooms: Full Bath + Half Bath + Basement Full Bath + Basement Half Bath.
• In our experience, houses are often listed in terms of total number of bedrooms and total number of bathrooms. Our data had total number of bedrooms, but lacked total number of bathrooms.
• Average room size: Above-Ground Living Area / Total Number of R00ms Above Ground.
• “Open concept” homes have been gaining popularity and homes with large rooms have always been popular, and with the provided variables we believe average room size might address both of these trends.
• Bathroom to room ratio: (Full Bath + Half Bath) / Number of Bedrooms Above Ground
• The number of bathrooms desired in a house depends on the number of bedrooms. A home with one bedroom will not be viewed nearly as negatively for having only one bathroom as would a house with three bedrooms.
• Comparative size of living area: Above-Ground Living Area / mean(Above-Ground Living Area)
• This variable attempts to capture house size as it directly compares to other houses, but in a manner different from mere number of rooms.
• Landscape-ability interaction: Lot Shape * Land Contour
• Landscaped homes tend to sell for more (see this article) and the ability for a lot to be easily landscaped is, in part, determined by the shape of the lot (regular, slightly irregular, irregular, very irregular) and the land contour (level, banked, hillside, depressed). Certain combinations may be workable (regular shape, hillside) while other combinations (very irregular shape, hillside) may make landscaping difficult and/or expensive.

Of the six features that we added, only “landscape-ability” resulted from research; we either already had the important industry variables in the data (total above-ground square footage or neighborhood, for example) or we did not have the information we needed to create the desired variables. Additional features we would have liked to have added include geolocation data for proximity to schools and shopping centers, quality of nearby schools, and specific rooms/house features remodeled (if applicable).

One step that we glossed over a bit was extensive adding and removing of features. We noticed with just a few trials that removing existing features seemed to negatively impact the performance of our models and that the models improved when we added features. Given that the regularized linear models would give preference to more important features via larger coefficients and tree-based models would give preference to more important features by splitting preferences, we opted not to spend too much time manually adding and removing features. Essentially, we allowed our models to select the most predictive features themselves.

Modeling

For our individual models, we trained the following:

• Multiple linear regression
• Ridge regression
• LASSO regression
• Elastic net regression
• Spline regression
• Basic decision tree
• Random forest
• XGBoosted tree

Overall our models consistently had RMSE values between .115 and .145. As stated earlier, our cross validation scores were usually very close to our scores on Kaggle. To our surprise, our linear-based models did very well compared to our tree-based models. Even the much-hyped XGBoost model was at best on par with our linear-based spline and elastic net models, and our random forest models tended to be much worse. An abbreviated table of our results is shown below:

 Elastic Net Spline Random Forest Gradient Boost XGBoost Ensemble Cross Validation 0.12157 0.11181 0.14171 0.12491 0.12282 0.11227 Held-out Test 0.11762 0.11398 0.13834 0.11403 0.11485 n/a Kaggle 0.12107 0.11796 n/a n/a n/a 0.11710

As we endeavored to find a final, “best” model via ensembling and stacking, we followed the advice of our instructor (and successful Kaggler) Zeyu by attempting to combine models with different strengths and weaknesses.

We took the predictions from each of our best base models (Spline, Gradient Boost, XGBoost) as the features to use for the next level. Our best result came from a weighted average of the three, with weights determined by a grid search. The blend of 76% Spline, 14.5% Gradient Boost, and 9.5% Xgboost gave us a 0.11227 RMSE on our training set and 0.11710 on Kaggle.

We also experimented with using a second level meta-model to do the stacking, but with a linear meta-model, the coefficients were highly unstable because the predictions were so closely related to each other, and with a gradient boost meta-model, we were unable to beat our best base model alone.

Conclusions

As mentioned above, we were surprised by the strength of our various linear models in comparison to our tree models. We suspect this had a lot to do with the data itself, in that Sale Price (or rather, log(Sale Price)) likely has a relatively linear relationship with the predictor variables. This highlights one of the most important takeaways from this project: that linear models have a real place in machine learning.

In situations like this, where performance between a simpler model and a more complex model are similar, the better interpretability and the ease of training of the simpler model may also prove to be deciding factors on model choice.  In most cases, stacking multiple base-layer models into a second-layer is impractical, despite performing slightly better.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### [For Members] Getting Started with Network Graphs in R

Add the vertices. Connect them with edges. Repeat as necessary. Read More

### The data fix

Rachel Thomas's article came onto my twitter feed. It caught my attention because it has a click-baity title "How (and why) to create a good validation set."

Or, I thought it was click bait but she is really serious about this. (For those not familiar with the literature, we don't use all historical data to build machine learning models. The historical data are split, typically at random, into training and validation sets. The validation set is supposed to simulate new data the algorithms haven't seen before, a sort of honest check of the model.) She makes some alarmist claims here:

• there is such a thing as a "poorly chosen" validation set
• random selection is not a good way to make a validation set, a "poor choice for many real-world problems"
• the analyst should manufacture a validation set
• the validation set should be representative of future, currently-unseen, data

Even though I don't like any of her advice, I can't disagree with her diagnosis:

An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again.

***

One of the examples given is a response function that has a time trend.

If this model does not detect the trend, indeed the prediction will have poor accuracy on  real-world data. She is making a claim that a validation set based on a pre-post time split is better than a random selection.

Since this is a simple linear trend, either way of making the validation set will capture the trend. So what makes the model fail in production is not the presence of this trend but a shift of the trajectory after the model is deployed. But the choice of validation set won't help prevent the problem.

The downside of the pre-post split is the presence of many time-varying predictors. A naive example: if an on-off switch just happens to be pressed at that time split point, then all your training examples have the "on" condition while your validation examples have "off".

Manufacturing the validation set to reflect some unknown future trends creates a conceptual difficulty. The training set is now materially different from the validation set, so why would we expect the trained model to perform well on the validation set? And how much degradation in validation set performance is considered the right price to pay for potentially better in-market performance? That question boils down to how much you want to generalize the data, and at the core of the statistical view of the modeling problem.

***

The subtext of the article is that if the model doesn't work, fix the data. I tend to want to fix the model. If it doesn't work in production because the nature of the time trend has shifted, then adjust the model to include the new time trend.

Diagnosing the difference between production data and historical data is part of good model hygiene. It's very hard to predict unexpected shifts in the data and even if you could, you wouldn't have any training data to support such shifts.

The "data fix" is not the solution. Refining one's model is.

PS. While I don't agree with designing your validation set, I do advise selecting your historical dataset carefully, and think about which units to include or exclude from the modeling process, which Rachel discusses at the end of her post.

### Document worth reading: “Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey”

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA. Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

### No to inferential thresholds

Harry Crane points us to this new paper, “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse,” and writes:

Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of “P-hacking” on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors.

I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.

From the concluding section of Crane’s paper:

The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.

Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.

I like Crane’s style. I can’t say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I’ve seen, as I’ve written many times (see, for example, here and here).

I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news—and, as Hal Stern and I wrote awhile ago, it’s not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant—“p less than 0.005”! Wow!—and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever.

So, scientists and statisticians: All that thresholding you’re doing? It’s not doing what you think it’s doing. It’s just a magnification of noise.

So I’m not really inclined to follow the details of Crane’s argument regarding false positive rates etc., but I’m supportive of his general attitude that thresholds are a joke.

Post-publication review, not “ever expanding regulation”

Crane’s article also includes this bit:

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.

As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are not “calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to “ban” anything!, nor are we offering any “generic calls to action.” Of all the things on that list, the only thing we’re suggesting is “better study design”—and our suggestions for better study design are in no way a call for “ever expanding scientific regulation.”

The post No to inferential thresholds appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Book Memo: “An Introduction to Statistical Learning”

 with Applications in R An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

### Data science courses in R (/python/and others) for $10 at Udemy (Black Friday sale) Udemy is offering readers of R-bloggers access to its global online learning marketplace for only$10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

• R programming

Introductory R courses:

Top data science courses (not in R):

### RcppEigen 0.3.3.3.1

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release 0.3.3.3.1 of RcppEigen is now on CRAN (and will get to Debian soon). It brings Eigen 3.3.* to R.

The impetus was a request from CRAN to change the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up. Otherwise, Haiku-OS is now supported and a minor Travis tweak was made.

The complete NEWS file entry follows.

#### Changes in RcppEigen version 0.3.3.3.1 (2017-11-19)

• Compilation under Haiku-OS is now supported (Yu Gong in #45).

• The Rcpp.plugin.maker helper function is called via :: as it is in fact exported (yet we had old code using :::).

• A spurious argument was removed from an example call.

• Travis CI now uses https to fetch the test runner script.

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppClassic 0.9.9

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release RcppClassic 0.9.9 is now at CRAN. This package provides a maintained version of the otherwise deprecated first Rcpp API; no new projects should use it.

Per a request from CRAN, we changed the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up.

Courtesy of CRANberries, there are changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“Before jumping on the Big Data bandwagon, I think it is important to ask the question of whether the problem you have requires much data.” Pradyumna S. Upadrashta ( February 13, 2015 )

Sometimes, when I write a really long blog post, I forget what the point was at the end. I suppose I could just update the previous post…but that feels wrong for some reason.

I meant to make one final point in my last post about how better data analyses help you reason about the data. In particular, I meant to tie together the discussion about garbage collection to the section on data analysis. With respect to garbage collection in programming, I wrote

The promise of garbage collection is clear: the programmer doesn’t have to think about memory. Lattner’s criticism is that when building large-scale systems the programmer always has to think about memory, and as such, garbage collection makes it harder to do so. The goal of [Automatic Reference Counting] is to make it easier for other people to understand your code and to allow programmers to reason clearly about memory.

To pick this apart just a little bit, it’s tempting to think that “not having to think about memory management” is a worthy goal, but Lattner’s point is that it’s not. At least not right now. Better to make tools that will help people to think more effectively about this important topic (a “bicycle for the mind”).

I think the analogy for data analysis is that I think it’s tempting to think of the goal of methods development as removing the need to think about data and assumptions. The “ultimate” method is one where you don’t have to worry about distributions or nonlinearities or interactions or anything like that. But I don’t see that as the goal. Good methods, and good analyses, help us think about all those things much more efficiently. So what I might say is that

When doing large-scale data analyses, the data analyst always has to think about the data and assumptions, and as such, some approaches can actually make that harder to do than others. The goal of the good data analysis is to make it easier to reason about how the data are related to the result, relative to the assumptions you make about the data and the models.

### M4 Forecasting Competition: response from Spyros Makridakis

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Following my post on the M4 competition yesterday, Spyros Makridakis sent me these comments for posting here.
I would like to thank Rob, my friend and co-author, for his insightful remarks concerning the upcoming M4 competition. As Rob says, the two of us have talked a great deal about competitions and I certainly agree with him about the “ideal” forecasting competition. In this reply, I will explain why I have deviated from the “ideal”, mostly for practical reasons and to ensure higher participation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Where to live in the Netherlands based on temperature XKCD style

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

After seeing a plot of best places to live in Spain and the USA based on the weather, I had to
chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences.

First the end result:

In this xkcd comic we see that the topleft of the the graph represents “if you hate cold and hate heat”, if you go down from the topleft to the bottom left the winters get colder and ending in “if you love cold and hate heat”. Going to the right the heat (and humidity) go up ending in “if you love cold and love heat”. Going up to the top right: “if you hate cold and love heat”.

This post explains how to make the plot, to see where I got the data and what procedures I took look at https://github.com/RMHogervorst/xkcd_weather_cities_nl.

### Inspiration

According to this post by Maële Salmon inspired by xkcd, we can determine our ideal city by looking at wintertemperature and humidex (combination of humidity and summer heat).

I’ve seen major cities in the USA (post by Maelle) and where to live in Spain by Claudia Guirao. There is even one on France in French, of course.

So I had to make one for the Netherlands too. There is just a small detail,
The Netherlands is tiny, the United States is approximately 237 times larger then The Netherlands. From The Hague to the German border directly east is 164 km (101 miles) and from Leeuwarden to Maastricht in the south is 262 km (162 miles). Essentially my home country has a moderate sea climate with warm summers and relatively mild winters.

I expect there to be no real big differences between the cities. I use the average temperature over december, january and february for winter temperature and calculate the humidex using the comf package. This humidex is a combination of humidity and temperature.

• 20 to 29: Little to no discomfort
• 30 to 39: Some discomfort
• 40 to 45: Great discomfort; avoid exertion
• Above 45: Dangerous; heat stroke quite possible

For cities I went the extremely lazy way and took all of the available weatherstations provided by the Dutch weather service (KNMI, — Royal Netherlands, Metereological Institute). There are 46 stations in the Netherlands. These are not always in the cities but sort of broadly describe the entire country.

### Plot like XKCD

The XKCD package has wonderful plotting abilities and a cool font that you have to install. I copied and modified the code from the post of Mäelle, because it is really good!

If you want to see how I did the data cleaning go to the github page.

First we plot all of the stations in the Netherlands most of these places will probably not be familiar to non-Dutch people.

library("xkcd")
library("ggplot2")
library("extrafont")
library("ggrepel")

xrange <- range(result$humidex_avg) yrange <- range(result$wintertemp)

set.seed(3456)
ggplot(result,
aes(humidex_avg, wintertemp)) +
geom_point() +
geom_text_repel(aes(label = NAME), family = "xkcd",
max.iter = 50000, size = 3)+
ggtitle("Where to live in The Netherlands \nbased on your temperature preferences",
subtitle = "Data from KNMI, 2016-2017") +
xlab("Summer heat and humidity via Humidex")+
ylab("Winter temperature in Celsius degrees") +
xkcdaxis(xrange = xrange,
yrange = yrange)+
theme_xkcd() +
theme(text = element_text(size = 13, family = "xkcd"))


But what does that mean, in the grand scheme of things? As you might notice the range is very small. It would be interesting to add some US cities and perhaps some Spanisch cities to compare against. For fun I also added the Dutch Caribean islands.

xrange2 <- range(c(18,40))  # modified these by hand to increase the margins
yrange2 <- range(c(-5,40))
USA <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"DETROIT, MI", 27, 0,
"NASHVILLE, TN", 33, 9,
"FORT MEYERS, FL",37, 20
)
SPAIN <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"TENERIFE, SPAIN", 24, 13,
"BARCELONA, SPAIN",32, 10
)
D_CARI <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"Bonaire, Caribbean Netherlands", 27, calcHumx(29,76),
"Sint Eustatius, Caribbean Netherlands", 28, calcHumx(28,77),
"Saba, Caribbean Netherlands",30, calcHumx(30,76)
)

set.seed(3456)
ggplot(result,
aes(humidex_avg, wintertemp)) +
geom_point(alpha = .7) +
geom_text_repel(aes(label = NAME),
family = "xkcd",
max.iter = 50000, size = 2)+
geom_text_repel(data = USA, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "blue")+
geom_point(data = USA, aes(humidex_avg, wintertemp), color = "blue")+
geom_text_repel(data = SPAIN, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "red")+
geom_point(data = SPAIN, aes(humidex_avg, wintertemp),color = "red")+
geom_text_repel(data = D_CARI, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "orange")+
geom_point(data = D_CARI, aes(humidex_avg, wintertemp), color = "orange")+
labs(title = "Where to live in The Netherlands based on \nyour temperature preferences \nCompared with some places in Spain, Caribbean NL and USA",
subtitle = "Data from KNMI, 2016-2017, added Spain and US cities",
x = "Summer heat and humidity via Humidex",
y = "Winter temperature in Celsius degrees",
caption = "includes Caribbean Netherlands"
) +
xkcdaxis(xrange = xrange2,
yrange = yrange2)+
theme_xkcd() +
theme(text = element_text(size = 16, family = "xkcd"))


Finally we can compare the wintertemperature and summer humidex in The Netherlands by placing the points on a map using the coordinates of the measure stations.

NL <- map_data(map = "world",region = "Netherlands")
result %>%
rename(LON = LON(east), LAT = LAT(north)) %>%
ggplot( aes(LON, LAT))+
geom_point(aes(color = wintertemp))+
geom_text_repel(aes(label = NAME),
family = "xkcd", size = 3,
max.iter = 50000)+
geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") +
coord_map()+
labs(title = "Wintertemperature in NL",
subtitle = "data from KNMI (2016,2017",
x = "", y = "")+
theme_xkcd()+
theme(text = element_text(size = 16, family = "xkcd"))

result %>%
rename(LON = LON(east), LAT = LAT(north)) %>%
ggplot( aes(LON, LAT))+
geom_point(aes(color = humidex_avg))+
geom_text_repel(aes(label = NAME),
family = "xkcd", size = 3,
max.iter = 50000)+
geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") +
coord_map()+
labs(title = "Humidex in NL",
subtitle = "data from KNMI (2016,2017",
x = "", y = "")+
theme_xkcd()+
theme(text = element_text(size = 12, family = "xkcd"))+
scale_color_continuous(low = "white", high = "red")


Now show us what your country looks like!

Where to live in the Netherlands based on temperature XKCD style was originally published by at Clean Code on November 20, 2017.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

Introduction to keyholder package. Tools for keeping track of information about rows.

# Prologue

During development of my other R package (ruler), I encountered the following problem: how to track rows of data frame after application of some user defined function? It is assumed that this function takes data frame as input, subsets it (with possible creation of new columns, but not rows) and returns the result. The typical example using dplyr and magrittr’s pipe:

suppressMessages(library(dplyr))

# Custom mtcars for more clear explanation
mtcars_tbl <- mtcars %>%
select(mpg, vs, am) %>%
as_tibble()

# A handy way of creating function with one argument
modify <- . %>%
mutate(vs_am = vs * am) %>%
filter(vs_am == 1) %>%
arrange(desc(mpg))

# The question is: which rows of mtcars_tbl are returned?
mtcars_tbl %>% modify()
## # A tibble: 7 x 4
##     mpg    vs    am vs_am
##
## 1  33.9     1     1     1
## 2  32.4     1     1     1
## 3  30.4     1     1     1
## 4  30.4     1     1     1
## 5  27.3     1     1     1
## # ... with 2 more rows

To solve this problem I ended up creating package keyholder, which became my first CRAN release. You can install its stable version with :

install.packages("keyholder")

This post describes basis of design and main use cases of keyholder. For more information see its vignette Introduction to keyholder.

# Overview

suppressMessages(library(keyholder))

The main idea of package is to create S3 class keyed_df, which indicates that original data frame (or tibble) should have attribute keys. “Key” is any vector (even list) of the same length as number of rows in data frame. Keys are stored as tibble in attribute keys and so one data frame can have multiple keys. In other words, keys can be considered as columns of data frame which are hidden from subsetting functions but are updated according to them.

To achieve that, those functions should be generic and have method for keyed_df implemented. Look here for the list of functions supported by keyholder. As for version 0.1.1 they are all one- and two-table dplyr verbs for local data frames and [ function.

# Create and manipulate keys

There are two distinct ways of creating keys: by assigning and by using existing columns:

# By assigning
mtcars_tbl_1 <- mtcars_tbl
keys(mtcars_tbl_1) <- tibble(rev_id = nrow(mtcars_tbl_1):1)
mtcars_tbl_1
## # A keyed object. Keys: rev_id
## # A tibble: 32 x 3
##     mpg    vs    am
## *
## 1  21.0     0     1
## 2  21.0     0     1
## 3  22.8     1     1
## 4  21.4     1     0
## 5  18.7     0     0
## # ... with 27 more rows

# By using existing columns
mtcars_keyed <- mtcars_tbl %>% key_by(vs)
mtcars_keyed
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
##     mpg    vs    am
## *
## 1  21.0     0     1
## 2  21.0     0     1
## 3  22.8     1     1
## 4  21.4     1     0
## 5  18.7     0     0
## # ... with 27 more rows

To get keys use keys() (which always returns tibble) or pull_key() (similar to dplyr::pull() but for keys):

mtcars_keyed %>% keys()
## # A tibble: 32 x 1
##      vs
## *
## 1     0
## 2     0
## 3     1
## 4     1
## 5     0
## # ... with 27 more rows

mtcars_keyed %>% pull_key(vs)
##  [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1

To restore keys (create respective columns in data frame) use restore_keys():

# Column vs didn't change in output because it was restored from keys
mtcars_keyed %>%
mutate(vs = 2) %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
##     mpg    vs    am
##
## 1  21.0     0     1
## 2  21.0     0     1
## 3  22.8     1     1
## 4  21.4     1     0
## 5  18.7     0     0
## # ... with 27 more rows

To end having keys use unkey():

mtcars_keyed %>% unkey()
## # A tibble: 32 x 3
##     mpg    vs    am
## *
## 1  21.0     0     1
## 2  21.0     0     1
## 3  22.8     1     1
## 4  21.4     1     0
## 5  18.7     0     0
## # ... with 27 more rows

# Use cases

## Track rows

To track rows after application of user defined function one can create key with row number as values. keyholder has a wrapper use_id() for this:

# use_id() removes all existing keys and creates key ".id"
mtcars_track <- mtcars_tbl %>%
use_id()

mtcars_track %>% pull_key(.id)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32

Now rows are tracked:

mtcars_track %>%
modify() %>%
pull_key(.id)
## [1] 20 18 19 28 26  3 32

# Make sure of correct result
mtcars_tbl %>%
mutate(id = seq_len(n())) %>%
modify() %>%
pull(id)
## [1] 20 18 19 28 26  3 32

The reason for using “key id” instead of “column id” is that modify() hypothetically can perform differently depending on columns of its input. For example, it can use dplyr’s scoped variants of verbs or simply check input’s column structure.

## Restore information

During development of tools for data analysis one can have a need to ensure that certain columns don’t change after application of some function. This can be achieved by keying those columns and restoring them later (note that this can change the order of columns.):

weird_modify <- . %>% transmute(new_col = vs + 2 * am)

# Suppose there is a need for all columns to stay untouched in the output
mtcars_tbl %>%
key_by(everything()) %>%
weird_modify() %>%
# This can be replaced by its scoped variant: restore_keys_all()
restore_keys(everything()) %>%
unkey()
## # A tibble: 32 x 4
##   new_col   mpg    vs    am
##
## 1       2  21.0     0     1
## 2       2  21.0     0     1
## 3       3  22.8     1     1
## 4       1  21.4     1     0
## 5       0  18.7     0     0
## # ... with 27 more rows

## Hide columns

In actual data analysis the following situation can happen: one should modify all but handful of columns with dplyr::mutate_if().

is_integerish <- function(x) {all(x == as.integer(x))}

if_modify <- . %>% mutate_if(is_integerish, ~ . * 10)

mtcars_tbl %>% if_modify()
## # A tibble: 32 x 3
##     mpg    vs    am
##
## 1  21.0     0    10
## 2  21.0     0    10
## 3  22.8    10    10
## 4  21.4    10     0
## 5  18.7     0     0
## # ... with 27 more rows

Suppose column vs should appear unchanged in the output. This can be achieved in several ways, which differ slightly but significantly. The first one is to key by vs, apply function and restore vs from keys.

mtcars_tbl %>%
key_by(vs) %>%
if_modify() %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
##     mpg    vs    am
##
## 1  21.0     0    10
## 2  21.0     0    10
## 3  22.8     1    10
## 4  21.4     1     0
## 5  18.7     0     0
## # ... with 27 more rows

The advantage is that it doesn’t change the order of columns. The disadvantage is that it actually applies modification function to column, which can be undesirable in some cases.

The second approach is similar, but after keying by vs one can remove this column from data frame. This way column vs is moved to last column.

mtcars_hidden_vs <- mtcars_tbl %>% key_by(vs, .exclude = TRUE)

mtcars_hidden_vs
## # A keyed object. Keys: vs
## # A tibble: 32 x 2
##     mpg    am
## *
## 1  21.0     1
## 2  21.0     1
## 3  22.8     1
## 4  21.4     0
## 5  18.7     0
## # ... with 27 more rows

mtcars_hidden_vs %>%
if_modify() %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
##     mpg    am    vs
##
## 1  21.0    10     0
## 2  21.0    10     0
## 3  22.8    10     1
## 4  21.4     0     1
## 5  18.7     0     0
## # ... with 27 more rows

# Conclusions

• It might be a good idea to extract some package functionality into separate package, as this can lead to one more useful tool.
• Package keyholder offers functionality for keeping track of arbitrary data about rows after application of some user defined function. This is done by creating special attribute “keys” which is updated after every change in rows (subsetting, ordering, etc.).

sessionInfo()

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base
##
## other attached packages:
## [1] keyholder_0.1.1 bindrcpp_0.2    dplyr_0.7.4
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     bookdown_0.5     assertthat_0.2.0 digest_0.6.12
##  [5] rprojroot_1.2    R6_2.2.2         backports_1.1.1  magrittr_1.5
##  [9] evaluate_0.10.1  blogdown_0.2     rlang_0.1.4      stringi_1.1.5
## [13] rmarkdown_1.7    tools_3.4.2      stringr_1.2.0    glue_1.2.0
## [17] yaml_2.1.14      compiler_3.4.2   pkgconfig_2.0.1  htmltools_0.3.6
## [21] bindr_0.1        knitr_1.17       tibble_1.3.4

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Content Evaluation: What is the Value of Social Media?

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

As most bloggers, I do check my analytics stats on a regular basis. What I do not really look at is social shares. For most blog posts traffic follows a clear pattern; 2 days of increased traffic, followed by a steady decrease to the base traffic volume.
The amplitude varies massively depending on how interesting/viral the post was. However, in the long-term this virality does not drive the majority of the traffic but rather the search ranking of individual posts, as 80% of the base traffic comes through organic search results.

There is plenty of discussion on the value of a share/like with very different take-aways.
I thought it would be great to analyse if social shares do increase my traffic.

In addition, I realized that it is pretty easy to get the number of Facebook and LinkedIn shares for each of my blog posts from r-bloggers. (Please find the script at the end of the post.)

For the analysis, I scraped the social share numbers and merged them with my Google Analytics stats.
In order to compare the traffic between blog posts, I normalise the total traffic per post by number of days the post is online.

To get a first picture of the data we can plot the normalized traffic over the number of shares (both by blog post).

Next, we can analyse how the variables are correlated. I included the total pageviews as reference.

We see that LinkedIn and Facebook shares are positively correlated with r=.56 and that the correlation between shares and normalized pageviews is the same.
When looking at the total page views we see that have a much lower correlation with shares both from facebook as well as Linkedin.

To determine the “value” of social media to drive traffic volume, we can regress the number of social shares on the normalized traffic numbers.

 Dependent variable: normalizedPageViews (1) (2) (3) shares 0.043** 0.043** 0.026 (0.017) (0.017) (0.022) linkedIn1 0.138 0.138 0.101 (0.126) (0.126) (0.129) DaysOnline -0.008 (0.006) Constant 1.462 1.462 5.744 (1.114) (1.114) (3.571) Observations 33 33 33 R2 0.341 0.341 0.375 Adjusted R2 0.297 0.297 0.310 Residual Std. Error 5.042 (df = 30) 5.042 (df = 30) 4.994 (df = 29) F Statistic 7.755*** (df = 2; 30) 7.755*** (df = 2; 30) 5.802*** (df = 3; 29) Note: *p<0.1; **p<0.05; ***p<0.01

As in the correlation plot, we see that highly shared posts have more daily visits. If you take the estimates at face value one can also state that a linkedIn share is worth 3 times a facebook share in terms of additional traffic.
In the last regression I also included a variable (DaysOnline) to capture my learning effect. The longer post is online the lower is the daily traffic. (e.g. posts published when I started blogging).

While writing this post, I realized that the script can also be used to analyse;

• which topics are highly shared
• which authors are popular
• how social sharing changed over time
on r-bloggers.

Have fun!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Timing in R

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

As time goes on, your R scripts are probably getting longer and more complicated, right? Timing parts of your script could save you precious time when re-running code over and over again. Today I’m going to go through the 4 main functions for doing so.

## Nested timings

### 1) Sys.time()

Sys.time() takes a “snap-shot” of the current time and so it can be used to record start and end times of code.

start_time = Sys.time()
Sys.sleep(0.5)
end_time = Sys.time()

To calculate the difference, we just use a simple subtraction

end_time - start_time
## Time difference of 0.5061 secs

Notice it creates a neat little message for the time difference.

### 2) The tictoc package

You can install the CRAN version of tictoc via

install.packages("tictoc")

whilst the most recent development is available via the tictoc GitHub page.

library("tictoc")

Like Sys.time(), tictoc also gives us ability to nest timings within code. However, we now have some more options to customise our timing. At it’s most basic it acts like Sys.time():

tic()
Sys.sleep(0.5)
toc()
## 0.506 sec elapsed

Now for a more contrived example.

# start timer for the entire section, notice we can name sections of code
tic("total time")
# start timer for first subsection
tic("Start time til half way")
Sys.sleep(2)
# end timer for the first subsection, log = TRUE tells toc to give us a message
toc(log = TRUE)
## Start time til half way: 2.037 sec elapsed

Now to start the timer for the second subsection

tic("Half way til end")
Sys.sleep(2)
# end timer for second subsection
toc(log = TRUE)
## Half way til end: 2.012 sec elapsed
# end timer for entire section
toc(log = TRUE)
## total time: 4.067 sec elapsed

We can view the results as a list (format = TRUE returns this list in a nice format), rather than raw code

tic.log(format = TRUE)
## [[1]]
## [1] "Start time til half way: 2.037 sec elapsed"
##
## [[2]]
## [1] "Half way til end: 2.012 sec elapsed"
##
## [[3]]
## [1] "total time: 4.067 sec elapsed"

We can also reset the log via

tic.clearlog()

## Comparing functions

### 1) system.time()

Why oh WHY did R choose to give system.time() a lower case s and Sys.time() and upper case s? Anyway… system.time() can be used to time functions without having to take note of the start and end times.

system.time(Sys.sleep(0.5))
##    user  system elapsed
##   0.000   0.000   0.501
system.time(Sys.sleep(1))
##    user  system elapsed
##   0.000   0.000   1.003

We only want to take notice of the “elapsed” time, for the definition of the “user” and “system” times see this thread.

For a repeated timing, we would use the replicate() function.

system.time(replicate(10, Sys.sleep(0.1)))
##    user  system elapsed
##   0.000   0.000   1.007

### 2) The microbenchmark package

You can install the CRAN version of microbenchmark via

install.packages("microbenchmark")

Alternatively you can install the latest update via the microbenchmark GitHub page.

library("microbenchmark")

At it’s most basic, microbenchmark() can we used to time single pieces of code.

# times = 10: repeat the test 10 times
# unit = "s": output in seconds
microbenchmark(Sys.sleep(0.1), times = 10, unit = "s")
## Unit: seconds
##            expr    min     lq   mean median     uq    max neval
##  Sys.sleep(0.1) 0.1001 0.1006 0.1005 0.1006 0.1006 0.1006    10

Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds(!!!!). Already this is better than system.time(). Not only that, but we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output.

sleep = microbenchmark(sleepy = Sys.sleep(0.1),
sleepier = Sys.sleep(0.2),
sleepiest = Sys.sleep(0.3),
times = 10,
unit = "s")

As well as this (more?!) microbenchmark comes with a two built-in plotting functions.

microbenchmark:::autoplot.microbenchmark(sleep)

microbenchmark:::boxplot.microbenchmark(sleep)

These provide quick and efficient ways of visualising our timings.

## Conclusion

Sys.time() and system.time() have there place, but for most cases we can do better. The tictoc and microbenchmark packages are particularly useful and make it easy to store timings for later use, and the range of options for both packages stretch far past the options for Sys.time() and system.time(). The built-in plotting functions are handy.

Thanks for chatting!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## November 19, 2017

### Whats new on arXiv

This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting. Inspired by network pruning techniques, we exploit redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, we are able to sequentially ‘pack’ multiple tasks into a single network while ensuring minimal drop in performance and minimal storage overhead. Unlike prior work that uses proxy losses to maintain accuracy on older tasks, we always optimize for the task at hand. We perform extensive experiments on a variety of network architectures and large-scale datasets, and observe much better robustness against catastrophic forgetting than prior work. In particular, we are able to add three fine-grained classification tasks to a single ImageNet-trained VGG-16 network and achieve accuracies close to those of separately trained networks for each task.
Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By post-hoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradient-based optimization or amortized actor functions. Combining attribute constraints with a universal ‘realism’ constraint, which enforces similarity to the data distribution, we generate realistic conditional images from an unconditional variational autoencoder. Further, using gradient-based optimization, we demonstrate identity-preserving transformations that make the minimal adjustment in latent space to modify the attributes of an image. Finally, with discrete sequences of musical notes, we demonstrate zero-shot conditional generation, learning latent constraints in the absence of labeled data or a differentiable reward function. Code with dedicated cloud instance has been made publicly available (https://goo.gl/STGMGx ).
In this paper, we consider the problem of optimizing the quantiles of the cumulative rewards of Markov Decision Processes (MDP), to which we refers as Quantile Markov Decision Processes (QMDP). Traditionally, the goal of a Markov Decision Process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly to be infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. (If we have some reference here, it would be good.) Our framework of QMDP provides analytical results characterizing the optimal QMDP solution and presents the algorithm for solving the QMDP. We provide analytical results characterizing the optimal QMDP solution and present the algorithms for solving the QMDP. We illustrate the model with two experiments: a grid game and a HIV optimal treatment experiment.
Most algorithms for reinforcement learning work by estimating action-value functions. Here we present a method that uses Lagrange multipliers, the costate equation, and multilayer neural networks to compute policy gradients. We show that this method can find solutions to time-optimal control problems, driving nonlinear mechanical systems quickly to a target configuration. On these tasks its performance is comparable to that of deep deterministic policy gradient, a recent action-value method.
We present a deep generative model for learning to predict classes not seen at training time. Unlike most existing methods for this problem, that represent each class as a point (via a semantic embedding), we represent each seen/unseen class using a class-specific latent-space distribution, conditioned on class attributes. We use these latent-space distributions as a prior for a supervised variational autoencoder (VAE), which also facilitates learning highly discriminative feature representations for the inputs. The entire framework is learned end-to-end using only the seen-class training data. The model infers corresponding attributes of a test image by maximizing the VAE lower bound; the inferred attributes may be linked to labels not seen when training. We further extend our model to a (1) semi-supervised/transductive setting by leveraging unlabeled unseen-class data via an unsupervised learning module, and (2) few-shot learning where we also have a small number of labeled inputs from the unseen classes. We compare our model with several state-of-the-art methods through a comprehensive set of experiments on a variety of benchmark data sets.
Recommenders have become widely popular in recent years because of their broader applicability in many e-commerce applications. These applications rely on recommenders for generating advertisements for various offers or providing content recommendations. However, the quality of the generated recommendations depends on user features (like demography, temporality), offer features (like popularity, price), and user-offer features (like implicit or explicit feedback). Current state-of-the-art recommenders do not explore such diverse features concurrently while generating the recommendations. In this paper, we first introduce the notion of Trackers which enables us to capture the above-mentioned features and thus incorporate users’ online behaviour through statistical aggregates of different features (demography, temporality, popularity, price). We also show how to capture offer-to-offer relations, based on their consumption sequence, leveraging neural embeddings for offers in our Offer2Vec algorithm. We then introduce BoostJet, a novel recommender which integrates the Trackers along with the neural embeddings using MatrixNet, an efficient distributed implementation of gradient boosted decision tree, to improve the recommendation quality significantly. We provide an in-depth evaluation of BoostJet on Yandex’s dataset, collecting online behaviour from tens of millions of online users, to demonstrate the practicality of BoostJet in terms of recommendation quality as well as scalability.
This work introduces a model of distributed learning in the spirit of Yao’s communication complexity model. We consider a two-party setting, where each of the players gets a list of labelled examplesand they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, we allow the players to send each other labelled examples, where each example costs one unit of communication. This model can also be thought of as a distributed version of sample compression schemes. We study several fundamental questions in this model. For example, we define the analogues of the complexity classes P, NP and coNP, and show that in this model P equals the intersection of NP and coNP. The proof does not seem to follow from the analogous statement in classical communication complexity; in particular, our proof uses different techniques, including boosting and metric properties of VC classes. This framework allows to prove, in the context of distributed learning, unconditional separations between various learning contexts, like realizable versus agnostic learning, and proper versus improper learning. The proofs here are based on standard ideas from communication complexity as well as learning theory and geometric constructions in Euclidean space. As a corollary, we also obtain lower bounds that match the performance of algorithms from previous works on distributed classification.
Knowledge bases (KB) constructed through information extraction from text play an important role in query answering and reasoning. In this work, we study a particular reasoning task, the problem of discovering causal relationships between entities, known as causal discovery. There are two contrasting types of approaches to discovering causal knowledge. One approach attempts to identify causal relationships from text using automatic extraction techniques, while the other approach infers causation from observational data. However, extractions alone are often insufficient to capture complex patterns and full observational data is expensive to obtain. We introduce a probabilistic method for fusing noisy extractions with observational data to discover causal knowledge. We propose a principled approach that uses the probabilistic soft logic (PSL) framework to encode well-studied constraints to recover long-range patterns and consistent predictions, while cheaply acquired extractions provide a proxy for unseen observations. We apply our method gene regulatory networks and show the promise of exploiting KB signals in causal discovery, suggesting a critical, new area of research.
Generative Adversarial Networks gets wide attention in machine learning field because of its massive potential to learn high dimensional, complex real data. Specifically, it does not need to do further distribution assumption and can simply infer real-like samples from latent space. This powerful property leads GAN to be applied various applications such as image synthesis, image attribute editing and semantically decomposing of image. In this review paper, we look into details of GAN that firstly show how it operates and fundamental meaning of objective functions and point to GAN variants applied to vast amount of tasks.
We present a conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each. Our method, called the Relation Network (RN), is trained end-to-end from scratch. During meta-learning, it learns to learn a deep distance metric to compare a small number of images within episodes, each of which is designed to simulate the few-shot setting. Once trained, a RN is able to classify images of new classes by computing relation scores between query images and the few examples of each new class without further updating the network. Besides providing improved performance on few-shot learning, our framework is easily extended to zero-shot learning. Extensive experiments on four datasets demonstrate that our simple approach provides a unified and effective approach for both of these two tasks.
Increasingly many real world tasks involve data in multiple modalities or views. This has motivated the development of many effective algorithms for learning a common latent space to relate multiple domains. However, most existing cross-view learning algorithms assume access to paired data for training. Their applicability is thus limited as the paired data assumption is often violated in practice: many tasks have only a small subset of data available with pairing annotation, or even no paired data at all. In this paper we introduce Deep Matching Autoencoders (DMAE), which learn a common latent space and pairing from unpaired multi-modal data. Specifically we formulate this as a cross-domain representation learning and object matching problem. We simultaneously optimise parameters of representation learning auto-encoders and the pairing of unpaired multi-modal data. This framework elegantly spans the full regime from fully supervised, semi-supervised, and unsupervised (no paired data) multi-modal learning. We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning.
Machine translation is going through a radical revolution, driven by the explosive development of deep learning techniques using Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this paper, we consider a special case in machine translation problems, targeting to translate natural language into Structural Query Language (SQL) for data retrieval over relational database. Although generic CNN and RNN learn the grammar structure of SQL when trained with sufficient samples, the accuracy and training efficiency of the model could be dramatically improved, when the translation model is deeply integrated with the grammar rules of SQL. We present a new encoder-decoder framework, with a suite of new approaches, including new semantic features fed into the encoder as well as new grammar-aware states injected into the memory of decoder. These techniques help the neural network focus on understanding semantics of the operations in natural language and save the efforts on SQL grammar learning. The empirical evaluation on real world database and queries show that our approach outperform state-of-the-art solution by a significant margin.
Understanding the flow of information in Deep Neural Networks is a challenging problem that has gain increasing attention over the last few years. While several methods have been proposed to explain network predictions, only few attempts to analyze them from a theoretical perspective have been made in the past. In this work we analyze various state-of-the-art attribution methods and prove unexplored connections between them. We also show how some methods can be reformulated and more conveniently implemented. Finally, we perform an empirical evaluation with six attribution methods on a variety of tasks and architectures and discuss their strengths and limitations.
Question Answering has come a long way from answer sentence selection, relational QA to reading and comprehension. We move our attention to abstractive question answering by which we facilitate machine to read passages and answer questions by generating them. We frame the problem as a sequence to sequence learning where the encoder being a network that models the relation between question and passage, thereby relying solely on passage and question content to form an abstraction of the answer. Not being able to retain facts and making repetitions are common mistakes that affect the overall legibility of answers. To counter these issues, we employ copying mechanism and maintenance of coverage vector in our model respectively. Our results on MS-MARCO demonstrates it’s superiority over baselines and we also show qualitative examples where we improved in terms of correctness and readability.
Deep neural networks have proved very successful on archetypal tasks for which large training sets are available, but when the training data are scarce, their performance suffers from overfitting. Many existing methods of reducing overfitting are data-independent, and their efficacy is often limited when the training set is very small. Data-dependent regularizations are mostly motivated by the observation that data of interest lie close to a manifold, which is typically hard to parametrize explicitly and often requires human input of tangent vectors. These methods typically only focus on the geometry of the input data, and do not necessarily encourage the networks to produce geometrically meaningful features. To resolve this, we propose a new framework, the Low-Dimensional-Manifold-regularized neural Network (LDMNet), which incorporates a feature regularization method that focuses on the geometry of both the input data and the output features. In LDMNet, we regularize the network by encouraging the combination of the input data and the output features to sample a collection of low dimensional manifolds, which are searched efficiently without explicit parametrization. To achieve this, we directly use the manifold dimension as a regularization term in a variational functional. The resulting Euler-Lagrange equation is a Laplace-Beltrami equation over a point cloud, which is solved by the point integral method without increasing the computational complexity. We demonstrate two benefits of LDMNet in the experiments. First, we show that LDMNet significantly outperforms widely-used network regularizers such as weight decay and DropOut. Second, we show that LDMNet can be designed to extract common features of an object imaged via different modalities, which proves to be very useful in real-world applications such as cross-spectral face recognition.
In this paper, we develop a local rank correlation measure which quantifies the performance of dimension reduction methods. The local rank correlation is easily interpretable, and robust against the extreme skewness of nearest neighbor distributions in high dimensions. Some benchmark datasets are studied. We find that the local rank correlation closely corresponds to our visual interpretation of the quality of the output. In addition, we demonstrate that the local rank correlation is useful in estimating the intrinsic dimensionality of the original data, and in selecting a suitable value of tuning parameters used in some algorithms.

### If you did not already know

Explicit Semantic Analysis (ESA)
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf-idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words. Typically, the text corpus is Wikipedia, though other corpora including the Open Directory Project have been used. ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving text categorization and has been used by this pair of researchers to compute what they refer to as ‘semantic relatedness’ by means of cosine similarity between the aforementioned vectors, collectively interpreted as a space of ‘concepts explicitly defined and described by humans’, where Wikipedia articles (or ODP entries, or otherwise titles of documents in the knowledge base corpus) are equated with concepts. The name ‘explicit semantic analysis’ contrasts with latent semantic analysis (LSA), because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space. ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically orthogonal concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as ‘concepts’. To explain this observation, links have been shown between ESA and the generalized vector space model. Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using ‘a single application of ESA (text similarity)’ and ‘just a single, extremely small and homogenous test collection of 50 news documents’. Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA. CL-ESA exploits a document-aligned multilingual reference collection (e.g., again, Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.
http://…-explicit-semantic-analysis-esa-explained

P-Tree Programming
We propose a novel method for automatic program synthesis. P-Tree Programming represents the program search space through a single probabilistic prototype tree. From this prototype tree we form program instances which we evaluate on a given problem. The error values from the evaluations are propagated through the prototype tree. We use them to update the probability distributions that determine the symbol choices of further instances. The iterative method is applied to several symbolic regression benchmarks from the literature. It outperforms standard Genetic Programming to a large extend. Furthermore, it relies on a concise set of parameters which are held constant for all problems. The algorithm can be employed for most of the typical computational intelligence tasks such as classification, automatic program induction, and symbolic regression. …

DeepPath
We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multi-hop relational paths: we use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets. …

### Automatic output format in Rmarkdown

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don’t want to format them one by one. For example, I have created the following data frame in dplyr

data2 %>% group_by(uf) %>%  summarise(n = n(), ) %>%  arrange(desc(n))

One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function.

t1 <- data2 %>%  group_by(uf) %>%  summarise(n = n(), ) %>% arrange(desc(n))knitr::kable(t1)

However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá.

library(printr)

Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Packages worth a look

Maximum Likelihood Inference on Multi-State Trees (ML.MSBD)
Inference of a multi-states birth-death model from a phylogeny, comprising a number of states N, birth and death rates for each state and on which edges each state appears. Inference is done using a hybrid approach: states are progressively added in a greedy approach. For a fixed number of states N the best model is selected via maximum likelihood. Reference: J. Barido-Sottani and T. Stadler (2017) <doi:10.1101/215491>.

Nearest Neighbor Based Multiple Imputation for Survival Data with Missing Covariates (NNMIS)
Imputation for both missing covariates and censored observations (optional) for survival data with missing covariates by the nearest neighbor based multiple imputation algorithm as described in Hsu et al. (2006) <doi:10.1002/sim.2452>, Long et al. (2012) <doi:10.5705/ss.2010.069>, Hsu et al. (2014) <doi:10.1080/10543406.2014.888444>, and Hsu and Yu (2017) <arXiv:1710.04721>. Note that the current version can only impute for a situation with one missing covariate.

Weight of Evidence Based Segmentation of a Variable (woeR)
Segment a numeric variable based on a dichotomous dependent variable by using the weight of evidence (WOE) approach (Ref: Siddiqi, N. (2006) <doi:10.1002/9781119201731.biblio>). The underlying algorithm adopts a recursive approach to create segments that are diverse in respect of their WOE values and meet the demands of user-defined parameters. The algorithm also aims to maintain a monotonic trend in WOE values of consecutive segments. As such, it can be particularly helpful in improving robustness of linear and logistic regression models.

A Shiny User Interface of Time Warping for Improved Gradient Matching (shinyKGode)
Interactive shiny application to perform inference of non-linear differential equations via gradient matching. Three (Lotka-Volterra, Fitz hugh Nagumo, and Biopathway) pre-defined models are provided, and users can also load their own models (in the Systems Biology Markup Language format) into the application.

Diagnostics for Pharmacometric Models (xpose)
Diagnostics for non-linear mixed-effects (population) models from ‘NONMEM’ <http://…/>. ‘xpose’ facilitates data import, creation of numerical run summary and provide ‘ggplot2’-based graphics for data exploration and model diagnostics.

High-Dimensional Mediation Analysis (HIMA)
Allows to estimate and test high-dimensional mediation effects based on sure independent screening and minimax concave penalty techniques. A joint significance test is used for mediation effect. Haixiang Zhang, Yinan Zheng, Zhou Zhang, Tao Gao, Brian Joyce, Grace Yoon, Wei Zhang, Joel Schwartz, Allan Just, Elena Colicino, Pantel Vokonas, Lihui Zhao, Jinchi Lv, Andrea Baccarelli, Lifang Hou, Lei Liu (2016) <doi:10.1093/bioinformatics/btw351>.

## Privacy

• Another week, another AWS bucket is left unattended. This one was big, though: Defense Department Spied On Social Media, Left All Its Collected Data Exposed To Anyone. Original investigation.

The UpGuard Cyber Risk Team can now disclose that three publicly downloadable cloud-based storage servers exposed a massive amount of data collected in apparent Department of Defense intelligence-gathering operations. The repositories appear to contain billions of public internet posts and news commentary scraped from the writings of many individuals from a broad array of countries, including the United States, by CENTCOM and PACOM, two Pentagon unified combatant commands charged with US military operations across the Middle East, Asia, and the South Pacific.

## Tech

• The [University of Washington] team used machine-learning-based tools to analyze the language in nearly 800 movie scripts, quantifying how much power and agency those scripts give to individual characters. In their study, recently presented in Denmark at the 2017 Conference on Empirical Methods in Natural Language Processing, the researchers found subtle but widespread gender bias in the way male and female characters are portrayed.

• This one has been controversial this week. Original: The Ivory Tower Can’t Keep Ignoring Tech. A reply: Cathy O’Neil Sleepwalks into Punditry. The original point:

These days, big data, artificial intelligence and the tech platforms that put them to work have huge influence and power. Algorithms choose the information we see when we go online, the jobs we get, the colleges to which we’re admitted and the credit cards and insurance we are issued. It goes without saying that when computers are making decisions, a lot can go wrong.

Our lawmakers desperately need this explained to them in an unbiased way so they can appropriately regulate, and tech companies need to be held accountable for their influence over all elements of our lives. But academics have been asleep at the wheel, leaving the responsibility for this education to well-paid lobbyists and employees who’ve abandoned the academy.

• On May 23, an email landed in the sales inbox of a San Francisco startup called HiQ Labs, politely asking the company to go out of business. HiQ is a “people analytics” firm that creates software tools for corporate human resources departments. Its Skill Mapper graphically represents the credentials and abilities of a workforce; its Keeper service identifies when employees are at risk of leaving for another job. Both draw the overwhelming majority of their data from a single trove: the material that is posted—with varying degrees of timeliness, detail, accuracy, and self-awareness—by the 500 million people on the social networking site LinkedIn.

• Meet The Spreadsheet That Can Solve NYC Transit (and the Man Who Made It). True Data Science, but not sexy.

The spreadsheet is called the Balanced Transportation Analyzer, or BTA. It has 72 separate worksheets, many of which contain over a thousand rows and dozens of columns. Komanoff made the spreadsheet for a single purpose: to be the most comprehensive accounting possible of how a congestion charge in Manhattan would affect New York City.

## Visualizations

• [CO2 concentration and global meaan temperature, 1958

Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!

### Why is R slow? some explanations and MKL/OpenBLAS setup to try to fix this

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

# Introduction

Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries.

But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises:

• Using byte-code compiler
• Vectorizing operations
• Using simple data structures (i.e using data frames instead of matrices in large computing instances)
• Re-using results

I would add another good practise: “Use the tidyverse”. Provided tidyverse packages such as dplyr benefit from Rcpp, having a C++ backend can be faster than using dplyr’s equivalents in base (i.e plain vanilla) R.

The idea of this post is to clarify some ideas. R does not compete with C or C++ provided they are different languages. R and data.table package may compete with Python and numpy library. This does not mean that I’m defending R over Python or backwards. The reason behind this is that both R and Python implementations consists in an interpreter while in C and C++ it consists in a compiler, and this means that C and C++ will always be faster because in really over-simplifying terms compiler implementations are closer to the machine.

# Basic setup for general usage

As an Ubuntu user I can say the basic R installation from Canonical or CRAN repositories work for most of the things I do on my laptop.

When I use RStudio Server Pro© that’s a different story because I really want to optimize things because when I work with large data (i.e. 100GB in RAM) a 3% more of resources efficiency or reduced execution time is valuable.

Installing R with OpenBLAS will give you a tremedous performance boost, and that will work for most of laptop situations. I explain how to do that in detail for Ubuntu 17.10 and Ubuntu 16.04 but a general setup would be as simple as one of this two options:

# 1: Install from Canonical (default Ubuntu repository)
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install libopenblas-dev r-base

# 2: Install from CRAN mirror
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys  51716619E084DAB9
printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install libopenblas-dev r-base

# 3: Install RStudio (bonus)
sudo apt-get install gdebi
sudo gdebi rstudio-xenial-1.1.383-amd64.deb
printf '\nexport QT_STYLE_OVERRIDE=gtk\n' | sudo tee -a ~/.profile


Being (1) a substitute of (2). It’s totally up to you which one to use and both will give you a really fast R compared to installing it without OpenBLAS.

# Benchmarking different R setups

I already use R with OpenBLAS just like the setup above. I will compile parallel R instances to do the benchmarking.

## Installing Intel© MKL numerical libraries

My benchmarks do indicate that in my case it’s convenient to take the time it takes to install Intel© MKL. The execution time is strongly reduces for some operations when compared to R with OpenBLAS performance.

Run this to install MKL:

# keys taken from https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB

sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update && sudo apt-get install intel-mkl-64bit


## Installing CRAN R with MKL

To compile it from source (in this case it’s the only option) run these lines:

# key added after sudo apt-get update returned a warning following this guide: https://support.rstudio.com/hc/en-us/articles/218004217-Building-R-from-source
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys  51716619E084DAB9
printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list

# you need to enable multiverse repo or packages as xvfb won't be found
sudo rm -rf /etc/apt/sources.list
printf 'deb http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse
deb-src http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse\n
deb http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse
deb-src http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse\n
deb http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse
deb-src http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse\n' | sudo tee -a /etc/apt/sources.list

sudo apt-get update
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove

sudo apt-get build-dep r-base

cd ~/GitHub/r-with-intel-mkl
wget https://cran.r-project.org/src/base/R-3/R-3.4.2.tar.gz
tar xzvf R-3.4.2.tar.gz

cd R-3.4.2
source /opt/intel/mkl/bin/mklvars.sh intel64
./configure --prefix=/opt/R/R-3.4.2-intel-mkl --enable-R-shlib --with-blas="\$MKL" --with-lapack
make && sudo make install
printf '\nexport RSTUDIO_WHICH_R=/usr/local/bin/R\nexport RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl\n' | tee -a ~/.profile


## Installing CRAN R with OpenBLAS

Just not to interfere with working installation I decided to compile a parallel instance from source:

cd ~/GitHub/r-with-intel-mkl/
rm -rf R-3.4.2
tar xzvf R-3.4.2.tar.gz
cd R-3.4.2

./configure --prefix=/opt/R/R-3.4.2-openblas --enable-R-shlib --with-blas --with-lapack
make && sudo make install
printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R\n' | tee -a ~/.profile


## Installing CRAN R with no optimized numerical libraries

There is a lot of discussion and strong evidence from different stakeholders in the R community that do indicate that this is by far the most inefficient option. I compiled this just to make a complete benchmark:

cd ~/GitHub/r-with-intel-mkl/
rm -rf R-3.4.2
tar xzvf R-3.4.2.tar.gz
cd R-3.4.2

./configure --prefix=/opt/R/R-3.4.2-defaults --enable-R-shlib
make && sudo make install
printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R\n' | tee -a ~/.profile


## Installing Microsoft© R Open with MKL

This R version includes MKL by default and it’s supposed to be easy to install. I could not make it run and that’s bad because different articles (like this post by Brett Klamer) state that this R version is really efficient but no different to standard CRAN R with MKL numerical libraries.

In any case here’s the code to install this version:

cd ~/GitHub/r-with-intel-mkl
wget https://mran.blob.core.windows.net/install/mro/3.4.2/microsoft-r-open-3.4.2.tar.gz
tar xzvf microsoft-r-open-3.4.2.tar.gz
cd microsoft-r-open
sudo ./install.sh
printf 'export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R\n' | tee -a ~/.profile

# it was not possible to start /opt/microsoft/ropen/3.4.2/lib64/R/bin/R
# the error is:
# *** caught segfault ***
# address 0x50, cause 'memory not mapped'

# removing Microsoft R
# https://mran.microsoft.com/documents/rro/installation#revorinst-uninstall steps did not work
sudo apt-get remove 'microsoft-r-.*'
sudo apt-get autoclean && sudo apt-get autoremove


## Benchmark results

My scripts above do edit ~/.profile. This is to open RStudio and work with differently configured R instances on my computer.

I released the benchmark results and scripts on GitHub. The idea is to run the same scripts from ATT© and Microsoft© to see how different setups perform.

To work with CRAN R with MKL I had to edit ~/.profile because of how I configurated the instances. So I had to run nano ~/.profile and comment the last part of the file to obtain this result:

#export RSTUDIO_WHICH_R=/usr/bin/R
export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R
#export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R


After that I log out and then log in to open RStudio to run the benchmark.

The other two cases are similar and the benchmark results were obtained editing ~/.profile, logging out and in and opening RStudio with the corresponding instance.

As an example, this result starts with the R version and the corresponding numerical libraries used in that sessions. Any other result are reported in the same way.

And here are the results from ATT© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds)
Creation, transp., deformation of a 2500×2500 matrix (sec) 0.68 0.68 0.67
2400×2400 normal distributed random matrix ^1000 0.56 0.56 0.56
Sorting of 7,000,000 random values 0.79 0.79 0.79
2800×2800 cross-product matrix (b = a’ * a) 0.3 0.36 14.55
Linear regr. over a 3000×3000 matrix (c = a \ b’) 0.17 0.22 6.98
FFT over 2,400,000 random values 0.33 0.33 0.33
Eigenvalues of a 640×640 random matrix 0.22 0.49 0.74
Determinant of a 2500×2500 random matrix 0.2 0.22 2.99
Cholesky decomposition of a 3000×3000 matrix 0.31 0.21 5.76
Inverse of a 1600×1600 random matrix 0.2 0.21 2.79
3,500,000 Fibonacci numbers calculation (vector calc) 0.54 0.54 0.54
Creation of a 3000×3000 Hilbert matrix (matrix calc) 0.23 0.24 0.23
Grand common divisors of 400,000 pairs (recursion) 0.27 0.29 0.3
Creation of a 500×500 Toeplitz matrix (loops) 0.28 0.28 0.28
Escoufier’s method on a 45×45 matrix (mixed) 0.22 0.23 0.28
Total time for all 15 tests 5.3 5.62 37.78
Overall mean (weighted mean) 0.31 0.32 0.93

And here are the results from Microsoft© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds)
Matrix multiply 5.985 13.227 165.18
Cholesky Factorization 1.061 2.349 26.762
Singular Value Decomposition 7.268 18.325 47.076
Principal Components Analysis 14.932 40.612 162.338
Linear Discriminant Analysis 26.195 43.75 117.537

# Actions after benchmarking results

I decided to try Intel MKL and I’ll write another post benchmarking things I do everyday beyond what is considered in the scripts.

To clean my system I deleted all R instances but MKL:

sudo apt-get remove r-base r-base-dev
sudo apt-get remove 'r-cran.*'

sudo apt-get autoclean && sudo apt-get autoremove

sudo apt-get build-dep r-base

sudo rm -rf /opt/R/R-3.4.2-openblas
sudo rm -rf /opt/R/R-3.4.2-defaults
sudo ln -s /opt/R/R-3.4.2-intel-mkl/bin/R  /usr/bin/R


I edited ~/.profile so the final lines are:

export RSTUDIO_WHICH_R=/usr/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R
#export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R


And I also decided to configure my user packages directory from zero:

# remove installed user packages
rm -rf ~/R

# create new user packages directory
mkdir ~/R/
mkdir ~/R/x86_64-pc-linux-gnu-library/
mkdir ~/R/x86_64-pc-linux-gnu-library/3.4

# install common packages
R --vanilla << EOF
install.packages(c("tidyverse","data.table","dtplyr","devtools","roxygen2","bit64","pacman"), repos = "https://cran.rstudio.com/")
q()
EOF

# Export to HTML/Excel
R --vanilla << EOF
install.packages(c("htmlTable","openxlsx"), repos = "https://cran.rstudio.com/")
q()
EOF

# Blog tools
R --vanilla << EOF
install.packages(c("knitr","rmarkdown"), repos='http://cran.us.r-project.org')
q()
EOF
sudo apt-get install python-pip
sudo pip install --upgrade --force-reinstall markdown rpy2==2.7.8 pelican==3.6.3

# PDF extraction tools
sudo apt-get install libpoppler-cpp-dev default-jre default-jdk
sudo R CMD javareconf
R --vanilla << EOF
library(devtools)
install.packages(c("rjava","pdftools"), repos = "https://cran.rstudio.com/")
install_github("ropensci/tabulizer")
q()
EOF

# TTF/OTF fonts usage
R --vanilla << EOF
install.packages("showtext", repos = "https://cran.rstudio.com/")
q()
EOF

# Cairo for graphic devices
sudo apt-get install libgtk2.0-dev libxt-dev libcairo2-dev
R --vanilla << EOF
install.packages("Cairo", repos = "https://cran.rstudio.com/")
q()
EOF


# Concluding remarks

The benchmark exposed here are in no way a definitive end to the long discussion on numerical libraries. My results show some evidence that indicates, that because of more speed for some operations, I should use MKL.

One of the advantages of the setup I explained is that you can use MKL with Python. In that case numpy` calculations will be boosted.

Using MKL with AMD© processors might not provide an important improvement when compared to use OpenBLAS. This is because MKL uses specific processor instructions that work well with i3 or i5 processors but not neccesarily with non-Intel models.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Spatial models for demographic trends?

Jon Minton writes:

You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton’s (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US.

Colour versions of the figures are available in the ‘supplementary data’ link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs – and arxiv – allow for a much faster pace of research and analysis.)

An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed ‘data cube’ of log mortality by age and year, whose features I then discuss. [See here and here.]

Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough.

I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification.

The arrangement of data by year and age conceptually involves thinking about a continuous ‘data surface’ much like a spatial surface.

Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R’s CARBayes package, to model demographic data as well?

I agree that visualization is important.

Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we’d want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn’t seem like a property you’d want here.

If you do want to fit Bayesian CAR models, I suggest you do it in Stan.

Minton responded:

I agree that additional structure and different assumptions to those made by CAR would be needed. I’m thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data.

In the case of fertility, I also agree with using cohort and age as the surface’s axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort ‘achieve’ replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.

The post Spatial models for demographic trends? appeared first on Statistical Modeling, Causal Inference, and Social Science.

### If you did not already know

Deep Texture Encoding Network (Deep TEN)
We propose a Deep Texture Encoding Network (Deep-TEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate off-the-shelf features such as SIFT descriptors or pre-trained CNN features for material recognition. Our new approach provides an end-to-end learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to state-of-the-art methods using gold-standard databases such as MINC-2500, Flickr Material Database, KTH-TIPS-2b, and two recent databases 4D-Light-Field-Material and GTOS. The source code for the complete system are publicly available. …

LibLinear
LibLinear is a linear classifier for data with millions of instances and features. It supports
• L2-regularized classifiers L2-loss linear SVM,
• L1-loss linear SVM, and logistic regression (LR)
• L1-regularized classifiers (after version 1.4)
• L2-loss linear SVM and logistic regression (LR)
• L2-regularized support vector regression (after version 1.9)
• L2-loss linear SVR and L1-loss linear SVR. …

Root Mean Square Error (RMSE)
Taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard deviation. …