“One of the keys to success in big data analytics projects is building strong ties between data analysts and business units.” Ed Burns ( August 2014 )
“One of the keys to success in big data analytics projects is building strong ties between data analysts and business units.” Ed Burns ( August 2014 )
We’re back at it in the Proof of Work difficulty spectrum, this time going through how Ethereum’s difficulty changes over time. This is part 4.2 of the part 4 series, where part 4.1 was about Bitcoin’s PoW difficulty, and the following 4.3 will be about jbc’s PoW difficulty.
TL;DR
To calculate the difficulty for the next Ethereum block, you calculate the time it took to mine the previous block, and if that time difference was greater than the goal time, then the difficulty goes down to make mining the next block quicker. If it was less than the time goal, then difficulty goes up to attempt to mine the next block quicker.
There are three parts to determining the new difficulty: offset
, which determines the standard amount of change from one difficulty to the next; sign
, which determines if the difficulty should go up or down; and bomb
, which adds on extra difficulty depending on the block’s number.
These numbers are calculated slightly differently for the different forks, Frontier, Homestead, and Metropolis, but the overall formula for calculating the next difficulty is
target = parent.difficulty + (offset * sign) + bomb
For the following code examples, this will be the class of the block.
class Block(): def __init__(self, number, timestamp, difficulty, uncles=None): self.number = number self.timestamp = timestamp self.difficulty = difficulty self.uncles = uncles
The data I use to show the code is correct was grabbed from Etherscan.
The code doesn’t include edge cases in calculating the difficulty either. For the most part, edge cases aren’t involved in calculating the difficulty target. By not including them, it makes the following code much easier to understand. I’ll talk about the edge cases at the end to make sure they’re not completely ignored.
Finally, for the beginning fork, I talk about what the different variables and functions perform. For later forks, Homestead and Metropolis, I only talk about the changes.
Also, here’s all the code I threw in a Github repo! If you don’t want to write all the code yourself, you should at least clone it locally and run it yourself. 1000% feel free to make pull requests if you want to add more parts to it or think I formatted the code wrong.
In the beginning, there was Frontier. I’ll jump right in by giving a bullet point section going over the config variables.
DIFF_ADJUSTMENT_CUTOFF
— Represents the seconds that Ethereum is aiming to mine a block at.BLOCK_DIFF_FACTOR
— Helps calculate how much the current difficulty can change.EXPDIFF_PERIOD
— Denotes after how many blocks the bomb amount is updated.EXPDIFF_FREE_PERIODS
— How many EXPDIFF_PERIODS
are ignored before including the bomb in difficulty calculation.And now descriptions of the functions.
calc_frontier_sign
— Calculates whether the next difficulty value should go up or down. In the case of Frontier, if the previous block was mined quicker than the 13 seconds DIFF_ADJUSTMENT_CUTOFF
, then the sign will be 1, meaning we want the difficulty to be higher to make it more difficult with the goal that the next block be mined more slow. If the previous block was mined longer than 13 seconds, then the sign will be -1 and the next difficulty will be lower. The overall point of this is that the goal for block mining time is ~12.5 seconds. Take a look at Vitalik Buterin’s post where he talks about choosing 12 seconds at the minimum average block mining time.
calc_frontier_offset
— Offset is the value that determines how much or how little the difficulty will change. In the Frontier, this is the parent’s difficulty integer devided by the BLOCK_DIFF_FACTOR
. Since it’s devided by 2048, which is 2^11, offset
can also be calculated by parent_difficulty >> 11
if you want to look at it in terms of shifting bits. Since 1.0 / 2048 == 0.00048828125
, it means that the offset will only change the difficulty by 0.0488%
per change. Not that much, which is good because we don’t want the difficulty to change a ton with each different mined block. But if the time becomes consistently under the 13 seconds cutoff, the difficulty will slowly grow to compensate.
calc_frontier_bomb
— The bomb. The bomb adds an amount of difficulty that doubles after every EXPDIFF_PERIOD
block is mined. In the Frontier world, this value is incredibly small. For example, at block 1500000, the bomb is 2 ** ((1500000 // 100000) - 2) == 2 ** 15 == 32768
. The difficulty of the block is 34982465665323. That’s a huge difference meaning that the bomb took on zero affect. This will change later.
calc_frontier_difficulty
— Once you have the values for sign, offset, and bomb, the new difficulty is (parent.difficulty + offset * sign) + bomb
. Let’s say that the the time it took to mine the parent’s block was 15 seconds. In this case, the difficulty will go down by offset * -1
, and then add the small amount of the bomb
at the end. If the time to mine the parent’s block was 8 seconds, the difficulty will increase by offset + bomb
.
In order to understand it fully, go through the code that follows and look at the calculations.
config = dict( DIFF_ADJUSTMENT_CUTOFF=13, BLOCK_DIFF_FACTOR=2048, EXPDIFF_PERIOD=100000, EXPDIFF_FREE_PERIODS=2, ) def calc_frontier_offset(parent_difficulty): offset = parent_difficulty // config['BLOCK_DIFF_FACTOR'] return offset def calc_frontier_sign(parent_timestamp, child_timestamp): time_diff = child_timestamp - parent_timestamp if time_diff < config['DIFF_ADJUSTMENT_CUTOFF']: sign = 1 else: sign = -1 return sign def calc_frontier_bomb(parent_number): period_count = (parent.number + 1) // config['EXPDIFF_PERIOD'] period_count -= config['EXPDIFF_FREE_PERIODS'] bomb = 2**(period_count) return bomb def calc_frontier_difficulty(parent, child_timestamp): offset = calc_frontier_offset(parent.difficulty) sign = calc_frontier_sign(parent.timestamp, child_timestamp) bomb = calc_frontier_bomb(parent.number) target = (parent.difficulty + offset * sign) + bomb return offset, sign, bomb, target
The Homestead fork, which took place at block number 1150000 in March of 2016, has a couple big changes with calculating the sign
.
calc_homestead_sign
— Instead of having a single number, DIFF_ADJUSTMENT_CUTOFF which is different for the Homestead fork, that makes the difficulty go up or down, Homestead takes a slightly different approach. If you look at the code, you’ll see that that there are groupings of the sign rather than either 1 or -1. If the time_diff between grandparent and parent is in [0,9], sign will be 1, meaning that difficulty needs to increase. If the time_diff is [10,19], the sign will be 0 meaning that the difficulty should stay as it is. If the time_diff is in the range [20, 29], then the sign becomes -1. If time_diff is in the range [30,39], then the sign is -2, etc.
This does two things. First, Homestead doesn’t want to equate a block that took 50 seconds to mine as being the same as a block that took 15 seconds to mine. If it took a block 50 seconds, then the next difficulty in fact does need to be easier. Secondly, instead of DIFF_ADJUSTMENT_CUTOFF representing the goal time, this switches the aim point to be the mid point of the range of time_diff
s with a sign of 0. [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Meaning ~14.5 seconds, not including the bomb.
config = dict( HOMESTEAD_DIFF_ADJUSTMENT_CUTOFF=10, BLOCK_DIFF_FACTOR=2048, EXPDIFF_PERIOD=100000, EXPDIFF_FREE_PERIODS=2, ) def calc_homestead_offset(parent_difficulty): offset = parent_difficulty // config['BLOCK_DIFF_FACTOR'] return offset def calc_homestead_sign(parent_timestamp, child_timestamp): time_diff = child_timestamp - parent_timestamp sign = 1 - (time_diff // config['HOMESTEAD_DIFF_ADJUSTMENT_CUTOFF']) return sign def calc_homestead_bomb(parent_number): period_count = (parent_number + 1) // config['EXPDIFF_PERIOD'] # or parent.number + 1 >> 11 if you like bit notation period_count -= config['EXPDIFF_FREE_PERIODS'] bomb = 2**(period_count) return bomb def calc_homestead_difficulty(parent, child_timestamp): offset = calc_homestead_offset(parent.difficulty) sign = calc_homestead_sign(parent.timestamp, child_timestamp) bomb = calc_homestead_bomb(parent.number) target = (parent.difficulty + offset * sign) + bomb return offset, sign, bomb, target
There are a couple differences from Homestead. First is that the DIFF_ADJUSTMENT_CUTOFF
is now 9, which means that, without uncles, the target block time is midpoint of [9, 10, 11, 12, 13, 14, 15, 16, 17] or ~13 seconds.
The second takes into account whether or not there are uncles included in the block. And uncle in Ethereum language refers to a point in time where two nodes mine a child block from the same grandparent. So if you’re mining a child block from a parent that has a “sibling”, you’re able to pick one of the siblings to mine from, but also include that you noticed the other block. In that case, Ethereum wants to make the new difficulty larger, buy another offset, to make sure that there is a less likely chance for two natural forks to get much longer.
Now the biggest difference is dissolving the impact of the bombs. Check out the code for calc_metropolis_bomb
where not only do we subtract the value of EXPDIFF_FREE_PERIODS
, but also METROPOLIS_DELAY_PERIODS
which is 30 time periods. A huge number. Instead of talking about the bombs here, I’ll have a section devoted to that after this.
config = dict( METROPOLIS_DIFF_ADJUSTMENT_CUTOFF=9, BLOCK_DIFF_FACTOR=2048, EXPDIFF_PERIOD=100000, EXPDIFF_FREE_PERIODS=2, METROPOLIS_DELAY_PERIODS=30, ) def calc_metropolis_offset(parent_difficulty): offset = parent_difficulty // config['BLOCK_DIFF_FACTOR'] return offset def calc_metropolis_sign(parent_timestamp, child_timestamp): if parent.uncles: uncles = 2 else: uncles = 1 time_diff = child_timestamp - parent_timestamp sign = uncles - (time_diff // config['METROPOLIS_DIFF_ADJUSTMENT_CUTOFF']) return sign def calc_metropolis_bomb(parent_number): period_count = (parent_number + 1) // config['EXPDIFF_PERIOD'] period_count -= config['METROPOLIS_DELAY_PERIODS'] #chop off 30, meaning go back 3M blocks in time period_count -= config['EXPDIFF_FREE_PERIODS'] #chop off 2 more for good measure bomb = 2**(period_count) return bomb def calc_metropolis_difficulty(parent, child_timestamp): offset = calc_metropolis_offset(parent.difficulty) sign = calc_metropolis_sign(parent_timestamp, child_timestamp) bomb = calc_metropolis_bomb(parent.number) target = (parent.difficulty + offset * sign) + bomb return offset, sign, bomb, target
If you look at one of the difficulty charts from online, you’ll see a recent amount of huge increasing jumps every 100000 blocks, and then a giant drop off about a month ago. Screenshot time for those not wanting to click the link:
What’s the point of having a bomb like this? A big goal of Ethereum is to get rid of Proof of Work, which requires energy and time to create and validate a new block, into Proof of Stake, which is described in the Ethereum Wiki. In order to force nodes to move to the Proof of Stake implementation, a “bomb
” that doubles its impact on the difficulty every 100000 blocks would soon make it take so long to mine a new block, the nodes running on the old fork won’t be able to run anymore. This is where the term “Ice Age” comes from; the block chain would be frozen in time.
This is a good way to manage future changes, but also we run into the issue that the new Proof of Stake implementation, called Casper, wasn’t ready in time before the giant difficulty spikes seen in that graph. That’s where the Metropolis fork came into play — it eliminated the effect the bomb
has on the difficulty, but in a few years it will come back into play, where the switch to Casper will (hopefully) be ready to roll. That being said, predicting when features like this fork are ready for production and adoption is difficult, so if Casper isn’t ready to be pushed quick enough, you can create another fork that moves the bomb back in time again.
Bomb Math
Besides the description of what the bombs do to the difficulty as seen on those graphs, it’s worth it to show the math behind why the difficulty is rising to those levels, and what that means for how long it takes to mine a block.
Going back, remember that offset
is an indicator of how much it takes a node to mine a block.
For example, if the previous time difference between blocks was [18-26], we say that if we decrease difficulty by an offset
, we’ll be able to move the time to mine a block back to the [9-17] range, or DIFF_ADJUSTMENT_CUTOFF
seconds lower.
So if we increase the difficulty by offset
, we expect the time it takes to mine a block to increase by DIFF_ADJUSTMENT_CUTOFF
. If we increase the difficulty by half of an offset
, the mining time should increase by about DIFF_ADJUSTMENT_CUTOFF
/ 2.
So if we want to see how much a bomb will influence the change in mining time, we want the ratio of bomb and offset.
(bomb / offset) * DIFF_ADJUSTMENT_CUTOFF
Here’s the code that shows how we can calculate the time it takes to mine and compare to the actual average time, where actual average mining time was gathered from this chart by hovering the cursor around where in time these blocks were mined.
def calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_bomb, calc_offset): homestead_goal_mining_time = 14.5 #about that. bomb = calc_bomb(block_number) offset = calc_offset(difficulty) bomb_offset_ratio = bomb / float(offset) seconds_adjustment = bomb_offset_ratio * config['HOMESTEAD_DIFF_ADJUSTMENT_CUTOFF'] average_mining_time = 0.4 * 60 calculated_average_mining_time = homestead_goal_mining_time + seconds_adjustment print "Bomb: %s" % bomb print "Offset: %s" % offset print "Bomb Offset Ratio: %s" % bomb_offset_ratio print "Seconds Adjustment: %s" % seconds_adjustment print "Actual Avg Mining Time: %s" % actual_average_mining_time print "Calculated Mining Time: %s" % calculated_average_mining_time print block_number = 4150000 difficulty = 1711391947807191 actual_average_mining_time = 0.35 * 60 calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_homestead_bomb, calc_homestead_offset) block_number = 4250000 difficulty = 2297313428231280 actual_average_mining_time = 0.4 * 60 calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_homestead_bomb, calc_homestead_offset) block_number = 4350000 difficulty = 2885956744389112 actual_average_mining_time = 0.5 * 60 calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_homestead_bomb, calc_homestead_offset) block_number = 4546050 difficulty = 1436507601685486 actual_average_mining_time = 0.23 * 60 calc_mining_time(block_number, difficulty, actual_average_mining_time, calc_metropolis_bomb, calc_metropolis_offset)
When run:
Bomb: 549755813888 Offset: 835640599515 Bomb Offset Ratio: 0.657885476372 Seconds Adjustment: 6.57885476372 Actual Avg Mining Time: 21.0 Calculated Mining Time: 21.0788547637 Bomb: 1099511627776 Offset: 1121735072378 Bomb Offset Ratio: 0.980188330427 Seconds Adjustment: 9.80188330427 Actual Avg Mining Time: 24.0 Calculated Mining Time: 24.3018833043 Bomb: 2199023255552 Offset: 1409158566596 Bomb Offset Ratio: 1.56052222062 Seconds Adjustment: 15.6052222062 Actual Avg Mining Time: 30.0 Calculated Mining Time: 30.1052222062 Bomb: 8192 Offset: 701419727385 Bomb Offset Ratio: 1.16791696614e-08 Seconds Adjustment: 1.16791696614e-07 Actual Avg Mining Time: 13.8 Calculated Mining Time: 14.5000001168
Look how large the bomb offset ratio is for the later Homestead blocks, and how tiny it becomes after the Metropolis block! With the loss of the bomb’s impact, the time difference drops a crap ton.
As of now, the bomb doesn’t affect the difficulty by pretty much any amount. It’ll be back though. If Ethereum continues to mine blocks at the rate of ~14.5 seconds, they’ll have 100,000 mine blocks in around 17 days. Multiply that time by 30, which accounts for the METROPOLIS_DELAY_PERIODS
, we’ll be back in a state where the bomb makes a difference in around a year and a half, no matter how much the hash power increases.
Bomb Equilibrium
The last part of bomb dealing is to quickly explain why a slight increase in the bomb
will make the overall difficulty be raised to a much higher value compared to how large the bomb
value is.
The thought on how this works is by taking the calculated expecting mining time and from there calculating the difficulty required by the current hash rate to mine blocks in that time period. When the bomb
amount changes, the difficulty will continue to rise or fall to get to the difficulty for that time, and then hover in that equilibrium. If you look at the blocks right after the bomb change, you’ll see it takes more than a few blocks to get to the new level.
There are a couple edge cases here not mentioned in the code above.
Like before, here’s a list of questions I had when writing this that didn’t get put in the main post.
What’s in the header?
I’m sure this question will come up a lot when only talking about Proof of Work difficulty. It’s great that we know how difficulty is calculated, but how to validate a block has a valid header is beyond this post.
I’m not going to explain it here, but probably in a future post when I decide how jbc’s header should be calculated after I implement transactions.
I will note that Bitcoin’s header is incredibly simple where the values are smashed together (making sure that the way the bits are combined have the right endian). Ethereum’s is much more complicated by dealing with transactions using a cash method rather than a Merkel tree.
How do you go through and figure this out?
There are tons of posts out there like this one, but frankly, most are very high level with either descriptions, numbers, but not many showing code that calculates those numbers. My goal is to do all of those.
That means that to get to complete understanding to talk about them all, I go through and look at the code bases. Here’s the calc_difficulty function from the Python repo, calcDifficulty from c++, and calcDifficulty from Go. Another bigger point that I keep talking about is that looking at the code doesn’t do much at all, what you need to do is implement similar code yourself.
All of those time estimations include a ~ before a number. Why is that necessary?
That’s a really good question. And the main answer is uncertainty in how many nodes are trying to mine blocks, as well as randomness. If you look at the difficulty chart and zoom in to a timespan of a couple days, you’ll see how random it gets. It’s the same with the average mining time for a specific day. They fluctuate, and so we can’t say exactly what time we expect.
Proof of Work is all I hear people talk about, but it isn’t universal, right?
Correct! I mentioned it above, and I’m betting that if you’re reading this much about Ethereum and you’re all the way at the bottom of this post that you’ve heard of the term Proof of Stake. This is the new option that a major cryptocurrency blockchain hasn’t implemented yet. There are other types of block validation out there. The big upcoming Enterprise versions of blockchains probably won’t be completely based in Proof of Work at all.
The biggest indicator of what type of validation blockchains will use depends on the use cases. Cryptocurrencies need something completely un-fraudable. Same with a blockchain that might store information on who owns what piece of land. We don’t want people to be able to change ownership. But a blockchain that’s used to store something less insanely valuable doesn’t need to waste all that energy. I know some of these other Proofs of validity will come into play in the next few years.
You wrote something confusing / didn’t explain it well enough. What should I do?
Get in freaking contact. Seriously, tons of posts out there that talk about results but don’t say how they calculated it and assume that everyone is as smart as them and know what’s going on. I’m trying to be the opposite where I don’t say something without fully explaining. So if there’s something confusing or wrong, contact me and I’ll make sure to fix the issue.
Can I DM you on Twitter?
Do you like cats or dogs better?
Cats.
In this post, I show how to build an interactive geographic dashboard using Displayr, Plotly and R. It is particularly fascinating in that it tracks the real-time position of military aircraft. To do so, I am first going to bring in some data from two separate sources (regions based on the size of military air-force, and then the real-time tracking of aircraft positions). The dashboard displays the dynamic data in two ways: region shading (to indicate the size of the military force per country) and point markers (for the aircraft positions). I then build a map to neatly and attractively display all this data.
I am going to shade each country of the map according to the size of its military air-force. To do this I need a list of aircraft per country. Fortunately, the ever useful Wikipedia has just what I need (here). The following code reads in the data and cleans it to produce a neat table.
library(httr) library(XML) # Extract table via URL url &amp;lt;- "https://en.wikipedia.org/wiki/List_of_countries_by_level_of_military_equipment#List" r &amp;lt;- GET(url) airforces &amp;lt;- readHTMLTable(doc = content(r, "text"))[[2]] # Clean required columns airforces &amp;lt;- airforces[-1, c("Country[note 1]", "Military aircraft[note 3]")] colnames(airforces) &amp;lt;- c("Country", "MilitaryAircraft") remove.bracket.content &amp;lt;- function(s) { return(gsub("\\[.+\\]", "", s)) } airforces &amp;lt;- data.frame(apply(airforces, 2, remove.bracket.content)) airforces$MilitaryAircraft &amp;lt;- as.numeric(gsub(",", "", airforces$MilitaryAircraft)) airforces
Compared to the above, the second source of data is more dynamic. I am using ADS-B, which pools real-time flight tracking information from across the globe. Not all military operations are top secret. Apparently, some aircraft operated by the armed forces broadcast their positions publicly.
To collate this information, I construct a URL to retrieve a JSON object with information about military aircraft (JSON is flexible text format used for data exchange). Then I parse the JSON to create a data.frame.
library(jsonlite)&amp;lt;/pre&amp;gt; url &amp;lt;- "http://public-api.adsbexchange.com/VirtualRadar/AircraftList.json?" url &amp;lt;- paste0(url, "fMilQ=TRUE") positions &amp;lt;- fromJSON(url)$acList if (length(positions) != 0) { positions &amp;lt;- positions[positions$Type != "TEST", ] positions &amp;lt;- positions[!is.na(positions$Lat), ] } positions
The following code produces a plotly map of the world. The countries are shaded according to the size of the air-force, with the scale shown on a legend. In plotly terminology, each layer of the map is known as a trace.
library(plotly) library(flipFormat) # specify map area and projection g &amp;lt;- list(scope = "world", showframe = FALSE, showcoastlines = TRUE, projection = list(type = 'mercator'), lonaxis = list(range = c(-140, 179)), lataxis = list(range = c(-55, 70)), resolution = 50) # higher resolution # shade countries by airforce size p &amp;lt;- plot_geo(airforces) %&amp;gt;% add_trace(data = airforces, name = "Airforce", z = ~MilitaryAircraft, color = ~MilitaryAircraft, colors = 'Blues', locations = ~Country, marker = list(line = list(color = toRGB("grey"), width = 0.5)), showscale = TRUE, locationmode = "country names", colorbar = list(title = 'Airforce', separatethousands = TRUE)) %&amp;gt;% config(displayModeBar = F) %&amp;gt;% layout(geo = g, margin = list(l=0, r=0, t=0, b=0, pad=0), paper_bgcolor = 'transparent')
Finally, I add markers showing the airplane locations as another trace. I use a different color for those with speed less than 200 knots and altitude less than 2000 feet. The hover text contains more detail about the aircraft.
aircolors = rep("airborne", nrow(positions)) # create a vector with the status of each aeroplane aircolors[positions$Spd &amp;lt; 200 &amp;amp; positions$Alt &amp;lt; 2000] &amp;lt;- "ground/approach" hovertext = paste0("Operator:", positions$Op, "\nModel:", positions$Mdl, "\nAltitide(ft):", sapply(positions$Alt, FormatAsReal)) hoverinfo = rep("all", nrow(positions)) p = add_trace(p, data = positions, x = positions$Long, y = positions$Lat, color = aircolors, hovertext = hovertext, showlegend = FALSE)
The final result is shown below. Since broadcasting of position is purely voluntary, I suspect that the Top Gun flights are not shown!
While the map above shows the required information, it can easily be made more useful and appealing. Displayr allows me to add a control to switch between regions, text and image annotations, and a background. Here is a link to the finished dashboard, with a screenshot shown below.
Thinking about it again, if you can’t see any military aircraft at all on the radar, then you should really worry!
You can explore the dashboards used in this post within Displayr. All the code is embedded within the pages (look on the right-hand panel after you click on an object). Click here to see the copy of the working document (without background and finishing touches). Click here open a copy of the finished document that includes the background, control box, and other finishing touches.
Sketched Subspace Clustering (Sketch-SC)
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data. Subspace clustering (SC) is a relatively recent method that is able to successfully classify nonlinearly separable data in a multitude of settings. In spite of their high clustering accuracy, SC methods incur prohibitively high computational complexity when processing large volumes of high-dimensional data. Inspired by random sketching approaches for dimensionality reduction, the present paper introduces a randomized scheme for SC, termed Sketch-SC, tailored for large volumes of high-dimensional data. Sketch-SC accelerates the computationally heavy parts of state-of-the-art SC approaches by compressing the data matrix across both dimensions using random projections, thus enabling fast and accurate large-scale SC. Performance analysis as well as extensive numerical tests on real data corroborate the potential of Sketch-SC and its competitive performance relative to state-of-the-art scalable SC approaches. …
Horseshoe Estimator
This paper proposes a new approach to sparse-signal detection called the horseshoe estimator. We show that the horseshoe is a close cousin of the lasso in that it arises from the same class of multivariate scale mixtures of normals, but that it is almost universally superior to the double-exponential prior at handling sparsity. A theoretical framework is proposed for understanding why the horseshoe is a better default “sparsity” estimator than those that arise from powered-exponential priors. Comprehensive numerical evidence is presented to show that the difference in performance can often be large. Most importantly, we show that the horseshoe estimator corresponds quite closely to the answers one would get if one pursued a full Bayesian model-averaging approach using a “two-groups” model: a point mass at zero for noise, and a continuous density for signals. Surprisingly, this correspondence holds both for the estimator itself and for the classification rule induced by a simple threshold applied to the estimator. We show how the resulting thresholded horseshoe can also be viewed as a novel Bayes multiple-testing procedure. …
subgraph2vec
In this paper, we present subgraph2vec, a novel approach for learning latent representations of rooted subgraphs from large graphs inspired by recent advancements in Deep Learning and Graph Kernels. These latent representations encode semantic substructure dependencies in a continuous vector space, which is easily exploited by statistical models for tasks such as graph classification, clustering, link prediction and community detection. subgraph2vec leverages on local information obtained from neighbourhoods of nodes to learn their latent representations in an unsupervised fashion. We demonstrate that subgraph vectors learnt by our approach could be used in conjunction with classifiers such as CNNs, SVMs and relational data clustering algorithms to achieve significantly superior accuracies. Also, we show that the subgraph vectors could be used for building a deep learning variant of Weisfeiler-Lehman graph kernel. Our experiments on several benchmark and large-scale real-world datasets reveal that subgraph2vec achieves significant improvements in accuracies over existing graph kernels on both supervised and unsupervised learning tasks. Specifically, on two realworld program analysis tasks, namely, code clone and malware detection, subgraph2vec outperforms state-of-the-art kernels by more than 17% and 4%, respectively. …
Bidirectional deep echo state networks
An Efficient Bayesian Robust Principal Component Regression
Action-Attending Graphic Neural Network
Grounding Visual Explanations (Extended Abstract)
High-Resolution Deep Convolutional Generative Adversarial Networks
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
A generic and fast C++ optimization framework
ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation
Nonparametric independence testing via mutual information
• Accuracy of inference on the physics of binary evolution from gravitational-wave observations
• Language-Based Image Editing with Recurrent Attentive Models
• Network Geometry and Complexity
• Chromatic Number and Dichromatic Polynomial of Digraphs
• On cordial labeling of hypertrees
• Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies
• One Model for the Learning of Language
• Grammatical facial expression recognition using customized deep neural network architecture
• Neighborhood selection with application to social networks
• Spatio-Temporal Motifs for Optimized Vehicle-to-Vehicle (V2V) Communications
• Local Density of the Bose Glass Phase
• GA-PSO-Optimized Neural-Based Control Scheme for Adaptive Congestion Control to Improve Performance in Multimedia Applications
• On the Verification and Computation of Strong Nash Equilibrium
• Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico
• Fast ordered sampling of DNA sequence variants
• Estimating stationary characteristic functions of stochastic systems via semidefinite programming
• Attend and Interact: Higher-Order Object Interactions for Video Understanding
• Entanglement contour perspective for strong area law violation in a disordered long-range hopping model
• Mosquito detection with low-cost smartphones: data acquisition for malaria research
• Conditional Markov Chain Search for the Simple Plant Location Problem improves upper bounds on twelve Körkel-Ghosh instances
• Student Success Prediction in MOOCs
• Question Asking as Program Generation
• Numerical time integration of lumped parameter systems governed by implicit constitutive relations
• Grounded Objects and Interactions for Video Captioning
• Adaptive active queue management controller for TCP communication networks using PSO-RBF models
• Exploring the Use of Shatter for AllSAT Through Ramsey-Type Problems
• 3D Reconstruction of Incomplete Archaeological Objects Using a Generative Adversary Network
• Free energy of bipartite spherical Sherrington–Kirkpatrick model
• Mobile Video Object Detection with Temporally-Aware Feature Maps
• Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
• Thoracic Disease Identification and Localization with Limited Supervision
• Shape Inpainting using 3D Generative Adversarial Network and Recurrent Convolutional Networks
• Improvements to context based self-supervised learning
• Dimensionality Reduction on Grassmannian via Riemannian Optimization: A Generalized Perspective
• Non local branching Brownians with annihilation and free boundary problems
• A Unified Method for Exact Inference in Random-effects Meta-analysis via Monte Carlo Conditioning
• VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
• An $O^*(1.84^k)$ Parameterized Algorithm for the Multiterminal Cut Problem
• Average treatment effects in the presence of unknown interference
• Improving Palliative Care with Deep Learning
• Multi-objective risk-averse two-stage stochastic programming problems
• Ubenwa: Cry-based Diagnosis of Birth Asphyxia
• Training a network to attend like human drivers saves it from common but misleading loss functions
• Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
• Vision Based Railway Track Monitoring using Deep Learning
• A Resizable Mini-batch Gradient Descent based on a Randomized Weighted Majority
• Towards Self-organized Large-Scale Shape Formation: A Cognitive Agent-Based Computing Approach
• Multi-Objective Maximization of Monotone Submodular Functions with Cardinality Constraint
• Using KL-divergence to focus Deep Visual Explanation
• Generic algorithms for scheduling applications on heterogeneous multi-core platforms
• Towards dense volumetric pancreas segmentation in CT using 3D fully convolutional networks
• Evolution of Social Power for Opinion Dynamics Networks
• Best rank-$k$ approximations for tensors: generalizing Eckart-Young
• xUnit: Learning a Spatial Activation Function for Efficient Image Restoration
• Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size
• Renormalization of local times of super-Brownian motion
• Chinese Typeface Transformation with Hierarchical Adversarial Network
• A scale-dependent finite difference method for time fractional derivative relaxation type equations
• A Fusion-based Gender Recognition Method Using Facial Images
• Reconstruction of a random phase dynamics network from observations
• Separating Style and Content for Generalized Style Transfer
• An almost-linear time algorithm for uniform random spanning tree generation
• Fast Recurrent Fully Convolutional Networks for Direct Perception in Autonomous Driving
• Association schemes on the Schubert cells of a Grassmannian
• GPI-based Secrecy Rate Maximization Beamforming Scheme for Wireless Transmission with AN-aided Directional Modulation
• A unified deep artificial neural network approach to partial differential equations in complex geometries
• AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding
• Contributed Discussion to Computationally Efficient Multivariate Spatio-Temporal Models for High-Dimensional Count-Valued Data by Bradley et al
• Local neighbourhoods for first passage percolation on the configuration model
• Large Neural Network Based Detection of Apnea, Bradycardia and Desaturation Events
• Optimal Index Codes via a Duality between Index Coding and Network Coding
• Improved Bayesian Compression
• A note on convergence of solutions of total variation regularized linear inverse problems
• Win Prediction in Esports: Mixed-Rank Match Prediction in Multi-player Online Battle Arena Games
• Pseudo-positive regularization for deep person re-identification
• Modelling dark current and hot pixels in imaging sensors
• Detecting hip fractures with radiologist-level performance using deep neural networks
• Image Matters: Jointly Train Advertising CTR Model with Image Representation of Ad and User Behavior
• Towards a Combinatorial proof of Gessel’s conjecture on two-sided Gamma positivity: A reduction to simple permutations
• Asymptotic normality in Crump-Mode-Jagers processes: the lattice case
• Classification of postoperative surgical site infections from blood measurements with missing data using recurrent neural networks
• Optimal rates of linear convergence of the averaged alternating modified reflections method for two subspaces
• Random walk on a randomly oriented honeycomb lattice
• Multiwinner Elections with Diversity Constraints
• Graph Clustering using Effective Resistance
• Nonlinear oscillatory mixing in the generalized Landau scenario
• Discovery of Complex Anomalous Patterns of Sexual Violence in El Salvador
• Learning a Robust Representation via a Deep Network on Symmetric Positive Definite Manifolds
• Liouville quantum gravity on the annulus
• Cut-off phenomenon for random walks on free orthogonal quantum groups
• Calibration of Distributionally Robust Empirical Optimization Models
• Random affine simplexes
• Learning to Play Othello with Deep Neural Networks
• Cautious NMPC with Gaussian Process Dynamics for Miniature Race Cars
• Dependent landmark drift: robust point set registration based on the Gaussian mixture model with a statistical shape model
• On optimal coding of non-linear dynamical systems
• Evolving soft locomotion in aquatic and terrestrial environments: effects of material properties and environmental transitions
• Unsupervised Reverse Domain Adaption for Synthetic Medical Images via Adversarial Training
• Loom: Query-aware Partitioning of Online Graphs
• Bounds for the Nakamura number
• Partial Truthfulness in Minimal Peer Prediction Mechanisms with Limited Knowledge
• Superpixels Based Segmentation and SVM Based Classification Method to Distinguish Five Diseases from Normal Regions in Wireless Capsule Endoscopy
• Depth Assisted Full Resolution Network for Single Image-based View Synthesis
• On the chromatic number of almost s-stable Kneser graphs
• Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments
• Classifying optimal binary subspace codes of length 8, constant dimension 4 and minimum distance 6
• Dynamic Matching: Reducing Integral Algorithms to Approximately-Maximal Fractional Algorithms
• Segmenting Brain Tumors with Symmetry
• Neural Motifs: Scene Graph Parsing with Global Context
• The Complexity of Multiwinner Voting Rules with Variable Number of Winners
• Nearly Optimal Stochastic Approximation for Online Principal Subspace Estimation
• Hardening Quantum Machine Learning Against Adversaries
• A Parallelizable Acceleration Framework for Packing Linear Programs
• On the Existence of Densities for Functional Data and their Link to Statistical Privacy
• Multiresolution and Hierarchical Analysis of Astronomical Spectroscopic Cubes using 3D Discrete Wavelet Transform
• Predict Responsibly: Increasing Fairness by Learning To Defer
• ADVISE: Symbolism and External Knowledge for Decoding Advertisements
• Neon2: Finding Local Minima via First-Order Oracles
• Self-similar growth-fragmentations as scaling limits of Markov branching processes
Hello! Last week, I thought I knew how users and groups worked on Linux. Here is what I thought:
julia
)julia
can
access the file, and b) checks which groups julia
belongs to, and whether any of those groups
owns & can access that fileSo, for example, if a process is owned by the julia
user and julia
is in the awesome
group,
then the process would be allowed to read this file.
dr--r--r-- 1 bork awesome 6872 Sep 24 11:09 file.txt
I had not thought carefully about this, but if pressed I would have said that it probably checks the
/etc/group
file at runtime to see what groups you’re in.
I found out at work last week that, no, what I describe above is not how groups work. In particular Linux does not check which groups a process’s user belongs to every time that process tries to access a file.
Here is how groups actually work! I learned this by reading Chapter 9 (“Process Credentials”) of The Linux Programming Interface which is an incredible book. As soon as I realized that I did not understand how users and groups worked, I opened up the table of contents with absolute confidence that it would tell me what’s up, and I was right.
They key new insight for me was pretty simple! The chapter starts out by saying that user and group IDs are attributes of the process:
This means that the way Linux actually does group checks to see a process can read a file is:
/etc/group
)Generally when doing access control checks it uses the effective user/group ID, not the real user/group ID. Technically when accessing a file it actually uses the file-system ids but those are usually the same as the effective uid/gid.
Here’s another fun example that follows from this: if I create a new panda
group and add myself
(bork) to it, then run groups
to check my group memberships – I’m not in the panda group!
bork@kiwi~> sudo addgroup panda
Adding group `panda' (GID 1001) ...
Done.
bork@kiwi~> sudo adduser bork panda
Adding user `bork' to group `panda' ...
Adding user bork to group panda
Done.
bork@kiwi~> groups
bork adm cdrom sudo dip plugdev lpadmin sambashare docker lxd
no panda
in that list! To double check, let’s try making a file owned by the panda
group and see
if I can access it:
$ touch panda-file.txt
$ sudo chown root:panda panda-file.txt
$ sudo chmod 660 panda-file.txt
$ cat panda-file.txt
cat: panda-file.txt: Permission denied
Sure enough, I can’t access panda-file.txt
. No big surprise there. My shell didn’t have the panda
group as a supplementary GID before, and running adduser bork panda
didn’t do anything to change
that.
So this raises kind of a confusing question, right – if processes have groups baked into them, how do you get assigned your groups in the first place? Obviously you can’t assign yourself more groups (that would defeat the purpose of access control).
It’s relatively clear how processes I execute from my shell (bash/fish) get their groups – my shell runs as me, and it has a bunch of group IDs on it. Processes I execute from my shell are forked from the shell so they get the same groups as the shell had.
So there needs to be some “first” process that has your groups set on it, and all the other
processes you set inherit their groups from that. That process is called your login shell and
it’s run by the login
program (/bin/login
) on my laptop. login
runs as root and calls a C
function called initgroups
to set up your groups (by reading /etc/group
). It’s allowed to set up
your groups because it runs as root.
So! Let’s say I am running in a shell, and I want to refresh my groups! From what we’ve learned
about how groups are initialized, I should be able to run login
to refresh my groups and start a
new login shell!
Let’s try it:
$ sudo login bork
$ groups
bork adm cdrom sudo dip plugdev lpadmin sambashare docker lxd panda
$ cat panda-file.txt # it works! I can access the file owned by `panda` now!
Sure enough, it works! Now the new shell that login
spawned is part of the panda
group! Awesome!
This won’t affect any other shells I already have running. If I really want the new panda
group
everywhere, I need to restart my login session completely, which means quitting my window manager
and logging in again.
Somebody on Twitter told me that if you want to start a new shell with a new group that you’ve been
added to, you can use newgrp
. Like this:
sudo addgroup panda
sudo adduser bork panda
newgrp panda # starts a new shell, and you don't have to be root to run it!
You can accomplish the same(ish) thing with sg panda bash
which will start a bash
shell that
runs with the panda
group.
I’ve also always been a little vague about what it means for a process to run as “setuid root”. It
turns out that setuid sets the effective user ID! So if I (julia
) run a setuid root process (like passwd
), then the real user ID will be set to julia
, and the effective user ID will be set to root
.
passwd
needs to run as root, but it can look at its real user ID to see that julia
started the
process, and prevent julia
from editing any passwords except for julia
’s password.
There are a bunch more details about all the edge cases and exactly how everything works in The Linux Programming Interface so I will not get into all the details here. That book is amazing. Everything I talked about in this post is from Chapter 9, which is a 17-page chapter inside a 1300-page book.
The thing I love most about that book is that reading 17 pages about how users and groups work is really approachable, self-contained, super useful, and I don’t have to tackle all 1300 pages of it at once to learn helpful things :)
ACM SIGKDD Executive Committee hereby invites proposals to host the annual KDD Conference in 2020 and later. Proposals due Jan 31, 2018.
A Step Towards Reproducible Data Science : Docker for Data Science Workflows
Programming Languages for Data Science and ML – With Source Code Illustrations
The Gaussian Correlation Inequality in One Picture
Concatenate TensorFlow Tensors Along A Given Dimension
How Can We Trust Machine Learning and AI?
IBM Introduces New Software to Ease Adoption of AI, Machine Learning and Deep Learning
How (& Why) Data Scientists and Data Engineers Should Share a Platform
Generative Adversarial Networks – Part II
Top 10 Videos on Deep Learning in Python
Why is R slow? some explanations and MKL/OpenBLAS setup to try to fix this
How to combine point and boxplots in timeline charts with ggplot2 facets
Teaching to machines: What is learning in machine learning entails?
Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest
New Poll: Which Data Science / Machine Learning methods and tools you used?
Automated Feature Engineering for Time Series Data
You know the drill by now. Grab the tweets. Generate the report using RMarkdown. Push to Github. Publish to RPubs.
This time it’s the Australian Bioinformatics & Computational Biology Society Conference 2017, including the COMBINE symposium. Looks like a good time was had by all in Adelaide.
A couple of quirks this time around. First, the rtweet package went through a brief phase of returning lists instead of nice data frames. I hope that’s been discarded as a bad idea
Second, results returned from a hashtag search that did not contain said hashtags, but were definitely from the meeting. Yet to get to the bottom of that one. I treated ABACBS and COMBINE as one event for this analysis.
Third, given that most Twitter users have had 280 characters since about November 7, is this reflected in the conference tweets? It is not. In fact, most tweets seem to hit the hard limit of 140, which makes me wonder if users’ clients were not updated, or the R package has a hard-coded character limit in its parsing functions (longest = 148 characters).
Filed under: australia, bioinformatics, meetings, R, statistics Tagged: abacbs, twitter
by Siddarth Ramesh, Data Scientist, Microsoft
I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.
Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. But many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.
Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.
Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server.
The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.
To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.
revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python
executable in the MML directory, and then import the packages.
The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods.
import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head()
One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands.
The side effect of having to one-hot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset.
from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() enc.fit(x_cats) onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)
At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns.
Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.
titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)
This dataset is already ready to be fed into the revoscalepy model.
One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.
Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+
’ for separate independent variables, ‘:
’ for interaction terms, and ‘*
’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.
With sklearn, model.fit
expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:
model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(-1) y = encoded_titanic_data['Survived'] model = model.fit(x,y)
revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy.
#rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)
The resulting object, titanic_data_tree
, is the same structural object that RxDTree()
would create in R. RxDTree()
integrates with the rpart library in R, meaning rx_dtree()
indirectly does if the users wants to export the object back into R for further exploratory analysis around this decision tree.
From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree()
reads the R-like formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python.
Microsoft Docs: Introducing revoscalepy
While deep learning methods have achieved state-of-the-art performance in many challenging inverse problems like image inpainting and super-resolution, they invariably involve problem-specific training of the networks. Under this approach, different problems require different networks. In scenarios where we need to solve a wide variety of problems, e.g., on a mobile camera, it is inefficient and costly to use these specially-trained networks. On the other hand, traditional methods using signal priors can be used in all linear inverse problems but often have worse performance on challenging tasks. In this work, we provide a middle ground between the two kinds of methods --- we propose a general framework to train a single deep neural network that solves arbitrary linear inverse problems. The proposed network acts as a proximal operator for an optimization algorithm and projects non-image signals onto the set of natural images defined by the decision boundary of a classifier. In our experiments, the proposed framework demonstrates superior performance over traditional methods using a wavelet sparsity prior and achieves comparable performance of specially-trained networks on tasks including compressive sensing and pixel-wise inpainting.
Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I've reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)
That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:
df <- expand.grid(x = 1:10, y=1:10)
— @DynamicWebPaige (@DynamicWebPaige) November 8, 2017
df$angle <- runif(100, 0, 2*pi)
df$speed <- runif(100, 0, sqrt(0.1 * df$x))
ggplot(df, aes(x, y)) +
geom_point() +
geom_spoke(aes(angle = angle, radius = speed))
y'all twitterpeople give me 280 characters?
yr just gonna get code samples pic.twitter.com/hyGEE2DxGy
x <- getURL("https://t.co/ivZZvodbNK")
— @DynamicWebPaige (@DynamicWebPaige) November 11, 2017
b <- read.csv(text=x)
c <- get_map(location=c(-122.080954,36.971709), maptype="terrain", source="google", zoom=14)
ggmap(c) +
geom_path(data=b, aes(color=elevation), size=3)+
scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ
library(dygraphs)
— @DynamicWebPaige (@DynamicWebPaige) November 16, 2017
lungDeaths <- cbind(mdeaths, fdeaths)
dygraph(lungDeaths) %>%
dySeries("mdeaths", label = "Male") %>%
dySeries("fdeaths", label = "Female") %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)
I ❤️ R's interactive visualizations SO MUCH pic.twitter.com/LevjElly3L
library(plot3D)
— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017
par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)
surf3D(x = (3+2*cos(M$x)) * cos(M$y),
y = (3+2*cos(M$x)) * sin(M$y),
z = 2 * sin(M$x),
colkey=FALSE,
bty="b2") pic.twitter.com/a6GwQTaGYC
For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below.
rud.is: Twitter Outer Limits : Seeing How Far Have Folks Fallen Down The Slippery Slope to “280” with rtweet
Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)
That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:
df <- expand.grid(x = 1:10, y=1:10)
df$angle <- runif(100, 0, 2*pi)
df$speed <- runif(100, 0, sqrt(0.1 * df$x))ggplot(df, aes(x, y)) +
geom_point() +
geom_spoke(aes(angle = angle, radius = speed))y’all twitterpeople give me 280 characters?
yr just gonna get code samples pic.twitter.com/hyGEE2DxGy— @DynamicWebPaige (@DynamicWebPaige) November 8, 2017
x <- getURL(“https://t.co/ivZZvodbNK“)
b <- read.csv(text=x)
c <- get_map(location=c(-122.080954,36.971709), maptype=”terrain”, source=”google”, zoom=14)
ggmap(c) +
geom_path(data=b, aes(color=elevation), size=3)+
scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ— @DynamicWebPaige (@DynamicWebPaige) November 11, 2017
library(dygraphs)
lungDeaths <- cbind(mdeaths, fdeaths)
dygraph(lungDeaths) %>%
dySeries(“mdeaths”, label = “Male”) %>%
dySeries(“fdeaths”, label = “Female”) %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)I ❤️ R’s interactive visualizations SO MUCH pic.twitter.com/LevjElly3L
— @DynamicWebPaige (@DynamicWebPaige) November 16, 2017
library(plot3D)
par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)surf3D(x = (3+2*cos(M$x)) * cos(M$y),
y = (3+2*cos(M$x)) * sin(M$y),
z = 2 * sin(M$x),
colkey=FALSE,
bty=”b2″) pic.twitter.com/a6GwQTaGYC— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017
For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below.
rud.is: Twitter Outer Limits : Seeing How Far Have Folks Fallen Down The Slippery Slope to “280” with rtweet
Frankenstein’s Legacy: Four Conversations about Artificial Intelligence, Machine Learning, and the Modern World is a collaboration between Carnegie Mellon University’s ETC Press, dSHARP, and the Carnegie Mellon University Libraries in collaboration with the Alumni Association’s CMUThink program. This book is part of a university-wide series celebrating the two-hundredth anniversary of the publication of Mary Shelley’s Frankenstein. This book project specifically sprung from the panel Frankenstein 200: Perils and Potential panel hosted by Digital Scholarship Strategist Rikk Mulligan. Each of the four panel participants – Jeffrey Bigham, David Danks, Barry Luokkala, and Molly Wright Steenson – sat down with ETC Editor Brad King for wide-ranging discussions about artificial intelligence, machine learning, and the impact of these technologies on the world in which we live. Those conversations were edited into the manuscript that you’re now reading. The book – part of the ETC Press “In Conversation With” series – is a conversational examination and explanation of some of the most powerful technological tools in our society. Frankenstein’s Legacy – Four Conversations About Artificial Intelligence, Machine Learning, and the Modern World
Download this whitepaper from NVIDIA DGX Systems, and gain insight into the engineering expertise and innovation found in pre-optimized deep learning frameworks available only on NVIDIA DGX Systems and learn how to dramatically reduce your engineering costs using today’s most popular frameworks.
Introduction
Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’.
As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable.
To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository.
Understanding the Data
Of the 79 variables provided, 51 were categorical and 28 were continuous.
Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set.
Processing the Data
Response Variable
As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution.
Missing Data
Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data.
1) NA/NaN is actually a class:
In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable.
Our solution was to impute most of the NA/NaN values to a value of “None.”
2) Not every NA/NaN corresponds to a missing attribute:
While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set.
Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent.
3) Domain knowledge:
Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.
Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.
4) Imputing with mode:
Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation.
Our solution was to simply find the most common class for this categorical variable and impute for this missing value.
Ordinal Categories
For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue).
Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively.
Nominal Categories
The ranking of nominal categories is not appropriate as there is no actual rank among the classes.
Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present).
Outliers
An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices.
Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors.
Skewness
While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance.
Our solution was to log + 1 transform several of the predictors.
Near Zero Predictors
Predictors with very low variance offer little predictive power to models.
Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same.
Feature Engineering
Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created:
Models and Results
Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price.
We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better.
For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression.
Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price.
For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions.
There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model.
One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price.
Conclusions
The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about $9000 (or about 5%) for the average-priced house.
While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold.
Ontotext live, online training designed to improve understanding of how Semantic Technology operates to help you make best use of it. Preparation starts Nov 30 and live class is Dec 7.
Optimal Thresholding Fisher’s P-Value Combination Method (TFisher)
We provide the cumulative distribution function (CDF), quantile, and statistical power calculator for a collection of thresholding Fisher’s p-value combination methods, including Fisher’s p-value combination method, truncated product method and, in particular, soft-thresholding Fisher’s p-value combination method which is proven to be optimal in some context of signal detection. The p-value calculator for the omnibus version of these tests are also included. For reference, please see Hong Zhang and Zheyang Wu. ‘Optimal Thresholding of Fisher’s P-value Combination Tests for Signal Detection’, submitted.
Large Data Sets (ldat)
Tools for working with vectors and data sets that are too large to keep in memory. Extends the basic functionality provided in the ‘lvec’ package. Provides basis statistical functionality of ‘lvec’ objects, such as arithmetic operations and calculating means and sums. Also implements ‘data.frame’-like objects storing its data in ‘lvec’ objects.
Random Wishart Matrix Generation (rWishart)
An expansion of R’s ‘stats’ random wishart matrix generation. This package allows the user to generate singular, Uhlig and Harald (1994) <doi:10.1214/aos/1176325375>, and pseudo wishart, Diaz-Garcia, et al.(1997) <doi:10.1006/jmva.1997.1689>, matrices. In addition the user can generate wishart matrices with fractional degrees of freedom, Adhikari (2008) <doi:10.1061/(ASCE)0733-9399(2008)134:12(1029)>, commonly used in volatility modeling. Users can also use this package to create random covariance matrices.
Simple Matrix Construction (Massign)
Constructing matrices for quick prototyping can be a nuisance, requiring the user to think about how to fill the matrix with values using the matrix() function. The %<-% operator solves that issue by allowing the user to construct matrices using code that shows the actual matrices.
Multivariate Empirical Density Function (MEPDF)
Based on the input data an n-dimensional cube with sub cells of user specified side length is created. The number of sample points which fall in each sub cube is counted, and with the cell volume and overall sample size an empirical probability can be computed. A number of cubes of higher resolution can be superimposed. The basic method stems from J.L. Bentley in ‘Multidimensional Divide and Conquer’. J. L. Bentley (1980) <doi:10.1145/358841.358850>.
Renewal Hawkes Process (RHawkes)
Simulate a renewal Hawkes (RHawkes) self-exciting process, with a given immigrant hazard rate function and offspring density function. Calculate the likelihood of a RHawkes process with given hazard rate function and offspring density function for an (increasing) sequence of event times. Calculate the Rosenblatt residuals of the event times. Predict future event times based on observed event times up to a given time. For details see Chen and Stindl (2017) <doi:10.1080/10618600.2017.1341324>.
Jessica Franklin writes:
Given your interest in post-publication peer review, I thought you might be interested in our recent experience criticizing a paper published in BMJ last year by Hemkens et al.. I realized that the method used for the primary analysis was biased, so we published a criticism with mathematical proof of the bias (we tried to publish in BMJ, but it was a no go). Now there has been some back and forth between the Hemkens group and us on the BMJ rapid response page, and BMJ is considering a retraction, but no action yet. I don’t really want to comment too much on the specifics, as I don’t want to escalate the tension here, but this has all been pretty interesting, at least to me.
Interesting, in part because both sides in the dispute include well-known figures in epidemiology: John Ioannidis is a coauthor on the Hemkens et al. paper, and Kenneth Rothman is a coauthor on the Franklin et al. criticism.
Background
The story starts with the paper by Hemkens et al., who performed a meta-analysis on “16 eligible RCD studies [observational studies using ‘routinely collected data’], and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths),” and they found that the observational studies overestimated efficacy of treatments compared to the later randomized experiments.
Their message: be careful when interpreting observational studies.
One thing I wonder about, though, is how much of this is due to the time ordering of the studies. Forget for a moment about which studies are observational and which are experimental. In any case, I’d expect the first published study on a topic to show statistically significant results—otherwise it’s less likely to be published in the first place—whereas anything could happen in a follow-up. Thus, I’d expect to see earlier studies overestimate effect sizes relative to later studies, irrespective of which studies are observational and which are experimental. This is related to the time-reversal heuristic.
To put it another way: The Hemkens et al. project is itself an observational study, and in their study there is complete confounding between two predictors: (a) whether a result came from an observational study or an experiment, and (b) whether the result was published first or second. So I think it’s impossible to disentangle the predictive value of (a) and (b).
The criticism and the controversy
Here are the data from Hemkens et al.:
Franklin et al. expressed the following concern:
In a recent meta-analysis by Hemkens et al. (Hemkens et al. 2016), the authors compared published RCD [routinely collected data] studies and subsequent RCTs [randomized controlled trials] using the ROR, but inverted the clinical question and corresponding treatment effect estimates for all study questions where the RCD estimate was > 1, thereby ensuring that all RCD estimates indicated protective effects.
Here’s the relevant bit from Hemkens et al.:
For consistency, we inverted the RCD effect estimates where necessary so that each RCD study indicated an odds ratio less than 1 (that is, swapping the study groups so that the first study group has lower mortality risk than the second).
So, yeah, that’s what they did.
On one hand, I can see where Hemkens et al. were coming from. To the extent that the original studies purported to be definitive, it makes sense to code them in the same direction, so that you’re asking how the replications compared to what was expected.
On the other hand, Franklin et al. have a point, that in the absence of any differences, the procedure of flipping all initial estimates to have odds ratios less than 1 will bias the estimate of the difference.
Beyond this, the above graph shots a high level of noise in the comparisons, as some of the follow-up randomized trials have standard errors that are essentially infinite. (What do you say about an estimated odds ratio that can be anywhere from 0.2 to 5?) Hemkens et al. appear to be using some sort of weighting procedure, but the relevant point here is that only a few of these studies have enough data to tell us anything at all.
My take on these papers
The above figure tells the story: The 16 observational studies appear to show a strong correlation between standard error and estimated effect size. This makes sense. Go, for example, to the bottom of the graph: I don’t know anything about Hahn 2010, Fonoarow 2008, Moss 2003, Kim 2008, and Cabell 2005, but all these studies are estimated to cut mortality by 50% or more, which seems like a lot, especially considering the big standard errors. It’s no surprise that these big estimates fail to reappear under independent replication. Indeed, as noted above, I’d expect that big estimates from randomized experiments would also generally fail to reappear under independent replication.
Franklin et al. raise a valid criticism: Even if there is no effect at all, the method used by Hemkens et al. will create the appearance of an effect: in short, the Hemkens et al. estimate is indeed biased.
Put it all together, and I think that the sort of meta-analysis performed by Hemkens et al. is potentially valuable, but maybe it would’ve been enough for them to stop with the graph on the left in the above image. It’s not clear that anything is gained from their averaging; also there’s complete confounding in their data between timing (which of the two studies came first) and mode (observational or experimental).
The discussion
Here are some juicy bits from the online discussion at the BMJ site:
02 August 2017, José G Merino, US Research Editor, The BMJ:
Last August, a group led by Jessica Franklin submitted to us a criticism of the methods used by the authors of this paper, calling into question some of the assumptions and conclusion reached by Lars Hemkens and his team. We invited Franklin and colleagues to submit their comments as a rapid response rather than as a separate paper but they declined and instead published the paper in Epidemiological Methods (Epidem Meth 2-17;20160018, DOI 10.1515/em-2016-0018.) We would like to alert the BMJ’s readers about the paper, which can be found here: https://www.degruyter.com/view/j/em.ahead-of-print/em-2016-0018/em-2016-0018.xml?format=INT
We asked Hemkens and his colleagues to submit a response to the criticism. That report is undergoing statistical review at The BMJ. We will post the response shortly.
14 September 2017, Lars G Hemkens, senior researcher, Despina G Contopoulos-Ioannidis, John P A Ioannidis:
The arguments and analyses of Franklin et al. [1] are flawed and misleading. . . . It is trivial that the direction of comparisons is essential in meta-epidemiological research comparing analytic approaches. It is also essential that there must be a rule for consistent coining of the direction of comparisons. The fact that there are theoretically multiple ways to define such rules and apply the ratio-of-odds ratio method doesn’t invalidate the approach in any way. . . . We took in our study the perspective of clinicians facing new evidence, having no randomized trials, and having to decide whether they use a new promising treatment. In this situation, a treatment would be seen as promising when there are indications for beneficial effects in the RCD-study, which we defined as having better survival than the comparator (that is a OR < 1 for mortality in the RCD-study) . . . it is the only reasonable and useful selection rule in real life . . . The theoretical simulation of Franklin et al. to make all relative risk estimates <1 in RCTs makes no sense in real life and is without any relevance for patient care or health-care decision making. . . . Franklin et al. included in their analysis a clinical question where both subsequent trials were published simultaneously making it impossible to clearly determine which one is the first (Gnerlich 2007). Franklin et al. selected the data which better fit to their claim. . . .
21 September 2017, Susan Gruber, Biostatistician:
The rapid response of Hemkens, Contopoulos-Ioannidis, and Ioannidis overlooks the fact that a metric of comparison can be systematic, transparent, replicable, and also wrong. Franklin et. al. clearly explains and demonstrates that inverting the OR based on RCD study result (or on the RCT result) yields a misleading statistic. . . .
02 October 2017, Jessica M. Franklin, Assistant Professor of Medicine, Sara Dejene, Krista F. Huybrechts, Shirley V. Wang, Martin Kulldorff, and Kenneth J. Rothman:
In a recent paper [1], we provided mathematical proof that the inversion rule used in the analysis of Hemkens et al. [2] results in positive bias of the pooled relative odds ratio . . . In their response, Hemkens et al [3] do not address this core statistical problem with their analysis. . . .
We applaud the transparency with which Hemkens et al reported their analyses, which allowed us to replicate their findings independently as well as to illustrate the inherent bias in their statistical method. Our paper was originally submitted to BMJ, as recently revealed by a journal editor [4], and it was reviewed there by two prominent biostatisticians and an epidemiologist. All three reviewers recognized that we had described a fundamental flaw in the statistical approach invented and used by Hemkens et al. We believe that everyone makes mistakes, and acknowledging an honest mistake is a badge of honor. Thus, based on our paper and those three reviews, we expected Hemkens et al. and the journal editors simply to acknowledge the problem and to retract the paper. Their reaction to date is disappointing.
13 November 2017, José G Merino, US Research Editor, Elizabeth Loder, Head of Research, The BMJ:
We acknowledge receipt of this letter that includes a request for retraction of the paper. We take this request very seriously. Before we make a decision on this request, we -The BMJ’s editors and statisticians – are reviewing all the available information. We hope to reach a decision that will maintain the integrity of the scientific literature, acknowledge legitimate differences of opinion about the methods used in the analysis of data, and is fair to all the participants in the debate. We will post a rapid response once we make a decision on this issue.
The discussion also includes contributions from others on unrelated aspects of the problem; here I’m focusing about the Franklin et al. critique and the Hemkens et al. paper.
Good on ya, BMJ
I love how the BMJ is handling this. The discussion is completely open, and the journal editor is completely non-judgmental. All so much better than my recent experience with the Association for Psychological Science, where the journal editor brushed me off in a polite but content-free way, and then the chair of the journal’s publication board followed up with some gratuitous rudeness. The BMJ is doing it right, and the psychology society has a few things to learn from them.
Also, just to make my position on this clear: I don’t see why anyone would think the Hemkens et al. paper should be retracted; a link to the criticisms would seem to be enough.
P.S. Franklin adds:
Just last week I got am email from someone who thought that our conclusion in our Epi Methods paper that use of the pooled ROR without inversion is “just as flawed” was too strong. I think they are right, so we will now be preparing a correction to our paper to modify this statement. So the circle of post-publication peer review continues…
Yes, exactly!
The post “A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies” appeared first on Statistical Modeling, Causal Inference, and Social Science.
Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started.
Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.
Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters.
Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.
Having clarified these points, we can proceed to the topic of today’s post.
We already know that the default estimator used for
smooth
functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise.
In this post we will use the series N1823 for our examples:
library(Mcomp) x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x))
Plot the data in order to see what we have:
plot(x)
The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3.
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T)
Here’s the output:
Time elapsed: 0.08 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.147 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 629.249 Cost function type: MSE; Cost function value: 377623.069 Information criteria: AIC AICc BIC 1703.389 1703.977 1716.800 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1% MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4%
It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals:
qqnorm(resid(ourModel)) qqline(resid(ourModel))
The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%:
shapiro.test(resid(ourModel)) > p-value = 0.001223
This may indicate that other estimators may do a better job. And there is a magical parameter
cfType
in the
smooth
functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE:
\begin{equation} \label{eq:MAE}
\text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}|
\end{equation}
The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution).
Let’s see what happens with the same model, on the same data when we use MAE:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MAE")
Time elapsed: 0.09 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.101 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 636.546 Cost function type: MAE; Cost function value: 462.675 Information criteria: AIC AICc BIC 1705.879 1706.468 1719.290 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4% MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5%
There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won’t become normal just because we used a different estimator:
\begin{equation} \label{eq:HAM}
\text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|}
\end{equation}
This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator.
The same example, the same model, but HAM as an estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="HAM")
Time elapsed: 0.06 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.001 0.001 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 666.439 Cost function type: HAM; Cost function value: 19.67 Information criteria: AIC AICc BIC 1715.792 1716.381 1729.203 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4% MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3%
This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE.
Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don’t have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case.
While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware!
\begin{equation} \label{eq:MSEh}
\text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2
\end{equation}
The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of \(h\). The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned.
Here’s what happens when we use this estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MSEh")
Time elapsed: 0.24 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 657.781 Cost function type: MSEh; Cost function value: 550179.34 Information criteria: AIC AICc BIC 30393.86 30404.45 30635.25 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8% MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8%
As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function.
The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead:
\begin{equation} \label{TMSE}
\text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2
\end{equation}
This estimator is more reasonable than MSE\(_h\) because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE\(_h\), but slightly better.
This is what happens in our example:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="TMSE")
Time elapsed: 0.2 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.075 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 633.48 Cost function type: TMSE; Cost function value: 7477097.717 Information criteria: AIC AICc BIC 30394.36 30404.94 30635.75 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6% MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5%
Comparing the model estimated using TMSE with the same one estimated using MSE and MSE\(_h\), it is worth noting that the smoothing parameters in this model are greater than in case of MSE\(_h\), but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE\(_h\). Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased.
This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE\(_j\) and then sum them up:
\begin{equation} \label{eq:GTMSE}
\text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right)
\end{equation}
Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE\(_h\) and TMSE. This estimator is more efficient than both TMSE and MSE\(_j\) because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples.
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="GTMSE")
Time elapsed: 0.18 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 649.253 Cost function type: GTMSE; Cost function value: 232.419 Information criteria: AIC AICc BIC 30402.77 30413.36 30644.16 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9% MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6%
In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE\(_h\). However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples.
Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample.
There is also a non-documented feature in
smooth
functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE\(_h\), aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important.
These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE\(_h\) estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="aMSEh")
Time elapsed: 0.11 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 627.818 Cost function type: aMSEh; Cost function value: 375907.976 Information criteria: AIC AICc BIC 30652.15 30662.74 30893.55 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6% MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3%
The resulting smoothing parameters are shrunk towards zero, similar to MSE\(_h\), but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE\(_h\). The difference in time may increase with the increase of sample size and forecasting horizon.
smooth
functions (e.g. MAE\(_h\) and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them.
Now that we discussed all the possible estimators that you can use with
smooth
, you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice.
First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE\(_h\) and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE\(_h\), you need to estimate \(h\) separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE\(_h\), but at the same time leads to at least as accurate forecasts as in case of MSE\(_h\) for all the steps from 1 to h.
Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy.
Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway.
Finally, don’t use MSE\(_h\), TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead.
Please vote in new KDnuggets poll which examines the methods and tools used for a real-world application or project.
We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.
Seeking applicants for two tenure-track assistant professor positions in data analytics, with research depth and breadth in data analytics, data mining, machine learning, deep learning, artificial intelligence, and related domains.
Seeking candidates whose teaching and research interests are related to marketing analytics. Applicants should possess a PhD and be able to provide evidence of publications (and/or demonstrate the potential to publish) in reputable academic journals.
Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:
Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.
Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.
Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.
There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.
Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:
In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.
In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.
If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:
The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.
Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.
The following three figures give you an overview of how easy it is to work with the R API functionality.
First, define an API endpoint and R (prediction) function.
Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.
Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.
Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.
The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.
Cheers, Longhow.
Former U.S. Chief Data Scientist DJ Patil will be lending his expertise to DataScience.com’s product, engineering, and R&D teams as they expand the features of the company’s enterprise data science platform.
Imagine taking the practical, hands-on teaching style of the PyImageSearch blog…
…and translating it to a live, in person conference.
Sound interesting?
If so, mark the date on your calendar now:
PyImageConf is taking place on August 26th-28th 2018 at the San Francisco Hyatt Regency.
Confirmed speaker/workshop hosts include:
To learn more about PyImageConf (and why you should attend), keep reading.
PyImageConf is a practical, hands-on conference featuring computer vision, deep learning, and OpenCV experts.
If you want to rub elbows, make connections, and learn from the leading computer vision and deep learning experts (and hang out with us in San Francisco for a few days)…you’ll want to get on the wait list now.
PyImageConf has put together the biggest names in computer vision, deep learning, and OpenCV education to give you the best possible live, hands-on training and talks.
Each speaker is respectively known for their writing, teaching, online courses, and contributions to open source projects.
Right now there are eight confirmed speakers/workshop hosts with more to be added later:
If you’re looking for a computer vision/deep learning conference with the foremost educators and speakers, this is it.
PyImageConf will take place on August 26-28th 2018.
Tickets will go on sale in January 2018.
If you want a ticket make sure you claim your spot in line!
The conference will be hosted in the newly remodeled San Francisco Regency Hyatt, right on the beautiful San Francisco Bay waterline.
Talks will take place in their exquisite state-of-the-art ballroom and workshops/breakout sessions will utilize the hotel’s dedicated workspaces.
Take a look at the photos below to get a feel for the space — I’m sure you’ll agree that this will be the computer vision and deep learning event of the year:
I’m currently finalizing ticket prices with my conference coordinator now but expect the prices to be in the $700-$900 range.
This may seem like a lot, but keep in mind this includes two full days of talks, workshops, and training — most conferences would charge well over $1,500-$2,500 for this type of event.
If you’re a PyImageSearch Gurus member you will enjoy first dibs on the tickets (and a discount).
Take the time now to check with your stakeholders (spouse, kids, etc.) and earmark funds to grab a ticket.
The conference will be small and intimate, capped at 200 attendees.
I’m purposely keeping the conference small to enable you to:
Once the 200 tickets sell out, they’re all gone and I will not be adding more.
If you’re a longtime PyImageSearch reader I don’t need to tell you how fast these types of sales sell out.
A few years ago when I ran a Kickstarter campaign for the PyImageSearch Gurus course all early bird specials sold out within 30 minutes.
In January of this year I launched the Deep Learning for Computer Vision with Python Kickstarter — all early bird specials were claimed in under 15 minutes.
Given that there will only be 200 tickets available, I expect all tickets to sell out within 10 minutes of the tickets going live.
So, if you want a ticket to PyImageConf…
…make sure you click here and get on the early bird list!
Here’s the TL;DR;
The post Save the Date: PyImageConf 2018 appeared first on PyImageSearch.
Axel Cleeremans writes:
I appreciated your piece titled “What has happened down here is the winds have changed”. Your mini-history of what happened was truly enlightening — but you didn’t explicitly mention our failure to replicate Bargh’s slow walking effect. This was absolutely instrumental in triggering the replication crisis. As you know, the article was covered by the science journalist Ed Yong and came shortly after the Stapel affair. It was the first failure to replicate a classic priming effect that attracted so much attention. Yong’s blog post about it attracted a response from John Bargh and further replies from Yong, as you indirectly point to. But our article and the entire exchange between Yong and Bargh is also what triggered an extended email discussion involving many of the actors involved in this entire debate (including E. J. Wagenmakers, Hal Pashler, Fritz Strack and about 30 other people). That discussion was initiated by Daniel Kahneman after he and I discussed what to make of our failure to replicate Bargh’s findings. This email discussion continued for about two years and eventually resulted in further attempts to replicate, as they are unfolding now.
I was aware of the Bargh issue but I’d only read Wagenmakers (and Bargh’s own unfortunate writings) on the issue; I’d never followed up to read the original, so this is good to know. One thing I like about having these exchanges on a blog, rather than circulating emails, is that all the discussion is in one place and is open to all to read and participate.
The post A pivotal episode in the unfolding of the replication crisis appeared first on Statistical Modeling, Causal Inference, and Social Science.
Also: A Day in the Life of a Data Scientist; Top 10 Videos on Deep Learning in Python; 8 Ways to Improve Your Data Science Skills in 2 Years; Machine Learning Algorithms: Which One to Choose for Your Problem; Top 10 Machine Learning Algorithms for Beginners
Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.
Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started!
The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customer’s attributes:
library(plyr) library(corrplot) library(ggplot2) library(gridExtra) library(ggthemes) library(caret) library(MASS) library(randomForest) library(party) churn <- read.csv('Telco-Customer-Churn.csv') str(churn) 'data.frame': 7043 obs. of 21 variables: $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target.
We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, let’s remove all rows with missing values.
sapply(churn, function(x) sum(is.na(x))) customerID gender SeniorCitizen Partner 0 0 0 0 Dependents tenure PhoneService MultipleLines 0 0 0 0 InternetService OnlineSecurity OnlineBackup DeviceProtection 0 0 0 0 TechSupport StreamingTV StreamingMovies Contract 0 0 0 0 PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 0 0 11 Churn 0
churn <- churn[complete.cases(churn), ]
1. We will change “No internet service” to “No” for six columns, they are: “OnlineSecurity”, “OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies”.
cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }
2. We will change “No phone service” to “No” for column “MultipleLines”
churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No")))
3. Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”
min(churn$tenure); max(churn$tenure) [1] 1 [1] 72
group_tenure = 0 & tenure 12 & tenure 24 & tenure 48 & tenure 60){ return('> 60 Month') } } churn$tenure_group <- sapply(churn$tenure,group_tenure) churn$tenure_group <- as.factor(churn$tenure_group)
4. Change the values in column “SeniorCitizen” from 0 or 1 to “No” or “Yes”.
churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))
5. Remove the columns we do not need for the analysis.
churn$customerID <- NULL churn$tenure <- NULL
Correlation between numeric variables
numeric.var <- sapply(churn, is.numeric) corr.matrix <- cor(churn[,numeric.var]) corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", method="number")
Gives this plot:
The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.
churn$TotalCharges <- NULL
Bar plots of categorical variables
p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p1, p2, p3, p4, ncol=2)
p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p5, p6, p7, p8, ncol=2)
p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + xlab("Online Backup") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + xlab("Device Protection") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + xlab("Tech Support") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + xlab("Streaming TV") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p9, p10, p11, p12, ncol=2)
p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + xlab("Streaming Movies") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + xlab("Contract") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + xlab("Paperless Billing") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + xlab("Payment Method") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + xlab("Tenure Group") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p13, p14, p15, p16, p17, ncol=2)
All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.
First, we split the data into training and testing sets
intrain<- createDataPartition(churn$Churn,p=0.7,list=FALSE) set.seed(2017) training<- churn[intrain,] testing<- churn[-intrain,]
Confirm the splitting is correct
dim(training); dim(testing) [1] 4924 19 [1] 2108 19
Fitting the Logistic Regression Model
LogModel <- glm(Churn ~ .,family=binomial(link="logit"),data=training) print(summary(LogModel)) Call: glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max -2.0370 -0.6793 -0.2861 0.6590 3.1608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.030149 1.008223 -2.014 0.044053 * genderMale -0.039006 0.077686 -0.502 0.615603 SeniorCitizenYes 0.194408 0.101151 1.922 0.054611 . PartnerYes -0.062031 0.092911 -0.668 0.504363 DependentsYes -0.017974 0.107659 -0.167 0.867405 PhoneServiceYes -0.387097 0.788745 -0.491 0.623585 MultipleLinesYes 0.345052 0.214799 1.606 0.108187 InternetServiceFiber optic 1.022836 0.968062 1.057 0.290703 InternetServiceNo -0.829521 0.978909 -0.847 0.396776 OnlineSecurityYes -0.393647 0.215993 -1.823 0.068379 . OnlineBackupYes -0.113220 0.213825 -0.529 0.596460 DeviceProtectionYes -0.025797 0.213317 -0.121 0.903744 TechSupportYes -0.306423 0.220920 -1.387 0.165433 StreamingTVYes 0.399209 0.395912 1.008 0.313297 StreamingMoviesYes 0.338852 0.395890 0.856 0.392040 ContractOne year -0.805584 0.127125 -6.337 2.34e-10 *** ContractTwo year -1.662793 0.216346 -7.686 1.52e-14 *** PaperlessBillingYes 0.338724 0.089407 3.789 0.000152 *** PaymentMethodCredit card (automatic) -0.018574 0.135318 -0.137 0.890826 PaymentMethodElectronic check 0.373214 0.113029 3.302 0.000960 *** PaymentMethodMailed check 0.001400 0.136446 0.010 0.991815 MonthlyCharges -0.005001 0.038558 -0.130 0.896797 tenure_group0-12 Month 1.858899 0.205956 9.026 < 2e-16 *** tenure_group12-24 Month 0.968497 0.201452 4.808 1.53e-06 *** tenure_group24-48 Month 0.574822 0.185500 3.099 0.001943 ** tenure_group48-60 Month 0.311790 0.200096 1.558 0.119185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5702.8 on 4923 degrees of freedom Residual deviance: 4094.4 on 4898 degrees of freedom AIC: 4146.4 Number of Fisher Scoring iterations: 6
Feature Analysis
The top three most-relevant features include Contract, tenure_group and PaperlessBilling.
anova(LogModel, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Churn Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 4923 5702.8 gender 1 0.39 4922 5702.4 0.5318602 SeniorCitizen 1 95.08 4921 5607.3 < 2.2e-16 *** Partner 1 107.29 4920 5500.0 < 2.2e-16 *** Dependents 1 27.26 4919 5472.7 1.775e-07 *** PhoneService 1 1.27 4918 5471.5 0.2597501 MultipleLines 1 9.63 4917 5461.8 0.0019177 ** InternetService 2 452.01 4915 5009.8 < 2.2e-16 *** OnlineSecurity 1 183.83 4914 4826.0 < 2.2e-16 *** OnlineBackup 1 69.94 4913 4756.1 < 2.2e-16 *** DeviceProtection 1 47.58 4912 4708.5 5.287e-12 *** TechSupport 1 82.78 4911 4625.7 < 2.2e-16 *** StreamingTV 1 4.90 4910 4620.8 0.0269174 * StreamingMovies 1 0.36 4909 4620.4 0.5461056 Contract 2 309.25 4907 4311.2 < 2.2e-16 *** PaperlessBilling 1 14.21 4906 4297.0 0.0001638 *** PaymentMethod 3 38.85 4903 4258.1 1.867e-08 *** MonthlyCharges 1 0.10 4902 4258.0 0.7491553 tenure_group 4 163.67 4898 4094.4 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.
Assessing the predictive ability of the Logistic Regression model
testing$Churn <- as.character(testing$Churn) testing$Churn[testing$Churn=="No"] <- "0" testing$Churn[testing$Churn=="Yes"] <- "1" fitted.results <- predict(LogModel,newdata=testing,type='response') fitted.results 0.5,1,0) misClasificError <- mean(fitted.results != testing$Churn) print(paste('Logistic Regression Accuracy',1-misClasificError)) [1] "Logistic Regression Accuracy 0.796489563567362"
Logistic Regression Confusion Matrix
print("Confusion Matrix for Logistic Regression"); table(testing$Churn, fitted.results > 0.5) [1] "Confusion Matrix for Logistic Regression" FALSE TRUE 0 1392 156 1 273 287
Odds Ratio
One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.
exp(cbind(OR=coef(LogModel), confint(LogModel))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.1313160 0.01815817 0.9461855 genderMale 0.9617454 0.82587632 1.1199399 SeniorCitizenYes 1.2145919 0.99591418 1.4807053 PartnerYes 0.9398537 0.78338071 1.1276774 DependentsYes 0.9821863 0.79488224 1.2124072 PhoneServiceYes 0.6790251 0.14466019 3.1878587 MultipleLinesYes 1.4120635 0.92707245 2.1522692 InternetServiceFiber optic 2.7810695 0.41762316 18.5910286 InternetServiceNo 0.4362582 0.06397364 2.9715699 OnlineSecurityYes 0.6745919 0.44144273 1.0296515 OnlineBackupYes 0.8929545 0.58709919 1.3577947 DeviceProtectionYes 0.9745328 0.64144877 1.4805460 TechSupportYes 0.7360754 0.47707096 1.1344691 StreamingTVYes 1.4906458 0.68637788 3.2416264 StreamingMoviesYes 1.4033353 0.64624171 3.0518161 ContractOne year 0.4468271 0.34725066 0.5717469 ContractTwo year 0.1896086 0.12230199 0.2861341 PaperlessBillingYes 1.4031556 1.17798691 1.6725920 PaymentMethodCredit card (automatic) 0.9815977 0.75273387 1.2797506 PaymentMethodElectronic check 1.4523952 1.16480721 1.8145076 PaymentMethodMailed check 1.0014007 0.76673087 1.3092444 MonthlyCharges 0.9950112 0.92252949 1.0731016 tenure_group0-12 Month 6.4166692 4.30371945 9.6544837 tenure_group12-24 Month 2.6339823 1.78095906 3.9256133 tenure_group24-48 Month 1.7768147 1.23988035 2.5676783 tenure_group48-60 Month 1.3658675 0.92383315 2.0261505
Decision Tree visualization
For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are “Contract”, “tenure_group” and “PaperlessBilling”.
tree <- ctree(Churn~Contract+tenure_group+PaperlessBilling, training) plot(tree, type='simple')
Gives this plot:
1. Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
2. If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
3. On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0–12 month, and using PaperlessBilling, then this customer is more likely to churn.
Decision Tree Confusion Matrix
We are using all the variables to product confusion matrix table and make predictions.
pred_tree <- predict(tree, testing) print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn) [1] "Confusion Matrix for Decision Tree" Actual Predicted No Yes No 1395 346 Yes 153 214
Decision Tree Accuracy
p1 <- predict(tree, training) tab1 <- table(Predicted = p1, Actual = training$Churn) tab2 <- table(Predicted = pred_tree, Actual = testing$Churn) print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2))) [1] "Decision Tree Accuracy 0.763282732447818"
The accuracy for Decision Tree has hardly improved. Let’s see if we can do better using Random Forest.
Random Forest Initial Model
rfModel <- randomForest(Churn ~., data = training) print(rfModel) Call: randomForest(formula = Churn ~ ., data = training) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 20.65% Confusion matrix: No Yes class.error No 3245 370 0.1023513 Yes 647 662 0.4942704
The error rate is relatively low when predicting “No”, and the error rate is much higher when predicting “Yes”.
Random Forest Prediction and Confusion Matrix
pred_rf <- predict(rfModel, testing) caret::confusionMatrix(pred_rf, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1381 281 Yes 167 279 Accuracy : 0.7875 95% CI : (0.7694, 0.8048) No Information Rate : 0.7343 P-Value [Acc > NIR] : 9.284e-09 Kappa : 0.4175 Mcnemar's Test P-Value : 9.359e-08 Sensitivity : 0.8921 Specificity : 0.4982 Pos Pred Value : 0.8309 Neg Pred Value : 0.6256 Prevalence : 0.7343 Detection Rate : 0.6551 Detection Prevalence : 0.7884 Balanced Accuracy : 0.6952 'Positive' Class : No
Random Forest Error Rate
plot(rfModel)
Gives this plot:
We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.
Tune Random Forest Model
t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE, ntreeTry = 200, trace = TRUE, improve = 0.05)
Gives this plot:
We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.
Fit the Random Forest Model After Tuning
rfModel_new <- randomForest(Churn ~., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) print(rfModel_new) Call: randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 200 No. of variables tried at each split: 2 OOB estimate of error rate: 19.7% Confusion matrix: No Yes class.error No 3301 314 0.0868603 Yes 656 653 0.5011459
OOB error rate decreased to 19.7% from 20.65% earlier.
Random Forest Predictions and Confusion Matrix After Tuning
pred_rf_new <- predict(rfModel_new, testing) caret::confusionMatrix(pred_rf_new, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1404 305 Yes 144 255 Accuracy : 0.787 95% CI : (0.7689, 0.8043) No Information Rate : 0.7343 P-Value [Acc > NIR] : 1.252e-08 Kappa : 0.3989 Mcnemar's Test P-Value : 4.324e-14 Sensitivity : 0.9070 Specificity : 0.4554 Pos Pred Value : 0.8215 Neg Pred Value : 0.6391 Prevalence : 0.7343 Detection Rate : 0.6660 Detection Prevalence : 0.8107 Balanced Accuracy : 0.6812 'Positive' Class : No
The accuracy did not increase but the sensitivity improved, compare with the initial Random Forest model.
Random Forest Feature Importance
varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')
From the above example, we can see that Logistic Regression and Random Forest performed better than Decision Tree for customer churn analysis for this particular dataset.
Throughout the analysis, I have learned several important things:
1. Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn.
2. There does not seem to be a relationship between gender and churn.
3. Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn.
Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.
Related Post
Russell Goldenberg released Scrollama.js in an effort to make scrollytelling more straightforward to implement.
Scrollytelling can be complicated to implement and difficult to make performant. The goal of this library is to provide a simple interface for creating scroll-driven interactives and improve user experience by reducing scroll jank. It offers (optional) methods to implement the common scrollytelling pattern to reduce more involved DOM calculations. For lack of a better term, I refer to it as the sticky graphic pattern, whereby the graphic scrolls into view, becomes “stuck” for a duration of steps, then exits and “unsticks” when the steps conclude.
Bookmarked for later.
Tags: JavaScript, scrollytelling
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Job seekers: please follow the links below to learn more and apply for your R job of interest:
In R-users.com you can see all the R jobs that are currently available.
R-users Resumes
R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
It’s an exciting time here at AYLIEN – in the past couple of months, we’ve moved office, closed a funding round, and added six people to the team. We’re delighted to announce our most recent hire and our first Chief Architect and Principal Engineer, Hunter Kelly.
The AYLIEN tech infrastructure has grown to quite a scale at this point. In addition to serving over 30,000 users with three product offerings, we’re also a fully-functional AI research lab that houses five full-time researchers, who in turn feed their findings back into the products. With such a complex architecture and backends that have highly demanding tasks, bringing an engineer with the breadth and quality of experience that Hunter has is a huge boost to us as we move into the next phase of our journey.
At first glance, Hunter’s career has followed a seemingly meandering path through some really interesting companies. After graduating from UC Berkeley in the 90s, he joined the Photoscience Department at Pixar in California, became one of the first engineers in Google’s Dublin office, and at NewBay, he designed and built a multi-petabyte storage solution for handling user-generated content, still in use by some of the world’s largest telcos today. Hunter is joining us from Zalando’s Fashion Insight Centre, where as the first engineer in the Dublin office he kicked off the Fashion Content Platform, which was shortlisted as a finalist in the 2017 DatSci Awards.
The common thread in those roles, while perhaps not obvious, is data. Hunter brings this rich experience working on data, both from an engineering and data science perspective, to focus on one extremely important problem – how can we leverage data to solve our hardest problems?
This question is central to AI research, and Hunter’s expertise is a perfect fit with AYLIEN’s mission to make Natural Language Processing hassle-free for developers. Our APIs handle the heavy lifting so developers can leverage Deep NLP in a couple of lines of code, and the ease with which our users do this is down to the great work our science and engineering teams do. Adding Hunter to the intersection of these teams will add a huge amount to our capabilities here and we’re really excited about the great work we can get done.
Here’s what Hunter had to say about joining the team:
“I’m really excited to be joining AYLIEN at this point in time. I think that AI and Machine Learning are incredibly powerful tools that everyone should be able to leverage. I really look forward to being able to bring my expertise and experience, particularly with large-scale and streaming data platforms, to AYLIEN to help broaden their already impressive offerings. AI is just reaching that critical point of moving beyond academia and reaching wide-scale adoption. Making its power accessible to the wider community beyond very focused experts is a really interesting and exciting challenge.”
When he’s not in AYLIEN, Hunter can be found spending time with his wife and foster children, messing around learning yet another programming language, painting minis, playing board games, tabletop RPG’s and wargames, or spending too much time playing video games. He’s also been known to do some Salsa dancing, traveling, sailing, and scuba diving.
Check out Hunter’s talks on his most recent work at ClojureConj and this year’s Kafka Summit.
Ancient Data, Tech Ethics, Session Replay, and Cache Filesystem
Continue reading Four short links: 20 November 2017.
Graph databases are now common within a range of industries such as life sciences, healthcare, financial services, government and intelligence. Graphs are particularly valuable in these sectors because of the complex nature of the data and need for powerful, yet flexible data analytics. In addition, graph databases allow visibility across
The post How to keep graph databases both flexible and secure appeared first on Dataconomy.
Countless words have been written in recent years about the rise of the ‘gig economy’. As the natural evolution of the freelance model, the gig economy is based on short-term and temporary jobs handled by contractors and other independent workers. Instead of hiring full-time staff, freelancers offer a concise way
The post Can Blockchain Supercharge the Gig Economy? appeared first on Dataconomy.
Introduction
Residential real estate prices are fascinating… and frustrating. The homebuyer, the home-seller, the real estate agent, the economist, and the banker are all interested in housing prices, but they’re sufficiently subtle that no one person or industry has a complete understanding.
For our third overall project and first group project we were assigned Kaggle’s Advanced Regression Techniques Competition. The goal, for the project and the original competition, was to predict housing prices in Ames, Iowa. While this particular competition is no longer active, the premise proved to be a veritable playground for testing our knowledge of data cleaning, exploratory data analysis, statistics, and, most importantly, machine learning. What follows is the combined work of Quentin Picard, Paul Ton, Kathryn Bryant, and Hans Lau.
Data
As stated on the Kaggle competition description page, the data for this project was compiled by Dean De Cock for educational purposes, and it includes 79 predictor variables (house attributes) and one target variable (price). As a result of the educational nature of the competition, the data was pre-split into a training set and a test set; the two datasets were given in the forms of csv files, each around 450 KB in size. Each of the predictor variables could fall under one of the following:
The specifics of the 79 predictor variables are omitted for brevity but can be found in the data_description.txt file on the competition website. The target variable was Sale Price, given in US dollars.
Project Overview (Process)
The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.
Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing with curriculum than by anything else.
Throughout the project our group wanted to mimic a real-world machine learning project as much as possible, which meant that although we were given both a training set and a test set, we opted to treat the given test set as if it were “future” data. As a result, we further split the Kaggle training data into a private training set and a private testing set, with an 80/20 split, respectively. This allowed us to evaluate models in two ways before predicting on the Kaggle test data: with RMSE of predictions made on the private test set and with cross validation RMSE of the entire training set.
Given the above choices, the process for training and evaluating each individual model was broken down as follows:
Computing RMSE of predictions made on the private test set served as a group sanity check, in that if anything was amiss with a model we found out at this point and could correct for it before proceeding. The decision to fit a new model to the Kaggle train in step 4 set using the private train hyperparameters found in step 1 was one for which we never felt completely at ease; we were trying to minimize overfitting to the Kaggle training set, but also recognized that we weren’t using all the possible information we could by not re-tuning using the full training set. One benefit to this choice was that we never saw any major discrepancies between our own cross validation scores in step 5 and the Kaggle scores from step 7. With more time, we would have further investigated whether keeping the private train parameters or re-tuning for the final models was more beneficial.
The last noteworthy choice we made in our overall process was to feed differently-cleaned datasets to our linear-based models and our tree-based models. In linear-based models (linear, ridge, LASSO, elastic net, spline), prediction values are continuous and sensitive to outliers so we opted to sacrifice information about “rare” houses in favor of gaining better predictions on “common” houses. We also wanted to minimize the likelihood of rare levels only showing up in our test data and/or causing columns of all 0’s in dummifed data. We executed this tradeoff by releveling any nominal categorical variables that contained extremely rare classes, where “rare” was defined to be “occuring in less than 1% of the observations.”
For example, the Heating variable had six levels but four of them combined accounted for only about 1% of the observations; so, we combined these four levels into a new ‘other’ level so that releveled Heating had only three levels, all accounting for 1% or more of the observations. Depending on the variable, we either created an ‘other’ level as in the example or we grouped rare levels into existing levels according to level similarity in the variable documentation.
We opted not to relevel data fed into our tree-based models because tree predictions are more robust to outliers and rare classes; trees can separate rare observations from others through splits, which prevents common observation predictions from being distorted by rare observations in fitting.
Data Cleaning and EDA
We were fortunate with the Kaggle data in that it came to us relatively clean. The only basic cleaning tasks were to correct typos in levels of categorical variables, specify numeric or categorical variables in R, and rename variables beginning with numbers to satisfy R’s variable name requirements. There were a number of “quality” and “condition” variables that had levels of Poor, Fair, Typical/Average, Good, and Excellent which we label encoded as integers 1-5 to preserve their inherent ordinality. For nominal categorical variables, we used one-hot encoding from the ‘vtreat’ package in R. (Most machine learning algorithms in Python require all variables to have numeric values, and although R’s machine learning algorithms can handle nominal categorical variables in non-numeric/string form, R’s computations effectively use one-hot encoding under the hood in these cases.)
After taking care of these basic cleaning issues, we needed to address missingness in our data. Below is a plot that shows the variables containing missing values and the degree to which missingness occurs in each (shown in yellow):
For all variables except Garage Year Built and Lot Frontage, we performed basic mode imputation. Mode imputation was chosen for simplicity and because it could be done with both categorical and numerical data. Of course, mode imputation has the negative side effect of artificially decreasing variance within the affected variable and therefore would not be appropriate for variables with higher degrees of missingness. It was for this reason, in fact, that we approached missingness differently for Garage Year Built and Lot Frontage.
For missing values in Garage Year Built, we imputed the Year Built for the house. We justified this because most garages are built at the same time as the house, and houses without garages get no penalty or benefit by having the Garage Year Built equal to the Year Built for the house.
For Lot Frontage, the variable with the greatest degree of missingness, was researched and explored in order to arrive at an imputation strategy. Lot Frontage was defined to be the linear feet of street connected to the property. Given that most properties were either regularly or only slightly irregularly shaped according to Lot Shape, we deduced that most properties would have Lot Frontage values that were correlated with Lot Area values. However, since length is measured in units and area is measure in square units, we found it most appropriate to relate log(Lot Frontage) with log(Lot Area), so as to get a linear relationship between the two. See plot below.
Thus, we imputed missing values for Lot Frontage by fitting a linear model of log(Lot Frontage) regressed onto log(Lot Area), predicting on the missing values for Lot Frontage using Lot Area, and then exponentiating (inverse log-ing) the result.
The next step in EDA after finding and addressing missingness was to look at outliers. Looking at boxplots of both Sale Price and log(Sale Price) we saw that there were quite a few outliers, where ‘outliers’ mathematically defined to be observations lying more than 1.5*IQR (Inner Quartile Range) above the third quartile and below the first quartile.
Although removing the outliers may have improved our predictions for average-priced homes, we were hesitant to do so due to the relatively small size of our sample (~1600 observations in Kaggle training set). We felt that removing any outliers without further justification than them simply being outliers would likely jeopardize our predictions for houses in the extremes.
Since outliers would have the most impact on the fit of linear-based models, we further investigated outliers by training a basic multiple linear regression model on the Kaggle training set with all observations included; we then looked at the resulting influence and studentized residuals plots:
From these, we saw that there were only two observations that could justifiably be removed: observation 1299 and observation 251. These were both beyond or on the lines representing Cook’s Distance, meaning that as individual observations they had a significant impact on the regression formula; as such, we removed these two observations from consideration.
The last bit of preprocessing we did was dealing with multicollinearity. This, as for outliers, was cleaning done mostly for the benefit of linear-based models; in fact, it was done only for vanilla multiple linear regression since regularization in ridge, LASSO, and elastic net models deals with collinearity by construction. To eliminate collinearity issues, we used the findLinearCombos() function from the ‘caret’ package in R on our dummified data. This function identified linear combinations between predictor variables and allowed us to easily drop linearly dependent variables.
Feature Engineering
For feature engineering we took a two-pronged approach: we used anecdotal data/personal knowledge and we did research.
Of the six features that we added, only “landscape-ability” resulted from research; we either already had the important industry variables in the data (total above-ground square footage or neighborhood, for example) or we did not have the information we needed to create the desired variables. Additional features we would have liked to have added include geolocation data for proximity to schools and shopping centers, quality of nearby schools, and specific rooms/house features remodeled (if applicable).
One step that we glossed over a bit was extensive adding and removing of features. We noticed with just a few trials that removing existing features seemed to negatively impact the performance of our models and that the models improved when we added features. Given that the regularized linear models would give preference to more important features via larger coefficients and tree-based models would give preference to more important features by splitting preferences, we opted not to spend too much time manually adding and removing features. Essentially, we allowed our models to select the most predictive features themselves.
Modeling
For our individual models, we trained the following:
Overall our models consistently had RMSE values between .115 and .145. As stated earlier, our cross validation scores were usually very close to our scores on Kaggle. To our surprise, our linear-based models did very well compared to our tree-based models. Even the much-hyped XGBoost model was at best on par with our linear-based spline and elastic net models, and our random forest models tended to be much worse. An abbreviated table of our results is shown below:
Elastic Net | Spline | Random Forest | Gradient Boost | XGBoost | Ensemble | |
Cross Validation | 0.12157 | 0.11181 | 0.14171 | 0.12491 | 0.12282 | 0.11227 |
Held-out Test | 0.11762 | 0.11398 | 0.13834 | 0.11403 | 0.11485 | n/a |
Kaggle | 0.12107 | 0.11796 | n/a | n/a | n/a | 0.11710 |
As we endeavored to find a final, “best” model via ensembling and stacking, we followed the advice of our instructor (and successful Kaggler) Zeyu by attempting to combine models with different strengths and weaknesses.
We took the predictions from each of our best base models (Spline, Gradient Boost, XGBoost) as the features to use for the next level. Our best result came from a weighted average of the three, with weights determined by a grid search. The blend of 76% Spline, 14.5% Gradient Boost, and 9.5% Xgboost gave us a 0.11227 RMSE on our training set and 0.11710 on Kaggle.
We also experimented with using a second level meta-model to do the stacking, but with a linear meta-model, the coefficients were highly unstable because the predictions were so closely related to each other, and with a gradient boost meta-model, we were unable to beat our best base model alone.
Conclusions
As mentioned above, we were surprised by the strength of our various linear models in comparison to our tree models. We suspect this had a lot to do with the data itself, in that Sale Price (or rather, log(Sale Price)) likely has a relatively linear relationship with the predictor variables. This highlights one of the most important takeaways from this project: that linear models have a real place in machine learning.
In situations like this, where performance between a simpler model and a more complex model are similar, the better interpretability and the ease of training of the simpler model may also prove to be deciding factors on model choice. In most cases, stacking multiple base-layer models into a second-layer is impractical, despite performing slightly better.
Add the vertices. Connect them with edges. Repeat as necessary. Read More
Rachel Thomas's article came onto my twitter feed. It caught my attention because it has a click-baity title "How (and why) to create a good validation set."
Or, I thought it was click bait but she is really serious about this. (For those not familiar with the literature, we don't use all historical data to build machine learning models. The historical data are split, typically at random, into training and validation sets. The validation set is supposed to simulate new data the algorithms haven't seen before, a sort of honest check of the model.) She makes some alarmist claims here:
Even though I don't like any of her advice, I can't disagree with her diagnosis:
An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again.
***
One of the examples given is a response function that has a time trend.
If this model does not detect the trend, indeed the prediction will have poor accuracy on real-world data. She is making a claim that a validation set based on a pre-post time split is better than a random selection.
Since this is a simple linear trend, either way of making the validation set will capture the trend. So what makes the model fail in production is not the presence of this trend but a shift of the trajectory after the model is deployed. But the choice of validation set won't help prevent the problem.
The downside of the pre-post split is the presence of many time-varying predictors. A naive example: if an on-off switch just happens to be pressed at that time split point, then all your training examples have the "on" condition while your validation examples have "off".
Manufacturing the validation set to reflect some unknown future trends creates a conceptual difficulty. The training set is now materially different from the validation set, so why would we expect the trained model to perform well on the validation set? And how much degradation in validation set performance is considered the right price to pay for potentially better in-market performance? That question boils down to how much you want to generalize the data, and at the core of the statistical view of the modeling problem.
***
The subtext of the article is that if the model doesn't work, fix the data. I tend to want to fix the model. If it doesn't work in production because the nature of the time trend has shifted, then adjust the model to include the new time trend.
Diagnosing the difference between production data and historical data is part of good model hygiene. It's very hard to predict unexpected shifts in the data and even if you could, you wouldn't have any training data to support such shifts.
The "data fix" is not the solution. Refining one's model is.
PS. While I don't agree with designing your validation set, I do advise selecting your historical dataset carefully, and think about which units to include or exclude from the modeling process, which Rachel discusses at the end of her post.
Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA. Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey
Harry Crane points us to this new paper, “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse,” and writes:
Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of “P-hacking” on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors.
I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.
From the concluding section of Crane’s paper:
The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.
Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.
I like Crane’s style. I can’t say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I’ve seen, as I’ve written many times (see, for example, here and here).
I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news—and, as Hal Stern and I wrote awhile ago, it’s not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant—“p less than 0.005”! Wow!—and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever.
So, scientists and statisticians: All that thresholding you’re doing? It’s not doing what you think it’s doing. It’s just a magnification of noise.
So I’m not really inclined to follow the details of Crane’s argument regarding false positive rates etc., but I’m supportive of his general attitude that thresholds are a joke.
Post-publication review, not “ever expanding regulation”
Crane’s article also includes this bit:
While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.
As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are not “calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to “ban” anything!, nor are we offering any “generic calls to action.” Of all the things on that list, the only thing we’re suggesting is “better study design”—and our suggestions for better study design are in no way a call for “ever expanding scientific regulation.”
The post No to inferential thresholds appeared first on Statistical Modeling, Causal Inference, and Social Science.
Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.
Click here to browse ALL (R and non-R) courses
Advanced R courses:
Introductory R courses:
Top data science courses (not in R):
A maintenance release 0.3.3.3.1 of RcppEigen is now on CRAN (and will get to Debian soon). It brings Eigen 3.3.* to R.
The impetus was a request from CRAN to change the call to Rcpp::Rcpp.plugin.maker()
to only use ::
as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up. Otherwise, Haiku-OS is now supported and a minor Travis tweak was made.
The complete NEWS
file entry follows.
Changes in RcppEigen version 0.3.3.3.1 (2017-11-19)
Compilation under Haiku-OS is now supported (Yu Gong in #45).
The
Rcpp.plugin.maker
helper function is called via::
as it is in fact exported (yet we had old code using:::
).A spurious argument was removed from an example call.
Travis CI now uses https to fetch the test runner script.
Courtesy of CRANberries, there is also a diffstat report for the most recent release.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
A maintenance release RcppClassic 0.9.9 is now at CRAN. This package provides a maintained version of the otherwise deprecated first Rcpp API; no new projects should use it.
Per a request from CRAN, we changed the call to Rcpp::Rcpp.plugin.maker()
to only use ::
as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up.
Courtesy of CRANberries, there are changes relative to the previous release.
Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
“Before jumping on the Big Data bandwagon, I think it is important to ask the question of whether the problem you have requires much data.” Pradyumna S. Upadrashta ( February 13, 2015 )
Sometimes, when I write a really long blog post, I forget what the point was at the end. I suppose I could just update the previous post…but that feels wrong for some reason.
I meant to make one final point in my last post about how better data analyses help you reason about the data. In particular, I meant to tie together the discussion about garbage collection to the section on data analysis. With respect to garbage collection in programming, I wrote
The promise of garbage collection is clear: the programmer doesn’t have to think about memory. Lattner’s criticism is that when building large-scale systems the programmer always has to think about memory, and as such, garbage collection makes it harder to do so. The goal of [Automatic Reference Counting] is to make it easier for other people to understand your code and to allow programmers to reason clearly about memory.
To pick this apart just a little bit, it’s tempting to think that “not having to think about memory management” is a worthy goal, but Lattner’s point is that it’s not. At least not right now. Better to make tools that will help people to think more effectively about this important topic (a “bicycle for the mind”).
I think the analogy for data analysis is that I think it’s tempting to think of the goal of methods development as removing the need to think about data and assumptions. The “ultimate” method is one where you don’t have to worry about distributions or nonlinearities or interactions or anything like that. But I don’t see that as the goal. Good methods, and good analyses, help us think about all those things much more efficiently. So what I might say is that
When doing large-scale data analyses, the data analyst always has to think about the data and assumptions, and as such, some approaches can actually make that harder to do than others. The goal of the good data analysis is to make it easier to reason about how the data are related to the result, relative to the assumptions you make about the data and the models.
Following my post on the M4 competition yesterday, Spyros Makridakis sent me these comments for posting here.
I would like to thank Rob, my friend and co-author, for his insightful remarks concerning the upcoming M4 competition. As Rob says, the two of us have talked a great deal about competitions and I certainly agree with him about the “ideal” forecasting competition. In this reply, I will explain why I have deviated from the “ideal”, mostly for practical reasons and to ensure higher participation.
After seeing a plot of best places to live in Spain and the USA based on the weather, I had to
chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences.
First the end result:
How to read this plot?
In this xkcd comic we see that the topleft of the the graph represents “if you hate cold and hate heat”, if you go down from the topleft to the bottom left the winters get colder and ending in “if you love cold and hate heat”. Going to the right the heat (and humidity) go up ending in “if you love cold and love heat”. Going up to the top right: “if you hate cold and love heat”.
This post explains how to make the plot, to see where I got the data and what procedures I took look at https://github.com/RMHogervorst/xkcd_weather_cities_nl.
According to this post by Maële Salmon inspired by xkcd, we can determine our ideal city by looking at wintertemperature and humidex (combination of humidity and summer heat).
I’ve seen major cities in the USA (post by Maelle) and where to live in Spain by Claudia Guirao. There is even one on France in French, of course.
So I had to make one for the Netherlands too. There is just a small detail,
The Netherlands is tiny, the United States is approximately 237 times larger then The Netherlands. From The Hague to the German border directly east is 164 km (101 miles) and from Leeuwarden to Maastricht in the south is 262 km (162 miles). Essentially my home country has a moderate sea climate with warm summers and relatively mild winters.
I expect there to be no real big differences between the cities. I use the average temperature over december, january and february for winter temperature and calculate the humidex using the comf package. This humidex is a combination of humidity and temperature.
For cities I went the extremely lazy way and took all of the available weatherstations provided by the Dutch weather service (KNMI, — Royal Netherlands, Metereological Institute). There are 46 stations in the Netherlands. These are not always in the cities but sort of broadly describe the entire country.
The XKCD package has wonderful plotting abilities and a cool font that you have to install. I copied and modified the code from the post of Mäelle, because it is really good!
If you want to see how I did the data cleaning go to the github page.
First we plot all of the stations in the Netherlands most of these places will probably not be familiar to non-Dutch people.
library("xkcd")
library("ggplot2")
library("extrafont")
library("ggrepel")
xrange <- range(result$humidex_avg)
yrange <- range(result$wintertemp)
set.seed(3456)
ggplot(result,
aes(humidex_avg, wintertemp)) +
geom_point() +
geom_text_repel(aes(label = NAME), family = "xkcd",
max.iter = 50000, size = 3)+
ggtitle("Where to live in The Netherlands \nbased on your temperature preferences",
subtitle = "Data from KNMI, 2016-2017") +
xlab("Summer heat and humidity via Humidex")+
ylab("Winter temperature in Celsius degrees") +
xkcdaxis(xrange = xrange,
yrange = yrange)+
theme_xkcd() +
theme(text = element_text(size = 13, family = "xkcd"))
But what does that mean, in the grand scheme of things? As you might notice the range is very small. It would be interesting to add some US cities and perhaps some Spanisch cities to compare against. For fun I also added the Dutch Caribean islands.
xrange2 <- range(c(18,40)) # modified these by hand to increase the margins
yrange2 <- range(c(-5,40))
USA <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"DETROIT, MI", 27, 0,
"NASHVILLE, TN", 33, 9,
"FORT MEYERS, FL",37, 20
)
SPAIN <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"MADRID, SPAIN", 27, 8,
"TENERIFE, SPAIN", 24, 13,
"BARCELONA, SPAIN",32, 10
)
D_CARI <- tribble(
~NAME, ~humidex_avg, ~wintertemp,
"Bonaire, Caribbean Netherlands", 27, calcHumx(29,76),
"Sint Eustatius, Caribbean Netherlands", 28, calcHumx(28,77),
"Saba, Caribbean Netherlands",30, calcHumx(30,76)
)
set.seed(3456)
ggplot(result,
aes(humidex_avg, wintertemp)) +
geom_point(alpha = .7) +
geom_text_repel(aes(label = NAME),
family = "xkcd",
max.iter = 50000, size = 2)+
geom_text_repel(data = USA, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "blue")+
geom_point(data = USA, aes(humidex_avg, wintertemp), color = "blue")+
geom_text_repel(data = SPAIN, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "red")+
geom_point(data = SPAIN, aes(humidex_avg, wintertemp),color = "red")+
geom_text_repel(data = D_CARI, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd",
max.iter = 50000, size = 2, color = "orange")+
geom_point(data = D_CARI, aes(humidex_avg, wintertemp), color = "orange")+
labs(title = "Where to live in The Netherlands based on \nyour temperature preferences \nCompared with some places in Spain, Caribbean NL and USA",
subtitle = "Data from KNMI, 2016-2017, added Spain and US cities",
x = "Summer heat and humidity via Humidex",
y = "Winter temperature in Celsius degrees",
caption = "includes Caribbean Netherlands"
) +
xkcdaxis(xrange = xrange2,
yrange = yrange2)+
theme_xkcd() +
theme(text = element_text(size = 16, family = "xkcd"))
Finally we can compare the wintertemperature and summer humidex in The Netherlands by placing the points on a map using the coordinates of the measure stations.
NL <- map_data(map = "world",region = "Netherlands")
result %>%
rename(LON = `LON(east)`, LAT = `LAT(north)`) %>%
ggplot( aes(LON, LAT))+
geom_point(aes(color = wintertemp))+
geom_text_repel(aes(label = NAME),
family = "xkcd", size = 3,
max.iter = 50000)+
geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") +
coord_map()+
labs(title = "Wintertemperature in NL",
subtitle = "data from KNMI (2016,2017",
x = "", y = "")+
theme_xkcd()+
theme(text = element_text(size = 16, family = "xkcd"))
result %>%
rename(LON = `LON(east)`, LAT = `LAT(north)`) %>%
ggplot( aes(LON, LAT))+
geom_point(aes(color = humidex_avg))+
geom_text_repel(aes(label = NAME),
family = "xkcd", size = 3,
max.iter = 50000)+
geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") +
coord_map()+
labs(title = "Humidex in NL",
subtitle = "data from KNMI (2016,2017",
x = "", y = "")+
theme_xkcd()+
theme(text = element_text(size = 12, family = "xkcd"))+
scale_color_continuous(low = "white", high = "red")
Now show us what your country looks like!
Where to live in the Netherlands based on temperature XKCD style was originally published by at Clean Code on November 20, 2017.
Introduction to keyholder package. Tools for keeping track of information about rows.
During development of my other R package (ruler), I encountered the following problem: how to track rows of data frame after application of some user defined function? It is assumed that this function takes data frame as input, subsets it (with possible creation of new columns, but not rows) and returns the result. The typical example using dplyr and magrittr’s pipe:
suppressMessages(library(dplyr))
# Custom `mtcars` for more clear explanation
mtcars_tbl <- mtcars %>%
select(mpg, vs, am) %>%
as_tibble()
# A handy way of creating function with one argument
modify <- . %>%
mutate(vs_am = vs * am) %>%
filter(vs_am == 1) %>%
arrange(desc(mpg))
# The question is: which rows of `mtcars_tbl` are returned?
mtcars_tbl %>% modify()
## # A tibble: 7 x 4
## mpg vs am vs_am
##
## 1 33.9 1 1 1
## 2 32.4 1 1 1
## 3 30.4 1 1 1
## 4 30.4 1 1 1
## 5 27.3 1 1 1
## # ... with 2 more rows
To solve this problem I ended up creating package keyholder, which became my first CRAN release. You can install its stable version with :
install.packages("keyholder")
This post describes basis of design and main use cases of keyholder
. For more information see its vignette Introduction to keyholder.
suppressMessages(library(keyholder))
The main idea of package is to create S3 class keyed_df
, which indicates that original data frame (or tibble) should have attribute keys
. “Key” is any vector (even list) of the same length as number of rows in data frame. Keys are stored as tibble in attribute keys
and so one data frame can have multiple keys. In other words, keys can be considered as columns of data frame which are hidden from subsetting functions but are updated according to them.
To achieve that, those functions should be generic and have method for keyed_df
implemented. Look here for the list of functions supported by keyholder
. As for version 0.1.1
they are all one- and two-table dplyr
verbs for local data frames and [
function.
There are two distinct ways of creating keys: by assigning and by using existing columns:
# By assigning
mtcars_tbl_1 <- mtcars_tbl
keys(mtcars_tbl_1) <- tibble(rev_id = nrow(mtcars_tbl_1):1)
mtcars_tbl_1
## # A keyed object. Keys: rev_id
## # A tibble: 32 x 3
## mpg vs am
## *
## 1 21.0 0 1
## 2 21.0 0 1
## 3 22.8 1 1
## 4 21.4 1 0
## 5 18.7 0 0
## # ... with 27 more rows
# By using existing columns
mtcars_keyed <- mtcars_tbl %>% key_by(vs)
mtcars_keyed
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
## mpg vs am
## *
## 1 21.0 0 1
## 2 21.0 0 1
## 3 22.8 1 1
## 4 21.4 1 0
## 5 18.7 0 0
## # ... with 27 more rows
To get keys use keys()
(which always returns tibble) or pull_key()
(similar to dplyr::pull()
but for keys):
mtcars_keyed %>% keys()
## # A tibble: 32 x 1
## vs
## *
## 1 0
## 2 0
## 3 1
## 4 1
## 5 0
## # ... with 27 more rows
mtcars_keyed %>% pull_key(vs)
## [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
To restore keys (create respective columns in data frame) use restore_keys()
:
# Column `vs` didn't change in output because it was restored from keys
mtcars_keyed %>%
mutate(vs = 2) %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
## mpg vs am
##
## 1 21.0 0 1
## 2 21.0 0 1
## 3 22.8 1 1
## 4 21.4 1 0
## 5 18.7 0 0
## # ... with 27 more rows
To end having keys use unkey()
:
mtcars_keyed %>% unkey()
## # A tibble: 32 x 3
## mpg vs am
## *
## 1 21.0 0 1
## 2 21.0 0 1
## 3 22.8 1 1
## 4 21.4 1 0
## 5 18.7 0 0
## # ... with 27 more rows
To track rows after application of user defined function one can create key with row number as values. keyholder
has a wrapper use_id()
for this:
# `use_id()` removes all existing keys and creates key ".id"
mtcars_track <- mtcars_tbl %>%
use_id()
mtcars_track %>% pull_key(.id)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32
Now rows are tracked:
mtcars_track %>%
modify() %>%
pull_key(.id)
## [1] 20 18 19 28 26 3 32
# Make sure of correct result
mtcars_tbl %>%
mutate(id = seq_len(n())) %>%
modify() %>%
pull(id)
## [1] 20 18 19 28 26 3 32
The reason for using “key id” instead of “column id” is that modify()
hypothetically can perform differently depending on columns of its input. For example, it can use dplyr
’s scoped variants of verbs or simply check input’s column structure.
During development of tools for data analysis one can have a need to ensure that certain columns don’t change after application of some function. This can be achieved by keying those columns and restoring them later (note that this can change the order of columns.):
weird_modify <- . %>% transmute(new_col = vs + 2 * am)
# Suppose there is a need for all columns to stay untouched in the output
mtcars_tbl %>%
key_by(everything()) %>%
weird_modify() %>%
# This can be replaced by its scoped variant: restore_keys_all()
restore_keys(everything()) %>%
unkey()
## # A tibble: 32 x 4
## new_col mpg vs am
##
## 1 2 21.0 0 1
## 2 2 21.0 0 1
## 3 3 22.8 1 1
## 4 1 21.4 1 0
## 5 0 18.7 0 0
## # ... with 27 more rows
In actual data analysis the following situation can happen: one should modify all but handful of columns with dplyr::mutate_if()
.
is_integerish <- function(x) {all(x == as.integer(x))}
if_modify <- . %>% mutate_if(is_integerish, ~ . * 10)
mtcars_tbl %>% if_modify()
## # A tibble: 32 x 3
## mpg vs am
##
## 1 21.0 0 10
## 2 21.0 0 10
## 3 22.8 10 10
## 4 21.4 10 0
## 5 18.7 0 0
## # ... with 27 more rows
Suppose column vs
should appear unchanged in the output. This can be achieved in several ways, which differ slightly but significantly. The first one is to key by vs
, apply function and restore vs
from keys.
mtcars_tbl %>%
key_by(vs) %>%
if_modify() %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
## mpg vs am
##
## 1 21.0 0 10
## 2 21.0 0 10
## 3 22.8 1 10
## 4 21.4 1 0
## 5 18.7 0 0
## # ... with 27 more rows
The advantage is that it doesn’t change the order of columns. The disadvantage is that it actually applies modification function to column, which can be undesirable in some cases.
The second approach is similar, but after keying by vs
one can remove this column from data frame. This way column vs
is moved to last column.
mtcars_hidden_vs <- mtcars_tbl %>% key_by(vs, .exclude = TRUE)
mtcars_hidden_vs
## # A keyed object. Keys: vs
## # A tibble: 32 x 2
## mpg am
## *
## 1 21.0 1
## 2 21.0 1
## 3 22.8 1
## 4 21.4 0
## 5 18.7 0
## # ... with 27 more rows
mtcars_hidden_vs %>%
if_modify() %>%
restore_keys(vs)
## # A keyed object. Keys: vs
## # A tibble: 32 x 3
## mpg am vs
##
## 1 21.0 10 0
## 2 21.0 10 0
## 3 22.8 10 1
## 4 21.4 0 1
## 5 18.7 0 0
## # ... with 27 more rows
keyholder
offers functionality for keeping track of arbitrary data about rows after application of some user defined function. This is done by creating special attribute “keys” which is updated after every change in rows (subsetting, ordering, etc.).sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8
## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8
## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] keyholder_0.1.1 bindrcpp_0.2 dplyr_0.7.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 bookdown_0.5 assertthat_0.2.0 digest_0.6.12
## [5] rprojroot_1.2 R6_2.2.2 backports_1.1.1 magrittr_1.5
## [9] evaluate_0.10.1 blogdown_0.2 rlang_0.1.4 stringi_1.1.5
## [13] rmarkdown_1.7 tools_3.4.2 stringr_1.2.0 glue_1.2.0
## [17] yaml_2.1.14 compiler_3.4.2 pkgconfig_2.0.1 htmltools_0.3.6
## [21] bindr_0.1 knitr_1.17 tibble_1.3.4
As most bloggers, I do check my analytics stats on a regular basis. What I do not really look at is social shares. For most blog posts traffic follows a clear pattern; 2 days of increased traffic, followed by a steady decrease to the base traffic volume.
The amplitude varies massively depending on how interesting/viral the post was. However, in the long-term this virality does not drive the majority of the traffic but rather the search ranking of individual posts, as 80% of the base traffic comes through organic search results.
There is plenty of discussion on the value of a share/like with very different take-aways.
I thought it would be great to analyse if social shares do increase my traffic.
In addition, I realized that it is pretty easy to get the number of Facebook and LinkedIn shares for each of my blog posts from r-bloggers. (Please find the script at the end of the post.)
For the analysis, I scraped the social share numbers and merged them with my Google Analytics stats.
In order to compare the traffic between blog posts, I normalise the total traffic per post by number of days the post is online.
To get a first picture of the data we can plot the normalized traffic over the number of shares (both by blog post).
Next, we can analyse how the variables are correlated. I included the total pageviews as reference.
We see that LinkedIn and Facebook shares are positively correlated with r=.56 and that the correlation between shares and normalized pageviews is the same.
When looking at the total page views we see that have a much lower correlation with shares both from facebook as well as Linkedin.
To determine the “value” of social media to drive traffic volume, we can regress the number of social shares on the normalized traffic numbers.
Dependent variable: | |||
normalizedPageViews | |||
(1) | (2) | (3) | |
shares | 0.043^{**} | 0.043^{**} | 0.026 |
(0.017) | (0.017) | (0.022) | |
linkedIn1 | 0.138 | 0.138 | 0.101 |
(0.126) | (0.126) | (0.129) | |
DaysOnline | -0.008 | ||
(0.006) | |||
Constant | 1.462 | 1.462 | 5.744 |
(1.114) | (1.114) | (3.571) | |
Observations | 33 | 33 | 33 |
R^{2} | 0.341 | 0.341 | 0.375 |
Adjusted R^{2} | 0.297 | 0.297 | 0.310 |
Residual Std. Error | 5.042 (df = 30) | 5.042 (df = 30) | 4.994 (df = 29) |
F Statistic | 7.755^{***} (df = 2; 30) | 7.755^{***} (df = 2; 30) | 5.802^{***} (df = 3; 29) |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
As in the correlation plot, we see that highly shared posts have more daily visits. If you take the estimates at face value one can also state that a linkedIn share is worth 3 times a facebook share in terms of additional traffic.
In the last regression I also included a variable (DaysOnline) to capture my learning effect. The longer post is online the lower is the daily traffic. (e.g. posts published when I started blogging).
While writing this post, I realized that the script can also be used to analyse;
Have fun!
As time goes on, your R scripts are probably getting longer and more complicated, right? Timing parts of your script could save you precious time when re-running code over and over again. Today I’m going to go through the 4 main functions for doing so.
Sys.time()
Sys.time()
takes a “snap-shot” of the current time and so it can be used to record start and end times of code.
start_time = Sys.time()
Sys.sleep(0.5)
end_time = Sys.time()
To calculate the difference, we just use a simple subtraction
end_time - start_time
## Time difference of 0.5061 secs
Notice it creates a neat little message for the time difference.
You can install the CRAN
version of tictoc via
install.packages("tictoc")
whilst the most recent development is available via the tictoc GitHub page.
library("tictoc")
Like Sys.time()
, tictoc also gives us ability to nest timings within code. However, we now have some more options to customise our timing. At it’s most basic it acts like Sys.time()
:
tic()
Sys.sleep(0.5)
toc()
## 0.506 sec elapsed
Now for a more contrived example.
# start timer for the entire section, notice we can name sections of code
tic("total time")
# start timer for first subsection
tic("Start time til half way")
Sys.sleep(2)
# end timer for the first subsection, log = TRUE tells toc to give us a message
toc(log = TRUE)
## Start time til half way: 2.037 sec elapsed
Now to start the timer for the second subsection
tic("Half way til end")
Sys.sleep(2)
# end timer for second subsection
toc(log = TRUE)
## Half way til end: 2.012 sec elapsed
# end timer for entire section
toc(log = TRUE)
## total time: 4.067 sec elapsed
We can view the results as a list (format = TRUE
returns this list in a nice format), rather than raw code
tic.log(format = TRUE)
## [[1]]
## [1] "Start time til half way: 2.037 sec elapsed"
##
## [[2]]
## [1] "Half way til end: 2.012 sec elapsed"
##
## [[3]]
## [1] "total time: 4.067 sec elapsed"
We can also reset the log via
tic.clearlog()
system.time()
Why oh WHY did R choose to give system.time()
a lower case s
and Sys.time()
and upper case s
? Anyway… system.time()
can be used to time functions without having to take note of the start and end times.
system.time(Sys.sleep(0.5))
## user system elapsed
## 0.000 0.000 0.501
system.time(Sys.sleep(1))
## user system elapsed
## 0.000 0.000 1.003
We only want to take notice of the “elapsed” time, for the definition of the “user” and “system” times see this thread.
For a repeated timing, we would use the replicate()
function.
system.time(replicate(10, Sys.sleep(0.1)))
## user system elapsed
## 0.000 0.000 1.007
You can install the CRAN
version of microbenchmark via
install.packages("microbenchmark")
Alternatively you can install the latest update via the microbenchmark GitHub page.
library("microbenchmark")
At it’s most basic, microbenchmark()
can we used to time single pieces of code.
# times = 10: repeat the test 10 times
# unit = "s": output in seconds
microbenchmark(Sys.sleep(0.1), times = 10, unit = "s")
## Unit: seconds
## expr min lq mean median uq max neval
## Sys.sleep(0.1) 0.1001 0.1006 0.1005 0.1006 0.1006 0.1006 10
Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds(!!!!). Already this is better than system.time()
. Not only that, but we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output.
sleep = microbenchmark(sleepy = Sys.sleep(0.1),
sleepier = Sys.sleep(0.2),
sleepiest = Sys.sleep(0.3),
times = 10,
unit = "s")
As well as this (more?!) microbenchmark comes with a two built-in plotting functions.
microbenchmark:::autoplot.microbenchmark(sleep)
microbenchmark:::boxplot.microbenchmark(sleep)
These provide quick and efficient ways of visualising our timings.
Sys.time()
and system.time()
have there place, but for most cases we can do better. The tictoc and microbenchmark packages are particularly useful and make it easy to store timings for later use, and the range of options for both packages stretch far past the options for Sys.time()
and system.time()
. The built-in plotting functions are handy.
Thanks for chatting!
PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models
Quantile Markov Decision Process
Zero-Shot Learning via Class-Conditioned Deep Generative Models
BoostJet: Towards Combining Statistical Aggregates with Neural Embeddings for Recommendations
On Communication Complexity of Classification Problems
Using Noisy Extractions to Discover Causal Knowledge
How Generative Adversarial Nets and its variants Work: An Overview of GAN
Learning to Compare: Relation Network for Few-Shot Learning
An Encoder-Decoder Framework Translating Natural Language to Database Queries
A unified view of gradient-based attribution methods for Deep Neural Networks
An Abstractive approach to Question Answering
LDMNet: Low Dimensional Manifold Regularized Neural Networks
A New Method for Performance Analysis in Nonlinear Dimensionality Reduction
• Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
• Random gradient extrapolation for distributed and stochastic optimization
• Online Allocation with Traffic Spikes: Mixing Adversarial and Stochastic Models
• Fast Predictive Simple Geodesic Regression
• Predicting vehicular travel times by modeling heterogeneous influences between arterial roads
• End-to-end Training for Whole Image Breast Cancer Diagnosis using An All Convolutional Design
• Large-scale Analysis of Opioid Poisoning Related Hospital Visits in New York State
• Detecting Egregious Conversations between Customers and Virtual Agents
• A Parameter Estimation Method Using Linear Response Statistics: Numerical Scheme
• WebRelate: Integrating Web Data with Spreadsheets using Examples
• CMU LiveMedQA at TREC 2017 LiveQA: A Consumer Health Question Answering System
• Rigid Graph Compression: Motif-based rigidity analysis for disordered fiber networks
• Finer Grained Entity Typing with TypeNet
• ORBIT: Ordering Based Information Transfer Across Space and Time for Global Surface Water Monitoring
• Avalanche precursors of failure in hierarchical fuse networks
• Robust and Precise Vehicle Localization based on Multi-sensor Fusion in Diverse City Scenes
• Set complexity of construction of a regular polygon
• Hierarchical Modeling of Seed Variety Yields and Decision Making for Future Planting Plans
• Global convergence rates of augmented Lagrangian methods for constrained convex programming
• K3, L3, LP, RM3, A3, FDE: How to Make Many-Valued Logics Work for You
• Fronthaul-Aware Group Sparse Precoding and Signal Splitting in SWIPT C-RAN
• Understanding the Changing Roles of Scientific Publications via Citation Embeddings
• Bootstrapped synthetic likelihood
• The Mpemba index and anomalous relaxation
• Cograph Editing in $O(3^n n)$ time and $O(2^n)$ space
• Least informative distributions in Maximum q-log-likelihood estimation
• On monotonicity of some functionals with variable exponent under symmetrization
• AOGNets: Deep AND-OR Grammar Networks for Visual Recognition
• Knowledge transfer for surgical activity prediction
• Local eigenvalue statistics of one-dimensional random non-selfadjoint pseudo-differential operators
• Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning
• Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy
• An Optimal and Progressive Approach to Online Search of Top-k Important Communities
• Categorical data analysis using a skewed Weibull regression model
• Pricing Football Players using Neural Networks
• Predictive Independence Testing, Predictive Conditional Independence Testing, and Predictive Graphical Modelling
• A new characterization of the dual polar graphs
• Packing nearly optimal Ramsey R(3,t) graphs
• Solution Uniqueness of Convex Piecewise Affine Functions Based Optimization with Applications to Constrained $\ell_1$ Minimization
• Crowdsourcing Question-Answer Meaning Representations
• On Analyzing Job Hop Behavior and Talent Flow Networks
• Occlusion Aware Unsupervised Learning of Optical Flow
• Linear-Cost Covariance Functions for Gaussian Random Fields
• Using experimental game theory to transit human values to ethical AI
• NISP: Pruning Networks using Neuron Importance Score Propagation
• Consistency of Hill Estimators in a Linear Preferential Attachment Model
• On Channel Reciprocity to Activate Uplink Channel Training for Downlink Data Transmission
• Optimal Load Balancing in Millimeter Wave Cellular Heterogeneous Networks
• Priming Neural Networks
• Learning Deeply Supervised Visual Descriptors for Dense Monocular Reconstruction
• Enhanced Array Aperture using Higher Order Statistics for DoA Estimation
• Budget-Constrained Multi-Armed Bandits with Multiple Plays
• Defense against Universal Adversarial Perturbations
• A Design-Time/Run-Time Application Mapping Methodology for Predictable Execution Time in MPSoCs
• Enhanced Attacks on Defensively Distilled Deep Neural Networks
• When Mobile Blockchain Meets Edge Computing: Challenges and Applications
• Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition
• Learning from Millions of 3D Scans for Large-scale 3D Face Recognition
• HandSeg: A Dataset for Hand Segmentation from Depth Images
• 3D Face Reconstruction from Light Field Images: A Model-free Approach
• Zero-Annotation Object Detection with Web Knowledge Transfer
• HodgeRank with Information Maximization for Crowdsourced Pairwise Ranking Aggregation
• Less-forgetful Learning for Domain Expansion in Deep Neural Networks
• Physical-Layer Schemes for Wireless Coded Caching
• Learning to Find Good Correspondences
• Optimal Selection of Interconnections in Composite Systems for Structural Controllability
• Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
• Gamma-positivity in combinatorics and geometry
• A kind of orthogonal polynomials and related identities II
• On evolutionary selection of blackjack strategies
• Superpixel clustering with deep features for unsupervised road segmentation
• Bayesian uncertainty quantification in linear models for diffusion MRI
• Remedies against the Vocabulary Gap in Information Retrieval
• Hindsight policy gradients
• A Law of Large Numbers in the Supremum Norm for a Multiscale Stochastic Spatial Gene Network
• Parametric Manifold Learning Via Sparse Multidimensional Scaling
• Stability of optimal spherical codes
• A Revisit on Deep Hashings for Large-scale Content Based Image Retrieval
• Global versus Localized Generative Adversarial Nets
• Probabilities of incidence between lines and a plane curve over finite fields
• Sub-committee Approval Voting and Generalised Justified Representation Axioms
• Natural Language Guided Visual Relationship Detection
• Utility maximization via decoupling fields
• From Algorithmic Black Boxes to Adaptive White Boxes: Declarative Decision-Theoretic Ethical Programs as Codes of Ethics
• Permutations sorted by a finite and an infinite stack in series
• Frame Interpolation with Multi-Scale Deep Loss Functions and Generative Adversarial Networks
• On the critical densities of minor-closed classes
• Dynamical characterization of combinatorially rich sets near zero
• Integrated Face Analytics Networks through Cross-Dataset Hybrid Training
• Gaussian Process Decentralized Data Fusion Meets Transfer Learning in Large-Scale Distributed Cooperative Perception
• The signature of robot action success in EEG signals of a human observer: Decoding and visualization using deep convolutional neural networks
• Adjusting for selective non-participation with re-contact data in the FINRISK 2012 survey
• The Perception-Distortion Tradeoff
• Two Birds with One Stone: Iteratively Learn Facial Attributes with GANs
• Extensions of the Hitsuda-Skorokhod integral
• Sequences, Items And Latent Links: Recommendation With Consumed Item Packs
• Improving Consistency and Correctness of Sequence Inpainting using Semantically Guided Generative Adversarial Network
• On first exit times and their means for Brownian bridges
• Robust Unsupervised Domain Adaptation for Neural Networks via Moment Alignment
• SUPRA: Open Source Software Defined Ultrasound Processing for Real-Time Applications
• Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification
• 3D Trajectory Reconstruction of Dynamic Objects Using Planarity Constraints
• Switch chain mixing times through triangle counts
• ConvAMR: Abstract meaning representation parsing
• Learning Compositional Visual Concepts with Mutual Consistency
• Reliable Video Streaming over mmWave with Multi Connectivity and Network Coding
• Zero-Shot Learning via Category-Specific Visual-Semantic Mapping
• Beyond Sparsity: Tree Regularization of Deep Models for Interpretability
• A novel low-rank matrix completion approach to estimate missing entries in Euclidean distance matrices
• Nonlinear dependencies on Brazilian equity network from mutual information minimum spanning trees
• A Low-Rank Rounding Heuristic for Semidefinite Relaxation of Hydro Unit Commitment Problems
• Neurology-as-a-Service for the Developing World
• Addressing Cross-Lingual Word Sense Disambiguation on Low-Density Languages: Application to Persian
• Stationary states of boundary driven exclusion processes with nonreversible boundary dynamics
• Power Diagram Detection with Applications to Information Elicitation
• On the probability of nonexistence in binomial subsets
• Converting P-Values in Adaptive Robust Lower Bounds of Posterior Probabilities to increase the reproducible Scientific ‘Findings’
• A Forward-Backward Approach for Visualizing Information Flow in Deep Networks
• Boolean Extremes and Dagum Distributions
• A Novel Framework for Robustness Analysis of Visual QA Models
• Deceptiveness of internet data for disease surveillance
• Uniform weak convergence of poverty measures with relative poverty lines
Explicit Semantic Analysis (ESA)
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf-idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words. Typically, the text corpus is Wikipedia, though other corpora including the Open Directory Project have been used. ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch as a means of improving text categorization and has been used by this pair of researchers to compute what they refer to as ‘semantic relatedness’ by means of cosine similarity between the aforementioned vectors, collectively interpreted as a space of ‘concepts explicitly defined and described by humans’, where Wikipedia articles (or ODP entries, or otherwise titles of documents in the knowledge base corpus) are equated with concepts. The name ‘explicit semantic analysis’ contrasts with latent semantic analysis (LSA), because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space. ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically orthogonal concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of information retrieval systems when it is based not on Wikipedia, but on the Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as ‘concepts’. To explain this observation, links have been shown between ESA and the generalized vector space model. Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using ‘a single application of ESA (text similarity)’ and ‘just a single, extremely small and homogenous test collection of 50 news documents’. Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA. CL-ESA exploits a document-aligned multilingual reference collection (e.g., again, Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.
http://…-explicit-semantic-analysis-esa-explained …
P-Tree Programming
We propose a novel method for automatic program synthesis. P-Tree Programming represents the program search space through a single probabilistic prototype tree. From this prototype tree we form program instances which we evaluate on a given problem. The error values from the evaluations are propagated through the prototype tree. We use them to update the probability distributions that determine the symbol choices of further instances. The iterative method is applied to several symbolic regression benchmarks from the literature. It outperforms standard Genetic Programming to a large extend. Furthermore, it relies on a concise set of parameters which are held constant for all problems. The algorithm can be employed for most of the typical computational intelligence tasks such as classification, automatic program induction, and symbolic regression. …
DeepPath
We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multi-hop relational paths: we use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets. …
I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don’t want to format them one by one. For example, I have created the following data frame in dplyr
data2 %>% group_by(uf) %>%
summarise(n = n(), ) %>%
arrange(desc(n))
One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function.
t1 <- data2 %>%
group_by(uf) %>%
summarise(n = n(), ) %>% arrange(desc(n))
knitr::kable(t1)
However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá.
library(printr)
Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result:
Maximum Likelihood Inference on Multi-State Trees (ML.MSBD)
Inference of a multi-states birth-death model from a phylogeny, comprising a number of states N, birth and death rates for each state and on which edges each state appears. Inference is done using a hybrid approach: states are progressively added in a greedy approach. For a fixed number of states N the best model is selected via maximum likelihood. Reference: J. Barido-Sottani and T. Stadler (2017) <doi:10.1101/215491>.
Nearest Neighbor Based Multiple Imputation for Survival Data with Missing Covariates (NNMIS)
Imputation for both missing covariates and censored observations (optional) for survival data with missing covariates by the nearest neighbor based multiple imputation algorithm as described in Hsu et al. (2006) <doi:10.1002/sim.2452>, Long et al. (2012) <doi:10.5705/ss.2010.069>, Hsu et al. (2014) <doi:10.1080/10543406.2014.888444>, and Hsu and Yu (2017) <arXiv:1710.04721>. Note that the current version can only impute for a situation with one missing covariate.
Weight of Evidence Based Segmentation of a Variable (woeR)
Segment a numeric variable based on a dichotomous dependent variable by using the weight of evidence (WOE) approach (Ref: Siddiqi, N. (2006) <doi:10.1002/9781119201731.biblio>). The underlying algorithm adopts a recursive approach to create segments that are diverse in respect of their WOE values and meet the demands of user-defined parameters. The algorithm also aims to maintain a monotonic trend in WOE values of consecutive segments. As such, it can be particularly helpful in improving robustness of linear and logistic regression models.
A Shiny User Interface of Time Warping for Improved Gradient Matching (shinyKGode)
Interactive shiny application to perform inference of non-linear differential equations via gradient matching. Three (Lotka-Volterra, Fitz hugh Nagumo, and Biopathway) pre-defined models are provided, and users can also load their own models (in the Systems Biology Markup Language format) into the application.
Diagnostics for Pharmacometric Models (xpose)
Diagnostics for non-linear mixed-effects (population) models from ‘NONMEM’ <http://…/>. ‘xpose’ facilitates data import, creation of numerical run summary and provide ‘ggplot2’-based graphics for data exploration and model diagnostics.
High-Dimensional Mediation Analysis (HIMA)
Allows to estimate and test high-dimensional mediation effects based on sure independent screening and minimax concave penalty techniques. A joint significance test is used for mediation effect. Haixiang Zhang, Yinan Zheng, Zhou Zhang, Tao Gao, Brian Joyce, Grace Yoon, Wei Zhang, Joel Schwartz, Allan Just, Elena Colicino, Pantel Vokonas, Lihui Zhao, Jinchi Lv, Andrea Baccarelli, Lifang Hou, Lei Liu (2016) <doi:10.1093/bioinformatics/btw351>.
Another week, another AWS bucket is left unattended. This one was big, though: Defense Department Spied On Social Media, Left All Its Collected Data Exposed To Anyone. Original investigation.
The UpGuard Cyber Risk Team can now disclose that three publicly downloadable cloud-based storage servers exposed a massive amount of data collected in apparent Department of Defense intelligence-gathering operations. The repositories appear to contain billions of public internet posts and news commentary scraped from the writings of many individuals from a broad array of countries, including the United States, by CENTCOM and PACOM, two Pentagon unified combatant commands charged with US military operations across the Middle East, Asia, and the South Pacific.
New tool quantifies power imbalance between female and male characters in Hollywood movie scripts.
The [University of Washington] team used machine-learning-based tools to analyze the language in nearly 800 movie scripts, quantifying how much power and agency those scripts give to individual characters. In their study, recently presented in Denmark at the 2017 Conference on Empirical Methods in Natural Language Processing, the researchers found subtle but widespread gender bias in the way male and female characters are portrayed.
This one has been controversial this week. Original: The Ivory Tower Can’t Keep Ignoring Tech. A reply: Cathy O’Neil Sleepwalks into Punditry. The original point:
These days, big data, artificial intelligence and the tech platforms that put them to work have huge influence and power. Algorithms choose the information we see when we go online, the jobs we get, the colleges to which we’re admitted and the credit cards and insurance we are issued. It goes without saying that when computers are making decisions, a lot can go wrong.
Our lawmakers desperately need this explained to them in an unbiased way so they can appropriately regulate, and tech companies need to be held accountable for their influence over all elements of our lives. But academics have been asleep at the wheel, leaving the responsibility for this education to well-paid lobbyists and employees who’ve abandoned the academy.
The Brutal Fight to Mine Your Data and Sell It to Your Boss. Plus Hacker News discussion.
On May 23, an email landed in the sales inbox of a San Francisco startup called HiQ Labs, politely asking the company to go out of business. HiQ is a “people analytics” firm that creates software tools for corporate human resources departments. Its Skill Mapper graphically represents the credentials and abilities of a workforce; its Keeper service identifies when employees are at risk of leaving for another job. Both draw the overwhelming majority of their data from a single trove: the material that is posted—with varying degrees of timeliness, detail, accuracy, and self-awareness—by the 500 million people on the social networking site LinkedIn.
Meet The Spreadsheet That Can Solve NYC Transit (and the Man Who Made It). True Data Science, but not sexy.
The spreadsheet is called the Balanced Transportation Analyzer, or BTA. It has 72 separate worksheets, many of which contain over a thousand rows and dozens of columns. Komanoff made the spreadsheet for a single purpose: to be the most comprehensive accounting possible of how a congestion charge in Manhattan would affect New York City.
present](https://www.reddit.com/r/dataisbeautiful/comments/7ch00f/co%E2%82%82_concentration_and_global_mean_temperature/).
Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.
Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!
Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries.
But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises:
I would add another good practise: “Use the tidyverse”. Provided tidyverse packages such as dplyr
benefit from Rcpp
, having a C++ backend can be faster than using dplyr’s equivalents in base (i.e plain vanilla) R.
The idea of this post is to clarify some ideas. R does not compete with C or C++ provided they are different languages. R and data.table
package may compete with Python and numpy
library. This does not mean that I’m defending R over Python or backwards. The reason behind this is that both R and Python implementations consists in an interpreter while in C and C++ it consists in a compiler, and this means that C and C++ will always be faster because in really over-simplifying terms compiler implementations are closer to the machine.
As an Ubuntu user I can say the basic R installation from Canonical or CRAN repositories work for most of the things I do on my laptop.
When I use RStudio Server Pro© that’s a different story because I really want to optimize things because when I work with large data (i.e. 100GB in RAM) a 3% more of resources efficiency or reduced execution time is valuable.
Installing R with OpenBLAS will give you a tremedous performance boost, and that will work for most of laptop situations. I explain how to do that in detail for Ubuntu 17.10 and Ubuntu 16.04 but a general setup would be as simple as one of this two options:
# 1: Install from Canonical (default Ubuntu repository)
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install libopenblas-dev r-base
# 2: Install from CRAN mirror
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9
printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install libopenblas-dev r-base
# 3: Install RStudio (bonus)
cd ~/Downloads
wget https://download1.rstudio.org/rstudio-xenial-1.1.383-amd64.deb
sudo apt-get install gdebi
sudo gdebi rstudio-xenial-1.1.383-amd64.deb
printf '\nexport QT_STYLE_OVERRIDE=gtk\n' | sudo tee -a ~/.profile
Being (1) a substitute of (2). It’s totally up to you which one to use and both will give you a really fast R compared to installing it without OpenBLAS.
I already use R with OpenBLAS just like the setup above. I will compile parallel R instances to do the benchmarking.
My benchmarks do indicate that in my case it’s convenient to take the time it takes to install Intel© MKL. The execution time is strongly reduces for some operations when compared to R with OpenBLAS performance.
Run this to install MKL:
# keys taken from https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update && sudo apt-get install intel-mkl-64bit
To compile it from source (in this case it’s the only option) run these lines:
# key added after sudo apt-get update returned a warning following this guide: https://support.rstudio.com/hc/en-us/articles/218004217-Building-R-from-source
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9
printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list
# you need to enable multiverse repo or packages as xvfb won't be found
sudo rm -rf /etc/apt/sources.list
printf 'deb http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse
deb-src http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse\n
deb http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse
deb-src http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse\n
deb http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse
deb-src http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse\n' | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove
sudo apt-get upgrade --with-new-pkgs
sudo apt-get build-dep r-base
cd ~/GitHub/r-with-intel-mkl
wget https://cran.r-project.org/src/base/R-3/R-3.4.2.tar.gz
tar xzvf R-3.4.2.tar.gz
cd R-3.4.2
source /opt/intel/mkl/bin/mklvars.sh intel64
MKL="-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm"
./configure --prefix=/opt/R/R-3.4.2-intel-mkl --enable-R-shlib --with-blas="$MKL" --with-lapack
make && sudo make install
printf '\nexport RSTUDIO_WHICH_R=/usr/local/bin/R\nexport RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl\n' | tee -a ~/.profile
Just not to interfere with working installation I decided to compile a parallel instance from source:
cd ~/GitHub/r-with-intel-mkl/
rm -rf R-3.4.2
tar xzvf R-3.4.2.tar.gz
cd R-3.4.2
./configure --prefix=/opt/R/R-3.4.2-openblas --enable-R-shlib --with-blas --with-lapack
make && sudo make install
printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R\n' | tee -a ~/.profile
There is a lot of discussion and strong evidence from different stakeholders in the R community that do indicate that this is by far the most inefficient option. I compiled this just to make a complete benchmark:
cd ~/GitHub/r-with-intel-mkl/
rm -rf R-3.4.2
tar xzvf R-3.4.2.tar.gz
cd R-3.4.2
./configure --prefix=/opt/R/R-3.4.2-defaults --enable-R-shlib
make && sudo make install
printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R\n' | tee -a ~/.profile
This R version includes MKL by default and it’s supposed to be easy to install. I could not make it run and that’s bad because different articles (like this post by Brett Klamer) state that this R version is really efficient but no different to standard CRAN R with MKL numerical libraries.
In any case here’s the code to install this version:
cd ~/GitHub/r-with-intel-mkl
wget https://mran.blob.core.windows.net/install/mro/3.4.2/microsoft-r-open-3.4.2.tar.gz
tar xzvf microsoft-r-open-3.4.2.tar.gz
cd microsoft-r-open
sudo ./install.sh
printf 'export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R\n' | tee -a ~/.profile
# it was not possible to start /opt/microsoft/ropen/3.4.2/lib64/R/bin/R
# the error is:
# *** caught segfault ***
# address 0x50, cause 'memory not mapped'
# removing Microsoft R
# https://mran.microsoft.com/documents/rro/installation#revorinst-uninstall steps did not work
sudo apt-get remove 'microsoft-r-.*'
sudo apt-get autoclean && sudo apt-get autoremove
My scripts above do edit ~/.profile
. This is to open RStudio and work with differently configured R instances on my computer.
I released the benchmark results and scripts on GitHub. The idea is to run the same scripts from ATT© and Microsoft© to see how different setups perform.
To work with CRAN R with MKL I had to edit ~/.profile
because of how I configurated the instances. So I had to run nano ~/.profile
and comment the last part of the file to obtain this result:
#export RSTUDIO_WHICH_R=/usr/bin/R
export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R
#export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R
After that I log out and then log in to open RStudio to run the benchmark.
The other two cases are similar and the benchmark results were obtained editing ~/.profile
, logging out and in and opening RStudio with the corresponding instance.
As an example, this result starts with the R version and the corresponding numerical libraries used in that sessions. Any other result are reported in the same way.
And here are the results from ATT© benchmarking script:
Task | CRAN R with MKL (seconds) | CRAN R with OpenBLAS (seconds) | CRAN R with no optimized libraries (seconds) |
---|---|---|---|
Creation, transp., deformation of a 2500×2500 matrix (sec) | 0.68 | 0.68 | 0.67 |
2400×2400 normal distributed random matrix ^1000 | 0.56 | 0.56 | 0.56 |
Sorting of 7,000,000 random values | 0.79 | 0.79 | 0.79 |
2800×2800 cross-product matrix (b = a’ * a) | 0.3 | 0.36 | 14.55 |
Linear regr. over a 3000×3000 matrix (c = a \ b’) | 0.17 | 0.22 | 6.98 |
FFT over 2,400,000 random values | 0.33 | 0.33 | 0.33 |
Eigenvalues of a 640×640 random matrix | 0.22 | 0.49 | 0.74 |
Determinant of a 2500×2500 random matrix | 0.2 | 0.22 | 2.99 |
Cholesky decomposition of a 3000×3000 matrix | 0.31 | 0.21 | 5.76 |
Inverse of a 1600×1600 random matrix | 0.2 | 0.21 | 2.79 |
3,500,000 Fibonacci numbers calculation (vector calc) | 0.54 | 0.54 | 0.54 |
Creation of a 3000×3000 Hilbert matrix (matrix calc) | 0.23 | 0.24 | 0.23 |
Grand common divisors of 400,000 pairs (recursion) | 0.27 | 0.29 | 0.3 |
Creation of a 500×500 Toeplitz matrix (loops) | 0.28 | 0.28 | 0.28 |
Escoufier’s method on a 45×45 matrix (mixed) | 0.22 | 0.23 | 0.28 |
Total time for all 15 tests | 5.3 | 5.62 | 37.78 |
Overall mean (weighted mean) | 0.31 | 0.32 | 0.93 |
And here are the results from Microsoft© benchmarking script:
Task | CRAN R with MKL (seconds) | CRAN R with OpenBLAS (seconds) | CRAN R with no optimized libraries (seconds) |
---|---|---|---|
Matrix multiply | 5.985 | 13.227 | 165.18 |
Cholesky Factorization | 1.061 | 2.349 | 26.762 |
Singular Value Decomposition | 7.268 | 18.325 | 47.076 |
Principal Components Analysis | 14.932 | 40.612 | 162.338 |
Linear Discriminant Analysis | 26.195 | 43.75 | 117.537 |
I decided to try Intel MKL and I’ll write another post benchmarking things I do everyday beyond what is considered in the scripts.
To clean my system I deleted all R instances but MKL:
sudo apt-get remove r-base r-base-dev
sudo apt-get remove 'r-cran.*'
sudo apt-get autoclean && sudo apt-get autoremove
sudo apt-get build-dep r-base
sudo rm -rf /opt/R/R-3.4.2-openblas
sudo rm -rf /opt/R/R-3.4.2-defaults
sudo ln -s /opt/R/R-3.4.2-intel-mkl/bin/R /usr/bin/R
I edited ~/.profile
so the final lines are:
export RSTUDIO_WHICH_R=/usr/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R
#export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R
#export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R
And I also decided to configure my user packages directory from zero:
# remove installed user packages
rm -rf ~/R
# create new user packages directory
mkdir ~/R/
mkdir ~/R/x86_64-pc-linux-gnu-library/
mkdir ~/R/x86_64-pc-linux-gnu-library/3.4
# install common packages
R --vanilla << EOF
install.packages(c("tidyverse","data.table","dtplyr","devtools","roxygen2","bit64","pacman"), repos = "https://cran.rstudio.com/")
q()
EOF
# Export to HTML/Excel
R --vanilla << EOF
install.packages(c("htmlTable","openxlsx"), repos = "https://cran.rstudio.com/")
q()
EOF
# Blog tools
R --vanilla << EOF
install.packages(c("knitr","rmarkdown"), repos='http://cran.us.r-project.org')
q()
EOF
sudo apt-get install python-pip
sudo pip install --upgrade --force-reinstall markdown rpy2==2.7.8 pelican==3.6.3
# PDF extraction tools
sudo apt-get install libpoppler-cpp-dev default-jre default-jdk
sudo R CMD javareconf
R --vanilla << EOF
library(devtools)
install.packages(c("rjava","pdftools"), repos = "https://cran.rstudio.com/")
install_github("ropensci/tabulizer")
q()
EOF
# TTF/OTF fonts usage
R --vanilla << EOF
install.packages("showtext", repos = "https://cran.rstudio.com/")
q()
EOF
# Cairo for graphic devices
sudo apt-get install libgtk2.0-dev libxt-dev libcairo2-dev
R --vanilla << EOF
install.packages("Cairo", repos = "https://cran.rstudio.com/")
q()
EOF
The benchmark exposed here are in no way a definitive end to the long discussion on numerical libraries. My results show some evidence that indicates, that because of more speed for some operations, I should use MKL.
One of the advantages of the setup I explained is that you can use MKL with Python. In that case numpy
calculations will be boosted.
Using MKL with AMD© processors might not provide an important improvement when compared to use OpenBLAS. This is because MKL uses specific processor instructions that work well with i3 or i5 processors but not neccesarily with non-Intel models.
Jon Minton writes:
You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton’s (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US.
Colour versions of the figures are available in the ‘supplementary data’ link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs – and arxiv – allow for a much faster pace of research and analysis.)
An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed ‘data cube’ of log mortality by age and year, whose features I then discuss. [See here and here.]
Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough.
I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification.
The arrangement of data by year and age conceptually involves thinking about a continuous ‘data surface’ much like a spatial surface.
Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R’s CARBayes package, to model demographic data as well?
My reply:
I agree that visualization is important.
Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we’d want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn’t seem like a property you’d want here.
If you do want to fit Bayesian CAR models, I suggest you do it in Stan.
Minton responded:
I agree that additional structure and different assumptions to those made by CAR would be needed. I’m thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data.
In the case of fertility, I also agree with using cohort and age as the surface’s axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort ‘achieve’ replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.
The post Spatial models for demographic trends? appeared first on Statistical Modeling, Causal Inference, and Social Science.
Deep Texture Encoding Network (Deep TEN)
We propose a Deep Texture Encoding Network (Deep-TEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate off-the-shelf features such as SIFT descriptors or pre-trained CNN features for material recognition. Our new approach provides an end-to-end learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to state-of-the-art methods using gold-standard databases such as MINC-2500, Flickr Material Database, KTH-TIPS-2b, and two recent databases 4D-Light-Field-Material and GTOS. The source code for the complete system are publicly available. …
LibLinear
LibLinear is a linear classifier for data with millions of instances and features. It supports
• L2-regularized classifiers L2-loss linear SVM,
• L1-loss linear SVM, and logistic regression (LR)
• L1-regularized classifiers (after version 1.4)
• L2-loss linear SVM and logistic regression (LR)
• L2-regularized support vector regression (after version 1.9)
• L2-loss linear SVR and L1-loss linear SVR. …
Root Mean Square Error (RMSE)
Taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard deviation. …