My Data Science Blogs

April 24, 2017

View BACPAC Files Using This One Weird Trick

I did not think it was possible to view the contents of a BACPAC file. However, last month while at SQL Bits I was reminded about this one weird trick to view the contents of a BACPAC file.

The post View BACPAC Files Using This One Weird Trick appeared first on Thomas LaRock.

If you liked this post then consider subscribing to the IS [NOT] NULL newsletter: http://thomaslarock.com/is-not-null-newsletter/

Continue Reading…

Collapse

Read More

Magister Dixit

“When a new, powerful tool comes along, there’s a tendency to think it can solve more problems than it actually can. Computers have not made offices paperless, for example, and Predator drones haven’t made a significant dent in the annual number of terrorist acts. … While big data tools are, indeed, very powerful, the results they deliver tend to be only as good as the strategy behind their deployment. A closer look at successful big data projects offers clues as to why they are successful … and why others fall short of the mark.” Patrick Marshall ( 08.07.2014 )


Continue Reading…

Collapse

Read More

The Future of String Arrays in MATLAB

At MathWorks, we almost never talk in detail about product features before they are released. That's why this week's Art of MATLAB blog caught my eye. Guest blogger Dave Bergstein describes in some detail how we expect support for string arrays to evolve in the future. If you do a lot of data analysis or text processing in MATLAB, it is worth your while to take a look at Dave's post.

Continue Reading…

Collapse

Read More

Stan in St. Louis this Friday

This Friday afternoon I (Jonah) will be speaking about Stan at Washington University in St. Louis. The talk is open to the public, so anyone in the St. Louis area who is interested in Stan is welcome to attend. Here are the details:

Title: Stan: A Software Ecosystem for Modern Bayesian Inference
Jonah Sol Gabry, Columbia University

Neuroimaging Informatics and Analysis Center (NIAC) Seminar Series
Friday April 28, 2017, 1:30-2:30pm
NIL Large Conference Room
#2311, 2nd Floor, East Imaging Bldg.
4525 Scott Avenue, St. Louis, MO

medicine.wustl.eduNIAC

The post Stan in St. Louis this Friday appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…

Collapse

Read More

Best Data Science Courses from Udemy (only $10 till Apr 29)

Here a list of the best courses in data science from Udemy, covering Data Science, Machine Learning, Python, Spark, Tableau, and Hadoop - only $10 until April 29, 2017.

Continue Reading…

Collapse

Read More

Document worth reading: “Experimental Analysis of Design Elements of Scalarizing Functions-based Multiobjective Evolutionary Algorithms”

In this paper we systematically study the importance, i.e., the influence on performance, of the main design elements that differentiate scalarizing functions-based multiobjective evolutionary algorithms (MOEAs). This class of MOEAs includes Multiobjecitve Genetic Local Search (MOGLS) and Multiobjective Evolutionary Algorithm Based on Decomposition (MOEA/D) and proved to be very successful in multiple computational experiments and practical applications. The two algorithms share the same common structure and differ only in two main aspects. Using three different multiobjective combinatorial optimization problems, i.e., the multiobjective symmetric traveling salesperson problem, the traveling salesperson problem with profits, and the multiobjective set covering problem, we show that the main differentiating design element is the mechanism for parent selection, while the selection of weight vectors, either random or uniformly distributed, is practically negligible if the number of uniform weight vectors is sufficiently large. Experimental Analysis of Design Elements of Scalarizing Functions-based Multiobjective Evolutionary Algorithms


Continue Reading…

Collapse

Read More

How disinformation spreads in a network

Disinformation is kind of a problem these days, yeah? Fatih Erikli uses a simulation that works like a disaster spread model applied to social networks to give an idea of how disinformation spreads.

I tried to visualize how a disinformation becomes a post-truth by the people who subscribed in a network. We can think this network as a social media such as Facebook or Twitter. The nodes (points) in the map represent individuals and the edges (lines) shows the relationships between them in the community. The disinformation will be forwarded to their audience by the unconscious internet (community) members.

Set the “consciousness” parameter and select a node to run.

Tags: ,

Continue Reading…

Collapse

Read More

FAIR research being presented at ICLR 2017

Facebook AI (FAIR) researchers and engineers are converging at the 5th annual International Conference on Learning Representations (ICLR) 2017 in Toulon, France this week. ICLR brings together the top artificial intelligence and machine learning experts to discuss how to best learn meaningful and useful representations of data to application areas such as vision, speech, audio and natural language processing.

At ICLR they will be sharing their latest research in AI through 18 conference and workshop track papers, including a subset of their work on dialog systems which is outlined here.  The complete list of FAIR research papers being presented at ICLR is:

An Analytical Formula of Population Gradient for Two-Layered ReLU network and its Applications in Convergence and Critical Point Analysis
Yuandong Tian

Automatic Rule Extraction from Long Short Term Memory Networks
James Murdoch and Arthur Szlam

CommAI: Evaluating the Frst Steps Towards a Useful General AI
Marco Baroni, Armand Joulin, Allan Jabri, Germaan Kruszewski, Angeliki Lazaridou, Klemen Simonic, and Tomas Mikolov

Dialogue Learning With Human-in-the-Loop
Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston

DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Efficient Softmax Approximation for GPUs
Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou

Episodic Exploration for Deep Deterministic Policies for StarCraft Micromanagement
Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala

Improving Neural Language Models with a Continuous Cache
Edouard Grave, Armand Joulin, and Nicolas Usunier

Learning End-to-End Goal-Oriented Dialog
Antoine Bordes, Y-Lan Boureau, and Jason Weston

Learning Through Dialogue Interactions by Asking Questions
Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston

LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation
Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh

Multi-Agent Cooperation and the Emergence of (Natural) Language
Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni

Revisiting Classifier Two-Sample Tests
David Lopez-Paz and Maxime Oquab

Towards Principled Methods for Training Generative Adversarial Networks
Martin Arjovsky and Leon Bottou

Tracking the World State with Recurrent Entity Networks
Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun

Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning
Yuxin Wu and Yuandong Tian

Unsupervised Cross-Domain Image Generation
Yaniv Taigman, Adam Polyak, and Lior Wolf

Variable Computation in Recurrent Neural Networks
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov

Continue Reading…

Collapse

Read More

Industrial Asset Management – Slaying hurdles to get the most from your assets

A well-structured asset performance management (APM) plan can give real-time visibility of equipment reliability while predicting possible failures.

Continue Reading…

Collapse

Read More

NFL draft performance vs. expectations

Reuben Fischer-Baum for The Washington Post looks at professional football expectations given their draft picks versus performance.

By comparing how much value teams should get given their set of picks with how much value they actually get, we can calculate which franchises make the most of their draft selections. Approximate Value (AV), a stat created by Pro Football Reference that measures how well a player performed overall in a season, is useful here. Based on this metric, we find that the Browns draftees have underperformed in the NFL given their draft position, especially when compared to the draftees of a team like, say, the Seahawks.

My main takeaway is that teams seem to know what they’re gonna get. Overall at least. Save a few teams who outdid expectations and a few who failed pretty badly, everyone else sticks towards the baseline. But it’s also really random year-to-year, which is essentially what makes sports interesting.

See also the player-level comparison for professional basketball from last year.

And, just a random observation, it felt weird reading this sports piece with “Democracy Dies in Darkness” at the header of The Washington Post site. But maybe that’s just me.

Tags: , ,

Continue Reading…

Collapse

Read More

Testing Polynomial Equality

Problem: Determine if two polynomial expressions represent the same function. Specifically, if p(x_1, x_2, \dots, x_n) and q(x_1, x_2, \dots, x_n) are a polynomial with inputs, outputs and coefficients in a field F, then the problem is to determine if p(\mathbf{x}) = q(\mathbf{x}) for every x \in F, in time polynomial in the number of bits required to write down p and q.

Solution: Let d be the maximum degree of all terms in p, q. Choose a finite set S \subset F with |S| > 2d. Repeat the following process 100 times:

  1. Choose inputs z_1, z_2, \dots, z_n \in S uniformly at random.
  2. Check if p(z_1, \dots, z_n) = q(z_1, \dots, z_n).

If every single time the two polynomials agree, accept the claim that they are equal. If they disagree on any input, reject. You will be wrong with probability at most 2^{-100}.

Discussion: At first glance it’s unclear why this problem is hard.

If you have two representations of polynomials p, q, say expressed in algebraic notation, why can’t you just do the algebra to convert them both into the same format, and see if they’re equal?

Unfortunately, that conversion can take exponential time. For example, suppose you have a polynomial p(x) = (x+1)^{1000}. Though it only takes a few bits to write down, expressing it in a “canonical form,” often in the monomial form a_0 + a_1x + \dots + a_d x^d, would require exponentially many bits in the original representation. In general, it’s unknown how to algorithmically transform polynomials into a “canonical form” (so that they can be compared) in subexponential time.

Instead, the best we know how to do is treat the polynomials as black boxes and plug values into them.

Indeed, for single variable polynomials it’s well known that a nonzero degree d polynomial has at most d roots. A similar result is true for polynomials with many variables, and so we can apply that result to the polynomial p - q to determine if p = q. This theorem is so important (and easy to prove) that it deserves the name of lemma.

The Schwartz-Zippel lemma. Let p be a nonzero polynomial of total degree d \geq 0 over a field F. Let S be a finite subset of F and let z_1, \dots, z_n be chosen uniformly at random from S. The probability that p(z_1, \dots, z_n) = 0 is at most d / |S|.

Proof. By induction on the number of variables n. For the case of n=1, it’s the usual fact that a single-variable polynomial can have at most d roots. Now for the inductive step, assume this is true for all polynomials with n-1 variables, and we will prove it for n variables. Write $latex $p$ as a polynomial in the variable x_1, whose coefficients are other polynomials:

\displaystyle p(x_1, \dots, x_n) = \sum_{k=1}^d Q_k(x_2, \dots, x_n) x_1^k

Here we’ve grouped p by the powers of x_1, so that Q_i are the coefficients of each x_1^k. This is useful because we’ll be able to apply the inductive hypothesis to one of the Q_i‘s, which have fewer variables.

Indeed, we claim there must be some Q_k which is nonzero for k > 0. Clearly, since p is not the zero polynomial, some Q_k must be nonzero. If the only nonzero Q_k is Q_0, then we’re done because p doesn’t depend on x_1 at all. Otherwise, take the largest nonzero Q_k. It’s true that the degree of Q_k is at most d-k. This is true because the term x_1^k Q_k has degree at most d.

By the inductive hypothesis, if we choose z_2, \dots, z_n and plug them into Q_k, we get zero with probability at most \frac{d-k}{|S|}. The crucial part is that if this polynomial coefficient is nonzero, then the entire polynomial p is nonzero. This is true even if an unlucky choice of x_1 = z_1 causes the resulting evaluation p(z_1, \dots, z_n) \neq 0.

To think about it a different way, imagine we’re evaluating the polynomial in phases. In the first phase, we pick the z_2, \dots, z_n. We could also pick z_1 independently but not reveal what it is, for the sake of this story. Then we plug in the z_2, \dots, z_n, and the result is a one-variable polynomial whose largest coefficient is Q_k(z_1, \dots, z_n). The inductive hypothesis tells us that this one-variable polynomial is the zero polynomial with probability at most \frac{d-k}{|S|}. (It’s probably a smaller probability, since all the coefficients have to be zero, but we’re just considering the largest one for the sake of generality and simplicity)

Indeed, the resulting polynomial after we plug in z_2, \dots, z_n has degree k, so we can apply the inductive hypothesis to it as well, and the probability that it’s zero for a random choice of z_1 is at most k / |S|.

Finally, the probability that both occur can be computed using basic probability algebra. Let A be the event that, for these z_i inputs, Q_k is zero, and B the event that p is zero for the z_i and the additional z_1.

Then \textup{Pr}[B] = \textup{Pr}[B \textup{ and } A] + \textup{Pr}[B \textup{ and } !A] = \textup{Pr}[B \mid A] \textup{Pr}[A] + \textup{Pr}[B \mid !A] \textup{Pr}[!A].

Note the two quantities above that we don’t know are \textup{Pr}[B \mid A] and \textup{Pr}[!A], so we’ll bound them from above by 1. The rest of the quantities add up to exactly what we want, and so

\displaystyle  \textup{Pr}[B] \leq \frac{d-k}{|S|} + \frac{k}{|S|} = \frac{d}{|S|},

which proves the theorem.

\square

While this theorem is almost trivial to prove (it’s elementary induction, and the obvious kind), it can be used to solve polynomial identity testing, as well as finding perfect matchings in graphs and test numbers for primality.

But while the practical questions are largely solved–it’s hard to imagine a setting where you’d need faster primality testing than the existing randomized algorithms–the theory and philosophy of the result is much more interesting.

Indeed, checking two polynomials for equality has no known deterministic polynomial time algorithm. It’s one of a small class of problems, like integer factoring and the discrete logarithm, which are not known to be efficiently solvable in theory, but are also not known to be NP-hard, so there is still hope. The existence of this randomized algorithm increases hope (integer factorization sure doesn’t have one!). And more generally, the fact that there are so few natural problems in this class makes one wonder whether randomness is actually beneficial at all. From a polynomial-time-defined-as-efficient perspective, can every problem efficiently solvable with access to random bits also be solved without such access? In the computational complexity lingo, does P = BPP? Many experts think the answer is yes.


Continue Reading…

Collapse

Read More

Data Science for Operational Excellence (Part-3)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Optimized transportation planning is a task usually left to the firm’s logistic department. However, it is often difficult to visualize, specially if there are many points involved in the logistic network. R and its packages can help solving this issue. Our goal here is to expand logistics networking visualization. In order to do that, we will use packages as ggmap and leaflet.

Answers to the exercises are available here.

Exercise 1
Load libraries: ggmap, fields, lpSolve, leaflet, dplyr, magrittr. Use the following vectors to create a new one called allCitiesAux: soyaCities <- c("Sapezal","Sorriso", "Nova Mutum", "Diamantino", "Cascavel") , transhipment <- c("Alto Araguaia", "Cascavel"), ports <- c("Santos", "Paranagua").
Exercise 2
Use the function geocode to collect latitude and longitude for all cities.
Exercise 3
Create a data frame and with columns names: City, lat and lng.
Exercise 4
Create a matrix that contains all the distance between all cities. We will use this in the lp.transportation function, so remember that rows must be offer points and columns demand points.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 5
Create row.signs, row.rhs, col.signs, col.rhs. For that, remember to set a seed equals to 123 and that all soya must be exported through ports. For the “right hand side” variables use random generated number. Port demands should be between 300 and 600. Soya production should be between 50 and 300.
Exercise 6
Solve the transportation problem and change columns and row names to match the names from the cost matrix.
Exercise 7
Create a list of data frames to store all segments presented in the solution. Example, one of this segments should be Sapezal to Santos.
Exercise 8
Create a map using leaflet and add lines for each segment based on the list of data frames created previously.
Exercise 9
Create a list of data frames to store road routes extracted using the route function from ggmap.
Exercise 10
Create a new map using leaflet that, instead of showing straight lines from origin to destination, shows road routes.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Jupyter Digest: 24 April 2017

Python cheat sheet, open source DL guide, Keen IO, and digital signal processing.

  • Python Cheat Sheet. This handy guide from Julian Gaal (@juliangaal on GitHub), focused on data science, provides a quick and well-formatted reference for common NumPy and Matplotlib functions. So, if you can never remember how to add ticks to a plot, split an array, or you just want to share a great trick you've learned, this is worth checking out.
  • Introduction to Artificial Neural Networks and Deep Learning: A Practical Guide with Applications in Python. If you're wondering what deep learning is all about, this open source guide by Sebastian Raschka (@rasbt) hits all the right notes (TensorFlow, RNN, etc). It's still in the early stages (mostly just appendices about math), but if the chapters are close to this quality, it's going to be a great book.

  • Analyzing your Keen IO data with Jupyter Notebooks. Keen IO is a handy analytics package that you can embed in your website or mobile application to track event data. In addition to the easy-to-use APIs, it also has a nice front-end app where you can do queries and get essential information. We've used it almost since it came out at O'Reilly, and it just keeps getting better and better. However, while the dashboards are nice and keep the folks back home happy, sometimes you just want to throw that stuff into Pandas and go deeper. This refreshingly concise article by Joanne Cheng (@joannecheng) shows you how, and has a nice companion notebook with details.

  • Cacophony for the whole family. Everything Allen Downey (@AllenDowney) writes is a model for how to explain a complex subject in clear, compelling language. But, he's also really funny. This excerpt from Think DSP, which shows how to use digital signal processing to simulate an elementary school band, resonated with me because I've been listening to my kids practice on their recorders. This article nails the experience. Plus, it's a cool example of how to generate audio files with python in the notebook. (Be sure to keep your volume low, though!)

Continue reading Jupyter Digest: 24 April 2017.

Continue Reading…

Collapse

Read More

Top Stories, Apr 17-23: Awesome Deep Learning: Most Cited Deep Learning Papers; The Value of Exploratory Data Analysis

Awesome Deep Learning: Most Cited Deep Learning Papers; The Value of Exploratory Data Analysis; 10 Free Must-Read Books for Machine Learning and Data Science; Forrester vs Gartner on Data Science Platforms and Machine Learning Solutions; Data Science for the Layman

Continue Reading…

Collapse

Read More

How to Battle the Data Wheel of Death

Data not Constantly Maintained ->Data Becomes Irrelevant -> People Lose Trust -> Use Data Less. We examine 4 reasons for such wheel of death, and what can you do about it.

Continue Reading…

Collapse

Read More

SQL Server 2017 to add Python support

One of the major announcements from yesterday's Data Amp event was that SQL Server 2017 will add Python as a supported language. Just as with the continued R support, SQL Server 2017 will allow you to process data in the database using any Python function or package without needing to export the data from the database, and use SQL Server itself as an operationalization platform for production applications using Python code. In addition, the high-performance and distributed statistical and machine learning functions from the RevoScaleR and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.

Python

The video below provides more details on the capabilities and the underlying architecture, and includes a demo of deploying and running Python code within SQL Server 2017.

SQL Server 2017 will also be available for the first time on Linux (including within Docker containers), and the new version is currently in public preview. In case you missed it yesterday, check out our post on the many new features supporting R in Microsoft R Server 9.1. And for more on the many updates coming in SQL Server 2017, take a look at the official blog post linked below.

SQL Server Blog: Delivering AI with data: the next generation of Microsoft’s data platform

Continue Reading…

Collapse

Read More

This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists’ often annoy the left as much as the right.

First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF. Although the current administration did propose a 20% cut, we are aware that, generally speaking, support for scientific research has traditionally been bipartisan.

It is true that the typical data-driven scientists will disagree, sometimes strongly, with many stances that are considered conservative. For example, most scientists will argue that:

  1. Climate change is real and is driven largely by increased carbon dioxide and other human-made emissions into the atmosphere.
  2. Evolution needs to be part of children’s education and creationism has no place in Science class.
  3. Homosexuality is not a choice.
  4. Science must be publically funded because the free market is not enough to make science thrive.

But scientists will also hold positions that are often criticized heavily by some of those who identify as politically left wing:

  1. Current vaccination programs are safe and need to be enforced: without heard immunity thousands of children would die.
  2. Genetically modified organisms (GMOs) are safe and are indispensable to fight world hunger. There is no need for warning labels.
  3. Using nuclear energy to power our electrical grid is much less harmful than using natural gas, oil and coal and, currently, more viable than renewable energy.
  4. Alternative medicine, such as homeopathy, naturopathy, faith healing, reiki, and acupuncture, is pseudo-scientific quackery.

The timing of the announcement of the March for Science, along with the organizers’ focus on environmental issues and diversity, may have made it seem like a partisan or left-leaning event, but please also note that many scientists criticized the organizers for this very reason and there was much debate in general. Most scientists I know that went to the march did so not necessarily because they are against republican administrations, but because they are legitimately concerned about some of the choices of this particular administration and the future of our country if we stop funding and trusting science.

If you haven’t already seen this Neil Degrasse Tyson video on the importance of Science to everyone, I highly recommend it.

Continue Reading…

Collapse

Read More

R 3.4.0 now available

R 3.4.0, the latest release of the R programming language (codename: "You Stupid Darkness"), is now available. This is the annual major update to the R language engine, and provides improved performance for R programs. The source code was released by the R Core Team on Friday and binaries for Windows, Mac and Linux are available for download now from CRAN.

The most significant change in this release is that the JIT ('Just In Time') byte-code compiler is now enabled at level 3 by default. This means that top-level loops will be compiled before being run, and all functions will be compiled on their first or second use. Byte-compilation has been available for packages for some time, but the inclusion of this just-in-time compiler means you'll see similar performance benefits for your own R code without having to take any additional steps to compile it.

There have been various other performance improvements as well. Sorting vectors of numbers is faster (thanks to the use of the radix-sort algorithm by default). Tables with missing values compute quicker. Long strings no longer cause slowness in the str function. And the sapply function is faster when applied to arrays with dimension names.

Also on the performance front, R now will use some higher-performance BLAS routines for linear algebra operations (for example, DGEMV is now used instead of DGEMM). R falls back to a single-threaded (but higher-precision) built-in linear algebra functions when missing values are present in the data, and the check for that situation has also been sped up which should further improve performance.

For package authors using compiled C/C++/Fortran code, there is a new function to register the routines that should be exposed to other packages. Registering functions speeds up the process of calling compiled code, particularly on Windows systems. The gain is measured in the order of microseconds per call, but when these functions are called thousands or millions of times the impact can be noticeable.

This update also makes a number tweaks to the language to improve consistency, especially in corner cases. R is now more consistent in its handling of missing values when constructing tables. Some common programmer errors, like comparing a vector with a zero-length array, now generate warnings. And there have also been some accuracy improvements for extreme values in some statistical functions.

This is just a summary of the changes; check out the announcement linked below for the detailed list. As always, big thanks go to the volunteers of the R Core Group for continuing to push the boundaries of R's capabilities with each annual release and the several patches that improve it each year. (By the way, the R Foundation now accepts donations via its website. If you benefit from using R, consider contributing.)

R-announce mailing list: R 3.4.0 is released

Continue Reading…

Collapse

Read More

R 3.4.0 now available

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.4.0, the latest release of the R programming language (codename: "You Stupid Darkness"), is now available. This is the annual major update to the R language engine, and provides improved performance for R programs. The source code was released by the R Core Team on Friday and binaries for Windows, Mac and Linux are available for download now from CRAN.

The most significant change in this release is that the JIT ('Just In Time') byte-code compiler is now enabled at level 3 by default. This means that top-level loops will be compiled before being run, and all functions will be compiled on their first or second use. Byte-compilation has been available for packages for some time, but the inclusion of this just-in-time compiler means you'll see similar performance benefits for your own R code without having to take any additional steps to compile it.

There have been various other performance improvements as well. Sorting vectors of numbers is faster (thanks to the use of the radix-sort algorithm by default). Tables with missing values compute quicker. Long strings no longer cause slowness in the str function. And the sapply function is faster when applied to arrays with dimension names.

Also on the performance front, R now will use some higher-performance BLAS routines for linear algebra operations (for example, DGEMV is now used instead of DGEMM). R falls back to a single-threaded (but higher-precision) built-in linear algebra functions when missing values are present in the data, and the check for that situation has also been sped up which should further improve performance.

For package authors using compiled C/C++/Fortran code, there is a new function to register the routines that should be exposed to other packages. Registering functions speeds up the process of calling compiled code, particularly on Windows systems. The gain is measured in the order of microseconds per call, but when these functions are called thousands or millions of times the impact can be noticeable.

This update also makes a number tweaks to the language to improve consistency, especially in corner cases. R is now more consistent in its handling of missing values when constructing tables. Some common programmer errors, like comparing a vector with a zero-length array, now generate warnings. And there have also been some accuracy improvements for extreme values in some statistical functions.

This is just a summary of the changes; check out the announcement linked below for the detailed list. As always, big thanks go to the volunteers of the R Core Group for continuing to push the boundaries of R's capabilities with each annual release and the several patches that improve it each year. (By the way, the R Foundation now accepts donations via its website. If you benefit from using R, consider contributing.)

R-announce mailing list: R 3.4.0 is released

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

How Automated ML is Transforming the Predictive Analytics Landscape

Learn how DataRobot automates predictive modeling, and how our platform can deliver these same types of insights and a substantial productivity boost to your machine learning endeavors, on Tuesday, May 2nd at 1:00 pm ET.

Continue Reading…

Collapse

Read More

Data Science Dividends – A Gentle Introduction to Financial Data Analysis

This post outlines some very basic methods for performing financial data analysis using Python, Pandas, and Matplotlib, focusing mainly on stock price data. A good place for beginners to start.

Continue Reading…

Collapse

Read More

Eye blink detection with OpenCV, Python, and dlib

In last week’s blog post, I demonstrated how to perform facial landmark detection in real-time in video streams.

Today, we are going to build upon this knowledge and develop a computer vision application that is capable of detecting and counting blinks in video streams using facial landmarks and OpenCV.

To build our blink detector, we’ll be computing a metric called the eye aspect ratio (EAR), introduced by Soukupová and Čech in their 2016 paper, Real-Time Eye Blink Detection Using Facial Landmarks.

Unlike traditional image processing methods for computing blinks which typically involve some combination of:

  1. Eye localization.
  2. Thresholding to find the whites of the eyes.
  3. Determining if the “white” region of the eyes disappears for a period of time (indicating a blink).

The eye aspect ratio is instead a much more elegant solution that involves a very simple calculation based on the ratio of distances between facial landmarks of the eyes.

This method for eye blink detection is fast, efficient, and easy to implement.

To learn more about building a computer vision system to detect blinks in video streams using OpenCV, Python, and dlib, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

Eye blink detection with OpenCV, Python, and dlib

Our blink detection blog post is divided into four parts.

In the first part we’ll discuss the eye aspect ratio and how it can be used to determine if a person is blinking or not in a given video frame.

From there, we’ll write Python, OpenCV, and dlib code to (1) perform facial landmark detection and (2) detect blinks in video streams.

Based on this implementation we’ll apply our method to detecting blinks in example webcam streams along with video files.

Finally, I’ll wrap up today’s blog post by discussing methods to improve our blink detector.

Understanding the “eye aspect ratio” (EAR)

As we learned from our previous tutorial, we can apply facial landmark detection to localize important regions of the face, including eyes, eyebrows, nose, ears, and mouth:

Figure 1: Detecting facial landmarks in an video stream in real-time.

This also implies that we can extract specific facial structures by knowing the indexes of the particular face parts:

Figure 2: Applying facial landmarks to localize various regions of the face, including eyes, eyebrows, nose, mouth, and jawline.

Figure 2: Applying facial landmarks to localize various regions of the face, including eyes, eyebrows, nose, mouth, and jawline.

In terms of blink detection, we are only interested in two sets of facial structures — the eyes.

Each eye is represented by 6 (x, y)-coordinates, starting at the left-corner of the eye (as if you were looking at the person), and then working clockwise around the remainder of the region:

Figure 3: The 6 facial landmarks associated with the eye.

Based on this image, we should take away on key point:

There is a relation between the width and the height of these coordinates.

Based on the work by Soukupová and Čech in their 2016 paper, Real-Time Eye Blink Detection using Facial Landmarks, we can then derive an equation that reflects this relation called the eye aspect ratio (EAR):

Figure 4: The eye aspect ratio equation.

Where p1, …, p6 are 2D facial landmark locations.

The numerator of this equation computes the distance between the vertical eye landmarks while the denominator computes the distance between horizontal eye landmarks, weighting the denominator appropriately since there is only one set of horizontal points but two sets of vertical points.

Why is this equation so interesting?

Well, as we’ll find out, the eye aspect ratio is approximately constant while the eye is open, but will rapidly fall to zero when a blink is taking place.

Using this simple equation, we can avoid image processing techniques and simply rely on the ratio of eye landmark distances to determine if a person is blinking.

To make this more clear, consider the following figure from Soukupová and Čech:

Figure 5: Top-left: A visualization of eye landmarks when then the eye is open. Top-right: Eye landmarks when the eye is closed. Bottom: Plotting the eye aspect ratio over time. The dip in the eye aspect ratio indicates a blink (Figure 1 of Soukupová and Čech).

On the top-left we have an eye that is fully open — the eye aspect ratio here would be large(r) and relatively constant over time.

However, once the person blinks (top-right) the eye aspect ratio decreases dramatically, approaching zero.

The bottom figure plots a graph of the eye aspect ratio over time for a video clip. As we can see, the eye aspect ratio is constant, then rapidly drops close to zero, then increases again, indicating a single blink has taken place.

In our next section, we’ll learn how to implement the eye aspect ratio for blink detection using facial landmarks, OpenCV, Python, and dlib.

Detecting blinks with facial landmarks and OpenCV

To get started, open up a new file and name it

detect_blinks.py
 . From there, insert the following code:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

To access either our video file on disk (

FileVideoStream
 ) or built-in webcam/USB camera/Raspberry Pi camera module (
VideoStream
 ), we’ll need to use my imutils library, a set of convenience functions to make working with OpenCV easier.

If you do not have

imutils
  installed on your system (or if you’re using an older version), make sure you install/upgrade using the following command:

$ pip install --upgrade imutils

Note: If you are using Python virtual environments (as all of my OpenCV install tutorials do), make sure you use the

workon
  command to access your virtual environment first and then install/upgrade
imutils
 .

Otherwise, most of our imports are fairly standard — the exception is dlib, which contains our implementation of facial landmark detection.

If you haven’t installed dlib on your system, please follow my dlib install tutorial to configure your machine.

Next, we’ll define our

eye_aspect_ratio
  function:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear

This function accepts a single required parameter, the (x, y)-coordinates of the facial landmarks for a given

eye
 .

Lines 16 and 17 compute the distance between the two sets of vertical eye landmarks while Line 21 computes the distance between horizontal eye landmarks.

Finally, Line 24 combines both the numerator and denominator to arrive at the final eye aspect ratio, as described in Figure 4 above.

Line 27 then returns the eye aspect ratio to the calling function.

Let’s go ahead and parse our command line arguments:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())

Our

detect_blinks.py
  script requires a single command line argument, followed by a second optional one:

  • --shape-predictor
     : This is the path to dlib’s pre-trained facial landmark detector. You can download the detector along with the source code + example videos to this tutorial using the “Downloads” section of the bottom of this blog post.
  • --video
     : This optional switch controls the path to an input video file residing on disk. If you instead want to work with a live video stream, simply omit this switch when executing the script.

We now need to set two important constants that you may need to tune for your own implementation, along with initialize two other important variables, so be sure to pay attention to this explantation:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

When determining if a blink is taking place in a video stream, we need to calculate the eye aspect ratio.

If the eye aspect ratio falls below a certain threshold and then rises above the threshold, then we’ll register a “blink” — the

EYE_AR_THRESH
  is this threshold value. We default it to a value of
0.3
  as this is what has worked best for my applications, but you may need to tune it for your own application.

We then have an important constant,

EYE_AR_CONSEC_FRAME
  — this value is set to
3
  to indicate that three successive frames with an eye aspect ratio less than
EYE_AR_THRESH
  must happen in order for a blink to be registered.

Again, depending on the frame processing throughput rate of your pipeline, you may need to raise or lower this number for your own implementation.

Lines 44 and 45 initialize two counters.

COUNTER
  is the total number of successive frames that have an eye aspect ratio less than
EYE_AR_THRESH
  while
TOTAL
  is the total number of blinks that have taken place while the script has been running.

Now that our imports, command line arguments, and constants have been taken care of, we can initialize dlib’s face detector and facial landmark detector:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

The dlib library uses a pre-trained face detector which is based on a modification to the Histogram of Oriented Gradients + Linear SVM method for object detection.

We then initialize the actual facial landmark predictor on Line 51.

You can learn more about dlib’s facial landmark detector (i.e., how it works, what dataset it was trained on, etc., in this blog post).

The facial landmarks produced by dlib follow an indexable list, as I describe in this tutorial:

Figure 6: The full set of facial landmarks that can be detected via dlib (higher resolution).

We can therefore determine the starting and ending array slice index values for extracting (x, y)-coordinates for both the left and right eye below:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

Using these indexes we’ll be able to extract eye regions effortlessly.

Next, we need to decide if we are working with a file-based video stream or a live USB/webcam/Raspberry Pi camera video stream:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

If you’re using a file video stream, then leave the code as is.

Otherwise, if you want to use a built-in webcam or USB camera, uncomment Line 62.

For a Raspberry Pi camera module, uncomment Line 63.

If you have uncommented either Line 62 or Line 63, then uncomment Line 64 as well to indicate that you are not reading a video file from disk.

Finally, we have reached the main loop of our script:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

# loop over frames from the video stream
while True:
	# if this is a file video stream, then we need to check if
	# there any more frames left in the buffer to process
	if fileStream and not vs.more():
		break

	# grab the frame from the threaded video file stream, resize
	# it, and convert it to grayscale
	# channels)
	frame = vs.read()
	frame = imutils.resize(frame, width=450)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

	# detect faces in the grayscale frame
	rects = detector(gray, 0)

On Line 68 we start looping over frames from our video stream.

If we are accessing a video file stream and there are no more frames left in the video, we break from the loop (Lines 71 and 72).

Line 77 reads the next frame from our video stream, followed by resizing it and converting it to grayscale (Lines 78 and 79).

We then detect faces in the grayscale frame on Line 82 via dlib’s built-in face detector.

We now need to loop over each of the faces in the frame and then apply facial landmark detection to each of them:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

# loop over frames from the video stream
while True:
	# if this is a file video stream, then we need to check if
	# there any more frames left in the buffer to process
	if fileStream and not vs.more():
		break

	# grab the frame from the threaded video file stream, resize
	# it, and convert it to grayscale
	# channels)
	frame = vs.read()
	frame = imutils.resize(frame, width=450)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

	# detect faces in the grayscale frame
	rects = detector(gray, 0)

	# loop over the face detections
	for rect in rects:
		# determine the facial landmarks for the face region, then
		# convert the facial landmark (x, y)-coordinates to a NumPy
		# array
		shape = predictor(gray, rect)
		shape = face_utils.shape_to_np(shape)

		# extract the left and right eye coordinates, then use the
		# coordinates to compute the eye aspect ratio for both eyes
		leftEye = shape[lStart:lEnd]
		rightEye = shape[rStart:rEnd]
		leftEAR = eye_aspect_ratio(leftEye)
		rightEAR = eye_aspect_ratio(rightEye)

		# average the eye aspect ratio together for both eyes
		ear = (leftEAR + rightEAR) / 2.0

Line 89 determines the facial landmarks for the face region, while Line 90 converts these (x, y)-coordinates to a NumPy array.

Using our array slicing techniques from earlier in this script, we can extract the (x, y)-coordinates for both the left and right eye, respectively (Lines 94 and 95).

From there, we compute the eye aspect ratio for each eye on Lines 96 and 97.

Following the suggestion of Soukupová and Čech, we average the two eye aspect ratios together to obtain a better blink estimate (making the assumption that a person blinks both eyes at the same time, of course).

Our next code block simply handles visualizing the facial landmarks for the eye regions themselves:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

# loop over frames from the video stream
while True:
	# if this is a file video stream, then we need to check if
	# there any more frames left in the buffer to process
	if fileStream and not vs.more():
		break

	# grab the frame from the threaded video file stream, resize
	# it, and convert it to grayscale
	# channels)
	frame = vs.read()
	frame = imutils.resize(frame, width=450)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

	# detect faces in the grayscale frame
	rects = detector(gray, 0)

	# loop over the face detections
	for rect in rects:
		# determine the facial landmarks for the face region, then
		# convert the facial landmark (x, y)-coordinates to a NumPy
		# array
		shape = predictor(gray, rect)
		shape = face_utils.shape_to_np(shape)

		# extract the left and right eye coordinates, then use the
		# coordinates to compute the eye aspect ratio for both eyes
		leftEye = shape[lStart:lEnd]
		rightEye = shape[rStart:rEnd]
		leftEAR = eye_aspect_ratio(leftEye)
		rightEAR = eye_aspect_ratio(rightEye)

		# average the eye aspect ratio together for both eyes
		ear = (leftEAR + rightEAR) / 2.0

		# compute the convex hull for the left and right eye, then
		# visualize each of the eyes
		leftEyeHull = cv2.convexHull(leftEye)
		rightEyeHull = cv2.convexHull(rightEye)
		cv2.drawContours(frame, [leftEyeHull], -1, (0, 255, 0), 1)
		cv2.drawContours(frame, [rightEyeHull], -1, (0, 255, 0), 1)

You can read more about extracting and visualizing individual facial landmark regions in this post.

At this point we have computed our (averaged) eye aspect ratio, but we haven’t actually determined if a blink has taken place — this is taken care of in the next section:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

# loop over frames from the video stream
while True:
	# if this is a file video stream, then we need to check if
	# there any more frames left in the buffer to process
	if fileStream and not vs.more():
		break

	# grab the frame from the threaded video file stream, resize
	# it, and convert it to grayscale
	# channels)
	frame = vs.read()
	frame = imutils.resize(frame, width=450)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

	# detect faces in the grayscale frame
	rects = detector(gray, 0)

	# loop over the face detections
	for rect in rects:
		# determine the facial landmarks for the face region, then
		# convert the facial landmark (x, y)-coordinates to a NumPy
		# array
		shape = predictor(gray, rect)
		shape = face_utils.shape_to_np(shape)

		# extract the left and right eye coordinates, then use the
		# coordinates to compute the eye aspect ratio for both eyes
		leftEye = shape[lStart:lEnd]
		rightEye = shape[rStart:rEnd]
		leftEAR = eye_aspect_ratio(leftEye)
		rightEAR = eye_aspect_ratio(rightEye)

		# average the eye aspect ratio together for both eyes
		ear = (leftEAR + rightEAR) / 2.0

		# compute the convex hull for the left and right eye, then
		# visualize each of the eyes
		leftEyeHull = cv2.convexHull(leftEye)
		rightEyeHull = cv2.convexHull(rightEye)
		cv2.drawContours(frame, [leftEyeHull], -1, (0, 255, 0), 1)
		cv2.drawContours(frame, [rightEyeHull], -1, (0, 255, 0), 1)

		# check to see if the eye aspect ratio is below the blink
		# threshold, and if so, increment the blink frame counter
		if ear < EYE_AR_THRESH:
			COUNTER += 1

		# otherwise, the eye aspect ratio is not below the blink
		# threshold
		else:
			# if the eyes were closed for a sufficient number of
			# then increment the total number of blinks
			if COUNTER >= EYE_AR_CONSEC_FRAMES:
				TOTAL += 1

			# reset the eye frame counter
			COUNTER = 0

Line 111 makes a check to see if the eye aspect ratio is below our blink threshold — if it is, we increment the number of consecutive frames that indicate a blink is taking place (Line 112).

Otherwise, Line 116 handles the case where the eye aspect ratio is not below the blink threshold.

In this case, we make another check on Line 119 to see if a sufficient number of consecutive frames contained an eye blink ratio below our pre-defined threshold.

If the check passes, we increment the

TOTAL
  number of blinks (Line 120).

We then reset the number of consecutive blinks

COUNTER
  (Line 123).

Our final code block simply handles drawing the number of blinks on our output frame, as well as displaying the current eye aspect ratio:

# import the necessary packages
from scipy.spatial import distance as dist
from imutils.video import FileVideoStream
from imutils.video import VideoStream
from imutils import face_utils
import numpy as np
import argparse
import imutils
import time
import dlib
import cv2

def eye_aspect_ratio(eye):
	# compute the euclidean distances between the two sets of
	# vertical eye landmarks (x, y)-coordinates
	A = dist.euclidean(eye[1], eye[5])
	B = dist.euclidean(eye[2], eye[4])

	# compute the euclidean distance between the horizontal
	# eye landmark (x, y)-coordinates
	C = dist.euclidean(eye[0], eye[3])

	# compute the eye aspect ratio
	ear = (A + B) / (2.0 * C)

	# return the eye aspect ratio
	return ear
 
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--shape-predictor", required=True,
	help="path to facial landmark predictor")
ap.add_argument("-v", "--video", type=str, default="",
	help="path to input video file")
args = vars(ap.parse_args())
 
# define two constants, one for the eye aspect ratio to indicate
# blink and then a second constant for the number of consecutive
# frames the eye must be below the threshold
EYE_AR_THRESH = 0.3
EYE_AR_CONSEC_FRAMES = 3

# initialize the frame counters and the total number of blinks
COUNTER = 0
TOTAL = 0

# initialize dlib's face detector (HOG-based) and then create
# the facial landmark predictor
print("[INFO] loading facial landmark predictor...")
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(args["shape_predictor"])

# grab the indexes of the facial landmarks for the left and
# right eye, respectively
(lStart, lEnd) = face_utils.FACIAL_LANDMARKS_IDXS["left_eye"]
(rStart, rEnd) = face_utils.FACIAL_LANDMARKS_IDXS["right_eye"]

# start the video stream thread
print("[INFO] starting video stream thread...")
vs = FileVideoStream(args["video"]).start()
fileStream = True
# vs = VideoStream(src=0).start()
# vs = VideoStream(usePiCamera=True).start()
# fileStream = False
time.sleep(1.0)

# loop over frames from the video stream
while True:
	# if this is a file video stream, then we need to check if
	# there any more frames left in the buffer to process
	if fileStream and not vs.more():
		break

	# grab the frame from the threaded video file stream, resize
	# it, and convert it to grayscale
	# channels)
	frame = vs.read()
	frame = imutils.resize(frame, width=450)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

	# detect faces in the grayscale frame
	rects = detector(gray, 0)

	# loop over the face detections
	for rect in rects:
		# determine the facial landmarks for the face region, then
		# convert the facial landmark (x, y)-coordinates to a NumPy
		# array
		shape = predictor(gray, rect)
		shape = face_utils.shape_to_np(shape)

		# extract the left and right eye coordinates, then use the
		# coordinates to compute the eye aspect ratio for both eyes
		leftEye = shape[lStart:lEnd]
		rightEye = shape[rStart:rEnd]
		leftEAR = eye_aspect_ratio(leftEye)
		rightEAR = eye_aspect_ratio(rightEye)

		# average the eye aspect ratio together for both eyes
		ear = (leftEAR + rightEAR) / 2.0

		# compute the convex hull for the left and right eye, then
		# visualize each of the eyes
		leftEyeHull = cv2.convexHull(leftEye)
		rightEyeHull = cv2.convexHull(rightEye)
		cv2.drawContours(frame, [leftEyeHull], -1, (0, 255, 0), 1)
		cv2.drawContours(frame, [rightEyeHull], -1, (0, 255, 0), 1)

		# check to see if the eye aspect ratio is below the blink
		# threshold, and if so, increment the blink frame counter
		if ear < EYE_AR_THRESH:
			COUNTER += 1

		# otherwise, the eye aspect ratio is not below the blink
		# threshold
		else:
			# if the eyes were closed for a sufficient number of
			# then increment the total number of blinks
			if COUNTER >= EYE_AR_CONSEC_FRAMES:
				TOTAL += 1

			# reset the eye frame counter
			COUNTER = 0

		# draw the total number of blinks on the frame along with
		# the computed eye aspect ratio for the frame
		cv2.putText(frame, "Blinks: {}".format(TOTAL), (10, 30),
			cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
		cv2.putText(frame, "EAR: {:.2f}".format(ear), (300, 30),
			cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
 
	# show the frame
	cv2.imshow("Frame", frame)
	key = cv2.waitKey(1) & 0xFF
 
	# if the `q` key was pressed, break from the loop
	if key == ord("q"):
		break

# do a bit of cleanup
cv2.destroyAllWindows()
vs.stop()

To see our eye blink detector in action, proceed to the next section.

Blink detection results

Before executing any of these examples, be sure to use the “Downloads” section of this guide to download the source code + example videos + pre-trained dlib facial landmark predictor. From there, you can unpack the archive and start playing with the code.

Over this past weekend I was traveling out to Las Vegas for a conference. While I was waiting for my plane to board, I sat at the gate and put together the code for this blog post — this involved recording a simple video of myself that I could use to evaluate the blink detection software.

To apply our blink detector to the example video, just execute the following command:

$ python detect_blinks.py \
	--shape-predictor shape_predictor_68_face_landmarks.dat \
	--video blink_detection_demo.mp4

And as you’ll see, we can successfully count the number of blinks in the video using OpenCV and facial landmarks:

Later, at my hotel, I recorded a live stream of the blink detector in action and turned it into a screencast.

To access my built-in webcam I executed the following command (taking care to uncomment the correct

VideoStream
  class, as detailed above):

$ python detect_blinks.py \
	--shape-predictor shape_predictor_68_face_landmarks.dat

Here is the output of the live blink detector along with my commentary:

Improving our blink detector

This blog post focused solely on using the eye aspect ratio as a quantitative metric to determine if a person has blinked in a video stream.

However, due to noise in a video stream, subpar facial landmark detections, or fast changes in viewing angle, a simple threshold on the eye aspect ratio could produce a false-positive detection, reporting that a blink had taken place when in reality the person had not blinked.

To make our blink detector more robust to these challenges, Soukupová and Čech recommend:

  1. Computing the eye aspect ratio for the N-th frame, along with the eye aspect ratios for N – 6 and N + 6 frames, then concatenating these eye aspect ratios to form a 13 dimensional feature vector.
  2. Training a Support Vector Machine (SVM) on these feature vectors.

Soukupová and Čech report that the combination of the temporal-based feature vector and SVM classifier helps reduce false-positive blink detections and improves the overall accuracy of the blink detector.

Summary

In this blog post I demonstrated how to build a blink detector using OpenCV, Python, and dlib.

The first step in building a blink detector is to perform facial landmark detection to localize the eyes in a given frame from a video stream.

Once we have the facial landmarks for both eyes, we compute the eye aspect ratio for each eye, which gives us a singular value, relating the distances between the vertical eye landmark points to the distances between the horizontal landmark points.

Once we have the eye aspect ratio, we can threshold it to determine if a person is blinking — the eye aspect ratio will remain approximately constant when the eyes are open and then will rapidly approach zero during a blink, then increase again as the eye opens.

To improve our blink detector, Soukupová and Čech recommend constructing a 13-dim feature vector of eye aspect ratios (N-th frame, N – 6 frames, and N + 6 frames), followed by feeding this feature vector into a Linear SVM for classification.

Of course, a natural extension of blink detection is drowsiness detection which we’ll be covering in the next two weeks here on the PyImageSearch blog.

To be notified when the drowsiness detection tutorial is published, be sure to enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

The post Eye blink detection with OpenCV, Python, and dlib appeared first on PyImageSearch.

Continue Reading…

Collapse

Read More

Stan without frontiers, Bayes without tears

This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are really gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to evaluate a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

The post Stan without frontiers, Bayes without tears appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…

Collapse

Read More

PAW for Manufacturing Chicago Needs You

Join us again this year in Chicago for Predictive Analytics World for Manufacturing, June 19-22, and receive 20% off current conference rates for two-day passes with the discount code MISSYOU.

Continue Reading…

Collapse

Read More

Stephen Curry for MVP, because he makes his teammates better the most

The choice for Most Valuable Player in the NBA is only minimally about the numbers, but it’s fun to look anyways. FiveThirtyEight makes the case for Stephen Curry. I particularly like the chart that shows how other players on a team fare when an MVP candidate doesn’t play.

Not only do virtually all of his teammates (10 of 11 players with at least 30 shots, representing over 1,700 shots taken without him3) shoot worse without Curry on the court to draw attention, they shoot dramatically worse. Overall, Curry’s teammates shoot 7.3 percentage points worse with Curry off the court, with his average teammate4 shooting 8.3 points worse. Among our MVP candidates, LeBron has the next-highest impact on average teammate shooting (3.9 points), followed by Westbrook (2.5 points). When it comes to opening up a team’s offense, Curry has no equal.

Full player breakdowns.

Tags: , ,

Continue Reading…

Collapse

Read More

errors 0.0.1

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

In physics, engineering and other disciplines, a single number, a bare quantity, is useless, meaningless. When you measure something, your results will be within the precision of your equipment, and then you need to propagate those errors up to whatever final indirect measure you are interested in. Finally, you need to express it properly. For instance, the elementary charge:

library(errors)

e <- set_errors(1.6021766208e-19, 0.0000000098e-19)
print(e, digits=2, notation="plus-minus")
## (1.6021766208 +/- 0.0000000098)e-19
print(e, digits=2, notation="parenthesis")
## 1.6021766208(98)e-19

Propagation of uncertainty (or propagation of error) is easy, but too labourious. Lately, for various reasons, I have been growing sick of this task, so I decided to write this small package. In a nutshell, the errors package, which is already on CRAN, provides support for painless automatic error propagation in numerical operations and pretty printing.

With errors, you can add errors to your numeric vectors:

x <- 1:10
errors(x) <- 0.1
x
## errors: 0.1 0.1 0.1 0.1 0.1 ...
##  [1]  1  2  3  4  5  6  7  8  9 10
errors(x)
##  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
errors(x) <- seq(0.1, 1, 0.1)
errors(x)
##  [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

The set_errors() method is a pipe-friendly version of the above:

(x <- set_errors(1:10, seq(0.1, 1, 0.1)))
## errors: 0.1 0.2 0.3 0.4 0.5 ...
##  [1]  1  2  3  4  5  6  7  8  9 10

Then, you simply work with your quantities without having to worry about errors anymore:

df <- as.data.frame(x)

(df$`3x` <- 3*x)
## errors: 0.3 0.6 0.9 1.2 1.5 ...
##  [1]  3  6  9 12 15 18 21 24 27 30
(df$`x^2` <- x^2)
## errors: 0.2 0.8 1.8 3.2 5 ...
##  [1]   1   4   9  16  25  36  49  64  81 100
(df$`sin(x)` <- sin(x))
## errors: 0.054030230586814 0.0832293673094285 0.296997748980134 0.261457448345445 0.141831092731613 ...
##  [1]  0.8414710  0.9092974  0.1411200 -0.7568025 -0.9589243 -0.2794155
##  [7]  0.6569866  0.9893582  0.4121185 -0.5440211
(df$`cumsum(x)` <- cumsum(x))
## errors: 0.1 0.223606797749979 0.374165738677394 0.547722557505166 0.741619848709566 ...
##  [1]  1  3  6 10 15 21 28 36 45 55

Finally, you probably need to report your data in a consistent way. No problem: just print your table.

df
##         x     3x     x^2  sin(x) cumsum(x)
## 1  1.0(1) 3.0(3)  1.0(2) 0.84(5)    1.0(1)
## 2  2.0(2) 6.0(6)  4.0(8) 0.91(8)    3.0(2)
## 3  3.0(3) 9.0(9)    9(2)  0.1(3)    6.0(4)
## 4  4.0(4)  12(1)   16(3) -0.8(3)   10.0(5)
## 5  5.0(5)  15(2)   25(5) -1.0(1)   15.0(7)
## 6  6.0(6)  18(2)   36(7) -0.3(6)  21.0(10)
## 7  7.0(7)  21(2)  49(10)  0.7(5)     28(1)
## 8  8.0(8)  24(2)  60(10)  1.0(1)     36(1)
## 9  9.0(9)  27(3)  80(20)  0.4(8)     45(2)
## 10  10(1)  30(3) 100(20) -0.5(8)     55(2)

By default, errors uses parenthesis notation (which is more compact) and a single significant digit for errors. If you prefer the plus-minus notation or you need more significant digits, just pass the notation and digits arguments to the print() method, as in the first example, or set them as global options with options(errors.notation="plus-minus", errors.digits=2).

The inner workings of this package have been inspired by Edzer Pebesma and his excellent units package. As a next step, I envision numeric vectors with errors and units for R. Thus, I publicly invite Edzer to collaborate with me in making our packages work together.

Article originally published in Enchufa2.es: errors 0.0.1.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Four short links: 24 April 2017

Sports Analytics, Book Recommendations, Breakthrough Tech, and Engineering Leadership

  1. Bringing IoT to Sports Analytics (Adrian Colyer) -- IoT + cricket, two great tastes ... You might initially think that gyroscopes and accelerometers would be great for [spin tracking]: but gyroscopes can only cope with spins up to about 5rps (while professional bowlers can generate spin of over 30rps), and gravity is not measured in accelerometers during free-fall. This leaves magnetometers.
  2. Book Recommendations by TED 2017 Speakers -- a good and diverse list. Thankfully, they're not all "Mindfulness Sales Techniques of the Renaissance: A Historic Look at Believing in Yourself for Happiness and Profit" type of bollocks.
  3. Ten Breakthrough Technologies for 2017 (MIT TR) -- the good news is: some of them are hard to turn into advertising business models! They are: reversing paralysis, self-driving trucks, paying with your face, practical quantum computers, the 360-degree selfie, hot solar cells, Gene Therapy 2.0, the cell atlas, botnets of things, and reinforcement learning.
  4. How I Lead 4 Years of Computer Engineering -- Life is like a real-time strategy game. It's great to know beforehand how others played it when they were at your situation. In this short post, I'm summing up scenarios I faced, decisions I made, and their outcomes during four years of computer engineering.

Continue reading Four short links: 24 April 2017.

Continue Reading…

Collapse

Read More

R 3.4.0 is released – with new speed upgrades and bug-fixes

R 3.4.0 (codename “You Stupid Darkness”) was released 3 days ago. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes and new features is provided below.

As mentioned two months ago by David Smith, R 3.4.0 indicates several major changes aimed at improving the performance of R in various ways. These includes:

  • The JIT (‘Just In Time’) byte-code compiler is now enabled by default at its level 3. This means functions will be compiled on first or second use and top-level loops will be compiled and then run. (Thanks to Tomas Kalibera for extensive work to make this possible.) For now, the compiler will not compile code containing explicit calls to browser(): this is to support single stepping from the browser() call. JIT compilation can be disabled for the rest of the session using compiler::enableJIT(0) or by setting environment variable R_ENABLE_JIT to 0.
  • Matrix products now consistently bypass BLAS when the inputs have NaN/Inf values. Performance of the check of inputs has been improved. Performance when BLAS is used is improved for matrix/vector and vector/matrix multiplication (DGEMV is now used instead of DGEMM). One can now choose from alternative matrix product implementations via options(matprod = ). The “internal” implementation is not optimized for speed but consistent in precision with other summations in R (using long double accumulators where available). “blas” calls BLAS directly for best speed, but usually with undefined behavior for inputs with NaN/Inf.
  • Speedup in simplify2array() and hence sapply() and mapply() (for the case of names and common length #> 1), thanks to Suharto Anggono’s PR#17118.
  • Accumulating vectors in a loop is faster – Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time.
  • C-LEVEL FACILITIES have been extended.
  • Radix sort (which can be considered more efficient for some cases) is now chosen by method = “auto” for sort.int() for double vectors (and hence used for sort() for unclassed double vectors), excluding ‘long’ vectors. sort.int(method = “radix”) no longer rounds double vectors. The default method until R 3.2.0 was “shell”. A minimal comparison between the two shows that for very short vectors (100 values), “shell” would perform better. From a 1000 values, they are comparable, and for larger vectors – “radix” is doing 2-3 times faster (which is probably the use case for which we would care about more). More about this can be read in ?sort.int

 

#> 
#> set.seed(2017-04-24)
#> x  microbenchmark(shell = sort.int(x, method = "shell"), radix = sort.int(x, method = "radix"))
Unit: microseconds
  expr    min     lq     mean median     uq    max neval cld
 shell 15.775 16.606 17.80971 17.989 18.543 33.211   100  a 
 radix 32.657 34.595 35.67700 35.148 35.702 88.561   100   b
#> 
#> set.seed(2017-04-24)
#> x  microbenchmark(shell = sort.int(x, method = "shell"), radix = sort.int(x, method = "radix"))
Unit: microseconds
  expr    min     lq     mean median      uq    max neval cld
 shell 53.414 55.074 56.54395 56.182 57.0120 96.034   100   b
 radix 45.665 46.772 48.04222 47.325 48.1555 78.598   100  a 
#> 
#> set.seed(2017-04-24)
#> x  microbenchmark(shell = sort.int(x, method = "shell"), radix = sort.int(x, method = "radix"))
Unit: milliseconds
  expr      min       lq      mean    median        uq      max neval cld
 shell 93.33140 95.94478 107.75347 103.02756 115.33709 221.0800   100   b
 radix 38.18241 39.01516  46.47038  41.45722  47.49596 159.3518   100  a 
#> 
#>

More about the changes in R case be read at the nice post by David Smith, or in the list of changes given below.

 

Upgrading to R 3.4.0 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
setInternet2(TRUE) # only for R versions older than 3.3.0
installr::updateR() # updating R.
# If you wish it to go faster, run: installr::updateR(T)

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.4.0

SIGNIFICANT USER-VISIBLE CHANGES

  • (Unix-alike) The default methods for download.file() and url() now choose "libcurl" except for file:// URLs. There will be small changes in the format and wording of messages, including in rare cases if an issue is a warning or an error. For example, when HTTP re-direction occurs, some messages refer to the final URL rather than the specified one.Those who use proxies should check that their settings are compatible (see ?download.file: the most commonly used forms work for both "internal" and "libcurl").
  • table() has been amended to be more internally consistent and become back compatible to R <= 2.7.2 again. Consequently, table(1:2, exclude = NULL) no longer contains a zero count for <NA#>, but useNA = "always" continues to do so.
  • summary.default() no longer rounds, but its print method does resulting in less extraneous rounding, notably of numbers in the ten thousands.
  • factor(x, exclude = L) behaves more rationally when x or L are character vectors. Further, exclude = <factor#> now behaves as documented for long.
  • Arithmetic, logic (&, |) and comparison (aka ‘relational’, e.g., <, ==) operations with arrays now behave consistently, notably for arrays of length zero.Arithmetic between length-1 arrays and longer non-arrays had silently dropped the array attributes and recycled. This now gives a warning and will signal an error in the future, as it has always for logic and comparison operations in these cases (e.g., compare matrix(1,1) + 2:3 and matrix(1,1) < 2:3).
  • The JIT (‘Just In Time’) byte-code compiler is now enabled by default at its level 3. This means functions will be compiled on first or second use and top-level loops will be compiled and then run. (Thanks to Tomas Kalibera for extensive work to make this possible.)For now, the compiler will not compile code containing explicit calls to browser(): this is to support single stepping from the browser() call.JIT compilation can be disabled for the rest of the session using compiler::enableJIT(0) or by setting environment variable R_ENABLE_JIT to 0.
  • xtabs() works more consistently with NAs, also in its result no longer setting them to 0. Further, a new logical option addNA allows to count NAs where appropriate. Additionally, for the case sparse = TRUE, the result’s dimnames are identical to the default case’s.
  • Matrix products now consistently bypass BLAS when the inputs have NaN/Inf values. Performance of the check of inputs has been improved. Performance when BLAS is used is improved for matrix/vector and vector/matrix multiplication (DGEMV is now used instead of DGEMM).One can now choose from alternative matrix product implementations via options(matprod = ). The "internal" implementation is not optimized for speed but consistent in precision with other summations in R (using long double accumulators where available). "blas" calls BLAS directly for best speed, but usually with undefined behavior for inputs with NaN/Inf.
  • factor() now uses order() to sort its levels, not sort.list(). This makes factor() support custom vector-like objects if methods for the appropriate generics are defined. This change has the side effect of making factor() succeed on empty or length-one non-atomic vector(-like) types (e.g., list), where it failed before.

NEW FEATURES

  • User errors such as integrate(f, 0:1, 2) are now caught.
  • Add signature argument to debug(), debugonce(), undebug() and isdebugged() for more conveniently debugging S3 and S4 methods. (Based on a patch by Gabe Becker.)
  • Add utils::debugcall() and utils::undebugcall() for debugging the function that would be called by evaluating the given expression. When the call is to an S4 generic or standard S3 generic, debugcall() debugs the method that would be dispatched. A number of internal utilities were added to support this, most notably utils::isS3stdGeneric(). (Based on a patch by Gabe Becker.)
  • Add utils::strcapture(). Given a character vector and a regular expression containing capture expressions, strcapture() will extract the captured tokens into a tabular data structure, typically a data.frame.
  • str() and strOptions() get a new option drop.deparse.attr with improved but changed default behaviour for expressions. For expressionobjects x, str(x) now may remove extraneous white space and truncate long lines.
  • str(<looooooooong_string#>) is no longer very slow; inspired by Mikko Korpela’s proposal in PR#16527.
  • str(x)‘s default method is more “accurate” and hence somewhat more generous in displaying character vectors; this will occasionally change Routputs (and need changes to some ‘*.Rout(.save)’ files).
    For a classed integer vector such as x <- xtabs(~ c(1,9,9,9)), str(x) now shows both the class and "int", instead of only the latter.
  • isSymmetric(m) is much faster for large asymmetric matrices m via pre-tests and a new option tol1 (with which strict back compatibility is possible but not the default).
  • The result of eigen() now is of class "eigen" in the default case when eigenvectors are computed.
  • Zero-length date and date-time objects (of classes "POSIX[cl]?t") now print() “recognizably”.
  • xy.coords() and xyz.coords() get a new setLab option.
  • The method argument of sort.list(), order() and sort.int() gains an "auto" option (the default) which should behave the same as before when method was not supplied.
  • stopifnot(E, ..) now reports differences when E is a call to all.equal() and that is not true.
  • boxplot(<formula#>, *) gain optional arguments drop, sep, and lex.order to pass to split.default() which itself gains an argument lex.orderto pass to interaction() for more flexibility.
  • The plot() method for ppr() has enhanced default labels (xmin and main).
  • sample.int() gains an explicit useHash option (with a back compatible default).
  • identical() gains an ignore.srcref option which drops "srcref" and similar attributes when true (as by default).
  • diag(x, nrow = n) now preserves typeof(x), also for logical, integer and raw x (and as previously for complex and numeric).
  • smooth.spline() now allows direct specification of lambda, gets a hatvalues() method and keeps tol in the result, and optionally parts of the internal matrix computations.
  • addNA() is faster now, e.g. when applied twice. (Part of PR#16895.)
  • New option rstandard(<lm#>, type = "predicted") provides the “PRESS”–related leave-one-out cross-validation errors for linear models.
  • After seven years of deprecation, duplicated factor levels now produce a warning when printed and an error in levels<- instead of a warning.
  • Invalid factors, e.g., with duplicated levels (invalid but constructable) now give a warning when printed, via new function .valid.factor().
  • sessionInfo() has been updated for Apple’s change in OS naming as from ‘10.12’ (‘macOS Sierra’ vs ‘OS X El Capitan’).Its toLatex() method now includes the running component.
  • options(interrupt=) can be used to specify a default action for user interrupts. For now, if this option is not set and the error option is set, then an unhandled user interrupt invokes the error option. (This may be dropped in the future as interrupt conditions are not errorconditions.)
  • In most cases user interrupt handlers will be called with a "resume" restart available. Handlers can invoke this restart to resume computation. At the browser prompt the r command will invoke a "resume" restart if one is available. Some read operations cannot be resumed properly when interrupted and do not provide a "resume" restart.
  • Radix sort is now chosen by method = "auto" for sort.int() for double vectors (and hence used for sort() for unclassed double vectors), excluding ‘long’ vectors.sort.int(method = "radix") no longer rounds double vectors.
  • The default and data.frame methods for stack() preserve the names of empty elements in the levels of the ind column of the return value. Set the new drop argument to TRUE for the previous behavior.
  • Speedup in simplify2array() and hence sapply() and mapply() (for the case of names and common length #> 1), thanks to Suharto Anggono’s PR#17118.
  • table(x, exclude = NULL) now sets useNA = "ifany" (instead of "always"). Together with the bug fixes for this case, this recovers more consistent behaviour compatible to older versions of R. As a consequence, summary() for a logical vector no longer reports (zero) counts for NAwhen there are no NAs.
  • dump.frames() gets a new option include.GlobalEnv which allows to also dump the global environment, thanks to Andreas Kersting’s proposal in PR#17116.
  • system.time() now uses message() instead of cat() when terminated early, such that suppressMessages() has an effect; suggested by Ben Bolker.
  • citation() supports ‘inst/CITATION’ files from package source trees, with lib.loc pointing to the directory containing the package.
  • try() gains a new argument outFile with a default that can be modified via options(try.outFile = .), useful notably for Sweave.
  • The unexported low-level functions in package parallel for passing serialized R objects to and from forked children now support long vectors on 64-bit platforms. This removes some limits on higher-level functions such as mclapply() (but returning gigabyte results from forked processes via serialization should be avoided if at all possible).
  • Connections now print() without error even if invalid, e.g. after having been destroyed.
  • apropos() and find(simple.words = FALSE) no longer match object names starting with . which are known to be internal objects (such as .__S3MethodsTable__.).
  • Convenience function hasName() has been added; it is intended to replace the common idiom !is.null(x$name) without the usually unintended partial name matching.
  • strcapture() no longer fixes column names nor coerces strings to factors (suggested by Bill Dunlap).
  • strcapture() returns NA for non-matching values in x (suggested by Bill Dunlap).
  • source() gets new optional arguments, notably exprs; this is made use of in the new utility function withAutoprint().
  • sys.source() gets a new toplevel.env argument. This argument is useful for frameworks running package tests; contributed by Tomas Kalibera.
  • Sys.setFileTime() and file.copy(copy.date = TRUE) will set timestamps with fractions of seconds on platforms/filesystems which support this.
  • (Windows only.) file.info() now returns file timestamps including fractions of seconds; it has done so on other platforms since R 2.14.0. (NB: some filesystems do not record modification and access timestamps to sub-second resolution.)
  • The license check enabled by options(checkPackageLicense = TRUE) is now done when the package’s namespace is first loaded.
  • ppr() and supsmu() get an optional trace argument, and ppr(.., sm.method = ..spline) is no longer limited to sample size n <= 2500.
  • The POSIXct method for print() gets optional tz and usetz arguments, thanks to a report from Jennifer S. Lyon.
  • New function check_packages_in_dir_details() in package tools for analyzing package-check log files to obtain check details.
  • Package tools now exports function CRAN_package_db() for obtaining information about current packages in the CRAN package repository, and several functions for obtaining the check status of these packages.
  • The (default) Stangle driver Rtangle allows annotate to be a function and gets a new drop.evalFALSE option.
  • The default method for quantile(x, prob) should now be monotone in prob, even in border cases, see PR#16672.
  • bug.report() now tries to extract an email address from a BugReports field, and if there is none, from a Contacts field.
  • The format() and print() methods for object.size() results get new options standard and digits; notably, standard = "IEC" and standard = "SI" allow more standard (but less common) abbreviations than the default ones, e.g. for kilobytes. (From contributions by Henrik Bengtsson.)
  • If a reference class has a validity method, validObject will be called automatically from the default initialization method for reference classes.
  • tapply() gets new option default = NA allowing to change the previously hardcoded value.
  • read.dcf() now consistently interprets any ‘whitespace’ to be stripped to include newlines.
  • The maximum number of DLLs that can be loaded into R e.g. via dyn.load() can now be increased by setting the environment variable R_MAX_NUM_DLLS before starting R.
  • Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time.
  • Loading the parallel package namespace no longer sets or changes the .Random.seed, even if R_PARALLEL_PORT is unset.NB: This can break reproducibility of output, and did for a CRAN package.
  • Methods "wget" and "curl" for download.file() now give an R error rather than a non-zero return value when the external command has a non-zero status.
  • Encoding name "utf8" is mapped to "UTF-8". Many implementations of iconv accept "utf8", but not GNU libiconv (including the late 2016 version 1.15).
  • sessionInfo() shows the full paths to the library or executable files providing the BLAS/LAPACK implementations currently in use (not available on Windows).
  • The binning algorithm used by bandwidth selectors bw.ucv(), bw.bcv() and bw.SJ() switches to a version linear in the input size n for n #> nb/2. (The calculations are the same, but for larger n/nb it is worth doing the binning in advance.)
  • There is a new option PCRE_study which controls when grep(perl = TRUE) and friends ‘study’ the compiled pattern. Previously this was done for 11 or more input strings: it now defaults to 10 or more (but most examples need many more for the difference from studying to be noticeable).
  • grep(perl = TRUE) and friends can now make use of PCRE’s Just-In-Time mechanism, for PCRE #>= 8.20 on platforms where JIT is supported. It is used by default whenever the pattern is studied (see the previous item). (Based on a patch from Mikko Korpela.)This is controlled by a new option PCRE_use_JIT.Note that in general this makes little difference to the speed, and may take a little longer: its benefits are most evident on strings of thousands of characters. As a side effect it reduces the chances of C stack overflow in the PCRE library on very long strings (millions of characters, but see next item).

    Warning: segfaults were seen using PCRE with JIT enabled on 64-bit Sparc builds.

  • There is a new option PCRE_limit_recursion for grep(perl = TRUE) and friends to set a recursion limit taking into account R‘s estimate of the remaining C stack space (or 10000 if that is not available). This reduces the chance of C stack overflow, but because it is conservative may report a non-match (with a warning) in examples that matched before. By default it is enabled if any input string has 1000 or more bytes. (PR#16757)
  • getGraphicsEvent() now works on X11(type = "cairo") devices. Thanks to Frederick Eaton (for reviving an earlier patch).
  • There is a new argument onIdle for getGraphicsEvent(), which allows an R function to be run whenever there are no pending graphics events. This is currently only supported on X11 devices. Thanks to Frederick Eaton.
  • The deriv() and similar functions now can compute derivatives of log1p(), sinpi() and similar one-argument functions, thanks to a contribution by Jerry Lewis.
  • median() gains a formal ... argument, so methods with extra arguments can be provided.
  • strwrap() reduces indent if it is more than half width rather than giving an error. (Suggested by Bill Dunlap.)
  • When the condition code in if(.) or while(.) is not of length one, an error instead of a warning may be triggered by setting an environment variable, see the help page.
  • Formatting and printing of bibliography entries (bibentry) is more flexible and better documented. Apart from setting options(citation.bibtex.max = 99) you can also use print(<citation#>, bibtex=TRUE) (or format(..)) to get the BibTeX entries in the case of more than one entry. This also affects citation(). Contributions to enable style = "html+bibtex" are welcome.

C-LEVEL FACILITIES

  • Entry points R_MakeExternalPtrFn and R_ExternalPtrFn are now declared in header ‘Rinternals.h’ to facilitate creating and retrieving an Rexternal pointer from a C function pointer without ISO C warnings about the conversion of function pointers.
  • There was an exception for the native Solaris C++ compiler to the dropping (in R 3.3.0) of legacy C++ headers from headers such as ‘R.h’ and ‘Rmath.h’ — this has now been removed. That compiler has strict C++98 compliance hence does not include extensions in its (non-legacy) C++ headers: some packages will need to request C++11 or replace non-C++98 calls such as lgamma: see §1.6.4 of ‘Writing R Extensions’.Because it is needed by about 70 CRAN packages, headers ‘R.h’ and ‘Rmath.h’ still declare
    use namespace std;

    when included on Solaris.

  • When included from C++, the R headers now use forms such as std::FILE directly rather than including the line
    using std::FILE;

    C++ code including these headers might be relying on the latter.

  • Headers ‘R_ext/BLAS.h’ and ‘R_ext/Lapack.h’ have many improved declarations including const for double-precision complex routines. Inter alia this avoids warnings when passing ‘string literal’ arguments from C++11 code.
  • Headers for Unix-only facilities ‘R_ext/GetX11Image.h’, ‘R_ext/QuartzDevice.h’ and ‘R_ext/eventloop.h’ are no longer installed on Windows.
  • No-longer-installed headers ‘GraphicsBase.h’, ‘RGraphics.h’, ‘Rmodules/RX11.h’ and ‘Rmodules/Rlapack.h’ which had a LGPL license no longer do so.
  • HAVE_UINTPTR_T is now defined where appropriate by Rconfig.h so that it can be included before Rinterface.h when CSTACK_DEFNS is defined and a C compiler (not C++) is in use. Rinterface.h now includes C header ‘stdint.h’ or C++11 header ‘cstdint’ where needed.
  • Package tools has a new function package_native_routine_registration_skeleton() to assist adding native-symbol registration to a package. See its help and §5.4.1 of ‘Writing R Extensions’ for how to use it. (At the time it was added it successfully automated adding registration to over 90% of CRAN packages which lacked it. Many of the failures were newly-detected bugs in the packages, e.g. 50 packages called entry points with varying numbers of arguments and 65 packages called entry points not in the package.)

INSTALLATION on a UNIX-ALIKE

  • readline headers (and not just the library) are required unless configuring with –with-readline=no.
  • configure now adds a compiler switch for C++11 code, even if the compiler supports C++11 by default. (This ensures that g++ 6.x uses C++11 mode and not its default mode of C++14 with ‘GNU extensions’.)The tests for C++11 compliance are now much more comprehensive. For gcc < 4.8, the tests from R 3.3.0 are used in order to maintain the same behaviour on Linux distributions with long-term support.
  • An alternative compiler for C++11 is now specified with CXX11, not CXX1X. Likewise C++11 flags are specified with CXX11FLAGS and the standard (e.g., -std=gnu++11 is specified with CXX11STD.
  • configure now tests for a C++14-compliant compiler by testing some basic features. This by default tries flags for the compiler specified by CXX11, but an alternative compiler, options and standard can be specified by variables CXX14, CXX14FLAGS and CXX14STD (e.g., -std=gnu++14).
  • There is a new macro CXXSTD to help specify the standard for C++ code, e.g. -std=c++98. This makes it easier to work with compilers which default to a later standard: for example, with CXX=g++6 CXXSTD=-std=c++98 configure will select commands for g++ 6.x which conform to C++11 and C++14 where specified but otherwise use C++98.
  • Support for the defunct IRIX and OSF/1 OSes and Alpha CPU has been removed.
  • configure checks that the compiler specified by $CXX $CXXFLAGS is able to compile C++ code.
  • configure checks for the required header ‘sys/select.h’ (or ‘sys/time.h’ on legacy systems) and system call select and aborts if they are not found.
  • If available, the POSIX 2008 system call utimensat will be used by Sys.setFileTime() and file.copy(copy.date = TRUE). This may result in slightly more accurate file times. (It is available on Linux and FreeBSD but not macOS.)
  • The minimum version requirement for libcurl has been reduced to 7.22.0, although at least 7.28.0 is preferred and earlier versions are little tested. (This is to support Debian 7 ‘Wheezy’ LTS and Ubuntu ‘Precise’ 12.04 LTS, although the latter is close to end-of-life.)
  • configure tests for a C++17-compliant compiler. The tests are experimental and subject to change in the future.

INCLUDED SOFTWARE

  • (Windows only) Tcl/Tk version 8.6.4 is now included in the binary builds. The ‘tcltk*.chm’ help file is no longer included; please consult the online help at http://www.tcl.tk/man/ instead.
  • The version of LAPACK included in the sources has been updated to 3.7.0: no new routines have been added to R.

PACKAGE INSTALLATION

  • There is support for compiling C++14 or C++17 code in packages on suitable platforms: see ‘Writing R Extensions’ for how to request this.
  • The order of flags when LinkingTo other packages has been changed so their include directories come earlier, before those specified in CPPFLAGS. This will only have an effect if non-system include directories are included with -I flags in CPPFLAGS (and so not the default -I/usr/local/include which is treated as a system include directory on most platforms).
  • Packages which register native routines for .C or .Fortran need to be re-installed for this version (unless installed with R-devel SVN revision r72375 or later).
  • Make variables with names containing CXX1X are deprecated in favour of those using CXX11, but for the time being are still made available viafile ‘etc/Makeconf’. Packages using them should be converted to the new forms and made dependent on R (#>= 3.4.0).

UTILITIES

  • Running R CMD check --as-cran with _R_CHECK_CRAN_INCOMING_REMOTE_ false now skips tests that require remote access. The remaining (local) tests typically run quickly compared to the remote tests.
  • R CMD build will now give priority to vignettes produced from files in the ‘vignettes’ directory over those in the ‘inst/doc’ directory, with a warning that the latter are being ignored.
  • R CMD config gains a –all option for printing names and values of all basic configure variables.It now knows about all the variables used for the C++98, C++11 and C++14 standards.
  • R CMD check now checks that output files in ‘inst/doc’ are newer than the source files in ‘vignettes’.
  • For consistency with other package subdirectories, files named ‘*.r’ in the ‘tests’ directory are now recognized as tests by R CMD check. (Wish of PR#17143.)
  • R CMD build and R CMD check now use the union of R_LIBS and .libPaths(). They may not be equivalent, e.g., when the latter is determined by R_PROFILE.
  • R CMD build now preserves dates when it copies files in preparing the tarball. (Previously on Windows it changed the dates on all files; on Unix, it changed some dates when installing vignettes.)
  • The new option R CMD check --no-stop-on-test-error allows running the remaining tests (under ‘tests/’) even if one gave an error.
  • Check customization via environment variables to detect side effects of .Call() and .External() calls which alter their arguments is described in §8 of the ‘R Internals’ manual.
  • R CMD check now checks any BugReports field to be non-empty and a suitable single URL.
  • R CMD check --as-cran now NOTEs if the package does not register its native routines or does not declare its intentions on (native) symbol search. (This will become a WARNING in due course.)

DEPRECATED AND DEFUNCT

  • (Windows only) Function setInternet2() is defunct.
  • Installation support for readline emulations based on editline (aka libedit) is deprecated.
  • Use of the C/C++ macro NO_C_HEADERS is defunct and silently ignored.
  • unix.time(), a traditional synonym for system.time(), has been deprecated.
  • structure(NULL, ..) is now deprecated as you cannot set attributes on NULL.
  • Header ‘Rconfig.h’ no longer defines SUPPORT_OPENMP; instead use _OPENMP (as documented for a long time).
  • (C-level Native routine registration.) The deprecated styles member of the R_CMethodDef and R_FortranMethodDef structures has been removed. Packages using these will need to be re-installed for R 3.4.0.
  • The deprecated support for PCRE versions older than 8.20 will be removed in R 3.4.1. (Versions 8.20–8.31 will still be accepted but remain deprecated.)

BUG FIXES

  • Getting or setting body() or formals() on non-functions for now signals a warning and may become an error for setting.
  • match(x, t), duplicated(x) and unique(x) work as documented for complex numbers with NAs or NaNs, where all those containing NA do match, whereas in the case of NaN‘s both real and imaginary parts must match, compatibly with how print() and format() work for complex numbers.
  • deparse(<complex#>, options = "digits17") prints more nicely now, mostly thanks to a suggestion by Richie Cotton.
  • Rotated symbols in plotmath expressions are now positioned correctly on x11(type = "Xlib"). (PR#16948)
  • as<-() avoids an infinite loop when a virtual class is interposed between a subclass and an actual superclass.
  • Fix level propagation in unlist() when the list contains zero-length lists or factors.
  • Fix S3 dispatch on S4 objects when the methods package is not attached.
  • Internal S4 dispatch sets .Generic in the method frame for consistency with standardGeneric(). (PR#16929)
  • Fix order(x, decreasing = TRUE) when x is an integer vector containing MAX_INT. Ported from a fix Matt Dowle made to data.table.
  • Fix caching by callNextMethod(), resolves PR#16973 and PR#16974.
  • grouping() puts NAs last, to be consistent with the default behavior of order().
  • Point mass limit cases: qpois(-2, 0) now gives NaN with a warning and qgeom(1, 1) is 0. (PR#16972)
  • table() no longer drops an "NaN" factor level, and better obeys exclude = <chr#>, thanks to Suharto Anggono’s patch for PR#16936. Also, in the case of exclude = NULL and NAs, these are tabulated correctly (again).Further, table(1:2, exclude = 1, useNA = "ifany") no longer erroneously reports <NA#> counts.Additionally, all cases of empty exclude are equivalent, and useNA is not overwritten when specified (as it was by exclude = NULL).
  • wilcox.test(x, conf.int=TRUE) no longer errors out in cases where the confidence interval is not available, such as for x = 0:2.
  • droplevels(f) now keeps <NA#> levels when present.
  • In integer arithmetic, NULL is now treated as integer(0) whereas it was previously treated as double(0).
  • The radix sort considers NA_real_ and NaN to be equivalent in rank (like the other sort algorithms).
  • When index.return=TRUE is passed to sort.int(), the radix sort treats NAs like sort.list() does (like the other sort algorithms).
  • When in tabulate(bin, nbin) length(bin) is larger than the maximal integer, the result is now of type double and hence no longer silently overflows to wrong values. (PR#17140)
  • as.character.factor() respects S4 inheritance when checking the type of its argument. (PR#17141)
  • The factor method for print() no longer sets the class of the factor to NULL, which would violate a basic constraint of an S4 object.
  • formatC(x, flag = f) allows two new flags, and signals an error for invalid flags also in the case of character formatting.
  • Reading from file("stdin") now also closes the connection and hence no longer leaks memory when reading from a full pipe, thanks to Gábor Csárdi, see thread starting at https://stat.ethz.ch/pipermail/r-devel/2016-November/073360.html.
  • Failure to create file in tempdir() for compressed pdf() graphics device no longer errors (then later segfaults). There is now a warning instead of error and compression is turned off for the device. Thanks to Alec Wysoker (PR#17191).
  • Asking for methods() on "|" returns only S3 methods. See https://stat.ethz.ch/pipermail/r-devel/2016-December/073476.html.
  • dev.capture() using Quartz Cocoa device (macOS) returned invalid components if the back-end chose to use ARGB instead of RGBA image format. (Reported by Noam Ross.)
  • seq("2", "5") now works too, equivalently to "2":"5" and seq.int().
  • seq.int(to = 1, by = 1) is now correct, other cases are integer (instead of double) when seq() is integer too, and the “non-finite” error messages are consistent between seq.default() and seq.int(), no longer mentioning NaN etc.
  • rep(x, times) and rep.int(x, times) now work when times is larger than the largest value representable in an integer vector. (PR#16932)
  • download.file(method = "libcurl") does not check for URL existence before attempting downloads; this is more robust to servers that do not support HEAD or range-based retrieval, but may create empty or incomplete files for aborted download requests.
  • Bandwidth selectors bw.ucv(), bw.bcv() and bw.SJ() now avoid integer overflow for large sample sizes.
  • str() no longer shows "list output truncated", in cases that list was not shown at all. Thanks to Neal Fultz (PR#17219)
  • Fix for cairo_pdf() (and svg() and cairo_ps()) when replaying a saved display list that contains a mix of grid and graphics output. (Report by Yihui Xie.)
  • The str() and as.hclust() methods for "dendrogram" now also work for deeply nested dendrograms thanks to non-recursive implementations by Bradley Broom.
  • sample() now uses two uniforms for added precision when the uniform generator is Knuth-TAOCP, Knuth-TAOCP-2002, or a user-defined generator and the population size is 2^25 or greater.
  • If a vignette in the ‘vignettes’ directory is listed in ‘.Rbuildignore’, R CMD build would not include it in the tarball, but would include it in the vignette database, leading to a check warning. (PR#17246)
  • tools::latexToUtf8() infinite looped on certain inputs. (PR#17138)
  • terms.formula() ignored argument names when determining whether two terms were identical. (PR#17235)
  • callNextMethod() was broken when called from a method that augments the formal arguments of a primitive generic.
  • Coercion of an S4 object to a vector during sub-assignment into a vector failed to dispatch through the as.vector() generic (often leading to a segfault).
  • Fix problems in command completion: Crash (PR#17222) and junk display in Windows, handling special characters in filenames on all systems.

logo

Continue Reading…

Collapse

Read More

Data Analytics Is The Key Skill for The Modern Engineer

Many process manufacturing owner-operators in this next phase of a digital shift have engaged in technology pilots to explore options for reducing costs, meeting regulatory compliance, and/or increasing overall equipment effectiveness (OEE). Despite this transformation, the adoption of advanced analytics tools still presents certain challenges. The extensive and complicated tooling

The post Data Analytics Is The Key Skill for The Modern Engineer appeared first on Dataconomy.

Continue Reading…

Collapse

Read More

Building an R training environment

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

I recently delivered a day of training at SQLBits and I really upped my game in terms of infrastructure for it. The resultant solution was super smooth and mitigated all the install issues and preparation for attendees. This meant we got to spend the whole day doing R, instead of troubleshooting.

I’m so happy with the solution for an online R training environment that I want to share the solution, so you can take it and use it for when you need to do training.

TL;DR

Workflow for constructing a training environment with RStudio
Workflow for constructing a training environment with RStudio

To get a web-facing Rstudio server, with login details for users, and all the prerequisites installed:

  1. Generate usernames and passwords
  2. Build a Docker image based on rocker that contains your users and any prerequisites
  3. Run docker image on VM with a port binding to make the Rstudio accessible externally

Where the magick happens

The first thing I wanted was user names and passwords for people. I wanted these preprepared and, ideally, on something with my contact info. I’d used moo before so I knew I could do up to 100 custom cards – I just needed to make them. I used the package random and the package magick to generate credentials and write them on to PDFs that could then be uploaded to moo.

You can read the whole gist but the key code includes:

  • A random password generation step
write.csv(data.frame(usernames=paste0("u",stringr::str_pad(1:n,pad = "0",width = 3))
                     ,pwords=random::randomStrings(n,len = 6,digits = FALSE,loweralpha = FALSE))
          ,"users100.csv",row.names = FALSE)
  • A read of my blank card back
myimage <- magick::image_read(myfile,density = 500)
  • In a loop (there’s probably a better way!)
    • Writing out the text to the card back
    • Save a new card back
myimage2 <- magick::image_annotate(  myimage, "lockedata.biz", font = "Roboto", size = 90, location = "+350+75" )
magick::image_write(myimage2, paste0("GeneratedBacks/",basename, up[i,"usernames"], ".", baseext))

Filled in card back
Before writing
Card front

Making users

The next, and much more tricksy, bit was writing a script that would correctly create users. Peter Shaw helped me out initially and the code looks like:

for userdetails in `cat users.csv`
do
        user=`echo $userdetails | cut -f 1 -d ,`
    passwd=`echo $userdetails | cut -f 2 -d ,`
        useradd $user -m -p `mkpasswd $passwd`
done

This goes through each line in the CSV and adds a user with a specified password. We use the mkpasswd utility here to encrypt the password. mkpasswd is part of the whois program and we’ll need that tidbit later.

This needs to be done not only to give people login info but a home directory to save their stuff into. If they don’t have home directories then Rstudio throws a tantrum with them.

rocker

The next bit was getting some sort of server with Rstudio installed. Dirk Eddelbuettel made a bunch of Docker containers available under the rocker brand. One of these has Rstudio server and the tidyverse installed. Using this preconfigured container would mean I’d only have to add my finishing touches to it to get everything I needed.

Dockerfile

The last but one step was building my own customised Docker container that used rocker as a base, created all my users, and installed anything extra I wanted on top of the tidyverse.

FROM rocker/tidyverse
MAINTAINER Steph Locke <steph@itsalocke.com>
RUN apt-get install libudunits2-0 libudunits2-dev whois
RUN  R -e 'devtools::install_github("lockedata/TextAnalysis")' 
RUN  R -e 'devtools::install_github("dgrtwo/widyr")' 
ADD https://gist.githubusercontent.com/stephlocke/0036331e7a3338e965149833e92c1360/raw/607fb01602e143671c83216a4c5f1ad2deb10bf6/mkusers.sh /
ADD https://gist.githubusercontent.com/stephlocke/0036331e7a3338e965149833e92c1360/raw/6d967c19d9c73cecd1e2d4da0eed2cd646790bd5/users.csv /
RUN chmod 777 /mkusers.sh
RUN /mkusers.sh

This starts with the tidyverse & Rstudio then:

  • adds the requisite programs for dependencies in my package and whois for mkpasswd to be able to work
  • installs packages from github, notably the one designed to facilitate the day of text analysis
  • get the shell script and the csv from the gist
  • make the shell script executable and then run it

C’est fini!

The final step was to actually this container on a Digital Ocean virtual machine (but it could be run anywhere) like so:

docker pull stephlocke/montypythonwkrshopdocker 
docker run -d -p 80:8787 stephlocke/montypythonwkrshopdocker

You can take my github repo with the Dockerfile and customise it to suit your own requirements, build it in Docker and then use it whenever you need an R training environment that can be customised to your needs and is accessible from a web browser.

You can use it for general working environments but make sure to read my data persistence in Docker post first as the run command does not have an external volume mapped and if your host for docker crashes, everything will be lost permanently. An extra incentive to use source control maybe!

The post Building an R training environment appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

#ICLR2017 Monday Afternoon Program

 
ICLR 2017 is taking place today in Toulon this week, there will be a blog post for each half day that features directly links to papers and attendant codes if there are any. The meeting will be featured live on Facebook here at: https://www.facebook.com/iclr.cc/ . If you want to say hi, I am around.
 
Afternoon Session – Session Chair: Joan Bruna (sponsored by Baidu) 14.30 - 15.10 Invited talk 2: Benjamin Recht
15.10 - 15.30 Contributed Talk 3: Understanding deep learning requires rethinking generalization - BEST PAPER AWARD
15.30 - 15.50 Contributed Talk 4: Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
15.50 - 16.10 Contributed Talk 5: Towards Principled Methods for Training Generative Adversarial Networks
16.10 - 16.30 Coffee Break
16.30 - 18.20 Poster Session 2 (Conference Papers, Workshop Papers)
18.20 - 18.30 Group photo at stadium attached to Neptune Congress Center.
 
C1: Neuro-Symbolic Program Synthesis
C2: Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy (code)
C3: Trained Ternary Quantization (code)
C4: DSD: Dense-Sparse-Dense Training for Deep Neural Networks (code)
C5: A Compositional Object-Based Approach to Learning Physical Dynamics (code, project site)
C6: Multilayer Recurrent Network Models of Primate Retinal Ganglion Cells
C7: Improving Generative Adversarial Networks with Denoising Feature Matching (chainer implementation)
C8: Transfer of View-manifold Learning to Similarity Perception of Novel Objects
C9: What does it take to generate natural textures?
C10: Emergence of foveal image sampling from learning to attend in visual scenes
C11: PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications
C12: Learning to Optimize
C13: Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
C14: Optimal Binary Autoencoding with Pairwise Correlations
C15: On the Quantitative Analysis of Decoder-Based Generative Models (evaluation code)
C16: Adversarial machine learning at scale
C17: Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks
C18: Capacity and Learnability in Recurrent Neural Networks
C19: Deep Learning with Dynamic Computation Graphs  (TensorFlow code)
C20: Exploring Sparsity in Recurrent Neural Networks
C21: Structured Attention Networks (code)
C22: Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning
C23: Variational Lossy Autoencoder
C24: Learning to Query, Reason, and Answer Questions On Ambiguous Texts
C25: Deep Biaffine Attention for Neural Dependency Parsing
C26: A Compare-Aggregate Model for Matching Text Sequences (code)
C27: Data Noising as Smoothing in Neural Network Language Models
C28: Neural Variational Inference For Topic Models
C29: Bidirectional Attention Flow for Machine Comprehension (code, page)
C30: Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
C31: Stochastic Neural Networks for Hierarchical Reinforcement Learning
C32: Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning (video)
C33: Third Person Imitation Learning
 
W1: Audio Super-Resolution using Neural Networks (code)
W2: Semantic embeddings for program behaviour patterns
W3: De novo drug design with deep generative models : an empirical study
W4: Memory Matching Networks for Genomic Sequence Classification
W5: Char2Wav: End-to-End Speech Synthesis
W6: Fast Chirplet Transform Injects Priors in Deep Learning of Animal Calls and Speech
W7: Weight-averaged consistency targets improve semi-supervised deep learning results
W8: Particle Value Functions
W9: Out-of-class novelty generation: an experimental foundation
W10: Performance guarantees for transferring representations (presentation, video)
W11: Generative Adversarial Learning of Markov Chains
W12: Short and Deep: Sketching and Neural Networks
W13: Understanding intermediate layers using linear classifier probes
W14: Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity
W15: Neural Combinatorial Optimization with Reinforcement Learning (TensorFlow code)
W16: Tactics of Adversarial Attacks on Deep Reinforcement Learning Agents
W17: Adversarial Discriminative Domain Adaptation (workshop extended abstract)
W18: Efficient Sparse-Winograd Convolutional Neural Networks
W19: Neural Expectation Maximization 

 
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…

Collapse

Read More

The triumph of polling in France’s first-round election

BACKERS of Marine Le Pen were shy, supporters for Emmanuel Macron voters were soft, and timid pollsters were herding. And who trusts polls anymore anyway, following the surprise victories of Brexit and Donald Trump?

Continue Reading…

Collapse

Read More

The triumph of polling in France’s first-round election

BACKERS of Marine Le Pen were shy, supporters of Emmanuel Macron voters were soft, and timid pollsters were herding. And who trusts polls anymore anyway, following the surprise victories for Brexit and Donald Trump?

Continue Reading…

Collapse

Read More

The triumph of polling in France’s first-round election

BACKERS of Marine Le Pen were shy, supporters of Emmanuel Macron voters were soft, and timid pollsters were herding. And who trusts polls anymore anyway, following the surprise victories for Brexit and Donald Trump?

Continue Reading…

Collapse

Read More

Document worth reading: “Deep Architectures for Modulation Recognition”

We survey the latest advances in machine learning with deep neural networks by applying them to the task of radio modulation recognition. Results show that radio modulation recognition is not limited by network depth and further work should focus on improving learned synchronization and equalization. Advances in these areas will likely come from novel architectures designed for these tasks or through novel training methods. Deep Architectures for Modulation Recognition


Continue Reading…

Collapse

Read More

#ICLR2017 Monday Morning Program

 
So ICLR 2017 is taking place today in Toulon this week, there will be a blog post for each half day that features directly links to papers from the Open review section. The meeting will be featured live on Facebook here at: https://www.facebook.com/iclr.cc/ . If you want to say hi, I am around.

Monday April 24, 2017

Morning Session – Session Chair: Dhruv Batra

7.00 - 8.45 Registration
8.45 - 9.00 Opening Remarks
9.00 - 9.40 Invited talk 1: Eero Simoncelli
9.40 - 10.00 Contributed talk 1: End-to-end Optimized Image Compression
10.00 - 10.20 Contributed talk 2: Amortised MAP Inference for Image Super-resolution
10.20 - 10.30 Coffee Break
10.30 - 12.30 Poster Session 1

C1: Making Neural Programming Architectures Generalize via Recursion (slides, code, video)
C2: Learning Graphical State Transitions (code)
C3: Distributed Second-Order Optimization using Kronecker-Factored Approximations
C4: Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes
C5: Neural Program Lattices
C6: Diet Networks: Thin Parameters for Fat Genomics
C7: Unsupervised Cross-Domain Image Generation  (TensorFlow implementation )
C8: Towards Principled Methods for Training Generative Adversarial Networks
C9: Recurrent Mixture Density Network for Spatiotemporal Visual Attention
C10: Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (PyTorch code)
C11: Pruning Filters for Efficient ConvNets
C12: Stick-Breaking Variational Autoencoders
C13: Identity Matters in Deep Learning
C14: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
C15: Recurrent Hidden Semi-Markov Model
C16: Nonparametric Neural Networks
C17: Learning to Generate Samples from Noise through Infusion Training
C18: An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax
C19: Highway and Residual Networks learn Unrolled Iterative Estimation
C20: Soft Weight-Sharing for Neural Network Compression (Tutorial)
C21: Snapshot Ensembles: Train 1, Get M for Free
C22: Towards a Neural Statistician
C23: Learning Curve Prediction with Bayesian Neural Networks
C24: Learning End-to-End Goal-Oriented Dialog
C25: Multi-Agent Cooperation and the Emergence of (Natural) Language
C26: Efficient Vector Representation for Documents through Corruption ( code)
C27: Improving Neural Language Models with a Continuous Cache
C28: Program Synthesis for Character Level Language Modeling
C29: Tracking the World State with Recurrent Entity Networks (TensorFlow implementation)
C30: Reinforcement Learning with Unsupervised Auxiliary Tasks (blog post, an implementation )
C31: Neural Architecture Search with Reinforcement Learning ( slides, some implementation of appendix A
C32: Sample Efficient Actor-Critic with Experience Replay
C33: Learning to Act by Predicting the Future
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…

Collapse

Read More

NLP with AirBnb

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Basic NLP in R

With text survey data about AirBnb

R is a very powerful programming language with many great built in statistical functions and a large repository of additional libraries. In this post I will introduce some basic text analysis to generate a ‘stemmed’ wordcloud and frequency chart from text data.


airbnb-wordcloud .     airbnb-top6--

AirBnb

airblogo

Recently, I had the opportunity to help someone with a research paper written about AirBNB. The author was trying to analyze how customers feel about the company, so she decided to create a survey with Survey Monkey, and had 45 respondents.

This seemed like a good opportunity for some text mining and text analysis in R.

Continue to the next page for instructions creating the above visualizations with NLP techniques.

The post NLP with AirBnb appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Book Memo: “Matrix and Tensor Factorization Techniques for Recommender Systems”

This book presents the algorithms used to provide recommendations by exploiting matrix factorization and tensor decomposition techniques. It highlights well-known decomposition methods for recommender systems, such as Singular Value Decomposition (SVD), UV-decomposition, Non-negative Matrix Factorization (NMF), etc. and describes in detail the pros and cons of each method for matrices and tensors. This book provides a detailed theoretical mathematical background of matrix/tensor factorization techniques and a step-by-step analysis of each method on the basis of an integrated toy example that runs throughout all its chapters and helps the reader to understand the key differences among methods. It also contains two chapters, where different matrix and tensor methods are compared experimentally on real data sets, such as Epinions, GeoSocialRec, Last.fm, BibSonomy, etc. and provides further insights into the advantages and disadvantages of each method.

Continue Reading…

Collapse

Read More

If you did not already know: “mapnik”

Mapnik is a high-powered rendering library that can take GIS data from a number of sources (ESRI shapefiles, PostGIS databases, etc.) and use them to render beautiful 2-dimensional maps. It’s used as the underlying rendering solution for a lot of online mapping services, most notably including MapQuest and the OpenStreetMap project, so it’s a truly production-quality framework. And, despite being written in C++, it comes with bindings for Python and Node, so you can leverage it in the language of your choice.
Render Google Maps Tiles with Mapnik and Python
mapnik google


Continue Reading…

Collapse

Read More

The real lesson of the human genome project

When I was pursuing my PhD, the human genome project was often both regarded as overly ambitious (maybe even impossible) and full of possibilities. To many people’s surprise, the project was declared complete back in 2003, much earlier than expected. Today, we have a “roughly” complete map of the human genome.

Many announced, too soon it turns out, that it would quickly bring new cures, new therapies, that were previously unthinkable. Fifteen years later, the concrete results are somewhat thin on the ground.

But it made one very important difference. Before we could track down people’s genes, it was often believed that our destiny was written out in our DNA. You were what you genes said you were.

So people quickly moved to cataloging the genes of the super smart, the genes of the athletes, the genes of the centenarians. And what have they found? Not a whole lot.

There are a few genes that might help you live a bit longer. There are a few genes more frequent in athletes. And so forth. But the effects are often quite weak. There are exceptions here and there. Some genes give you a much higher chance of getting some cancers, or Alzheimer’s and so forth… but they affect relatively few of us in aggregate.

Chances are that if you have your genome sequenced, you will find that you are slightly lucky in some ways and slightly unlucky in others… but it is all quite boring.

If we could build a human body from the DNA up, selecting only the best genes, the result would be nothing like a superman.

If you have a toned body, some of your genes are activated, but it turns out that your nerdy friend who barely looks like a man has more or less the same genes, they are just silenced.

And so, I think that the main lesson is that human biology is far more malleable than many people thought. And this means that the badly dressed kid next to you, who looks like an unhealthy idiot? To a close approximation, he probably has a comparable genetic legacy to yours.

Continue Reading…

Collapse

Read More

Magister Dixit

“Data science is a fancy way to say using numbers and names to answer a question.” Brandon Rohrer ( Dec 19, 2015 )


Continue Reading…

Collapse

Read More

Science really is non-partisan: facts and skepticism annoy everybody

This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right.

First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF. Although the current administration did propose a 20% cut to NIH, we are aware that, generally speaking, support for scientific research has traditionally been bipartisan.

It is true that the typical data-driven scientists will disagree, sometimes strongly, with many stances that are considered conservative. For example, most scientists will argue that:

  1. Climate change is real and is driven largely by increased carbon dioxide and other human-made emissions into the atmosphere.
  2. Evolution needs to be part of children’s education and creationism has no place in Science class.
  3. Homosexuality is not a choice.
  4. Science must be publically funded because the free market is not enough to make science thrive.

But scientists will also hold positions that are often criticized heavily by some of those who identify as politically left wing:

  1. Current vaccination programs are safe and need to be enforced: without heard immunity thousands of children would die.
  2. Genetically modified organisms (GMOs) are safe and are indispensable to fight world hunger. There is no need for warning labels.
  3. Using nuclear energy to power our electrical grid is much less harmful than using natural gas, oil and coal and, currently, more viable than renewable energy.
  4. Alternative medicine, such as homeopathy, naturopathy, faith healing, reiki, and acupuncture, is pseudo-scientific quackery.

The timing of the announcement of the March for Science, along with the organizers’ focus on environmental issues and diversity, may have made it seem like a partisan or left-leaning event, but please also note that many scientists criticized the organizers for this very reason and there was much debate in general. Most scientists I know that went to the march did so not necessarily because they are against Republican administrations, but because they are legitimately concerned about some of the choices of this particular administration and the future of our country if we stop funding and trusting science.

If you haven’t already seen this Neil Degrasse Tyson video on the importance of Science to everyone, I highly recommend it.

Continue Reading…

Collapse

Read More

Hacking maps with ggplot2

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

This is a very short post on mapping with ggplot2.

Quite often, mapping some data, we do not need to follow scrupulously the formal requirements to geographical maps – the idea is just to show the spatial dimension of the data. For instance, the network of rivers is not the most important information when we map the elections outcome. Thus, the simplified mapping allows quite some freedom in transforming the geodata. The classical example of such geodata transformation is the replacement and scaling of Alaska and Hawaii to be mapped alongside the mainland of the US. As one may see in this example, usually such geodata transformations utilize quite complex GIS tools in order to reposition an object in the coordinate system.

The interesting feature of mapping with ggplot2 is that, before the actual plotting, geodata has to be fotrified (ggplot2::fortify) – transformed to a simple dataframe object. Since fortified geodata is basically a dataframe, some simple transformations could be made really easily.

In my last paper, I needed to show a two-dimensional grouping of the European NUTS-2 regions in 4 quadrants according to GDP per capita and the share of working-age population (see Figure 8 in the preprint). In line with the study setting, I did the grouping separately for Western, Southern, and Eastern Europe. I decided that the most straightforward way to show that on map would be to visually separate the 3 subregions of Europe. The task is easily doable through triggering the fortified geodata object – see the code below.

First, the code to prepare the R session and load the (already prepared) data.

library(tidyverse) # version 1.1.1
library(extrafont) # version 0.17
library(ggthemes) # version 3.4.0
font <- "Roboto Condensed"
library(hrbrthemes) # version 0.1.0
# The code is tested on a PC-win7-x64
# R version 3.3.3


# load the prepared geodata and stat data
load("https://ikashnitsky.github.io/doc/misc/map-hacking/map-hacking.Rdata")

# fortify the spatial objects
bord <- fortify(Sborders)
fort <- fortify(Sn2, region = 'id')

Next, I hack the geodata (long and lat variables) moving groups of NUTS-2 regions (Western, Southern, and Eastern Europe) apart. The appropriate values to move the groups of regions were found empirically.

# hack geodata to separate macro-regions
fort_hack <- fort %>% 
        left_join(df %>% select(id, subregion), 'id') %>% 
        mutate(long = ifelse(subregion=='E', long + 5e5, long),
               long = ifelse(subregion=='S', long + 2e5, long),
               lat = ifelse(subregion=='S', lat - 5e5, lat),
               long = ifelse(subregion=='W', long - 2e5, long))

Finally, we are ready to create the schematic map.

# create color pallete
brbg <- RColorBrewer::brewer.pal(11,"BrBG")
brbg4 <- brbg[c(4,9,2,11)]

# create the two-dim legend
ggleg <- ggplot()+
        coord_equal(xlim = c(0,1), ylim = c(0,1), expand = c(0,0))+
        annotate('rect', xmin = .45, xmax = .6, ymin = .1, ymax = .25, 
                 fill = brbg4[1], color = NA)+
        annotate('rect', xmin = .45, xmax = .6, ymin = .4, ymax = .55, 
                 fill = brbg4[2], color = NA)+
        annotate('rect', xmin = .75, xmax = .9, ymin = .1, ymax = .25, 
                 fill = brbg4[3], color = NA)+
        annotate('rect', xmin = .75, xmax = .9, ymin = .4, ymax = .55, 
                 fill = brbg4[4], color = NA)+
        annotate('rect', xmin = .05, xmax = .95, ymin = .05, ymax = .95, 
                 fill = NA, color = "grey20")+
        
        annotate('text', x = .35, y = c(.175, .475), vjust = .5, hjust = 1,
                 size = 6, fontface = 2, label = c('POOR', 'RICH'), family = font) + 
        annotate('text', x = c(.525, .825), y = .65, vjust = 0, hjust = .5,
                 size = 6, fontface = 2, label = c('LOW', 'HIGH'), family = font)+
        annotate('text', x = .1, y = .9, vjust = 1, hjust = 0,
                 size = 7, fontface = 2, label = "LEGEND", family = font)+
        theme_map()

# create the blank map
basemap <- ggplot()+
        coord_equal(ylim=c(900000,5400000), xlim=c(2500000, 7000000), expand = c(0,0))+
        theme_map()+
        theme(panel.border=element_rect(color = 'black',size=.5,fill = NA),
              legend.position = 'none')

# the main map
map_temp <- basemap + 
        geom_map(map = fort_hack, data = df, aes(map_id=id, fill=group))+
        scale_fill_manual(values = brbg4[c(3, 1, 4, 2)])

# now combine the map and the legend
map <- ggplot() + 
        coord_equal(xlim = c(0,1), ylim = c(0,1), expand = c(0,0))+
        annotation_custom(ggplotGrob(map_temp), xmin = 0, xmax = 1, ymin = 0, ymax = 1)+
        annotation_custom(ggplotGrob(ggleg), xmin = 0.72, xmax = 0.99, ymin = 0.72, ymax = 0.99)+
        labs(title = "Labour force and income in EU-27 NUTS-2 regions",
             subtitle = "Within each of the three macro-regions of Europe - Westren, Southern, and Eastern -\nNUTS-2 regions are classified in 4 groups according to the level of GDP per capita\nand the share of working age population in 2008",
             caption = "Data: Eurostat\nAuthor: Ilya Kashnitsky (ikashnitsky.github.io)")+
        theme_ipsum_rc(plot_title_size = 30, subtitle_size = 20, caption_size = 15)

And here is the result.


fig1

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

April 23, 2017

The meta-hype algorithm

catwash

Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method. Meta-hype is a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

The post The meta-hype algorithm appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…

Collapse

Read More

R Packages worth a look

Uncertainty Propagation Analysis (spup)
Uncertainty propagation analysis in spatial environmental modelling following methodology described in Heuvelink et al. (2017) <doi:10.1080/13658810601063951> and Brown and Heuvelink (2007) <doi:10.1016/j.cageo.2006.06.015>. The package provides functions for examining the uncertainty propagation starting from input data and model parameters, via the environmental model onto model outputs. The functions include uncertainty model specification, stochastic simulation and propagation of uncertainty using Monte Carlo (MC) techniques. Uncertain variables are described by probability distributions. Both numerical and categorical data types are handled. Spatial auto-correlation within an attribute and cross-correlation between attributes is accommodated for. The MC realizations may be used as input to the environmental models called from R, or externally.

An Easy Way to Report ROC Analysis (reportROC)
Provides an easy way to report the results of ROC analysis, including: 1. an ROC curve. 2. the value of Cutoff, SEN (sensitivity), SPE (specificity), AUC (Area Under Curve), AUC.SE (the standard error of AUC), PLR (positive likelihood ratio), NLR (negative likelihood ratio), PPV (positive predictive value), NPV (negative predictive value).

Estimating Finite Population Total (fpest)
Given the values of sampled units and selection probabilities the desraj function in the package computes the estimated value of the total as well as estimated variance.

Regression Analysis Based on Win Loss Endpoints (WLreg)
Use various regression models for the analysis of win loss endpoints adjusting for non-binary and multivariate covariates.

Smoothed Bootstrap and Random Generation from Kernel Densities (kernelboot)
Smoothed bootstrap and functions for random generation from univariate and multivariate kernel densities. It does not estimate kernel densities.


Continue Reading…

Collapse

Read More

Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)

Nyt_goodschoolsaffordablehomes_nyc

This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.

***

One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.

***

The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:

BA_NYC_Flight_Map

The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.

***

In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.

***

A more subtle issue can be seen when comparing San Francisco and Boston regions:

Nyt_goodschoolsaffordablehomes_sfobos

One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.

***

Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 

***

If you haven't already, see this related post on my sister blog for a discussion of the data analysis.

 

 

 

 

Continue Reading…

Collapse

Read More

Dispute over analysis of school quality and home prices shows social science is hard

Most of my friends with families fret over school quality when deciding where to buy their homes. It's well known that good school districts are also associated with expensive houses. A feedback cycle is at work here: home prices surge where there are good schools; only richer people can afford to buy such homes; wealth brings other advantages, and so the schools tend to have better students, which leads to higher test scores.

A few weeks ago, the New York Times Upshot team ran some visual analytics to illustrate the correlation between school quality and home prices (link). Good for them, they went beyond a simple bivariate analysis. They also added commute time to the mix. The primary concern appears to be comparing so-called central cities with their suburbs in their respective regions. Some of these suburbs have better schools but the commute times are on average higher. Is it worth it to move to such neighborhoods?

This article has attracted much controversy. Alexander Russo summarizes some of the complaints here. The key objections cited in his report are as follows:

  • using test scores as a proxy for school quality is simple-minded
  • no accounting for other important variables such as demographics, backgrounds and so on (i.e. no attempt to control for other variables)
  • not causal: no one believes that moving a randomly selected student from a "bad" district to a "good" district will cause his/her test scores to jump by say two grade levels
  • using grade equivalent units which no one can understand

There are also complaints about the visual design. I will have some comments in a separate post on the sister blog. (The post should be available soon after this post goes live.)

Nyt_goodschoolsaffordablehomes_nyc

***

I think #2 and #3 above are very important caveats. Point #1 is definitely a concern but critics admit that there are no great proxies. I will get to Point #4 below.

For me, the biggest issue is something missing completely from the original analysis: what decision scientists call the "indifference curves". What underlies this analysis is the assumption that families are making trade-offs between commute times, school quality and home prices. An indifference curve makes explicit what these trade-offs are: for example, how many more dollars per square feet is someone willing to pay in order to send kids to a school that is one grade level above another school? These curves are generally not linear. The value of moving from 3.5 to 4 grade levels is not as high as that of going from 1.5 to 2 grade levels, even though both shifts are precisely half a grade.

Once these trade-offs are in focus, we see the relevance of complaint #2. In the New York area chart, shown above, almost all of the districts are in the lower right quadrant relative to NYC, which means that a lot of districts on average have higher test scores and lower home prices than NYC. So, if those two factors were the primary reasons for selecting where to live, then no one would rationally choose to live in NYC! The addition of the third factor indicates that commute time may be another key consideration. Is commute time able to explain all of the data? That's not something easy to see from this chart, especially when commute time is reduced to two levels (under/over 30 minutes) and encoded in the colors.

***

The test score data came from the Stanford Education Data Archive (link). After you give them your email address, you will find a note saying that one standard deviation is roughly three grade levels. This raises a different kind of issue! Are the differences plotted on the chart meaningful? Usually, meaningful differences are at least two standard deviations, which would be six or more grade levels.

"Grade level equivalent" is a complicated concept. The same test is administered to students of different grades, so for example, the 5th-grade test is given to students in 3rd, 5th and 7th grades. The average scores of 3rd, 5th and 7th graders are computed. These are then mapped back to the 5th-grade scale. Anyone in 5th grade who scored above the average 7th grader is said to be two grade levels "ahead".

The fact that one standard deviation equals three grade levels tells me that the score distributions of 5th graders and 7th graders taking the 5th grade test overlap substantially. It seems likely that the difference between the 5th and 7th grader is not statistically significant - which, if true, is a sad commentary on our education standards.

Back to the chart above. If my interpretation based on the note from the Stanford Eduaction Data Archive is correct, then almost all of the horizontal differences on the NYT charts are practically meaningless.

***

Now, click here to read my comments on the data visualization.

 

 

Continue Reading…

Collapse

Read More

The long game towards understanding dialog

Building an effective dialog system

At Facebook AI Research (FAIR), understanding dialogue is an ambitious and long-term AI research goal.

A truly effective dialogue system will be an assistive technology that will likely include a chatbot-type of system being able to interact with people through natural language communication. This could help people better understand the world around them and communicate more effectively with others, effectively bridging communication gaps. Researching and developing these kinds of technologies will only become more important as the amount of digital content continues to grow.

The attempt to understand and interpret dialogue is not a new one. As far back as 20 years, there were several efforts to build a machine a person could talk to and teach how to have a conversation. These incorporated technology and engineering, but were single purposed with a very narrow focus, using pre-programmed scripted responses.

Thanks to progress in machine learning, particularly in the last few years, having AI agents being able to converse with people in natural language has become a more realistic endeavor that is garnering attention from both the research community and industry.

However, most of today’s dialogue systems continue to be scripted: their natural language understanding module may be based on machine learning, but what they execute or answer is in general dictated by if/then statements or rules engines. While they are improvement on what existed decades ago, it is in large part due to the large databases of content used to create and script their responses.

Tackling the challenge from both ends

Getting to a natural language dialogue state with a chatbot remains a challenge and will require a number of research breakthroughs. At FAIR we have chosen to tackle the problem from both ends: general AI and reasoning by machines through communication as well as conducting research grounded in current dialog systems, using lessons learned from exposing actual chatbots to people. Our strength lies in embracing the diversity that spans both approaches. From long-term, fundamental research like the CommAI initiative, to shorter-term applied efforts such as FastText or Facebook M. Through these, combined with our team’s expertise across the AI spectrum, from deep learning for NLP to reinforcement learning, computer vision, and engineering, we hope to deliver significant natural language dialog advancements.

An important aspect of FAIR work on dialog is how we ground it in clear foundations:

  • Strong baselines: advanced learning systems for NLP problems should effectively deliver a performance improvement relative to more traditional methods. To that end, we built FastText with the goal of providing the very best results achievable with relatively simple and well understood techniques.
  • Clear(er) evaluations: evaluating dialog systems is a hard problem. At FAIR we came up with better tools to do just that. At ICLR 2017, we are sharing findings and tools with the academic community on dialogue of our initiatives. These include the CommAI environment [1] to train and evaluate reasoning models, and the bAbI tasks for dialog [4] which can be used to test end-to-end dialog models. Thanks to our collaboration with Facebook M, these tools have been tested with models in real production conditions.
  • Open research: FAIR publishes almost all its research in conferences or through preprints. Similarly, code and data, including the two evaluation initiatives cited above are released as open-source. Just as there is diversity of work in FAIR, there is also tremendous diversity across the AI community. We believe that open dialog and shared tools and learnings will lead to bigger advancements overall.

Making progress through shared learnings

At ICLR, we are presenting 7 papers that illustrate the quality, innovation and breadth of FAIR dialog research. Lazaridou et al. [6] and the CommAI team [1] propose directions for having systems be able to discover and use basic communication skills, a first step towards artificial general intelligence. Li et al. present two papers that study how end-to-end dialog systems can be improved by using ongoing live conversations to improve themselves [2, 5]. Bordes et al. introduces the bAbI tasks for dialog [4] to test end-to-end dialog systems in goal-oriented scenarios. And we present two papers on machine reading by Grave et al. [3] and Henaff et al. [7] that push the boundaries of text understanding by machines.

References
[1] CommAI: Evaluating the First Steps Towards a Useful General AI, M Baroni, A Joulin, A Jabri, G Kruszewski, A Lazaridou, K Simonic, T Mikolov
[2] Dialogue Learning With Human-In-The-Loop, J Li, AH Miller, S Chopra, MA Ranzato, J Weston
[3] Improving Neural Language Models with a Continuous Cache, E Grave, A Joulin, N Usunier
[4] Learning End-to-end Goal-oriented Dialog, A Bordes, YL Boureau, J Weston
[5] Learning Through Dialogue Interactions, J Li, AH Miller, S Chopra, MA Ranzato, J Weston
[6] Multi-Agent Cooperation and the Emergence of (Natural) Language, A Lazaridou, A Peysakhovich, M Baroni
[7] “Tracking the World State with Recurrent Entity Networks,” M Henaff, J Weston, A Szlam, A Bordes, Y LeCun

Continue Reading…

Collapse

Read More

Experimental Design Exercises

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises we shall follow the practice of conducting an experimental study. Researcher wants to see if there is any influence of working-out on body mass. Three groups of subjects with similar food and sport habits were included in the experiment. Each group was subjected to a different set of exercises. Body mass was measured before and after workout. The focus of the research is the difference in body mass between groups, measured after working-out. In order to examine these effects, we shall use paired t test, t test for independent samples, one-way and two-ways analysis of variance and analysis of covariance.

You can download the dataset here. The data is fictious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data. Calculate descriptive statistics and test for the normality of both initial and final measurements for whole sample and for each group.

Exercise 2

Is there effect of exercises and what is the size of that effect for each group? (Tip: You should use paired t test.)

Exercise 3

Is the variance of the body mass on final measurement the same for each of the three groups? (Tip: Use Levene’s test for homogeneity of variances)

Exercise 4

Is there a difference between groups on final measurement and what is the effect size? (Tip: Use one-way ANOVA)

Learn more about statistics for your experimental design in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Work thru regression problems
  • use different statistical tests and interpret them
  • And much more

Exercise 5

Between which groups does the difference of body mass appear after the working-out? (Tip: Conduct post-hoc test.)

Exercise 6

What is the impact of age and working-out program on body mass on final measurement? (Tip: Use two-way between groups ANOVA.)

Exercise 7

What is the origin of effect of working-out program between subjects of different age? (Tip: You should conduct post-hoc test.)

Exercise 8

Is there a linear relationship between initial and final measurement of body mass for each group?

Exercise 9

Is there a significant difference in body mass on final measurement between groups, while controlling for initial measurement?

Exercise 10

How much of the variance is explained by independent variable? How much of the variance is explained by covariate?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Data Links #100

New data analysis competitions

Tech

  • EU launches public consultation into fears about future of internet.

    A succession of surveys over the coming weeks will ask people for their views on everything from privacy and security to artificial intelligence, net neutrality, big data and the impact of the digital world on jobs, health, government and democracy.

  • Another article on the blackbox problem. Our Machines Now Have Knowledge We'll Never Understand.

    We are increasingly relying on machines that derive conclusions from models that they themselves have created, models that are often beyond human comprehension, models that "think" about the world differently than we do.

    But this comes with a price. This infusion of alien intelligence is bringing into question the assumptions embedded in our long Western tradition. We thought knowledge was about finding the order hidden in the chaos. We thought it was about simplifying the world. It looks like we were wrong. Knowing the world may require giving up on understanding it.

  • Algorithm Aims to Predict Bickering Among Couples.

    Most of conflict-monitoring experiments with real-life couples have previously taken place in the controlled settings of psychology labs. Researchers with the Couple Mobile Sensing Project at the University of Southern California, in Los Angeles, took a different approach by studying couples in their normal living conditions using wearable devices and smartphones to collect data. Their early field trial with 34 couples has shown that the combination of wearable devices and artificial intelligence based on machine learning AI could lead to the future of smartphone apps acting as relationship counselors.

  • Cambridge Analytica Explained: Data and Elections.

    Recently, the data mining firm Cambridge Analytica has been the centre of tons of debate around the use of profiling and micro-targeting in political elections. We've written this analysis to explain what it all means, and the consequences of becoming predictable to companies and political campaigns.

    [...] The idea that a single company influenced an entire election is also difficult to maintain because every single candidate used some form of profiling and micro-targeting to persuade voters — including Hillary Clinton and Trump's competitors in the primaries. Not every campaign used personality profiles but that doesn't make it any less invasive or creepy!


Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!

Continue Reading…

Collapse

Read More

Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:


Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See https://www.facebook.com/groups/853552931365745/permalink/1343285629059137/ for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

The post Would you prefer three N=300 studies or one N=900 study? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…

Collapse

Read More

The fish shell is awesome

3 years ago, I switched from using bash or zsh to fish (https://fishshell.com/). I like it so much that I wanted to write a blog post about why! There are a few fish features that mean that I’ll probably never switch back to bash.

no configuration

First – I know that zsh+oh-my-zsh is really awesome, and you can almost certainly configure zsh to do all everything I’m gonna describe in this post. The thing I like personally about fish is that I don’t have to configure it! I just install it on all my computers, it comes with a lot of awesome features out of the box, and I don’t need to do any configuration.

My fish configuration file literally just sets some environment variables and that’s it.

feature 1: autosuggestions

This is the one true reason I adore fish, I don’t care about any other feature nearly as much.

The first amazing thing that fish does is autocompletion from my shell history. As I type, it’ll automatically suggest (in light grey) a command that I ran recently. I can press the right arrow key to accept the completion, or keep typing to ignore it.

Here’s what that looks like in practice: (in this example I just typed the “v” key and it guessed that I want to run the previous vim command again)

This autocompletion is also contextual. If I type ‘v’ in my home directory, it’ll just suggest vim, but if I cd into work/homepage and type v, it’ll suggest the file I edited when I was in that directory. I LOVE that it takes the directory I’m in into account and it means that the autocompletions work so much better.

The filename autocompletions are also smarter than in bash: I can type ls 2017, press tab, and it’ll autocomplete to ls outreachy-flyer-2017-May.jpg.

feature 2: really good syntax highlighting

When I type a command that doesn’t exist, fish highlights it in red, like this. python4 gets highlighted in red, and python3 in blue, because python3 is a real program on my system and python4 isn’t.

This is nice and it helps me all the time to see when I make typos.

feature 3: loops that are easier to use

Did you see that for loop in that screen shot before? Yeah!

for i in *.yaml
  echo $i
end

It’s so READABLE. All the control flow statements just end with end, not done or fi or esac. And it actually has a usable editor for loops. I can use my arrow keys to go up to the first line and edit something if I made a mistake.

You will also notice at this point that fish’s loops are real different from bash loops. bash/sh syntax really does not work inside fish.

feature 4: no more ctrl+r

This one isn’t so important but I still like it.

I search my history all the time! For example, I never remember where my fish configuration is because it’s kind of in a weird spot. But I remember that it’s called config.fish so I just need to search for that.

In bash to search your history, you use “ctrl+r” and then it says “reverse-i-search” for some reason and then you find your thing. In fish, here’s how it works

  1. type config.fish
  2. press the up arrow. vim ~/.config/fish/config.fish appears
  3. right, that’s the file! Press enter. Done!

This is not really so different than bash’s ctrl+r but for some reason it makes me happy that I can just press the up arrow any time to search my history.

fish is great! maybe try it!

some other random cool stuff about fish:

  • it comes in with built in autocompletions for a lot of commands, you don’t have to do anything special to install them
  • also apparently it automatically generates completions by parsing man pages? that’s so smart and amazing. I just learned that today!
  • you can configure it from a web browser (:D :D). I don’t really configure my fish but I think this is cool because they’re like “hey, you don’t have to learn our configuration language to make some basic changes, just go to this webpage!“. It’s such a different approach!
  • It has a great 1-page tutorial that walks you through all the basic features and the differences from other shells.

fish isn’t POSIX compliant, and that still trips me up sometimes, it’s similar but just different enough that I sometimes get confused about how to set environment variables. This doesn’t make me love fish any less, though! If I ever need to run a bash script or something I just switch to bash for a couple minutes, it takes like 1 second to type bash =)

Continue Reading…

Collapse

Read More

Using R: a function that adds multiple ggplot2 layers

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:

library(ggplot2)

data(ChickWeight)
diet1 <- subset(ChickWeight, Diet == 1)
diet2 <- subset(ChickWeight, Diet == 2)

This is just an example that shows the phenomenon. The first two functions will work, but combining them won’t.

add_line <- function(df) {
  geom_line(aes(x = Time, y = weight, group = Chick), data = df)
}

add_points <- function(df) {
  geom_point(aes(x = Time, y = weight), data = df)
}

add_line_points <- function(df) {
  add_line(df) + add_points(df)
}

## works
(plot1 <- ggplot() + add_line(diet1) + add_points(diet1))

## won't work: non-numeric argument to binary operator
try((plot2 <- ggplot() + add_line_points(diet1)))

First, you can get the same result by putting mappings and data in the ggplot function. This will work if all layers are going to plot the same data, but that does it for some cases:

## bypasses the issue by putting mappings in ggplot()
(plot3 <- ggplot(aes(x = Time, y = weight, group = Chick), data = diet1) +   
  geom_line() + geom_point())

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

## bypasses the issue and gives a similar feel with pipes

library(magrittr)

add_line_points2 <- function(plot, df, ...) {
  plot +
    geom_line(aes(x = Time, y = weight, group = Chick), ..., data = df) +
    geom_point(aes(x = Time, y = weight), ..., data = df)
}

(plot4 <- ggplot() %>% add_line_points2(diet1)
   %>% add_line_points2(diet2, colour = "red"))

Finally, in many cases, one can stick all the data in a combined data frame, and avoid building up the plot from different data frames altogether.

## plot the whole dataset at once
(plot5 <- ggplot(aes(x = Time, y = weight, group = Chick, colour = Diet),
                 data = ChickWeight) +
   geom_line() + geom_point())

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time one more mapping in aes(). No looping or custom functions required.

I hope that was of some use.

Postat i:computer stuff, data analysis Tagged: ggplot2, R

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

R Packages worth a look

Dynamic Correlation Analysis for High Dimensional Data (DCA)
Finding dominant latent signals that regulate dynamic correlation between many pairs of variables.

Encrypt and Decrypt Strings, R Objects and Files (safer)
A consistent interface to encrypt and decrypt strings, R objects and files using symmetric key encryption.

Create and Evaluate NONMEM Models in a Project Context (nonmemica)
Systematically creates and modifies NONMEM(R) control streams. Harvests NONMEM output, builds run logs, creates derivative data, generates diagnostics. NONMEM (ICON Development Solutions <http://…/> ) is software for nonlinear mixed effects modeling. See ‘package?nonmemica’.

Tabular Reporting API (flextable)
Create pretty tables for ‘Microsoft Word’, ‘Microsoft PowerPoint’ and ‘HTML’ documents. Functions are provided to let users create tables, modify and format their content. It extends package ‘officer’ that does not contain any feature for customized tabular reporting. Function ‘tabwid’ produces an ‘htmlwidget’ ready to be used in ‘Shiny’ or ‘R Markdown (*.Rmd)’ documents. See the ‘flextable’ website for more information.

Connect to ‘DocuSign’ API (docuSignr)
Connect to the ‘DocuSign’ Rest API <https://…/RESTAPIGuide.htm>, which supports embedded signing, and sending of documents.


Continue Reading…

Collapse

Read More

Sunday Morning Insight: "No Need for the Map of a Cat, Mr Feynman" or The Long Game in Nanopore Sequencing.



About 5 weeks ago, we wondered how we could tell if the world was changing right before our eyes ?  well, this is happening, instance #2 just got more real:


Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of "ultra-long" reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.
Here is some context:

And previously on Nuit Blanche:
 
Credit: NASA, JPL




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…

Collapse

Read More

Document worth reading: “A Review on Algorithms for Constraint-based Causal Discovery”

Causal discovery studies the problem of mining causal relationships between variables from data, which is of primary interest in science. During the past decades, significant amount of progresses have been made toward this fundamental data mining paradigm. Recent years, as the availability of abundant large-sized and complex observational data, the constrain-based approaches have gradually attracted a lot of interest and have been widely applied to many diverse real-world problems due to the fast running speed and easy generalizing to the problem of causal insufficiency. In this paper, we aim to review the constraint-based causal discovery algorithms. Firstly, we discuss the learning paradigm of the constraint-based approaches. Secondly and primarily, the state-of-the-art constraint-based casual inference algorithms are surveyed with the detailed analysis. Thirdly, several related open-source software packages and benchmark data repositories are briefly summarized. As a conclusion, some open problems in constraint-based causal discovery are outlined for future research. A Review on Algorithms for Constraint-based Causal Discovery


Continue Reading…

Collapse

Read More

Explaining complex machine learning models with LIME

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

The classification decisions made by machine learning models are usually difficult – if not impossible – to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well – often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.

Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand.

Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.

Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable. The method has been published in

“Why Should I Trust You?” Explaining the Predictions of Any Classifier. By Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin from the University of Washington in Seattle

lime is able to explain all models for which we can obtain prediction probabilities (in R, that is every model that works with predict(type = "prob")). It makes use of the fact that linear models are easy to explain because they are based on linear relationships between features and class labels: The complex model function is approximated by locally fitting linear models to permutations of the original training set.

On each permutation, a linear model is being fit and weights are given so that incorrect classification of instances that are more similar to the original data are penalized (positive weights support a decision, negative weights contradict them). This will give an approximation of how much (and in which way) each feature contributed to a decision made by the model.

The code for lime has originally been made available for Python but the awesome Thomas Lin Pedersen has already created an implementation in R. It is not on CRAN (yet, I assume), but you can install it via Github:

devtools::install_github("thomasp85/lime")

The data I am using is the World Happiness Data from my last post. So, let’s train a neural network on this data to predict three classes of the happiness scores: low, medium and high.

load("data_15_16.RData")
# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)
set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data  <- data_15_16[-index, ]
set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "mlp",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

The explain function

The central function of lime is lime() It creates the function that is used in the next step to explain the model’s predictions.

We can give a couple of options. Check the help ?lime for details, but the most important to think about are:

  • Should continuous features be binned? And if so, into how many bins?

Here, I am keeping the default bin_continuous = TRUE but specify 5 instead of 4 (the default) bins with n_bins = 5.

library(lime)

explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)

Now, let’s look at how the model is explained. Here, I am not going to look at all test cases but I’m randomly choosing three cases with correct predictions and three with wrong predictions.

pred <- data.frame(sample_id = 1:nrow(test_data),
                   predict(model_mlp, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")

Beware that we need to give our test-set data table row names with the sample names or IDs to be displayed in the header of our explanatory plots below.

library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")

test_data_cor <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_cor$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)

test_data_wrong <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_wrong$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)

The explain function from above can now be used with our test samples. Further options we can specify are:

  • How many features do we want to use in the explanatory function?

Let’s say we have a big training set with 100 features. Looking at all features and trying to understand them all could be more confusing than helpful. And very often, a handful of very important features will be enough to predict test samples with a reasonable accuracy (see also my last post on OneR). So, we can choose how many features we want to look at with the n_features option.

  • How do we want to choose these features?

Next, we specify how we want this subset of features to be found. The default, auto, uses forward selection if we chose n_features <= 6 and uses the features with highest weights otherwise. We can also directly choose feature_select = "forward_selection", feature_select = "highest_weights" or feature_select = "lasso_path". Again, check ?lime for details.

In our example dataset, we only have 7 features and I want to look at the top 5.

I also want to have explanation for all three class labels in the response variable (low, medium and high happiness), so I am choosing n_labels = 3.

explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)

It will return a tidy tibble object that we can plot with plot_features():

plot_features(explanation_cor, ncol = 3)

plot_features(explanation_wrong, ncol = 3)

The information in the output tibble is described in the help function ?lime and can be viewed with

tibble::glimpse(explanation_cor)

So, what does this tell us, now? Let’s look at case 22 (the first row of our plot for correctly predicted classes): This sample has been correctly predicted to come from the medium happiness group because it

  • has a dystopia value between 2.03 & 2.32,
  • a trust/government corruption score below 0.05,
  • a GDP/economy score between 1.06 and 1.23 and
  • a life expectancy score between 0.59 and 0.7.

From the explanation for the label “high” we can also see that this case has a family score bigger than 1.12, which is more representative of high happiness samples.

pred %>%
  filter(sample_id == 22)
##   sample_id        low   medium       high actual prediction correct
## 1        22 0.02906327 0.847562 0.07429938 medium     medium correct

The explanatory function named dystopia the most strongly supporting feature for this prediction. Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables […]

The explanatory plot tells us for each feature and class label in which range of values a representative data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For case 22 and the feature dystopia, the data point 2.27 falls within the range for medium happiness (between 2.03 and 2.32) with a high weight.

When we look at where this case falls on the range of values for this feature, we can see that is indeed very close to the median of medium training cases and further away from the medians for high and low training cases. The other supportive features show us the same trend.

train_data %>%
  gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
  ggplot(aes(x = Happiness.Score.l, y = y)) +
    geom_boxplot(alpha = 0.8, color = "grey") + 
    geom_point(data = gather(test_data[22, ], x, y, Economy..GDP.per.Capita.:Dystopia.Residual), color = "red", size = 3) +
    facet_wrap(~ x, scales = "free", ncol = 4)

An overview over the top 5 explanatory features for case 22 is stored in:

as.data.frame(explanation_cor[1:9]) %>%
  filter(case == "22")
##    case  label label_prob   model_r2 model_intercept
## 1    22 medium 0.84756196 0.05004205       0.5033729
## 2    22 medium 0.84756196 0.05004205       0.5033729
## 3    22 medium 0.84756196 0.05004205       0.5033729
## 4    22 medium 0.84756196 0.05004205       0.5033729
## 5    22 medium 0.84756196 0.05004205       0.5033729
## 6    22   high 0.07429938 0.06265119       0.2293890
## 7    22   high 0.07429938 0.06265119       0.2293890
## 8    22   high 0.07429938 0.06265119       0.2293890
## 9    22   high 0.07429938 0.06265119       0.2293890
## 10   22   high 0.07429938 0.06265119       0.2293890
## 11   22    low 0.02906327 0.07469729       0.3528088
## 12   22    low 0.02906327 0.07469729       0.3528088
## 13   22    low 0.02906327 0.07469729       0.3528088
## 14   22    low 0.02906327 0.07469729       0.3528088
## 15   22    low 0.02906327 0.07469729       0.3528088
##                          feature feature_value feature_weight
## 1              Dystopia.Residual       2.27394     0.14690100
## 2  Trust..Government.Corruption.       0.03005     0.06308598
## 3       Economy..GDP.per.Capita.       1.13764     0.02944832
## 4       Health..Life.Expectancy.       0.66926     0.02477567
## 5                     Generosity       0.00199    -0.01326503
## 6                         Family       1.23617     0.13629781
## 7                     Generosity       0.00199    -0.07514534
## 8  Trust..Government.Corruption.       0.03005    -0.07574480
## 9              Dystopia.Residual       2.27394    -0.07687559
## 10      Economy..GDP.per.Capita.       1.13764     0.07167086
## 11                        Family       1.23617    -0.14932931
## 12      Economy..GDP.per.Capita.       1.13764    -0.12738346
## 13                    Generosity       0.00199     0.09730858
## 14             Dystopia.Residual       2.27394    -0.07464384
## 15 Trust..Government.Corruption.       0.03005     0.06220305
##                                       feature_desc
## 1         2.025072 < Dystopia.Residual <= 2.320632
## 2        Trust..Government.Corruption. <= 0.051198
## 3  1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 4  0.591822 < Health..Life.Expectancy. <= 0.701046
## 5                           Generosity <= 0.123528
## 6                                1.119156 < Family
## 7                           Generosity <= 0.123528
## 8        Trust..Government.Corruption. <= 0.051198
## 9         2.025072 < Dystopia.Residual <= 2.320632
## 10 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 11                               1.119156 < Family
## 12 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 13                          Generosity <= 0.123528
## 14        2.025072 < Dystopia.Residual <= 2.320632
## 15       Trust..Government.Corruption. <= 0.051198

In a similar way, we can explore why some predictions were wrong.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_0.5.0       purrr_0.2.2       readr_1.1.0      
##  [4] tidyr_0.6.1       tibble_1.3.0      tidyverse_1.1.1  
##  [7] RSNNS_0.4-9       Rcpp_0.12.10      lime_0.1.0       
## [10] caret_6.0-73      ggplot2_2.2.1     lattice_0.20-35  
## [13] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3    
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.6.0    assertthat_0.2.0   glmnet_2.0-5      
##  [4] rprojroot_1.2      digest_0.6.12      psych_1.7.3.21    
##  [7] R6_2.2.0           plyr_1.8.4         backports_1.0.5   
## [10] MatrixModels_0.4-1 stats4_3.3.3       evaluate_0.10     
## [13] httr_1.2.1         hrbrthemes_0.1.0   lazyeval_0.2.0    
## [16] readxl_0.1.1       minqa_1.2.4        SparseM_1.76      
## [19] extrafontdb_1.0    car_2.1-4          nloptr_1.0.4      
## [22] Matrix_1.2-8       rmarkdown_1.4      labeling_0.3      
## [25] splines_3.3.3      lme4_1.1-12        extrafont_0.17    
## [28] stringr_1.2.0      foreign_0.8-67     munsell_0.4.3     
## [31] hunspell_2.3       broom_0.4.2        modelr_0.1.0      
## [34] mnormt_1.5-5       mgcv_1.8-17        htmltools_0.3.5   
## [37] nnet_7.3-12        codetools_0.2-15   MASS_7.3-45       
## [40] ModelMetrics_1.1.0 grid_3.3.3         nlme_3.1-131      
## [43] jsonlite_1.4       Rttf2pt1_1.3.4     gtable_0.2.0      
## [46] DBI_0.6-1          magrittr_1.5       scales_0.4.1      
## [49] stringi_1.1.5      reshape2_1.4.2     xml2_1.1.1        
## [52] tools_3.3.3        forcats_0.2.0      hms_0.3           
## [55] pbkrtest_0.4-7     yaml_2.1.14        colorspace_1.3-2  
## [58] rvest_0.3.2        knitr_1.15.1       haven_1.0.0       
## [61] quantreg_5.29

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

A classical analysis (Radio Swiss classic program)

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I am not a classical music expert at all, but I happen to have friends who are, and am even married to someone playing the cello (and the ukulele!). I appreciate listening to such music from time to time, in particular Baroque music. A friend made me discover Radio Swiss classic, an online radio playing classical music all day and all night long, with a quite nice variety, and very little speaking between pieces, with no ads (thank you, funders of the radio!). Besides, the voices telling me which piece has just been played are really soothing, so Radio Swiss classic is a good one in my opinion.

Today, instead of anxiously waiting for the results of the French presidential elections, I decided to download the program of the radio in the last years and have a quick look at it, since after all, the website says that the radio aims at relaxing people.

Scraping the program

My webscraping became a bit more elegant because I followed the advice of EP alias expersso, who by the way should really start blogging. I started downloading programs since September 2008 because that’s when I met the friend who told me about Radio Swiss Classic.

dates <- seq(from = lubridate::ymd("2008-09-01"),
             to = lubridate::ymd("2017-04-22"),
             by = "1 day")


base_url <- "http://www.radioswissclassic.ch/en/music-programme/search/"

get_one_day_program <- function(date, base_url){
  # in order to see progress
  message(date)
  
  # build URL
  date_as_string <- as.character(date)
  date_as_string <- stringr::str_replace_all(date_as_string, "-", "")
  url <- paste0(base_url, date_as_string)
  
  # read page
  page <- try(xml2::read_html(url),
              silent = TRUE)
  if(is(page, "try-error")){
    message("horribly wrong")

    closeAllConnections()
    return(NULL)
  }else{
    
    # find all times, artists and pieces
    times <- xml2::xml_text(xml2::xml_find_all(page, 
                                               xpath="//span[@class='time hidden-xs']//text()"))
    artists <- xml2::xml_text(xml2::xml_find_all(page, 
                                                 xpath="//span[@class='titletag']//text()"))
    pieces <- xml2::xml_text(xml2::xml_find_all(page, 
                                                xpath="//span[@class='artist']//text()"))
    # the last artist and piece are the current ones
    artists <- artists[1:(length(artists) - 1)]
    pieces <- pieces[1:(length(pieces) - 1)]
    
    # get a timedate from each time
    timedates <- paste(as.character(date), times)
    timedates <- lubridate::ymd_hm(timedates)
    timedates <- lubridate::force_tz(timedates, tz = "Europe/Zurich")
    
    # format the output
    program <- tibble::tibble(time = timedates,
                              artist = artists,
                              piece = pieces)
    
    return(program)
  }
  
}

programs <- purrr::map(dates, get_one_day_program, 
                       base_url = base_url)

programs <- dplyr::bind_rows(programs)

save(programs, file = "data/radioswissclassic_programs.RData")

There were some days without any program on the website, for which the website said something was horribly wrong with the server.

load("data/radioswissclassic_programs.RData")
wegot <- length(unique(lubridate::as_date(programs$time)))
wewanted <- length(seq(from = lubridate::ymd("2008-09-01"),
                       to = lubridate::ymd("2017-04-22"),
                       by = "1 day"))

However, I got a program for approximately 0.96 of the days.

Who are the most popular composers?

library("magrittr")
table(programs$artist) %>%
  broom::tidy() %>%
  dplyr::arrange(desc(Freq)) %>%
  head(n = 20) %>%
  knitr::kable()
Var1 Freq
Wolfgang Amadeus Mozart 37823
Ludwig van Beethoven 20936
Joseph Haydn 18140
Franz Schubert 15596
Antonio Vivaldi 14947
Johann Sebastian Bach 12003
Felix Mendelssohn-Bartholdy 11541
Antonin Dvorak 10265
Gioachino Rossini 9591
Frédéric Chopin 8470
Piotr Iljitsch Tchaikowsky 8092
Georg Friedrich Händel 7935
Tomaso Albinoni 6175
Gaetano Donizetti 5945
Giuseppe Verdi 5639
Johannes Brahms 5526
Johann Nepomuk Hummel 5439
Camille Saint-Saëns 5395
Luigi Boccherini 5130
Johann Christian Bach 4976

I’ll have to admit that I don’t even know all the composers in this table but they’re actually all famous according to my live-in classical music expert. Radio Swiss classic allows listeners to rate pieces, so the most popular ones are programmed more often, and well I guess the person making the programs also tend to program famous composers quite often.

library("ggplot2")
library("hrbrthemes")
table(programs$artist) %>%
  broom::tidy() %>%
  ggplot() +
  geom_histogram(aes(Freq)) +
  scale_x_log10() +
  theme_ipsum(base_size = 14) 

plot of chunk unnamed-chunk-3

Interestingly, but not that surprisingly I guess given the popularity of, say, Mozart, the distribution of occurrences by composers seems to be log-normally distributed.

How long are pieces?

On the website of Radio Swiss classic it is stated that pieces are longer in the evening than during the day, which I wanted to try and see. Because the program of the radio was not corrected for time changes (i.e. on 25 hour-days there are only 24 hours of music according to the online program), I shall only look at pieces whose duration is smaller than 60 minutes, which solves the issue of missing days at the same time.

programs <- dplyr::arrange(programs, time)
programs <- dplyr::mutate(programs,
                          duration = difftime(lead(time, 1),
                                       time,
                                       units = "min"))

programs <- dplyr::mutate(programs,
                          duration = ifelse(duration > 60,
                                            NA, duration))
programs <- dplyr::mutate(programs,
                          hour = as.factor(lubridate::hour(time)))

programs %>%
ggplot() +
  geom_boxplot(aes(hour, duration))+
  theme_ipsum(base_size = 14) 

plot of chunk unnamed-chunk-4

I don’t find the difference between day and night that striking, maybe I could try to define day and night to have a prettier figure (but I won’t do any test, I soon need to go watch TV).

programs %>%
  dplyr::mutate(night = (lubridate::hour(time) <= 4 | lubridate::hour(time) >= 20)) %>%
ggplot() +
  geom_boxplot(aes(night, duration))+
  theme_ipsum(base_size = 14)

plot of chunk unnamed-chunk-5

Conclusion

The website also states that the pieces are more lively in the morning, but I have no data to which to match the titles of the pieces to investigate that claim. Well I have not even looked for such data. Another extension that I would find interesting would be to match each composer’s name to a style and then see how often each style is played. Now I’ll stop relaxing and go stuff my face with food in front of the election results!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

April 22, 2017

R code to accompany Real-World Machine Learning (Chapter 6): Exploring NYC Taxi Data

(This article was first published on data prone - R, and kindly contributed to R-bloggers)

Abstract

The rwml-R Github repo is updated with R code for exploratory data analysis of New York City taxi data from Chapter 6 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the fread function from data.table, joining data frames by multiple variables with inner_join, and plotting categorical and numerical data with ggplot2.

Data for NYC taxi example

The data files for the examples in Chapter 6 of the book are available at
http://www.andresmh.com/nyctaxitrips/.
They are compressed as a 7-Zip file archive
(e.g. with p7zip), so you will
need to have the 7z command available in your path to decompress and load
the data.
(On a mac, you can use Homebrew to install p7zip with
the command brew install p7zip.)

Using fread (and dplyr…again)

As in Chapter 5, the fread function from the
data.table library is used to quickly read in a sample of the rather large
data files. It is similar to read.table but faster and more convenient.
The following code reads in the first 50k lines of data from one of the
trip data files and one of the fare data files. The mutate and filter
functions from dplyr are used to clean up the data (e.g. remove data
with unrealistic latitude and longitude values). The trip and fare data are
combined with the inner_join function from the dplyr package.

tripFile1 <- "../data/trip_data_1.csv"
fareFile1 <- "../data/trip_fare_1.csv"
npoints <- 50000
tripData <- fread(tripFile1, nrows=npoints, stringsAsFactors = TRUE) %>%
  mutate(store_and_fwd_flag = 
           replace(store_and_fwd_flag, which(store_and_fwd_flag == ""), "N")) %>%
  filter(trip_distance > 0 & trip_time_in_secs > 0 & passenger_count > 0) %>%
  filter(pickup_longitude < -70 & pickup_longitude > -80) %>%
  filter(pickup_latitude > 0 & pickup_latitude < 41) %>%
  filter(dropoff_longitude < 0 & dropoff_latitude > 0)
tripData$store_and_fwd_flag <- factor(tripData$store_and_fwd_flag)
fareData <- fread(fareFile1, nrows=npoints, stringsAsFactors = TRUE)
dataJoined <- inner_join(tripData, fareData)
remove(fareData, tripData)

Exploring the data

In the complete code-through, plots of categorical and numerical
features of the data are made using
ggplot2, including a visualization of the pickup locations in latitude and
longitude space which is shown below. With slightly less than 50,000 data
points, we can clearly see the street layout of downtown Manhatten.
Many of the trips originate in the other boroughs of New York, too.

The latitude/longitude of pickup locations. Note that the x-axis is flipped, compared to a regular map.

Feedback welcome

If you have any feedback on the rwml-R project, please
leave a comment below or use the Tweet button.
As with any of my projects, feel free to fork the rwml-R repo
and submit a pull request if you wish to contribute.
For convenience, I’ve created a project page for rwml-R with
the generated HTML files from knitr, including a page with
all of the event-modeling examples from chapter 6.

Download
Fork

To leave a comment for the author, please follow the link and comment on their blog: data prone - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Magister Dixit

“Twitter is probably the best place to start conversations about data science.” John Foreman


Continue Reading…

Collapse

Read More

Thanks for reading!