My Data Science Blogs

July 23, 2018

The global slump in press freedom

ACROSS the world, freedom of the press is atrophying. According to scores compiled by Freedom House, a think-tank, the muzzling of journalists and independent news media is at its worst point in 13 years.

Continue Reading…


Read More

When wife earns more than husband, they report a lesser gap

Marta Murray-Close and Misty L. Heggeness for the Census Bureau compared income responses from the Current Population Survey against income tax reports. The former can be fudged, whereas the latter is accurate by law. The researchers found a statistical difference that suggests when a wife makes more than a husband, they report a lesser gap in the survey.

This paper compares the earnings reported for husbands and wives in the Current Population Survey with their “true” earnings from administrative income-tax records. Estimates from OLS regressions show that survey respondents react to violations of the norm that husbands earn more than their wives by inflating their reports of husbands’ earnings and deflating their reports of wives’ earnings. On average, the gap between a husband’s survey and administrative earnings is 2.9 percentage points higher if his wife earns more than he does, and the gap between a wife’s survey and administrative earnings in 1.5 percentage points lower if she earns more than her husband does. These findings suggest that gendered social norms can influence survey reports of seemingly objective outcomes and that their impact may be heterogeneous not just between genders but also within gender.

The gap shift didn’t change much, regardless if the wife reported or the husband did. However, it’s interesting that the shift tended towards a boost for the husband’s income when the wife reported and a bump down for the wife’s income when the husband reported.

Tags: ,

Continue Reading…


Read More

Are vectorized random number generators actually useful?

Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing them using SIMD instructions.

A reader challenged me: is this actually useful in practice?

There are problems where the generation of random number is critical to the performance. That is the case in many simulations, for example. The simplest and best known one is probably the random shuffling of arrays. The standard algorithm is quite simple:

for (i = size; i > 1; i--) {
  var i = random_number(i);
  switch_values(array, i, j);

If you are interested, O’Neill wrote a whole blog post of this specific problem.

So can I accelerate the shuffling of an array using SIMD instructions?

So I threw together a vectorized shuffle and a regular (scalar) shuffle, both of them using O’Neill’s PCG32. The net result? I almost double the speed using SIMD instructions when the array is in cache:

SIMD shuffle 3.5 cycles per entry
scalar shuffle 6.6 cycles per entry

My code is available. I do not do anything sophisticated so I expect it is possible to do a lot better. My sole goal was to demonstrate that SIMD random number generators are practical.

Continue Reading…


Read More

Building A Data Science Product in 10 Days

At startups, you often have the chance to create products from scratch. In this article, the author will share how to quickly build valuable data science products, using his first project at Instacart as an example.

Continue Reading…


Read More

Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:


The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:


I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:


First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:


The white color still stands out more than necessary. Fix this using a gray color.


What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.


The gridlines are also too thick. Here's a slimmed-down look:


The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.


Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)



Continue Reading…


Read More

Six reasons to think twice about your data lake strategy

Since data has been called the “oil” of the new economy, it’s easy to assume that more is better. You can never have too much oil, so the same goes for data too, right? Hence there has been a lot of hype about data lakes over the past few years.

The post Six reasons to think twice about your data lake strategy appeared first on Dataconomy.

Continue Reading…


Read More

A Guide to AI, Machine Learning, and Deep Learning Talks at Spark + AI Summit Europe

Within a couple of years of its release as an open-source machine learning and deep learning framework, TensorFlow has seen an amazing rate of adoption. Consider the number of stars on its github page: over 105K; look at the number of contributors: 1500+; and observe its growing penetration and pervasiveness in verticals: from medical imaging to gaming; from computer vision to voice recognition and natural language processing.

As in Spark + AI Summit in San Francisco, so too in Spark + AI Summit Europe, we have seen high-caliber technical talks about the use of TensorFlow and other deep learning frameworks as part of the new tracks: AI, Productionizing ML, and Deep Learning Techniques. In this blog, we highlight a few talks that caught our eye, in their promise and potential. It always helps to have some navigational guidance if you are new to the summit or technology.

For example, look how Logical Clocks AB is using Apache Spark and TensorFlow to herald novel methods for developing distributed learning and training. Jim Dowling in his talk, Distributed Deep Learning with Apache Spark and TensorFlow, will explore myriad ways Apache Spark is combined with deep learning frameworks such as TensorFlow, TensorFlowonSpark, Horovod, and Deep Learning Pipelines to build deep learning applications.

Closely related to the above talk in integrating popular deep learning frameworks, such as TensorFlow, Keras or PyTorch, as first-class citizens on Apache Spark is Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark. Messrs Tim Hunter and Xiangrui Meng will share how to unify data and AI in order to simplify building production-ready AI applications. Continuing on how Spark integrates well with deep learning frameworks, Messrs Tim Hunter and Debajyoti Roy will discuss how to accomplish Geospatial Analytics with Apache Spark and Deep Learning Pipelines.

Now, if you are a news junky and wonder how to decipher and discern its emotive content, you may marvel at the use of sophisticated algorithms and techniques behind it and how to apply it. In this fascinating use case, An AI Use Case: Market Event Impact Determination via Sentiment and Emotion Analysis, Messrs Lei Gao and Jin Wang, both from IBM, will reveal the technical mastery. Similarly, if you are curious how social media images, when analyzed and categorized employing AI, are engaging, say for marketing campaigns, this session from Jack McCush will prove equally absorbing: Predicting Social Engagement of Social Images with Deep Learning.

At the Spark + AI Summit in San Francisco, Jeffrey Yau’s talk on Time Series Forecasting Using Recurrent Neural Network and Vector Autoregressive Model: When and How was a huge hit. He will repeat it in London on how two specific techniques—Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN)—can be applied to financial models. Also, IBM’s Nick Pentreath will provide an overview of RNN, modeling time series and sequence data, common architectures, and optimizing techniques in his talk, Recurrent Neural Networks for Recommendations and Personalization.

Data and AI are all about scale—about productizing ML models, about managing data infrastructure that supports the ML code. Three talks seem to fit the bill that may be of interest to you. First is from Josef Habdan, an aviation use case aptly titled, RealTime Clustering Engine Using Structured Streaming and SparkML Running on Billions of Rows a Day. Second is from Messrs Gaurav Agarwal and Mani Parkhe, from Uber and Databricks, Intelligence Driven User Communications at Scale. And third deals with productionizing scalable and high volume Apache Spark pipelines for smart-homes with hundreds of thousands of users. Erni Durdevic of Quby in his talk will share, Lessons Learned Developing and Managing High Volume Apache Spark Pipelines in Production.

At Databricks we cherish our founders’ academic roots, so all previous summits have had research tracks. Research in technology heralds paradigm shifts—for instance, at UC Berkeley AMPLab, it led to Spark; at Google, it led to TensorFlow; at CERN it led to the world wide web. Nikolay Malitsky’s talk, Spark-MPI: Approaching the Fifth Paradigm, will address the existing impedance mismatch between data-intensive and compute-intensive ecosystems by presenting the Spark-MPI approach, based on the MPI Process Management Interface (PMI). And Intel researchers Qi Xie and Sophia Sun will share a case study: Accelerating Apache Spark with FPGAs: A Case Study for 10TB TPCx-HS Spark Benchmark Acceleration with FPGA.

And finally, if you’re new to TensorFlow, Keras or PyTorch and want to learn how it all fits in the grand scheme of data and AI, you can enroll in a training course offered on both AWS and Azure: Hands-on Deep Learning with Keras, Tensorflow, and Apache Spark. Or to get a cursory and curated Tale of Three Deep Learning Frameworks, attend Brooke Wenig’s and my talk.

What’s Next

You can also peruse and pick sessions from the schedule, too. In the next blog, we will share our picks from sessions related to Data Science, Developer and Deep Dives.

If you have not registered yet, take advantage of the early bird before July 27, 2018 and save £300. See you there!

Read More

Find out more about initial keynotes: Spark + AI Summit Europe Agenda Announced


Try Databricks for free. Get started today.

The post A Guide to AI, Machine Learning, and Deep Learning Talks at Spark + AI Summit Europe appeared first on Databricks.

Continue Reading…


Read More

Simple object tracking with OpenCV

Today’s tutorial kicks off a new series of blog posts on object tracking, arguably one of the most requested topics here on PyImageSearch.

Object tracking is the process of:

  1. Taking an initial set of object detections (such as an input set of bounding box coordinates)
  2. Creating a unique ID for each of the initial detections
  3. And then tracking each of the objects as they move around frames in a video, maintaining the assignment of unique IDs

Furthermore, object tracking allows us to apply a unique ID to each tracked object, making it possible for us to count unique objects in a video. Object tracking is paramount to building a person counter (which we’ll do later in this series).

An ideal object tracking algorithm will:

  • Only require the object detection phase once (i.e., when the object is initially detected)
  • Will be extremely fast — much faster than running the actual object detector itself
  • Be able to handle when the tracked object “disappears” or moves outside the boundaries of the video frame
  • Be robust to occlusion
  • Be able to pick up objects it has “lost” in between frames

This is a tall order for any computer vision or image processing algorithm and there are a variety of tricks we can play to help improve our object trackers.

But before we can build such a robust method we first need to study the fundamentals of object tracking.

In today’s blog post, you will learn how to implement centroid tracking with OpenCV, an easy to understand, yet highly effective tracking algorithm.

In future posts in this object tracking series, I’ll start going into more advanced kernel-based and correlation-based tracking algorithms.

To learn how to get started building your first object tracking with OpenCV, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Simple object tracking with OpenCV

In the remainder of this post, we’ll be implementing a simple object tracking algorithm using the OpenCV library.

This object tracking algorithm is called centroid tracking as it relies on the Euclidean distance between (1) existing object centroids (i.e., objects the centroid tracker has already seen before) and (2) new object centroids between subsequent frames in a video.

We’ll review the centroid algorithm in more depth in the following section. From there we’ll implement a Python class to contain our centroid tracking algorithm and then create a Python script to actually run the object tracker and apply it to input videos.

Finally, we’ll run our object tracker and examine the results, noting both the positives and the drawbacks of the algorithm.

The centroid tracking algorithm

The centroid tracking algorithm is a multi-step process. We will review each of the tracking steps in this section.

Step #1: Accept bounding box coordinates and compute centroids

Figure 1: To build a simple object tracking algorithm using centroid tracking, the first step is to accept bounding box coordinates from an object detector and use them to compute centroids.

The centroid tracking algorithm assumes that we are passing in a set of bounding box (x, y)-coordinates for each detected object in every single frame.

These bounding boxes can be produced by any type of object detector you would like (color thresholding + contour extraction, Haar cascades, HOG + Linear SVM, SSDs, Faster R-CNNs, etc.), provided that they are computed for every frame in the video.

Once we have the bounding box coordinates we must compute the “centroid”, or more simply, the center (x, y)-coordinates of the bounding box. Figure 1 above demonstrates accepting a set of bounding box coordinates and computing the centroid.

Since these are the first initial set of bounding boxes presented to our algorithm we will assign them unique IDs.

Step #2: Compute Euclidean distance between new bounding boxes and existing objects

Figure 2: Three objects are present in this image for simple object tracking with Python and OpenCV. We need to compute the Euclidean distances between each pair of original centroids (red) and new centroids (green).

For every subsequent frame in our video stream we apply Step #1 of computing object centroids; however, instead of assigning a new unique ID to each detected object (which would defeat the purpose of object tracking), we first need to determine if we can associate the new object centroids (yellow) with the old object centroids (purple). To accomplish this process, we compute the Euclidean distance (highlighted with green arrows) between each pair of existing object centroids and input object centroids.

From Figure 2 you can see that we have this time detected three objects in our image. The two pairs that are close together are two existing objects.

We then compute the Euclidean distances between each pair of original centroids (purple) and new centroids (purple). But how do we use the Euclidean distances between these points to actually match them and associate them?

The answer is in Step #3.

Step #3: Update (x, y)-coordinates of existing objects

Figure 3: Our simple centroid object tracking method has associated objects with minimized object distances. What do we do about the object in the bottom left though?

The primary assumption of the centroid tracking algorithm is that a given object will potentially move in between subsequent frames, but the distance between the centroids for frames F_t and F_{t + 1} will be smaller than all other distances between objects.

Therefore, if we choose to associate centroids with minimum distances between subsequent frames we can build our object tracker.

In Figure 3 you can see how our centroid tracker algorithm chooses to associate centroids that minimize their respective Euclidean distances.

But what about the lonely point in the bottom-left?

It didn’t get associated with anything — what do we do with it?

Step #4: Register new objects

Figure 4: In our object tracking with Python and OpenCV example, we have a new object that wasn’t matched with an existing object, so it is registered as object ID #3.

In the event that there are more input detections than existing objects being tracked, we need to register the new object. “Registering” simply means that we are adding the new object to our list of tracked objects by:

  1. Assigning it a new object ID
  2. Storing the centroid of the bounding box coordinates for that object

We can then go back to Step #2 and repeat the pipeline of steps for every frame in our video stream.

Figure 4 demonstrates the process of using the minimum Euclidean distances to associate existing object IDs and then registering a new object.

Step #5: Deregister old objects

Any reasonable object tracking algorithm needs to be able to handle when an object has been lost, disappeared, or left the field of view.

Exactly how you handle these situations is really dependent on where your object tracker is meant to be deployed, but for this implementation, we will deregister old objects when they cannot be matched to any existing objects for a total of N subsequent frames.

Object tracking project structure

To see today’s project structure in your terminal, simply use the


$ tree --dirsfirst
├── pyimagesearch
│   ├──
│   └──
├── deploy.prototxt
└── res10_300x300_ssd_iter_140000.caffemodel

1 directory, 5 files


  module is not pip-installable — it is included with today’s “Downloads” (which you’ll find at the bottom of this post). Inside you’ll find the
  file which contains the


  class is an important component used in the
  driver script.

The remaining

  files are part of the OpenCV deep learning face detector. They are necessary for today’s face detection + tracking method, but you could easily use another form of detection (more on that later).

Be sure that you have NumPy, SciPy, and imutils installed before you proceed:

$ pip install numpy scipy imutils

…in addition to having OpenCV 3.3+ installed. If you follow one of my OpenCV install tutorials, be sure to replace the tail end of the

  command to grab at least OpenCV 3.3 (and update the paths in the CMake command). You’ll need 3.3+ to ensure you have the DNN module.

Implementing centroid tracking with OpenCV

Before we can apply object tracking to our input video streams, we first need to implement the centroid tracking algorithm. While you’re digesting this centroid tracker script, just keep in mind Steps 1-5 above and review the steps as necessary.

As you’ll see, the translation of steps to code requires quite a bit of thought, and while we perform all steps, they aren’t linear due to the nature of our various data structures and code constructs.

I would suggest

  1. Reading the steps above
  2. Reading the code explanation for the centroid tracker
  3. And finally reading the steps above once more

This process will bring everything full circle and allow you to wrap your head around the algorithm.

Once you’re sure you understand the steps in the centroid tracking algorithm, open up the
  inside the
  module and let’s review the code:

# import the necessary packages
from scipy.spatial import distance as dist
from collections import OrderedDict
import numpy as np

class CentroidTracker():
	def __init__(self, maxDisappeared=50):
		# initialize the next unique object ID along with two ordered
		# dictionaries used to keep track of mapping a given object
		# ID to its centroid and number of consecutive frames it has
		# been marked as "disappeared", respectively
		self.nextObjectID = 0
		self.objects = OrderedDict()
		self.disappeared = OrderedDict()

		# store the number of maximum consecutive frames a given
		# object is allowed to be marked as "disappeared" until we
		# need to deregister the object from tracking
		self.maxDisappeared = maxDisappeared

On Lines 2-4 we import our required packages and modules —

 , and


  class is defined on Line 6. The constructor accepts a single parameter, the maximum number of consecutive frames a given object has to be lost/disappeared for until we remove it from our tracker (Line 7).

Our constructor builds four class variables:

  • nextObjectID
     : A counter used to assign unique IDs to each object (Line 12). In the case that an object leaves the frame and does not come back for
      frames, a new (next) object ID would be assigned.
  • objects
     : A dictionary that utilizes the object ID as the key and the centroid (x, y)-coordinates as the value (Line 13).
  • disappeared
     : Maintains number of consecutive frames (value) a particular object ID (key) has been marked as “lost”for (Line 14).
  • maxDisappeared
     : The number of consecutive frames an object is allowed to be marked as “lost/disappeared” until we deregister the object.

Let’s define the

  method which is responsible for adding new objects to our tracker:

def register(self, centroid):
		# when registering an object we use the next available object
		# ID to store the centroid
		self.objects[self.nextObjectID] = centroid
		self.disappeared[self.nextObjectID] = 0
		self.nextObjectID += 1


  method is defined on Line 21. It accepts a
  and then adds it to the
  dictionary using the next available object ID.

The number of times an object has disappeared is initialized to

  in the
  dictionary (Line 25).

Finally, we increment the

  so that if a new object comes into view, it will be associated with a unique ID (Line 26).

Similar to our

  method, we also need a

def deregister(self, objectID):
		# to deregister an object ID we delete the object ID from
		# both of our respective dictionaries
		del self.objects[objectID]
		del self.disappeared[objectID]

Just like we can add new objects to our tracker, we also need the ability to remove old ones that have been lost or disappeared from our the input frames themselves.


  method is defined on Line 28. It simply deletes the
  in both the
  dictionaries, respectively (Lines 31 and 32).

The heart of our centroid tracker implementation lives inside the


def update(self, rects):
		# check to see if the list of input bounding box rectangles
		# is empty
		if len(rects) == 0:
			# loop over any existing tracked objects and mark them
			# as disappeared
			for objectID in self.disappeared.keys():
				self.disappeared[objectID] += 1

				# if we have reached a maximum number of consecutive
				# frames where a given object has been marked as
				# missing, deregister it
				if self.disappeared[objectID] > self.maxDisappeared:

			# return early as there are no centroids or tracking info
			# to update
			return self.objects

The update method, defined on Line 34, accepts a list of bounding box rectangles, presumably from an object detector (Haar cascade, HOG + Linear SVM, SSD, Faster R-CNN, etc.). The format of the

  parameter is assumed to be a tuple with this structure:
(startX, startY, endX, endY)

If there are no detections, we’ll loop over all object IDs and increment their

  count (Lines 37-41). We’ll also check if we have reached the maximum number of consecutive frames a given object has been marked as missing. If that is the case we need to remove it from our tracking systems (Lines 46 and 47). Since there is no tracking info to update, we go ahead and
  early on Line 51.

Otherwise, we have quite a bit of work to do over the next seven code blocks in the


# initialize an array of input centroids for the current frame
		inputCentroids = np.zeros((len(rects), 2), dtype="int")

		# loop over the bounding box rectangles
		for (i, (startX, startY, endX, endY)) in enumerate(rects):
			# use the bounding box coordinates to derive the centroid
			cX = int((startX + endX) / 2.0)
			cY = int((startY + endY) / 2.0)
			inputCentroids[i] = (cX, cY)

On Line 54 we’ll initialize a NumPy array to store the centroids for each


Then, we loop over bounding box rectangles (Line 57) and compute the centroid and store it in the

  list (Lines 59-61).

If there are currently no objects we are tracking, we’ll register each of the new objects:

# if we are currently not tracking any objects take the input
		# centroids and register each of them
		if len(self.objects) == 0:
			for i in range(0, len(inputCentroids)):

Otherwise, we need to update any existing object (x, y)-coordinates based on the centroid location that minimizes the Euclidean distance between them:

# otherwise, are are currently tracking objects so we need to
		# try to match the input centroids to existing object
		# centroids
			# grab the set of object IDs and corresponding centroids
			objectIDs = list(self.objects.keys())
			objectCentroids = list(self.objects.values())

			# compute the distance between each pair of object
			# centroids and input centroids, respectively -- our
			# goal will be to match an input centroid to an existing
			# object centroid
			D = dist.cdist(np.array(objectCentroids), inputCentroids)

			# in order to perform this matching we must (1) find the
			# smallest value in each row and then (2) sort the row
			# indexes based on their minimum values so that the row
			# with the smallest value as at the *front* of the index
			# list
			rows = D.min(axis=1).argsort()

			# next, we perform a similar process on the columns by
			# finding the smallest value in each column and then
			# sorting using the previously computed row index list
			cols = D.argmin(axis=1)[rows]

The updates to existing tracked objects take place beginning at the

  on Line 72. The goal is to track the objects and to maintain correct object IDs — this process is accomplished by computing the Euclidean distances between all pairs of
 , followed by associating object IDs that minimize the Euclidean distance.

Inside of the else block beginning on Line 72, we will:

  • Grab
      values (Lines 74 and 75).
  • Compute the distance between each pair of existing object centroids and new input centroids (Line 81). The output NumPy array shape of our distance map
      will be
    (# of object centroids, # of input centroids)
  • To perform the matching we must (1) Find the smallest value in each row, and (2) Sort the row indexes based on the minimum values (Line 88). We perform a very similar process on the columns, finding the smallest value in each column, and then sorting them based on the ordered rows (Line 93). Our goal is to have the index values with the smallest corresponding distance at the front of the lists.

The next step is to use the distances to see if we can associate object IDs:

# in order to determine if we need to update, register,
			# or deregister an object we need to keep track of which
			# of the rows and column indexes we have already examined
			usedRows = set()
			usedCols = set()

			# loop over the combination of the (row, column) index
			# tuples
			for (row, col) in zip(rows, cols):
				# if we have already examined either the row or
				# column value before, ignore it
				# val
				if row in usedRows or col in usedCols:

				# otherwise, grab the object ID for the current row,
				# set its new centroid, and reset the disappeared
				# counter
				objectID = objectIDs[row]
				self.objects[objectID] = inputCentroids[col]
				self.disappeared[objectID] = 0

				# indicate that we have examined each of the row and
				# column indexes, respectively

Inside the code block above, we:

  • Initialize two sets to determine which row and column indexes we have already used (Lines 98 and 99). Keep in mind that a set is similar to a list but it contains only unique values.
  • Then we loop over the combinations of
    (row, col)
      index tuples (Line 103) in order to update our object centroids:
    • If we’ve already used either this row or column index, ignore it and
        to loop (Lines 107 and 108).
    • Otherwise, we have found an input centroid that:
      • 1. Has the smallest Euclidean distance to an existing centroid
      • 2. And has not been matched with any other object
      • In that case, we update the object centroid (Lines 113-115) and make sure to add the
          to their respective

There are likely indexes in our

  sets that we have NOT examined yet:

# compute both the row and column index we have NOT yet
			# examined
			unusedRows = set(range(0, D.shape[0])).difference(usedRows)
			unusedCols = set(range(0, D.shape[1])).difference(usedCols)

So we must determine which centroid indexes we haven’t examined yet and store them in two new convenient sets (

 ) on Lines 124 and 125.

Our final check handles any objects that have become lost or if they’ve potentially disappeared:

# in the event that the number of object centroids is
			# equal or greater than the number of input centroids
			# we need to check and see if some of these objects have
			# potentially disappeared
			if D.shape[0] >= D.shape[1]:
				# loop over the unused row indexes
				for row in unusedRows:
					# grab the object ID for the corresponding row
					# index and increment the disappeared counter
					objectID = objectIDs[row]
					self.disappeared[objectID] += 1

					# check to see if the number of consecutive
					# frames the object has been marked "disappeared"
					# for warrants deregistering the object
					if self.disappeared[objectID] > self.maxDisappeared:

To finish up:

  • If the number of object centroids is greater than or equal to the number of input centroids (Line 131):
    • We need to verify if any of these objects are lost or have disappeared by looping over unused row indexes if any (Line 133).
    • In the loop, we will:
      • 1. Increment their
          count in the dictionary (Line 137).
      • 2. Check if the
          count exceeds the
          threshold (Line 142), and, if so we’ll deregister the object (Line 143).

Otherwise, the number of input centroids is greater than the number of existing object centroids, so we have new objects to register and track:

# otherwise, if the number of input centroids is greater
			# than the number of existing object centroids we need to
			# register each new input centroid as a trackable object
				for col in unusedCols:

		# return the set of trackable objects
		return self.objects

We loop over the

  indexes (Line 149) and we register each new centroid (Line 150). Finally, we’ll return the set of trackable objects to the calling method (Line 153).

Understanding the centroid tracking distance relationship

Our centroid tracking implementation was quite long, and admittedly, the most confusing aspect of the algorithm is Lines 81-93.

If you’re having trouble following along with what that code is doing you should consider opening a Python shell and performing the following experiment:

>>> from scipy.spatial import distance as dist
>>> import numpy as np
>>> np.random.seed(42)
>>> objectCentroids = np.random.uniform(size=(2, 2))
>>> centroids = np.random.uniform(size=(3, 2))
>>> D = dist.cdist(objectCentroids, centroids)
>>> D
array([[0.82421549, 0.32755369, 0.33198071],
       [0.72642889, 0.72506609, 0.17058938]])

Once you’ve started a Python shell in your terminal with the

  command, import
  as shown on Lines 1 and 2).

Then, set a seed for reproducibility (Line 3) and generate 2 (random) existing

  (Line 4) and 3
  (Line 5).

From there, compute the Euclidean distance between the pairs (Line 6) and display the results (Lines 7-9). The result is a matrix

  of distances with two rows (# of existing object centroids) and three columns (# of new input centroids).

Just like we did earlier in the script, let’s find the minimum distance in each row and sort the indexes based on this value:

>>> D.min(axis=1)
array([0.32755369, 0.17058938])
>>> rows = D.min(axis=1).argsort()
>>> rows
array([1, 0])

First, we find the minimum value for each row, allowing is to figure out which existing object is closest to the new input centroid (Lines 10 and 11). By then sorting on these values (Line 12) we can obtain the indexes of these

  (Lines 13 and 14).

In this case, the second row (index

 ) has the smallest value and then the first row (index
 ) has the next smallest value.

Using a similar process for the columns:

>>> D.argmin(axis=1)
array([1, 2])
>>> cols = D.argmin(axis=1)[rows]
>>> cols
array([2, 1])

…we first examine the values in the columns and find the index of the value with the smallest column (Lines 15 and 16).

We then sort these values using our existing

  (Lines 17-19).

Let’s print the results and analyze them:

>>> print(list(zip(rows, cols)))
[(1, 2), (0, 1)]

The final step is to combine them using

  (Lines 20). The resulting list is printed on Line 21.

Analyzing the results, we find that:

  • D[1, 2]
      has the smallest Euclidean distance implying that the second existing object will be matched against the third input centroid.
  • And
    D[0, 1]
      has the next smallest Euclidean distance which implies that the first existing object will be matched against the second input centroid.

I’d like to reiterate here that now that you’ve reviewed the code, you should go back and review the steps to the algorithm in the previous section. From there you’ll be able to associate the code with the more linear steps outlined here.

Implementing the object tracking driver script

Now that we have implemented our

  class, let’s put it to work with an object tracking driver script.

The driver script is where you can use your own preferred object detector, provided that it produces a set of bounding boxes. This could be a Haar Cascade, HOG + Linear SVM, YOLO, SSD, Faster R-CNN, etc. For this example script, I’m making use of OpenCV’s deep learning face detector, but feel free to make your own version of the script which implements a different detector.

Inside this script, we will:

  • Work with a live
      object to grab frames from your webcam
  • Load and utilize OpenCV’s deep learning face detector
  • Instantiate our
      and use it to track face objects in the video stream
  • And display our results which includes bounding boxes and object ID annotations overlaid on the frames

When you’re ready, open up
  from today’s “Downloads” and follow along:

# import the necessary packages
from pyimagesearch.centroidtracker import CentroidTracker
from import VideoStream
import numpy as np
import argparse
import imutils
import time
import cv2

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-p", "--prototxt", required=True,
	help="path to Caffe 'deploy' prototxt file")
ap.add_argument("-m", "--model", required=True,
	help="path to Caffe pre-trained model")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
args = vars(ap.parse_args())

First, we specify our imports. Most notably we’re using the

  class that we just reviewed. We’re also going to use
  and OpenCV.

We have three command line arguments which are all related to our deep learning face detector:

  • --prototxt
     : The path to the Caffe “deploy” prototxt.
  • --model
     : The path to the pre-trained model models.
  • --confidence
     : Our probability threshold to filter weak detections. I found that a default value of
      is sufficient.

The prototxt and model files come from OpenCV’s repository and I’m including them in the “Downloads” for your convenience.

Note: In case you missed it at the start of this section, I’ll repeat that you can use any detector you wish. As an example, we’re using a deep learning face detector which produces bounding boxes. Feel free to experiment with other detectors, just be sure that you have capable hardware to keep up with the more complex ones (some may run best with a GPU, but today’s face detector can easily run on a CPU).

Next, let’s perform our initializations:

# initialize our centroid tracker and frame dimensions
ct = CentroidTracker()
(H, W) = (None, None)

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

# initialize the video stream and allow the camera sensor to warmup
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()

In the above block, we:

  • Instantiate our
      (Line 21). Recall from the explanation in the previous section that this object has three methods: (1)
     , (2)
     , and (3)
     . We’re only going to use the
      method as it will register and deregister objects automatically. We also initialize
      (our frame dimensions) to
      (Line 22).
  • Load our serialized deep learning face detector model from disk using OpenCV’s DNN module (Line 26).
  • Start our
      (Line 30). With
      handy, we’ll be able to capture frames from our camera in our next
      loop. We’ll allow our camera
      seconds to warm up (Line 31).

Now let’s begin our

  loop and start tracking face objects:

# loop over the frames from the video stream
while True:
	# read the next frame from the video stream and resize it
	frame =
	frame = imutils.resize(frame, width=400)

	# if the frame dimensions are None, grab them
	if W is None or H is None:
		(H, W) = frame.shape[:2]

	# construct a blob from the frame, pass it through the network,
	# obtain our output predictions, and initialize the list of
	# bounding box rectangles
	blob = cv2.dnn.blobFromImage(frame, 1.0, (W, H),
		(104.0, 177.0, 123.0))
	detections = net.forward()
	rects = []

We loop over frames and

  them to a fixed width (while preserving aspect ratio) on Lines 34-47. Our frame dimensions are grabbed as needed (Lines 40 and 41).

Then we pass the frame through the CNN object detector to obtain predictions and object locations (Lines 46-49).

We initialize a list of

 , our bounding box rectangles on Line 50.

From there, let’s process the detections:

# loop over the detections
	for i in range(0, detections.shape[2]):
		# filter out weak detections by ensuring the predicted
		# probability is greater than a minimum threshold
		if detections[0, 0, i, 2] > args["confidence"]:
			# compute the (x, y)-coordinates of the bounding box for
			# the object, then update the bounding box rectangles list
			box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])

			# draw a bounding box surrounding the object so we can
			# visualize it
			(startX, startY, endX, endY) = box.astype("int")
			cv2.rectangle(frame, (startX, startY), (endX, endY),
				(0, 255, 0), 2)

We loop over the detections beginning on Line 53. If the detection exceeds our confidence threshold, indicating a valid detection, we:

  • Compute the bounding box coordinates and append them to the
      list (Lines 59 and 60)
  • Draw a bounding box around the object (Lines 64-66)

Finally, let’s call

  on our centroid tracker object,

# update our centroid tracker using the computed set of bounding
	# box rectangles
	objects = ct.update(rects)

	# loop over the tracked objects
	for (objectID, centroid) in objects.items():
		# draw both the ID of the object and the centroid of the
		# object on the output frame
		text = "ID {}".format(objectID)
		cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10),
			cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2), (centroid[0], centroid[1]), 4, (0, 255, 0), -1)

	# show the output frame
	cv2.imshow("Frame", frame)
	key = cv2.waitKey(1) & 0xFF

	# if the `q` key was pressed, break from the loop
	if key == ord("q"):

# do a bit of cleanup


  call on Line 70 handles the heavy lifting in our simple object tracker with Python and OpenCV script.

We would be done here and ready to loop back to the top if we didn’t care about visualization.

But that’s no fun!

On Lines 73-79 we display the centroid as a filled in circle and the unique object ID number text. Now we’ll be able to visualize the results and check to see if our

  properly keeps track of our objects by associating the correct IDs with the objects in the video stream.

We’ll display the frame on Line 82 until the quit key (“q”) has been pressed (Lines 83-87). If the quit key is pressed, we simply 

  and perform cleanup (Lines 87-91).

Centroid object tracking results

To see our centroid tracker in action using the “Downloads” section of this blog post to download the source code and OpenCV face detector. From there, open up a terminal and execute the following command:

$ python --prototxt deploy.prototxt \
	--model res10_300x300_ssd_iter_140000.caffemodel
[INFO] loading model...
[INFO] starting video stream...

Below you can see an example of a single face (my face) being detected and tracked:

This second example includes two objects being correctly detected and tracked:

Notice how I even though the second face is “lost” once I move the book cover outside the view of the camera, our object tracking is able to pick the face back up again when it comes into view. If the face had existed outside the field of view for more than 50 frames, the object would have been deregistered.

The final example animation here demonstrates tracking three unique objects:

Again, despite object ID #2 being unsuccessfully detected between some frames, our object tracking algorithm is able to find it again and associate it with its original centroid.

For a more detailed demonstration of our object tracker, including commentary, be sure to refer to the video below:

Limitations and drawbacks

While our centroid tracker worked great in this example, there are two primary drawbacks of this object tracking algorithm.

The first is that it requires that object detection step to be run on every frame of the input video.

  • For very fast object detectors (i.e., color thresholding and Haar cascades) having to run the detector on every input frame is likely not an issue.
  • But if you are (1) using a significantly more computationally expensive object detector such as HOG + Linear SVM or deep learning-based detectors on (2) a resource-constrained device, your frame processing pipeline will slow down tremendously as you will be spending the entire pipeline running a very slow detector.

The second drawback is related to the underlying assumptions of the centroid tracking algorithm itself — centroids must lie close together between subsequent frames.

  • This assumption typically holds, but keep in mind we are representing our 3D world with 2D frames — what happens when an object overlaps with another one?
  • The answer is that object ID switching could occur.
  • If two or more objects overlap each other to the point where their centroids intersect and instead have the minimum distance to the other respective object, the algorithm may (unknowingly) swap the object ID.
  • It’s important to understand that the overlapping/occluded object problem is not specific to centroid tracking — it happens for many other object trackers as well, including advanced ones.
  • However, the problem is more pronounced with centroid tracking as we relying strictly on the Euclidean distances between centroids and no additional metrics, heuristics, or learned patterns.

As long as you keep these assumptions and limitations in mind when using centroid tracking the algorithm will work wonderfully for you.


In today’s blog post you learned how to perform simple object tracking with OpenCV using an algorithm called centroid tracking.

The centroid tracking algorithm works by:

  1. Accepting bounding box coordinates for each object in every frame (presumably by some object detector).
  2. Computing the Euclidean distance between the centroids of the input bounding boxes and the centroids of existing objects that we already have examined.
  3. Updating the tracked object centroids to their new centroid locations based on the new centroid with the smallest Euclidean distance.
  4. And if necessary, marking objects as either “disappeared” or deregistering them completely.

Our centroid tracker performed well in this example tutorial but has two primary downsides:

  1. It requires that we run an object detector for each frame of the video — if your object detector is computationally expensive to run you would not want to utilize this method.
  2. It does not handle overlapping objects well and due to the nature of the Euclidean distance between centroids, it’s actually possible for our centroids to “swap IDs” which is far from ideal.

Despite its downsides, centroid tracking can be used in quite a few object tracking applications provided (1) your environment is somewhat controlled and you don’t have to worry about potentially overlapping objects and (2) your object detector itself can be run in real-time.

If you enjoyed today’s blog post, be sure to download the code using the form below. I’ll be back next week with another object tracking tutorial!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

The post Simple object tracking with OpenCV appeared first on PyImageSearch.

Continue Reading…


Read More

From no-data to data: The awkward transition

I was going to write a post with the above title, but now I don’t remember what I was going to say!

The post From no-data to data: The awkward transition appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Comparison of Top 6 Python NLP Libraries

Today, we want to outline and compare the most popular and helpful natural language processing libraries, based on our experience.

Continue Reading…


Read More

Top 20 Python AI and Machine Learning Open Source Projects

Getting into Machine Learning and AI is not an easy task. Many aspiring professionals and enthusiasts find it hard to establish a proper path into the field, given the enormous amount of resources available today. The field is evolving constantly and it is crucial that we keep up with the pace of this rapid development. In order to cope with this overwhelming speed of evolution and innovation, a good way to stay updated and knowledgeable on the advances of ML, is to engage with the community by contributing to the many open-source projects and tools that are used daily by advanced professionals.

Here we update the information and examine the trends since our previous post Top 20 Python Machine Learning Open Source Projects (Nov 2016).

Tensorflow has moved to the first place with triple-digit growth in contributors. Scikit-learn dropped to 2nd place, but still has a very large base of contributors.

Compared to 2016, the projects with the fastest growth in number of contributors were

  1. TensorFlow, 169% up, from 493 to 1324 contributors
  2. Deap, 86% up, from 21 to 39 contributors
  3. Chainer, 83% up, from 84 to 154 contributors
  4. Gensim, 81% up, from 145 to 262 contributors
  5. Neon, 66% up, from 47 to 78 contributors
  6. Nilearn, 50% up, from 46 to 69 contributors

Also new in 2018:

  1. Keras, 629 contributors
  2. PyTorch, 399 contributors


Fig. 1: Top 20 Python AI and Machine Learning projects on Github.

Size is proportional to the number of contributors, and color represents to the change in the number of contributors - red is higher, blue is lower. Snowflake shape is for Deep Learning projects, round for other projects.

We see that Deep Learning projects like TensorFlow, Theano, and Caffe are among the most popular.

The list below gives projects in descending order based on the number of contributors on Github. The change in number of contributors is versus 2016 KDnuggets Post on Top 20 Python Machine Learning Open Source Projects.

We hope you enjoy going through the documentation pages of each of these to start collaborating and learning the ways of Machine Learning using Python.

  1. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization. The system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system.
    Contributors: 1324 (168% up), Commits: 28476, Stars: 92359. Github URL: Tensorflow

  2. Scikit-learn is simple and efficient tools for data mining and data analysis, accessible to everybody, and reusable in various context, built on NumPy, SciPy, and matplotlib, open source, commercially usable – BSD license.
    Contributors: 1019 (39% up), Commits: 22575, Github URL: Scikit-learn

  3. Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
    Contributors: 629 (new), Commits: 4371, Github URL: Keras

  4. PyTorch, Tensors and Dynamic neural networks in Python with strong GPU acceleration.
    Contributors: 399 (new), Commits: 6458, Github URL: pytorch

  5. Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
    Contributors: 327 (24% up), Commits: 27931, Github URL: Theano

  6. Gensim is a free Python library with features such as scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents.
    Contributors: 262 (81% up), Commits: 3549, Github URL: Gensim

  7. Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors.
    Contributors: 260 (21% up), Commits: 4099, Github URL: Caffe

  8. Chainer is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational auto-encoders.
    Contributors: 154 (84% up), Commits: 12613, Github URL: Chainer

  9. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
    Contributors: 144 (33% up), Commits: 9729, Github URL: Statsmodels

  10. Shogun is Machine learning toolbox which provides a wide range of unified and efficient Machine Learning (ML) methods. The toolbox seamlessly allows to easily combine multiple data representations, algorithm classes, and general purpose tools.
    Contributors: 139 (32% up), Commits: 16362, Github URL: Shogun

  11. Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and Theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).
    Contributors: 119 (3.5% up), Commits: 7119, Github URL: Pylearn2

  12. NuPIC is an open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM). Parts of HTM theory have been implemented, tested, and used in applications, and other parts of HTM theory are still being developed.
    Contributors: 85 (12% up), Commits: 6588, Github URL: NuPIC

  13. Neon is Nervana's Python-based deep learning library. It provides ease of use while delivering the highest performance.
    Contributors: 78 (66% up), Commits: 1112, Github URL: Neon

  14. Nilearn is a Python module for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modelling, classification, decoding, or connectivity analysis.
    Contributors: 69 (50% up), Commits: 6198, Github URL: Nilearn

  15. Orange3 is open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox.
    Contributors: 53 (33% up), Commits: 8915, Github URL: Orange3

  16. Pymc is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems.
    Contributors: 39 (5.4% up), Commits: 2721, Github URL: Pymc

  17. Deap is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelisation mechanism such as multiprocessing and SCOOP.
    Contributors: 39 (86% up), Commits: 1960, Github URL: Deap

  18. Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory so that many processes may share the same data.
    Contributors: 35 (46% up), Commits: 527, Github URL: Annoy

  19. PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.
    Contributors: 32 (3% up), Commits: 992, Github URL: PyBrain

  20. Fuel is a data pipeline framework which provides your machine learning models with the data they need. It is planned to be used by both the Blocks and Pylearn2 neural network libraries.
    Contributors: 32 (10% up), Commits: 1116, Github URL: Fuel

The contributor and commit numbers were recorded in February 2018.

Editor's note: This was originally posted on KDNuggets, and has been reposted with permission. Author Ilan Reinstein is a physicist and data scientist.

Continue Reading…


Read More

Rank Minimization for Snapshot Compressive Imaging - implementation -

Yang just sent me the following:

Hi Igor,

I am writing regarding a paper on compressive sensing you may find of interest, co-authored with Xin Yuan, Jinli Suo, David Brady, and Qionghai Dai. We get exciting results on snapshot compressive imaging (SCI), i.e., encoding each frame of an image sequence with a spectral-, temporal-, or angular- variant random mask and summing them pixel-by-pixel to form one-shot measurement. Snapshot compressive hyperspectral, high-speed, and ligh-field imaging are among representatives.

We combine rank minimization to exploit the nonlocal self-similarity of natural scenes, which is widely acknowledged in image/video processing and alternating minimization approach to solve this problem. Results of both simulation and real data from four different SCI systems, where measurement noise is dominant, demonstrate that our proposed algorithm leads to significant improvements (>4dB in PSNR) and more robustness to noise compared with current state-of-the-art algorithms.

Paper arXiv link:
Github repository link:

Here is an animated demo for visualization and comparison with the state-of-the-art algorithms, , i.e., GMM-TP (TIP'14), MMLE-GMM (TIP'15), MMLE-MFA (TIP'15), and GAP-TV (ICIP'16).
Yang ( 

Thanks Yang !

Snapshot compressive imaging (SCI) refers to compressive imaging systems where multiple frames are mapped into a single measurement, with video compressive imaging and hyperspectral compressive imaging as two representative applications. Though exciting results of high-speed videos and hyperspectral images have been demonstrated, the poor reconstruction quality precludes SCI from wide applications.This paper aims to boost the reconstruction quality of SCI via exploiting the high-dimensional structure in the desired signal. We build a joint model to integrate the nonlocal self-similarity of video/hyperspectral frames and the rank minimization approach with the SCI sensing process. Following this, an alternating minimization algorithm is developed to solve this non-convex problem. We further investigate the special structure of the sampling process in SCI to tackle the computational workload and memory issues in SCI reconstruction. Both simulation and real data (captured by four different SCI cameras) results demonstrate that our proposed algorithm leads to significant improvements compared with current state-of-the-art algorithms. We hope our results will encourage the researchers and engineers to pursue further in compressive imaging for real applications.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Continue Reading…


Read More

How to take machine learning from exploration to implementation

Recognizing the interest in ML, the Strata Data Conference program is designed to help companies adopt ML across large sections of their existing operations.

Interest in machine learning (ML) has been growing steadily, and many companies and organizations are aware of the potential impact these tools and technologies can have on their underlying operations and processes. The reality is that we are still in the early phases of adoption, and a majority of companies have yet to deploy ML across their operations. The results of a recent survey we conducted (with 11,000+ respondents, a full report is forthcoming) bears this out—only about 15% of respondents worked for companies that have extensive experience using ML in production:

machine learning adoption

In this post, I’ll describe how one can go from “exploration and evaluation” to actual “implementation” of ML technologies. Along the way, I’ll highlight key sections of the upcoming Strata Data conference in New York this September. Recognizing the interest in ML, we assembled a program to help companies adopt ML across large sections of their existing operations.

Understanding key technologies and methods

Machine learning has enormous potential, but in order to unleash its promise, one needs to identify appropriate problems and use cases. The key to using any new set of tools and technologies is to understand what they can and cannot do. How do you put your organization in a position to take advantage of ML technologies? Because ML has the potential to affect every aspect of an organization, we are highlighting several companies that have invested resources in training and organizing their workforce on these new technologies.

Data preparation, governance, and privacy

Much of ML in use within companies falls under supervised learning, which means proper training data (or labeled examples) are essential. The rise of deep learning has made this even more pronounced, as many modern neural network architectures rely on very large amounts of training data. Issues pertaining to data security, privacy, and governance persist and are not necessarily unique to ML applications. But the hunger for large amounts of training data, the advent of new regulations like GDPR, and the importance of managing risk means a stronger emphasis on reproducibility and data lineage are very much needed.

As companies gain more experience, they start augmenting existing data sets with data that holds the potential to improve their existing ML models. There is growing interest in exploring alternative data sets and types. How exactly can companies augment their existing data sets with external data sources? Can decentralization technologies (like blockchains) pave the way for new forms of data exchanges?

Data integration and data platforms

Data used for ML models tends to come from a variety of sources. Because ML relies on large amounts of (labeled) data, learning how to design, build, and maintain robust data pipelines is important for productionizing machine learning. Depending on the application, companies typically have to grapple with variety (disparate data sources), volume (particularly for deep learning systems that are data hungry), and velocity (ingesting data from sensors and edge devices).

Over the last few years, many companies have begun rolling out data platforms for business intelligence and business analytics. More recently, companies have started to expand toward platforms that can support growing teams of data scientists. Common features of modern data science platforms include: support for notebooks and open source machine learning libraries, project management (collaboration and reproducibility), feature stores (for storing and sharing useful variables), and model visualization.

Model building: Some standard use cases

Machine learning has become commonly used for recommendation and personalization applications. The rise of new algorithms always tends to bring renewed interest. Deep learning has caused many companies to evaluate their existing recommenders, and many have begun to use neural networks to either supplement or replace their existing models. Deep learning has also reinvigorated interest in applications involving two data types found within many companies: text (natural language processing and understanding) and temporal data (correlations, anomalies, forecasting).

Model lifecycle management

Companies are realizing that machine learning model development is not quite the same as software development. Completion of the ML model building process doesn’t automatically translate to a working system. The data community is still in the process of building tools to help manage the entire lifecycle, which also includes model deployment, monitoring, and operations. While tools and best practices are just beginning to emerge and be shared, model lifecycle management is one of the most active areas in the data space.

Ethics and privacy

Newly introduced regulations in Europe (GDPR) and California (Consumer Privacy Act) have placed concepts like “user control” and “privacy-by-design” at the forefront for companies wanting to deploy ML. The good news is that there are new privacy-preserving tools and techniques—including differential privacy—that are becoming available for both business intelligence and ML applications.

Ethics and compliance are areas of interest to many in the data community. Beyond privacy, data professionals are much more engaged in topics such as fairness, transparency, and explainability in machine learning. Are data sets that are being used for model training representative of the broader population? For certain application domains and settings, transparency and interpretability are essential and regulators may require more transparent models, even at the expense of power and accuracy. More generally, how do companies mitigate risk when using ML?

Case studies

Putting all these topics together into working ML products—data movement and storage, model building, ML lifecycle management, ethics and privacy—requires experience. One of the best ways to learn is to hear presentations from peers who have successfully (and repeatedly) used machine learning to reinvigorate key facets of their operations. We showcase many case studies at the upcoming Strata Data conference in NYC; here are a selection of talks from a few application domains:

Continue reading How to take machine learning from exploration to implementation.

Continue Reading…


Read More

Be smart with responsive web design

By being intentional and deliberate in your approach, you can build an excellent user experience that performs well regardless of screen size.

Responsive Web Design

Mobile is no longer “the future.” As mentioned in Chapter 1, handsets are the primary Internet access method ( for a vast number of global Internet users. People are primarily using handsets to access the Internet, and these devices present their own unique set of challenges. Between the tremendous amount of latency on mobile networks (see “Mobile Networks”) and hardware challenges like WiFi signal strength and battery power (see “Mobile Hardware”), it’s more important than ever that we design and develop sites that are as high performing and efficient as possible. We need to aim for no unnecessary overhead for our users and optimize for perceived performance on all screen sizes.

The challenge with responsive web design sites is that it can be very easy to accidentally deliver unnecessary content like too-large images or unused CSS and JavaScript. Because the process of creating a responsively designed site can often include adding markup and functionality to optimize your layout and content for smaller screens, it’s no surprise that many sites deliver the same page weight or additional page weight to mobile devices without the designers and developers even realizing it.

Many creators of responsive sites are already going above and beyond in their decision-making process: reflowing content, choosing to hide or show various elements, making smart decisions about hierarachy, and more. We need to build an additional step into this responsive web design workflow: ensuring that we are delivering only the necessary content in terms of page weight and requests, not just information architecture.

Guy Podjarny found that the majority of responsively designed sites are currently delivering roughly the same page weight to small and large screens. But it doesn’t have to be this way: responsive web design is not inherently bad for performance, and we can be smart about what we deliver to our users. By being intentional in your approach to designing a responsive site and deliberate with what kinds of assets you require your users to download, you can build an excellent user experience that performs well regardless of screen size.

Deliberately Loading Content

Because we so often create a responsive site by adding things like more media queries for various screen sizes, it’s easy to forget that we may also be adding a ton of extra overhead for our users. This is especially true when a design starts with a desktop version and is then edited to scale down for smaller screens: what happens to those assets that have been optimized for the desktop view? Too often these are left as is; images are always served at the same size (just scaled down visually, through CSS), or fonts continue to be delivered and implemented as they are on desktop. We need to be deliberate with how we load content and ensure we are delivering only the bytes that our user absolutely needs.


Images should be served at the size at which they are displayed on the page to eliminate any unnecessary overhead for your users. In Figure 1-1, we can see a screenshot of Google’s home page with Chrome DevTools open. The size at which the Google logo is displayed is smaller than the actual height and width of the logo file.

This means that users are downloading unnecessary bytes, since their browsers downloaded an image that’s unnecessarily large for how it’s displayed. As you inspect an image in Chrome DevTools, you’ll be able to see the height and width of the image as it is displayed on the page as well as the image’s “natural” size, which can often be different than the display size (Figure 1-2).

In Figure 1-2, we can see that Google may be sending a retina-sized version of the image to users. Since retina displays cram twice as many pixels into their screens, a designer or developer can send an image twice as large as necessary and scale it down for display in the browser. This technique makes images look crisp on retina displays. Unfortunately, it also means users who aren’t using retina displays will download unnecessary image file bytes.

Figure 1-1. In this example, we can see that the size at which the Google logo is displayed is smaller than the actual size of the logo file.
Figure 1-2. Chrome DevTools will tell you how large an image is naturally as well as its actual displayed dimensions on the page.

Inspect the images on your site and see if there are opportunities for serving appropriately sized files. You have a few different ways to tell the browser which image to serve: RESS solutions, CSS media queries, and the new picture specification.

RESS, which stands for responsive web design with server-side components, is one option for creating and serving correctly sized images. You can improve performance by choosing which assets to serve to your user on the server side, rather than optimizing them on the client side. Your server can make smart decisions by looking at a user agent string, from which it can guess things like your user’s screen size, device capabilities like touch, and more. Tools like Adaptive Images ( detect your user’s screen size and will automatically create, cache, and deliver correctly sized images based on your defined breakpoints (see Figure 1-3). In his book High Performance Responsive Design (O’Reilly), Tom Barker outlines a number of RESS techniques and how to implement them.

Figure 1-3. In this example from the Adaptive Images site (, you can see different pixel widths and heights were generated from a single image with the Adaptive Images tool, as well as the different file sizes of the resulting images.

However, there are a number of downsides to RESS solutions. RESS won’t respond to client-size changes (e.g., if a user rotates the device from portrait to landscape). Let’s say you’re using RESS to send a perfectly resized image to your user’s browser. If that user rotates her device and your responsive layout changes, your server won’t know to send a new image to fit the new layout. This is why techniques like media queries and the new picture specification tend to be better solutions for responsive images.

There has been a lot of research done to determine which methods are best for displaying a correctly sized image using CSS in a responsive design, thanks in particular to Tim Kadlec ( and Cloud Four ( However, browsers can do unexpected things as they determine which image(s) to download for your page with CSS, which is why it’s important to test your site’s performance and ensure that you are asking your users’ browsers to download only the necessary resources.

For example, simply setting display: none to an element will not prevent a browser from downloading the image:

<div id="hide">
  <img src="image.jpg" alt="Image" />

/* Seriously, don't do this.
   Browsers will still download the image. */

@media (max-width: 600px) {
  #hide {
    display: none;

The same goes for applying display: none to an element with a background-image; the image will still be downloaded:

<div id="hide"></div>

/* Again, don't do this.
   Browsers will still download the image. */

#hide {
  background: url(image.jpg);

@media (max-width: 600px) {
  #hide {
    display: none;

Instead, if you want to hide an image from displaying with CSS in a responsive design, you can try hiding the parent element of the element with a background-image:

<div id="parent">

/* Hide the parent element;
   Browsers will not download the image. */

#parent div {
  background: url(image.jpg);

@media (max-width: 600px) {
  #parent {
    display: none;

Alternatively, you could apply different media queries to tell the browser which background-image is appropriate to download at which screen size. A browser will download an image when it matches a media query:

<div id="match"></div>

@media (min-width: 601px) {
  #match {
    background: url(big.jpg);

@media (max-width: 600px) {
  #match {
    background: url(small.jpg);

Note that if media queries overlap, older browsers will download both images.

But what about serving up retina images with CSS? We can ensure that only the retina version is downloaded for most browsers by using a media query to serve the retina version:

<div id="match"></div>

#match {
  background: url(regular.png);

@media (-webkit-min-device-pixel-ratio: 1.5),
  (min--moz-device-pixel-ratio: 1.5),
  (-o-min-device-pixel-ratio: 3/2),
  (min-device-pixel-ratio: 1.5) {
    #match {
      background: url(retina.png);

Devices running Android 2.x that have a device pixel ratio equal to or above 1.5 will unfortunately download both versions of the image (regular.png as well as retina.png), but as Kadlec notes in his article (, it’s unlikely that you will encounter a retina device running Android 2.x.

Your best bet for displaying a correctly sized picture in modern browsers is to take advantage of the picture element in HTML. picture is currently supported in Chrome 38, Firefox 33, and Opera 25, and is a part of the new picture specification ( This new specification allows you to tell the browser which image file to download and when, and it includes a fallback for browsers that don’t support the picture element.

Here’s a simple example of the picture element that uses a media query to determine which image file to download. The first source to match, top to bottom, is the resource that gets picked for the browser to download:

  <source media="(min-width: 800px)" srcset="big.png">
  <source media="(min-width: 400px)" srcset="small.png">
  <img src="small.png" alt="Description">

Check out how amazing this is. Not only are we able to match media attributes to tell the browser which image file to download, but we also have a low-resolution image that will be downloaded by browsers that don’t support the picture element. Picturefill ( is a polyfill that enables support for the picture element in browsers that don’t currently support it, so you can start using picture today! A good rule of thumb here is that all the images defined in the same picture element should be able to be described with the same alt attribute.

You can use the picture element to serve retina images when applicable, too!

  <source media="(min-width: 800px)"
    srcset="big.png 1x, big-hd.png 2x">
  <source media="(min-width: 600px)"
    srcset="medium.png 1x, medium-hd.png 2x">
  <img src="small.png" srcset="small-hd.png 2x"

In this example, srcset tells the browser which image to choose at different pixel densities. Again, we’re saving overhead for our users by being precise and telling the browser exactly which single image file is the right one to retrieve and display.

One additional superpower of the picture element is the type attribute:

  <source type="image/svg+xml" srcset="pic.svg">
  <img src="pic.png" alt="Description">

We can tell our user’s browser to ignore an image source unless it recognizes the contents of the type attribute. In this example, browsers that recognize SVG will download the SVG file, and the rest will download the fallback PNG. Again, we’re able to tell the browser exactly which single image file is the right one to download and display, saving our user from unnecessary page weight overhead.

But what about fluid designs? Or what if you just have a handful of different image sizes, and want your user’s browser to choose the most appropriate resource without listing specific viewport sizes or screen resolutions? The picture specification can help with these, too, using the sizes attribute. sizes follows this syntax:

sizes="[media query] [length],
       [media query] [length],
       [default length]"

Each media query in the sizes attribute will relate to a length that the image will be displayed on the page, relative to the viewport size. So if you have a length of 33.3vw, the browser understands that the image will be displayed at 33% of the viewport width. If you have a length of 100vw, the browser understands that the image will be displayed at 100% of the viewport width. This math helps the browser choose which image will be most appropriate to retrieve and show to your user.

sizes is smart because it will look through each media query to see which applies before figuring out the correct image to download. In this example, we can tell the browser that at larger screens, the image will be shown at 33% of the viewport, but the default width of the image is 100% of the viewport:

sizes="(min-width: 1000px) 33.3vw, 

The browser looks in the srcset list of images to see their dimensions. We can tell the browser the width of each image in our list with the syntax image.jpg 360w, where image.jpg is the path to the image file and 360w indicates that this image is 360 px wide:

<img srcset="small.jpg 400w,
       medium.jpg 800w,
       big.jpg 1600w"
     sizes="(min-width: 1000px) 33.3vw,

With this list of images in srcset and list of display widths in sizes, browsers can pick the best image to fetch and display to your user based on media query and viewport size. This comes in handy when you use a content management system, too; allow your CMS to generate the sources and markup for your image. This way, a CMS user has to upload only one version and not worry about how it will be displayed at different screen sizes. Note that, as demonstrated in this example, you can use the new picture specification without using the picture element!

You can use all of the pieces of this new specification in concert to give your user’s browser a ton of power in choosing which image should be downloaded and displayed. You’ll be able to choose to serve differently cropped images at different screen sizes, as well as retina-optimized images for high-pixel-density devices, and you can give the browser the power to choose the right image for the job based on media query. All of this is excellent for performance.


Font files can add a huge amount of overhead to your site because they require additional requests and increase page weight. As discussed in “Optimizing Web Fonts,” there are several ways of optimizing your font files to ensure they are as high performing as possible. One additional consideration you can make in your responsive design is to load your custom font file only on larger screens. This is something we do at Etsy, as we would rather save our users from downloading the extra font file overhead if they’re on a mobile device.

To do this, set your normal fallback fonts on your content. Then use a media query to only apply your web font to content at a large breakpoint:

@font-face {
  font-family: 'FontName';
  src: url('fontname.woff') format('woff');

body {
  font-family: Georgia, serif;

@media (min-width: 1000px) {
  body {
    font-family: 'FontName', Georgia, serif;

This will download and apply the font file only if the user’s device matches the media query. All browsers (except Internet Explorer 8 and lower) are smart about downloading a font file only if it applies. Internet Explorer 8 and lower will download all @font-face files referenced in a page’s CSS file, even if they aren’t used on the page.


While you’ll make many decisions about how to create your site’s responsive web design during the actual design and development process, it’s important to take a beat before you begin any work to consider your overall approach and how it will impact performance. Building performance into project documentation, taking the time to look at your site from a mobile-first perspective, and figuring out how you’re going to measure the performance of your site across media queries will help you to create a speedy, responsively designed site.

Project Documentation

If possible, incorporate performance into your project documentation for any project (not just responsive web designs!). For a responsive site, you’ll want to benchmark and continue to measure the same standard performance metrics like total page weight, total page load time, and perceived performance using the Speed Index. But you’ll also want to be able to set goals for devices and media queries, not just an average overall page using your design.

As we’ll discuss in “Approach New Designs with a Performance Budget,” there are ways to make compromises on site speed as you develop. By setting a performance budget, you’ll be able to make concessions as you balance aesthetics and performance. For any responsive web design, you’ll be making these same concessions; maybe you’ll want to serve a large image at a particular media query that puts you over your budget, so you’ll decide to not deliver extra font weights to make up the time. Table 1-1 outlines an example performance budget for a responsive web design.

Table 1-1. Example responsive web design budget
Measure Goal Notes

Total page load time

2 seconds

For all breakpoints

Total page weight

500 KB

min-width: 900 px

Total page weight

300 KB

max-width: 640 px

Speed Index


For all breakpoints

Set some expectations within your project documentation about how you expect to avoid unnecessary page weight or requests to any device. In addition, make it clear that you will be measuring these things for each media query or screen size and what your goals are, as in Table 1-1. These kinds of budgets can get a bit fuzzy. For example, what happens if you rotate a device and it switches between budgets? It’s essential to have a baseline indicating the importance of performance to set expectations for those who are working on the project. Remember that this will benefit not just your mobile users, but your desktop users as well.

Mobile First

A mobile-first approach to designing any site will help you in so many areas. It will prompt you to:

  • Ask critical questions up front (“What is the core purpose of this page?”).

  • Identify the most important functionality and content for your users.

  • Establish design patterns and how they will change across screen sizes.

  • Think about your site from an accessibility perspective (“How accessible will this be for people on slower connections or less capable devices?”).

By starting with a mobile-first approach, you can attempt to avoid the square peg/round hole mentality that many designers and developers fall into when they try to reshape a desktop experience for mobile devices. You can progressively enhance your site by adding functionality, incorporating more powerful animations and styles, and taking advantage of newer devices’ capabilities, all while keeping track of performance implications as you add on.

The mobile experience shouldn’t be bare-bones. It should be a deliberate experience; designers and developers should use the benefits of, and be cognizant of the limitations for, each platform their site will be rendered on. Mobile isn’t just an add-on to desktop, and desktop isn’t just an add-on to mobile. Content parity doesn’t mean that each platform’s experience should be identical. We should be designing and developing with our users’ needs in mind.

A mobile-first approach forces you to ask these important questions about core user needs early and will help you with the performance of your site. An experience with intention about your users will help you focus on what kinds of assets are being delivered to them. An approach in which you make hard decisions about functionality and content hierarchy at small screen sizes will help you keep your total page weight and number of requests down. A site that starts with the most important content and assets, rather than tacking on media queries to handle smaller screen sizes, will be a huge help in keeping your performance under control.

For your responsive site, consider your smallest screen sizes first. Reorder your CSS to deliver small screen styles first, and use progressive enhancement to add content and capabilities as screen sizes get larger. Deliver correctly sized assets, ensure there’s no scrolling jank, and make the page’s core functionality interactive as quickly as possible. From there, you can make decisions about how to share larger assets on larger screens, reflow content in your hierarchy, and continue to be deliberate about performance in your overall user experience.

Measure Everything

In Chapter 6, we’ll cover how to continue to measure your performance as you iterate and test your designs. You’ll use all of these tactics on a responsively designed site, just as you would any other site. But there are some additional considerations for measuring a responsive web design.

Primarily, you need to ensure that only the appropriate content is being loaded at each breakpoint. Don’t join the other 72% of websites that are serving up the same size responsive design site across screen sizes.

If you’re able to, implement automated tests that measure the total page weight for each of your chosen breakpoints. Tom Barker included an excellent chapter on continuous web performance testing in his book High Performance Responsive Design, which outlines how to implement Phantom JS tests that measure each breakpoint’s performance, including YSlow score and total page weight.

You can also test this manually. Emulate a device using Chrome DevTools and use the Resources panel to see which image size is being downloaded for that device. Here is an example set of media queries in which I choose to serve a different image based on breakpoint:

@media (min-width: 601px) {
  section {
    background: url(big.png);

@media (max-width: 600px) {
  section {
    background: url(small.png);

I want to make sure not only that the correct image is downloaded for a particular device size, but that both images aren’t downloaded. I used Chrome DevTools with caching disabled to emulate a Google Nexus 10 that would match the larger media query (Figure 1-4), and a Google Nexus 4 that would match the smaller media query (Figure 1-5).

Figure 1-4. In this example, I emulated a Google Nexus 10 to see which image would be downloaded. In the Network panel, we can see that big.png was called.

Each emulated device correctly downloaded only the image that was needed. We can also see the total page size transferred: 7.3 KB for the larger device, and 2.9 KB for the smaller device. Continue to check on the resources and total page weight being delivered to each breakpoint determined in your project plans to ensure that you’re meeting your goals.

For measuring total page load time and Speed Index at each breakpoint, check out WebPagetest’s drop-downs for browser (Figure 1-6) and connection speed (Figure 1-7).

The Dulles, Virginia, WebPagetest location includes a number of mobile browsers in the Browser drop-down. This testing location includes physical devices, like the iPhone 4 and the Nexus 5, on which you can test.

Figure 1-5. After switching the emulator to the Google Nexus 4 and refreshing the page, we can see that small.png was called instead of big.png.
Figure 1-6. You can choose from an assortment of mobile browsers in your WebPagetest run.
Figure 1-7. You can choose from an assortment of emulated connection speeds in your WebPagetest run.

The different connections listed in the Connections drop-down are created using traffic shaping. This means Chrome DevTools will emulate what a user may experience on this type of connection, but the results will be more consistent across tests because the test is actually happening on WiFi.

Compare the results for each breakpoint to make sure that your total page load time and Speed Index meets or beats the goal outlined in your project documentation.

All of the other techniques in this book will also help you optimize your responsive web design for performance. As you design your responsive site, be deliberate about which assets are downloaded by your users. Develop a performance budget at each breakpoint and use a mobile-first approach when designing and developing the site. Be sure to also check out Tom Barker’s book, High Performance Responsive Design, for more in-depth details on optimizing both the backend and frontend of your responsively designed website for performance.

As always, measuring performance as you work and as your site ages will help you keep page load time under control. In the next chapter, we’ll dive into tools and routines for checking in on the performance of your site to help you get a holistic view of your user experience over time.

Continue reading Be smart with responsive web design.

Continue Reading…


Read More

Four short links: 23 July 2018

State Sponsored Trolling, Public Standards, Explorable Explanations, and iOS Network Debugging

  1. State Sponsored Trolling (Institute For The Future) -- authoritarians around the world have mastered social media. Bloomberg did some great follow-up work on the IFTF report. (via Cory Doctorow)
  2. Public Resource Wins Right to Publish Standards Used in Law -- The question in this case is whether private organizations whose standards have been incorporated by reference can invoke copyright and trademark law to prevent the unauthorized copying and distribution of their works. [...] Because the district court erred in its application of both fair use doctrines, we reverse and remand, leaving for another day the far thornier question of whether standards retain their copyright after they are incorporated by reference into law.
  3. Explorable Explanations -- explanations and simulators for things to help you learn them. Regular readers will know I'm a huge fan of simulations as learning tools.
  4. Wormholy -- debug network iOS apps from within the app: Add it to your project, and that's all! Shake your device or your simulator and Wormholy will appear. In case, for whatever reason, the Charles proxy doesn't do it for you.

Continue reading Four short links: 23 July 2018.

Continue Reading…


Read More

Classify all the Things (with Multiple Labels)

Derrick Higgins of American Family Insurance presented a talk, “Classify all the Things (with multiple labels): The most common type of modeling task no one talks about” at Rev. Higgins covers multilabel classification, a few methods used for multiclass prediction, and existing toolkits. This blog post provides highlights, the video, and a transcript of the talk.

Session Summary

At Derrick Higgins’ talk, “Classify all the Things (with multiple labels): The most common type of modeling task no one talks about”,  he advocates that data scientists consider “adding the appropriate tools to their toolkit for multilabel tasks”. He introduces the distinction between multilabel and multiclass classification, presents a few methods used for multilabel prediction, and provides pointers to helpful toolkits.  Key highlights from the session include

  • exploring multilabel datasets in the public domain including BibTeX, delicious, birds, medical, and scene
  • delving into multilabel versus extreme multilabel versus collaborative filtering
  • strategies data scientists can pursue to model multilabel tasks, including label powerset, RAKEL, Conditional Bernoulli Mixtures as well as structured prediction
  • coverage of toolkits such as sk-multilearn, pyramid, svm-struct, pyStruct, and more.

For more insights from this talk, watch the video or read through the transcript.


Video Transcript of the Presentation

Derrick Higgins:

All right, thanks, everybody, for coming out today. I am Derrick Higgins, the head of the data science and analytics lab at American Family. We are a special projects team there, we’ve been working on a lot of different things. This talk is inspired by a couple of projects that came our way recently, where all of the tasks seem to fall into a particular framing.

If you look at the kind of texts that aspiring data scientists are exposed to when they start in the field: look at blogs, you look at introductory machine learning textbooks and so on, you kind of get the perspective that there are basically two things that data scientists are being asked to do today. There are classification tasks and there are regression tasks.

This is from one of my favorite introductory machine learnings textbooks by Christopher Bishop: “In classification problems, the task is to assign new inputs to one of a number of discrete classes or categories. However, there are many other pattern recognition tasks, which we shall refer to as regression problems, in which the outputs represent the values of continuous variables.” So he’s not actually saying there’s nothing else in the world besides these two types of problems, but it’s kind of suggestive.

This is not actually that crazy, because I’m sure everybody here in the room knows, there are just a lot of tasks that fit into one frame or the other. If a data scientist has a good set of tools that she can use for classification tasks, multiclass classification tasks, for regression tasks, maybe some other stuff that’s unsupervised, clustering and dimensionality reduction, there’s just a lot that she can do. There’s a lot of tasks out there that she’ll be well equipped to take on. As her career progresses or as she takes on more and more responsibility, she’s going to need to broaden her tool set a little bit and add some more heavy-duty things for tasks like segmentation, machine translation. But really, just with that core set of tools for multiclass classification and for regression, she’s going to be pretty well equipped for those tasks.

Today, I want to talk about a slightly different but related set of tasks, in particular multilabel classification, and suggest that maybe a lot of data scientists out there might consider adding the appropriate tools to their toolkit for multilabel tasks, given how frequently they come up. Okay, so, multiclass versus multilabel classification, probably a lot of folks are already familiar with the distinction or what’s meant by the two. Multiclass classification is by far the more common task framing; we’re trying to take an input vector of Xs, so M input features, and map each of them to the most likely class of a set of N possible labels. The output space is guaranteed to be, the output labels are guaranteed to be mutually exclusive, and at least one, exactly one, has to be assigned to every input vector.

The multilabel framing is slightly different in that we have, again, a set of N output labels, but they’re not mutually exclusive and we’re not guaranteed that any of them are actually going to be appropriate for a given input instance. An example task of this type would be one that involves looking at information, say, images related to issues that occur in somebody’s home and trying to identify attributes of those instances. One instance may be an issue that is in the interior of the home represents a hazard and is electrical. Another may be a plumbing issue in the interior, so we have multiple attributes for these different types of instances.

A couple of interesting things to note or keep in mind about these types of tasks: one, these labels that we’re trying to predict can sometimes be very related to one another, either positively or negatively. So some of the attributes that we look at in an image to determine whether it’s electrical may be very similar to the attributes that we need to look at to determine if it’s a hazard or not.

The other thing about this sort of task is that there may be constraints that are hard or soft that, in the output space, they determine how coherent the labeling is. It could be that certain types of issues are both interior and exterior issues, so maybe there’s something to do with the windows, which are sort of at the border between interior of the home and exterior of the home. But it’s not going to be common. That’s going to be, by far, dispreferred to pass through a labeling that assigns both interior and exterior labels. Even more of a hard constraint, I guess, would be if you have some sort of a taxonomic relationship between your labels. We may have even more specific tags that say there’s an issue having to do with wiring, or there’s an issue having to do with junction boxes, and it would be confusing for downstream users if those specific tags were applied and not the more general electric tags, the electrical tag as well.

Once you start looking for these kind of multilabel tasks, it seems like they kind of crop up all over. Here’s some examples of multilabel datasets that are out there in the public domain. You can go to various repositories and download them. A couple of these, BibTeX and Delicious, have to do with semantic tags that users have assigned, in the case of BibTeX, to bibliographic entries; in the case of Delicious, to web pages. The Birds dataset is actually a set of audio snippets where the task is to identify the species of birds that are present in the environment, based on the vocalizations that you can hear in the audio clips. Of course, there might be multiple birds that are present. Medical is a dataset of free text clinicians’ notes, and the tasks there is to identify diagnosis codes that should be assigned to a patient based on the notes that the doctor or clinician took.

And then finally, the Scene dataset is a dataset of images that are classified by scene types, so they could be a beach scene, there could be a city scene, and of course there could be, well, maybe not city and beach so much. I’m from Chicago, so that’s okay. Maybe city and sunset are a reasonable combination. These are pretty diverse in terms of the number of labels, ranging from six labels to up to a thousand labels, almost, for Delicious, and then the number of rows as well. Under a thousand up to, like, 16,000 for Delicious. That kind of spans the space of what I really want to talk about in terms of multilabel modeling.

In the multilabel classification task type I’m talking about, we have these predictive features, this matrix of features X where we’ve got N instances and L features to use for prediction associated with each of those, and then we have these M labels where M is sort of a manageable number. It’s, like, less than a thousand tags or labels that we want to assign to each of our input instances.

There is another type of task that people talk about sometimes called extreme multilabel classification, and that is what you might expect; we have kind of a less manageable number of labels that you want to predict for each of your input instances. It might be 100,000 labels, it might be a million tags, and these things can be, for instance, songs from a playlist a user has chosen to play, or items from an inventory that a user has bought. So semantically, in terms of what you’re trying to achieve, often it’s very similar to collaborative filtering tasks, with the main difference being that, in this extreme multilabel classification task, you have these additional predictive features where, traditionally, in collaborative filtering, you don’t have the features that you’re using to predict as well, you have just this set of products or songs or labels, whatever, and then a set of users or instances that are associated with them.

Okay, so, at a high level, multilabel tasks, there are three general strategies I’m going to say you can pursue to try and model these tasks. The first one is really tempting, it’s just sort of put your head in the sand, pretend there’s no problem, deal with it using the tools you have and forget you ever heard the word “multilabel.” This is maybe not as crazy as it might sound, because there are a lot of tasks out there, including some of the ones that my team has been working on, where it is not strictly a multiclass classification task, but it’s pretty close. It’s almost always the case, in fact, that your train vectors and your test vectors that you’re trying to do predictions for are one hot vectors of labels. It’s almost always the case that at least one label is applied and it’s almost always the case that it’s a unique label.

So you can imagine if we kind of focus our attention just on the sub-part of this home issue classification task that has to do with the sub-system of the home that is affected, then electrical issues are going to be pretty common, plumbing issues are going to be pretty common, issues to do with a specific appliance might be common. And occasionally you’ll get something where there’s an electrical issue to do with the water heater and some associated plumbing stuff related to that, but it’s going to be far less frequent than any of these tags occurring independently. It might be possible to sort of massage your data in some way and treat the task as if it was just multiclass classification, and I think that is important, at least, to include as a reasonable baseline when you’re doing this kind of modeling.

If we were to do this graphically, what we’re doing here is some sort of sub-sampling, throwing away training data that has multiple classes associated with it or somehow transforming the data to discard labels that we think are less important. And then once we’ve done that, we can just train your favorite multiclass classifier and then do one hot encoding of the outputs, and you get a matrix that you can feed into whatever your downstream multilabel evaluations are and compare with whatever fancier methods you might consider.

That was strategy one. Strategy two, then, is, okay, bite the bullet, realize we actually have a multilabel problem that we need to deal with in some way, but, you know, I’ve got this set of tools; is there some way that I can transform the task or find a way that I can apply the tools that I have, apply what I’ve got to just solve the problem that I have? There are a variety of problem transformation approaches to this general set of tasks, and the first of these, which is called binary relevance, is probably the first thing that you all thought of when I put up the multilabel task to begin with.

Namely, that you’re just going to build a bunch of binary classifiers for all of the labels in your output space, which basically amounts to, just assuming the conditional independence of all of the labels that you’re dealing with, and just, I guess, in passing I’ll note that I’m going to be using, bold X, as the vector of input features, and bold Y as the feature, the vector of output labels. It’s pretty conventional. So again, graphically, this is what it looks like. We’ve got a bunch of input instances, we’re going to pass them to a bunch of independent classifiers that are associated with each of our output classifiers. Sorry, output labels: one separate classifier for interior, one classifier for exterior, one classifier for hazard and so on.

Great, we have a way of doing the task that allows for multiple labels to be assigned to any given input instance, but we’re not really handling the dependency between the labels at all. There’s no way of kind of penalizing combinations that we think are disfavorable or should be disfavored or are unlikely, such as interior and exterior at the same time.

So that’s that one extreme of our set of options for transforming the task, where we just assume that all the labels are completely independent. This label powerset method, then, is at the extreme opposite end of the scale, where we’re going to say, actually, any combination of labels is entirely unique, distinct and shares no characteristics with any other combination of labels that we could assign to a given input instance. It’s called label powerset because we’re basically taking the powerset of these two to the N possible combinations of labels as the set of classes that we can consider assigning in a multiclass classifier. We have some classifier that’s going to do scoring in the output space, and then we’re going to normalize over the entire powerset of label combinations to get a probability for that.

Now, I should say, in reality, just a small subset of the powerset of label combinations are actually going to be instantiated in our training data, so we don’t typically have to deal with two to the N, but still, we have to deal with a pretty large number of label combinations. So this is what it looks like graphically: approximately, we’re going to just build one big multiclass classifier, and it’s going to have a ton of output classes. It’s going to have hazard assigned on its own as one possible output class, it’s going to have electrical as another output class. And then unrelated to either of these, it’s going to have a class like exterior-slash-hazard-slash-electrical, and then it’s going to have yet another appliance-slash-hazard output class, and just more than I can depict on this slide. All of these different combinations of labels that could be applied.

This is kind of a brute-force extreme approach; obviously there are some problems with it. For example, the lack of information-sharing between labels that would seem to be, or between classes that would seem to be related, like interior-slash-hazard and interior-hazard-electrical. And also sparse data issues, where the support for each of these classes is likely to be very small, or least some of these classes are likely to have very low support in their training data. But if there are really kind of unique, distinct characteristics that some of these combinations of labels have, like interior hazards legitimately having different evidence behind them, than exterior hazards or hazards that don’t have either of those tags, it could be affected.

We mentioned some of the drawbacks of label powerset, and there’s this random K-labelsets method, also called RAKEL, which is an attempt to address some of the drawbacks of label powerset while still leveraging the fundamental method. I’m going to go a little bit out of order here, but the idea is that you’re going to build label powerset classifiers, but only for sets of K labels at a time, so K may be three, K may be four. You’re going to be a label powerset that’s only responsible for, say, four labels, and then you’re going to build a whole ensemble of those. This is saying in our set of label powerset classifiers, each is responsible for K labels, and they are J of them in total.

Each label powerset classifier in the ensemble is going to assign a score to a given label, so if we’re trying to figure out the score for label Y-sub-i, which could be electrical, then we’re going to count that classifier’s score for all the label subsets that include that label, Y-sub-i, so we’re going to count the score for electrical, count the score for hazard and electrical, count the score for interior electrical, we’re going to add all of those up, and then we’re going to normalize by the score of all the label subsets that that label powerset classifier is responsible for. That’s each classifier in the ensemble of label powerset classifiers, and then we’re going to aggregate across all those label powerset classifiers to get an overall score for each label at the RAKEL level. We do that by basically averaging over all of the label powerset classifiers which have that label in its output space, because not every classifier in the ensemble is going to have the opportunity to even assign a label like electrical, because it may not be among the K that it’s responsible for.

That’s probably much less clear than a picture; we’re just going to have, instead of one big multiclass classifier, we’re going to have a number of big multiclass classifiers where the output space of each is the powerset of a subset of the total set of labels. This is a case where K equals three, this multiclass classifier is going to be responsible for hazard, electrical and interior. This one’s going to be responsible for plumbing, appliance and interior, and this one’s going to be responsible for hazard, plumbing and exterior. They’re each going to assign a score to the labels in their set, and then those are going to be aggregated using the method that I mentioned on the prior slide.

So yeah, RAKEL attempts to address some of the problems with label powerset, but it comes at the cost of introducing a couple of hyper-parameters. Now we have this hyper-parameter J, which is the number of ensembles we’re going to use. We also have hyper-parameter K, which is the number of labels that’s in each label powerset classifier. We might have to search over those.

Okay, so somewhat simpler to describe is classifier chains, and this is what it sounds like: you’re going to build a bunch of classifiers, but you’re just going to apply them in serial. The output of the first classifier will be a single label, and then that predicted label will be an input to the next classifier in the chain, and you’ll go like that through the entire sequence. The architecture here is pretty simple: one classifier is first going to predict this interior label, the predicted value of the interior label will be input to a classifier of the next exterior, and so on down the line so we predict all our labels. So that’s a relatively simple way of introducing dependencies among the different labels and allowing for some conditioning of each on the next.

But again, there are drawbacks as well. One of those is that, of course, errors in each of these classifiers are going to propagate, and so the longer this chain is, the bigger problem it’s going to be. The other problem is that there’s really no rationale for why we decided to predict interior first; we could’ve predicted interior last and conditioned that on electrical and so, and the random order in which you choose to chain these classifiers may not be optimal.

All right, so the last method I’m going to talk about that involves problem transformation is something called Conditional Bernoulli Mixtures. These are mixture models; the basic idea is that you’re going to do a soft assignment of your input instances to a cluster or to mixture components, and then based on that cluster, you’re going to use a specialized classifier to assign the labels. In particular, it’s going to be a binary relevance classifier, which means that the label assignment within each mixture component is going to be independent. So the individual classifiers in your ensemble at the second level are going to treat the labels as independent, but because of the mixture model structure, because this is conditioned on the mixture component membership for each instance, the labels are going to be dependent in the model as a whole.

There are two models, or two types of models that need to be trained in a Conditional Bernoulli Mixture model: there’s this top-level model that says, okay, I’ve got these capital-K clusters or capital-K components in my mixture model, which one does this instance belong to? And then there is this lower-level model, which is just our standard binary relevance model. So if we’re in cluster three or if we’re in mixture component three, mostly, what is the probability of each label, based on the characteristics of the instance and based on what we know about this cluster?

So here’s how that looks. We’ve got this sort of hierarchical structure to a model, where we’ve got this multiclass classifier that we’re training. Again, it could kind of be whatever your favorite classifier is, subject to some constraints I’m going to talk about related to training. And then at the lowest level, you’ve got these binary relevance models, which are relatively simple, just independent binary models for each of the labels. So two issues with Conditional Bernoulli Mixtures, one has to do with training: it’s not simply a model that we can train end-to-end using gradient descent or something like that because, you’ll recall a couple slides ago, we have this Z. This Z here is the latent class distribution or latent mixture component distribution for a given instance, and that we can’t observe. So we have to estimate it iteratively.

Instead, we have to do Expectation Maximization, where we come up with some estimate of Z and then, based on value, the current estimate of Z we’ll train each of our classifiers, and then we’ll update our estimate of Z so that the training procedure’s a little bit more complicated and also more time-consuming. And then inference can also be complicated as well. Doing inference in this model is not, what we want do is not actually just find the expectation from the model for each of the labels independently. Instead, we want to find the optimal label assignment across the entire label space, which may be different from considering each of the labels in independence.

For the details on that, I’m just going to refer you to the paper. I think the citation may have fallen off the bottom of the slide here. Happy to provide the slides and citations afterwards. But there’s some complexity there in doing dynamic programming or sampling to get the inference.

That’s a very quick survey of different methods you can use related to transforming the problem into something we can approach with standard tools for multiclass classification or binary classification. The third and final general strategy that we have for dealing with multilabel classification tasks is to really pull out the heavy machinery and decide we’re going to treat this as a structured prediction task. You may be familiar with structured prediction in other contexts; it’s kind of a standard set of tools and methodology for predicting structured objects, rather than predicting something that’s like a number or a class.

This comes up a lot in computer vision or natural language processing, for instance. Computer vision’s typical structured prediction task might be image segmentation, where you’ve got a bunch of pixel-label pairings that you want to predict and they’re all strongly related to one another. Or in natural language processing, where you’ve got this dependency structure that you want to assign to a sentence, or a tag structure that may associate a given tag with each word. There’s strong interdependencies between ways that part of speech tags are associated with each word.

In these types of contexts, you don’t really have much of a choice. These are just structured objects, and you need to find some way of fitting these into a standard classification framework, and structured prediction’s a way of doing that. But the same methodologies can be applied even when you don’t have quite so much structure in your output space as in the multilabel case where we have just a bunch of labels that may be inter-correlated or anti-correlated.

In a multilabel classification task, we’re trying to take a set of inputs, Xs, and map them into this space. Map them into a set of Ys, a set of labels, and the fundamental idea of structured prediction is we’re going to transform that task from mapping inputs into outputs into one in which we map pairings of inputs and candidate outputs into some score. So instead of generating predictions, we’re going to be critiquing predictions that somebody has already given us, somehow.

The challenge there, of course, is that we don’t anymore have a very straightforward way doing inference. In sort of a more standard classification paradigm, we can just feed information forward to our network or apply our classifier and get the optimal Y that is predicted. In a structured prediction framework, we actually have to somehow compute this argmax, find the Y that maximizes this scoring function.

This is, I promise, the last diagram of this type that I’m going to show. Basically, the idea is you’re going to now pair up your candidate labelings with your input instances and then pass this to a scoring function. That’s the final of structured prediction. Maybe the most popular way of doing structured prediction is a framework called conditional random fields. This is really popular in natural language processing, especially, where you’ll see linear chain conditional random fields, where the features have to do with the label assignments from adjacent words or something like that. But basically, it’s just a log-linear model. It’s a globally normalized log-linear model that’s going to apply to an instance and the entire candidate labeling that we’ve assigned to it. Log-linear model–this is just the normalizing constant that we typically don’t have to compute.

The thing that is different about a conditional random field is that we have two different types of features that are associated with parameters, these lambdas in the model. There are these feature functions F, that have to do with how strongly our predictive features are associated with our output labels, and then there are these feature functions G that have to do with how strongly pairs of output labels are related to one another. Basically, whether the labels are correlated with one another.

There is some complexity to making the training efficient, especially because you have to do inference in the process of training, but once you solve these mathematical problems, they can be trained using convex optimization. Again, because we’re doing approximate inference to compute this argmax, there are a couple of approaches there that could be used. You can consider only the supported combinations, so when you’re considering all the combinations that could be candidates for putting into this scoring function, you could only consider the ones that actually showed up in your training set, which is going to be much smaller than the set of possibilities — the powerset. Or you can do binary pruning, where you basically just build, in parallel, a binary relevance classifier, and then you only consider labels that get a score above some threshold when computing the different combinations that could be scored.

Very similar to conditional random fields are structural SVMs, very similar in that they use these feature functions that operate between feature label pairs and pairs of labels that come from a factor graph. Again, there’s some complexity in how to actually optimize these things. It’s an approximation to the standard support vector machine training problem where you iteratively select a subset of the constraints that you would be trying to satisfy, and the trade-off there is some epsilon difference in the accuracy of the convergence. So I’m not going to go into that in great detail, but it’s very, very similar to the conditional random fields setting.

And then last structured prediction method that I’m going to talk about is deep value networks. As with anything in data science, yes, there is a bleeding edge: deep neural networks’ approach to doing structured prediction, actually, a couple of them. Deep value networks is the one that came out in a, I think, ICML paper last year, so I’ll talk about that one.

All these structured prediction methods basically are critiquing functions: you take your input X, your candidate output Y, and compute some score. For deep value networks, the score specifically is a task-specific loss function, so if you’re doing image segmentation, a standard evaluation metric there is IOU, intersection over union, comparing sets of pixels. If you’re doing natural language processing, it might be some measure of constituency overlap for dependency structures. But for multilabel classification, one of the metrics we care about is instance F1. The deep value network is going to take these pairs of feature vectors and candidate labelings and see if it can predict what the instance F1 value will be for those labelings.

So if we’re going to do this, you need a couple things. First thing is you need training data; you need to have paired feature-prediction-loss function triples. Not pairs, triples. There are a few ways to generate these: you can randomly generate these candidate labelings from your dataset, or you can select according to some model internal criteria, based on how likely those labelings are or based on how informative that labeling will be for the model. Actually, in the paper they use a sort of hybrid between these, sampling the candidate labelings according to all three of these criteria together.

And then once you’ve done that, you just train the network to try and predict this loss function based on the sample of synthetic training data that you’ve put together. It’s kind of interesting, at the inference stage as well where we, again, we have to compute the argmax for this scoring function for deep value networks, because it’s a neural network, we can actually use backpropagation to do this, because your network is going to take a vector of input features that’s paired with a candidate labeling, and then the output’s going to be the loss function.

What you’ll do is, at inference time, we’ll input the feature vector together with a dummy labeling, which is just all zeroes. You’ll feed it forward through the network to get the estimated loss value, and then you’ll backpropagate it in order to minimize that loss. But you won’t change any of the weights in the networks, those will be frozen. You’ll only change the Y values in the input space. Just do that until convergence to get the best Y, or the best Y vector, which is the label that you want to assign.

I can see you’re all very excited now to learn about the new methodologies you’re going to use for multilabel tasks. You want to go home and do this, you realize you have three multilabel datasets sitting on the shelf and, where’s the toolkit I can use for all this stuff? The bad news is there’s not really a toolkit that you can use for all of these methods. There is a thing called sk-multilearn, which is great; it sounds like it should be sklearn for multilabel modeling. It does have some good functionality in there, but really, it’s much around the methods for problem transformation, as opposed to the structured prediction stuff. The structured prediction stuff is a little bit more distributed across different toolkits.

The pyramid toolkit has some good stuff for structured prediction, although not structured SVMs, you have to get those from another aisle in the supermarket. And the deep value networks are just in a git project that you can download. You kind of have to mix and match a little bit to do all this different stuff.
I want to close with just a brief evaluation, but before I do that, a couple quick notes about evaluation, how to do it in the multilabel context. This is typically what we have, is input to our evaluations, we’ve got a set of known gold-standard labels for each of our N instances, and then whatever the classifier structure we use, we get predicted labels for each of those instances, and then we want to compute our evaluation metrics based on these two matrices. But there are two ways to do it, for at least many of the metrics.

One is instance-based evaluation, whether you’re computing precision or recall or f-measure or whatever, you’re going instance by instance and computing that score for each instance and then averaging or aggregating across all the rows. The other, unsurprisingly, is label-based aggregation, where you’re computing precision, recall, f-measure for each label, each column independently, and then aggregating across those. Just be aware those both exist, they both get reported in published papers, they’re both reported by different toolkits, and they’re not at all comparable. So make sure you’re computing the right thing and comparing the right way across different toolkits, which you, unfortunately, will have to use.

Also, multiple evaluation metrics that you might care about; I’ve been talking about F1, in particular, instance F1, which is the harmonic mean of precision and recall. Another important and more stringent evaluation criteria is subset accuracy, which is how often you actually assign the exactly correct subset of the total number of the total set of labels to each instance. That’s actually the measure that many of these methods are optimizing for. Overlap score is very similar to this IOU in image segmentation in the intersection between the predicted and gold-standard labels divided by the union of the two, allowing for partial credit, much like F1. And then hamming loss is just the total number of zero-one misses across all of the individual instance-label combinations in the test set.

Okay, so, here’s the evaluations, and this is far from comprehensive, this is just kind of what I could do in the course of a couple weeks, using methods from different toolkits. One evaluation measure, which is instance F1, and two different datasets, BibTeX being a little bit bigger, Medical being a little bit smaller along both dimensions. First place to start, maybe, is binary relevance, because it’s kind of the first thing that most of us would try, and it does okay. But then if you compare just putting our heads in the sand and treating it as a multiclass problem, maybe it doesn’t do quite as well for BibTeX, but it actually does better for Medical, observing the constraint that no more than one label, in fact, exactly one label has to be assigned to every instance.

It may be useful to think about why that would be, that we’re bringing in this external knowledge, this external constraint, that you can’t have no label assigned to any given instance. That can help to kind of counteract shrinkage issues that can happen with these binary relevance models, where they’re reluctant to go out on a limb and predict a rare class.

The problem transformation approaches: label powerset does really badly on BibTeX, but does a little bit better on the Medical dataset, maybe related to the fact that there are many more labels, there are about three times at many labels in the BibTeX dataset as in Medical. RAKEL still doesn’t do as well as binary relevance here, but for Medical, does a little bit better. Classifier chain gives us a little bit of a bump for Medical, but it is basically the same as binary relevance. And then the Conditional Bernoulli Mixtures, among all the problem transformation approaches, performed the best. Possibly because it’s the most complex model.

Then looking at the structured prediction methods, conditional random fields, I mean, I guess there’s a reason they’re so popular. They do very well across both datasets. Structured SVMs I think I probably didn’t give a fair shake here because I didn’t let the algorithm run to convergence; I ran out of patience after a couple days. And then deep value networks; this is a really interesting case, because relative to the baseline here of binary relevance, deep value networks do great on BibTeX, kind of competitive with CRF, but really, really awful on Medical.

I’m not sure exactly how to allocate the blame here, but I will say BibTeX is a dataset that was reported on in the associated paper for deep value networks, so I was able to replicate the results that were reported there. But I was unable to get anything working very well at all for Medical. Could be that I didn’t try the right hyper-parameters or something, or it could be something more fundamental to the internals of the model here where, again, these different ways of synthesizing training data for the neural network just don’t work well for this Medical task.

Maybe that resonates with some of you who have tried bleeding edge neural network techniques that were reported a year ago at ICML and had difficulty putting them to practical use. Anyway, that’s all I have, so thanks for your attention.

This transcript has been edited for readability.

The post Classify all the Things (with Multiple Labels) appeared first on Data Science Blog by Domino.

Continue Reading…


Read More

Document worth reading: “Machine Learning and Cognitive Technology for Intelligent Wireless Networks”

The ability to dynamically and efficiently allocate resources to meet the need of growing diversity in services and user behavior marks the future of wireless networks, giving rise to intelligent processing, which aims at enabling the system to perceive and assess the available resources, to autonomously learn to adapt to the perceived wireless environment, and to reconfigure its operating mode to maximize the utility of the available resources. The perception capability and reconfigurability are the essential features of cognitive technology while modern machine learning techniques project effectiveness in system adaptation. In this paper, we discuss the development of the cognitive technology and machine learning techniques and emphasize their roles in improving both spectrum and energy efficiency of the future wireless networks. We describe in detail the state-of-the-art of cognitive technology, covering spectrum sensing and access approaches that may enhance spectrum utilization and curtail energy consumption. We discuss powerful machine learning algorithms that enable spectrum- and energy-efficient communications in dynamic wireless environments. We also present practical applications of these techniques to the existing and future wireless communication systems, such as heterogeneous networks and device-to-device communications, and identify some research opportunities and challenges in cognitive technology and machine learning as applied to future wireless networks. Machine Learning and Cognitive Technology for Intelligent Wireless Networks

Continue Reading…


Read More

Real-time data visualization using R and data extracting from SQL Server

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

In the previous post, I have showed how to visualize near real-time data using Python and Dash module.  And it is time to see one of the many ways, how to do it in R. This time, I will not use any additional frames for visualization, like shiny, plotly or any others others, but will simply use base R functions and RODBC package to extract data from SQL Server.

Extracting data from SQL Server will and simulating inserts in SQL Server table will primarily simulate the near real-time data. If you have followed the previous post, you will notice that I am using same T-SQL table and query to extract real-time data.

First, we will create a sample table in SQL Server and populate it with some sample data:


USE Test;

CREATE TABLE dbo.LiveStatsFromSQLServer
,Num tinyint NOT NULL)

And populate it with some sample data:

-- Insert some test data
INSERT INTO dbo.LiveStatsFromSQLServer(num)
GO 10

Now, that we have SQL foundations set up, let’s focus on R code.

First we set the environment variable and the RODBC library:

# create env for storing the variables/data frames between the functions
assign("getREnvironment", new.env(), envir = .GlobalEnv)

We will generate a function for extracting data from SQL Server and storing it in environment data.frame variable:

# Function to read data from SQL Server
getSQLServerData <- function()
#extract environment settings for storing data
getREnvironment <- get("getREnvironment", envir = .GlobalEnv, mode = "environment")
#get the SQL Server data
con <- odbcDriverConnect('driver={SQL Server};
db_df <- sqlQuery(con, 'SELECT 
                         TOP 20 id
                        FROM LiveStatsFromSQLServer ORDER BY id DESC')
#overwrite existing data with new data
df_overwrite <- db_df
getREnvironment$db_df <- data.frame(df_overwrite)
try(assign("getREnvironment", getREnvironment, envir = .GlobalEnv))
invisible() #do not print the results


Once we have this function registered, we can now create a small for loop that will update the plot with newly fetched data from SQL Server:

# Plot graph 
n=1000 #nof iterations
windowQuery=20 # syncronised with TOP clause in SELECT statement
for (i in 1:(n-windowQuery)) {
  getREnvironment <- get("getREnvironment", envir = .GlobalEnv, mode = "environment")
  data <- getREnvironment$db_df
  plot( data$id, data$num, type='l',main='Realtime data from SQL Server')


Once we run the complete R code, we need to trigger and run also the new inserts in SQL Server Management studio:

-- Do some inserts to mimic the data stream
INSERT INTO dbo.LiveStatsFromSQLServer(num)
WAITFOR DELAY '00:00:00.500'
GO 100

Once we do this, we can observe the realtime data from SQL Server being plotted in R environment (R Studio).

As always, complete code is available at Github.

Happy R-coding! 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

DALEX and H2O: Machine Learning Model Interpretability And Feature Explanation

(This article was first published on - Articles, and kindly contributed to R-bloggers)

As advanced machine learning algorithms are gaining acceptance across many organizations and domains, machine learning interpretability is growing in importance to help extract insight and clarity regarding how these algorithms are performing and why one prediction is made over another. There are many methodologies to interpret machine learning results (i.e. variable importance via permutation, partial dependence plots, local interpretable model-agnostic explanations), and many machine learning R packages implement their own versions of one or more methodologies. However, some recent R packages that focus purely on ML interpretability agnostic to any specific ML algorithm are gaining popularity. One such package is DALEX and this post covers what this package does (and does not do) so that you can determine if it should become part of your preferred machine learning toolbox.

We implement machine learning models using H2O, a high performance ML toolkit. Let’s see how DALEX and H2O work together to get the best of both worlds with high performance and feature explainability!

Similar Articles You Might Enjoy

Data science for business tutorials:

LIME for black-box model interpretability and feature:

Expected Value Framework for ML Classification in Business:

Learning Trajectory

We’ll cover the following topics on DALEX in this article:

  1. Advantages & disadvantages: a quick breakdown of what DALEX does and does not do.

  2. Replication requirements: what you’ll need to reproduce the analysis.

  3. DALEX procedures: necessary functions for downstream explainers.

  4. Residual diagnostics: understanding and comparing errors.

  5. Variable importance: permutation based importance score.

  6. Predictor-response relationship: PDP and ALE plots.

  7. Local interpretation: explanations for a single prediction.

Get The Best Resources In Data Science. Every Friday!

Sign up for our free “5 Topic Friday” Newsletter. Every week, I’ll send you the five coolest topics in data science for business that I’ve found that week. These could be new R packages, free books, or just some fun to end the week on.

Sign Up For Five-Topic-Friday!

DALEX and H2O: Machine Learning Model Interpretability And Feature Explanation

By Brad Boehmke, Director of Data Science at 84.51°

1.0 Advantages & disadvantages

DALEX is an R package with a set of tools that help to provide Descriptive mAchine Learning EXplanations ranging from global to local interpretability methods. In particular, it makes comparing performance across multiple models convenient. However, as is, there are some problems with this package scaling to wider data sets commonly used by organizations. The following provides a quick list of its pros and cons:


  • ML model and package agnostic: can be used for any supervised regression and binary classification ML model where you can customize the format of the predicted output.
  • Provides convenient approaches to compare results across multiple models.
  • Residual diagnostics: allows you to compare residual distributions.
  • Variable importance: uses a permutation-based approach for variable importance, which is model agnostic, and accepts any loss function to assess importance.
  • Partial dependence plots: leverages the pdp package.
  • Provides an alternative to PDPs for categorical predictor variables (merging path plots).
  • Includes a unique and intuitive approach for local intepretation.


  • Some functions do not scale well to wide data (many predictor variables)
  • Currently only supports regression and binary classification problems (i.e. no multinomial support).
  • Only provides permutation-based variable importance scores (which become slow as number of features increase).
  • PDP plots can only be performed one variable at a time (options for two-way interaction PDP plots).
  • Does not provide ICE curves.
  • Does not provide alternative local interpretation algorithms (i.e. LIME, SHAP values).

2.0 Replication requirements

We leverage the following packages:

# load required packages

# initialize h2o session
## H2O is not running yet, starting it now...
## Note:  In case of errors look at the following log files:
##     C:\Users\mdancho\AppData\Local\Temp\Rtmp6f3d5n/h2o_mdancho_started_from_r.out
##     C:\Users\mdancho\AppData\Local\Temp\Rtmp6f3d5n/h2o_mdancho_started_from_r.err
## Starting H2O JVM and connecting: . Connection successful!
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 seconds 710 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version: 
##     H2O cluster version age:    5 months and 3 days !!! 
##     H2O cluster name:           H2O_started_from_R_mdancho_dpc744 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.51 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.4 (2018-03-15)

To demonstrate model visualization techniques we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem.

To demonstrate DALEX’s capabilities we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem.
I perform a few house cleaning tasks on the data prior to converting to an h2o object and splitting.

NOTE: To use some of DALEX’s functions, categorical predictor variables need to be converted to factors. Also, I force ordered factors to be unordered as h2o does not support ordered categorical variables.

# classification data
df <- rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

We will explore how to visualize a few of the more common machine learning algorithms implemented with h2o. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces a regularized logistic regression, random forest, and gradient boosting machine models; all of which provide AUCs ranging between .75-.79. Although these models have distinct AUC scores, our objective is to understand how these models come to this conclusion in similar or different ways based on underlying logic and data structure.

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123

# random forest model
rf <- h2o.randomForest(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123

# gradient boosting machine model
gbm <-  h2o.gbm(
  x = x, 
  y = y,
  training_frame = splits$train,
  validation_frame = splits$valid,
  ntrees = 1000,
  stopping_metric = "AUC",    
  stopping_rounds = 10,         
  stopping_tolerance = 0.005,
  seed = 123

# model performance
h2o.auc(glm, valid = TRUE)
## [1] 0.7870935
h2o.auc(rf, valid = TRUE)
## [1] 0.7681021
h2o.auc(gbm, valid = TRUE)
## [1] 0.7468242

3.0 DALEX procedures

The DALEX architecture can be split into three primary operations:

  1. Any supervised regression or binary classification model with defined input (X) and output (Y) where the output can be customized to a defined format can be used.
  2. The machine learning model is converted to an “explainer” object via DALEX::explain(), which is just a list that contains the training data and meta data on the machine learning model.
  3. The explainer object can be passed onto multiple functions that explain different components of the given model.

DALEX Architecture

DALEX Application Process and Architecture

Although DALEX does have native support for some ML model objects (i.e. lm, randomForest), it does not have native many of the preferred ML packages produced more recently (i.e. h2o, xgboost, ranger). To make DALEX compatible with these objects, we need three things:

  1. x_valid: Our feature set needs to be in its original form not as an h2o object.
  2. y_valid: Our response variable needs to be a numeric vector. For regression problems this is simple, as it will already be in this format. For binary classification this requires you to convert the responses to 0/1.
  3. pred: a custom predict function that returns a vector of numeric values. For binary classification problems, this means extracting the probability of the response.
# convert feature data to non-h2o objects
x_valid <-$valid)[, x]

# make response variable numeric binary vector
y_valid <- as.vector(as.numeric(as.character(splits$valid$Attrition)))
## [1] 0 0 0 0 0 0
# create custom predict function
pred <- function(model, newdata)  {
  results <-, as.h2o(newdata)))

pred(rf, x_valid) %>% head()
## [1] 0.18181818 0.27272727 0.06060606 0.54545455 0.03030303 0.42424242

Once you have these three components, you can now create your explainer objects for each ML model. Considering I used a validation set to compute the AUC, we want to use that same validation set for ML interpretability.

# elastic net explainer
explainer_glm <- explain(
  model = glm,
  data = x_valid,
  y = y_valid,
  predict_function = pred,
  label = "h2o glm"

# random forest explainer
explainer_rf <- explain(
  model = rf,
  data = x_valid,
  y = y_valid,
  predict_function = pred,
  label = "h2o rf"

# GBM explainer
explainer_gbm <- explain(
  model = gbm,
  data = x_valid,
  y = y_valid,
  predict_function = pred,
  label = "h2o gbm"

# example of explainer object
## [1] "explainer"
##                  Length Class            Mode     
## model              1    H2OBinomialModel S4       
## data              30    data.frame       list     
## y                233    -none-           numeric  
## predict_function   1    -none-           function 
## link               1    -none-           function 
## class              1    -none-           character
## label              1    -none-           character

4.0 Residual diagnostics

As we saw earlier, the GLM model had the highest AUC followed by the random forest model then GBM. However, a single accuracy metric can be a poor indicator of performance. Assessing residuals of predicted versus actuals can allow you to identify where models deviate in their predictive accuracy. We can use DALEX::model_performance to compute the predictions and residuals. Printing the output returns residual quantiles and plotting the output allows for easy comparison of absolute residual values across models.

In this example, the residuals are comparing the probability of attrition to the binary attrition value (1-yes, 0-no). Looking at the quantiles you can see that the median residuals are lowest for the GBM model. And looking at the boxplots you can see that the GBM model also had the lowest median absolute residual value. Thus, although the GBM model had the lowest AUC score, it actually performs best when considering the median absoluate residuals. However, you can also see a higher number of residuals in the tail of the GBM residual distribution (left plot) suggesting that there may be a higher number of large residuals compared to the GLM model. This helps to illustrate how your residuals behave similarly and differently across models.

# compute predictions & residuals
resids_glm <- model_performance(explainer_glm)
resids_rf  <- model_performance(explainer_rf)
resids_gbm <- model_performance(explainer_gbm)

# assess quantiles for residuals
##          0%         10%         20%         30%         40% 
## -0.99155845 -0.70432615  0.01281214  0.03402030  0.06143281 
##         50%         60%         70%         80%         90% 
##  0.08362550  0.10051641  0.12637877  0.17583980  0.22675709 
##        100% 
##  0.47507569
##          0%         10%         20%         30%         40% 
## -0.96969697 -0.66666667  0.00000000  0.03030303  0.06060606 
##         50%         60%         70%         80%         90% 
##  0.09090909  0.12121212  0.15151515  0.18181818  0.27272727 
##        100% 
##  0.66666667
##          0%         10%         20%         30%         40% 
## -0.96307337 -0.75623698  0.03258538  0.04195091  0.05344621 
##         50%         60%         70%         80%         90% 
##  0.06382511  0.07845749  0.09643740  0.11312648  0.18169305 
##        100% 
##  0.66208105
# create comparison plot of residuals for each model
p1 <- plot(resids_glm, resids_rf, resids_gbm)
p2 <- plot(resids_glm, resids_rf, resids_gbm, geom = "boxplot")

gridExtra::grid.arrange(p1, p2, nrow = 1)

plot of chunk plot-residuals

5.0 Variable importance

An important task in ML interpretation is to understand which predictor variables are relatively influential on the predicted outcome. Many ML algorithms have their own unique ways to quantify the importance or relative influence of each feature (i.e. coefficients for linear models, impurity for tree-based models). However, other algorithms like naive Bayes classifiers and support vector machines do not. This makes it difficult to compare variable importance across multiple models.

DALEX uses a model agnostic variable importance measure computed via permutation. This approach follows the following steps:

For any given loss function do
1: compute loss function for full model (denote _full_model_)
2: randomize response variable, apply given ML, and compute loss function (denote _baseline_)
3: for variable j
     | randomize values
     | apply given ML model
     | compute & record loss function

To compute the permuted variable importance we use DALEX::variable_importance(). The printed output just provides a data frame with the output and plotting the three variable importance objects allows us to compare the most influential variables for each model. How do we interpret this plot?

  1. Left edge of x-axis is the loss function for the _full_model_. The default loss function is squared error but any custom loss function can be supplied.
  2. The first item listed in each plot is _baseline_. This value represents the loss function when our response values are randomized and should be a good indication of the worst-possible loss function value when there is no predictive signal in the data.
  3. The length of the remaining variables represent the variable importance. The larger the line segment, the larger the loss when that variable is randomized.

The results provide some interesting insights. First, the shifted x-axis left edge helps to illustrate the difference in the RMSE loss between the three models (i.e. GLM model has the lowest RMSE suggesting that the greater number of tail residuals in the GBM model is likely penalizing the RMSE score. Second, we can see which variables are consistently influential across all models (i.e. OverTime, EnvironmentSatisfaction, Age), variables that are influential in two but not all three (i.e. BusinessTravel, WorkLifeBalance), and variables which are only influential in one model but not others (i.e. DailyRate, YearsInCurrentRole). This helps you to see if models are picking up unique structure in the data or if they are using common logic.

In this example, all three models appear to be largely influenced by the OverTime, EnvironmentSatisfaction, Age, TotalWorkingYears, and JobLevel variables. This gives us confidences that these features have strong predictive signals.

TIP: You can incorporate custom loss functions using the loss_function argument.

# compute permutation-based variable importance
vip_glm <- variable_importance(explainer_glm, n_sample = -1, loss_function = loss_root_mean_square) 
vip_rf  <- variable_importance(explainer_rf, n_sample = -1, loss_function = loss_root_mean_square)
vip_gbm <- variable_importance(explainer_gbm, n_sample = -1, loss_function = loss_root_mean_square)

plot(vip_glm, vip_rf, vip_gbm, max_vars = 10)

plot of chunk vip

One downfall of the permutation-based approach to variable importance is it can become slow. Since the algorithm loops through and applies a model for each predictor variable, the more features in your model the longer it will take. For this example, which includes 30 features, it takes 81 seconds to compute variable importance for all three models. However, when tested on a data set with 100 predictors it took nearly 5 minutes to compute.

TIP: variable_importance includes an n_sample argument that, by default, will sample only 1000 observations to try increase the speed of computation. Adjusting n_sample = -1 as I did in the above code chunk just means to use all observations.

6.0 Predictor-response relationship

Once we’ve identified influential variables across all three models, next we likely want to understand how the relationship between these influential variables and the predicted response differ between the models. This helps to indicate if each model is responding to the predictor signal similarly or if one or more models respond differently. For example, we saw that the Age variable was one of the most influential variables across all three models. The below partial dependence plot illustrates that the GBM and random forest models are using the Age signal in a similar non-linear manner; however, the GLM model is not able to capture this same non-linear relationship. So although the GLM model may perform better (re: AUC score), it may be using features in biased or misleading ways.

# compute PDP for a given variable --> uses the pdp package
pdp_glm  <- variable_response(explainer_glm, variable =  "Age", type = "pdp")
pdp_rf   <- variable_response(explainer_rf,  variable =  "Age", type = "pdp")
pdp_gbm  <- variable_response(explainer_gbm, variable =  "Age", type = "pdp")

plot(pdp_glm, pdp_rf, pdp_gbm)

plot of chunk pdp

Although you can use PDPs for categorical predictor variables, DALEX provides merging path plots originally provided by the factoMerger package. For example, the EnvironmentSatisfaction variable captures the level of satisfaction regarding the working environment among employees. This variable showed up in all three models’ top 10 most influential variable lists. We can use type = "factor" to create a merging path plot and it shows very similar results for each model. Those employees that have low level of satisfaction have, on average, higher probabilities of attrition. Whereas, employees with medium to very high have about the same likelihood of attriting. The left side of the plot is the merging path plot, which shows the similarity between groups via hierarchical clustering. It illustrates that employees with medium and high satisfaction are most similar, then these employees are next most similar to employees with very high satisfaction. Then finally, the least similar group is the low satisfaction employees.

cat_glm  <- variable_response(explainer_glm, variable = "EnvironmentSatisfaction", type = "factor")
cat_rf   <- variable_response(explainer_rf,  variable = "EnvironmentSatisfaction", type = "factor")
cat_gbm  <- variable_response(explainer_gbm, variable = "EnvironmentSatisfaction", type = "factor")

plot(cat_glm, cat_rf, cat_gbm)

plot of chunk pdp-categorical

7.0 Local interpretation

The previous plots help us to understand our model from a global perspective by illustrating errors, identifying the variables with the largest overall impact, and understanding predictor-response relationships across all observations. However, often, we also need to perform local interpretation which allows us to understand why a particular prediction was made for an observation. Understanding and comparing how a model uses the predictor variables to make a given prediction can provide trust to you (the analyst) and also the stakeholder(s) that will be using the model output for decision making purposes.

Although LIME and SHAP (1, 2) values have recently become popular for local ML interpretation, DALEX uses a process called break down to compute localized variable importance scores.

There are two break down approaches that can be applied. The default is called step up and the algorithm performs the following steps:

existing_data = validation data set used in explainer
new_ob = single observation to perform local interpretation on
p = number of predictors
l = list of predictors
baseline = mean predicted response of existing_data

for variable i in {1,...,p} do
  for variable j in {1,...,l} do
    | substitue variable j in existing_data with variable j value in new_ob
    | predicted_j = mean predicted response of altered existing_data
    | diff_j = absolute difference between baseline - predicted
    | reset existing_data
  | t = variable j with largest diff value
  | contribution for variable t = diff value for variable t 
  | remove variable t from l

This is called step up because, essentially, it sweeps through each column, identifies the column with the largest difference score, adds that variable to the list as the most important, sweeps through the remaining columns, identifies the column with the largest score, adds that variable to the list as second most important, etc. until all variables have been assessed.

An alternative approach is called the step down which follows a similar algorithm but rather than remove the variable with the largest difference score on each sweep, it removes the variable with the smallest difference score. Both approaches are analogous to backward stepwise selection where step up removes variables with largest impact and step down removes variables with smallest impact.

To perform the break down algorithm on a single observation, use the DALEX::prediction_breakdown function. The output is a data frame with class “prediction_breakdown_explainer” that lists the contribution for each variable.

TIP: The default approach is step up but you can perform step down by adding the following argument direction = "down".

# create a single observation
new_cust <- splits$valid[1, ] %>%

# compute breakdown distances
new_cust_glm <- prediction_breakdown(explainer_glm, observation = new_cust)
new_cust_rf  <- prediction_breakdown(explainer_rf, observation = new_cust)
new_cust_gbm <- prediction_breakdown(explainer_gbm, observation = new_cust)
# class of prediction_breakdown output
## [1] "prediction_breakdown_explainer" "data.frame"
# check out the top 10 influential variables for this observation
new_cust_gbm[1:10, 1:5] %>%
  variable contribution variable_name variable_value cummulative
1 (Intercept) 0.0000000 Intercept 1 0.0000000
JobRole + JobRole = Laboratory_Technician 0.0377084 JobRole Laboratory_Technician 0.0377084
StockOptionLevel + StockOptionLevel = 0 0.0243714 StockOptionLevel 0 0.0620798
MaritalStatus + MaritalStatus = Single 0.0242334 MaritalStatus Single 0.0863132
JobLevel + JobLevel = 1 0.0318771 JobLevel 1 0.1181902
Age + Age = 32 0.0261924 Age 32 0.1443826
BusinessTravel + BusinessTravel = Travel_Frequently 0.0210466 BusinessTravel Travel_Frequently 0.1654292
RelationshipSatisfaction + RelationshipSatisfaction = High 0.0108112 RelationshipSatisfaction High 0.1762404
Education + Education = College 0.0016912 Education College 0.1779315
PercentSalaryHike + PercentSalaryHike = 13 0.0001158 PercentSalaryHike 13 0.1780473

We can plot the entire list of contributions for each variable of a particular model. We can see that several predictors have zero contribution, while others have positive and negative contributions. For the GBM model, the predicted value for this individual observation was positively influenced (increased probability of attrition) by variables such as JobRole, StockOptionLevel, and MaritalStatus. Alternatively, variables such as JobSatisfaction, OverTime, and EnvironmentSatisfaction reduced this observations probability of attriting.


plot of chunk unnamed-chunk-3

For data sets with a small number of predictors, you can compare across multiple models in a similar way as with earlier plotting (plot(new_cust_glm, new_cust_rf, new_cust_gbm)). However, with wider data sets, this becomes cluttered and difficult to interpret. Alternatively, you can filter for the largest absolute contribution values. This causes the output class to lose its prediction_breakdown_explainer class so we can plot the results with ggplot.

Each model has a similar prediction that the new observation has a low probability of predicting:

  • GLM: .12
  • random forest: 0.18
  • GBM: 0.06

However, how each model comes to that conclusion in a slightly different way. However, there are several predictors that we see consistently having a positive or negative impact on this observations’ probability of attriting (i.e. OverTime, EnvironmentSatisfaction, JobSatisfaction are reducing this employees probability of attriting while JobLevel, MaritalStatus, StockOptionLevel, and JobLevel are all increasing the probability of attriting). Consequently, we can have a decent amount of trust that these are strong signals for this observation regardless of model. However, when each model picks up unique signals in variables that the other models do not capture (i.e. DistanceFromHome, NumCompaniesWorked), its important to be careful how we communicate these signals to stakeholders. Since these variables do not provide consistent signals across all models we should use domain experts or other sources to help validate whether or not these predictors are trustworthy. This will help us understand if the model is using proper logic that translates well to business decisions.


# filter for top 10 influential variables for each model and plot
list(new_cust_glm, new_cust_rf, new_cust_gbm) %>%
  purrr::map(~ top_n(., 11, wt = abs(contribution))) %>%, .) %>%
  mutate(variable = paste0(variable, " (", label, ")")) %>%
  ggplot(aes(contribution, reorder(variable, contribution))) +
  geom_point() +
  geom_vline(xintercept = 0, size = 3, color = "white") +
  facet_wrap(~ label, scales = "free_y", ncol = 1) +

plot of chunk unnamed-chunk-4

Unfortunately, a major drawback to DALEX’s implementation of these algorithm’s is that they are not parallelized. Consequently, wide data sets become extremely slow. For example, performing the previous three prediction_breakdown functions on this attrition data set with 30 predictors takes about 12 minutes. However, this grows exponentially as more predictors are added. When we apply a single instance of prediction_breakdown to the Ames housing data (80 predictors), it took over 3 hours to execute!

# ames data
ames.h2o <- as.h2o(AmesHousing::make_ames())

# create local observation
local_ob <-[1, ])

# variable names for resonse & features
y <- "Sale_Price"
x <- setdiff(names(ames.h2o), y)

# random forest model
rf <- h2o.randomForest(
  x = x, 
  y = y,
  training_frame = ames.h2o,
  ntrees = 1000

# get features for explainer
x_valid <-[, x]
y_valid <- as.vector(ames.h2o[y])

# create custom predict function
pred <- function(model, newdata)  {
  results <- as.vector(predict(model, as.h2o(newdata)))

# create explainer
ames_rf <- explain(
  model = rf,
  data = x_valid,
  y = y_valid,
  predict_function = pred,
  label = "ames"

# time to compute prediction break down
  ames_example  <- prediction_breakdown(ames_rf, observation = local_ob)

Looking at the underlying code for the prediction_breakdown function (it simply calls breakDown::broken.default), there are opportunities for integrating parallelization capabilities (i.e. via foreach package). Consequently, prior to adding it to your preferred ML toolkit, you should determine:

  1. if you are satisfied with its general alorithmic approach,
  2. do you typically use wide data sets, and if so…
  3. what is your appetite and bandwidth for integrating parallelization (either in your own version or collaborating with the package authors),
  4. and how is performance after parallelization (do you see enough speed improvement to justify use).

Next Steps: Take The Data Science For Business With R Course!

If interested in learning more about modeling using H2O and model explanations, definitely check out Data Science For Business With R (DS4B 201-R). Over the course of 10 weeks, the the student learn all of the steps to solve a $15M employee turnover problem with H2O along with a host of other tools, frameworks, and techniques.

The students love it. Here’s a comment we received from one of our students, Siddhartha Choudhury, Data Architect at Accenture.


“To be honest, this course is the best example of an end to end project I have seen from business understanding to communication.”

Siddhartha Choudhury, Data Architect at Accenture

See for yourself why our students have rated Data Science For Business With R (DS4B 201-R) a 9.0 of 10.0 for Course Satisfaction!

Get Started Today!

Data Science For Business With Python (DS4B 201-P)

P.S. Did we mention with have a DS4B Python Course coming?!?! Well we do! Favio Vazquez, Principle Data Scientist at OXXO, is building the Python equivalent of DS4B 201. The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python. More information is forthcoming. Sign up for Business Science University to stay updated.

About The Author

This MACHINE LEARNING TUTORIAL comes from Brad Boehmke, Director of Data Science at 84.51°, where he and his team develops algorithmic processes, solutions, and tools that enable 84.51° and its analysts to efficiently extract insights from data and provide solution alternatives to decision-makers. Brad is not only a talented data scientist, he’s an adjunct professor at the University of Cincinnati, Wake Forest, and Air Force Institute of Technology. Most importantly, he’s an active contributor to the Data Science Community and he enjoys giving back via advanced machine learning education available at the UC Business Analytics R Programming Guide!

Additional DALEX Resources

The following provides resources to learn more about the DALEX package:

  • DALEX GitHub repo:
  • breakDown package which is called by DALEX:
  • Paper that explains the prediction break down algorithm link

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), Business Science University is for you! Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement (ROI-Driven Data Science)

Data Science For Business With R Virtual Workshop

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed.

What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop using the R Statistical Programming Language. Here’s an example of a Shiny app you will create.

Get 15% OFF!

HR 301 Shiny Application: Employee Prediction

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301 (Building A Shiny Web App)

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.

Get 15% OFF!

Data Science For Business With Python (DS4B 201-P)

Did we mention with have a DS4B Python Course coming?!?! Well we do! Favio Vazquez, Principle Data Scientist at OXXO, is building the Python equivalent of DS4B 201. The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python. More information is forthcoming. Sign up for Business Science University to stay updated.

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Fundamentals of Clinical Data Science”

This open access book comprehensively covers the fundamentals of clinical data science, focusing on data collection, modelling and clinical applications. Topics covered in the first section on data collection include: data sources, data at scale (big data), data stewardship (FAIR data) and related privacy concerns. Aspects of predictive modelling using techniques such as classification, regression or clustering, and prediction model validation will be covered in the second section. The third section covers aspects of mobile clinical decision support systems, operational excellence and value-based healthcare.

Continue Reading…


Read More

If you did not already know

Geometric Generalization Based Zero-Shot Learning Test google
Raven’s Progressive Matrices are one of the widely used tests in evaluating the human test taker’s fluid intelligence. Analogously, this paper introduces geometric generalization based zero-shot learning tests to measure the rapid learning ability and the internal consistency of deep generative models. Our empirical research analysis on state-of-the-art generative models discern their ability to generalize concepts across classes. In the process, we introduce Infinit World, an evaluable, scalable, multi-modal, light-weight dataset and Zero-Shot Intelligence Metric ZSI. The proposed tests condenses human-level spatial and numerical reasoning tasks to its simplistic geometric forms. The dataset is scalable to a theoretical limit of infinity, in numerical features of the generated geometric figures, image size and in quantity. We systematically analyze state-of-the-art model’s internal consistency, identify their bottlenecks and propose a pro-active optimization method for few-shot and zero-shot learning. …

SNIPER google
We present SNIPER, an algorithm for performing efficient multi-scale training in instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions around ground-truth instances (referred to as chips) at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER only processes 30% more pixels compared to the commonly used single scale training at 800×1333 pixels on the COCO dataset. But, it also observes samples from extreme resolutions of the image pyramid, like 1400×2000 pixels. As SNIPER operates on resampled low resolution chips (512×512 pixels), it can have a batch size as large as 20 on a single GPU even with a ResNet-101 backbone. Therefore it can benefit from batch-normalization during training without the need for synchronizing batch-normalization statistics across GPUs. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47.6% on the COCO dataset for bounding box detection and can process 5 images per second with a single GPU. …

Machine Learning Canvas google
A framework to connect the dots between data collection, machine learning, and value creation …

Continue Reading…


Read More

REST APIs and Plumber

(This article was first published on R Views, and kindly contributed to R-bloggers)

Moving R resources from development to production can be a challenge, especially when the resource isn’t something like a shiny application or rmarkdown document that can be easily published and consumed. Consider, as an example, a customer success model created in R. This model is responsible for taking customer data and returning a predicted outcome, like the likelihood the customer will churn. Once this model is developed and validated, there needs to be some way for the model output to be leveraged by other systems and individuals within the company.

Traditionally, moving this model into production has involved one of two approaches: either running customer data through the model on a batch basis and caching the results in a database, or handing the model definition off to a development team to translate the work done in R into another language, such as Java or Scala. Both approaches have significant downsides. Batch processing works, but it misses real-time updates. For example, if the batch job runs every night and a customer calls in the next morning and has a heated conversation with support, the model output will have no record of that exchange when the customer calls the customer loyalty department later the same day to cancel their service. In essence, model output is served on a lag, which can sometimes lead to critical information loss. However, the other option requires a large investment of time and resources to convert an existing model into another language just for the purpose of exposing that model as a real-time service. Neither of these approaches is ideal; to solve this problem, the optimal solution is to expose the existing R model as a service that can be easily accessed by other parts of the organization.

plumber is an R package that allows existing R code to be exposed as a web service through special decorator comments. With minimal overhead, R programmers and analysts can use plumber to create REST APIs that expose their work to any number of internal and external systems. This solution provides real-time access to processes and services created entirely in R, and can effectively eliminate the need to perform batch operations or technical hand-offs in order to move R code into production.

This post will focus on a brief introduction to RESTful APIs, then an introduction to the plumber package and how it can be used to expose R services as API endpoints. In subsequent posts, we’ll build a functioning web API using plumber that integrates with Slack and provides real-time customer status reports.

Web APIs

For some, APIs (Application Programming Interface) are things heard of but seldom seen. However, whether seen or unseen, APIs are part of everyday digital life. In fact, you’ve likely used a web API from within R, even if you didn’t recognize it at the time! Several R packages are simply wrappers around popular web APIs, such as tidycensus and gh. Web APIs are a common framework for sharing information across a network, most commonly through HTTP.


To understand how HTTP requests work, it’s helpful to know the players involved. A client makes a request to a server, which interprets the request and provides a response. An HTTP request can be thought of simply as a packet of information sent to the server, which the server attempts to interpret and respond to. Every time you visit a URL in a web browser, an HTTP request is made and the response is rendered by the browser as the website you see. It is possible to inspect this interaction using the development tools in a browser.

As seen above, this request is composed of a URL and a request method, which in the case of a web browser accessing a website, is GET.


There are several components of an HTTP request, but here we’ll mention on only a few.

  • URL: the address or endpoint for the request
  • Verb / method: a specific method invoked on the endpoint (GET, POST, DELETE, PUT)
  • Headers: additional data sent to the server, such as who is making the request and what type of response is expected
  • Body: data sent to the server outside of the headers, common for POST and PUT requests

In the browser example above, a GET request was made by the web browser to


The API response mirrors the request to some extent. It includes headers that contain information about the response and a body that contains any data returned by the API. The headers include the HTTP status code that informs the client how the request was received, along with details about the content that’s being delivered. In the example of a web browser accessing, we can see below that the response headers include the status code (200) along with details about the response content, including the fact that the content returned is HTML. This HTML content is what the browser renders into a webpage.


The httr package provides a nice framework for working with HTTP requests in R. The following basic example demonstrates some of what we’ve already learned by using httr and, which provides a playground of sorts for HTTP requests.

# A simple GET request
response <- GET("")
## Response []
##   Date: 2018-07-23 14:57
##   Status: 200
##   Content-Type: application/json
##   Size: 266 B
## {"args":{},"headers":{"Accept":"application/json, text/xml, application/...

In this example we’ve made a GET request to and received a response. We know our request was successful because we see that the status is 200. We also see that the response contains data in JSON format. The Getting started with httr page provides additional examples of working with HTTP requests and responses.


Representational State Transfer (REST) is an architectural style for APIs that includes specific constraints for building APIs to ensure that they are consistent, performant, and scalable. In order to be considered truly RESTful, an API must meet each of the following six constraints:

  • Uniform interface: clearly defined interface between client and server
  • Stateless: state is managed via the requests themselves, not through reliance on an external service
  • Cacheable: responses should be cacheable in order to improve scalability
  • Client-Server: clear separation of client and server, each with it’s on distinct responsibilities in the exchange
  • Layered System: there may be intermediaries between the client and the server, but the client should be unaware of them
  • Code on Demand: the response can include logic executable by the client

We could spend a lot of time diving further into each of these specifications, but that is beyond the scope of this post. More detail about REST can be found here.


Creating RESTful APIs using R is straightforward using the plumber package. Even if you have never written an API, plumber makes it easy to turn existing R functions into API endpoints. Developing plumber endpoints is simply a matter of providing specialized R comments before R functions. plumber recognizes both #' and #* comments, although the latter is recommended in order to avoid potential conflicts with roxygen2. The following defines a plumber endpoint that simply returns the data provided in the request query string.


#* @apiTitle Simple API

#* Echo provided text
#* @param text The text to be echoed in the response
#* @get /echo
function(text = "") {
    message_echo = paste("The text is:", text)

Here we’ve defined a simple function that takes a parameter, text, and returns it with some additional comments as part of a list. By default, plumber will serialize the object returned from a function into JSON using the jsonlite package. We’ve provided specialized comments to inform plumber that this endpoint is available at api-url/echo and will respond to GET requests.

There are a few ways this plumber script can be run locally. First, assuming the file is saved as plumber.R, the following code would start a local web server hosting the API.

plumber::plumb("plumber.R")$run(port = 5762)

Once the web server has started, the API can be interacted with using any set of HTTP tools. We could even interact with it using httr as demonstrated earlier, although we would need to open a separate R session to do so since the current R session is busy serving the API.

The other method for running the API requires a recent preview build of the RStudio IDE. Recent preview builds include features that make it easier to work with plumber. When editing a plumber script in a recent version of the IDE, a “Run API” icon will appear in the top right hand corner of the source editor. Clicking this button will automatically run a line of code similar to the one we ran above to start a web server hosting the API. A swagger-generated UI will be rendered in the Viewer pane, and the API can be interacted with directly from within this UI.

Now that we have a running plumber API, we can query it using curl from the command line to investigate it’s behavior.

$ curl "localhost:5762/echo" | jq '.'
  "message_echo": [
    "The text is: "

In this case, we queried the API without providing any additional data or parameters. As a result, the text parameter is the default empty string, as seen in the response. In order to pass a value to our underlying function, we can define a query string in the request as follows:

$ curl "localhost:5762/echo?text=Hi%20there" | jq '.'
  "message_echo": [
    "The text is: Hi there"

In this case, the text parameter is defined as part of the query string, which is appended to the end of the URL. Additional parameters could be defined by separating each key-value pair with &. It’s also possible to pass the parameter as part of the request body. However, to leverage this method of data delivery, we need to update our API definition so that the /echo endpoint also accepts POST requests. We’ll also update our API to consider multiple parameters, and return the parsed parameters along with the entire request body.


#* @apiTitle Simple API

#* Echo provided text
#* @param text The text to be echoed in the response
#* @param number A number to be echoed in the response
#* @get /echo
#* @post /echo
function(req, text = "", number = 0) {
    message_echo = paste("The text is:", text),
    number_echo = paste("The number is:", number),
    raw_body = req$postBody

With this new API definition, the following curl request can be made to pass parameters to the API via the request body.

$ curl --data "text=Hi%20there&number=42&other_param=something%20else" "localhost:5762/echo" | jq '.'
  "message_echo": [
    "The text is: Hi there"
  "number_echo": [
    "The number is: 42"
  "raw_body": [

Notice that we passed more than just text and number in the request body. plumber parses the request body and matches any arguments found in the R function definition. Additional arguments, like other_param in this case, are ignored. plumber can parse the request body if it is URL-encoded or JSON. The following example shows the same request, but with the request body encoded as JSON.

$ curl --data '{"text":"Hi there", "number":"42", "other_param":"something else"}' "localhost:5762/echo" | jq '.'
  "message_echo": [
    "The text is: Hi there"
  "number_echo": [
    "The number is: 42"
  "raw_body": [
    "{\"text\":\"Hi there\", \"number\":\"42\", \"other_param\":\"something else\"}"

While these examples are fairly simple, they demonstrate the extraordinary facility of plumber. Thanks to plumber, it is now a fairly straightforward process to expose R functions so they can be consumed and leveraged by any number of systems and processes. We’ve only scratched the surface of its capabilities and, as mentioned, future posts will walk through the creation of a Slack app using plumber. Comprehensive documentation for plumber can be found here.


Up until now, we’ve just been interacting with our APIs in our local development environment. That’s great for development and testing, but when it comes time to expose an API to external services, we don’t want our laptop held responsible (at least, I don’t!). There are several deployment methods for plumber outlined in the documentation. The most straightforward method of deployment is to use RStudio Connect. When editing a plumber script in recent versions of the RStudio IDE, a blue publish button will appear in the top right-hand corner of the source editor. Clicking this button brings up a menu that enables the user to publish the API to an instance of RStudio Connect. Once published, API access and performance can be configured through RStudio Connect and the API can be leveraged by external systems and processes.


Web APIs are a powerful mechanism for providing systematic access to computational processes. Writing APIs with plumber makes it easy for others to take advantage of the work you’ve created in R without the need to rely on batch processing or code rewriting. plumber is exceptionally flexible and can be used to define a wide variety of endpoints. These endpoints can be used to integrate R with other systems. As an added bonus, downstream consumers of these APIs require no knowledge of R. They only need to know how to properly interact with the API via HTTP. plumber provides a convenient and reliable bridge between R and other systems and/or languages used within an organization.

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Gifski on CRAN: the fastest GIF encoder in the universe

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

cover image

The gifski package which was demonstrated in May at eRum 2018 in Budapest is now on CRAN. Gifski is a simple but powerful package which can hopefully take away an important performance bottleneck for generating animated graphics in R.

What is Gifski

Gifski is a multi-threaded high-quality GIF encoder written in Rust. It can create animated GIF images with thousands of colors per frame and do so much faster than other software. The Gifski Website has more technical details and beautiful examples.

The R package wraps the Rust crate and can be installed in the usual way from CRAN. One of the major benefits of Rust is that it has no runtime, so the R package has no dependencies.


On Linux you need to install cargo to compile the rust code, but the package does not require any external libraries. Cargo automatically compiles and links all Rust code when building the R package. If you are on MacOS, try installing from source to see how it works:

install.packages("gifski", type = "source")

How to Use

The R interface is very simple: either generate a GIF from a set of images, or directly from the R graphics device. The ?gifski manual page contains example of both. The gifski() function combines a set of PNG images into a single animated GIF file:

# Convert png files to gif
par(ask = FALSE)
for(i in 1:10)
  plot(rnorm(i * 10), main = i)
png_files <- sprintf("frame%03d.png", 1:10)
gif_file <- tempfile(fileext = ".gif")
gifski(png_files, gif_file)

Alternatively the save_gif() function captures plots R generated in an R expression and saves them as a animated gif, just like e.g. animation::saveGIF() but much faster:

# Example borrowed from gganimate
makeplot <- function(){
  datalist <- split(gapminder, gapminder$year)
  lapply(datalist, function(data){
    p <- ggplot(data, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
      scale_size("population", limits = range(gapminder$pop)) + geom_point() + ylim(20, 90) +
      scale_x_log10(limits = range(gapminder$gdpPercap)) + ggtitle(data$year) + theme_classic()

# High Definition images:
gif_file <- save_gif(makeplot(), width = 800, height = 450, res = 92)


Gifski shows a progress meter while generating the GIF. Running this example shows that the GIF encoding is no longer a serious overhead: time spent in encoding is only a small fraction of the total time to generate the plot. Hopefully this will make it easier to generate animations with hundreds or even thousands of frames using for example the gganimate package.

More about Rust in R

This is the first CRAN package that interfaces a Rust library. In this case the R package itself does not contain any Rust code because we can call Rust directly from C.

If you are interested in learning more about using Rust in R packages, have a look at my slides from eRum 2018. Also see the r-rust organization on Github for more examples R packages, especially the hellorust package.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

future 1.9.0 – Output from The Future

(This article was first published on JottR on R, and kindly contributed to R-bloggers)

future 1.9.0 – Unified Parallel and Distributed Processing in R for Everyone – is on CRAN. This is a milestone release:

Standard output is now relayed from futures back to the master R session –
regardless of where the futures are processed!

Disclaimer: A future’s output is relayed only after it is resolved and when its value is retrieved by the master R process. In other words, the output is not streamed back in a “live” fashion as it is produced. Also, it is only the standard output that is relayed. See below, for why the standard error cannot be relayed.

Illustration of communication by mechanical semaphore in 1800s France. Lines of towers supporting semaphore masts were built within visual distance of each other. The arms of the semaphore were moved to different positions, to spell out text messages. The operators in the next tower would read the message and pass it on. Invented by Claude Chappee in 1792, semaphore was a popular communication technology in the early 19th century until the telegraph replaced it. (source:
Relaying standard output from far away


Assume we have access to three machines with R installed on our local network. We can distribute our R processing to these machines using futures by:

> library(future)
> plan(cluster, workers = c("n1", "n2", "n3"))
> nbrOfWorkers()
[1] 3

With the above, future expressions will now be processed across those three machines. To see which machine a future ends up being resolved by, we can output the hostname, e.g.

> printf <- function(...) cat(sprintf(...))

> f <- future({
+   printf("Hostname: %s\n",[["nodename"]])
+   42
+ })
> v <- value(f)
Hostname: n1
> v
[1] 42

We see that this particular future was resolved on the n1 machine. Note how the output is relayed when we call value(). This means that if we call value() multiple times, the output will also be relayed multiple times, e.g.

> v <- value(f)
Hostname: n1
> value(f)
Hostname: n1
[1] 42

This is intended and by design. In case you are new to futures, note that a future is only evaluated once. In other words, calling value() multiple times will not re-evaluate the future expression.

The output is also relayed when using future assignments (%<-%). For example,

> v %<-% {
+   printf("Hostname: %s\n",[["nodename"]])
+   42
+ }
> v
Hostname: n1
[1] 42
> v
[1] 42

In this case, the output is only relayed the first time we print v. The reason for this is because when first set up, v is a promise (delayed assignment), and as soon as we “touch” (here print) it, it will internally call value() on the underlying future and then be resolved to a regular variable v. This is also intended and by design.

In the spirit of the Future API, any output behaves exactly the same way regardless of future backend used. In the above, we see that output can be relayed from three external machines back to our local R session. We would get the exact same if we run our futures in parallel, or sequentially, on our local machine, e.g.

> plan(sequential)
 v %<-% {
   printf("Hostname: %s\n",[["nodename"]])
> v
Hostname: my-laptop
[1] 42

This also works when we use nested futures wherever the workers are located (local or remote), e.g.

> plan(list(sequential, multiprocess))
> a %<-% {
+   printf("PID: %d\n", Sys.getpid())
+   b %<-% {
+     printf("PID: %d\n", Sys.getpid())
+     42
+   }
+   b	
+ }
> a
PID: 360547
PID: 484252
[1] 42

Higher-Level Future Frontends

The core Future API, that is, the explicit future()value() functions and the implicit future-assignment operator %<-% function, provides the foundation for all of the future ecosystem. Because of this, relaying of output will work out of the box wherever futures are used. For example, when using future.apply we get:

> library(future.apply)
> plan(cluster, workers = c("n1", "n2", "n3"))
> printf <- function(...) cat(sprintf(...))

> y <- future_lapply(1:5, FUN = function(x) {
+   printf("Hostname: %s (x = %g)\n",[["nodename"]], x)
+   sqrt(x)
+ })
Hostname: n1 (x = 1)
Hostname: n1 (x = 2)
Hostname: n2 (x = 3)
Hostname: n3 (x = 4)
Hostname: n3 (x = 5)
> unlist(y)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

and similarly when, for example, using foreach:

> library(doFuture)
> registerDoFuture()
> plan(cluster, workers = c("n1", "n2", "n3"))
> printf <- function(...) cat(sprintf(...))

> y <- foreach(x = 1:5) %dopar% {
+   printf("Hostname: %s (x = %g)\n",[["nodename"]], x)
+   sqrt(x)
+ }
Hostname: n1 (x = 1)
Hostname: n1 (x = 2)
Hostname: n2 (x = 3)
Hostname: n3 (x = 4)
Hostname: n3 (x = 5)
> unlist(y)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

What about standard error?

Unfortunately, it is not possible to relay output sent to the standard error (stderr), that is, output by message(), cat(..., file = stderr()), and so on, is not taken care of. This is due to a limitation in R, preventing us from capturing stderr in a reliable way. The gist of the problem is that, contrary to stdout (“output”), there can only be a single stderr (“message”) sink active in R at any time. What really is the show stopper is that if we allocate such a message sink, it will be stolen from us the moment other code/functions request the message sink. In other words, message sinks cannot be used reliably in R unless one fully controls the whole software stack. As long as this is the case, it is not possible to collect and relay stderr in a consistent fashion across all future backends (*). But, of course, I’ll keep on trying to find a solution to this problem. If anyone has a suggestion for a workaround or a patch to R, please let me know.

(*) The callr package captures stdout and stderr in a consistent manner, so for the future.callr backend, we could indeed already now relay stderr. We could probably also find a solution for future.batchtools backends, which targets HPC job schedulers by utilizing the batchtools package. However, if code becomes dependent on using specific future backends, it will limit the end users’ options – we want to avoid that as far as ever possible. Having said this, it is possible that we’ll start out supporting stderr by making it an optional feature of the Future API.

Poor Man’s debugging

Because the output is also relayed when there is an error, e.g.

> x <- "42"
> f <- future({
+   str(list(x = x))
+   log(x)
+ })
> value(f)
List of 1
 $ x: chr "42"
Error in log(x) : non-numeric argument to mathematical function

it can be used for simple troubleshooting and narrowing down errors. For example,

> library(doFuture)
> registerDoFuture()
> plan(multiprocess)
> nbrOfWorkers()
[1] 2
> x <- list(1, "2", 3, 4, 5)
> y <- foreach(x = x) %dopar% {
+   str(list(x = x))
+   log(x)
+ }
List of 1
 $ x: num 1
List of 1
 $ x: chr "2"
List of 1
 $ x: num 3
List of 1
 $ x: num 4
List of 1
 $ x: num 5
Error in { : 
  task 2 failed - "non-numeric argument to mathematical function"

From the error message, we get that there was an “non-numeric argument” (element) passed to a function. By adding the str(), we can also see that it is of type character and what its value is. This will help us go back to the data source (x) and continue the troubleshooting there.

What’s next?

Progress bar information is one of several frequently requested features in the future framework. I hope to attack the problem of progress bars and progress messages in higher-level future frontends such as future.apply. Ideally, this can be done in a uniform and generic fashion to meet all needs. A possible implementation that has been discussed, is to provide a set of basic hook functions (e.g. on-start, on-resolved, on-value) that any ProgressBar API (e.g. jobstatus) can build upon. This could help avoid tie-in to a particular progress-bar implementation.

Another feature I’d like to get going is (optional) benchmarking of processing time and memory consumption. This type of information will help optimize parallel and distributed processing by identifying and understand the various sources of overhead involved in parallelizing a particular piece of code in a particular compute environment. This information will also help any efforts trying to automate load balancing. It may even be used for progress bars that try to estimate the remaining processing time (“ETA”).

So, lots of work ahead. Oh well …

Happy futuring!

See also

To leave a comment for the author, please follow the link and comment on their blog: JottR on R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

July 22, 2018

Distilled News

Introduction to Artificial Neural Network Model

In this Machine Learning tutorial, we will take you through the introduction of Artificial Neural network Model. First of all, we will discuss the multilayer Perceptron network next with the Radial Basis Function Network, they both are supervised learning model. At last, we will cover the Kohonen Model which follows Unsupervised learning and the difference between Multilayer Perceptron network and Radial Basis Function Network.

A Brief Overview of Outlier Detection Techniques

Outliers are extreme values that deviate from other observations on data , they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample.

RStudio:addins part 4 – Unit testing coverage investigation and improvement, made easy

A developer always pays his technical debts! And we have a debt to pay to the gods of coding best practices, as we did not present many unit tests for our functions yet. Today we will show how to efficiently investigate and improve unit test coverage for our R code, with focus on functions governing our RStudio addins, which have their own specifics.

MiKTeX Behind a Windows Firewall

I´ve always had problems with MiKTeX on my work computer. I can install it just fine, or get IT to install it, but then the package manager doesn´t work because of our firewall. You can set up a local repository to get around this problem, and I will show you how. I´m just doing a basic setup here, just enough to compile the RStudio Rmd template to PDF. ‘Why do I need MiKTeX ‘, you might ask. Well, because if you want to create a PDF from an RMarkdown file in RStudio it is required. Otherwise you get this polite error message.

Whether to use a data frame in R?

In this post, I try to show you in which situations using a data frame is appropriate, and in which it´s not.

CVXR: A Direct Standardization Example

In our first blog post, we introduced CVXR, an R package for disciplined convex optimization, and showed how to model and solve a non-negative least squares problem using its interface. This time, we will tackle a non-parametric estimation example, which features new atoms as well as more complex constraints.

Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

The focus of data science is data analysis. This article begins with a categorization of the data science technical areas that play a direct role in data analysis. Next, big data are addressed, which create computational challenges due to the data size, as does the computational complexity of many analytic methods. Divide and recombine (D&R) is a statistical approach whose goal is to meet the challenges. In D&R, the data are divided into subsets, an analytic method is applied independently to each subset, and the outputs are recombined. This enables a large component of embarrassingly-parallel computation, the fastest parallel computation. DeltaRho open-source software implements D&R. At the front end, the analyst programs in R. The back end is the Hadoop distributed file system and parallel compute engine. The goals of D&R are the following: access to thousands of methods of machine learning, statistics, and data visualization; deep analysis of the data, which means analysis of the detailed data at their finest granularity; easy programming of analyses; and high computational performance. To succeed, D&R requires research in all of the technical areas of data science. Network cybersecurity and climate science are two subject-matter areas with big, complex data benefiting from D&R. We illustrate this by discussing two datasets, one from each area. The first is the measurements of 13 variables for each of 10,615,054,608 queries to the Spamhaus IP address blacklisting service. The second has 50,632 3-hourly satellite rainfall estimates at 576,000 locations.

Continue Reading…


Read More

Amazon’s Hanging Cable Problem (Golden Gate Edition)

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)


In this post we use R’s capabilities to solve nonlinear equation systems in order to answer an extension of the hanging cable problem to suspension bridges. We then use R and ggplot to overlay the solution to an image of the Golden Gate Bridge in order to bring together theory and practice.

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.


The so called Amazon’s hanging cable problem explained in this youtube video (watched 2.4 mio times!1) goes as follows:

A cable of 80 meters (m) is hanging from the top of two poles that are both 50 m from the ground. What is the distance between the two poles, to one decimal place, if the center of the cable is:

  1. 20 m above the ground?
  2. 10 m above the ground?

Allegedly, (b) has been used as an Amazon interview question, however, the problem is much older and has otherwise nothing to do with Amazon. Can you solve (b)? Or even (a)? The problem can be illustrated as follows:

Screenshot from Presh Talwalkar’s website.

Hint: The solution to (a) is concisely described in Chatterjee and Nita (2010) and for (b) you need to do little more than just think. So instead of applying at Amazon, let’s take the question to the next level: Apply for the orange belt in R: How you wouldn’t solve the hanging cable problem by instead solving the hanging cable problem suspension bridge style!

As explained in the video the catenary curve is the geometric shape, a cable assumes under its own weight when supported only at its ends. If instead the cable supports a uniformly distributed vertical load, the cable has the shape of a parabolic curve. This would for example be the case for a suspension bridge with a horizontal suspended deck, if the cable itself is not too heavy compared to the road sections. A prominent example of a suspension bridges is the Golden Gate Bridge, which we will use as motivating example for this post.

Solving the Cable Problem

Parabola Shape

Rephrasing the cable problem as the ‘suspension bridge problem‘ we need to solve a two-component non-linear equation system:

  1. the first component ensures that the parabolic curve with vertex at \((0,0)\) goes through the poles at the x-values \(-x\) and \(x\). In other words: the distance between the two poles is \(2x\). Note that the coordinate system is aligned such that the lowest point of the cable is at the origo.

  2. the second component ensures that the arc-length of the parabola is as given by the problem. Since the parabola is symmetric it is sufficient to study the positive x-axis

The two criteria are converted into an equation system as follows: \[
\begin{align*} a x^2 &= 50 – \text{height above ground} \\
\int_0^x \sqrt{1 + \left(\frac{d}{du} a u^2\right)^2} du &= 40.

Here, the general equation for arc-length of a function \(y=f(u)\) has been used. Solving the arc-length integral for a parabola can either be done by numerical integration or by solving the integral analytically or just look up the resulting analytic expression as eqn 4.25 in Spiegel (1968). Subtracting the RHS from each LHS gives us a non-linear equation system with unknowns \(a\) and \(x\) of the form

y_1(a,x) \\
0 \\

Writing this in R code:

## Height of function at the location x from center is (pole_height - height_above_ground)
y1_parabola <- function(a, x, pole_height=50, above_ground=20) {
  a*x^2 - (pole_height - above_ground)

## Arc-length of the parabola between [-x,x] is given as cable_length
y2_parabola <- function(a,x, cable_length=80, arc_method=c("analytic","numeric")) {

  ##Arc-length of a parabola a*u^2 within interval [0,x]
  if(arc_method == "numeric") {
    f <- function(u) return( sqrt(1 + (2*a*u)^2))
    half_arclength <- integrate(f, lower=0, upper=x)$value
  } else if (arc_method=="analytic") {
    half_arclength <-  1/(4*a)*(2*a*x*sqrt(4*a^2*x^2+1) + asinh(2*a*x))

  ##The equation: s = cable_length/2
  half_arclength - cable_length/2

## The non-linear equation system \bm{y}(\theta) = \bm{0}, where the LHS
## is given by a list with two components containing y_1(\theta) and y_2(\theta)
f_sys <- function(theta, y, pole_height=50, above_ground=20, cable_length=80, ...) {
  a <- theta[1]
  x <- exp(theta[2]) ##ensure x is positive

  c(y[[1]](a,x, pole_height=pole_height, above_ground=above_ground),
    y[[2]](a,x, cable_length=cable_length, ...))

##Helper function to transform theta parameter vector to (a,x)'
theta2ax <- function(theta) {
  c(a=theta[1], x=exp(theta[2]))

To ensure \(x>0\) we re-parametrized the equations with \(\theta_2 = \log(x)\) and provide the function theta2ax to backtransform the result. We can now use the nleqslv package to solve the non-linear equation system using a one-liner:

y_parabola <- list(y1_parabola, y2_parabola)
sol_parabola <- nleqslv(x=c(0.1,0.1),f_sys, y=y_parabola,  arc_method="analytic")
##           a           x 
##  0.05355207 23.66859605

In other words, for a cable of length 80m the pole of a suspension bridge will be located 23.7m from the origo, which means the two poles of the bridge will be 47.3m apart, which is also the span of the bridge.

Using arc_method="numeric" instead of the analytic solution gives

##           a           x 
##  0.05355207 23.66859605

It is re-assuring to see that the numerical integration method yields the same result as the analytic method. The analytic method has mathematical beauty, the numerical method allows the data scientist to solve the problem without diving into formula compendiums or geometry.

Catenary Shape

Using the same code, but with the y-functions formulated for the catenary case we obtain

## Value of y=f(u) evaluated at u=x
y1_catenary <- function(a,x, pole_height=50, above_ground=20) {
  a * cosh(x/a) - a - (pole_height- above_ground)
## Arc-length condition
y2_catenary <- function(a,x, cable_length=80) {
  a * sinh(x/a) - cable_length/2

## Solve equation system
y_catenary <- list(y1_catenary, y2_catenary)
sol_catenary <- nleqslv(x=c(0.1,0.1),f_sys, y=y_catenary, method="Newton")
##        a        x 
## 11.66667 22.70229

In other words the solution to the original cable problem is \(x=22.7 m\) whereas the answer to the suspension bridge version is \(x=23.7m\). The difference to the parabolic form can be seen from the following graph:

Testing the theory

We test our theory by studying the cable of the Golden Gate suspension bridge. Shown below is a photograph by D Ramey Logan available under a CC BY 4.0 license. For presentation in this post the image was tilted by -0.75 degrees (around the camera’s view axis) with the imager package to make sea level approximately horizontal. Parabolic and catenary overlays (no real difference between the two) were done using the theory described above.

##Preprocess image
img <- imager::load.image(file.path(fullFigPath, "Golden_Gate_Bridge.png"))
img <- imager::imrotate(img, angle=-0.75, interpolation=1)
img <- imager::resize(img,-50,-50, interpolation_type=5)

We manually identify center, sea level and poles from the image and use annotation_raster to overlay the image on the ggplot of the corresponding parabola and catenary. See the code on github for details.

The fit is not perfect, which is due to the camera’s direction not being orthogonal to the plane spanned by the bridge – for example the right pole appears to be closer to the camera than the left pole2. We scaled and ‘offsetted’ the image so the left pole is at distance 640m from origo, but did not correct for the tilting around the \(y\)-axis. Furthermore, distances are being distorted by the lens, which might explain the poles being too small. Rectification and perspective control of such images is a photogrammetric method beyond the scope of this post!


This post may not to impress a Matlab coding engineer, but it shows how R has developed into a versatile tool going way beyond statistics: We used its optimization and image analysis capabilities. Furthermore, given an analytic form of \(y(\theta)\), R can symbolically determine the Jacobian and, hence, implement the required Newton-Raphson solving of the non-linear equation system directly – see the Appendix. In other words: R is also a full stack mathematical problem solving tool!

As a challenge to the interested reader: Can you write R code, for example using imager, which automatically identifies poles and cable in the image and based on the known specification of these parameters of the Golden Gate Bridge (pole height: 230m, span 1280m, clearance above sea level: 67.1m), and perform a rectification of the image? If yes, Stockholm University’s Math Department hires for Ph.D. positions every April! The challenge could work well as pre-interview project. 😉

Appendix – Newton-Raphson Algorithm

Because the y_1(a,x) and y_2(a,x) are both available in closed analytic form, one can form the Jacobian of non-linear equations system by combining the two gradients. This can be achieved symbolically using the deriv or Deriv::Deriv functions in R.

Given starting value \(\theta\) the iterative procedure to find the root of the non-linear equation system \(y(\theta) = 0\) is given by (Nocedal and Wright 2006, Sect. 11.1)

\theta^{(k+1)} = \theta^k – J(\theta^k)^{-1} y(\theta),

where \(J\) is the Jacobian of the system, which in this case is a 2×2 matrix.

gradient_y1 <- Deriv::Deriv(y1_parabola, x=c("a","x"))

y2_parabola_analytic <- function(a,x, cable_length=80) {
  1/(4*a)*(2*a*x*sqrt(4*a^2*x^2+1) + asinh(2*a*x)) - cable_length/2

gradient_y2 <- Deriv::Deriv(y2_parabola_analytic, x=c("a","x"))

J <- function(theta, pole_height=50, above_ground=20, cable_length=80, ...) {
  a <- theta[1]
  x <- exp(theta[2]) #  x <- exp(theta[2])

  ##Since we use x = exp(theta[2])=g(theta[2]) we need the chain rule to find the gradient in theta
  ##this is g'(theta[2]) = exp(theta[2]) = x
  rbind(gradient_y1(a,x, pole_height=pole_height, above_ground=above_ground)* c(1, x),
        gradient_y2(a,x, cable_length=cable_length)  * c(1, x))

By iterating Newton-Raphson steps we can find the solution of the equation system manually:

##Start values
theta <- c(0.1,log(10))
thetanew <- c(0.1,log(20))
##Log with the values
log <- t(theta2ax(theta))

##Iterate Newton-Raphson steps until convergence
while ( (sum(thetanew - theta)^2 / sum(theta^2)) > 1e-15) {
  theta <- thetanew
  ##Update step
  thetanew <- theta - solve(J(theta=theta)) %*% f_sys(theta, y=y_parabola, arc_method="analytic")
  ##Add to log
  log <- rbind(log, theta2ax(thetanew))

##Look at the steps taken
##               a        x
## [1,] 0.10000000 10.00000
## [2,] 0.02667392 25.46647
## [3,] 0.04632177 25.43589
## [4,] 0.05270610 23.75416
## [5,] 0.05354318 23.66953
## [6,] 0.05355207 23.66860
## [7,] 0.05355207 23.66860

We show the moves of the algorithm in a 2D contour plot for \(r(a,x) = \sqrt{y_1(a,x)^2 + y_2(a,x)^2}\). The solution to the system has \(r(a,x)=0\). See the code on github for details.


Chatterjee, N., and B. G. Nita. 2010. “The Hanging Cable Problem for Practical Applications.” Atlantic Electronic Journal of Mathematics 4 (1): 70–77.

Nocedal, J., and S. J. Wright. 2006. Numerical Optimization. 2nd ed. Springer-Verlag.

Spiegel, M. R. 1968. Mathematical Handbook of Formulas and Tables. Schaum’s Outline Series. McGraw-Hill Book Company.

  1. As of 2018-07-23.↩

  2. A manual investigation using the "Map | Map Object" Filter in Gimp showed that the angle of tilting around the y-axis is about 20 degrees.↩

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Evaluation of the 2018 FIFA World Cup Forecast

(This article was first published on Achim Zeileis, and kindly contributed to R-bloggers)

A look back the 2018 FIFA World Cup in Russia to check whether our tournament forecast based on the bookmaker consensus model was any good…

How surprising was the tournament?

Last week France won the 2018 FIFA World Cup in a match against Croatia in Russia, thus delivering an entertaining final to a sportful tournament. Many perceived the course of the tournament as very unexpected and surprising because many of the “usual” favorites like Brazil, Germany, Spain, or Argentina did not even make it to the semi-finals. And in contrast, teams like host Russia and finalist Croatia proceeded further than expected. However, does this really mean that expectations of experts and fans were so wrong? Or, how surprising was the result given pre-tournament predictions?

Therefore, we want to take a critical look back at our own Probabilistic Forecast for the 2018 FIFA World Cup based on the bookmaker consensus model that aggregated the expert judgments of 26 bookmakers and betting exchanges. A set of presentation slides (in PDF format) with explanations of the model and its evaluation are available to accompany this blog post: slides.pdf


Despite some surprises in the tournament, the probabilistic bookmaker consensus forecast fitted reasonably well. Although it is hard to evaluate probabilistic forecasts with only one realization of the tournament but by and large most outcomes do not deviate systematically from the probabilities assigned to them.

However, there is one notable exception: Expectations about defending champion Germany were clearly wrong. “Die Mannschaft” was predicted to advance from the group stage to the round of 16 with probability 89.1% – and they not only failed to do so but instead came in last in their group.

Other events that were perceived as surprising were not so unlikely to begin, e.g., for Argentina it was more likely to get eliminated before the quarter finals (predicted probability: 51%) than to proceed further. Or they were not unlikely conditional on previous tournament events. Examples for the latter are the pre-tournament prediction for Belgium beating Brazil in a match (40%) or Russia beating Spain (33%). Of course, another outcome of those matches was more likely but compared with these predictions the results were maybe not as surprising as perceived by many. Finally, the pre-tournament prediction of Croatia making it to the final was only 6% but conditional on the events from the round of 16 (especially with Spain being eliminated) this increased to 27% (only surpassed by England with 36%).

Tournament animation

The animated GIF below shows the pre-tournament predictions for each team winning the 2018 FIFA world cup. In the animation the teams that “survived” over the course of the tournament are highlighted. This clearly shows that the elimination of Germany (winning probability: 15.8%) was the big surprise in the group stage but otherwise almost all of the teams expected to proceed also did so. Afterwards, two of the other main favorites Brazil (16.6%) and Spain (12.5%) dropped out but eventually the fourth team with double-digit winning probability (France, 12.1%) prevailed.

tournament animation


Compared to other rankings of the teams in the tournament, the bookmaker consensus model did quite well. To illustrate this we compute the Spearman rank correlation of observed partial tournament ranking (1 FRA, 2 CRO, 3 BEL, 4 ENG, 6.5 URU, 6.5 BRA, …) with the bookmaker consensus model as well as Elo and FIFA rating.

Method Correlation
Bookmaker consensus
Elo rating
FIFA rating

Match probabilities

As there is no good way to assess the predicted winning probabilities for winning the title with only one realization of the tournament, we at least (roughly) assess the quality of the predicted probabilities for the individual matches. To do so, we split the 63 matches into three groups, depending on the winning probability of the stronger team.

pairwise probability evaluation

This gives us matches that were predicted to be almost even (50-58%), had moderate advantages for the stronger team (58-72%), or clear advantages for the stronger team (72-85%). It turns out that in the latter two groups the average predicted probabilities (dashed red line) match the actual observed proportions quite well. Only in the “almost even” group, the stronger teams won slightly more often than expected.

Group stage probabilities

As already mentioned above, there was only one big surprise in the group stage – with Germany being eliminated. As the tables below show, most other results from the group rankings conformed quite well with the predicted probabilities to “survive” the group stage.

Prob. (in %)
Prob. (in %)
Prob. (in %)
Prob. (in %)
Prob. (in %)
Prob. (in %)
Prob. (in %)
Prob. (in %)

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Magister Dixit

“Being patient with other people and really respecting other people is very important in succeeding.” Anna Smith

Continue Reading…


Read More

R Packages worth a look

The Double-Gap Life Expectancy Forecasting Model (MortalityGaps)
Life expectancy is highly correlated over time among countries and between males and females. These associations can be used to improve forecasts. Here …

Penalized Partial Least Squares (ppls)
Contains linear and nonlinear regression methods based on Partial Least Squares and Penalization Techniques. Model parameters are selected via cross-va …

Statistical Inference on Lineup Fairness (r4lineups)
Since the early 1970s eyewitness testimony researchers have recognised the importance of estimating properties such as lineup bias (is the lineup biase …

Continue Reading…


Read More

How to add Trend Lines in R Using Plotly

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

1. Global trend lines

One of the simplest methods to identify trends is to fit a ordinary least squares regression model to the data. The model most people are familiar with is the linear model, but you can add other polynomial terms for extra flexibility. In practice, avoid polynomials of degrees larger than three because they are less stable.

Global trendlines with plotly

Below, we use the EuStockMarkets data set (available in R data sets) to construct linear, quadratic and cubic trend lines.

xx = EuStockMarkets[,1] = attr(xx, "tsp")
tt = seq([1], to =[2], by=1/[3])
data.fmt = list(color=rgb(0.8,0.8,0.8,0.8), width=4)
line.fmt = list(dash="solid", width = 1.5, color=NULL)

ti = 1:length(xx)
m1 = lm(xx~ti)
m2 = lm(xx~ti+I(ti^2))
m3 = lm(xx~ti+I(ti^2)+I(ti^3))

p.glob = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
p.glob = add_lines(p.glob, x=tt, y=predict(m1), line=line.fmt, name="Linear")
p.glob = add_lines(p.glob, x=tt, y=predict(m2), line=line.fmt, name="Quadratic")
p.glob = add_lines(p.glob, x=tt, y=predict(m3), line=line.fmt, name="Cubic")
p.glob = layout(p.glob, title = "Global smoothers")

2. Local smoothers

Global models assume that the time series follows a single trend. For many data sets, however, we would want to relax this assumption. In the following section, we demonstrate the use of local smoothers using the Nile data set. It contains measurements of the annual river flow of the Nile over 100 years and is less regular than the EuStockMarkets data set.

i. Moving averages

The moving average (also known as running mean) method consists of taking the mean of a fixed number of nearby points.  Even with this simple method we see that the question of how to choose the neighborhood is crucial for local smoothers. Increasing the bandwidth from 5 to 20 suggests that there is a gradual decrease in annual river flow from 1890 to 1905 instead of a sharp decrease at around 1900.

Many packages include functions to compute the running mean such as caTools::runmean and forecast::ma, which may have additional features, but filter in the base stats package can be used to compute moving averages without installing additional packages.

Moving averages in plotly

xx = Nile = attr(xx, "tsp")
tt = seq([1], to =[2], by=1/[3])

rmean20 = stats::filter(xx, rep(1/20, 20), side=2)
rmean5 = stats::filter(xx, rep(1/5, 5), side=2)

p.rm = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
p.rm = add_lines(p.rm, x=tt, y=rmean20, line=line.fmt, name="Bandwidth = 20")
p.rm = add_lines(p.rm, x=tt, y=rmean5, line=line.fmt, name="Bandwidth = 5")
p.rm = layout(p.rm, title = "Running mean")

ii. Running line smoothers

The running line smoother reduces the bias by fitting a linear regression in a local neighborhood of the target value. A popular algorithm using the running line smoother is Friedman’s super smoother supsmu, which by default uses cross-validation to find the best span.

Running line smoothers in plotly

rlcv = supsmu(tt, xx)
rlst = supsmu(tt, xx, span = 0.05)
rllt = supsmu(tt, xx, span = 0.75)

p.rl = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line = data.fmt, name="Data")
p.rl = add_lines(p.rl, x=tt, y=rllt$y, line=line.fmt, name="Span = 0.75")
p.rl = add_lines(p.rl, x=tt, y=rlst$y, line=line.fmt, name="Span = 0.05")
p.rl = add_lines(p.rl, x=tt, y=rlcv$y, line=line.fmt, name="Cross-validated span")
p.rl = layout(p.rl, title = "Running line smoothers")

iii. Kernel smoothers

An alternative approach to specifying a neighborhood is to decrease weights further away from the target value. These estimates are much smoother than the results from either the running mean or running line smoothers.

Kernel smoothers in plotly

ks1 = ksmooth(tt, xx, "normal", 20, x.points=tt)
ks2 = ksmooth(tt, xx, "normal", 5, x.points=tt)

p.ks = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
p.ks = add_lines(p.ks, x=ks1$x, y=ks1$y, line=line.fmt, name="Bandwidth = 20")
p.ks = add_lines(p.ks, x=ks1$x, y=ks2$y, line=line.fmt, name="Bandwidth = 5")
p.ks = layout(p.ks, title = "Kernel smoother")\

iv. Smoothing splines

Splines consist of a piece-wise polynomial with pieces defined by a sequence of knots where the pieces join smoothly. A smoothing splines is estimated by minimizing a criterion containing a penalty for both goodness of fit, and smoothness. The trade-off between the two is controlled by the smoothing parameter lambda, which is typically chosen by cross-validation.

Smoothing splines trend lines in plotly

In the base package, smooth.spline can be used to compute splines, but it is more common to use the GAM function in mgcv. Both functions use cross-validation to choose the default smoothing parameter; but as seen in the chart above, the results vary between implementations. Another advantage to using GAM is that it allows estimation of confidence intervals.

sp.base = smooth.spline(tt, xx) = gam(xx~s(tt, bs="cr"))
sp.gam = gam(xx~s(tt))
sp.pred = predict(sp.gam, type="response",
sp.df = data.frame(x=sp.gam$model[,2], y=sp.pred$fit,
                    lb=as.numeric(sp.pred$fit - (1.96 * sp.pred$,
                    ub=as.numeric(sp.pred$fit + (1.96 * sp.pred$
sp.df = sp.df[order(sp.df$x),]

pp = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
pp = add_lines(pp, x=tt, y=sp.pred$fit, name="GAM", line=list(color="#366092", width=2))
pp = add_ribbons(pp, x=sp.df$x, ymin=sp.df$lb, ymax=sp.df$ub, name="GAM 95% CI", line=list(color="#366092", opacity=0.4, width=0))
pp = add_lines(pp, x=tt, y=predict(sp.base)$y, name="smooth.spline", line=list(color="orange", width=2))
pp = layout(pp, title="Smoothing splines")


LOESS (Locally Estimated Scatterplot Smoother) combines local regression with kernels by using locally weighted polynomial regression (by default, quadratic regression with tri-cubic weights). It also allows estimation of approximate confidence intervals. However, it is important to note that unlike supsmu, smooth.spline or gam, loess does not use cross-validation. By default, the span is set to 0.75; that is, the estimated smooth at each target value consists of a local regression constructed using 75% of the data points closest to the target value. This span is fairly large and results in estimated values that are smoother than those from other methods.

LOESS trend line with different spans in plotly

ll.rough = loess(xx~tt, span=0.1)
ll.smooth = loess(xx~tt, span=0.75)

p.ll = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
p.ll = add_lines(p.ll, x=tt, y=predict(ll.smooth), line=line.fmt, name="Span = 0.75")
p.ll = add_lines(p.ll, x=tt, y=predict(ll.rough), line=line.fmt, name="Span = 0.10")
p.ll = layout(p.ll, title="LOESS")

LOESS trend line with confidence intervals in plotly

ll.smooth = loess(xx~tt, span=0.75)
ll.pred = predict(ll.smooth, se = TRUE)
ll.df = data.frame(x=ll.smooth$x, fit=ll.pred$fit,
lb = ll.pred$fit - (1.96 * ll.pred$se),
ub = ll.pred$fit + (1.96 * ll.pred$se))
ll.df = ll.df[order(ll.df$tt),]

p.llci = plot_ly(x=tt, y=xx, type="scatter", mode="lines", line=data.fmt, name="Data")
p.llci = add_lines(p.llci, x=tt, y=ll.pred$fit, name="Mean", line=list(color="#366092", width=2))
p.llci = add_ribbons(p.llci, x=ll.df$tt, ymin=ll.df$lb, ymax=ll.df$ub, name="95% CI", line=list(opacity=0.4, width=0, color="#366092"))
p.llci = layout(p.llci, title = "LOESS with confidence intervals")

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Marvel size chart

The logistics of being a 60-foot man must be a pain.

Tags: , ,

Continue Reading…


Read More

Of statistics class and judo class: Beyond the paradigm of sequential education

In judo class they kinda do the same thing every time: you warm up and then work on different moves. Different moves in different classes, and there are different levels, but within any level the classes don’t really have a sequence. You just start where you start, practice over and over, and gradually improve. Different students in the class are at different levels, both when it comes to specific judo expertise and also general strength, endurance, and flexibility, so it wouldn’t make sense to set up the class sequentially. Even during the semester, some people show up at the dojo once a week, others twice or three times a week.

Now compare that with how we organize our statistics classes. Each course has a sequence, everyone starts on week 1 and ends on week 15 (or less for a short course), and the assumption is that everyone starts at the same level, has the same abilities, will put in the same level of effort, and can learn at the same pace. We know this isn’t so, and we try our best to adapt to the levels of all the students in the class, but the baseline is uniformity.

The main way in which we adapt to different levels of students is to offer many different courses, so each student can jump in at his or her own level. But this still doesn’t account for different abilities, different amounts of time spent during the semester, and so forth.

Also, and relatedly, I don’t think the sequential model works so well within a course, even setting aside the differences between students. There is typically only a weak ordering of the different topics within a statistics course, and to really learn the material you have to keep going back and practicing what you’ve learned.

The sequential model works well in a textbook—it’s good to be able to find what you need, and see how it relates to the other material you’ll be learning. But in a course, I’m thinking we’d be better off moving toward the judo model, in which we have a bunch of classes with regular hours and students can drop in and practice, setting their schedules as they see fit. We could then assess progress using standardized tests instead of course grades.

P.S. I’m not an expert on judo, so please take the above description as approximate. This post is really about statistics teaching, not judo.

The post Of statistics class and judo class: Beyond the paradigm of sequential education appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

If you did not already know

Belief Propagation google
Belief propagation, also known as sum-product message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory and has demonstrated empirical success in numerous applications including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability. The algorithm was first proposed by Judea Pearl in 1982, who formulated this algorithm on trees, and was later extended to polytrees. It has since been shown to be a useful approximate algorithm on general graphs. …

DSS-MAS google
For some decision processes a significant added value is achieved when enterprises’ internal Data Warehouse (DW) can be integrated and combined with external data gained from web sites of competitors and other relevant Web sources. In this paper we discuss the agent-based integration approach using ontologies (DSS-MAS). In this approach data from internal DW and external sources are scanned by coordinated group of agents, while semantically integrated and relevant data is reported to business users according to business rules. After data from internal DW, Web sources and business rules are acquired, agents using these data and rules can infer new knowledge and therefore facilitate decision making process. Knowledge represented in enterprises’ ontologies is acquired from business users without extensive technical knowledge using user friendly user interface based on constraints and predefined templates. The approach presented in the paper was verified using the case study from the domain of mobile communications with the emphasis on supply and demand of mobile phones. …

Deep Sparse Subspace Clustering google
In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four real-world datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering. …

Continue Reading…


Read More

Book Memo: “Data Analytics in Professional Soccer”

Performance Analysis Based on Spatiotemporal Tracking Data
Daniel Link explores how data analytics can be used for studying performance in soccer. Based on spatiotemporal data from the German Bundesliga, the six individual studies in this book present innovative mathematical approaches for game analysis and player assessment. The findings can support coaches and analysts to improve performance of their athletes and inspire other researchers to advance the research field of sports analytics.

Continue Reading…


Read More

Document worth reading: “Universal gradient descent”

In this small book we collect many different and useful facts around gradient descent method. First of all we consider gradient descent with inexact oracle. We build a general model of optimized function that include composite optimization approache, level’s methods, proximal methods etc. Then we investigate primal-dual properties of the gradient descent in general model set-up. At the end we generalize method to universal one. Universal gradient descent

Continue Reading…


Read More

Magister Dixit

“In Information visualization it is often helpful to see data from a different angle and compare it with other information, update it, network it.” Joachim Sauter

Continue Reading…


Read More

Where to get help with your R question?

(This article was first published on Posts on Maëlle's R blog, and kindly contributed to R-bloggers)

Last time I blogged, I offered my obnoxious helpful advice for blog content and promotion. Today, let me again be the agony aunt you didn’t even write to! Imagine you have an R question, i.e. a question related to how you can do something with R, and your search engine efforts haven’t been too successful: where should you ask it to increase your chance of its getting answered? You could see this post as my future answer to stray suboptimal Twitter R questions, or as a more general version of Scott Chamberlain’s excellent talk about how to get help related to rOpenSci software in the 2017-03-07 rOpenSci comm call.

I think that the general journey to getting answers to your R questions is first trying your best to get answers locally in the documentation of R, then to use a search engine, and then to post a well-formulated question somewhere. My post is aimed at helping you find that somewhere. Note that it’s considered best practice to ask in one somewhere at once, and to then move on to another somewhere if you haven’t got any answer!

One thing this post isn’t about is how to ask for help to humans, which is a topic that’s e.g. covered very well in Jenny Bryan’s talk about reprex in the same 2017-03-07 rOpenSci comm call, but I’ll link out to useful resources I’ve found. This post is also not about how to ask for help to e.g. Google, and I don’t know of a good search engine guide yet although e.g. “It can be particularly helpful to paste an error message into a search engine to find out whether others have solved a problem that you encountered.” in is true.

Public question platforms vs. safe spaces?

I’ll start this post with a general comment for newbies who aren’t at ease enough yet to post their question, no matter what type of questions (see later), to anywhere public: find your safe and friendly space!

Lean on your current friend(s)

Do you have an R friend? Maybe this colleague who actually convinced you to start using R? Well, ask this person for help! Remind them it’s their fault you’re struggling now.

And if you’re not a newbie but are reading this now, be this R friend! Mentor newbies on their way to be better R users and askers.

Find new friends

Join a community, and ask your questions there. More specifically, you can join the R4DS community; your local/remote R-Ladies community (as a reminder, this doesn’t mean women only but instead “minority genders (including but not limited to cis/trans women, trans men, non-binary, genderqueer, agender)”); this French-speaking community. Such communities most often have Slack workspaces or equivalent, with an r-help channel, as well as a code of conduct and a general friendly attitude. Build your confidence there, and soon, you’ll be out asking (and answering!) R questions in the wilderness as well.

Note, if you know of any list of such friendly communities, please tell me. I only know of this list (of lists) of R User Groups and Conferences by David Smith.

Spotlight on my typology of R questions

One thing I especially like about Scott Chamberlain’s slidedeck is his typology of methods to get help, “methods and what they’re good for”. I’ll build this post based on a typology of questions, that also intersects with a typology of your newbiness-nervousness.

I tend to think of R questions as pertaining to one of these categories:

  • is this a bug, why am I stuck with this code? how do I install rJava on my laptop? (problem!) Problems encompass actual bugs, and problems where the problem is your code rather than the package (I’ll admit that I find joy in my code’s sometimes not being the problem, although this doesn’t solve the problem.).

  • is there a tool in R for doing foo? how do I learn about bar? (question!)

  • what’s the best plotting library in R? dependencies, is less more? (debate!)

I agree that the difference between debate and question might be thin, in my mind question questions have more of a trivia aspect, with answers being easier.

Where to ask your problem question

Scenario: you wrote R code, and it doesn’t do what you expected it to, which includes the glorious cases of not running at all or of using packages you haven’t even been able to install. What do you do then?

  • If your problem is linked to a package in particular, i.e. you know the function that’s causing you sorrow comes from the foo package, then your question is a foo question or bug report. So what you need to find now is the platform to report bugs and ask questions about foo.

  • When there’s no other indication or you’re not sure foo or bar is the actual culprit, use a more general site like Stack Overflow, tagging your question with “r” and other relevant labels. I was pleased to see that the bottom of includes a list of question venues including Stack Overflow, but also all R mailing lists.

Where to ask your question or debate question

How to find out whether there’s a tool to do image manipulation in R? How to know what’s best practice for your special package development challenge? How to get a book recommendation?

Maybe Twitter, to which I’ll dedicate a whole section a bit later. But a good guess is also trying to locate where the experts, or people interested in the answer, normally interact. Your finding your happy place will probably be a bit of a trial-and-error process, so while asking your first question might be more difficult, things should be easier as you learn to navigate the different communities and their activity. Here is some inspiration:

  • The R SIG (special interest group) mailing lists. Now I don’t use them at all because I’m not keen on receiving and sending many emails, but I can see there’s a great diversity of groups, maybe one that’ll suit your interests!

  • If your question is related to open science, reproducibility, documenting data, extracting data out of something (a PDF! a PNG!), package development, rOpenSci’s discuss forum might be the right venue.

  • The RStudio community forum has many categories including the tidyverse and the RStudio IDE of course but also about package development; R in production, at scale, and in complex environments, etc.

  • For a question related to outbreak analytics and R programming, check out the R Epidemics Consortium mailing list or Slack.

What is Twitter good for?

Anyone can use Twitter for what they want, depending on the accounts they follow and on their way to consume their timelines and hashtags, so the following is probably even more personal to me than the rest. I think Twitter questions that have higher chances to be answered (because I like answering them!) are the trivia ones, i.e. “is there something to do foo”, and the more controversial ones if only for having fun. That said, if you don’t have a huge following yet, or post at the “wrong” time, even when using the right hashtags or so your question might be ignored, keep that in mind.

What Twitter is not good not is, in my opinion:

  • Asking a question about a package by tagging the maintainer especially when the package docs state there’s another venue for that. But then, as a maintainer, one should probably handle such mistakes gracefully (like Yihui did here).

  • Too much code I think, even if tools like Sean Kross’ codefinch can help. Your tweet with a question can circumvent the difficulty by advertising a gist like codefinch makes you do (i.e. a tweet with an “attachment”), or by being a link to your actual question on Stack Overflow (i.e. a tweet to try and draw attention to your question).

  • Building an archive of questions and answers. It won’t be easy at all to search whether a question has already been answered. You can counteract this loss of knowledge by asking questions somewhere else, or building something out of the answers you’ve been given, e.g. a blog post listing resources for geospatial analysis in R or resources about transport and GIS in R like Steph Locke and Steph de Silva, respectively, did after getting answers on Twitter.

  • Surprise questions. At least I am scared of questions phrased like this: “Who has time to help me with ?” (e.g. XML, web scraping). They sound like a trap to me! I’d rather know more precisely what the person wants.


I think it’s as important to ask your questions in the right place as it is to ask them in the right way. I hope this post provided some clues as to know where to ask R questions. Feel free to comment below with questions/suggestions!

I’ll round up this post with a few links of resources to learn how to ask your question:

  • The guidelines and code of conduct of the venues where you’ve chosen to post your question, if there are some!

  • the 2017-03-07 rOpenSci comm call “How to Ask Questions so they get Answered! Possibly by Yourself!” was so, so good. I’ve already mentioned the talk by Jenny Bryan about reprex and Scott Chamberlain’s talk about getting help about rOpenSci’s software. Furthermore, JD Long reviewed the historical challenges of getting R help.

  • I’ve seen very useful posts on RStudio Community forum when browsing the FAQ tag. In particular, I like “Tips for Introducing Non-Programming-Problem Discussions” and this more general list of FAQs.

Good luck with your questions! The Disqus thingy below is good for questions about this post!

To leave a comment for the author, please follow the link and comment on their blog: Posts on Maëlle's R blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

July 21, 2018

R Packages worth a look

Fast Algorithms for Best Subset Selection (L0Learn)
Highly optimized toolkit for (approximately) solving L0-regularized learning problems. The algorithms are based on coordinate descent and local combina …

Simulation of Graphically Constrained Matrices (gmat)
Implementation of the simulation method for Gaussian graphical models described in Córdoba et al. (2018) <arXiv:1807.03090>. The package also pro …

Graph-Based Landscape De-Fragmentation (gDefrag)
Provides a set of tools to help the de-fragmentation process. It works by prioritizing the different sections of linear infrastructures (e.g. roads, po …

Continue Reading…


Read More

Mini-workshop: The Future of Random Projections II, 1pm-4pm, May 2nd, 2018, Paris, France

Florent Krzakala and I are organizing a second mini-workshop on The Future of Random Projections  un-originally titled "The Future of Random Projections II".

As data is getting richer, the need to make sense of it is becoming paramount in many different areas. In this context of large scale learning, Random Projections offer a way to be useful in a variety of unsupervised and supervised learning techniques. In this workshop, we will explore the different use of this transform  from the point of you of several research areas featured in each talk.

We will be streaming it during the event and the video will then be on YouTube. For those of you in Paris, it is going to be on May 2nd, 2018 at IPGG. you can register here whether you are in Paris or not so as to receive information on the link for the streaming. The workshop is hosted by LightOn.

Streaming video:

Here is the four main speakers we will have. The event will start at 1:00PM Paris time and should stop on or before 4:00PM.

1:00 pm - 1:30 pm
Title: "Time for dithering! Quantized random embeddings with RIP random matrices."

Abstract: Quantized compressive sensing (QCS) deals with the problem of coding compressive measurements of low-complexity signals (e.g., sparse vectors in a given basis, low-rank matrices) with quantized, finite precision representations, i.e., a mandatory process involved in any practical sensing model. While the resolution of this quantization clearly impacts the quality of signal reconstruction, there even exist incompatible combinations of quantization functions and sensing matrices that proscribe arbitrarily low reconstruction error when the number of measurements increases.

In this introductory talk, we will see that a large class of random matrix constructions, i.e., known to respect the restricted isometry property (RIP) in the compressive sensing literature, can be made "compatible" with a simple scalar and uniform quantizer (e.g., a rescaled rounding operation). This compatibility is simply ensured by the addition of a uniform random vector, or random "dithering", to the compressive signal measurements before quantization.

In this context, we will first study how quantized, dithered random projections of "low-complexity" signals is actually an efficient dimensionality reduction technique that preserves the distances of low-complexity signals up to some controllable additive and multiplicative distortions. Second, the compatibility of RIP sensing matrices with the dithered quantization process will be demonstrated by the existence of (at least) one signal reconstruction method, the projected back projection (PBP), which achieves low reconstruction error, decaying when the number of measurements increases. Finally, by leveraging the quasi-isometry property reached by quantized, dithered random embeddings, we will show how basic signal classification (or clustering) can be realized from their QCS observations, i.e., without a reconstruction step. Here also the complexity, or intrinsic dimension, of the observed signals drives the final classification accuracy.

1:30pm - 2:00pm  Julien Mairal, Inria Grenoble
Title: Foundations of Deep Learning from a Kernel Point of View.

Abstract: In the past years, deep neural networks such as convolutional or recurrent ones have become highly popular for solving various prediction problems, notably in computer vision and natural language processing. Conceptually close to approaches that were developed several decades ago, they greatly benefit from the large amounts of labeled data that have been available recently, allowing to learn huge numbers of model parameters without worrying too much about overfitting. Before the resurgence of neural networks, non-parametric models based on positive definite kernels were one of the most dominant topics in machine learning. These approaches are still widely used today because of several attractive features. Kernel methods are indeed versatile; as long as a positive definite kernel is specified for the type of data considered—e.g., vectors, sequences, graphs, or sets—a large class of machine learning algorithms originally defined for linear models may be used. Kernel methods also admit natural mechanisms to control the learning capacity and reduce overfitting. In this talk, we will consider both paradigms and show how they are related. We will notably show that the reproducing kernel point of view allows to derive theoretical results for classical convolutional neural networks.

2:00pm - 2:10 pm small break

2:10pm - 2:40pm: Dmitry Ulyanov, Skoltech Institute
Title: Deep Image Prior

Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, superresolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them and to restore images based on flash-no flash input pairs.Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity.

2:40pm - 3:10pm: Kurt Cutajar , EURECOM,
Title “Random Feature Expansions for Deep Gaussian Processes”

Abstract: The widespread application of machine learning in safety-critical domains such as medical diagnosis and autonomous driving has sparked a renewed interest in probabilistic models which produce principled uncertainty estimates alongside predictions. The composition of multiple Gaussian processes as a deep Gaussian process (DGP) enables a deep probabilistic nonparametric approach to flexibly tackle complex machine learning problems with sound quantification of uncertainty. However, traditional inference approaches for DGP models have limited scalability and are notoriously cumbersome to construct. Inspired by recent advances in the field of Bayesian deep learning, in this talk I shall present an alternative formulation of DGPs based on random feature expansions. This yields a practical learning framework which significantly advances the state-of-the-art in inference for DGPs, and enables accurate quantification of uncertainty. The scalability and performance of our proposal is showcased on several datasets with up to 8 million observations, and various DGP architectures with up to 30 hidden layers.

3:10pm - 4:00pm Coffee break.

Credit image: Rich Baraniuk

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Continue Reading…


Read More

SuperDataScience Podcast: Insights from the Founder of KDnuggets

I talk to Kirill Eremenko about my journey to data science, how KDnuggets started, why you should start honing your machine learning engineering skills at this very moment, what's the future of data science, and more.

Continue Reading…


Read More

simmer 4.0.0

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The 4.0.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN under a new license: we decided to switch to GPL >= 2. Most notably in this major release, the C++ core has been refactorised and exposed under inst/include. This is not a big deal for most users, but it enables extensions. As an example of this, simmer.mon is an experimental package that links to simmer and extends its monitoring facilities to provide a new DBI-based backend. Not a very efficient one, but it demonstrates how to extend the simmer core from another package.

Exception handling has been remarkably improved. In previous releases, errors were reported to happen in the run() method, which is… everything that can happen, obviously. In this version, errors are catched and more information is provided, particularly about the simulation time, the arrival and the activity involved:


bad.traj <- trajectory() %>%
  timeout(function() NA)

simmer() %>%
  add_generator("dummy", bad.traj, at(pi)) %>%
## Error: 'dummy0' at 3.14 in 'Timeout':
##  missing value (NA or NaN returned)

Another improvement has to do with attributes. These are commonly used to build incremental indices, but some boilerplate was needed to initialise them. Now this is automatic (and configurable):

index.traj <- trajectory() %>%
  set_global("index", 1, mod="+", init=10)

simmer() %>%
  add_generator("dummy", index.traj, at(1:3), mon=2) %>%
  run() %>%
##   time name   key value replication
## 1    1      index    11           1
## 2    2      index    12           1
## 3    3      index    13           1

Finally, the log_ activity was created for occasional debugging, but we noticed that simmer users use it a lot more to know what is happening when they build models, but so much output is annoying when a model is complete. Therefore, we have implemented simulation-scoped logging levels to be able to turn on and off specific messages on demand:

log.traj <- trajectory() %>%
  log_("This will be always printed") %>% # level=0
  log_("This can be disabled", level=1)

simmer(log_level=1) %>%
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed
## 3.14159: dummy0: This can be disabled
simmer() %>% # log_level=0
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed

See below for a comprehensive list of changes.

New features:

  • The C++ core has been refactorised into a header-only library under inst/include (#147 closing #145). Therefore, from now on it is possible to extend the C++ API from another package by listing simmer under the LinkingTo field in the DESCRIPTION file.
  • New generic monitor constructor enables the development of new monitoring backends in other packages (179f656, as part of #147).
  • New simulation-scoped logging levels. The log_ activity has a new argument level which determines whether the message is printed depending on a global log_level defined in the simmer constructor (#152).
  • set_attribute and set_global gain a new argument to automatically initialise new attributes (#157). Useful to update counters and indexes in a single line, without initialisation boilerplate.

Minor changes and fixes:

  • Enhanced exception handling, with more informative error messages (#148).
  • Refactorisation of the printing methods and associated code (#149).
  • Allow empty trajectories in sources and activities with sub-trajectories (#151 closing #150).
  • Enable -DRCPP_PROTECTED_EVAL (Rcpp >=, which provides fast evaluation of R expressions by leveraging the new stack unwinding protection API (R >= 3.5.0).
  • Replace backspace usage in vector’s ostream method (2b2f43e).
  • Fix namespace clashes with rlang and purrr (#154).

Article originally published in simmer 4.0.0.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

SQL Server on VMware Boot Camp at VMworld 2018

I am extremely proud to announce that I am presenting in conjunction with VMware Corp. an all-day boot camp for advanced SQL Server operations, performance tuning, and business continuity strategies at this year’s VMworld 2018 USA in Las Vegas on Sunday, August 26th, at the Mandalay Bay conference center.

Oleg Ulyanov from VMware Corp. and I are teaming up together to present this all-day intensive workshop.

Workshop Abstract: Learn the essential skills necessary to run SQL Server implementations on VMware vSphere. VMware expert Oleg Ulyanov will be joined by Microsoft MVP David Klee from Heraflux to cover best practices and optimal approaches to deployment, operation, and management of SQL Server on VMware vSphere. This technical workshop delivers real-life, instructor-led, live training and incorporates the recommended design and configuration practices for architecting business-critical databases on VMware vSphere infrastructure. Subjects covered in depth will include performance, monitoring , high availability and disaster recovery with failover cluster instances or availability groups, and advanced storage architectures.

Register for this exciting workshop as part of your VMworld conference admission ASAP, as seats are filling up! SQL Server is not an easy application to manage in any environment, so come learn many real-world tips and tricks for making these databases run as fast and smooth as possible.

Continue Reading…


Read More

Science and Technology links (July 21st, 2018)

  1. It seems that something called “Cartesian Genetic Programming” could be a match for Deep Learning.
  2. If you look at the best paid individuals in most fields, most of them are men. Why is that? There are many possible explanations. One of them is that men take more risks. If you hope to ever get larger rewards, it helps to take large risks.

    It is undeniable, I think, that boys take more physical risks than girls. But do men really take more risks than women in their careers?

    Biologists tend to tell us that females are more risk adverse. The usual story is that a female’s ability to reproduce is mostly limited by the availability of food and shelter, whereas a male’s ability to reproduce has far more to do with its ability to impress females, or to impose its will. We can test it out by entering animals in games. For example, Orsini et al. investigated the issue in rats:

    Consistent with findings in human subjects, females were more risk averse, choosing the large, risky reward significantly less than males. This effect was not due to differences in shock reactivity or body weight, and risk-taking in females was not modulated by estrous phase. Systemic amphetamine administration decreased risk-taking in both males and females; however, females exhibited greater sensitivity to amphetamine, suggesting that dopaminergic signaling may partially account for sex differences in risk-taking.

    Social scientists have another, different take. For example, Nelson, an economist, finds the whole question metaphysical:

    The statement “women are more risk averse than men” is fundamentally a metaphysical assertion about unobservable essences or characteristics, and therefore cannot be empirically proven or disproven. A review of the empirical literature, with attention paid to the misleading nature of generic beliefs and statements, the proper interpretation of statistical results, and the quantitative magnitudes of detectable differences and similarities, sheds doubt on whether statements such as these should have any place in an empirical science that aspires to objectivity. The widespread acceptance of such statements appears to perhaps be rooted more in confirmation bias than in reality.

  3. A honey bee has about one million neurons and a billion synapses. The best smartphones have many more transistors than honey bees have synapses.
  4. Do human beings have “general” intelligence? Kevin Kelly, a well-respected writer, does not think so:

    We like to call our human intelligence “general purpose,” because compared with other kinds of minds we have met, it can solve more types of problems, but as we build more and more synthetic minds we’ll come to realize that human thinking is not general at all. It is only one species of thinking.

  5. Our cells are powered by mitochondria (tiny cells living inside our cells). If your mitochondria cease to function well enough, your cells are as good as dead. But could they be revived? It seems so, at least as far as the heart of babies is concerned: fresh mitochondria can revive flagging cells and enable them to quickly recover.
  6. Each time your cells divide, the end of your chromosomes (the telomeres) shortens. Thus a given human cell can only divide so many time. This sounds terrible, but it is less terrible because cells have ways to elongated their telomeres when needed (telomerase)… however, most cells will just divide a fixed number of times. At that point, the cell either dies (apoptosis) or, rarely, becomes senescent. However, the cells don’t literally run out of telomeres. So what happens? According to Ly et al. the problem is structural: the telomeres loop around, hiding the end of the chromosome. When the telomeres are too short to effectively hide the end of the chromosome, it is treated as defective by our biochemistry.
  7. Will we continue to get faster computers? Denning and Lewis think so:

    These analyses show that the conditions exist at all three levels of the computing ecosystem to sustain exponential growth. They support the optimism of many engineers that many additional years of exponential growth are likely. Moore’s Law was sustained for five decades. Exponential growth is likely to be sustained for many more.

  8. There is a huge garbage patch in the Pacific ocean. It is huge: about the size of the state of Texas. A large fraction, nearly half of it, is made of fishing gear.
  9. We making progress against cancer, but it is not enough. In 2018, an estimated 1,735,350 new cases of cancer will be diagnosed in the United States and 609,640 people will die from the disease.
  10. Bill Gates wants to help diagnose Alzheimer’s early. It is believed that by the time you have symptoms of Alzheimer’s, you have already extensive damages. Many researchers believe it could be far easier to prevent these damages than to reverse them.
  11. It is believed that Alzheimer’s might be caused by protein aggregation in the brain. But it is not entirely clear how these proteins causes problems. New research suggests that their proteins induce cellular senescence in the brain. So cell senescence could be a key factor in the emergence of Alzheimers’.

Continue Reading…


Read More

RStudio:addins part 4 – Unit testing coverage investigation and improvement, made easy

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)


A developer always pays his technical debts! And we have a debt to pay to the gods of coding best practices, as we did not present many unit tests for our functions yet. Today we will show how to efficiently investigate and improve unit test coverage for our R code, with focus on functions governing our RStudio addins, which have their own specifics.

As a practical example, we will do a simple resctructuring of one of our functions to increase its test coverage from a mere 34% to over 90%.

The pretty rewards for your tests

The pretty rewards for your tests

Fly-through of unit testing in R

Much has been written on the importance of unit testing, so we will not spend more time on convincing the readers, but rather very quickly provide a few references in case the reader is new to unit testing with R. In the later parts of the article we assume that these basics are known.

In a few words

  • devtools – Makes package development easier by providing R functions that simplify common tasks
  • testthat– Is the most popular unit testing package for R
  • covr– Helps track test coverage for R packages and view reports locally or (optionally) upload the results

For a start guide to use testthat within a package, visit the Testing section of R packages by Hadley Wickham. I would also recommend checking out the showcase on the 2.0.0 release of the testthat itself.

Investigating test coverage within a package

For the purpose of investigating the test coverage of a package we can use the covr package. Within an R project, we can call the package_coverage() function to get a nicely printed high-level overview, or we can provide a specific path to a package root directory and call it as follows:

# This looks much prettier in the R console ;)
## jhaddins Coverage: 59.05%
## R/viewSelection.R: 34.15%
## R/addRoxytag.R: 40.91%
## R/makeCmd.R: 92.86%

For a deeper investigation, converting the results to a data.frame might be very useful. The below shows the count of number of times that given expression was called during the running of our tests for each group of code lines:

covResults <- covr::package_coverage(pkgPath)[, c(1:3, 5, 11)]
##             filename         functions first_line last_line value
## 1     R/addRoxytag.R            roxyfy         10        12     6
## 2     R/addRoxytag.R            roxyfy         11        11     2
## 3     R/addRoxytag.R            roxyfy         13        15     4
## 4     R/addRoxytag.R            roxyfy         14        14     2
## 5     R/addRoxytag.R            roxyfy         16        16     2
## 6     R/addRoxytag.R            roxyfy         17        17     2
## 7     R/addRoxytag.R            roxyfy         18        18     2
## 8     R/addRoxytag.R        addRoxytag         29        29     0
## 9     R/addRoxytag.R        addRoxytag         30        37     0
## 10    R/addRoxytag.R        addRoxytag         32        34     0
## 11    R/addRoxytag.R        addRoxytag         38        38     0
## 12    R/addRoxytag.R    addRoxytagCode         44        44     0
## 13    R/addRoxytag.R    addRoxytagLink         50        50     0
## 14    R/addRoxytag.R     addRoxytagEqn         56        56     0
## 15       R/makeCmd.R           makeCmd         20        24     5
## 16       R/makeCmd.R           makeCmd         21        21     0
## 17       R/makeCmd.R           makeCmd         23        23     5
## 18       R/makeCmd.R           makeCmd         25        27     5
## 19       R/makeCmd.R           makeCmd         26        26     4
## 20       R/makeCmd.R           makeCmd         28        32     5
## 21       R/makeCmd.R           makeCmd         33        35     5
## 22       R/makeCmd.R           makeCmd         34        34     2
## 23       R/makeCmd.R           makeCmd         36        38     5
## 24       R/makeCmd.R           makeCmd         37        37     1
## 25       R/makeCmd.R           makeCmd         39        39     5
## 26       R/makeCmd.R      replaceTilde         48        50     1
## 27       R/makeCmd.R      replaceTilde         49        49     1
## 28       R/makeCmd.R      replaceTilde         51        51     1
## 29       R/makeCmd.R        executeCmd         61        61     5
## 30       R/makeCmd.R        executeCmd         62        66     5
## 31       R/makeCmd.R        executeCmd         68        72     3
## 32       R/makeCmd.R        executeCmd         69        69     0
## 33       R/makeCmd.R        executeCmd         71        71     3
## 34       R/makeCmd.R runCurrentRscript         90        90     1
## 35       R/makeCmd.R runCurrentRscript         91        91     1
## 36       R/makeCmd.R runCurrentRscript         92        96     1
## 37       R/makeCmd.R runCurrentRscript         93        95     1
## 38       R/makeCmd.R runCurrentRscript         94        94     0
## 39 R/viewSelection.R     viewSelection          7         7     0
## 40 R/viewSelection.R     viewSelection          8        12     0
## 41 R/viewSelection.R     viewSelection         10        10     0
## 42 R/viewSelection.R     viewSelection         13        13     0
## 43 R/viewSelection.R  getFromSysframes         24        24     6
## 44 R/viewSelection.R  getFromSysframes         25        25     3
## 45 R/viewSelection.R  getFromSysframes         26        26     3
## 46 R/viewSelection.R  getFromSysframes         28        28     3
## 47 R/viewSelection.R  getFromSysframes         29        29     3
## 48 R/viewSelection.R  getFromSysframes         30        30     3
## 49 R/viewSelection.R  getFromSysframes         31        31    92
## 50 R/viewSelection.R  getFromSysframes         32        32    92
## 51 R/viewSelection.R  getFromSysframes         33        33    92
## 52 R/viewSelection.R  getFromSysframes         34        34     2
## 53 R/viewSelection.R  getFromSysframes         37        37     1
## 54 R/viewSelection.R        viewObject         56        56     3
## 55 R/viewSelection.R        viewObject         57        57     3
## 56 R/viewSelection.R        viewObject         58        58     3
## 57 R/viewSelection.R        viewObject         61        61     0
## 58 R/viewSelection.R        viewObject         64        64     0
## 59 R/viewSelection.R        viewObject         65        65     0
## 60 R/viewSelection.R        viewObject         66        66     0
## 61 R/viewSelection.R        viewObject         69        69     0
## 62 R/viewSelection.R        viewObject         70        70     0
## 63 R/viewSelection.R        viewObject         71        71     0
## 64 R/viewSelection.R        viewObject         74        74     0
## 65 R/viewSelection.R        viewObject         76        76     0
## 66 R/viewSelection.R        viewObject         77        77     0
## 67 R/viewSelection.R        viewObject         79        79     0
## 68 R/viewSelection.R        viewObject         81        81     0
## 69 R/viewSelection.R        viewObject         82        82     0
## 70 R/viewSelection.R        viewObject         83        83     0
## 71 R/viewSelection.R        viewObject         88        88     0
## 72 R/viewSelection.R        viewObject         89        89     0
## 73 R/viewSelection.R        viewObject         91        91     0
## 74 R/viewSelection.R        viewObject         92        92     0
## 75 R/viewSelection.R        viewObject         93        93     0
## 76 R/viewSelection.R        viewObject         96        96     0

Calling covr::zero_coverage with a overage object returned by package_coverage will provide a data.frame with locations that have 0 test coverage. The nice thing about running it within RStudio is that it outputs the results on the Markers tab in RStudio, where we can easily investigate:

zeroCov <- covr::zero_coverage(covResults)
zero_coverage markers

zero_coverage markers

Test coverage for RStudio addin functions

Investigating our code, let us focus on the results for the viewSelection.R, which has a very weak 34% test coverage. We can analyze exactly which lines have no test coverage in a specific file:

zeroCov[zeroCov$filename == "R/viewSelection.R", "line"]
##  [1]  7  8  9 10 11 12 13 61 64 65 66 69 70 71 74 76 77 79 81 82 83 88 89
## [24] 91 92 93 96

Looking at the code, we can see that the first chuck of lines – 7:13 represent the viewSelection function, which just calls lapply and invisibly returns NULL.
The main weak spot however is the function viewObject, out of which we only test the early return in case of invalid chr argument provided. None of the other functionality is tested.

The reason behind this is that when running the tests, RStudio functionality is not available and therefore we would not be able to test even the not-so-well designed return values, as they are almost always preceded by a call to rstudioapi or other RStudio-related functionality such as the object viewer, because that is what they are designed to do. This means we must restructure the code in such a way that we contain the RStudio-dependent functionality to a necessary minimum, keeping a big majority of the code testable – only calling the side-effecting rstudioapi when actually executing the addin functionality itself.

Rewriting an addin function for better coverage

We will now show one potential way to solve this issue for the particular case of our viewObject function.

The idea behind the solution is to only return the arguments for the call to the RStudio API related functionality, instead of executing them in the function itself – hence the rename to getViewArgs.

This way we can test the function’s return value against the expected arguments and only execute them with in the addin execution wrapper itself. A picture may be worth a thousand words, so here is the diff with relevant changes:

Refactoring for testability

Refactoring for testability

Testing the rewritten function and gained coverage

Now that our return values are testable across the entire getViewArgs function, we can easily write tests to cover the entire function, a couple examples:

test_that("getViewArgs for function"
        , expect_equal(
          , list(what = "View", args = list(x = reshape, title = "reshape"))
test_that("getViewArgs for data.frame"
        , expect_equal(
          , list(what = "View",
                 args = list(x = data.frame(
                     height = c(58, 59, 60, 61, 62, 63, 64, 65,
                                66, 67, 68, 69, 70, 71, 72),
                     weight = c(115, 117, 120, 123, 126, 129, 132, 135,
                                139, 142, 146, 150, 154, 159, 164)
                   title = "datasets::women"

Looking at the test coverage provided after our changes, we can see that we are at more than 90% percent coverage for viewSelection.R:

# This looks much prettier in the R console ;)
covResults <- covr::package_coverage(pkgPath)
## jhaddins Coverage: 82.05%
## R/addRoxytag.R: 40.91%
## R/viewSelection.R: 90.57%
## R/makeCmd.R: 92.86%

And looking at the lines that not covered for viewSelection.R, we can indeed see that the only uncovered lines left are in fact those with the viewSelection function, which is responsible only for executing the addin itself:

covResults <-
covResults[covResults$filename == "R/viewSelection.R" &
             covResults$value == 0, c(1:3, 5, 11)]
##             filename     functions first_line last_line value
## 39 R/viewSelection.R viewSelection         13        17     0
## 40 R/viewSelection.R viewSelection         15        15     0
## 41 R/viewSelection.R viewSelection         16        16     0

In the ideal world we would of course want to also automate the testing of our addin execution itself by examining if their effects in the RStudio IDE are as expected, however this is far beyond the scope of this post. For some of our addin functionality we can however even directly test the side-effects, such as when the addin should produce a file with certain content.

TL;DR – Just give me the package


Did you find the article helpful or interesting? Help others find it by sharing

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

“A Headline That Will Make Global-Warming Activists Apoplectic”

I saw this article in the newspaper today, “2017 Was One of the Hottest Years on Record. And That Was Without El Niño,” subtitled, “The world in 2017 saw some of the highest average surface temperatures ever recorded, surprising scientists who had expected sharper retreat from recent record years,” and accompanied by the above graph, and this reminded me of something.

A few years ago there was a cottage industry among some contrarian journalists, making use of the fact that 1998 was a particularly hot year (by the standards of its period) to cast doubt on the global warming trend. Ummmm, where did I see this? . . . Here, I found it! It was a post by Stephen Dubner on the Freakonomics blog, entitled, “A Headline That Will Make Global-Warming Activists Apoplectic,” and continuing:

The BBC is responsible. The article, by the climate correspondent Paul Hudson, is called “What Happened to Global Warming?” Highlights:

For the last 11 years we have not observed any increase in global temperatures. And our climate models did not forecast it, even though man-made carbon dioxide, the gas thought to be responsible for warming our planet, has continued to rise. So what on Earth is going on?


According to research conducted by Professor Don Easterbrook from Western Washington University last November, the oceans and global temperatures are correlated. . . . Professor Easterbrook says: “The PDO cool mode has replaced the warm mode in the Pacific Ocean, virtually assuring us of about 30 years of global cooling.”

Let the shouting begin. Will Paul Hudson be drummed out of the circle of environmental journalists? Look what happened here, when Al Gore was challenged by a particularly feisty questioner at a conference of environmental journalists.

We have a chapter in SuperFreakonomics about global warming and it too will likely produce a lot of shouting, name-calling, and accusations ranging from idiocy to venality. It is curious that the global-warming arena is so rife with shrillness and ridicule. Where does this shrillness come from? . . .

No shrillness here. Professor Don Easterbrook from Western Washington University seems to have screwed up his calculations somewhere, but that happens. And Dubner did not make this claim himself; he merely featured a news article that featured this particular guy and treated him like an expert. Actually, Dubner and his co-author Levitt also wrote, “we believe that rising global temperatures are a man-made phenomenon and that global warming is an important issue to solve,” so I could never quite figure out in their blog he was highlighting an obscure scientist who was claiming that we were virtually assured of 30 years of cooling.

Anyway, we all make mistakes; what’s important is to learn from them. I hope Dubner and his Freaknomics colleagues learn from this particular prediction that went awry. Remember, back in 2009 when Dubner was writing about “A Headline That Will Make Global-Warming Activists Apoplectic,” and Don Easterbrook was “virtually assuring us of about 30 years of global cooling,” the actual climate-science experts were telling us that things would be getting hotter. The experts were pointing out that oft-repeated claims such as “For the last 11 years we have not observed any increase in global temperatures . . .” were pivoting off the single data point of 1998, but Dubner and Levitt didn’t want to hear it. Fiddling while the planet burns, one might say.

It’s not that the experts are always right, but it can make sense to listen to their reasoning instead of going on about apoplectic activists, feisty questioners, and shrillness.

The post “A Headline That Will Make Global-Warming Activists Apoplectic” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Nerd childhood piece: how I decided to become a mathematician

I’ve got a new Bloomberg View column out about how I decided to become a mathematician:

How Dominoes Helped Make Me a Mathematician

Proofs can be empowering

My other Bloomberg columns are listed here.

Continue Reading…


Read More

Book Memo: “Mathematics of Big Data”

Spreadsheets, Databases, Matrices, and Graphs
The first book to present the common mathematical foundations of big data analysis across a range of applications and technologies.Today, the volume, velocity, and variety of data are increasing rapidly across a range of fields, including Internet search, healthcare, finance, social media, wireless devices, and cybersecurity. Indeed, these data are growing at a rate beyond our capacity to analyze them. The tools-including spreadsheets, databases, matrices, and graphs-developed to address this challenge all reflect the need to store and operate on data as whole sets rather than as individual elements. This book presents the common mathematical foundations of these data sets that apply across many applications and technologies. Associative arrays unify and simplify data, allowing readers to look past the differences among the various tools and leverage their mathematical similarities in order to solve the hardest big data challenges.The book first introduces the concept of the associative array in practical terms, presents the associative array manipulation system D4M (Dynamic Distributed Dimensional Data Model), and describes the application of associative arrays to graph analysis and machine learning. It provides a mathematically rigorous definition of associative arrays and describes the properties of associative arrays that arise from this definition. Finally, the book shows how concepts of linearity can be extended to encompass associative arrays. Mathematics of Big Data can be used as a textbook or reference by engineers, scientists, mathematicians, computer scientists, and software engineers who analyze big data.

Continue Reading…


Read More

R Packages worth a look

Adaptive Sparsity Models (AdaptiveSparsity)
Implements Figueiredo EM algorithm for adaptive sparsity (Jeffreys prior) (see Figueiredo, M.A.T.; , ‘Adaptive sparseness for supervised learning,’ Pat …

Greedy Experimental Design Construction (GreedyExperimentalDesign)
Computes experimental designs for a two-arm experiment with covariates by greedily optimizing a balance objective function. This optimization provides …

FIS ‘MarketMap C-Toolkit’ (rhli)
Complete access from ‘R’ to the FIS ‘MarketMap C-Toolkit’ (‘FAME C-HLI’). ‘FAME’ is a fully integrated software and database management system from FIS …

Continue Reading…


Read More

MiKTeX Behind a Windows Firewall

(This article was first published on TeachR, and kindly contributed to R-bloggers)

I’ve always had problems with MiKTeX on my work computer.
I can install it just fine, or get IT to install it, but then the package manager doesn’t work because of our firewall.
You can set up a local repository to get around this problem, and I will show you how.
I’m just doing a basic setup here, just enough to compile the RStudio Rmd template to PDF.

“Why do I need MiKTeX?”, you might ask.
Well, because if you want to create a PDF from an RMarkdown file in RStudio it is required.
Otherwise you get this polite error message.

RStudio: No pdflatex error message

I downloaded the 32-bit basic version from

MiKTeX Basic 32-bit Download

You’ll probably just want to pick the default options for the install EXCEPT for
the question that asks whether you want MiKTeX to install missing packages automatically.
I’m saying “no” to that because that is where the firewall gets in the way.


MiKTeX will update your PATH system variable, but to get RStudio to recognize that it has changed you will need to restart it.
Trying to knit the RStudio Rmd template to PDF right after install gives a new error, that it cannot find the file “url.sty”.
Actually, if you had done the default install, MiKTeX would have tried to install the needed package automatically.
But, this would have just failed if the needed port is not open on your firewall.

One way to get the packages you need is to create a local package repo and install the packages “manually”.

I’ve created a github repo with a minimal set of LaTeX packages to get this working,
So, you can clone it straight, download a ZIP, or fork and clone. Up to you.
I’m going to assume you know how to get these files from github and put them somewhere inside your firewall.
I’ve put mine in a folder called c:\home\git\Other\MiKTeX_pkgs.

Next, open the MiKTeX package manager on your computer.
It was just installed.
Select change package repository from the repository menu.
It will then ask you what kind of repo you want.

You want a local repo

On the next screen browse to the folder where you cloned the git repo, and point to the “repo-local” subdirectory.

You can send me an email if I've lost you

MiKTeX will take forever to synchronize the package database, but once it is done, you can select the packages you want installed.
Unfortunately, I have not been able to figure out how to create an index file that just has the downloaded packages in it.
So, the package manager will still list all of the packages available at the CTAN mirror.
I know, it sucks.
Then you can run the batch file scripts\100-mpm-install.bat as an admin and it will install the packages for you.

If you decide you want to add more packages to your local repo, just put the file name in the DOWNLOAD file
and re-run the script ./scripts/050-download-locals.bat and then scripts\100-mpm-install.bat.

Let me know if this helps or if it doesn’t work for you.

To leave a comment for the author, please follow the link and comment on their blog: TeachR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Whats new on arXiv

Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints

A text mining approach is proposed based on latent Dirichlet allocation (LDA) to analyze the Consumer Financial Protection Bureau (CFPB) consumer complaints. The proposed approach aims to extract latent topics in the CFPB complaint narratives, and explores their associated trends over time. The time trends will then be used to evaluate the effectiveness of the CFPB regulations and expectations on financial institutions in creating a consumer oriented culture that treats consumers fairly and prioritizes consumer protection in their decision making processes. The proposed approach can be easily operationalized as a decision support system to automate detection of emerging topics in consumer complaints. Hence, the technology-human partnership between the proposed approach and the CFPB team could certainly improve consumer protections from unfair, deceptive or abusive practices in the financial markets by providing more efficient and effective investigations of consumer complaint narratives.

Anomaly Detection for Water Treatment System based on Neural Network with Automatic Architecture Optimization

We continue to develop our neural network (NN) based forecasting approach to anomaly detection (AD) using the Secure Water Treatment (SWaT) industrial control system (ICS) testbed dataset. We propose genetic algorithms (GA) to find the best NN architecture for a given dataset, using the NAB metric to assess the quality of different architectures. The drawbacks of the F1-metric are analyzed. Several techniques are proposed to improve the quality of AD: exponentially weighted smoothing, mean p-powered error measure, individual error weight for each variable, disjoint prediction windows. Based on the techniques used, an approach to anomaly interpretation is introduced.

Preventing Poisoning Attacks on AI based Threat Intelligence Systems

As AI systems become more ubiquitous, securing them becomes an emerging challenge. Over the years, with the surge in online social media use and the data available for analysis, AI systems have been built to extract, represent and use this information. The credibility of this information extracted from open sources, however, can often be questionable. Malicious or incorrect information can cause a loss of money, reputation, and resources; and in certain situations, pose a threat to human life. In this paper, we use an ensembled semi-supervised approach to determine the credibility of Reddit posts by estimating their reputation score to ensure the validity of information ingested by AI systems. We demonstrate our approach in the cybersecurity domain, where security analysts utilize these systems to determine possible threats by analyzing the data scattered on social media websites, forums, blogs, etc.

Statistical Model Compression for Small-Footprint Natural Language Understanding

In this paper we investigate statistical model compression applied to natural language understanding (NLU) models. Small-footprint NLU models are important for enabling offline systems on hardware restricted devices, and for decreasing on-demand model loading latency in cloud-based systems. To compress NLU models, we present two main techniques, parameter quantization and perfect feature hashing. These techniques are complementary to existing model pruning strategies such as L1 regularization. We performed experiments on a large scale NLU system. The results show that our approach achieves 14-fold reduction in memory usage compared to the original models with minimal predictive performance impact.

Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning

Predictive geometric models deliver excellent results for many Machine Learning use cases. Despite their undoubted performance, neural predictive algorithms can show unexpected degrees of instability and variance, particularly when applied to large datasets. We present an approach to measure changes in geometric models with respect to both output consistency and topological stability. Considering the example of a recommender system using word2vec, we analyze the influence of single data points, approximation methods and parameter settings. Our findings can help to stabilize models where needed and to detect differences in informational value of data points on a large scale.

Semantic Parsing: Syntactic assurance to target sentence using LSTM Encoder CFG-Decoder

Semantic parsing can be defined as the process of mapping natural language sentences into a machine interpretable, formal representation of its meaning. Semantic parsing using LSTM encoder-decoder neural networks have become promising approach. However, human automated translation of natural language does not provide grammaticality guarantees for the sentences generate such a guarantee is particularly important for practical cases where a data base query can cause critical errors if the sentence is ungrammatical. In this work, we propose an neural architecture called Encoder CFG-Decoder, whose output conforms to a given context-free grammar. Results are show for any implementation of such architecture display its correctness and providing benchmark accuracy levels better than the literature.

Linear Programming Approximations for Index Coding

Index coding, a source coding problem over broadcast channels, has been a subject of both theoretical and practical interest since its introduction (by Birk and Kol, 1998). In short, the problem can be defined as follows: there is an input \textbf{x} \triangleq (\textbf{x}_1, \dots, \textbf{x}_n), a set of n clients who each desire a single symbol \textbf{x}_i of the input, and a broadcaster whose goal is to send as few messages as possible to all clients so that each one can recover its desired symbol. Additionally, each client has some predetermined ‘side information,’ corresponding to certain symbols of the input \textbf{x}, which we represent as the ‘side information graph’ \mathcal{G}. The graph \mathcal{G} has a vertex v_i for each client and a directed edge (v_i, v_j) indicating that client i knows the jth symbol of the input. Given a fixed side information graph \mathcal{G}, we are interested in determining or approximating the ‘broadcast rate’ of index coding on the graph, i.e. the fewest number of messages the broadcaster can transmit so that every client gets their desired information. Using index coding schemes based on linear programs (LPs), we take a two-pronged approach to approximating the broadcast rate. First, extending earlier work on planar graphs, we focus on approximating the broadcast rate for special graph families such as graphs with small chromatic number and disk graphs. In certain cases, we are able to show that simple LP-based schemes give constant-factor approximations of the broadcast rate, which seem extremely difficult to obtain in the general case. Second, we provide several LP-based schemes for the general case which are not constant-factor approximations, but which strictly improve on the prior best-known schemes.

A Projection Pursuit Forest Algorithm for Supervised Classification

This paper presents a new ensemble learning method for classification problems called projection pursuit random forest (PPF). PPF uses the PPtree algorithm introduced in Lee et al. (2013). In PPF, trees are constructed by splitting on linear combinations of randomly chosen variables. Projection pursuit is used to choose a projection of the variables that best separates the classes. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account which allows PPF to outperform a traditional random forest when separations between groups occurs in combinations of variables. The method presented here can be used in multi-class problems and is implemented into an R (R Core Team, 2018) package, PPforest, which is available on CRAN, with development versions at https://…/PPforest.

Imparting Interpretability to Word Embeddings

As an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words but the vector corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce an additive modification to the objective function of the embedding learning algorithm that encourages the embedding vectors of words that are semantically related a predefined concept to take larger values along a specified dimension, while leaving the original semantic learning mechanism mostly unaffected. In other words, we align words that are already determined to be related, along predefined concepts. Therefore, we impart interpretability to the word embedding by assigning meaning to its vector dimensions. The predefined concepts are derived from an external lexical resource, which in this paper is chosen as Roget’s Thesaurus. We observe that alignment along the chosen concepts is not limited to words in the Thesaurus and extends to other related words as well. We quantify the extent of interpretability and assignment of meaning from our experimental results. We also demonstrate the preservation of semantic coherence of the resulting vector space by using word-analogy and word-similarity tests. These tests show that the interpretability-imparted word embeddings that are obtained by the proposed framework do not sacrifice performances in common benchmark tests.

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we propose the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2017). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.

Bounded Information Rate Variational Autoencoders

This paper introduces a new member of the family of Variational Autoencoders (VAE) that constrains the rate of information transferred by the latent layer. The latent layer is interpreted as a communication channel, the information rate of which is bound by imposing a pre-set signal-to-noise ratio. The new constraint subsumes the mutual information between the input and latent variables, combining naturally with the likelihood objective of the observed data as used in a conventional VAE. The resulting Bounded-Information-Rate Variational Autoencoder (BIR-VAE) provides a meaningful latent representation with an information resolution that can be specified directly in bits by the system designer. The rate constraint can be used to prevent overtraining, and the method naturally facilitates quantisation of the latent variables at the set rate. Our experiments confirm that the BIR-VAE has a meaningful latent representation and that its performance is at least as good as state-of-the-art competing algorithms, but with lower computational complexity.

Attend and Rectify: a Gated Attention Mechanism for Fine-Grained Recovery

We propose a novel attention mechanism to enhance Convolutional Neural Networks for fine-grained recognition. It learns to attend to lower-level feature activations without requiring part annotations and uses these activations to update and rectify the output likelihood distribution. In contrast to other approaches, the proposed mechanism is modular, architecture-independent and efficient both in terms of parameters and computation required. Experiments show that networks augmented with our approach systematically improve their classification accuracy and become more robust to clutter. As a result, Wide Residual Networks augmented with our proposal surpasses the state of the art classification accuracies in CIFAR-10, the Adience gender recognition task, Stanford dogs, and UEC Food-100.

FuzzerGym: A Competitive Framework for Fuzzing and Learning

Fuzzing is a commonly used technique designed to test software by automatically crafting program inputs. Currently, the most successful fuzzing algorithms emphasize simple, low-overhead strategies with the ability to efficiently monitor program state during execution. Through compile-time instrumentation, these approaches have access to numerous aspects of program state including coverage, data flow, and heterogeneous fault detection and classification. However, existing approaches utilize blind random mutation strategies when generating test inputs. We present a different approach that uses this state information to optimize mutation operators using reinforcement learning (RL). By integrating OpenAI Gym with libFuzzer we are able to simultaneously leverage advancements in reinforcement learning as well as fuzzing to achieve deeper coverage across several varied benchmarks. Our technique connects the rich, efficient program monitors provided by LLVM Santizers with a deep neural net to learn mutation selection strategies directly from the input data. The cross-language, asynchronous architecture we developed enables us to apply any OpenAI Gym compatible deep reinforcement learning algorithm to any fuzzing problem with minimal slowdown.

Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer

Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can ‘interpolate’: By decoding the convex combination of the latent codes for two datapoints, the autoencoder can produce an output which semantically mixes characteristics from the datapoints. In this paper, we propose a regularization procedure which encourages interpolated outputs to appear more realistic by fooling a critic network which has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations.

Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks

Systematic compositionality is the ability to recombine meaningful units with regular and predictable outcomes, and it’s seen as key to humans’ capacity for generalization in language. Recent work has studied systematic compositionality in modern seq2seq models using generalization to novel navigation instructions in a grounded environment as a probing tool, requiring models to quickly bootstrap the meaning of new words. We extend this framework here to settings where the model needs only to recombine well-trained functional words (such as ‘around’ and ‘right’) in novel contexts. Our findings confirm and strengthen the earlier ones: seq2seq models can be impressively good at generalizing to novel combinations of previously-seen input, but only when they receive extensive training on the specific pattern to be generalized (e.g., generalizing from many examples of ‘X around right’ to ‘jump around right’), while failing when generalization requires novel application of compositional rules (e.g., inferring the meaning of ‘around right’ from those of ‘right’ and ‘around’).

A Hand-Held Multimedia Translation and Interpretation System with Application to Diet Management
Minimizing convex quadratic with variable precision Krylov methods
Guess who Multilingual approach for the automated generation of author-stylized poetry
Clinical Text Classification with Rule-based Features and Knowledge-guided Convolutional Neural Networks
Signal Alignment for Humanoid Skeletons via the Globally Optimal Reparameterization Algorithm
Real-Time Stereo Vision for Road Surface 3-D Reconstruction
Eigenspace-Based Minimum Variance Combined with Delay Multiply and Sum Beamformer: Application to Linear-Array Photoacoustic Imaging
High-Mobility Wideband Massive MIMO Communications: Doppler Compensation, Analysis and Scaling Law
A Fixed-Parameter Linear-Time Algorithm to Compute Principal Typings of Planar Flow Networks
Entanglement Transitions from Holographic Random Tensor Networks
Universal Scaling Theory of the Boundary Geometric Tensor in Disordered Metals
Ricci curvature for parametric statistics via optimal transport
Comparative study of Discrete Wavelet Transforms and Wavelet Tensor Train decomposition to feature extraction of FTIR data of medicinal plants
Weakly Monotone Fock Space and Monotone Convolution of the Wigner Law
NIP omega-categorical structures: the rank 1 case
Hierarchical Multi Task Learning With CTC
Real-time digital signal recovery for a low-pass transfer function system with multiple complex poles
The trinomial transform triangle
The classification of homogeneous finite-dimensional permutation structures
Continuous approximation of $(M_t,M_t, 1)$ distributions with application to production
A Holistic Approach to Forecasting Wholesale Energy Market Prices
Reconstructing Latent Orderings by Spectral Clustering
Datamining a medieval medical text reveals patterns in ingredient choice that reflect biological activity against the causative agents of specified infections
Distributed Second-order Convex Optimization
A Scalable MCEM Estimator for Spatio-Temporal Autoregressive Models
Representational efficiency outweighs action efficiency in human program induction
Fast and Deterministic Approximations for $k$-Cut
CT Image Enhancement Using Stacked Generative Adversarial Networks and Transfer Learning for Lesion Segmentation Improvement
Minimum distance computation of linear codes via genetic algorithms with permutation encoding
Take a Look Around: Using Street View and Satellite Images to Estimate House Prices
Approximation Schemes for Low-Rank Binary Matrix Approximation Problems
What kind of content are you prone to tweet Multi-topic Preference Model for Tweeters
Once reinforced random walk on $\mathbb{Z}\times Γ$
How Consumer Empathy Assist Power Grid in Demand Response
Automatic Identification of Ineffective Online Student Questions in Computing Education
A $φ$-Competitive Algorithm for Scheduling Packets with Deadlines
Efficient Power Flow Management and Peak Shaving in a Microgrid-PV System
A Novel Scheme for Support Identification and Iterative Sampling of Bandlimited Graph Signals
Tomlinson-Harashima Precoded Rate-Splitting for Multiuser MIMO Systems
Evaluating Word Embeddings in Multi-label Classification Using Fine-grained Name Typing
Efficient Training on Very Large Corpora via Gramian Estimation
A Tale of Santa Claus, Hypergraphs and Matroids
Tracking Sparse mmWave Channel: Performance Analysis under Intra-Cluster Angular Spread
Is the SIC Outcome There When Nobody Looks
Achievable Rate maximization by Passive Intelligent Mirrors
Asymptotically Optimal Estimation Algorithm for the Sparse Signal with Arbitrary Distributions
Performance, Power, and Area Design Trade-offs in Millimeter-Wave Transmitter Beamforming Architectures
Few-Shot Adaptation for Multimedia Semantic Indexing
Negative Imaginary State Feedback Control with a Prescribed Degree of Stability
Coexistence of scale invariant and rhythmic behavior in self-organized criticality
A Machine Learning Approach for Detecting Students at Risk of Low Academic Achievement
Isolating effects of age with fair representation learning when assessing dementia
Disorder-robust entanglement transport
Efficient Sampling of Bandlimited Graph Signals
Exponential Stabilization for Ito Stochastic Systems with Multiple Input Delays
Monocular Object Orientation Estimation using Riemannian Regression and Classification Networks
Convex Relaxations in Power System Optimization: A Brief Introduction
Stability of generalized Petersen graphs
UAV-Based in-band Integrated Access and Backhaul for 5G Communications
Cooperative Adaptive Cruise Control for Connected Autonomous Vehicles by Factoring Communication-Related Constraints
Optimal estimation of Gaussian mixtures via denoised method of moments
Limiting spectral distribution of the product of truncated Haar unitary matrices
ArticulatedFusion: Real-time Reconstruction of Motion, Geometry and Segmentation Using a Single Depth Camera
Chest X-rays Classification: A Multi-Label and Fine-Grained Problem
Normalization of ternary generalized pseudostandard words
Ricci-flat graphs with girth four
Towards Explainable and Controllable Open Domain Dialogue Generation with Dialogue Acts
Visual Domain Adaptation with Manifold Embedded Distribution Alignment
Machine Learning Based Featureless Signalling
A hybrid algorithm for the two-trust-region subproblem
On the modular Erdös-Burgess constant
Simple robust genomic prediction and outlier detection for a multi-environmental field trial
Searching for network modules
In pixels we trust: From Pixel Labeling to Object Localization and Scene Categorization
Label Aggregation via Finding Consensus Between Models
Deep Sequential Multi-camera Feature Fusion for Person Re-identification
Mr. DLib’s Living Lab for Scholarly Recommendations
SPDEs with Space-Mean Dynamics
Deep Adaptive Proposal Network for Object Detection in Optical Remote Sensing Images
Quantifying Volatility Reduction in German Day-ahead Spot Market in the Period 2006 through 2016
Sequence to Logic with Copy and Cache
On the Phase Tracking Reference Signal (PT-RS) Design for 5G New Radio (NR)
QoS and Coverage Aware Dynamic High Density Vehicle Platooning (HDVP)
Birkhoff-von Neumann Graphs that are PM-compact
Automated Phenotyping of Epicuticular Waxes of Grapevine Berries Using Light Separation and Convolutional Neural Networks
Indexing Execution Patterns in Workflow Provenance Graphs through Generalized Trie Structures
Generative Adversarial Networks for MR-CT Deformable Image Registration
Can We Assess Mental Health through Social Media and Smart Devices Addressing Bias in Methodology and Evaluation
MITK-ModelFit: generic open-source framework for model fits and their exploration in medical imaging – design, implementation and application on the example of DCE-MRI
Test-time augmentation with uncertainty estimation for deep learning-based medical image segmentation
Stochastic Quantization for the Edwards Measure of Fractional Brownian Motion with $Hd=1$
Green function of a random walk in a cone
Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks
Revisiting Cross Modal Retrieval
On some special classes of contact $B_0$-VPG graphs
An entropy generation formula on $RCD(K,\infty)$ spaces
Fuzzy quantification for linguistic data analysis and data mining
ISIC 2018-A Method for Lesion Segmentation
Localization of disordered harmonic chain with long-range correlation
Image Reconstruction via Variational Network for Real-Time Hand-Held Sound-Speed Imaging
Delay and Communication Tradeoffs for Blockchain Systems with Lightweight IoT Clients
Modeling Visual Context is Key to Augmenting Object Detection Datasets
Semi-Dense 3D Reconstruction with a Stereo Event Camera
Selective Zero-Shot Classification with Augmented Attributes
On the almost-principal minors of a symmetric matrix
Can Artificial Intelligence Reliably Report Chest X-Rays : Radiologist Validation of an Algorithm trained on 1.2 Million X-Rays
On the Sweep Map for Fuss Rational Dyck Paths
Two algorithms for a fully coupled and consistently macroscopic PDE-ODE system modeling a moving bottleneck on a road
Conditional Random Fields as Recurrent Neural Networks for 3D Medical Imaging Segmentation
Stochastic Model Predictive Control with Discounted Probabilistic Constraints
Guided Upsampling Network for Real-Time Semantic Segmentation
Three for one and one for three: Flow, Segmentation, and Surface Normals
Prophet Secretary Through Blind Strategies
Robust Oil-spill Forensics and Petroleum Source Differentiation using Quantized Peak Topography Maps
A Microservice-enabled Architecture for Smart Surveillance using Blockchain Technology
Edge colourings and topological graph polynomials
Improving Simple Models with Confidence Profiles
Finding Minimum Volume Circumscribing Ellipsoids Using Copositive Programming
A Strategy of MR Brain Tissue Images’ Suggestive Annotation Based on Modified U-Net
Harmonic functions on mated-CRT maps
Hybrid scene Compression for Visual Localization
An invariance principle for ergodic scale-free random environments
Exact Algorithms for Finding Well-Connected 2-Clubs in Real-World Graphs: Theory and Experiments
Using Deep Neural Networks to Translate Multi-lingual Threat Intelligence
Exact asymptotics for Duarte and supercritical rooted kinetically constrained models
Bio-Measurements Estimation and Support in Knee Recovery through Machine Learning
Emulating malware authors for proactive protection using GANs over a distributed image visualization of the dynamic file behavior
Optimal Las Vegas Approximate Near Neighbors in $\ell_p$
Self-Organizing Maps as a Storage and Transfer Mechanism in Reinforcement Learning
Limited Memory Kelley’s Method Converges for Composite Convex and Submodular Objectives
Attention-Guided Curriculum Learning for Weakly Supervised Classification and Localization of Thoracic Diseases on Chest Radiographs
Positional Value in Soccer: Expected League Points Added above Replacement
An expansion formula for type A and Kronecker quantum cluster algebras
A unified theory of adaptive stochastic gradient descent as Bayesian filtering
Partial recovery bounds for clustering with the relaxed $K$means
Realization Spaces of Uniform Phased Matroids
A geometric integration approach to nonsmooth, nonconvex optimisation
Transfer Learning for Action Unit Recognition
Capsule Networks against Medical Imaging Data Challenges
Compositional GAN: Learning Conditional Image Composition
Nested Covariance Determinants and Restricted Trek Separation in Gaussian Graphical Models
A linear-time algorithm for generalized trust region problems

Continue Reading…


Read More

Magister Dixit

“It’s not enough to tell someone, ‘This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.’ As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. … It really comes back to this big divide, this bottleneck, between the domain expert and the machine learning expert. I saw that as the most challenging problem facing us when we try to really make machine learning widely applied in the world. I saw both machine learning experts and domain experts as being difficult to scale up. There’s only a few of each kind of expert produced every year. I thought, how can I scale up machine learning expertise? I thought the best thing that I could do is to build software that doesn’t take a machine learning expert to use, so that the domain experts can use them to build their own applications. That’s what prompted me to do research in automating machine learning while at MSR [Microsoft Research].” Alice Zheng ( 2015 )

Continue Reading…


Read More

R.devices – Into the Void

(This article was first published on JottR on R, and kindly contributed to R-bloggers)

R.devices 2.16.0 – Unified Handling of Graphics Devices – is on CRAN. With this release, you can now easily suppress unwanted graphics, e.g. graphics produced by one of those do-everything-in-one-call functions that we all bump into once in a while. To suppress graphics, the R.devices package provides graphics device nulldev(), and function suppressGraphics(), which both send any produced graphics into the void. This works on all operating systems, including Windows.

Guillaume Nery base jumping at Dean’s Blue Hole, filmed on breath hold by Julie Gautier


plot(1:100, main = "Some Ignored Graphics")
  plot(1:100, main = "Some Ignored Graphics")

Other Features

Some other reasons for using the R.devices package:

  • No need to call – Did you ever forgot to call, or did a function call produce an error causing not to be reached, leaving a graphics device open? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, is automatically taken care of.

  • No need to specify filename extension – Did you ever switch from using png() to, say, pdf(), and forgot to update the filename resulting in a my_plot.png file that is actually a PDF file? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, filename extensions are automatically taken care of – just specify the part without the extension.

  • Specify the aspect ratio – rather than having to manually calculate device-specific arguments width or height, e.g. toPNG("my_plot", { plot(1:10) }, aspectRatio = 2/3). This is particularly useful when switching between device types, or when outputting to multiple ones at the same time.

  • Unified API for graphics options – conveniently set (most) graphics options including those that can otherwise only be controlled via arguments, e.g. devOptions("png", width = 1024).

  • Control where figure files are saved – the default is folder figures/ but can be set per device type or globally, e.g. devOptions("*", path = "figures/col/").

  • Easily produce EPS and faviconstoEPS() and toFavicon() are friendly wrappers for producing EPS and favicon graphics.

  • Capture and replay graphics – for instance, use future::plan(remote, workers = ""); p %<-% capturePlot({ plot(1:10) }) to produce graphics on a remote machine, and then display it locally by printing p.

Some more examples

R.devices::toPDF("my_plot", {
  plot(1:100, main = "Amazing Graphics")
### [1] "figures/my_plot.pdf"
R.devices::toPNG("my_plot", {
  plot(1:100, main = "Amazing Graphics")
### [1] "figures/my_plot.png"
R.devices::toEPS("my_plot", {
  plot(1:100, main = "Amazing Graphics")
### [1] "figures/my_plot.eps"
R.devices::devEval(c("png", "pdf", "eps"), name = "my_plot", {
  plot(1:100, main = "Amazing Graphics")
}, aspectRatio = 1.3)
### $png
### [1] "figures/my_plot.png"
### $pdf
### [1] "figures/my_plot.pdf"
### $eps
### [1] "figures/my_plot.eps"

See also

To leave a comment for the author, please follow the link and comment on their blog: JottR on R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Regional population structures at a glance

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)


I am happy to announce that our paper is published today in The Lancet.

Kashnitsky, I., & Schöley, J. (2018). Regional population structures at a glance. The Lancet, 392(10143), 209–210.

At a glance

Demographic history of a population is imprinted in its age structure. A meaningful representation of regional population age structures can tell numerous demographic stories – at a glance. To produce such a snapshot of regional populations, we use an innovative approach of ternary colour coding.

Here is the map:


We let the data speak colours

With ternary colour coding, each element of a three-dimensional array of compositional data is represented with a unique colour. The resulting colours show direction and magnitude of deviations from the centrepoint, which represents the average age of the European population, and is dark grey. The hue component of a colour encodes the direction of deviation: yellow indicates an elderly population (>65 years), cyan indicates people of working age (15–64 years), and magenta indicates children (0–14 years).

The method is very flexible, and one can easily produce these meaningful colours using our R package tricolore. Just explore the capabilities of the package in a built-in shiny app using the following lines of code:


Replication materials at github

Folow us on Twitter: @ikahhnitsky, @jschoeley.


To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

July 20, 2018

If you did not already know

Model Features google
A key question in Reinforcement Learning is which representation an agent can learn to efficiently reuse knowledge between different tasks. Recently the Successor Representation was shown to have empirical benefits for transferring knowledge between tasks with shared transition dynamics. This paper presents Model Features: a feature representation that clusters behaviourally equivalent states and that is equivalent to a Model-Reduction. Further, we present a Successor Feature model which shows that learning Successor Features is equivalent to learning a Model-Reduction. A novel optimization objective is developed and we provide bounds showing that minimizing this objective results in an increasingly improved approximation of a Model-Reduction. Further, we provide transfer experiments on randomly generated MDPs which vary in their transition and reward functions but approximately preserve behavioural equivalence between states. These results demonstrate that Model Features are suitable for transfer between tasks with varying transition and reward functions. …

Penalized Splines of Propensity Prediction (PSPP) google
Little and An (2004, Statistica Sinica 14, 949-968) proposed a penalized spline of propensity prediction (PSPP) method of imputation of missing values that yields robust model-based inference under the missing at random assumption. The propensity score for a missing variable is estimated and a regression model is fitted that includes the spline of the estimated logit propensity score as a covariate. The predicted unconditional mean of the missing variable has a double robustness (DR) property under misspecification of the imputation model. We show that a simplified version of PSPP, which does not center other regressors prior to including them in the prediction model, also has the DR property. We also propose two extensions of PSPP, namely, stratified PSPP and bivariate PSPP, that extend the DR property to inferences about conditional means. These extended PSPP methods are compared with the PSPP method and simple alternatives in a simulation study and applied to an online weight loss study conducted by Kaiser Permanente..
‘Robust-squared’ Imputation Models Using BART

RuleMatrix google
With the growing adoption of machine learning techniques, there is a surge of research interest towards making machine learning systems more transparent and interpretable. Various visualizations have been developed to help model developers understand, diagnose, and refine machine learning models. However, a large number of potential but neglected users are the domain experts with little knowledge of machine learning but are expected to work with machine learning systems. In this paper, we present an interactive visualization technique to help users with little expertise in machine learning to understand, explore and validate predictive models. By viewing the model as a black box, we extract a standardized rule-based knowledge representation from its input-output behavior. We design RuleMatrix, a matrix-based visualization of rules to help users navigate and verify the rules and the black-box model. We evaluate the effectiveness of RuleMatrix via two use cases and a usability study. …

Continue Reading…


Read More

Distilled News

Echoes of the Future

This report discusses ways to combine graphics output from the ‘graphics’ package and the ‘grid’ package in R and introduces a new function echoGrob in the ‘gridGraphics’ package.

6 data analytics trends that will dominate 2018

As businesses transform into data-driven enterprises, data technologies and strategies need to start delivering value. Here are four data analytics trends to watch in the months ahead.
• Data lakes will need to demonstrate business value or die
• The CDO will come of age
• Rise of the data curator
• Data governance strategies will be key themes for all C-level executives
• The proliferation of metadata management continues
• Predictive analytics helps improve data quality

The end of errors in ANOVA reporting

Psychology is still (unfortunately) massively using analysis of variance (ANOVA). Despite its relative simplicity, I am very often confronted to errors in its reporting, for instance in student´s theses or manuscripts. Beyond the incomplete, uncomprehensible or just wrong reporting, one can find a tremendous amount of genuine errors (that could influence the results and their intepretation), even in published papers! (See the excellent statcheck to quickly check the stats of a paper). This error proneness can be at least partially explained by the fact that copy/pasting the (appropriate) values of any statistical software and formatting them textually is a very annoying process. How to end it

Top 8 Sites To Make Money By Uploading Files

pay to upload sites have been really in now days and is indeed an easy way to make few bugs online. Its one of those ways to make money online that doesn´t need any effort on your part. Each time you upload your files to their servers and someone downloads them, you get paid a certain amount.

Amazon Alexa and Accented English

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let´s walk through the analysis I did for Dylan and Pulse Labs.

Explaining Black-Box Machine Learning Models – Code Part 1: tabular data + caret + iml

This is code that will accompany an article that will appear in a special edition of a German IT magazine. The article is about explaining black-box machine learning models. In that article I´m showcasing three practical examples:
1.Explaining supervised classification models built on tabular data using caret and the iml package
2.Explaining image classification models with keras and lime
3.Explaining text classification models with lime

Benchmarking Feature Selection Algorithms with Xy()

Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem. Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.

Causation in a Nutshell

Knowing the who, what, when, where, etc., is vital in marketing. Predictive analytics can also be useful for many organizations. However, also knowing the why helps us better understand the who, what, when, where, and so on, and the ways they are tied together. It also helps us predict them more accurately. Knowing the why increases their value to marketers and increases the value of marketing. Analysis of causation can be challenging, though, and there are differences of opinion among authorities. The statistical orthodoxy is that randomized experiments are the best approach. Experiments in many cases are infeasible or unethical, however. They also can be botched or be so artificial that they do not generalize to real world conditions. They may also fail to replicate. They are not magic.

Autoencoder as a Classifier using Fashion-MNIST Dataset

In this tutorial, you will learn & understand how to use autoencoder as a classifier in Python with Keras. You’ll be using Fashion-MNIST dataset as an example.

Receiver Operating Characteristic Curves Demystified (in Python)

In Data Science, evaluating model performance is very important and the most commonly used performance metric is the classification score. However, when dealing with fraud datasets with heavy class imbalance, a classification score does not make much sense. Instead, Receiver Operating Characteristic or ROC curves offer a better alternative. ROC is a plot of signal (True Positive Rate) against noise (False Positive Rate). The model performance is determined by looking at the area under the ROC curve (or AUC). The best possible AUC is 1 while the worst is 0.5 (the 45 degrees random line). Any value less than 0.5 means we can simply do the exact opposite of what the model recommends to get the value back above 0.5. While ROC curves are common, there aren´t that many pedagogical resources out there explaining how it is calculated or derived. In this blog, I will reveal, step by step, how to plot an ROC curve using Python. After that, I will explain the characteristics of a basic ROC curve.

Opportunity: data lakes offer a ‘360-degree view’ to an organisation

Data lakes provide a solution for businesses looking to harness the power of data. Stuart Wells, executive vice president, chief product and technology officer at FICO, discusses with Information Age how approaching data in this way can lead to better business decisions.

Using the AWS Glue Data Catalog as the Metastore for Hive

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Continue Reading…


Read More

Document worth reading: “Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning”

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models. Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

Continue Reading…


Read More

Marvel Cinematic Universe as a 3-D network

The Straits Times visualized the Marvel Cinematic Universe with a 3-D browsable network. Link colors represent type of relationship, and proximity naturally represents commonalities between characters. Click on individual characters for information on each. Turn on the sound for extra dramatics.

Tags: , ,

Continue Reading…


Read More

Ready your Skills for a Cloud-First World with Google

The Machine Learning with TensorFlow on Google Cloud Platform Specialization on Coursera will help you jumpstart your career, includes hands-on labs, and takes you from a strategic overview to practical skills in building real-world, accurate ML models.

Continue Reading…


Read More

A hex sticker wall, created with R

Bill Venables, member of the R Foundation, co-author of the Introduction to R manual, R package developer, and one of the smartest and nicest (a rare combo!) people you will ever meet, received some gifts from the user!2018 conference committee in recognition of his longtime service to the R community. He received a book of reminiscences from those whose lives he has touched, and this blanket:

The design on the blanket is the "Hex Sticker Wall" that was on design during the conference, created by Mitchell O'Hara-Wild:

The design for the wall was created — naturally — using R. Mitchell created an R script that will arrange a folder of hexagon images according to a specified layout. He then used a map of Australia to lay out the hex sticker coordinates within the map boundaries and plotted that to create the Hexwall. If you have a similar collection of hex images you could use the same technique to arrange them into any shape you like. The details are provided in the blog post linked below.

Mitchell O'Hara-Wild: UseR! 2018 Feature Wall

Continue Reading…


Read More

A hex sticker wall, created with R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Bill Venables, member of the R Foundation, co-author of the Introduction to R manual, R package developer, and one of the smartest and nicest (a rare combo!) people you will ever meet, received some gifts from the user!2018 conference committee in recognition of his longtime service to the R community. He received a book of reminiscences from those whose lives he has touched, and this blanket:

The design on the blanket is the “Hex Sticker Wall” that was on design during the conference, created by Mitchell O’Hara-Wild:

The design for the wall was created — naturally — using R. Mitchell created an R script that will arrange a folder of hexagon images according to a specified layout. He then used a map of Australia to lay out the hex sticker coordinates within the map boundaries and plotted that to create the Hexwall. If you have a similar collection of hex images you could use the same technique to arrange them into any shape you like. The details are provided in the blog post linked below.

Mitchell O’Hara-Wild: UseR! 2018 Feature Wall

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Thanks for reading!