My Data Science Blogs

September 24, 2017

RcppGSL 0.3.3

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its two-column pdf default.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.3 (2017-09-24)

  • We also check for gsl-config at package load.

  • The vignette now uses the pinp package in two-column mode.

  • Minor other fixes to package and testing infrastructure.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Linear Programming and Healthy Diets — Part 2

Previously in this series:

Foods of the Father

My dad’s an interesting guy.

Every so often he picks up a health trend and/or weight loss goal that would make many people’s jaw drop. For example, we once went on a 5-day, 50-mile backpacking trip in the Grand Tetons, and my dad brought one of these per day for dinner, and had vitamin tablets for the rest of his sustenance. The rest of us planned for around 3,000 calories per day. He’s tried the “high fat” and “no fat” diets, and quite a few others. He’s concerned with losing weight, but also living longer, so he’s into caloric restriction among other things.

Recently he asked me to help him optimize his diet. He described a scheme he was performing by hand to minimize the number of calories he consumed per day while maintaining the minimum nutrients required by the FDA’s recommendations. He had a spreadsheet with the nutrients for each food, and a spreadsheet with the constraints on each nutrient. He wanted to come up with a collection of meals (or just throw all the food into a blender) that taste within reason but meet these criteria.

He was essentially solving a linear program by hand, roughly as best as one can, with a few hundred variables! After asking me whether there was “any kind of math” that could help him automate his laborious efforts, I decided to lend a hand. After all, it’s been over three years since I promised my readers I’d apply linear programming to optimize a diet (though it was optimizing for cost rather than calories).

Though it never went beyond what linear programming can handle, pretty quickly my dad’s requests specialized beyond what would interest a general reader. Perhaps this is the nature of math consulting, but it seems when you give someone what they want, they realize that’s not what they wanted.

But the basic ideas are still relevant enough. My solution is a hundred-ish lines of python to set up the input, using Google’s open source operations research tools as the core solver. Disclaimer: I work for Google but I don’t work on the team that wrote this tool. Also nothing in this post represents the views of my employer. It’s just me, and the scale of this problem is laughable for Google to care about anyway.

So this post is half tutorial showing how to use the or-tools python wrapper (it’s only somewhat documented), and half showing a realistic use case for linear programming.

However, don’t let this post dissuade you from the belief that linear programming is useful beyond dieting. People use linear programming to solve all kinds of interesting problems. Here are a few I came across in just the last few weeks:

And that’s not even to mention the ubiquitous applications in operations research (network flow, production optimization, economics) that every large company relies on. The applications seem endless!

As usual, all of the code and data we use in the making of this post is available at this blog’s Github page.


If you remember how linear programs work, you can safely skip this section.

As a refresher, let’s outline how to model the nutrition problem as a linear program and recall the basic notation. The variables are food in 100 gram increments. So x_1 might be the amount of canned peas consumed, x_2 lobster meat, etc. All variables would have to be nonnegative, of course. The objective is to minimize the total number of calories consumed. If c_1 \geq 0 is the amount of calories in 100g of canned peas, then one would pay c_1x_1 in calories contributed by peas. If we have n different variables, then the objective function is the linear combination

\textup{cost}(\mathbf{x}) = \sum_{j=1}^n c_j x_j

We’re using boldface \mathbf{x} to represent the vector (x_1, \dots, x_n). Likewise, \mathbf{c} will represent the cost vector of foods (c_1, \dots, c_n). As we’ve seen many times, we can compactly write the sum above as an inner product \langle \mathbf{c}, \mathbf{x} \rangle.

Finally, we require that the amount of each nutrient combined in the stuff we buy meets some threshold. So for each nutrient we have a constraint. The easiest one is calories; we require the total number of calories consumed is at least (say) 2,000. So if a_j represents the number of calories in food j, we require \sum_{j=1}^n a_j x_j \geq 2000. We might also want to restrict a maximum number of calories, but in general having a diet with more calories implies higher cost, and so when the linear program minimizes cost we should expect it not to produce a diet with significantly more than 2,000 calories.

Since we have one set of nutrient information for each pair of (nutrient, food), we need to get fancier with the indexing. I’ll call a_{i,j} the amount of nutrient i in food j. Note that A = (a_{i,j}) will be a big matrix, so I’m saying that nutrients i represent the rows of the matrix and foods j represent the columns. That is, each row of the matrix represents the amount of one specific nutrient in all the foods, and each column represents the nutritional content of a single food. We’ll always use n to denote the number of foods, and m to denote the number of nutrients.

Finally, we have a lower and upper bound for each nutrient, which behind the scenes are converted into lower bounds (possibly negating the variables). This isn’t required to write the program, as we’ll see. In notation, we require that for every 1 \leq i \leq m, the nutrient constraint \sum_{j=1}^n a_{i,j} x_j \geq b_i is satisfied. If we again use vector notation for the constraints \mathbf{b}, we can write the entire set of constraints as a “matrix equation”

A \mathbf{x} \geq \mathbf{b}

And this means each entry of the vector you get from multiplying the left-hand-side is greater than the corresponding entry on the right-hand-side. So the entire linear program is summarized as follows

\displaystyle \begin{aligned} \textup{min } & \langle \mathbf{c} , \mathbf{x} \rangle  \\  \textup{such that } & A \mathbf{x}  \geq \mathbf{b} \\ & \mathbf{x}  \geq \mathbf{0} \end{aligned}

That’s the syntactical form of our linear program. Now all (!) we need to do is pick a set of foods and nutrients, and fill in the constants for A, \mathbf{c}, \mathbf{b}.

Nutrients and Foods

The easier of the two pieces of data is the nutrient constraints. The system used in the United States is called the Dietary Reference Intake system. It consists of five parts, which I’ve paraphrased from Wikipedia.

  • Estimated Average Requirements (EAR), expected to satisfy the needs of 50% of the people in an age group.
  • Recommended Dietary Allowances (RDA), the daily intake level considered sufficient to meet the requirements of 97.5% of healthy individuals (two standard deviations).
  • Adequate Intake (AI), where no RDA has been established. Meant to complement the RDA, but has less solid evidence.
  • Tolerable upper intake levels (UL), the highest level of daily consumption that have not shown harmful side effects.

While my dad come up with his own custom set of constraints, the ones I’ve posted on the github repository are essentially copy/paste from the current RDA/AI as a lower bound, with the UL as an upper bound. The values I selected are in a csv. Missing values in the upper bound column mean there is no upper bound. And sorry ladies, since it’s for my dad I chose the male values. Women have slightly different values due to different average size/weight.

Nutrient values for food are a little bit harder, because nutrient data isn’t easy to come by. There are a few databases out there, all of which are incomplete, and some of which charge for use. My dad spent a long time hunting down the nutrients (he wanted some additional special nutrients) for his top 200-odd foods.

Instead, in this post we’ll use the USDA’s free-to-use database of 8,000+ foods. It comes in a single, abbreviated, oddly-formatted text file which I’ve parsed into a csv and chosen an arbitrary subset of 800-ish foods to play around with.

Python OR Tools

Google’s ortools docs ask you to download a tarball to install their package, but I found that was unnecessary (perhaps it’s required to attach a third-party solver to their interface?). Instead, one can just pip install it.

pip3 install py3-ortools

Then in a python script, you can import the ortools library and create a simple linear program:

from ortools.linear_solver import pywraplp

solver = pywraplp.Solver('my_LP', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

x = solver.NumVar(0, 10, 'my first variable')
y = solver.NumVar(0, 10, 'my second variable')

solver.Add(x + y <= 7) solver.Add(x - 2 * y >= -2)

objective = solver.Objective()
objective.SetCoefficient(x, 3)
objective.SetCoefficient(y, 1)

status = solver.Solve()

if status not in [solver.OPTIMAL, solver.FEASIBLE]:
    raise Exception('Unable to find feasible solution')


This provides the basic idea of the library. You can use python’s operator overloading (to an extent) to make the constraints look nice in the source code.

Setting up the food LP

The main file contains a definition for a class, which, in addition to loading the data, encapsulates all the variables and constraints.

class DietOptimizer(object):
    def __init__(self, nutrient_data_filename='nutrients.csv',

        self.food_table = # load data into a list of dicts
        self.constraints_data = # load data into a list of dicts

        self.solver = pywraplp.Solver('diet_optimizer', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

        self.objective = self.solver.Objective()
        for row in self.food_table:
            name = row['description']
            var = self.variable_dict[name]
            calories_in_food = row[calories_name]
            self.objective.SetCoefficient(var, calories_in_food)

We’ll get into the variables and constraints momentarily, but before that we can see the solve method

    def solve(self):
            Return a dictionary with 'foods' and 'nutrients' keys representing
            the solution and the nutrient amounts for the chosen diet
        status = self.solver.Solve()
        if status not in [self.solver.OPTIMAL, self.solver.FEASIBLE]:
            raise Exception('Unable to find feasible solution')

        chosen_foods = {
            food_name: var.solution_value()
            for food_name, var in self.variable_dict.items() if var.solution_value() > 1e-10

        self.chosen_foods = chosen_foods

        nutrients = {
            row['nutrient']: self.nutrients_in_diet(chosen_foods, row['nutrient'])
            for row in self.constraints_table

        return {
            'foods': chosen_foods,
            'nutrients': nutrients,

Here nutrients_in_diet is a helper function which, given a dictionary of foods and a nutrient, outputs the nutrient contents for that food. This can be used independently of the solver to evaluate the nutrient contents of a proposed diet.

Next we have the method to create the variables.

    def create_variable_dict(self):
            The variables are the amount of each food to include, denominated in units of 100g
        self.variable_dict = dict(
            (row['description'], self.solver.NumVar(0, 10, row['description']))
            for row in self.food_table

Each food must be present in a nonnegative amount, and I’ve imposed a cap of 10 (1kg) for any individual food. The reason for this is that I originally had a “water” constraint, and the linear program decided to optimize for that by asking one to drink 2L of red wine per day. I neglected to put in an alcohol nutrient (because it was not already there and I’m lazy), so instead I limited the amount of any individual food. It still seems like a reasonable constraint to impose that nobody would want to eat more than 1kg of any single food on one day.

Finally, we can construct the constraints. The core method takes a nutrient and a lower and upper bound:

    def create_constraint(self, nutrient_name, lower, upper):
            Each constraint is a lower and upper bound on the
            sum of all food variables, scaled by how much of the
            relevant nutrient is in that food.
        if not lower:
            print('Warning! Nutrient %s has no lower bound!'.format(nutrient_name))

        sum_of_foods = self.foods_for_nutrient(nutrient_name)
        constraint_lb = lower <= sum_of_foods
        self.constraint_dict[nutrient_name + ' (lower bound)'] = constraint_lb

        if not upper:
            return  # no upper bound in the data

        constraint_ub = sum_of_foods <= upper
        self.constraint_dict[nutrient_name + ' (upper bound)'] = constraint_ub

This method is mostly bookkeeping, while foods_for_nutrient does the individual nutrient lookup. Note that one is not allowed to do a double-ended inequality like self.solver.Add(lower <= sum_of_foods <= upper). If you try, ortools will ignore one end of the bound.

    def foods_for_nutrient(self, nutrient_name, scale_by=1.0):
        # a helper function that computes the scaled sum of all food variables
        # for a given nutrient
        relevant_foods = []
        for row in self.food_table:
            var = self.variable_dict[row['description']]
            nutrient_amount = row[nutrient_name]
            if nutrient_amount > 0:
                relevant_foods.append(scale_by * nutrient_amount * var)

        if len(relevant_foods) == 0:
            print('Warning! Nutrient %s has no relevant foods!'.format(nutrient_name))

        return SumArray(relevant_foods)

Here we are a bit inefficient by iterating through the entire table, instead of just those foods containing the nutrient in question. But there are only a few hundred foods in our sample database (8,000 if you use the entire SR28 database), and so the optimization isn’t necessary.

Also note that while ortools allows one to use the sum method, it does so in a naive way, because sum([a, b, c]) becomes ((a + b) + c), which is a problem because if the list is too long their library exceeds Python’s default recursion limit. Instead we construct a SumArray by hand.

Finally, though we omitted it here for simplicity, throughout the code in the Github repository you’ll see references to percent_constraints. This exists because some nutrients, like fat, are recommended to be restricted to a percentage of calories, not an absolute amount. So we define a mechanism to specify a nutrient should be handled with percents, and a mapping from grams to calories. This ends up using the scale_by parameter above, both to scale fat by 9 calories per gram, and to scale calories to be a percentage. Cf. the special function for creating percent constraints.

Finally, we have methods just for pretty-printing the optimization problem and the solution, called summarize_optimization_problem and summarize_solution, respectively.

Running the solver

Invoking the solver is trivial.

if __name__ == "__main__":
    solver = DietOptimizer()
    # solver.summarize_optimization_problem()
    solution = solver.solve()

With the example foods and constraints in the github repo, the result is:


   38.5g: CURRY POWDER

Nutrient totals

    1700.0 mg   calcium                   [1700.0, 2100.0]   
     130.0 g    carbohydrate              [130.0, ]          
     550.0 mg   choline                   [550.0, 3500.0]    
       3.3 mg   copper                    [0.9, 10.0]        
      60.5 g    dietary fiber             [38.0, ]           
     549.7 μg   dietary folate            [400.0, 1000.0]    
    1800.0 kcal energy                    [1800.0, 2100.0]   
      32.4 mg   iron                      [18.0, 45.0]       
     681.7 mg   magnesium                 [420.0, ]          
       7.3 mg   manganese                 [2.3, 11.0]        
      35.0 mg   niacin                    [16.0, 35.0]       
      11.7 mg   pantothenic acid          [5.0, ]            
    2554.3 mg   phosphorus                [1250.0, 4000.0]   
      14.0 g    polyunsaturated fatty acids  [1.6, 16.0]        
    4700.0 mg   potassium                 [4700.0, ]         
     165.2 g    protein                   [56.0, ]           
       2.8 mg   riboflavin                [1.3, ]            
     220.8 μg   selenium                  [55.0, 400.0]      
    1500.0 mg   sodium                    [1500.0, 2300.0]   
       2.4 mg   thiamin                   [1.2, ]            
      59.4 g    total fat                 [20.0, 35.0]        (29.7% of calories)
    3000.0 μg   vitamin a                 [900.0, 3000.0]    
      23.0 μg   vitamin b12               [2.4, ]            
       2.4 mg   vitamin b6                [1.7, 100.0]       
     157.6 mg   vitamin c                 [90.0, 2000.0]     
     893.0 iu   vitamin d                 [400.0, 4000.0]    
      15.0 mg   vitamin e                 [15.0, 1000.0]     
     349.4 μg   vitamin k                 [120.0, ]          
      17.2 mg   zinc                      [11.0, 40.0] 

Unfortunately, this asks for a kilo of raw alfalfa sprouts, which I definitely would not eat. The problem is that alfalfa is ridiculously nutritious. Summarizing the diet with the print_details flag set, we see they contribute a significant amount of nearly every important nutrient.

	18.8% of calcium (mg)
	16.2% of carbohydrate (g)
	26.2% of choline (mg)
	47.3% of copper (mg)
	31.4% of dietary fiber (g)
	65.5% of dietary folate (μg)
	12.8% of energy (kcal)
	29.7% of iron (mg)
	39.6% of magnesium (mg)
	25.6% of manganese (mg)
	13.7% of niacin (mg)
	48.2% of pantothenic acid (mg)
	27.4% of phosphorus (mg)
	29.3% of polyunsaturated fatty acids (g)
	16.8% of potassium (mg)
	24.2% of protein (g)
	45.1% of riboflavin (mg)
	2.7% of selenium (μg)
	4.0% of sodium (mg)
	31.9% of thiamin (mg)
	11.6% of total fat (g)
	2.7% of vitamin a (μg)
	13.9% of vitamin b6 (mg)
	52.0% of vitamin c (mg)
	1.3% of vitamin e (mg)
	87.3% of vitamin k (μg)
	53.5% of zinc (mg)

But ignoring that, we have some reasonable sounding foods: fish, sweet potato, rosemary (okay that’s a ton of rosemary), egg and wine. I bet someone could make a tasty meal from those rough ingredients.

Extensions and Exercises

No tutorial would be complete without exercises. All of these are related to the actual linear program modeling problem.

Food groups: Suppose you had an additional column for each food called “food group.” You want to create a balanced meal, so you add additional constraint for each food group requiring some food, but not too much, from each group. Furthermore, for certain foods like spices, one could add a special constraint for each spice requiring not more than, say, 20g of any given spice. Or else, as one can see, the linear program can produce diets involving obscenely large amounts of spices.

Starting from a given set of foods: Supposing you have an idea for a recipe (or couple of recipes for a day’s meals), but you want to add whatever else is needed to make it meet the nutritional standards. Modify the LP to allow for this.

Integer variations: The ortools package supports integer programming as well. All you need to do to enable this is change the solver type to CBC_MIXED_INTEGER_PROGRAMMING. The solver will run as normal, and now you can create integer-valued variables using solver.IntVar instead of NumVar. Using binary variables, one can define logical OR constraints (figure out how this must work). Define a new binary variable for each food, and define a constraint that makes this variable 0 when the food is not used, and 1 when the food is used. Then add a term to the optimization problem that penalizes having too many different foods in a daily diet.

(Harder) Sampling: Part of the motivation for this project is to come up with a number of different dishes that are all “good” with respect to this optimization problem. Perhaps there is more than one optimal solution, or perhaps there are qualitatively different diets that are close enough to optimal. However, this implementation produces a deterministic output. Find a way to introduce randomness into the program, so that you can get more than one solution.

Feel free to suggest other ideas, and extend or rewrite the model to do something completely different. The sky’s the limit!

Continue Reading…


Read More

Chess records page

Chess records page (no, not on the first page, or the second page, or the third page, of a google search of *chess records*).

There’s lots of good stuff here, enough to fill much of a book if you so desire. As we’ve discussed, chess games are in the public domain so if you take material on chess games from an existing book or website without crediting the person who compiled this material, you’re not actually plagiarizing.

The post Chess records page appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Profiling Go programs with pprof

Last week me and my cool coworker Josh were debugging some memory problems in a Go program using pprof.

There’s a bunch of pprof documentation on the internet but I found a few things confusing so here are some notes so I can find them easily.

First – when working with pprof it’s good to be running a recent version of Go! For example Go 1.8 adds mutex profiles so you can see mutex contention.

in this post I’ll

  • link to the useful pprof resource I found
  • explain what a pprof profile is
  • give an example of how to look at a heap profile of a Go program
  • explain a few things about the heap profiler works (what do the stack traces mean? how are they collected?)
  • most importantly (to me), deconstruct an example pprof protobuf file so we understand what a pprof profile actually is

This post won’t really explain in detail how to to use pprof to diagnose performance issues in Go programs, but I think these fundamentals (“what even is a pprof file”) will help me do that more easily.

pprof basics

pprof lets you collect CPU profiles, traces, and heap profiles for your Go programs. The normal way to use pprof seems to be:

  1. Set up a webserver for getting Go profiles (with import _ "net/http/pprof")
  2. Run curl localhost:$PORT/debug/pprof/$PROFILE_TYPE to save a profile
  3. Use go tool pprof to analyze said profile

You can also generate pprof profiles in your code using the pprof package but I haven’t done that.

Useful pprof reading

Here is every useful link I’ve found so far about pprof on the internet. Basically the material on the internet about pprof seems to be the official documentation + rakyll’s amazing blog.

(there are probably also talks about pprof but I am too impatient to watch talks, that’s part of why I write lots of blog posts and give few talks)

What’s a profile? What kinds of profiles can I get?

When understanding how things work I like to start at the beginning. What is a “profile” exactly?

Well, let’s read the documentation! The 7th time I looked at the runtime/pprof docs, I read this very useful sentence:

A Profile is a collection of stack traces showing the call sequences that led to instances of a particular event, such as allocation. Packages can create and maintain their own profiles; the most common use is for tracking resources that must be explicitly closed, such as files or network connections.

Each Profile has a unique name. A few profiles are predefined:

goroutine    - stack traces of all current goroutines
heap         - a sampling of all heap allocations
threadcreate - stack traces that led to the creation of new OS threads
block        - stack traces that led to blocking on synchronization primitives
mutex        - stack traces of holders of contended mutexes

There are 7 places you can get profiles in the default webserver: the ones mentioned above

and also 2 more: the CPU profile and the CPU trace.

To analyze these profiles (lists of stack traces), the tool to use is go tool pprof, which is a bunch of tools for visualizing stack traces.

super confusing note: the trace endpoint (/debug/pprof/trace?seconds=5), unlike all the rest, outputs a file that is not a pprof profile. Instead it’s a trace and you can view it using go tool trace (not go tool pprof).

You can see the available profiles with http://localhost:6060/debug/pprof/ in your browser. Except it doesn’t tell you about /debug/pprof/profile or /debug/pprof/trace for some reason.

All of these kinds of profiles (goroutine, heap allocations, etc) are just collections of stacktraces, maybe with some metadata attached. If we look at the pprof protobuf definition, you see that a profile is mostly a bunch of Samples.

A sample is basically a stack trace. That stack trace might have some extra information attached to it! For example in a heap profile, the stack trace has a number of bytes of memory attached to it. I think the Samples are the most important part of the profile.

We’re going to deconstruct what exactly is inside a pprof file later, but for now let’s start by doing a quick example of what analyzing a heap profile looks like!

Getting a heap profile with pprof

I’m mostly interested in debugging memory problems right now. So I decided to write a program that allocates a bunch of memory to profile with pprof.

func main() {
    // we need a webserver to get the pprof webserver
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    fmt.Println("hello world")
    var wg sync.WaitGroup
    go leakyFunction(wg)

func leakyFunction(wg sync.WaitGroup) {
    defer wg.Done()
    s := make([]string, 3)
    for i:= 0; i < 10000000; i++{
        s = append(s, "magical pandas")
        if (i % 100000) == 0 {
            time.Sleep(500 * time.Millisecond)

Basically this just starts a goroutine leakyFunction that allocates a bunch of memory and then exits eventually.

Getting a heap profile of this program is really easy – we just need to run go tool pprof http://localhost:6060/debug/pprof/heap. This puts us into an interactive mode where we run top

$ go tool pprof  http://localhost:6060/debug/pprof/heap
    Fetching profile from http://localhost:6060/debug/pprof/heap
    Saved profile in /home/bork/pprof/pprof.localhost:6060.inuse_objects.inuse_space.004.pb.gz
    Entering interactive mode (type "help" for commands)
(pprof) top
    34416.04kB of 34416.04kB total (  100%)
    Showing top 10 nodes out of 16 (cum >= 512.04kB)
          flat  flat%   sum%        cum   cum%
       33904kB 98.51% 98.51%    33904kB 98.51%  main.leakyFunction

I can also do the same thing outside interactive mode with go tool pprof -top http://localhost:6060/debug/pprof/heap.

This basically tells us that main.leakyFunction is using 339MB of memory. Neat!

We can also generate a PNG profile like this: go tool pprof -png http://localhost:6060/debug/pprof/heap > out.png.

Here’s what that looks like (I ran it at a different time so it’s only using 100MBish of memory).

what do the stack traces in a heap profile mean?

This is not complicated but also was not 100% obvious to me. The stack traces in the heap profile are the stack trace at time of allocation.

So the stack traces in the heap profile might be for code that is not running anymore – like maybe a function allocated a bunch of memory, returned, and a different function that should be freeing that memory is misbehaving. So the function to blame for the memory leak might be totally different than the function listed in the heap profile.

alloc_space vs inuse_space

go tool pprof has the option to show you either allocation counts or in use memory. If you’re concerned with the amount of memory being used, you probably want the inuse metrics, but if you’re worried about time spent in garbage collection, look at allocations!

  -inuse_space      Display in-use memory size
  -inuse_objects    Display in-use object counts
  -alloc_space      Display allocated memory size
  -alloc_objects    Display allocated object counts

I was originally confused about this works – the profiles have already be collected! How can I make this choice after the fact? I think how the heap profiles work is – allocations are recorded at some sample rate. Then every time one of those allocation is freed, that’s also recorded. So you get a history of both allocations and frees for some sample of memory activity. Then when it comes time to analyze your memory usage, you can decide where you want inuse memory or total allocation counts!

You can read the source for the memory profiler here: It has a lot of useful comments! For example here are the comments about setting the sample rate:

// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.

// To include every allocated block in the profile, set MemProfileRate to 1.
// To turn off profiling entirely, set MemProfileRate to 0.

// The tools that process the memory profiles assume that the
// profile rate is constant across the lifetime of the program
// and equal to the current value. Programs that change the
// memory profiling rate should do so just once, as early as
// possible in the execution of the program (for example,
// at the beginning of main).

pprof fundamentals: deconstructing a pprof file

When I started working with pprof I was confused about what was actually happening. It was generating these heap profiles named like pprof.localhost:6060.inuse_objects.inuse_space.004.pb.gz – what is that? How can I see the contents?

Well, let’s take a look!! I wrote an even simpler Go program to get the simplest possible heap profile.

package main

import "runtime"
import "runtime/pprof"
import "os"
import "time"

func main() {
    go leakyFunction()
    time.Sleep(500 * time.Millisecond)
    f, _ := os.Create("/tmp/profile.pb.gz")
    defer f.Close()

func leakyFunction() {
    s := make([]string, 3)
    for i:= 0; i < 10000000; i++{
        s = append(s, "magical pprof time")

This program just allocates some memory, writes a heap profile, and exits. Pretty simple. Let’s look at this file /tmp/profile.pb.gz! You can download a gunzipped version profile.pb here: profile.pb. I installed protoc using these directions.

profile.pb is a protobuf file, and it turns out you can view protobuf files with protoc, the protobuf compiler.

go get
protoc --decode=perftools.profiles.Profile  $GOPATH/src/ --proto_path $GOPATH/src/

The output of this is a bit long, you can view it all here: output.

Here’s a summary though of what’s in this heap profile file! This contains 1 sample. A sample is a stack trace, and this stack trace has 2 locations: 1 and 2. What are locations 1 and 2? Well they correspond to mappings 1 and 2, which in turn correspond to filenames 7 and 8.

If we look at the string table, we see that filenames 7 and 8 are these two:

string_table: "/home/bork/work/experiments/golang-pprof/leak_simplest"
string_table: "[vdso]"
sample {
  location_id: 1
  location_id: 2
  value: 1
  value: 34717696
  value: 1
  value: 34717696
mapping {
  id: 1
  memory_start: 4194304
  memory_limit: 5066752
  filename: 7
mapping {
  id: 2
  memory_start: 140720922800128
  memory_limit: 140720922808320
  filename: 8
location {
  id: 1
  mapping_id: 1
  address: 5065747
location {
  id: 2
  mapping_id: 1
  address: 4519969
string_table: ""
string_table: "alloc_objects"
string_table: "count"
string_table: "alloc_space"
string_table: "bytes"
string_table: "inuse_objects"
string_table: "inuse_space"
string_table: "/home/bork/work/experiments/golang-pprof/leak_simplest"
string_table: "[vdso]"
string_table: "[vsyscall]"
string_table: "space"
time_nanos: 1506268926947477256
period_type {
  type: 10
  unit: 4
period: 524288

pprof files don’t always contain function names

One interesting thing about this pprof file profile.pb is that it doesn’t contain the names of the functions we’re running! But If I run go tool pprof on it, it prints out the name of the leaky function. How did you do that, go tool pprof?!

go tool pprof -top  profile.pb 
59.59MB of 59.59MB total (  100%)
      flat  flat%   sum%        cum   cum%
   59.59MB   100%   100%    59.59MB   100%  main.leakyFunction
         0     0%   100%    59.59MB   100%  runtime.goexit

I answered this with strace, obviously – I straced go tool pprof and this is what I saw:

5015  openat(AT_FDCWD, "/home/bork/pprof/binaries/leak_simplest", O_RDONLY|O_CLOEXEC <unfinished ...>
5015  openat(AT_FDCWD, "/home/bork/work/experiments/golang-pprof/leak_simplest", O_RDONLY|O_CLOEXEC) = 3

So it seems that go tool pprof noticed that the filename in profile.pb was /home/bork/work/experiments/golang-pprof/leak_simplest, and then it just opened up that file on my computer and used that to get the function names. Neat!

You can also pass the binary to go tool pprof like go tool pprof -out $BINARY_FILE myprofile.pb.gz. Sometimes pprof files contain function names and sometimes they don’t, I haven’t figured out what determines that yet.

pprof keeps improving!

also I found out that thanks to the great work of people like rakyll, pprof keeps getting better!! For example There’s this pull request which is being worked on RIGHT NOW which adds flamegraph support to the pprof web interface. Flamegraphs are the best thing in the universe so I’m very excited for that to be available.

If I got someting wrong (I probably did) let me know!!

Continue Reading…


Read More

R Packages worth a look

Access Domains and Search Popular Websites (websearchr)
Functions that allow for accessing domains and a number of search engines.

Hyphenation and Syllable Counting for Text Analysis (sylly)
Provides the hyphenation algorithm used for ‘TeX’/’LaTeX’ and similar software, as proposed by Liang (1983, <https://…/> ). Mainly contains the function hyphen() to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the ‘koRpus’ package, but later released as a separate package so it’s lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages; this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev mailing list (<> ).

Optimal Design and Statistical Power of Cost-Efficient Multilevel Randomized Trials (odr)
Calculate the optimal sample allocation that minimizes variance of treatment effect in a multilevel randomized trial under fixed budget and cost structure, perform power analyses with and without accommodating costs and budget. The reference for proposed methods is: Shen, Z., & Kelcey, B. (under review). Optimal design of cluster randomized trials under condition- and unit-specific cost structures. 2018 American Educational Research Association (AERA) annual conference.

Fit and Predict a Gaussian Process Model with (Time-Series) Binary Response (binaryGP)
Allows the estimation and prediction for binary Gaussian process model. The mean function can be assumed to have time-series structure. The estimation methods for the unknown parameters are based on penalized quasi-likelihood/penalized quasi-partial likelihood and restricted maximum likelihood. The predicted probability and its confidence interval are computed by Metropolis-Hastings algorithm. More details can be seen in Sung et al (2017) <arXiv:1705.02511>.

Distributions and Gradients (dng)
Provides density, distribution function, quantile function and random generation for the split-t distribution, and computes the mean, variance, skewness and kurtosis for the split-t distribution (Li, F, Villani, M. and Kohn, R. (2010) <doi:10.1016/j.jspi.2010.04.031>).

Simulation Based Inference of Lasso Estimator (EAinference)
Estimator augmentation methods for statistical inference on high-dimensional data, as described in Zhou, Q. (2014) <doi:10.1080/01621459.2014.946035> and Zhou, Q. and Min, S. (2017) <doi:10.1214/17-EJS1309>. It provides several simulation-based inference methods: (a) Gaussian and wild multiplier bootstrap for lasso, group lasso, scaled lasso, scaled group lasso and their de-biased estimators, (b) importance sampler for approximating p-values in these methods, (c) Markov chain Monte Carlo lasso sampler with applications in post-selection inference.

Continue Reading…


Read More

Document worth reading: “Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions”

Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructures, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. This work examines some of these challenges and discusses solutions proposed in the literature to address them. Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions

Continue Reading…


Read More

If you did not already know

Kanri Distance (KDC) google
Kanri’s proprietary combination of patented statistical and process methods provides a uniquely powerful and insightful ability to evaluate large data sets with multiple variables. While many tools evaluate patterns and dynamics for large data, only the Kanri Distance Calculator allows users to understand where they stand with respect to a desired target state and the specific contribution of each variable toward the overall distance from the target state. The Kanri model not only calculates the relationship of variables within the overall data set, but more importantly mathematically teases out the interaction between each of them. This combination of relational insights fuels Kanri’s breakthrough distance calculator. It answers the question “In a world of exponentially expanding data how do I find the variables that will solve my problem and it helps quickly to reach that conclusion.” But the Kanri model does not stop there. Kanri tells you exactly, formulaically how much each variable contributes. The Kanri Distance Calculator opens a new world of solution development possibilities that can apply the power of massive data sets to an individual…or to an individualized objective. …

Probably Approximately Correct Learning (PAC Learning,WARL) google
In computational learning theory, probably approximately correct learning (PAC learning) is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the “probably” part), the selected function will have low generalization error (the “approximately correct” part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). …

Local Projections google
In this paper, we propose a novel approach for outlier detection, called local projections, which is based on concepts of Local Outlier Factor (LOF) (Breunig et al., 2000) and RobPCA (Hubert et al., 2005). By using aspects of both methods, our algorithm is robust towards noise variables and is capable of performing outlier detection in multi-group situations. We are further not reliant on a specific underlying data distribution. For each observation of a dataset, we identify a local group of dense nearby observations, which we call a core, based on a modification of the k-nearest neighbours algorithm. By projecting the dataset onto the space spanned by those observations, two aspects are revealed. First, we can analyze the distance from an observation to the center of the core within the projection space in order to provide a measure of quality of description of the observation by the projection. Second, we consider the distance of the observation to the projection space in order to assess the suitability of the core for describing the outlyingness of the observation. These novel interpretations lead to a univariate measure of outlyingness based on aggregations over all local projections, which outperforms LOF and RobPCA as well as other popular methods like PCOut (Filzmoser et al., 2008) and subspace-based outlier detection (Kriegel et al., 2009) in our simulation setups. Experiments in the context of real-word applications employing datasets of various dimensionality demonstrate the advantages of local projections. …

Continue Reading…


Read More

September 23, 2017

Magister Dixit

“Failure is a great teacher.” Daniel Tunkelang

Continue Reading…


Read More

Book Memo: “Beautiful Data”

The Stories Behind Elegant Data Solutions
In this insightful book, you’ll learn from the best data practitioners in the field just how wide-ranging – and beautiful – working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video. With Beautiful Data, you will:
• Explore the opportunities and challenges involved in working with the vast number of datasets made available by the Web
• Learn how to visualize trends in urban crime, using maps and data mashups
• Discover the challenges of designing a data processing system that works within the constraints of space travel
• Learn how crowdsourcing and transparency have combined to advance the state of drug research
• Understand how new data can automatically trigger alerts when it matches or overlaps pre-existing data
• Learn about the massive infrastructure required to create, capture, and process DNA data
That’s only small sample of what you’ll find in Beautiful Data. For anyone who handles data, this is a truly fascinating book.

Continue Reading…


Read More

RcppCNPy 0.2.7

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of the RcppCNPy package arrived on CRAN yesterday.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp two-page template and package.

Changes in version 0.2.7 (2017-09-22)

  • Vignette updated to Rmd and use of pinp package

  • File src/init.c added for dynamic registration

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

RcppClassic 0.9.8

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A bug-fix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.

Courtesy of CRANberries, there are changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.


vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.

vtreat handles, in a statistically sound fashion:

In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public open-source implementation is for in-memory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).

vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.

A new feature of vtreat version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.

Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.

Continue Reading…


Read More

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-9)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.

Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.

Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a six-shooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.

The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.

  1. Create a function that simulate a complete game and return how many coin flip has been done and who win.
  2. In average, how many coin flip is needed before someone has all the pennies.
  3. Plot the histogram of the number of coin flipped during a simulation.
  4. What is the probability that someone wins a coin flip?
  5. What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?

When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.

In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.

  1. Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
  2. Repeat this process, but this time choose to switch door.
  3. Why the probabilities are different?

When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.

To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?

Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.

  1. Compute the success rate (number of success/total number of cases) of both treatments.
  2. Which treatment seems the more successful?
  3. Create a contingency table of the success.
  4. Compute the success rate of both treatments when treating small kidney stones.
  5. Compute the success rate of both treatments when treating large kidney stones.
  6. Which treatment is more successful for small kidney stone? For large kidney stone?

This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.

Exercise 7

  1. Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
  2. Do a scatter plot of the variable X and Y and add the trend line to the graph.
  3. Repeat this process of each of the three categories.

We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!

Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.

A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.

  1. Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
  2. Repeat this process for the remaining observations. Is there a difference between those rate?

It is a known fact that false positive rate are a lot higher in low-incidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!

Exercise 9

  1. Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
  2. Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
  3. Compute the variation of the estimation.
  4. Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  5. Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  6. Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  7. Plot the variance of the estimation of the means done with different sample size.

As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.

Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?

A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Getting the right uncertainties when fitting multilevel models

Cesare Aloisi writes:

I am writing you regarding something I recently stumbled upon in your book Data Analysis Using Regression and Multilevel/Hierarchical Models which confused me, in hopes you could help me understand it. This book has been my reference guide for many years now, and I am extremely grateful for everything I learnt from you.

On page 261, a 95% confidence interval for the intercept in a certain group (County 26) is calculated using only the standard error of the “random effect” (the county-level error). The string is as follows:

coef(M1)$county[26,1] + c(-2,2)*se.ranef(M1)$county[26]

My understanding is that, since the group-level prediction (call it y.hat_j = coef(M1)$county[26,1]) is a linear combination of a global average and a group-level deviation from the average (y.hat_j = beta_0 + eta_j), then the variance of y.hat_j should be the sum of the covariances of beta_0 and eta_j, not just the variance of eta_j, as the code on page 261 seems to imply. In other words:

Var(y.hat_j) = Var(beta_0) + Var(eta_j) + 2Cov(beta_0, eta_j)

Admittedly, lme4 does not provide an estimate for the last term, the covariance between “fixed” and “random” effects. Was the code used in the book to simplify the calculations, or was there some deeper reason to it that I failed to grasp?

My reply: The short answer is that it’s difficult to get this correct in lmer but very easy when using stan_lmer() in the rstanarm package. That’s what I recommend, and that’s what we’ll be doing in the 2nd edition of our book.

The post Getting the right uncertainties when fitting multilevel models appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Getting started with Deep Learning for Computer Vision with Python

This blog post is intended for readers who have purchased a copy of my new book, Deep Learning for Computer Vision with Python.

Inside this tutorial you’ll learn how to:

  • Download the books, code, datasets, and any extras associated with your purchase.
  • Obtain your email receipt and invoice.
  • Access the companion website associated with Deep Learning for Computer Vision with Python.
  • Post an issue, submit a bug, or report a typo using the companion website.
  • Reactivate an expired download link.

If you have any other questions related to the book, please send me an email or use the contact form.

Getting started with Deep Learning for Computer Vision with Python

Thank you for picking up a copy of Deep Learning for Computer Vision with Python

I appreciate your support of both myself and the PyImageSearch blog. Without you, PyImageSearch would not be possible.

My goal is to ensure you receive a huge return on both your investment of time and finances. To ensure you get off on the right foot, this guide will help you get started with your brand new copy of Deep Learning for Computer Vision with Python.

Downloading the files

After you successfully checkout and purchase your copy of Deep Learning for Computer Vision with Python you will be redirected to a page that looks similar to the one below:

Figure 1: The “Downloads Page” you can use to download the files associated with your purchase of Deep Learning for Computer Vision with Python.

This is your purchase page and where you will be able to download your files. Left click on each file and your download will start.

All files that start with the prefix

  are part of the Starter Bundle. Files that start with
  are part of the Practitioner Bundle. And finally, file names that start with
  are part of the ImageNet Bundle.

File names that include

  contain the PDF of the respective bundle. File names including
  contain your code/datasets associated for the bundle. For example, the file name
  contains all code/datasets associated with the Starter Bundle. The file name
  contains your PDF of the Starter Bundle.

Finally, the
  file contains your pre-configured Ubuntu VirtualBox virtual machine.

Note: At this time only the Starter Bundle contents have been released. The contents of the Practitioner Bundle and ImageNet Bundle will be released in October.

If you close this tab in your browser and need to access it again, simply:

  1. Open up your inbox.
  2. Find the email receipt (see section below).
  3. Click on the “View Purchase Online” link.

From there you’ll be able to access the downloads page.

Please go ahead and download these files at your earliest convenience. The service I use to handle payments and distribution of digital downloads automatically expires URLs after four days for security reasons. If your download ever expires, no problem at all, just refer to the “Reactivating an expired download” section below.

Your email receipt and invoice

A few minutes after you purchase your copy of Deep Learning for Computer Vision with Python you’ll receive an email with the subject: “Your purchase from PyImageSearch is complete”.

Inside this email you’ll find links to view/print your invoice as well as access the downloads page:

Figure 2: After purchasing your copy of Deep Learning for Computer Vision with Python you will receive an email containing your receipt/invoice and link to re-access the downloads page.

If you did not receive this email, please ensure you are examining the inbox/email address you used when checking out. If you used PayPal you’ll want to check the email address associated with your PayPal account.

If you still cannot find the email, no worries! Please just email me or send me a message from via the contact form and include any pertinent information, such as:

  • The email address the purchase should be listed under.
  • Your name.
  • Any other relevant information you may have (purchase number, whether the payment was made via credit card or PayPal, if a friend/colleague purchased for you etc.).

From there I can double-check the database and ensure you receive your email receipt and downloads link.

Accessing the companion website

Your purchase of Deep Learning for Computer Vision with Python includes access to the supplementary material/companion website.

To access the companion website:

  1. Download the PDF of the Starter Bundle.
  2. Open the Starter Bundle to the “Companion Website” section (page 15 of the PDF).
  3. Follow the link to the companion website.
  4. Register your account on the companion website by creating a username and password.

From there you’ll be able to access the companion website:

Figure 3: The Deep Learning for Computer Vision with Python companion website.

Right now the companion website includes links to (1) configure your development environment and (2) report a bug. In the future this website will contain additional supplementary material.

Posting an issue, bug report, or typo

The most important reason you should create your account on the companion website is to report an issue, bug, or typo.

You can do this by clicking the “Issues” button in the header of the companion website:

Figure 4: If you encounter an error when using the book, please check the “Issues” page inside the companion website.

You’ll then see a list of all open tickets.

You can search these tickets by clicking the “Apply Filters” button.

If no ticket matches your query, click “Create New Ticket” and fill out the required fields:

Figure 5: If no (already submitted) bug report matches your error, please create a new ticket so myself and others in the PyImageSearch community can help you.

From there, myself and the rest of the PyImageSearch community can help you with the problem.

You can always email me regarding any issues as well; however, I may refer you to the companion website to post the bug so:

  1. I can keep track of the issue and ensure your problem is resolved in a timely manner.
  2. Other readers can learn from the issue if they encounter it as well.

Since Deep Learning for Computer Vision with Python is a brand new book, there are bound to be many questions. By using the issue tracker we can keep all bugs organized while ensuring the community can learn from other questions as well.

Reactivating an expired download

The service I use to handle payments and distribution of digital downloads automatically expires URLs after four days for security reasons.

If your URL ever expires, no problem at all — simply email me or send me a message and I can reactivate your purchase for you.


In this tutorial you learned how to get started with your new purchase of Deep Learning for Computer Vision with Python.

If you have a question that is not discussed in this guide, please shoot me an email or send me a message — I’ll be happy to discuss the problem with you.

Otherwise, if your question is specifically related to a chapter, a piece of code, an error message, or anything pertinent to the actual contents of the book, please refer to the “Posting an issue, bug report, or typo” section above.

Thank you again for purchasing a copy of Deep Learning for Computer Vision with Python.

I feel incredibly excited and privileged to guide you on your journey to deep learning mastery.

Without you, this blog would not be possible.

Have a wonderful day and happy reading!

P.S. If you haven’t already purchased a copy of Deep Learning for Computer Vision with Python, you can do so here.

The post Getting started with Deep Learning for Computer Vision with Python appeared first on PyImageSearch.

Continue Reading…


Read More

Whats new on arXiv

Practical Machine Learning for Cloud Intrusion Detection: Challenges and the Way Forward

Operationalizing machine learning based security detections is extremely challenging, especially in a continuously evolving cloud environment. Conventional anomaly detection does not produce satisfactory results for analysts that are investigating security incidents in the cloud. Model evaluation alone presents its own set of problems due to a lack of benchmark datasets. When deploying these detections, we must deal with model compliance, localization, and data silo issues, among many others. We pose the problem of ‘attack disruption’ as a way forward in the security data science space. In this paper, we describe the framework, challenges, and open questions surrounding the successful operationalization of machine learning based security detections in a cloud environment and provide some insights on how we have addressed them.

Deconvolutional Latent-Variable Model for Text Sequence Matching

A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we employ deconvolutional networks as the sequence decoder (generator), providing learned latent codes with more semantic information and better generalization. Our model, trained in an unsupervised manner, yields stronger empirical predictive performance than a decoder based on Long Short-Term Memory (LSTM), with less parameters and considerably faster training. Further, we apply it to text sequence-matching problems. The proposed model significantly outperforms several strong sentence-encoding baselines, especially in the semi-supervised setting.

Feature Engineering for Predictive Modeling using Reinforcement Learning

Feature engineering is a crucial step in the process of predictive modeling. It involves the transformation of given feature space, typically using mathematical functions, with the objective of reducing the modeling error for a given target. However, there is no well-defined basis for performing effective feature engineering. It involves domain knowledge, intuition, and most of all, a lengthy process of trial and error. The human attention involved in overseeing this process significantly influences the cost of model generation. We present a new framework to automate feature engineering. It is based on performance driven exploration of a transformation graph, which systematically and compactly enumerates the space of given options. A highly efficient exploration strategy is derived through reinforcement learning on past examples.

Lazy stochastic principal component analysis

Stochastic principal component analysis (SPCA) has become a popular dimensionality reduction strategy for large, high-dimensional datasets. We derive a simplified algorithm, called Lazy SPCA, which has reduced computational complexity and is better suited for large-scale distributed computation. We prove that SPCA and Lazy SPCA find the same approximations to the principal subspace, and that the pairwise distances between samples in the lower-dimensional space is invariant to whether SPCA is executed lazily or not. Empirical studies find downstream predictive performance to be identical for both methods, and superior to random projections, across a range of predictive models (linear regression, logistic lasso, and random forests). In our largest experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix multiplications, besides an operation on a small square matrix whose size depends only on the target dimensionality.

Handling Factors in Variable Selection Problems

Factors are categorical variables, and the values which these variables assume are called levels. In this paper, we consider the variable selection problem where the set of potential predictors contains both factors and numerical variables. Formally, this problem is a particular case of the standard variable selection problem where factors are coded using dummy variables. As such, the Bayesian solution would be straightforward and, possibly because of this, the problem, despite its importance, has not received much attention in the literature. Nevertheless, we show that this perception is illusory and that in fact several inputs like the assignment of prior probabilities over the model space or the parameterization adopted for factors may have a large (and difficult to anticipate) impact on the results. We provide a solution to these issues that extends the proposals in the standard variable selection problem and does not depend on how the factors are coded using dummy variables. Our approach is illustrated with a real example concerning a childhood obesity study in Spain.

Class-Splitting Generative Adversarial Networks

Generative Adversarial Networks (GANs) produce systematically better quality samples when class label information is provided., i.e. in the conditional GAN setup. This is still observed for the recently proposed Wasserstein GAN formulation which stabilized adversarial training and allows considering high capacity network architectures such as ResNet. In this work we show how to boost conditional GAN by augmenting available class labels. The new classes come from clustering in the representation space learned by the same GAN model. The proposed strategy is also feasible when no class information is available, i.e. in the unsupervised setup. Our generated samples reach state-of-the-art Inception scores for CIFAR-10 and STL-10 datasets in both supervised and unsupervised setup.

Neural Optimizer Search with Reinforcement Learning

We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, running average of the gradient, etc. The controller is trained with Reinforcement Learning to maximize the performance of a model after a few epochs. On CIFAR-10, our method discovers several update rules that are better than many commonly used optimizers, such as Adam, RMSProp, or SGD with and without Momentum on a ConvNet model. We introduce two new optimizers, named PowerSign and AddSign, which we show transfer well and improve training on a variety of different tasks and architectures, including ImageNet classification and Google’s neural machine translation system.

Analyzing users’ sentiment towards popular consumer industries and brands on Twitter

Social media serves as a unified platform for users to express their thoughts on subjects ranging from their daily lives to their opinion on consumer brands and products. These users wield an enormous influence in shaping the opinions of other consumers and influence brand perception, brand loyalty and brand advocacy. In this paper, we analyze the opinion of 19M Twitter users towards 62 popular industries, encompassing 12,898 enterprise and consumer brands, as well as associated subject matter topics, via sentiment analysis of 330M tweets over a period spanning a month. We find that users tend to be most positive towards manufacturing and most negative towards service industries. In addition, they tend to be more positive or negative when interacting with brands than generally on Twitter. We also find that sentiment towards brands within an industry varies greatly and we demonstrate this using two industries as use cases. In addition, we discover that there is no strong correlation between topic sentiments of different industries, demonstrating that topic sentiments are highly dependent on the context of the industry that they are mentioned in. We demonstrate the value of such an analysis in order to assess the impact of brands on social media. We hope that this initial study will prove valuable for both researchers and companies in understanding users’ perception of industries, brands and associated topics and encourage more research in this field.

Uniquely labelled geodesics of Coxeter groups
Anisotropic Functional Fourier Deconvolution from indirect long-memory observations
Numerical reconstruction of the first band(s) in an inverse Hill’s problem
Extreme Value Estimation for Discretely Sampled Continuous Processes
Data-Driven Model Predictive Control of Autonomous Mobility-on-Demand Systems
Inter-Subject Analysis: Inferring Sparse Interactions with Dense Intra-Graphs
Minimum Covariance Determinant and Extensions
Multi-Resolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments
Multi-camera Multi-Object Tracking
A Unified Approach to the Global Exactness of Penalty and Augmented Lagrangian Functions I: Parametric Exactness
Estimated Depth Map Helps Image Classification
A Deep-Reinforcement Learning Approach for Software-Defined Networking Routing Optimization
A Flocking-based Approach for Distributed Stochastic Optimization
On the Design of LQR Kernels for Efficient Controller Learning
On Compiling DNNFs without Determinism
Near Optimal Sketching of Low-Rank Tensor Regression
Covert Wireless Communication with Artificial Noise Generation
Persistence Flamelets: multiscale Persistent Homology for kernel density exploration
Talagrand Concentration Inequalities for Stochastic Partial Differential Equations
Supervised Learning with Indefinite Topological Kernels
On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration
Statistical Methods for Ecological Breakpoints and Prediction Intervals
Cost Adaptation for Robust Decentralized Swarm Behaviour
Variational Memory Addressing in Generative Models
Irreversibility of mechanical and hydrodynamic instabilities
Discrete-Time Polar Opinion Dynamics with Susceptibility
Accelerating PageRank using Partition-Centric Processing
Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding
Hypergraph Theory: Applications in 5G Heterogeneous Ultra-Dense Networks
Maximal Moments and Uniform Modulus of Continuity for Stable Random Fields
The k-tacnode process
Fractional iterated Ornstein-Uhlenbeck Processes
Learning RBM with a DC programming Approach
Large Vocabulary Automatic Chord Estimation Using Deep Neural Nets: Design Framework, System Variations and Limitations
Local Private Hypothesis Testing: Chi-Square Tests
SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes
Chromatic number, Clique number, and Lovász’s bound: In a comparison
Semi-Automated Nasal PAP Mask Sizing using Facial Photographs
SpectralFPL: Online Spectral Learning for Single Topic Models
Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization
Analysis of Wireless-Powered Device-to-Device Communications with Ambient Backscattering
Convergence characteristics of the generalized residual cutting method
Visual Question Generation as Dual Task of Visual Question Answering
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
The size of $3$-uniform hypergraphs with given matching number and codegree
A First Derivative Potts Model for Segmentation and Denoising Using MILP
3D Deformable Object Manipulation using Fast Online Gaussian Process Regression
Human Pose Estimation using Global and Local Normalization
Self-Dual Codes better than the Gilbert–Varshamov bound
Convolutional neural networks that teach microscopes how to image
Learning Complex Swarm Behaviors by Exploiting Local Communication Protocols with Deep Reinforcement Learning
Bayesian nonparametric inference for the M/G/1 queueing systems based on the marked departure process
Neural network identification of people hidden from view with a single-pixel, single-photon detector
Sorting with Recurrent Comparison Errors
Real-time predictive maintenance for wind turbines using Big Data frameworks
Assumption-Based Approaches to Reasoning with Priorities
Hysteretic percolation from locally optimal decisions
A Communication-Efficient Distributed Data Structure for Top-k and k-Select Queries
The power of big data sparse signal detection tests on nonparametric detection boundaries
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer’s Disease
On Composite Quantum Hypothesis Testing
A New Framework for $\mathcal{H}_2$-Optimal Model Reduction
Hybrid Beamforming Based on Implicit Channel State Information for Millimeter Wave Links
Speech Recognition Challenge in the Wild: Arabic MGB-3
Secure Energy Efficiency Optimization for MISO Cognitive Radio Network with Energy Harvesting
Blood-based metabolic signatures in Alzheimer’s disease
Alternating least squares as moving subspace correction
Spectral Asymptotics for Krein-Feller-Operators with respect to Random Recursive Cantor Measures
Connectedness of random set attractors
On the distribution of monochromatic complete subgraphs and arithmetic progressions
Influence of Clustering on Cascading Failures in Interdependent Systems
Down the Large Rabbit Hole
Playing for Benchmarks
AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection
Stochastic parameterization identification using ensemble Kalman filtering combined with expectation-maximization and Newton-Raphson maximum likelihood methods
Density of the set of probability measures with the martingale representation property
H-DenseUNet: Hybrid Densely Connected UNet for Liver and Liver Tumor Segmentation from CT Volumes
Symbolic Optimal Control
Efficient Column Generation for Cell Detection and Segmentation
Beyond the Sharp Null: Randomization Inference, Bounded Null Hypotheses, and Confidence Intervals for Maximum Effects
On the multi-dimensional elephant random walk
Extended-Alphabet Finite-Context Models
Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness
Non-Depth-First Search against Independent Distributions on an AND-OR Tree
Stable-like fluctuations of Biggins’ martingales
Multi-label Pixelwise Classification for Reconstruction of Large-scale Urban Areas
On the precise determination of the Tsallis parameters in proton – proton collisions at LHC energies
Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE
Distributed Submodular Minimization And Motion Coordination Over Discrete State Space
Berezinskii-Kosteriltz-Thouless transition in disordered multi-channel Luttinger liquids
If and When a Driver or Passenger is Returning to Vehicle: Framework to Infer Intent and Arrival Time
Urban Land Cover Classification with Missing Data Using Deep Convolutional Neural Networks
On Andrews–Warnaar’s identities of partial theta functions
A new ‘3D Calorimetry’ of hot nuclei
Inducing Distant Supervision in Suggestion Mining through Part-of-Speech Embeddings
Quantum Autoencoders via Quantum Adders with Genetic Algorithms
Bidirected Graphs I: Signed General Kotzig-Lovász Decomposition
On the $l^p$-norm of the Discrete Hilbert transform
Learned Features are better for Ethnicity Classification
Dynamic Evaluation of Neural Sequence Models
Perturbative Black Box Variational Inference

Continue Reading…


Read More

R Packages worth a look

Gui for Simulating Time Series (tsgui)
This gui shows realisations of times series, currently ARMA and GARCH processes. It might be helpful for teaching and studying.

Sample Size Calculation for Mean and Proportion Comparisons in Phase 3 Clinical Trials (SampleSize4ClinicalTrials)
The design of phase 3 clinical trials can be classified into 4 types: (1) Testing for equality;(2) Superiority trial;(3) Non-inferiority trial; and (4) Equivalence trial according to the goals. Given that none of the available packages combines these designs in a single package, this package has made it possible for researchers to calculate sample size when comparing means or proportions in phase 3 clinical trials with different designs. The ssc function can calculate the sample size with pre-specified type 1 error rate,statistical power and effect size according to the hypothesis testing framework. Furthermore, effect size is comprised of true treatment difference and non-inferiority or equivalence margins which can be set in ssc function. (Reference: Yin, G. (2012). Clinical Trial Design: Bayesian and Frequentist Adaptive Methods. John Wiley & Sons.)

Rcmdr’ Plugin for Alpha-Sutte Indicator ‘sutteForecastR’ (RcmdrPlugin.sutteForecastR)
The ‘sutteForecastR’ is a package of Alpha-Sutte indicator. To make the ‘sutteForecastR’ user friendly, so we develop an ‘Rcmdr’ plug-in based on the Alpha-Sutte indicator function.

Simulation Studies with Stan (rstansim)
Provides a set of functions to facilitate and ease the running of simulation studies of Bayesian models using ‘stan’. Provides functionality to simulate data, fit models, and manage simulation results.

Credit Risk Scorecard (scorecard)
Makes the development of credit risk scorecard easily and efficiently by providing functions such as information value, variable filter, optimal woe binning, scorecard scaling and performance evaluation etc. The references including: 1. Refaat, M. (2011, ISBN: 9781447511199). Credit Risk Scorecard: Development and Implementation Using SAS. 2. Siddiqi, N. (2006, ISBN: 9780471754510). Credit risk scorecards. Developing and Implementing Intelligent Credit Scoring.

An Orthogonality Constrained Optimization Approach for Semi-Parametric Dimension Reduction Problems (orthoDr)
Utilize an orthogonality constrained optimization algorithm of Wen & Yin (2013) <DOI:10.1007/s10107-012-0584-1> to solve a variety of dimension reduction problems in the semiparametric framework, such as Ma & Zhu (2013) <DOI:10.1214/12-AOS1072>, and Sun, Zhu, Wang & Zeng (2017) <arXiv:1704.05046>. It also serves as a general purpose optimization solver for problems with orthogonality constraints.

Continue Reading…


Read More

How Random Forests improve simple Regression Trees?

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

Regression Trees

In this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.

The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the height-weight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).


You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variable-observation based on how much the new partition reduces the model error.

What I want to illustrate here is how unstable a Regression Tree can be. The package tree has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random re-sample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.


data(cpus, package = "MASS") # = Load Data

# = First Tree
set.seed(1) # = Seed for Replication
tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax,
             data = cpus[sample(1:209, 209, replace = TRUE), ])
plot(tree1);  text(tree1)

plot of chunk unnamed-chunk-28

# = Second Tree
tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax,
             data = cpus[sample(1:209,209, replace=TRUE), ])
plot(tree2);  text(tree2)

plot of chunk unnamed-chunk-28

As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the re-sample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.

# = Calculate predictions
pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus))
# = Plots
g1 = ggplot(data = pred) + geom_point(aes(p1, p2))
g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value))
grid.arrange(g1, g2, ncol = 2)

plot of chunk unnamed-chunk-29

Random Forest

As mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (re-sample the data many times with replacement and re-estimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variable-observation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.

The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.

# = Regression Tree
tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax,
               data = cpus[1:150, ])

# = Random Forest
set.seed(1) # = Seed for replication
rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax,
                  data=cpus[1:150, ], nodesize = 10, importance = TRUE)

# = Calculate MSE
mse_tree = mean((predict(tree_fs, cpus[-c(1:150), ]) - log(cpus$perf)[-c(1:150)])^2)
mse_rf = mean((predict(rf, cpus[-c(1:150), ]) - log(cpus$perf[-c(1:150)]))^2)

c(rf = mse_rf, tree = mse_tree)
##        rf      tree
## 0.2884766 0.5660053

As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.

##        %IncMSE IncNodePurity
## syct  22.60512     22.373601
## mmin  19.46153     21.965340
## mmax  24.84038     27.239772
## cach  27.92483     33.536185
## chmin 13.77196     13.352793
## chmax 17.61297      8.379306

Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:


plot of chunk unnamed-chunk-32

If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Probabilistic Metric Spaces”

This distinctly nonclassical treatment focuses on developing aspects that differ from the theory of ordinary metric spaces, working directly with probability distribution functions rather than random variables. The two-part treatment begins with an overview that discusses the theory’s historical evolution, followed by a development of related mathematical machinery. The presentation defines all needed concepts, states all necessary results, and provides relevant proofs. The second part opens with definitions of probabilistic metric spaces and proceeds to examinations of special classes of probabilistic metric spaces, topologies, and several related structures, such as probabilistic normed and inner-product spaces. Throughout, the authors focus on developing aspects that differ from the theory of ordinary metric spaces, rather than simply transferring known metric space results to a more general setting.

Continue Reading…


Read More

Magister Dixit

“Data science can directly enable a strategic differentiator if the company’s core competency depends on its data and analytic capabilities. When this happens, the company becomes supportive to data science instead of the other way around.” Eric Colson, Brad Klingenberg, Jeff Magnusson ( March 31, 2015 )

Continue Reading…


Read More

September 22, 2017

If you did not already know

Apache Lucy google
The Apache Lucy search engine library provides full-text search for dynamic programming languages. …

TrueSkill Ranking System google
TrueSkill is a Bayesian ranking algorithm developed by Microsoft Research and used in the Xbox matchmaking system built to address some perceived flaws in the Elo rating system. It is an extension of the Glicko rating system to multiplayer games. The purpose of a ranking system is to both identify and track the skills of gamers in a game (mode) in order to be able to match them into competitive matches. The TrueSkill ranking system only uses the final standings of all teams in a game in order to update the skill estimates (ranks) of all gamers playing in this game. Ranking systems have been proposed for many sports but possibly the most prominent ranking system in use today is ELO. …

Intrablocks Correspondence Analysis (IBCA) google
We propose a new method to describe contingency tables with double partition structures in columns and rows. Furthermore, we propose new superimposed representations, based on the introduction of variable dilations for the partial clouds associated with the partitions of the columns and the rows. …

Continue Reading…


Read More

Welcome to R/exams

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Welcome everybody, we are proud to introduce the brand new web page and
blog This provides a central access point for
the open-source software “exams” implemented in the
R system for statistical computing.
R/exams is a one-for-all exams generator using (potentially) dynamic
exercises written in R/Markdown or R/LaTeX and it can export a variety of
formats from learning management systems to classical written exams.
This post gives a brief overview of what has happened so far and what
you can expect to find here in the next months.



The package was originally started more than a decade ago to facilitate
classical written exams with randomized exercises for large-lecture courses.
Like many other teachers of introductory statistics and mathematics courses,
we were in need of infrastructure for conducting written exams with about 1,000-1,500
students. A team effort of colleagues at WU Wirtschaftsuniversität Wien
lead to a large collection of dynamic exercises and the software was eventually
released at

Over the years learning management systems (like
OLAT, etc.)
became easily available at virtually every university, creating a desire to
employ the same dynamic exercises also for online tests and quizzes. Hence,
the R/exams infrastructure was completely reimplemented allowing to export
the same exercises not only to written exams (with automatic scanning
and evaluation) but also to learning management systems, the live voting
system ARSnova, as well as customizable standalone
files in PDF, HTML, Docx, and beyond.

Despite (or rather because of) the flexibility of the software, novice R/exams
users often struggled with adopting it because the documentation provided in
the package is either somewhat technical and/or targeted towards more experienced
R users.


Hence, this web page and blog make it easy for new users to explore the possibilities
of R/exams before reading about a lot of technical details. It also provides accessible
guides to common tasks and examples for dynamic exercises with different complexity.
For a first tour you can check out the one-for-all approach of
the package based on (potentially) dynamic exercise templates
for generating large numbers of personalized exams/quizzes/tests, either for
e-learning or classical written exams (with
automatic evaluation).

Some tutorials already explain the installation of R/exams (with
dependencies like LaTeX or pandoc) and the first steps in writing dynamic exercises
using either Markdown or LaTeX markup along with randomization in R. There is
also a gallery of exercise templates, ranging from basic multiple-choice
questions with simple shuffling to more advanced exercises with random data, graphics,


For the next months we plan to write more tutorial blog posts that help to accomplish
common tasks, e.g., hands-on guidance for exporting exercises from R to Moodle or
tips how to write good randomized exercises. If you want to give us feedback or ask
us to cover certain topics please feel free to contact us – be it via e-mail,
discussion forum, or twitter. Also if you want to link R/exams-based materials or share
share experiences of using R/exams in a guest post, please let us know.


Big shout to all contributors that helped R/exams to grow so much
over the last years. A special thank you goes to Patrik Keller, Hanna Krimm, and Reto
Stauffer who established the infrastructure for this web page (using R/Markdown and Jekyll)
and designed graphics/icons/photos/etc. Thanks also to everyone sharing this post,
especially on and

To leave a comment for the author, please follow the link and comment on their blog: R/exams. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Distilled News

Data liquidity in the age of inference

It’s a special time in the evolutionary history of computing. Oft-used terms like big data, machine learning, and artificial intelligence have become popular descriptors of a broader underlying shift in information processing. While traditional rules-based computing isn’t going anywhere, a new computing paradigm is emerging around probabilistic inference, where digital reasoning is learned from sample data rather than hardcoded with boolean logic. This shift is so significant that a new computing stack is forming around it with emphasis on data engineering, algorithm development, and even novel hardware designs optimized for parallel computing workloads, both within data centers and at endpoints.

Some Social Network Analysis with Python

The following problems appeared in the programming assignments in the coursera course Applied Social Network Analysis in Python. The descriptions of the problems are taken from the assignments. The analysis is done using NetworkX.

Instacart Market Basket Analysis, Winner’s Interview: 2nd place, Kazuki Onodera

Our recent Instacart Market Basket Analysis competition challenged Kagglers to predict which grocery products an Instacart consumer will purchase again and when. Imagine, for example, having milk ready to be added to your cart right when you run out, or knowing that it’s time to stock up again on your favorite ice cream. This focus on understanding temporal behavior patterns makes the problem fairly different from standard item recommendation, where user needs and preferences are often assumed to be relatively constant across short windows of time. Whereas Netflix might be fine assuming you want to watch another movie similar to the one you just watched, it’s less clear that you’ll want to reorder a fresh batch of almond butter or toilet paper if you bought them yesterday. We interviewed Kazuki Onodera (aka ONODERA on Kaggle), a data scientist at Yahoo! JAPAN, to understand how he used complex feature engineering, gradient boosted tree models, and special modeling of the competition’s F1 evaluation metric to win 2nd place.

A Solution to Missing Data: Imputation Using R

Handling missing values is one of the worst nightmares a data analyst dreams of. In situations, a wise analyst ‘imputes’ the missing values instead of dropping them from the data.

How to escape saddlepoints efficiently

Michael Jordan discusses recent results in gradient-based optimization for large-scale data analysis.

Exploratory Data Analysis of Tropical Storms in R

The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on and started some basic Exploratory data analysis (EDA). EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.

Multi-Dimensional Reduction and Visualisation with t-SNE

t-SNE is a very powerful technique that can be used for visualising (looking for patterns) in multi-dimensional data. Great things have been said about this technique. In this blog post I did a few experiments with t-SNE in R to learn about this technique and its uses. Its power to visualise complex multi-dimensional data is apparent, as well as its ability to cluster data in an unsupervised way. What’s more, it is also quite clear that t-SNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of t-SNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since t-SNE was not originally intended for this purpose. All in all, t-SNE is a powerful technique that merits due attention.

30 Essential Data Science, Machine Learning & Deep Learning Cheat Sheets

This collection of data science cheat sheets is not a cheat sheet dump, but a curated list of reference materials spanning a number of disciplines and tools.

Continue Reading…


Read More

Esri: Data Scientist

Seeking a Data Scientist with excellent conversation and communication skills, passionate about changing the world, literally, through machine learning.

Continue Reading…


Read More

Because it's Friday: Blue skies or SkyNet?

I enjoyed attending the O'Reilly AI conference in San Francisco this week. There were many thought-provoking talks, but in the days since then my thoughts kept returning to one thing: incentives

One way of training an AI involves a reward function: a simple equation that increases in value as the AI gets closer to its goal. The AI then explores possibilities in such a way that its reward is maximized. I saw one amusing video of a robot arm that was being trained to pick up a block and then extend its arm as far as possible and set the block down at the limits of its reach. The reward function for such an activity is simple: the reward value is the distance of the block from its original position. Unfortunately, the robot learned not to carefully pick up and set down the block, but instead reached back and whacked the block as hard as it possibly could, sending the block careening across the room.

Artifical Intelligences learn do what we incentivize them to to. But what if those incentives end up not aligning with our interests, despite our best intentions? That's the theme of Tim O'Reilly's keynote below, and it's well worth watching.

That's all from is here at the blog for this week. We'll be back on Monday with more: have a great weekend, and see you then!

Continue Reading…


Read More

The Easy Button for R & Python on Spark, Webinar Oct 18

Learn five solid reasons to use managed services for Cloudera for R, Python and other advanced analytics on Spark & Hadoop in the cloud.

Continue Reading…


Read More

Stan Weekly Roundup, 22 September 2017

This week (and a bit from last week) in Stan:

  • Paul-Christian Bürkner‘s paper on brms (a higher-level interface to RStan, which preceded rstanarm and is still widely used and recommended by our own devs) was just published as a JStatSoft article. If you follow the link, the abstract explains what brms does.

  • Ben Goodrich and Krzysztof Sakrejda have been working on standalone functions in RStan. The bottleneck has been random number generators. As an application, users want to write Stan functions that they can use for efficient calculations inside R; it’s easier than C++ and we have a big stats library with derivatives backing it up.

  • Ben Goodrich has also been working on multi-threading capabilities for RStan.

  • Sean Talts has been continuing the wrestling match with continuous integration. Headaches continue from new compiler versions, Travis timeouts, and fragile build scripts.

  • Sean Talts has been working with Sebastian code reviewing MPI. The goal is to organize the code so that it’s easy to test and maintain (the two go together in well written code along with readability and crisp specs for the API boundaries).

  • Sean Talts has been working on his course materials for Simulation-Based Algorithmic Calibration (SBAC), the new name for applying the diagnostic of Cook et al.

  • Bob Carpenter has been working on writing a proper language spec for Stan looking forward to tuples, ragged and sparse structures, and functions for Stan 3. That’s easy; the denotational semantics will be more challenging as it has to be generic in terms of types and discuss how Stan compiles to a function with transforms.

  • Bob Carpenter has also been working on getting variable declarations through the model concept. After that, a simple base class and constant correctness for the model class to make it easier to use outside of Stan’s algorithms.

  • Michael Betancourt and Sean Talts have been prepping their physics talks for the upcoming course at MIT. There are post-it notes at metric heights in our office and they filmed themselves dropping a ball while filming a phone’s stopwatch (clever! hope that’s not too much of a spoiler for the course).

  • Michael Betancourt is also working on organizing the course before StanCon that’ll benefit NumFOCUS.

  • Jonah Gabry and T.J. Mahr released BayesPlot 1.4 with some new visualizations from Jonah’s paper.

  • Jonah Gabry working on Mac/C++ issues with R and has had to communicate with the R devs themselves it’s gotten so deep.

  • Lauren Kennedy has joined the team at Columbia taking over for Jonah Gabry at the population research center in the School of Social Work; she’ll be focusing on population health. She’ll also be working us (specifically with Andrew Gelman) on survey weighting and multilevel regression and poststratification with rstanarm.

  • Dan Simpson has been working on sparse autodiff architecture and talking with Aki and me about parallelization.

  • Dan Simpson and Andrew Gelman have been plotting how to marginalize out random effects in multilevel linear regression.

  • Andrew Gelman has been working with Jennifer Hill and others on revising his and Jennifer’s regression book. It’ll come out in two parts, the first of which is (tentatively?) titled Regression and Other Stories. Jonah Gabry has been working on the R packages for it and beefing up bayesplot and rstanarm to handle it.

  • Andrew Gelman is trying to write all the workflow stuff down with everyone else including Sean Talts, Michael Betancourt, and Daniel Simpson. As usual, a simple request from Andrew to write a short paper has spun out into real developments on the computational and methodological front.

  • Aki Vehtari, Andrew Gelman and others are revising the expectation propagation (EP) paper; we’re excited about the data parallel aspects of this.

  • Aki Vehtari gave a talk on priors for GPs at the summer school for GPs in Sheffield. He reports there were even some Stan users there using Stan for GPs. Their lives should get easier over the next year or two. Videos of the talks: Priors and integration over hyperparameters for GPs and On Bayesian model selection and model averaging.

  • Aki Vehtari reports there are 190 students in his Bayes class in Helsinki!

  • Michael Betancourt, Dan Simpson, and Aki Vehtari wrote comments on a paper about frequentist properties of horseshoe priors. Aki’s revised horseshoe prior paper has also been accepted.

  • Ben Bales wrote some generic array append code and also some vectorized random number generators which I reviewed and should go in soon.

  • Bill Gillespie is working on a piecewise linear interpolation function for Stan’s math library; he already added it to Torsten in advance of the ACoP tutorial he’s doing next month. He’ll be looking at a 1D integrator as an exercise, picking up from where Marco Inacio left off (it’s based on some code by John Cook).

  • Bill Gillespie is trying to hire a replacement for Charles Margossian at Metrum. He’s looking for someone who wants to work on Stan and pharmacology, preferably with experience in C++ and numerical analysis.

  • Krzysztof Sakrejda started a postdoc working on statistical modeling for global scale demographics for reproductive health with Leontine Alkema at UMass Amherst.

  • Krzysztof Sakrejda is working with makefiles trying to simplify them and inadvertently solved some of our clang++ compiler issues for CmdStan.

  • Matthijs Vákár got a pull request in for GLMs to speed up logistic regression by a factor of four or so by introducing analytic derivatives.

  • Matthijs Vákár is also working on higher-order imperative semantics for probabilistic programming languages like Stan.

  • Mitzi Morris finished last changes for pull request on base expression type refactor (this will pave the way for tuples, sparse matrices, ragged arrays, and functional types—hence all the semantic activity).

  • Mitzi Morris is also refactoring the local variable type inference system to squash a meta-bug that surfaced with ternary operators and will simplify the code.

  • Charles Margossian is finishing a case study on the algebraic solver to submit for the extended StanCon deadline. While he’s knee-deep in first-year grad student courses in measure theory and statistics.

  • Breck Baldwin and others have been talking to DataCamp (hi, Rasmus!) and Coursera. We’ll be getting some Stan classes out over the next year or two. Coordinating with DataCamp is easy, Coursera plus Columbia less so.

(video links added by Aki)

The post Stan Weekly Roundup, 22 September 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

New Deep Learning Blog on MATLAB Central

I'd like to invite everyone to check out the new Deep Learning blog that just launched today on MATLAB Central.

I happen to know the author, and he's an OK fellow.

Continue Reading…


Read More

Document worth reading: “A Survey of Neural Network Techniques for Feature Extraction from Text”

This paper aims to catalyze the discussions about text feature extraction techniques using neural network architectures. The research questions discussed in the paper focus on the state-of-the-art neural network techniques that have proven to be useful tools for language processing, language generation, text classification and other computational linguistics tasks. A Survey of Neural Network Techniques for Feature Extraction from Text

Continue Reading…


Read More

Science and Technology links (September 22th, 2017)

Mass is preserved. When you lose fat, where does the fat goes? It has to go somewhere. No, it is not transformed in “energy” or “heat”. Neither “energy” nor “heat” have atomic mass. I asked the question on social networks and most people got the wrong answer. This YouTube video gives the correct answer.

I had been assuming that the US consumption of electricity was on the rise. I have gadgets everywhere around me, these use electricity, right? Actually, no, electricity usage, in absolute value, is stagnant in the US, which means that the per capita usage is down:

Economic growth in advanced economies does not require increased energy consumption. Real GDP rose 12.7% in the U.S. between 2008 and 2015. During the same time period, electric consumption declined by 0.3%.

Climate scientists badly failed to predict CO2 emissions:

Global CO2 emission intensity increased despite all major scenarios projecting a decline.

Influential climate scientists have also revised their predictions:

Computer modelling used a decade ago to predict how quickly global average temperatures would rise may have forecast too much warming, a study has found. (…) The Earth warmed more slowly than the models forecast, meaning the planet has a slightly better chance of meeting the goals set out in the Paris climate agreement, including limiting global warming to 1.5C above pre-industrial levels. (…) The study, published this week in the journal Nature Geoscience, does not play down the threat which climate change has to the environment, and maintains that major reductions in emissions must be attained.(…) But the findings indicate the danger may not be as acute as was previously thought.

So you think you should take your antibiotics even after you feel fine? No. Exposing your body to antibiotics longer than needed is counterproductive, as it helps develop antibiotic resistance. Doctor Brad Spellberg is the chief medical officer for the Los Angeles County, and we writes:

There are some chronic infections, such as tuberculosis, where you do indeed have to take long courses of antibiotics, not to prevent resistance but rather to cure the infection. But for most acute bacterial infections, short courses of antibiotics result in equivalent cure rates and with less chance of causing the emergence of antibiotic resistance among the bacteria in and on your body.

Llewelyn et al. (2017) write:

However, the idea that stopping antibiotic treatment early encourages antibiotic resistance is not supported by evidence, while taking antibiotics for longer than necessary increases the risk of resistance.

(Source: HackerNews).

Uber has taken the world by storm, using a small mobile app to allow people to either offer cab services or call a cab service, without the infrastructure that normally supports cab services. The City of London will not renew Uber’s license. The government fears that Uber is a threat to Londoners’ safety and security.

Human beings’ death rate follows a Gompertz’ law which means that the probability that you will die goes up exponentially, year after year. This explains why, even though there are billions of us, none of us get to live beyond 120 years old. Not all living beings follow a Gompertz’ law. Plants do not. However, wooden utility poles do follow a Gompertz’ law just like us.

P. D. Mangan reports that exercise helps keep cancer away:

Exercise prevents cancer (…) A recent meta-analysis found up to 42% less risk for 10 different cancers, including esophageal, liver, lung, kidney, and many other cancers. (…) An interesting question is how exercise prevents cancer, and some recent research sheds light on this. (…) Fat tissue generates cytokines that promote the proliferation of cancer cells, and physical activity diminishes or abolishes the effect, which is dose-dependent, i.e. more exercise means less cancer promoting cytokines. (…) In animals (mice) that were implanted with tumor cells, voluntary running reduced tumor growth by over 60%. The researchers believe that exercise mobilized natural killer (NK) cells, which attack cancer.

Continue Reading…


Read More

Big Data Analytics with H20 in R Exercises -Part 1

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

We have dabbled with RevoScaleR before , In this exercise we will work with H2O , another high performance R library which can handle big data very effectively .It will be a series of exercises with increasing degree of difficulty . So Please do this in sequence .
H2O requires you to have Java installed in your system .So please install Java before trying with H20 .As always check the documentation before trying these exercise set .
Answers to the exercises are available here.
If you want to install the latest release from H20 , install it via this instructions .

Exercise 1
Download the latest stable release from h20 and initialize the cluster

Exercise 2
Check the cluster information via clusterinfo

Exercise 3
You can see how h2o works via the demo function , Check H2O’s glm via demo method .

Exercise 4

down load the loan.csv from H2O’s github repo and import it using H2O .
Exercise 5
Check the type of imported loan data and notice that its not a dataframe , check the summary of the loan data .
Hint -use h2o.summary()

Exercise 6
One might want to transfer a dataframe from R environment to H2O , use as.h2o to conver the mtcars dataframe as a H2OFrame

Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to

  • work with different data import techniques,
  • know how to import data and transform it for a specific moddeling or analysis goal,
  • and much more.

Exercise 7

Check the dimension of the loan H2Oframe via h2o.dim

Exercise 8
Find the colnames from the H2OFrame of loan data.

Exercise 9

Check the histogram of the loan amount of loan H2Oframe .

Exercise 10
Find the mean of loan amount by each home ownership group from the loan H2OFrame

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Tutorial: Launch a Spark and R cluster with HDInsight

If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.

This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:

To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.

Microsoft Azure Documentation: Get started using R Server on HDInsight

Continue Reading…


Read More

Tutorial: Launch a Spark and R cluster with HDInsight

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.

This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:

To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.

Microsoft Azure Documentation: Get started using R Server on HDInsight

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Putting Machine Learning in Production

In machine learning, going from research to production environment requires a well designed architecture. This blog shows how to transfer a trained model to a prediction server.

Continue Reading…


Read More

Bay Area Apache Spark Meetup at HPE/Aruba Networks Summary

On September 7th, we held our monthly Bay Area Apache Spark Meetup (BASM) at HPE/Aruba Networks in Santa Clara. We had two Apache Spark related talks: one from Aruba Networks’ Data Engineering team and other from Databricks’ Machine Learning team.

For those who missed the meetup, below is the video and link to each presentation and slides. You can peruse slides and view the video at your leisure. To those who helped and attended, thank you for participating and continued community support. And thanks to HPE/Aruba Networks for hosting and sponsoring our meetup.

Techniques to Use Pyspark for Data Correlation to Support Scalability and Complexity

by John Conley (Aruba Networks)

View the slides here

Web-Scale Graph Analytics with Apache Spark

by Joseph Bradley (Databricks)

View the slides here

What’s Next

Our next Apache Spark Meetup will be a Pre-Spark Summit Meetup, on October 24th, 2017, in Dublin, Ireland. If you live in Dublin or are going to be in Dublin, Ireland, RSVP now!

Also, Spark Summit EU, 2017 will be in Dublin, Ireland, and all members of this group can register with 15% discount code: BayAreaMU


Try Databricks for free. Get started today.

The post Bay Area Apache Spark Meetup at HPE/Aruba Networks Summary appeared first on Databricks.

Continue Reading…


Read More

BigML Summer 2017 Release and Webinar: Deepnets!

BigML’s Summer 2017 Release is here! Join us on Thursday October 5, 2017, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to discover the latest update of the BigML platform. We will be presenting Deepnets, a highly effective supervised learning method that […]

Continue Reading…


Read More

Ensemble Learning to Improve Machine Learning Results

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Continue Reading…


Read More

The Rohingya refugee crisis is the worst in decades

ON AUGUST 25th a group of militant Rohingya Muslims attacked police bases in northern Myanmar. The army retaliated with untrammelled fury, burning villages, killing civilians and raping women. More than 420,000 terrified Rohingyas have crossed the border into Bangladesh.

Continue Reading…


Read More

Your deep learning + Python Ubuntu virtual machine

When it comes to working with deep learning + Python I highly recommend that you use a Linux environment.

Deep learning tools can be more easily configured and installed on Linux, allowing you to develop and run neural networks quickly.

Of course, configuring your own deep learning + Python + Linux development environment can be quite the tedious task, especially if you are new to Linux, a beginner at working the command line/terminal, or a novice when compiling and installing packages by hand.

In order to help you jump start your deep learning + Python education, I have created an Ubuntu virtual machine with all necessary deep learning libraries you need to successful (including Keras, TensorFlow, scikit-learn, scikit-image, OpenCV, and others) pre-configured and pre-installed.

This virtual machine is part of all three bundles of my book, Deep Learning for Computer Vision with Python. After you purchase your copy you’ll be able to download the virtual machine and get started with deep learning immediately.

In the remainder of this tutorial I’ll show you:

  • How to download and install VirtualBox for managing, creating, and importing virtual machines.
  • How to import the pre-configured Ubuntu virtual machine for deep learning.
  • How to access the pre-installed deep learning libraries on the virtual machine.

Let’s go ahead and get started.

Your deep learning + Python virtual machine

Your purchase of Deep Learning for Computer Vision with Python includes a pre-configured Ubuntu virtual machine for deep learning. In the following sections I’ll show you how easy it is to import your Ubuntu deep learning virtual machine.

This tutorial is broken down into three parts to make it easy to digest and understand:

  1. Download and install VirtualBox.
  2. Download and import your pre-configured Ubuntu deep learning virtual machine.
  3. Access the Python development environment inside the deep learning virtual machine.

Step #1: Download and install VirtualBox

The first step is to download VirtualBox, a free open source platform for managing virtual machines.

VirtualBox will run on macOS, Linux, and Windows.

We call the physical hardware VirtualBox is running on your host machine. The virtual machine that will be imported into VirtualBox is the guest machine.

To install VirtualBox, first visit the downloads page and then select the appropriate binaries for your operating system:

Figure 1: VirtualBox downloads.

From there install the software on your system following the provided instructions — I’ll be using macOS in this example, but again, these instructions will also work on Linux and Windows as well:

Figure 2: Installing VirtualBox on macOS

Step #2: Download your deep learning virtual machine

Now that VirtualBox is installed you need to download the pre-configured Ubuntu virtual machine associated with your purchase of Deep Learning for Computer Vision with Python:

Figure 3: Downloading the pre-configured Ubuntu deep learning virtual machine.

The file is approximately 4GB so depending on your internet connection this download make take some time to complete.

Once you have downloaded the
  file, unarchive it and you’ll find a file named
DL4CV Ubuntu VM.ova
. I have placed this file on my Desktop:

Figure 4: The DL4CV Ubuntu VM.ova file.

This is the actual file that you will be importing into the VirtualBox manager.

Step #3: Import the deep learning virtual machine into VirtualBox

Go ahead and open up the VirtualBox manager.

From there select

File => Import Appliance...

Figure 5: Importing the pre-configured Ubuntu deep learning virtual machine.

Once the dialog opens you’ll want to navigate to where the

DL4CV Ubuntu VM.ova
  file resides on disk:

Figure 6: Selecting the pre-configured Ubuntu deep learning virtual machine.

Finally, you can click “Import” and allow the virtual machine to import:

Figure 7: Importing the Ubuntu deep learning virtual machine may take 3-4 minutes depending on your system.

The entire import process should take only a few minutes.

Step #4: Boot the deep learning virtual machine

Now that the deep learning virtual machine has been imported we need to boot it.

From the VirtualBox manager  select the “DL4CV Ubuntu VM” on the left pane of the window and then click “Start”:

Figure 8: Booting the pre-configured Ubuntu deep learning virtual machine.

Once the virtual machine has booted you can login using the following credentials:

  • Username:
  • Password:

Figure 9: Logging into the deep learning virtual machine.

Step #5: Access the deep learning Python virtual environment

The next step after logging into the VM is to launch a terminal:

Figure 10: Launching a terminal window.

From there, execute

workon dl4cv
  to access the Python + deep learning development environment:

Figure 11: Accessing the dl4cv deep learning + Python development environment.

Notice that my prompt now has the text

  preceding it, implying that I am inside the
  Python virtual environment.

You can run

pip freeze
  to see all the Python libraries installed.

I have included a screenshot below demonstrating how to import Keras, TensorFlow, and OpenCV from a Python shell:

Figure 12: Importing Keras, TensorFlow, and OpenCV into our deep learning Python virtual environment.

Step #6: (Optional) Install Guest Additions on virtual machine

An optional step you may wish to perform is installing the VirtualBox Guest Additions on your machine.

The Guest Additions package allow you to:

  • Copy and paste from the virtual machine to your host (and vice versa)
  • Share folders between the virtual machine and host
  • Adjust screen resolution
  • etc.

You can install the Guest Additions by selecting

Devices => Install Guest Additions...
  from the VirtualBox menu at the top of your screen.

Executing code from Deep Learning for Computer Vision with Python on your virtual machine

There are multiple methods to access the Deep Learning for Computer Vision with Python source code + datasets from your virtual machine.

By far the easiest method is to simply open Firefox and download the .zip archives from the “Your Purchase” page after buying your copy of Deep Learning for Computer Vision with Python. I would recommend forwarding the receipt email to yourself so you can login to your inbox via Firefox and then download the code + datasets.

You may also use your favorite SFTP/FTP client to transfer the code from your host machine to the virtual machine.

Of course, you can always manually write the code inside Ubuntu virtual machine using the built-in text editor as you follow along with the book.

Tips for using the deep learning virtual machine

When using the Ubuntu VirtualBox virtual machine for deep learning I recommend the following:

  • Use Sublime Text as a lightweight code editor. Sublime Text is my favorite code editor for Linux. It’s simple, easy to use, and very lightweight, making it ideal for a virtual machine.
  • Use PyCharm as a full blown IDE. When it comes to Python IDEs, it’s hard to beat PyCharm. I personally don’t like using PyCharm inside a virtual machine as it’s quite resource hungry. You’ll also need to configure PyCharm to use the
      Python environment once installed.

Troubleshooting and FAQ

In this section I detail answers to frequently asked questions and problems regarding the pre-configured Ubuntu deep learning virtual machine.

How do I boot my deep learning virtual machine?

Once your VM has been imported, select the “DL4CV Ubuntu VM” on the left-hand side of the VirtualBox software, then click the “Start” button. Your VM will then boot.

What is the username and password for the Ubuntu deep learning virtual machine?

The username is

  and the password is

How do I run Python scripts that access deep learning libraries?

The Deep Learning for Computer Vision with Python virtual machine uses Python virtual environments to help organize Python modules and keep them separate from the system install of Python.

To access the virtual environment simply execute

workon dl4cv
  from the shell. Form there you’ll have access to deep learning/computer vision libraries such as TensorFlow, Keras, OpenCV, scikit-learn, scikit-image, etc.

How can I access my GPU from the Ubuntu virtual machine?

The short answer is that you cannot access your GPU from the virtual machine.

A virtual machine abstracts your hardware and creates an artificial barrier between your host machine and your guest machine. Peripherals such as your GPU, USB ports, etc. on your physical computer cannot be accessed by the virtual machine.

If you would like to use your GPU for deep learning I would suggest configuring your native development environment.

I am receiving an error message related to “VT-x/AMD-V hardware acceleration is not available for your system”. What do I do?

If you are getting an error message similar to the following:

Figure 13: Resolving “VT-x/AMD-V hardware acceleration is not available for your system” errors.

Then you likely need to check your BIOS and ensure virtualization is enabled. If you’re on Windows you might also need to disable Hyper-V mode.

To resolve the problem:

  1. Disable Hyper-V mode from the Windows control panel (if using the Windows operating system). Take a look at the answers to this question, which is the same problem you are having. Disabling Hyper-V is different on different Windows versions, but following the answer to the question above you should be able to find your solution. That said, also make sure you do step 2 below as well.
  2. Check your BIOS. The next time you boot your system, go into the BIOS and ensure that Virtualization is enabled (normally it’s under some sort of “advanced setting”). If virtualization is not enabled, then the VM will not be able to boot.


In today’s tutorial I demonstrated:

  • How to download and install VirtualBox, the software used to manage virtual machines.
  • How to import and launch your Ubuntu deep learning virtual machine.
  • How to access the deep learning development environment once Ubuntu has booted.

All purchases of my book, Deep Learning for Computer Vision with Python, include a copy of my pre-configured virtual machine.

This virtual machine is by far the fastest way to get up and running with deep learning and computer vision using the Python programming language.

If you would like to learn more about my new book (and pick up a copy yourself), just click here.

The post Your deep learning + Python Ubuntu virtual machine appeared first on PyImageSearch.

Continue Reading…


Read More

30 Essential Data Science, Machine Learning & Deep Learning Cheat Sheets

This collection of data science cheat sheets is not a cheat sheet dump, but a curated list of reference materials spanning a number of disciplines and tools.

Continue Reading…


Read More

Air rage update

So. Marcus Crede, Carol Nickerson, and I published a letter in PPNAS criticizing the notorious “air rage” article. (Due to space limitations, our letter contained only a small subset of the many possible criticisms of that paper.) Our letter was called “Questionable association between front boarding and air rage.”

The authors of the original paper, Katherine DeCelles and Michael Norton, published a response in which they concede nothing. They state that their hypotheses are “are predicated on decades of theoretical and empirical support across the social sciences” and they characterize their results as “consistent with theory.” I have no reason to dispute either of these claims, but at the same time these theories are so flexible that they could predict just about anything, including, I suspect, the very opposite of the claims made in the paper. As usual, there’s a confusion between a general scientific theory and some very specific claims regarding regression coefficients in some particular fitted model.

Considering the DeCelles and Norton reply in a context-free sense, it reads as reasonable: yes, it is possible for the signs and magnitudes of estimates to change when adding controls to a regression. The trouble is that their actual data seem to be of low quality, and due to the observational nature of their study, there are lots of interactions not included in the model that are possibly larger than their main effects (for example, interactions of plane configuration with type of flight, interactions with alcohol consumption, nonlinearities in the continuous predictors such as number of seats and flight difference).

The whole thing is interesting in that it reveals the challenge of interpreting this sort of exchange from the outside. how it is possible for researchers to string together paragraphs that have the form or logical argument, in support of whatever claim they’d like to make. Of course someone could say the same about us. . . .

One good thing about slogans such as “correlation does not imply causation” is that they get right to the point.

The post Air rage update appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Multi-Dimensional Reduction and Visualisation with t-SNE

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

t-SNE is a very powerful technique that can be used for visualising (looking for patterns) in multi-dimensional data. Great things have been said about this technique. In this blog post I did a few experiments with t-SNE in R to learn about this technique and its uses.

Its power to visualise complex multi-dimensional data is apparent, as well as its ability to cluster data in an unsupervised way.

What’s more, it is also quite clear that t-SNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of t-SNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since t-SNE was not originally intended for this purpose.

All in all, t-SNE is a powerful technique that merits due attention.


Let’s start with a brief description. t-SNE stands for t-Distributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction, i.e., given some complex dataset with many many dimensions, t-SNE projects this data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising high-dimensional data in this way allows us to look for patterns in the dataset.

t-SNE has become so popular because:

  • it was demonstrated to be useful in many situations,
  • it’s incredibly flexible, and
  • can often find structure where other dimensionality-reduction algorithms cannot.

A good introduction on t-SNE can be found here. The original paper on t-SNE and some visualisations can be found at this site. In particular, I like this site which shows how t-SNE is used for unsupervised image clustering.

While t-SNE itself is computationally heavy, a faster version exists that uses what is known as the Barnes-Hut approximation. This faster version allows t-SNE to be applied on large real-world datasets.

t-SNE for Exploratory Data Analysis

Because t-SNE is able to provide a 2D or 3D visual representation of high-dimensional data that preserves the original structure, we can use it during initial data exploration. We can use it to check for the presence of clusters in the data and as a visual check to see if there is some ‘order’ or some ‘pattern’ in the dataset. It can aid our intuition about what we think we know about the domain we are working in.

Apart from the initial exploratory stage, visualising information via t-SNE (or any other algorithm) is vital throughout the entire analysis process – from those first investigations of the raw data, during data preparation, as well as when interpreting the outputs of machine learning algorithms and presenting insights to others. We will see further on that we can use t-SNE even during the predcition/classification stage itself.

Experiments on the Optdigits dataset

In this post, I will apply t-SNE to a well-known dataset, called optdigits, for visualisation purposes.

The optdigits dataset has 64 dimensions. Can t-SNE reduce these 64 dimensions to just 2 dimension while preserving structure in the process? And will this structure (if present) allow handwritten digits to be correctly clustered together? Let’s find out.

tsne package

We will use the tsne package that provides an exact implementation of t-SNE (not the Barnes-Hut approximation). And we will use this method to reduce dimensionality of the optdigits data to 2 dimensions. Thus, the final output of t-SNE will essentially be an array of 2D coordinates, one per row (image). And we can then plot these coordinates to get the final 2D map (visualisation). The algorithm runs in iterations (called epochs), until the system converges. Every
\(K\) number of iterations and upon convergence, t-SNE can call a user-supplied callback function, and passes the list of 2D coordinates to it. In our callback function, we plot the 2D points (one per image) and the corresponding class labels, and colour-code everything by the class labels.

traindata <- read.table("optdigits.tra", sep=",")
trn <- data.matrix(traindata)


cols <- rainbow(10)

# this is the epoch callback function used by tsne. 
# x is an NxK table where N is the number of data rows passed to tsne, and K is the dimension of the map. 
# Here, K is 2, since we use tsne to map the rows to a 2D representation (map).
ecb = function(x, y){ plot(x, t='n'); text(x, labels=trn[,65], col=cols[trn[,65] +1]); }

tsne_res = tsne(trn[,1:64], epoch_callback = ecb, perplexity=50, epoch=50)

The images below show how the clustering improves as more epochs pass.

As one can see from the above diagrams (especially the last one, for epoch 1000), t-SNE does a very good job in clustering the handwritten digits correctly.

But the algorithm takes some time to run. Let’s try out the more efficient Barnes-Hut version of t-SNE, and see if we get equally good results.

Rtsne package

The Rtsne package can be used as shown below. The perplexity parameter is crucial for t-SNE to work correctly – this parameter determines how the local and global aspects of the data are balanced. A more detailed explanation on this parameter and other aspects of t-SNE can be found in this article, but a perplexity value between 30 and 50 is recommended.

traindata <- read.table("optdigits.tra", sep=",")
trn <- data.matrix(traindata)


# perform dimensionality redcution from 64D to 2D
tsne <- Rtsne(as.matrix(trn[,1:64]), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2)

# display the results of t-SNE
cols <- rainbow(10)
plot(tsne$Y, t='n')
text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])

Gives this plot:

Note how clearly-defined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map.

Of Manifolds and the Manifold Assumption

How can t-SNE achieve such a good result? How can it ‘drop’ 64-dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated?

The reason has to do with the mathematical concept of manifolds. A Manifold is a \(d\)-dimensional surface that lives in an \(D\)-dimensional space, where \(d < D\). For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space.

Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then t-SNE should be able to find this lower-dimensional manifold, ‘unwrap it’, and present it to us as a lower-dimensional map of the original data.

t-SNE vs. SOM

t-SNE is actually a tool that does something similar to Self-Organising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below.

The t-SNE map is more clear than that obtained via a SOM and the clusters are separated much better.

t-SNE for Shelter Animals dataset

In a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply t-SNE to the dataset – I am using the cleaned and modified data as described in this blog entry.

trn <- read.csv('')
trn <- data.matrix(trn)


# scale the data first prior to running t-SNE
trn[,-1]  <- scale(trn[,-1])

tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2)

# display the results of t-SNE
cols <- rainbow(5)
plot(tsne$Y, t='n')
text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])

Gives this plot:

Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened.

Let’s now use t-SNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by t-SNE.

tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2)

#display results of t-SNE
plot3d(tsne$Y, col=cols[trn[,1]])
legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))

Gives this plot:

This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.

One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of \(N\)-dimensional points and conclude that they lie on some \(M\)-dimensional manifold (where \(M < N\)). We could then take those \(M\)-dimensional points and conclude that they lie on some other lower-dimensional manifold (with, say, \(K\) dimensions, where \(K < M\)). We could repeat this process indefinitely and eventually conclude that our arbitrary \(N\)-dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have \(N\)-dimensional points being generated by a process with \(N\) degrees of freedom, your data will not lie on a lower-dimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.

So we can conclude that t-SNE did not aid much the initial exploratory analysis for this dataset.

t-SNE and machine learning?

One disadvantage of t-SNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run t-SNE on a dataset, then gather a few more samples (rows), and “update” the t-SNE output with the new samples. You would need to re-run t-SNE from scratch on the full dataset (previous dataset + new samples). Thus t-SNE works only in batch mode.

This disadvantage appears to makes it difficult for t-SNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use t-SNE (with care) in a machine learning solution. And the use of t-SNE can improve classification results, sometimes markedly. The limitation is the extra non-realtime processing brought about by t-SNE’s batch mode nature.

Stay tuned.

    Related Post

    1. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    2. Analyzing Obesity across USA
    3. Can we predict flu deaths with Machine Learning and R?
    4. Graphical Presentation of Missing Data; VIM Package
    5. Creating Graphs with Python and GooPyCharts

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…


    Read More

    On debugging

    My favorite advice on debugging is from Professor Norman Matloff:

    Finding your bug is a process of confirming the many things that you believe are true – until you find one that is not true.


    I would add to this my own observation:

    You are not truly “debugging” until you (temporarily) remove all desire to fix the problem. Investigation must be entirely devoted to finding and seeing the problem.

    Continue Reading…


    Read More

    My advice on dplyr::mutate()

    There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

    Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.

    Vlcsnap 00887

    “Character is what you are in the dark.”

    John Whorfin quoting Dwight L. Moody.

    I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

    What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

    • Disclaimer: Win-Vector LLC has no official standing with RStudio, or dplyr development.
    • However:

      “One need not have been Caesar in order to understand Caesar.”

      Alternately: Georg Simmmel or Max Webber.

      Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.

    From experience we have distilled a lot of best practices. And below we will share one.

    From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

    Note that you can refer to columns that you’ve just created:

       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours

    Let’s try that with database backed data:

    # [1] ‘0.7.3’
    db <- DBI::dbConnect(RSQLite::SQLite(), 
    flights <- copy_to(db, 
           gain = arr_delay - dep_delay,
           hours = air_time / 60,
           gain_per_hour = gain / hours
    # # Source:   lazy query [?? x 22]
    # # Database: sqlite 3.19.3 [:memory:]
    # year month   day dep_time sched_dep_time        ...
    # <int> <int> <int>    <int>          <int>       ...
    #   1  2013     1     1      517            515   ...
    # ...

    That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

    However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

           gain = arr_delay - dep_delay,
           hours = air_time / 60,
           gain_per_hour = gain / hours,
           gain_per_minute = 60 * gain_per_hour
    # Error in rsqlite_send_query(conn@ptr, statement) : 
    #   no such column: gain_per_hour

    (Some detail on the failing query are here.)

    It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

    This quickly leads to strong feelings of learned helplessness and anxiety.

    Our rule for dplyr::mutate() has been for some time:

    Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

    Under this rule neither of the above mutates are allowed. The second should be written as (switching to pipe-notation):

    flights %>%
      mutate(gain = arr_delay - dep_delay,
             hours = air_time / 60) %>%
      mutate(gain_per_hour = gain / hours) %>%
      mutate(gain_per_minute = 60 * gain_per_hour)

    And the above works.

    If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

    dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

    Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

    However, confusion around re-use of column names has been driving dplyr issues for quite some time:

    It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification. “Works for the instructor” is not an acceptable level of dependability.

    Continue Reading…


    Read More

    Data liquidity in the age of inference

    Probabilistic computation holds too much promise for it to be stifled by playing zero sum games with data.

    It's a special time in the evolutionary history of computing. Oft-used terms like big data, machine learning, and artificial intelligence have become popular descriptors of a broader underlying shift in information processing. While traditional rules-based computing isn’t going anywhere, a new computing paradigm is emerging around probabilistic inference, where digital reasoning is learned from sample data rather than hardcoded with boolean logic. This shift is so significant that a new computing stack is forming around it with emphasis on data engineering, algorithm development, and even novel hardware designs optimized for parallel computing workloads, both within data centers and at endpoints.

    A funny thing about probabilistic inference is that when models work well, they’re probably right most of the time, but always wrong at least some of the time. From a mathematics perspective, this is because such models take a numerical approach to problem analysis, as opposed to an analytical one. That is, they learn patterns from data (with various levels of human involvement) that have certain levels of statistical significance, but remain somewhat ignorant to any physics-level intuition related to those patterns, whether represented by math theorems, conjectures, or otherwise. However, that’s also precisely why probabilistic inference is so incredibly powerful. Many real-world systems are so multivariate, complex, and even stochastic that analytical math models do not exist and remain tremendously difficult to develop. In the meanwhile, their physics-ignorant, FLOPS-happy, and often brutish machine learning counterparts can develop deductive capabilities that don’t nicely follow any known rules, yet still almost always arrive at the correct answers.

    Continue reading Data liquidity in the age of inference.

    Continue Reading…


    Read More

    4 data-driven ways to digitize your business

    In today’s digital landscape, customers expect you to deliver products and services in a fast and efficient manner. Heavyweights like Amazon and Google have set a bar in terms of operations, and they’ve set it high. An increasing need for more streamlined and efficient processes, combined with advancing technologies has

    The post 4 data-driven ways to digitize your business appeared first on Dataconomy.

    Continue Reading…


    Read More

    Four short links: 22 September 2017

    Molecular Robots, Distributed Deep Nets, SQL Notebook, and Super-Accurate GPS

    1. Scientists Create World’s First ‘Molecular Robot’ Capable Of Building Molecules -- Each individual robot is capable of manipulating a single molecule and is made up of just 150 carbon, hydrogen, oxygen and nitrogen atoms. To put that size into context, a billion billion of these robots piled on top of each other would still only be the same size as a single grain of salt. The robots operate by carrying out chemical reactions in special solutions which can then be controlled and programmed by scientists to perform the basic tasks. (via Slashdot)
    2. Distributed Deep Neural Networks -- in Adrian Colyer's words: DDNNs partition networks between mobile/embedded devices, cloud (and edge), although the partitioning is static. What’s new and very interesting here though is the ability to aggregate inputs from multiple devices (e.g., with local sensors) in a single model, and the ability to short-circuit classification at lower levels in the model (closer to the end devices) if confidence in the classification has already passed a certain threshold. It looks like both teams worked independently and in parallel on their solutions. Overall, DDNNs are shown to give lower latency decisions with higher accuracy than either cloud or devices working in isolation, as well as fault tolerance in the sense that classification accuracy remains high even if individual devices fail. (via Morning Paper)
    3. Franchise -- an open-source notebook for sql.
    4. Super-Accurate GPS Chips Coming to Smartphones in 2018 (IEEE Spectrum) -- 30cm accuracy (today: 5m), will help with the reflections you get in cities, and with 50% energy savings.

    Continue reading Four short links: 22 September 2017.

    Continue Reading…


    Read More

    Document worth reading: “Image Segmentation Algorithms Overview”

    The technology of image segmentation is widely used in medical image processing, face recognition pedestrian detection, etc. The current image segmentation techniques include region-based segmentation, edge detection segmentation, segmentation based on clustering, segmentation based on weakly-supervised learning in CNN, etc. This paper analyzes and summarizes these algorithms of image segmentation, and compares the advantages and disadvantages of different algorithms. Finally, we make a prediction of the development trend of image segmentation with the combination of these algorithms. Image Segmentation Algorithms Overview

    Continue Reading…


    Read More

    Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1

    (This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

    The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

    I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).

    In this blog post – “Part 1” – I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? etc.

    “Part 2” will be an in depth analysis of the language used in patent titles, descriptions, and claims, and “Part 3” will be on experimentation with with deep neural nets applied to the patents’ title, description, and claims text.

    First, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
    I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.

    Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords
    “machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”
    in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1

    With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.

    plot of keyword match frequencies

    Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.

    Some example patents

    Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.

    ## $`2867`
    ## [1] "Fuselage manufacturing system"
    ## [3] "... using various techniques. For example, at least
    ##      one of an artificial intelligence program, a
    ##      knowledgebase, an expert ..."
    ## $`1618`
    ## [1] "Systems and methods for validating wind farm
    ##      performance measurements"
    ## [2] "General Electric Company"
    ## [3] "... present disclosure leverages and fuses
    ##      accurate available sensor data using machine
    ##      learning algorithms. That is, the more ..."
    ## $`5441`
    ## [1] "Trigger repeat order notifications"
    ## [2] "Accenture Global Services Limited"
    ## [3] "... including user data obtained from a user
    ##      device; obtaining a predictive model that
    ##      estimates a likelihood of ..."

    And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.

    ## $`5742`
    ## [1] "Adaptive demodulation method and apparatus using an
    ##      artificial neural network to improve data recovery
    ##      in high speed channels"
    ## [2] "Kelquan Holdings Ltd."
    ## [3] "... THE INVENTION\nh-0002\n1 The present invention
    ##      relates to a neural network based integrated
    ##      demodulator that mitigates ..."
    ## $`3488`
    ## [1] "Cluster-trained machine learning for image processing"
    ## [2] "Amazon Technologies, Inc."
    ## [3] "... BACKGROUND\nh-0001\n1 Artificial neural networks,
    ##      especially deep neural network ..."
    ## $`3103`
    ## [1] "Methods and apparatus for machine learning based
    ##      malware detection"
    ## [2] "Invincea, Inc."
    ## [3] "... a need exists for methods and apparatus that can
    ##      use machine learning techniques to reduce the amount ..."

    Who holds these patents (inventors and assignees)?

    The first question I would like to address is who files most of the machine learning and AI related patents.

    Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.

    plot of chunk unnamed-chunk-5

    It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).

    There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):

    • What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
    • Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?

    Where do these inventions come from (geographically)?

    Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
    In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
    The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).

    plot of chunk unnamed-chunk-16

    The results aren’t that surprising.
    However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.

    Alternatively, we can also map the number of patents per inventors’ origin countries.

    plot of chunk unnamed-chunk-17

    Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
    Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.

    Patent application and publication dates, and duration of patent examination process

    Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as:
    When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.

    For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…

    • …in the top panel: number of patents (y-axis) per their original month of filing (x-axis),
    • …in the bottom panel: the number of patents (y-axis) that were assigned (approved) per week (x-axis) in 2017 so far.

    plot of chunk unnamed-chunk-7

    The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
    The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.

    The distribution of the duration from patent filing date to approval is depicted by following figure.

    plot of chunk unnamed-chunk-9

    So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:

    • “Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
    • “Apparatus for biological sensing and alerting of pharmaco-genomic mutation” (~12.24 years to approval; assignee: NA)
    • “System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
    • “Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
    • “Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)

    Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…

    We can also look at the five patents with the shortest approval time:

    • “Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
    • “Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
    • “Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
    • “Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
    • “Generating accurate reason codes with complex non-linear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)

    Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.

    Do big tech companies get their patents approved faster than other companies (e.g., startups)?

    The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.

    plot of chunk unnamed-chunk-15

    Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.

    Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.


    There is definitely a lot of promise in the USPTO full text patent data.
    Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
    The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:

    Creative Commons License
    This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.

    1. There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. 

    To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…


    Read More

    Whats new on arXiv

    varbvs: Fast Variable Selection for Large-scale Regression

    We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies, we demonstrate that varbvs scales well to data sets with hundreds of thousands of variables and thousands of samples, and has features that facilitate rapid data analyses. Moreover, varbvs allows for extensive model customization, which can be used to incorporate external information into the analysis. We expect that the combination of an easy-to-use interface and robust, scalable algorithms for posterior computation will encourage broader use of Bayesian variable selection in areas of applied statistics and computational biology. The most recent R and MATLAB source code is available for download at Github (https://…/varbvs ), and the R package can be installed from CRAN (https://…/package=varbvs ).

    Unsupervised Machine Learning for Networking: Techniques, Applications and Research Challenges

    While machine learning and artificial intelligence have long been applied in networking research, the bulk of such works has focused on supervised learning. Recently there has been a rising trend of employing unsupervised machine learning using unstructured raw network data to improve network performance and provide services such as traffic engineering, anomaly detection, Internet traffic classification, and quality of service optimization. The interest in applying unsupervised learning techniques in networking emerges from their great success in other fields such as computer vision, natural language processing, speech recognition, and optimal control (e.g., for developing autonomous self-driving cars). Unsupervised learning is interesting since it can unconstrain us from the need of labeled data and manual handcrafted feature engineering thereby facilitating flexible, general, and automated methods of machine learning. The focus of this survey paper is to provide an overview of the applications of unsupervised learning in the domain of networking. We provide a comprehensive survey highlighting the recent advancements in unsupervised learning techniques and describe their applications for various learning tasks in the context of networking. We also provide a discussion on future directions and open research issues, while also identifying potential pitfalls. While a few survey papers focusing on the applications of machine learning in networking have previously been published, a survey of similar scope and breadth is missing in literature. Through this paper, we advance the state of knowledge by carefully synthesizing the insights from these survey papers while also providing contemporary coverage of recent advances.

    On Collaborative Compressive Sensing Systems: The Framework, Design and Algorithm

    We propose a collaborative compressive sensing (CCS) framework consisting of a bank of K compressive sensing (CS) systems that share the same sensing matrix but have different sparsifying dictionaries. This CCS system is guaranteed to yield better performance than each individual CS system in a statistical sense, while with the parallel computing strategy, it requires the same time as that needed for each individual CS system to conduct compression and signal recovery. We then provide an approach to designing optimal CCS systems by utilizing a measure that involves both the sensing matrix and dictionaries and hence allows us to simultaneously optimize the sensing matrix and all the K dictionaries under the same scheme. An alternating minimization-based algorithm is derived for solving the corresponding optimal design problem. We provide a rigorous convergence analysis to show that the proposed algorithm is convergent. Experiments with real images are carried out and show that the proposed CCS system significantly improves on existing CS systems in terms of the signal recovery accuracy.

    Deep Reinforcement Learning for Event-Driven Multi-Agent Decision Processes

    The incorporation of macro-actions (temporally extended actions) into multi-agent decision problems has the potential to address the curse of dimensionality associated with such decision problems. Since macro-actions last for stochastic durations, multiple agents executing decentralized policies in cooperative environments must act asynchronously. We present an algorithm that modifies Generalized Advantage Estimation for temporally extended actions, allowing a state-of-the-art policy optimization algorithm to optimize policies in Dec-POMDPs in which agents act asynchronously. We show that our algorithm is capable of learning optimal policies in two cooperative domains, one involving real-time bus holding control and one involving wildfire fighting with unmanned aircraft. Our algorithm works by framing problems as ‘event-driven decision processes,’ which are scenarios where the sequence and timing of actions and events are random and governed by an underlying stochastic process. In addition to optimizing policies with continuous state and action spaces, our algorithm also facilitates the use of event-driven simulators, which do not require time to be discretized into time-steps. We demonstrate the benefit of using event-driven simulation in the context of multiple agents taking asynchronous actions. We show that fixed time-step simulation risks obfuscating the sequence in which closely-separated events occur, adversely affecting the policies learned. Additionally, we show that arbitrarily shrinking the time-step scales poorly with the number of agents.

    Curriculum Learning of Visual Attribute Clusters for Multi-Task Classification

    Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum learning in a visual attribute classification framework. Individual tasks are grouped after performing hierarchical clustering based on their correlation. The clusters of tasks are learned in a curriculum learning setup by transferring knowledge between clusters. The learning process within each cluster is performed in a multi-task classification setup. By leveraging the acquired knowledge, we speed-up the process and improve performance. We demonstrate the effectiveness of our method via ablation studies and a detailed analysis of the covariates, on a variety of publicly available datasets of humans standing with their full-body visible. Extensive experimentation has proven that the proposed approach boosts the performance by 4% to 10%.

    A textual transform of multivariate time-series for prognostics

    Prognostics or early detection of incipient faults is an important industrial challenge for condition-based and preventive maintenance. Physics-based approaches to modeling fault progression are infeasible due to multiple interacting components, uncontrolled environmental factors and observability constraints. Moreover, such approaches to prognostics do not generalize to new domains. Consequently, domain-agnostic data-driven machine learning approaches to prognostics are desirable. Damage progression is a path-dependent process and explicitly modeling the temporal patterns is critical for accurate estimation of both the current damage state and its progression leading to total failure. In this paper, we present a novel data-driven approach to prognostics that employs a novel textual representation of multivariate temporal sensor observations for predicting the future health state of the monitored equipment early in its life. This representation enables us to utilize well-understood concepts from text-mining for modeling, prediction and understanding distress patterns in a domain agnostic way. The approach has been deployed and successfully tested on large scale multivariate time-series data from commercial aircraft engines. We report experiments on well-known publicly available benchmark datasets and simulation datasets. The proposed approach is shown to be superior in terms of prediction accuracy, lead time to prediction and interpretability.

    OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning

    Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.

    Contrastive Principal Component Analysis

    We present a new technique called contrastive principal component analysis (cPCA) that is designed to discover low-dimensional structure that is unique to a dataset, or enriched in one dataset relative to other data. The technique is a generalization of standard PCA, for the setting where multiple datasets are available — e.g. a treatment and a control group, or a mixed versus a homogeneous population — and the goal is to explore patterns that are specific to one of the datasets. We conduct a wide variety of experiments in which cPCA identifies important dataset-specific patterns that are missed by PCA, demonstrating that it is useful for many applications: subgroup discovery, visualizing trends, feature selection, denoising, and data-dependent standardization. We provide geometrical interpretations of cPCA and show that it satisfies desirable theoretical guarantees. We also extend cPCA to nonlinear settings in the form of kernel cPCA. We have released our code as a python package and documentation is on Github.

    SBG-Sketch: A Self-Balanced Sketch for Labeled-Graph Stream Summarization

    Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.

    VCExplorer: A Interactive Graph Exploration Framework Based on Hub Vertices with Graph Consolidation

    Graphs have been widely used to model different information networks, such as the Web, biological networks and social networks (e.g. Twitter). Due to the size and complexity of these graphs, how to explore and utilize these graphs has become a very challenging problem. In this paper, we propose, VCExplorer, a new interactive graph exploration framework that integrates the strengths of graph visualization and graph summarization. Unlike existing graph visualization tools where vertices of a graph may be clustered into a smaller collection of super/virtual vertices, VCExplorer displays a small number of actual source graph vertices (called hubs) and summaries of the information between these vertices. We refer to such a graph as a HA-graph (Hub-based Aggregation Graph). This allows users to appreciate the relationship between the hubs, rather than super/virtual vertices. Users can navigate through the HA- graph by ‘drilling down’ into the summaries between hubs to display more hubs. We illustrate how the graph aggregation techniques can be integrated into the exploring framework as the consolidated information to users. In addition, we propose efficient graph aggregation algorithms over multiple subgraphs via computation sharing. Extensive experimental evaluations have been conducted using both real and synthetic datasets and the results indicate the effectiveness and efficiency of VCExplorer for exploration.

    Temporal Pattern Mining from Evolving Networks

    Recently, evolving networks are becoming a suitable form to model many real-world complex systems, due to their peculiarities to represent the systems and their constituting entities, the interactions between the entities and the time-variability of their structure and properties. Designing computational models able to analyze evolving networks becomes relevant in many applications. The goal of this research project is to evaluate the possible contribution of temporal pattern mining techniques in the analysis of evolving networks. In particular, we aim at exploiting available snapshots for the recognition of valuable and potentially useful knowledge about the temporal dynamics exhibited by the network over the time, without making any prior assumption about the underlying evolutionary schema. Pattern-based approaches of temporal pattern mining can be exploited to detect and characterize changes exhibited by a network over the time, starting from observed snapshots.

    Distributed Lance-William Clustering Algorithm

    One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.

    Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties [Extended Version]

    In knowledge bases such as Wikidata, it is possible to assert a large set of properties for entities, ranging from generic ones such as name and place of birth to highly profession-specific or background-specific ones such as doctoral advisor or medical condition. Determining a preference or ranking in this large set is a challenge in tasks such as prioritisation of edits or natural-language generation. Most previous approaches to ranking knowledge base properties are purely data-driven, that is, as we show, mistake frequency for interestingness. In this work, we have developed a human-annotated dataset of 350 preference judgments among pairs of knowledge base properties for fixed entities. From this set, we isolate a subset of pairs for which humans show a high level of agreement (87.5% on average). We show, however, that baseline and state-of-the-art techniques achieve only 61.3% precision in predicting human preferences for this subset. We then analyze what contributes to one property being rated as more important than another one, and identify that at least three factors play a role, namely (i) general frequency, (ii) applicability to similar entities and (iii) semantic similarity between property and entity. We experimentally analyze the contribution of each factor and show that a combination of techniques addressing all the three factors achieves 74% precision on the task. The dataset is available at

    ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

    Single-source and top-k SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-k SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

    Interplay of Coulomb interactions and disorder in three dimensional quadratic band crossings without time-reversal or particle-hole symmetry
    Orbits for eighteen visual binaries and two double-line spectroscopic binaries observed with HRCAM on the CTIO SOAR 4m telescope, using a new Bayesian orbit code based on Markov Chain Monte Carlo
    A Driven Tagged Particle in Asymmetric Simple Exclusion Processes
    Orthogonal Series Density Estimation for Complex Surveys
    Time-Optimal Collaborative Guidance using the Generalized Hopf Formula
    On Upper Approximations of Pareto Fronts
    Queuing with Heterogeneous Users: Block Probability and Sojourn times
    A Destroying Driven Tagged Particle in Symmetric Simple Exclusion Processes
    Derivation of Network Reprogramming Protocol with Z3
    Optimal projection of observations in a Bayesian setting
    High-dimensional posterior consistency for hierarchical non-local priors in regression
    Yaglom limits for R-transient chains with non-trivial Martin boundary
    A Server-based Approach for Predictable GPU Access with Improved Analysis
    A Memristive Neural Network Computing Engine using CMOS-Compatible Charge-Trap-Transistor (CTT)
    A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent
    Learning of Coordination Policies for Robotic Swarms
    Localization in the Disordered Holstein model
    Secure Beamforming in Full-Duplex SWIPT Systems
    Dynamic Cross-Layer Beamforming in Hybrid Powered Communication Systems With Harvest-Use-Trade Strategy
    Multilevel mixed effects parametric survival analysis
    Estimating model evidence using ensemble-based data assimilation with localization – The model selection problem
    An Attention-based Collaboration Framework for Multi-View Network Representation Learning
    Construction C*: an improved version of Construction C
    On Graphs and the Gotsman-Linial Conjecture for d = 2
    Distributed event-triggered control for multi-agent formation stabilization and tracking
    Unique Information via Dependency Constraints
    Verifying Properties of Binarized Deep Neural Networks
    Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
    An Optimality Proof for the PairDiff operator for Representing Relations between Words
    Controllability and data-driven identification of bipartite consensus on nonlinear signed networks
    Deep Lattice Networks and Partial Monotonic Functions
    Random matrices: repulsion in spectrum
    Concentration of distances in Wigner matrices
    Sieve: Actionable Insights from Monitored Metrics in Microservices
    Property Testing in High Dimensional Ising models
    A Voting-Based System for Ethical Decision Making
    Higher Distance Energies and Expanders with Structure
    Blind Estimation of Sparse Broadband Massive MIMO Channels with Ideal and One-bit ADCs
    Subset Testing and Analysis of Multiple Phenotypes (STAMP)
    Reversible Joint Hilbert and Linear Canonical Transform Without Distortion
    Online Learning of a Memory for Learning Rates
    Empowering In-Memory Relational Database Engines with Native Graph Processing
    Covering Numbers for Semicontinuous Functions
    Some inequalities for $k$-colored partition functions
    The cohomology of abelian Hessenberg varieties and the Stanley-Stembridge conjecture
    Measuring Player Retention and Monetization using the Mean Cumulative Function
    Equilibrium fluctuations for the weakly asymmetric discrete Atlas model
    SegFlow: Joint Learning for Video Object Segmentation and Optical Flow
    The Fourth Characteristic of a Semimartingale
    A shared latent space matrix factorisation method for recommending new trial evidence for systematic review updates
    Exponential concentration for zeroes of stationary Gaussian processes
    Transfer learning from synthetic to real images using variational autoencoders for robotic applications
    Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs
    Latent Embeddings for Collective Activity Recognition
    A Central Limit Theorem for Fleming-Viot Particle Systems with Hard Killing
    Information-Coupled Turbo Codes for LTE Systems
    Careful prior specification avoids incautious inference for log-Gaussian Cox point processes
    Stochastic Channel Modeling for Diffusive Mobile Molecular Communication Systems
    A stencil scaling approach for accelerating matrix-free finite element implementations
    Complexity of Finding Perfect Bipartite Matchings Minimizing the Number of Intersecting Edges
    The Life in 1-Consensus
    Block-Diagonal Solutions to Lyapunov Inequalities and Generalisations of Diagonal Dominance
    Efficient Graph Edit Distance Computation and Verification via Anchor-aware Lower Bound Estimation
    On $2$-chains inside thin subsets of $\mathbb{R}^d$ and product of distances
    Affordable and Energy-Efficient Cloud Computing Clusters: The Bolzano Raspberry Pi Cloud Cluster Experiment
    Updating the silent speech challenge benchmark with deep learning
    Miscorrection-free Decoding of Staircase Codes
    Time-dependent reflection at the localization transition
    Differential transcendence & algebraicity criteria for the series counting weighted quadrant walks
    Atomic Norm Denoising-Based Joint Channel Estimation and Faulty Antenna Detection for Massive MIMO
    Higher Order Concentration of Measure
    UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning
    Towards a better understanding of the matrix product function approximation algorithm in application to quantum physics
    Berry-Esseen Bounds for typical weighted sums
    Bandits with Delayed Anonymous Feedback
    A note on the 4-girth-thickness of K_{n,n,n}
    Specification tests in semiparametric transformation models
    On Energy Efficient Uplink Multi-User MIMO with Shared LNA Control
    Using marginal structural models to adjust for treatment drop-in when developing clinical prediction models
    Learning quadrangulated patches for 3D shape parameterization and completion
    Open Source Dataset and Deep Learning Models for Online Digit Gesture Recognition on Touchscreens
    Integrating hyper-parameter uncertainties in a multi-fidelity Bayesian model for the estimation of a probability of failure
    Forbidden Subgraphs for Chorded Pancyclicity
    The random pinning model with correlated disorder given by a renewal set
    De-identification of medical records using conditional random fields and long short-term memory networks
    EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning
    Linear Quadratic Games with Costly Measurements
    Using Parameterized Black-Box Priors to Scale Up Model-Based Policy Search for Robotics
    Constructing a Hierarchical User Interest Structure based on User Profiles
    Bayesian Optimization with Automatic Prior Selection for Data-Efficient Direct Policy Search
    A Byzantine Fault-Tolerant Ordering Service for the Hyperledger Fabric Blockchain Platform
    Stock-out Prediction in Multi-echelon Networks
    Spatial features of synaptic adaptation affecting learning performance
    NOMA Assisted Wireless Caching: Strategies and Performance Analysis
    Iterated Stochastic Integrals in Infinite Dimensions – Approximation and Error Estimates
    Stochastic Burgers’ Equation on the Real Line: Regularity and Moment Estimates
    Synchronization in Kuramoto-Sakaguchi ensembles with competing influence of common noise and global coupling
    An Expectation Conditional Maximization approach for Gaussian graphical models
    New Examples of Dimension Zero Categories
    Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks
    Characterization and enumeration of 3-regular permutation graphs
    Error-tolerant Multisecant Method for Nonlinearly Constrained Optimization
    Equilibrium-Independent Dissipativity with Quadratic Supply Rates
    Text Compression for Sentiment Analysis via Evolutionary Algorithms

    Continue Reading…


    Read More

    Thermal structure of Hurricane Maria

    Hurricane Maria touched down in Puerto Rico. This visualization by Joshua Stevens at NASA shows what the thermal structure of the storm looked like, based on data collected by the Terra satellite.

    Colder clouds, which are generally higher in the atmosphere, are shown with white. Somewhat warmer, lower clouds appear purple. The image reveals a very well-defined eye surrounded by high clouds on all sides—an indication that the storm was very intense.


    Tags: , ,

    Continue Reading…


    Read More

    Basics of Entity Resolution

    Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

    Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

    Naming Your Problem

    Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

    How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

    Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

    The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

    1. Deduplication: eliminating duplicate (exact) copies of repeated data.
    2. Record linkage: identifying records that reference the same entity across different sources.
    3. Canonicalization: converting data with more than one possible representation into a standard form.

    Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

    About Dedupe

    Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

    How Dedupe Works

    Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

    Testing Out Dedupe

    Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

    In your terminal (we recommend doing so inside a virtual environment):

    git clone
    cd dedupe-examples
    pip install unidecode
    pip install future
    pip install dedupe

    Then we'll run the file to see what dedupe can do:


    Blocking and Affine Gap Distance

    Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

    But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

    Silvrback blog image

    How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

    When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

    fields = [
        {'field' : 'Name', 'type': 'String'},
        {'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
        {'field' : 'Address', 'type': 'String', 'has missing' : True},
        {'field' : 'Purchases', 'type': 'String'},

    Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

    Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

    • the first three digits of the phone number
    • the full name
    • the first five characters of the name
    • a random 4-gram within the city name

    Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

    Silvrback blog image

    Silvrback blog image

    Silvrback blog image

    Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

    Silvrback blog image

    The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

    Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

    Active Learning

    You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

    Do these records refer to the same thing?
    (y)es / (n)o / (u)nsure / (f)inished

    Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

    Silvrback blog image

    You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

    • (y)es: confirms that the two references are to the same entity
    • (n)o: labels the two references as not the same entity
    • (u)nsure: does not label the two references as the same entity or as different entities
    • (f)inished: ends the active learning session and triggers the supervised learning phase

    Silvrback blog image

    As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

    The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

    fields = [
        {'field' : 'Site name', 'type': 'String'},
        {'field' : 'Address', 'type': 'String'},
        {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
        {'field' : 'Phone', 'type': 'String', 'has missing' : True},

    A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

    Something a Bit More Challenging

    In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

    The data set we used was pulled from the website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

    Silvrback blog image

    The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

    Database Field Field Description
    NAMELAST Last name of entity
    NAMEFIRST First name of entity
    NAMEMID Middle name of entity
    UIN Unique Identification Number
    BDGNBR Badge Number
    Type of Access Access type to White House
    TOA Time of arrival
    POA Post on arrival
    TOD Time of departure
    POD Post on departure
    APPT_MADE_DATE When the appointment date was made
    APPT_START_DATE When the appointment date is scheduled to start
    APPT_END_DATE When the appointment date is scheduled to end
    APPT_CANCEL_DATE When the appointment date was canceled
    Total_People Total number of people scheduled to attend
    LAST_UPDATEDBY Who was the last person to update this event
    POST Classified as 'WIN'
    LastEntryDate When the last update to this instance
    TERMINAL_SUFFIX ID for terminal used to process visitor
    visitee_namelast The visitee's last name
    visitee_namefirst The visitee's first name
    MEETING_LOC The location of the meeting
    MEETING_ROOM The room number of the meeting
    CALLER_NAME_LAST The authorizing person for the visitor's last name
    CALLER_NAME_FIRST The authorizing person for the visitor's first name
    CALLER_ROOM The authorizing person's room for the visitor
    Description Description of the event or visit
    RELEASE_DATE The date this set of logs were released to the public

    Loading the Data

    Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

    import requests
    def getData(url,fname):
        Download the dataset from the webpage.
        response = requests.get(url)
        with open(fname, 'w') as f:
    DATAURL = ""
    ORIGFILE = "fixtures/whitehouse-visitors.csv"

    Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

    Tailoring the Code

    Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

    import csv
    import psycopg2
    from dateutil import parser
    from datetime import datetime
    conn = None
    DATABASE = your_db_name
    USER = your_user_name
    HOST = your_hostname
    PASSWORD = your_password
        conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD)
        print ("I've connected")
        print ("I am unable to connect to the database")
    cur = conn.cursor()

    The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

    This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

    def dateParseSQL(nfile):
        cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
                      (visitor_id SERIAL PRIMARY KEY,
                      lastname    varchar,
                      firstname   varchar,
                      uin         varchar,
                      apptmade    varchar,
                      apptstart   varchar,
                      apptend     varchar,
                      meeting_loc varchar);''')
        with open(nfile, 'rU') as infile:
            reader = csv.reader(infile, delimiter=',')
            next(reader, None)
            for row in reader:
                for field in DATEFIELDS:
                    if row[field] != '':
                            dt = parser.parse(row[field])
                            row[field] = dt.toordinal()  # We also tried dt.isoformat()
                sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
                       VALUES (%s,%s,%s,%s,%s,%s,%s)"
                cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
        print ("All done!")

    About 60 of our rows had ASCII characters, which we dropped using this SQL command:

    delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';

    For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

    import tempfile
    import argparse
    import csv
    import os
    import dedupe
    import psycopg2
    from psycopg2.extras import DictCursor

    Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

    KEY_FIELD = 'visitor_id'
    SOURCE_TABLE = 'visitors'
    FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
                   'type': 'String','has missing': True},
                  {'field': 'lastname', 'variable name': 'lastname',
                   'type': 'String','has missing': True},
                  {'field': 'uin', 'variable name': 'uin',
                   'type': 'String','has missing': True},
                  {'field': 'meeting_loc', 'variable name': 'meeting_loc',
                   'type': 'String','has missing': True}

    We modified a function Dan wrote to generate the predicate blocks:

    def candidates_gen(result_set):
        lset = set
        block_id = None
        records = []
        i = 0
        for row in result_set:
            if row['block_id'] != block_id:
                if records:
                    yield records
                block_id = row['block_id']
                records = []
                i += 1
                if i % 10000 == 0:
                    print ('{} blocks'.format(i))
            smaller_ids = row['smaller_ids']
            if smaller_ids:
                smaller_ids = lset(smaller_ids.split(','))
                smaller_ids = lset([])
            records.append((row[KEY_FIELD], row, smaller_ids))
        if records:
            yield records

    And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

    def find_dupes(args):
        deduper = dedupe.Dedupe(FIELDS)
        with psycopg2.connect(database=args.dbname,
                              cursor_factory=DictCursor) as con:
            with con.cursor() as c:
                c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
                row = c.fetchone()
                count = row['count']
                sample_size = int(count * args.sample)
                print ('Generating sample of {} records'.format(sample_size))
                with con.cursor('deduper') as c_deduper:
                    c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
                    temp_d = dict((i, row) for i, row in enumerate(c_deduper))
                    deduper.sample(temp_d, sample_size)
                if os.path.exists(
                    print ('Loading training file from {}'.format(
                    with open( as tf:
                print ('Starting active learning')
                print ('Starting training')
                deduper.train(ppc=0.001, uncovered_dupes=5)
                print ('Saving new training file to {}'.format(
                with open(, 'w') as training_file:
                print ('Creating blocking_map table')
                    DROP TABLE IF EXISTS blocking_map
                    CREATE TABLE blocking_map
                    (block_key VARCHAR(200), %s INTEGER)
                    """ % KEY_FIELD)
                for field in deduper.blocker.index_fields:
                    print ('Selecting distinct values for "{}"'.format(field))
                    c_index = con.cursor('index')
                        SELECT DISTINCT %s FROM %s
                        """ % (field, SOURCE_TABLE))
                    field_data = (row[field] for row in c_index)
                    deduper.blocker.index(field_data, field)
                print ('Generating blocking map')
                c_block = con.cursor('block')
                    SELECT * FROM %s
                    """ % SOURCE_TABLE)
                full_data = ((row[KEY_FIELD], row) for row in c_block)
                b_data = deduper.blocker(full_data)
                print ('Inserting blocks into blocking_map')
                csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
                csv_writer = csv.writer(csv_file)
                f = open(, 'r')
                c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
                print ('Indexing blocks')
                    CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
                c.execute("DROP TABLE IF EXISTS plural_key")
                c.execute("DROP TABLE IF EXISTS plural_block")
                c.execute("DROP TABLE IF EXISTS covered_blocks")
                c.execute("DROP TABLE IF EXISTS smaller_coverage")
                print ('Calculating plural_key')
                    CREATE TABLE plural_key
                    (block_key VARCHAR(200),
                    block_id SERIAL PRIMARY KEY)
                    INSERT INTO plural_key (block_key)
                    SELECT block_key FROM blocking_map
                    GROUP BY block_key HAVING COUNT(*) > 1
                print ('Indexing block_key')
                    CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
                print ('Calculating plural_block')
                    CREATE TABLE plural_block
                    AS (SELECT block_id, %s
                    FROM blocking_map INNER JOIN plural_key
                    USING (block_key))
                    """ % KEY_FIELD)
                print ('Adding {} index'.format(KEY_FIELD))
                    CREATE INDEX plural_block_%s_idx
                        ON plural_block (%s)
                    """ % (KEY_FIELD, KEY_FIELD))
                    CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
                    ON plural_block (block_id, %s)
                    """ % (KEY_FIELD, KEY_FIELD))
                print ('Creating covered_blocks')
                    CREATE TABLE covered_blocks AS
                        (SELECT %s,
                                string_agg(CAST(block_id AS TEXT), ','
                                ORDER BY block_id) AS sorted_ids
                         FROM plural_block
                         GROUP BY %s)
                     """ % (KEY_FIELD, KEY_FIELD))
                print ('Indexing covered_blocks')
                    CREATE UNIQUE INDEX covered_blocks_%s_idx
                        ON covered_blocks (%s)
                    """ % (KEY_FIELD, KEY_FIELD))
                print ('Committing')
                print ('Creating smaller_coverage')
                    CREATE TABLE smaller_coverage AS
                        (SELECT %s, block_id,
                            TRIM(',' FROM split_part(sorted_ids,
                            CAST(block_id AS TEXT), 1))
                            AS smaller_ids
                         FROM plural_block
                         INNER JOIN covered_blocks
                         USING (%s))
                    """ % (KEY_FIELD, KEY_FIELD))
                print ('Clustering...')
                c_cluster = con.cursor('cluster')
                    SELECT *
                    FROM smaller_coverage
                    INNER JOIN %s
                        USING (%s)
                    ORDER BY (block_id)
                    """ % (SOURCE_TABLE, KEY_FIELD))
                clustered_dupes = deduper.matchBlocks(
                        candidates_gen(c_cluster), threshold=0.5)
                print ('Creating entity_map table')
                c.execute("DROP TABLE IF EXISTS entity_map")
                    CREATE TABLE entity_map (
                        %s INTEGER,
                        canon_id INTEGER,
                        cluster_score FLOAT,
                        PRIMARY KEY(%s)
                    )""" % (KEY_FIELD, KEY_FIELD))
                print ('Inserting entities into entity_map')
                for cluster, scores in clustered_dupes:
                    cluster_id = cluster[0]
                    for key_field, score in zip(cluster, scores):
                            INSERT INTO entity_map
                                (%s, canon_id, cluster_score)
                            VALUES (%s, %s, %s)
                            """ % (KEY_FIELD, key_field, cluster_id, score))
                print ('Indexing head_index')
                c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
        parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
        parser.add_argument('-t', '--training', default='training.json', help='name of training file')
        args = parser.parse_args()

    Active Learning Observations

    We ran multiple experiments:

    • Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
    • Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

    We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

    Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

    Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

    One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

    lastname : KUZIEMKO
    firstname : ILYANA
    meeting_loc : WH
    lastname : KUZIEMKO
    firstname : ILYANA
    meeting_loc : WH
    Do these records refer to the same thing?
    (y)es / (n)o / (u)nsure / (f)inished

    Some were hard:

    lastname : Desimone
    firstname : Daniel
    meeting_loc : OEOB
    lastname : DeSimone
    firstname : Daniel
    meeting_loc : WH
    Do these records refer to the same thing?
    (y)es / (n)o / (u)nsure / (f)inished


    What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

    During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

    Within-Visit Duplicates

    lastname : Ryan
    meeting_loc : OEOB
    firstname : Patrick
    uin : U62671
    lastname : Ryan
    meeting_loc : OEOB
    firstname : Patrick
    uin : U62671

    Across-Visit Duplicates (Frequent Fliers)

    lastname : TANGHERLINI
    meeting_loc : OEOB
    firstname : DANIEL
    uin : U02692
    lastname : TANGHERLINI
    meeting_loc : NEOB
    firstname : DANIEL
    uin : U73085
    lastname : ARCHULETA
    meeting_loc : WH
    firstname : KATHERINE
    uin : U68121
    lastname : ARCHULETA
    meeting_loc : OEOB
    firstname : KATHERINE
    uin : U76331

    Silvrback blog image


    In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

    Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

    District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

    Continue Reading…


    Read More

    Data Exploration with Python, Part 2

    This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

    Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

    When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.

    In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.

    Exploratory Data Analysis Framework

    We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.

    EPA Vehicle Fuel Economy Data

    We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.

    Renaming Columns to be More Intuitive

    Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.

    We can use the code below to rename the columns in our vehicles data frame.

    vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders',
                        'Transmission','Drivetrain','Vehicle Class','Fuel Type',
                        'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG',
                        'CO2 Emission Grams/Mile','Fuel Cost/Year']

    Thinking About Categorization

    Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.

    Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.

    In this stage of the framework, we are going to create additional categories in 3 distinct ways:

    • Category Aggregations
    • Binning Continuous Variables
    • Clustering

    Now that we have a better idea of what we are doing and why, let's get started.

    Aggregating to Higher-Level Categories

    The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.

    def unique_col_values(df):
        for column in df:
            print("{} | {} | {}".format(
                df[column].name, len(df[column].unique()), df[column].dtype
    Make | 126 | object
    Model | 3491 | object
    Year | 33 | int64
    Engine Displacement | 65 | float64
    Cylinders | 9 | float64
    Transmission | 43 | object
    Drivetrain | 7 | object
    Vehicle Class | 34 | object
    Fuel Type | 13 | object
    Fuel Barrels/Year | 116 | float64
    City MPG | 48 | int64
    Highway MPG | 49 | int64
    Combined MPG | 45 | int64
    CO2 Emission Grams/Mile | 550 | float64
    Fuel Cost/Year | 58 | int64

    From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set.

    • Make: 126 unique values
    • Model: 3,491 unique values
    • Transmission: 43 unique values
    • Drivetrain: 7 unique values
    • Vehicle Class: 34 unique values
    • Fuel Type: 13 unique values

    When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.

    What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).

    For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual.

    Category Aggregations - Transmission

    Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M.

    AUTOMATIC = "Automatic"
    MANUAL = "Manual"
                 'Transmission Type'] = AUTOMATIC
                 'Transmission Type'] = MANUAL

    We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.

    Category Aggregations - Vehicle Class

    small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars']
    midsize = ['Midsize Cars']
    large = ['Large Cars']
    vehicles.loc[vehicles['Vehicle Class'].isin(small), 
                 'Vehicle Category'] = 'Small Cars'
    vehicles.loc[vehicles['Vehicle Class'].isin(midsize), 
                 'Vehicle Category'] = 'Midsize Cars'
    vehicles.loc[vehicles['Vehicle Class'].isin(large), 
                 'Vehicle Category'] = 'Large Cars'
    vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'), 
                 'Vehicle Category'] = 'Station Wagons'
    vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'), 
                 'Vehicle Category'] = 'Pickup Trucks'
    vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'), 
                 'Vehicle Category'] = 'Special Purpose'
    vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'), 
                 'Vehicle Category'] = 'Sport Utility'
    vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')),
                 'Vehicle Category'] = 'Vans & Minivans'

    Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans.

    vehicles['Model Type'] = (vehicles['Make'] + " " +

    Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively.

    vehicles['Fuel Type'].unique()
    array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity',
           'Premium or E85', 'Premium Gas or Electricity', 'Gasoline or E85',
           'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity',
           'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'],

    This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.

    vehicles['Gas'] = 0
    vehicles['Ethanol'] = 0
    vehicles['Electric'] = 0
    vehicles['Propane'] = 0
    vehicles['Natural Gas'] = 0
    vehicles.loc[vehicles['Fuel Type'].str.contains(
            'Regular|Gasoline|Midgrade|Premium|Diesel'),'Gas'] = 1
    vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1
    vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1
    vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1
    vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1

    As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.

    vehicles.loc[vehicles['Fuel Type'].str.contains(
            'Regular|Gasoline'),'Gas Type'] = 'Regular'
    vehicles.loc[vehicles['Fuel Type'] == 'Midgrade',
                 'Gas Type'] = 'Midgrade'
    vehicles.loc[vehicles['Fuel Type'].str.contains('Premium'),
                 'Gas Type'] = 'Premium'
    vehicles.loc[vehicles['Fuel Type'] == 'Diesel',
                 'Gas Type'] = 'Diesel'
    vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),
                 'Gas Type'] = 'Natural'

    An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.

    Creating Categories from Continuous Variables

    The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way.

    Silvrback blog image

    Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified.

    efficiency_categories = ['Very Low Efficiency', 'Low Efficiency',
                             'Moderate Efficiency','High Efficiency',
                             'Very High Efficiency']
    vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'],
                                          5, efficiency_categories)
    engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine',
                         'Large Engine', 'Very Large Engine']
    vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'],
                                      5, engine_categories)
    emission_categories = ['Very Low Emissions', 'Low Emissions',
                            'Moderate Emissions','High Emissions',
                            'Very High Emissions']
    vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'],
                                     5, emission_categories)
    fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost',
                           'Moderate Fuel Cost','High Fuel Cost',
                           'Very High Fuel Cost']
    vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'],
                                    5, fuelcost_categories)

    Clustering to Create Additional Categories

    The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.

    Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.

    The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data.

    The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.

    cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                       'City MPG','Highway MPG','Combined MPG',
                       'CO2 Emission Grams/Mile', 'Fuel Cost/Year']

    Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives).

    from sklearn import preprocessing
    scaler = preprocessing.MaxAbsScaler()
    vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns])
    vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)

    Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster.

    from sklearn.cluster import KMeans
    def kmeans_cluster(df, n_clusters=2):
        model = KMeans(n_clusters=n_clusters, random_state=1)
        clusters = model.fit_predict(df)
        cluster_results = df.copy()
        cluster_results['Cluster'] = clusters
        return cluster_results

    Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.

    def summarize_clustering(results):
        cluster_size = results.groupby(['Cluster']).size().reset_index()
        cluster_size.columns = ['Cluster', 'Count']
        cluster_means = results.groupby(['Cluster'], as_index=False).mean()
        cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster')
        return cluster_summary

    We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want?

    There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get.

    cluster_results = kmeans_cluster(vehicle_clusters, 8)
    cluster_summary = summarize_clustering(cluster_results)

    After running the couple of lines of code above, your cluster_summary should look similar to the following.

    Silvrback blog image

    By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.

    This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.

    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

    Silvrback blog image

    In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.

    With this goal in mind, it is apparent that we probably have too many clusters because:

    • Clusters 3, 4, and 7 look pretty similar
    • Clusters 2 and 5 look similar as well
    • Clusters 0 and 6 are also a little close for comfort

    From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.

    cluster_results = kmeans_cluster(vehicle_clusters, 4)
    cluster_summary = summarize_clustering(cluster_results)
    sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

    Silvrback blog image

    These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.

    We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.

    • Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient.
    • Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient.
    • Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced.
    • Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient.

    Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame.

    cluster_results['Cluster Name'] = ''
    cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced'
    cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient'
    cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient'
    cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient'
    vehicles = vehicles.reset_index().drop('index', axis=1)
    vehicles['Cluster Name'] = cluster_results['Cluster Name']


    In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.

    Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!

    District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

    Continue Reading…


    Read More

    Data Exploration with Python, Part 3

    This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2!

    Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.

    We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:

    • The higher-level categories we created via category aggregations.
    • The quintiles we created by binning our continuous variables.
    • The clusters we generated via k-means clustering based on numeric variables.

    Now, without further ado, let's embark on our insight-finding mission!

    Making Our Data Smaller: Filter + Aggregate

    One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.

    To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.

    def agg_count(df, group_field):
        grouped = df.groupby(group_field, as_index=False).size()
        grouped.sort(ascending = False)
        grouped = pd.DataFrame(grouped).reset_index()
        grouped.columns = [group_field, 'Count']
        return grouped

    Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know.

    vehicles_2016 = vehicles[vehicles['Year']==2016]
    category_counts = agg_count(vehicles_2016, 'Vehicle Category')

    Filter Aggregate Count

    This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.

    ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
    ax.set(xlabel='\n Number of Vehicles Manufactured')
    sns.plt.title('Vehicles Manufactured by Category (2016) \n')

    Vehicles Manufactured 2016 Bar Chart

    Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).

    vehicles_1985 = vehicles[vehicles['Year']==1985]
    category_counts = agg_count(vehicles, 'Vehicle Category')
    ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
    ax.set(xlabel='\n Number of Vehicles Manufactured')
    sns.plt.title('Vehicles Manufactured by Category (1985) \n')

    Vehicles Manufactured 1985 Bar Chart

    If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.

    class_counts = agg_count(vehicles_2016, 'Vehicle Class')
    ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class')
    ax.set(xlabel='\n Number of Vehicles Manufactured')
    sns.plt.title('Vehicles Manufactured by Class (2016) \n')

    Silvrback blog image

    We could also look at vehicle counts by manufacturer.

    make_counts = agg_count(vehicles_2016, 'Make')
    ax = sns.barplot(data=make_counts, x='Count', y='Make')
    ax.set(xlabel='\n Number of Vehicles Manufactured')
    sns.plt.title('Vehicles Manufactured by Make (2016) \n')

    Silvrback blog image

    What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.

    very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency']
    make_counts = agg_count(very_efficient, 'Make')
    ax = sns.barplot(data=make_counts, x='Count', y='Make')
    ax.set(xlabel='\n Number of Vehicles Manufactured')
    sns.plt.title('Very Fuel Efficient Vehicles by Make \n')

    Very Efficient Vehicles by Make

    What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field.

    def agg_avg(df, group_field, calc_field):
        grouped = df.groupby(group_field, as_index=False)[calc_field].mean()
        grouped = grouped.sort(calc_field, ascending = False)
        grouped.columns = [group_field, 'Avg ' + str(calc_field)]
        return grouped

    We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.

    category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG')
    ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category')
    ax.set(xlabel='\n Average Combined MPG')
    sns.plt.title('Average Combined MPG by Category (2016) \n')

    Average Fuel Efficiency by Vehicle Category

    Pivoting the Data for More Detail

    Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.

    First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.

    def pivot_count(df, rows, columns, calc_field):
        df_pivot = df.pivot_table(values=calc_field, 
                                 ).dropna(axis=0, how='all')
        return df_pivot

    We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.

    effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency',
                                   'Engine Size','Combined MPG')

    Filter Pivot Count Fuel Efficiency vs. Engine Size

    This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.

    sns.heatmap(effic_size_pivot, annot=True, fmt='g')
    ax.set(xlabel='\n Engine Size')
    sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')

    Fuel Efficiency by Engine Size Heatmap 2016

    Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.

    effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency',
                                   'Engine Size','Combined MPG')
    fig, ax = plt.subplots(figsize=(15,8))
    sns.heatmap(effic_size_pivot, annot=True, fmt='g')
    ax.set(xlabel='\n Engine Size')
    sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')

    Fuel Efficiency by Engine Size Heatmap 1985

    With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.

    effic_size_category = pivot_count(vehicles_2016,
                                      ['Engine Size','Fuel Efficiency'],
                                      'Vehicle Category','Combined MPG')
    fig, ax = plt.subplots(figsize=(20,10))
    sns.heatmap(effic_size_category, annot=True, fmt='g')
    ax.set(xlabel='\n Vehicle Category')
    sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')

    Engine Size and Fuel Efficiency by Vehicle Category Heatmap

    In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.

    As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.

    effic_size_pivot = pivot_count(vehicles_2016, 'Make',
                                   'Vehicle Category','Combined MPG')
    fig, ax = plt.subplots(figsize=(20,20))
    sns.heatmap(effic_size_pivot, annot=True, fmt='g')
    ax.set(xlabel='\n Vehicle Category')
    sns.plt.title('Make vs. Vehicle Category (2016) \n')

    Make vs. Vehicle Category 2016

    Visualizing Changes Over Time

    So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.

    def multi_line(df, x, y):
        ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")

    Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.

    multi_line(vehicles, 'Year', 'Vehicle Category')
    ax.set(xlabel='\n Year')
    sns.plt.title('Vehicle Categories Over Time \n')

    Vehicle Categories Over Time

    We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.

    If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.

    bmw = vehicles[vehicles['Make'] == 'BMW']
    multi_line(bmw, 'Year', 'Vehicle Category')
    ax.set(xlabel='\n Year')
    sns.plt.title('BMW Vehicle Categories Over Time \n')

    BMW Vehicle Categories Over Time

    We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.

    toyota = vehicles[vehicles['Make'] == 'Toyota']
    multi_line(toyota, 'Year', 'Vehicle Category')
    ax.set(xlabel='\n Year')
    sns.plt.title('Toyota Vehicle Categories Over Time \n')

    Toyota Vehicle Categories Over Time

    Examining Relationships Between Variables

    The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.

    For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.

    select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                       'City MPG','Highway MPG','Combined MPG',
                       'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name']
    sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)

    Scatter Matrix with Cluster Hue

    From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.

    Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.

    sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles, 
               hue='Cluster Name', size=8, fit_reg=False)

    Displacement MPG Scatter Plot

    In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.

    We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!


    We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!

    Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!

    District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

    Continue Reading…


    Read More

    If you did not already know

    Pattern Sequence based Forecasting (PSF) google
    This paper discusses about PSF, an R package for Pattern Sequence based Forecasting (PSF) algorithm used for univariate time series future prediction. The PSF algorithm consists of two major parts: clustering and prediction techniques. Clustering part includes selection of cluster size and then labeling of time series data with reference to various clusters. Whereas, the prediction part include functions like optimum window size selection for specific patterns and prediction of future values with reference to past pattern sequences. The PSF package consists of various functions to implement PSF algorithm. It also contains a function, which automates all other functions to obtain optimum prediction results. The aim of this package is to promote PSF algorithm and to ease its implementation with minimum efforts. This paper describe all the functions in PSF package with their syntax and simple examples. Finally, the usefulness of this package is discussed by comparing it with auto.arima, a well known time series forecasting function available on CRAN repository. …

    Spinnaker google
    Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence. …

    Spreadmart google
    A spreadmart (spreadsheet data mart) is a situation in which a company’s employees has inconsistent views of corporate data because each department relies on the data from their own spreadsheets. …

    Continue Reading…


    Read More

    Swift as a low-level programming language?

    Modern processors come with very powerful instructions that are simply not available in many high-level languages JavaScript or Python. Here are a few examples:

    • Most programming languages allow you to multiply two 64-bit integers and to get the 64-bit results (it is typically written as x = a * b). But the multiplication of two 64-bit integers actually occupies 128 bits. Reasonably enough, most programming languages (including C, Java, Go, Swift, C++,…) do not have 128-bit integers as a standard data type. So these programming languages have no way to represent the result, other than as two 64-bit integers. In practice, they often offer no direct way to get to the most significant 64 bits. Yet 64-bit processors (x64, ARM Aarch64…) have no problem computing the most significant 64 bits.
    • Similarly, processors have specialized instructions to compute population counts (also called Hamming weight). A population count is the number of ones contained in the binary representation of a machine word. It is a critical operation in many advanced algorithms and data structures. You can compute it with shifts and additions, but it is much, much slower than when using the dedicated processors that all modern processors have to solve this problem. And I should stress that you can beat the dedicated instructions with vector instructions.

    This disconnect between programming languages and processors is somewhat problematic because programmers have to get around the problem by doing more work, essentially letting the perfectly good instructions that processors offer go to waste. To be clear, that’s not an issue for most people, but most people do not deal with difficult programming challenges.

    That is, 95% of all programming tasks can be solved with Python or JavaScript. Then, out of the remaining 5%, the vast majority are well served with something like Java or C#. But then, for the top 1% of all programming problems, the most challenging ones, then you need all of the power you can get your hands on. Historically, people have been using C or C++ for these problems.

    I like C and C++, and they are fine for most things. These languages are aging well, in some respect. However, they also carry a fair amount of baggage.

    But what else is there?

    Among many other good choices, there is Apple’s Swift.

    Swift is progressing fast. Swift 4.0 is a step forward. And, in some sense, it is beating C and C++. Let me consider the two problems I mentioned: getting the most significant bits of a product and computing the population count. C and C++ offer no native way to solve these problems. At best, there are common, non-portable, extensions that help.

    Swift 4.0 solves both problems optimally in my view:

    • value1.multipliedFullWidth(by: value2).high gets mapped to the mulx instruction on my x64 laptop
    • value.nonzeroBitCount gets mapped to the popcnt instruction on my x64 laptop

    (The phrase nonzeroBitCount is a weird way to describe the population count, but I can live with it.)

    If you call these operations in a tight loop, they seem to generate very efficient code. In fact, consider the case where you repeatedly call value.nonzeroBitCount over an array:

    func popcntArray( _  value  : inout [UInt64] ) -> Int {
        return value.reduce( 0, { x, y in
            x &+ popCnt(y)

    The compiler does not use popcnt, but rather the more efficient vectorized approach (see Muła et al.). That’s because Swift benefits from the powerful LLVM machinery in the back-end.

    I wrote a small Swift 4.0 program to illustrate my points. You can compile a swift program for your hardware using a command such as swiftc myprogram.swift -O -Xcc -march=native. You can then use the lldb debugger to automatically get the assembly produced by a given function.

    Conclusion. If you are not targeting iOS, it is crazy to use Swift for high-performance low-level programming language at this time. However, it is getting less and less crazy.

    Continue Reading…


    Read More

    R Packages worth a look

    Core Inflation (Inflation)
    Provides access to core inflation functions. Four different core inflation functions are provided. The well known trimmed means, exclusion and double weighing methods, alongside the new Triple Filter method introduced in Ferreira et al. (2016) <>.

    Give your Dependencies Stars on GitHub! (ThankYouStars)
    A tool for starring GitHub repositories.

    pinp’ is not ‘PNAS’ (pinp)
    A ‘PNAS’-alike style for ‘rmarkdown’, derived from the ‘Proceedings of the National Academy of Sciences of the United States of America’ (PNAS, see <> ) LaTeX style, and adapted for use with ‘markdown’ and ‘pandoc’.

    Transitive Index Numbers for Cross-Sections and Panel Data (multilaterals)
    Computing transitive (and non-transitive) index numbers (Coelli et al., 2005 <doi:10.1007/b136381>) for cross-sections and panel data. For the calculation of transitive indexes, the EKS (Coelli et al., 2005 <doi:10.1007/b136381>; Rao et al., 2002 <doi:10.1007/978-1-4615-0851-9_4>) and Minimum spanning tree (Hill, 2004 <doi:10.1257/0002828043052178>) methods are implemented. Traditional fixed-base and chained indexes, and their growth rates, can also be derived using the Paasche, Laspeyres, Fisher and Tornqvist formulas.

    Spatial Floor Simulation (Isotropic) (SpatialFloor)
    Spatial floor simulation with exponential/Gaussian variance-covariance function (isotropic), with specification of distance function, nugget, sill, range. The methodology follows Nole A.C. Cressie (2015) <doi:10.1002/9781119115151>.

    Continue Reading…


    Read More

    September 21, 2017

    Will Stanton hit 61 home runs this season?

    [edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]

    So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

    Sampling notation

    A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

    \theta \sim \mbox{Beta}(1, 1)

    The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

    56 \sim \mbox{Binomial}(555, \theta)

    Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

    \mathit{ab} \sim \mbox{Poisson}(10 \times 555 / 149)

    We then take the number of home runs to be binomial given the number of at bats and the home run rate.

    h \sim \mbox{Binomial}(\mathit{ab}, \theta)

    Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

    \mbox{gte61} = \mbox{I}[h \geq (61 - 56)]

    Event probability

    The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

    \displaystyle \mbox{Pr}[h \geq (61 - 56)] \\[6pt] \hspace*{3em}  \displaystyle { } = \ \int_{\theta} \ \sum_{ab} \, \ \mathrm{I}[h \geq 61 - 56] \\ \hspace*{8em} \ \times \ \mbox{Binomial}(h \mid ab, \theta) \\[6pt] \hspace*{8em} \ \times \ \mbox{Poisson}(ab \mid 10 \ \times \ 555 / 149) \\[6pt] \hspace*{8em} \ \times \ \mbox{Beta}(\theta \mid 1 + 56, 1 + 555 - 56) \ \mathrm{d}\theta.

    Computation in R

    The posterior for \theta is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

    > sum(rbinom(1e5,
                 rpois(1e5, 10 * 555 / 149),
                 rbeta(1e5, 1 + 56, 1 + 555 - 56))
           >= (61 - 56)) / 1e5
    [1] 0.34

    That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.


    The above is intended for recreational use only and is not intended for serious bookmaking.


    You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for \theta down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

    The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

    The post Will Stanton hit 61 home runs this season? appeared first on Statistical Modeling, Causal Inference, and Social Science.

    Continue Reading…


    Read More

    Product Launch: Amped up Kernels Resources + Code Tips & Hidden Cells

    Kaggle’s kernels focused engineering team has been working hard to make our coding environment one that you want to use for all of your side projects. We’re excited to announce a host of new changes that we believe make Kernels the default place you’ll want to train your competition models, explore open data, and build your data science portfolio. Here’s exactly what’s changed:

    Additional Computational Resources (doubled and tripled)

    • Execution time: Now your kernels can run for up to 60 minutes instead of our past 20 minute limit.
    • CPUs: Use up to four CPUs for multithreaded workloads..
    • RAM: Work with twice as much data with 16 GB of RAM available for every kernel.
    • Disk space: Create killer output with 1 GB of disk space.

    New resources

    Code Tips

    Code tips catch common mistakes as you work through coding a kernel. They will pop up when you run code with an identifiable error and significantly cut down your troubleshooting time.

    Code Tips GIF

    Here are some examples of the most common code tips you’ll run into:

    Although you specified the "R" language, you might be writing Python code. Was this intentional? If not, start a Python script instead.

    Couldn't show a character. Did you happen to load binary data as text?

    Did you mean "from bs4 import BeautifulSoup"?

    Did you mean "ggplot2"?

    Do you mean pandas.DataFrame?

    Hidden Cells

    You publish public kernels so you can share your data science work to build a portfolio, get feedback, and help others learn. We've added the ability to hide cells, making it possible to present your analysis cleanly so people can focus on what's useful. If viewers want to dig in, it’s still possible to unhide cells to see the dirty details.

    Improved Reliability

    We know how frustrating it is to lose work, get stuck on Kaggle-side errors, or simply have a workbench that’s down when you’re trying to get up and running. It’s a priority of the Kernels engineering team to improve reliability so you can count on your code. Here are a few recent improvements:

    • Fewer notebook disconnections
    • Notebook editor now works with “Block Third Party Cookies” browser setting enabled
    • More reliable notebook auto-save

    What else can we do to make Kernels your default data science and ML workbench? Aurelio will be monitoring this forum post closely...let us know what you'd like to see next!

    Continue Reading…


    Read More

    Distilled News

    Introduction to Pseudo-Labelling : A Semi-Supervised learning technique

    We have made huge progress in solving Supervised machine learning problems. That also means that we need a lot of data to build our image classifiers or sales forecasters. The algorithms search patterns through the data again and again. But, that is not how human mind learns. A human brain does not require millions of data for training with multiple iterations of going through the same image for understanding a topic. All it needs is a few guiding points to train itself on the underlying patterns. Clearly, we are missing something in current machine learning approaches. Thankfully, there is a line of research which specifically caters to this question. Can we build a system capable of requiring minimal amount of supervision which can learn majority of the tasks on its own. In this article, I would like to cover one such technique called pseudo-labelling. I will give an intuitive explanation of what pseudo-labelling is and then provide a hands-on implementation of it.

    Introducing: Unity Machine Learning Agents

    Our two previous blog entries implied that there is a role games can play in driving the development of Reinforcement Learning algorithms. As the world’s most popular creation engine, Unity is at the crossroads between machine learning and gaming. It is critical to our mission to enable machine learning researchers with the most powerful training scenarios, and for us to give back to the gaming community by enabling them to utilize the latest machine learning technologies. As the first step in this endeavor, we are excited to introduce Unity Machine Learning Agents.

    GAN-Collection: Collection of various GAN models implemented in torch7

    Torch implementation of various types of GANs (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN).

    AI industry overview: Revisiting Big Data, Big Model, and Big Compute in the AI era

    AI Conference chairs Ben Lorica and Roger Chen reveal the current AI trends they’ve observed in industry.

    Monte Carlo Simulations & the “SimDesign” Package in R

    Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate-analyze-summarize coding structure over the typical “for-loop” strategy are discussed.

    Answer probability questions with simulation (part-2)

    This is the second exercise set on answering probability questions with simulation. Finishing the first exercise set is not a prerequisite. The difficulty level is about the same – thus if you are looking for a challenge aim at writing up faster more elegant algorithms. As always, it pays off to read the instructions carefully and think about what the solution should be before starting to code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated and slow.

    Enterprise-ready dashboards with Shiny and databases

    Inside the enterprise, a dashboard is expected to have up-to-the-minute information, to have a fast response time despite the large amount of data that supports it, and to be available on any device. An end user may expect that clicking on a bar or column inside a plot will result in either a more detailed report, or a list of the actual records that make up that number. This article will cover how to use a set of R packages, along with Shiny, to meet those requirements.

    Networks with R

    In order to practice with network data with R, we have been playing with the Padgett (1994) Florentine’s wedding dataset (discussed in the lecture).

    Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016

    R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook. I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States. The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election. For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent how-to guide.

    Continue Reading…


    Read More

    Thanks for reading!