# My Data Science Blogs

## September 24, 2017

### RcppGSL 0.3.3

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its two-column pdf default.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

#### Changes in version 0.3.3 (2017-09-24)

• We also check for gsl-config at package load.

• The vignette now uses the pinp package in two-column mode.

• Minor other fixes to package and testing infrastructure.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Linear Programming and Healthy Diets — Part 2

Previously in this series:

## Foods of the Father

Every so often he picks up a health trend and/or weight loss goal that would make many people’s jaw drop. For example, we once went on a 5-day, 50-mile backpacking trip in the Grand Tetons, and my dad brought one of these per day for dinner, and had vitamin tablets for the rest of his sustenance. The rest of us planned for around 3,000 calories per day. He’s tried the “high fat” and “no fat” diets, and quite a few others. He’s concerned with losing weight, but also living longer, so he’s into caloric restriction among other things.

Recently he asked me to help him optimize his diet. He described a scheme he was performing by hand to minimize the number of calories he consumed per day while maintaining the minimum nutrients required by the FDA’s recommendations. He had a spreadsheet with the nutrients for each food, and a spreadsheet with the constraints on each nutrient. He wanted to come up with a collection of meals (or just throw all the food into a blender) that taste within reason but meet these criteria.

He was essentially solving a linear program by hand, roughly as best as one can, with a few hundred variables! After asking me whether there was “any kind of math” that could help him automate his laborious efforts, I decided to lend a hand. After all, it’s been over three years since I promised my readers I’d apply linear programming to optimize a diet (though it was optimizing for cost rather than calories).

Though it never went beyond what linear programming can handle, pretty quickly my dad’s requests specialized beyond what would interest a general reader. Perhaps this is the nature of math consulting, but it seems when you give someone what they want, they realize that’s not what they wanted.

But the basic ideas are still relevant enough. My solution is a hundred-ish lines of python to set up the input, using Google’s open source operations research tools as the core solver. Disclaimer: I work for Google but I don’t work on the team that wrote this tool. Also nothing in this post represents the views of my employer. It’s just me, and the scale of this problem is laughable for Google to care about anyway.

So this post is half tutorial showing how to use the or-tools python wrapper (it’s only somewhat documented), and half showing a realistic use case for linear programming.

However, don’t let this post dissuade you from the belief that linear programming is useful beyond dieting. People use linear programming to solve all kinds of interesting problems. Here are a few I came across in just the last few weeks:

And that’s not even to mention the ubiquitous applications in operations research (network flow, production optimization, economics) that every large company relies on. The applications seem endless!

As usual, all of the code and data we use in the making of this post is available at this blog’s Github page.

## Refresher

If you remember how linear programs work, you can safely skip this section.

As a refresher, let’s outline how to model the nutrition problem as a linear program and recall the basic notation. The variables are food in 100 gram increments. So $x_1$ might be the amount of canned peas consumed, $x_2$ lobster meat, etc. All variables would have to be nonnegative, of course. The objective is to minimize the total number of calories consumed. If $c_1 \geq 0$ is the amount of calories in 100g of canned peas, then one would pay $c_1x_1$ in calories contributed by peas. If we have $n$ different variables, then the objective function is the linear combination

$\textup{cost}(\mathbf{x}) = \sum_{j=1}^n c_j x_j$

We’re using boldface $\mathbf{x}$ to represent the vector $(x_1, \dots, x_n)$. Likewise, $\mathbf{c}$ will represent the cost vector of foods $(c_1, \dots, c_n)$. As we’ve seen many times, we can compactly write the sum above as an inner product $\langle \mathbf{c}, \mathbf{x} \rangle$.

Finally, we require that the amount of each nutrient combined in the stuff we buy meets some threshold. So for each nutrient we have a constraint. The easiest one is calories; we require the total number of calories consumed is at least (say) 2,000. So if $a_j$ represents the number of calories in food $j$, we require $\sum_{j=1}^n a_j x_j \geq 2000$. We might also want to restrict a maximum number of calories, but in general having a diet with more calories implies higher cost, and so when the linear program minimizes cost we should expect it not to produce a diet with significantly more than 2,000 calories.

Since we have one set of nutrient information for each pair of (nutrient, food), we need to get fancier with the indexing. I’ll call $a_{i,j}$ the amount of nutrient $i$ in food $j$. Note that $A = (a_{i,j})$ will be a big matrix, so I’m saying that nutrients $i$ represent the rows of the matrix and foods $j$ represent the columns. That is, each row of the matrix represents the amount of one specific nutrient in all the foods, and each column represents the nutritional content of a single food. We’ll always use $n$ to denote the number of foods, and $m$ to denote the number of nutrients.

Finally, we have a lower and upper bound for each nutrient, which behind the scenes are converted into lower bounds (possibly negating the variables). This isn’t required to write the program, as we’ll see. In notation, we require that for every $1 \leq i \leq m$, the nutrient constraint $\sum_{j=1}^n a_{i,j} x_j \geq b_i$ is satisfied. If we again use vector notation for the constraints $\mathbf{b}$, we can write the entire set of constraints as a “matrix equation”

$A \mathbf{x} \geq \mathbf{b}$

And this means each entry of the vector you get from multiplying the left-hand-side is greater than the corresponding entry on the right-hand-side. So the entire linear program is summarized as follows

\displaystyle \begin{aligned} \textup{min } & \langle \mathbf{c} , \mathbf{x} \rangle \\ \textup{such that } & A \mathbf{x} \geq \mathbf{b} \\ & \mathbf{x} \geq \mathbf{0} \end{aligned}

That’s the syntactical form of our linear program. Now all (!) we need to do is pick a set of foods and nutrients, and fill in the constants for $A, \mathbf{c}, \mathbf{b}$.

## Nutrients and Foods

The easier of the two pieces of data is the nutrient constraints. The system used in the United States is called the Dietary Reference Intake system. It consists of five parts, which I’ve paraphrased from Wikipedia.

• Estimated Average Requirements (EAR), expected to satisfy the needs of 50% of the people in an age group.
• Recommended Dietary Allowances (RDA), the daily intake level considered sufficient to meet the requirements of 97.5% of healthy individuals (two standard deviations).
• Adequate Intake (AI), where no RDA has been established. Meant to complement the RDA, but has less solid evidence.
• Tolerable upper intake levels (UL), the highest level of daily consumption that have not shown harmful side effects.

While my dad come up with his own custom set of constraints, the ones I’ve posted on the github repository are essentially copy/paste from the current RDA/AI as a lower bound, with the UL as an upper bound. The values I selected are in a csv. Missing values in the upper bound column mean there is no upper bound. And sorry ladies, since it’s for my dad I chose the male values. Women have slightly different values due to different average size/weight.

Nutrient values for food are a little bit harder, because nutrient data isn’t easy to come by. There are a few databases out there, all of which are incomplete, and some of which charge for use. My dad spent a long time hunting down the nutrients (he wanted some additional special nutrients) for his top 200-odd foods.

Instead, in this post we’ll use the USDA’s free-to-use database of 8,000+ foods. It comes in a single, abbreviated, oddly-formatted text file which I’ve parsed into a csv and chosen an arbitrary subset of 800-ish foods to play around with.

## Python OR Tools

Google’s ortools docs ask you to download a tarball to install their package, but I found that was unnecessary (perhaps it’s required to attach a third-party solver to their interface?). Instead, one can just pip install it.

pip3 install py3-ortools


Then in a python script, you can import the ortools library and create a simple linear program:

from ortools.linear_solver import pywraplp

solver = pywraplp.Solver('my_LP', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

x = solver.NumVar(0, 10, 'my first variable')
y = solver.NumVar(0, 10, 'my second variable')

solver.Add(x + y <= 7) solver.Add(x - 2 * y >= -2)

objective = solver.Objective()
objective.SetCoefficient(x, 3)
objective.SetCoefficient(y, 1)
objective.SetMaximization()

status = solver.Solve()

if status not in [solver.OPTIMAL, solver.FEASIBLE]:
raise Exception('Unable to find feasible solution')

print(x.solution_value())
print(y.solution_value())


This provides the basic idea of the library. You can use python’s operator overloading (to an extent) to make the constraints look nice in the source code.

## Setting up the food LP

The main file diet_optimizer.py contains a definition for a class, which, in addition to loading the data, encapsulates all the variables and constraints.

class DietOptimizer(object):
def __init__(self, nutrient_data_filename='nutrients.csv',
nutrient_constraints_filename='constraints.csv'):

self.food_table = # load data into a list of dicts
self.constraints_data = # load data into a list of dicts

self.solver = pywraplp.Solver('diet_optimizer', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)
self.create_variable_dict()
self.create_constraints()

self.objective = self.solver.Objective()
for row in self.food_table:
name = row['description']
var = self.variable_dict[name]
calories_in_food = row[calories_name]
self.objective.SetCoefficient(var, calories_in_food)
self.objective.SetMinimization()


We’ll get into the variables and constraints momentarily, but before that we can see the solve method

    def solve(self):
'''
Return a dictionary with 'foods' and 'nutrients' keys representing
the solution and the nutrient amounts for the chosen diet
'''
status = self.solver.Solve()
if status not in [self.solver.OPTIMAL, self.solver.FEASIBLE]:
raise Exception('Unable to find feasible solution')

chosen_foods = {
food_name: var.solution_value()
for food_name, var in self.variable_dict.items() if var.solution_value() > 1e-10
}

self.chosen_foods = chosen_foods

nutrients = {
row['nutrient']: self.nutrients_in_diet(chosen_foods, row['nutrient'])
for row in self.constraints_table
}

return {
'foods': chosen_foods,
'nutrients': nutrients,
}


Here nutrients_in_diet is a helper function which, given a dictionary of foods and a nutrient, outputs the nutrient contents for that food. This can be used independently of the solver to evaluate the nutrient contents of a proposed diet.

Next we have the method to create the variables.

    def create_variable_dict(self):
'''
The variables are the amount of each food to include, denominated in units of 100g
'''
self.variable_dict = dict(
(row['description'], self.solver.NumVar(0, 10, row['description']))
for row in self.food_table
)


Each food must be present in a nonnegative amount, and I’ve imposed a cap of 10 (1kg) for any individual food. The reason for this is that I originally had a “water” constraint, and the linear program decided to optimize for that by asking one to drink 2L of red wine per day. I neglected to put in an alcohol nutrient (because it was not already there and I’m lazy), so instead I limited the amount of any individual food. It still seems like a reasonable constraint to impose that nobody would want to eat more than 1kg of any single food on one day.

Finally, we can construct the constraints. The core method takes a nutrient and a lower and upper bound:

    def create_constraint(self, nutrient_name, lower, upper):
'''
Each constraint is a lower and upper bound on the
sum of all food variables, scaled by how much of the
relevant nutrient is in that food.
'''
if not lower:
print('Warning! Nutrient %s has no lower bound!'.format(nutrient_name))
return

sum_of_foods = self.foods_for_nutrient(nutrient_name)
constraint_lb = lower <= sum_of_foods
self.constraint_dict[nutrient_name + ' (lower bound)'] = constraint_lb

if not upper:
return  # no upper bound in the data

constraint_ub = sum_of_foods <= upper
self.constraint_dict[nutrient_name + ' (upper bound)'] = constraint_ub


This method is mostly bookkeeping, while foods_for_nutrient does the individual nutrient lookup. Note that one is not allowed to do a double-ended inequality like self.solver.Add(lower <= sum_of_foods <= upper). If you try, ortools will ignore one end of the bound.

    def foods_for_nutrient(self, nutrient_name, scale_by=1.0):
# a helper function that computes the scaled sum of all food variables
# for a given nutrient
relevant_foods = []
for row in self.food_table:
var = self.variable_dict[row['description']]
nutrient_amount = row[nutrient_name]
if nutrient_amount > 0:
relevant_foods.append(scale_by * nutrient_amount * var)

if len(relevant_foods) == 0:
print('Warning! Nutrient %s has no relevant foods!'.format(nutrient_name))
return

return SumArray(relevant_foods)


Here we are a bit inefficient by iterating through the entire table, instead of just those foods containing the nutrient in question. But there are only a few hundred foods in our sample database (8,000 if you use the entire SR28 database), and so the optimization isn’t necessary.

Also note that while ortools allows one to use the sum method, it does so in a naive way, because sum([a, b, c]) becomes ((a + b) + c), which is a problem because if the list is too long their library exceeds Python’s default recursion limit. Instead we construct a SumArray by hand.

Finally, though we omitted it here for simplicity, throughout the code in the Github repository you’ll see references to percent_constraints. This exists because some nutrients, like fat, are recommended to be restricted to a percentage of calories, not an absolute amount. So we define a mechanism to specify a nutrient should be handled with percents, and a mapping from grams to calories. This ends up using the scale_by parameter above, both to scale fat by 9 calories per gram, and to scale calories to be a percentage. Cf. the special function for creating percent constraints.

Finally, we have methods just for pretty-printing the optimization problem and the solution, called summarize_optimization_problem and summarize_solution, respectively.

## Running the solver

Invoking the solver is trivial.

if __name__ == "__main__":
solver = DietOptimizer()
# solver.summarize_optimization_problem()
solution = solver.solve()
solver.summarize_solution(solution)


With the example foods and constraints in the github repo, the result is:

Diet:
--------------------------------------------------

298.9g: ALCOHOLIC BEV,WINE,TABLE,WHITE,MUSCAT
1000.0g: ALFALFA SEEDS,SPROUTED,RAW
38.5g: CURRY POWDER
2.1g: CUTTLEFISH,MXD SP,CKD,MOIST HEAT
31.3g: EGG,WHL,CKD,HARD-BOILED
24.0g: LOTUS ROOT,CKD,BLD,DRND,WO/SALT
296.5g: MACKEREL,JACK,CND,DRND SOL
161.0g: POMPANO,FLORIDA,CKD,DRY HEAT
87.5g: ROSEMARY,FRESH
239.1g: SWEET POTATO,CKD,BKD IN SKN,FLESH,WO/ SALT

Nutrient totals
--------------------------------------------------

1700.0 mg   calcium                   [1700.0, 2100.0]
130.0 g    carbohydrate              [130.0, ]
550.0 mg   choline                   [550.0, 3500.0]
3.3 mg   copper                    [0.9, 10.0]
60.5 g    dietary fiber             [38.0, ]
549.7 μg   dietary folate            [400.0, 1000.0]
1800.0 kcal energy                    [1800.0, 2100.0]
32.4 mg   iron                      [18.0, 45.0]
681.7 mg   magnesium                 [420.0, ]
7.3 mg   manganese                 [2.3, 11.0]
35.0 mg   niacin                    [16.0, 35.0]
11.7 mg   pantothenic acid          [5.0, ]
2554.3 mg   phosphorus                [1250.0, 4000.0]
14.0 g    polyunsaturated fatty acids  [1.6, 16.0]
4700.0 mg   potassium                 [4700.0, ]
165.2 g    protein                   [56.0, ]
2.8 mg   riboflavin                [1.3, ]
220.8 μg   selenium                  [55.0, 400.0]
1500.0 mg   sodium                    [1500.0, 2300.0]
2.4 mg   thiamin                   [1.2, ]
59.4 g    total fat                 [20.0, 35.0]        (29.7% of calories)
3000.0 μg   vitamin a                 [900.0, 3000.0]
23.0 μg   vitamin b12               [2.4, ]
2.4 mg   vitamin b6                [1.7, 100.0]
157.6 mg   vitamin c                 [90.0, 2000.0]
893.0 iu   vitamin d                 [400.0, 4000.0]
15.0 mg   vitamin e                 [15.0, 1000.0]
349.4 μg   vitamin k                 [120.0, ]
17.2 mg   zinc                      [11.0, 40.0]


Unfortunately, this asks for a kilo of raw alfalfa sprouts, which I definitely would not eat. The problem is that alfalfa is ridiculously nutritious. Summarizing the diet with the print_details flag set, we see they contribute a significant amount of nearly every important nutrient.

1000.0g: ALFALFA SEEDS,SPROUTED,RAW
18.8% of calcium (mg)
16.2% of carbohydrate (g)
26.2% of choline (mg)
47.3% of copper (mg)
31.4% of dietary fiber (g)
65.5% of dietary folate (μg)
12.8% of energy (kcal)
29.7% of iron (mg)
39.6% of magnesium (mg)
25.6% of manganese (mg)
13.7% of niacin (mg)
48.2% of pantothenic acid (mg)
27.4% of phosphorus (mg)
29.3% of polyunsaturated fatty acids (g)
16.8% of potassium (mg)
24.2% of protein (g)
45.1% of riboflavin (mg)
2.7% of selenium (μg)
4.0% of sodium (mg)
31.9% of thiamin (mg)
11.6% of total fat (g)
2.7% of vitamin a (μg)
13.9% of vitamin b6 (mg)
52.0% of vitamin c (mg)
1.3% of vitamin e (mg)
87.3% of vitamin k (μg)
53.5% of zinc (mg)


But ignoring that, we have some reasonable sounding foods: fish, sweet potato, rosemary (okay that’s a ton of rosemary), egg and wine. I bet someone could make a tasty meal from those rough ingredients.

## Extensions and Exercises

No tutorial would be complete without exercises. All of these are related to the actual linear program modeling problem.

Food groups: Suppose you had an additional column for each food called “food group.” You want to create a balanced meal, so you add additional constraint for each food group requiring some food, but not too much, from each group. Furthermore, for certain foods like spices, one could add a special constraint for each spice requiring not more than, say, 20g of any given spice. Or else, as one can see, the linear program can produce diets involving obscenely large amounts of spices.

Starting from a given set of foods: Supposing you have an idea for a recipe (or couple of recipes for a day’s meals), but you want to add whatever else is needed to make it meet the nutritional standards. Modify the LP to allow for this.

Integer variations: The ortools package supports integer programming as well. All you need to do to enable this is change the solver type to CBC_MIXED_INTEGER_PROGRAMMING. The solver will run as normal, and now you can create integer-valued variables using solver.IntVar instead of NumVar. Using binary variables, one can define logical OR constraints (figure out how this must work). Define a new binary variable for each food, and define a constraint that makes this variable 0 when the food is not used, and 1 when the food is used. Then add a term to the optimization problem that penalizes having too many different foods in a daily diet.

(Harder) Sampling: Part of the motivation for this project is to come up with a number of different dishes that are all “good” with respect to this optimization problem. Perhaps there is more than one optimal solution, or perhaps there are qualitatively different diets that are close enough to optimal. However, this implementation produces a deterministic output. Find a way to introduce randomness into the program, so that you can get more than one solution.

Feel free to suggest other ideas, and extend or rewrite the model to do something completely different. The sky’s the limit!

### Chess records page

Chess records page (no, not on the first page, or the second page, or the third page, of a google search of *chess records*).

There’s lots of good stuff here, enough to fill much of a book if you so desire. As we’ve discussed, chess games are in the public domain so if you take material on chess games from an existing book or website without crediting the person who compiled this material, you’re not actually plagiarizing.

The post Chess records page appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Profiling Go programs with pprof

Last week me and my cool coworker Josh were debugging some memory problems in a Go program using pprof.

There’s a bunch of pprof documentation on the internet but I found a few things confusing so here are some notes so I can find them easily.

First – when working with pprof it’s good to be running a recent version of Go! For example Go 1.8 adds mutex profiles so you can see mutex contention.

in this post I’ll

• link to the useful pprof resource I found
• explain what a pprof profile is
• give an example of how to look at a heap profile of a Go program
• explain a few things about the heap profiler works (what do the stack traces mean? how are they collected?)
• most importantly (to me), deconstruct an example pprof protobuf file so we understand what a pprof profile actually is

This post won’t really explain in detail how to to use pprof to diagnose performance issues in Go programs, but I think these fundamentals (“what even is a pprof file”) will help me do that more easily.

### pprof basics

pprof lets you collect CPU profiles, traces, and heap profiles for your Go programs. The normal way to use pprof seems to be:

1. Set up a webserver for getting Go profiles (with import _ "net/http/pprof")
2. Run curl localhost:$PORT/debug/pprof/$PROFILE_TYPE to save a profile
3. Use go tool pprof to analyze said profile

You can also generate pprof profiles in your code using the pprof package but I haven’t done that.

Here is every useful link I’ve found so far about pprof on the internet. Basically the material on the internet about pprof seems to be the official documentation + rakyll’s amazing blog.

(there are probably also talks about pprof but I am too impatient to watch talks, that’s part of why I write lots of blog posts and give few talks)

### What’s a profile? What kinds of profiles can I get?

When understanding how things work I like to start at the beginning. What is a “profile” exactly?

Well, let’s read the documentation! The 7th time I looked at the runtime/pprof docs, I read this very useful sentence:

A Profile is a collection of stack traces showing the call sequences that led to instances of a particular event, such as allocation. Packages can create and maintain their own profiles; the most common use is for tracking resources that must be explicitly closed, such as files or network connections.

Each Profile has a unique name. A few profiles are predefined:

goroutine    - stack traces of all current goroutines
heap         - a sampling of all heap allocations
threadcreate - stack traces that led to the creation of new OS threads
block        - stack traces that led to blocking on synchronization primitives
mutex        - stack traces of holders of contended mutexes


There are 7 places you can get profiles in the default webserver: the ones mentioned above

and also 2 more: the CPU profile and the CPU trace.

To analyze these profiles (lists of stack traces), the tool to use is go tool pprof, which is a bunch of tools for visualizing stack traces.

super confusing note: the trace endpoint (/debug/pprof/trace?seconds=5), unlike all the rest, outputs a file that is not a pprof profile. Instead it’s a trace and you can view it using go tool trace (not go tool pprof).

You can see the available profiles with http://localhost:6060/debug/pprof/ in your browser. Except it doesn’t tell you about /debug/pprof/profile or /debug/pprof/trace for some reason.

All of these kinds of profiles (goroutine, heap allocations, etc) are just collections of stacktraces, maybe with some metadata attached. If we look at the pprof protobuf definition, you see that a profile is mostly a bunch of Samples.

A sample is basically a stack trace. That stack trace might have some extra information attached to it! For example in a heap profile, the stack trace has a number of bytes of memory attached to it. I think the Samples are the most important part of the profile.

We’re going to deconstruct what exactly is inside a pprof file later, but for now let’s start by doing a quick example of what analyzing a heap profile looks like!

### Getting a heap profile with pprof

I’m mostly interested in debugging memory problems right now. So I decided to write a program that allocates a bunch of memory to profile with pprof.

func main() {
// we need a webserver to get the pprof webserver
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
fmt.Println("hello world")
var wg sync.WaitGroup
go leakyFunction(wg)
wg.Wait()
}

func leakyFunction(wg sync.WaitGroup) {
defer wg.Done()
s := make([]string, 3)
for i:= 0; i < 10000000; i++{
s = append(s, "magical pandas")
if (i % 100000) == 0 {
time.Sleep(500 * time.Millisecond)
}
}
}


Basically this just starts a goroutine leakyFunction that allocates a bunch of memory and then exits eventually.

Getting a heap profile of this program is really easy – we just need to run go tool pprof http://localhost:6060/debug/pprof/heap. This puts us into an interactive mode where we run top

### pprof keeps improving!

also I found out that thanks to the great work of people like rakyll, pprof keeps getting better!! For example There’s this pull request https://github.com/google/pprof/pull/188 which is being worked on RIGHT NOW which adds flamegraph support to the pprof web interface. Flamegraphs are the best thing in the universe so I’m very excited for that to be available.

If I got someting wrong (I probably did) let me know!!

### R Packages worth a look

Access Domains and Search Popular Websites (websearchr)
Functions that allow for accessing domains and a number of search engines.

Hyphenation and Syllable Counting for Text Analysis (sylly)
Provides the hyphenation algorithm used for ‘TeX’/’LaTeX’ and similar software, as proposed by Liang (1983, <https://…/> ). Mainly contains the function hyphen() to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the ‘koRpus’ package, but later released as a separate package so it’s lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages; this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev mailing list (<http://korpusml.reaktanz.de> ).

Optimal Design and Statistical Power of Cost-Efficient Multilevel Randomized Trials (odr)
Calculate the optimal sample allocation that minimizes variance of treatment effect in a multilevel randomized trial under fixed budget and cost structure, perform power analyses with and without accommodating costs and budget. The reference for proposed methods is: Shen, Z., & Kelcey, B. (under review). Optimal design of cluster randomized trials under condition- and unit-specific cost structures. 2018 American Educational Research Association (AERA) annual conference.

Fit and Predict a Gaussian Process Model with (Time-Series) Binary Response (binaryGP)
Allows the estimation and prediction for binary Gaussian process model. The mean function can be assumed to have time-series structure. The estimation methods for the unknown parameters are based on penalized quasi-likelihood/penalized quasi-partial likelihood and restricted maximum likelihood. The predicted probability and its confidence interval are computed by Metropolis-Hastings algorithm. More details can be seen in Sung et al (2017) <arXiv:1705.02511>.

Provides density, distribution function, quantile function and random generation for the split-t distribution, and computes the mean, variance, skewness and kurtosis for the split-t distribution (Li, F, Villani, M. and Kohn, R. (2010) <doi:10.1016/j.jspi.2010.04.031>).

Simulation Based Inference of Lasso Estimator (EAinference)
Estimator augmentation methods for statistical inference on high-dimensional data, as described in Zhou, Q. (2014) <doi:10.1080/01621459.2014.946035> and Zhou, Q. and Min, S. (2017) <doi:10.1214/17-EJS1309>. It provides several simulation-based inference methods: (a) Gaussian and wild multiplier bootstrap for lasso, group lasso, scaled lasso, scaled group lasso and their de-biased estimators, (b) importance sampler for approximating p-values in these methods, (c) Markov chain Monte Carlo lasso sampler with applications in post-selection inference.

### Document worth reading: “Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions”

Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructures, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. This work examines some of these challenges and discusses solutions proposed in the literature to address them. Resource Elasticity for Distributed Data Stream Processing: A Survey and Future Directions

### If you did not already know

Kanri Distance (KDC)
Kanri’s proprietary combination of patented statistical and process methods provides a uniquely powerful and insightful ability to evaluate large data sets with multiple variables. While many tools evaluate patterns and dynamics for large data, only the Kanri Distance Calculator allows users to understand where they stand with respect to a desired target state and the specific contribution of each variable toward the overall distance from the target state. The Kanri model not only calculates the relationship of variables within the overall data set, but more importantly mathematically teases out the interaction between each of them. This combination of relational insights fuels Kanri’s breakthrough distance calculator. It answers the question “In a world of exponentially expanding data how do I find the variables that will solve my problem and it helps quickly to reach that conclusion.” But the Kanri model does not stop there. Kanri tells you exactly, formulaically how much each variable contributes. The Kanri Distance Calculator opens a new world of solution development possibilities that can apply the power of massive data sets to an individual…or to an individualized objective. …

Probably Approximately Correct Learning (PAC Learning,WARL)
In computational learning theory, probably approximately correct learning (PAC learning) is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the “probably” part), the selected function will have low generalization error (the “approximately correct” part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). …

Local Projections
In this paper, we propose a novel approach for outlier detection, called local projections, which is based on concepts of Local Outlier Factor (LOF) (Breunig et al., 2000) and RobPCA (Hubert et al., 2005). By using aspects of both methods, our algorithm is robust towards noise variables and is capable of performing outlier detection in multi-group situations. We are further not reliant on a specific underlying data distribution. For each observation of a dataset, we identify a local group of dense nearby observations, which we call a core, based on a modification of the k-nearest neighbours algorithm. By projecting the dataset onto the space spanned by those observations, two aspects are revealed. First, we can analyze the distance from an observation to the center of the core within the projection space in order to provide a measure of quality of description of the observation by the projection. Second, we consider the distance of the observation to the projection space in order to assess the suitability of the core for describing the outlyingness of the observation. These novel interpretations lead to a univariate measure of outlyingness based on aggregations over all local projections, which outperforms LOF and RobPCA as well as other popular methods like PCOut (Filzmoser et al., 2008) and subspace-based outlier detection (Kriegel et al., 2009) in our simulation setups. Experiments in the context of real-word applications employing datasets of various dimensionality demonstrate the advantages of local projections. …

## September 23, 2017

### Magister Dixit

“Failure is a great teacher.” Daniel Tunkelang

### Book Memo: “Beautiful Data”

 The Stories Behind Elegant Data Solutions In this insightful book, you’ll learn from the best data practitioners in the field just how wide-ranging – and beautiful – working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video. With Beautiful Data, you will: • Explore the opportunities and challenges involved in working with the vast number of datasets made available by the Web • Learn how to visualize trends in urban crime, using maps and data mashups • Discover the challenges of designing a data processing system that works within the constraints of space travel • Learn how crowdsourcing and transparency have combined to advance the state of drug research • Understand how new data can automatically trigger alerts when it matches or overlaps pre-existing data • Learn about the massive infrastructure required to create, capture, and process DNA data That’s only small sample of what you’ll find in Beautiful Data. For anyone who handles data, this is a truly fascinating book.

### RcppCNPy 0.2.7

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of the RcppCNPy package arrived on CRAN yesterday.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp two-page template and package.

#### Changes in version 0.2.7 (2017-09-22)

• Vignette updated to Rmd and use of pinp package

• File src/init.c added for dynamic registration

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppClassic 0.9.8

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A bug-fix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.

Courtesy of CRANberries, there are changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.

vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects.

vtreat handles, in a statistically sound fashion:

In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public open-source implementation is for in-memory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL).

vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.

A new feature of vtreat version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.

### Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-9)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.

Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.

Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a six-shooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.

The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

• Work with about different binomial and logistic regression techniques
• Know how to compare regression models and choose the right fit
• And much more

Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.

1. Create a function that simulate a complete game and return how many coin flip has been done and who win.
2. In average, how many coin flip is needed before someone has all the pennies.
3. Plot the histogram of the number of coin flipped during a simulation.
4. What is the probability that someone wins a coin flip?
5. What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?

When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.

In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.

1. Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
2. Repeat this process, but this time choose to switch door.
3. Why the probabilities are different?

When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.

To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?

Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.

1. Compute the success rate (number of success/total number of cases) of both treatments.
2. Which treatment seems the more successful?
3. Create a contingency table of the success.
4. Compute the success rate of both treatments when treating small kidney stones.
5. Compute the success rate of both treatments when treating large kidney stones.
6. Which treatment is more successful for small kidney stone? For large kidney stone?

This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.

Exercise 7

1. Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
2. Do a scatter plot of the variable X and Y and add the trend line to the graph.
3. Repeat this process of each of the three categories.

We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!

Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.

A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.

1. Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
2. Repeat this process for the remaining observations. Is there a difference between those rate?

It is a known fact that false positive rate are a lot higher in low-incidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!

Exercise 9

1. Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
2. Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
3. Compute the variation of the estimation.
4. Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
5. Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
6. Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
7. Plot the variance of the estimation of the means done with different sample size.

As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.

Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?

A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Getting the right uncertainties when fitting multilevel models

Cesare Aloisi writes:

I am writing you regarding something I recently stumbled upon in your book Data Analysis Using Regression and Multilevel/Hierarchical Models which confused me, in hopes you could help me understand it. This book has been my reference guide for many years now, and I am extremely grateful for everything I learnt from you.

On page 261, a 95% confidence interval for the intercept in a certain group (County 26) is calculated using only the standard error of the “random effect” (the county-level error). The string is as follows:

coef(M1)$county[26,1] + c(-2,2)*se.ranef(M1)$county[26]

text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])  Gives this plot: Note how clearly-defined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map. ## Of Manifolds and the Manifold Assumption How can t-SNE achieve such a good result? How can it ‘drop’ 64-dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated? The reason has to do with the mathematical concept of manifolds. A Manifold is a $$d$$-dimensional surface that lives in an $$D$$-dimensional space, where $$d < D$$. For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space. Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then t-SNE should be able to find this lower-dimensional manifold, ‘unwrap it’, and present it to us as a lower-dimensional map of the original data. ## t-SNE vs. SOM t-SNE is actually a tool that does something similar to Self-Organising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below. The t-SNE map is more clear than that obtained via a SOM and the clusters are separated much better. ## t-SNE for Shelter Animals dataset In a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply t-SNE to the dataset – I am using the cleaned and modified data as described in this blog entry. trn <- read.csv('train.modified.csv.zip') trn <- data.matrix(trn) require(Rtsne) # scale the data first prior to running t-SNE trn[,-1] <- scale(trn[,-1]) tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) # display the results of t-SNE cols <- rainbow(5) plot(tsne$Y, t='n')
text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])  Gives this plot: Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened. Let’s now use t-SNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by t-SNE. tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2) #display results of t-SNE require(rgl) plot3d(tsne$Y, col=cols[trn[,1]])
legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))



Gives this plot:

This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.

One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of $$N$$-dimensional points and conclude that they lie on some $$M$$-dimensional manifold (where $$M < N$$). We could then take those $$M$$-dimensional points and conclude that they lie on some other lower-dimensional manifold (with, say, $$K$$ dimensions, where $$K < M$$). We could repeat this process indefinitely and eventually conclude that our arbitrary $$N$$-dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have $$N$$-dimensional points being generated by a process with $$N$$ degrees of freedom, your data will not lie on a lower-dimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.

So we can conclude that t-SNE did not aid much the initial exploratory analysis for this dataset.

## t-SNE and machine learning?

One disadvantage of t-SNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run t-SNE on a dataset, then gather a few more samples (rows), and “update” the t-SNE output with the new samples. You would need to re-run t-SNE from scratch on the full dataset (previous dataset + new samples). Thus t-SNE works only in batch mode.

This disadvantage appears to makes it difficult for t-SNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use t-SNE (with care) in a machine learning solution. And the use of t-SNE can improve classification results, sometimes markedly. The limitation is the extra non-realtime processing brought about by t-SNE’s batch mode nature.

Stay tuned.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### On debugging

My favorite advice on debugging is from Professor Norman Matloff:

Finding your bug is a process of confirming the many things that you believe are true – until you find one that is not true.

I would add to this my own observation:

You are not truly “debugging” until you (temporarily) remove all desire to fix the problem. Investigation must be entirely devoted to finding and seeing the problem.

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

• Disclaimer: Win-Vector LLC has no official standing with RStudio, or dplyr development.
• However:

“One need not have been Caesar in order to understand Caesar.”

Alternately: Georg Simmmel or Max Webber.

Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.

From experience we have distilled a lot of best practices. And below we will share one.

From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)


Let’s try that with database backed data:

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
# [1] ‘0.7.3’

db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
flights <- copy_to(db,
nycflights13::flights,
&aposflights&apos)

mutate(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# # Source:   lazy query [?? x 22]
# # Database: sqlite 3.19.3 [:memory:]
# year month   day dep_time sched_dep_time        ...
# <int> <int> <int>    <int>          <int>       ...
#   1  2013     1     1      517            515   ...
# ...


That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

mutate(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
gain_per_minute = 60 * gain_per_hour
)
# Error in rsqlite_send_query(conn@ptr, statement) :
#   no such column: gain_per_hour


(Some detail on the failing query are here.)

It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

This quickly leads to strong feelings of learned helplessness and anxiety.

Our rule for dplyr::mutate() has been for some time:

Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

Under this rule neither of the above mutates are allowed. The second should be written as (switching to pipe-notation):

flights %>%
mutate(gain = arr_delay - dep_delay,
hours = air_time / 60) %>%
mutate(gain_per_hour = gain / hours) %>%
mutate(gain_per_minute = 60 * gain_per_hour)


And the above works.

If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

However, confusion around re-use of column names has been driving dplyr issues for quite some time:

It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification. “Works for the instructor” is not an acceptable level of dependability.

### Data liquidity in the age of inference

Probabilistic computation holds too much promise for it to be stifled by playing zero sum games with data.

It's a special time in the evolutionary history of computing. Oft-used terms like big data, machine learning, and artificial intelligence have become popular descriptors of a broader underlying shift in information processing. While traditional rules-based computing isn’t going anywhere, a new computing paradigm is emerging around probabilistic inference, where digital reasoning is learned from sample data rather than hardcoded with boolean logic. This shift is so significant that a new computing stack is forming around it with emphasis on data engineering, algorithm development, and even novel hardware designs optimized for parallel computing workloads, both within data centers and at endpoints.

A funny thing about probabilistic inference is that when models work well, they’re probably right most of the time, but always wrong at least some of the time. From a mathematics perspective, this is because such models take a numerical approach to problem analysis, as opposed to an analytical one. That is, they learn patterns from data (with various levels of human involvement) that have certain levels of statistical significance, but remain somewhat ignorant to any physics-level intuition related to those patterns, whether represented by math theorems, conjectures, or otherwise. However, that’s also precisely why probabilistic inference is so incredibly powerful. Many real-world systems are so multivariate, complex, and even stochastic that analytical math models do not exist and remain tremendously difficult to develop. In the meanwhile, their physics-ignorant, FLOPS-happy, and often brutish machine learning counterparts can develop deductive capabilities that don’t nicely follow any known rules, yet still almost always arrive at the correct answers.

Continue reading Data liquidity in the age of inference.

In today’s digital landscape, customers expect you to deliver products and services in a fast and efficient manner. Heavyweights like Amazon and Google have set a bar in terms of operations, and they’ve set it high. An increasing need for more streamlined and efficient processes, combined with advancing technologies has

The post 4 data-driven ways to digitize your business appeared first on Dataconomy.

### Four short links: 22 September 2017

Molecular Robots, Distributed Deep Nets, SQL Notebook, and Super-Accurate GPS

1. Scientists Create World’s First ‘Molecular Robot’ Capable Of Building Molecules -- Each individual robot is capable of manipulating a single molecule and is made up of just 150 carbon, hydrogen, oxygen and nitrogen atoms. To put that size into context, a billion billion of these robots piled on top of each other would still only be the same size as a single grain of salt. The robots operate by carrying out chemical reactions in special solutions which can then be controlled and programmed by scientists to perform the basic tasks. (via Slashdot)
2. Distributed Deep Neural Networks -- in Adrian Colyer's words: DDNNs partition networks between mobile/embedded devices, cloud (and edge), although the partitioning is static. What’s new and very interesting here though is the ability to aggregate inputs from multiple devices (e.g., with local sensors) in a single model, and the ability to short-circuit classification at lower levels in the model (closer to the end devices) if confidence in the classification has already passed a certain threshold. It looks like both teams worked independently and in parallel on their solutions. Overall, DDNNs are shown to give lower latency decisions with higher accuracy than either cloud or devices working in isolation, as well as fault tolerance in the sense that classification accuracy remains high even if individual devices fail. (via Morning Paper)
3. Franchise -- an open-source notebook for sql.
4. Super-Accurate GPS Chips Coming to Smartphones in 2018 (IEEE Spectrum) -- 30cm accuracy (today: 5m), will help with the reflections you get in cities, and with 50% energy savings.

### Document worth reading: “Image Segmentation Algorithms Overview”

The technology of image segmentation is widely used in medical image processing, face recognition pedestrian detection, etc. The current image segmentation techniques include region-based segmentation, edge detection segmentation, segmentation based on clustering, segmentation based on weakly-supervised learning in CNN, etc. This paper analyzes and summarizes these algorithms of image segmentation, and compares the advantages and disadvantages of different algorithms. Finally, we make a prediction of the development trend of image segmentation with the combination of these algorithms. Image Segmentation Algorithms Overview

### Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1

(This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).

In this blog post – “Part 1” – I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? etc.

“Part 2” will be an in depth analysis of the language used in patent titles, descriptions, and claims, and “Part 3” will be on experimentation with with deep neural nets applied to the patents’ title, description, and claims text.

First, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.

Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords
“machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”
in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1

With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.

Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.

### Some example patents

Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.

And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.

## Who holds these patents (inventors and assignees)?

The first question I would like to address is who files most of the machine learning and AI related patents.

Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.

It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).

There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):

• What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
• Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?

## Where do these inventions come from (geographically)?

Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).

The results aren’t that surprising.
However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.

Alternatively, we can also map the number of patents per inventors’ origin countries.

Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.

## Patent application and publication dates, and duration of patent examination process

Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as:
When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.

For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…

• …in the top panel: number of patents (y-axis) per their original month of filing (x-axis),
• …in the bottom panel: the number of patents (y-axis) that were assigned (approved) per week (x-axis) in 2017 so far.

The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.

The distribution of the duration from patent filing date to approval is depicted by following figure.

So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:

• “Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
• “Apparatus for biological sensing and alerting of pharmaco-genomic mutation” (~12.24 years to approval; assignee: NA)
• “System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
• “Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
• “Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)

Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…

We can also look at the five patents with the shortest approval time:

• “Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
• “Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
• “Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
• “Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
• “Generating accurate reason codes with complex non-linear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)

Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.

### Do big tech companies get their patents approved faster than other companies (e.g., startups)?

The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.

Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.

Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.

## Conclusion

There is definitely a lot of promise in the USPTO full text patent data.
Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:

1. There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Whats new on arXiv

We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies, we demonstrate that varbvs scales well to data sets with hundreds of thousands of variables and thousands of samples, and has features that facilitate rapid data analyses. Moreover, varbvs allows for extensive model customization, which can be used to incorporate external information into the analysis. We expect that the combination of an easy-to-use interface and robust, scalable algorithms for posterior computation will encourage broader use of Bayesian variable selection in areas of applied statistics and computational biology. The most recent R and MATLAB source code is available for download at Github (https://…/varbvs ), and the R package can be installed from CRAN (https://…/package=varbvs ).
While machine learning and artificial intelligence have long been applied in networking research, the bulk of such works has focused on supervised learning. Recently there has been a rising trend of employing unsupervised machine learning using unstructured raw network data to improve network performance and provide services such as traffic engineering, anomaly detection, Internet traffic classification, and quality of service optimization. The interest in applying unsupervised learning techniques in networking emerges from their great success in other fields such as computer vision, natural language processing, speech recognition, and optimal control (e.g., for developing autonomous self-driving cars). Unsupervised learning is interesting since it can unconstrain us from the need of labeled data and manual handcrafted feature engineering thereby facilitating flexible, general, and automated methods of machine learning. The focus of this survey paper is to provide an overview of the applications of unsupervised learning in the domain of networking. We provide a comprehensive survey highlighting the recent advancements in unsupervised learning techniques and describe their applications for various learning tasks in the context of networking. We also provide a discussion on future directions and open research issues, while also identifying potential pitfalls. While a few survey papers focusing on the applications of machine learning in networking have previously been published, a survey of similar scope and breadth is missing in literature. Through this paper, we advance the state of knowledge by carefully synthesizing the insights from these survey papers while also providing contemporary coverage of recent advances.
We propose a collaborative compressive sensing (CCS) framework consisting of a bank of $K$ compressive sensing (CS) systems that share the same sensing matrix but have different sparsifying dictionaries. This CCS system is guaranteed to yield better performance than each individual CS system in a statistical sense, while with the parallel computing strategy, it requires the same time as that needed for each individual CS system to conduct compression and signal recovery. We then provide an approach to designing optimal CCS systems by utilizing a measure that involves both the sensing matrix and dictionaries and hence allows us to simultaneously optimize the sensing matrix and all the $K$ dictionaries under the same scheme. An alternating minimization-based algorithm is derived for solving the corresponding optimal design problem. We provide a rigorous convergence analysis to show that the proposed algorithm is convergent. Experiments with real images are carried out and show that the proposed CCS system significantly improves on existing CS systems in terms of the signal recovery accuracy.
The incorporation of macro-actions (temporally extended actions) into multi-agent decision problems has the potential to address the curse of dimensionality associated with such decision problems. Since macro-actions last for stochastic durations, multiple agents executing decentralized policies in cooperative environments must act asynchronously. We present an algorithm that modifies Generalized Advantage Estimation for temporally extended actions, allowing a state-of-the-art policy optimization algorithm to optimize policies in Dec-POMDPs in which agents act asynchronously. We show that our algorithm is capable of learning optimal policies in two cooperative domains, one involving real-time bus holding control and one involving wildfire fighting with unmanned aircraft. Our algorithm works by framing problems as ‘event-driven decision processes,’ which are scenarios where the sequence and timing of actions and events are random and governed by an underlying stochastic process. In addition to optimizing policies with continuous state and action spaces, our algorithm also facilitates the use of event-driven simulators, which do not require time to be discretized into time-steps. We demonstrate the benefit of using event-driven simulation in the context of multiple agents taking asynchronous actions. We show that fixed time-step simulation risks obfuscating the sequence in which closely-separated events occur, adversely affecting the policies learned. Additionally, we show that arbitrarily shrinking the time-step scales poorly with the number of agents.
Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum learning in a visual attribute classification framework. Individual tasks are grouped after performing hierarchical clustering based on their correlation. The clusters of tasks are learned in a curriculum learning setup by transferring knowledge between clusters. The learning process within each cluster is performed in a multi-task classification setup. By leveraging the acquired knowledge, we speed-up the process and improve performance. We demonstrate the effectiveness of our method via ablation studies and a detailed analysis of the covariates, on a variety of publicly available datasets of humans standing with their full-body visible. Extensive experimentation has proven that the proposed approach boosts the performance by 4% to 10%.
Prognostics or early detection of incipient faults is an important industrial challenge for condition-based and preventive maintenance. Physics-based approaches to modeling fault progression are infeasible due to multiple interacting components, uncontrolled environmental factors and observability constraints. Moreover, such approaches to prognostics do not generalize to new domains. Consequently, domain-agnostic data-driven machine learning approaches to prognostics are desirable. Damage progression is a path-dependent process and explicitly modeling the temporal patterns is critical for accurate estimation of both the current damage state and its progression leading to total failure. In this paper, we present a novel data-driven approach to prognostics that employs a novel textual representation of multivariate temporal sensor observations for predicting the future health state of the monitored equipment early in its life. This representation enables us to utilize well-understood concepts from text-mining for modeling, prediction and understanding distress patterns in a domain agnostic way. The approach has been deployed and successfully tested on large scale multivariate time-series data from commercial aircraft engines. We report experiments on well-known publicly available benchmark datasets and simulation datasets. The proposed approach is shown to be superior in terms of prediction accuracy, lead time to prediction and interpretability.
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.
We present a new technique called contrastive principal component analysis (cPCA) that is designed to discover low-dimensional structure that is unique to a dataset, or enriched in one dataset relative to other data. The technique is a generalization of standard PCA, for the setting where multiple datasets are available — e.g. a treatment and a control group, or a mixed versus a homogeneous population — and the goal is to explore patterns that are specific to one of the datasets. We conduct a wide variety of experiments in which cPCA identifies important dataset-specific patterns that are missed by PCA, demonstrating that it is useful for many applications: subgroup discovery, visualizing trends, feature selection, denoising, and data-dependent standardization. We provide geometrical interpretations of cPCA and show that it satisfies desirable theoretical guarantees. We also extend cPCA to nonlinear settings in the form of kernel cPCA. We have released our code as a python package and documentation is on Github.
Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.
Graphs have been widely used to model different information networks, such as the Web, biological networks and social networks (e.g. Twitter). Due to the size and complexity of these graphs, how to explore and utilize these graphs has become a very challenging problem. In this paper, we propose, VCExplorer, a new interactive graph exploration framework that integrates the strengths of graph visualization and graph summarization. Unlike existing graph visualization tools where vertices of a graph may be clustered into a smaller collection of super/virtual vertices, VCExplorer displays a small number of actual source graph vertices (called hubs) and summaries of the information between these vertices. We refer to such a graph as a HA-graph (Hub-based Aggregation Graph). This allows users to appreciate the relationship between the hubs, rather than super/virtual vertices. Users can navigate through the HA- graph by ‘drilling down’ into the summaries between hubs to display more hubs. We illustrate how the graph aggregation techniques can be integrated into the exploring framework as the consolidated information to users. In addition, we propose efficient graph aggregation algorithms over multiple subgraphs via computation sharing. Extensive experimental evaluations have been conducted using both real and synthetic datasets and the results indicate the effectiveness and efficiency of VCExplorer for exploration.
Recently, evolving networks are becoming a suitable form to model many real-world complex systems, due to their peculiarities to represent the systems and their constituting entities, the interactions between the entities and the time-variability of their structure and properties. Designing computational models able to analyze evolving networks becomes relevant in many applications. The goal of this research project is to evaluate the possible contribution of temporal pattern mining techniques in the analysis of evolving networks. In particular, we aim at exploiting available snapshots for the recognition of valuable and potentially useful knowledge about the temporal dynamics exhibited by the network over the time, without making any prior assumption about the underlying evolutionary schema. Pattern-based approaches of temporal pattern mining can be exploited to detect and characterize changes exhibited by a network over the time, starting from observed snapshots.
One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage.
In knowledge bases such as Wikidata, it is possible to assert a large set of properties for entities, ranging from generic ones such as name and place of birth to highly profession-specific or background-specific ones such as doctoral advisor or medical condition. Determining a preference or ranking in this large set is a challenge in tasks such as prioritisation of edits or natural-language generation. Most previous approaches to ranking knowledge base properties are purely data-driven, that is, as we show, mistake frequency for interestingness. In this work, we have developed a human-annotated dataset of 350 preference judgments among pairs of knowledge base properties for fixed entities. From this set, we isolate a subset of pairs for which humans show a high level of agreement (87.5% on average). We show, however, that baseline and state-of-the-art techniques achieve only 61.3% precision in predicting human preferences for this subset. We then analyze what contributes to one property being rated as more important than another one, and identify that at least three factors play a role, namely (i) general frequency, (ii) applicability to similar entities and (iii) semantic similarity between property and entity. We experimentally analyze the contribution of each factor and show that a combination of techniques addressing all the three factors achieves 74% precision on the task. The dataset is available at http://www.kaggle.com/srazniewski/wikidatapropertyranking.
Single-source and top-$k$ SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

### Thermal structure of Hurricane Maria

Hurricane Maria touched down in Puerto Rico. This visualization by Joshua Stevens at NASA shows what the thermal structure of the storm looked like, based on data collected by the Terra satellite.

Colder clouds, which are generally higher in the atmosphere, are shown with white. Somewhat warmer, lower clouds appear purple. The image reveals a very well-defined eye surrounded by high clouds on all sides—an indication that the storm was very intense.

Ugh.

Tags: , ,

### Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

1. Deduplication: eliminating duplicate (exact) copies of repeated data.
2. Record linkage: identifying records that reference the same entity across different sources.
3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

### How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

### Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe


Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py


### Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},
]


Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

• the first three digits of the phone number
• the full name
• the first five characters of the name
• a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

## Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

• (y)es: confirms that the two references are to the same entity
• (n)o: labels the two references as not the same entity
• (u)nsure: does not label the two references as the same entity or as different entities
• (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
]


A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

## Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
"""
"""
response = requests.get(url)
with open(fname, 'w') as f:
f.write(response.content)

ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

## Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname

try:
print ("I've connected")
except:
print ("I am unable to connect to the database")
cur = conn.cursor()


The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
(visitor_id SERIAL PRIMARY KEY,
lastname    varchar,
firstname   varchar,
uin         varchar,
apptstart   varchar,
apptend     varchar,
meeting_loc varchar);''')
conn.commit()
with open(nfile, 'rU') as infile:
for field in DATEFIELDS:
if row[field] != '':
try:
dt = parser.parse(row[field])
row[field] = dt.toordinal()  # We also tried dt.isoformat()
except:
continue
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
conn.commit()
print ("All done!")

dateParseSQL(ORIGFILE)


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';


For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor


Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}
]


We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records

block_id = row['block_id']
records = []
i += 1

if i % 10000 == 0:
print ('{} blocks'.format(i))

smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
else:
smaller_ids = lset([])

records.append((row[KEY_FIELD], row, smaller_ids))

if records:
yield records


And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)

with psycopg2.connect(database=args.dbname,
host='localhost',
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)

print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
del(temp_d)

if os.path.exists(args.training):
with open(args.training) as tf:

print ('Starting active learning')
dedupe.convenience.consoleLabel(deduper)

print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)

print ('Saving new training file to {}'.format(args.training))
with open(args.training, 'w') as training_file:
deduper.writeTraining(training_file)

deduper.cleanupTraining()

print ('Creating blocking_map table')
c.execute("""
DROP TABLE IF EXISTS blocking_map
""")
c.execute("""
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)

for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
c_index.execute("""
SELECT DISTINCT %s FROM %s
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
c_index.close()

print ('Generating blocking map')
c_block = con.cursor('block')
c_block.execute("""
SELECT * FROM %s
""" % SOURCE_TABLE)
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)

print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
csv_writer.writerows(b_data)
csv_file.close()

f = open(csv_file.name, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
f.close()

os.remove(csv_file.name)

con.commit()

print ('Indexing blocks')
c.execute("""
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
""")
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

print ('Calculating plural_key')
c.execute("""
CREATE TABLE plural_key
(block_key VARCHAR(200),
block_id SERIAL PRIMARY KEY)
""")
c.execute("""
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
""")

print ('Indexing block_key')
c.execute("""
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
""")

print ('Calculating plural_block')
c.execute("""
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)

c.execute("""
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
""" % (KEY_FIELD, KEY_FIELD))
c.execute("""
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Creating covered_blocks')
c.execute("""
CREATE TABLE covered_blocks AS
(SELECT %s,
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
GROUP BY %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Indexing covered_blocks')
c.execute("""
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
""" % (KEY_FIELD, KEY_FIELD))
print ('Committing')

print ('Creating smaller_coverage')
c.execute("""
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
""" % (KEY_FIELD, KEY_FIELD))
con.commit()

print ('Clustering...')
c_cluster = con.cursor('cluster')
c_cluster.execute("""
SELECT *
FROM smaller_coverage
INNER JOIN %s
USING (%s)
ORDER BY (block_id)
""" % (SOURCE_TABLE, KEY_FIELD))
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)

print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
c.execute("""
CREATE TABLE entity_map (
%s INTEGER,
canon_id INTEGER,
cluster_score FLOAT,
PRIMARY KEY(%s)
)""" % (KEY_FIELD, KEY_FIELD))

print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
c.execute("""
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))

c_cluster.close()
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
con.commit()

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()
find_dupes(args)


## Active Learning Observations

We ran multiple experiments:

• Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
• Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


## Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

### Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671


### Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


## Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Data Exploration with Python, Part 2

This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.

In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.

We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.

We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.

## Renaming Columns to be More Intuitive

Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.

We can use the code below to rename the columns in our vehicles data frame.

vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders',
'Transmission','Drivetrain','Vehicle Class','Fuel Type',
'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile','Fuel Cost/Year']


Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.

Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.

In this stage of the framework, we are going to create additional categories in 3 distinct ways:

• Category Aggregations
• Binning Continuous Variables
• Clustering

Now that we have a better idea of what we are doing and why, let's get started.

### Aggregating to Higher-Level Categories

The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.

def unique_col_values(df):
for column in df:
print("{} | {} | {}".format(
df[column].name, len(df[column].unique()), df[column].dtype
))

unique_col_values(vehicles)

Make | 126 | object
Model | 3491 | object
Year | 33 | int64
Engine Displacement | 65 | float64
Cylinders | 9 | float64
Transmission | 43 | object
Drivetrain | 7 | object
Vehicle Class | 34 | object
Fuel Type | 13 | object
Fuel Barrels/Year | 116 | float64
City MPG | 48 | int64
Highway MPG | 49 | int64
Combined MPG | 45 | int64
CO2 Emission Grams/Mile | 550 | float64
Fuel Cost/Year | 58 | int64


From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set.

• Make: 126 unique values
• Model: 3,491 unique values
• Transmission: 43 unique values
• Drivetrain: 7 unique values
• Vehicle Class: 34 unique values
• Fuel Type: 13 unique values

When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.

What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).

For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual.

Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M.

AUTOMATIC = "Automatic"
MANUAL = "Manual"

vehicles.loc[vehicles['Transmission'].str.startswith('A'),
'Transmission Type'] = AUTOMATIC

vehicles.loc[vehicles['Transmission'].str.startswith('M'),
'Transmission Type'] = MANUAL


We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.

small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars']
midsize = ['Midsize Cars']
large = ['Large Cars']

vehicles.loc[vehicles['Vehicle Class'].isin(small),
'Vehicle Category'] = 'Small Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(midsize),
'Vehicle Category'] = 'Midsize Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(large),
'Vehicle Category'] = 'Large Cars'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'),
'Vehicle Category'] = 'Station Wagons'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'),
'Vehicle Category'] = 'Pickup Trucks'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'),
'Vehicle Category'] = 'Special Purpose'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'),
'Vehicle Category'] = 'Sport Utility'

vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')),
'Vehicle Category'] = 'Vans & Minivans'


Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans.

vehicles['Model Type'] = (vehicles['Make'] + " " +
vehicles['Model'].str.split().str.get(0))


Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively.

vehicles['Fuel Type'].unique()

array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity',
'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity',
'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'],
dtype=object)


This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.

vehicles['Gas'] = 0
vehicles['Ethanol'] = 0
vehicles['Electric'] = 0
vehicles['Propane'] = 0
vehicles['Natural Gas'] = 0

vehicles.loc[vehicles['Fuel Type'].str.contains(

vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1


As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.

vehicles.loc[vehicles['Fuel Type'].str.contains(
'Regular|Gasoline'),'Gas Type'] = 'Regular'

vehicles.loc[vehicles['Fuel Type'] == 'Diesel',
'Gas Type'] = 'Diesel'

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),
'Gas Type'] = 'Natural'


An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.

### Creating Categories from Continuous Variables

The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way.

Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified.

efficiency_categories = ['Very Low Efficiency', 'Low Efficiency',
'Moderate Efficiency','High Efficiency',
'Very High Efficiency']

vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'],
5, efficiency_categories)

engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine',
'Large Engine', 'Very Large Engine']

vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'],
5, engine_categories)

emission_categories = ['Very Low Emissions', 'Low Emissions',
'Moderate Emissions','High Emissions',
'Very High Emissions']

vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'],
5, emission_categories)

fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost',
'Moderate Fuel Cost','High Fuel Cost',
'Very High Fuel Cost']

vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'],
5, fuelcost_categories)


### Clustering to Create Additional Categories

The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.

Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.

The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data.

The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.

cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
'City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile', 'Fuel Cost/Year']


Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives).

from sklearn import preprocessing
scaler = preprocessing.MaxAbsScaler()

vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns])
vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)


Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster.

from sklearn.cluster import KMeans

def kmeans_cluster(df, n_clusters=2):
model = KMeans(n_clusters=n_clusters, random_state=1)
clusters = model.fit_predict(df)
cluster_results = df.copy()
cluster_results['Cluster'] = clusters
return cluster_results


Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.

def summarize_clustering(results):
cluster_size = results.groupby(['Cluster']).size().reset_index()
cluster_size.columns = ['Cluster', 'Count']
cluster_means = results.groupby(['Cluster'], as_index=False).mean()
cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster')
return cluster_summary


We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want?

There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 8)
cluster_summary = summarize_clustering(cluster_results)


After running the couple of lines of code above, your cluster_summary should look similar to the following.

By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.

This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)


In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.

With this goal in mind, it is apparent that we probably have too many clusters because:

• Clusters 3, 4, and 7 look pretty similar
• Clusters 2 and 5 look similar as well
• Clusters 0 and 6 are also a little close for comfort

From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 4)
cluster_summary = summarize_clustering(cluster_results)

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)


These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.

We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.

• Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient.
• Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient.
• Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced.
• Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient.

Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame.

cluster_results['Cluster Name'] = ''
cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced'
cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient'

vehicles = vehicles.reset_index().drop('index', axis=1)
vehicles['Cluster Name'] = cluster_results['Cluster Name']


## Conclusion

In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.

Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Data Exploration with Python, Part 3

This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2!

Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.

We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:

• The higher-level categories we created via category aggregations.
• The quintiles we created by binning our continuous variables.
• The clusters we generated via k-means clustering based on numeric variables.

Now, without further ado, let's embark on our insight-finding mission!

## Making Our Data Smaller: Filter + Aggregate

One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.

To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.

def agg_count(df, group_field):
grouped = df.groupby(group_field, as_index=False).size()
grouped.sort(ascending = False)

grouped = pd.DataFrame(grouped).reset_index()
grouped.columns = [group_field, 'Count']
return grouped


Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know.

vehicles_2016 = vehicles[vehicles['Year']==2016]
category_counts = agg_count(vehicles_2016, 'Vehicle Category')


This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (2016) \n')


Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).

vehicles_1985 = vehicles[vehicles['Year']==1985]
category_counts = agg_count(vehicles, 'Vehicle Category')

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (1985) \n')


If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.

class_counts = agg_count(vehicles_2016, 'Vehicle Class')

ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Class (2016) \n')


We could also look at vehicle counts by manufacturer.

make_counts = agg_count(vehicles_2016, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Make (2016) \n')


What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.

very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency']
make_counts = agg_count(very_efficient, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Very Fuel Efficient Vehicles by Make \n')


What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field.

def agg_avg(df, group_field, calc_field):
grouped = df.groupby(group_field, as_index=False)[calc_field].mean()
grouped = grouped.sort(calc_field, ascending = False)
grouped.columns = [group_field, 'Avg ' + str(calc_field)]
return grouped


We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.

category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG')

ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category')
ax.set(xlabel='\n Average Combined MPG')
sns.plt.title('Average Combined MPG by Category (2016) \n')


## Pivoting the Data for More Detail

Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.

First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.

def pivot_count(df, rows, columns, calc_field):
df_pivot = df.pivot_table(values=calc_field,
index=rows,
columns=columns,
aggfunc=np.size
).dropna(axis=0, how='all')
return df_pivot


We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.

effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency',
'Engine Size','Combined MPG')


This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.

sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')


Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.

effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency',
'Engine Size','Combined MPG')

fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')


With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.

effic_size_category = pivot_count(vehicles_2016,
['Engine Size','Fuel Efficiency'],
'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(effic_size_category, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')


In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.

As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.

effic_size_pivot = pivot_count(vehicles_2016, 'Make',
'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Make vs. Vehicle Category (2016) \n')


## Visualizing Changes Over Time

So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.

def multi_line(df, x, y):
ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")


Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.

multi_line(vehicles, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Vehicle Categories Over Time \n')


We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.

If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.

bmw = vehicles[vehicles['Make'] == 'BMW']

multi_line(bmw, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('BMW Vehicle Categories Over Time \n')


We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.

toyota = vehicles[vehicles['Make'] == 'Toyota']

multi_line(toyota, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Toyota Vehicle Categories Over Time \n')


## Examining Relationships Between Variables

The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.

For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.

select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
'City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name']

sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)


From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.

Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.

sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles,
hue='Cluster Name', size=8, fit_reg=False)


In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.

We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!

## Conclusion

We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!

Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### If you did not already know

Pattern Sequence based Forecasting (PSF)
This paper discusses about PSF, an R package for Pattern Sequence based Forecasting (PSF) algorithm used for univariate time series future prediction. The PSF algorithm consists of two major parts: clustering and prediction techniques. Clustering part includes selection of cluster size and then labeling of time series data with reference to various clusters. Whereas, the prediction part include functions like optimum window size selection for specific patterns and prediction of future values with reference to past pattern sequences. The PSF package consists of various functions to implement PSF algorithm. It also contains a function, which automates all other functions to obtain optimum prediction results. The aim of this package is to promote PSF algorithm and to ease its implementation with minimum efforts. This paper describe all the functions in PSF package with their syntax and simple examples. Finally, the usefulness of this package is discussed by comparing it with auto.arima, a well known time series forecasting function available on CRAN repository. …

Spinnaker
Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence. …

A spreadmart (spreadsheet data mart) is a situation in which a company’s employees has inconsistent views of corporate data because each department relies on the data from their own spreadsheets. …

### Swift as a low-level programming language?

Modern processors come with very powerful instructions that are simply not available in many high-level languages JavaScript or Python. Here are a few examples:

• Most programming languages allow you to multiply two 64-bit integers and to get the 64-bit results (it is typically written as x = a * b). But the multiplication of two 64-bit integers actually occupies 128 bits. Reasonably enough, most programming languages (including C, Java, Go, Swift, C++,…) do not have 128-bit integers as a standard data type. So these programming languages have no way to represent the result, other than as two 64-bit integers. In practice, they often offer no direct way to get to the most significant 64 bits. Yet 64-bit processors (x64, ARM Aarch64…) have no problem computing the most significant 64 bits.
• Similarly, processors have specialized instructions to compute population counts (also called Hamming weight). A population count is the number of ones contained in the binary representation of a machine word. It is a critical operation in many advanced algorithms and data structures. You can compute it with shifts and additions, but it is much, much slower than when using the dedicated processors that all modern processors have to solve this problem. And I should stress that you can beat the dedicated instructions with vector instructions.

This disconnect between programming languages and processors is somewhat problematic because programmers have to get around the problem by doing more work, essentially letting the perfectly good instructions that processors offer go to waste. To be clear, that’s not an issue for most people, but most people do not deal with difficult programming challenges.

That is, 95% of all programming tasks can be solved with Python or JavaScript. Then, out of the remaining 5%, the vast majority are well served with something like Java or C#. But then, for the top 1% of all programming problems, the most challenging ones, then you need all of the power you can get your hands on. Historically, people have been using C or C++ for these problems.

I like C and C++, and they are fine for most things. These languages are aging well, in some respect. However, they also carry a fair amount of baggage.

But what else is there?

Among many other good choices, there is Apple’s Swift.

Swift is progressing fast. Swift 4.0 is a step forward. And, in some sense, it is beating C and C++. Let me consider the two problems I mentioned: getting the most significant bits of a product and computing the population count. C and C++ offer no native way to solve these problems. At best, there are common, non-portable, extensions that help.

Swift 4.0 solves both problems optimally in my view:

• value1.multipliedFullWidth(by: value2).high gets mapped to the mulx instruction on my x64 laptop
• value.nonzeroBitCount gets mapped to the popcnt instruction on my x64 laptop

(The phrase nonzeroBitCount is a weird way to describe the population count, but I can live with it.)

If you call these operations in a tight loop, they seem to generate very efficient code. In fact, consider the case where you repeatedly call value.nonzeroBitCount over an array:

func popcntArray( _  value  : inout [UInt64] ) -> Int {
return value.reduce( 0, { x, y in
x &+ popCnt(y)
})
}


The compiler does not use popcnt, but rather the more efficient vectorized approach (see Muła et al.). That’s because Swift benefits from the powerful LLVM machinery in the back-end.

I wrote a small Swift 4.0 program to illustrate my points. You can compile a swift program for your hardware using a command such as swiftc myprogram.swift -O -Xcc -march=native. You can then use the lldb debugger to automatically get the assembly produced by a given function.

Conclusion. If you are not targeting iOS, it is crazy to use Swift for high-performance low-level programming language at this time. However, it is getting less and less crazy.

### R Packages worth a look

Core Inflation (Inflation)
Provides access to core inflation functions. Four different core inflation functions are provided. The well known trimmed means, exclusion and double weighing methods, alongside the new Triple Filter method introduced in Ferreira et al. (2016) <https://goo.gl/UYLhcj>.

Give your Dependencies Stars on GitHub! (ThankYouStars)
A tool for starring GitHub repositories.

pinp’ is not ‘PNAS’ (pinp)
A ‘PNAS’-alike style for ‘rmarkdown’, derived from the ‘Proceedings of the National Academy of Sciences of the United States of America’ (PNAS, see <https://www.pnas.org> ) LaTeX style, and adapted for use with ‘markdown’ and ‘pandoc’.

Transitive Index Numbers for Cross-Sections and Panel Data (multilaterals)
Computing transitive (and non-transitive) index numbers (Coelli et al., 2005 <doi:10.1007/b136381>) for cross-sections and panel data. For the calculation of transitive indexes, the EKS (Coelli et al., 2005 <doi:10.1007/b136381>; Rao et al., 2002 <doi:10.1007/978-1-4615-0851-9_4>) and Minimum spanning tree (Hill, 2004 <doi:10.1257/0002828043052178>) methods are implemented. Traditional fixed-base and chained indexes, and their growth rates, can also be derived using the Paasche, Laspeyres, Fisher and Tornqvist formulas.

Spatial Floor Simulation (Isotropic) (SpatialFloor)
Spatial floor simulation with exponential/Gaussian variance-covariance function (isotropic), with specification of distance function, nugget, sill, range. The methodology follows Nole A.C. Cressie (2015) <doi:10.1002/9781119115151>.

## September 21, 2017

### Will Stanton hit 61 home runs this season?

[edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]

So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

Sampling notation

A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

$\theta \sim \mbox{Beta}(1, 1)$

The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

$56 \sim \mbox{Binomial}(555, \theta)$

Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

$\mathit{ab} \sim \mbox{Poisson}(10 \times 555 / 149)$

We then take the number of home runs to be binomial given the number of at bats and the home run rate.

$h \sim \mbox{Binomial}(\mathit{ab}, \theta)$

Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

$\mbox{gte61} = \mbox{I}[h \geq (61 - 56)]$

Event probability

The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

$\displaystyle \mbox{Pr}[h \geq (61 - 56)] \\[6pt] \hspace*{3em} \displaystyle { } = \ \int_{\theta} \ \sum_{ab} \, \ \mathrm{I}[h \geq 61 - 56] \\ \hspace*{8em} \ \times \ \mbox{Binomial}(h \mid ab, \theta) \\[6pt] \hspace*{8em} \ \times \ \mbox{Poisson}(ab \mid 10 \ \times \ 555 / 149) \\[6pt] \hspace*{8em} \ \times \ \mbox{Beta}(\theta \mid 1 + 56, 1 + 555 - 56) \ \mathrm{d}\theta.$

Computation in R

The posterior for $\theta$ is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

> sum(rbinom(1e5,
rpois(1e5, 10 * 555 / 149),
rbeta(1e5, 1 + 56, 1 + 555 - 56))
>= (61 - 56)) / 1e5
[1] 0.34


That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.

Disclaimer

The above is intended for recreational use only and is not intended for serious bookmaking.

Exercise

You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for $\theta$ down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

The post Will Stanton hit 61 home runs this season? appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Product Launch: Amped up Kernels Resources + Code Tips & Hidden Cells

Kaggle’s kernels focused engineering team has been working hard to make our coding environment one that you want to use for all of your side projects. We’re excited to announce a host of new changes that we believe make Kernels the default place you’ll want to train your competition models, explore open data, and build your data science portfolio. Here’s exactly what’s changed:

## Additional Computational Resources (doubled and tripled)

• Execution time: Now your kernels can run for up to 60 minutes instead of our past 20 minute limit.
• RAM: Work with twice as much data with 16 GB of RAM available for every kernel.
• Disk space: Create killer output with 1 GB of disk space.

## Code Tips

Code tips catch common mistakes as you work through coding a kernel. They will pop up when you run code with an identifiable error and significantly cut down your troubleshooting time.

Here are some examples of the most common code tips you’ll run into:

Although you specified the "R" language, you might be writing Python code. Was this intentional? If not, start a Python script instead.

Couldn't show a character. Did you happen to load binary data as text?

Did you mean "from bs4 import BeautifulSoup"?

Did you mean "ggplot2"?

Do you mean pandas.DataFrame?

## Hidden Cells

You publish public kernels so you can share your data science work to build a portfolio, get feedback, and help others learn. We've added the ability to hide cells, making it possible to present your analysis cleanly so people can focus on what's useful. If viewers want to dig in, it’s still possible to unhide cells to see the dirty details.

## Improved Reliability

We know how frustrating it is to lose work, get stuck on Kaggle-side errors, or simply have a workbench that’s down when you’re trying to get up and running. It’s a priority of the Kernels engineering team to improve reliability so you can count on your code. Here are a few recent improvements:

• Fewer notebook disconnections
• Notebook editor now works with “Block Third Party Cookies” browser setting enabled
• More reliable notebook auto-save

What else can we do to make Kernels your default data science and ML workbench? Aurelio will be monitoring this forum post closely...let us know what you'd like to see next!

### Distilled News

We have made huge progress in solving Supervised machine learning problems. That also means that we need a lot of data to build our image classifiers or sales forecasters. The algorithms search patterns through the data again and again. But, that is not how human mind learns. A human brain does not require millions of data for training with multiple iterations of going through the same image for understanding a topic. All it needs is a few guiding points to train itself on the underlying patterns. Clearly, we are missing something in current machine learning approaches. Thankfully, there is a line of research which specifically caters to this question. Can we build a system capable of requiring minimal amount of supervision which can learn majority of the tasks on its own. In this article, I would like to cover one such technique called pseudo-labelling. I will give an intuitive explanation of what pseudo-labelling is and then provide a hands-on implementation of it.
Our two previous blog entries implied that there is a role games can play in driving the development of Reinforcement Learning algorithms. As the world’s most popular creation engine, Unity is at the crossroads between machine learning and gaming. It is critical to our mission to enable machine learning researchers with the most powerful training scenarios, and for us to give back to the gaming community by enabling them to utilize the latest machine learning technologies. As the first step in this endeavor, we are excited to introduce Unity Machine Learning Agents.
Torch implementation of various types of GANs (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN).
AI Conference chairs Ben Lorica and Roger Chen reveal the current AI trends they’ve observed in industry.
Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate-analyze-summarize coding structure over the typical “for-loop” strategy are discussed.
This is the second exercise set on answering probability questions with simulation. Finishing the first exercise set is not a prerequisite. The difficulty level is about the same – thus if you are looking for a challenge aim at writing up faster more elegant algorithms. As always, it pays off to read the instructions carefully and think about what the solution should be before starting to code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated and slow.
Inside the enterprise, a dashboard is expected to have up-to-the-minute information, to have a fast response time despite the large amount of data that supports it, and to be available on any device. An end user may expect that clicking on a bar or column inside a plot will result in either a more detailed report, or a list of the actual records that make up that number. This article will cover how to use a set of R packages, along with Shiny, to meet those requirements.
In order to practice with network data with R, we have been playing with the Padgett (1994) Florentine’s wedding dataset (discussed in the lecture).
R has a lot of packages for users to analyse posts on social media. As an experiment in this field, I decided to start with the biggest one: Facebook. I decided to look at the Facebook activity of Donald Trump and Hillary Clinton during the 2016 presidential election in the United States. The winner may be more famous for his Twitter account than his Facebook one, but he still used it to great effect to help pick off his Republican rivals in the primaries and to attack Hillary Clinton in the general election. For this work we’re going to be using the Rfacebook package developed by Pablo Barbera, plus his excellent how-to guide.