# My Data Science Blogs

## July 21, 2017

### Hoag Hospital: Strategic Analytics Analyst Lead

Seeking a Strategic Analytics Analyst Lead to partner with internal and external clients to deliver data driven insights that support decision making throughout the organization.

### Book Memo: “Multicriteria and Clustering”

 Classification Techniques in Agrifood and Environment This book provides an introduction to operational research methods and their application in the agrifood and environmental sectors. It explains the need for multicriteria decision analysis and teaches users how to use recent advances in multicriteria and clustering classification techniques in practice. Further, it presents some of the most common methodologies for statistical analysis and mathematical modeling, and discusses in detail ten examples that explain and show “hands-on” how operational research can be used in key decision-making processes at enterprises in the agricultural food and environmental industries. As such, the book offers a valuable resource especially well suited as a textbook for postgraduate courses.

## July 20, 2017

### Document worth reading: “Provable benefits of representation learning”

There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc. To study the relative merits of these techniques, it’s essential to formalize the definition and goals of representation learning, so that they are all become instances of the same definition. This paper introduces such a formal framework that also formalizes the utility of learning the representation. It is related to previous Bayesian notions, but with some new twists. We show the usefulness of our framework by exhibiting simple and natural settings — linear mixture models and loglinear models, where the power of representation learning can be formally shown. In these examples, representation learning can be performed provably and efficiently under plausible assumptions (despite being NP-hard), and furthermore: (i) it greatly reduces the need for labeled data (semi-supervised learning) and (ii) it allows solving classification tasks when simpler approaches like nearest neighbors require too much data (iii) it is more powerful than manifold learning methods. Provable benefits of representation learning

### Magister Dixit

“Data is becoming a powerful and most valuable commodity in 21st century. It is leading to scientific insights and new ways of understanding human behaviour. Data can also make you rich. Very rich.” SBS documentary “The Age of Big Data”

### Building a Data-Driven Culture

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

What does it mean to be data-driven? Everything we do is driven by data. Even our intuition is really an accumulation of data points through experience. However, experience isn’t easily won or shared, and often we settle for repeating previous actions, or defer to the resident Hippo!

When we talk about being data-driven, what we actually mean is that we would like to make decisions based on the best data, made available to the most people. What does that mean for business, and how do you start?

By far the most difficult thing in being data-driven is getting the right data in the first place. CIOs frequently inherit a legacy of software applications that have been created with functionality in mind, not reuse of data. Data almost seems relegated to be a side effect of the operation of the machine, and hard to liberate. That’s changing, but it’s still remarkably easy to forget to think about exploiting data when you’re creating new systems. If we maintain a focus on the life cycle of data, we can avoid designing ourselves into silos.

The way in which data is made available is also key. There’s a difference between data for exploration, and data for operation. You wouldn’t expect anybody to pick a driving route just by watching the car’s speedometer. You can’t do without it, but neither is it enough. A dashboard is just table stakes: great for monitoring, but not for understanding patterns and correlations. Productive use of data in business involves learning the art of asking questions and finding that fingertip feel for the problem. That’s one reason that data lakes have people excited, as they hold the promise of making it quicker and cheaper to ask questions.

As with all business buzzwords, “data-driven” does us a disservice at the same time as highlighting a really important concept. We actually want to be “data-informed,” because humans are still the best decision makers we have in business. The aim is to force-multiply our most valuable asset: people. We should automate away all the messy bits of making sense of raw data, and present the human in the loop with the most useful information possible.

The point of being data-driven is to be able to take action. It’s an upgrade to the traditional reporting functions of analytics: we’re moving the use of data into the everyday operation of our business. It’s not a coincidence that companies who are lauded for being data-driven have also updated their infrastructure and working practices in order to be agile and move fast. Having the best data is of limited use if you’re unable to exploit it in the marketplace.

Creating a data-driven culture is a big undertaking, but you can start small and expand out. Start with a manageable problem and a willing team, where you can make an impact in three months or less. Work backwards from your business goals, and figure out what data or tools people need to make a difference. Do you need new data, faster data, more complete data? Does the team have the freedom to explore and ask questions?

Lastly, always hold the business goal in mind. In your words and plans, use the language of creating value, and fostering, rather than smothering, innovation. Being data-driven is about making the data work for you, helping you go further, faster.

The post Building a Data-Driven Culture appeared first on Silicon Valley Data Science.

### Joyplots tutorial with insect data 🐛 🐞🦋

This tutorial shows you how to get up and running with joyplots.

Joyplots are a really nice visualization, which let you pull apart a dataset and plot density for several factors separately but on the same axis. It's particularly useful if you want to avoid drawing a new facet for each level of a factor but still want to directly compare them to each other.

This plot of when in the day Americans do different activities, made by Henrik Lindberg, is a really good example of the type of analysis well-suited to a joyplot.

I’ll be using a Kaggle Kernels notebook for this walkthrough, analyzing a dataset of insects caught in a light trap set on the roof of the University of Copenhagen’s Zoological Museum. You can create your own notebook here to code along.

 # load in our libraries
library(tidyverse) # loads in all the tidyverse libraries
library(lubridate) # to make dealing with dates easier
library(ggjoy) #the brand new ggjoy package!

# read in data & convert it to a tibble (a special type of dataframe with a lot of nice qualities,

# take a look at the first couple rows to make sure it all loaded in alright
head(bugs) 

Ok, all of that looks good. Now, let's see which months were the most popular for insects to visit the trap.

# add a coulmn with the month of each observation. mdy() tells the lubridate package what
# format our dates are in & month() says we only want the month from the date
bugs$month <- month(mdy(bugs$date1))

# list of months for labelling graph
monthList <- c("Jan","Feb","Mar","April","May", "June","July","Aug","Sep","Oct","Nov","Dec")

# remap months from numbers (3:12) to words (March-December)
bugs$month <- plyr::mapvalues(bugs$month, levels(as.factor(bugs$month)), monthList[3:12]) # plot the nubmer of bugs caught by month ggplot(data = bugs, aes(x = month, y = individuals)) + geom_point() + scale_x_discrete(limit=monthList) head(bugs)  Not surprisingly, most insects showed up in the summer. (Denmark is in the Northern Hemisphere, so Summer runs from June to September.) Can we peel apart the two orders of insects in our dataset using gg_joy to see if they show up at different times of year? # we're going to have to do some data manipulation to get there. # let's get the total number of insects observed on each day (binning over years) bugs$dayInYear <- yday(mdy(bugs$date1)) # joyplot of when insects were observed by order. Scale changes how tall the peaks are ggplot(data = bugs, aes(x = dayInYear, y = order)) + geom_joy(scale = 0.9) + theme_joy()  So it looks like both orders (beetles or Lepidoptera & butterflies or Coleoptera) tend to show up at roughly the same time. Another good use of joyplots is to see how events have shifted over time. Let's see if there have been any shifts in when insects are observed over the years the light trap has been set up.  # joyplot of dates on which insects were observed by year of observation ggplot(data = bugs, aes(x = dayInYear, y = as.factor(year))) + geom_joy(scale = 0.9) + theme_joy()  Maybe a little bit of shift. Just eyeballing it, it looks like there hasn't been shift of the mass of observations to earlier or later in the year. Rather, it almost looks as if the peak of observations has spread out, as if the "insect season" has become longer. We can test that by looking at the change in the variance of the days in the year where bugs are observed. </span> # look at the variance varianceByYear <- bugs %>% group_by(year) %>% summarise(variance = sd(dayInYear)) # plot variance by year ggplot(varianceByYear, aes(year, variance)) + geom_line() + geom_smooth(method='lm') # this function adds the fitted line (w/ confidence interval)  Sure enough, it looks like there been increasing variance in what days of the year insects are observed in this light trap, an observation I probably wouldn't have thought to look for if I hadn't had a joyplot of this data. And that’s it. For more visualization tutorials, check out Meg Risdal’s post on “Seventeen Ways to Map Data in Kaggle Kernels: Tutorials for Python and R Users”. And, go here to fork this notebook and play with the code even further. Good luck! Continue Reading… ### Quirks about running Rcpp on Windows through RStudio (This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers) # Quirks about running Rcpp on Windows through RStudio This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not. I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating. After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file: Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/", "C:/RBuildTools/3.4/mingw_64/bin", sep = ";")) Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")  Note that C:/RBuildTools is the default location suggested when I installed RTools. This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!! Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows. Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post. To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Quickly Check your id Variables (This article was first published on That’s so Random, and kindly contributed to R-bloggers) Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded. library(dplyr) some_df <- data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1)) some_df %>% unique_id(a)  ## # A tibble: 2 x 3 ## a b val ## ## 1 3 103 -1.8 ## 2 3 104 1.1  some_df %>% unique_id(a, b)  ## [1] TRUE  Here you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github(edwinth/thatssorandom). unique_id <- function(x, ...) { id_set <- x %>% select(...) id_set_dist <- id_set %>% distinct if (nrow(id_set) == nrow(id_set_dist)) { TRUE } else { non_unique_ids <- id_set %>% filter(id_set %>% duplicated) %>% distinct() suppressMessages( inner_join(non_unique_ids, x) %>% arrange(...) ) } }  To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Quick Tips for Getting A Data Science Team Off the Ground Should you start a data science team? Or not? It isn’t an easy decision. This blog post provides tips to help leaders at startups and early-stage companies decide whether it is the right time to start building a data science team. ## Why Data Science? An increasing number of startups and early-stage companies are realizing they need to tap into data science to grow or stay competitive. They’ve recognized that collecting and storing data is fruitless unless it can drive insights that propel the business forward. If you’re a manager charged with building a data science team from scratch, or expanding a fledgling one, it can be hard to get the timing right or know what to prioritize. This blog post will help you ensure you’re investing in the right people, processes and technology to make your young team flourish. ## Prioritize Timing Don’t launch a data science team just because everyone seems to be doing it. You’re jumping in too early if you haven’t been collecting any data or haven’t gleaned basic business intelligence from the information you have. You need to understand what’s happening today before building tools to predict what’s next. ## Hiring The Best People Building a strong team There are no unicorns. You probably won’t find someone who excels at dissecting high-level problems, strategizing about moving the business forward, pitching the team’s work to other stakeholders, doing the heavy lifting technically and integrating tools into existing production systems. Instead, think of the team as a balanced organism in which people with different strengths come together symbiotically. Reflect deeply on your existing skill set and what you need to round out the team. If you have a savvy leadership that knows exactly what problems they want a data scientist to tackle, hire a technical-leaning person first. If you’re having trouble integrating data science into the business, prioritize a hire that can cultivate those relationships and spark cultural change. Avoid bringing on a junior data scientist without a strong mentor. People that are scrappy, business-curious and low-ego will be an asset in a young team. And don’t underestimate the importance of diversity in fostering better outcomes, even in a small team. Recruiting good candidates Data scientists are in high demand, so you’ll probably have to do some wooing. Pitch potential hires on the opportunity to work on problems that impact the business, and assure them that they'll have access to interesting data to do so. Strong candidates want to know that data science is a priority at the company, not a buzzword. Give them concrete examples of projects they could work on to drive revenue and value. Invite them to spend a day getting to know the team. Communicate that the company will see them not as a cog in the machine but as a thought worker with a seat at the table. While interviewing candidates, don't just ask about their technical skills. How well can they get to the heart of problems they could be tackling? How effectively can they communicate the importance of their work? How well can they prioritize? After all, solving the right problems and selling data science to business stakeholders are key aspects of the job. Onboarding seamlessly Try to give new hires access to data and business stakeholders on Day One. Get them involved in meetings that will help them understand pressing business concerns, and give them examples of how data science insights have impacted the company in the past. Acquaint them with the standard suite of tools your team uses, but leave room for flexibility and experimentation. New hires will be happier and more productive if they can leverage tools they already know. Make sure newbies understand how they will be evaluated. ## Running Your Team Picking a focus With a small team, you won’t be able to do everything. Identify one or two priorities that have buy-in from executive stakeholders. Ideally, these priorities will also have clear business value and relatively short timelines, and they won’t require vast changes in how other teams operate. Before you start a project, take time to ensure you’re solving the right problem. Be clear on who you’re building for and how it’s going to help the business. If you take six months to build a “perfect” model, it could be obsolete by the time you finish. Being late to the game will also hurt your credibility, because your work will no longer be relevant and the business will have moved on. Iterating quickly and encouraging collaboration Deploy products early and often. Data science is about experimentation, and most experiments fail. Identify the flops quickly so you can course correct. Find ways to shorten the feedback loop. For example, consider a first iteration that is pure business logic, rather than an algorithm, or provide opportunities for soft launches. And make sure you celebrate experiments that lead to no results to encourage bold experimentation. Collaboration is also critical to working efficiently and making an impact. Working together allows your team to tackle bigger problems, leverage individual strengths and avoid depending too much on one person. Collaborative data science involves ensuring repeatability (the same process produces the same outputs), reproducibility (team members can easily recreate statistical tests, empirical experiments and computational functions), and replicability (the same experiment can be repeated twice, collecting and analyzing data in the same way and arriving at the same results). Provide opportunities, such as “lunch and learns,” that encourages team members to share their work in a supportive environment. Establish frequent communication with the business side to get buy-in and feedback (don’t forget to use language they understand and tie everything to business objectives). The strongest data science teams are proactive partners in business discussions rather than request-fillers. Establishing best practices early Don’t make the mistake of thinking your team is too small to adopt best practices for documentation, code review and feedback. It will be harder to change behavior down the line, and insights you generate could end up being a cornerstone of the business. Failing to install good systems early could mean losing institutional knowledge or going down pointless rabbit holes. Deciding where your team will live Having a centralized team is important for facilitating collaboration, building on existing work, improving quality through code reviews and providing career development pathways. Meanwhile, embedding data scientists in various divisions steeps them in that aspect of the business and generates better innovations. A happy medium is the hybrid “hub and spoke” model: Team members spend two or three days embedded with a team, and the rest of the week sitting with their data science colleagues. ## How Technology Can Help Selecting the best technology for your team will have a massive impact on its success. The right data science platform will help integrate new hires, make your team more efficient and enable collaboration — so you can quickly make a visible impact. Easing new hires Some companies can take up to six weeks to onboard new team members, wasting precious time. If past data sources, experiments and discussions are stored on individual desktops or lost in email chains, they can be difficult or impossible to trace. The solution is a data science platform that automatically stores existing data, tools, connections, packages, libraries and code. That means new team members are up and running in days. Since they can easily reproduce previous experiments and view past results, they're in a position to start contributing immediately. Data scientists also want to use familiar tools. Pick a platform that allows new hires to stick with their preferred programming languages and software while supplementing gaps in knowledge, such as working with Git or AWS. Enabling innovation and collaboration A data science platform can help iteration happen quickly by allowing your team to efficiently run and track multiple experiments at the same time. It can also allow you to easily share and discuss results with colleagues and business stakeholders. Choose a platform that can communicate results to non-technical users without overwhelming them with the complex model behind the scenes. Sharing your work through easy-to-digest visualizations, interactive dashboards and web apps will do wonders to get buy-in across the company. Getting feedback in real time will also keep your team from going off course and will speed up development cycles. Protecting institutional knowledge With a small team, you don't want aspects of your work to become dependent on a single member who can leave at any time. A data science platform that automatically maintains a system of record ensures everyone's contributions remain an asset of the company. The archive also prevents the team from reinventing the wheel or losing the knowledge underlying a key revenue-driver. Choose a platform that tracks code, data, environments and other factors in one place, without anyone having to input them manually. A growing number of companies are recognizing the value of investing in data science. But launching or building a new team can be daunting. Creating an effective data science team requires getting your timing right, attracting the best people and integrating them seamlessly. It also means focusing on key problems and establishing practices that enable your team to collaborate and quickly make an impact. Adopting the right data science platform can help your budding team thrive. If you are looking for additional and in-depth insight to help data science leaders identify existing gaps and direct future data science investment, download Domino's whitepaper, Data Science Maturity Model The post Quick Tips for Getting A Data Science Team Off the Ground appeared first on Data Science Blog by Domino. Continue Reading… ### Boost economy with immigration Want to increase the GDP? Easy. Let more immigrants in. Lena Groeger for ProPublica: In an analysis for ProPublica, Adam Ozimek and Mark Zandi at Moody’s Analytics, an independent economics firm, estimated that for every 1 percent increase in U.S. population made of immigrants, GDP rises 1.15 percent. So a simple way to get to Trump’s 4 percent GDP bump? Take in about 8 million net immigrants per year. To show you what that really looks like, we’ve charted the effect below. You can see for yourself what might happen to the economy if we increased immigration to the highest rates in history or dropped it to zero – and everything in between. The interactive in the article lets you pose the what-if with various immigration rates. Give it a try. Tags: , Continue Reading… ### Big Data Innovation, Data Visualization Summits, Boston, Sep 7-8 Visualize your data, Demonstrate its value, and tailor your pitch - learn how from the industry leaders in Boston. Continue Reading… ### R Packages worth a look Bayesian Graphical Lasso (BayesianGLasso) Implements a data-augmented block Gibbs sampler for simulating the posterior distribution of concentration matrices for specifying the topology and parameterization of a Gaussian Graphical Model (GGM). This sampler was originally proposed in Wang (2012) <doi:10.1214/12-BA729>. Stochastic Gradient Markov Chain Monte Carlo (sgmcmc) Provides functions that performs popular stochastic gradient Markov chain Monte Carlo (SGMCMC) methods on user specified models. The required gradients are automatically calculated using ‘TensorFlow’ <https://…/>, an efficient library for numerical computation. This means only the log likelihood and log prior functions need to be specified. The methods implemented include stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian Monte Carlo (SGHMC), stochastic gradient Nose-Hoover thermostat (SGNHT) and their respective control variate versions for increased efficiency. Subgroup Discovery and Bump Hunting (subgroup.discovery) Developed to assist in discovering interesting subgroups in high-dimensional data. The PRIM implementation is based on the 1998 paper ‘Bump hunting in high-dimensional data’ by Jerome H. Friedman and Nicholas I. Fisher. <doi:10.1023/A:1008894516817> PRIM involves finding a set of ‘rules’ which combined imply unusually large (or small) values of some other target variable. Specifically one tries to find a set of sub regions in which the target variable is substantially larger than overall mean. The objective of bump hunting in general is to find regions in the input (attribute/feature) space with relatively high (low) values for the target variable. The regions are described by simple rules of the type if: condition-1 and … and condition-n then: estimated target value. Given the data (or a subset of the data), the goal is to produce a box B within which the target mean is as large as possible. There are many problems where finding such regions is of considerable practical interest. Often these are problems where a decision maker can in a sense choose or select the values of the input variables so as to optimize the value of the target variable. In bump hunting it is customary to follow a so-called covering strategy. This means that the same box construction (rule induction) algorithm is applied sequentially to subsets of the data. Estimate (Generalized) Linear Mixed Models with Factor Structures (PLmixed) Utilizes the ‘lme4’ package and the optim() function from ‘stats’ to estimate (generalized) linear mixed models (GLMM) with factor structures using a profile likelihood approach, as outlined in Jeon and Rabe-Hesketh (2012) <doi:10.3102/1076998611417628>. Factor analysis and item response models can be extended to allow for an arbitrary number of nested and crossed random effects, making it useful for multilevel and cross-classified models. A Monadic Pipeline System (rmonad) A monadic solution to pipeline analysis. All operations — and the errors, warnings and messages they emit — are merged into a directed graph. Infix binary operators mediate when values are stored, how exceptions are handled, and where pipelines branch and merge. The resulting structure may be queried for debugging or report generation. ‘rmonad’ complements, rather than competes with, non-monadic pipeline packages like ‘magrittr’ or ‘pipeR’. Continue Reading… ### Deep Learning, AI Assistant Summits London feature DeepMind and much more, Sep 21-22 – KDnuggets Offer The Deep Learning Summit London and the AI Assistant Summit London will be continuing the RE•WORK Global Summit Series this September 21 & 22. Early Bird discount is ending on July 28th. Register now to guarantee a spot at the Summit and use the discount code KDNUGGETS to save 20% on all tickets. Continue Reading… ### In America, you are what you eat WHAT is soppressata? Google searches for the Italian meat surged last week, thanks to a column by David Brooks in the New York Times in which he recounted an awkward lunch at an upscale delicatessen with “a friend with only a high-school degree”. Continue Reading… ### RuleML Keynote My colleague Eric Mazeran gave a keynote on ML, Optimization and Rules : time for agility and convergences at the Rule ML conference . I co authored the material with him, and the slides can be found here. It was well received as it gives a global view on how these three technologies can be used together. I'd like to comment on one slide of his presentation (slide 15 if you download the deck). Here is a slightly modified version of it: It captures what I believe is the ideal data science project. It starts with data about some part of the world we are interested int (typically some business related data), and a business question we try to answer. I discussed the trap of starting data science projects without a business question in Start With A Question. Examples of relevant business questions include: which customers are likely to renew their yearly subscription? Which customers should I target with my marketing campaign? Which products should I recommend to which customers? What maintenance operations should I perform first? How should I replenish my inventory to best meet future demand? Let's use the first question for the sake of clarity: which customers are likely to renew their yearly subscription? The first thing data scientists would do is to look at available data, and explore it to see if it can help answer the question. They will use various techniques ranging from data visualization to statistical and machine learning algorithms. Their goal is to find patterns that are correlated with customers subscription renewal. Finding these patterns is worth it. They can then be shared with decision makers. Data science projects can stop there, in which case they are Data Mining projects. Other projects move to the next stage. The next stage is to use whatever patterns were discovered in the first stage to make predictions. For instance, if I have created a statistical or a machine learning model that predicts with good accuracy which customers will renew and which won't, then I can use that model regularly, say every month, to identify which customers are unlikely to renew. What the model ouptuts is a probability of customers renewing. This predicted probability is a very useful information for decision makers. For instance the marketing department may look at the customers that have a low predicted probability for renewal and target them with incentive actions. Projects can stop at this stage, in which case they are Machine Learning projects. When projects stop after stage two, the eventual action (e.g. targeting some customers with incentives) is left to human decision makers. While this makes sense in many cases, in some other cases one wants to automate decisions. This is what a third stage is about. There are mainly two ways to automate the action piece. The first one is to implement some business logic via a rule based system. For instance, a rule could be: if the predicted probability of renewal is less than 0.1 and the customer is a preferred customer, then send a coupon to the customer. This is the simplest way to transform machine learning output into actions. It is applicable when actions can be made one at a time, just by looking at some context and a machine learning prediction. A second way is to be used when a number of actions have to be defined together. For instance, if we have a limited budget for coupons, we should not use a fixed probability threshold to select which customers should receive a coupon. We should rather focus on those most likely to respond to the coupon. A very common use case is when machine learning is used to forecast future demand (for instance future sales for each product and each store). Then we can use mathematical optimization to find the best replenishment plan for each store: I discussed this machine learning + optimization pattern in Optimization Is Ready For Big Data: Part 1, Volume. It seems that the recent progress in machine learning make it really actionable now. The last stage of the process is to monitor the consequences of the actions on the word we are interested in. By monitoring their effects we can determine if the machine learning predictions were accurate or not. That information can be analyzed by data scientists, which can lead to better models. As a matter of fact, we are executing the process as a continuous loop, where each iteration builds upon data produced by the previous iteration. This continuous loop is at the heart of learning machines. There is much more content in Eric's keynote and I encourage readers to have a look at his deck. Let me conclude with a remark on how the above relate to analytics. Stage 1 corresponds to Descriptive Analytics. Second stage corresponds to Predictive Analytics. Third stage corresponds to Prescriptive Analytics. Fourth stage can be called learning from feedback, and has no real equivalent in traditional analytics. Interested readers can read more about the various analytics stages in the Analytics Landscape. Continue Reading… ### IEEE ICDM 2017 Call For Award Nominations, due Aug 15 Nominations sought for outstanding research and service contributions in the field of data mining and data science. Continue Reading… ### Digging Deeper with your Channel Analysis Direct or Search? Organic Search or Paid Search? Social Media or Email? All of the above or All of the above? Understanding which digital channels are most effective at acquiring valuable customers is difficult to do on a regular basis. However, channel analyses is necessary so your business an answer questions like, “Where should we focus our new advertising campaigns?” or “How can we quantify how the last campaign performed?” Let’s take a deeper dive into how we can use data to drive channel analysis. In a previous post, we discussed three simple metrics for an omni-channel business: Percent of customer acquisitions, lifetime revenue, and repeat order probability. Knowing these generalized metrics are crucial, but what else can we uncover? For this post, we will discuss some additional digital channel metrics you should be considering when analyzing customer behavior. Sessions Identifying the channels that drive the most traffic can help you determine where to focus your next marketing efforts. At the same time, you should be asking yourself “Is any of this information surprising?” Did you expect certain channels to perform better/worse relative to the others? In the example below, you can see how much traffic came in through each channel where the majority of customers are from direct channel. You can get a good sense of how your customer base is growing by comparing this to changes from the last period or year.  Channel Sessions Change from last month Change from last year Direct 15,318 205.49% 433.12% Organic Search 3,988 30.22% 1.09% Paid Search 2,104 4.06% 130.03% Email 587 -40.95% -10.11% Social 120 -20.44% 52.72% Conversion rates We can’t get the whole story from Sessions alone. After all, high traffic can only go so far if the conversion rate is low. Using the same example we can see that although the sessions for email are low, it has by far the highest conversion rate and has gone up 46.23% since last year. Does this mean you should focus all of your efforts on improving your email campaigns? Probably not, but it should get you thinking on how to slowly increase the traffic coming in through emails. Direct on the other hand drives the most traffic but has one of the lowest conversion rates.  Channel Conversion Rate Change from last month Change from last year Direct 0.29% -90.11% -83.30% Organic Search 1.87% 7.14% 14.35% Paid Search 1.33% 1.90% -3.55% Email 6.28% 2.71% 46.23% Social 0.21% -90.41% -93.63% Revenue by channel Now that we’ve looked at Sessions and Conversion rates, we’ll last take a look at revenue. Some patterns we’ve seen so far in this example: • Direct brings in high traffic but low conversion • Email brings in low traffic but high conversion • Organic and Paid search performs somewhere in between Bringing in revenue will now help us paint a better picture on what the data is really telling us. Direct traffic has grown the most over the last year but still does not bring in the most revenue. Even though email had the highest conversion rate, paid search still generates more revenue just due to the higher traffic.  Channel Revenue Change from last month Change from last year Direct$7,497.44 205.49% 433.12% Organic Search $9,240.32 30.22% 1.09% Paid Search$4,663.00 4.06% 130.03% Email $2,699.20 -40.95% -10.11% Social$460.83 -20.44% 52.72%

In Summary

If you’re a business with an online presence, performing channel analysis is an important step to increasing your customer acquisition and retention rates. Your data may tell a different story and how you decide to interpret and act on this will depend on what your goals are as a business.

These are just a few example of how you can analyze the performance of your various channels. If you’re interested in exploring this further, contact us today!

The post Digging Deeper with your Channel Analysis appeared first on The Data Point.

### Emotional Intelligence for Data Science Teams

Here are three lessons for making and demonstrating a greater business impact to your organization, according to Domino Labs most successful customers.

### How big data and AI will reshape the automotive industry

The O’Reilly Data Show Podcast: Evangelos Simoudis on next-generation mobility services.

In this episode of the Data Show, I spoke with Evangelos Simoudis, co-founder of Synapse Partners and a frequent contributor to O’Reilly. He recently published a book entitled The Big Data Opportunity in Our Driverless Future, and I wanted get his thoughts on the transportation industry and the role of big data and analytics in its future. Simoudis is an entrepreneur, and he also advises and invests in many technology startups. He became interested in the automotive industry long before the current wave of autonomous vehicle startups was in the planning stages.

Continue reading How big data and AI will reshape the automotive industry.

### Data Analysis for Life Sciences

Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.

A companion book and associated R Markdown documents are also available for download.

Genomics Data Analysis Series

For links to all of the course components, including videos and supplementary materials, follow the link below.

### Data Analysis for Life Sciences

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.

A companion book and associated R Markdown documents are also available for download.

Genomics Data Analysis Series

For links to all of the course components, including videos and supplementary materials, follow the link below.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Design by Evolution: How to evolve your neural network with AutoML

The gist ( tl;dr): Time to evolve! I’m gonna give a basic example (in PyTorch) of using evolutionary algorithms to tune the hyper-parameters of a DNN.

### How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?

Someone who wishes to remain anonymous writes in with a story:

Linking to a new paper by Jorge Luis García, James J. Heckman, and Anna L. Ziff, an economist Sue Dynarski makes this “joke” on facebook—or maybe it’s not a joke:

How does one adjust standard errors to account for the fact that N of papers on an experiment > N of participants in the experiment?

Clicking through, the paper uses data from the “Abecedarian” (ABC) childhood intervention program of the 1970s. Well, the related ABC & “CARE” experiments, pooled together. From Table 3 on page 7, the ABC experiment has 58 treatment and 56 control students, while ABC has 17 treatment and 23 control. If you type “abecedarian” into Google Scholar, sure enough, you get 9,160 results! OK, but maybe some of those just have citations or references to other papers on that project… If you restrict the search to papers with “abecedarian” in the title, you still get 180 papers. If you search for the word “abecedarian” on Google Scholar (not necessarily in the title) and restrict to papers by Jim Heckman, you get 86 results.

That’s not why I thought to email you though.

Go to pages 7-8 of this new paper where they explain why they merged the ABC and CARE studies:

CARE included an additional arm of treatment. Besides the services just described, those in the treatment group also received home visiting from birth to age 5. Home visiting consisted of biweekly visits focusing on parental problem-solving skills. There was, in addition, an experimental group that received only the home visiting component, but not center-based care.[fn 17] In light of previous analyses, we drop this last group from our analysis. The home visiting component had very weak estimated effects.[fn 18] These analyses justify merging the treatment groups of ABC and CARE, even though that of CARE received the additional home-visiting component.[fn 19] We henceforth analyze the samples so generated as coming from a single ABC/CARE program.

OK, they merged some interventions (garden of forking paths?) because they wanted more data. But, how do they know that home visits had weak effects? Let’s check their explanation in footnote 18:

18: Campbell et al. (2014) test and do not reject the hypothesis of no treatment effects for this additional component of CARE.

Yep. Jim Heckman and coauthors conclude that the effects are “very weak” because ran some tests and couldn’t reject the null. If you go deep into the supplementary material of the cited paper, to tables S15(a) and S15(b), sure enough you find that these “did not reject the null” conclusions are drawn from interventions with 12-13 control and 11-14 treatment students (S15(a)) or 15-16 control and 18-20 treatment students (S15(b)). Those are pretty small sample sizes…

This jumped out at me and I thought you might be interested too.

My reply: This whole thing is unfortunate but it is consistent with the other writings of Heckman and his colleagues in this area: huge selection bias and zero acknowledgement of the problem. It makes me sad because Heckman’s fame came from models of selection bias, but he doesn’t see it when it’s right in front of his face. See here, for example.

The topic is difficult to write about for a few reasons.

First, Heckman is a renowned scholar and he is evidently careful about what he writes. We’re not talking about Brian Wansink or Satoshi Kanazawa here. Heckman works on important topics, his studies are not done on the cheap, and he’s eminently reasonable in his economic discussions. He’s just making a statistical error, over and over again. It’s a subtle error, though, that has taken us (the statistics profession) something like a decade to fully process. Making this mistake doesn’t make Heckman a bad guy, and that’s part of the problem: When you tell a quantitative researcher that they made a statistical error, you often get a defensive reaction, as if you accused them of being stupid, or cheating. But lots of smart, honest people have made this mistake. That’s one of the reasons we have formal statistical methods in the first place: people get lots of things wrong when relying on instinct. Probability and statistics are important, but they’re not quite natural to our ways of thinking.

Second, who wants to be the grinch who’s skeptical about early childhood intervention? Now, just to be clear, there’s lots of room to be skeptical about Heckman’s claims and still think that early childhood intervention is a good idea. For example, this paper by Garcia, Heckman, Leaf, and Prados reports a benefit/cost ratio of 7.3. So they could be overestimating their effect by a factor of 7 and still have a favorable ratio. The point is, if for whatever reason you support universal day care or whatever, you have a motivation not to worry too much about the details of a study that supports your position.

Again, I’m not saying that Heckman and his colleagues are doing this. I can only assume they’re reporting what, to them, are their best estimates. Unfortunately these methods are biased. But a lot of people with classical statistics and econometrics training don’t realize this: they thing regression coefficients are unbiased estimates, but nobody ever told them that the biases can be huge when there is selection for statistical significance.

And, remember, selection for statistical significance is not just about the “file drawer” and it’s not just about “p-hacking.” It’s about researcher degrees of freedom and forking paths that researchers themselves don’t always realize until they try to replicate their own studies. I don’t think Heckman and his colleagues have dozens of unpublished papers hiding in their file drawers, and I don’t think they’re running their data through dozens of specifications until they find statistical significance. So it’s not the file drawer and it’s not p-hacking as is often understood. But these researchers do have nearly unlimited degrees of freedom in their data coding and analysis, they do interpret “non-significant” differences as null and “significant” differences at face value, they have forking paths all over the place, and their estimates of magnitudes of effects are biased in the positive direction. It’s kinda funny but also kinda sad, that there’s so much concern for rigor in the design of these studies and in the statistical estimators used in the analysis, but lots of messiness in between, lots of motivation on the part of the researchers to find success after success after success, and lots of motivation for scholarly journals and the news media to publicize the results uncritically. These motivations are not universal—there’s clearly a role in the ecosystem for critics within academia, the news media, and in the policy community—but I think there are enough incentives for success within Heckman’s world to keep him and his colleagues from seeing what’s going wrong.

Again, it’s not easy—it took the field of social psychology about a decade to get a handle on the problem, and some are still struggling. So I’m not slamming Heckman and his colleagues. I think they can and will do better. It’s just interesting, when considering the mistakes that accomplished people make, to ask, How did this happen?

P.S. This is an important topic. It’s not ovulation-and-voting or air rage or himmicanes or anything silly like that: We’re talking about education policy that could affect millions of kids! And I’m not saying I have all the answers, or anything close to that. No, it’s the opposite: data are relevant to these questions, and I’m not close to the data. What’s needed is an integration of theory with observational and experimental data, and it’s great that academic economists such as Heckman have put so much time into studying these problems. I see my role as statistician as a helper. For better or worse, statistics is a big part of the story, and when people are making statistical errors, we should be correcting them. But these corrections are not the end of the story; they’re just necessary adjustments to keep research on the right track.

### Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

1. Deduplication: eliminating duplicate (exact) copies of repeated data.
2. Record linkage: identifying records that reference the same entity across different sources.
3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

### How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

### Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe


Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py


### Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},
]


Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

• the first three digits of the phone number
• the full name
• the first five characters of the name
• a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

## Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

• (y)es: confirms that the two references are to the same entity
• (n)o: labels the two references as not the same entity
• (u)nsure: does not label the two references as the same entity or as different entities
• (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
]


A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

## Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
"""
"""
response = requests.get(url)
with open(fname, 'w') as f:
f.write(response.content)

ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

## Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname

try:
print ("I've connected")
except:
print ("I am unable to connect to the database")
cur = conn.cursor()


The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
(visitor_id SERIAL PRIMARY KEY,
lastname    varchar,
firstname   varchar,
uin         varchar,
apptstart   varchar,
apptend     varchar,
meeting_loc varchar);''')
conn.commit()
with open(nfile, 'rU') as infile:
for field in DATEFIELDS:
if row[field] != '':
try:
dt = parser.parse(row[field])
row[field] = dt.toordinal()  # We also tried dt.isoformat()
except:
continue
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
conn.commit()
print ("All done!")

dateParseSQL(ORIGFILE)


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';


For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor


Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}
]


We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records

block_id = row['block_id']
records = []
i += 1

if i % 10000 == 0:
print ('{} blocks'.format(i))

smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
else:
smaller_ids = lset([])

records.append((row[KEY_FIELD], row, smaller_ids))

if records:
yield records


And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)

with psycopg2.connect(database=args.dbname,
host='localhost',
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)

print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
del(temp_d)

if os.path.exists(args.training):
with open(args.training) as tf:

print ('Starting active learning')
dedupe.convenience.consoleLabel(deduper)

print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)

print ('Saving new training file to {}'.format(args.training))
with open(args.training, 'w') as training_file:
deduper.writeTraining(training_file)

deduper.cleanupTraining()

print ('Creating blocking_map table')
c.execute("""
DROP TABLE IF EXISTS blocking_map
""")
c.execute("""
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)

for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
c_index.execute("""
SELECT DISTINCT %s FROM %s
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
c_index.close()

print ('Generating blocking map')
c_block = con.cursor('block')
c_block.execute("""
SELECT * FROM %s
""" % SOURCE_TABLE)
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)

print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
csv_writer.writerows(b_data)
csv_file.close()

f = open(csv_file.name, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
f.close()

os.remove(csv_file.name)

con.commit()

print ('Indexing blocks')
c.execute("""
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
""")
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

print ('Calculating plural_key')
c.execute("""
CREATE TABLE plural_key
(block_key VARCHAR(200),
block_id SERIAL PRIMARY KEY)
""")
c.execute("""
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
""")

print ('Indexing block_key')
c.execute("""
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
""")

print ('Calculating plural_block')
c.execute("""
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)

c.execute("""
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
""" % (KEY_FIELD, KEY_FIELD))
c.execute("""
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Creating covered_blocks')
c.execute("""
CREATE TABLE covered_blocks AS
(SELECT %s,
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
GROUP BY %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Indexing covered_blocks')
c.execute("""
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
""" % (KEY_FIELD, KEY_FIELD))
print ('Committing')

print ('Creating smaller_coverage')
c.execute("""
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
""" % (KEY_FIELD, KEY_FIELD))
con.commit()

print ('Clustering...')
c_cluster = con.cursor('cluster')
c_cluster.execute("""
SELECT *
FROM smaller_coverage
INNER JOIN %s
USING (%s)
ORDER BY (block_id)
""" % (SOURCE_TABLE, KEY_FIELD))
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)

print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
c.execute("""
CREATE TABLE entity_map (
%s INTEGER,
canon_id INTEGER,
cluster_score FLOAT,
PRIMARY KEY(%s)
)""" % (KEY_FIELD, KEY_FIELD))

print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
c.execute("""
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))

c_cluster.close()
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
con.commit()

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()
find_dupes(args)


## Active Learning Observations

We ran multiple experiments:

• Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
• Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


## Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

### Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671


### Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


## Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Populating a GRAKN.AI Knowledge Graph with the World

This updated article describes how to move SQL data into a GRAKN.AI knowledge graph.

### Adopting AI in the Enterprise: Ford Motor Company

Dimitar Filev on bringing cutting-edge computational intelligence to cars and the factories that build them.

Driverless cars aren’t the only application for deep learning on the road: neural networks have begun to make their way into every corner of the automotive industry, from supply-chain management to engine controllers.

In this installment of our ongoing series on artificial intelligence (AI) and machine learning (ML) in the enterprise, we speak with Dimitar Filev, executive technical leader at Ford Research & Advanced Engineering, who leads the team focused on control methods and computational intelligence.

## What was the first application of AI and ML at Ford?

Ford research lab has been conducting systematic research on computational intelligence—one of the branches of AI—for more than 20 years. About 15 years ago, Ford Motor Company introduced one of the first large-scale industrial applications of neural networks. Ford researchers developed and implemented, in mass-produced cars, an innovative misfire detection system—a neural-net-based classifier of crankshaft acceleration patterns for diagnosing engine misfire (undesirable combustion failure that has a negative impact on performance and emissions). Multiple other AI applications to Ford product and manufacturing followed this success.

## How do you leverage AI and ML today to create a better product?

We can think of two categories of ML and AI applications in our vehicles. In addition to the obvious applications in driverless cars, Ford has also developed AI-based technologies that enable different functions in support of vehicle engineering. These are not always visible to the driver.

As I mentioned before, we used recurrent-neural-net-based classifiers for misfire detection in V10 engines; we also use them for intruder detection when the driver is away from vehicle. We also use fuzzy logic-type rule-based gain scheduling controllers integrated with the battery control systems of hybrid-electric vehicles.

In our supply chain, neural networks are the main drivers behind the inventory management system recommending specific vehicle configurations to dealers, and evolutionary computing algorithms (in conjunction with dynamic semantic network-based expert systems) are deployed in support of resource management in assembly plants.

## Are there other use cases within Ford today?

Another group of AI applications is driven by the fact that current vehicles have evolved into complex mobile cyber systems with increasing computational power and resources generating gigabytes of information per hour, continuously connected, with information following to, from, and through the platform. Increased capability of vehicle systems, along with the growing customer demand for new features, product improvement, personalization, rich information utilization, etc., are some of the drivers for introducing machine learning techniques in modern vehicles.

The most common AI applications involve direct driver interaction, including advisory systems that monitor acceleration and braking patterns to provide on-board evaluations of a driver’s preferences and intentions for different purposes—characterization of the driver, advice for fuel-efficient driving and safe driving, auto-selecting the optimal suspension and steering modes, simplifying the human-machine interface by estimating the most likely next destination, and preferred settings of the climate control, etc. These systems use traditional AI methods—rule-based, Markov models, clustering; they do not require special hardware. One of their distinctive features is to be intelligent enough to identify the level of acceptance of provided recommendations, and avoid drivers’ annoyance.

Recent extensive development of autonomous vehicles is the driver for deep learning applications to vehicle localization, object detection, classification, and tracking. We can expect in the near future a wide range of novel deep-learning-based features and user experiences in our cars and trucks, innovative mobility solutions, and intelligent automation systems in our manufacturing plants.

## What steps have you needed to take to build a team that can grasp and apply recent advances in AI and ML?

We have several centers of excellence in machine learning and AI, with focus on robotics, next-generation autonomous driving, and data analytics. Our goal is to expand the AI-based methodology and development tools throughout the company and to make them part of the commonly used engineering tools, similar to Matlab and Simulink.

Building centers of excellence in AI and ML was not too challenging since, as I mentioned earlier, we had engineers and researchers with backgrounds and experience in conventional neural networks, fuzzy logic, expert systems, Markov decision processes, evolutionary computing, and other main areas of computational intelligence. This created the foundation that we are now upgrading with a state-of-the-art expertise in deep learning methods and tools. We continue to expand this critical mass of experienced engineers by hiring more computer specialists with strong educational backgrounds in AI and ML.

## Where does Ford hope to gain a competitive advantage in applying AI and ML?

AI provides an opportunity to better use available information for creating new features and driver-aware, personalized vehicles that would better fit to the specific customer. In addition, machine learning is an irreplaceable enabler for creating smart driver-assist systems and fully autonomous vehicles. Increased connectivity is one of the major drivers expanding the capability of the on-board infotainment and control systems by incorporating cloud resources. In the near future, we can envision seamless integration of vehicle on-board systems with cloud-based intelligent agents, self-organizing algorithms, and other AI tools that would broaden the range of user experiences offered by our mobility solutions.

## Are there any areas where you've considered leveraging AI/ML but found that the technology isn't ready yet?

I don’t think so—just the opposite. It seems that the AI/ML toolbox is growing exponentially and ahead of mass applications. We are witnessing an interesting reality—while many research areas (e.g., control engineering, computer programming, cybernetics) were driven by the need for new technical solutions, the AI revolution that is happening now is inspired by the advances in machine learning research. Besides the stimulating effect of some remarkable successes (e.g., Google DeepMind), I would like to mention two important and unique enablers of this rapid development—first, the quick proliferation of research ideas and results that are made immediately available by posting recent publications on arxiv.org or other public websites; and second, the wide accessibility to open source AI software development tools—TensorFlow, Neon, Torch, Digits, Theano, just to mention a few. The challenge now is to mature the most effective and innovative AI solutions, and to integrate them within new features and customer experiences.

## Is Ford interested in partnering with other Silicon Valley companies and startups? Are there any initiatives you'd like to see the community focus on?

Ford Motor Company has partnerships with a number of high-tech companies and startups around the world, and, of course, in Silicon Valley. We are an active member of the innovative Silicon Valley community through our Research & Innovation Center in Palo Alto and are always interested in working with new companies and startups.

## What's the most promising or interesting advancement you've seen in AI recently, and how do you think it will impact Ford?

It is hard to outline the most interesting one, for the progress in AI is enormous. The number of publications, patents, and software products in the AI area is exponentially growing, and almost every day we are witnessing new accomplishments, novel approaches, and smart applications. I am specifically interested in the developments in reinforcement learning, intelligent agents, game theory, and Markov decision processes since they open the door to new advancements in the field of automated reasoning, decision-making and optimal control, and their automotive and mobility applications.

### Chris Stetson on system migrations and defining a microservices reference architecture

The O’Reilly Podcast: Helping developers improve performance, security, and service discoverability.

In this podcast episode, O’Reilly’s Jeff Bleiel talks with Chris Stetson, chief architect and head of engineering at NGINX. They discuss Stetson’s experiences working on microservices-based systems and how a microservices reference architecture can ease a development team’s pain when shifting from a monolithic application to many individualized microservices.

Like most developers, Stetson started off writing monolithic applications before moving over to a service-oriented architecture, where he broke apart different components of the application. “So many developers will approach building an application as a monolith because they don’t have to build out the infrastructure, orchestration tools, networking capabilities, and contracts between the different components,” he said. “However, many developers and teams today will approach their application as a monolith with the idea that there will be a clear separation of concerns for different parts that can easily be broken out.”

According to Stetson, the benefits for developers in adopting microservices is similar to the Agile movement, in that you may only have a couple weeks to work on a feature or piece of functionality. “Microservices encapsulate a set of functions and services that are constrained to a single set of concerns,” he said. “As a result, developers need to optimize around those concerns, which will help them build a really powerful and complete system that is harder to accomplish with a large monolithic application.”

Stetson’s passion for optimizing systems around microservices led him to spearhead the creation of NGINX’s Microservices Reference Architecture. “This reference architecture was our attempt to understand how we could help our customers build a microservice application while helping them improve aspects of their architecture related to performance, security, service discovery, and circuit breaker pattern functionality within the environment,” he said. “The reference architecture is an actual photo-sharing application, similar to Flickr or Shutterfly. We chose that application idea since it’s one everyone is familiar with, and it showcases powerful asymmetric computing requirements.”

This reference architecture includes three different networking models: the Proxy Model, the Router Mesh, and the Fabric Model. Stetson mentioned how the Proxy Model is similar in function to, and complements, Kubernetes’ Ingress Controller. (Kubernetes is an open source orchestration tool for managing the deployment and instances of containerized applications.) “Kubernetes has a very powerful framework for organizing microservices, allowing effective communication between services, and providing network segmentation,” he said. “It offers a lot of great services for systems to take advantage of in order to perform traffic management within a microservice application.”

This post and podcast is a collaboration between O'Reilly and NGINX. See our statement of editorial independence.

### Edward Callahan on reactive microservice deployments

The O’Reilly Podcast: Modify your existing pipeline to embrace failure in isolation.

In this podcast episode, I talk about reactive microservice deployments with Edward Callahan, a senior engineer with Lightbend. We discuss the difference between a normal deployment pipeline and one that’s fully reactive, as well as the impact reactive deployments have on software teams.

Callahan mentioned how a deployment platform must be developer and operator friendly in order to enable the highly productive, iterative development being sought by enterprises undergoing software-led transformations. However, it can be very easy for software teams to get frustrated with the operational tooling generally available. “Integration is often cumbersome on the development process,” he said. “Development and operations teams are demanding more from the operational machinery they depend on for the success of their applications and services.”

For enterprises already developing reactive applications, these development teams are starting to realize their applications should be deployed to an equally reactive deployment platform. “With the complexity of managing state in a distributed deployment being handled reactively, the deployment workflow becomes a simplified and reliable pipeline,” Callahan said. “This frees developers to address business needs instead of the many details of delivering clustered services.”

Callahan said the same core reactive principles that define an application’s design—responsiveness, resiliency, elasticity, and message driven—can also be applied to a reactive deployment pipeline. The following characteristics of such a pipeline include:

• Developer and operations friendly—The deployment pipeline should support ease of testing, continuous delivery, cluster conveniences, and composability.
• Application-centric logging, telemetry, and monitoring—Meaningful, actionable data is far more valuable than petabytes of raw telemetry. How many messages are in a given queue and how long it is taking to service those events is far more indicative of the service response times that are tied to your service-level agreements.
• Application-centric process monitoring—A fundamental aspect of monitoring is that the supervisory system automatically restarts services if they terminate unexpectedly.
• Elastic and scalable—Scaling the number of instances of a service and scaling the resources of a cluster. Clusters need some amount of spare capacity or headroom.

According to Callahan, the main difference between a normal deployment pipeline and a reactive one is the ability for the system to embrace failure in isolation. “Failure cannot be avoided,” he said. “You must embrace failure and seek to keep your services available despite it, even if this requires operating in a degraded manner. Let it crash! Instead of attempting to repair nodes when they fail, you replace the failing resources with new ones.”

When making the move to a reactive deployment pipeline, software teams need to remain flexible in the face of change. They also must stay mindful of any potential entrapments resulting from vendor lock-in around a new platform. “Standards really do help here,” Callahan said. “Watch for the standards as you move through the journey of building your own reactive deployment pipeline.”

This post is a collaboration between O'Reilly and Lightbend. See our statement of editorial independence.

Continue reading Edward Callahan on reactive microservice deployments.

### Four short links: 20 July 2017

SQL Equivalence, Streaming Royalties, Open Source Publishing, and Serial Entitlement

1. Introducing Cosette -- a SQL solver for automatically checking semantic equivalences of SQL queries. With Cosette, one can easily verify the correctness of SQL rewrite rules, find errors in buggy SQL rewrites, build auto-graders for SQL assignments, develop SQL optimizers, bust “fake SQLs,” etc. Open source, from the University of Washington.
2. Streaming Services Royalty Rates Compared (Information is Beautiful) -- the lesson is that it's more profitable to work for a streaming service than to be an artist hosted on it.
3. Editoria -- open source web-based, end-to-end, authoring, editing, and workflow tool that presses and library publishers can leverage to create modern, format-flexible, standards-compliant, book-length works. Funded by the Mellon Foundation, Editoria is a project of the University of California Press and the California Digital Library.
4. The Al Capone Theory of Sexual Harassment (Val Aurora) -- The U.S. government recognized a pattern in the Al Capone case: smuggling goods was a crime often paired with failing to pay taxes on the proceeds of the smuggling. We noticed a similar pattern in reports of sexual harassment and assault: often people who engage in sexually predatory behavior also faked expense reports, plagiarized writing, or stole credit for other people’s work.

### Preparing for the Plumber v0.4.0 Release

(This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers)

Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can take a look at www.rplumber.io to learn more about how to use the package to create your APIs.

Over the years, we’ve noticed a handful of things that we wished we’d done differently in the development of Plumber. In particular, there were some decisions that prioritized convenience over security which we wanted to roll back. We’ve decided to bite the bullet and make these changes before more users start using Plumber. The v0.4.0 release of Plumber includes a handful of breaking changes that mitigate these issues all at once. Our hope in getting all of these out of the way at once is to ensure that users building on Plumber moving forward have confidence that any breaking change under consideration has already been made and that we shouldn’t have this kind of churn again anytime soon.

If you’re already using Plumber, we strongly encourage you to read the v0.4.0 migration guide below. As you’ll see, most of these changes affect the hosting or low-level interactions in Plumber. The dominant mode in which people use Plumber (adding comments to their existing R functions) has not changed; there are no breaking changes in how those files are interpreted.

v0.4.0 of Plumber will be deployed to CRAN in the coming days. We strongly encourage you to try out the v0.4.0 version of Plumber ahead of time to make sure that you’ve migrated your APIs correctly.

devtools::install_github("trestletech/plumber")


Alternatively, you can continue using the last release of Plumber which didn’t include any of these breaking changes (v0.3.3) until you’re ready to upgrade by using the following command:

devtools::install_github("trestletech/plumber", ref="v0.3.3")


You can see the full release notes for v0.4.0 here.

## Plumber v0.4.0 Migration Guide

There are a number of changes that users should consider when preparing to upgrade to plumber v0.4.0.

1. Plumber no longer accepts external connections by default. The host parameter for the run() method now defaults to 127.0.0.1, meaning that Plumber will only listen for incoming requests from the local machine on which it’s running — not from any other machine on the network. This is done for security reasons so that you don’t accidentally expose a Plumber API that you’re developing to your entire network. To restore the old behavior in which Plumber listened for connections from any machine on the network, use $run(host="0.0.0.0"). Note that if you’re deploying to an environment that includes an HTTP proxy (such as the DigitalOcean servers which use nginx), having Plumber listen only on 127.0.0.1 is likely the right default, as your proxy — not Plumber — is the one receiving external connections. 2. Plumber no longer sets the Access-Control-Allow-Origin HTTP header to *. This was previously done for convenience but given the security implications we’re reversing this decision. The previous behavior would have allowed web browsers to make requests of your API from other domains using JavaScript if the request used only standard HTTP headers and were a GET, HEAD, or POST request. These requests will no longer work by default. If you wish to allow an endpoint to be accessible from other origins in a web browser, you can use res$setHeader("Access-Control-Allow-Origin", "*") in an endpoint or filter.
3. Rather than setting the default port to 8000, the port is now randomly selected. This ensures that a shared server (like RStudio Server) will be able to support multiple people developing Plumber APIs concurrently without them having to manually identify an available port. This can be controlled by specifying the port parameter in the run() method or by setting the plumber.port option.
4. The object-oriented model for Plumber routers has changed. If you’re calling any of the following methods on your Plumber router, you will need to modify your code to use the newer alternatives: addFilter, addEndpoint, addGlobalProcessor, and addAssets. The code around these functions has undergone a major rewrite and some breaking changes have been introduced. These four functions are still supported with a deprecation warning in 0.4.0, but support is only best-effort. Certain parameters on these methods are no longer supported, so you should thoroughly test any Plumber API that leverages any of these methods before deploying version 0.4.0. Updated documentation for using Plumber programmatically is now available.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Behind the Scenes of BigML’s Time Series Forecasting

BigML’s Time Series Forecasting model uses Exponential Smoothing under hood. This blog post, the last one of our series of six about Time Series, will explore the technical details of Exponential Smoothing models, to help you gain insights about your forecasting results. Exponential Smoothing Explained To understand Exponential Smoothing, let’s first focus on the smoothing part […]

### Gender representation in comic books

Amanda Shendruk for The Pudding analyzed how genders are represented differently in comic books, focusing on “naming conventions, types of superpowers, and the composition of teams to see how male and female genders are portrayed.” The charts are good, but I’m pretty sure the animated GIFs for a handful of female characters make the piece.

Tags: , ,

### Python Cheat Sheet for Data Science

The printable version of this cheat sheet

It’s common when first learning Python for Data Science to have trouble remembering all the syntax that you need. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it’s nice to have a handy reference, so we’ve put together this cheat sheet to help you out!

If you’re interested in learning Python, we have a free Python Programming: Beginner course which can start you on your data science journey.

## Key Basics, Printing and Getting Help

 x = 3 Assign 3 to the variable x print(x) Print the value of x type(x) Return the type of the variable x (in this case, int for integer) help(x) Show documentation for the str data type help(print) Show documentation for the print() function

 f...

### R Packages worth a look

Differential Expression Analysis Using a Bottom-Up Model (denoiSeq)
Given count data from two conditions, it determines which transcripts are differentially expressed across the two conditions using Bayesian inference of the parameters of a bottom-up model for PCR amplification. This model is developed in Ndifon Wilfred, Hilah Gal, Eric Shifrut, Rina Aharoni, Nissan Yissachar, Nir Waysbort, Shlomit Reich Zeliger, Ruth Arnon, and Nir Friedman (2012), <http://…/15865.full>, and results in a distribution for the counts that is a superposition of the binomial and negative binomial distribution.

Apply Functions to Multiple Multidimensional Arguments (multiApply)
The base apply function and its variants, as well as the related functions in the ‘plyr’ package, typically apply user-defined functions to a single argument (or a list of vectorized arguments in the case of mapply). The ‘multiApply’ package extends this paradigm to functions taking a list of multiple unidimensional or multidimensional arguments (or combinations thereof) as input, which can have different numbers of dimensions as well as different dimension lengths.

Basic Functions for Pre-Processing Microarrays (PreProcess)
Provides classes to pre-process microarray gene expression data as part of the OOMPA collection of packages described at <http://…/>.

Visualize Reproducibility and Replicability in a Comparison of Scientific Studies (scifigure)
Users may specify what fundamental qualities of a new study have or have not changed in an attempt to reproduce or replicate an original study. A comparison of the differences is visualized. Visualization approach follows Patil, Peng, and Leek (2016) <doi:10.1101/066803>.

Two Stage Forecasting (TSF) for Long Memory Time Series in Presence of Structural Break (TSF)
Forecasting of long memory time series in presence of structural break by using TSF algorithm by Papailias and Dias (2015) <doi:10.1016/j.ijforecast.2015.01.006>.

### Book Memo: “Aggregated Search”

 The goal of aggregated search is to provide integrated search across multiple heterogeneous search services in a unified interface—a single query box and a common presentation of results. In the web search domain, aggregated search systems are responsible for integrating results from specialized search services, or verticals, alongside the core web results. For example, search portals such as Google, Bing, and Yahoo! provide access to vertical search engines that focus on different types of media (images and video), different types of search tasks (search for local businesses and online products), and even applications that can help users complete certain tasks (language translation and math calculations). This monograph provides a comprehensive summary of previous research in aggregated search. It starts by describing why aggregated search requires unique solutions. It then discusses different sources of evidence that are likely to be available to an aggregated search system, as well as different techniques for integrating evidence in order to make vertical selection and presentation decisions. Next, it surveys different evaluation methodologies for aggregated search and discusses prior user studies that have aimed to better understand how users behave with aggregated search interfaces. It proceeds to review different advanced topics in aggregated search. It concludes by highlighting the main trends and discussing short-term and long-term areas for future work.

### Whats new on arXiv

Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.
We present an implementation of a probabilistic first-order logic called TensorLog, in which classes of logical queries are compiled into differentiable functions in a neural-network infrastructure such as Tensorflow or Theano. This leads to a close integration of probabilistic logical reasoning with deep-learning infrastructure: in particular, it enables high-performance deep learning frameworks to be used for tuning the parameters of a probabilistic logic. Experimental results show that TensorLog scales to problems involving hundreds of thousands of knowledge-base triples and tens of thousands of examples.
The cooperative hierarchical structure is a common and significant data structure observed in, or adopted by, many research areas, such as: text mining (author-paper-word) and multi-label classification (label-instance-feature). Renowned Bayesian approaches for cooperative hierarchical structure modeling are mostly based on topic models. However, these approaches suffer from a serious issue in that the number of hidden topics/factors needs to be fixed in advance and an inappropriate number may lead to overfitting or underfitting. One elegant way to resolve this issue is Bayesian nonparametric learning, but existing work in this area still cannot be applied to cooperative hierarchical structure modeling. In this paper, we propose a cooperative hierarchical Dirichlet process (CHDP) to fill this gap. Each node in a cooperative hierarchical structure is assigned a Dirichlet process to model its weights on the infinite hidden factors/topics. Together with measure inheritance from hierarchical Dirichlet process, two kinds of measure cooperation, i.e., superposition and maximization, are defined to capture the many-to-many relationships in the cooperative hierarchical structure. Furthermore, two constructive representations for CHDP, i.e., stick-breaking and international restaurant process, are designed to facilitate the model inference. Experiments on synthetic and real-world data with cooperative hierarchical structures demonstrate the properties and the ability of CHDP for cooperative hierarchical structure modeling and its potential for practical application scenarios.
Most neural machine translation (NMT) models are based on the sequential encoder-decoder framework, which makes no use of syntactic information. In this paper, we improve this model by explicitly incorporating source-side syntactic trees. More specifically, we propose (1) a bidirectional tree encoder which learns both sequential and tree structured representations; (2) a tree-coverage model that lets the attention depend on the source-side syntax. Experiments on Chinese-English translation demonstrate that our proposed models outperform the sequential attentional model as well as a stronger baseline with a bottom-up tree encoder and word coverage.
Pairwise ranking methods are the basis of many widely used discriminative training approaches for structure prediction problems in natural language processing(NLP). Decomposing the problem of ranking hypotheses into pairwise comparisons enables simple and efficient solutions. However, neglecting the global ordering of the hypothesis list may hinder learning. We propose a listwise learning framework for structure prediction problems such as machine translation. Our framework directly models the entire translation list’s ordering to learn parameters which may better fit the given listwise samples. Furthermore, we propose top-rank enhanced loss functions, which are more sensitive to ranking errors at higher positions. Experiments on a large-scale Chinese-English translation task show that both our listwise learning framework and top-rank enhanced listwise losses lead to significant improvements in translation quality.
Information extraction and user intention identification are central topics in modern query understanding and recommendation systems. In this paper, we propose DeepProbe, a generic information-directed interaction framework which is built around an attention-based sequence to sequence (seq2seq) recurrent neural network. DeepProbe can rephrase, evaluate, and even actively ask questions, leveraging the generative ability and likelihood estimation made possible by seq2seq models. DeepProbe makes decisions based on a derived uncertainty (entropy) measure conditioned on user inputs, possibly with multiple rounds of interactions. Three applications, namely a rewritter, a relevance scorer and a chatbot for ad recommendation, were built around DeepProbe, with the first two serving as precursory building blocks for the third. We first use the seq2seq model in DeepProbe to rewrite a user query into one of standard query form, which is submitted to an ordinary recommendation system. Secondly, we evaluate DeepProbe’s seq2seq model-based relevance scoring. Finally, we build a chatbot prototype capable of making active user interactions, which can ask questions that maximize information gain, allowing for a more efficient user intention idenfication process. We evaluate first two applications by 1) comparing with baselines by BLEU and AUC, and 2) human judge evaluation. Both demonstrate significant improvements compared with current state-of-the-art systems, proving their values as useful tools on their own, and at the same time laying a good foundation for the ongoing chatbot application.
Although Neural networks could achieve state-of-the-art performance while recongnizing images, they often suffer a tremendous defeat from adversarial examples–inputs generated by utilizing imperceptible but intentional perturbations to samples from the datasets. How to defense against adversarial examples is an important problem which is well worth to research. So far, only two well-known methods adversarial training and defensive distillation have provided a significant defense. In contrast to existing methods mainly based on model itself, we address the problem purely based on the adversarial examples itself. In this paper, a novel idea and the first framework based Generative Adversarial Nets named AE-GAN capable of resisting adversarial examples are proposed. Extensive experiments on benchmark datasets indicate that AE-GAN is able to defense against adversarial examples effectively.
In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.
We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.
We introduce Latent Gaussian Process Regression which is a latent variable extension allowing modelling of non-stationary processes using stationary GP priors. The approach is built on extending the input space of a regression problem with a latent variable that is used to modulate the covariance function over the input space. We show how our approach can be used to model non-stationary processes but also how multi-modal or non-functional processes can be described where the input signal cannot fully disambiguate the output. We exemplify the approach on a set of synthetic data and provide results on real data from geostatistics.
Visual object tracking is a challenging computer vision task with numerous real-world applications. Here we propose a simple but efficient Spectral Filter Tracking (SFT)method. To characterize rotational and translation invariance of tracking targets, the candidate image region is models as a pixelwise grid graph. Instead of the conventional graph matching, we convert the tracking into a plain least square regression problem to estimate the best center coordinate of the target. But different from the holistic regression of correlation filter based methods, SFT can operate on localized surrounding regions of each pixel (i.e.,vertex) by using spectral graph filters, which thus is more robust to resist local variations and cluttered background.To bypass the eigenvalue decomposition problem of the graph Laplacian matrix L, we parameterize spectral graph filters as the polynomial of L by spectral graph theory, in which L k exactly encodes a k-hop local neighborhood of each vertex. Finally, the filter parameters (i.e., polynomial coefficients) as well as feature projecting functions are jointly integrated into the regression model.
We consider the task of one-shot learning of visual categories. In this paper we explore a Bayesian procedure for updating a pretrained convnet to classify a novel image category for which data is limited. We decompose this convnet into a fixed feature extractor and softmax classifier. We assume that the target weights for the new task come from the same distribution as the pretrained softmax weights, which we model as a multivariate Gaussian. By using this as a prior for the new weights, we demonstrate competitive performance with state-of-the-art methods whilst also being consistent with ‘normal’ methods for training deep networks on large data.
State-of-the-art object recognition Convolutional Neural Networks (CNNs) are shown to be fooled by image agnostic perturbations, called universal adversarial perturbations. It is also observed that these perturbations generalize across multiple networks trained on the same target data. However, these algorithms require training data on which the CNNs were trained and compute adversarial perturbations via complex optimization. The fooling performance of these approaches is directly proportional to the amount of available training data. This makes them unsuitable for practical attacks since its unreasonable for an attacker to have access to the training data. In this paper, for the first time, we propose a novel data independent approach to generate image agnostic perturbations for a range of CNNs trained for object recognition. We further show that these perturbations are transferable across multiple network architectures trained either on same or different data. In the absence of data, our method generates universal adversarial perturbations efficiently via fooling the features learned at multiple layers thereby causing CNNs to misclassify. Experiments demonstrate impressive fooling rates and surprising transferability for the proposed universal perturbations generated without any training data.
Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.
Representing texts as fixed-length vectors is central to many language processing tasks. Most traditional methods build text representations based on the simple Bag-of-Words (BoW) representation, which loses the rich semantic relations between words. Recent advances in natural language processing have shown that semantically meaningful representations of words can be efficiently acquired by distributed models, making it possible to build text representations based on a better foundation called the Bag-of-Word-Embedding (BoWE) representation. However, existing text representation methods using BoWE often lack sound probabilistic foundations or cannot well capture the semantic relatedness encoded in word vectors. To address these problems, we introduce the Spherical Paragraph Model (SPM), a probabilistic generative model based on BoWE, for text representation. SPM has good probabilistic interpretability and can fully leverage the rich semantics of words, the word co-occurrence information as well as the corpus-wide information to help the representation learning of texts. Experimental results on topical classification and sentiment analysis demonstrate that SPM can achieve new state-of-the-art performances on several benchmark datasets.
Digital pathology has advanced substantially over the last decade however tumor localization continues to be a challenging problem due to highly complex patterns and textures in the underlying tissue bed. The use of convolutional neural networks (CNNs) to analyze such complex images has been well adopted in digital pathology. However in recent years, the architecture of CNNs have altered with the introduction of inception modules which have shown great promise for classification tasks. In this paper, we propose a modified ‘transition’ module which learns global average pooling layers from filters of varying sizes to encourage class-specific filters at multiple spatial resolutions. We demonstrate the performance of the transition module in AlexNet and ZFNet, for classifying breast tumors in two independent datasets of scanned histology sections, of which the transition module was superior.
The study of reaction times and their underlying cognitive processes is an important field in Psychology. Reaction times are usually modeled through the ex-Gaussian distribution, because it provides a good fit to multiple empirical data. The complexity of this distribution makes the use of computational tools an essential element in the field. Therefore, there is a strong need for efficient and versatile computational tools for the research in this area. In this manuscript we discuss some mathematical details of the ex-Gaussian distribution and apply the ExGUtils package, a set of functions and numerical tools, programmed for python, developed for numerical analysis of data involving the ex-Gaussian probability density. In order to validate the package, we present an extensive analysis of fits obtained with it, discuss advantages and differences between the least squares and maximum likelihood methods and quantitatively evaluate the goodness of the obtained fits (which is usually an overlooked point in most literature in the area). The analysis done allows one to identify outliers in the empirical datasets and criteriously determine if there is a need for data trimming and at which points it should be done.
Generative Adversarial Networks (GANs) have been shown to be able to sample impressively realistic images. GAN training consists of a saddle point optimization problem that can be thought of as an adversarial game between a generator which produces the images, and a discriminator, which judges if the images are real. Both the generator and the discriminator are commonly parametrized as deep convolutional neural networks. The goal of this paper is to disentangle the contribution of the optimization procedure and the network parametrization to the success of GANs. To this end we introduce and study Generative Latent Optimization (GLO), a framework to train a generator without the need to learn a discriminator, thus avoiding challenging adversarial optimization problems. We show experimentally that GLO enjoys many of the desirable properties of GANs: learning from large data, synthesizing visually-appealing samples, interpolating meaningfully between samples, and performing linear arithmetic with noise vectors.

### Distilled News

Heatmapping is a simple and efficient way to analyze visitor interaction and user behavior on your website. If you are in a Conversion Rate Optimization (aka. CRO) project with your e-commerce or startup (or any other online) business, it’s indispensable to run some website heatmaps – such as click, mouse movement or scroll heatmaps.
In my last post I did some drawings based on L-Systems. These drawings are done sequentially. At any step, the state of the drawing can be described by the position (coordinates) and the orientation of the pencil. In that case I only used two kind of operators: drawing a straight line and turning a constant angle.
Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding.
Predictive maintenance is widely considered to be the obvious next step for any business with high-capital assets: harness machine learning to control rising equipment maintenance costs and pave the way for self maintenance through artificial intelligence (AI).
What makes BI tools great? What features are important while selecting a good BI tool? Let’s have a look. NYU MS in Business Analytics 2017NYU MS in Business Analytics
If you use an API key to access a secure service, or need to use a password to access a protected database, you’ll need to provide these ‘secrets’ in your R code somewhere. That’s easy to do if you just include those keys as strings in your code — but it’s not very secure. This means your private keys and passwords are stored in plain-text on your hard drive, and if you email your script they’re available to anyone who can intercept that email. It’s also really easy to inadvertently include those keys in a public repo if you use Github or similar code-sharing services. To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust.
How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups. Here is a course with videos that present the method named Multiple Factor Analysis.
How to analyse of categorical data? Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most well-known use of Multiple Correspondence Analysis is: surveys. Four videos present a course on MCA, highlighting the way to interpret the data. Then you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results. With this course, you will be stand-alone to perform and interpret results obtain with MCA.

### The Viral Recurring Decimal: Euler Problem 26

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Proposed solution to Euler Problem 26 in the R language. Find the value of d < 1000 for which 1/d contains the longest recurring decimal cycle. Continue reading

The post The Viral Recurring Decimal: Euler Problem 26 appeared first on The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## July 19, 2017

### Document worth reading: “Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes”

It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don’t. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

### Top KDnuggets tweets, Jul 12-18: 10 Free #MustRead Books for #MachineLearning and #DataScience; Why #AI and Machine Learning?

Also top 32 Reasons #DataScience Projects and Teams Fail; Text Classifier Algorithms in #MachineLearning; The 4 Types of #Data #Analytics: Descriptive, Diagnostic ...

### Machine learning best practices: detecting rare events

This is the second post in my series of machine learning best practices. If you missed it, read the first post, Machine learning best practices: the basics. As we go along, all ten tips will be archived at this machine learning best practices page.

Machine learning commonly requires the use of highly unbalanced data. When detecting fraud or isolating manufacturing defects, for example, the target event is extremely rare – often way below 1 percent.  So, even if you’re using a model that’s 99 percent accurate, it might not correctly classify these rare events.

### What can you do to find the needles in the haystack?

A lot of data scientists frown when they hear the word sampling. I like to use the term focused data selection, where you construct a biased training data set by oversampling or undersampling.  As a result, my training data may end up slightly more balanced, often with a 10 percent event level or more (See Figure 1). This higher ratio of events can help the machine learning algorithm learn to better isolate the event signal.

For reference, undersampling removes observations at random to downsize the majority class. Oversampling up-sizes the minority class at random to decrease the level of class disparity.

Figure 1: Develop biased samples through under and oversampling.  The plus sign represents duplicated examples.

Another rare event modeling strategy is to use decision processing to place greater weight on correctly classifying the event.

TABLE 1: The cost of inaccurately identifying fraud

The table above shows the cost associated with each decision outcome. In this scenario, classifying a fraudulent case as not fraudulent has an expected cost of $500. There's also a$100 cost associated with falsely classifying a non-fraudulent case as fraudulent.

Rather than developing a model based on some statistical assessment criterion, here the goal is to select the best model that minimizes total cost. Total cost = False negative X 500 + False Positive X 100. In this strategy, accurately specifying the cost of the two types of misclassification is the key to  the success of the algorithm.

My next post will be about combining lots of models. If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here.

The post Machine learning best practices: detecting rare events appeared first on Subconscious Musings.

### If you did not already know

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning — a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function. …

Least-Angle Regression (LARS)
In statistics, least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable to be determined by a linear combination of a subset of potential covariates. Then the LARS algorithm provides a means of producing an estimate of which variables to include, as well as their coefficients. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residual. …

False Positive Rate
In statistics, when performing multiple comparisons, the term false positive ratio, also known as the false alarm ratio, usually refers to the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate (or “false alarm rate”) usually refers to the expectancy of the false positive ratio.

### Ligature fonts for R

(This article was first published on R – Benomics, and kindly contributed to R-bloggers)

Ligature fonts are fonts which sometimes map multiple characters to a single glyph, either for readability or just because it looks neat. Importantly, this only affects the rendering of the text with said font, while the distinct characters remain in the source.

The Apple Chancery font with and without ligatures enabled.

Maybe ligatures are an interesting topic in themselves if you’re into typography, but it’s the relatively modern monospaced variants which are significantly more useful in the context of R programming.

Two of the most popular fonts in this category are:

• Fira Code — an extension of Fira Mono which really goes all out providing a wide range of ligatures for obscure Haskell operators, as well as the more standard set which will be used when writing R
• Hasklig — a fork of Source Code Pro (in my opinion a nicer base font) which is more conservative with the ligatures it introduces

Here’s some code to try out with these ligature fonts, first rendered via bog-standard monospace font:

library(magrittr)
library(tidyverse)

filtered_storms <- dplyr::storms %>%
filter(category == 5, year &gt;= 2000) %>%
unite("date", year:day, sep = "-") %>%
group_by(name) %>%
filter(pressure == max(pressure)) %>%
mutate(date = as.Date(date)) %>%
arrange(desc(date)) %>%
ungroup() %T>%
print()


Here’s the same code rendered with Hasklig:

Some of the glyphs on show here are:

• A single arrow glyph for less-than hyphen (<-)
• Altered spacing around two colons (::)
• Joined up double-equals

Fira Code takes this a bit further and also converts >= to a single glyph:

In my opinion these fonts are a nice and clear way of reading and writing R. In particular the single arrow glyph harks back to the APL keyboards with real arrow keys, for which our modern two-character <- is a poor substitute.

One downside could be a bit of confusion when showing your IDE to someone else, or maybe writing slightly longer lines than it appears, but personally I’m a fan and my RStudio is now in Hasklig.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### ICML is changing its constitution

Andrew McCallum has been leading an initiative to update the bylaws of IMLS, the organization which runs ICML. I expect most people aren’t interested in such details. However, the bylaws change rarely and can have an impact over a long period of time so they do have some real importance. I’d like to hear comment from anyone with a particular interest before this year’s ICML.

In my opinion, the most important aspect of the bylaws is the at-large election of members of the board which is preserved. Most of the changes between the old and new versions are aimed at better defining roles, committees, etc… to leave IMLS/ICML better organized.

Anyways, please comment if you have a concern or thoughts.

### Magister Dixit

“However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz.” Foster Provost, Tom Fawcett ( 2013 )

### Neville’s Method of Polynomial Interpolation

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 1 of 5 in the series Numerical Analysis

Neville’s method evaluates a polynomial that passes through a given set of $x$ and $y$ points for a particular $x$ value using the Newton polynomial form. Neville’s method is similar to a now-defunct procedure named Aitken’s algorithm and is based on the divided differences recursion relation (“Neville’s Algorithm”, n.d).

It was stated before in a previous post on Lagrangian polynomial interpolation that there exists a Lagrange polynomial that passes through points $y_1, y_2, \cdots, y_k$ where each is a distinct integer and $0 \leq y_i \leq n$ at corresponding x values $x_0, x_1, x_2, \cdots, x_n$. The $k$ points $y_1, y_2, \cdots, y_k$ are denoted $P_{y_1, y_2, \cdots, y_k}(x)$.

###### Neville’s Method

Neville’s method can be stated as follows:

Let a function $f$ be defined at points $x_0, x_1, \cdots, x_k$ where $x_j$ and $x_i$ are two distinct members. For each $k$, there exists a Lagrange polynomial $P$ that interpolates the function $f$ at the $k + 1$ points $x_0, x_1, \cdots, x_k$. The $k$th Lagrange polynomial is defined as:

$\large{P(x) = \frac{(x - x_j) P_{0,1,\cdots,j-1,j+1,\cdots,k}(x) - (x - x_i) P_{0,1,\cdots,i-1,i+1,\cdots,k}(x)}{(x_i - x_j)}}$

The $P_{0,1,\cdots,j-1,j+1,\cdots,k}$ and $P_{0,1,\cdots,i-1,i+1,\cdots,k}$ are often denoted $\hat{Q}$ and $Q$, respectively, for ease of notation.

$\large{P(x) = \frac{(x - x_j) \hat{Q}(x) - (x - x_i) Q(x)}{(x_i - x_j)}}$

The interpolating polynomials can thus be generated recursively, which we will see in the following example:

###### Neville’s Method Example

Consider the following table of $x$ and corresponding $y$ values.

x y
8.1 16.9446
8.3 17.56492
8.6 18.50515
8.7 18.82091

Suppose we are interested in interpolating a polynomial that passes through these points to approximate the resulting $y$ value from an $x$ value of 8.4.

We can construct the interpolating polynomial approximations using the function above:

$Q_{1,1} = \frac{(8.4 - x_0)Q_{1,0} - (8.4 - x_1) Q_{0,0}}{x_1 - x_0} = \frac{(8.4 - 8.1)(17.56492) - (8.4 - 8.3)(16.9446)}{8.3 - 8.1} = 17.87508$
$Q_{2,1} = \frac{(8.4 - x_1)Q_{2,0} - (8.4 - x_2)Q_{1,0}}{(x_2 - x_1)} = \frac{(8.4 - 8.3)(18.50515) - (8.4 - 8.6)(17.56492)}{(8.6 - 8.3)} = 17.87833$
$Q_{3,1} = \frac{(8.4 - x_2)Q_{3,0} - (8.4 - x_3)Q_{2,0}}{(x_3 - x_2)} = \frac{(8.4 - 8.6)(18.82091) - (8.4 - 8.7)(18.50515)}{(8.7 - 8.6)} = 17.87363$

The approximated values $Q_{1,1} = 17.87508, Q_{2,1} = 17.87833, Q_{3,1} = 17.87363$ are then used in the next iteration.

$Q_{2,2} = \frac{(8.4 - x_0)Q_{2,1} - (8.4 - x_2)Q_{1,1}}{(x_2 - x_0)} = \frac{(8.4 - 8.1)(17.87833) - (8.4 - 8.6)(17.87508)}{(8.6 - 8.1)} = 17.87703$
$Q_{3,2} = \frac{(8.4 - x_1)Q_{3,1} - (8.4 - x_3)Q_{2,1}}{(x_3 - x_1)} = \frac{(8.4 - 8.3)(17.87363) - (8.4 - 8.7)(17.87833)}{(8.7 - 8.3)} = 17.877155$

Then the final iteration yields the approximated $y$ value for the given $x$ value.

$Q_{3,3} = \frac{(8.4 - x_0)Q_{3,2} - (8.4 - x_3)Q_{2,2}}{(x_3 - x_0)} = \frac{(8.4 - 8.1)(17.877155) - (8.4 - 8.7)(17.87703)}{(8.7 - 8.1)} = 17.8770925$

Therefore $17.8770925$ is the approximated value at the point $8.4$.

###### Neville’s Method in R

The following function is an implementation of Neville’s method for interpolating and evaluating a polynomial.

poly.neville <- function(x, y, x0) {

n <- length(x)
q <- matrix(data = 0, n, n)
q[,1] <- y

for (i in 2:n) {
for (j in i:n) {
q[j,i] <- ((x0 - x[j-i+1]) * q[j,i-1] - (x0 - x[j]) * q[j-1,i-1]) / (x[j] - x[j-i+1])
}
}

res <- list('Approximated value'=q[n,n], 'Neville iterations table'=q)
return(res)
}


Let’s test this function to see if it reports the same result as what we found earlier.

x <- c(8.1, 8.3, 8.6, 8.7)
y <- c(16.9446, 17.56492, 18.50515, 18.82091)

poly.neville(x, y, 8.4)
## $Approximated value ## [1] 17.87709 ## ##$Neville iterations table
##          [,1]     [,2]     [,3]     [,4]
## [1,] 16.94460  0.00000  0.00000  0.00000
## [2,] 17.56492 17.87508  0.00000  0.00000
## [3,] 18.50515 17.87833 17.87703  0.00000
## [4,] 18.82091 17.87363 17.87716 17.87709


The approximated value is reported as $17.87709$, the same value we calculated previously (minus a few decimal places). The function also outputs the iteration table that stores the intermediate results.

The pracma package contains the neville() function which also performs Neville’s method of polynomial interpolation and evaluation.

library(pracma)

neville(x, y, 8.4)
## [1] 17.87709


The neville() function reports the same approximated value that we found with our manual calculations and function.

###### References

Burden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Cheney, E. W., & Kincaid, D. (2013). Numerical mathematics and computing (6th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Neville’s algorithm. (2016, January 2). In Wikipedia, The Free Encyclopedia. From https://en.wikipedia.org/w/index.php?title=Neville%27s_algorithm&oldid=697870140

The post Neville’s Method of Polynomial Interpolation appeared first on Aaron Schlegel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The last gasp

EVERY two years the World Health Organisation (WHO) takes stock of the efforts of governments around the globe to curb smoking. The latest report, published today, shows that only a single country, Turkey, has implemented to the fullest degree all of the measures recommended by the WHO.

### PAW Keynotes: Tips, Tricks, Mistakes, and Examples

PAW Business, Oct 29 - Nov 2, 2017 in NYC, will be packed with the top machine learning and predictive analytics experts, practitioners, authors, business thought leaders - check the keynotes.

### ArCo Package v 0.2 is on

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

The ArCo package 0.2 is now available on CRAN. The functions are now more user friendly. The new features are:

• Default function for estimation if the user does not inform the functions fn and p.fn. The default model is Ordinary Least Squares.
• The user can now add extra arguments to the fn function in the call.
• The data will be automatically coerced when possible.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Hacking in silico protein engineering with Machine Learning and AI, explained

Proteins are building blocks of all living matter. Although tremendous progress has been made, protein engineering remains laborious, expensive and truly complicated. Here is how Machine Learning can help.

### Data wrangling : Transforming (2/3)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set.

Before proceeding, it might be helpful to look over the help pages for the group_by, ungrpoup, summary, summarise, arrange, mutate, cumsum.

install.packages("dplyr")
library(dplyr)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a new object named cars_cyl and assign to it the mtcars data frame grouped by the variable cyl
Hint: be careful about the data type of the variable, in order to be used for grouping it has to be a factor.

Exercise 2

Remove the grouping from the object cars_cyl

Exercise 3

Print out the summary statistics of the mtcars data frame using the summary function and pipeline symbols %>%.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

• Work with popular libraries such as dplyr
• Learn about methods such as pipelines
• And much more

Exercise 4

Make a more descriptive summary statistics output containing the 4 quantiles, the mean, the standard deviation and the count.

Exercise 5

Print out the average *hp* for every cyl category

Exercise 6

Print out the mtcars data frame sorted by hp (ascending oder)

Exercise 7

Print out the mtcars data frame sorted by hp (descending oder)

Exercise 8

Create a new object named cars_per containing the mtcars data frame along with a new variable called performance and calculated as performance = hp/mpg

Exercise 9

Print out the cars_per data frame, sorted by performance in descending order and create a new variable called rank indicating the rank of the cars in terms of performance.

Exercise 10

To wrap everything up, we will use the iris data set. Print out the mean of every variable for every Species and create two new variables called Sepal.Density and Petal.Density being calculated as Sepal.Density = Sepal.Length Sepal.Width and Petal.Density = Sepal.Length Petal.Width respectively.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### This one takes time to make, takes even more time to read

Reader Matt F. contributed this confusing chart from Wired, accompanying an article about Netflix viewing behavior.

Matt doesn't like this chart. He thinks the main insight - most viewers drop out after the first episode - is too obvious. And there are more reasons why the chart doesn't work.

This is an example of a high-effort, low-reward chart. See my return-on-effort matrix for more on this subject.

The high effort is due to several design choices.

The most attention-grabbing part of the chart is the blue, yellow and green bars. The blue and yellow together form a unity, while the green color refers to something else entirely. The shows in blue are classified as "savored," meaning that "viewers" on average took in less than two hours per day "to complete the season." The shows in yellow are just the opposite and labeled "devoured." The distinction between savored and devoured shows appears to be a central thesis of the article.

The green cell measures something else unrelated to the average viewer's speed of consumption. It denotes a single episode, the "watershed" after which "at least 70 percent of viewers will finish the season." The watershed episode exists for all shows, the only variability is which episode. The variability is small because all shows experience a big drop-off in audience after episode 1, the slope of the audience curve is decreasing with further episodes, and these shows have a small number of episodes (6 to 13). In the shows depicted, with a single exception of BoJack Horseman, the watershed occurs in episode 2, 3, or 4.

Beyond the colors, readers will consider the lengths of the bars. The labels are typically found on the horizontal axis but here, they are found facing the wrong way on pink columns on the right edge of the chart. These labels are oriented in a way that makes readers think they represent column heights.

The columns look like they are all roughly the same height but on close inspection, they are not! Their heights are not given on top of the columns but on the side of the vertical axis.

The bar lengths show the total number of minutes of season 1 of each of these shows. This measure is a peripheral piece of information that adds little to the chart.

The vertical axis indicates the proportion of viewers who watched all episodes within one week of viewing. This segmentation of viewers is related to the segmentation of the shows (blue/yellow) as they are both driven by the speed of consumption.

Not surprisingly, the higher the elevation of the bar, the more likely it is yellow. Higher bar means more people are binge-watching, which should imply the show is more likely classified as "devoured". Despite the correlation, these two ways of measuring the speed of consumption is not consistent. The average show on the chart has about 7 hours of content. If consumed within one week, it requires only one hour of viewing per day... so the average show would be classified as "savored" even though the average viewer can be labeled a binge-watcher who finishes in one week.

***

[After taking a breath of air] We may have found the interesting part of this chart - the show Orange is the New Black is considered a "devoured" show and yet only half the viewers finish all episodes within one week, a much lower proportion than most of the other shows. Given the total viewing hours of about 12, if the viewer watches two hours per day, it should take 6 days to finish the series, within the one-week cutoff. So this means that the viewers may be watching more than one episode at a time, but taking breaks between viewing sessions.

The following chart brings out the exceptional status of this show:

PS. Above image was replaced on 7/19/2017 based on feedback from the commenters. Labels and legend added.

### Road Lane Line Detection using Computer Vision models

A tutorial on how to implement a computer vision data pipeline for road lane detection used by self-driving cars.

### On Unlimited Sampling

Ayush Bhandari just let me know about the interesting approach of Unlimited Sampling in an email exchange:

...In practice, ADCs clip or saturate whenever the amplitude of signal x exceeds ADC threshold L. Typical solution is to de-clip the signal for which purpose various methods have been proposed.

Based on a new ADC hardware which allows for sampling using the principle
y = mod(x,L)

where x is bandlimited and L is the ADC threshold, we show that Nyquist rate about \pi e (~10) times faster guarantees recovery of x from y. For this purpose we outline a new, stable recovery procedure.

Paper and slides are here.

There is also the PhysOrg coverage. Thanks Ayush ! Here is the paper:

Shannon's sampling theorem provides a link between the continuous and the discrete realms stating that bandlimited signals are uniquely determined by its values on a discrete set. This theorem is realized in practice using so called analog--to--digital converters (ADCs). Unlike Shannon's sampling theorem, the ADCs are limited in dynamic range. Whenever a signal exceeds some preset threshold, the ADC saturates, resulting in aliasing due to clipping. The goal of this work is to analyze an alternative approach that does not suffer from these problems. Our work is based on recent developments in ADC design, which allow for ADCs that reset rather than to saturate, thus producing modulo samples. An open problem that remains is: Given such modulo samples of a bandlimited function as well as the dynamic range of the ADC, how can the original signal be recovered and what are the sufficient conditions that guarantee perfect recovery? In this work, we prove such sufficiency conditions and complement them with a stable recovery algorithm. Our results are not limited to certain amplitude ranges, in fact even the same circuit architecture allows for the recovery of arbitrary large amplitudes as long as some estimate of the signal norm is available when recovering. Numerical experiments that corroborate our theory indeed show that it is possible to perfectly recover function that takes values that are orders of magnitude higher than the ADC's threshold.

h/t Laurent.

### Integrating Apache Airflow with Databricks

This blog post is part of our series of internal engineering blogs on Databricks platform, infrastructure management, integration, tooling, monitoring, and provisioning.

Today, we are excited to announce native Databricks integration in Apache Airflow, a popular open source workflow scheduler. This blog post illustrates how you can set up Airflow and use it to trigger Databricks jobs.

One very popular feature of Databricks’ Unified Analytics Platform (UAP) is the ability to convert a data science notebook directly into production jobs that can be run regularly. While this feature unifies the workflow from exploratory data science to production data engineering, some data engineering jobs can contain complex dependencies that are difficult to capture in notebooks. To support these complex use cases, we provide REST APIs so jobs based on notebooks and libraries can be triggered by external systems. Of these, one of the most common schedulers used by our customers is Airflow. We are happy to share that we have also extended Airflow to support Databricks out of the box.

## Airflow Basics

Airflow is a generic workflow scheduler with dependency management. Besides its ability to schedule periodic jobs, Airflow lets you express explicit dependencies between different stages in your data pipeline.

Each ETL pipeline is represented as a directed acyclic graph (DAG) of tasks (not to be mistaken with Spark’s own DAG scheduler and tasks). Dependencies are encoded into the DAG by its edges — for any given edge, the downstream task is only scheduled if the upstream task completed successfully. For example, in the example, DAG below, task B and C will only be triggered after task A completes successfully. Task D will then be triggered when task B and C both complete successfully.

The tasks in Airflow are instances of “operator” class and are implemented as small Python scripts. Since they are simply Python scripts, operators in Airflow can perform many tasks: they can poll for some precondition to be true (also called a sensor) before succeeding, perform ETL directly, or trigger external systems like Databricks.

## Native Databricks Integration in Airflow

We implemented an Airflow operator called DatabricksSubmitRunOperator, enabling a smoother integration between Airflow and Databricks. Through this operator, we can hit the Databricks Runs Submit API endpoint, which can externally trigger a single run of a jar, python script, or notebook. After making the initial request to submit the run, the operator will continue to poll for the result of the run. When it completes successfully, the operator will return allowing for downstream tasks to run.

We’ve contributed the DatabricksSubmitRunOperator upstream to the open-source Airflow project. However, the integrations will not be cut into a release branch until Airflow 1.9.0 is released. Until then, to use this operator you can install Databricks’ fork of Airflow, which is essentially Airflow version 1.8.1 with our DatabricksSubmitRunOperator patch applied.

pip install --upgrade "git+git://github.com/databricks/incubator-airflow.git@1.8.1-db1#egg=apache-airflow[databricks]"


## Airflow with Databricks Tutorial

In this tutorial, we’ll set up a toy Airflow 1.8.1 deployment which runs on your local machine and also deploy an example DAG which triggers runs in Databricks.

The first thing we will do is initialize the sqlite database. Airflow will use it to track miscellaneous metadata. In a production Airflow deployment, you’ll want to edit the configuration to point Airflow to a MySQL or Postgres database but for our toy example, we’ll simply use the default sqlite database. To perform the initialization run:

airflow initdb


The SQLite database and default configuration for your Airflow deployment will be initialized in ~/airflow.

In the next step, we’ll write a DAG that runs two Databricks jobs with one linear dependency. The first Databricks job will trigger a notebook located at /Users/airflow@example.com/PrepareData, and the second will run a jar located at dbfs:/lib/etl-0.1.jar. To save time, we’ve already gone ahead and written the DAG for you here.

From a mile high view, the script DAG essentially constructs two DatabricksSubmitRunOperator tasks and then sets the dependency at the end with the set_dowstream method. A skeleton version of the code looks something like this:

notebook_task = DatabricksSubmitRunOperator(
…)

…)


In reality, there are some other details we need to fill in to get a working DAG file. The first step is to set some default arguments which will be applied to each task in our DAG.

args = {
'owner': 'airflow',
'email': ['airflow@example.com'],
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2)
}


The two interesting arguments here are depends_on_past and start_date. If depends_on_past is true, it signals Airflow that a task should not be triggered unless the previous instance of a task completed successfully. The start_date argument determines when the first task instance will be scheduled.

The next section of our DAG script actually instantiates the DAG.

dag = DAG(
dag_id='example_databricks_operator', default_args=args,
schedule_interval='@daily')


In this DAG, we give it a unique ID, attach the default arguments we declared earlier, and give it a daily schedule. Next, we’ll specify the specifications of the cluster that will run our tasks.

new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'node_type_id': 'r3.xlarge',
'aws_attributes': {
'availability': 'ON_DEMAND'
},
'num_workers': 8
}


The schema of this specification matches the new cluster field of the Runs Submit endpoint. For your example DAG, you may want to decrease the number of workers or change the instance size to something smaller.

Finally, we’ll instantiate the DatabricksSubmitRunOperator and register it with our DAG.

notebook_task_params = {
'new_cluster': new_cluster,
'notebook_path': '/Users/airflow@example.com/PrepareData',
},
}

# Example of using the JSON parameter to initialize the operator.
dag=dag,


In this piece of code, the JSON parameter takes a python dictionary that matches the Runs Submit endpoint.

To add another task downstream of this one, we do instantiate the DatabricksSubmitRunOperator again and use the special set_downstream method on the notebook_task operator instance to register the dependency.

# Example of using the named parameters of DatabricksSubmitRunOperator
# to initialize the operator.
dag=dag,
new_cluster=new_cluster,
'main_class_name': 'com.example.ProcessData'
},
libraries=[
{
'jar': 'dbfs:/lib/etl-0.1.jar'
}
]
)



This task runs a jar located at dbfs:/lib/etl-0.1.jar.

Notice that in the notebook_task, we used the JSON parameter to specify the full specification for the submit run endpoint and that in the spark_jar_task, we flattened the top level keys of the submit run endpoint into parameters for the DatabricksSubmitRunOperator. Although both ways of instantiating the operator are equivalent, the latter method does not allow you to use any new top level fields like spark_python_task or spark_submit_task. For more detailed information about the full API of DatabricksSubmitRunOperator, please look at the documentation here.

Now that we have our DAG, to install it in Airflow create a directory in ~/airflow called ~/airflow/dags and copy the DAG into that directory.

At this point, Airflow should be able to pick up the DAG.

\$ airflow list_dags                                                           [10:27:13]
[2017-07-06 10:27:23,868] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-07-06 10:27:24,238] {models.py:168} INFO - Filling up the DagBag from /Users/andrew/airflow/dags

-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
example_bash_operator
example_branch_dop_operator_v3
example_branch_operator
example_databricks_operator


We can also visualize the DAG in the web UI. To start it up, run airflow webserver and connect to localhost:8080. Clicking into the “example_databricks_operator,” you’ll see many visualizations of your DAG. Here is the example:

At this point, a careful observer might also notice that we don’t specify information such as the hostname, username, and password to a Databricks shard anywhere in our DAG. To configure this we use the connection primitive of Airflow that allows us to reference credentials stored in a database from our DAG. By default, all DatabricksSubmitRunOperator set the databricks_conn_id parameter to “databricks_default,” so for our DAG, we’ll have to add a connection with the ID “databricks_default.”

The easiest way to do this is through the web UI. Clicking into the “Admin” on the top and then “Connections” in the dropdown will show you all your current connections. For our use case, we’ll add a connection for “databricks_default.” The final connection should look something like this:

Now that we have everything set up for our DAG, it’s time to test each task. To do this for the notebook_task we would run, airflow test example_databricks_operator notebook_task 2017-07-01 and for the spark_jar_task we would run airflow test example_databricks_operator spark_jar_task 2017-07-01. To run the DAG on a schedule, you would invoke the scheduler daemon process with the command airflow scheduler.

If everything goes well, after starting the scheduler, you should be able to see backfilled runs of your DAG start to run in the web UI.

## Next Steps

In conclusion, this blog post provides an easy example of setting up Airflow integration with Databricks. It demonstrates how Databricks extension to and integration with Airflow allows access via Databricks Runs Submit API to invoke computation on the Databricks platform. For more detailed instructions on how to set up a production Airflow deployment, please look at the official Airflow documentation.

Also, if you want to try this tutorial on Databricks, sign up for a free trial today.

--

The post Integrating Apache Airflow with Databricks appeared first on Databricks.

### New R Course: Writing Efficient R Code

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hello R users, we’ve got a brand new course today: Writing Efficient R Code by Colin Gillespie.

The beauty of R is that it is built for performing data analysis. The downside is that sometimes R can be slow, thereby obstructing our analysis. For this reason, it is essential to become familiar with the main techniques for speeding up your analysis, so you can reduce computational time and get insights as quickly as possible.

Take me to chapter 1!

Writing Efficient R Code features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master in writing efficient, quick, R code!

What you’ll learn:

Chapter 1: The Art of Benchmarking

In order to make your code go faster, you need to know how long it takes to run.

Chapter 2: Fine Tuning – Efficient Base R

R is flexible because you can often solve a single problem in many different ways. Some ways can be several orders of magnitude faster than the others.

Chapter 3: Diagnosing Problems – Code Profiling

Profiling helps you locate the bottlenecks in your code.

Chapter 4: Turbo Charged Code – Parallel Programming

Some problems can be solved faster using multiple cores on your machine. This chapter shows you how to write R code that runs in parallel.

Learn how to write efficient R code today!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

### Machine Learning Explained: supervised learning, unsupervised learning, and reinforcement learning

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Machine learning is often split between three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Knowing the differences between these three types of learning is necessary for any data scientist.

## The big picture

The type of learning is defined by the problem you want to solve and is intrinsic to the goal of your analysis:

• You have a target, a value or a class to predict. For instance, let’s say you want to predict the revenue of a store from different inputs (day of the week, advertising, promotion). Then your model will be trained on historical data and use them to forecast future revenues. Hence the model is supervised, it knows what to learn.
• You have unlabelled data and looks for patterns, groups in these data. For example, you want to cluster to clients according to the type of products they order, how often they purchase your product, their last visit, … Instead of doing it manually, unsupervised machine learning will automatically discriminate different clients.
• You want to attain an objective. For example, you want to find the best strategy to win a game with specified rules. Once these rules are specified, reinforcement learning techniques will play this game many times to find the best strategy.

## On supervised learning

Supervised learning regroups different techniques which all share the same principles:

• The training dataset contains inputs data (your predictors) and the value you want to predict (which can be numeric or not).
• The model will use the training data to learn a link between the input and the outputs. Underlying idea is that the training data can be generalized and that the model can be used on new data with some accuracy.

Some supervised learning algorithms:

• Linear and logistic regression
• Support vector machine
• Naive Bayes
• Neural network
• Classification trees and random forest

Supervised learning is often used for expert systems in image recognition, speech recognition, forecasting, and in some specific business domain (Targeting, Financial analysis, ..)

## On unsupervised learning

Cluster Analysis from Wikipedia

On the other hand, unsupervised learning does not use output data (at least output data that are different from the input). Unsupervised algorithms can be split into different categories:

• Clustering algorithm, such as K-means, hierarchical clustering or mixture models. These algorithms try to discriminate and separate the observations in different groups.
• Dimensionality reduction algorithms (which are mostly unsupervised) such as PCA, ICA or autoencoder. These algorithms find the best representation of the data with fewer dimensions.
• Anomaly detections to find outliers in the data, i.e. observations which do not follow the data set patterns.

Most of the time unsupervised learning algorithms are used to pre-process the data, during the exploratory analysis or to pre-train supervised learning algorithms.

## On reinforcement learning

Reinforcement learning algorithms try to find the best ways to earn the greatest reward. Rewards can be winning a game, earning more money or beating other opponents. They present state-of-art results on very human task, for instance, this paper from the University of Toronto shows how a computer can beat human in old-school Atari video game.

Reinforcement learnings algorithms follow the different circular steps:

Given its and the environment’s states, the agent will choose the action which will maximize its reward or will explore a new possibility. These actions will change the environment’s and the agent states. They will also be interpreted to give a reward to the agent. By performing this loop many times, the agents will improve its behavior.

Reinforcement learning already performs wells on ‘small’ dynamic system and is definitely to follow for the years to come.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Katie Moussouris on how organizations should and shouldn’t respond to reported vulnerabilities

The O’Reilly Security Podcast: Why legal responses to bug reports are an unhealthy reflex, thinking through first steps for a vulnerability disclosure policy, and the value of learning by doing.

In this episode, O’Reilly’s Courtney Nash talks with Katie Moussouris, founder and CEO of Luta Security. They discuss why many organizations have a knee-jerk legal response to a bug report (and why your organization shouldn’t), the first steps organizations should take in formulating a vulnerability disclosure program, and how learning through experience and sharing knowledge benefits all.

### Make Your Plans for Stans (-s + Con)

This post is by Mike

A friendly reminder that registration is open for StanCon 2018, which will take place over three days, from Wednesday January 10, 2018 to Friday January 12, 2018, at the beautiful Asilomar Conference Grounds in Pacific Grove, California.

Detailed information about registration and accommodation at Asilomar, including fees and instructions, can be found on the event website.  Early registration ends on Friday November 10, 2017 and no registrations will be accepted after Wednesday December 20, 2017.

We have an awesome set of invited speakers this year that is worth attendance alone,

• Susan Holmes (Department of Statistics, Stanford University)
• Sean Taylor and Ben Letham (Facebook Core Data Science)
• Manuel Rivas (Department of Biomedical Data Science, Stanford University)
• Talia Weiss (Department of Physics, Massachusetts Institute of Technology)
• Sophia Rabe-Hesketh and Daniel Furr (Educational Statistics and Biostatistics, University of California, Berkeley)

Contributed talks will proceed as last year, with each submission consisting of self-contained knitr or Jupyter notebooks that will be made publicly available after the conference.  Last year’s contributed talks were awesome and we can’t wait to see what users will submit this year.  For details on how to submit see the submission website.  The final deadline for submissions is Saturday September 16, 2017 5:00:00 AM GMT.

This year we are going to try to support as many student scholarships  as we can — if you are a student who would love to come but may not have the funding then don’t hesitate to submit a short application!

Finally, we are still actively looking for sponsors!  If you are interested in supporting StanCon 2018, or know someone who might be, then please contact the organizing committee.

The post Make Your Plans for Stans (-s + Con) appeared first on Statistical Modeling, Causal Inference, and Social Science.