# My Data Science Blogs

## March 25, 2017

### More than half the funds laundered in a major Russian scheme went via the UK

The Global Laundromat, a money-laundering scheme that ran from 2010-2014, funnelled more dirty money through the UK than any other country

More than $20bn (£16bn) was shifted out of Russia between 2010 and 2014 in a large-scale money laundering operation called the Global Laundromat. The scheme was run by criminals with links to the Russian government, and moved billions in dirty money into Europe through shell companies. The majority of the companies involved in laundering the money were registered in the UK. Over 50% of the money transacted through the Laundromat flowed to shell companies registered in London, Birmingham and Scotland. Continue reading... Continue Reading… ### Distilled News Here are three eBooks available for free. QR decomposition is another technique for decomposing a matrix into a form that is easier to work with in further applications. The QR decomposition technique decomposes a square or rectangular matrix, which we will denote as AAA, into two components, QQQ, and RRR. I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May. We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results. I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Continue Reading… ### Saturday Morning Video: #NIPS2016 Symposium, Recurrent Neural Networks and Other Machines that Learn Algorithms From the page of the minisymposium Program Full session videos are available here: Session 1Session 2Session 3. We provide individual videos and slides below. You can also watch this Playlist.  2:00 - 2:20 Jürgen Schmidhuber Introduction to Recurrent Neural Networks and Other Machines that Learn Algorithms Slides Video 2:20 - 2:40 Paul Werbos Deep Learning in Recurrent Networks: From Basics To New Data on the Brain Slides Video 2:40 - 3:00 Li Deng Three Cool Topics on RNN Slides Video 3:00 - 3:20 Risto Miikkulainen Scaling Up Deep Learning through Neuroevolution Slides Video 3:20 - 3:40 Jason Weston New Tasks and Architectures for Language Understanding and Dialogue with Memory Slides Video 3:40 - 4:00 Oriol Vinyals Recurrent Nets Frontiers Slides Unavailable Video 4:00 - 4:30 Coffee Break 4:30 - 4:50 Mike Mozer Neural Hawkes Process Memories Slides Video 4:50 - 5:10 Ilya Sutskever Meta Learning in the Universe Slides Video 5:10 - 5:30 Marcus Hutter Asymptotically fastest solver of all well-defined problems Slides Video (unfortunately cannot come - J. Schmidhuber will stand in for him) 5:30 - 5:50 Nando de Freitas Learning to Learn, to Program, to Explore and to Seek Knowledge Slides Video 5:50 - 6:10 Alex Graves Differentiable Neural Computer Slides Video 6:30 - 7:30 Light dinner break/Posters 7:30 - 7:50 Nal Kalchbrenner Generative Modeling as Sequence Learning Slides Video 7:50 - 9:00 Panel Discussion Topic: The future of machines that learn algorithmsPanelists: Ilya Sutskever, Jürgen Schmidhuber, Li Deng, Paul Werbos, Risto Miikkulainen, Sepp Hochreiter Moderator: Alex Graves Video Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### Saturday Morning Video: The Role of Multi-Agent Learning in Artificial Intelligence Research at DeepMind Thore Graepel talks about The Role of Multi-Agent Learning in Artificial Intelligence Research at DeepMind The video cannot be embedded, so here is the link: https://youtu.be/CvL-KV3IBcM Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### Book Memo: “Heuristic Search”  The Emerging Science of Problem Solving This book aims to provide a general overview of heuristic search, to present the basic steps of the most popular heuristics, and to stress their hidden difficulties as well as their opportunities. It provides a comprehensive understanding of Heuristic search, the applications of which are now widely used in a variety of industries including engineering, finance, sport, management and medicine. It intends to aid researchers and practitioners in solving complex combinatorial and global optimisation problems, and spark interest in this exciting decision science-based subject. It will provide the reader with challenging and lively methodologies through which they will be able to design and analyse their own techniques Continue Reading… ### Document worth reading: “Evaluation-as-a-Service: Overview and Outlook” Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Also, crowdsourcing has changed the way that industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. This white paper is based on discussions at a workshop on Evaluation-as-a-Service (EaaS). EaaS is the paradigm of not providing data sets to participants and have them work on the data locally, but keeping the data central and allowing access via Application Programming Interfaces (API), Virtual Machines (VM) or other possibilities to ship executables. The objective of this white paper are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly towards sustainable research infrastructures. This white paper summarizes several existing approaches to EaaS and analyzes their usage scenarios and also the advantages and disadvantages. The many factors influencing EaaS are overviewed, and the environment in terms of motivations for the various stakeholders, from funding agencies to challenge organizers, researchers and participants, to industry interested in supplying real-world problems for which they require solutions. Evaluation-as-a-Service: Overview and Outlook Continue Reading… ### If you did not already know: “Paragraph Vector” Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, ‘powerful,’ ‘strong’ and ‘Paris’ are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. GitXiv Paragraph Vector Continue Reading… ## March 24, 2017 ### Suggests and Vignettes (This article was first published on R – Enchufa2, and kindly contributed to R-bloggers) Dirk Eddelbuettel quite rightly reminded us the other day that Suggests is not Depends. I am sorry to say that I am one of those who are using Suggests… “casually”. Mea culpa. I must say that this is restricted to vignettes: there are no tests nor examples using suggested packages. But I am not checking if my suggested packages are available at all, which is definitely wrong. And I understand that it must be frustrating to run reverse dependencies on a package as popular as Rcpp when the rest of us are using Suggests so… casually. So I was definitely determined to solve this, and I finally managed to find a very simple solution that may be helpful to other maintainers. Our simmer package has seven vignettes. Two of them are very introductory and do not use any external package. But as you try to demonstrate more advanced features and use cases, you start needing some other tools; and their use could be intensive, so that checking suggested packages for every call or every code chunk might not scale. However, I realised that the important thing for those advanced vignettes is just to make the story they tell available to your users, and anyway they are always built and available online on CRAN. Therefore, I decided to add the following at the beginning of each vignette: required <- c("pkg1", "pkg2", "pkgn") if (!all(unlist(lapply(required, function(pkg) requireNamespace(pkg, quietly = TRUE))))) knitr::opts_chunk$set(eval = FALSE)

Problem solved. Yes, I know, I am still taking knitr for granted. But given that it has its own entry (VignetteBuilder) in the DESCRIPTION, I think this is fair enough. I only hope that Dirk will unblacklist simmer after our next release.

Contenido extraído de Enchufa2.es: Suggests and Vignettes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### On a First Name Basis with Statistics Sweden

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

## Abstract

Jugding from recent R-Bloggers posts, it appears that many data scientists are concerned with scraping data from various media sources (Wikipedia, twitter, etc.). However, one should be aware that well structured and high quality datasets are available through state’s and country’s bureau of statistics. Increasingly these are offered to the public through direct database access, e.g., using a REST API. We illustrate the usefulness of such an approach by accessing data from Statistics Sweden.

## Introduction

Scandinavian countries are world-class when it comes to public registries. So when in need for reliable population data, this is the place to look. As an example, we access Statistics Sweden data by their API using the pxweb package developed by @MansMeg, @antagomir and @LCHansson. Love was the first speaker at a Stockholm R-Meetup some years ago, where I also gave a talk. Funny how such R-Meetups become useful many years after!

library(pxweb)

By browsing the Statistics Sweden (in Swedish: Statistiska Centralbyrån (SCB)) data using their web interface one sees that they have two relevant first name datasets: one containing the tilltalsnamn of newborns for each year during 1998-2016 and one for the years 2004-2016. Note: A tilltalsnamn in Sweden is the first name (of several possible first names) by which a person is usually addressed. About 2/3 of the persons in the Swedish name registry indicate which of their first names is their tilltalsnamn. For the remaining persons it is automatically implied that their tilltalsnamn is the first of the first names. Also note: For reasons of data protection the 1998-2016 dataset contains only first names used 10 or more times in a given year, the 2004-2016 dataset contains only first names used 2 or more times in a given year.

Downloading such data through the SCB web-interface is cumbersome, because the downloads are limited to 50,000 data cells per query. Hence, one has to do several manual queries to get hold of the relevant data. This is where their API becomes a real time-saver. Instead of trying to fiddle with the API directly using rjson or RJSONIO we use the specially designed pxweb package to fetch the data. One can either use the web-interface to determine the name of the desired data matrix to query or navigate directly through the api using pxweb:

d <- interactive_pxweb(api = "api.scb.se", version = "v1", lang = "en")

and select Population followed by Name statistics and then BE0001T04Ar or BE0001T04BAr, respectively, in order to obtain the relevant data and api download url. This leads to the following R code for download:

names10 <- get_pxweb_data(
url = "http://api.scb.se/OV0104/v1/doris/en/ssd/BE/BE0001/BE0001T04Ar",
dims = list(Tilltalsnamn = c('*'),
ContentsCode = c('BE0001AH'),
Tid = c('*')),
clean = TRUE) %>% as.tbl

For better usability we rename the columns a little and replace NA counts to be zero. For visualization we pick 10 random lines of the dataset.

names10 <- names10 %>% select(-observations) %>%
rename(firstname=first name normally used,counts=values) %>%
mutate(counts = ifelse(is.na(counts),0,counts))
##Look at 10 random lines
names10 %>% slice(sample(seq_len(nrow(names10)),size=5))
## # A tibble: 5 × 3
##   firstname   year counts
##      <fctr> <fctr>  <dbl>
## 1   Leandro   2011     15
## 2    Marlon   2004      0
## 3    Andrej   2009      0
## 4     Ester   2002     63
## 5   Muhamed   1998      0

Note: Each spelling variant of a name in the data is treated as a unique name. In similar fashion we download the BE0001AL dataset as names2.

We now join the two datasets into one large data.frame by

names <- rbind(data.frame(names2,type="min02"), data.frame(names10,type="min10"))

and thus got everything in place to compute the name collision probability over time using the birthdayproblem package (as shown in previous posts).

library(birthdayproblem)
collision <- names %>% group_by(year,type) %>% do({
data.frame(p=pbirthday_up(n=26L, prob= .$counts / sum(.$counts),method="mase1992")$prob, gini= ineq::Gini(.$counts))
}) %>% ungroup %>% mutate(year=as.numeric(as.character(year)))

And the resulting probabilities based on the two datasets min02 (at least two instances of the name in a given year) and min10 (at least ten instances of the name in a given year) can easily be visualized over time.

ggplot( collision, aes(x=year, y=p, color=type)) + geom_line(size=1.5) +
scale_y_continuous(label=scales::percent,limits=c(0,1)) +
xlab("Year") + ylab("Probability") +
ggtitle("Probability of a name collision in a class of 26 kids born in year YYYY") +
scale_colour_discrete(name = "Dataset")

As seen in similar plots for other countries, there is a decline in the collision probability over time. Note also that the two curves are upper limits to the true collision probabilities. The true probabilities, i.e. taking all tilltalsnamn into account, would be based on the hypothetical min1 data set. These probabilities would be slightly, but not substantially, below the min2 line. The same problem occurs, e.g., in the corresponding UK and Wales data. Here, Table 6 is listing all first names with 3 or more uses, but not stating how many newborns have a name occurring once and twice, respectively. With all due respect for the need to anonymise the name statistics, it’s hard to understand why this summary figure is not automatically reported, so one would be able to at least compute correct totals or collision probabilities.

## Summary

Altogether, I was still quite happy to get proper individual name data so the collision probabilities are – opposite to some of my previous blog analyses – exact!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Whats new on arXiv

We propose a supervised algorithm for generating type embeddings in the same semantic vector space as a given set of entity embeddings. The algorithm is agnostic to the derivation of the underlying entity embeddings. It does not require any manual feature engineering, generalizes well to hundreds of types and achieves near-linear scaling on Big Graphs containing many millions of triples and instances by virtue of an incremental execution. We demonstrate the utility of the embeddings on a type recommendation task, outperforming a non-parametric feature-agnostic baseline while achieving 15x speedup and near-constant memory usage on a full partition of DBpedia. Using state-of-the-art visualization, we illustrate the agreement of our extensionally derived DBpedia type embeddings with the manually curated domain ontology. Finally, we use the embeddings to probabilistically cluster about 4 million DBpedia instances into 415 types in the DBpedia ontology.
The least-squares support vector machine is a frequently used kernel method for non-linear regression and classification tasks. Here we discuss several approximation algorithms for the least-squares support vector machine classifier. The proposed methods are based on randomized block kernel matrices, and we show that they provide good accuracy and reliable scaling for multi-class classification problems with relatively large data sets. Also, we present several numerical experiments that illustrate the practical applicability of the proposed methods.
In this work, we propose several online methods to build a \emph{learning curriculum} from a given set of target-task-specific training tasks in order to speed up reinforcement learning (RL). These methods can decrease the total training time needed by an RL agent compared to training on the target task from scratch. Unlike traditional transfer learning, we consider creating a sequence from several training tasks in order to provide the most benefit in terms of reducing the total time to train. Our methods utilize the learning trajectory of the agent on the curriculum tasks seen so far to decide which tasks to train on next. An attractive feature of our methods is that they are weakly coupled to the choice of the RL algorithm as well as the transfer learning method. Further, when there is domain information available, our methods can incorporate such knowledge to further speed up the learning. We experimentally show that these methods can be used to obtain suitable learning curricula that speed up the overall training time on two different domains.
Convolutional Neural Networks have been a subject of great importance over the past decade and great strides have been made in their utility for producing state of the art performance in many computer vision problems. However, the behavior of deep networks is yet to be fully understood and is still an active area of research. In this work, we present an intriguing behavior: pre-trained CNNs can be made to improve their predictions by structurally perturbing the input. We observe that these perturbations – referred as Guided Perturbations – enable a trained network to improve its prediction performance without any learning or change in network weights. We perform various ablative experiments to understand how these perturbations affect the local context and feature representations. Furthermore, we demonstrate that this idea can improve performance of several existing approaches on semantic segmentation and scene labeling tasks on the PASCAL VOC dataset and supervised classification tasks on MNIST and CIFAR10 datasets.
When using reinforcement learning (RL) algorithms to evaluate a policy it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on the accuracy of the VF estimate, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is a large amount of interest in the potential for allowing RL algorithms to adaptively generate (i.e. to learn) approximation architectures. We investigate a method of adapting approximation architectures which uses feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. We introduce an algorithm based upon this idea which adapts a state aggregation approximation architecture on-line. Assuming $S$ states, we demonstrate theoretically that – provided the following relatively non-restrictive assumptions are satisfied: (a) the number of cells $X$ in the state aggregation architecture is of order $\sqrt{S}\ln{S}\log_2{S}$ or greater, (b) the policy and transition function are close to deterministic, and (c) the prior for the transition function is uniformly distributed – our algorithm can guarantee, assuming we use an appropriate scoring function to measure VF error, error which is arbitrarily close to zero as $S$ becomes large. It is able to do this despite having only $O(X\log_2{S})$ space complexity (and negligible time complexity). We conclude by generating a set of empirical results which support the theoretical results.
In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.
Deep neural networks achieve unprecedented performance levels over many tasks and scale well with large quantities of data, but performance in the low-data regime and tasks like one shot learning still lags behind. While recent work suggests many hypotheses from better optimization to more complicated network structures, in this work we hypothesize that having a learnable and more expressive similarity objective is an essential missing component. Towards overcoming that, we propose a network design inspired by deep residual networks that allows the efficient computation of this more expressive pairwise similarity objective. Further, we argue that regularization is key in learning with small amounts of data, and propose an additional generator network based on the Generative Adversarial Networks where the discriminator is our residual pairwise network. This provides a strong regularizer by leveraging the generated data samples. The proposed model can generate plausible variations of exemplars over unseen classes and outperforms strong discriminative baselines for few shot classification tasks. Notably, our residual pairwise network design outperforms previous state-of-theart on the challenging mini-Imagenet dataset for one shot learning by getting over 55% accuracy for the 5-way classification task over unseen classes.
Commercial establishments like restaurants, service centres and retailers have several sources of customer feedback about products and services, most of which need not be as structured as rated reviews provided by services like Yelp, or Amazon, in terms of sentiment conveyed. For instance, Amazon provides a fine-grained score on a numeric scale for product reviews. Some sources, however, like social media (Twitter, Facebook), mailing lists (Google Groups) and forums (Quora) contain text data that is much more voluminous, but unstructured and unlabelled. It might be in the best interests of a business establishment to assess the general sentiment towards their brand on these platforms as well. This text could be pipelined into a system with a built-in prediction model, with the objective of generating real-time graphs on opinion and sentiment trends. Although such tasks like the one described about have been explored with respect to document classification problems in the past, the implementation described in this paper, by virtue of learning a continuous function rather than a discrete one, offers a lot more depth of insight as compared to document classification approaches. This study aims to explore the validity of such a continuous function predicting model to quantify sentiment about an entity, without the additional overhead of manual labelling, and computational preprocessing & feature extraction. This research project also aims to design and implement a re-usable document regression pipeline as a framework, Rapid-Rate\cite{rapid_rate}, that can be used to predict document scores in real-time.
The traditional bag-of-words approach has found a wide range of applications in computer vision. The standard pipeline consists of a generation of a visual vocabulary, a quantization of the features into histograms of visual words, and a classification step for which usually a support vector machine in combination with a non-linear kernel is used. Given large amounts of data, however, the model suffers from a lack of discriminative power. This applies particularly for action recognition, where the vast amount of video features needs to be subsampled for unsupervised visual vocabulary generation. Moreover, the kernel computation can be very expensive on large datasets. In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training. The model further allows to incorporate the kernel computation into the neural network directly, solving the complexity issue and allowing to represent the complete classification system within a single network. We evaluate our method on four recent action recognition benchmarks and show that the conventional model as well as sparse coding methods are outperformed.
Knowledge bases of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge bases are typically incomplete, it is useful to be able to perform knowledge base completion, i.e., predict whether a relationship not in the knowledge base is likely to be true. This article presents an overview of embedding models of entities and relationships for knowledge base completion, with up-to-date experimental results on two standard evaluation tasks of link prediction (i.e. entity prediction) and triple classification.
We study deep neural networks for classification of images with quality distortions. We first show that networks fine-tuned on distorted data greatly outperform the original networks when tested on distorted data. However, fine-tuned networks perform poorly on quality distortions that they have not been trained for. We propose a mixture of experts ensemble method that is robust to different types of distortions. The ‘experts’ in our model are trained on a particular type of distortion. The output of the model is a weighted sum of the expert models, where the weights are determined by a separate gating network. The gating network is trained to predict optimal weights for a particular distortion type and level. During testing, the network is blind to the distortion level and type, yet can still assign appropriate weights to the expert models. We additionally investigate weight sharing methods for the mixture model and show that improved performance can be achieved with a large reduction in the number of unique network parameters.
Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing. Most competitive approaches lie at methodological extremes: some follow a Bayesian approach, defining probabilistic models with convergence guarantees, while others opt for more efficient heuristic techniques. Here we introduce an approximation to a segmental Bayesian model that falls in between, with a clear objective function but using hard clustering and segmentation rather than full Bayesian inference. Like its Bayesian counterpart, this embedded segmental k-means model (ES-KMeans) represents arbitrary-length word segments as fixed-dimensional acoustic word embeddings. On English and Xitsonga data, ES-KMeans outperforms a leading heuristic method in word segmentation, giving similar scores to the Bayesian model while being five times faster with fewer hyperparameters. However, there is a trade-off in cluster purity, with the Bayesian model’s purer clusters yielding about 10% better unsupervised word error rates.

### Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

1. Deduplication: eliminating duplicate (exact) copies of repeated data.
2. Record linkage: identifying records that reference the same entity across different sources.
3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

### How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

### Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe


Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py


### Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},
]


Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

• the first three digits of the phone number
• the full name
• the first five characters of the name
• a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

## Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

• (y)es: confirms that the two references are to the same entity
• (n)o: labels the two references as not the same entity
• (u)nsure: does not label the two references as the same entity or as different entities
• (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
]


A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

## Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
"""
"""
response = requests.get(url)
with open(fname, 'w') as f:
f.write(response.content)

ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

## Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname

try:
print ("I've connected")
except:
print ("I am unable to connect to the database")
cur = conn.cursor()


The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
(visitor_id SERIAL PRIMARY KEY,
lastname    varchar,
firstname   varchar,
uin         varchar,
apptstart   varchar,
apptend     varchar,
meeting_loc varchar);''')
conn.commit()
with open(nfile, 'rU') as infile:
for field in DATEFIELDS:
if row[field] != '':
try:
dt = parser.parse(row[field])
row[field] = dt.toordinal()  # We also tried dt.isoformat()
except:
continue
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
conn.commit()
print ("All done!")

dateParseSQL(ORIGFILE)


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';


For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor


Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}
]


We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records

block_id = row['block_id']
records = []
i += 1

if i % 10000 == 0:
print ('{} blocks'.format(i))

smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
else:
smaller_ids = lset([])

records.append((row[KEY_FIELD], row, smaller_ids))

if records:
yield records


And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)

with psycopg2.connect(database=args.dbname,
host='localhost',
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)

print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
del(temp_d)

if os.path.exists(args.training):
with open(args.training) as tf:

print ('Starting active learning')
dedupe.convenience.consoleLabel(deduper)

print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)

print ('Saving new training file to {}'.format(args.training))
with open(args.training, 'w') as training_file:
deduper.writeTraining(training_file)

deduper.cleanupTraining()

print ('Creating blocking_map table')
c.execute("""
DROP TABLE IF EXISTS blocking_map
""")
c.execute("""
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)

for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
c_index.execute("""
SELECT DISTINCT %s FROM %s
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
c_index.close()

print ('Generating blocking map')
c_block = con.cursor('block')
c_block.execute("""
SELECT * FROM %s
""" % SOURCE_TABLE)
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)

print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
csv_writer.writerows(b_data)
csv_file.close()

f = open(csv_file.name, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
f.close()

os.remove(csv_file.name)

con.commit()

print ('Indexing blocks')
c.execute("""
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
""")
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

print ('Calculating plural_key')
c.execute("""
CREATE TABLE plural_key
(block_key VARCHAR(200),
block_id SERIAL PRIMARY KEY)
""")
c.execute("""
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
""")

print ('Indexing block_key')
c.execute("""
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
""")

print ('Calculating plural_block')
c.execute("""
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)

c.execute("""
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
""" % (KEY_FIELD, KEY_FIELD))
c.execute("""
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Creating covered_blocks')
c.execute("""
CREATE TABLE covered_blocks AS
(SELECT %s,
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
GROUP BY %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Indexing covered_blocks')
c.execute("""
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
""" % (KEY_FIELD, KEY_FIELD))
print ('Committing')

print ('Creating smaller_coverage')
c.execute("""
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
""" % (KEY_FIELD, KEY_FIELD))
con.commit()

print ('Clustering...')
c_cluster = con.cursor('cluster')
c_cluster.execute("""
SELECT *
FROM smaller_coverage
INNER JOIN %s
USING (%s)
ORDER BY (block_id)
""" % (SOURCE_TABLE, KEY_FIELD))
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)

print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
c.execute("""
CREATE TABLE entity_map (
%s INTEGER,
canon_id INTEGER,
cluster_score FLOAT,
PRIMARY KEY(%s)
)""" % (KEY_FIELD, KEY_FIELD))

print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
c.execute("""
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))

c_cluster.close()
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
con.commit()

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()
find_dupes(args)


## Active Learning Observations

We ran multiple experiments:

• Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
• Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


## Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

### Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671


### Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


## Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

### No-op: The case of Case and Deaton

In responding to some recent blog comments I noticed an overlap among our two most recent posts:

2. When does research have active opposition?

The first post was all about the fascinating patterns you can find by analyzing and graphing data from the CDC Wonder website, which has information on all the deaths in the United States over a fifteen-year period. The post was motivated by the release of a new article by economists Anne Case and Angus Deaton, who pulled out a few things from these data and, from this, spun a media-friendly story of the struggle of white Americans. In that post, I emphasized that I think Case and Deaton’s work has positive value and that I hope journalists will use that work as a starting point to explore these questions more deeply by interviewing knowledgeable actuaries, demographers, and public health experts.

The second post explored how it was that now-disgraced eating-behavior researcher Brian Wansink managed to stay at the top of the heap, maintaining media exposure, government grants, and policy influence for something like 10 years even while his sloppy research practices were all in plain sight, as it were, in his published work. In that post I suggest one reason that Wansink stayed afloat all this time was that his claims were pretty much innocuous, and he was working in a noncompetitive field, so there was nobody out there with any motivation to examine his work with a critical eye.

And here’s the mash-up: Case and Deaton are writing about an important topic—mortality trends!—but their message is basically simpatico to all parts of the political spectrum. Struggling working-class white people, that’s a story that both left and right can get behind. There’s nobody on the other side!

Indeed, when the original Case/Deaton story came out a bit over a year ago, it was framed by many as an increase in the death rate of middle-aged white men, because that was what everyone was expecting to hear—even though the actual data (when correctly age-adjusted) showed a decrease in the death rate of middle-aged white men in recent years (the increase was only among women), and even though Case and Deaton themselves never claimed that anything was happening with men in particular.

The news media—left, right, and center—had a pre-existing narrative of middle-aged white malaise, and they slotted the Case and Deaton reports into that narrative.

Why did the media not interview any questioning voices? Why did we not hear from actuaries, demographers, and public health experts with other takes on the matter? Why no alternative perspectives? Because there was no natural opposition.

And it does seem that the news media need opposition, not just other perspectives. After the original Case and Deaton paper came out, I did some quick calculations, then some more careful calculations, and realized that their headline claim—an increase in mortality among middle-aged white Americans—was wrong. But when I wrote about it, and when I spoke with journalists, I made it clear that, although Case and Deaton made a mistake by not age adjusting (and another mistake by not disaggregating by sex), their key conclusion—their comparison with trends among other groups and in other countries—held up, so I was in agreement with Case and Deaton’s main point, even if I thought they were wrong about the direction of the trend and I was skeptical about their comparisons of different education levels.

Journalists’ take on this was, pretty much, that there was no controversy so everything Case and Deaton said should be taken at face value.

I don’t think this was the worst possible outcome: based on my read of the data, Case and Deaton are making reasonable points. I just wish there were a way for their story to motivate better news coverage. There are lots of experts in demographics and public health who could add a lot to this discussion.

As I wrote in my earlier post, Case and Deaton found some interesting patterns. They got the ball rolling. Read their paper, talk with them, get their perspective. Then talk with other experts: demographers, actuaries, public health experts. Talk with John Bound, Arline Geronimus, Javier Rodriguez, and Timothy Waidmann, who specifically raised concerns about comparisons of time series broken down by education. Talk with Chris Schmid, author of the paper, “Increased mortality for white middle-aged Americans not fully explained by causes suggested.” You don’t need to talk with me—I’m just a number cruncher and claim no expertise on causes of death. But click on the link, wait 20 minutes for it to download and take a look at our smoothed mortality rate trends by state. There’s lots and lots there, much more than can be captured in any simple story.

P.S. Just to emphasize: I’m making no equivalence between Wansink and Case/Deaton. Wansink’s published work is riddled with errors and his data quality appears to be approximately zero; Case and Deaton are serious scholars, and all I’ve said about them is that they’ve made a couple of subtle statistical errors which have not invalidated their key conclusions. But all of them, for different reasons, have made claims that have elicited little opposition, hence unfortunate gaps in media coverage.

The post No-op: The case of Case and Deaton appeared first on Statistical Modeling, Causal Inference, and Social Science.

Learn Anomaly Detection, Deep Learning, or Customer Analytics in R online at Statistics.com with top instructors who are leaders of the field. Use code 3CAP17 before March 30 to save $170. Continue Reading… ### Because it's Friday: Run Ollie, Run! When it comes to winning so much you get sick of winning, you need to have at least a basic level of competence. But when it came to the Crufts dog show this year, Ollie just didn't have it: (With thanks to reader MB.) To be fair though, Ollie did much better than our own Jack Russell, Easy, would have done! That's all from us for this week. Have a great weekend, and we'll be back with more on Monday. See you then! Continue Reading… ### Turner: Advisor Analytics Architect Seeking a candidate who leverages knowledge of the organization's information, application, and infrastructure environment as well as the current technology landscape to work with Operations Research and Data Scientists and other team members to design and implement a holistic and optimized analytics platform. Continue Reading… ### Comparing subreddits, with Latent Semantic Analysis in R FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "r/"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders. The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits. The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits: The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package. For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below. FiveThirtyEight: Dissecting Trump's Most Rabid Following Continue Reading… ### Comparing subreddits, with Latent Semantic Analysis in R (This article was first published on Revolutions, and kindly contributed to R-bloggers) FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "r/"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders. The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits. The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits: The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package. For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below. FiveThirtyEight: Dissecting Trump's Most Rabid Following To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R Packages worth a look Biological Relevance Testing (brt) Analyses of large-scale -omics datasets commonly use p-values as the indicators of statistical significance. However, considering p-value alone neglects the importance of effect size (i.e., the mean difference between groups) in determining the biological relevance of a significant difference. Here, we present a novel algorithm for computing a new statistic, the biological relevance testing (BRT) index, in the frequentist hypothesis testing framework to address this problem. Finding Optimal Three-Group Splits Based on a Survival Outcome (rolr) Provides fast procedures for exploring all pairs of cutpoints of a single covariate with respect to survival and determining optimal cutpoints using a hierarchical method and various ordered logrank tests. Htmlwidgets’ in Responsive ‘iframes’ (widgetframe) Provides two functions ‘frameableWidget()’, and ‘frameWidget()’. The ‘frameableWidget()’ is used to add extra code to a ‘htmlwidget’ which allows is to be rendered correctly inside a responsive ‘iframe’. The ‘frameWidget()’ is a ‘htmlwidget’ which displays content of another ‘htmlwidget’ inside a responsive ‘iframe’. These functions allow for easier embedding of ‘htmlwidgets’ in content management systems such as ‘wordpress’, ‘blogger’ etc. They also allow for separation of widget content from main HTML content where CSS of the main HTML could interfere with the widget. Gradient and Vesselness Tools for Arrays and NIfTI Images (vesselr) Simple functions capable of providing gradient, hessian, and vesselness for a given 3-dimensional volume. Continue Reading… ### Magister Dixit “When information is cheap, attention becomes expensive.” James Gleick Continue Reading… ### Why smoking is still so widespread MORE than 50 years after it became clear that smoking kills, the habit remains the leading preventable cause of death, with an annual toll of nearly 6m lives. A study published this week in the Lancet, a medical journal, helps to explain why it is so enduring. Continue Reading… ### One statement employing two resistance tactics to fend off the data I want to parse this statement by new EPA Chief Scott Pruitt, as quoted in this New York Times article: I think that measuring with precision human activity on the climate is something very challenging to do and there’s tremendous disagreement about the degree of impact, so no, I would not agree that it’s a primary contributor to the global warming that we see. I'm not going to talk about the politics of climate change in this post but rather to point out that Pruitt used two popular resistance tactics adopted by people who don't like what the data are suggesting to them. I have been party to countless such dialogue in business meetings. Tactic #1: Claiming that data should be ignored until it is made "precise" Since nothing can be 100% precise, the request for more precise data is akin to an ask for eliminating "terrorism". Imprecision, like terrorism, is a feature of the world we live in. One can work to decrease imprecision, or reduce terrorism, but one can't and won't eradicate either. This tactic is specifically tailored to data analysts, who being logical thinkers, will never vouch for anything 100%. It works particularly well with ethical data analysts. People using Tactic #1 are daring the analysts to stand up and make false claims of 100% precision. (The tactic might not work on, say, IBM marketers, who have made some stupendous claims.) The demand for more "precision" always leads to a demand for more analysis. The cycle continues. Tactic #1 stipulates a black-and-white world. The data is either 100% precise (good) or 100% imprecise (bad). Anything in between is lumped with 100% imprecise. If this criterion were to be applied to all business decisions, there would be no risk-taking, no investments of any kind, and capitalism would grind to a halt. If Pfizer would not spend any money developing new drugs unless the data is 100% precise, it would never start any projects. If a real estate developer would not take out a loan unless he is 100% sure all of the space could be rented out at desirable prices within the first year, then he would never undertake any projects. However, when it comes to such decisions, the same decision-makers who fear the scourge of imprecision suddenly re-make themselves as "betting men." Tactic #2: Pointing to disagreement as a reason to refute the conclusion An easy way to derail meetings is to have a few people blow smoke at the data. The questions raised are frequently trivial, sometimes irrelevant, but the questioning produces an air of doubt. For example, an analyst might conclude that the overall customer satisfaction rating has been trending down. Now even if the aggregate rating is in decline, there will definitely exist a few counties or neighborhoods in which the rating has sharply risen. The people in resistance will insist on investigating those counties - it doesn't matter if those counties only include 0.01% of the customer base. Imagine you are a third party to the debate, and you have no knowledge of the subject matter (for example, you are the controller attending a meeting about customer loyalty). You listen to a prolonged discussion of cases that may or may not contradict the data analyst's conclusion. You only know what you are hearing at the meeting. It's not surprising that you think there is "tremendous disagreement" and that the conclusion may be dubious. In reality, there are only a few loud dissenters, who are objecting because the conclusion does not confirm their pre-conception. For any budding data analyst, you have to be prepared to handle these situations. Pruitt's quote is a perfect encapsulation of the common data-busting tactics. He is saying (a) the data is imprecise and (b) some people disagree, and therefore (c) I reject your conclusion and (d) I am free to believe whatever I want to. PS. Scott Klein pointed me to this related article by the great Tim Harford, which covers similar ground, while filling in some history. It's a Financial Times article so you may need to figure out their leaky paywall ("The problem with facts" by Tim Harford. The link goes to his website, not to FT, which is behind a paywall.). Continue Reading… ### Apple: Software Engineer – Local Search Seeking strong developers with a passion towards relevance and end-user experience, to be contributing towards monitoring/modeling/improving local search for Siri, Spotlight, Safari and other end points for users interacting with Apple Maps. Continue Reading… ### Science and Technology links (March 24, 2017) There are many claims that innovation is slowing down. In the XXth century, we went from horses to planes. What have we done lately? We have not cured cancer or old age. We did get the iPhone. There is that. But so what? There are many claims that Moore’s law, the observation that processors gets twice as good every two years or so, is faltering if not failing entirely. Then there is Eroom’s law, the observation that new medical drugs are getting exponentially expensive. I don’t think that anyone questions the fact that we are still on an exponential curve… but it matters whether we make progress at a rate of 1% a year or 5% a year. So what might be happening? Why would we slow down? Some believe that all of the low-hanging fruits have been picked. So we invented the airplane, the car, and the iPhone, that was easy, but whatever remains is too hard. There is also the theory that as we do more research, we start duplicating our efforts in vain. Knott looked at the data and found something else: One thought is that if R&D has truly gotten harder, it should have gotten harder for everyone. (…) That’s not what I found when I examined 40 years of financial data for all publicly traded U.S. firms. I found instead that maximum RQ [R&D productivity] was actually increasing over time! (…) I restricted attention to a particular sector, e.g., manufacturing or services. I found that maximum RQ was increasing within sectors as well. I then looked at coarse definitions of industry, such as Measuring Equipment (Standard Industrial Classification 38), then successively more narrow definitions, such as Surgical, Medical, And Dental Instruments (SIC 384), then Dental Equipment (SIC 3843). What I found was that as I looked more narrowly, maximum RQ did decrease over time (…) What the pattern suggests is that while opportunities within industries decline over time, as they do, companies respond by creating new industries with greater technological opportunity. The way I understand this finding is that once an industry reaches maturity, further optimizations will provide diminishing returns… until someone finds a different take on the problem and invents a new industry. With time, animals accumulate senescent cells. These are cells that should die (by apoptosis) but somehow stick around. This happens very rarely, so no matter how old you are, you have very few senescent cells, to the point where a biologist would have a hard time finding them. But they cause trouble, a lot of trouble it seems. They might be responsible for a sizeable fraction of age-related health conditions. Senolytics are agents that help remove senescent cells from your tissues. There is a natural product (quercetin), found in apples and health stores that is a mild senolytic. (I do not recommend you take quercetin though eating apples is fine.) A few of years ago, I had not heard about senolytics. Judging by the Wikipedia page, the idea has emerged around 2013. A quick search in Google Scholar seems to reveal that 2013 is roughly accurate. (Update: Josh Mitteldorf credits work by Jan van Deursen of Mayo Clinic dating back to 2011) You may want to remember this term. Anyhow, the BBC reported on a recent trial in mice: They have rejuvenated old mice to restore their stamina, coat of fur and even some organ function. The findings, published in the journal Cell, showed liver function was easily restored and the animals doubled the distance they would run in a wheel. Dr de Keizer said: “We weren’t planning to look at their hair, but it was too obvious to miss.” “In terms of mouse work we are pretty much done, we could look at specific age-related diseases eg osteoporosis, but we should now prepare for clinical translation.” At this point, the evidence is very strong that removing senescent cells is both practical and beneficial. It seems very likely that, in the near future, older people will be healthier through senolytics. However, details matter. For example, senescent help your skin to heal, so removing all of your senescent cells all the time would not be a good thing. Moreover, senolytics are likely slightly toxic, after all they get some of your cells to die, so you would not want to overdose. You probably just want to maintain the level of senescent cells at a low level, by periodic “cleansing”. How to best achieve this result is a matter of research. Are professors going to move to YouTube and make a living there? Some are doing it now. Professor Steve Keen has gone to YouTube to ask people to fund his research. Professor Jordan Peterson claims that he makes something like 10k$ a month through donations to support his YouTube channel. I am not exactly sure who supports these people and what it all means.

We are inserting synthetic cartilage in people with arthritis.

It seems that the sugar industry paid scientists to dismiss the health concerns regarding sugar:

The article draws on internal documents to show that an industry group called the Sugar Research Foundation wanted to “refute” concerns about sugar’s possible role in heart disease. The SRF then sponsored research by Harvard scientists that did just that. The result was published in the New England Journal of Medicine in 1967, with no disclosure of the sugar industry funding.

I think we should all be aware that sugar in large quantities makes you at risk for obesity, heart disease, and diabetes. True dark chocolate is probably fine, however.

It seems that when it comes to fitness, high-intensity exercises (interval training) works really well, no matter your age: it improves muscle mitochondrial function and hypertrophy in all ages. [Translation: you have more energy (mitochondrial function) and larger muscles (hypertrophy).] So the treadmill and the long walks? They may help a bit, but if you want to get in shape, you better crank up the intensity.

John P. A. Ioannidis has made a name for himself by criticizing modern-day science. His latest paper is Meta-assessment of bias in science, and the gist of it is:

we consistently observed that small, early, highly-cited studies published in peer-reviewed journals were likely to overestimate effects.

What does it mean in concrete terms? Whenever you hear a breakthrough for the first, take it with a grain of salt. Wait for the results to be confirmed independently. Also, we may consider established researchers as more reliable, as per the paper’s results.

Viagra not only helps with erectile dysfunction, it seems that it keeps heart disease at bay too. But Viagra is out of patent at this point, so pharmaceutical companies are unlikely to shell out millions to market it for other uses. Maybe the government or academics should do this kind of research?

When thinking about computer performance, we often think of the microprocessor. However, storage and memory are often just as important as processors for performance. The latest boost we got were solid-state disks (SSD) and what difference does it make! Intel is now commercializing what might be the start of a new breakthrough (3D XPoint). Like a disk, the 3D XPoint memory is persistent, but it has latency closer to that of internal memory. Also, unlike our solid-state drives, this memory is byte addressable: you can modify individual bytes without having to rewrite entire pages of memory. In effect, Intel is blurring the distinction between storage and memory. For less than 2k$, you can now get a fancy disk having hundreds of gigabytes that works a bit like internal memory. The long-term picture is that we may get more and more computers with persistent memory that has nearly the performance of our current volatile memory, but without the need to be powered all the time. This would allow our computers to have a lot more memory. Of course, for this to happen, we need more than just 3D XPoint, but chances are good that competitors are hard at work building new types of persistent memory. Leonardo da Vinci once invented a “self-supporting bridge”. Basically, given a few straight planks, you can quickly build a strong bridge without nail or rope. You just assemble the planks and you are done. It is quite impressive: I would really like to know how da Vinci’s mind worked. Whether it is was ever practical, I do not know. But I came across a cute video of a dad and his son building it up. We have been told repeatedly that the sun was bad for us. Lindqvist et al. in Avoidance of sun exposure as a risk factor for major causes of death find contrary evidence. If you are to believe their results, it is true that if you spend more time in the sun, you are more likely to die of cancer. However, this is because you are less likely to die of other causes: Women with active sun exposure habits were mainly at a lower risk of cardiovascular disease (CVD) and noncancer/non-CVD death as compared to those who avoided sun exposure. As a result of their increased survival, the relative contribution of cancer death increased in these women. Nonsmokers who avoided sun exposure had a life expectancy similar to smokers in the highest sun exposure group, indicating that avoidance of sun exposure is a risk factor for death of a similar magnitude as smoking. Compared to the highest sun exposure group, life expectancy of avoiders of sun exposure was reduced by 0.6–2.1 years. Sun exposure is good for your health and makes you live longer. No, we do not know why. Our DNA carries the genetic code that makes us what we are. Our cells use DNA as a set of recipes to make useful proteins. We know that as we age, our DNA does not change a lot. We know because if we take elderly identical twins, their genetic code is very similar. So the body is quite careful not to let our genes get corrupted. Random mutations do occur, but a single cell being defective is hardly cause for concern. Maybe you are not impressed to learn that your cells preserve their genetic code very accurately, but you should be. Each day, over 50 billion of your cells die through apoptosis and must be replaced. You need 2 million new red blood cells per second alone. Anyhow, DNA is not the only potential source of trouble. DNA is not used directly to make the proteins, our cells use RNA instead. So there is a whole complicated process to get from DNA to protein and even if your DNA is sane, the produced protein could still be bad. A Korean team recently showed that something called “nonsense-mediated mRNA decay” (NMR), a quality-control process for RNA, could improve or degrade the lifespan of worms if it is tweaked. Thus, even if you have good genes, it is possible that your cells could start making junk instead of useful proteins as you grow older. Our bodies are built and repaired by our stem cells. Though we have much to learn, we know that injecting stem cells into a damaged tissue may help make it healthier. In the future, it is conceivable that we may regenerate entire organs in vivo (in your body) by stem cells injections. But we need to produce the stem cells first. The latest trend in medicine is “autologous stem cell transplantation”. What this means is that we take your own cells, modify them as needed, and then reinject them as appropriate stem cells where they may help. This is simpler for obvious reasons than using donated stem cells. For one thing, these are your own cells, so they are not likely to be rejected as foreign. But a lot of sick people are quite old. Are the stem cells from old people still good enough? In Regenerative capacity of autologous stem cell transplantation in elderly, Gonzalez-Garza and Cruz-Vega tell us that it is indeed the case: stem cells from elderly donors are capable of self-renewal and differentiation in vitro. That’s true even though the gene expression of the stem cells taken from elderly donors differs from that of younger donors. In a Nature article, researchers report being able to cause a tooth to regrow using stem cells. The subject (a dog) saw a whole new tooth grow and become fully functional. If this works, then we might soon be able to do the same in human being. Can you imagine regrowing a whole new tooth as an adult? It seems that we can do it. Back in 2010, researchers set up the ImageNet challenge. The idea was to take a large collection of images and to ask a computer what was in the image. For the first few years, the computers were far worse than human beings. Then they got better and better. And better. Today, machines have long surpassed human beings to the point of making the challenge less irrelevant in the same way Deep Blue defeating Kasparov made computer Chess programs less exciting. It seems like the competition is closing down a last workshop: “The workshop will mark the last of the ImageNet Challenge competitions, and focus on unanswered questions and directions for the future.” I don’t think that the researchers imagined, back in 2010, that the competition would be so quickly defeated. Predicting the future is hard. A new company, Egenesis wants to build genetically modified pig that can be used as organ donors for human beings. George Church from Harvard is behind the company. Intel, the companies that make the microprocessors in your PCs, is creating an Artificial Intelligence group. Artificial Intelligence is quickly reaching peak hype. Greg Linden reacting to “TensorFlow [an AI library] is the new foundation for Computer Science”: “No. No, it’s not.” Currently, if you want to stop an incoming rocket or a drone, you have to use a missile of your own. That’s expensive. It looks like Lockheed Martin has a laser powerful enough to stop a rocket or a drone. In time, this should be much more cost effective. Want to protect an airport from rockets and drones? Deploy lasers around it. Next question: can you build drones that are impervious to lasers? Andy Pavlo is a computer science professor who tries to have real-world impact. How do you do such a thing? From his blog: (…) the best way to have the most impact (…) is to build a system that solves real-world problems for people. Andy is right, of course, but what is amazing is that this should even be a question in the first place. Simply put: it is really hard to have an impact on the world by writing academic papers. Very hard. There is a new planet in our solar system: “it is almost 10 times heavier than the Earth”. Maybe. Western Digital sells 14TB disks, with helium. This is huge. Netflix is moving from a rating system based on 5 stars to a thumbs-up, thumbs-down model. In a New Scientist article, we learn about new research regarding the “rejuvenation of old blood”. It is believed that older people have too much of some factors in their blood, and it is believed that simply regularizing these levels would have rejuvenation effect. But, of course, it may also be the case that old blood is missing some “youthful factors” and that other tissues than the blood, such as the bone marrow, need them. This new research supports this view: When Geiger’s team examined the bone marrow of mice, they found that older animals have much lower levels of a protein called osteopontin. To see if this protein has an effect on blood stem cells, the team injected stem cells into mice that lacked osteopontin and found that the cells rapidly aged. But when older stem cells were mixed in a dish with osteopontin and a protein that activates it, they began to produce white blood cells just as young stem cells do. This suggests osteopontin makes stem cells behave more youthfully (EMBO Journal, doi.org/b4jp). “If we can translate this into a treatment, we can make old blood young again,” Geiger says. Tech people often aggregate in specific locations, such as the Silicon Valley, where there are jobs, good universities, great experts and a lot of capital. This lead to rising cost of living and high real estate prices. Meanwhile, you can buy houses for next to nothing if you go elsewhere. It seems that the price differential keeps on rising. Will it go on forever? Tyler Cowen says that it won’t. He blames the high real estate prices on the fact that technology disproportionally benefits specific individuals. However, he says, technology invariably starts to benefit a wider share of the population, and when it does, real estate prices tend toward a fairer equilibrium. Continue Reading… ### Building Shiny App Exercises (part-8) (This article was first published on R-exercises, and kindly contributed to R-bloggers) Transform your App into Dashboard Now that we covered the basic staff that you need to know in order to build your App it is time to enhance its appearance and its functionality. The interface is very important fot the user as it must not only be friendly but also easy to use. At this part we will transform your Shiny App into a beautiful Shiny Dashboard. Firstly we will create the interface and then step by step we will “move” the App you built in the previous parts into this. In part 8 we will move the app step by step into your dashboard and in the last two parts we will enhance its appearance even more and of course deploy it. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin! Answers to the exercises are available here. INSTALLATION The packages that we are going to use is shinydashboard and shiny . To install, run: install.packages("shinydashboard") install.packages("shiny") Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.). Exercise 1 Install the package shinydashboard and the package shiny in your working directory. BASICS A dashboard has three parts: a header, a sidebar, and a body. Here’s the most minimal possible UI for a dashboard page. ## ui.R ## library(shinydashboard)   dashboardPage( dashboardHeader(), dashboardSidebar(), dashboardBody() ) Exercise 2 Add a dashboardPage and then Header, Sidebar and Body into your UI. HINT: Use dashboardPage, dashboardHeader, dashboardSidebar, dashboardBody. First of all we should name it with title like below: ## ui.R ## library(shinydashboard)   dashboardPage( dashboardHeader(title="Dashboard), dashboardSidebar(), dashboardBody() ) Exercise 3 Name your dashboard “Shiny App”. HINT: Use title. Next, we can add content to the sidebar. For this example we’ll add menu items that behave like tabs. These function similarly to Shiny’s tabPanels: when you click on one menu item, it shows a different set of content in the main body. There are two parts that need to be done. First, you need to add menuItems to the sidebar, with appropriate tabNames. ## Sidebar content dashboardSidebar( sidebarMenu( menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("Widgets", tabName = "widgets", icon = icon("th")) ) ) Exercise 4 Create three menuItem, name them “DATA TABLE”, “SUMMARY” and “K-MEANS” respectively. Make sure to use distict tabName for each one of them. The icon is of your choise. HINT: Use menuItem, tabName and icon. In the body, add tabItems with corresponding values for tabName: ## Body content dashboardBody( tabItems( tabItem(tabName = "dashboard", h2("Dashboard"), fluidRow( box() ) ), tabItem(tabName = "widgets", h2("WIDGETS") ), ) ) Exercise 5 Add tabItems in dashboardBody. Be sure to give the same tabName to each one to get them linked with your menuItem. HINT: Use tabItems, tabItem, h2. Obviously, this dashboard isn’t very useful. We’ll need to add components that actually do something. In the body we can add boxes that have content. Firstly let’s create a box for our dataTable in the tabItem with tabName “dt”. ## Body content dashboardBody( tabItems( tabItem(tabName = "dashboard", h2("Dashboard"), fluidRow( box() ) ), tabItem(tabName = "widgets", h2("WIDGETS") ), ) ) Exercise 6 Specify the fluidrow and create a box inside the “DATA TABLE” tabItem. HINT: Use fluidrow and box. Exercise 7 Do the same for the other two tabItem. Create one fluidrow and one box in the “SUMMARY” and another fluidrow with four boxin the “K-MEANS”. Now just copy and paste the code below, which you used in part 7 to move your dataTable inside the “DATA TABLE” tabItem. #ui.R dataTableOutput("Table"),width = 400 #server.R output$Table <- renderDataTable( iris,options = list( lengthMenu = list(c(10, 20, 30,-1),c('10','20','30','ALL')), pageLength = 10))

Exercise 8

Place the sample code above in the right place in order to add the dataTable “Table” inside the “DATA TABLE” tabItem.

Now just copy and paste the code below, which you used in part 7 to move the dataTable “Table2” inside the “SUMMARY” tabItem.
#ui.R
dataTableOutput("Table2"),width = 400

 

#server.R sumiris<-as.data.frame.array(summary(iris)) output$Table2 <- renderDataTable(sumiris) Exercise 9 Place the sample code above in the right place in order to add the dataTable “Table2” inside the “SUMMARY” tabItem. Do the same for the last exercise as you just have to put the code from part 7 inside the “K-MEANS” tabItem. Exercise 10 Place the K-Means plot and the three widgets from part 7 inside the four box you created before. To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### PeachIH: Data Scientist, Machine Learning Engineer Artificial Intelligence and Machine Learning (AI/ML) healthcare company is seeking candidates for multiple positions: Data Scientists, Big Data Engineers, Machine Learning Engineers, Software Engineers (mobile, web, reporting). Continue Reading… ### RIP AlchemyAPI: What to consider before migrating to Watson ### The end of AlchemyAPI Since the very beginning we’ve competed head to head with AlchemyAPI. We have always had a lot in common with them; from our initial mission of bringing NLP to the masses to launching features and products with similar goals and objectives. Even though we wouldn’t have admitted it then, we looked up to Alchemy in many ways. The news of their acquisition by IBM meant a couple of things: 1. Our biggest competitor was going to be incorporated into a much larger platform and ecosystem and 2. The NLP-as-a-service market started to gather even more hype. But more so, the acquisition opened an opportunity for us to gather some of that market share left by the fallout. What’s happened since then has been interesting. We hear them mentioned less and less in our sales calls and we’ve noticed more and more customers make the move to us over moving to Watson/Bluemix/Knowledge Studio…whatever it is you need to move to 😉 When we ask our ex-AlchemyAPI users why they’ve decided to move to us over sticking with IBM one particular aspect rings true; the experience we provide is better than working with IBM Watson. When they talk about experience they refer to some key points; • Support • Developer experience • Flexibility • Transparent pricing* • Ease of use * I dare you try and figure out how much your monthly spend will be on a Watson service! ### The deprecation of the service Yes all of this was inevitable and to be honest we were surprised it took this long. AlchemyAPI had built a strong brand and for a lot of the reasons mentioned above it made sense for IBM to hold on to that Alchemy branding and to try and win over their user base. In the last couple of weeks IBM announced they’re going to kill the AlchemyAPI services by deprecating the AlchemyLanguage and AlchemyNews products. It’s being touted as a rebrand and for the most part it is. However, there are some key elements you need to be aware of in both products before you consider making the move to Watson. ### So what’s actually happening? IBM are shutting down two core AlchemyAPI services, AlchemyLanguage and AlchemyNews. As of the 7th of April you can’t create any new instances of either Alchemy service and support will cease on the 7th of April 2018 for both products. IBM’s advice is to switch to one of their other existing services, Watson NLU or Watson Language Classifier for AlchemyLanguage users and the Watson Discovery News service for AlchemyNews users. As we already mentioned, this was all expected to happen eventually and all seems pretty straightforward, right? Not really… ### What you need to know We’ve been investigating this for the last couple of days and we’re still a little confused as to what a typical migration for users will look like. Figuring out which of the 120+ services on Bluemix you need is hard, deciding which Watson service you should migrate to is confusing and it’s really not clear what elements of AlchemyAPI they are keeping and what exactly they’re dumping in the bin. Even though they’ve made the effort to phase out the Alchemy services with the customer in mind, in our opinion they haven’t made working with Bluemix and/or Watson easy from the get go, no matter which service you’re using. This means existing Alchemy users are going to be faced with a number of challenges: changes in pricing, flexibility and accessibility and the deprecation of some core features. ### 1. Using Watson/Bluemix; have they lost the built-for-developers feel? #### Access and ease of use At AYLIEN we obsess over the “Developer Experience”. We take steps to make it as easy as possible for our users to get up and running with the service. Our documentation is clear, complete and easy to understand and navigate. We provide SDKs for 7 different programming languages. We have a robust and fully functional free product offering and we do everything we can to make even our paid plans accessible to development teams of all shapes and sizes. The accessibility of our tech is something we feel very strongly about at AYLIEN and that’s something we believe was important to the AlchemyAPI team too. However, it’s difficult to say the same about IBM. We can’t comment on the details of the strategy but from what we hear from speaking with our users and the accessibility of Bluemix and Watson as a whole, things are looking a little grim for developers hoping to use the Watson language services. • You need a Bluemix account to access the Watson services which is only available on a 30 day trial • For most of the Watson services you’ll only find 3 – 5 SDKs available • The pricing structure of many of the Watson services is extremely prohibitive and down right confusing (more on that below) #### Flexibility and support What we hear time and time again from our customers is working with us is just easier. Put simply they don’t want to deal with a beast of an enterprise like IBM. They want to know that if they have a feature request or feedback they’ll be listened to, they don’t want to jump through hoops and engage in a drawn out sales process in order to use the service, they want to know that the tech they’re using is constantly evolving and the team behind it are passionate about advancing it, but above all, they want the reassurance that if something goes wrong they can pick up the phone and talk to someone who cares. We do everything we can to make sure our customers are getting the most of our APIs. We’ll run on-boarding calls with new users, we consult with our users on how they can integrate with their solutions and from time to time we’ll work directly with our customers to customize existing features to their specific needs. This level of service is only possible because of our ability to stay flexible and agile. Sample support thread – Yes our founder still handles some support queries ### 2. Pricing; what’s going to happen your monthly cost? If you do plan on sitting down to figure out how your cost might change after your move to Watson we recommend grabbing a coffee and snack to get you through it. We have created some comparison tables below to provide a quick overview of some of the savings you can make by migrating to AYLIEN over Watson. They are broken down by service and describe what you should expect to pay as an AlchemyLanguage or AlchemyNews user moving to the Watson services vs what you would pay with AYLIEN. #### AlchemyLanguage users Note: Watson NLU and NLC services are charged on a tiered basis based on the number of Natural Language Units you use. Like our pricing it’s volume based so we were able to compare some price points based on example volumes. ##### Pricing example: Hits / NLU units AYLIEN Watson Saving 180,000$199 $540$441
1,000,000 $649$1,500 $851 2,000,000$649 $2,500$1,851

#### AlchemyNews users

##### Pricing Example:

Note: Watson Discovery News and AYLIEN News API are priced a little differently. Our pricing is based on how many stories you collect and analyze but we don’t care how many queries you make and Watson Discovery News charges per query made plus enrichments. In Watson Discovery News however there is a limit of 50 results (stories) per query which means it’s not too difficult to compare the pricing based on an example number of stories collected.

Stories AYLIEN Watson Saving
100,000 $211$200 $-11 1,000,000$1,670 $2,000$851
2,000,000 $2,871$4,000 $1,129 ## – We’re running an Alchemy Amnesty offer – We know that many of Alchemy’s hardcore fans are looking for an alternative that is just as user friendly and powerful, which is why we’re running an Alchemy Amnesty: we’re giving away 2 months free on any of our plans to any customer moving from AlchemyAPI to AYLIEN. In order to avail of the offer signup here and just drop us an email to sales@aylien.com. ### 3. Features; what’s going and what’s staying? The AlchemyLanguage features are being incorporated into Watson NLU and Watson NLC while the AlchemyNews product is being incorporated into a larger product named Watson Discovery News. As part of the migration however, there are a number of changes to the feature set which we’ve set out below. We’ve only listed the features that are either changing or being canned altogether. #### AlchemyLanguage Feature Watson NLU AYLIEN Text API Entity & Concept Extraction Sentiment Analysis Language Detection Image Tagging (separate service) Related Phrases (separate service) Summarization Hashtag Suggestion Article Extraction Semantic Labelling Microformat Extraction Date Extraction #### AlchemyNews Feature Watson Discovery News AYLIEN News API Customized Sources Source Rank (Blekko) (Alexa) Industry Taxonomies 1 2 Languages 1 6 Summarization Clustering Social Stats Similar Articles/Stories Deduplication Don’t just take our word for it, make your own mind up. All the details you need to start testing our service for each of our solutions is listed below. Text API News API ### Final thoughts Users are moving to AYLIEN for a variety of reasons that we outlined above. The primary drivers are ease of use, flexibility, support, feature set and pricing. If you’re dreading making the move to IBM and you miss the experience of dealing with a dev friendly team, we’ve got you covered ;). The post RIP AlchemyAPI: What to consider before migrating to Watson appeared first on AYLIEN. Continue Reading… ### Getting Started with Deep Learning This post approaches getting started with deep learning from a framework perspective. Gain a quick overview and comparison of available tools for implementing neural networks to help choose what's right for you. Continue Reading… ### Around The Blogs In 78 Hours Yes, this is an issue and the blogs are helping seeing through some of it: Jort and his team have released Audioset Thomas Bob Continue Reading… ### Superpixels in imager (This article was first published on R – dahtah, and kindly contributed to R-bloggers) Superpixels are used in image segmentation as a pre-processing step. Instead of segmenting pixels directly, we first group similar pixels into “super-pixels”, which can then be processed further (and more cheaply). (image from Wikimedia) The current version of imager doesn’t implement them, but it turns out that SLIC superpixels are particularly easy to implement. SLIC is essentially k-means applied to pixels, with some bells and whistles. We could use k-means to segment images based on colour alone. To get good results on colour segmentation the CIELAB colour space is appropriate, because it tries to be perceptually uniform. library(tidyverse) library(imager) im <- load.image("https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Aster_Tataricus.JPG/1024px-Aster_Tataricus.JPG") #Convert to CIELAB colour space, then create a data.frame with three colour channels as columns d <- sRGBtoLab(im) %>% as.data.frame(wide="c")%>% dplyr::select(-x,-y) #Run k-means with 2 centers km <- kmeans(d,2) #Turn cluster index into an image seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1))
plot(im,axes=FALSE)
highlight(seg==1)


We mostly manage to separate the petals from the rest, with a few errors here and there.
SLIC does pretty much the same thing, except we (a) use many more centers and (b) we add pixel coordinates as features in the clustering. The latter ensures that only adjacent pixels get grouped together.

The code below implements SLIC. It’s mostly straightforward:

#Compute SLIC superpixels
#im: input image
#nS: number of superpixels
#ratio: determines compactness of superpixels.
#low values will result in pixels with weird shapes
#... further arguments passed to kmeans
slic <- function(im,nS,compactness=1,...)
{
#If image is in colour, convert to CIELAB
if (spectrum(im) ==3) im <- sRGBtoLab(im)

#The pixel coordinates vary over 1...width(im) and 1...height(im)
#Pixel values can be over a widely different range
#We need our features to have similar scales, so
#we compute relative scales of spatial dimensions to colour dimensions
sc.spat <- (dim(im)[1:2]*.28) %>% max #Scale of spatial dimensions
sc.col <- imsplit(im,"c") %>% map_dbl(sd) %>% max

#Scaling ratio for pixel values
rat <- (sc.spat/sc.col)/(compactness*10)

X <- as.data.frame(im*rat,wide="c") %>% as.matrix
#Generate initial centers from a grid
ind <- round(seq(1,nPix(im)/spectrum(im),l=nS))
#Run k-means
km <- kmeans(X,X[ind,],...)

#Return segmentation as image (pixel values index cluster)
seg <- as.cimg(km$cluster,dim=c(dim(im)[1:2],1,1)) #Superpixel image: each pixel is given the colour of the superpixel it belongs to sp <- map(1:spectrum(im),~ km$centers[km$cluster,2+.]) %>% do.call(c,.) %>% as.cimg(dim=dim(im)) #Correct for ratio sp <- sp/rat if (spectrum(im)==3) { #Convert back to RGB sp <- LabtosRGB(sp) } list(km=km,seg=seg,sp=sp) }  Use it as follows: #400 superpixels out <- slic(im,400) #Superpixels plot(out$sp,axes=FALSE)
#Segmentation
plot(out$seg,axes=FALSE) #Show segmentation on original image (im*add.colour(abs(imlap(out$seg)) == 0)) %>% plot(axes=FALSE)


The next step is to segment the superpixels but I’ll keep that for another time.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### IBM Chief Data Officer Strategy Summit, March 29-30, San Francisco – free VIP passes

Join over 150 Chief Data Officers, Chief Analytics Officers and other senior data leaders in San Francisco. A few VIP complimentary places are still available.

### Writing a conference abstract the data science way

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Conferences are an ideal platform to share your work with the wider community. However, as we all know, conferences require potential speakers to submit abstracts about their talk. And writing abstracts is not necessarily the most rewarding work out there. I have actually never written one so when asked to prepare abstracts for this year’s conferences I didn’t really know where to start.

So, I did what any sane person would do: get data. As Mango has organised a number of EARL conferences there are a good deal of abstracts available, both accepted and not accepted. In this blogpost I’m going to use the tidytext package to analyse these abstracts and see what distinguishes the accepted abstracts from the rest.

Disclaimer: the objective of this blogpost is not to present a rigorous investigation into conference abstracts but rather an exploration of, and potential use for, the tidytext package.

## The data

I don’t know what it’s like for other conferences but for EARL all abstracts are submitted through an online form. I’m not sure if these forms are stored in a database but I received them as a PDF. To convert the PDFs to text I make use of the pdftotext program as outlined in this stackoverflow thread.

list_of_files <- list.files("Abstracts Received/", pattern=".pdf", full.names=TRUE)
convert_pdf <- function(fileName){
outputFile <- paste0('"', gsub('.pdf', '.txt', fileName),'"')
inputFile <- paste0('"', fileName, '"')
# I turned the layout option on because the input is a somewhat tabular layout
command <- paste('"C:/Program Files/xpdf/bin64/pdftotext.exe" -layout', inputFile, outputFile)
system(command, wait=FALSE)
}
result <- map(list_of_files, convert_pdf)
failures <- length(list_of_files) - sum(unlist(result))
if(failures>0){
warning(sprintf("%i files failed to convert", c(failures))
}


Now that I have converted the files I can read them in and extract the relevant fields.

# load functions extract_abstracts(), remove_punctuation()
source('Scripts/extract_abstracts.R')

input_files <- list.files("Abstracts Received/", pattern=".txt", full.names=TRUE)

mutate(Title=remove_punctuation(Title), Title=str_sub(Title, end=15)) %>%
rename(TitleShort=Title)

# the conversion to text created files with an "incomplete final line" for which
# readLines generates a warning, hence the suppressWarnings
map_df(extract_abstracts) %>%
mutate(AbstractID=row_number(), TitleShort=str_sub(Title, end=15)) %>%
left_join(acceptance_results, by="TitleShort") %>%
filter(!is.na(Accepted)) %>%
select(AbstractID, Abstract, Accepted)

# a code chunk that ends with a plot is a good code chunk
qplot(map_int(input_dataAbstract, nchar)) + labs(title="Length of abstracts", x="Number of characters", y="Number of abstracts")  ## The analysis So, now that I have the data ready I can apply some tidytext magic. I will first convert the data into a tidy format, then clean it up a bit and finally create a few visualisations. data(stop_words) tidy_abstracts <- input_data %>% mutate(Abstract=remove_punctuation(Abstract)) %>% unnest_tokens(word, Abstract) %>% # abracadabra! anti_join(stop_words %>% filter(word!="r")) %>% # In this case R is a word filter(is.na(as.numeric(word))) # filter out numbers # my personal mantra: a code chunk that ends with a plot is a good code chunk tidy_abstracts %>% count(AbstractID, Accepted) %>% ggplot() + geom_density(aes(x=n, colour=Accepted), size=1) + labs(title="Distribution of number of words per abstract", x="Number of words")  The abstracts with a higher number of words have a slight advantage but I wouldn’t bet on it. There is something to be said for being succinct. But what really matters is obviously content so let’s have a look at what words are commonly used. tidy_abstracts %>% count(word, Accepted, sort=TRUE) %>% # count the number of observations per category and word group_by(Accepted) %>% top_n(20) %>% # select the top 20 counts per category ungroup() %>% ggplot() + geom_col(aes(x=word, y=n, fill=Accepted), show.legend = FALSE) + coord_flip() + labs(x="", y="Count", title="Wordcount by Acceptance category") + facet_grid(~ Accepted)  Certainly an interesting graph! It may have been better to show the proportions instead of counts as the number of abstracts in each category are not equal. Nevertheless, the conclusion remains the same. The words “r” and “data” are clearly the most common. However, what is more interesting is that abstracts in the “yes” category use certain words significantly more often than abstracts in the “no” category and vice versa (more often because a missing bar doesn’t necessarily mean a zero observation). For example, the words “science”, “production” and “performance” occur more often in the “yes” category. Vice versa, the words “tools”, “product”, “package” and “company(ies)” occur more often in the “no” category. Also, the word “application” occurs in its singular form in the “no” category and in its plural form in the “yes” category. Certainly, at EARL we like our applications to be plural, it is in the name after all. There is one important caveat with the above analysis and that is to do with the frequency of words within abstracts. The overall frequencies aren’t really that high and one abstract’s usage of a particular word can make it seem more important than it really is. Luckily the tidytext package provides a solution for that as I can now easily calculate the TF-IDF score. tidy_abstracts %>% count(Accepted, word, sort=TRUE) %>% # count the number of observations per category and word bind_tf_idf(word, Accepted, n) %>% # calculate tf-idf group_by(Accepted) %>% top_n(10) %>% # select the top 10 scores per category ungroup() %>% ggplot() + geom_col(aes(x=word, tf_idf, fill=Accepted), show.legend = FALSE) + labs(x="", y="TF-IDF", title="TF-IDF by Acceptance category") + coord_flip() + facet_grid(~ Accepted)  Note that I have aggregated the counts over the Acceptance category as I’m interested in what words are important within a category and not within a particular abstract. There isn’t an obvious pattern visible in the results but I can certainly hypothesise. Words like “algorithm”, “effects”, “visualize”, “ml” and “optimization” point strongly towards the application side of things. Whereas words like “concept”, “objects” and “statement” are softer and more generic. XBRL is the odd one out here but interesting in it’s own right, whoever submitted that abstract should perhaps consider re-submitting as it’s quite unique. ## Next Steps That’s it for this blogpost but here are some next steps I would do if I had more time: • Add more abstracts from previous years / other conferences • Analyse combination of words (n-grams) to work towards what kind of sentences should go into an abstract • The content isn’t the only thing that matters. By adding more metadata (time of submission, previously presented, etc.) the model can be made more accurate • Try out topic modeling on the accepted abstracts to help with deciding what streams would make sense • Train a neural network with all abstracts and generate a winning abstract [insert evil laugh] ## Conclusion In this blogpost I have explored text data taken from abstract submissions to the EARL conference using the fabulous tidytext package. I analysed words from abstracts that were accepted versus those that weren’t and also compared their TF-IDF score. If you want to know more about the tidytext package come to the Web Scraping and Text Mining workshop my colleagues Nic Crane and Beth Ashlee will be giving preceding the LondonR meetup this Tuesday the 28th of March. Also, if this blogpost has made you want to write an abstract, we are still accepting submissions for EARL London and EARL San Fransisco (I promise I won’t use it for a blogpost). As always, the code for this post can be found on GitHub. Continue Reading… ### Some Random Weekend Reading (This article was first published on RStudio, and kindly contributed to R-bloggers) by Joseph Rickert Few of us have enough time to read, and most of us already have depressingly deep stacks of material that we would like to get through. However, sometimes a random encounter with something interesting is all that it takes to regenerate enthusiasm. Just in case you are not going to get to a book store with a good technical section this weekend, here are a few not-quite-random reads. Deep Learning by Goodfellow, Bengio and Courville is a solid, self-contained introduction to Deep Learning that begins with Linear Algebra and ends with discussions of research topics such as Autoencoders, Representation Learning, and Boltzman Machines. The online layout extends an invitation to click anywhere and begin reading. Sampling the chapters, I found the text to be engaging reading; much more interesting and lucid than just an online resource. For some Deep Learning practice with R and H2O, have a look at the post Deep Learning in R by Kutkina and Feuerriegel. However, if you are under the impression that getting a handle on Deep Learning will get you totally up to speed with neural network buzzwords, you may be disappointed. Extreme Learning Machines, which “aim to break the barriers between the conventional artificial learning techniques and biological learning mechanisms”, are sure to take you even deeper into the abyss. For a succinct introduction to ELMs with and application to handwritten digit classification, have a look at the recent paper by Pang and Yang. For more than an afternoon’s worth of reading, browse through the IEEE Intelligent Systems issue on Extreme Learning Machines here, and the other resources collected here. See the announcement of the 2014 conference for the full context of the quote above. For something a little lighter and closer to home, Christopher Gandrud’s page on the networkD3 package is sure to set you browsing through Sankey Diagrams and Force Directed Drawing Alorithms. Finally, if you are like me and think that the weekends are for catching up on things that you should probably already know, but on which you might be a bit shaky, remember that you can never know enough about GitHub. Compliments of GitHub’s Carolyn Shin, here is some online GitHub reading: GitHub Guides, GitHub on Demand Training, and an online version of the Pro Git Book. Reading recommendations go both ways. Please feel free to comment with some recommendations of your own. To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Key Takeaways from Strata + Hadoop World 2017 San Jose The focus is increasingly shifting from storing and processing Big Data in an efficient way, to applying traditional and new machine learning techniques to drive higher value from the data at hand. Continue Reading… ### Leaf Classification Competition: 1st Place Winner's Interview, Ivan Sosnovik Can you see the random forest for its leaves? The Leaf Classification playground competition ran on Kaggle from August 2016 to February 2017. Kagglers were challenged to correctly identify 99 classes of leaves based on images and pre-extracted features. In this winner's interview, Kaggler Ivan Sosnovik shares his first place approach. He explains how he had better luck using logistic regression and random forest algorithms over XGBoost or convolutional neural networks in this feature engineering competition. ## Brief intro I am an MSc student in Data Analysis at Skoltech, Moscow. I joined Kaggle about a year ago when I attended my first ML course at university. The first competition was What’s Cooking. Since that, I’ve participated in several Kaggle competitions, but didn’t pay so much attention to it. It was more like a bit of practice to understand how ML approaches work. Ivan Sosnovik on Kaggle. The idea of Leaf Classification was very simple and challenging. Seemed like I wouldn’t have to stack so many models and the solution could be elegant. Moreover, the total volume of data was just 100+ Mb and the process of learning could be performed even with a laptop. It was very promising because the majority of the computations was supposed to be done on my MacBook Air with 1,3 GHz Intel Core i5 and 4 Gb RAM. I have worked with black-and-white images before. And there is a forest near my house. However, it didn’t give me so much profit in this competition. ## Let’s get technical When I joined the competition, several kernels with top 20% scores were published. The solutions used the initially extracted features and Logistic Regression. It gave . And no significant improvement could be achieved by tuning of the parameters. In order to enhance the quality, feature engineering had to be performed. Seemed like no one had done it because the top solution had slightly better score than mine. ### Feature engineering I did first things first and plotted the images for each of the classes. 10 images from the train set for each of 7 randomly chosen classes. The raw images had different resolution, rotation, aspect ratio, width, and height. However, the variation of each of the parameters within the class is less than between the classes. Therefore, some informative features could be constructed just on the fly. They are: • width and height • aspect ration: width / height • square: width * height • is orientation horizontal: int(width > height) Another very useful feature that seemed to help is the average value of the pixels of the image. I added these features to the already extracted ones. Logistic regression enhanced the result. However, most of the work was yet to be done. All of the above-described features represent nothing about the content of the image. ##### PCA Despite the success of neural networks as feature extractors, I still like PCA. It is simple and allows one to get the useful representation of the image in . First of all, the images were rescaled to the size of . Then PCA was applied. The components were added to the set of previously extracted features. Eigenvalues of the covariance matrix. The number of components was varied. Finally, I used principle components. This approach showed . ##### Moments and hull In order to generate even more features, I used OpenCV. There is a great tutorial on how to get the moments and hull of the image. I also added some pairwise multiplication of several features. The final set of features is the following: • Initial features • height, width, ratio etc. • PCA • Moments The Logistic Regression demonstrated . ## The main idea All of the above-described demonstrated good result. Such result would be appropriate for real life application. However, it could be enhanced. ### Uncertainty The majority of objects had certain decision: there was the only class with and the rest had . However, I found several objects with uncertainty in a prediction like this: Prediction of logistic regression. The set of confusion classes was small (15 classes divided into several subgroups), so I decided to look at the pictures of the leaves and check if I can classify them. Here is the result: Quercus confusion group. Eucalyptus and Cornus confusion group. I must admit that Quercus’ (Oak) leaves look almost the same for different subspecies. I assume, that I could distinguish Eucalyptus from Cornus, but the classification of subspecies seems complicated to me. ### Can you really see the random forest for the leaves? The key idea of my solution was to create another classifier, which will make predictions only for confusion classes. The first one I tried was RandomForestClassifier from sklearn and it gave excellent result after the tuning of hyperparameters. The random forest was trained on the same data as logistic regression, but only the objects from confusion classes were used. If logistic regression gave uncertain predictions for an object then the prediction of the random forest classifier was used. Random forest gave the probabilities for 15 classes, the rest assumed to be absolute . The final pipeline is the following: Final pipeline. ### Threshold The leaderboard score was calculated on the whole dataset. That is why some risky approaches could be used in this competition. Submissions are evaluated using the multi-class logloss. where - number of objects and classes respectively, is the prediction and is the indicator: if object is in class , otherwise it equals to . If the model correctly chose the class, then the following approach will decrease the overall logloss, otherwise, it will increase dramatically. After thresholding, I got the score of . That’s it. All the labels are correct. ## What else? I’ve tried several methods that showed appropriate result but was not used in the final pipeline. Moreover, I had some ideas on how to make the solution more elegant. In this section, I’ll try to discuss them. #### XGBoost XGBoost by dmlc is a great tool. I’ve used it in several competitions before and decided to train it on the initially extracted features. It demonstrated the same score as logistic regression or even worse, but the time consumption was a way bigger. #### Submission blending Before I came up with an idea of Random Forest as the second classifier, I tried different one-model methods. Therefore I collected lots of submissions. The trivial idea is to blend the submissions: to use the mean of the predictions or weighted mean. The result did not impress me either. #### Neural networks Neural networks were one of the first ideas I tried to implement. Convolutional Neural Networks are good feature extractors, therefore, they could be used as a first-level model or even as a main classifier. The original images came with different resolution. I rescaled them to . The training of CNN on my laptop was too time-consuming to choose the right architecture in reasonable time, so I declined this idea after several hours of training. I believe, that CNNs could give accurate predictions for this dataset. ## Bio I am Ivan Sosnovik. I am a second-year master student at Skoltech and MIPT. Deep learning and applied mathematics are of great interest to me. You can visit my GitHub to check some stunning projects. Continue Reading… ### Unsupervised Investments: A Comprehensive Guide to AI Investors This article presents a list of 80 funds investing in Artificial Intelligence and Machine Learning. Continue Reading… ### When does research have active opposition? A reporter was asking me the other day about the Brian Wansink “pizzagate” scandal. The whole thing is embarrassing for journalists and bloggers who’ve been reporting on this guy’s claims entirely uncritically for years. See here, for example. Or here and here. Or here, here, here, and here. Or here. Or here, here, here, . . . The journalist on the phone was asking me some specific questions: What did I think of Wansink’s work (I think it’s incredibly sloppy, at best), Should Wansink release his raw data (I don’t really care), What could Wansink do at this point to restore his reputation (Nothing’s gonna work at this point), etc. But then I thought of another question: How was Wansink able to get away with it for so long. Remember, he got called on his research malpractice a full 5 years ago; he followed up with some polite words and zero action, and his reputation wasn’t dented at all. The problem, it seems to me, is that Wansink has had virtually no opposition all these years. It goes like this. If you do work on economics, you’ll get opposition. Write a paper claiming the minimum wage helps people and you’ll get criticism on the right. Write a paper claiming the minimum wage hurts people and you’ll get criticism on the left. Some—maybe most—of this criticism may be empty, but the critics are motivated to use whatever high-quality arguments are at their disposal, so as to improve their case. Similarly with any policy-related work. Do research on the dangers of cigarette smoking, or global warming, or anything else that threatens a major industry, and you’ll get attacked. This is not to say that these attacks are always (or never) correct, just that you’re not going to get your work accepted for free. What about biomedical research? Lots of ambitious biologists are running around, all aiming for that elusive Nobel Prize. And, so I’ve heard, many of the guys who got the prize are pushing everyone in their labs to continue publishing purported breakthrough after breakthrough in Cell, Science, Nature, etc. . . . What this means is that, if you publish a breakthrough of your own, you can be sure that the sharks will be circling, and lots of top labs will be out there trying to shoot you down. It’s a competitive environment. You might be able to get a quick headline or two, but shaky lab results won’t be able to sustain a Wansink-like ten-year reign at the top of the charts. Even food research will get opposition if it offends powerful interests. Claim to have evidence that sugar is bad for you, or milk is bad for you, and yes you might well get favorable media treatment, but the exposure will come with criticism. If you make this sort of inflammatory claim and your research is complete crap, then there’s a good chance someone will call you on it. Wansink, though, his story is different. Yes, he’s occasionally poked at the powers that be, but his research papers address major policy debates only obliquely. There’s no particular reason for anyone to oppose a claim that men eat differently when with men than with women, or that buffet pricing affects or does not affect how much people eat, or whatever. Wansink’s work flies under the radar. Or, to mix metaphors, he’s in the Goldilocks position, with topics that are not important for anyone to care about disputing, but interesting and quirky enough to appeal to the editors at the New York Times, NPR, Freakonomics, Marginal Revolution, etc. It’s similar with embodied cognition, power pose, himmicanes, ages ending in 9, and other PPNAS-style Gladwell bait. Nobody has much motivation to question these claims, so they can stay afloat indefinitely, generating entire literatures in peer-reviewed journals, only to collapse years or decades later when someone pops the bubble via a preregistered non-replication or a fatal statistical criticism. We hear a lot about the self-correcting nature of science, but—at least until recently—there seems to have been a lot of published science that’s completely wrong, but which nobody bothered to check. Or, when people did check, no one seemed to care. A couple weeks ago we had a new example, a paper out of Harvard called, “Caught Red-Minded: Evidence-Induced Denial of Mental Transgressions.” My reaction when reading this paper was somewhere between: (1) Huh? As recently as 2016, the Journal of Experimental Psychology: General was still publishing this sort of slop? and (2) Hmmm, the authors are pretty well known, so the paper must have some hidden virtues. But now I’m realizing that, yes, the paper may well have hidden virtues—that’s what “hidden” means, that maybe these virtues are there but I don’t see them—but, yes, serious scholars really can release low-quality research, when there’s no feedback mechanism to let them know there are problems. OK, there are some feedback mechanisms. There are journal referees, there are outside critics like me or Uri Simonsohn who dispute forking path p-value evidence on statistical grounds, and there are endeavors such as the replication project that have revealed systemic problems in social psychology. But referee reports are hidden (you can respond to them by just submitting to a new journal), and the problem with peer review is the peers; and the other feedbacks are relatively new, and some established figures in psychology and other fields have had trouble adjusting. Everything’s changing—look at Pizzagate, power pose, etc., where the news media are starting to wise up, and pretty soon it’ll just be NPR, PPNAS, and Ted standing in a very tiny circle, tweeting these studies over and over again to each other—but as this is happening, I think it’s useful to look back and consider how it is that certain bubbles have been kept afloat for so many years, how it is that the U.S. government gave millions of dollars in research grants to a guy who seems to have trouble counting pizza slices. The post When does research have active opposition? appeared first on Statistical Modeling, Causal Inference, and Social Science. Continue Reading… ### R Weekly Bulletin Vol – I (This article was first published on R programming, and kindly contributed to R-bloggers) We are starting with R weekly bulletins which will contain some interesting ways and methods to write codes in R and solve bugging problems. We will also cover R functions and shortcut keys for beginners. We understand that there can be more than one way of writing a code in R, and the solutions listed in the bulletins may not be the sole reference point for you. Nevertheless, we believe that the solutions listed will be helpful to many of our readers. Hope you like our R weekly bulletins. Enjoy reading them! ### Shortcut Keys 1. To move cursor to R Source Editor – Ctrl+1 2. To move cursor to R Console – Ctrl+2 3. To clear the R console – Ctrl+L ### Problem Solving Ideas #### Creating user input functionality To create the user input functionality in R, we can make use of the readline function. This gives us the flexibility to set the input values for variables for our choice in the code is run. Example: Suppose that we have coded a backtesting strategy. We want to have the flexibility to choose the backtest period. To do so, we can create a user input “n”, signifying the backtest period in years, and add the line shown below at the start of the code. When the code is run, it will prompt the user to enter the value for “n”. Upon entering the value, the R code will get executed for the set period and produce the desired output. n = readline(prompt = "Enter the backtest period in years: ") #### Refresh a code in every x seconds To refresh a code in every x seconds we can use the while loop and the Sys.sleep function. The “while loop” keeps executing the enclosed block of commands until the condition remains satisfied. We enclose the code in the while statement and keep the condition as TRUE. By keeping the condition as TRUE, it will keep looping. At the end of the code, we add the Sys.sleep function and specify the delay time in seconds. This way the code will get refreshed after every “x” seconds. Example: In this example, we initialize the x value to zero. The code is refreshed every 1 second, and it will keep printing the value of x. One can hit the escape button on the keyboard to terminate the code. x = 0 while (TRUE) { x = x + 1 print(x) Sys.sleep(1) } #### Running multiple R scripts sequentially To run multiple R scripts, one can have the main script which will contain the names of the scripts to be run. Running the main script will lead to the execution of the other R scripts. Assume the name of the main script is “NSE Stocks.R”. In this script, we will mention the names of the scripts we wish to run within the source function. In this example, we wish to run the “Top gainers.R” and the “Top losers.R” script. These will be the part of the “NSE Stocks.R” as shown below and we run the main script to run these 2 scripts. source("Top gainers.R") source("Top losers.R") Enclosing the R script name within the “source” function causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. Alternatively, one can also place the R script names in a vector, and use the sapply function. Example: filenames = c("Top gainers.R", "Top losers.R") sapply(filenames, source) #### Converting a date in the American format to Standard date format The American date format is of the type mm/dd/yyyy, whereas the ISO 8601 standard format is yyyy-mm-dd. To convert a date from American format to the Standard date format we will use the as.Date function along with the format function. The example below illustrates the method. Example: # date in American format dt = "07/24/2016" # If we call the as.Date function on the date, it will # throw up an error, as the default format assumed by the as.Date function is yyyy-mmm-dd. as.Date(dt) Error in charToDate(x): character string is not in a standard unambiguous format # Correct way of formatting the date as.Date(dt, format = "%m/%d/%Y") [1] “2016-07-24” #### How to remove all the existing files from a folder To remove all the files from a particular folder, one can use the unlink function. Specify the path of the folder as the argument to the function. A forward slash with an asterisk is added at the end of the path. The syntax is given below. unlink(“path/*”) Example: unlink("C:/Users/Documents/NSE Stocks/*") This will remove all the files present in the “NSE Stocks” folder. ### Functions Demystified #### write.csv function If you want to save a data frame or matrix in a csv file, R provides for the write.csv function. The syntax for the write.csv function is given as: write.csv(x, file=”filename”, row.names=FALSE) If we specify row.names=TRUE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers. Column header line is written by default. If you do not want the column headers, set col.names=FALSE. Example: # Create a data frame Ticker = c("PNB","CANBK","KTKBANK","IOB") Percent_Change = c(2.30,-0.25,0.50,1.24) df = data.frame(Ticker,Percent_Change) write.csv(df, file="Banking Stocks.csv", row.names=FALSE) This will write the data contained in the “df” dataframe to the “Banking Stocks.csv” file. The file gets saved in the R working directory. #### fix function The fix function shows the underlying code of a function provided as an argument to it. Example: fix(sd) The underlying code for the standard deviation function is as shown below. This is displayed when we executed the fix function with “sd” as the argument. function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm)) #### download.file function The download.file function helps download a file from a website. This could be a webpage, a csv file, an R file, etc. The syntax for the function is given as: download.file(url, destfile) where, url – the Uniform Resource Locator (URL) of the file to be downloaded destfile – the location to save the downloaded file, i.e. path with a file name Example: In this example, the function will download the file from the path given in the “url” argument, and saved it in the D drive within the “Skills” folder with the name “betawacc.xls”. url = "http://www.exinfm.com/excel%20files/betawacc.xls" destfile = "D:/Skills/wacc.xls" download.file(url, destfile) ### Next Step We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers. The post R Weekly Bulletin Vol – I appeared first on . To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Wise Practitioner – Predictive Workforce Analytics Interview Series: Haig Nalbantian at Mercer By: Greta Roberts, Conference Chair, Predictive Analytics World for Workforce 2017 In anticipation of his upcoming Predictive Analytics World for Workforce conference presentation, The Pay Equity Revolution: How Advanced Analytics are Helping to Close the Gender Pay Gap in Organizations, we interviewed Haig Nalbantian, Senior Partner, Co-leader Mercer Workforce Sciences Institute at Mercer. View the Q-and-A below to see how Haig Nalbantian has incorporated predictive analytics into the workforce of Mercer. Also, glimpse what’s in store for the new PAW Workforce conference, May 14-18, 2017. Q: How is a specific line of business / business unit using your predictive analytics method to inform decisions? A: We’ve been conducting pay equity modeling and assessments either alone or as part of a broader workforce analysis since the early 90s. In the past five years or so, this area of work has grown enormously. More and more of our clients – in the US and, increasingly in Europe as well – are conducting annual pay audits to proactively address pay equity issues for women and minorities. In working with us, they choose to rely on comprehensive predictive models of base pay and total compensation that account for the multiple individual, group and market factors that drive pay in organizations. In this way, they not only isolate the effects of specific demographics themselves, thereby assessing if and to what extent there are unexplained pay disparities associated with gender or race, but also get a deeper insight about explained differences – that is, of the root causes of persistent differences that show up in raw (unadjusted) comparisons of pay levels. While those concerned with legal challenges regarding pay equity commonly use statistical controls to explain pay differences and reduce estimates of the size of pay disparities, the more strategically-minded leaders in this domain use these same controls to better understand why pay disparities exist and what can be done systematically to reduce or eliminate them in a sustainable way. I am pleased to see more organizations moving away from a predominately legal or compliance view of pay equity to a more expansive strategic view that seeks to address systemic sources of gender and racial disparities in pay. Mercer’s When Women Thrive research has shown that aggressive evaluation and management of pay equity is a leading indicator of greater success in other aspects of employment equity. Specifically, those organizations which have specialized, independent teams using statistical methods to assess and ensure pay equity as part of the annual compensation process are significantly more likely to do better in securing a more diverse workforce and leadership team. Focus on pay equity and you are likely to end up with better diversity outcomes overall. Many of our clients do, in fact, rely on our predictive modeling approach to pay equity, commissioning us, on an annual basis, to estimate statistical models of pay determination to assess if and to what extent pay disparities exist and make adjustments where bona fide pay gaps are found. They typically do this work as part of the annual compensation review. Q: If HR were 100% ready and the data were available, what would your boldest approach to pay equity deliver? A: In the best of all worlds, organizations will evaluate and address pay equity in the broader context of what the organization actually rewards. Our team has undertaken analyses of the drivers of pay across literally hundreds of organizations in the US and abroad for almost twenty five years now. We find the drivers of pay vary significantly across and even within organizations. They also vary over time as changing business models and objectives and changing labor market dynamics force organizations to adapt their rewards to help drive corresponding changes in their workforce. Effective pay equity practices must account for such changes and help ensure that pay equity actions align with evolving reward strategies. So, for example, if a new business strategy places a premium on certain new roles, it is important, from a gender pay equity perspective, not only to know that women in those roles are paid on par with comparable men, but that women are getting the opportunity to access these new and valued roles. If these new roles command higher pay, disproportionate representation of men would end up increasing the raw pay gap and likely diminishing the prospects of women to be successful in the organization. A successful pay equity process will keep tab of underlying changes in what is being valued by the organization to ensure women, minorities and other groups of interest are not systematically disadvantaged by market- or internally-driven shifts in the valuation of skills, knowledge, capabilities, experience, behaviors and roles. Properly designed, a pay equity assessment is folded into the annual compensation review; it becomes an opportunity to assess the strategic alignment of rewards with business goals. Most our clients pursue this approach. A pay equity review is not a one-time study; it is an ongoing process of rewards review, one that is of significant strategic importance to the organization. Q: Do you think "black box" workforce predictive methods will become widely embraced in the pay equity domain? A: “Black box solutions” are for functional tacticians at best, not practitioners of strategic workforce management. Strategic workforce management requires understanding and effectively communicating the story within the data. By design, black box solutions bypass the story, substituting claims of “predictive validity” instead. Time may prove me wrong, but I have yet to see a compelling human capital storyline emerge from statistical relationships or algorithmically-generated predictions alone. Explanatory analytics – understanding what’s behind relationships detected in the data – is, in my view, central to building and articulating a story that can engage leaders and compel action. Since I view pay equity as fundamental to reward strategy, I am reluctant to embrace the use of automated data analytics as the basis of pay equity assessments. If pay equity is part and parcel of rewards alignment, there is no substitute for careful modeling and interpretation of the drivers of rewards. Q: Is there a risk of making the pay equity process too complex? A: Our domain of workforce analytics always carries the risk of being overwhelmed by complexity of approach or analytical techniques. This has never deterred our team, however, from pursuing a more sophisticated technical solution if we are sure that solution will lead to more accurate conclusions and better results. The proof ultimately is in the results achieved. As I mentioned in my interview last year, sports analytics has definitely added complexity to the statistics tracked and followed by front office professionals, field managers and coaches, players, player representatives and sports journalists, but they have gained speedy adoption in the industry. Few of these stakeholders really grasp the technical dimensions of sports analytics. Nonetheless, they are pervasively used – because they work, because they lead to better decisions and more targeted investments. Staying away from sophisticated analytics on grounds of complexity is a cop out, one that is becoming increasingly untenable in the HR field. The analytics used for pay equity are not all that complex. Most HR leaders have a basic understanding of multivariate regression analysis. Even if they don’t, they can readily understand that measuring pay disparities and determining their sources requires accounting for other non-demographic factors that also influence pay levels. That’s what good modeling will accomplish. More complex is the way in which the methodology is practically applied and how the results are translated into action. So, for example; if pay strategies and pay determination are different across business units, functions, geographies, occupations and job families, do you need to model each of these separately? What determines the degree of segmentation used? Technical requirements, such as minimum required population sizes for statistical modeling, may trade off against practical business considerations. There is no pure science to inform such decisions. Similarly, once you identify pay disparities or, for instance, employees who are “under-paid” relative to peers – i.e. “outliers” – how do you close the gaps? Do you address outliers only in groups where demographic disparities have been detected? Should you make adjustments for women and non-whites only? Implementation questions such as these are generally more “complex” and challenging to navigate than are issues around methodology. Seldom do we get drawn into detailed conversations about statistical techniques. On the other hand, we do have extensive discussions about implementation issues and the “philosophy” behind pay actions. In sum, complexity is not a major barrier for workforce analysts in the pay equity area. A richer explanation of such issues is found in Stefan Gaertner, Greenfield, G and Levine, B. “Pay Equity: New Pressures, New Challenges,” Human Resource Executive Online. April 12, 2016. Q: What is one specific way in which predictive analytics is driving workforce decisions? A: Pay equity is perhaps the area where we see the most tangible results from our predictive modeling work. First of all, clients don’t ask us to do this work if they are not prepared to act on the results. Organizations understand that you don’t sit on pay disparities if you find them. You have to take reasonable action to remedy bona fide pay inequities once found. Due diligence is always required in implementing pay actions. No statistical model can alone determine if there are pay disparities, certainly not at an individual level. First of all, there is always the potential for error in the raw data on which such models are estimated. Further, there is statistical error in the estimation of the models themselves. Not all relevant factors influencing pay may be captured in the organizations archival workforce (HRIS) data. And some jobs or career levels may be so thinly populated that it is impossible to make accurate statistical comparisons that account for differences in job or role. At a certain point, judgement comes into play. Once individual outliers are identified, you need to carefully review them to sort out those cases where there are good technical or business explanations for the pay differences observed and those differences related to gender or race that remain unexplained. The modeling helps narrow the field for such hands-on review, but it does not bypass this need entirely. As in most areas of workforce analytics, science and art come together to render the best solution. Still, there is no question that the analytics delivered here are hugely impactful. When you do this work, you know you are going to have an immediate effect on the client organization and the employees whose pay is at issue. Doing such consequential work is very satisfying. But it carries a huge responsibility. Because you will deliver point estimates of pay differences that may translate into actual payouts to individuals, you cannot rely on large sample sizes to overcome any data error. Precision in working the data you have is critical. Those who do this work have to be on their toes. Always! Q: How does business culture need to evolve to realize the full promise of predictive workforce analytics such a pay equity modeling? A: I think I largely answered this question in my response to the first question above where I reference Mercer’s When Women Thrive study. That study showed that pay equity is basically the tip of the spear in organizations’ efforts to secure gender diversity in their leadership and workforce generally. If you don’t get the pay side right, it is unlikely you’ll be doing well on the representation, promotion, retention, hiring or performance sides either. Rewards are consequential. They signal what is valued in an organization. If you don’t signal you value women, minorities or other groups of interest, you are unlikely to secure them as a vital, engaged, representative and effective part of your workforce. So start with pay equity. But don’t stop there. If I am clear about anything in our field, it is that effective human capital management requires a systems view. The dynamics process that produces your workforce – we call it your “internal labor market”- consists of multiple moving parts that interact with each other continuously to affect the mix of talent embodied in your workforce. What happens on the reward side influences what happens on the retention side, the development side, the performance side; and vice versa. The best analytics will de-mystify this process, help you understand what drives it and, thereby, help you shape your internal labor market to meet the needs of your business. Workforce diversity and pay equity should be seen in this light. In the end, they are all about the business. Organizations that do in fact recognize their workforce as an asset need to know what’s happening to that asset and the return they’re getting on that asset. Taking a systems view helps deliver and better process this information. Workforce analytics teams can help foster this view in the way they analyze data and communicate results. This approach enhances the power of their work. It also helps engage leadership in a way traditional HR often failed to do. Such engagement makes all the difference in making the resulting strategies successful. —————— Don't miss Haig’s conference presentation, The Pay Equity Revolution: How Advanced Analytics are Helping to Close the Gender Pay Gap in Organizations, at PAW Workforce, on Wednesday, May 17, 2017, from 2:15 to 3:00 pm. Click here to register for attendance By: Greta Roberts, CEO, Talent Analytics, Corp. @gretaroberts and Conference Chair of Predictive Analytics World for Workforce Continue Reading… ### Web data acquisition: parsing json objects with tidyjson (Part 3) (This article was first published on R-posts.com, and kindly contributed to R-bloggers) The collection of example flight data in json format available in part 2, described the libraries and the structure of the POST request necessary to collect data in a json object. Despite the process generated and transferred locally a proper response, the data collected were neither in a suitable structure for data analysis nor immediately readable. They appears as just a long string of information nested and separated according to the JavaScript object notation syntax. Thus, to visualize the deeply nested json object and make it human readable and understandable for further processing, the json content could be copied and pasted in a common online parser. The tool allows to select each node of the tree and observe the data structure up to the variables and data of interest for the statistical analysis. The bulk of the relevant information for the purpose of the analysis on flight prices are hidden in the tripOption node as shown in the following figure (only 50 flight solutions were requested). However, looking deeply into the object, several other elements are provided as the distance in mile, the segment, the duration, the carrier, etc. The R parser to transform the json structure in a usable dataframe requires the dplyr library for using the pipe operator (%>%) to streamline the code and make the parser more readable. Nevertheless, the library actually wrangling through the lines is tidyjson and its powerful functions: • enter_object: enters and dives into a data object; • gather_array: stacks a JSON array; • spread_values: creates new columns from values assigning specific type (e.g. jstring, jnumber). library(dplyr) # for pipe operator %>% and other dplyr functions library(tidyjson) # https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html data_items <- datajson %>% spread_values(kind = jstring("kind")) %>% spread_values(trips.kind = jstring("trips","kind")) %>% spread_values(trips.rid = jstring("trips","requestId")) %>% enter_object("trips","tripOption") %>% gather_array %>% spread_values( id = jstring("id"), saleTotal = jstring("saleTotal")) %>% enter_object("slice") %>% gather_array %>% spread_values(slice.kind = jstring("kind")) %>% spread_values(slice.duration = jstring("duration")) %>% enter_object("segment") %>% gather_array %>% spread_values( segment.kind = jstring("kind"), segment.duration = jnumber("duration"), segment.id = jstring("id"), segment.cabin = jstring("cabin")) %>% enter_object("leg") %>% gather_array %>% spread_values( segment.leg.aircraft = jstring("aircraft"), segment.leg.origin = jstring("origin"), segment.leg.destination = jstring("destination"), segment.leg.mileage = jnumber("mileage")) %>% select(kind, trips.kind, trips.rid, saleTotal,id, slice.kind, slice.duration, segment.kind, segment.duration, segment.id, segment.cabin, segment.leg.aircraft, segment.leg.origin, segment.leg.destination, segment.leg.mileage) head(data_items) kind trips.kind trips.rid saleTotal 1 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 2 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR178.38 3 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 4 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR235.20 5 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 6 qpxExpress#tripsSearch qpxexpress#tripOptions UnxCOx4nKIcIOpRiG0QBOe EUR248.60 id slice.kind slice.duration 1 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 2 ftm7QA6APQTQ4YVjeHrxLI006 qpxexpress#sliceInfo 510 3 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 4 ftm7QA6APQTQ4YVjeHrxLI009 qpxexpress#sliceInfo 490 5 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 6 ftm7QA6APQTQ4YVjeHrxLI007 qpxexpress#sliceInfo 355 segment.kind segment.duration segment.id segment.cabin 1 qpxexpress#segmentInfo 160 GixYrGFgbbe34NsI COACH 2 qpxexpress#segmentInfo 235 Gj1XVe-oYbTCLT5V COACH 3 qpxexpress#segmentInfo 190 Grt369Z0shJhZOUX COACH 4 qpxexpress#segmentInfo 155 GRvrptyoeTfrSqg8 COACH 5 qpxexpress#segmentInfo 100 GXzd3e5z7g-5CCjJ COACH 6 qpxexpress#segmentInfo 105 G8axcks1R8zJWKrN COACH segment.leg.aircraft segment.leg.origin segment.leg.destination segment.leg.mileage 1 320 FCO IST 859 2 77W IST LHR 1561 3 73H FCO ARN 1256 4 73G ARN LHR 908 5 319 FCO STR 497 6 319 STR LHR 469  Data are now in an R-friendly structure despite not yet ready for analysis. As can be observed from the first rows, each record has information on a single segment of the flight selected. A further step of aggregation using some SQL is needed in order to end up with a dataframe of flights data suitable for statistical analysis. Next up, the aggregation, some data analysis and data visualization to complete the journey through the web data acquisition using R. #R #rstats #maRche #json #curl #tidyjson #Rbloggers This post is also shared in www.r-bloggers.com and LinkedIn To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R – Change columns names in a spatial dataframe (This article was first published on R – scottishsnow, and kindly contributed to R-bloggers) Ordnance Survey have a great OpenRoads dataset, but unfortunately it contains a column called ‘primary’, which is a keyword in SQL. This makes it challenging/impossible to import the OpenRoads dataset into a SQL database (e.g. GRASS), without changing the offending column name. Enter R! Or any other capable programming language. The following script reads a folder of shp files, changes a given column name and overwrites the original files. Note the use of the ‘@’ symbol to call a slot from the S4 class object (thanks Stack Exchange). library(rgdal) f = list.files("~/Downloads/OS/data", pattern="shp") f = substr(f, 1, nchar(f) - 4) lapply(f, function(i){ x = readOGR("/home/mspencer/Downloads/OS/data", i) colnames(x@data)[11] = "prim" writeOGR(x, "/home/mspencer/Downloads/OS/data", i, "ESRI Shapefile", overwrite_layer=T) })  To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Neural Networks for Learning Lyrics (This article was first published on More or Less Numbers, and kindly contributed to R-bloggers) I created a Twitter account which was inspired by a couple Twitter accounts that applied a particular type of machine learning technique to learn how two (at the time) presidential hopefuls spoke. I thought, why not see what a model like this could do with lyrics from my favorite rock n roll artist? Long short term memory (LSTM) is a recurrent neural network (RNN) that can be used to produce sentences or phrases by learning from text. The two twitter accounts that inspired this were @deeplearnthebern and @deepdrumpf which use this technique to produce phrases and sentences. I scraped a little more than 300 of his songs and have fed them to a LSTM model using R and the mxnet library. Primarily I used the mxnet.io/ to build and train the model…great site and tools. The tutorials on their site are very helpful and particularly this one. The repository is here that contains the code for the scraper and other information. Follow deeplearnbruce for tweets that are hopefully entertaining for Springsteen fans or anyone else. To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Four short links: 24 March 2017 Philosophy, Interactive Simulations, Reproducible Data Science, and Visualizing Connectivity 1. Daniel Dennett's Science of the Soul (New Yorker) -- AI starts a lot of really lousy philosophical debates. Now we need to catch up to the philosophers who have been working on escaping nonsense for years. [T]he researchers, anticipating the discussion’s inexorable transformation into a meditation on “Westworld,” clutched their heads and sighed. 2. Loopy -- a tool to make interactive simulations of complex systems. 3. Reproducible Data Analysis in Jupyter -- This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook. 4. Visualizing Connectivity in Large Graphs -- We have two main contributions: (1) two novel visualization techniques that work in concert for summarizing graph connectivity; and (2) Graffinity, an open source implementation of these visualizations supplemented by detail views to enable a complete analysis workflow. Graffinity was designed in a close collaboration with neuroscientists and is optimized for connectomics data analysis, yet the technique is applicable across domains. We validate the connectivity overview and our open source tool with illustrative examples using flight and connectomics data. Continue reading Four short links: 24 March 2017. Continue Reading… ### Lesser known purrr tricks purrr is package that extends R’s functional programming capabilities. It brings a lot of new stuff to the table and in this post I show you some of the most useful (at least to me) functions included in purrr. ## Getting rid of loops with map() library(purrr) numbers <- list(11, 12, 13, 14) map_dbl(numbers, sqrt) ## [1] 3.316625 3.464102 3.605551 3.741657 You might wonder why this might be preferred to a for loop? It’s a lot less verbose, and you do not need to initialise any kind of structure to hold the result. If you google “create empty list in R” you will see that this is very common. However, with the map() family of functions, there is no need for an initial structure. map_dbl() returns an atomic list of real numbers, but if you use map() you will get a list back. Try them all out! ## Map conditionally #### map_if() # Create a helper function that returns TRUE if a number is even is_even <- function(x){ !as.logical(x %% 2) } map_if(numbers, is_even, sqrt) ## [[1]] ## [1] 11 ## ## [[2]] ## [1] 3.464102 ## ## [[3]] ## [1] 13 ## ## [[4]] ## [1] 3.741657 #### map_at() map_at(numbers, c(1,3), sqrt) ## [[1]] ## [1] 3.316625 ## ## [[2]] ## [1] 12 ## ## [[3]] ## [1] 3.605551 ## ## [[4]] ## [1] 14 map_if() and map_at() have a further argument than map(); in the case of map_if(), a predicate function ( a function that returns TRUE or FALSE) and a vector of positions for map_at(). This allows you to map your function only when certain conditions are met, which is also something that a lot of people google for. ## Map a function with multiple arguments numbers2 <- list(1, 2, 3, 4) map2(numbers, numbers2, +) ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 14 ## ## [[3]] ## [1] 16 ## ## [[4]] ## [1] 18 You can map two lists to a function which takes two arguments using map_2(). You can even map an arbitrary number of lists to any function using pmap(). By the way, try this in: +(1,3) and see what happens. ## Don’t stop execution of your function if something goes wrong possible_sqrt <- possibly(sqrt, otherwise = NA_real_) numbers_with_error <- list(1, 2, 3, "spam", 4) map(numbers_with_error, possible_sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] NA ## ## [[5]] ## [1] 2 Another very common issue is to keep running your loop even when something goes wrong. In most cases the loop simply stops at the error, but you would like it to continue and see where it failed. Try to google “skip error in a loop” or some variation of it and you’ll see that a lot of people really just want that. This is possible by combining map() and possibly(). Most solutions involve the use of tryCatch() which I personally do not find very easy to use. ## Don’t stop execution of your function if something goes wrong and capture the error safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(numbers_with_error, safe_sqrt) ## [[1]] ## [[1]]result
## [1] 1
##
## [[1]]$error ## NULL ## ## ## [[2]] ## [[2]]$result
## [1] 1.414214
##
## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result
## [1] 1.732051
##
## [[3]]$error ## NULL ## ## ## [[4]] ## [[4]]$result
## [1] NA
##
## [[4]]$error ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## ## [[5]] ## [[5]]$result
## [1] 2
##
## [[5]]$error ## NULL safely() is very similar to possibly() but it returns a list of lists. An element is thus a list of the result and the accompagnying error message. If there is no error, the error component is NULL if there is an error, it returns the error message. ## Transpose a list safe_result_list <- map(numbers_with_error, safe_sqrt) transpose(safe_result_list) ##$result
## $result[[1]] ## [1] 1 ## ##$result[[2]]
## [1] 1.414214
##
## $result[[3]] ## [1] 1.732051 ## ##$result[[4]]
## [1] NA
##
## $result[[5]] ## [1] 2 ## ## ##$error
## $error[[1]] ## NULL ## ##$error[[2]]
## NULL
##
## $error[[3]] ## NULL ## ##$error[[4]]
## <simpleError in .f(...): non-numeric argument to mathematical function>
##
## $error[[5]] ## NULL Here we transposed the above list. This means that we still have a list of lists, but where the first list holds all the results (which you can then access with safe_result_list$result) and the second list holds all the errors (which you can access with safe_result_list$error). This can be quite useful! ## Apply a function to a lower depth of a list transposed_list <- transpose(safe_result_list) transposed_list %>% at_depth(2, is_null) ##$result
## $result[[1]] ## [1] FALSE ## ##$result[[2]]
## [1] FALSE
##
## $result[[3]] ## [1] FALSE ## ##$result[[4]]
## [1] FALSE
##
## $result[[5]] ## [1] FALSE ## ## ##$error
## $error[[1]] ## [1] TRUE ## ##$error[[2]]
## [1] TRUE
##
## $error[[3]] ## [1] TRUE ## ##$error[[4]]
## [1] FALSE
##
## $error[[5]] ## [1] TRUE Sometimes working with lists of lists can be tricky, especially when we want to apply a function to the sub-lists. This is easily done with at_depth()! ## Set names of list elements name_element <- c("sqrt()", "ok?") set_names(transposed_list, name_element) ##$sqrt()
## $sqrt()[[1]] ## [1] 1 ## ##$sqrt()[[2]]
## [1] 1.414214
##
## $sqrt()[[3]] ## [1] 1.732051 ## ##$sqrt()[[4]]
## [1] NA
##
## $sqrt()[[5]] ## [1] 2 ## ## ##$ok?
## $ok?[[1]] ## NULL ## ##$ok?[[2]]
## NULL
##
## $ok?[[3]] ## NULL ## ##$ok?[[4]]
## <simpleError in .f(...): non-numeric argument to mathematical function>
##
## \$ok?[[5]]
## NULL

## Reduce a list to a single value

reduce(numbers, *)
## [1] 24024

reduce() applies the function * iteratively to the list of numbers. There’s also accumulate():

accumulate(numbers, *)
## [1]    11   132  1716 24024

which keeps the intermediary results.

This function is very general, and you can reduce anything:

Matrices:

mat1 <- matrix(rnorm(10), nrow = 2)
mat2 <- matrix(rnorm(10), nrow = 2)
mat3 <- matrix(rnorm(10), nrow = 2)

list_mat <- list(mat1, mat2, mat3)

reduce(list_mat, +)
##            [,1]       [,2]       [,3]       [,4]      [,5]
## [1,] -0.5228188  0.4813357  0.3808749 -1.1678164 0.3080001
## [2,] -3.8330509 -0.1061853 -3.8315768  0.3052248 0.3486929

even data frames:

df1 <- as.data.frame(mat1)
df2 <- as.data.frame(mat2)
df3 <- as.data.frame(mat3)

list_df <- list(df1, df2, df3)

reduce(list_df, dplyr::full_join)
## Joining, by = c("V1", "V2", "V3", "V4", "V5")
## Joining, by = c("V1", "V2", "V3", "V4", "V5")
##            V1         V2          V3         V4         V5
## 1  0.01587062  0.8570925  1.04330594 -0.5354500  0.7557203
## 2 -0.46872345  0.3742191 -1.88322431  1.4983888 -1.2691007
## 3 -0.60675851 -0.7402364 -0.49269182 -0.4884616 -1.0127531
## 4 -1.49619518  1.0714251  0.06748534  0.6650679  1.1709317
## 5  0.06806907  0.3644795 -0.16973919 -0.1439047  0.5650329
## 6 -1.86813223 -1.5518295 -2.01583786 -1.8582319  0.4468619

Hope you enjoyed this list of useful functions! If you enjoy the content of my blog, you can follow me on twitter.

### Modules vs. microservices

Apply modular system design principles while avoiding the operational complexity of microservices.

Much has been said about moving from monoliths to microservices. Besides rolling off the tongue nicely, it also seems like a no-brainer to chop up a monolith into microservices. But is this approach really the best choice for your organization? It’s true that there are many drawbacks to maintaining a messy monolithic application. But there is a compelling alternative which is often overlooked: modular application development. In this article, we'll explore what this alternative entails and show how it relates to building microservices.

## Microservices for modularity

"With micro-services we can finally have teams work independently", or "our monolith is too complex, which slows us down." These expressions are just a few of the many reasons that lead development teams down the path of microservices. Another one is the need for scalability and resilience. What developers collectively seem to be yearning for is a modular approach to system design and development. Modularity in software development can be boiled down into three guiding principles:

### Fastest way to alphabetize your bookshelf

Sorting algorithms. Apparently there are an endless number of ways to visualize them in various contexts, and somehow it never gets old. Here sorting in the context of books on a shelf.

Tags:

### Paris Machine Learning Hors Serie #10 : Workshop SPARK (atelier 1)

Leonardo Noleto, data scientist chez KPMG, nous fait découvrir le processus de nettoyage et transformation des données brutes en données “propres” avec Apache Spark.

Apache Spark est un framework open source généraliste, conçu pour le traitement distribué de données. C’est une extension du modèle MapReduce avec l’avantage de pouvoir traiter les données en mémoire et de manière interactive. Spark offre un ensemble de composants pour l’analyse de données: Spark SQL, Spark Streaming, MLlib (machine learning) et GraphX (graphes).

Cet atelier se concentre sur les fondamentaux de Spark et le paradigme de traitement de données avec l’interface de programmation Python (plus précisément PySpark).

L’installation, configuration, traitement sur cluster, Spark Streaming, MLlib et GraphX ne seront pas abordés dans cet atelier.

Matériel à installer c'est ici. ..

Objectifs

• Comprendre les fondamentaux de Spark et le situer dans l'écosystème Big Data ;
• Savoir la différence avec Hadoop MapReduce ;
• Utiliser les RDD (Resilient Distributed Datasets) ;
• Utiliser les actions et transformations les plus courantes pour manipuler et analyser des données ;
• Ecrire un pipeline de transformation de données ;
• Utiliser l’API de programmation PySpark.

Cet atelier est le premier d’une série de 2 ateliers avec Apache Spark. Pour suivre les prochains ateliers, vous devez avoir suivi les précédents ou être à l’aise avec les sujets déjà traités.

Quels sont les pré-requis ?

• Connaître les base du langage Python (ou apprendre rapidement via ce cours en ligne Python Introduction)
• Être sensibilisé au traitement de la donnée avec R, Python ou Bash (why not?)
• Aucune connaissance préalable en traitement distribué et Apache Spark n’est demandée. C’est un atelier d’introduction. Les personnes ayant déjà une première expérience avec Spark (en Scala, Java ou R) risquent de s'ennuyer (c’est un atelier pour débuter).

Comment me préparer pour cet atelier ?

• Vous devez être muni d’un ordinateur portable relativement moderne et avec minimum 4 Go de mémoire, avec un navigateur internet installé. Vous devez pouvoir vous connecter à Internet via le Wifi.
• Suivre les instructions pour vous préparer à l’atelier (installation Docker + image docker de l’atelier).
• Les données à nettoyer sont comprises dans l’image Docker. Les exercices seront fournis lors de l’atelier en format Jupyter notebook.
• Le notebook est ici: https://goo.gl/emkoee

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Whats new on arXiv

A typical IR system that delivers and stores information is affected by problem of matching between user query and available content on web. Use of Ontology represents the extracted terms in form of network graph consisting of nodes, edges, index terms etc. The above mentioned IR approaches provide relevance thus satisfying users query. The paper also emphasis on analyzing multimedia documents and performs calculation for extracted terms using different statistical formulas. The proposed model developed reduces semantic gap and satisfies user needs efficiently.
The proposed methodology is procedural i.e. it follows finite number of steps that extracts relevant documents according to users query. It is based on principles of Data Mining for analyzing web data. Data Mining first adapts integration of data to generate warehouse. Then, it extracts useful information with the help of algorithm. The task of representing extracted documents is done by using Vector Based Statistical Approach that represents each document in set of Terms.
In all but the most trivial optimization problems, the structure of the solutions exhibit complex interdependencies between the input parameters. Decades of research with stochastic search techniques has shown the benefit of explicitly modeling the interactions between sets of parameters and the overall quality of the solutions discovered. We demonstrate a novel method, based on learning deep networks, to model the global landscapes of optimization problems. To represent the search space concisely and accurately, the deep networks must encode information about the underlying parameter interactions and their contributions to the quality of the solution. Once the networks are trained, the networks are probed to reveal parameter combinations with high expected performance with respect to the optimization task. These estimates are used to initialize fast, randomized, local search algorithms, which in turn expose more information about the search space that is subsequently used to refine the models. We demonstrate the technique on multiple optimization problems that have arisen in a variety of real-world domains, including: packing, graphics, job scheduling, layout and compression. The problems include combinatoric search spaces, discontinuous and highly non-linear spaces, and span binary, higher-cardinality discrete, as well as continuous parameters. Strengths, limitations, and extensions of the approach are extensively discussed and demonstrated.
In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories is usually not unambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorisation has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. To achieve automatic grouping of levels with essentially the same effect, we adopt a Bayesian approach and specify the prior on the level effects as a location mixture of spiky normal components. Fusion of level effects is induced by a prior on the mixture weights which encourages empty components. Model-based clustering of the effects during MCMC sampling allows to simultaneously detect categories which have essentially the same effect size and identify variables with no effect at all. The properties of this approach are investigated in simulation studies. Finally, the method is applied to analyse effects of high-dimensional categorical predictors on income in Austria.
This paper describes a method for clustering data that are spread out over large regions and which dimensions are on different scales of measurement. Such an algorithm was developed to implement a robotics application consisting in sorting and storing objects in an unsupervised way. The toy dataset used to validate such application consists of Lego bricks of different shapes and colors. The uncontrolled lighting conditions together with the use of RGB color features, respectively involve data with a large spread and different levels of measurement between data dimensions. To overcome the combination of these two characteristics in the data, we have developed a new weighted K-means algorithm, called gap-ratio K-means, which consists in weighting each dimension of the feature space before running the K-means algorithm. The weight associated with a feature is proportional to the ratio of the biggest gap between two consecutive data points, and the average of all the other gaps. This method is compared with two other variants of K-means on the Lego bricks clustering problem as well as two other common classification datasets.

Our work titled Online Human-Bot Interactions: Detection, Estimation, and Characterization has been accepted for publication at the prestigious International AAAI Conference on Web and Social Media (ICWSM 2017) to be held in Montreal, Canada in May 2017!

The goal of this study was twofold: first, we aimed at understanding how difficult is to detect social bots on Twitter respectively for machine learning models and for humans. Second, we wanted to perform a census of the Twitter population to estimate how many accounts are not controlled by humans, but rather by computer software (bots).

To address the first question, we developed a family of machine learning models that leverages over one thousand features characterising the online behaviour of Twitter accounts. We then trained these models with manually-annotated collections of examples of human and bot-controlled accounts across the spectrum of complexity, ranging from simple bots to very sophisticated ones fueled by advanced AI. We discovered that, while human accounts and simple bots are very easy to identify, both by other humans and by our models, there exist a family of sophisticated social AIs that systematically escape identification by our models and by human snap-judgment.

Our second finding reveals that a significant fraction of Twitter accounts, between 9% and 15%,  are likely social bots. This translates in nearly 50 million accounts, according to recent estimates that put the Twitter userbase at above 320 million. Although not all bots are dangerous, many are used for malicious purposes: in the past, for example, Twitter bots have been used to manipulate public opinion during election times, to manipulate the stock market, and by extremist groups for radical propaganda.

## Cite as:

Onur Varol, Emilio Ferrara, Clayton Davis, Filippo Menczer, Alessandro Flammini. Online Human-Bot Interactions: Detection, Estimation, and Characterization. ICWSM 2017

## Press Coverage

1. News That 48 Million Of Twitter’s Users May Be Bots Could Impact Its Valuation – Forbes
2. Fake accounts scandal weighs on Twitter boss – The Times
3. Pressure Grows on Twitter CEO Dorsey Amid Bot Scandal – The Street
4. CMO Today: Marketers and Political Wonks Gather for SXSW – The Wall Street Journal
5. Huge number of Twitter accounts are not operated by humans – ABC News
6. Early Twitter investor Chris Sacca says he ‘hates’ the stock, calls bot issue ‘embarrassing’ – CNBC
7. Up to 48 million Twitter accounts are bots, study says – CNET
8. R u bot or not? – VICE
9. New Machine Learning Framework Uncovers Twitter’s Vast Bot Population – VICE/Motherboard
10. A Whopping 48 Million Twitter Accounts Are Actually Just Bots, Study Says – Tech Times
11. 15 Percent Of Twitter Accounts May Be Bots [STUDY] – Value Walk
12. Why the Rise of Bots is a Concern for Social Networks – Enterpreneuer
13. Study reveals whopping 48M Twitter accounts are actually bots – CBS News
14. Twitter is home to nearly 48 million bots, according to report – The Daily Dot
15. As many as 48 million Twitter accounts aren’t people, says study – CNBC
16. New Study Says 48 Million Accounts On Twitter Are Bots – We are social media
17. Almost 48 million Twitter accounts are bots – Axios
18. Twitter user accounts: around 15% or 48 million are bots [study] – The Vanguard
19. Report: 48 Million Twitter Accounts Are Bots – Breitbart
20. Rise of the TWITTERBOTS – Daily Mail
21. 15 per cent of Twitter is bots, but not the Kardashian kind – The Inquirer
22. 48 mn Twitter accounts are bots, says study – The Economic Times
23. 9-15 per cent of Twitter accounts are bots, reveals study – Financial Express
24. Nearly 48 million Twitter accounts are bots: study – Deccan herald
25. Study: Nearly 48 Million Twitter Accounts Are Fake; Many Push Political Agendas – The Libertarian Republic
26. As many as 48 million accounts on Twitter are actually bots, study finds – Sacramento Bee
27. Study Reveals Roughly 48M Twitter Accounts Are Actually Bots – CBS DFW
28. Up to 48 million Twitter accounts may be Bots – Financial Buzz
29. Up to 15% of Twitter accounts are not real people – Blasting News
30. Tech Bytes: Twitter is Being Invaded by Bots – WDIO Eyewitness News
31. About 9-15% of Twitter accounts are bots: Study – The Indian Express
32. Twitter Has Nearly 48 Million Bot Accounts, So Don’t Get Hurt By All Those Online Trolls – India Times
33. Twitter May Have 45 Million Bots on Its Hands – Investopedia
35. 9-15% of Twitter accounts are bots: Study – MENA FN
36. Up To 15 Percent Of Twitter Users Are Bots, Study Says – Vocativ
37. 48 million active Twitter accounts could be bots – Gearbrain
38. Study: 15% of Twitter accounts could be bots – Marketing Dive
39. 15% of Twitter users are actually bots, study claims – MemeBurn
40. Almost 48 million Twitter accounts are bots – Click Lancashire
41. As many as 48 million or around 15% of Twitter accounts are bots – TechWorm
42. Twitter Has an Overwhelming 48 Million Bot Accounts – GineersNow

## Press in non-English media

1. Bis zu 48 von 319 Mio. Twitter-Nutzern sind Bots – Kronen Zeitung (in German)
2. Bad Bot oder Mensch – das ist hier die Frage – Medien Milch (in German)
3. Studie: Bis zu 48 Millionen Twitter-Nutzer sind in Wirklichkeit Bots – T3N (in German)
4. Der Aufstieg der Twitter-Bots: 48 Millionen Nutzer sind nicht menschlich – Studie – Sputnik News (in German)
5. Studie: Bis zu 48 Millionen Nutzer auf Twitter sind Bots – der Standard (in German)
6. “Blade Runner”-Test für Twitter-Accounts: Bot oder Mensch? – der Standard (in German)
8. 15 Prozent Social Bots? – DLF24 (in German)
9. TWITTER: IST JEDER SIEBTE USER EIN BOT? – UberGizmo (in German)
10. Twitter: Bis zu 48 Millionen Bot-Profile – Heise (in German)
11. Studie: Bis zu 15 Prozent aller aktiven, englischsprachigen Twitter-Konten sind Bots – Netzpolitik (in German)
12. Automatische Erregung – Wiener Zeitung (in German)
13. Un 15% de los usuarios de Twitter son bots – El Mundo (in Spanish)
14. Al menos el 15 por ciento de las cuentas de Twitter son bots – El Tiempo (in Spanish)
15. 15 por ciento de las cuentas de Twitter son ‘bots’: estudio – CNET (in Spanish)
16. El 15% de las cuentas de Twitter son bots – Gestion (in Spanish)
17. 48 de los 319 millones de usuarios activos de Twitter son bots – TIC Beat (in Spanish)
18. 15% de las cuentas de Twitter son ‘bots’ – Merca 2.0 (in Spanish)
19. 48 de los 319 de usuarios activos en Twitter son bots – MDZ (in Spanish)
21. Twitter compterait 48 millions de comptes gérés par des robots – MeltyStyle (in French)
22. Twitter : 48 millions de comptes sont des bots – blog du moderateur (in French)
23. ’30 tot 50 miljoen actieve Twitter-accounts zijn bots’ – NOS (in Dutch)
24. 48 εκατομμύρια χρήστες στο Twitter δεν είναι άνθρωποι, σύμφωνα με έρευνα Πηγή – LiFo (in Greek)
25. 48 triệu người dùng Twitter là bot và mối nguy hại – Khoa Hoc Phattrien (in Vietnamese)

### Document worth reading: “Automatic Sarcasm Detection: A Survey”

Automatic detection of sarcasm has witnessed interest from the sentiment analysis research community. With diverse approaches, datasets and analyses that have been reported, there is an essential need to have a collective understanding of the research in this area. In this survey of automatic sarcasm detection, we describe datasets, approaches (both supervised and rule-based), and trends in sarcasm detection research. We also present a research matrix that summarizes past work, and list pointers to future work. Automatic Sarcasm Detection: A Survey

### Becoming a Data Scientist Podcast Episode 15: David Meza

David Meza is Chief Knowledge Architect at NASA, and talks to Renee in this episode about his educational background, his early work at NASA, and examples of his work with multidisciplinary teams. He also describes a project involving a graph database that improved search capabilities so NASA engineers could more easily find “lessons learned”.

Link to podcast Episode 15 audio
Podcast’s RSS feed for podcast subscription apps
Podcast on Stitcher
Podcast on iTunes

Podcast Video Playlist:

Mentioned in the episode:

NASA.gov

MS Access

Neutral Buoyancy Lab

civil servant

graph database

JSC – Johnson Space Center

topic modeling

Southern Data Science Conference in Atlanta, GA on April 7, 2017 (Coupon code RENEE takes 15% off ticket price)

Thanks to DataCamp for sponsoring this episode!

DataCamp discount link in Data Science Learning Club forums (only visible to logged-in users)

### Becoming a Data Scientist Podcast Episode 16: Randy Olson

Renee interviews Randal S. Olson, Senior Data Scientist in the Institute for Biomedial Informatics at UPenn, about his path to becoming a data scientist, his interesting data science blog posts, and his work with non-data-scientists and students.

Podcast Video Playlist:

More about the Data Science Learning Club:
Data Science Learning Club Welcome Message
Data Science Learning Club Activity 16 – Genetic Algorithms
Data Science Learning Club Meet & Greet

Mentioned in the episode:

bytecode

Dr. Kenneth Stanley at the University of Central Florida

evolutionary algorithm

Michigan State University Artificial Intelligence

BEACON NSF Science and Technology Center at MSU

Randal S. Olson publications

Randy’s blog

Data Is Beautiful Reddit

Randy on:
github
Patreon

Becoming a Data Scientist T-Shirts!

Thanks to DataCamp for sponsoring this episode!

DataCamp discount link in Data Science Learning Club forums (only visible to logged-in users)

### Book Memo: “Markov Decision Processes in Practice”

 This book presents classical Markov Decision Processes (MDP) for real-life applications and optimization. MDP allows users to develop and formally support approximate and simple decision rules, and this book showcases state-of-the-art applications in which MDP was key to the solution approach. The book is divided into six parts. Part 1 is devoted to the state-of-the-art theoretical foundation of MDP, including approximate methods such as policy improvement, successive approximation and infinite state spaces as well as an instructive chapter on Approximate Dynamic Programming. It then continues with five parts of specific and non-exhaustive application areas. Part 2 covers MDP healthcare applications, which includes different screening procedures, appointment scheduling, ambulance scheduling and blood management. Part 3 explores MDP modeling within transportation. This ranges from public to private transportation, from airports and traffic lights to car parking or charging your electric car . Part 4 contains three chapters that illustrates the structure of approximate policies for production or manufacturing structures. In Part 5, communications is highlighted as an important application area for MDP. It includes Gittins indices, down-to-earth call centers and wireless sensor networks. Finally Part 6 is dedicated to financial modeling, offering an instructive review to account for financial portfolios and derivatives under proportional transactional costs. The MDP applications in this book illustrate a variety of both standard and non-standard aspects of MDP modeling and its practical use. This book should appeal to readers for practitioning, academic research and educational purposes, with a background in, among others, operations research, mathematics, computer science, and industrial engineering.

### RApiDatetime 0.0.1

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Very happy to announce a new package of mine is now up on the CRAN repository network: RApiDatetime.

It provides six entry points for C-level functions of the R API for Date and Datetime calculations: asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.

These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.

As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()):

    inline Datetime::Datetime(const std::string &s, const std::string &fmt) {
Rcpp::Function strptime("strptime");    // we cheat and call strptime() from R
Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct
m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt)));
update_tm();
}

I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.

It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

From the previous post on “Poor Data Management Practices“, the discussion ended with a high level approach to one possible solution for data silos. Traditional approaches for solving the data silo problem can cost millions of dollars (even for a moderately sized company), and typically requires a huge effort in integration work (e.g., data modeling, system engineering, software design, and development). In this post Flafka, the unofficial name for integrating Flume as a producer for Kafka, is presented as another possible big data solution for data silos.
I’m fresh off my annual field trip to the Strata+Hadoop conference in San Jose last week. This is always exciting, enervating, and exhausting but it remains the single best place to pick up on what’s changing in our profession. This conference is on a world tour with four more stops before repeating next year. The New York show is supposed to be a little bigger (hard to imagine) but the San Jose show is closest to our intellectual birthplace. After all this is the place where to call yourself a nerd would be regarded as a humble brag. I’ll try to briefly share the major themes and changes I found this year and will write later in more depth about some of these.
We explore the use of Evolution Strategies, a class of black box optimization algorithms, as an alternative to popular RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using hundreds to thousands of parallel workers, ES can solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training time. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.
Have trouble setting and tracking goals for your business or yourself? And what if you have thousands of data points coming from multiple data sources? At Statsbot, we believe that Artificial Intelligence can change the way people work with data. At first, we saved analysts from tedious and boring work gathering data from numerous analytics platforms and sharing insights with teammates.

## March 23, 2017

### Magister Dixit

“Digital leaders know their data. They convert their information into actionable business insight. Considering that more data is shared online every second today than was stored in the entire Internet 20 years ago, it’s no wonder that differentiating products and services requires advanced tools.” Mark Barrenechea ( September 11, 2015 )

### Announcing R Tools 1.0 for Visual Studio 2015

This post is authored by Shahrokh Mortazavi, Partner Director of Program Management at Microsoft.

I’m delighted to announce the General Availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May. RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:

#### Core IDE Features

RTVS builds on Visual Studio, which means you get numerous features for free, from using multiple languages to word-class editing and debugging, to over 7,000 extensions for every conceivable need.

• A polyglot IDE – VS supports R, Python, C++, C#, Node.js, SQL, etc. projects simultaneously.
• Editor – complete editing experience for R scripts and functions, including detachable/tabbed windows, syntax highlighting, and much more.
• IntelliSense – aka auto-completion, available in both the editor and the Interactive R window.
• R Interactive Window – work with the R console directly from within Visual Studio.
• History window – view, search, select previous commands and send to the Interactive window.
• Variable Explorer – drill into your R data structures and examine their values.
• Plotting – see all your R plots in a Visual Studio tool window.
• Debugging – breakpoints, stepping, watch windows, call stacks and more.
• R Markdown – R Markdown/knitr support with export to Word and HTML.
• Git – source code control via Git and GitHub.
• Extensions – over 7,000 extensions covering a wide spectrum from data to languages to productivity.
• Help – use ? and ?? to view R documentation within Visual Studio.

RTVS includes various features that address the needs of individual as well as data science teams, for example:

#### SQL Server 2016

RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.

#### Microsoft R Client

Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.

#### Visual Studio Team Services

Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself, privately.

#### Remoting

Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.

We’re very excited to officially bring another language to the Visual Studio family! Along with Python Tools for Visual Studio, you have the two main languages for tackling most any ML and analytics related challenge. Very soon (~May), we’ll release RTVS for VS2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install.

Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Please let us know on the github repo.

Shahrokh

Resources

### Evolution Strategies as a Scalable Alternative to Reinforcement Learning - implementation -

We explore the use of Evolution Strategies, a class of black box optimization algorithms, as an alternative to popular RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using hundreds to thousands of parallel workers, ES can solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training time. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

### so what?

"What is the point?" This is a question that comes up often in my workshops when we are looking at graphs and discussing how they can be improved. Other flavors of this same basic question take the form: "What is the message?", "What is the story here?" or the concise, "So what?"

Too often, when we communicate with data, we don't make our point clear. We leave our audience guessing. Your audience should never have to guess what message you want them to know. The onus is on the person communicating the information (you!) to make that clear.

I've been thinking a lot about story lately (in preparation for my recent Tapestry presentation and also for an upcoming project). The word "story" has become a buzzword. Everyone wants to "tell a story with data." But very often, when we use this phrase, we don't really mean story. We mean what I mentioned above—the point, the key takeaway, the so what?

I've started to draw a distinction when I talk about story into two types: story-with-a-lower-case-'s' and Story-with-a-capital-'S.' The latter is Story in the real sense of the word. A Story has key critical components—there is a structure, a shape—it has a plot, a rising action, a point of climax where tensions reach their highest, a falling action, and a resolution. (Related note, Jon Schwabish is currently running a series of blog posts on story—the capital S kind—that starts off with his thoughts on the question what is Story?). But I veer slightly off track.

Today, my focus is on story-with-a-lower-case-'s,' which, in my view, is the minimum level of "story" when you are communicating for explanatory purposes with data. It's not really story at all, but rather the point—the so what? For every graph you show, for every slide you show: make the point clear to your audience. This can be through your spoken words in a live meeting or presentation, or physically written down on the page if the document is meant to stand on its own. Don't assume two people looking at the same graph or slide will interpret it the same. Which means if there is a key takeaway—which there absolutely should be if you're at the point of communicating the information—you need to make that point clearly to your audience. Put it into words!

Speaking of words, in slideware-land (PowerPoint, Keynote, and similar), the title bar on each slide is precious real estate (this is similar to a section heading in a written report). This is the first thing your audience encounters when they see your slide. Too often we underutilize this space with a descriptive title. Think about instead using this precious real estate for an active title. Put your key takeaway—the so what?—there. It makes sense when we stop and think about it: use your title strategically. (If you need more evidence than common sense, check out Michelle Borkin's Tapestry talk where she demonstrates the importance of effective titles and also covers some other interesting learnings for communicating with data from her studies at Northeastern University.)

Let's check out the importance of having a clear so what? and the impact effective titling can have through an example. The following is a visual I discussed in a conference presentation recently:

I originally came across this graphic when combing through recent posts on viz.wtf (an entertaining potpourri of what not to do when visualizing data). I know, I know—it's not really data visualization at all, just a visual made to look infographic-y with some numbers in it. Being clear on the so what? can help us better understand how to best visualize this data.

I did a little digging, and it turns out that the graphic above was originally part of an article in The Daily Texan:

Note the title: "DWI rates increase in months following departure of Uber and Lyft." It turns out there already was an effective title making the so what? clear.

Now that we know what point we're trying to make, we can visualize the data to make this point more effectively. As a related note, these numbers don't actually appear to be DWI rates, as described in the original graphic. A little research reveals this data is likely the number of DWI arrests. I'll make that clear in the title and also spell out what DWI is the first time it's used, in case anyone in my audience is unfamiliar. If we want to show an increase over time, a line graph could do that effectively:

The line graph allows us to clearly see the trend: decreasing DWIs January through April and increasing April forward. But remember that part about making my so what? clear via words? In the above, my audience is left to interpret the data themselves and draw their own conclusions. If I'm the one presenting the data, I should assist in that process. In the following, I've added a subtitle making the takeaway clear. Also, there was an important event—the departure of Uber and Lyft—which I've annotated on the graph directly for context.

I put the main point into words via the subtitle. I did my audience the added bonus of tying these words visually to the relevant data points through consistent use of color. This means that after my audience reads the words, they know exactly where to look in the graph for evidence of the point that is being made. If I were showing this graph on a slide, I could have the takeaway "DWIs increase in the months following Uber and Lyft departures" as the slide title (and remove it from the subtitle, leaving just the main title on the graph).

Bottom line: make your so what? clear via words on every graph and every slide. Don't leave your audience guessing, or leave your important takeaway being known to chance!

Recurrent Neural Networks (RNNs) are powerful models that achieve unparalleled performance on several pattern recognition problems. However, training of RNNs is a computationally difficult task owing to the well-known ‘vanishing/exploding’ gradient problems. In recent years, several algorithms have been proposed for training RNNs. These algorithms either: exploit no (or limited) curvature information and have cheap per-iteration complexity; or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as ADAM and ADAGRAD while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present an novel stochastic quasi-Newton algorithm (adaQN) for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method is judicious in storing and retaining L-BFGS curvature pairs which is indirectly used as a means of controlling the quality of the steps. We present numerical experiments on two language modeling tasks and show that adaQN performs at par, if not better, than popular RNN training algorithms. These results suggest that quasi-Newton algorithms have the potential to be a viable alternative to first- and second-order methods for training RNNs. … adaQN

### TensorFlow RNN Tutorial

On the deep learning R&D team at SVDS, we have investigated Recurrent Neural Networks (RNN) for exploring time series and developing speech recognition capabilities. Many products today rely on deep neural networks that implement recurrent layers, including products made by companies like Google, Baidu, and Amazon.

However, when developing our own RNN pipelines, we did not find many simple and straightforward examples of using neural networks for sequence learning applications like speech recognition. Many examples were either powerful but quite complex, like the actively developed DeepSpeech project from Mozilla under Mozilla Public License, or were too simple and abstract to be used on real data.

In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and you can find the accompanying GitHub repository here. The software we’re using is a mix of borrowed and inspired code from existing open source projects. Below is a video example of machine speech recognition on a 1906 Edison Phonograph advertisement. The video includes a running trace of sound amplitude, extracted spectrogram, and predicted text.

Since we have extensive experience with Python, we used a well-documented package that has been advancing by leaps and bounds: TensorFlow. Before you get started, if you are brand new to RNNs, we highly recommend you read Christopher Olah’s excellent overview of RNN Long Short-Term Memory (LSTM) networks here.

## Speech recognition: audio and transcriptions

Until the 2010’s, the state-of-the-art for speech recognition models were phonetic-based approaches including separate components for pronunciation, acoustic, and language models. Speech recognition in the past and today both rely on decomposing sound waves into frequency and amplitude using fourier transforms, yielding a spectrogram as shown below.

Training the acoustic model for a traditional speech recognition pipeline that uses Hidden Markov Models (HMM) requires speech+text data, as well as a word to phoneme dictionary. HMMs are generative probabilistic models for sequential data, and are typically evaluated using Levenshtein word error distance, a string metric for measuring differences in strings.

These models can be simplified and made more accurate with speech data that is aligned with phoneme transcriptions, but this a tedious manual task. Because of this effort, phoneme-level transcriptions are less likely to exist for large sets of speech data than word-level transcriptions. For more information on existing open source speech recognition tools and models, check out our colleague Cindi Thompson’s recent post.

## Connectionist Temporal Classification (CTC) loss function

We can discard the concept of phonemes when using neural networks for speech recognition by using an objective function that allows for the prediction of character-level transcriptions: Connectionist Temporal Classification (CTC). Briefly, CTC enables the computation of probabilities of multiple sequences, where the sequences are the set of all possible character-level transcriptions of the speech sample. The network uses the objective function to maximize the probability of the character sequence (i.e., chooses the most likely transcription), and calculates the error for the predicted result compared to the actual transcription to update network weights during training.

It is important to note that the character-level error used by a CTC loss function differs from the Levenshtein word error distance often used in traditional speech recognition models. For character generating RNNs, the character and word error distance will be similar in phonetic languages such as Esperonto and Croatian, where individual sounds correspond to distinct characters. Conversely, the character versus word error will be quite different for a non-phonetic language like English.

If you want to learn more about CTC, there are many papers and blog posts that explain it in more detail. We will use TensorFlow’s CTC implementation, and there continues to be research and improvements on CTC-related implementations, such as this recent paper from Baidu. In order to utilize algorithms developed for traditional or deep learning speech recognition models, our team structured our speech recognition platform for modularity and fast prototyping:

## Importance of data

It should be no surprise that creating a system that transforms speech into its textual representation requires having (1) digital audio files and (2) transcriptions of the words that were spoken. Because the model should generalize to decode any new speech samples, the more examples we can train the system on, the better it will perform. We researched freely available recordings of transcribed English speech; some examples that we have used for training are LibriSpeech (1000 hours), TED-LIUM (118 hours), and VoxForge (130 hours). The chart below includes information on these datasets including total size in hours, sampling rate, and annotation.

In order to easily access data from any data source, we store all data in a flat format. This flat format has a single .wav and a single .txt per datum. For example, you can find example Librispeech Training datum ‘211-122425-0059’ in our GitHub repo as 211-122425-0059.wav and 211-122425-0059.txt. These data filenames are loaded into the TensorFlow graph using a datasets object class, that assists TensorFlow in efficiently loading, preprocessing the data, and loading individual batches of data from CPU to GPU memory. An example of the data fields in the datasets object is shown below:

class DataSet:
def __init__(self, txt_files, thread_count, batch_size, numcep, numcontext):
# ...

def from_directory(self, dirpath, start_idx=0, limit=0, sort=None):
return txt_filenames(dirpath, start_idx=start_idx, limit=limit, sort=sort)

def next_batch(self, batch_size=None):
idx_list = range(_start_idx, end_idx)
txt_files = [_txt_files[i] for i in idx_list]
wav_files = [x.replace('.txt', '.wav') for x in txt_files]
# Load audio and text into memory
(audio, text) = get_audio_and_transcript(
txt_files,
wav_files,
_numcep,
_numcontext)


## Feature representation

In order for a machine to recognize audio data, the data must first be converted from the time to the frequency domain. There are several methods for creating features for machine learning of audio data, including binning by arbitrary frequencies (i.e., every 100Hz), or by using binning that matches the frequency bands of the human ear. This typical human-centric transformation for speech data is to compute Mel-frequency cepstral coefficients (MFCC), either 13 or 26 different cepstral features, as input for the model. After this transformation the data is stored as a matrix of frequency coefficients (rows) over time (columns).

Because speech sounds do not occur in isolation and do not have a one-to-one mapping to characters, we can capture the effects of coarticulation (the articulation of one sound influencing the articulation of another) by training the network on overlapping windows (10s of milliseconds) of audio data that captures sound from before and after the current time index. Example code of how to obtain MFCC features, and how to create windows of audio data is shown below:

# Load wav files

# Get mfcc coefficients
orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)

# For each time slice of the training set, we need to copy the context this makes
train_inputs = np.array([], np.float32)
train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))

for time_slice in range(train_inputs.shape[0]):
# Pick up to numcontext time slices in the past,
# And complete with empty mfcc features
need_empty_past = max(0, ((time_slices[0] + numcontext) - time_slice))
empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
assert(len(empty_source_past) + len(data_source_past) == numcontext)
...


For our RNN example, we use 9 time slices before and 9 after, for a total of 19 time points per window.With 26 cepstral coefficients, this is 494 data points per 25 ms observation. Depending on the data sampling rate, we recommend 26 cepstral features for 16,000 Hz and 13 cepstral features for 8,000 hz. Below is an example of data loading windows on 8,000 Hz data:

If you would like to learn more about converting analog to digital sound for RNN speech recognition, check out Adam Geitgey’s machine learning post.

## Modeling the sequential nature of speech

Long Short-Term Memory (LSTM) layers are a type of recurrent neural network (RNN) architecture that are useful for modeling data that has long-term sequential dependencies. They are important for time series data because they essentially remember past information at the current time point, which influences their output. This context is useful for speech recognition because of its temporal nature. If you would like to see how LSTM cells are instantiated in TensorFlow, we’ve include example code below from the LSTM layer of our DeepSpeech-inspired Bi-Directional Neural Network (BiRNN).

with tf.name_scope('lstm'):
# Forward direction cell:
lstm_fw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)
# Backward direction cell:
lstm_bw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)

# Now we feed layer_3 into the LSTM BRNN cell and obtain the LSTM BRNN output.
outputs, output_states = tf.nn.bidirectional_dynamic_rnn(
cell_fw=lstm_fw_cell,
cell_bw=lstm_bw_cell,
# Input is the previous Fully Connected Layer before the LSTM
inputs=layer_3,
dtype=tf.float32,
time_major=True,
sequence_length=seq_length)

tf.summary.histogram("activations", outputs)


For more details about this type of network architecture, there are some excellent overviews of how RNNs and LSTM cells work. Additionally, there continues to be research on alternatives to using RNNs for speech recognition, such as with convolutional layers which are more computationally efficient than RNNs.

## Network training and monitoring

Because we trained our network using TensorFlow, we were able to visualize the computational graph as well as monitor the training, validation, and test performance from a web portal with very little extra effort using TensorBoard. Using tips from Dandelion Mane’s great talk at the 2017TensorFlow Dev Summit, we utilize tf.name_scope to add node and layer names, and write out our summary to file. The results of this is an automatically generated, understandable computational graph, such as this example of a Bi-Directional Neural Network (BiRNN) below. The data is passed amongst different operations from bottom left to top right. The different nodes can be labelled and colored with namespaces for clarity. In this example, teal ‘fc’ boxes correspond to fully connected layers, and the green ‘b’ and ‘h’ boxes correspond to biases and weights, respectively.

We utilized the TensorFlow provided tf.train.AdamOptimizer to control the learning rate. The AdamOptimizer improves on traditional gradient descent by using momentum (moving averages of the parameters), facilitating efficient dynamic adjustment of hyperparameters. We can track the loss and error rate by creating summary scalars of the label error rate:

  # Create a placeholder for the summary statistics
with tf.name_scope("accuracy"):
# Compute the edit (Levenshtein) distance of the top path
distance = tf.edit_distance(tf.cast(self.decoded[0], tf.int32), self.targets)

# Compute the label error rate (accuracy)
self.ler = tf.reduce_mean(distance, name='label_error_rate')
self.ler_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
self.train_ler_op = tf.summary.scalar("train_label_error_rate", self.ler_placeholder)
self.dev_ler_op = tf.summary.scalar("validation_label_error_rate", self.ler_placeholder)
self.test_ler_op = tf.summary.scalar("test_label_error_rate", self.ler_placeholder)


## How to improve an RNN

Now that we have built a simple LSTM RNN network, how do we improve our error rate? Luckily for the open source community, many large companies have published the math that underlies their best performing speech recognition models. In September 2016, Microsoft released a paper in arXiv describing how they achieved a 6.9% error rate on the NIST 200 Switchboard data. They utilized several different acoustic and language models on top of their convolutional+recurrent neural network. Several key improvements that have been made by the Microsoft team and other researchers in the past 4 years include:

• using language models on top of character based RNNs
• using convolutional neural nets (CNNs) for extracting features from the audio
• ensemble models that utilize multiple RNNs

It is important to note that the language models that were pioneered in traditional speech recognition models of the past few decades, are again proving valuable in the deep learning speech recognition models.

Modified From: A Historical Perspective of Speech Recognition, Xuedong Huang, James Baker, Raj Reddy Communications of the ACM, Vol. 57 No. 1, Pages 94-103, 2014

We have provided a GitHub repository with a script that provides a working and straightforward implementation of the steps required to train an end-to-end speech recognition system using RNNs and the CTC loss function in TensorFlow. We have included example data from the LibriVox corpus in the repository. The data is separated into folders:

• Train: train-clean-100-wav (5 examples)
• Test: test-clean-wav (2 examples)
• Dev: dev-clean-wav (2 examples)

When training these handful of examples, you will quickly notice that the training data will be overfit to ~0% word error rate (WER), while the Test and Dev sets will be at ~85% WER. The reason the test error rate is not 100% is because out of the 29 possible character choices (a-z, apostrophe, space, blank), the network will quickly learn that:

• certain characters (e, a, space, r, s, t) are more common
• consonant-vowel-consonant is a pattern in English
• increased signal amplitude of the MFCC input sound features corresponds to characters a-z

The results of a training run using the default configurations in the github repository is shown below:

If you would like to train a performant model, you can add additional .wav and .txt files to these folders, or create a new folder and update configs/neural_network.ini with the folder locations. Note that it can take quite a lot of computational power to process and train on just a few hundred hours of audio, even with a powerful GPU.