# My Data Science Blogs

## June 18, 2019

### If you did not already know

Elastic Distributed Training
With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum. …

Explanation-assisted Guess (ExAG)
While there have been many proposals on how to make AI algorithms more transparent, few have attempted to evaluate the impact of AI explanations on human performance on a task using AI. We propose a Twenty-Questions style collaborative image guessing game, Explanation-assisted Guess Which (ExAG) as a method of evaluating the efficacy of explanations in the context of Visual Question Answering (VQA) – the task of answering natural language questions on images. We study the effect of VQA agent explanations on the game performance as a function of explanation type and quality. We observe that ‘effective’ explanations are not only conducive to game performance (by almost 22% for ‘excellent’ rated explanations), but also helpful when VQA system answers are erroneous or noisy (by almost 30% compared to no explanations). We also see that players develop a preference for explanations even when penalized and that the explanations are mostly rated as ‘helpful’. …

Distributed Online Linear Regression
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set. …

Quantitative CBA
Quantitative CBA is a postprocessing algorithm for association rule classification algorithm CBA (Liu et al, 1998). QCBA uses original, undiscretized numerical attributes to optimize the discovered association rules, refining the boundaries of literals in the antecedent of the rules produced by CBA. Some rules as well as literals from the rules can consequently be removed, which makes the resulting classifier smaller. One-rule classification and crisp rules make CBA classification models possibly most comprehensible among all association rule classification algorithms. These viable properties are retained by QCBA. The postprocessing is conceptually fast, because it is performed on a relatively small number of rules that passed data coverage pruning in CBA. Benchmark of our QCBA approach on 22 UCI datasets shows average 53% decrease in the total size of the model as measured by the total number of conditions in all rules. Model accuracy remains on the same level as for CBA. …

### Magister Dixit

“You can use an eraser on the drafting table or a sledgehammer on the construction site.” Frank Lloyd Wright

## June 17, 2019

### R Packages worth a look

Stan Models for the Pairwise Comparison Factor Model (pcFactorStan)
Provides convenience functions and pre-programmed Stan models related to the pairwise comparison factor model. Its purpose is to make fitting pairwise …

Spatial and Spatio-Temporal Bayesian Model for Circular Data (CircSpaceTime)
Implementation of Bayesian models for spatial and spatio-temporal interpolation of circular data using Gaussian Wrapped and Gaussian Projected distribu …

Complex Pearson Distributions (cpd)
Probability mass function, distribution function, quantile function and random generation for the Complex Triparametric Pearson (CTP) and Complex Bipar …

The Topic SCORE Algorithm to Fit Topic Models (TopicScore)
Provides implementation of the ‘Topic SCORE’ algorithm that is proposed by Tracy Ke and Minzhe Wang. The singular value decomposition step is optimized …

### Le Monde puzzle [#1104]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A palindromic Le Monde mathematical puzzle:

In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.

Which can be solved in a few lines of R code:

coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+")))
amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+")))))
amounz=amounz[amounz<1e3]


and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102,  a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Facebook Research at CVPR 2019

The post Facebook Research at CVPR 2019 appeared first on Facebook Research.

### Distilled News

I often find myself viewing and reviewing dataframes throughout the course of an analysis, and a substantial amount of time can be spent rewriting the same code to do this. inspectdf is an R package designed to make common exploratory tools a bit more useful and easy to use. In particular, it’s very powerful be able to quickly see the contents of categorical features. In this article, we’ll summarise how to use the inspect_cat() function from inspectdf for summarising and visualising categorical columns.
This article is the first of a series in which I will cover the whole process of developing a machine learning project. In this article we focus on training a supervised learning text classification model in Python. The motivation behind writing these articles is the following: as a learning data scientist who has been working with data science tools and machine learning models for a fair amount of time, I’ve found out that many articles in the internet, books or literature in general strongly focus on the modeling part. That is, we are given a certain dataset (with the labels already assigned if it is a supervised learning problem), try several models and obtain a performance metric. And the process ends there.
In my previous post, we got a thorough understanding of Entropy, Cross-Entropy, and KL-Divergence in an intuitive way and also by calculating their values through examples. In case you missed it, please go through it once before proceeding to the finale. In this post, we will apply these concepts and check the results in a real dataset. Also, it will give us good intuition on how to use these concepts in modeling various day-to-day machine learning problems. So, let’s get started.
This online collection of tutorials was created by graduate students in psychology as a resource for other experimental psychologists interested in using R for statistical analyses and graphics. Each chapter was created to provide an overview of how to code a particular topic in the R language. Who is this book for? This book was designed for psychologists already familiar with the statistics they need to utilize, but who have zero experience programming and working in R. Many of the authors of these tutorials had never used R prior to taking the course in which this collection of tutorials was created. In one semester, they were able to gain enough proficiency in R to independently create one of the tutorials included here.
The General Data Protection Regulation (GDPR), the new privacy law for the European Union (EU), went into effect on May 25, 2018. One year later, there is mounting evidence that the law has not produced its intended outcomes; moreover, the unintended consequences are severe and widespread. This article documents the challenges associated with the GDPR, including the various ways in which the law has impacted businesses, digital innovation, the labor market, and consumers. Specifically, the evidence shows that the GDPR:
• Negatively affects the EU economy and businesses
• Drains company resources
• Hurts European tech startups
• Reduces competition in digital advertising
• Is too complicated for businesses to implement
• Fails to increase trust among users
• Negatively impacts users’ online access
• Is too complicated for consumers to understand
• Is not consistently implemented across member states
• Strains resources of regulators
The modelDown package turns classification or regression models into HTML static websites. With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits. The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.
I was really excited when Google announced their Digital Wellbeing program, back in May 2018, especially Dashboard. It tracks all your app interactions on the phone and even helps you to limit app usage by setting time restrictions on different apps. But as of October 2018, Google still hasn’t rolled out that feature to all Android P users and is in beta even for Pixel users. So I decided to check out my own statistics with the data available at hand.
When starting out in data science, DevOps tasks are the last thing you should be worrying about. Trying to master all (or most) aspects of data science requires a tremendous amount of time and practice. Nevertheless, if you should happen to attend a boot camp or some other type of school, it is very likely that you are going to have to complete group projects sooner or later. However, coordinating these without any DevOps knowledge can prove to be quite the challenge. How do we share code? How do we deal with very expensive computations? How do we make sure everyone is using the same environment? Questions like these can easily stall the progress of any data science project.
In the last weeks, months and even years a lot of tools arose that promise to make the field of data science more accessible. This isn’t an easy task considering the complexity of most parts of the data science and machine learning pipeline. None the less many libraries and tools including Keras, FastAI, and Weka made it significantly easier to create a data science project by providing us with an easy to use high-level interface and a lot of prebuilt components.
I applied StyleGAN to images of Unicode characters to see if it could invent new characters.
One of the most important skills that every Data Scientist must master is the ability to explore data properly. Thorough exploratory data analysis (EDA) is essential in order to ensure the integrity of your gathered data and performed analysis. The example used in this tutorial is an exploratory analysis of historical SAT and ACT data to compare participation and performance between SAT and ACT exams in different States. By the end of this tutorial, we will have gained data-driven insight into potential issues regarding standardized testing in the United States. The focus of this tutorial is to demonstrate the exploratory data analysis process, as well as provide an example for Python programmers who want to practice working with data. For this analysis, I examined and manipulated available CSV data files containing data about the SAT and ACT for both 2017 and 2018 in a Jupyter Notebook. Exploring data through well-constructed visualizations and descriptive statistics is a great way to become acquainted with the data you’re working with and formulate hypotheses based on your observations.
Hypothesis are our assumptions about the data which may or may not be true. In this post we’ll discuss about the statistical process of evaluating the truthiness of a hypothesis – this process is known as hypothesis testing. Most of the statistical analysis has its genesis in comparing two types of distributions: population distribution and sample distribution. Let’s understand these terms through an example – Suppose we want to statistically test our hypothesis that on average, the performance of students in a standard aptitude test has improved in the last decade. We’re given a dataset containing the marks (maximum marks = 250) of 100 randomly selected students who appeared in the exam in 2009 and 2019.
Detailed explanation to implement Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent from scratch in Python.
Researchers from Adobe and UC Berkeley have unveiled an interesting way to combat the spread of image manipulation – using AI to spot edited photos. The AI was trained to recognize instances where the Face-Aware Liquify feature of Photoshop was used to edit images. The feature enables you to easily tweak and exaggerate facial features, for example, widening the eyes or literally turning a frown into a smile. This particular feature is popular when it comes to editing faces, and was chosen because the effects can be extremely subtle. The results were astonishing. While human faces could spot the edits just 53% of the time (only a little over chance), the AI sometimes achieved results as high as 99%. Part of the reason the AI performs so much better than the human eye is that it can also access low-level image data, as opposed to simply relying on visual cues. So, why is this important?
The purpose of this article is to provide a step-by-step tutorial on how to use BERT for multi-classification task. BERT ( Bidirectional Encoder Representations from Transformers), is a new method of pre-training language representation by Google that aimed to solve a wide range of Natural Language Processing tasks. This model is based on unsupervised, deeply bidirectional system and managed to achieve state-of-the-art results when it was first released to the public in 2018.
In this tutorial, I will quickly go through the details of four of the famous CNN architectures and how they differ from each other by explaining their W3H (When, Why, What and How).
Most of the real world data contains missing values. They occur due to many reasons like some observations were not recorded and corruption of data.

### Detecting Bias with SHAP

StackOverflow’s annual developer survey concluded earlier this year, and they have graciously published the (anonymized) 2019 results for analysis. They’re a rich view into the experience of software developers around the world — what’s their favorite editor? how many years of experience? tabs or spaces? and crucially, salary. Software engineers’ salaries are good, and sometimes both eye-watering and news-worthy.

The tech industry is also painfully aware that it does not always live up to its purported meritocratic ideals. Pay isn’t a pure function of merit, and story after story tells us that factors like name-brand school, age, race, and gender have an effect on outcomes like salary.

Can machine learning do more than predict things? Can it explain salaries and so highlight cases where these factors might be undesirably causing pay differences? This example will sketch how standard models can be augmented with SHAP (SHapley Additive exPlanations) to detect individual instances whose predictions may be concerning, and then dig deeper into the specific reasons the data leads to those predictions.

## Model Bias or Data (about) Bias?

While this topic is often characterized as detecting “model bias”, a model is merely a mirror of the data it was trained on. If the model is ‘biased’ then it learned that from the historical facts of the data. Models are not the problem per se; they are an opportunity to analyze data for evidence of bias.

Explaining models isn’t new, and most libraries can assess the relative importance of the inputs to a model. These are aggregate views of inputs’ effects. However, the output of some machine learning models has highly individual effects: is your loan approved? will you receive financial aid? are you a suspicious traveller?

Indeed, StackOverflow offers a handy calculator to estimate one’s expected salary, based on its survey. We can only speculate about how accurate the predictions are overall, but all that a developer particularly cares about is his or her own prospects.

The right question may not be, does the data suggest bias overall? but rather, does the data show individual instances of bias?

## Assessing the Survey Data

The 2019 data is, thankfully, clean and free of data problems. It contains responses to 85 questions from about 88,000 developers.

This example focuses only on full-time developers. The data set contains plenty of relevant information, like years of experience, education, role, and demographic information. Notably, this data set doesn’t contain information about bonuses and equity, just salary.

It also has responses to wide-ranging questions about attitudes on blockchain, fizz buzz, and the survey itself. These are excluded here as unlikely to reflect the experience and skills that presumably should determine compensation. Likewise, for simplicity, it will also only focus on US-based developers.

The data needs a little more transformation before modeling. Several questions allow multiple responses, like “What are your greatest challenges to productivity as a developer?” These single questions yield multiple yes/no responses and need to be broken out into multiple yes/no features.

Some multiple-choice questions like “Approximately how many people are employed by the company or organization you work for?” afford responses like “2-9 employees”. These are effectively binned continuous values, and it may be useful to map them back to inferred continuous values like “2” so that the model may consider their order and relative magnitude. This translation is unfortunately manual and entails some judgment calls.

The Apache Spark code that can accomplish this is in the accompanying notebook, for the interested.

## Model Selection with Apache Spark

With the data in a more machine-learning-friendly form, the next step is to fit a regression model that predicts salary from these features. The data set itself, after filtering and transformation with Spark, is a mere 4MB, containing 206 features from about 12,600 developers, and could easily fit in memory as a pandas DataFrame on your wristwatch, let alone a server.

xgboost, a popular gradient-boosted trees package, can fit a model to this data in minutes on a single machine, without Spark. xgboost offers many tunable “hyperparameters” that affect the quality of the model: maximum depth, learning rate, regularization, and so on. Rather than guess, simple standard practice is to try lots of settings of these values and pick the combination that results in the most accurate model.

Fortunately, this is where Spark comes back in. It can build hundreds of these models in parallel and collect the results of each. Because the data set is small, it’s simple to broadcast it to the workers, create a bunch of combinations of those hyperparameters to try, and use Spark to apply the same simple non-distributed xgboost code that could build a model locally to the data with each combination.

...
def train_model(params):
(max_depth, learning_rate, reg_alpha, reg_lambda, gamma, min_child_weight) = params
xgb_regressor = XGBRegressor(objective='reg:squarederror', max_depth=max_depth,\
learning_rate=learning_rate, reg_alpha=reg_alpha, reg_lambda=reg_lambda, gamma=gamma,\
min_child_weight=min_child_weight, n_estimators=3000, base_score=base_score,\
importance_type='total_gain', random_state=0)
xgb_model = xgb_regressor.fit(b_X_train.value, b_y_train.value,\
eval_set=[(b_X_test.value, b_y_test.value)],\
eval_metric='rmse', early_stopping_rounds=30)
n_estimators = len(xgb_model.evals_result()['validation_0']['rmse'])
y_pred = xgb_model.predict(b_X_test.value)
mae = mean_absolute_error(y_pred, b_y_test.value)
rmse = sqrt(mean_squared_error(y_pred, b_y_test.value))
return (params + (n_estimators,), (mae, rmse), xgb_model)

...

max_depth =        np.unique(np.geomspace(3, 7, num=5, dtype=np.int32)).tolist()
learning_rate =    np.unique(np.around(np.geomspace(0.01, 0.1, num=5), decimals=3)).tolist()
reg_alpha =        [0] + np.unique(np.around(np.geomspace(1, 50, num=5), decimals=3)).tolist()
reg_lambda =       [0] + np.unique(np.around(np.geomspace(1, 50, num=5), decimals=3)).tolist()
gamma =            np.unique(np.around(np.geomspace(5, 20, num=5), decimals=3)).tolist()
min_child_weight = np.unique(np.geomspace(5, 30, num=5, dtype=np.int32)).tolist()

parallelism = 128
param_grid = [(choice(max_depth), choice(learning_rate), choice(reg_alpha),\
choice(reg_lambda), choice(gamma), choice(min_child_weight)) for _ in range(parallelism)]

params_evals_models = sc.parallelize(param_grid, parallelism).map(train_model).collect()


That will create a lot of models. To track and evaluate the results, mlflow can log each one with its metrics and hyperparameters, and view them in the notebook’s Experiment. Here, one hyperparameter over many runs is compared to the resulting accuracy (mean absolute error):

The single model that showed the lowest error on the held-out validation data set is of interest. It yielded a mean absolute error of about $28,000 on salaries that average about$119,000. Not terrible, although we should realize the model can only explain most of the variation in salary.

## Interpreting the xgboost Model

Although the model can be used to predict future salaries, instead, the question is what the model says about the data. What features seem to matter most when predicting salary accurately? The xgboost model itself computes a notion of feature importance:

import mlflow.sklearn
best_run_id = "..."
model = mlflow.sklearn.load_model("runs:/" + best_run_id + "/xgboost")
sorted((zip(model.feature_importances_, X.columns)), reverse=True)[:6]


Factors like years of coding professionally, organization size, and using Windows are most “important”. This is interesting, but hard to interpret. The values reflect relative and not absolute importance. That is, the effect isn’t measured in dollars. The definition of importance here (total gain) is also specific to how decision trees are built and are hard to map to an intuitive interpretation. The important features don’t even necessarily correlate positively with salary, either.

More importantly, this is a ‘global’ view of how much features matter in aggregate. Factors like gender and ethnicity don’t show up on this list until farther along. This doesn’t mean these factors aren’t still significant. For one, features can be correlated, or interact. It’s possible that factors like gender correlate with other features that the trees selected instead, and this to some degree masks their effect.

The more interesting question is not so much whether these factors matter overall — it’s possible that their average effect is relatively small — but, whether they have a significant effect in some individual cases. These are the instances where the model is telling us something important about individuals’ experience, and to those individuals, that experience is what matters.

## Applying SHAP for Developer-Level Explanations

Fortunately, a set of techniques for more theoretically sound model interpretation at the individual prediction level has emerged over the past five years or so. They are collectively “Shapley Additive Explanations”, and conveniently, are implemented in the Python package shap.

## Examining the Effects of Gender with SHAP Values

We came to look specifically at the effects of gender, race, and other factors that presumably should not be predictive per se of salary at all. This example will examine the effect of gender, though this by no means suggests that it’s the only or most important, type of bias to look for.

Gender is not binary, and the survey recognizes responses of “Man”, “Woman”, and “Non-binary, genderqueer, or gender non-conforming” as well as “Trans” separately. (Note that while the survey also separately records responses about sexuality, these are not considered here.) SHAP computes the effect on predicted salary for each of these. For a male developer (identifying only as male), the effect of gender is not just the effect of being male, but of not identifying as female, transgender, and so on.

SHAP values let us read off the sum of these effects for developers identifying as each of the four categories:

While male developers’ gender explains about a modest -$230 to +$890 with mean about $225, for females, the range is wider, from about -$4,260 to -$690 with mean -$1,320. The results for transgender and non-binary developers is similar, though slightly less negative.

When evaluating what this means below, it’s important to recall the limitations of the data and model here:

• Correlation isn’t causation; ‘explaining’ predicted salary is suggestive, but doesn’t prove, that a feature directly caused salary to be higher or lower
• The model isn’t perfectly accurate
• This is just 1 year of data, and only from US developers
• This reflects only base salary, not bonuses or stock, which can vary more widely

## Gender and Interacting Features

The SHAP library offers interesting visualizations that leverage its ability to isolate the effect of feature interactions. For example, the values above suggest that developers who identify as male are predicted to earn a slightly higher salary than others, but is there more to it? A dependence plot like this one can help:

Dots are developers. Developers at the left are those that don’t identify as male, and at the right, those that do, which are predominantly those identifying as only male. (The points are randomly spread horizontally for clarity.) The y-axis is SHAP value, or what identifying as male or not explains about predicted salary for each developer. As above, those not identifying as male show overall negative SHAP values, and one that varies widely, while others consistently show a small positive SHAP value.

What’s behind that variance? SHAP can select a second feature whose effect varies most given the value of, here, identifying as male or not.  It selects the answer “I work on what seems most important or urgent” to the question “How structured or planned is your work?”  Among developers identifying as male, those who answered this way (red points) appear to have slightly higher SHAP values. Among the rest, the effect is more mixed but seems to have generally lower SHAP values.

Interpretation is left to the reader, but perhaps: are male developers who feel empowered in this sense also enjoying slightly higher salaries, while other developers enjoy this where it goes hand in hand with lower-paying roles?

## Exploring Instances with Outsized Gender Effects

What about investigating the developer whose salary is most negatively affected? Just as it’s possible to look at the effect of gender-related features overall, it’s possible to search for the developer whose gender-related features had the largest impact on predicted salary. This person is female, and the effect is negative. According to the model, she is predicted to earn about $4,260 less per year because of her gender: The predicted salary, just over$157,000, accurate in this case, as her actual reported salary is $150,000. The three most positive and negative features influencing predicted salary are that she: • Has a college degree (only) (+$18,200)
• Has 10 years professional experience (+$9,400) • Identifies as East Asian (+$9,100)
• Works 40 hours per week (-$4,000) • Does not identify as male (-$4,250)
• Works at a medium-sized org of 100-499 employees (-$9,700) Given the magnitude of the effect on the predicted salary of not identifying as male, we might stop here and investigate the details of this case offline to gain a better understanding of the context around this developer and whether her experience, or salary, or both, need a change. ## Explaining Interactions There is more detail available within that -$4,260. SHAP can break down the effects of these features into interactions. The total effect of identifying as female on the prediction can be broken down into the effect of identifying as female and being an engineering manager, and working with Windows, etc.

The effect on predicted salary explained by the gender factors per se only adds up to about -$630. Rather, SHAP assigns most of the effects of gender to interactions with other features: gender_interactions = interaction_values[gender_feature_locs].sum(axis=0) max_c = np.argmax(gender_interactions) min_c = np.argmin(gender_interactions) print(X.columns[max_c]) print(gender_interactions[max_c]) print(X.columns[min_c]) print(gender_interactions[min_c]) DatabaseWorkedWith_PostgreSQL 110.64005 Ethnicity_East_Asian -1372.6714  Identifying as female and working with PostgreSQL affects predicted salary slightly positively, whereas also identifying as East Asian predicted salary more negatively. Interpreting these values at this level of granularity is difficult in this context, but, this additional level of explanation is available. ## Applying SHAP with Apache Spark SHAP values are computed independently for each row, given the model, and so this could have also been done in parallel with Spark. The following example computes SHAP values in parallel and similarly locates developers with outsized gender-related SHAP values: X_df = pruned_parsed_df.drop("ConvertedComp").repartition(16) X_columns = X_df.columns def add_shap(rows): rows_pd = pd.DataFrame(rows, columns=X_columns) shap_values = explainer.shap_values(rows_pd.drop(["Respondent"], axis=1)) return [Row(*([int(rows_pd["Respondent"][i])] + [float(f) for f in shap_values[i]])) for i in range(len(shap_values))] shap_df = X_df.rdd.mapPartitions(add_shap).toDF(X_columns) effects_df = shap_df.\ withColumn("gender_shap", col("Gender_Woman") + col("Gender_Man") + col("Gender_Non_binary__genderqueer__or_gender_non_conforming") + col("Trans")).\ select("Respondent", "gender_shap") top_effects_df = effects_df.filter(abs(col("gender_shap")) >= 2500).orderBy("gender_shap")  ## Clustering SHAP values Applying Spark is advantageous when there are a large number of predictions to assess with SHAP. Given that output, it’s also possible to use Spark to cluster the results with, for example, bisecting k-means: assembler = VectorAssembler(inputCols=[c for c in to_review_df.columns if c != "Respondent"],\ outputCol="features") assembled_df = assembler.transform(shap_df).cache() clusterer = BisectingKMeans().setFeaturesCol("features").setK(50).setMaxIter(50).setSeed(0) cluster_model = clusterer.fit(assembled_df) transformed_df = cluster_model.transform(assembled_df).select("Respondent", "prediction")  The cluster whose total gender-related SHAP effects are most negative might bear some further investigation. What are the SHAP values of those respondents in the cluster? What do the members of the cluster look like with respect to the overall developer population? min_shap_cluster_df = transformed_df.filter("prediction = 5").\ join(effects_df, "Respondent").\ join(X_df, "Respondent").\ select(gender_cols).groupBy(gender_cols).count().orderBy(gender_cols) all_shap_df = X_df.select(gender_cols).groupBy(gender_cols).count().orderBy(gender_cols) expected_ratio = transformed_df.filter("prediction = 5").count() / X_df.count() display(min_shap_cluster_df.join(all_shap_df, on=gender_cols).\ withColumn("ratio", (min_shap_cluster_df["count"] / all_shap_df["count"]) / expected_ratio).\ orderBy("ratio"))  Developers identifying as female (only) are represented in this cluster at almost 2.8x the rate of the overall developer population, for example. This isn’t surprising given the earlier analysis. This cluster could be further investigated to assess other factors specific to this group that contribute to overall lower predicted salary. ## Conclusion This type of analysis with SHAP can be run for any model, and at scale too. As an analytical tool, it turns models into data detectives, to surface individual instances whose predictions suggest that they deserve more examination. The output of SHAP is easily interpretable and yields intuitive plots, that can be assessed case-by-case by business users. Of course, this analysis isn’t limited to examining questions of gender, age or race bias. More prosaically, it could be applied to customer churn models. There, the question is not just “will this customer churn?” but “why is the customer churning?” A customer who is canceling due to price may be offered a discount, while one canceling due to limited usage might need an upsell. Finally, this analysis can be run as part of a model validation process. Model validation often focuses on the overall accuracy of a model. It should also focus on the model’s ‘reasoning’, or what features contributed most to the predictions. With SHAP, it can also help detect when too many individual predictions’ explanations are at odds with overall feature importance. -- Try Databricks for free. Get started today. The post Detecting Bias with SHAP appeared first on Databricks. Continue Reading… ### Data-driven to Model-driven: The Strategic Shift Being Made by Leading Organizations You can have all the data you want, do all the machine learning you want, but if you aren’t running your business on models, you’ll soon be left behind. In this webinar, we will demystify the model-driven business. Continue Reading… ### Data Science Jobs Report 2019: Python Way Up, TensorFlow Growing Rapidly, R Use Double SAS Data science jobs continue to grow in 2019, and this report shares the change and spread of jobs by software over recent years. Continue Reading… ### Top Stories, Jun 10 – 16: Best resources for developers transitioning into data science; 5 Useful Statistics Data Scientists Need to Know The Infinity Stones of Data Science; What you need to know about the Modern Open-Source Data Science ecosystem; Scalable Python Code with Pandas UDFs: A Data Science Application; Become a Pro at Pandas Continue Reading… ### Random Search and Reproducibility for Neural Architecture Search ** Nuit Blanche is now on Twitter: @NuitBlog ** Neural architecture search (NAS) is a promising research direction that has the potential to replace expertdesigned networks with learned, task-specific architectures. In order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. An implementation of the paper is at: https://github.com/liamcli/randomNAS_release The code base requires the following additional repositories: Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email. Other links: Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv Continue Reading… ### What should we do with “legacy” Java 8 applications? Java is a mature programming language. It was improved over many successive versions. Mostly, new Java versions did not break your code. Thus Java was a great, reliable platform. For some reason, the Oracle engineers decided to break things after Java 8. You cannot “just” upgrade from Java 8 to the following versions. You have to update your systems, sometimes in a significant way. For management purposes, my employer uses an ugly Java application, launched by browsers via something called Java Web Start. I am sure that my employer’s application was very modern when it launched, but it is now tragically old and ugly. Oracle has ended maintenance of Java 8 in January. It may stop making Java 8 available publicly at the end of 2020. Yet my employer’s application won’t work with anything beyond Java 8. Java on the desktop is not ideal. For a business applications, you are much better off with a pure Web application. It is easier to maintain, secure, it is more portable. Our IT staff knows this, they are not idiots. They are preparing a Web equivalent that should launch… some day… But it is complicated. They do not have infinite budgets and there are many stakeholders. What do we do while something more modern is being built? If you are a start-up, you can just switch to the open-source version of Java 8 like OpenJDK. But we are part of a large organization. We want to rely on supported software: doing otherwise would be irresponsible. So what do we do? I think that their current plan is just to stick with Java 8. They have an Oracle license, so they can keep on installing Java 8 on PCs even if Oracle pulls the plug. But is that wise? I think that a better solution would be to switch to Amazon Corretto. Amazon recruited James Gosling, Java’s inventor. It feels like the future of Java may move in Amazon’s hands. Continue Reading… ### How to Use Python’s datetime Python's datetime package is a convenient set of tools for working with dates and times. With just the five tricks that I’m about to show you, you can handle most of your datetime processing needs. Continue Reading… ### Online/Incremental Learning with Keras and Creme In this tutorial, you will learn how to perform online/incremental learning with Keras and Creme on datasets too large to fit into memory. A few weeks ago I showed you how to use Keras for feature extraction and online learning — we used that tutorial to perform transfer learning and recognize classes the original CNN was never trained on. To accomplish that task we needed to use Keras to train a very simple feedforward neural network on the features extracted from the images. However, what if we didn’t want to train a neural network? What if we instead wanted to train a Logistic Regression, Naive Bayes, or Decision Tree model on top of the data? Or what if we wanted to perform feature selection or feature processing before training such a model? You may be tempted to use scikit-learn — but you’ll soon realize that scikit-learn does not treat incremental learning as a “first-class citizen” — only a few online learning implementations are included in scikit-learn and they are awkward to use, to say the least. Instead, you should use Creme, which: • Implements a number of popular algorithms for classification, regression, feature selection, and feature preprocessing. • Has an API similar to scikit-learn. • And makes it super easy to perform online/incremental learning. In the remainder of this tutorial I will show you how to: 1. Use Keras + pre-trained CNNs to extract robust, discriminative features from an image dataset. 2. Utilize Creme to perform incremental learning on a dataset too large to fit into RAM. Let’s get started! To learn how to perform online/incremental learning with Keras and Creme, just keep reading! Looking for the source code to this post? Jump right to the downloads section. ## Online/Incremental Learning with Keras and Creme In the first part of this tutorial, we’ll discuss situations where we may want to perform online learning or incremental learning. We’ll then discuss why the Creme machine learning library is the appropriate choice for incremental learning. We’ll be using Kaggle’s Dogs vs. Cats dataset for this tutorial — we’ll spend some time briefly reviewing the dataset. From there, we’ll take a look at our directory structure from the project. Once we understand the directory structure, we’ll implement a Python script that will be used to extract features from the Dogs vs. Cats dataset using Keras and a CNN pre-trained on ImageNet. Given our extracted features (which will be too big to fit into RAM), we’ll use Creme to train a Logistic Regression model in an incremental learning fashion, ensuring that: 1. We can still train our classifier, despite the extracted features being too large to fit into memory. 2. We can still obtain high accuracy, even though we didn’t have access to “all” features at once. ### Why Online Learning/Incremental Learning? Figure 1: Multi-class incremental learning with Creme allows for machine learning on datasets which are too large to fit into memory (image source). Whether you’re working with image data, text data, audio data, or numerical/categorical data, you’ll eventually run into a dataset that is too large to fit into memory. What then? • Do you head to Amazon, NewEgg, etc. and purchase an upgraded motherboard with maxed out RAM? • Do you spin up a high memory instance on a cloud service provider like AWS or Azure? You could look into one of those options — and in some cases, they are totally reasonable avenues to explore. But my first choice would be to apply online/incremental learning. Using incremental learning you can work with datasets too large to fit into RAM and apply popular machine learning techniques, including: • Feature preprocessing • Feature selection • Classification • Regression • Clustering • Ensemble methods • …and more! Incremental learning can be super powerful — and today you’ll learn how to apply it to your own data. ### Why Creme for Incremental Learning? Figure 2: Creme is a library specifically tailored to incremental learning. The API is similar to that of scikit-learn’s which will make you feel at home while putting it to work on large datasets where incremental learning is required. Neural networks and deep learning are a form of incremental learning — we can train such networks on one sample or one batch at a time. However, just because we can apply neural networks to a problem doesn’t mean we should. Instead, we need to bring the right tool to the job. Just because you have a hammer in your hand doesn’t mean you would use it to bang in a screw. Incremental learning algorithms encompass a set of techniques used to train models in an incremental fashion. We often utilize incremental learning when a dataset is too large to fit into memory. The scikit-learn library does include a small handful of online learning algorithms, however: 1. It does not treat incremental learning as a first-class citizen. 2. The implementations are awkward to use. Enter the Creme library — a library exclusively dedicated to incremental learning with Python. The library itself is fairly new but last week I had some time to hack around with it. I really enjoyed the experience and found the scikit-learn inspired API very easy to use. After going through the rest of this tutorial, I think you’ll agree with me when I say, Creme is a great little library and I wish the developers and maintainers all the best with it — I hope that the library continues to grow. ### The Dogs vs. Cats Dataset Figure 3: In today’s example, we’re using Kaggle’s Dogs vs. Cats dataset. We’ll extract features with Keras producing a rather large features CSV. From there, we’ll apply incremental learning with Creme. The dataset we’ll be using here today is Kaggle’s Dogs vs. Cats dataset. The dataset includes 25,000 examples, evenly distributed: • Dogs: 12,500 images • Cats: 12,500 images Our goal is to apply transfer learning to: 1. Extract features from the dataset using Keras and a pre-trained CNN. 2. Use online/incremental learning via Creme to train a classifier on top of the features in an incremental fashion. ### Setting up your Creme environment While Creme requires a simple pip install, we have some other packages to install for today’s example too. Today’s required packages include: 1. imutils and OpenCV (a dependency of imutils) 2. scikit-learn 3. TensorFlow 4. Keras 5. Creme First, head over to my pip install opencv tutorial to install OpenCV in a Python virtual environment. The OpenCV installation instructions suggest an environment named cv but you can name yours whatever you’d like. From there, install the rest of the packages in your environment: $ workon cv
$pip install imutils$ pip install scikit-learn
$pip install tensorflow # or tensorflow-gpu$ pip install keras
$pip install creme Let’s ensure everything is properly installed by launching a Python interpreter: $ workon cv
$python >>> import cv2 >>> import imutils >>> import sklearn >>> import keras Using TensorFlow backend. >>> import creme >>> Provided that there are no errors, your environment is ready for incremental learning. ### Project Structure Figure 4: Download train.zip from the Kaggle Dogs vs. Cats downloads page for this incremental learning with Creme project. To set up your project, please follow the following steps: 1. Use the “Downloads” section of this blog post and follow the instructions to download the code. 2. Download the code to somewhere on your system. For example, you could download it to your ~/Desktop or ~/Downloads folder. 3. Open a terminal, cd into the same folder where the zip resizes. Unzip/extract the files via unzip keras-creme-incremental-learning.zip . Keep your terminal open. 4. Log into Kaggle (required for downloading data). 5. Head to the Kaggle Dogs vs. Cats “Data” page. 6. Click the little download button next to only the train.zip file. Save it into ~/Desktop/keras-creme-incremental-learning/ (or wherever you extracted the blog post files). 7. Back in your terminal, extract the dataset via unzip train.zip . Now let’s review our project structure: $ tree --dirsfirst --filelimit 10
.
├── train [25000 entries]
├── train.zip
├── features.csv
├── extract_features.py
└── train_incremental.py

1 directory, 4 files

You should see a

train/
directory with 25,000 files. This is where your actual dog and cat images reside. Let’s list a handful of them:

$ls train | sort -R | head -n 10 dog.271.jpg cat.5399.jpg dog.3501.jpg dog.5453.jpg cat.7122.jpg cat.2018.jpg cat.2298.jpg dog.3439.jpg dog.1532.jpg cat.1742.jpg As you can see, the class label (either “cat” or “dog”) is included in the first few characters of the filename. We’ll parse the class name out later. Back to our project tree, under the train/ directory are train.zip and features.csv . These files are not included with the “Downloads”. You should have already downloaded and extracted train.zip from Kaggle’s website. We will learn how to extract features and generate the large 12GB+ features.csv file in the next section. The two Python scripts we’ll be reviewing are extract_features.py and train_incremental.py . Let’s begin by extracting features with Keras! ### Extracting Features with Keras Before we can perform incremental learning, we first need to perform transfer learning and extract features from our Dogs vs. Cats dataset. To accomplish this task, we’ll be using the Keras deep learning library and the ResNet50 network (pre-trained on ImageNet). Using ResNet50, we’ll allow our images to forward propagate to a pre-specified layer. We’ll then take the output activations of that layer and treat them as a feature vector. Once we have feature vectors for all images in our dataset, we’ll then apply incremental learning. Let’s go ahead and get started. Open up the extract_features.py file and insert the following code: # import the necessary packages from sklearn.preprocessing import LabelEncoder from keras.applications import ResNet50 from keras.applications import imagenet_utils from keras.preprocessing.image import img_to_array from keras.preprocessing.image import load_img from imutils import paths import numpy as np import argparse import pickle import random import os # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--dataset", required=True, help="path to input dataset") ap.add_argument("-c", "--csv", required=True, help="path to output CSV file") ap.add_argument("-b", "--batch-size", type=int, default=32, help="batch size for the network") args = vars(ap.parse_args()) On Lines 2-12, all the packages necessary for extracting features are imported. Most notably this includes ResNet50 . ResNet50 is the convolutional neural network (CNN) we are using for transfer learning (Line 3). Three command line arguments are then parsed via Lines 15-22: • --dataset : The path to our input dataset (i.e. Dogs vs. Cats). • --csv : File path to our output CSV file. • --batch-size : By default, we’ll use a batch size of 32 . This will accommodate most CPUs and GPUs. Let’s go ahead and load our model: # load the ResNet50 network and store the batch size in a convenience # variable print("[INFO] loading network...") model = ResNet50(weights="imagenet", include_top=False) bs = args["batch_size"] On Line 27, we load the model while specifying two parameters: • weights="imagenet" : Pre-trained ImageNet weights are loaded for transfer learning. • include_top=False : We do not include the fully-connected head with the softmax classifier. In other words, we chop off the head of the network. With weights loaded, and by loading our model without the head, we are now ready for transfer learning. We will use the output values of the network directly, storing the results as feature vectors. Our feature vectors will each be 100,352-dim (i.e. 7 x 7 x 2048 which are the dimensions of the output volume of ResNet50 without the FC layer header). From here, let’s grab our imagePaths and extract our labels: # grab all image paths in the input directory and randomly shuffle # the paths imagePaths = list(paths.list_images(args["dataset"])) random.seed(42) random.shuffle(imagePaths) # extract the class labels from the image paths, then encode the # labels labels = [p.split(os.path.sep)[-1].split(".")[0] for p in imagePaths] le = LabelEncoder() labels = le.fit_transform(labels) On Lines 32-34, we proceed to grab all imagePaths and randomly shuffle them. From there, our class labels are extracted from the paths themselves (Line 38). Each image path as the format: • train/cat.0.jpg • train/dog.0.jpg • etc. In a Python interpreter, we can test Line 38 for sanity. As you develop the parsing + list comprehension, your interpreter might look like this: $ python
>>> import os
>>> label = "train/cat.0.jpg".split(os.path.sep)[-1].split(".")[0]
>>> label
'cat'
>>> imagePaths = ["train/cat.0.jpg", "train/dog.0.jpg", "train/dog.1.jpg"]
>>> labels = [p.split(os.path.sep)[-1].split(".")[0] for p in imagePaths]
>>> labels
['cat', 'dog', 'dog']
>>>

Lines 39 and 40 then instantiate and fit our label encoder, ensuring we can convert the string class labels to integers.

Let’s define our CSV columns and write them to the file:

# define our set of columns
cols = ["feat_{}".format(i) for i in range(0, 7 * 7 * 2048)]
cols = ["class"] + cols

# open the CSV file for writing and write the columns names to the
# file
csv = open(args["csv"], "w")
csv.write("{}\n".format(",".join(cols)))

We’ll be writing our extracted features to a CSV file.

The Creme library requires that the CSV file has a header and includes a name for each of the columns, namely:

1. The name of the column for the class label
2. A name for each of the features

Line 43 creates column names for each of the 7 x 7 x 2048 = 100,352 features while Line 44 defines the class name column (which will store the class label).

Thus, the first five rows and ten columns our CSV file will look like this:

$head -n 5 features.csv | cut -d ',' -f 1-10 class,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 Notice how the class is the first column. Then the columns span from feat_0 all the way to feat_100351 for a total of 100,352 features. If you edit the command to print more than 10 columns — say 5,000 — then you’ll see that not all the values are 0. Moving on, let’s proceed to loop over the images in batches: # loop over the images in batches for (b, i) in enumerate(range(0, len(imagePaths), bs)): # extract the batch of images and labels, then initialize the # list of actual images that will be passed through the network # for feature extraction print("[INFO] processing batch {}/{}".format(b + 1, int(np.ceil(len(imagePaths) / float(bs))))) batchPaths = imagePaths[i:i + bs] batchLabels = labels[i:i + bs] batchImages = [] We’ll loop over imagePaths in batches of size bs (Line 52). Lines 58 and 59 then grab the batch of paths and labels, while Line 60 initializes a list to hold the batch of images. Let’s loop over the current batch: # loop over the images and labels in the current batch for imagePath in batchPaths: # load the input image using the Keras helper utility while # ensuring the image is resized to 224x224 pixels image = load_img(imagePath, target_size=(224, 224)) image = img_to_array(image) # preprocess the image by (1) expanding the dimensions and # (2) subtracting the mean RGB pixel intensity from the # ImageNet dataset image = np.expand_dims(image, axis=0) image = imagenet_utils.preprocess_input(image) # add the image to the batch batchImages.append(image) Looping over paths in the batch (Line 63), we will load each image , preprocess it, and gather it into batchImages . The image itself is loaded on Line 66. We’ll preprocess the image by: • Resizing to 224×224 pixels via the target_size parameter on Line 66. • Converting to array format (Line 67). • Adding a batch dimension (Line 72). • Performing mean subtraction (Line 73). Note: If these preprocessing steps appear foreign, please refer to Deep Learning for Computer Vision with Python where I cover them in detail. Finally, the image is added to the batch via Line 76. In order to extract features, we’ll now pass the batch of images through our network: # pass the images through the network and use the outputs as our # actual features, then reshape the features into a flattened # volume batchImages = np.vstack(batchImages) features = model.predict(batchImages, batch_size=bs) features = features.reshape((features.shape[0], 7 * 7 * 2048)) # loop over the class labels and extracted features for (label, vec) in zip(batchLabels, features): # construct a row that consists of the class label and extracted # features vec = ",".join([str(v) for v in vec]) csv.write("{},{}\n".format(label, vec)) # close the CSV file csv.close() Our batch of images is sent through the network via Lines 81 and 82. Keep in mind that we have removed the fully-connected head layer of the network. Instead, the forward propagation stops prior to the average pooling layer. We will treat the output of this layer as a list of features , also known as a “feature vector”. The output dimension of the volume is (batch_size, 7 x 7 x ,2048). We can thus reshape the features into a NumPy array of shape (batch_size, 7 * 7 * 2048) , treating the output of the CNN as a feature vector. Maintaining our batch efficiency, the features and associated class labels are written to our CSV file (Lines 86-90). Inside the CSV file, the class label is the first field in each row (enabling us to easily extract it from the row during training). The feature vec follows. The features CSV file is closed via Line 93, as the last step of our script. ### Applying feature extraction with Keras Now that we’ve coded up extract_features.py , let’s apply it to our dataset. Make sure you have: 1. Used the “Downloads” section of this tutorial to download the source code. 2. Downloaded the Dogs vs. Cats dataset from Kaggle’s website. Open up a terminal and execute the following command: $ python extract_features.py --dataset train --csv features.csv
Using TensorFlow backend.
[INFO] processing batch 1/782
[INFO] processing batch 2/782
[INFO] processing batch 3/782
...
[INFO] processing batch 780/782
[INFO] processing batch 781/782
[INFO] processing batch 782/782

Using an NVIDIA K80 GPU the entire feature extraction process took 20m45s.

You could also use your CPU but keep in mind that the feature extraction process will take much longer.

After your script finishes running, take a look at the output size of

features.csv
:

$ls -lh features.csv -rw-rw-r-- 1 ubuntu ubuntu 12G Jun 10 11:16 features.csv The resulting file is over 12GB! And if we were to load that file into RAM, assuming 32-bit floats for the feature vectors, we would need 10.03GB! Your machine may or may not have that much RAM…but that’s not the point. Eventually, you will encounter a dataset that is too large for you to work with in main memory. When that time comes, you need need to use incremental learning. ### Incremental Learning with Creme If you’re at this point in the tutorial then I will assume you have extracted features from the Dogs vs. Cats dataset using Keras and ResNet50 (pre-trained on ImageNet). But what now? We’ve made the assumption that the entire dataset of extracted feature vectors are too large to fit into memory — how can we train a machine learning classifier on that data? Open up the train_incremental.py file and let’s find out: # import the necessary packages from creme.linear_model import LogisticRegression from creme.multiclass import OneVsRestClassifier from creme.preprocessing import StandardScaler from creme.compose import Pipeline from creme.metrics import Accuracy from creme import stream import argparse Lines 2-8 import packages required for incremental learning with Creme. We’ll be taking advantage of Creme’s implementation of LogisticRegression . Creme’s stream module includes a super convenient CSV data generator. Throughout training, we’ll calculate and print out our current Accuracy with Creme’s built in metrics tool. Let’s now use argparse to parse our command line arguments: # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-c", "--csv", required=True, help="path to features CSV file") ap.add_argument("-n", "--cols", type=int, required=True, help="# of feature columns in the CSV file (excluding class column") args = vars(ap.parse_args()) Our two command line arguments include: • --csv : The path to our input CSV features file. • --cols : Dimensions of our feature vector (i.e. how many columns there are in our feature vector). Now that we’ve parsed our command line arguments, we need to specify the data types of our CSV file to use Creme’s stream module properly: # construct our data dictionary which maps the data types of the # columns in the CSV file to built-in data types print("[INFO] building column names...") types = {"feat_{}".format(i): float for i in range(0, args["cols"])} types["class"] = int Line 21 builds a list of data types (floats) for every feature column of our CSV. We will have 100,352 floats. Similarly, Line 22 specifies that our class column is an integer type. Next, let’s initialize our data generator and construct our pipeline: # create a CSV data generator for the extracted Keras features dataset = stream.iter_csv(args["csv"], target_name="class", types=types) # construct our pipeline model = Pipeline([ ("scale", StandardScaler()), ("learn", OneVsRestClassifier( binary_classifier=LogisticRegression())) ]) Line 25 creates a CSV iterator that will stream features + class labels to our model. Lines 28-32 then constructs the model pipeline which: • First performs standard scaling (scales data to have zero mean and unit variance). • Then trains our Logistic Regression model in an incremental fashion (one data point at a time). Logistic Regression is a binary classifier meaning that it can be used to predict only two classes (which is exactly what the Dogs vs. Cats dataset is). However, if you want to recognize > 2 classes, you need to wrap LogisticRegression in a OneVsRestClassifier which fits one binary classifier per class. Note: There’s no harm in wrapping LogisticRegression in a OneVsRestClassifier for binary classification so I chose to do so here, just so you can see how it’s done — just keep in mind that it’s not required for binary classification but is required for > 2 classes. Let’s put Creme to work to train our model: # initialize our metric print("[INFO] starting training...") metric = Accuracy() # loop over the dataset for (i, (X, y)) in enumerate(dataset): # make predictions on the current set of features, train the # model on the features, and then update our metric preds = model.predict_one(X) model = model.fit_one(X, y) metric = metric.update(y, preds) print("INFO] update {} - {}".format(i, metric)) # show the accuracy of the model print("[INFO] final - {}".format(metric)) Line 36 initializes our metric (i.e., accuracy). From there, we begin to loop over our dataset (Line 39). Inside the loop, we: • Make a prediction on the current data point (Line 42). There are 25,000 data points (images), so this loop will run that many times. • Update the model weights based on the prediction (Line 43). • Update and display our accuracy metric (Lines 44 and 45). Finally, the accuracy of the model is displayed in the terminal (Line 48). ### Incremental Learning Results We are now ready to apply incremental learning using Keras and Creme. Make sure you have: 1. Used the “Downloads” section of this tutorial to download the source code. 2. Downloaded the Dogs vs. Cats dataset from Kaggle’s website. From there, open up a terminal and execute the following command: $ python train_incremental.py --csv features.csv --cols 100352
[INFO] building column names...
[INFO] starting training...
INFO] update 0 - Accuracy: 0.
INFO] update 1 - Accuracy: 0.
INFO] update 2 - Accuracy: 0.333333
INFO] update 3 - Accuracy: 0.5
INFO] update 4 - Accuracy: 0.6
INFO] update 5 - Accuracy: 0.5
INFO] update 6 - Accuracy: 0.571429
INFO] update 7 - Accuracy: 0.625
INFO] update 8 - Accuracy: 0.666667
INFO] update 9 - Accuracy: 0.7
INFO] update 10 - Accuracy: 0.727273
INFO] update 11 - Accuracy: 0.75
INFO] update 12 - Accuracy: 0.769231
INFO] update 13 - Accuracy: 0.714286
INFO] update 14 - Accuracy: 0.733333
INFO] update 15 - Accuracy: 0.75
INFO] update 16 - Accuracy: 0.705882
INFO] update 17 - Accuracy: 0.722222
INFO] update 18 - Accuracy: 0.736842
INFO] update 19 - Accuracy: 0.75
INFO] update 21 - Accuracy: 0.761906
...
INFO] update 24980 - Accuracy: 0.9741
INFO] update 24981 - Accuracy: 0.974101
INFO] update 24982 - Accuracy: 0.974102
INFO] update 24983 - Accuracy: 0.974103
INFO] update 24984 - Accuracy: 0.974104
INFO] update 24985 - Accuracy: 0.974105
INFO] update 24986 - Accuracy: 0.974107
INFO] update 24987 - Accuracy: 0.974108
INFO] update 24988 - Accuracy: 0.974109
INFO] update 24989 - Accuracy: 0.97411
INFO] update 24990 - Accuracy: 0.974111
INFO] update 24991 - Accuracy: 0.974112
INFO] update 24992 - Accuracy: 0.974113
INFO] update 24993 - Accuracy: 0.974114
INFO] update 24994 - Accuracy: 0.974115
INFO] update 24995 - Accuracy: 0.974116
INFO] update 24996 - Accuracy: 0.974117
INFO] update 24997 - Accuracy: 0.974118
INFO] update 24998 - Accuracy: 0.974119
INFO] update 24999 - Accuracy: 0.97412
[INFO] final - Accuracy: 0.97412

After only 21 samples our Logistic Regression model is obtaining 76.19% accuracy.

Letting the model train on all 25,000 samples, we reach 97.412% accuracy which is quite respectable. The process took 6hr48m on my system.

Again, the key point here is that our Logistic Regression model was trained in an incremental fashion — we were not required to store the entire dataset in memory at once. Instead, we could train our Logistic Regression classifier one sample at a time.

## Summary

In this tutorial, you learned how to perform online/incremental learning with Keras and the Creme machine learning library.

Using Keras and ResNet50 pre-trained on ImageNet, we applied transfer learning to extract features from the Dogs vs. Cats dataset.

We have a total of 25,000 images in the Dogs vs. Cats dataset. The output volume of ResNet50 is 7 x 7 x 2048 = 100,352-dim. Assuming 32-bit floats for our 100,352-dim feature vectors, that implies that trying to store the entire dataset in memory at once would require 10.03GB of RAM.

Not all machine learning practitioners will have that much RAM on their machines.

And more to the point — even if you do have sufficient RAM for this dataset, you will eventually encounter a dataset that exceeds the physical memory on your machine.

When that occasion arises you should apply online/incremental learning.

Using the Creme library we trained a multi-class Logistic Regression classifier one sample at a time, enabling us to obtain 97.412% accuracy on the Dogs vs. Cats dataset.

I hope you enjoyed today’s tutorial!

Feel free to use the code in this blog post as a starting point for your own projects where online/incremental learning is required.

The post Online/Incremental Learning with Keras and Creme appeared first on PyImageSearch.

### The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174

BACKGROUND

Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.

METHODS

We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”

RESULTS

The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

### The Machine Learning Puzzle, Explained

Lots of moving parts go into creating a machine learning model. Let's take a look at some of these core concepts and see how the machine learning puzzle comes together.

### Four short links: 17 June 2019

Multiverse Databases, Detecting Photoshopping, Simulation Platform, and Tail-Call Optimization: The Musical

1. Towards Multiverse Databases (Morning Paper) -- The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.
2. Detecting Photoshopped Fakes (Verge) -- Adobe worked with Berkeley researchers to develop software that can spot Photoshopping in an image. (via BoingBoing).
3. Open Sourcing AI Habitat (Facebook) -- a new simulation platform created by Facebook AI that’s designed to train embodied agents (such as virtual robots) in photo-realistic 3D environments. [...] To illustrate the benefits of this new platform, we’re also sharing Replica, a data set of hyperrealistic 3D reconstructions of a staged apartment, retail store, and other indoor spaces.
4. Tail-Call Optimization: The Musical (YouTube) -- you're welcome.

### Story formats for data

Financial Times, in an effort to streamline a part of the data journalism process, developed templates for data stories. They call it the Story Playbook:

The Playbook is also an important driver of culture change in the newsroom. We have a rich and familiar vocabulary for print: The basement (A sometimes light-hearted, 350-word story that sits below the fold on the front page), for example, or the Page 3 (a 900–1200 word story at the top of the third page that is the day’s most substantive analysis article). For FT journalists, catflaps, birdcages, and skylines need no explanation.

The story playbook creates the equivalent for online stories, by introducing a vocabulary that provides a shared point of reference for everyone in the newsroom.

### Magister Dixit

“Big data is not about the data. (Making the point that while data is plentiful and easy to collect, the real value is in the analytics.)” Gary King

### Towards Learning of Filter-Level Heterogeneous Compression of Convolutional Neural Networks - implementation -

** Nuit Blanche is now on Twitter: @NuitBlog **

Recently, deep learning has become a de facto standard in machine learning with convolutional neural networks (CNNs) demonstrating spectacular success on a wide variety of tasks. However, CNNs are typically very demanding computationally at inference time. One of the ways to alleviate this burden on certain hardware platforms is quantization relying on the use of low-precision arithmetic representation for the weights and the activations. Another popular method is the pruning of the number of filters in each layer. While mainstream deep learning methods train the neural networks weights while keeping the network architecture fixed, the emerging neural architecture search (NAS) techniques make the latter also amenable to training. In this paper, we formulate optimal arithmetic bit length allocation and neural network pruning as a NAS problem, searching for the configurations satisfying a computational complexity budget while maximizing the accuracy. We use a differentiable search method based on the continuous relaxation of the search space proposed by Liu et al. (2019a). We show, by grid search, that heterogeneous quantized networks suffer from a high variance which renders the benefit of the search questionable. For pruning, improvement over homogeneous cases is possible, but it is still challenging to find those configurations with the proposed method. The code is publicly available at https://github.com/yochaiz/Slimmable and https://github.com/yochaiz/darts-UNIQ.

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:

From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)

***

The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!

**

Here, I've put the data onto a scatter plot:

The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.

Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.

### Book Memo: “Mathematical Theories of Machine Learning”

 Theory and Applications This book studies mathematical theories of machine learning. The first part of the book explores the optimality and adaptivity of choosing step sizes of gradient descent for escaping strict saddle points in non-convex optimization problems. In the second part, the authors propose algorithms to find local minima in nonconvex optimization and to obtain global minima in some degree from the Newton Second Law without friction. In the third part, the authors study the problem of subspace clustering with noisy and missing data, which is a problem well-motivated by practical applications data subject to stochastic Gaussian noise and/or incomplete data with uniformly missing entries. In the last part, the authors introduce an novel VAR model with Elastic-Net regularization and its equivalent Bayesian model allowing for both a stable sparsity and a group selection.

### R Packages worth a look

Implementation of SCORE, SCORE+ and Mixed-SCORE (ScorePlus)
Implementation of community detection algorithm SCORE in the paper J. Jin (2015) <arXiv:1211.5803>, and SCORE+ in J. Jin, Z. Ke and S. Luo (2018) …

Fitting Discrete Distribution Models to Count Data (scModels)
Provides functions for fitting discrete distribution models to count data. Included are the Poisson, the negative binomial and, most importantly, a new …

Interface to the ‘JWSACruncher’ of ‘JDemetra+’ (rjwsacruncher)
JDemetra+’ (<https://…/jdemetra-app> ) is the seasonal adjustment softw …

Creating Visuals for Publication (utile.visuals)
A small set of functions for making visuals for publication in ‘ggplot2’. Key functions include geom_stepconfint() for drawing a step confidence interv …

### Forecasting tools in development

(This article was first published on - R, and kindly contributed to R-bloggers)

As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new open-source collaborators) for these projects.

Here is a partial list of in-progress software that we’ve been working on:

• sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease time-series data, so we made it easy for us and others to use specifications.
• ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this object-oriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plug-and-play interface to the models because they had been specified appropriately.
• Zoltar is a new repository (in alpha-ish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast use-case in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
• predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
• d3-foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3-foresight functionality.

The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The race to become Britain’s next PM

After the first round of voting, Boris Johnson is still the clear favourite

### The Greenland ice sheet is melting unusually fast

It is losing so much water that it may raise global sea levels by a millimetre this year

## June 16, 2019

### Distilled News

TensorFrames is an open source created by Apache Spark contributors. Its functions and parameters are named the same as in the TensorFlow framework. Under the hood, it is an Apache Spark DSL (domain-specific language) wrapper for Apache Spark DataFrames. It allows us to manipulate the DataFrames with TensorFlow functionality. And no, it is not pandas DataFrame, it is based on Apache Spark DataFrame.
The area under the precision-recall curve (AUPRC) is another performance metric that you can use to evaluate a classification model. If your model achieves a perfect AUPRC, it means your model can find all of the positive samples (perfect recall) without accidentally marking any negative samples as positive (perfect precision.) It’s important to consider both recall and precision together, because you could achieve perfect recall (but bad precision) using a naive classifier that marked everything positive, and you could achieve perfect precision (but bad recall) using a naive classifier that marked everything negative.
Human beings are able to see the images by capturing the reflected light rays which is a very complex task. So how can machines be programmed to perform a similar task? Computer sees the images as matrices which need to be processed to get a meaning out of it. Image segmentation is the method to partition the image into various segments with each segment having a different entity. Convolutional Neural Networks are successful for simpler images but haven’t given good results for complex images. This is where other algorithms like U-Net and Res-Net come into play.
10 Neural Network Architectures
• Perceptrons
• Convolutional Neural Networks
• Recurrent Neural Networks
• Long / Short Term Memory
• Gated Recurrent Unit
• Hopfield Network
• Boltzmann Machine
• Deep Belief Networks
• Autoencoders
In this post, I share slides and notes from a keynote that Roger Chen and I gave at the 2019 Artificial Intelligence conference in New York City. In this short summary, I highlight results from a – survey (AI Adoption in the Enterprise) and describe recent trends in AI. Over the past decade, AI and machine learning (ML) have become extremely active research areas: the web site arxiv.org had an average daily upload of around 100 machine learning papers in 2018. With all the research that has been conducted over the past few years, it’s fair to say that we now have entered the implementation phase for many AI technologies. Companies are beginning to translate research results and developments into products and services.
I first realized the power of the Kalman Filter during Kaggle’s Web Traffic Time Series Forecasting competition, a contest requiring prediction of future web traffic volumes for thousands of Wikipedia pages. In this contest, simple heuristics like ‘median of medians’ were hard to beat and my own modeling choices were mostly ineffective. Of course I blamed my tools, and wondered if anything in the traditional statistical toolbox was up for this task. Then I read a post by a user known only as ‘os,’ 8th place with Kalman filters.
Machine learning is ultimately used to predict outcomes given a set of features. Therefore, anything we can do to generalize the performance of our model is seen as a net gain. Dropout is a technique used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase. If you take a look at the Keras documentation for the dropout layer, you’ll see a link to a white paper written by Geoffrey Hinton and friends, which goes into the theory behind dropout.
A recent finding from KPMG’s Global Sourcing Advisory Pulse Survey ‘Robotic Revolution’, suggests that technology experts believe ‘The opportunities [from RPA] are many – so are the adoption challenges… For most organisations, taking advantage of higher-end RPA opportunities will be easier said than done.’
First of all, let’s define what is an anomaly in time series. Anomaly detection problem for time series can be formulated as finding outlier data points relative to some standard or usual signal. While there are plenty of anomaly types, we’ll focus only on the most important ones from a business perspective, such as unexpected spikes, drops, trend changes and level shifts. You can solve this problem in two way: supervised and unsupervised. While the first approach needs some labeled data, second does not, you need the just raw data. On that article, we will focus on the second approach.
As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer – the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption. The top tools show surprising stability – we see essentially the same pattern as last year.
We often hear claims à la ‘there is a high correlation between x and y .’ This is especially true with alleged findings about human or social behaviour in psychology, the social sciences or economics. A reported Pearson correlation coefficient of 0.8 indeed seems high in many cases and escapes our critical evaluation of its real meaning. So let’s see what correlation actually means and if it really conveys the information we often believe it does. Inspired by the funny spurious correlation project as well as Nassim Taleb’s medium post and Twitter rants in which he laments psychologists’ (and not only) total ignorance and misuse of probability and statistics, I decided to reproduce his note on how much information the correlation coefficient conveys under the Gaussian distribution.
One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.
There are countless examples of recommender systems in use across nearly every industry in existence today. Most people understand, at a high level, what recommender systems attempt to achieve. However, not many people understand how they work at a deeper level. This is what Leskovec, Rajaraman, and Ullman dive into in Chapter 9 of their book Mining of Massive Datasets. A great example of a recommender system in use today is Netflix. Whenever you log into Netflix there are various sections such as ‘Trending Now’ or ‘New Releases’, but there is also a section titled ‘Top Picks for You’. This section uses a complex formula that tries to estimate which movies you would enjoy the most. The formula takes into account previous movies you have enjoyed as well as other movies people like you have also enjoyed.
Graphs are becoming central to machine learning these days, whether you’d like to understand the structure of a social network by predicting potential connections, detecting fraud, understand customer’s behavior of a car rental service or making real-time recommendations for example. In this article, we’ll cover :
• The main graph algorithms
• Illustrations and use-cases
• Examples in Python
In the first article on the topic, Kernel Secrets in Machine Learning, I explained kernels in the most basic way possible. Before reading further I would advise you to quickly go through the article to get a feel of what a kernel really is if you have not already. Hopefully, you are going to come to this conclusion: A kernel is a similarity measure between two vectors in a mapped space. Good. Now we can check out and discuss some well-known kernels and also how do we combine kernels to produce other kernels. Keep in mind, for the examples that we are going to use, the x’s are one-dimensional vectors for plotting purposes and we fix the value of x’ to 2. Without further ado, let’s start kerneling.
This post is not about deep learning. But it could be might as well. This is the power of kernels. They are universally applicable in any machine learning algorithm. Why you might ask? I am going to try to answer this question in this article.
This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning.
After decades of a heavy slog with no promise of success, quantum computing is suddenly buzzing! Nearly two years ago, IBM made a quantum computer available to the world. The 5-quantum-bit (qubit) resource they now call the IBM Q experience. It was more like a toy for researchers than a way of getting any serious number crunching done. But 70,000 users worldwide have registered for it, and the qubit count in this resource has now quadrupled. With so many promises by quantum computing and data science being at the helm currently, are there any offerings by quantum computing for the AI? Let us explore that in this blog!

### If you did not already know

Estimation of Distribution Algorithm (EDA)
Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding the uniform distribution over admissible solutions and ending with the model that generates only the global optima. EDAs belong to the class of evolutionary algorithms. The main difference between EDAs and most conventional evolutionary algorithms is that evolutionary algorithms generate new candidate solutions using an implicit distribution defined by one or more variation operators, whereas EDAs use an explicit probability distribution encoded by a Bayesian network, a multivariate normal distribution, or another model class. Similarly as other evolutionary algorithms, EDAs can be used to solve optimization problems defined over a number of representations from vectors to LISP style S expressions, and the quality of candidate solutions is often evaluated using one or more objective functions.
Level-Based Analysis of the Univariate Marginal Distribution Algorithm

Abnormal Event Detection Network (AED-Net)
It is challenging to detect the anomaly in crowded scenes for quite a long time. In this paper, a self-supervised framework, abnormal event detection network (AED-Net), which is composed of PCAnet and kernel principal component analysis (kPCA), is proposed to address this problem. Using surveillance video sequences of different scenes as raw data, PCAnet is trained to extract high-level semantics of crowd’s situation. Next, kPCA,a one-class classifier, is trained to determine anomaly of the scene. In contrast to some prevailing deep learning methods,the framework is completely self-supervised because it utilizes only video sequences in a normal situation. Experiments of global and local abnormal event detection are carried out on UMN and UCSD datasets, and competitive results with higher EER and AUC compared to other state-of-the-art methods are observed. Furthermore, by adding local response normalization (LRN) layer, we propose an improvement to original AED-Net. And it is proved to perform better by promoting the framework’s generalization capacity according to the experiments. …

Fixed-Size Ordinally Forgetting Encoding (FOFE)
Question answering over knowledge base (KB-QA) has recently become a popular research topic in NLP. One popular way to solve the KB-QA problem is to make use of a pipeline of several NLP modules, including entity discovery and linking (EDL) and relation detection. Recent success on KB-QA task usually involves complex network structures with sophisticated heuristics. Inspired by a previous work that builds a strong KB-QA baseline, we propose a simple but general neural model composed of fixed-size ordinally forgetting encoding (FOFE) and deep neural networks, called FOFE-net to solve KB-QA problem at different stages. For evaluation, we use two popular KB-QA datasets, SimpleQuestions and WebQSP, and a newly created dataset, FreebaseQA. The experimental results show that FOFE-net performs well on KB-QA subtasks, entity discovery and linking (EDL) and relation detection, and in turn pushing overall KB-QA system to achieve strong results on all datasets. …

Q-Graph
Arising user-centric graph applications such as route planning and personalized social network analysis have initiated a shift of paradigms in modern graph processing systems towards multi-query analysis, i.e., processing multiple graph queries in parallel on a shared graph. These applications generate a dynamic number of localized queries around query hotspots such as popular urban areas. However, existing graph processing systems are not yet tailored towards these properties: The employed methods for graph partitioning and synchronization management disregard query locality and dynamism which leads to high query latency. To this end, we propose the system Q-Graph for multi-query graph analysis that considers query locality on three levels. (i) The query-aware graph partitioning algorithm Q-cut maximizes query locality to reduce communication overhead. (ii) The method for synchronization management, called hybrid barrier synchronization, allows for full exploitation of local queries spanning only a subset of partitions. (iii) Both methods adapt at runtime to changing query workloads in order to maintain and exploit locality. Our experiments show that Q-cut reduces average query latency by up to 57 percent compared to static query-agnostic partitioning algorithms. …

### Document worth reading: “On the Implicit Assumptions of GANs”

Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of recent years. We believe the identified problems deserve more attention, and highlight the implications on both the properties of GANs and the trajectory of research on probabilistic models. We recently proposed an alternative method that sidesteps these problems. On the Implicit Assumptions of GANs

### Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Problem: Minimum Cardinality Primary Key

Instance: Vectors x1 through xm elements of {0,1}n and positive integer k.

Question: Is there a “primary key” of size no more then k? That is: is there a subset P of {1, …, n} with |P| ≤ k such that for any integers a, b with 1 ≤ a < b ≤ n we can find an j in P such that xa(j) doesn’t equal xb(j) (i.e. xa and xb differ at some index named in P, and hence can be distinguished or “told apart”).

Now the standard reference on NP-hardness (Garey and Johnson, Computers and Intractability, Freeman, 1979) does have some NP-hard database examples (such as SR26 Minimum Cardinality Key). However the stated formulations are a bit hard to decode, so we will relate the above problem directly to a more accessible problem: SP8 Hitting Set.

Problem: SP8 Hitting Set

Instance: Collection C of subsets of a finite set S, positive integer K ≤ |S|.

Question: Is there a subset S contained in S with |S| ≤ K such that S contains at least one element from each subset in C?

The idea is: SP8 is thought to be difficult to solve, so if we show how Minimum Cardinality Primary Key could be used to easily solve SP8 this is then evidence Minimum Cardinality Primary Key is also difficult to solve.

So suppose we have an arbitrary instance of SP8 in front of us. Without loss of generality assume S = {1, …, n}, C = {C1, …, Cm}, and all of the Ci are non-empty and distinct.

We build an instance of the Minimum Cardinality Primary Key problem by defining a table with columns named s1 through sn plus d1 through dm.

Now we define the rows of our table:

• Let r0 be the row of all zeros.
• For i from 1 to m let zi be the row with zi(di) = 1 and all other columns equal to zero.
• For i from 1 to m let xi be the row with xi(di) = 1, xi(sj) = 1 for all j in Ci, and all other columns equal to zero.

Now let’s look at what sets of columns form primary keys for the collection of rows r0, zi, xi.

We must have all of di in P, as each di is the unique index of the only difference between zi and r0. Also, for any i we must have a j such that zi(dj)=1 and j in Ci, as if there were none we could not tell zi from xi (as they differ only in indices named by Ci).

This lets us confirm a good primary key set P is such that S = {j | sj in P} is itself a good hitting set for the SP8 problem. And for any hitting set S we have P = {sj | j in S} union {di, … dm} is a good solution for the Minimum Cardinality Primary Key problem (the di allow us to distinguish r0 from zi, the zi from themselves, r0 from xi, and the xi from them selves; the set hitting property lets us distinguish zi from the corresponding xi, completing the unique keying of rows by the chosen column set). And the solution sizes are always such that |P| = |S’| + m.

So: if we had a method to solve arbitrary instances of the Minimum Cardinality Primary Key problem, we could then use it to solve arbitrary instances of the SP8 Hitting Set Problem. We would just re-encode the SP8 problem as described above, solve the Minimum Cardinality Primary Key problem, and use the strong correspondence between solutions to these two problems to map the solution back to the SP8 problem. Thus the Minimum Cardinality Primary Key problem is itself NP-hard.

What made the problem hard was, as is quite common, is: the solution size constraint. Without that constraint the problem is trivial. The set of all columns either forms a primary key or does not, and it is simple calculation to check that. As with the variable pruning problem we can even try step-wise deleting columns to explore subsets of columns that are also primary table keys, moving us to a non-redundant key set (but possibly not of minimal size).

### We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

### modelDown is now on CRAN!

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.

The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.

Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, k-nearest neighbours). The R code beyond this example is here.

Fun facts:

archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try

archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")

– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)

– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.

library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
nChr = 10,
segSites = 5000,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters

SIMPARAM <- SimParam$new(FOUNDERPOP) SIMPARAM$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM$addSnpChip(nSnpPerChr = 1000) SIMPARAM$setGender("yes_sys")

## Founding population

pop <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

pop <- setPheno(pop,
varE = 20,
simParam = SIMPARAM)

## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
print(i)
sires <- selectInd(pop = breeding[[i - 1]],
gender = "M",
nInd = 25,
trait = 1,
use = "pheno",
simParam = SIMPARAM)

dams <- selectInd(pop = breeding[[i - 1]],
nInd = 500,
gender = "F",
trait = 1,
use = "pheno",
simParam = SIMPARAM)

breeding[[i]] <- randCross2(males = sires,
females = dams,
nCrosses = 500,
nProgeny = 10,
simParam = SIMPARAM)
breeding[[i]] <- setPheno(breeding[[i]],
varE = 20,
simParam = SIMPARAM)
}

## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
y = mean_g,
ymin = mean_g - sd_g,
ymax = mean_g + sd_g,
geom = "pointrange",
main = "Genetic mean and standard deviation",
xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before")
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after")


This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

Since it was introduced a few years ago, Google’s Transformer architecture has been applied to challenges ranging from generating fantasy fiction to writing musical harmonies. Importantly, the Transformer’s high performance has demonstrated that feed forward neural networks can be as effective as recurrent neural networks when applied to sequence tasks, such as language modeling and translation. While the Transformer and other feed forward models used for sequence problems are rising in popularity, their architectures are almost exclusively manually designed, in contrast to the computer vision domain where AutoML approaches have found state-of-the-art models that outperform those that are designed by hand. Naturally, we wondered if the application of AutoML in the sequence domain could be equally successful.
Networks are everywhere. We have social networks like Facebook, competitive product networks or various networks in an organisation. Also, for STATWORX it is a common task to unveil hidden structures and clusters in a network and visualize it for our customers. In the past, we used the tool Gephi to visualize our results in network analysis. Impressed by this outstanding pretty and interactive visualization, our idea was to find a way to do visualizations in the same quality directly in R and present it to our customers in an R Shiny app.
I bet most of us have seen a lot of AI-generated people faces in recent times, be it in papers or blogs. We have reached a stage where it is becoming increasingly difficult to distinguish between actual human faces and faces that are generated by Artificial Intelligence. In this post, I will help the reader to understand how they can create and build such applications on their own. I will try to keep this post as intuitive as possible for starters while not dumbing it down too much.
I read ‘anomaly’ definitions in every kind of contest, everywhere. In this caos the only truth is the variability of this definition, i.e. anomaly explanation is completely releted to the domain of interest. Detection of this kind of behavior is usefull in every business and the difficultness to detect this observations depends on the field of applications. If you are engaged in a problem of anomaly detection, which involves human activity (like prediction of sales or demand), you can take advantages from fundamental assumptions of human behaviors and plan a more efficient solution. This is exactly what we are doing in this post. We try to predict the Taxi demand in NYC in a critical time period. We formulate easy and important assumptions about human behaviors, which will permit us to detect an easy solution to forecast anomalies. All the dirty job is made by a loyalty LSTM, developed in Keras, which makes predictions and detection of anomalies at the same time!
This article looks at one of the most powerful and state of the art algorithms in Reinforcement Learning (RL), Twin Delayed Deep Deterministic Policy Gradients (TD3)( Fujimoto et al., 2018). By the end of this article you should have a solid understanding of what makes TD3 perform so well, be capable of implementing the algorithm yourself and use TD3 to train an agent to successfully run in the HalfCheetah environment.
In the last article we talked about the building blocks of a knowledge-graph, now we will go a step further and learn the basic concepts, technologies and languages we need to understand to actually build it.
Ensemble is a Latin-derived word which means ‘union of parts’. The regular classifiers that are used often are prone to make errors. As much as these errors are inevitable they can be reduced with the proper construction of a learning classifier. Ensemble learning is a way of generating various base classifiers from which a new classifier is derived which performs better than any constituent classifier. These base classifiers may differ in the algorithm used, hyperparameters, representation or the training set. The key objective of the ensemble methods is to reduce bias and variance.
Full disclosure: I haven’t watched or read Game of Thrones, but I am hoping to learn a lot about it by analyzing the text. If you would like more background about the basic text processing, you can read my other article. The text from all 5 books can be found on Kaggle. In this article I will be taking the cleaned text and using it to explain the following concepts:
• Vectorization: Bag-of-Words, TF-IDF, and Skip-Thought Vectors
• After Vectorization
• POS tagging
• Named Entity Recognition (NER)
• Chunking and Chinking
• Sentiment Analysis
• Other NLP packages
This is the final part of the series of how to go on from Jupyter Notebooks to software solutions in Data Science. Part 1 covered the basics of setting up the working environment and data exploration. Part 2 dived deep into data pre-processing and modelling. Part 3 will deal with how you can move on from Jupyter, front end development and your daily work in the code. The overall agenda of the series is the following:
• Setting up your working environment [Part 1]
• Important modules for data exploration [Part 1]
• Machine Learning Part 1: Data pre-processing [Part 2]
• Machine Learning Part 2: Models [Part 2]
• Moving on from Jupyter [Part 3]
• Shiny stuff: when do we get a front end? [Part 3]
• Your daily work in the code: keeping standards [Part 3]
Recurrent neural networks have a wide array of applications. These include time series analysis, document classification, speech and voice recognition. In contrast to feedforward artificial neural networks, the predictions made by recurrent neural networks are dependent on previous predictions. To elaborate, imagine we decided to follow an exercise routine where, every day, we alternate between lifting weights, swimming and yoga. We could then build a recurrent neural network to predict today’s workout given what we did yesterday. For example, if we lifted weights yesterday then we’d go swimming today. More often than not, the problems you’ll be tackling in the real world will be a function of the current state as well as other inputs. For instance, suppose we signed up for hockey once a week. If we’re playing hockey on the same day that we’re supposed to lift weights then we might decide to skip the gym. Thus, our model now has to differentiate between the case when we attended a yoga class yesterday and we’re not playing hockey as well as the case when we attended a yoga class yesterday and we’re playing hockey today in which case we’d jump directly to swimming.
Convolutional Neural Nets (CNNs), a concept that has achieved the greatest performance for image classification, was inspired by the mammalian visual cortex system. In spite of the drastic progress in automated computer vision systems, most of the success of image classification architectures comes from labeled data. The problem is that most of the real world data is not labeled. According to Yann LeCun, father of CNNs and professor at NYU, the next ‘big thing’ in artificial intelligence is semi-supervised learning – a type of machine learning task that makes use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. That is why recently a large research effort has been focused on unsupervised learning without leveraging a large amount of expensive supervision.
While the technology and tools used by data scientists have grown dramatically, the data science lifecycle has stagnated. In fact, little has changed between the earliest versions of CRISP-DM created over 20 years ago and the more recent lifecycles offered by leading vendors such as Google, Microsoft, and DataRobot. Most versions of the data science lifecycle still address the same set of tasks: understanding the business problem, understanding domain data, acquiring and engineering data, model development and training, and model deployment and monitoring (see Figure 1). But enterprise needs have evolved as data science has become embedded in most companies. Today, model reproducibility, traceability, verifiability has become a fundamental requirement for data science in large enterprises. Unfortunately, these requirements are omitted or significantly underplayed in leading AI/ML lifecycles.
Sentiment Analysis is a classic example of machine learning, which (in a nutshell) refers to: ‘A way of ‘learning’ that enables algorithms to evolve.’ This ‘learning’ means feeding the algorithm with a massive amount of data so that it can adjust itself and continually improve.’ Sentiment analysis is the automated process of understanding an opinion about a given subject from written or spoken language. In a world where we generate 2.5 quintillion bytes of data every day, sentiment analysis has become a key tool for making sense of that data. This has allowed companies to get key insights and automate all kind of processes.
The article assumes that you have some brief idea about the regression techniques that could predict the required variable from a stratified and equitable distribution of records in a dataset that are implemented by a statistical approach. Just kidding! All you need is adequate math to be able to understand basic graphs. Before entering the topic, a little brush up…
k-medoids clustering is a classical clustering machine learning algorithm. It is a sort of generalization of the k-means algorithm. The only difference is that cluster centers can only be one of the elements of the dataset, this yields an algorithm which can use any type of distance function whereas k-means only provably converges using the L2-norm.
Often, when training a very deep neural network, we want to stop training once the training accuracy reaches a certain desired threshold. Thus, we can achieve what we want (optimal model weights) and avoid wastage of resources (time and computation power). In this brief tutorial, let’s learn how to achieve this in Tensorflow and Keras, using the callback approach, in 4 simple steps.

### Sudan's government is minimizing the death toll in the Khartoum attack

Two state agencies have a different, smaller, number of protesters killed than an independent panel’s count

Death tolls can be hard to calculate. Violence, whether it’s natural or manmade, can create chaos that makes counting difficult. And we rarely pay attention for long enough to see that those who die from their injuries are added to the final count of the dead.

And sometimes, there’s a vested interest in minimizing the numbers. On 3 June, government forces in Sudan violently attacked protesters in Khartoum. The protesters were calling for the transitional military council to hand power over to a civilian-led government. According to Human Rights Watch, protesters were chased, whipped, shot at and, according to several reports, raped.

### If you did not already know

WeCURE
Missing data recovery is an important and yet challenging problem in imaging and data science. Successful models often adopt certain carefully chosen regularization. Recently, the low dimension manifold model (LDMM) was introduced by S.Osher et al. and shown effective in image inpainting. They observed that enforcing low dimensionality on image patch manifold serves as a good image regularizer. In this paper, we observe that having only the low dimension manifold regularization is not enough sometimes, and we need smoothness as well. For that, we introduce a new regularization by combining the low dimension manifold regularization with a higher order Curvature Regularization, and we call this new regularization CURE for short. The key step of solving CURE is to solve a biharmonic equation on a manifold. We further introduce a weighted version of CURE, called WeCURE, in a similar manner as the weighted nonlocal Laplacian (WNLL) method. Numerical experiments for image inpainting and semi-supervised learning show that the proposed CURE and WeCURE significantly outperform LDMM and WNLL respectively. …

Recurrent Convolutional Network (RCN)
Recently, three dimensional (3D) convolutional neural networks (CNNs) have emerged as dominant methods to capture spatiotemporal representations, by adding to pre-existing 2D CNNs a third, temporal dimension. Such 3D CNNs, however, are anti-causal (i.e., they exploit information from both the past and the future to produce feature representations, thus preventing their use in online settings), constrain the temporal reasoning horizon to the size of the temporal convolution kernel, and are not temporal resolution-preserving for video sequence-to-sequence modelling, as, e.g., in spatiotemporal action detection. To address these serious limitations, we present a new architecture for the causal/online spatiotemporal representation of videos. Namely, we propose a recurrent convolutional network (RCN), which relies on recurrence to capture the temporal context across frames at every level of network depth. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state $1\times 1$ convolution applied across time. The hidden state at any time $t$ is assumed to depend on the hidden state at $t-1$ and on the current output of the spatial convolution component. As a result, the proposed network: (i) provides flexible temporal reasoning, (ii) produces causal outputs, and (iii) preserves temporal resolution. Our experiments on the large-scale large ‘Kinetics’ dataset show that the proposed method achieves superior performance compared to 3D CNNs, while being causal and using fewer parameters. …

Parallelizable Stack Long Short-Term Memory
Stack Long Short-Term Memory (StackLSTM) is useful for various applications such as parsing and string-to-tree neural machine translation, but it is also known to be notoriously difficult to parallelize for GPU training due to the fact that the computations are dependent on discrete operations. In this paper, we tackle this problem by utilizing state access patterns of StackLSTM to homogenize computations with regard to different discrete operations. Our parsing experiments show that the method scales up almost linearly with increasing batch size, and our parallelized PyTorch implementation trains significantly faster compared to the Dynet C++ implementation. …

Social Relationship Graph Generation Network (SRG-GN)
Socially-intelligent agents are of growing interest in artificial intelligence. To this end, we need systems that can understand social relationships in diverse social contexts. Inferring the social context in a given visual scene not only involves recognizing objects, but also demands a more in-depth understanding of the relationships and attributes of the people involved. To achieve this, one computational approach for representing human relationships and attributes is to use an explicit knowledge graph, which allows for high-level reasoning. We introduce a novel end-to-end-trainable neural network that is capable of generating a Social Relationship Graph – a structured, unified representation of social relationships and attributes – from a given input image. Our Social Relationship Graph Generation Network (SRG-GN) is the first to use memory cells like Gated Recurrent Units (GRUs) to iteratively update the social relationship states in a graph using scene and attribute context. The neural network exploits the recurrent connections among the GRUs to implement message passing between nodes and edges in the graph, and results in significant improvement over previous methods for social relationship recognition. …

### Book Memo: “Managing Your Data Science Projects”

 Learn Salesmanship, Presentation, and Maintenance of Completed Models At first glance, the skills required to work in the data science field appear to be self-explanatory. Do not be fooled. Impactful data science demands an interdisciplinary knowledge of business philosophy, project management, salesmanship, presentation, and more. In Managing Your Data Science Projects, author Robert de Graaf explores important concepts that are frequently overlooked in much of the instructional literature that is available to data scientists new to the field. If your completed models are to be used and maintained most effectively, you must be able to present and sell them within your organization in a compelling way.

## June 15, 2019

### CoderStats Revamp, d3-geomap v3 Release, Python Data Science Handbook Review

I worked on some projects recently that were asleep for a while, including a revamp of coderstats.net, the release of d3-geomap version 3 and I published a review on the Python Data Science Handbook.

### R Packages worth a look

Composite Grid Gaussian Processes (CGGP)
Run computer experiments using the adaptive composite grid algorithm with a Gaussian process model. The algorithm works best when running an experiment …

Simple Jenkins Client (jenkins)
Manage jobs and builds on your Jenkins CI server <https://…/>. Create and edit projects, s …

Provides a framework to download, parse, and store text datasets on the disk and load them when needed. Includes various sentiment lexicons and labeled …

Robust Backfitting (RBF)
A robust backfitting algorithm for additive models based on (robust) local polynomial kernel smoothers. It includes both bounded and re-descending (ker …

### Document worth reading: “A Survey of the Recent Architectures of Deep Convolutional Neural Networks”

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art results on various competitive benchmarks. The powerful learning ability of deep CNN is largely achieved with the use of multiple non-linear feature extraction stages that can automatically learn hierarchical representation from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs and recently very interesting deep CNN architectures are reported. The recent race in deep CNN architectures for achieving high performance on the challenging benchmarks has shown that the innovative architectural ideas, as well as parameter optimization, can improve the CNN performance on various vision-related tasks. In this regard, different ideas in the CNN design have been explored such as use of different activation and loss functions, parameter optimization, regularization, and restructuring of processing units. However, the major improvement in representational capacity is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is gaining substantial appreciation. This survey thus focuses on the intrinsic taxonomy present in the recently reported CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting and attention. Additionally, it covers the elementary understanding of the CNN components and sheds light on the current challenges and applications of CNNs. A Survey of the Recent Architectures of Deep Convolutional Neural Networks

### Algorithmic bias and social bias

The “algorithmic bias” that concerns me is not so much a bias in an algorithm, but rather a social bias resulting from the demand for, and expectation of, certainty.

### Magister Dixit

“To put it simply, there is too much friction. In any given workflow, you have to go through several levels to get to what you really need. For instance, say you’re part of the customer service team: You use Salesforce to get the information you need to best serve customers. But depending on the information, you have to go across half-a-dozen windows in search for the right sales pitch, product information, or other collateral. You are 15 steps into a workflow before you get to the real starting point. This wastes time, money, and reduces quality of service. This is in sharp contrast to what you have come to expect using consumer products. Think peer-to-peer payment option solutions like Square that make payments as simple as the tap of a button – eliminating dozens of process steps that you would usually go through. This simple, bare-bones approach has changed industries across the board, be it transportation (Uber), insurance (15 minutes can save you…), accounting (TurboTax), retail (Amazon Same-Day), and so on. Enterprises that provide this personalized, contextual experience will thrive and those that don’t will falter.” Mayank Mehta ( August 27, 2015 )

### Introducing the {ethercalc} package

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix).

I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the package as they have a CORS configuration that make EtherCalc API wrangling problematic) for EtherCalc but I would highly recommend that you just use it via the npm module. To do that you need a working Node.js installation and I highly recommended also running a local Redis instance (it’s super lightweight). Linux folks can use their fav package manager for that and macOS folks can use homebrew. Folks on the legacy Windows operating system can visit:

to get EtherCalc going.

I also recommend running EtherCalc and Redis together for performance reasons. EtherCalc will maintain a persistent store for your spreadsheets (they call them “rooms” since EtherCalc supports collaborative editing) with or without Redis, but using Redis makes all EtherCalc ops much, much faster.

Once you have Redis running (on localhost, which is the default) and Node.js + npm installed, you can do the following to install EtherCalc:

$npm install -g ethercalc # may require sudo on some macOS or *nix systems  The -g tells npm to install the module globally and will work to ensure the ethercalc executable is on your PATH. Like many things one can install from Node.js or, even Python, you may see a cadre of “warnings” and possibly even some “errors”. If you execute the following and see similar messages: $ ethercalc --host=localhost ## IMPORTANT TO USE --host=localhost
Falling back to vm.CreateContext backend
Express server listening on port 8000 in development mode
Zappa 0.5.0 "You can't do that on stage anymore" orchestrating the show
Connected to Redis Server: localhost:6379


and then go to the URL it gives you and you see something like this:

then you’re all set to continue.

### A [Very] Brief EtherCalc Introduction

For now, if you hit that big, blue “Create Spreadsheet” button, you’ll see something pretty familiar if you’ve used Google Sheets, Excel, LibreOffice Calc (etc):

If you start ethercalc without the --host=localhost it listens on all network interfaces, so other folks on your network can also use it as a local “cloud” spreadsheet app, but also edit with you, just like Google Sheets.

I recommend playing around a bit in EtherCalc before continuing just to see that it is, indeed, a spreadsheet app like all the others you are familiar with, except it has a robust API that we can orchestrate from within R, now.

### Working with {ethercalc}

You can install {ethercalc} from the aforelinked source or via:

install.packages("ethercalc", repos = "https://cinc.rud.is")


where you’ll get a binary install for Windows and macOS (binary builds are for R 3.5.x but should also work for 3.6.x installs).

If you don’t want to drop to a command line interface to start EtherCalc you can use ec_start() to run one that will only be live during your R session.

Once you have EtherCalc running you’ll need to put the URL into an ETHERCALC_HOST environment variable. I recommend adding the following to ~/.Renviron and restarting your R session:

ETHERCALC_HOST=http://localhost:8000


(You’ll get an interactive prompt to provide this if you don’t have the environment variable setup.)

You can verify R can talk to your EtherCalc instance by executing ec_running() and reading the message or examining the (invisible) return value. Post a comment or file an issue (on your preferred social coding site) if you really think you’ve done everything right and still aren’t up and running by this point.

The use-case I setup in the previous blog post was to perform light data entry since scraping was both prohibited and would have taken more time given how the visualization was made. To start a new spreadsheet (remember, EtherCalc folks call these “rooms”), just do:

ec_new("for-blog")


And you should see this appear in your default web browser:

You can do ec_list() to see the names of all “saved” spreadsheets (ec_delete() can remove them, too).

We’ll type in the values from the previous post:

Now, to retrieve those values, we can do:

ec_read("for-blog", col_types="cii")
## # A tibble: 14 x 3
##
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12


That function takes any (relevant to this package use-case) parameter that readr::read_csv() takes (since it uses that under the hood to parse the object that comes back from the API call). If someone adds or modifies any values you can just call ec_read() again to retrieve them.

The ec_export() function lets you download the contents of the spreadsheet (“room”) to a local:

• CSV
• JSON
• HTML
• Markdown
• Excel

file (and it also returns the raw data directly to the R session). So you can do something like:

cat(rawToChar(ec_export("for-blog", "md", "~/Data/survey.md")))
## | ---- | ---- | ---- |
## |Health care|7|1|
## |Climate change|5|2|
## |Education|11|3|
## |Economics|6|4|
## |Science|10|7|
## |Technology|14|8|
## |National Security|1|5|
## |Politics|2|10|
## |Sports|3|14|
## |Immigration|4|6|
## |Arts & entertainment|8|13|
## |U.S. foreign policy|9|9|
## |Religion|12|12|


You can also append to a spreadsheet right from R. We’ll sort that data frame (to prove the append is working and I’m not fibbing) and append it to the existing sheet (this is a toy example, but imagine appending to an always-running EtherCalc instance as a data logger, which folks actually do IRL):

ec_read("for-blog", col_types="cii") %>%
dplyr::arrange(desc(topic)) %>%
ec_append("for-blog")


Note that you can open up EtherCalc to any existing spreadsheets (“rooms”) via ec_view() as well.

### FIN

It’s worth noting that EtherCalc appears to have a limit of around 500,000 “cells” per spreadsheet (“room”). I mention that since if you try to, say, ec_edit(ggplot2movies::movies, "movies") you would have very likely crashed the running EtherCalc instance if I did not code in some guide rails into that function and the ec_append() function to stop you from doing that. It’s sane limit IMO an Google Sheets does something similar (per-tab) for the similar reasons (and both limits are one reason I’m still against using a browser for “everything” given the limitations of javascript wrangling of DOM elements).

If you’re doing work on large-ish data, spreadsheets in general aren’t the best tools.

And, while you should avoid hand-wrangling data at all costs, ec_edit() is a much faster and feature-rich alternative to R’s edit() function on most systems.

I’ve shown off most of the current functionality of the {ethercalc} package in this post. One function I’ve left out is ec_cmd() which lets you completely orchestrate all EtherCalc operations. It’s powerful enough, and the EtherCalc command structure is gnarly enough, that we’ll have to cover it in a separate post. Also, stay tune for the aforementioned package-specific EtherCalc Docker image.

Kick the tyres, contribute issues and/or PRs as moved (and on your preferred social coding site) and see if both EtherCalc and {ethercalc} might work for you in place of or along with Excel and/or Google Sheets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Pharmacometrics meeting in Paris on the afternoon of 11 July 2019

Julie Bertrand writes:

The pharmacometrics group led by France Mentre (IAME, INSERM, Univ Paris) is very pleased to host a free ISoP Statistics and Pharmacometrics (SxP) SIG local event at Faculté Bichat, 16 rue Henri Huchard, 75018 Paris, on Thursday afternoon the 11th of July 2019.

It will features talks from Professor Andrew Gelman, Univ of Columbia (We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions) and Professor Rob Bies, Univ of Buffalo (A hybrid genetic algorithm for NONMEM structural model optimization).

We welcome all of you (please register here). Registration is capped at 70 attendees.

### Saturday Morning Videos: AutoML Workshop at ICML 2019

** Nuit Blanche is now on Twitter: @NuitBlog **

Katharina Eggensperger, Matthias Feurer, Frank Hutter, and Joaquin Vanschoren organized the AutoML workshop at ICML, and there are videos of the event that took place yesterday. Awesome.! Here is the intro for the workshop:
Machine learning has achieved considerable successes in recent years, but this success often relies on human experts, who construct appropriate features, design learning architectures, set their hyperparameters, and develop new learning algorithms. Driven by the demand for off-the-shelf machine learning methods from an ever-growing community, the research area of AutoML targets the progressive automation of machine learning aiming to make effective methods available to everyone. The workshop targets a broad audience ranging from core machine learning researchers in different fields of ML connected to AutoML, such as neural architecture search, hyperparameter optimization, meta-learning, and learning to learn, to domain experts aiming to apply machine learning to new types of problems.

All the videos are here.

Bayesian optimization is a powerful and flexible tool for AutoML. While BayesOpt was first deployed for AutoML simply as a black-box optimizer, recent approaches perform grey-box optimization: they leverage capabilities and problem structure specific to AutoML such as freezing and thawing training, early stopping, treating cross-validation error minimization as multi-task learning, and warm starting from previously tuned models. We provide an overview of this area and describe recent advances for optimizing sampling-based acquisition functions that make grey-box BayesOpt significantly more efficient.
The mission of AutoML is to make ML available for non-ML experts and to accelerate research on ML. We have a very similar mission at fast.ai and have helped over 200,000 non-ML experts use state-of-the-art ML (via our research, software, & teaching), yet we do not use methods from the AutoML literature. I will share several insights we've learned through this work, with the hope that they may be helpful to AutoML researchers.

AutoML aims at automating the process of designing good machine learning pipelines to solve different kinds of problems. However, existing AutoML systems are mainly designed for isolated learning by training a static model on a single batch of data; while in many real-world applications, data may arrive continuously in batches, possibly with concept drift. This raises a lifelong machine learning challenge for AutoML, as most existing AutoML systems can not evolve over time to learn from streaming data and adapt to concept drift. In this paper, we propose a novel AutoML system for this new scenario, i.e. a boosting tree based AutoML system for lifelong machine learning, which won the second place in the NeurIPS 2018 AutoML Challenge.

In this talk I'll survey work by Google researchers over the past several years on the topic of AutoML, or learning-to-learn. The talk will touch on basic approaches, some successful applications of AutoML to a variety of domains, and sketch out some directions for future AutoML systems that can leverage massively multi-task learning systems for automatically solving new problems.

Recent advances in Neural Architecture Search (NAS) have produced state-of-the-art architectures on several tasks. NAS shifts the efforts of human experts from developing novel architectures directly to designing architecture search spaces and methods to explore them efficiently. The search space definition captures prior knowledge about the properties of the architectures and it is crucial for the complexity and the performance of the search algorithm. However, different search space definitions require restarting the learning process from scratch. We propose a novel agent based on the Transformer that supports joint training and efficient transfer of prior knowledge between multiple search spaces and tasks.
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performsat least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result onPTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results.
The practical work of deploying a machine learning system is dominated by issues outside of training a model: data preparation, data cleaning, understanding the data set, debugging models, and so on. What does it mean to apply ML to this “grunt work” of machine learning and data science? I will describe first steps towards tools in these directions, based on the idea of semi-automating ML: using unsupervised learning to find patterns in the data that can be used to guide data analysts. I will also describe a new notebook system for pulling these tools together: if we augment Jupyter-style notebooks with data-flow and provenance information, this enables a new class of data-aware notebooks which are much more natural for data manipulation.
Panel Discussion

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### Question 15 of our Applied Regression final exam (and solution to question 14)

Here’s question 15 of our exam:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

And the solution to question 14:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

(a) For an average classroom, the curve is invlogit(1 + 0.1x), so it goes through the 50% point at x = -10. So the easiest way to draw the curve is to extend it outside the range of the data. But in the graph, the x-axis should go from 0 to 50. Recall that invlogit(5) = 0.99, so the probability of passing reaches 99% when x reaches 40. From all this information, you can draw the curve.

(b) The 25th and 75th percentage points of the normal distribution are at the mean +/- 0.67 standard errors. Thus, the 25th and 75th percentage points of the intercepts are 1 +/- 0.67*2, or -0.34, 2.34, so the curves to draw are invlogit(-0.34 + 0.1x) and invlogit(2.34 + 0.1x). These are just shifted versions of the curve from a, shifting by 1.34/0.1 = 13.4 to the left and the right.

Common mistakes

Students didn’t always use the range of x. The most common bad answer was to just draw a logistic curve and then put some numbers on the axes.

A key lesson that I had not conveyed well in class: draw and label the axes first, then draw the curve.

### Accelerating the Nelder - Mead Method with Predictive Parallel Evaluation - implementation -

** Nuit Blanche is now on Twitter: @NuitBlog **

The Nelder–Mead (NM) method has been recently proposed for application in hyperparameter optimization (HPO) of deep neural networks. However, the NM method is not suitable for parallelization, which is a serious drawback for its practical application in HPO. In this study, we propose a novel approach to accelerate the NM method with respect to the parallel computing resources. The numerical results indicate that the proposed method is significantly faster and more efficient when compared with the previous naive approaches with respect to the HPO tabular benchmarks.
The attendant implementaiton is here.

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### Magister Dixit

“Most companies think traditional Business Intelligence (BI) in which data is collected in warehouses, models are created based on business criteria and results are visualized through reports is sufficient. While this is true if your only concern is to answer basic questions like which customers are more profitable, it is not enough to deliver transformative business change like Data Science can. Data Science takes a different approach than BI in that insights and models are derived from the data through the application of statistical and mathematical techniques by Data Scientists. The data drives the modeling and insights. When you let the data guide you – you are less likely to try to use the data to support wrong predispositions or conclusions.” Kristen Paral ( August 23, 2014 )

### R Packages worth a look

Extensible Bootstrapped Split-Half Reliabilities (splithalfr)
Calculates scores and estimates bootstrapped split-half reliabilities for reaction time tasks and questionnaires. The ‘splithalfr’ can be extended with …

Quantile-Optimal Treatment Regimes with Censored Data (QTOCen)
Provides methods for estimation of mean- and quantile-optimal treatment regimes from censored data. Specifically, we have developed distinct functions …

Differential Risk Hotspots in a Linear Network (DRHotNet)
Performs the identification of differential risk hotspots given a marked point pattern (Diggle 2013) <doi:10.1201/b15326> lying on a linear netwo …

Themes for Base Graphics Plots (basetheme)
Functions to create and select graphical themes for the base plotting system. Contains: 1) several custom pre-made themes 2) mechanism for creating new …

### Distilled News

Machine learning can process data imperceptible to humans to produce expected results. These inconceivable patterns are inherent in the data but may make models vulnerable to adversarial attacks. How can developers harness these features to not lose control of AI?
Corporates are battling with technology giants and AI startups for the best and brightest AI talent. They are increasingly outsourcing their AI innovations to startups to ensure they do not get left behind in the race for AI competitive advantage. However outsourcing presents real and new risks which corporates are often ill equipped to identify and manage. There are real cultural barriers, implied risks, and questions that corporates should ask before partnering with any AI startup.

### Stabilising transformations: how do I present my results?

(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)

ANOVA is routinely used in applied biology for data analyses, although, in some instances, the basic assumptions of normality and homoscedasticity of residuals do not hold. In those instances, most biologists would be inclined to adopt some sort of stabilising transformations (logarithm, square root, arcsin square root…), prior to ANOVA. Yes, there might be more advanced and elegant solutions, but stabilising transformations are suggested in most traditional biometry books, they are very straightforward to apply and they do not require any specific statistical software. I do not think that this traditional technique should be underrated.

However, the use of stabilising transformations has one remarkable drawback, it may hinder the clarity of results. I’d like to give a simple, but relevant example.

# An example with counts

Consider the following dataset, that represents the counts of insects on 15 independent leaves, treated with the insecticides A, B and C (5 replicates):

dataset <- structure(data.frame(
Insecticide = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
.Label = c("A", "B", "C"), class = "factor"),
Count = c(448, 906, 484, 477, 634, 211, 276,
415, 587, 298, 50, 90, 73, 44, 26)),
.Names = c("Insecticide", "Count"))
dataset
##    Insecticide Count
## 1            A   448
## 2            A   906
## 3            A   484
## 4            A   477
## 5            A   634
## 6            B   211
## 7            B   276
## 8            B   415
## 9            B   587
## 10           B   298
## 11           C    50
## 12           C    90
## 13           C    73
## 14           C    44
## 15           C    26

We should not expect that a count variable is normally distributed with equal variances. Indeed, a graph of residuals against expected values shows clear signs of heteroscedasticity.

mod <- lm(Count ~ Insecticide, data=dataset)
plot(mod, which = 1)

In this situation, a logarithmic transformation is often suggested to produce a new normal and homoscedastic dataset. Therefore we take the log-transformed variable and submit it to ANOVA.

model <- lm(log(Count) ~ Insecticide, data=dataset)
print(anova(model), digits=6)
## Analysis of Variance Table
##
## Response: log(Count)
##             Df   Sum Sq Mean Sq F value     Pr(>F)
## Insecticide  2 15.82001 7.91000 50.1224 1.4931e-06 ***
## Residuals   12  1.89376 0.15781
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model)
##
## Call:
## lm(formula = log(Count) ~ Insecticide, data = dataset)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.6908 -0.1849 -0.1174  0.2777  0.5605
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    6.3431     0.1777  35.704 1.49e-13 ***
## InsecticideB  -0.5286     0.2512  -2.104   0.0572 .
## InsecticideC  -2.3942     0.2512  -9.529 6.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 12 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.8753
## F-statistic: 50.12 on 2 and 12 DF,  p-value: 1.493e-06

In this example, the standard error for each mean (SEM) corresponds to $$\sqrt{0.158/5}$$. In the end, we might show the following table of means for transformed data:

Insecticide Means (log n.)
A 6.343
B 5.815
C 3.985
SEM 0.178

Unfortunately, we loose clarity: how many insects did we have on each leaf? If we present in our manuscript a table like this one we might be asked by our readers or by the reviewer to report the means on the original measurement unit. What should we do, then? Here are some suggestions.

1. We can present the means of the original data with standard deviations. This is clearly less than optimal, if we want to suggest more than the bare variability of the observed sample. Furthermore, please remember that the means of original data may not be a good measure of central tendency, if the original population is strongly ‘asymmetric’ (skewed)!
2. We can show back-transformed means. Accordingly, if we have done, e.g., a logarithmic transformation, we can exponentiate the means of transformed data and report them back to the original measurement unit. Back-transformed means ‘estimate’ the medians of the original populations, which may be regarded as better measures of central tendency for skewed data.

We suggest that the use of the second method. However, this leaves us with the problem of adding a measure of uncertainty to back-transformed means. No worries, we can use the delta method to back-transform standard errors. It is straightforward:

1. take the first derivative of the back-transform function [in this case the first derivative of exp(X)=exp(X)] and
2. multiply it by the standard error of the transformed data.

This may be simply done by hand, with e.g $$exp(6.343) \times 0.178 = 101.19$$ (for insecticide A). This ‘manual’ solution is always available, regardless of the statistical software at hand. With R, we can use the ‘emmeans’ package (Lenth, 2016):

library(emmeans)
countM <- emmeans(model, ~Insecticide, transform = "response")

It is enough to set the argument ‘transform’ to ’response, although the transformation must be embedded in the model. It means: it is ok if we coded the model as:

log(Count) ~ Insecticide

On the contrary, it fails if we coded the model as:

logCount ~ Insecticide

where the transformation was performed prior to fitting.

Obviously, the back-transformed standard error is different for each mean (there is no homogeneity of variances on the original scale, but we knew this already). Back-transformed data might be presented as follows:

Insecticide Mean SE
A 568.5 101.19
B 335.1 59.68
C 51.88 9.57

It would be appropriate to state it clearly (e.g. in a footnote), that means and SEs were obtained by back-transformation via the delta method. Far clearer, isn’t it? As I said, there are other solutions, such as fitting a GLM, but stabilising transformations are simple and they are easily acceptable in biological Journals.

If you want to know something more about the delta-method you might start from my post here. A few years ago, some collegues and I have also discussed these issues in a journal paper (Onofri et al., 2010).

Andrea Onofri
University of Perugia (Italy)

# References

1. Lenth, R.V., 2016. Least-Squares Means: The R Package lsmeans. Journal of Statistical Software 69. https://doi.org/10.18637/jss.v069.i01
2. Onofri, A., Carbonell, E.A., Piepho, H.-P., Mortimer, A.M., Cousens, R.D., 2010. Current statistical issues in Weed Research. Weed Research 50, 5–24.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Cricket’s increasing sizzle owes much to India

Scoring rates have surged in the short T20 format

### Exploring Categorical Data With Inspectdf

(This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers)

### Exploring categorical data with inspectdf

#### What’s inspectdf and what’s it for?

I often find myself viewing and reviewing dataframes throughout the
course of an analysis, and a substantial amount of time can be spent
rewriting the same code to do this. inspectdf is an R package designed
to make common exploratory tools a bit more useful and easy to use.

In particular, it’s very powerful be able to quickly see the contents of
inspect_cat() function from inspectdf for summarising and
visualising categorical columns.

First of all, you’ll need to have the inspectdf package installed. You
can get it from github using

library(devtools)
install_github("alastairrushworth/inspectdf")


Then load the package in. We’ll also load dplyr for the starwars
data and for the pipe %>%.

library(inspectdf)
library(dplyr)

# check out the starwars help file
?starwars


#### Tabular summaries using inspect_cat()

The starwars data that comes bundled with dplyr has 7 columns that
have character class, and is therefore a nice candidate for illustrating
the use of inspect_cat. We can see this quickly using the
inspect_types() function from inspectdf.

starwars %>% inspect_types()

## # A tibble: 4 x 4
##   type        cnt  pcnt col_name
##
## 1 character     7 53.8
## 2 list          3 23.1
## 3 numeric       2 15.4
## 4 integer       1  7.69


Using inspect_cat() is very straightforward:

star_cat <- starwars %>% inspect_cat()
star_cat

## # A tibble: 7 x 5
##   col_name     cnt common common_pcnt levels
##
## 1 eye_color     15 brown        24.1
## 2 gender         5 male         71.3
## 3 hair_color    13 none         42.5
## 4 homeworld     49 Naboo        12.6
## 5 name          87 Ackbar        1.15
## 6 skin_color    31 fair         19.5
## 7 species       38 Human        40.2


So what does this tell us? Each row in the tibble returned from
inspect_cat() corresponds to each categorical column (factor,
logical or character) in the starwars dataframe.

• The cnt column tells you how many unique levels there are for each
column. For example, there are 15 unique entries in the eye_color
column.
• The common column prints the most commonly occurring entry. For
example, the most common eye_color is brown. The percentage
occurrence is 24.1% which is shown under common_pcnt.
• A full list of levels and occurrence frequency is provided in the
list column levels.

A table of relative frequencies of eye_color can be retrieved by
typing

star_cat$levels$eye_color

## # A tibble: 15 x 3
##    value           prop   cnt
##
##  1 brown         0.241     21
##  2 blue          0.218     19
##  3 yellow        0.126     11
##  4 black         0.115     10
##  5 orange        0.0920     8
##  6 red           0.0575     5
##  7 hazel         0.0345     3
##  8 unknown       0.0345     3
##  9 blue-gray     0.0115     1
## 10 dark          0.0115     1
## 11 gold          0.0115     1
## 12 green, yellow 0.0115     1
## 13 pink          0.0115     1
## 14 red, blue     0.0115     1
## 15 white         0.0115     1


There isn’t anything here that can’t be obtained by using the base
table() function with some post-processing. inspect_cat() automates
some of that functionality and wraps it into a single, convenient
function.

#### Visualising categorical columns with show_plot()

An important feature of inspectdf is the ability to visualise
dataframe summaries. Visualising categories can be challenging, because
categorical columns can be very rich and contain many unique levels. A
simple stacked barplot can be produced using show_plot()

star_cat %>% show_plot()


Like the star_cat tibble returned by inspect_cat(), each row of the
plot is a single column, split by the relative frequency of occurrence
of each unique entry.

• Some of the bars are labelled, but in cases where the bars are
small, the labels are not shown. If you encounter categorical
columns with really long strings, labels can be suppressed
altogether with show_plot(text_labels = FALSE).
• Missing values or NAs are shown as gray bars. In this case, there
are quite a few starwars characters whose homeworld is not
unknown or missing.

#### Combining rare entries with show_plot()

Some of the categorical columns like name seems to have a lot of
unique entries. We should expect this – names often are unique (or
almost) in a small dataset. If we scaled this analysis up to a dataset
with millions of rows, there would be so many names with very small
relative frequencies that the name bars would be very difficult to see.
show_plot() can help with this too!

star_cat %>% show_plot(high_cardinality = 1)


By setting the argument high_cardinality = 1 all entries that occur
only once are combined into a single group labelled high
cardinality
. This makes it easier to see when some entries occur only
once (or extremely rarely).

• In the above, it’s now obvious that no two people in the starwars
data share the same name, and that many come from a unique
homeworld or species.
• By setting high_cardinality = 2 or even greater, it’s possible to
group the ‘long-tail’ of rare categories even further. With larger
datasets, this becomes increasingly important for visualisation.
• A practical reason to combine rare entries, is plotting speed – it
can take a long time to render a plot with tens of thousands (or
more) unique bars! Using the high_cardinality argument can reduce
this dramatically.

#### Playing with color options in show_plot()

It’s been pointed out that the default ggplot color theme isn’t
particularly friendly to color-blind audiences. A more color-blind
friendly theme
is
available by specifying col_palette = 1:

star_cat %>% show_plot(col_palette = 1)


I’m also quite fond of the 80s theme by choosing col_palette = 2:

star_cat %>% show_plot(col_palette = 2)


There are 5 palettes at the moment, so have a play around. Note that the
color palettes have not yet hit the CRAN version of inspectdf – that
will come soon in an update, but for now you can get them from the
github version of the package using the code at the start of the
article.

Any feedback is welcome! Find me on twitter at
rushworth_a or write a github
issue
.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## June 14, 2019

### If you did not already know

In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy. …

SlicStan
Stan is a probabilistic programming language that has been increasingly used for real-world scalable projects. However, to make practical inference possible, the language sacrifices some of its usability by adopting a block syntax, which lacks compositionality and flexible user-defined functions. Moreover, the semantics of the language has been mainly given in terms of intuition about implementation, and has not been formalised. This paper provides a formal treatment of the Stan language, and introduces the probabilistic programming language SlicStan — a compositional, self-optimising version of Stan. Our main contributions are: (1) the formalisation of a core subset of Stan through an operational density-based semantics; (2) the design and semantics of the Stan-like language SlicStan, which facilities better code reuse and abstraction through its compositional syntax, more flexible functions, and information-flow type system; and (3) a formal, semantic-preserving procedure for translating SlicStan to Stan. …

Truncated-Uniform-Laplace (Tulap)
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a ‘Neyman-Pearson lemma’ for binomial data under DP, where the DP-UMP only depends on the sample sum. Our tests can also be stated as a post-processing of a random variable, whose distribution we coin ”Truncated-Uniform-Laplace” (Tulap), a generalization of the Staircase and discrete Laplace distributions. Furthermore, we obtain exact $p$-values, which are easily computed in terms of the Tulap random variable. Using the above techniques, we show that our tests can be applied to give uniformly most accurate one-sided confidence intervals and optimal confidence distributions. We also derive uniformly most powerful unbiased (UMPU) two-sided tests, which lead to uniformly most accurate unbiased (UMAU) two-sided confidence intervals. We show that our results can be applied to distribution-free hypothesis tests for continuous data. Our simulation results demonstrate that all our tests have exact type I error, and are more powerful than current techniques. …

Self Driving Data Curation
Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions. …

### Prioritizing technical debt as if time and money mattered

Adam Tornhill offers a new perspective on software development that will change how you view code.

Continue reading Prioritizing technical debt as if time and money mattered.

### From the trenches with Rebecca Parsons

Rebecca Parsons shares the story of her career path and her work as an architect.

Continue reading From the trenches with Rebecca Parsons.

### Choices of scale

Michael Feathers explores various scaling strategies in light of research about human cognition and systems cohesion.

### Book Memo: “Applied Data Science”

 Lessons Learned for the Data-Driven Business This book has two main goals: to define data science through the work of data scientists and their results, namely data products, while simultaneously providing the reader with relevant lessons learned from applied data science projects at the intersection of academia and industry. As such, it is not a replacement for a classical textbook (i.e., it does not elaborate on fundamentals of methods and principles described elsewhere), but systematically highlights the connection between theory, on the one hand, and its application in specific use cases, on the other.