My Data Science Blogs

June 18, 2019

Magister Dixit

“You can use an eraser on the drafting table or a sledgehammer on the construction site.” Frank Lloyd Wright

Continue Reading…

Collapse

Read More

June 17, 2019

R Packages worth a look

Stan Models for the Pairwise Comparison Factor Model (pcFactorStan)
Provides convenience functions and pre-programmed Stan models related to the pairwise comparison factor model. Its purpose is to make fitting pairwise …

Spatial and Spatio-Temporal Bayesian Model for Circular Data (CircSpaceTime)
Implementation of Bayesian models for spatial and spatio-temporal interpolation of circular data using Gaussian Wrapped and Gaussian Projected distribu …

Complex Pearson Distributions (cpd)
Probability mass function, distribution function, quantile function and random generation for the Complex Triparametric Pearson (CTP) and Complex Bipar …

The Topic SCORE Algorithm to Fit Topic Models (TopicScore)
Provides implementation of the ‘Topic SCORE’ algorithm that is proposed by Tracy Ke and Minzhe Wang. The singular value decomposition step is optimized …

Continue Reading…

Collapse

Read More

Le Monde puzzle [#1104]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A palindromic Le Monde mathematical puzzle:

In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or three coins.

Which can be solved in a few lines of R code:

coin=sort(c(1:9,(1:9)*11,outer(1:9*101,(0:9)*10,"+")))
amounz=sort(unique(c(coin,as.vector(outer(coin,coin,"+")))))
amounz=amounz[amounz<1e3]

and produces 10 amounts that cannot be paid with one or two coins. It is also easy to check that three coins are enough to cover all amounts below 10³. For the second question, starting with n¹=20,188,102,  a simple downward search of palindromic pairs (n¹,n²) such that n¹+n²=20,188,102 led to n¹=16,755,761 and n²=3,435,343. And starting with 11,033,011, the same search does not produce any solution, while there are three coins such that n¹+n²+n³=11,042,019, for instance n¹=11,022,011, n²=20,002, and n³=6.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Facebook Research at CVPR 2019

The post Facebook Research at CVPR 2019 appeared first on Facebook Research.

Continue Reading…

Collapse

Read More

Distilled News

Exploring Categorical Data With Inspectdf

I often find myself viewing and reviewing dataframes throughout the course of an analysis, and a substantial amount of time can be spent rewriting the same code to do this. inspectdf is an R package designed to make common exploratory tools a bit more useful and easy to use. In particular, it’s very powerful be able to quickly see the contents of categorical features. In this article, we’ll summarise how to use the inspect_cat() function from inspectdf for summarising and visualising categorical columns.


Text Classification in Python

This article is the first of a series in which I will cover the whole process of developing a machine learning project. In this article we focus on training a supervised learning text classification model in Python. The motivation behind writing these articles is the following: as a learning data scientist who has been working with data science tools and machine learning models for a fair amount of time, I’ve found out that many articles in the internet, books or literature in general strongly focus on the modeling part. That is, we are given a certain dataset (with the labels already assigned if it is a supervised learning problem), try several models and obtain a performance metric. And the process ends there.


A new Tool to your Toolkit, KL Divergence at Work

In my previous post, we got a thorough understanding of Entropy, Cross-Entropy, and KL-Divergence in an intuitive way and also by calculating their values through examples. In case you missed it, please go through it once before proceeding to the finale. In this post, we will apply these concepts and check the results in a real dataset. Also, it will give us good intuition on how to use these concepts in modeling various day-to-day machine learning problems. So, let’s get started.


A Language, not a Letter: Learning Statistics in R

This online collection of tutorials was created by graduate students in psychology as a resource for other experimental psychologists interested in using R for statistical analyses and graphics. Each chapter was created to provide an overview of how to code a particular topic in the R language. Who is this book for? This book was designed for psychologists already familiar with the statistics they need to utilize, but who have zero experience programming and working in R. Many of the authors of these tutorials had never used R prior to taking the course in which this collection of tutorials was created. In one semester, they were able to gain enough proficiency in R to independently create one of the tutorials included here.


What the Evidence Shows About the Impact of the GDPR After One Year

The General Data Protection Regulation (GDPR), the new privacy law for the European Union (EU), went into effect on May 25, 2018. One year later, there is mounting evidence that the law has not produced its intended outcomes; moreover, the unintended consequences are severe and widespread. This article documents the challenges associated with the GDPR, including the various ways in which the law has impacted businesses, digital innovation, the labor market, and consumers. Specifically, the evidence shows that the GDPR:
• Negatively affects the EU economy and businesses
• Drains company resources
• Hurts European tech startups
• Reduces competition in digital advertising
• Is too complicated for businesses to implement
• Fails to increase trust among users
• Negatively impacts users’ online access
• Is too complicated for consumers to understand
• Is not consistently implemented across member states
• Strains resources of regulators


modelDown is now on CRAN!

The modelDown package turns classification or regression models into HTML static websites. With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits. The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.


How your smartphone tells your story: A dive into Android activity data

I was really excited when Google announced their Digital Wellbeing program, back in May 2018, especially Dashboard. It tracks all your app interactions on the phone and even helps you to limit app usage by setting time restrictions on different apps. But as of October 2018, Google still hasn’t rolled out that feature to all Android P users and is in beta even for Pixel users. So I decided to check out my own statistics with the data available at hand.


Causal vs. Statistical Inference

Causal inference, or the problem of causality in general, has received a lot of attention in recent years. The question is simple, is correlation enough for inference? I am going to state the following, the more informed uninformed person is going to pose a certain argument that looks like this: Causation is nothing else than really strong correlation. I hate to break it to you if this is your opinion, but no it is not, it is most certainly not. I can see that it is relatively easy to get convinced that it is, but once we start thinking about it a bit we are easily going to come to the realization that it is not. If you are still convinced otherwise after reading this article, please contact me for further discussion because I would be interested in your line of thought.


How To Run Jupyter Notebooks in the Cloud

When starting out in data science, DevOps tasks are the last thing you should be worrying about. Trying to master all (or most) aspects of data science requires a tremendous amount of time and practice. Nevertheless, if you should happen to attend a boot camp or some other type of school, it is very likely that you are going to have to complete group projects sooner or later. However, coordinating these without any DevOps knowledge can prove to be quite the challenge. How do we share code? How do we deal with very expensive computations? How do we make sure everyone is using the same environment? Questions like these can easily stall the progress of any data science project.


Code free Data Science with Microsoft Azure Machine Learning Studio

In the last weeks, months and even years a lot of tools arose that promise to make the field of data science more accessible. This isn’t an easy task considering the complexity of most parts of the data science and machine learning pipeline. None the less many libraries and tools including Keras, FastAI, and Weka made it significantly easier to create a data science project by providing us with an easy to use high-level interface and a lot of prebuilt components.


Creating New Scripts with StyleGAN

I applied StyleGAN to images of Unicode characters to see if it could invent new characters.


Exploratory Data Analysis Tutorial in Python

One of the most important skills that every Data Scientist must master is the ability to explore data properly. Thorough exploratory data analysis (EDA) is essential in order to ensure the integrity of your gathered data and performed analysis. The example used in this tutorial is an exploratory analysis of historical SAT and ACT data to compare participation and performance between SAT and ACT exams in different States. By the end of this tutorial, we will have gained data-driven insight into potential issues regarding standardized testing in the United States. The focus of this tutorial is to demonstrate the exploratory data analysis process, as well as provide an example for Python programmers who want to practice working with data. For this analysis, I examined and manipulated available CSV data files containing data about the SAT and ACT for both 2017 and 2018 in a Jupyter Notebook. Exploring data through well-constructed visualizations and descriptive statistics is a great way to become acquainted with the data you’re working with and formulate hypotheses based on your observations.


Hypothesis Testing – An Introduction

Hypothesis are our assumptions about the data which may or may not be true. In this post we’ll discuss about the statistical process of evaluating the truthiness of a hypothesis – this process is known as hypothesis testing. Most of the statistical analysis has its genesis in comparing two types of distributions: population distribution and sample distribution. Let’s understand these terms through an example – Suppose we want to statistically test our hypothesis that on average, the performance of students in a standard aptitude test has improved in the last decade. We’re given a dataset containing the marks (maximum marks = 250) of 100 randomly selected students who appeared in the exam in 2009 and 2019.


An ‘Equation-to-Code’ Machine Learning Project Walk-Through – Part 3 SGD

Detailed explanation to implement Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent from scratch in Python.


This AI Can Detect Image Manipulation, And Might Well Be Our Savior

Researchers from Adobe and UC Berkeley have unveiled an interesting way to combat the spread of image manipulation – using AI to spot edited photos. The AI was trained to recognize instances where the Face-Aware Liquify feature of Photoshop was used to edit images. The feature enables you to easily tweak and exaggerate facial features, for example, widening the eyes or literally turning a frown into a smile. This particular feature is popular when it comes to editing faces, and was chosen because the effects can be extremely subtle. The results were astonishing. While human faces could spot the edits just 53% of the time (only a little over chance), the AI sometimes achieved results as high as 99%. Part of the reason the AI performs so much better than the human eye is that it can also access low-level image data, as opposed to simply relying on visual cues. So, why is this important?


Beginner’s Guide to BERT for Multi-classification Task

The purpose of this article is to provide a step-by-step tutorial on how to use BERT for multi-classification task. BERT ( Bidirectional Encoder Representations from Transformers), is a new method of pre-training language representation by Google that aimed to solve a wide range of Natural Language Processing tasks. This model is based on unsupervised, deeply bidirectional system and managed to achieve state-of-the-art results when it was first released to the public in 2018.


The W3H of AlexNet, VGGNet, ResNet, Inception

In this tutorial, I will quickly go through the details of four of the famous CNN architectures and how they differ from each other by explaining their W3H (When, Why, What and How).


A Comprehensive guide on handling Missing Values

Most of the real world data contains missing values. They occur due to many reasons like some observations were not recorded and corruption of data.

Continue Reading…

Collapse

Read More

Detecting Bias with SHAP

StackOverflow’s annual developer survey concluded earlier this year, and they have graciously published the (anonymized) 2019 results for analysis. They’re a rich view into the experience of software developers around the world — what’s their favorite editor? how many years of experience? tabs or spaces? and crucially, salary. Software engineers’ salaries are good, and sometimes both eye-watering and news-worthy.

The tech industry is also painfully aware that it does not always live up to its purported meritocratic ideals. Pay isn’t a pure function of merit, and story after story tells us that factors like name-brand school, age, race, and gender have an effect on outcomes like salary.

Can machine learning do more than predict things? Can it explain salaries and so highlight cases where these factors might be undesirably causing pay differences? This example will sketch how standard models can be augmented with SHAP (SHapley Additive exPlanations) to detect individual instances whose predictions may be concerning, and then dig deeper into the specific reasons the data leads to those predictions.

Model Bias or Data (about) Bias?

While this topic is often characterized as detecting “model bias”, a model is merely a mirror of the data it was trained on. If the model is ‘biased’ then it learned that from the historical facts of the data. Models are not the problem per se; they are an opportunity to analyze data for evidence of bias.

Explaining models isn’t new, and most libraries can assess the relative importance of the inputs to a model. These are aggregate views of inputs’ effects. However, the output of some machine learning models has highly individual effects: is your loan approved? will you receive financial aid? are you a suspicious traveller?

Indeed, StackOverflow offers a handy calculator to estimate one’s expected salary, based on its survey. We can only speculate about how accurate the predictions are overall, but all that a developer particularly cares about is his or her own prospects.

The right question may not be, does the data suggest bias overall? but rather, does the data show individual instances of bias?

Assessing the Survey Data

The 2019 data is, thankfully, clean and free of data problems. It contains responses to 85 questions from about 88,000 developers.

This example focuses only on full-time developers. The data set contains plenty of relevant information, like years of experience, education, role, and demographic information. Notably, this data set doesn’t contain information about bonuses and equity, just salary.

It also has responses to wide-ranging questions about attitudes on blockchain, fizz buzz, and the survey itself. These are excluded here as unlikely to reflect the experience and skills that presumably should determine compensation. Likewise, for simplicity, it will also only focus on US-based developers.

The data needs a little more transformation before modeling. Several questions allow multiple responses, like “What are your greatest challenges to productivity as a developer?” These single questions yield multiple yes/no responses and need to be broken out into multiple yes/no features.

Some multiple-choice questions like “Approximately how many people are employed by the company or organization you work for?” afford responses like “2-9 employees”. These are effectively binned continuous values, and it may be useful to map them back to inferred continuous values like “2” so that the model may consider their order and relative magnitude. This translation is unfortunately manual and entails some judgment calls.

The Apache Spark code that can accomplish this is in the accompanying notebook, for the interested.

Model Selection with Apache Spark

With the data in a more machine-learning-friendly form, the next step is to fit a regression model that predicts salary from these features. The data set itself, after filtering and transformation with Spark, is a mere 4MB, containing 206 features from about 12,600 developers, and could easily fit in memory as a pandas DataFrame on your wristwatch, let alone a server.

xgboost, a popular gradient-boosted trees package, can fit a model to this data in minutes on a single machine, without Spark. xgboost offers many tunable “hyperparameters” that affect the quality of the model: maximum depth, learning rate, regularization, and so on. Rather than guess, simple standard practice is to try lots of settings of these values and pick the combination that results in the most accurate model.

Fortunately, this is where Spark comes back in. It can build hundreds of these models in parallel and collect the results of each. Because the data set is small, it’s simple to broadcast it to the workers, create a bunch of combinations of those hyperparameters to try, and use Spark to apply the same simple non-distributed xgboost code that could build a model locally to the data with each combination.

...
def train_model(params):
  (max_depth, learning_rate, reg_alpha, reg_lambda, gamma, min_child_weight) = params  
  xgb_regressor = XGBRegressor(objective='reg:squarederror', max_depth=max_depth,\
    learning_rate=learning_rate, reg_alpha=reg_alpha, reg_lambda=reg_lambda, gamma=gamma,\
    min_child_weight=min_child_weight, n_estimators=3000, base_score=base_score,\
    importance_type='total_gain', random_state=0)
  xgb_model = xgb_regressor.fit(b_X_train.value, b_y_train.value,\
    eval_set=[(b_X_test.value, b_y_test.value)],\
                                eval_metric='rmse', early_stopping_rounds=30)
  n_estimators = len(xgb_model.evals_result()['validation_0']['rmse'])
  y_pred = xgb_model.predict(b_X_test.value)
  mae = mean_absolute_error(y_pred, b_y_test.value)
  rmse = sqrt(mean_squared_error(y_pred, b_y_test.value))
  return (params + (n_estimators,), (mae, rmse), xgb_model)

...

max_depth =        np.unique(np.geomspace(3, 7, num=5, dtype=np.int32)).tolist()
learning_rate =    np.unique(np.around(np.geomspace(0.01, 0.1, num=5), decimals=3)).tolist()
reg_alpha =        [0] + np.unique(np.around(np.geomspace(1, 50, num=5), decimals=3)).tolist()
reg_lambda =       [0] + np.unique(np.around(np.geomspace(1, 50, num=5), decimals=3)).tolist()
gamma =            np.unique(np.around(np.geomspace(5, 20, num=5), decimals=3)).tolist()
min_child_weight = np.unique(np.geomspace(5, 30, num=5, dtype=np.int32)).tolist()

parallelism = 128
param_grid = [(choice(max_depth), choice(learning_rate), choice(reg_alpha),\
  choice(reg_lambda), choice(gamma), choice(min_child_weight)) for _ in range(parallelism)]

params_evals_models = sc.parallelize(param_grid, parallelism).map(train_model).collect()

That will create a lot of models. To track and evaluate the results, mlflow can log each one with its metrics and hyperparameters, and view them in the notebook’s Experiment. Here, one hyperparameter over many runs is compared to the resulting accuracy (mean absolute error):

The single model that showed the lowest error on the held-out validation data set is of interest. It yielded a mean absolute error of about $28,000 on salaries that average about $119,000. Not terrible, although we should realize the model can only explain most of the variation in salary.

Interpreting the xgboost Model

Although the model can be used to predict future salaries, instead, the question is what the model says about the data. What features seem to matter most when predicting salary accurately? The xgboost model itself computes a notion of feature importance:

import mlflow.sklearn
best_run_id = "..."
model = mlflow.sklearn.load_model("runs:/" + best_run_id + "/xgboost")
sorted((zip(model.feature_importances_, X.columns)), reverse=True)[:6]

Factors like years of coding professionally, organization size, and using Windows are most “important”. This is interesting, but hard to interpret. The values reflect relative and not absolute importance. That is, the effect isn’t measured in dollars. The definition of importance here (total gain) is also specific to how decision trees are built and are hard to map to an intuitive interpretation. The important features don’t even necessarily correlate positively with salary, either.

More importantly, this is a ‘global’ view of how much features matter in aggregate. Factors like gender and ethnicity don’t show up on this list until farther along. This doesn’t mean these factors aren’t still significant. For one, features can be correlated, or interact. It’s possible that factors like gender correlate with other features that the trees selected instead, and this to some degree masks their effect.

The more interesting question is not so much whether these factors matter overall — it’s possible that their average effect is relatively small — but, whether they have a significant effect in some individual cases. These are the instances where the model is telling us something important about individuals’ experience, and to those individuals, that experience is what matters.

Applying SHAP for Developer-Level Explanations

Fortunately, a set of techniques for more theoretically sound model interpretation at the individual prediction level has emerged over the past five years or so. They are collectively “Shapley Additive Explanations”, and conveniently, are implemented in the Python package shap.

Given any model, this library computes “SHAP values” from the model. These values are readily interpretable, as each value is a feature’s effect on the prediction, in its units. A SHAP value of 1000 here means “explained +$1,000 of predicted salary”. SHAP values are computed in a way that attempts to isolate away of correlation and interaction, as well.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X, y=y.values)

SHAP values are also computed for every input, not the model as a whole, so these explanations are available for each input individually. It can also estimate the effect of feature interactions separately from the main effect of each feature, for each prediction.

Explaining the Features’ Effects Overall

Developer-level explanations can aggregate into explanations of the features’ effects on salary over the whole data set by simply averaging their absolute values. SHAP’s assessment of the overall most important features is similar:

The SHAP values tell a similar story. First, SHAP is able to quantify the effect on salary in dollars, which greatly improves the interpretation of the results. Above is a plot the absolute effect of each feature on predicted salary, averaged across developers. Years of professional coding experience still dominates, explaining on average almost $15,000 of effect on salary.

Examining the Effects of Gender with SHAP Values

We came to look specifically at the effects of gender, race, and other factors that presumably should not be predictive per se of salary at all. This example will examine the effect of gender, though this by no means suggests that it’s the only or most important, type of bias to look for.

Gender is not binary, and the survey recognizes responses of “Man”, “Woman”, and “Non-binary, genderqueer, or gender non-conforming” as well as “Trans” separately. (Note that while the survey also separately records responses about sexuality, these are not considered here.) SHAP computes the effect on predicted salary for each of these. For a male developer (identifying only as male), the effect of gender is not just the effect of being male, but of not identifying as female, transgender, and so on.

SHAP values let us read off the sum of these effects for developers identifying as each of the four categories:

While male developers’ gender explains about a modest -$230 to +$890 with mean about $225, for females, the range is wider, from about -$4,260 to -$690 with mean -$1,320. The results for transgender and non-binary developers is similar, though slightly less negative.

When evaluating what this means below, it’s important to recall the limitations of the data and model here:

  • Correlation isn’t causation; ‘explaining’ predicted salary is suggestive, but doesn’t prove, that a feature directly caused salary to be higher or lower
  • The model isn’t perfectly accurate
  • This is just 1 year of data, and only from US developers
  • This reflects only base salary, not bonuses or stock, which can vary more widely

Gender and Interacting Features

The SHAP library offers interesting visualizations that leverage its ability to isolate the effect of feature interactions. For example, the values above suggest that developers who identify as male are predicted to earn a slightly higher salary than others, but is there more to it? A dependence plot like this one can help:

Dots are developers. Developers at the left are those that don’t identify as male, and at the right, those that do, which are predominantly those identifying as only male. (The points are randomly spread horizontally for clarity.) The y-axis is SHAP value, or what identifying as male or not explains about predicted salary for each developer. As above, those not identifying as male show overall negative SHAP values, and one that varies widely, while others consistently show a small positive SHAP value.

What’s behind that variance? SHAP can select a second feature whose effect varies most given the value of, here, identifying as male or not.  It selects the answer “I work on what seems most important or urgent” to the question “How structured or planned is your work?”  Among developers identifying as male, those who answered this way (red points) appear to have slightly higher SHAP values. Among the rest, the effect is more mixed but seems to have generally lower SHAP values.

Interpretation is left to the reader, but perhaps: are male developers who feel empowered in this sense also enjoying slightly higher salaries, while other developers enjoy this where it goes hand in hand with lower-paying roles?

Exploring Instances with Outsized Gender Effects

What about investigating the developer whose salary is most negatively affected? Just as it’s possible to look at the effect of gender-related features overall, it’s possible to search for the developer whose gender-related features had the largest impact on predicted salary. This person is female, and the effect is negative. According to the model, she is predicted to earn about $4,260 less per year because of her gender:

The predicted salary, just over $157,000, accurate in this case, as her actual reported salary is $150,000.

The three most positive and negative features influencing predicted salary are that she:

  • Has a college degree (only) (+$18,200)
  • Has 10 years professional experience (+$9,400)
  • Identifies as East Asian (+$9,100)
  • Works 40 hours per week (-$4,000)
  • Does not identify as male (-$4,250)
  • Works at a medium-sized org of 100-499 employees (-$9,700)

Given the magnitude of the effect on the predicted salary of not identifying as male, we might stop here and investigate the details of this case offline to gain a better understanding of the context around this developer and whether her experience, or salary, or both, need a change.

Explaining Interactions

There is more detail available within that -$4,260. SHAP can break down the effects of these features into interactions. The total effect of identifying as female on the prediction can be broken down into the effect of identifying as female and being an engineering manager, and working with Windows, etc.

The effect on predicted salary explained by the gender factors per se only adds up to about -$630. Rather, SHAP assigns most of the effects of gender to interactions with other features:

gender_interactions = interaction_values[gender_feature_locs].sum(axis=0)
max_c = np.argmax(gender_interactions)
min_c = np.argmin(gender_interactions)
print(X.columns[max_c])
print(gender_interactions[max_c])
print(X.columns[min_c])
print(gender_interactions[min_c])

DatabaseWorkedWith_PostgreSQL
110.64005
Ethnicity_East_Asian
-1372.6714

Identifying as female and working with PostgreSQL affects predicted salary slightly positively, whereas also identifying as East Asian predicted salary more negatively. Interpreting these values at this level of granularity is difficult in this context, but, this additional level of explanation is available.

Applying SHAP with Apache Spark

SHAP values are computed independently for each row, given the model, and so this could have also been done in parallel with Spark. The following example computes SHAP values in parallel and similarly locates developers with outsized gender-related SHAP values:

X_df = pruned_parsed_df.drop("ConvertedComp").repartition(16)
X_columns = X_df.columns

def add_shap(rows):
  rows_pd = pd.DataFrame(rows, columns=X_columns)
  shap_values = explainer.shap_values(rows_pd.drop(["Respondent"], axis=1))
  return [Row(*([int(rows_pd["Respondent"][i])] + [float(f) for f in shap_values[i]])) for i in range(len(shap_values))]

shap_df = X_df.rdd.mapPartitions(add_shap).toDF(X_columns)

effects_df = shap_df.\
  withColumn("gender_shap", col("Gender_Woman") + col("Gender_Man") + col("Gender_Non_binary__genderqueer__or_gender_non_conforming") + col("Trans")).\
  select("Respondent", "gender_shap")
top_effects_df = effects_df.filter(abs(col("gender_shap")) >= 2500).orderBy("gender_shap")

Clustering SHAP values

Applying Spark is advantageous when there are a large number of predictions to assess with SHAP. Given that output, it’s also possible to use Spark to cluster the results with, for example, bisecting k-means:

assembler = VectorAssembler(inputCols=[c for c in to_review_df.columns if c != "Respondent"],\
  outputCol="features")
assembled_df = assembler.transform(shap_df).cache()

clusterer = BisectingKMeans().setFeaturesCol("features").setK(50).setMaxIter(50).setSeed(0)
cluster_model = clusterer.fit(assembled_df)
transformed_df = cluster_model.transform(assembled_df).select("Respondent", "prediction")

The cluster whose total gender-related SHAP effects are most negative might bear some further investigation. What are the SHAP values of those respondents in the cluster? What do the members of the cluster look like with respect to the overall developer population?

min_shap_cluster_df = transformed_df.filter("prediction = 5").\
  join(effects_df, "Respondent").\
  join(X_df, "Respondent").\
  select(gender_cols).groupBy(gender_cols).count().orderBy(gender_cols)
all_shap_df = X_df.select(gender_cols).groupBy(gender_cols).count().orderBy(gender_cols)
expected_ratio = transformed_df.filter("prediction = 5").count() / X_df.count()
display(min_shap_cluster_df.join(all_shap_df, on=gender_cols).\
  withColumn("ratio", (min_shap_cluster_df["count"] / all_shap_df["count"]) / expected_ratio).\
  orderBy("ratio"))

Developers identifying as female (only) are represented in this cluster at almost 2.8x the rate of the overall developer population, for example. This isn’t surprising given the earlier analysis. This cluster could be further investigated to assess other factors specific to this group that contribute to overall lower predicted salary.

Conclusion

This type of analysis with SHAP can be run for any model, and at scale too. As an analytical tool, it turns models into data detectives, to surface individual instances whose predictions suggest that they deserve more examination. The output of SHAP is easily interpretable and yields intuitive plots, that can be assessed case-by-case by business users.

Of course, this analysis isn’t limited to examining questions of gender, age or race bias. More prosaically, it could be applied to customer churn models. There, the question is not just “will this customer churn?” but “why is the customer churning?” A customer who is canceling due to price may be offered a discount, while one canceling due to limited usage might need an upsell.

Finally, this analysis can be run as part of a model validation process. Model validation often focuses on the overall accuracy of a model. It should also focus on the model’s ‘reasoning’, or what features contributed most to the predictions. With SHAP, it can also help detect when too many individual predictions’ explanations are at odds with overall feature importance.

--

Try Databricks for free. Get started today.

The post Detecting Bias with SHAP appeared first on Databricks.

Continue Reading…

Collapse

Read More

Data-driven to Model-driven: The Strategic Shift Being Made by Leading Organizations

You can have all the data you want, do all the machine learning you want, but if you aren’t running your business on models, you’ll soon be left behind. In this webinar, we will demystify the model-driven business.

Continue Reading…

Collapse

Read More

Data Science Jobs Report 2019: Python Way Up, TensorFlow Growing Rapidly, R Use Double SAS

Data science jobs continue to grow in 2019, and this report shares the change and spread of jobs by software over recent years.

Continue Reading…

Collapse

Read More

Top Stories, Jun 10 – 16: Best resources for developers transitioning into data science; 5 Useful Statistics Data Scientists Need to Know

The Infinity Stones of Data Science; What you need to know about the Modern Open-Source Data Science ecosystem; Scalable Python Code with Pandas UDFs: A Data Science Application; Become a Pro at Pandas

Continue Reading…

Collapse

Read More

Random Search and Reproducibility for Neural Architecture Search

** Nuit Blanche is now on Twitter: @NuitBlog **





Neural architecture search (NAS) is a promising research direction that has the potential to replace expertdesigned networks with learned, task-specific architectures. In order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. 

An implementation of the paper is at: https://github.com/liamcli/randomNAS_release

The code base requires the following additional repositories:

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Continue Reading…

Collapse

Read More

What should we do with “legacy” Java 8 applications?

Java is a mature programming language. It was improved over many successive versions. Mostly, new Java versions did not break your code. Thus Java was a great, reliable platform.

For some reason, the Oracle engineers decided to break things after Java 8. You cannot “just” upgrade from Java 8 to the following versions. You have to update your systems, sometimes in a significant way.

For management purposes, my employer uses an ugly Java application, launched by browsers via something called Java Web Start. I am sure that my employer’s application was very modern when it launched, but it is now tragically old and ugly. Oracle has ended maintenance of Java 8 in January. It may stop making Java 8 available publicly at the end of 2020. Yet my employer’s application won’t work with anything beyond Java 8.

Java on the desktop is not ideal. For a business applications, you are much better off with a pure Web application. It is easier to maintain, secure, it is more portable. Our IT staff knows this, they are not idiots. They are preparing a Web equivalent that should launch… some day… But it is complicated. They do not have infinite budgets and there are many stakeholders.

What do we do while something more modern is being built?

If you are a start-up, you can just switch to the open-source version of Java 8 like OpenJDK. But we are part of a large organization. We want to rely on supported software: doing otherwise would be irresponsible.

So what do we do?

I think that their current plan is just to stick with Java 8. They have an Oracle license, so they can keep on installing Java 8 on PCs even if Oracle pulls the plug.

But is that wise?

I think that a better solution would be to switch to Amazon Corretto. Amazon recruited James Gosling, Java’s inventor. It feels like the future of Java may move in Amazon’s hands.

Update: RedHat is offering paid support for OpenJDK 8.

Continue Reading…

Collapse

Read More

How to Use Python’s datetime

Python's datetime package is a convenient set of tools for working with dates and times. With just the five tricks that I’m about to show you, you can handle most of your datetime processing needs.

Continue Reading…

Collapse

Read More

Online/Incremental Learning with Keras and Creme

In this tutorial, you will learn how to perform online/incremental learning with Keras and Creme on datasets too large to fit into memory.

A few weeks ago I showed you how to use Keras for feature extraction and online learning — we used that tutorial to perform transfer learning and recognize classes the original CNN was never trained on.

To accomplish that task we needed to use Keras to train a very simple feedforward neural network on the features extracted from the images.

However, what if we didn’t want to train a neural network?

What if we instead wanted to train a Logistic Regression, Naive Bayes, or Decision Tree model on top of the data? Or what if we wanted to perform feature selection or feature processing before training such a model?

You may be tempted to use scikit-learn — but you’ll soon realize that scikit-learn does not treat incremental learning as a “first-class citizen” — only a few online learning implementations are included in scikit-learn and they are awkward to use, to say the least.

Instead, you should use Creme, which:

  • Implements a number of popular algorithms for classification, regression, feature selection, and feature preprocessing.
  • Has an API similar to scikit-learn.
  • And makes it super easy to perform online/incremental learning.

In the remainder of this tutorial I will show you how to:

  1. Use Keras + pre-trained CNNs to extract robust, discriminative features from an image dataset.
  2. Utilize Creme to perform incremental learning on a dataset too large to fit into RAM.

Let’s get started!

To learn how to perform online/incremental learning with Keras and Creme, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Online/Incremental Learning with Keras and Creme

In the first part of this tutorial, we’ll discuss situations where we may want to perform online learning or incremental learning.

We’ll then discuss why the Creme machine learning library is the appropriate choice for incremental learning.

We’ll be using Kaggle’s Dogs vs. Cats dataset for this tutorial — we’ll spend some time briefly reviewing the dataset.

From there, we’ll take a look at our directory structure from the project.

Once we understand the directory structure, we’ll implement a Python script that will be used to extract features from the Dogs vs. Cats dataset using Keras and a CNN pre-trained on ImageNet.

Given our extracted features (which will be too big to fit into RAM), we’ll use Creme to train a Logistic Regression model in an incremental learning fashion, ensuring that:

  1. We can still train our classifier, despite the extracted features being too large to fit into memory.
  2. We can still obtain high accuracy, even though we didn’t have access to “all” features at once.

Why Online Learning/Incremental Learning?

Figure 1: Multi-class incremental learning with Creme allows for machine learning on datasets which are too large to fit into memory (image source).

Whether you’re working with image data, text data, audio data, or numerical/categorical data, you’ll eventually run into a dataset that is too large to fit into memory.

What then?

  • Do you head to Amazon, NewEgg, etc. and purchase an upgraded motherboard with maxed out RAM?
  • Do you spin up a high memory instance on a cloud service provider like AWS or Azure?

You could look into one of those options — and in some cases, they are totally reasonable avenues to explore.

But my first choice would be to apply online/incremental learning.

Using incremental learning you can work with datasets too large to fit into RAM and apply popular machine learning techniques, including:

  • Feature preprocessing
  • Feature selection
  • Classification
  • Regression
  • Clustering
  • Ensemble methods
  • …and more!

Incremental learning can be super powerful — and today you’ll learn how to apply it to your own data.

Why Creme for Incremental Learning?

Figure 2: Creme is a library specifically tailored to incremental learning. The API is similar to that of scikit-learn’s which will make you feel at home while putting it to work on large datasets where incremental learning is required.

Neural networks and deep learning are a form of incremental learning — we can train such networks on one sample or one batch at a time.

However, just because we can apply neural networks to a problem doesn’t mean we should.

Instead, we need to bring the right tool to the job. Just because you have a hammer in your hand doesn’t mean you would use it to bang in a screw.

Incremental learning algorithms encompass a set of techniques used to train models in an incremental fashion.

We often utilize incremental learning when a dataset is too large to fit into memory.

The scikit-learn library does include a small handful of online learning algorithms, however:

  1. It does not treat incremental learning as a first-class citizen.
  2. The implementations are awkward to use.

Enter the Creme library — a library exclusively dedicated to incremental learning with Python.

The library itself is fairly new but last week I had some time to hack around with it.

I really enjoyed the experience and found the scikit-learn inspired API very easy to use.

After going through the rest of this tutorial, I think you’ll agree with me when I say, Creme is a great little library and I wish the developers and maintainers all the best with it — I hope that the library continues to grow.

The Dogs vs. Cats Dataset

Figure 3: In today’s example, we’re using Kaggle’s Dogs vs. Cats dataset. We’ll extract features with Keras producing a rather large features CSV. From there, we’ll apply incremental learning with Creme.

The dataset we’ll be using here today is Kaggle’s Dogs vs. Cats dataset.

The dataset includes 25,000 examples, evenly distributed:

  • Dogs: 12,500 images
  • Cats: 12,500 images

Our goal is to apply transfer learning to:

  1. Extract features from the dataset using Keras and a pre-trained CNN.
  2. Use online/incremental learning via Creme to train a classifier on top of the features in an incremental fashion.

Setting up your Creme environment

While Creme requires a simple pip install, we have some other packages to install for today’s example too. Today’s required packages include:

  1. imutils and OpenCV (a dependency of imutils)
  2. scikit-learn
  3. TensorFlow
  4. Keras
  5. Creme

First, head over to my pip install opencv tutorial to install OpenCV in a Python virtual environment. The OpenCV installation instructions suggest an environment named

cv
 but you can name yours whatever you’d like.

From there, install the rest of the packages in your environment:

$ workon cv
$ pip install imutils
$ pip install scikit-learn
$ pip install tensorflow # or tensorflow-gpu
$ pip install keras
$ pip install creme

Let’s ensure everything is properly installed by launching a Python interpreter:

$ workon cv
$ python
>>> import cv2
>>> import imutils
>>> import sklearn
>>> import keras
Using TensorFlow backend.
>>> import creme
>>>

Provided that there are no errors, your environment is ready for incremental learning.

Project Structure

Figure 4: Download train.zip from the Kaggle Dogs vs. Cats downloads page for this incremental learning with Creme project.

To set up your project, please follow the following steps:

  1. Use the “Downloads” section of this blog post and follow the instructions to download the code.
  2. Download the code to somewhere on your system. For example, you could download it to your
    ~/Desktop
      or
    ~/Downloads
      folder.
  3. Open a terminal,
    cd
      into the same folder where the zip resizes. Unzip/extract the files via
    unzip keras-creme-incremental-learning.zip
     . Keep your terminal open.
  4. Log into Kaggle (required for downloading data).
  5. Head to the Kaggle Dogs vs. Cats “Data” page.
  6. Click the little download button next to only the train.zip file. Save it into
    ~/Desktop/keras-creme-incremental-learning/
      (or wherever you extracted the blog post files).
  7. Back in your terminal, extract the dataset via 
    unzip train.zip
     .

Now let’s review our project structure:

$ tree --dirsfirst --filelimit 10
.
├── train [25000 entries]
├── train.zip
├── features.csv
├── extract_features.py
└── train_incremental.py

1 directory, 4 files

You should see a

train/
  directory with 25,000 files. This is where your actual dog and cat images reside. Let’s list a handful of them:

$ ls train | sort -R | head -n 10
dog.271.jpg
cat.5399.jpg
dog.3501.jpg
dog.5453.jpg
cat.7122.jpg
cat.2018.jpg
cat.2298.jpg
dog.3439.jpg
dog.1532.jpg
cat.1742.jpg

As you can see, the class label (either “cat” or “dog”) is included in the first few characters of the filename. We’ll parse the class name out later.

Back to our project tree, under the

train/
  directory are
train.zip
  and
features.csv
 . These files are not included with the “Downloads”. You should have already downloaded and extracted
train.zip
  from Kaggle’s website. We will learn how to extract features and generate the large 12GB+
features.csv
  file in the next section.

The two Python scripts we’ll be reviewing are

extract_features.py
  and
train_incremental.py
 . Let’s begin by extracting features with Keras!

Extracting Features with Keras

Before we can perform incremental learning, we first need to perform transfer learning and extract features from our Dogs vs. Cats dataset.

To accomplish this task, we’ll be using the Keras deep learning library and the ResNet50 network (pre-trained on ImageNet). Using ResNet50, we’ll allow our images to forward propagate to a pre-specified layer.

We’ll then take the output activations of that layer and treat them as a feature vector. Once we have feature vectors for all images in our dataset, we’ll then apply incremental learning.

Let’s go ahead and get started.

Open up the

extract_features.py
  file and insert the following code:

# import the necessary packages
from sklearn.preprocessing import LabelEncoder
from keras.applications import ResNet50
from keras.applications import imagenet_utils
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
from imutils import paths
import numpy as np
import argparse
import pickle
import random
import os

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
	help="path to input dataset")
ap.add_argument("-c", "--csv", required=True,
	help="path to output CSV file")
ap.add_argument("-b", "--batch-size", type=int, default=32,
	help="batch size for the network")
args = vars(ap.parse_args())

On Lines 2-12, all the packages necessary for extracting features are imported. Most notably this includes

ResNet50
 . ResNet50 is the convolutional neural network (CNN) we are using for transfer learning (Line 3).

Three command line arguments are then parsed via Lines 15-22:

  • --dataset
     : The path to our input dataset (i.e. Dogs vs. Cats).
  • --csv
     : File path to our output CSV file.
  • --batch-size
     : By default, we’ll use a batch size of
    32
     . This will accommodate most CPUs and GPUs.

Let’s go ahead and load our model:

# load the ResNet50 network and store the batch size in a convenience
# variable
print("[INFO] loading network...")
model = ResNet50(weights="imagenet", include_top=False)
bs = args["batch_size"]

On Line 27, we load the

model
  while specifying two parameters:

  • weights="imagenet"
     : Pre-trained ImageNet weights are loaded for transfer learning.
  • include_top=False
     : We do not include the fully-connected head with the softmax classifier. In other words, we chop off the head of the network.

With weights loaded, and by loading our model without the head, we are now ready for transfer learning. We will use the output values of the network directly, storing the results as feature vectors.

Our feature vectors will each be 100,352-dim (i.e. 7 x 7 x 2048 which are the dimensions of the output volume of ResNet50 without the FC layer header).

From here, let’s grab our

imagePaths
  and extract our labels:

# grab all image paths in the input directory and randomly shuffle
# the paths
imagePaths = list(paths.list_images(args["dataset"]))
random.seed(42)
random.shuffle(imagePaths)

# extract the class labels from the image paths, then encode the
# labels
labels = [p.split(os.path.sep)[-1].split(".")[0] for p in imagePaths]
le = LabelEncoder()
labels = le.fit_transform(labels)

On Lines 32-34, we proceed to grab all

imagePaths
  and randomly shuffle them.

From there, our class

labels
  are extracted from the paths themselves (Line 38). Each image path as the format:

  • train/cat.0.jpg
  • train/dog.0.jpg
  • etc.

In a Python interpreter, we can test Line 38 for sanity. As you develop the parsing + list comprehension, your interpreter might look like this:

$ python
>>> import os
>>> label = "train/cat.0.jpg".split(os.path.sep)[-1].split(".")[0]
>>> label
'cat'
>>> imagePaths = ["train/cat.0.jpg", "train/dog.0.jpg", "train/dog.1.jpg"]
>>> labels = [p.split(os.path.sep)[-1].split(".")[0] for p in imagePaths]
>>> labels
['cat', 'dog', 'dog']
>>>

Lines 39 and 40 then instantiate and fit our label encoder, ensuring we can convert the string class labels to integers.

Let’s define our CSV columns and write them to the file:

# define our set of columns
cols = ["feat_{}".format(i) for i in range(0, 7 * 7 * 2048)]
cols = ["class"] + cols

# open the CSV file for writing and write the columns names to the
# file
csv = open(args["csv"], "w")
csv.write("{}\n".format(",".join(cols)))

We’ll be writing our extracted features to a CSV file.

The Creme library requires that the CSV file has a header and includes a name for each of the columns, namely:

  1. The name of the column for the class label
  2. A name for each of the features

Line 43 creates column names for each of the 7 x 7 x 2048 = 100,352 features while Line 44 defines the class name column (which will store the class label).

Thus, the first five rows and ten columns our CSV file will look like this:

$ head -n 5 features.csv | cut -d ',' -f 1-10
class,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Notice how the

class
  is the first column. Then the columns span from
feat_0
  all the way to
feat_100351
  for a total of 100,352 features. If you edit the command to print more than 10 columns — say 5,000 — then you’ll see that not all the values are 0.

Moving on, let’s proceed to loop over the images in batches:

# loop over the images in batches
for (b, i) in enumerate(range(0, len(imagePaths), bs)):
	# extract the batch of images and labels, then initialize the
	# list of actual images that will be passed through the network
	# for feature extraction
	print("[INFO] processing batch {}/{}".format(b + 1,
		int(np.ceil(len(imagePaths) / float(bs)))))
	batchPaths = imagePaths[i:i + bs]
	batchLabels = labels[i:i + bs]
	batchImages = []

We’ll loop over

imagePaths
  in batches of size
bs
  (Line 52).

Lines 58 and 59 then grab the batch of paths and labels, while Line 60 initializes a list to hold the batch of images.

Let’s loop over the current batch:

# loop over the images and labels in the current batch
	for imagePath in batchPaths:
		# load the input image using the Keras helper utility while
		# ensuring the image is resized to 224x224 pixels
		image = load_img(imagePath, target_size=(224, 224))
		image = img_to_array(image)

		# preprocess the image by (1) expanding the dimensions and
		# (2) subtracting the mean RGB pixel intensity from the
		# ImageNet dataset
		image = np.expand_dims(image, axis=0)
		image = imagenet_utils.preprocess_input(image)

		# add the image to the batch
		batchImages.append(image)

Looping over paths in the batch (Line 63), we will load each

image
 , preprocess it, and gather it into 
batchImages
 . The
image
  itself is loaded on Line 66.

We’ll preprocess the image by:

  • Resizing to 224×224 pixels via the
    target_size
      parameter on Line 66.
  • Converting to array format (Line 67).
  • Adding a batch dimension (Line 72).
  • Performing mean subtraction (Line 73).

Note: If these preprocessing steps appear foreign, please refer to Deep Learning for Computer Vision with Python where I cover them in detail.

Finally, the

image
  is added to the batch via Line 76.

In order to extract features, we’ll now pass the batch of images through our network:

# pass the images through the network and use the outputs as our
	# actual features, then reshape the features into a flattened
	# volume
	batchImages = np.vstack(batchImages)
	features = model.predict(batchImages, batch_size=bs)
	features = features.reshape((features.shape[0], 7 * 7 * 2048))

	# loop over the class labels and extracted features
	for (label, vec) in zip(batchLabels, features):
		# construct a row that consists of the class label and extracted
		# features
		vec = ",".join([str(v) for v in vec])
		csv.write("{},{}\n".format(label, vec))

# close the CSV file
csv.close()

Our batch of images is sent through the network via Lines 81 and 82. 

Keep in mind that we have removed the fully-connected head layer of the network. Instead, the forward propagation stops prior to the average pooling layer. We will treat the output of this layer as a list of

features
 , also known as a “feature vector”.

The output dimension of the volume is (batch_size, 7 x 7 x ,2048). We can thus 

reshape
  the
features
  into a NumPy array of shape
(batch_size, 7 * 7 * 2048)
, treating the output of the CNN as a feature vector.

Maintaining our batch efficiency, the

features
  and associated class labels are written to our CSV file (Lines 86-90).

Inside the CSV file, the class

label
  is the first field in each row (enabling us to easily extract it from the row during training). The feature
vec
  follows.

The features CSV file is closed via Line 93, as the last step of our script.

Applying feature extraction with Keras

Now that we’ve coded up

extract_features.py
 , let’s apply it to our dataset.

Make sure you have:

  1. Used the “Downloads” section of this tutorial to download the source code.
  2. Downloaded the Dogs vs. Cats dataset from Kaggle’s website.

Open up a terminal and execute the following command:

$ python extract_features.py --dataset train --csv features.csv
Using TensorFlow backend.
[INFO] loading network...
[INFO] processing batch 1/782
[INFO] processing batch 2/782
[INFO] processing batch 3/782
...
[INFO] processing batch 780/782
[INFO] processing batch 781/782
[INFO] processing batch 782/782

Using an NVIDIA K80 GPU the entire feature extraction process took 20m45s.

You could also use your CPU but keep in mind that the feature extraction process will take much longer.

After your script finishes running, take a look at the output size of

features.csv
 :

$ ls -lh features.csv 
-rw-rw-r-- 1 ubuntu ubuntu 12G Jun  10 11:16 features.csv

The resulting file is over 12GB!

And if we were to load that file into RAM, assuming 32-bit floats for the feature vectors, we would need 10.03GB!

Your machine may or may not have that much RAM…but that’s not the point. Eventually, you will encounter a dataset that is too large for you to work with in main memory. When that time comes, you need need to use incremental learning.

Incremental Learning with Creme

If you’re at this point in the tutorial then I will assume you have extracted features from the Dogs vs. Cats dataset using Keras and ResNet50 (pre-trained on ImageNet).

But what now?

We’ve made the assumption that the entire dataset of extracted feature vectors are too large to fit into memory — how can we train a machine learning classifier on that data?

Open up the

train_incremental.py
file and let’s find out:

# import the necessary packages
from creme.linear_model import LogisticRegression
from creme.multiclass import OneVsRestClassifier
from creme.preprocessing import StandardScaler
from creme.compose import Pipeline
from creme.metrics import Accuracy
from creme import stream
import argparse

Lines 2-8 import packages required for incremental learning with Creme. We’ll be taking advantage of Creme’s implementation of

LogisticRegression
 . Creme’s
stream
  module includes a super convenient CSV data generator. Throughout training, we’ll calculate and print out our current
Accuracy
  with Creme’s built in
metrics
  tool.

Let’s now use argparse to parse our command line arguments:

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-c", "--csv", required=True,
	help="path to features CSV file")
ap.add_argument("-n", "--cols", type=int, required=True,
	help="# of feature columns in the CSV file (excluding class column")
args = vars(ap.parse_args())

Our two command line arguments include:

  • --csv
     : The path to our input CSV features file.
  • --cols
     : Dimensions of our feature vector (i.e. how many columns there are in our feature vector).

Now that we’ve parsed our command line arguments, we need to specify the data types of our CSV file to use Creme’s

stream
  module properly:

# construct our data dictionary which maps the data types of the
# columns in the CSV file to built-in data types
print("[INFO] building column names...")
types = {"feat_{}".format(i): float for i in range(0, args["cols"])}
types["class"] = int

Line 21 builds a list of data types (floats) for every feature column of our CSV. We will have 100,352 floats.

Similarly, Line 22 specifies that our

class
  column is an integer type.

Next, let’s initialize our data generator and construct our pipeline:

# create a CSV data generator for the extracted Keras features
dataset = stream.iter_csv(args["csv"], target_name="class", types=types)

# construct our pipeline
model = Pipeline([
	("scale", StandardScaler()),
	("learn", OneVsRestClassifier(
		binary_classifier=LogisticRegression()))
])

Line 25 creates a CSV iterator that will

stream
  features + class labels to our model.

Lines 28-32 then constructs the model pipeline which:

  • First performs standard scaling (scales data to have zero mean and unit variance).
  • Then trains our Logistic Regression model in an incremental fashion (one data point at a time).

Logistic Regression is a binary classifier meaning that it can be used to predict only two classes (which is exactly what the Dogs vs. Cats dataset is).

However, if you want to recognize > 2 classes, you need to wrap

LogisticRegression
in a
OneVsRestClassifier
which fits one binary classifier per class.

Note: There’s no harm in wrapping

LogisticRegression
in a
OneVsRestClassifier
for binary classification so I chose to do so here, just so you can see how it’s done — just keep in mind that it’s not required for binary classification but is required for > 2 classes.

Let’s put Creme to work to train our model:

# initialize our metric
print("[INFO] starting training...")
metric = Accuracy()

# loop over the dataset
for (i, (X, y)) in enumerate(dataset):
	# make predictions on the current set of features, train the
	# model on the features, and then update our metric
	preds = model.predict_one(X)
	model = model.fit_one(X, y)
	metric = metric.update(y, preds)
	print("INFO] update {} - {}".format(i, metric))

# show the accuracy of the model
print("[INFO] final - {}".format(metric))

Line 36 initializes our

metric
  (i.e., accuracy).

From there, we begin to loop over our dataset (Line 39). Inside the loop, we:

  • Make a prediction on the current data point (Line 42). There are 25,000 data points (images), so this loop will run that many times.
  • Update the
    model
      weights based on the prediction (Line 43).
  • Update and display our accuracy
    metric
      (Lines 44 and 45).

Finally, the accuracy of the model is displayed in the terminal (Line 48).

Incremental Learning Results

We are now ready to apply incremental learning using Keras and Creme. Make sure you have:

  1. Used the “Downloads” section of this tutorial to download the source code.
  2. Downloaded the Dogs vs. Cats dataset from Kaggle’s website.

From there, open up a terminal and execute the following command:

$ python train_incremental.py --csv features.csv --cols 100352
[INFO] building column names...
[INFO] starting training...
INFO] update 0 - Accuracy: 0.
INFO] update 1 - Accuracy: 0.
INFO] update 2 - Accuracy: 0.333333
INFO] update 3 - Accuracy: 0.5
INFO] update 4 - Accuracy: 0.6
INFO] update 5 - Accuracy: 0.5
INFO] update 6 - Accuracy: 0.571429
INFO] update 7 - Accuracy: 0.625
INFO] update 8 - Accuracy: 0.666667
INFO] update 9 - Accuracy: 0.7
INFO] update 10 - Accuracy: 0.727273
INFO] update 11 - Accuracy: 0.75
INFO] update 12 - Accuracy: 0.769231
INFO] update 13 - Accuracy: 0.714286
INFO] update 14 - Accuracy: 0.733333
INFO] update 15 - Accuracy: 0.75
INFO] update 16 - Accuracy: 0.705882
INFO] update 17 - Accuracy: 0.722222
INFO] update 18 - Accuracy: 0.736842
INFO] update 19 - Accuracy: 0.75
INFO] update 21 - Accuracy: 0.761906
...
INFO] update 24980 - Accuracy: 0.9741
INFO] update 24981 - Accuracy: 0.974101
INFO] update 24982 - Accuracy: 0.974102
INFO] update 24983 - Accuracy: 0.974103
INFO] update 24984 - Accuracy: 0.974104
INFO] update 24985 - Accuracy: 0.974105
INFO] update 24986 - Accuracy: 0.974107
INFO] update 24987 - Accuracy: 0.974108
INFO] update 24988 - Accuracy: 0.974109
INFO] update 24989 - Accuracy: 0.97411
INFO] update 24990 - Accuracy: 0.974111
INFO] update 24991 - Accuracy: 0.974112
INFO] update 24992 - Accuracy: 0.974113
INFO] update 24993 - Accuracy: 0.974114
INFO] update 24994 - Accuracy: 0.974115
INFO] update 24995 - Accuracy: 0.974116
INFO] update 24996 - Accuracy: 0.974117
INFO] update 24997 - Accuracy: 0.974118
INFO] update 24998 - Accuracy: 0.974119
INFO] update 24999 - Accuracy: 0.97412
[INFO] final - Accuracy: 0.97412

After only 21 samples our Logistic Regression model is obtaining 76.19% accuracy.

Letting the model train on all 25,000 samples, we reach 97.412% accuracy which is quite respectable. The process took 6hr48m on my system.

Again, the key point here is that our Logistic Regression model was trained in an incremental fashion — we were not required to store the entire dataset in memory at once. Instead, we could train our Logistic Regression classifier one sample at a time.

Summary

In this tutorial, you learned how to perform online/incremental learning with Keras and the Creme machine learning library.

Using Keras and ResNet50 pre-trained on ImageNet, we applied transfer learning to extract features from the Dogs vs. Cats dataset.

We have a total of 25,000 images in the Dogs vs. Cats dataset. The output volume of ResNet50 is 7 x 7 x 2048 = 100,352-dim. Assuming 32-bit floats for our 100,352-dim feature vectors, that implies that trying to store the entire dataset in memory at once would require 10.03GB of RAM.

Not all machine learning practitioners will have that much RAM on their machines.

And more to the point — even if you do have sufficient RAM for this dataset, you will eventually encounter a dataset that exceeds the physical memory on your machine.

When that occasion arises you should apply online/incremental learning.

Using the Creme library we trained a multi-class Logistic Regression classifier one sample at a time, enabling us to obtain 97.412% accuracy on the Dogs vs. Cats dataset.

I hope you enjoyed today’s tutorial!

Feel free to use the code in this blog post as a starting point for your own projects where online/incremental learning is required.

To download the source code to this post, and be notified when future tutorials are published here on PyImageSearch, just enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!


The post Online/Incremental Learning with Keras and Creme appeared first on PyImageSearch.

Continue Reading…

Collapse

Read More

The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174

BACKGROUND

Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.

METHODS

We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”

RESULTS

The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

Continue Reading…

Collapse

Read More

The Machine Learning Puzzle, Explained

Lots of moving parts go into creating a machine learning model. Let's take a look at some of these core concepts and see how the machine learning puzzle comes together.

Continue Reading…

Collapse

Read More

Four short links: 17 June 2019

Multiverse Databases, Detecting Photoshopping, Simulation Platform, and Tail-Call Optimization: The Musical

  1. Towards Multiverse Databases (Morning Paper) -- The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.
  2. Detecting Photoshopped Fakes (Verge) -- Adobe worked with Berkeley researchers to develop software that can spot Photoshopping in an image. (via BoingBoing).
  3. Open Sourcing AI Habitat (Facebook) -- a new simulation platform created by Facebook AI that’s designed to train embodied agents (such as virtual robots) in photo-realistic 3D environments. [...] To illustrate the benefits of this new platform, we’re also sharing Replica, a data set of hyperrealistic 3D reconstructions of a staged apartment, retail store, and other indoor spaces.
  4. Tail-Call Optimization: The Musical (YouTube) -- you're welcome.

Continue reading Four short links: 17 June 2019.

Continue Reading…

Collapse

Read More

Story formats for data

Financial Times, in an effort to streamline a part of the data journalism process, developed templates for data stories. They call it the Story Playbook:

The Playbook is also an important driver of culture change in the newsroom. We have a rich and familiar vocabulary for print: The basement (A sometimes light-hearted, 350-word story that sits below the fold on the front page), for example, or the Page 3 (a 900–1200 word story at the top of the third page that is the day’s most substantive analysis article). For FT journalists, catflaps, birdcages, and skylines need no explanation.

The story playbook creates the equivalent for online stories, by introducing a vocabulary that provides a shared point of reference for everyone in the newsroom.

Check them out on GitHub.

Tags: , ,

Continue Reading…

Collapse

Read More

Magister Dixit

“Big data is not about the data. (Making the point that while data is plentiful and easy to collect, the real value is in the analytics.)” Gary King

Continue Reading…

Collapse

Read More

Towards Learning of Filter-Level Heterogeneous Compression of Convolutional Neural Networks - implementation -

** Nuit Blanche is now on Twitter: @NuitBlog **





Recently, deep learning has become a de facto standard in machine learning with convolutional neural networks (CNNs) demonstrating spectacular success on a wide variety of tasks. However, CNNs are typically very demanding computationally at inference time. One of the ways to alleviate this burden on certain hardware platforms is quantization relying on the use of low-precision arithmetic representation for the weights and the activations. Another popular method is the pruning of the number of filters in each layer. While mainstream deep learning methods train the neural networks weights while keeping the network architecture fixed, the emerging neural architecture search (NAS) techniques make the latter also amenable to training. In this paper, we formulate optimal arithmetic bit length allocation and neural network pruning as a NAS problem, searching for the configurations satisfying a computational complexity budget while maximizing the accuracy. We use a differentiable search method based on the continuous relaxation of the search space proposed by Liu et al. (2019a). We show, by grid search, that heterogeneous quantized networks suffer from a high variance which renders the benefit of the search questionable. For pruning, improvement over homogeneous cases is possible, but it is still challenging to find those configurations with the proposed method. The code is publicly available at https://github.com/yochaiz/Slimmable and https://github.com/yochaiz/darts-UNIQ.



Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Continue Reading…

Collapse

Read More

Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:

Axios_newstopics

From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)

***

The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!

**

Here, I've put the data onto a scatter plot:

Redo_junkcharts_aiosnewstopics_1

The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.

Redo_junkcharts_aiosnewstopics_2

Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.

 

 

 

 

 

Continue Reading…

Collapse

Read More

Book Memo: “Mathematical Theories of Machine Learning”

Theory and Applications
This book studies mathematical theories of machine learning. The first part of the book explores the optimality and adaptivity of choosing step sizes of gradient descent for escaping strict saddle points in non-convex optimization problems. In the second part, the authors propose algorithms to find local minima in nonconvex optimization and to obtain global minima in some degree from the Newton Second Law without friction. In the third part, the authors study the problem of subspace clustering with noisy and missing data, which is a problem well-motivated by practical applications data subject to stochastic Gaussian noise and/or incomplete data with uniformly missing entries. In the last part, the authors introduce an novel VAR model with Elastic-Net regularization and its equivalent Bayesian model allowing for both a stable sparsity and a group selection.

Continue Reading…

Collapse

Read More

R Packages worth a look

Implementation of SCORE, SCORE+ and Mixed-SCORE (ScorePlus)
Implementation of community detection algorithm SCORE in the paper J. Jin (2015) <arXiv:1211.5803>, and SCORE+ in J. Jin, Z. Ke and S. Luo (2018) …

Fitting Discrete Distribution Models to Count Data (scModels)
Provides functions for fitting discrete distribution models to count data. Included are the Poisson, the negative binomial and, most importantly, a new …

Interface to the ‘JWSACruncher’ of ‘JDemetra+’ (rjwsacruncher)
JDemetra+’ (<https://…/jdemetra-app> ) is the seasonal adjustment softw …

Creating Visuals for Publication (utile.visuals)
A small set of functions for making visuals for publication in ‘ggplot2’. Key functions include geom_stepconfint() for drawing a step confidence interv …

Continue Reading…

Collapse

Read More

Forecasting tools in development

(This article was first published on - R, and kindly contributed to R-bloggers)

As I’ve been writing up a progress report for my NIGMS R35 MIRA award, I’ve been reminded at how much of the work that we’ve been doing is focused on forecasting infrastructure. A common theme in the Reich Lab is making operational forecasts of infectious disease outbreaks. The operational aspect means that we focus on everything from developing and adapting statistical methods to be used in forecasting applications to thinking about the data science toolkit that you need to store, evaluate, and visualize forecasts. To that end, in addition to working closely with the CDC in their FluSight initiative, we’ve been doing a lot of collaborative work on new R packages and data repositories that I hope will be useful beyond the confines of our lab. Some of these projects are fully operational, used in our production flu forecasts for CDC, and some have even gone through a level of code peer review. Others are in earlier stages of development. My hope is that in putting this list out there (see below the fold) we will generate some interest (and possibly find some new open-source collaborators) for these projects.



Here is a partial list of in-progress software that we’ve been working on:

  • sarimaTD is an R package that serves as a wrapper to some of the ARIMA modeling functionality in the forecast R package. We found that we consistently wanted to be specifying some transformations (T) and differencing (D) in specific ways that we have found useful in modeling infectious disease time-series data, so we made it easy for us and others to use specifications.
  • ForecastFramework is an R package that we have collaborated on with our colleagues at the Johns Hopkins Infectious Disease Dynamics lab. We’ve blogged about this before, and we see a lot of potential in this object-oriented framework for both standardizing how datasets are specified/accessed and how models are generated. That said, there still is a long ways to go to document and make this infrastructure usable by a wide audience. The most success I’ve had using it so far was having PhD students write forecast models for a seminar I taught this spring. I used a single script that could run and score forecasts from each model, with a very simple plug-and-play interface to the models because they had been specified appropriately.
  • Zoltar is a new repository (in alpha-ish release right now) for forecasts that we have been working on over the last year. It was initially designed with our CDC flu forecast use-case in mind, although the forecast structure is quite general, and with predx integration on the way (see next bullet) we are hoping that this will broaden the scope of possible use cases for Zoltar. To help facilitate our and others use of Zoltar, we are working on two interfaces to the Zoltar API, zoltpy for python and zoltr for R. Check out the documentation, especially for zoltr. There is quite a bit of data available!
  • predx is an R package designed my colleague and friend Michael Johansson of the US Centers for Disease Control and Prevention and OutbreakScience. Evan Ray, from the Reich Lab team, has contributed to it as well. The goal of predx is to define some general classes of data for both probabilistic and point forecasts, to better standardize ways that we might want to store and operate on these data.
  • d3-foresight is the main engine behind our interactive forecast visualizations for flu in the US. We have also integrated it with Zoltar, so that you can view forecasts stored in Zoltar (note, kind of a long load time for that last link) using some of the basic d3-foresight functionality.

The lion’s share of the credit for all of the above are due to some combination of Matthew Cornell, Abhinav Tushar, Katie House, and Evan Ray.

To leave a comment for the author, please follow the link and comment on their blog: - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

The race to become Britain’s next PM

After the first round of voting, Boris Johnson is still the clear favourite

Continue Reading…

Collapse

Read More

The Greenland ice sheet is melting unusually fast

It is losing so much water that it may raise global sea levels by a millimetre this year

Continue Reading…

Collapse

Read More

June 16, 2019

Distilled News

What is TensorFrames? TensorFlow + Apache Spark

TensorFrames is an open source created by Apache Spark contributors. Its functions and parameters are named the same as in the TensorFlow framework. Under the hood, it is an Apache Spark DSL (domain-specific language) wrapper for Apache Spark DataFrames. It allows us to manipulate the DataFrames with TensorFlow functionality. And no, it is not pandas DataFrame, it is based on Apache Spark DataFrame.


Measuring Performance: AUPRC

The area under the precision-recall curve (AUPRC) is another performance metric that you can use to evaluate a classification model. If your model achieves a perfect AUPRC, it means your model can find all of the positive samples (perfect recall) without accidentally marking any negative samples as positive (perfect precision.) It’s important to consider both recall and precision together, because you could achieve perfect recall (but bad precision) using a naive classifier that marked everything positive, and you could achieve perfect precision (but bad recall) using a naive classifier that marked everything negative.


Introduction to U-Net and Res-Net for Image Segmentation

Human beings are able to see the images by capturing the reflected light rays which is a very complex task. So how can machines be programmed to perform a similar task? Computer sees the images as matrices which need to be processed to get a meaning out of it. Image segmentation is the method to partition the image into various segments with each segment having a different entity. Convolutional Neural Networks are successful for simpler images but haven’t given good results for complex images. This is where other algorithms like U-Net and Res-Net come into play.


Top 10 Neural Network Architectures

10 Neural Network Architectures
• Perceptrons
• Convolutional Neural Networks
• Recurrent Neural Networks
• Long / Short Term Memory
• Gated Recurrent Unit
• Hopfield Network
• Boltzmann Machine
• Deep Belief Networks
• Autoencoders
• Generative Adversarial Network


AI adoption is being fueled by an improved tool ecosystem

In this post, I share slides and notes from a keynote that Roger Chen and I gave at the 2019 Artificial Intelligence conference in New York City. In this short summary, I highlight results from a – survey (AI Adoption in the Enterprise) and describe recent trends in AI. Over the past decade, AI and machine learning (ML) have become extremely active research areas: the web site arxiv.org had an average daily upload of around 100 machine learning papers in 2018. With all the research that has been conducted over the past few years, it’s fair to say that we now have entered the implementation phase for many AI technologies. Companies are beginning to translate research results and developments into products and services.


The Kalman Filter and (Maximum) Likelihood

I first realized the power of the Kalman Filter during Kaggle’s Web Traffic Time Series Forecasting competition, a contest requiring prediction of future web traffic volumes for thousands of Wikipedia pages. In this contest, simple heuristics like ‘median of medians’ were hard to beat and my own modeling choices were mostly ineffective. Of course I blamed my tools, and wondered if anything in the traditional statistical toolbox was up for this task. Then I read a post by a user known only as ‘os,’ 8th place with Kalman filters.


Machine Learning Python Keras Dropout Layer Explained

Machine learning is ultimately used to predict outcomes given a set of features. Therefore, anything we can do to generalize the performance of our model is seen as a net gain. Dropout is a technique used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase. If you take a look at the Keras documentation for the dropout layer, you’ll see a link to a white paper written by Geoffrey Hinton and friends, which goes into the theory behind dropout.


Robotic Process Automation

A recent finding from KPMG’s Global Sourcing Advisory Pulse Survey ‘Robotic Revolution’, suggests that technology experts believe ‘The opportunities [from RPA] are many – so are the adoption challenges… For most organisations, taking advantage of higher-end RPA opportunities will be easier said than done.’


Anomaly detection in time series with Prophet library

First of all, let’s define what is an anomaly in time series. Anomaly detection problem for time series can be formulated as finding outlier data points relative to some standard or usual signal. While there are plenty of anomaly types, we’ll focus only on the most important ones from a business perspective, such as unexpected spikes, drops, trend changes and level shifts. You can solve this problem in two way: supervised and unsupervised. While the first approach needs some labeled data, second does not, you need the just raw data. On that article, we will focus on the second approach.


What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem

As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer – the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption. The top tools show surprising stability – we see essentially the same pattern as last year.


Why correlation might tell us nothing about outliers

We often hear claims à la ‘there is a high correlation between x and y .’ This is especially true with alleged findings about human or social behaviour in psychology, the social sciences or economics. A reported Pearson correlation coefficient of 0.8 indeed seems high in many cases and escapes our critical evaluation of its real meaning. So let’s see what correlation actually means and if it really conveys the information we often believe it does. Inspired by the funny spurious correlation project as well as Nassim Taleb’s medium post and Twitter rants in which he laments psychologists’ (and not only) total ignorance and misuse of probability and statistics, I decided to reproduce his note on how much information the correlation coefficient conveys under the Gaussian distribution.


Monte Carlo Simulation and Statistical Probability Distributions in Python

One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.


Recommendation Systems using UV-Decomposition

There are countless examples of recommender systems in use across nearly every industry in existence today. Most people understand, at a high level, what recommender systems attempt to achieve. However, not many people understand how they work at a deeper level. This is what Leskovec, Rajaraman, and Ullman dive into in Chapter 9 of their book Mining of Massive Datasets. A great example of a recommender system in use today is Netflix. Whenever you log into Netflix there are various sections such as ‘Trending Now’ or ‘New Releases’, but there is also a section titled ‘Top Picks for You’. This section uses a complex formula that tries to estimate which movies you would enjoy the most. The formula takes into account previous movies you have enjoyed as well as other movies people like you have also enjoyed.


Graph Algorithms (Part 2)

Graphs are becoming central to machine learning these days, whether you’d like to understand the structure of a social network by predicting potential connections, detecting fraud, understand customer’s behavior of a car rental service or making real-time recommendations for example. In this article, we’ll cover :
• The main graph algorithms
• Illustrations and use-cases
• Examples in Python


Kernel Secrets in Machine Learning Pt. 2

In the first article on the topic, Kernel Secrets in Machine Learning, I explained kernels in the most basic way possible. Before reading further I would advise you to quickly go through the article to get a feel of what a kernel really is if you have not already. Hopefully, you are going to come to this conclusion: A kernel is a similarity measure between two vectors in a mapped space. Good. Now we can check out and discuss some well-known kernels and also how do we combine kernels to produce other kernels. Keep in mind, for the examples that we are going to use, the x’s are one-dimensional vectors for plotting purposes and we fix the value of x’ to 2. Without further ado, let’s start kerneling.


Kernel Secrets in Machine Learning Pt. 1

This post is not about deep learning. But it could be might as well. This is the power of kernels. They are universally applicable in any machine learning algorithm. Why you might ask? I am going to try to answer this question in this article.


How to learn the maths of Data Science using your high school maths knowledge

This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning.


What is Quantum Computing and How is it Useful for Artificial Intelligence?

After decades of a heavy slog with no promise of success, quantum computing is suddenly buzzing! Nearly two years ago, IBM made a quantum computer available to the world. The 5-quantum-bit (qubit) resource they now call the IBM Q experience. It was more like a toy for researchers than a way of getting any serious number crunching done. But 70,000 users worldwide have registered for it, and the qubit count in this resource has now quadrupled. With so many promises by quantum computing and data science being at the helm currently, are there any offerings by quantum computing for the AI? Let us explore that in this blog!


Machine Learning : Unsupervised – Hierarchical Clustering and Bootstrapping

This article is based on Unsupervised Learning algorithm: Hierarchical Clustering. This is the brief illustration with a practical working example of forming unsupervised hierarchical clusters and testing them to assure that you have formed the right clusters. This is a real-life data world example which can be studied and evaluated as data is provided for personal use and practice. There are variations to each topic in data science but there is a brief basic pattern that can be followed to build models. ‘The Datum’ empowers you to have access to these basic patterns for your lifetime and building upon them as you progress. Consider ‘The Datum’ blogs as your cookbook hand out which will help you learn, refer, and contribute the relevant topics. All listings and models are implemented in R Language using R Studio, and image instances of my work are embedded in this article for your reference.

Continue Reading…

Collapse

Read More

If you did not already know

Estimation of Distribution Algorithm (EDA) google
Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding the uniform distribution over admissible solutions and ending with the model that generates only the global optima. EDAs belong to the class of evolutionary algorithms. The main difference between EDAs and most conventional evolutionary algorithms is that evolutionary algorithms generate new candidate solutions using an implicit distribution defined by one or more variation operators, whereas EDAs use an explicit probability distribution encoded by a Bayesian network, a multivariate normal distribution, or another model class. Similarly as other evolutionary algorithms, EDAs can be used to solve optimization problems defined over a number of representations from vectors to LISP style S expressions, and the quality of candidate solutions is often evaluated using one or more objective functions.
Level-Based Analysis of the Univariate Marginal Distribution Algorithm


Abnormal Event Detection Network (AED-Net) google
It is challenging to detect the anomaly in crowded scenes for quite a long time. In this paper, a self-supervised framework, abnormal event detection network (AED-Net), which is composed of PCAnet and kernel principal component analysis (kPCA), is proposed to address this problem. Using surveillance video sequences of different scenes as raw data, PCAnet is trained to extract high-level semantics of crowd’s situation. Next, kPCA,a one-class classifier, is trained to determine anomaly of the scene. In contrast to some prevailing deep learning methods,the framework is completely self-supervised because it utilizes only video sequences in a normal situation. Experiments of global and local abnormal event detection are carried out on UMN and UCSD datasets, and competitive results with higher EER and AUC compared to other state-of-the-art methods are observed. Furthermore, by adding local response normalization (LRN) layer, we propose an improvement to original AED-Net. And it is proved to perform better by promoting the framework’s generalization capacity according to the experiments. …

Fixed-Size Ordinally Forgetting Encoding (FOFE) google
Question answering over knowledge base (KB-QA) has recently become a popular research topic in NLP. One popular way to solve the KB-QA problem is to make use of a pipeline of several NLP modules, including entity discovery and linking (EDL) and relation detection. Recent success on KB-QA task usually involves complex network structures with sophisticated heuristics. Inspired by a previous work that builds a strong KB-QA baseline, we propose a simple but general neural model composed of fixed-size ordinally forgetting encoding (FOFE) and deep neural networks, called FOFE-net to solve KB-QA problem at different stages. For evaluation, we use two popular KB-QA datasets, SimpleQuestions and WebQSP, and a newly created dataset, FreebaseQA. The experimental results show that FOFE-net performs well on KB-QA subtasks, entity discovery and linking (EDL) and relation detection, and in turn pushing overall KB-QA system to achieve strong results on all datasets. …

Q-Graph google
Arising user-centric graph applications such as route planning and personalized social network analysis have initiated a shift of paradigms in modern graph processing systems towards multi-query analysis, i.e., processing multiple graph queries in parallel on a shared graph. These applications generate a dynamic number of localized queries around query hotspots such as popular urban areas. However, existing graph processing systems are not yet tailored towards these properties: The employed methods for graph partitioning and synchronization management disregard query locality and dynamism which leads to high query latency. To this end, we propose the system Q-Graph for multi-query graph analysis that considers query locality on three levels. (i) The query-aware graph partitioning algorithm Q-cut maximizes query locality to reduce communication overhead. (ii) The method for synchronization management, called hybrid barrier synchronization, allows for full exploitation of local queries spanning only a subset of partitions. (iii) Both methods adapt at runtime to changing query workloads in order to maintain and exploit locality. Our experiments show that Q-cut reduces average query latency by up to 57 percent compared to static query-agnostic partitioning algorithms. …

Continue Reading…

Collapse

Read More

Document worth reading: “On the Implicit Assumptions of GANs”

Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of recent years. We believe the identified problems deserve more attention, and highlight the implications on both the properties of GANs and the trajectory of research on probabilistic models. We recently proposed an alternative method that sidesteps these problems. On the Implicit Assumptions of GANs

Continue Reading…

Collapse

Read More

Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Problem: Minimum Cardinality Primary Key

Instance: Vectors x1 through xm elements of {0,1}n and positive integer k.

Question: Is there a “primary key” of size no more then k? That is: is there a subset P of {1, …, n} with |P| ≤ k such that for any integers a, b with 1 ≤ a < b ≤ n we can find an j in P such that xa(j) doesn’t equal xb(j) (i.e. xa and xb differ at some index named in P, and hence can be distinguished or “told apart”).

Now the standard reference on NP-hardness (Garey and Johnson, Computers and Intractability, Freeman, 1979) does have some NP-hard database examples (such as SR26 Minimum Cardinality Key). However the stated formulations are a bit hard to decode, so we will relate the above problem directly to a more accessible problem: SP8 Hitting Set.

Problem: SP8 Hitting Set

Instance: Collection C of subsets of a finite set S, positive integer K ≤ |S|.

Question: Is there a subset S contained in S with |S| ≤ K such that S contains at least one element from each subset in C?

The idea is: SP8 is thought to be difficult to solve, so if we show how Minimum Cardinality Primary Key could be used to easily solve SP8 this is then evidence Minimum Cardinality Primary Key is also difficult to solve.

So suppose we have an arbitrary instance of SP8 in front of us. Without loss of generality assume S = {1, …, n}, C = {C1, …, Cm}, and all of the Ci are non-empty and distinct.

We build an instance of the Minimum Cardinality Primary Key problem by defining a table with columns named s1 through sn plus d1 through dm.

Now we define the rows of our table:

  • Let r0 be the row of all zeros.
  • For i from 1 to m let zi be the row with zi(di) = 1 and all other columns equal to zero.
  • For i from 1 to m let xi be the row with xi(di) = 1, xi(sj) = 1 for all j in Ci, and all other columns equal to zero.

Now let’s look at what sets of columns form primary keys for the collection of rows r0, zi, xi.

We must have all of di in P, as each di is the unique index of the only difference between zi and r0. Also, for any i we must have a j such that zi(dj)=1 and j in Ci, as if there were none we could not tell zi from xi (as they differ only in indices named by Ci).

This lets us confirm a good primary key set P is such that S = {j | sj in P} is itself a good hitting set for the SP8 problem. And for any hitting set S we have P = {sj | j in S} union {di, … dm} is a good solution for the Minimum Cardinality Primary Key problem (the di allow us to distinguish r0 from zi, the zi from themselves, r0 from xi, and the xi from them selves; the set hitting property lets us distinguish zi from the corresponding xi, completing the unique keying of rows by the chosen column set). And the solution sizes are always such that |P| = |S’| + m.

So: if we had a method to solve arbitrary instances of the Minimum Cardinality Primary Key problem, we could then use it to solve arbitrary instances of the SP8 Hitting Set Problem. We would just re-encode the SP8 problem as described above, solve the Minimum Cardinality Primary Key problem, and use the strong correspondence between solutions to these two problems to map the solution back to the SP8 problem. Thus the Minimum Cardinality Primary Key problem is itself NP-hard.

What made the problem hard was, as is quite common, is: the solution size constraint. Without that constraint the problem is trivial. The set of all columns either forms a primary key or does not, and it is simple calculation to check that. As with the variable pruning problem we can even try step-wise deleting columns to explore subsets of columns that are also primary table keys, moving us to a non-redundant key set (but possibly not of minimal size).

Continue Reading…

Collapse

Read More

We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

Continue Reading…

Collapse

Read More

modelDown is now on CRAN!

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)


The modelDown package turns classification or regression models into HTML static websites.
With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits.

The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.

Here you can browse an example website automatically created for 4 classification models (random forest, gradient boosting, support vector machines, k-nearest neighbours). The R code beyond this example is here.

Fun facts:

archivist hooks are generated for every documented object. So you can easily extract R objects from the HTML website. Try

archivist::aread("MI2DataLab/modelDown_example/docs/repository/574defd6a96ecf7e5a4026699971b1d7")

– session info is automatically recorded. So you can check version of packages available at model development (https://github.com/MI2DataLab/modelDown_example/blob/master/docs/session_info/session_info.txt)

– This package is initially created by Magda Tatarynowicz, Kamil Romaszko, Mateusz Urbański from Warsaw University of Technology as a student project.

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.

library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
                      nChr = 10,
                      segSites = 5000,
                      inbred = FALSE,
                      species = "GENERIC")


## Simulation parameters

SIMPARAM <- SimParam$new(FOUNDERPOP)
SIMPARAM$addTraitA(nQtlPerChr = 100,
                   mean = 100,
                   var = 10)
SIMPARAM$addSnpChip(nSnpPerChr = 1000)
SIMPARAM$setGender("yes_sys")


## Founding population

pop <- newPop(FOUNDERPOP,
              simParam = SIMPARAM)

pop <- setPheno(pop,
                varE = 20,
                simParam = SIMPARAM)


## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
    print(i)
    sires <- selectInd(pop = breeding[[i - 1]],
                       gender = "M",
                       nInd = 25,
                       trait = 1,
                       use = "pheno",
                       simParam = SIMPARAM)

    dams <- selectInd(pop = breeding[[i - 1]],
                      nInd = 500,
                      gender = "F",
                      trait = 1,
                      use = "pheno",
                      simParam = SIMPARAM)

    breeding[[i]] <- randCross2(males = sires,
                                females = dams,
                                nCrosses = 500,
                                nProgeny = 10,
                                simParam = SIMPARAM)
    breeding[[i]] <- setPheno(breeding[[i]],
                              varE = 20,
                              simParam = SIMPARAM)
}



## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
                   y = mean_g,
                   ymin = mean_g - sd_g,
                   ymax = mean_g + sd_g,
                   geom = "pointrange",
                   main = "Genetic mean and standard deviation",
                   xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before") 
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after") 

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Distilled News

Applying AutoML to Transformer Architectures

Since it was introduced a few years ago, Google’s Transformer architecture has been applied to challenges ranging from generating fantasy fiction to writing musical harmonies. Importantly, the Transformer’s high performance has demonstrated that feed forward neural networks can be as effective as recurrent neural networks when applied to sequence tasks, such as language modeling and translation. While the Transformer and other feed forward models used for sequence problems are rising in popularity, their architectures are almost exclusively manually designed, in contrast to the computer vision domain where AutoML approaches have found state-of-the-art models that outperform those that are designed by hand. Naturally, we wondered if the application of AutoML in the sequence domain could be equally successful.


Interactive Network Visualization with R

Networks are everywhere. We have social networks like Facebook, competitive product networks or various networks in an organisation. Also, for STATWORX it is a common task to unveil hidden structures and clusters in a network and visualize it for our customers. In the past, we used the tool Gephi to visualize our results in network analysis. Impressed by this outstanding pretty and interactive visualization, our idea was to find a way to do visualizations in the same quality directly in R and present it to our customers in an R Shiny app.


An End to End Introduction to GANs

I bet most of us have seen a lot of AI-generated people faces in recent times, be it in papers or blogs. We have reached a stage where it is becoming increasingly difficult to distinguish between actual human faces and faces that are generated by Artificial Intelligence. In this post, I will help the reader to understand how they can create and build such applications on their own. I will try to keep this post as intuitive as possible for starters while not dumbing it down too much.


Anomaly Detection with LSTM in Keras

I read ‘anomaly’ definitions in every kind of contest, everywhere. In this caos the only truth is the variability of this definition, i.e. anomaly explanation is completely releted to the domain of interest. Detection of this kind of behavior is usefull in every business and the difficultness to detect this observations depends on the field of applications. If you are engaged in a problem of anomaly detection, which involves human activity (like prediction of sales or demand), you can take advantages from fundamental assumptions of human behaviors and plan a more efficient solution. This is exactly what we are doing in this post. We try to predict the Taxi demand in NYC in a critical time period. We formulate easy and important assumptions about human behaviors, which will permit us to detect an easy solution to forecast anomalies. All the dirty job is made by a loyalty LSTM, developed in Keras, which makes predictions and detection of anomalies at the same time!


TD3: Learning To Run With AI

This article looks at one of the most powerful and state of the art algorithms in Reinforcement Learning (RL), Twin Delayed Deep Deterministic Policy Gradients (TD3)( Fujimoto et al., 2018). By the end of this article you should have a solid understanding of what makes TD3 perform so well, be capable of implementing the algorithm yourself and use TD3 to train an agent to successfully run in the HalfCheetah environment.


The Data Fabric, Containers, Kubernetes, Knowledge-Graphs, and more…

In the last article we talked about the building blocks of a knowledge-graph, now we will go a step further and learn the basic concepts, technologies and languages we need to understand to actually build it.


Advanced Ensemble Classifiers

Ensemble is a Latin-derived word which means ‘union of parts’. The regular classifiers that are used often are prone to make errors. As much as these errors are inevitable they can be reduced with the proper construction of a learning classifier. Ensemble learning is a way of generating various base classifiers from which a new classifier is derived which performs better than any constituent classifier. These base classifiers may differ in the algorithm used, hyperparameters, representation or the training set. The key objective of the ensemble methods is to reduce bias and variance.


A Game of Words: Vectorization, Tagging, and Sentiment Analysis

Full disclosure: I haven’t watched or read Game of Thrones, but I am hoping to learn a lot about it by analyzing the text. If you would like more background about the basic text processing, you can read my other article. The text from all 5 books can be found on Kaggle. In this article I will be taking the cleaned text and using it to explain the following concepts:
• Vectorization: Bag-of-Words, TF-IDF, and Skip-Thought Vectors
• After Vectorization
• POS tagging
• Named Entity Recognition (NER)
• Chunking and Chinking
• Sentiment Analysis
• Other NLP packages


Data Science as Software: from Notebooks to Tools [Part 3]

This is the final part of the series of how to go on from Jupyter Notebooks to software solutions in Data Science. Part 1 covered the basics of setting up the working environment and data exploration. Part 2 dived deep into data pre-processing and modelling. Part 3 will deal with how you can move on from Jupyter, front end development and your daily work in the code. The overall agenda of the series is the following:
• Setting up your working environment [Part 1]
• Important modules for data exploration [Part 1]
• Machine Learning Part 1: Data pre-processing [Part 2]
• Machine Learning Part 2: Models [Part 2]
• Moving on from Jupyter [Part 3]
• Shiny stuff: when do we get a front end? [Part 3]
• Your daily work in the code: keeping standards [Part 3]


Machine Learning: Recurrent Neural Networks And Long Short Term Memory (LSTM) Python Keras Example

Recurrent neural networks have a wide array of applications. These include time series analysis, document classification, speech and voice recognition. In contrast to feedforward artificial neural networks, the predictions made by recurrent neural networks are dependent on previous predictions. To elaborate, imagine we decided to follow an exercise routine where, every day, we alternate between lifting weights, swimming and yoga. We could then build a recurrent neural network to predict today’s workout given what we did yesterday. For example, if we lifted weights yesterday then we’d go swimming today. More often than not, the problems you’ll be tackling in the real world will be a function of the current state as well as other inputs. For instance, suppose we signed up for hockey once a week. If we’re playing hockey on the same day that we’re supposed to lift weights then we might decide to skip the gym. Thus, our model now has to differentiate between the case when we attended a yoga class yesterday and we’re not playing hockey as well as the case when we attended a yoga class yesterday and we’re playing hockey today in which case we’d jump directly to swimming.


Learning Like Babies

Convolutional Neural Nets (CNNs), a concept that has achieved the greatest performance for image classification, was inspired by the mammalian visual cortex system. In spite of the drastic progress in automated computer vision systems, most of the success of image classification architectures comes from labeled data. The problem is that most of the real world data is not labeled. According to Yann LeCun, father of CNNs and professor at NYU, the next ‘big thing’ in artificial intelligence is semi-supervised learning – a type of machine learning task that makes use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. That is why recently a large research effort has been focused on unsupervised learning without leveraging a large amount of expensive supervision.


Rethinking the Data Science Life Cycle for the Enterprise

While the technology and tools used by data scientists have grown dramatically, the data science lifecycle has stagnated. In fact, little has changed between the earliest versions of CRISP-DM created over 20 years ago and the more recent lifecycles offered by leading vendors such as Google, Microsoft, and DataRobot. Most versions of the data science lifecycle still address the same set of tasks: understanding the business problem, understanding domain data, acquiring and engineering data, model development and training, and model deployment and monitoring (see Figure 1). But enterprise needs have evolved as data science has become embedded in most companies. Today, model reproducibility, traceability, verifiability has become a fundamental requirement for data science in large enterprises. Unfortunately, these requirements are omitted or significantly underplayed in leading AI/ML lifecycles.


Deep Learning for Sentiment Analysis

Sentiment Analysis is a classic example of machine learning, which (in a nutshell) refers to: ‘A way of ‘learning’ that enables algorithms to evolve.’ This ‘learning’ means feeding the algorithm with a massive amount of data so that it can adjust itself and continually improve.’ Sentiment analysis is the automated process of understanding an opinion about a given subject from written or spoken language. In a world where we generate 2.5 quintillion bytes of data every day, sentiment analysis has become a key tool for making sense of that data. This has allowed companies to get key insights and automate all kind of processes.


Regression with Regularization Techniques.

The article assumes that you have some brief idea about the regression techniques that could predict the required variable from a stratified and equitable distribution of records in a dataset that are implemented by a statistical approach. Just kidding! All you need is adequate math to be able to understand basic graphs. Before entering the topic, a little brush up…


K-Medoids Clustering Using ATS: Unleashing the Power of Templates

k-medoids clustering is a classical clustering machine learning algorithm. It is a sort of generalization of the k-means algorithm. The only difference is that cluster centers can only be one of the elements of the dataset, this yields an algorithm which can use any type of distance function whereas k-means only provably converges using the L2-norm.


How to stop training a neural-network using callback?

Often, when training a very deep neural network, we want to stop training once the training accuracy reaches a certain desired threshold. Thus, we can achieve what we want (optimal model weights) and avoid wastage of resources (time and computation power). In this brief tutorial, let’s learn how to achieve this in Tensorflow and Keras, using the callback approach, in 4 simple steps.

Continue Reading…

Collapse

Read More

Sudan's government is minimizing the death toll in the Khartoum attack

Two state agencies have a different, smaller, number of protesters killed than an independent panel’s count

Death tolls can be hard to calculate. Violence, whether it’s natural or manmade, can create chaos that makes counting difficult. And we rarely pay attention for long enough to see that those who die from their injuries are added to the final count of the dead.

And sometimes, there’s a vested interest in minimizing the numbers. On 3 June, government forces in Sudan violently attacked protesters in Khartoum. The protesters were calling for the transitional military council to hand power over to a civilian-led government. According to Human Rights Watch, protesters were chased, whipped, shot at and, according to several reports, raped.

Continue reading...

Continue Reading…

Collapse

Read More

If you did not already know

WeCURE google
Missing data recovery is an important and yet challenging problem in imaging and data science. Successful models often adopt certain carefully chosen regularization. Recently, the low dimension manifold model (LDMM) was introduced by S.Osher et al. and shown effective in image inpainting. They observed that enforcing low dimensionality on image patch manifold serves as a good image regularizer. In this paper, we observe that having only the low dimension manifold regularization is not enough sometimes, and we need smoothness as well. For that, we introduce a new regularization by combining the low dimension manifold regularization with a higher order Curvature Regularization, and we call this new regularization CURE for short. The key step of solving CURE is to solve a biharmonic equation on a manifold. We further introduce a weighted version of CURE, called WeCURE, in a similar manner as the weighted nonlocal Laplacian (WNLL) method. Numerical experiments for image inpainting and semi-supervised learning show that the proposed CURE and WeCURE significantly outperform LDMM and WNLL respectively. …

Recurrent Convolutional Network (RCN) google
Recently, three dimensional (3D) convolutional neural networks (CNNs) have emerged as dominant methods to capture spatiotemporal representations, by adding to pre-existing 2D CNNs a third, temporal dimension. Such 3D CNNs, however, are anti-causal (i.e., they exploit information from both the past and the future to produce feature representations, thus preventing their use in online settings), constrain the temporal reasoning horizon to the size of the temporal convolution kernel, and are not temporal resolution-preserving for video sequence-to-sequence modelling, as, e.g., in spatiotemporal action detection. To address these serious limitations, we present a new architecture for the causal/online spatiotemporal representation of videos. Namely, we propose a recurrent convolutional network (RCN), which relies on recurrence to capture the temporal context across frames at every level of network depth. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state $1\times 1$ convolution applied across time. The hidden state at any time $t$ is assumed to depend on the hidden state at $t-1$ and on the current output of the spatial convolution component. As a result, the proposed network: (i) provides flexible temporal reasoning, (ii) produces causal outputs, and (iii) preserves temporal resolution. Our experiments on the large-scale large ‘Kinetics’ dataset show that the proposed method achieves superior performance compared to 3D CNNs, while being causal and using fewer parameters. …

Parallelizable Stack Long Short-Term Memory google
Stack Long Short-Term Memory (StackLSTM) is useful for various applications such as parsing and string-to-tree neural machine translation, but it is also known to be notoriously difficult to parallelize for GPU training due to the fact that the computations are dependent on discrete operations. In this paper, we tackle this problem by utilizing state access patterns of StackLSTM to homogenize computations with regard to different discrete operations. Our parsing experiments show that the method scales up almost linearly with increasing batch size, and our parallelized PyTorch implementation trains significantly faster compared to the Dynet C++ implementation. …

Social Relationship Graph Generation Network (SRG-GN) google
Socially-intelligent agents are of growing interest in artificial intelligence. To this end, we need systems that can understand social relationships in diverse social contexts. Inferring the social context in a given visual scene not only involves recognizing objects, but also demands a more in-depth understanding of the relationships and attributes of the people involved. To achieve this, one computational approach for representing human relationships and attributes is to use an explicit knowledge graph, which allows for high-level reasoning. We introduce a novel end-to-end-trainable neural network that is capable of generating a Social Relationship Graph – a structured, unified representation of social relationships and attributes – from a given input image. Our Social Relationship Graph Generation Network (SRG-GN) is the first to use memory cells like Gated Recurrent Units (GRUs) to iteratively update the social relationship states in a graph using scene and attribute context. The neural network exploits the recurrent connections among the GRUs to implement message passing between nodes and edges in the graph, and results in significant improvement over previous methods for social relationship recognition. …

Continue Reading…

Collapse

Read More

Book Memo: “Managing Your Data Science Projects”

Learn Salesmanship, Presentation, and Maintenance of Completed Models
At first glance, the skills required to work in the data science field appear to be self-explanatory. Do not be fooled. Impactful data science demands an interdisciplinary knowledge of business philosophy, project management, salesmanship, presentation, and more. In Managing Your Data Science Projects, author Robert de Graaf explores important concepts that are frequently overlooked in much of the instructional literature that is available to data scientists new to the field. If your completed models are to be used and maintained most effectively, you must be able to present and sell them within your organization in a compelling way.

Continue Reading…

Collapse

Read More

June 15, 2019

CoderStats Revamp, d3-geomap v3 Release, Python Data Science Handbook Review

I worked on some projects recently that were asleep for a while, including a revamp of coderstats.net, the release of d3-geomap version 3 and I published a review on the Python Data Science Handbook.

Preview Image

Continue Reading…

Collapse

Read More

R Packages worth a look

Composite Grid Gaussian Processes (CGGP)
Run computer experiments using the adaptive composite grid algorithm with a Gaussian process model. The algorithm works best when running an experiment …

Simple Jenkins Client (jenkins)
Manage jobs and builds on your Jenkins CI server <https://…/>. Create and edit projects, s …

Download and Load Various Text Datasets (textdata)
Provides a framework to download, parse, and store text datasets on the disk and load them when needed. Includes various sentiment lexicons and labeled …

Robust Backfitting (RBF)
A robust backfitting algorithm for additive models based on (robust) local polynomial kernel smoothers. It includes both bounded and re-descending (ker …

Continue Reading…

Collapse

Read More

Document worth reading: “A Survey of the Recent Architectures of Deep Convolutional Neural Networks”

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art results on various competitive benchmarks. The powerful learning ability of deep CNN is largely achieved with the use of multiple non-linear feature extraction stages that can automatically learn hierarchical representation from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs and recently very interesting deep CNN architectures are reported. The recent race in deep CNN architectures for achieving high performance on the challenging benchmarks has shown that the innovative architectural ideas, as well as parameter optimization, can improve the CNN performance on various vision-related tasks. In this regard, different ideas in the CNN design have been explored such as use of different activation and loss functions, parameter optimization, regularization, and restructuring of processing units. However, the major improvement in representational capacity is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is gaining substantial appreciation. This survey thus focuses on the intrinsic taxonomy present in the recently reported CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting and attention. Additionally, it covers the elementary understanding of the CNN components and sheds light on the current challenges and applications of CNNs. A Survey of the Recent Architectures of Deep Convolutional Neural Networks

Continue Reading…

Collapse

Read More

Algorithmic bias and social bias

The “algorithmic bias” that concerns me is not so much a bias in an algorithm, but rather a social bias resulting from the demand for, and expectation of, certainty.

Continue Reading…

Collapse

Read More

Magister Dixit

“To put it simply, there is too much friction. In any given workflow, you have to go through several levels to get to what you really need. For instance, say you’re part of the customer service team: You use Salesforce to get the information you need to best serve customers. But depending on the information, you have to go across half-a-dozen windows in search for the right sales pitch, product information, or other collateral. You are 15 steps into a workflow before you get to the real starting point. This wastes time, money, and reduces quality of service. This is in sharp contrast to what you have come to expect using consumer products. Think peer-to-peer payment option solutions like Square that make payments as simple as the tap of a button – eliminating dozens of process steps that you would usually go through. This simple, bare-bones approach has changed industries across the board, be it transportation (Uber), insurance (15 minutes can save you…), accounting (TurboTax), retail (Amazon Same-Day), and so on. Enterprises that provide this personalized, contextual experience will thrive and those that don’t will falter.” Mayank Mehta ( August 27, 2015 )

Continue Reading…

Collapse

Read More

Introducing the {ethercalc} package

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix).

I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the package as they have a CORS configuration that make EtherCalc API wrangling problematic) for EtherCalc but I would highly recommend that you just use it via the npm module. To do that you need a working Node.js installation and I highly recommended also running a local Redis instance (it’s super lightweight). Linux folks can use their fav package manager for that and macOS folks can use homebrew. Folks on the legacy Windows operating system can visit:

to get EtherCalc going.

I also recommend running EtherCalc and Redis together for performance reasons. EtherCalc will maintain a persistent store for your spreadsheets (they call them “rooms” since EtherCalc supports collaborative editing) with or without Redis, but using Redis makes all EtherCalc ops much, much faster.

Once you have Redis running (on localhost, which is the default) and Node.js + npm installed, you can do the following to install EtherCalc:

$ npm install -g ethercalc # may require `sudo` on some macOS or *nix systems

The -g tells npm to install the module globally and will work to ensure the ethercalc executable is on your PATH. Like many things one can install from Node.js or, even Python, you may see a cadre of “warnings” and possibly even some “errors”. If you execute the following and see similar messages:

$ ethercalc --host=localhost ## IMPORTANT TO USE --host=localhost
Please connect to: http://localhost:8000/
Starting backend using webworker-threads
Falling back to vm.CreateContext backend
Express server listening on port 8000 in development mode
Zappa 0.5.0 "You can't do that on stage anymore" orchestrating the show
Connected to Redis Server: localhost:6379

and then go to the URL it gives you and you see something like this:

then you’re all set to continue.

A [Very] Brief EtherCalc Introduction

EtherCalc has a wiki. As such, please hit that to get more info on EtherCalc.

For now, if you hit that big, blue “Create Spreadsheet” button, you’ll see something pretty familiar if you’ve used Google Sheets, Excel, LibreOffice Calc (etc):

If you start ethercalc without the --host=localhost it listens on all network interfaces, so other folks on your network can also use it as a local “cloud” spreadsheet app, but also edit with you, just like Google Sheets.

I recommend playing around a bit in EtherCalc before continuing just to see that it is, indeed, a spreadsheet app like all the others you are familiar with, except it has a robust API that we can orchestrate from within R, now.

Working with {ethercalc}

You can install {ethercalc} from the aforelinked source or via:

install.packages("ethercalc", repos = "https://cinc.rud.is")

where you’ll get a binary install for Windows and macOS (binary builds are for R 3.5.x but should also work for 3.6.x installs).

If you don’t want to drop to a command line interface to start EtherCalc you can use ec_start() to run one that will only be live during your R session.

Once you have EtherCalc running you’ll need to put the URL into an ETHERCALC_HOST environment variable. I recommend adding the following to ~/.Renviron and restarting your R session:

ETHERCALC_HOST=http://localhost:8000

(You’ll get an interactive prompt to provide this if you don’t have the environment variable setup.)

You can verify R can talk to your EtherCalc instance by executing ec_running() and reading the message or examining the (invisible) return value. Post a comment or file an issue (on your preferred social coding site) if you really think you’ve done everything right and still aren’t up and running by this point.

The use-case I setup in the previous blog post was to perform light data entry since scraping was both prohibited and would have taken more time given how the visualization was made. To start a new spreadsheet (remember, EtherCalc folks call these “rooms”), just do:

ec_new("for-blog")

And you should see this appear in your default web browser:

You can do ec_list() to see the names of all “saved” spreadsheets (ec_delete() can remove them, too).

We’ll type in the values from the previous post:

Now, to retrieve those values, we can do:

ec_read("for-blog", col_types="cii")
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##                                        
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

That function takes any (relevant to this package use-case) parameter that readr::read_csv() takes (since it uses that under the hood to parse the object that comes back from the API call). If someone adds or modifies any values you can just call ec_read() again to retrieve them.

The ec_export() function lets you download the contents of the spreadsheet (“room”) to a local:

  • CSV
  • JSON
  • HTML
  • Markdown
  • Excel

file (and it also returns the raw data directly to the R session). So you can do something like:

cat(rawToChar(ec_export("for-blog", "md", "~/Data/survey.md")))
## |topic|actually_read|say_want_covered|
## | ---- | ---- | ---- |
## |Health care|7|1|
## |Climate change|5|2|
## |Education|11|3|
## |Economics|6|4|
## |Science|10|7|
## |Technology|14|8|
## |Business|13|11|
## |National Security|1|5|
## |Politics|2|10|
## |Sports|3|14|
## |Immigration|4|6|
## |Arts & entertainment|8|13|
## |U.S. foreign policy|9|9|
## |Religion|12|12|

You can also append to a spreadsheet right from R. We’ll sort that data frame (to prove the append is working and I’m not fibbing) and append it to the existing sheet (this is a toy example, but imagine appending to an always-running EtherCalc instance as a data logger, which folks actually do IRL):

ec_read("for-blog", col_types="cii") %>% 
  dplyr::arrange(desc(topic)) %>% 
  ec_append("for-blog")

Note that you can open up EtherCalc to any existing spreadsheets (“rooms”) via ec_view() as well.

FIN

It’s worth noting that EtherCalc appears to have a limit of around 500,000 “cells” per spreadsheet (“room”). I mention that since if you try to, say, ec_edit(ggplot2movies::movies, "movies") you would have very likely crashed the running EtherCalc instance if I did not code in some guide rails into that function and the ec_append() function to stop you from doing that. It’s sane limit IMO an Google Sheets does something similar (per-tab) for the similar reasons (and both limits are one reason I’m still against using a browser for “everything” given the limitations of javascript wrangling of DOM elements).

If you’re doing work on large-ish data, spreadsheets in general aren’t the best tools.

And, while you should avoid hand-wrangling data at all costs, ec_edit() is a much faster and feature-rich alternative to R’s edit() function on most systems.

I’ve shown off most of the current functionality of the {ethercalc} package in this post. One function I’ve left out is ec_cmd() which lets you completely orchestrate all EtherCalc operations. It’s powerful enough, and the EtherCalc command structure is gnarly enough, that we’ll have to cover it in a separate post. Also, stay tune for the aforementioned package-specific EtherCalc Docker image.

Kick the tyres, contribute issues and/or PRs as moved (and on your preferred social coding site) and see if both EtherCalc and {ethercalc} might work for you in place of or along with Excel and/or Google Sheets.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Pharmacometrics meeting in Paris on the afternoon of 11 July 2019

Julie Bertrand writes:

The pharmacometrics group led by France Mentre (IAME, INSERM, Univ Paris) is very pleased to host a free ISoP Statistics and Pharmacometrics (SxP) SIG local event at Faculté Bichat, 16 rue Henri Huchard, 75018 Paris, on Thursday afternoon the 11th of July 2019.

It will features talks from Professor Andrew Gelman, Univ of Columbia (We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions) and Professor Rob Bies, Univ of Buffalo (A hybrid genetic algorithm for NONMEM structural model optimization).

We welcome all of you (please register here). Registration is capped at 70 attendees.

If you would like to present some of your work (related to SxP), please contact us by July 1, 2019. Send a title and short abstract (julie.bertrand@inserm.fr).

Continue Reading…

Collapse

Read More

Science and Technology links (June 15th 2019)

Continue Reading…

Collapse

Read More

Saturday Morning Videos: AutoML Workshop at ICML 2019

** Nuit Blanche is now on Twitter: @NuitBlog **


Katharina Eggensperger, Matthias Feurer, Frank Hutter, and Joaquin Vanschoren organized the AutoML workshop at ICML, and there are videos of the event that took place yesterday. Awesome.! Here is the intro for the workshop:
Machine learning has achieved considerable successes in recent years, but this success often relies on human experts, who construct appropriate features, design learning architectures, set their hyperparameters, and develop new learning algorithms. Driven by the demand for off-the-shelf machine learning methods from an ever-growing community, the research area of AutoML targets the progressive automation of machine learning aiming to make effective methods available to everyone. The workshop targets a broad audience ranging from core machine learning researchers in different fields of ML connected to AutoML, such as neural architecture search, hyperparameter optimization, meta-learning, and learning to learn, to domain experts aiming to apply machine learning to new types of problems.

All the videos are here.

Bayesian optimization is a powerful and flexible tool for AutoML. While BayesOpt was first deployed for AutoML simply as a black-box optimizer, recent approaches perform grey-box optimization: they leverage capabilities and problem structure specific to AutoML such as freezing and thawing training, early stopping, treating cross-validation error minimization as multi-task learning, and warm starting from previously tuned models. We provide an overview of this area and describe recent advances for optimizing sampling-based acquisition functions that make grey-box BayesOpt significantly more efficient.
The mission of AutoML is to make ML available for non-ML experts and to accelerate research on ML. We have a very similar mission at fast.ai and have helped over 200,000 non-ML experts use state-of-the-art ML (via our research, software, & teaching), yet we do not use methods from the AutoML literature. I will share several insights we've learned through this work, with the hope that they may be helpful to AutoML researchers.



AutoML aims at automating the process of designing good machine learning pipelines to solve different kinds of problems. However, existing AutoML systems are mainly designed for isolated learning by training a static model on a single batch of data; while in many real-world applications, data may arrive continuously in batches, possibly with concept drift. This raises a lifelong machine learning challenge for AutoML, as most existing AutoML systems can not evolve over time to learn from streaming data and adapt to concept drift. In this paper, we propose a novel AutoML system for this new scenario, i.e. a boosting tree based AutoML system for lifelong machine learning, which won the second place in the NeurIPS 2018 AutoML Challenge. 


In this talk I'll survey work by Google researchers over the past several years on the topic of AutoML, or learning-to-learn. The talk will touch on basic approaches, some successful applications of AutoML to a variety of domains, and sketch out some directions for future AutoML systems that can leverage massively multi-task learning systems for automatically solving new problems.


Recent advances in Neural Architecture Search (NAS) have produced state-of-the-art architectures on several tasks. NAS shifts the efforts of human experts from developing novel architectures directly to designing architecture search spaces and methods to explore them efficiently. The search space definition captures prior knowledge about the properties of the architectures and it is crucial for the complexity and the performance of the search algorithm. However, different search space definitions require restarting the learning process from scratch. We propose a novel agent based on the Transformer that supports joint training and efficient transfer of prior knowledge between multiple search spaces and tasks.
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performsat least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result onPTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results.
The practical work of deploying a machine learning system is dominated by issues outside of training a model: data preparation, data cleaning, understanding the data set, debugging models, and so on. What does it mean to apply ML to this “grunt work” of machine learning and data science? I will describe first steps towards tools in these directions, based on the idea of semi-automating ML: using unsupervised learning to find patterns in the data that can be used to guide data analysts. I will also describe a new notebook system for pulling these tools together: if we augment Jupyter-style notebooks with data-flow and provenance information, this enables a new class of data-aware notebooks which are much more natural for data manipulation.
Panel Discussion





Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Continue Reading…

Collapse

Read More

Question 15 of our Applied Regression final exam (and solution to question 14)

Here’s question 15 of our exam:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

And the solution to question 14:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

(a) For an average classroom, the curve is invlogit(1 + 0.1x), so it goes through the 50% point at x = -10. So the easiest way to draw the curve is to extend it outside the range of the data. But in the graph, the x-axis should go from 0 to 50. Recall that invlogit(5) = 0.99, so the probability of passing reaches 99% when x reaches 40. From all this information, you can draw the curve.

(b) The 25th and 75th percentage points of the normal distribution are at the mean +/- 0.67 standard errors. Thus, the 25th and 75th percentage points of the intercepts are 1 +/- 0.67*2, or -0.34, 2.34, so the curves to draw are invlogit(-0.34 + 0.1x) and invlogit(2.34 + 0.1x). These are just shifted versions of the curve from a, shifting by 1.34/0.1 = 13.4 to the left and the right.

Common mistakes

Students didn’t always use the range of x. The most common bad answer was to just draw a logistic curve and then put some numbers on the axes.

A key lesson that I had not conveyed well in class: draw and label the axes first, then draw the curve.

Continue Reading…

Collapse

Read More

Accelerating the Nelder - Mead Method with Predictive Parallel Evaluation - implementation -

** Nuit Blanche is now on Twitter: @NuitBlog **



The Nelder–Mead (NM) method has been recently proposed for application in hyperparameter optimization (HPO) of deep neural networks. However, the NM method is not suitable for parallelization, which is a serious drawback for its practical application in HPO. In this study, we propose a novel approach to accelerate the NM method with respect to the parallel computing resources. The numerical results indicate that the proposed method is significantly faster and more efficient when compared with the previous naive approaches with respect to the HPO tabular benchmarks.
 The attendant implementaiton is here.



Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn  or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Continue Reading…

Collapse

Read More

Magister Dixit

“Most companies think traditional Business Intelligence (BI) in which data is collected in warehouses, models are created based on business criteria and results are visualized through reports is sufficient. While this is true if your only concern is to answer basic questions like which customers are more profitable, it is not enough to deliver transformative business change like Data Science can. Data Science takes a different approach than BI in that insights and models are derived from the data through the application of statistical and mathematical techniques by Data Scientists. The data drives the modeling and insights. When you let the data guide you – you are less likely to try to use the data to support wrong predispositions or conclusions.” Kristen Paral ( August 23, 2014 )

Continue Reading…

Collapse

Read More

R Packages worth a look

Extensible Bootstrapped Split-Half Reliabilities (splithalfr)
Calculates scores and estimates bootstrapped split-half reliabilities for reaction time tasks and questionnaires. The ‘splithalfr’ can be extended with …

Quantile-Optimal Treatment Regimes with Censored Data (QTOCen)
Provides methods for estimation of mean- and quantile-optimal treatment regimes from censored data. Specifically, we have developed distinct functions …

Differential Risk Hotspots in a Linear Network (DRHotNet)
Performs the identification of differential risk hotspots given a marked point pattern (Diggle 2013) <doi:10.1201/b15326> lying on a linear netwo …

Themes for Base Graphics Plots (basetheme)
Functions to create and select graphical themes for the base plotting system. Contains: 1) several custom pre-made themes 2) mechanism for creating new …

Continue Reading…

Collapse

Read More

Distilled News

Why Machine Learning is vulnerable to adversarial attacks and how to fix it

Machine learning can process data imperceptible to humans to produce expected results. These inconceivable patterns are inherent in the data but may make models vulnerable to adversarial attacks. How can developers harness these features to not lose control of AI?


The risks of AI outsourcing – how to successfully work with AI startups

Corporates are battling with technology giants and AI startups for the best and brightest AI talent. They are increasingly outsourcing their AI innovations to startups to ensure they do not get left behind in the race for AI competitive advantage. However outsourcing presents real and new risks which corporates are often ill equipped to identify and manage. There are real cultural barriers, implied risks, and questions that corporates should ask before partnering with any AI startup.

Continue Reading…

Collapse

Read More

Stabilising transformations: how do I present my results?

(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)

ANOVA is routinely used in applied biology for data analyses, although, in some instances, the basic assumptions of normality and homoscedasticity of residuals do not hold. In those instances, most biologists would be inclined to adopt some sort of stabilising transformations (logarithm, square root, arcsin square root…), prior to ANOVA. Yes, there might be more advanced and elegant solutions, but stabilising transformations are suggested in most traditional biometry books, they are very straightforward to apply and they do not require any specific statistical software. I do not think that this traditional technique should be underrated.

However, the use of stabilising transformations has one remarkable drawback, it may hinder the clarity of results. I’d like to give a simple, but relevant example.

An example with counts

Consider the following dataset, that represents the counts of insects on 15 independent leaves, treated with the insecticides A, B and C (5 replicates):

dataset <- structure(data.frame(
  Insecticide = structure(c(1L, 1L, 1L, 1L, 1L, 
    2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
    .Label = c("A", "B", "C"), class = "factor"), 
  Count = c(448, 906, 484, 477, 634, 211, 276, 
    415, 587, 298, 50, 90, 73, 44, 26)), 
  .Names = c("Insecticide", "Count"))
dataset
##    Insecticide Count
## 1            A   448
## 2            A   906
## 3            A   484
## 4            A   477
## 5            A   634
## 6            B   211
## 7            B   276
## 8            B   415
## 9            B   587
## 10           B   298
## 11           C    50
## 12           C    90
## 13           C    73
## 14           C    44
## 15           C    26

We should not expect that a count variable is normally distributed with equal variances. Indeed, a graph of residuals against expected values shows clear signs of heteroscedasticity.

mod <- lm(Count ~ Insecticide, data=dataset)
plot(mod, which = 1)

In this situation, a logarithmic transformation is often suggested to produce a new normal and homoscedastic dataset. Therefore we take the log-transformed variable and submit it to ANOVA.

model <- lm(log(Count) ~ Insecticide, data=dataset)
print(anova(model), digits=6)
## Analysis of Variance Table
## 
## Response: log(Count)
##             Df   Sum Sq Mean Sq F value     Pr(>F)    
## Insecticide  2 15.82001 7.91000 50.1224 1.4931e-06 ***
## Residuals   12  1.89376 0.15781                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model)
## 
## Call:
## lm(formula = log(Count) ~ Insecticide, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6908 -0.1849 -0.1174  0.2777  0.5605 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.3431     0.1777  35.704 1.49e-13 ***
## InsecticideB  -0.5286     0.2512  -2.104   0.0572 .  
## InsecticideC  -2.3942     0.2512  -9.529 6.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 12 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.8753 
## F-statistic: 50.12 on 2 and 12 DF,  p-value: 1.493e-06

In this example, the standard error for each mean (SEM) corresponds to \(\sqrt{0.158/5}\). In the end, we might show the following table of means for transformed data:

Insecticide Means (log n.)
A 6.343
B 5.815
C 3.985
SEM 0.178

Unfortunately, we loose clarity: how many insects did we have on each leaf? If we present in our manuscript a table like this one we might be asked by our readers or by the reviewer to report the means on the original measurement unit. What should we do, then? Here are some suggestions.

  1. We can present the means of the original data with standard deviations. This is clearly less than optimal, if we want to suggest more than the bare variability of the observed sample. Furthermore, please remember that the means of original data may not be a good measure of central tendency, if the original population is strongly ‘asymmetric’ (skewed)!
  2. We can show back-transformed means. Accordingly, if we have done, e.g., a logarithmic transformation, we can exponentiate the means of transformed data and report them back to the original measurement unit. Back-transformed means ‘estimate’ the medians of the original populations, which may be regarded as better measures of central tendency for skewed data.

We suggest that the use of the second method. However, this leaves us with the problem of adding a measure of uncertainty to back-transformed means. No worries, we can use the delta method to back-transform standard errors. It is straightforward:

  1. take the first derivative of the back-transform function [in this case the first derivative of exp(X)=exp(X)] and
  2. multiply it by the standard error of the transformed data.

This may be simply done by hand, with e.g \(exp(6.343) \times 0.178 = 101.19\) (for insecticide A). This ‘manual’ solution is always available, regardless of the statistical software at hand. With R, we can use the ‘emmeans’ package (Lenth, 2016):

library(emmeans)
countM <- emmeans(model, ~Insecticide, transform = "response")

It is enough to set the argument ‘transform’ to ’response, although the transformation must be embedded in the model. It means: it is ok if we coded the model as:

log(Count) ~ Insecticide

On the contrary, it fails if we coded the model as:

logCount ~ Insecticide

where the transformation was performed prior to fitting.

Obviously, the back-transformed standard error is different for each mean (there is no homogeneity of variances on the original scale, but we knew this already). Back-transformed data might be presented as follows:

Insecticide Mean SE
A 568.5 101.19
B 335.1 59.68
C 51.88 9.57

It would be appropriate to state it clearly (e.g. in a footnote), that means and SEs were obtained by back-transformation via the delta method. Far clearer, isn’t it? As I said, there are other solutions, such as fitting a GLM, but stabilising transformations are simple and they are easily acceptable in biological Journals.

If you want to know something more about the delta-method you might start from my post here. A few years ago, some collegues and I have also discussed these issues in a journal paper (Onofri et al., 2010).

Thanks for reading!

Andrea Onofri
University of Perugia (Italy)

References

  1. Lenth, R.V., 2016. Least-Squares Means: The R Package lsmeans. Journal of Statistical Software 69. https://doi.org/10.18637/jss.v069.i01
  2. Onofri, A., Carbonell, E.A., Piepho, H.-P., Mortimer, A.M., Cousens, R.D., 2010. Current statistical issues in Weed Research. Weed Research 50, 5–24.

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Cricket’s increasing sizzle owes much to India

Scoring rates have surged in the short T20 format

Continue Reading…

Collapse

Read More

Exploring Categorical Data With Inspectdf

(This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers)

Exploring categorical data with inspectdf

What’s inspectdf and what’s it for?

I often find myself viewing and reviewing dataframes throughout the
course of an analysis, and a substantial amount of time can be spent
rewriting the same code to do this. inspectdf is an R package designed
to make common exploratory tools a bit more useful and easy to use.

In particular, it’s very powerful be able to quickly see the contents of
categorical features. In this article, we’ll summarise how to use the
inspect_cat() function from inspectdf for summarising and
visualising categorical columns.

First of all, you’ll need to have the inspectdf package installed. You
can get it from github using

library(devtools)
install_github("alastairrushworth/inspectdf")

Then load the package in. We’ll also load dplyr for the starwars
data and for the pipe %>%.

library(inspectdf)
library(dplyr)

# check out the starwars help file
?starwars

Tabular summaries using inspect_cat()

The starwars data that comes bundled with dplyr has 7 columns that
have character class, and is therefore a nice candidate for illustrating
the use of inspect_cat. We can see this quickly using the
inspect_types() function from inspectdf.

starwars %>% inspect_types()
## # A tibble: 4 x 4
##   type        cnt  pcnt col_name 
##             
## 1 character     7 53.8  
## 2 list          3 23.1  
## 3 numeric       2 15.4  
## 4 integer       1  7.69 

Using inspect_cat() is very straightforward:

star_cat <- starwars %>% inspect_cat()
star_cat
## # A tibble: 7 x 5
##   col_name     cnt common common_pcnt levels           
##                              
## 1 eye_color     15 brown        24.1  
## 2 gender         5 male         71.3   
## 3 hair_color    13 none         42.5  
## 4 homeworld     49 Naboo        12.6  
## 5 name          87 Ackbar        1.15 
## 6 skin_color    31 fair         19.5  
## 7 species       38 Human        40.2  

So what does this tell us? Each row in the tibble returned from
inspect_cat() corresponds to each categorical column (factor,
logical or character) in the starwars dataframe.

  • The cnt column tells you how many unique levels there are for each
    column. For example, there are 15 unique entries in the eye_color
    column.
  • The common column prints the most commonly occurring entry. For
    example, the most common eye_color is brown. The percentage
    occurrence is 24.1% which is shown under common_pcnt.
  • A full list of levels and occurrence frequency is provided in the
    list column levels.

A table of relative frequencies of eye_color can be retrieved by
typing

star_cat$levels$eye_color
## # A tibble: 15 x 3
##    value           prop   cnt
##               
##  1 brown         0.241     21
##  2 blue          0.218     19
##  3 yellow        0.126     11
##  4 black         0.115     10
##  5 orange        0.0920     8
##  6 red           0.0575     5
##  7 hazel         0.0345     3
##  8 unknown       0.0345     3
##  9 blue-gray     0.0115     1
## 10 dark          0.0115     1
## 11 gold          0.0115     1
## 12 green, yellow 0.0115     1
## 13 pink          0.0115     1
## 14 red, blue     0.0115     1
## 15 white         0.0115     1

There isn’t anything here that can’t be obtained by using the base
table() function with some post-processing. inspect_cat() automates
some of that functionality and wraps it into a single, convenient
function.

Visualising categorical columns with show_plot()

An important feature of inspectdf is the ability to visualise
dataframe summaries. Visualising categories can be challenging, because
categorical columns can be very rich and contain many unique levels. A
simple stacked barplot can be produced using show_plot()

star_cat %>% show_plot()

Like the star_cat tibble returned by inspect_cat(), each row of the
plot is a single column, split by the relative frequency of occurrence
of each unique entry.

  • Some of the bars are labelled, but in cases where the bars are
    small, the labels are not shown. If you encounter categorical
    columns with really long strings, labels can be suppressed
    altogether with show_plot(text_labels = FALSE).
  • Missing values or NAs are shown as gray bars. In this case, there
    are quite a few starwars characters whose homeworld is not
    unknown or missing.

Combining rare entries with show_plot()

Some of the categorical columns like name seems to have a lot of
unique entries. We should expect this – names often are unique (or
almost) in a small dataset. If we scaled this analysis up to a dataset
with millions of rows, there would be so many names with very small
relative frequencies that the name bars would be very difficult to see.
show_plot() can help with this too!

star_cat %>% show_plot(high_cardinality = 1)

By setting the argument high_cardinality = 1 all entries that occur
only once are combined into a single group labelled high
cardinality
. This makes it easier to see when some entries occur only
once (or extremely rarely).

  • In the above, it’s now obvious that no two people in the starwars
    data share the same name, and that many come from a unique
    homeworld or species.
  • By setting high_cardinality = 2 or even greater, it’s possible to
    group the ‘long-tail’ of rare categories even further. With larger
    datasets, this becomes increasingly important for visualisation.
  • A practical reason to combine rare entries, is plotting speed – it
    can take a long time to render a plot with tens of thousands (or
    more) unique bars! Using the high_cardinality argument can reduce
    this dramatically.

Playing with color options in show_plot()

It’s been pointed out that the default ggplot color theme isn’t
particularly friendly to color-blind audiences. A more color-blind
friendly theme
is
available by specifying col_palette = 1:

star_cat %>% show_plot(col_palette = 1)

I’m also quite fond of the 80s theme by choosing col_palette = 2:

star_cat %>% show_plot(col_palette = 2)

There are 5 palettes at the moment, so have a play around. Note that the
color palettes have not yet hit the CRAN version of inspectdf – that
will come soon in an update, but for now you can get them from the
github version of the package using the code at the start of the
article.

Comments? Suggestions? Issues?

Any feedback is welcome! Find me on twitter at
rushworth_a or write a github
issue
.

To leave a comment for the author, please follow the link and comment on their blog: Alastair Rushworth.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

June 14, 2019

If you did not already know

Graph-Adaptive Pruning (GAP) google
In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy. …

SlicStan google
Stan is a probabilistic programming language that has been increasingly used for real-world scalable projects. However, to make practical inference possible, the language sacrifices some of its usability by adopting a block syntax, which lacks compositionality and flexible user-defined functions. Moreover, the semantics of the language has been mainly given in terms of intuition about implementation, and has not been formalised. This paper provides a formal treatment of the Stan language, and introduces the probabilistic programming language SlicStan — a compositional, self-optimising version of Stan. Our main contributions are: (1) the formalisation of a core subset of Stan through an operational density-based semantics; (2) the design and semantics of the Stan-like language SlicStan, which facilities better code reuse and abstraction through its compositional syntax, more flexible functions, and information-flow type system; and (3) a formal, semantic-preserving procedure for translating SlicStan to Stan. …

Truncated-Uniform-Laplace (Tulap) google
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a ‘Neyman-Pearson lemma’ for binomial data under DP, where the DP-UMP only depends on the sample sum. Our tests can also be stated as a post-processing of a random variable, whose distribution we coin ”Truncated-Uniform-Laplace” (Tulap), a generalization of the Staircase and discrete Laplace distributions. Furthermore, we obtain exact $p$-values, which are easily computed in terms of the Tulap random variable. Using the above techniques, we show that our tests can be applied to give uniformly most accurate one-sided confidence intervals and optimal confidence distributions. We also derive uniformly most powerful unbiased (UMPU) two-sided tests, which lead to uniformly most accurate unbiased (UMAU) two-sided confidence intervals. We show that our results can be applied to distribution-free hypothesis tests for continuous data. Our simulation results demonstrate that all our tests have exact type I error, and are more powerful than current techniques. …

Self Driving Data Curation google
Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions. …

Continue Reading…

Collapse

Read More

Prioritizing technical debt as if time and money mattered

Adam Tornhill offers a new perspective on software development that will change how you view code.

Continue reading Prioritizing technical debt as if time and money mattered.

Continue Reading…

Collapse

Read More

From the trenches with Rebecca Parsons

Rebecca Parsons shares the story of her career path and her work as an architect.

Continue reading From the trenches with Rebecca Parsons.

Continue Reading…

Collapse

Read More

Choices of scale

Michael Feathers explores various scaling strategies in light of research about human cognition and systems cohesion.

Continue reading Choices of scale.

Continue Reading…

Collapse

Read More

Book Memo: “Applied Data Science”

Lessons Learned for the Data-Driven Business
This book has two main goals: to define data science through the work of data scientists and their results, namely data products, while simultaneously providing the reader with relevant lessons learned from applied data science projects at the intersection of academia and industry. As such, it is not a replacement for a classical textbook (i.e., it does not elaborate on fundamentals of methods and principles described elsewhere), but systematically highlights the connection between theory, on the one hand, and its application in specific use cases, on the other.

Continue Reading…

Collapse

Read More

Document worth reading: “Lenia – Biology of Artificial Life”

We report a new model of artificial life called Lenia (from Latin lenis ‘smooth’), a two-dimensional cellular automaton with continuous space-time-state and generalized local rule. Computer simulations show that Lenia supports a great diversity of complex autonomous patterns or ‘lifeforms’ bearing resemblance to real-world microscopic organisms. More than 400 species in 18 families have been identified, many discovered via interactive evolutionary computation. We present basic observations of the model regarding the properties of space-time and basic settings. We provide a board survey of the lifeforms, categorize them into a hierarchical taxonomy, and map their distribution in the parameter hyperspace. We describe their morphological structures and behavioral dynamics, propose possible mechanisms of their self-propulsion, self-organization and plasticity. Finally, we discuss how the study of Lenia would be related to biology, artificial life, and artificial intelligence. Lenia – Biology of Artificial Life

Continue Reading…

Collapse

Read More

How to Learn Python for Data Science the Right Way

The biggest mistake you can make while learning Python for data science is to learn Python programming from courses meant for programmers. Avoid this mistake, and learn Python the right way by following this approach.

Continue Reading…

Collapse

Read More

Thanks for reading!