My Data Science Blogs

April 20, 2018

Document worth reading: “On Cognitive Preferences and the Interpretability of Rule-based Models”

It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption, and recapitulate evidence for and against this postulate. We also report the results of an evaluation in a crowd-sourcing study, which does not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then continue to review criteria for interpretability from the psychological literature, evaluate some of them, and briefly discuss their potential use in machine learning. On Cognitive Preferences and the Interpretability of Rule-based Models

Continue Reading…


Read More

How to build deep learning models with SAS

SAS® supports the creation of deep neural network models. Examples of these models include convolutional neural networks, recurrent neural networks, feedforward neural networks and autoencoder neural networks. Let’s examine in more detail how SAS creates deep learning models using SAS® Visual Data Mining and Machine Learning.

Deep learning models with SAS Cloud Analytic Services

SAS Visual Mining and Machine Learning takes advantage of SAS Cloud Analytic Services (CAS) to perform what are referred to as CAS actions. You use CAS actions to load data, transform data, compute statistics, perform analytics and create output. Each action is configured by specifying a set of input parameters. Running a CAS action processes the action’s parameters and data, which creates an action result. CAS actions are grouped into CAS action sets.

Deep neural net models are trained and scored using the actions in the deepLearn CAS action set. This action set consists of several actions that support the end-to-end preprocessing, developing and deploying of deep neural network models. This action set provides users with the flexibility to describe their own model directed acyclic graph (DAG) to define the initial deep net structure. There are also actions that support adding and removing of layers from the network structure.

Appropriate model descriptions and parameters are needed to build deep learning models. We first need to define the network topology as a DAG and use this model description to train the parameters of the deep net models.

A deeper dive into the deepLearn SAS CAS Action Set

An overview of the steps involved in training deep neural network models, using the deepLearn CAS action set, is as follows:

  1. Create an empty deep learning model.
    • The BuildModel() CAS action in the deepLearn action set creates an empty deep learning model in the form of a CASTable object.
    • Users can choose from DNN, RNN or CNN network types to build the respective initial network.
  1. Add layers to the model.
    • This can be implemented using the addLayer() CAS action.
    • This CAS action provides the flexibility to add various types of layers, such as the input, convolutional, pooling, fully connected, residual or output as desired.
    • The specified layers are then added to the model table.
    • Each new layer has a unique identifier name associated with it.
    • This action also makes it possible to randomly crop/flip the input layer when images are given as inputs.
  1. Remove layers from the model.
    • Carried out using the removeLayer() CAS action.
    • By specifying the necessary layer name, layers can be removed from the model table.
  1. Perform hyperparameter autotuning.
    • dlTune() helps tune the optimization parameters needed for training the model.
    • Some of the tunable parameters include learning rate, dropout, mini batch size, gradient noise, etc.
    • For tuning, we must specify the lower and the upper bound range of the parameters within where we think the optimized value would lie.
    • An initial model weights table needs to be specified (in the form of a CASTable), which will initialize the model.
    • An exhaustive searching through the specified weights table is then performed on the same data multiple times to determine the optimized parameter values.
    • The resulting model weights with the best validation fit error is stored in a CAS table object.
  1. Train the neural net model.
    • The dlTrain() action trains the specified deep learning model for classification or regression tasks.
    • By allowing the user to input the initial model table that was built, the best model weights table that was stored by performing hyper-parameter tuning and the predictor and response variables, we train the necessary neural net model.
    • Trained models such as DNNs can be stored as an ASTORE binary object to be deployed in the SAS Event Stream Processing engine for real-time online scoring of data.
  1. Score the model.
    • The dlScore() action uses the trained model to score new data sets.
    • The model is scored using the trained model information from the ASTORE binary object and predicting against the new data set.
  1. Export the model.
    • The dlExportModel() exports the trained neural net models to other formats.
    • ASTORE is the current binary format supported by CAS.
  1. Import the model weights table.
    • dlImportModelWeights() imports the model weights information (that are initially specified as CAS table object) from external sources.
    • The currently supported format is HDF5.

As advances in deep learning are made, SAS will also continue to advance its deepLearn CAS action set.

This blog post is based on a SAS white paper, "How to Do Deep Learning With SAS: An introduction to deep learning neural networks and a guide to building deep learning models using SAS."

Read the complete deep learning paper now.


The post How to build deep learning models with SAS appeared first on Subconscious Musings.

Continue Reading…


Read More

If you did not already know

Stacked Deconvolutional Network (SDN) google
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and guarantee the fine recovery of localization information. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which guarantees the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new state-of-the-art results on three datasets, including PASCAL VOC 2012, CamVid, GATECH. In particular, our best model without CRF post-processing achieves an intersection-over-union score of 86.6% in the test set. …

Data Version Control (DVC) google
DVC makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data – through cloud storage (AWS S3, GCP) in a single DVC environment. …

Damerau-Levenshtein Distance google
In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau-Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The Damerau-Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that these four operations correspond to more than 80% of all human misspellings. Damerau’s paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau-Levenshtein distance has also seen uses in biology to measure the variation between protein sequences. …

Continue Reading…


Read More

Introducing the Anaconda Data Science Certification Program

This program gives data scientists a way to verify their proficiency, and organizations an independent standard for qualifying current and prospective data science experts. Register now!

Continue Reading…


Read More

R/Finance 2018 Registration

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

This year marks the 10th anniversary of the R/Finance Conference!  As in prior years, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 50+ presenters covering all areas of finance with R.  The conference will take place on June 1st and 2nd, at UIC in Chicago.

You can find registration information on the conference website, or you can go directly to the Cvent registration page.

Note that registration fees will increase by 50% at the end of early registration on May 21, 2018.

We are very excited about keynote presentations by JJ AllaireLi Deng, and Norm Matloff.  The conference agenda (currently) includes 18 full presentations and 33 shorter “lightning talks”.  As in previous years, several (optional) pre-conference seminars are offered on Friday morning.  We’re still working on the agenda, but we have another great lineup of speakers this year!

There is also an (optional) conference dinner at Wyndham Grand Chicago Riverfront in the 39th Floor Penthouse Ballroom and Terrace.  Situated directly on the riverfront, it is a perfect venue to continue conversations while dining and drinking.

We would to thank our 2018 Sponsors for the continued support enabling us to host such an exciting conference:

  UIC Liautaud Master of Science in Finance
  R Consortium
  William Blair

On behalf of the committee and sponsors, we look forward to seeing you in Chicago!

  Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson, Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Magister Dixit

“On a scale less grand, but probably more common, data analytics projects reach into all business units. Employees throughout these units must interact with the data science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making. This requires a close interaction between the data scientists and the business people responsible for decision-making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions.” Foster Provost & Tom Fawcett ( 2013 )

Continue Reading…


Read More

Why Deep Learning is perfect for NLP (Natural Language Processing)

Deep learning brings multiple benefits in learning multiple levels of representation of natural language. Here we will cover the motivation of using deep learning and distributed representation for NLP, word embeddings and several methods to perform word embeddings, and applications.

Continue Reading…


Read More

Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc

(This article was first published on R –, and kindly contributed to R-bloggers)

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package🔗 which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:


  driver = "Simba Athena ODBC Driver", 
  Schema = "redacted",
  AwsRegion = "us-east-1",
  AuthenticationType = "Default Credentials",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "redacted")

Apologies for the redaction and lack of output but we’ve removed the default example databases from our work Athena environment and I’m not near my personal systems, so a more complete example will have to wait until later.

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill 😉

What About Drill?

I also maintain the sergeant package🔗 which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.


  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZkQuorum = "HOST:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id  
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0         
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0         
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0         
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0         
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0         
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0         
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9         
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21        
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0         
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1         
## # ... with more rows, and 9 more variables: department_id , birth_date ,
## #   hire_date , salary , supervisor_id , education_level ,
## #   marital_status , gender , management_role ## 

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n              
##  1 Store Temporary Checker   268            
##  2 Store Temporary Stocker   264            
##  3 Store Permanent Checker   226            
##  4 Store Permanent Stocker   222            
##  5 Store Shift Supervisor    52             
##  6 Store Permanent Butcher   32             
##  7 Store Manager             24             
##  8 Store Assistant Manager   24             
##  9 Store Information Systems 16             
## 10 HQ Finance and Accounting 8              
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ll be updating my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.


The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Neural Network based Startup Name Generator

How to build a recurrent neural network to generate suggestions for your new company’s name.

Continue Reading…


Read More

Apple: Commerce Data Scientist – Apple Media Products

Seeking someone with a love for data. This position involves working on very large scale data mining, cleaning, analysis, deep level processing, machine learning or statistic modeling, metrics tracking and evaluation.

Continue Reading…


Read More

Most banks won’t touch America’s legal pot industry

IT IS often said that markets hate uncertainty. America’s marijuana industry is no exception. Earlier this year Jeff Sessions, the country’s attorney-general, rescinded a set of federal guidelines for marijuana-related businesses operating in states where the drug is legal.

Continue Reading…


Read More

G Research: Data Scientist – Data Intelligence

Seeking a Senior Data Scientist to join the Data Intelligence Team, to analyse and verify the fidelity and cleanliness of diverse data sources and generate analytics to help determine the usefulness of data sets in the investment process.

Continue Reading…


Read More

Bacon Bytes for 20-April

Bacon bytes go home winter you're drunk

As I’m writing this edition of Bacon Bytes it is snowing. in April. Again. Go home Winter, you’re drunk.

This past weekend I was having a discussion with friends about how Facebook tracks you even if you aren’t a member of Facebook. And while not an unusual practice, most people have no idea this is happening. Or allowed. So, if you want Facebook to stop tracking you, all you need to do is join Facebook. This is why Zuckerberg is a billionaire and I’m writing weekly recap posts.

In a sure sign that hell has frozen over, Microsoft released it’s first application for Linux this week. Microsoft built their own custom Linux kernel for their new IoT security service, Azure Sphere. Although it is not available yet (and I haven’t even heard about a private preview), this announcement is most interesting for two reasons. First, Microsoft built their own Linux distribution, which means Linus has won. No word yet on what his winnings are, but I’m guessing it’s a bunch of Bing rewards points. The second reason is that Amazon announced a similar service at re:Invent last year. So, the two major cloud providers are looking to corner the market on IoT security.

Here’s a good example of the current state of IoT security. Hackers stole data from a casino via a thermometer in a lobby fish tank. Just let that sink in for a moment. That’s why Azure Sphere is so important. Microsoft and Amazon need to protect people from themselves.

Microsoft along with 33 other companies signed an anti-cyberattack pledge this week. This announcement was timed with the start of the RSA conference because that’s the place to let the world know you are committed to securing customers data and devices. Unless you are Apple, Google, or (surprisingly) Amazon. Those companies chose not to sign the pledge. I’ve no idea why Amazon would have not signed, perhaps nobody could get a hold of Jeff Bezos in time.

SolarWinds released their annual IT Trends report. The report is the result of a survey of over 800 companies worldwide. The biggest takeaway this year is the report shows the gap that exists between management and labor. If you have ever sat at your desk and thought your management was a bit crazy, this is the report for you. It breaks down that a gap exists, and has data to show why. Essentially, people are too busy keeping the lights on in order to keep pace with emerging technology. This isn’t a new problem, I’m sure, but it is nice to have data to help understand the gap, and to figure out what actions to take next.

If you’ve ever thought about getting up and walking out of YAUM 9Yet Another Useless Meeting), Elon Musk has your back. Musk shared some productivity tips such as leaving meetings or hanging up the phone when needed. Most of the rules Musk lays out for employees all follow along with the lines of common sense. If your time is better spent elsewhere, don’t stay in a meeting where you no longer add value. Unfortunately, common sense isn’t so common.

The post Bacon Bytes for 20-April appeared first on Thomas LaRock.

Continue Reading…


Read More

Viacom’s Journey to Improving Viewer Experiences with Real-time Analytics at Scale

With over 4 billion subscribers, Viacom is focused on delivering amazing viewing experiences to their global audiences. Core to this strategy is ensuring petabytes of streaming content is delivered flawlessly through web, mobile and streaming applications. This is critically important during popular live events like the MTV Video Music Awards.

Streaming this much video can strain delivery systems resulting in long load times, mid-stream freezes and other issues. Not only does this create a poor experience, but can also result in lost ad dollars. To combat this, Viacom set out to build a scalable analytics platform capable of processing terabytes of streaming data for real-time insights on the viewer experience.

After evaluating a number of technologies, Viacom found their solution in Amazon S3 and the Databricks Unified Analytics Platform powered by Apache SparkTM. The rapid scalability of S3 coupled with the ease and processing power of Databricks, enabled Viacom to rapidly deploy and scale Spark clusters and unify their entire analytics stack – from basic SQL to advanced analytics on large scale streaming and historical datasets – with a single platform.

To learn more, join our webinar How Viacom Revolutionized Audience Experiences with Real-Time Analytics and AI on Apr 25 at 10:00 am PT.

The webinar will cover:

  • Why Viacom chose Databricks, Spark and AWS for scalable real-time insights and AI
  • How a unified platform for ad-hoc, batch, and real-time data analytics enabled them to improve content delivery
  • What it takes to create a self service analytics platform for business users, analysts, and data scientists

Register to attend this session.


Try Databricks for free. Get started today.

The post Viacom’s Journey to Improving Viewer Experiences with Real-time Analytics at Scale appeared first on Databricks.

Continue Reading…


Read More

G Research: Data Scientist

Seeking a data scientist to help us diversify and extend the capabilities of the Security Data Science team.

Continue Reading…


Read More

NLP –  Building a Question Answering Model

In this blog, I want to cover the main building blocks of a question answering model.

Continue Reading…


Read More

G Research: Data Science Tooling Expert

Seeking a candidate to identify and research new data science and machine learning technologies and support POCs of these technologies as well as other kinds of underlying infrastructure.

Continue Reading…


Read More

Packaging Shiny applications: A deep dive

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

(Or, how to write a Shiny app.R file that only contains a single line of code)

Mark Sellors, Head of Data Engineering

This post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.

Last year, during some of my conference talks, I told the story of Mango’s early adoption of Shiny and how it wasn’t always an easy path to production for us. In this post I’d like to fill in some of the technical background and provide some information about Shiny app publishing and packaging that is hopefully useful to a wider audience.

I’ve figured out some of this for myself, but the most pivotal piece of information came from Shiny creator, Joe Cheng. Joe told me some time ago, that all you really need in an app.R file is a function that returns a Shiny application object. When he told me this, I was heavily embedded in the publication side and I didn’t immediately understand the implications.

Over time though I came to understand the power and flexibility that this model provides and, to a large extent, that’s what this post is about.

What is Shiny?

Hopefully if you’re reading this you already know, but Shiny is a web application framework for Ri. It allows R users to develop powerful web applications entirely in R without having to understand HTML, CSS and JavaScript. It also allows us to embed the statistical power of R directly into those web applications.

Shiny apps generally consist of either a ui.R and a server.R (containing user interface and server-side logic respectively) or a single app.R which contains both.
Why package a Shiny app anyway?

If your app is small enough to fit comfortably in a single file, then packaging your application is unlikely to be worth it. As with any R script though, when it gets too large to be comfortably worked with as a single file, it can be useful to break it up into discrete components.

Publishing a packaged app will be more difficult, but to some extent that will depend on the infrastructure you have available to you.

Pros of packaging

Packaging is one of the many great features of the R language. Packages are fairly straightforward, quick to create and you can build them with a host of useful features like built-in documentation and unit tests.

They also integrate really nicely into Continuous Integration (CI) pipelines and are supported by tools like Travis. You can also get test coverage reports using things like

They’re also really easy to share. Even if you don’t publish your package to CRAN, you can still share it on GitHub and have people install it with devtools, or build the package and share that around, or publish the package on a CRAN-like system within your organisation’s firewall.

Cons of packaging

Before you get all excited and start to package your Shiny applications, you should be aware that — depending on your publishing environment — packaging a Shiny application may make it difficult or even impossible to publish to a system like Shiny Server or RStudio Connect, without first unpacking it again.

A little bit of Mango history

This is where Mango were in the early days of our Shiny use. We had a significant disconnect between our data scientists writing the Shiny apps and the IT team tasked with supporting the infrastructure they used. This was before we’d committed to having an engineering team that could sit in the middle and provide a bridge between the two.

When our data scientists would write apps that got a little large or that they wanted robust tests and documentation for, they would stick them in packages and send them over to me to publish to our original Shiny Server. The problem was: R packages didn’t really mean anything to me at the time. I knew how to install them, but that was about as far as it went. I knew from the Shiny docs that a Shiny app needs certain files (server.R and ui.R or an app.R) file, but that wasn’t what I got, so I’d send it back to the data science team and tell them that I needed those files or I wouldn’t be able to publish it.

More than once I got back a response along the lines of, “but you just need to load it up and then do runApp()”. But, that’s just not how Shiny Server works. Over time, we’ve evolved a set of best practices around when and how to package a Shiny application.

The first step was taking the leap into understanding Shiny and R packages better. It was here that I started to work in the space between data science and IT.

How to package a Shiny application

If you’ve seen the simple app you get when you choose to create a new Shiny application in RStudio, you’ll be familiar with the basic structure of a Shiny application. You need to have a UI object and a server function.

If you have a look inside the UI object you’ll see that it contains the html that will be used for building your user interface. It’s not everything that will get served to the user when they access the web application — some of that is added by the Shiny framework when it runs the application — but it covers off the elements you’ve defined yourself.

The server function defines the server-side logic that will be executed for your application. This includes code to handle your inputs and produce outputs in response.

The great thing about Shiny is that you can create something awesome quite quickly, but once you’ve mastered the basics, the only limit is your imagination.

For our purposes here, we’re going to stick with the ‘geyser’ application that RStudio gives you when you click to create a new Shiny Web Application. If you open up RStudio, and create a new Shiny app — choosing the single file app.R version — you’ll be able to see what we’re talking about. The small size of the geyser app makes it ideal for further study.

If you look through the code you’ll see that there are essentially three components: the UI object, the server function, and the shinyApp() function that actually runs the app.

Building an R package of just those three components is a case of breaking them out into the constituent parts and inserting them into a blank package structure. We have a version of this up on GitHub that you can check out if you want.

The directory layout of the demo project looks like this:

|-- R
|   |-- launchApp.R
|   |-- shinyAppServer.R
|   `-- shinyAppUI.R
|-- inst
|   `-- shinyApp
|       `-- app.R
|-- man
|   |-- launchApp.Rd
|   |-- shinyAppServer.Rd
|   `-- shinyAppUI.Rd
`-- shinyAppDemo.Rproj

Once the app has been adapted to sit within the standard R package structure we’re almost done. The UI object and server function don’t really need to be exported, and we’ve just put a really thin wrapper function around shinyApp() — I’ve called it launchApp() — which we’ll actually use to launch the app. If you install the package from GitHub with devtools, you can see it in action.


This will start the Shiny application running locally.

The approach outlined here also works fine with Shiny Modules, either in the same package, or called from a separate package.

And that’s almost it! The only thing remaining is how we might deploy this app to Shiny server (including Shiny Server Pro) or RStudio Connect.

Publishing your packaged Shiny app

We already know that Shiny Server and RStudio Connect expect either a ui.R and a server.R or an app.R file. We’re running our application out of a package with none of this, so we won’t be able to publish it until we fix this problem.

The solution we’ve arrived at is to create a directory called ‘shinyApp’ inside the inst directory of the package. For those of you who are new to R packaging, the contents of the ‘inst’ directory are essentially ignored during the package build process, so it’s an ideal place to put little extras like this.

The name ‘shinyApp’ was chosen for consistency with Shiny Server which uses a ‘shinyApps’ directory if a user is allowed to serve applications from their home directory.

Inside this directory we create a single ‘app.R’ file with the following line in it:


And that really is it. This one file will allow us to publish our packaged application under some circumstances, which we’ll discuss shortly.

Here’s where having a packaged Shiny app can get tricky, so we’re going to talk you through the options and do what we can to point out the pitfalls.

Shiny Server and Shiny Server Pro

Perhaps surprisingly — given that Shiny Server is the oldest method of Shiny app publication — it’s also the easiest one to use with these sorts of packaged Shiny apps. There are basically two ways to publish on Shiny Server. From your home directory on the server — also known as self-publishing — or publishing from a central location, usually the directory ‘/srv/shiny-server’.

The central benefit of this approach is the ability to update the application just by installing a newer version of the package. Sadly though, it’s not always an easy approach to take.

Apps served from home directory (AKA self-publishing)

The first publication method is from a users’ home directory. This is generally used in conjunction with RStudio Server. In the self-publishing model, Shiny Server (and Pro) expect apps to be found in a directory called ‘ShinyApps’, within the users home directory. This means that if we install a Shiny app in a package the final location of the app directory will be inside the installed package, not in the ShinyApps directory. In order to work around this, we create a link from where the app is expected to be, to where it actually is within the installed package structure.

So in the example of our package, we’d do something like this in a terminal session:

# make sure we’re in our home directory
# change into the shinyApps directory
cd shinyApps
# create a link from our app directory inside the package
ln -s /home/sellorm/R/x86_64-pc-linux-gnu-library/3.4/shinyAppDemo/shinyApp ./testApp

Note: The path you will find your libraries in will differ from the above. Check by running .libPaths()[1] and then dir(.libPaths()[1]) to see if that’s where your packages are installed.

Once this is done, the app should be available at ‘http://:3838//’ and can be updated by updating the installed version of the package. Update the package and the updates will be published via Shiny Server straight away.

Apps Server from a central location (usually /srv/shiny-server)

This is essentially the same as above, but the task of publishing the application generally falls to an administrator of some sort.

Since they would have to transfer files to the server and log in anyway, it shouldn’t be too much of an additional burden to install a package while they’re there. Especially if that makes life easier from then on.

The admin would need to transfer the package to the server, install it and then create a link — just like in the example above — from the expected location, to the installed location.

The great thing with this approach is that when updates are due to be installed the admin only has to update the installed package and not any other files.

RStudio Connect

Connect is the next generation Shiny Server. In terms of features and performance, it’s far superior to its predecessor. One of the best features is the ability to push Shiny app code directly from the RStudio IDE. For the vast majority of users, this is a huge productivity boost, since you no longer have to wait for an administrator to publish your app for you.

Since publishing doesn’t require anyone to directly log into the server as part of the publishing process, there aren’t really any straightforward opportunities to install a custom package. This means that, in general, publishing a packaged shiny application isn’t really possible.

There’s only one real workaround for this situation that I’m aware of. If you have an internal CRAN-like repository for your custom packages, you should be able to use that to update Connect, with a little work.

You’d need to have your dev environment and Connect hooked up to the same repo. The updated app package needs to be available in that repo and installed in your dev environment. Then, you could publish and then update the single line app.R for each successive package version you publish.

Connect uses packrat under the hood, so when you publish the app.R the packrat manifest will also be sent to the server. Connect will use the manifest to decide which packages are required to run your app. If you’re using a custom package this would get picked up and installed or updated during deployment.

It’s not currently possible to publish a packaged application to You’d need to make sure your app followed the accepted conventions for creating Shiny apps and only uses files, rather than any custom packages.


Packaging Shiny apps can be a real productivity boon for you and your team. In situations where you can integrate that process into other processes, such as automatically running your unit tests or automated publishing it can also help you adopt devops-style workflows.

However, in some instances, the practice can actually make things worse and really slow you down. It’s essential to understand what the publishing workflow is in your organisation before embarking on any significant Shiny packaging project as this will help steer you towards the best course of action.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

4 Ways IT Managers Should Expand Their Skills for Career Growth

Whether you’re already working in IT management or still on your way up, you should be sure of one thing: that you’re as comfortable working with people as you are working with tech. As a leader in IT, you’re responsible for much more than technology. You’ll likely spend as much –

The post 4 Ways IT Managers Should Expand Their Skills for Career Growth appeared first on Dataconomy.

Continue Reading…


Read More

Understanding What is Behind Sentiment Analysis – Part 2

Fine-tuning our sentiment classifier...

Continue Reading…


Read More

Microsoft Weekly Data Science News for April 20, 2018

Here are the latest articles from Microsoft regarding cloud data science products and updates.


Continue Reading…


Read More

AI, Machine Learning and Data Science Roundup: April 2018

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications I've noted over the past month or so.

Open Source AI, ML & Data Science News

An interface between R and Python: reticulate.

TensorFlow Hub: A library for reusable machine learning modules.

TensorFlow.js: Browser-based machine learning with WebGL acceleration. 

Download data from Kaggle with the Kaggle API.

Industry News

Tensorflow 1.7 supports the TensorRT library for faster computation on NVIDIA GPUs.

RStudio now provides a Tensorflow template in Paperspace for computation with NVIDIA GPUs.

Google Cloud Text-to-Speech provides natural speech in 32 voices and 12 languages.

Amazon Translate is now generally available.

Microsoft News

ZDNet reviews The Future Computed: "do read it to remind yourself how much preparation is required for the impact of AI".

Microsoft edges closer to quantum computer based on Majorana fermions (Bloomberg).

Microsoft’s Shared Innovation Initiative.

Azure Sphere: a new chip, Linux-based OS, and cloud services to secure IoT devices.

Microsoft’s Brainwave makes Bing’s AI over 10 times faster (Venturebeat).

Improvements for Python developers in the March 2018 release of Visual Studio Code.

A review of the Azure Data Science Virtual Machine with a focus on deep learning with GPUs.

Azure Media Analytics services: motion, face and text detection and semantic tagging for videos.

Learning resources

Training SqueezeNet in Azure with MXNet and the Data Science Virtual Machine.

Microsoft's Professional Program in AI, now available to the public as an EdX course

Run Python scripts on demand with Azure Container Instances.

How to train multiple models simultaneously with Azure Batch AI.

Scaling models to Kubernetes clusters with Azure ML Workbench.

A Beginner’s Guide to Quantum Computing and Q#.

A "weird" introduction to Deep Learning, by  Favio Vázquez.

Berkeley's Foundation of Data Science course now available online, for free.

Find previous editions of the monthly AI roundup here.

Continue Reading…


Read More

Carol Nickerson investigates an unfounded claim of “17 replications”

Carol Nickerson sends along this report in which she carefully looks into the claim that the effect of power posing on feelings of power has replicated 17 times. Also relevant to the discussion is this post from a few months ago by Joe Simmons, Leif Nelson, and Uri Simonsohn.

I am writing about this because the claims of replication have been receiving wide publicity, and so, to the extent that these claims are important and worth publicizing, it’s also important to point out their errors. Everyone makes scientific mistakes—myself included—and the fact that some mistakes were made regarding claimed replications is not intended in any way to represent a personal criticism of anyone involved.

The post Carol Nickerson investigates an unfounded claim of “17 replications” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Traits you’ll find in good managers

Work with your manager to get what you need, when you need it.

Continue reading Traits you’ll find in good managers.

Continue Reading…


Read More

Thinking beyond bots: How AI can drive social impact

A few ways to think differently and integrate innovation and AI into your company's altruistic pursuits.

What do artificial intelligence (AI), invention, and social good have in common? While on the surface they serve very different purposes, at their core, they all require you to do one thing in order to be successful at them: think differently.

Take the act of inventing—in order to develop a great patent, trade secret, or other intellectual property, you need to think outside of the box. Similarly, at the heart of AI is the act of unlocking new capabilities, whether that’s making virtual personal assistants like Alexa more useful, or creating a chatbot that provides a personalized experience to customers. And because of the constantly changing economic and social landscapes, coming up with impactful social good initiatives requires you to constantly approach things through a new lens.

Individually, these fields have seen notable advancements over the past year, including new technologies that are bringing improvements to AI and large companies that are prioritizing giving back. But even more exciting is that we’re seeing more and more business leaders and nonprofits combining AI, innovation, and social good to reach communities in innovative ways, at a scale we’ve never before seen.

There’s no better time than now to explore how your organization approaches your social good efforts. Here are a few ways you can think differently and integrate innovation and AI into your company’s altruistic pursuits.

Approach social good through the mind of an inventor

As a master inventor at IBM, I’m part of the team responsible for helping the company become the leading recipient of U.S. patents for the last quarter century. While developing patents and intellectual properties might not be what you’re setting out to do as part of your humanitarian efforts, the way we approach our jobs as inventors is something that can be applied across all aspects of giving back. Consider the United Nations’ 17 Sustainable Development Goals, which aim to eradicate things like poverty, hunger, disease, and more. These are game-changing initiatives that definitely require new ideas. What’s more, the United Nations estimates that we’re $5 trillion short on resources needed to accomplish these goals. How do we bridge this gap? Well, we need to start thinking differently.

Foundationally, coming up with a great invention is identifying a problem that needs to be solved and coming up with an out-of-the-box idea that’s smart, has the biggest impact, and the lowest risk. To do this, we look around us to see which relevant technologies we can use that are already at our disposal so we don’t have to completely reinvent the wheel if we don’t have to. We also identify which parts of the solution need a completely new idea to be created from scratch. Additionally, we look at the issue we’re trying to solve and the current landscape as a whole so we can predict any issues or future problems that may arise, and we try to address them ahead of time in our invention.

The same approach should be applied to social good—identify the problem you want to solve, the tools that already exist that can help you solve this dilemma, and the resources that need to be created or brought in from outside properties in order to execute your plan. At the heart of social good, similar to most inventions, are the people you’re trying to help. You need to make sure you’re maximizing the reach of your project while also minimizing any risks that may unintentionally create additional problems for the people you’re trying to help. To do this, you need to be creative in your approach.

As an example, this is exactly the approach InvestEd is taking (full disclosure: I am an advisor for InvestEd). They started off by realizing they could commercialize and create social good at the same time by enabling financial education and facilitating microloans for small businesses in emerging markets. Helping these small businesses grow added more value to the small, local communities. And to make their product even better, InvestEd is adding AI capabilities to widen their offerings and provide a more innovative user experience.

AI: Unlocking new capabilities

To grab the value and create disruptive AI technology for social good ideas, we have to think beyond the typical automation activities of a machine. Take Guiding Eyes, for example, which is using AI to discover the secrets behind successful guide dogs. By taking advantage of natural language processing (NLP) on structured and unstructured data, the system they’re using is trained to find correlations to successful dogs among genetic, health, temperament, and environmental factors—and the technology continues to learn and get better. By using AI, Guiding Eyes has seen a 10% increase in guide dog graduation rates, helping the organization meet the growing demand for guide dogs.

There are many other examples of AI being used for the betterment of society. For example, PAWS is an organization that uses machine learning to predict where poachers may strike, or Dr. Eric Elster, who worked with the Walter Reed National Military Medical Center to apply machine learning techniques to improve the treatment of U.S. service members injured in combat.

Best practices for getting started

These are a just a few ideas for how AI can be used for social good— there are still plenty of opportunities out there. The challenge is how to get started, so here are three best practices I’d like to share to help people who want to embark on this journey.

  1. First, build your understanding of what AI is and is not through great online learning, such as Intro to Artificial Intelligence, led by Peter Norvig and Sebastian Thrun, on Udacity.
  2. Second, think differently. AI is a different computing model. Instead of thinking about use cases and scenarios, really focus on the problem you want to solve. Think more “ideal scenario” on how to best develop a solution, and then see if a machine can be trained to do this work. Let’s consider personalized education, particularly reading comprehension (which has shown to have a tremendous impact on a child’s long-term educational performance across all subjects). With a traditional use case approach, we would probably try to develop a general framework that would help in a handful of scenarios. Now, Learning Ovations has thought about the more ideal scenario. They have realized there are too many possible scenarios to program or for a general framework to even cover. Instead, they’re training AI to assess each child’s performance (across traditional metrics and some new ones) as a tool for educators and parents. In addition, they’re creating an AI-powered recommendation engine based on each individual school’s curriculum to provide another tool for educators to create a customized reading program for each student. Thus, Learning Ovations thought differently on how to personalized education.
  3. Third, set aside preconceived notions. There are things that people are better than machines at doing, but there are things machines are better than people at doing—some of which may be surprising. For example, people seem to be more honest in sharing health or financial information with a machine than a person because they don’t worry about being judged. This typically means the machine gets more accurate data to provide recommendations. Thus, recognizing that a machine might be as capable in some areas could unlock whole new capabilities.

When it comes to AI, invention, and social good, the possibilities are endless. Technology will only continue to become more advanced, creating new opportunities to fix societal problems related to health, sustainability, conservation, accessibility, and much more. If you’re thinking of jumping into AI for good, just remember the most important rule: think differently.

Continue reading Thinking beyond bots: How AI can drive social impact.

Continue Reading…


Read More

Four short links: 20 April 2018

Functional Programming, High-Dimensional Data, Games and Datavis, and Container Management

  1. Interview with Simon Peyton-Jones -- I had always assumed that the more bleeding-edge changes to the type system, things like type-level functions, generalized algebraic data types (GADTs), higher rank polymorphism, and existential data types, would be picked up and used enthusiastically by Ph.D. students in search of a topic, but not really used much in industry. But in fact, it turns out that people in companies are using some of these still-not-terribly-stable extensions. I think it's because people in companies are writing software that they want to still be able to maintain and modify in five years time. SPJ is the creator of Haskell, and one of the leading thinkers in functional programming.
  2. HyperTools -- A Python toolbox for visualizing and manipulating high-dimensional data. Open source. High-dimensional = "a lot of columns in each row".
  3. What Videogames Have to Teach Us About Data Visualization -- super-interesting exploration of space, storytelling, structure, and annotations.
  4. Titus -- Netflix open-sourced their container management platform. There aren't many companies with the scale problems of Amazon, Netflix, Google, etc., so it's always interesting to see what comes out of them.

Continue reading Four short links: 20 April 2018.

Continue Reading…


Read More

Here’s what you get when you cross dinosaurs and flowers with deep learning

Neural networks have shown usefulness with a number of things, but here is an especially practical use case. Chris Rodley used neural networks to create a hybrid of a dinosaur book and a flower book. The world may never be the same again.

Tags: , ,

Continue Reading…


Read More

Videos: Computational Theories of the Brain, Simons Institute for the Theory of Computing

Monday, April 16th, 20188:30 am – 8:50 am
Coffee and Check-In
8:50 am – 9:00 am
Opening Remarks
9:00 am – 9:45 am
The Prefrontal Cortex as a Meta-Reinforcement Learning SystemMatthew Botvinick, DeepMind Technologies Limited, London and University College London9:45 am – 10:30 am
Working Memory Influences Reinforcement Learning Computations in Brain and BehaviorAnne Collins, UC Berkeley10:30 am – 11:00 am
11:00 am – 11:45 am
Predictive Coding Models of PerceptionDavid Cox, Harvard University11:45 am – 12:30 pm
TBASophie Denève, Ecole Normale Supérieure12:30 pm – 2:30 pm
2:30 pm – 3:15 pm
Towards Biologically Plausible Deep Learning: Early Inference in Energy-Based Models Approximates Back-PropagationAsja Fischer, University of Bonn
3:15 pm – 4:00 pm
Neural Circuitry Underling Working Memory in the Dorsolateral Prefrontal CortexVeronica Galvin, Yale University
4:00 pm – 5:00 pm

Tuesday, April 17th, 20188:30 am – 9:00 am
Coffee and Check-In
9:00 am – 9:45 am
TBASurya Ganguli, Stanford University9:45 am – 10:30 am
Does the Neocortex Use Grid Cell-Like Mechanisms to Learn the Structure of Objects?Jeff Hawkins, Numenta
10:30 am – 11:00 am
11:00 am – 11:45 am
Dynamic Neural Network Structures Through Stochastic RewiringRobert Legenstein, Graz University of Technology11:45 am – 12:30 pm
Backpropagation and Deep Learning in the BrainTimothy Lillicrap, DeepMind Technologies Limited, London12:30 pm – 2:30 pm
2:30 pm – 3:15 pm
An Algorithmic Theory of Brain NetworksNancy Lynch, Massachusetts Institute of Technology3:15 pm – 4:00 pm
Networks of Spiking Neurons Learn to Learn and RememberWolfgang Maass, Graz University of Technology4:00 pm – 4:30 pm
4:30 pm – 5:30 pm
Plenary Discussion: What Is Missing in Current Theories of Brain Computation?

Wednesday, April 18th, 20188:30 am – 9:00 am
Coffee and Check-In
9:00 am – 9:45 am
Functional Triplet Motifs Underlie Accurate Predictions of Single-Trial Responses in Populations of Tuned and Untuned v1 NeuronsJason MacLean, University of Chicago9:45 am – 10:30 am
The Sparse Manifold TransformBruno Olshausen, UC Berkeley10:30 am – 11:00 am
11:00 am – 11:45 am
Playing Newton: Automatic Construction of Phenomenological, Data-Driven Theories and ModelsIlya Nemenman, Emory University11:45 am – 12:30 pm
A Functional Classification of Glutamatergic Circuits in Cortex and ThalamusS. Murray Sherman, University of Chicago12:30 pm – 2:30 pm
2:30 pm – 3:15 pm
On the Link Between Energy & Information for the Design of Neuromorphic SystemsNarayan Srinivasa, Eta Compute3:15 pm – 4:00 pm
Neural Circuit Representation of Multiple Cognitive Tasks: Clustering and CompositionalityXJ Wang, New York University4:00 pm – 4:30 pm
4:30 pm – 5:30 pm
Plenary Discussion: How Can One Test/Falsify Current Theories of Brain Computation?

Thursday, April 19th, 20188:30 am – 9:00 am
Coffee and Check-In
9:00 am – 9:45 pm
Control of Synaptic Plasticity in Deep Cortical NetworksPieter Roelfsema, University of Amsterdam9:45 am – 10:30 am
Computation with AssembliesChristos Papadimitriou, Columbia University10:30 am – 11:00 am
11:00 am – 11:45 am
Capacity of Neural Networks for Lifelong Learning of Composable TasksLes Valiant, Harvard University11:45 am – 12:30 pm
An Integrated Cognitive ArchitectureGreg Wayne, Columbia University

Continue Reading…


Read More

Whats new on arXiv

Deep Generative Networks For Sequence Prediction

This thesis investigates unsupervised time series representation learning for sequence prediction problems, i.e. generating nice-looking input samples given a previous history, for high dimensional input sequences by decoupling the static input representation from the recurrent sequence representation. We introduce three models based on Generative Stochastic Networks (GSN) for unsupervised sequence learning and prediction. Experimental results for these three models are presented on pixels of sequential handwritten digit (MNIST) data, videos of low-resolution bouncing balls, and motion capture data. The main contribution of this thesis is to provide evidence that GSNs are a viable framework to learn useful representations of complex sequential input data, and to suggest a new framework for deep generative models to learn complex sequences by decoupling static input representations from dynamic time dependency representations.

Short Term Electric Load Forecast with Artificial Neural Networks

This paper presents issues regarding short term electric load forecasting using feedforward and Elman recurrent neural networks. The study cases were developed using measured data representing electrical energy consume from Banat area. There were considered 35 different types of structure for both feedforward and recurrent network cases. For each type of neural network structure were performed many trainings and best solution was selected. The issue of forecasting the load on short term is essential in the effective energetic consume management in an open market environment.

Improving Long-Horizon Forecasts with Expectation-Biased LSTM Networks

State-of-the-art forecasting methods using Recurrent Neural Net- works (RNN) based on Long-Short Term Memory (LSTM) cells have shown exceptional performance targeting short-horizon forecasts, e.g given a set of predictor features, forecast a target value for the next few time steps in the future. However, in many applications, the performance of these methods decays as the forecasting horizon extends beyond these few time steps. This paper aims to explore the challenges of long-horizon forecasting using LSTM networks. Here, we illustrate the long-horizon forecasting problem in datasets from neuroscience and energy supply management. We then propose expectation-biasing, an approach motivated by the literature of Dynamic Belief Networks, as a solution to improve long-horizon forecasting using LSTMs. We propose two LSTM ar- chitectures along with two methods for expectation biasing that significantly outperforms standard practice.

Unlearn What You Have Learned: Adaptive Crowd Teaching with Exponentially Decayed Memory Learners

With the increasing demand for large amount of labeled data, crowdsourcing has been used in many large-scale data mining applications. However, most existing works in crowdsourcing mainly focus on label inference and incentive design. In this paper, we address a different problem of adaptive crowd teaching, which is a sub-area of machine teaching in the context of crowdsourcing. Compared with machines, human beings are extremely good at learning a specific target concept (e.g., classifying the images into given categories) and they can also easily transfer the learned concepts into similar learning tasks. Therefore, a more effective way of utilizing crowdsourcing is by supervising the crowd to label in the form of teaching. In order to perform the teaching and expertise estimation simultaneously, we propose an adaptive teaching framework named JEDI to construct the personalized optimal teaching set for the crowdsourcing workers. In JEDI teaching, the teacher assumes that each learner has an exponentially decayed memory. Furthermore, it ensures comprehensiveness in the learning process by carefully balancing teaching diversity and learner’s accurate learning in terms of teaching usefulness. Finally, we validate the effectiveness and efficacy of JEDI teaching in comparison with the state-of-the-art techniques on multiple data sets with both synthetic learners and real crowdsourcing workers.

Deep Probabilistic Programming Languages: A Qualitative Study

Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses.

Deep Multimodal Subspace Clustering Networks

We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages – multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressiveness layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.

Fast Weight Long Short-Term Memory

Associative memory using fast weights is a short-term memory mechanism that substantially improves the memory capacity and time scale of recurrent neural networks (RNNs). As recent studies introduced fast weights only to regular RNNs, it is unknown whether fast weight memory is beneficial to gated RNNs. In this work, we report a significant synergy between long short-term memory (LSTM) networks and fast weight associative memories. We show that this combination, in learning associative retrieval tasks, results in much faster training and lower test error, a performance boost most prominent at high memory task difficulties.

Understanding Convolutional Neural Network Training with Information Theory

Using information theoretic concepts to understand and explore the inner organization of deep neural networks (DNNs) remains a big challenge. Recently, the concept of an information plane began to shed light on the analysis of multilayer perceptrons (MLPs). We provided an in-depth insight into stacked autoencoders (SAEs) using a novel matrix-based Renyi’s {\alpha}-entropy functional, enabling for the first time the analysis of the dynamics of learning using information flow in real-world scenario involving complex network architecture and large data. Despite the great potential of these past works, there are several open questions when it comes to applying information theoretic concepts to understand convolutional neural networks (CNNs). These include for instance the accurate estimation of information quantities among multiple variables, and the many different training methodologies. By extending the novel matrix-based Renyi’s {\alpha}-entropy functional to a multivariate scenario, this paper presents a systematic method to analyze CNNs training using information theory. Our results validate two fundamental data processing inequalities in CNNs, and also have direct impacts on previous work concerning the training and design of CNNs.

Successive Convexification: A Superlinearly Convergent Algorithm for Non-convex Optimal Control Problems

This paper presents the SCvx algorithm, a successive convexification algorithm designed to solve non-convex optimal control problems with global convergence and superlinear convergence-rate guarantees. The proposed algorithm handles nonlinear dynamics and non-convex state and control constraints by linearizing them about the solution of the previous iterate, and solving the resulting convex subproblem to obtain a solution for the current iterate. Additionally, the algorithm incorporates several safe-guarding techniques into each convex subproblem, employing virtual controls and virtual buffer zones to avoid artificial infeasibility, and a trust region to avoid artificial unboundedness. The procedure is repeated in succession, thus turning a difficult non-convex optimal control problem into a sequence of numerically tractable convex subproblems. Using fast and reliable Interior Point Method (IPM) solvers, the convex subproblems can be computed quickly, making the SCvx algorithm well suited for real-time applications. Analysis is presented to show that the algorithm converges both globally and superlinearly, guaranteeing the local optimality of the original problem. The superlinear convergence is obtained by exploiting the structure of optimal control problems, showcasing the superior convergence rate that can be obtained by leveraging specific problem properties when compared to generic nonlinear programming methods. Numerical simulations are performed for an illustrative non-convex quad-rotor motion planning example problem, and corresponding results obtained using Sequential Quadratic Programming (SQP) solver are provided for comparison. Our results show that the convergence rate of the SCvx algorithm is indeed superlinear, and surpasses that of the SQP-based method by converging in less than half the number of iterations.

State-Space Abstractions for Probabilistic Inference: A Systematic Review

Tasks such as social network analysis, human behavior recognition, or modeling biochemical reactions, can be solved elegantly by using the probabilistic inference framework. However, standard probabilistic inference algorithms work at a propositional level, and thus cannot capture the symmetries and redundancies that are present in these tasks. Algorithms that exploit those symmetries have been devised in different research fields, for example by the lifted inference-, multiple object tracking-, and modeling and simulation-communities. The common idea, that we call state space abstraction, is to perform inference over compact representations of sets of symmetric states. Although they are concerned with a similar topic, the relationship between these approaches has not been investigated systematically. This survey provides the following contributions. We perform a systematic literature review to outline the state of the art in probabilistic inference methods exploiting symmetries. From an initial set of more than 4,000 papers, we identify 116 relevant papers. Furthermore, we provide new high-level categories that classify the approaches, based on the problem classes the different approaches can solve. Researchers from different fields that are confronted with a state space explosion problem in a probabilistic system can use this classification to identify possible solutions. Finally, based on this conceptualization, we identify potentials for future research, as some relevant application domains are not addressed by current approaches.

Exact Distributed Training: Random Forest with Billions of Examples

We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h.

CoNet: Collaborative Cross Networks for Cross-Domain Recommendation

The cross-domain recommendation technique is an effective way of alleviating the data sparsity in recommender systems by leveraging the knowledge from relevant domains. Transfer learning is a class of algorithms underlying these techniques. In this paper, we propose a novel transfer learning approach for cross-domain recommendation by using neural networks as the base model. We assume that hidden layers in two base networks are connected by cross mappings, leading to the collaborative cross networks (CoNet). CoNet enables dual knowledge transfer across domains by introducing cross connections from one base network to another and vice versa. CoNet is achieved in multi-layer feedforward networks by adding dual connections and joint loss functions, which can be trained efficiently by back-propagation. The proposed model is evaluated on two real-world datasets and it outperforms baseline models by relative improvements of 3.56\% in MRR and 8.94\% in NDCG, respectively.

Validating Bayesian Inference Algorithms with Simulation-Based Calibration

Verifying the correctness of Bayesian computation is challenging. This is especially true for complex models that are common in practice, as these require sophisticated model implementations and algorithms. In this paper we introduce \emph{simulation-based calibration} (SBC), a general procedure for validating inferences from Bayesian algorithms capable of generating posterior samples. This procedure not only identifies inaccurate computation and inconsistencies in model implementations but also provides graphical summaries that can indicate the nature of the problems that arise. We argue that SBC is a critical part of a robust Bayesian workflow, as well as being a useful tool for those developing computational algorithms and statistical software.

Entropic Spectral Learning in Large Scale Networks

We present a novel algorithm for learning the spectral density of large scale networks using stochastic trace estimation and the method of maximum entropy. The complexity of the algorithm is linear in the number of non-zero elements of the matrix, offering a computational advantage over other algorithms. We apply our algorithm to the problem of community detection in large networks. We show state-of-the-art performance on both synthetic and real datasets.

Distribution-based Prediction of the Degree of Grammaticalization for German Prepositions
Using Convex Optimization of Autocorrelation with Constrained Support and Windowing for Improved Phase Retrieval Accuracy
Diagnostic Tests for Nested Sampling Calculations
DPRed: Making Typical Activation Values Matter In Deep Learning Computing
Encoding Longer-term Contextual Multi-modal Information in a Predictive Coding Model
Efficient Soft-Output Gauss-Seidel Data Detector for Massive MIMO Systems
On the coupling of Model Predictive Control and Robust Kalman Filtering
Efficient Channel Estimator with Angle-Division Multiple Access
On indefinite sums weighted by periodic sequences
The Vlasov-Navier-Stokes equations as a mean field limit
Deep Object Co-Segmentation
Terrain RL Simulator
Contextualised Browsing in a Digital Library’s Living Lab
Local Search is a PTAS for Feedback Vertex Set in Minor-free Graphs
The emergent integrated network structure of scientific research
Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer
Vision Based Dynamic Offside Line Marker for Soccer Games
Personalized neural language models for real-world query auto completion
Are FPGAs Suitable for Edge Computing?
Detecting Linguistic Characteristics of Alzheimer’s Dementia by Interpreting Neural Models
The Fundamental Solution to the p-Laplacian in a class of Hörmander Vector Fields
Multi-Reward Reinforced Summarization with Saliency and Entailment
Efficient Search of Compact QC-LDPC and SC-LDPC Convolutional Codes with Large Girth
On Learning Intrinsic Rewards for Policy Gradient Methods
An Adaptive Clipping Approach for Proximal Policy Optimization
Mage: Online Interference-Aware Scheduling in Multi-Scale Heterogeneous Systems
Optimal Carbon Taxes for Emissions Targets in the Electricity Sector
Objective Bayesian Inference for Repairable System Subject to Competing Risks
A Galerkin Isogeometric Method for Karhunen-Loeve Approximation of Random Fields
Communication-Aware Scheduling of Serial Tasks for Dispersed Computing
Bayesian parameter estimation for relativistic heavy-ion collisions
Robust Machine Comprehension Models via Adversarial Training
On coprime percolation, the visibility graphon, and the local limit of the GCD profile
A Generalized Cover’s Problem
Simplex Queues for Hot-Data Download
Multivariate Gaussian Process Regression for Multiscale Data Assimilation and Uncertainty Reduction
Minimax rate of testing in sparse linear regression
A Capacity-Price Game for Uncertain Renewables Resources
Two-Player Games for Efficient Non-Convex Constrained Optimization
Numerical Integration in Multiple Dimensions with Designed Quadrature
Learning how to be robust: Deep polynomial regression
Zero-shot Learning with Complementary Attributes
Improving Character-based Decoding Using Target-Side Morphological Information for Neural Machine Translation
UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition
Structure from Recurrent Motion: From Rigidity to Recurrency
Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems
Faster Evaluation of Subtraction Games
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization
Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change
Online Non-Additive Path Learning under Full and Partial Information
Method to solve quantum few-body problems with artificial neural networks
The 1D Schrödinger equation with a spacetime white noise: the average wave function
Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation
Aspect Level Sentiment Classification with Attention-over-Attention Neural Networks
Improving information centrality of a node in complex networks by adding edges
Average Age-of-Information Minimization in UAV-assisted IoT Networks
Homogenization of Periodic Linear Nonlocal Partial Differential Equations
SFace: An Efficient Network for Face Detection in Large Scale Variations
A Mean Field View of the Landscape of Two-Layers Neural Networks
Combating the Control Signal Spoofing Attack in UAV Systems
The Erdös-Sós Conjecture for Spiders
A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization
The weak order on Weyl posets
Fundamental domains for rhombic lattices with dihedral symmetry of order 8
Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines
Free to move or trapped in your group: Mathematical modeling of information overload and coordination in crowded populations
Estimation of the extreme value index in a censorship framework: asymptotic and finite sample behaviour
On bounds on bend number of classes of split and cocomparability graphs
$N$-detachable pairs in 3-connected matroids III: the theorem
Fast Channel Estimation for Millimetre Wave Wireless Systems Using Overlapped Beam Patterns
Coexistence of URLLC and eMBB services in the C-RAN Uplink: An Information-Theoretic Study
Ruin probabilities for two collaborating insurance companies
Independent Distributions on a Multi-Branching AND-OR Tree of Height 2
PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
End-to-end Graph-based TAG Parsing with Neural Networks
Geographical Scheduling for Multicast Precoding in Multi-Beam Satellite Systems
Visualizing the Feature Importance for Black Box Models
A Solution for Large-scale Multi-object Tracking
Squarefree divisor complexes of certain numerical semigroup elements
Variational Disparity Estimation Framework for Plenoptic Image
DEA-based benchmarking for performance evaluation in pay-for-performance incentive plans
Experiments with Universal CEFR Classification
An Economic-Based Analysis of RANKING for Online Bipartite Matching
Rooted complete minors in line graphs with a Kempe coloring
Superframes, A Temporal Video Segmentation
Modular Verification of Vehicle Platooning with Respect to Decisions, Space and Time
Consensus Community Detection in Multilayer Networks using Parameter-free Graph Pruning
Numerical semigroups with a fixed number of gaps of second kind
Deep Face Recognition: A Survey
NTUA-SLP at SemEval-2018 Task 2: Predicting Emojis using RNNs with Context-aware Attention
NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning
NTUA-SLP at SemEval-2018 Task 3: Tracking Ironic Tweets using Ensembles of Word and Character Level Attentive RNNs
E- and R-optimality of block designs for treatment-control comparisons
Active Learning for Breast Cancer Identification
Bayesian Metabolic Flux Analysis reveals intracellular flux couplings
The Graph Exploration Problem with Advice
Understanding Individual Neuron Importance Using Information Theory
Temporal Unknown Incremental Clustering (TUIC) Model for Analysis of Traffic Surveillance Videos
A Robot to Shape your Natural Plant: The Machine Learning Approach to Model and Control Bio-Hybrid Systems
High order synaptic learning in neuro-mimicking resistive memories
Platonic solids, Archimedean solids and semi-equivelar maps on the sphere
Is a Finite Intersection of Balls Covered by a Finite Union of Balls in Euclidean Spaces ?
Liveness Detection Using Implicit 3D Features
Index Codes for Interlinked Cycle Structures with Outer Cycles
Alquist: The Alexa Prize Socialbot
Impact of Non-orthogonal Multiple Access on the Offloading of Mobile Edge Computing
Promotion on oscillating and alternating tableaux and rotation of matchings and permutations
Are ResNets Provably Better than Linear Predictors?
An efficient open-source implementation to compute the Jacobian matrix for the Newton-Raphson power flow algorithm
Forecasting the presence and intensity of hostility on Instagram using linguistic and social features
Simulation-based Adversarial Test Generation for Autonomous Vehicles with Machine Learning Components
A General Account of Argumentation with Preferences
A Parallel/Distributed Algorithmic Framework for Mining All Quantitative Association Rules
Stopping Redundancy Hierarchy Beyond the Minimum Distance
Unspeech: Unsupervised Speech Context Embeddings
Quantifying the visual concreteness of words and topics in multimodal datasets
A lower bound on the number of homotopy types of simplicial complexes on $n$ vertices
A local approach to the Erdős-Sós conjecture
A note on number triangles that are almost their own production matrix
A Min.Max Algorithm for Spline Based Modeling of Violent Crime Rates in USA
Solving the Exponential Growth of Symbolic Regression Trees in Geometric Semantic Genetic Programming
On Abelian Longest Common Factor with and without RLE
ECG arrhythmia classification using a 2-D convolutional neural network
Automated detection of vulnerable plaque in intravascular ultrasound images
Active choice of teachers, learning strategies and goals for a socially guided intrinsic motivation learner
Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Unveiling the Power of Deep Tracking
Delayed Blockchain Protocols

Continue Reading…


Read More

Document worth reading: “Statistical Validity and Consistency of Big Data Analytics: A General Framework”

Informatics and technological advancements have triggered generation of huge volume of data with varied complexity in its management and analysis. Big Data analytics is the practice of revealing hidden aspects of such data and making inferences from it. Although storage, retrieval and management of Big Data seem possible through efficient algorithm and system development, concern about statistical consistency remains to be addressed in view of its specific characteristics. Since Big Data does not conform to standard analytics, we need proper modification of the existing statistical theory and tools. Here we propose, with illustrations, a general statistical framework and an algorithmic principle for Big Data analytics that ensure statistical accuracy of the conclusions. The proposed framework has the potential to push forward advancement of Big Data analytics in the right direction. The partition-repetition approach proposed here is broad enough to encompass all practical data analytic problems. Statistical Validity and Consistency of Big Data Analytics: A General Framework

Continue Reading…


Read More

R Packages worth a look

Rapidjson’ C++ Header Files (rapidjsonr)
Provides JSON parsing capability through the ‘Rapidjson’ ‘C++’ header-only library.

Prediction Intervals for Random-Effects Meta-Analysis (pimeta)
An implementation of prediction intervals for random-effects meta-analysis: Higgins et al. (2009) <doi:10.1111/j.1467-985X.2008.00552.x>, Partlett and Riley (2017) <doi:10.1002/sim.7140>, and Nagashima et al. (2018) <arXiv:1804.01054>.

Nonparametric Smoothing of Laplacian Graph Spectra (LPGraph)
A nonparametric method to approximate Laplacian graph spectra of a network with ordered vertices. This provides a computationally efficient algorithm for obtaining an accurate and smooth estimate of the graph Laplacian basis. The approximation results can then be used for tasks like change point detection, k-sample testing, and so on. The primary reference is Mukhopadhyay, S. and Wang, K. (2018, Technical Report).

Simulation of Correlated Systems of Equations with Multiple Variable Types (SimRepeat)
Generate correlated systems of statistical equations which represent repeated measurements or clustered data. These systems contain either: a) continuous normal, non-normal, and mixture variables based on the techniques of Headrick and Beasley (2004) <DOI:10.1081/SAC-120028431> or b) continuous (normal, non-normal and mixture), ordinal, and count (regular or zero-inflated, Poisson and Negative Binomial) variables based on the hierarchical linear models (HLM) approach. Headrick and Beasley’s method for continuous variables calculates the beta (slope) coefficients based on the target correlations between independent variables and between outcomes and independent variables. The package provides functions to calculate the expected correlations between outcomes, between outcomes and error terms, and between outcomes and independent variables, extending Headrick and Beasley’s equations to include mixture variables. These theoretical values can be compared to the simulated correlations. The HLM approach requires specification of the beta coefficients, but permits group and subject-level independent variables, interactions among independent variables, and fixed and random effects, providing more flexibility in the system of equations. Both methods permit simulation of data sets that mimic real-world clinical or genetic data sets (i.e. plasmodes, as in Vaughan et al., 2009, <10.1016/j.csda.2008.02.032>). The techniques extend those found in the ‘SimMultiCorrData’ and ‘SimCorrMix’ packages. Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either Fleishman’s third-order (<DOI:10.1007/BF02293811>) or Headrick’s fifth-order (<DOI:10.1016/S0167-9473(02)00072-5>) power method transformation (PMT). Simulation occurs at the component-level for continuous mixture distributions. These components are transformed into the desired mixture variables using random multinomial variables based on the mixing probabilities. The target correlation matrices are specified in terms of correlations with components of continuous mixture variables. Binary and ordinal variables are simulated by discretizing the normal variables at quantiles defined by the marginal distributions. Count variables are simulated using the inverse CDF method. There are two simulation pathways for the multi-variable type systems which differ by intermediate correlations involving count variables. Correlation Method 1 adapts Yahav and Shmueli’s 2012 method <DOI:10.1002/asmb.901> and performs best with large count variable means and positive correlations or small means and negative correlations. Correlation Method 2 adapts Barbiero and Ferrari’s 2015 modification of the ‘GenOrd’ package <DOI:10.1002/asmb.2072> and performs best under the opposite scenarios. There are three methods available for correcting non-positive definite correlation matrices. The optional error loop may be used to improve the accuracy of the final correlation matrices. The package also provides function to check parameter inputs and summarize the simulated systems of equations.

Interface for ‘GraphFrames’ (graphframes)
A ‘sparklyr’ <https://…/> extension that provides an R interface for ‘GraphFrames’ <https://…/>. ‘GraphFrames’ is a package for ‘Apache Spark’ that provides a DataFrame-based API for working with graphs. Functionality includes motif finding and common graph algorithms, such as PageRank and Breadth-first search.

Continue Reading…


Read More

Magister Dixit

“We must convey what constitutes data, what it can be used for, and why it’s valuable.” Jake Porway ( October 1, 2015 )

Continue Reading…


Read More

Monkeying around with Code and Paying it Forward

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)


This is a story (mostly) about how I started contributing to the rOpenSci package monkeylearn. I can’t promise any life flipturning upside down, but there will be a small discussion about git best practices which is almost as good 🤓. The tl;dr here is nothing novel but is something I wish I’d experienced firsthand sooner. That is, that tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source.

We all write code all the time to graft aditional features onto existing tools or reshape output into forms that fit better in our particular pipelines. Chances are, these are improvements our fellow package users could take advantage of. Plus, if they’re integrated into the package source code, then we no longer need our own wrappers and reshapers and speeder-uppers. That means less code and fewer chances of bugs all around 🙌. So, tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source.

Some Backstory

My first brush with the monkeylearn package was at work one day when I was looking around for an easy way to classify groups of texts using R. I made the very clever first move of Googling “easy way to classify groups of texts using R” and thanks to the magic of what I suppose used to be PageRank I landed upon a GitHub README for a package called monkeylearn.

A quick install.packages("monkeylearn") and creation of an API key later it started looking like this package would fit my use case. I loved that it sported only two functions, monkeylearn_classify() and monkeylearn_extract(), which did exactly what they said on the tin. They accept a vector of texts and return a dataframe of classifications or keyword extractions, respectively.

For a bit of background, the monkeylearn package hooks into the MonkeyLearn API, which uses natural language processing techniques to take a text input and hands back a vector of outputs (keyword extractions or classifications) along with metadata such as their confidence in relevance of the classification. There are a set of built-in “modules” (e.g., retail classifier, profanity extractor) but users can also create their own “custom” modules1 by supplying their own labeled training data.

The monkeylearn R package serves as a friendly interface to that API, allowing users to process data using the built-in modules (it doesn’t yet support creating and training of custom modules). In the rOpenSci tradition it’s peer-reviewed and was contributed via the onboarding process.

I began using the package to attach classifications to around 70,000 texts. I soon discovered a major stumbling block: I could not send texts to the MonkeyLearn API in batches. This wasn’t because the monkeylearn_classify() and monkeylearn_extract() functions themselves didn’t accept multiple inputs. Instead, it was because they didn’t explicitly relate inputs to outputs. This became a problem because inputs and outputs are not 1:1; if I send a vector of three texts for classification, my output dataframe might be 10 rows long. However, there was no user-friendly way to know for sure2 whether the first two or the first four output rows, for example, belonged to the first input text.

Here’s an example of what I mean.

texts <- c(
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.",
    "I'm not an ambiturner. I can't turn left.")

(texts_out <- monkeylearn_classify(texts, verbose = FALSE))
## # A tibble: 11 x 4
##    category_id probability label                text_md5                  
##  1    18314767      0.0620 Books                af55421029d7236ca6ecbb281…
##  2    18314954      0.0470 Mystery & Suspense   af55421029d7236ca6ecbb281…
##  3    18314957      0.102  Police Procedural    af55421029d7236ca6ecbb281…
##  4    18313210      0.0820 Party & Occasions    602f1ab2654b88f5c7f5c90e4…
##  5    18313231      0.176  "Party Supplies "    602f1ab2654b88f5c7f5c90e4…
##  6    18313235      0.134  Party Decorations    602f1ab2654b88f5c7f5c90e4…
##  7    18313236      0.406  Decorations          602f1ab2654b88f5c7f5c90e4…
##  8    18314767      0.0630 Books                bdb9881250321ce8abecacd4d…
##  9    18314870      0.0460 Literature & Fiction bdb9881250321ce8abecacd4d…
## 10    18314876      0.0400 Mystery & Suspense   bdb9881250321ce8abecacd4d…
## 11    18314878      0.289  Suspense             bdb9881250321ce8abecacd4d…

So we can see we’ve now got classifications for the texts we fed in as input. The MD5 hash can be used to disambiguate which outputs correspond to which inputs in some cases (see Maëlle’s fantastic Guardian Experience post!). This works great if you either don’t care about classifying your inputs independently of one another or you know that your inputs will never contain empty strings or other values that won’t be sent to the API. In my case, though, my inputs were independent of one another and also could not be counted on to be well-formed. I determined that each had to be classified separately so that I could guarantee a 1:1 match between input and output.

Initial Workaround

My first approach to this problem was to simply treat each text as a separate call. I wrapped monkeylearn_classify() in a function that would send a vector of texts and return a dataframe relating the input in one column to the output in the others. Here is a simplified version of it, sans the error handling and other bells and whistles:

initial_workaround <- function(df, col, verbose = FALSE) {
  quo_col <- enquo(col)
  out <- df %>% 
      tags = NA_character_
  for (i in 1:nrow(df)) {
    this_text <- df %>% 
      select(!!quo_col) %>% 
      slice(i) %>% as_vector()
    this_classification <- 
      monkeylearn_classify(this_text, verbose = verbose) %>% 
      select(-text_md5) %>% list()
    out[i, ]$tags <- this_classification


Since initial_workaround() takes a dataframe as input rather than a vector, let’s turn our sample into a tibble before feeding it in.

texts_df <- tibble(texts)

And now we’ll run the workaround:

initial_out <- initial_workaround(texts_df, texts)

## # A tibble: 3 x 2
##   texts                                                           tags    
## 1 It is a truth universally acknowledged, that a single man in p… <tibble… code="" left.="" turn="" i="" <tibble…="" an="" <="" ##="" 3="" not="" ambiturner.="" i'm="" can't=""></tibble…>

We see that this retains the 1:1 relationship between input and output, but still allows the output list-col to be unnested.

(initial_out %>% unnest())
## # A tibble: 11 x 4
##    texts                                   category_id probability label  
##    <chr>                                         <int>       <dbl> <chr>  
##  1 It is a truth universally acknowledged…    18314767      0.0620 Books  
##  2 It is a truth universally acknowledged…    18314954      0.0470 Myster…
##  3 It is a truth universally acknowledged…    18314957      0.102  Police…
##  4 When Mr. Bilbo Baggins of Bag End anno…    18313210      0.0820 Party …
##  5 When Mr. Bilbo Baggins of Bag End anno…    18313231      0.176  "Party…
##  6 When Mr. Bilbo Baggins of Bag End anno…    18313235      0.134  Party …
##  7 When Mr. Bilbo Baggins of Bag End anno…    18313236      0.406  Decora…
##  8 I'm not an ambiturner. I can't turn le…    18314767      0.0630 Books  
##  9 I'm not an ambiturner. I can't turn le…    18314870      0.0460 Litera…
## 10 I'm not an ambiturner. I can't turn le…    18314876      0.0400 Myster…
## 11 I'm not an ambiturner. I can't turn le…    18314878      0.289  Suspen…

But, the catch: this approach was quite slow. The real bottleneck here isn’t the for loop; it’s that this requires a round trip to the MonkeyLearn API for each individual text. For just these three meager texts, let’s see how long initial_workaround() takes to finish.

(benchmark <- system.time(initial_workaround(texts_df, texts)))
##    user  system elapsed 
##   0.036   0.001  15.609

It was clear that if classifying 3 inputs was going to take 15.6 seconds, even classifying my relatively small data was going to take a looong time, like on the order of 4 days, just for the first batch of data 🙈. I updated the function to write each row out to an RDS file after it was classified inside the loop (with an addition along the lines of write_rds(out[i, ], glue::glue("some_directory/{i}.rds"))) so that I wouldn’t have to rely on the function successfully finishing execution in one run. Still, I didn’t like my options.

This classification job was intended to be run every night, and with an unknown amount of input text data coming in every day, I didn’t want it to run for more than 24 hours one day and either a) prevent the next night’s job from running or b) necessitate spinning up a second server to handle the next night’s data.

Diving In

Now that I’m starting to think

I’m just about at the point where I have to start making myself useful.

I’d seen in the package docs and on the MonkeyLearn FAQ that batching up to 200 texts was possible2. So, I decide to first look into the mechanics of how text batching is done in the monkeylearn package.

Was the MonkeyLearn API returning JSON that didn’t relate each input individual and output? I sort of doubted it. You’d think that an API that was sent a JSON “array” of inputs would send back a hierarchical array to match. My hunch was that either the package was concatenating the input before shooting it off to the API (which would save user on API queries) or rowbinding the output after it was returned. (The rowbinding itself would be fine if each input could somehow be related to its one or many outputs.)

So I fork the package repo and set about rummaging through the source code. Blissfully, everything is nicely commented and the code was quite readable.

I step through monkeylearn_classify() in the debugger and narrow in on a call to what looks like a utility function: monkeylearn_parse(). I find it in utils.R.

The lines in monkeylearn_parse() that matter for our purposes are:

text <- httr::content(output, as = "text",
                        encoding = "UTF-8")
temp <- jsonlite::fromJSON(text)
if(length(temp$result[[1]]) != 0){
  results <-"rbind", temp$result)
  results$text_md5 <- unlist(mapply(rep, vapply(X=request_text,
                                                    algo = "md5"),
                                        unlist(vapply(temp$result, nrow,
                                                      FUN.VALUE = 0)),
                                        SIMPLIFY = FALSE))

So this is where the rowbinding happens – after the fromJSON() call! 🎉

This is good news because it means that the MonkeyLearn API is sending differentiated outputs back in a nested JSON object. The package converts this to a list with fromJSON() and only then is the rbinding applied. That’s why the text_md5 hash is generated during this step: to be able to group outputs that all correspond to a single input (same hash means same input).

I set about copy-pasting monkeylearn_parse() and did a bit of surgery on it, emerging with monkeylearn_parse_each(). monkeylearn_parse_each() skips the rbinding and retains the list structure of each output, which means that its output can be turned into a nested tibble with each row corresponding to one input. That nested tibble can then be related to each corresponding element of the input vector. All that remained was to use create a new enclosing analog to monkeylearn_classify() that could use monkeylearn_parse_each().

Thinking PR thoughts

At this point, I thought that such a function might be useful to some other people using the package so I started building it with an eye toward making a pull request.

Since I’d found it useful to be able to pass in an input dataframe in initial_workaround(), I figured I’d retain that feature of the function. I wanted users to still be able to pass in a bare column name but the package seemed to be light on tidyverse functions unless there was no alternative, so I un-tidyeval’d the function (using deparse(substitute()) instead of a quosure) and gave it the imaginative name…monkeylearn_classify_df(). The rest of the original code was so airtight I didn’t have to change much more to get it working.

A nice side effect of my plumbing through the guts of the package was that I caught a couple minor bugs (things like the remnants of a for loop remaining in what had been revamped into a while loop) and noticed where there could be some quick wins for improving the package.

After a few more checks I wrote up the description for the pull request which outlined the issue and the solution (though I probably should have first opened an issue, waited for a response, and then submitted a PR referencing the issue as Mara Averick suggests in her excellent guide to contributing to the tidyverse).

I checked the list of package contributors to see if I knew anyone. Far and away the main contributor was Maëlle Salmon! I’d heard of her through the magic of #rstats Twitter and the R-Ladies Global Slack. A minute or two after submitting it I headed over to Slack to give her a heads up that a PR would be heading her way.

In what I would come to know as her usual cheerful, perpetually-on-top-of-it form, Maëlle had already seen it and liked the idea for the new function.

Continuing Work

To make a short story shorter, Maëlle asked me if I’d like to create the extractor counterpart to monkeylearn_classify_df() and become an author on the package with push access to the repo. I said yes, of course, and so we began to strategize over Slack about tradeoffs like which package dependencies we were okay with taking on, whether to go the tidyeval or base route, what the best naming conventions for the new functions should be, etc.

On the naming front, we decided to gracefully deprecate monkeylearn_classify() and monkeylearn_extract() as the newer functions could cover all of the functionality that the older ones did. I don’t know much about cache invalidation, but the naming problem was hard as usual. We settled on naming their counterparts monkey_classify() (which replaced the original monkeylearn_classify_df()) and monkey_extract().


Early on in the process we started talking git conventions. Rather than both work off a development branch, I floated a structure that we typically follow at my company, where each ticket (or in this case, GitHub Issue) becomes its own branch off of dev. For instance, issue #33 becomes branch T33 (T for ticket). Each of these feature branches come off of dev (unless they’re hotfixes) and are merged back into dev and deleted when they pass all the necessary checks. This approach, I am told, stems from the “gitflow” philosophy which, as far as I understand it, is one of many ways to structure a git workflow that mostly doesn’t end in tears.

Image source

Like most git strategies, the idea here is to make pull requests as bite-sized as possible; in this case, a PR can only be as big as the issue it’s named from. An added benefit for me, at least, is that this keeps me from wandering off into other parts of the code without first documenting the point in a separate issue, and then creating a branch. At most one person is assinged to each ticket/issue, which minimizes merge conflicts. You also leave a nice paper trail because the branch name directly references the issue front and center in its title. This means you don’t have to explicitly name the issue in the commit or rely on GitHub’s (albeit awesome) keyword branch closing system3.

Finally, since the system is so tied to issues themselves, it encourages very frequent communication between collaborators. Since the issue must necessarily be made before the branch and the accompanying changes to the code, the other contributors have a chance to weigh in on the issue or the approach suggested in its comments before any code is written. In our case, it’s certainly made frequent communication the path of least resistance.

While this branch and PR-naming convention isn’t particular to gitflow (to my knowledge), it did spark a short conversation on Twitter that I think is useful to have:

Thomas Lin Pedersen makes a good point on the topic:

<script src="" charset="utf-8" async=""></script>

This insight got me thinking that the best approach might be to explicitly name the issue number and give a description in the branch name, like a slug of sorts. I started using a branch syntax like T31-fix-bug-i-just-created which has worked out well for Maëlle and me thus far, making the history a bit more readable.

Main Improvements

As I mentioned, the package was so good to begin with it was difficult to find ways to improve it. Most of the subsequent work I did on monkeylearn was to improve the new monkey_ functions.

The original monkeylearn_ functions discarded inputs such as empty strings that could not be sent to the API. We now retain those empty inputs and return NAs in the response columns for that row. This means that the output is always of the same dimensions as the input. We return an unnested dataframe by default, as the original functions did, but allow the output to be nested if the unnest flag is set to FALSE.

The functions also got more informative messages about which batches are currently being processed and which texts those batches corresponded to.

text_w_empties <- c(
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.",
    "I'm not an ambiturner. I can't turn left.",
    " ")

(empties_out <- monkey_classify(text_w_empties, classifier_id = "cl_5icAVzKR", texts_per_req = 2, unnest = TRUE))
## The following indices were empty strings and could not be sent to the API: 3, 5. They will still be included in the output.
## Processing batch 1 of 2 batches: texts 1 to 2
## Processing batch 2 of 2 batches: texts 2 to 3
## # A tibble: 8 x 4
##   req                                      category_id probability label  
##   <chr>                                          <int>       <dbl> <chr>  
## 1 It is a truth universally acknowledged,…       64708       0.289 Society
## 2 It is a truth universally acknowledged,…       64711       0.490 Relati…
## 3 When Mr. Bilbo Baggins of Bag End annou…       64708       0.348 Society
## 4 When Mr. Bilbo Baggins of Bag End annou…       64713       0.724 Specia…
## 5 ""                                                NA      NA     <na>   
## 6 I'm not an ambiturner. I can't turn lef…       64708       0.125 Society
## 7 I'm not an ambiturner. I can't turn lef…       64710       0.377 Parent…
## 8 " "                                               NA      NA     <na>

So even though the empty string inputs like the 3rd and 5th, aren’t sent to the API, we can see they’re still included in the output dataframe and assigned the same column names as all of the other outputs. That means that even if unnest is set to FALSE, the output can still be unnested with tidyr::unnest() after the fact.

If a dataframe is supplied, there is now a .keep_all option which allows for all columns of the input to be retained, not just the column that contains the text to be classified. This makes the monkey_ functions work even more like a mutate(); rather than returning an object that has to be joined on the original input, we do that association for the user.

sw <- dplyr::starwars %>% 
  dplyr::select(name, height) %>% 

df <- tibble::tibble(text = text_w_empties) %>% 

df %>% monkey_classify(text, classifier_id = "cl_5icAVzKR", unnest = FALSE, .keep_all = TRUE, verbose = FALSE)
## # A tibble: 5 x 4
##   name           height text                                       res    
##   <chr>           <int> <chr>                                      <list> 
## 1 Ackbar            180 It is a truth universally acknowledged, t… <data.… code="" ##="" bilbo="" shaak="" an="" 178="" left.="" 172="" "="" end="" turn="" skywalker="" mr.="" 3="" 2="" baggins="" 4="" ti="" 5="" shmi="" <="" ""="" luke="" <data.…="" lama="" 229="" not="" ambiturner.="" i'm="" 163="" i="" of="" when="" su="" bag="" announc…="" can't=""></data.…></list></chr></int></chr>

We see that the input column, text is sandwiched between the other columns of the original dataframe (the Starwars ones) and the output column res.

The hope is that all of this serves to improve the data safety and user experience of the package.

Developing functions in tandem

Something I’ve been thinking about while working on the twin functions monkey_extract() and monkey_classify() is what the best practice is for developing very similar functions in sync with one another. These two functions are different enough to have different default values (for example, monkey_extract() has a default extractor_id while monkey_classify() has a default classifier_id) but are so similar in other regards as to be sort of embarrassingly parallel.

What I’ve been turning over in my head is the question of how in sync these functions should be during development. As soon as you make a change to one function, should you immediately make the same change to the other? Or is it instead better to work on one function at a time, and, at some checkpoints then batch these changes over to the other function in a big copy-paste job? I’ve been tending toward the latter but it’s seemed a little dangerous to me.

Since there are only two functions to worry about here, creating a function factory to handle them seemed like overkill, but might technically be the best practice. I’d love to hear people’s thoughts on how they go about navigating this facet of package development.

Last Thoughts

My work on the monkeylearn package so far has been rewarding to say the least. It’s inspired me to be not just a consumer but more of an active contributor to open source. Some wise words from Maëlle on this front:

You too can become a contributor to an rOpenSci package! Have a look at the issues tracker of your favorite rOpenSci package(s) e.g. rgbif. Browse issues suitable for beginners over many rOpenSci repos thanks to Lucy D’Agostino McGowan’s contributr Shiny app. Always first ask in a new or existing issue whether your contribution would be welcome, plan a bit with the maintainer, and then have fun! We’d be happy to have you.

Maëlle’s been a fantastic mentor, providing guidance in at least four languages – English, French, R, and emoji, despite the time difference and 👶(!). When it comes to monkeylearn, the hope is to keep improving the core package features, add some more niceties, and look into building out an R-centric way for users to create and train their own custom modules on MonkeyLearn.

On y va!

  1. Custom, to a point. As of this writing, two types of classifier models you can create use either Naive Bayes or Support Vector Machines, though you can specify other parameters such as use_stemmer and strip_stopwords. Custom extractor modules are coming soon.
  2. That MD5 hash almost provided the solution; each row of the output gets a hash that corresponds to a single input row, so it seemed like the hash was meant to be used to be able to map inputs to outputs. Provided that I knew that all of my inputs were non-empty strings, which are filtered out before they can be sent to the API, and could be classified I could have nested the output based on its MD5 sum and mapped the indices of the inputs and the outputs 1:1. The trouble was that I knew that my input data would be changing and I wasn’t convinced that all of my inputs would be receive well-formed responses from the API. If some of the text couldn’t receive a corresponding set of classification, such a nested output would have fewer rows than the input vector’s length. There would be no way to tell which input corresponded to which nested output.
  3. Keywords in commits don’t automatically close issues until they’re merged into master, and since we were working off of dev for quite a long time, if we relied on keywords to automatically close issues our Open iIsues list wouldn’t accurately reflect the issues that we actually still had to address. Would be cool for GitHub to allow flags like maybe “fixes #33 –dev” could close issue #33 when the PR with that phrase in the commit was merged into dev.
<script type="text/javascript"> var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); </script>

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Announcing CGPfunctions 0.3 – April 20, 2018

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

As I continue to learn and grow in using R I have been trying to develop
the habit of being more formal in documenting and maintaining the
various functions and pieces of code I write. It’s not that I think they
are major inventions but they are useful and I like having them stored
in one place that I can keep track of. So I started building them as a
package and even publishing them to CRAN. For any of you who might find
them of interest as well.

CRAN Version


A package that includes functions that I find useful for teaching
statistics as well as actually practicing the art. They typically are
not “new” methods but rather wrappers around either base R or other
packages and concepts I’m trying to master. Currently contains:

  • Plot2WayANOVA which as the name implies conducts a 2 way ANOVA and
    plots the results using ggplot2
  • PlotXTabs which as the name implies plots cross tabulated
    variables using ggplot2
  • neweta which is a helper function that appends the results of a
    Type II eta squared calculation onto a classic ANOVA table
  • Mode which finds the modal value in a vector of data
  • SeeDist which wraps around ggplot2 to provide visualizations of
    univariate data.
  • OurConf is a simulation function that helps you learn about
    confidence intervals


# Install from CRAN

# Or the development version from GitHub
# install.packages("devtools")


Many thanks to Dani Navarro and the book > (Learning Statistics with
whose etaSquared function was the genesis of neweta.

“He who gives up safety for speed deserves neither.”

A shoutout to some other packages I find essential.

  • stringr, for strings.
  • lubridate, for date/times.
  • forcats, for factors.
  • haven, for SPSS, SAS and Stata
  • readxl, for .xls and .xlsx
  • modelr, for modelling within a
  • broom, for turning models into
    tidy data
  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern
    re-imagining of data frames.

Leaving Feedback

If you like CGPfunctions, please consider leaving feedback


Contributions in the form of feedback, comments, code, and bug reports
are most welcome. How to contribute:

  • Issues, bug reports, and wish lists: File a GitHub
  • Contact the maintainer ibecav at by email.

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

eRum Competition Winners

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

The results of the eRum competition are in! Before we announce the winners we would like to thank everyone who entered. It has been a pleasure to look at all of the ideas on show.

The Main Competition

The winner of the main competition is Lukasz Janiszewski. Lukasz provided a fantastic visualisation of the locations of each R user/ladies group and all R conferences. You can see his app here. If you want to view his code, you are able to do so in this GitHub repo. The code is contained in the directory erum_jr and the data preparation can be seen in budap.R.

Lukasz made 3 csv files contained the information about the R user, R ladies and R conferences. Using the help of an Rbloggers post, he was able to add the geospatial information to those csv files. Finally, he scraped each meetup page for information on the R-ladies groups. Using all of this information, he was able to make an informative, visually appealing dashboard with shiny.

Lukasz will now be jetting off to Budapest, to eRum 2018!

The Secondary Competition

The winner of the secondary competition is Jenny Snape. Jenny provided an excellent script to parse the current .Rmd files and extract the conference and group urls & locations. The script can be found in this GitHub gist. Jenny has written a few words to summarise her script…

“The files on github can be read into R as character vectors (where each line is a element of the vector) using the R readLines() function.

From this character vector, we need to extract the country, the group name and url. This can be done by recognising that each line containing a country starts with a ‘##’ and each line containing the group name and url starts with a ’*‘. Therefore we can use these ’tags’ to cycle through each element of the character vector and pull out vectors containing the countries, the cities and the urls of the R groups. These vectors can then be cleaned and joined together into a data frame.

I wrote these steps into a function that accepted each R group character vector as an input and returned the final data frame. As one of the data sets contained just R Ladies groups, I fed this in as an argument and returned it as a column in the final data frame in order to differentiate between the different group types. I also returned a variable based on the character vector input in order to differentiate between the different world continents.

Running this function on each of the character vectors creates separate data sets which can then be all joined together. This creates a final dataset containing all the information on each R group: the type of group, the url, the city and the region."

As well as this, Jenny provided us with a fantastic shiny dashboard, summarising the data.

Jenny has now received a free copy of Efficient R Prgoramming!

Once again, thank you to all who entered and well done to our winners, Lukasz and Jenny!

What next?

We’re in the process of converting Jenny’s & Lukasz’s hard work into a nice dashboard that will be magically updated via our list of useR groups and conferences. It should be ready in a few days.

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

April 19, 2018

If you did not already know

Online Multiple Kernel Classification (OMKC) google
Online learning and kernel learning are two active research topics in machine learning. Although each of them has been studied extensively, there is a limited effort in addressing the intersecting research. In this paper, we introduce a new research problem, termed OnlineMultiple Kernel Learning (OMKL), that aims to learn a kernel based prediction function from a pool of predefined kernels in an online learning fashion. OMKL is generally more challenging than typical online learning because both the kernel classifiers and their linear combination weights must be learned simultaneously. In this work, we consider two setups for OMKL, i.e. combining binary predictions or real-valued outputs from multiple kernel classifiers, and we propose both deterministic and stochastic approaches in the two setups for OMKL. The deterministic approach updates all kernel classifiers for every misclassified example, while the stochastic approach randomly chooses a classifier(s) for updating according to some sampling strategies. Mistake bounds are derived for all the proposed OMKL algorithms. …

Deep Generalized Canonical Correlation Analysis (DGCCA) google
We present Deep Generalized Canonical Correlation Analysis (DGCCA) — a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre- sentations on two distinct datasets for three downstream tasks: phonetic transcrip- tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques. …

Distributed Computing google
Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including RPC-like connectors and message queues. A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional non-distributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be ‘papered over’ by attempts at ‘transparency’ – see CAP theorem. Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. …

Continue Reading…


Read More


[relevant picture]

In a news article, “Pasta Is Good For You, Say Scientists Funded By Big Pasta,” Stephanie Lee writes:

The headlines were a fettuccine fanatic’s dream. “Eating Pasta Linked to Weight Loss in New Study,” Newsweek reported this month, racking up more than 22,500 Facebook likes, shares, and comments. The happy news also went viral on the Independent, the New York Daily News, and Business Insider.

What those and many other stories failed to note, however, was that three of the scientists behind the study in question had financial conflicts as tangled as a bowl of spaghetti, including ties to the world’s largest pasta company, the Barilla Group. . . .

They should get together with Big Oregano.

P.S. Our work has many government and corporate sponsors. Make of this what you will.

The post Pastagate! appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Distilled News

An Introduction to Graph Theory and Network Analysis (with Python codes)

“A picture speaks a thousand words” is one of the most commonly used phrases. But a graph speaks so much more than that. A visual representation of data, in the form of graphs, helps us gain actionable insights and make better data driven decisions based on them. But to truly understand what graphs are and why they are used, we will need to understand a concept known as Graph Theory. Understanding this concept makes us better programmers. Source: Quantdare But if you have tried to understand this concept before, you’ll have come across tons of formulae and dry theoretical concepts. This is why we decided to write this blog post. We have explained the concepts and then provided illustrations so you can follow along and intuitively understand how the functions are performing. This is a detailed post, because we believe that providing a proper explanation of this concept is a much preferred option over succinct definitions. In this article, we will look at what graphs are, their applications and a bit of history about them. We’ll also cover some Graph Theory concepts and then take up a case study using python to cement our understanding. Ready? Let’s dive into it.

An Overview of Regularization Techniques in Deep Learning (with Python code)

One of the most common problem data science professionals face is to avoid overfitting. Have you come across a situation where your model performed exceptionally well on train data, but was not able to predict test data. Or you were on the top of a competition in public leaderboard, only to fall hundreds of places in the final rankings? Well – this is the article for you! Avoiding overfitting can single-handedly improve our model’s performance. In this article, we will understand the concept of overfitting and how regularization helps in overcoming the same problem. We will then look at a few different regularization techniques and take a case study in python to further solidify these concepts.

Transfer Learning –Deep Learning for Everyone

Deep Learning, based on deep neural nets is launching a thousand ventures but leaving tens of thousands behind. Transfer Learning (TL), a method of reusing previously trained deep neural nets promises to make these applications available to everyone, even those with very little labeled data.

Notes from the AI frontier: Applications and value of deep learning

An analysis of more than 400 use cases across 19 industries and nine business functions highlights the broad use and significant economic potential of advanced AI techniques.

A Comparative Analysis of ChatBots APIs

Artificial Intelligence is on the rise! Not as a machine rebellion against human creators in the distant future, but as a growing modern trend of using machine-based predictions and decision-making in informational technologies. AI hype is everywhere: self-driving cars, smart image processing (e.g. Prisma), and communication domain use like conversational AI a.k.a. chatbots. The chatbot industry is expanding fast, yet the technologies are still young. Conversational bots used to be rather vacant like the old school text-based game “I smell a Wumpus”, but now they evolved into a top quality business tool. Chatbots offer a new type of simple and friendly interface imperative for browsing information and receiving services. IT experts and industry giants including Google, Microsoft, and Facebook agree that this technology will play a huge role in the future. To enjoy the marvels of Conversational Artificial Intelligence tools (or chatbots, if you are into brevity things), you must master the basics and understand the typical stack. In this article, we will discuss all kinds of instruments you can gear up with, how they are similar and at the same time different from each other, as well as their ups and downs. But before we hop on the journey of discovering these, let’s get into the deeper understanding of the chatbots and their topology.

Deep Learning With Apache Spark – Part 1

First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.

Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R

Time series prediction (forecasting) has experienced dramatic improvements in predictive accuracy as a result of the data science machine learning and deep learning evolution. As these ML/DL tools have evolved, businesses and financial institutions are now able to forecast better by applying these new technologies to solve old problems. In this article, we showcase the use of a special type of Deep Learning model called an LSTM (Long Short-Term Memory), which is useful for problems involving sequences with autocorrelation. We analyze a famous historical data set called “sunspots” (a sunspot is a solar phenomenon wherein a dark spot forms on the surface of the sun). We’ll show you how you can use an LSTM model to predict sunspots ten years into the future with an LSTM model.

How to run a custom version of Spark on hosted Kubernetes

Do you want to try out a new version of Apache Spark without waiting around for the entire release process? Does running alpha-quality software sound like fun? Does setting up a test cluster sound like work? This is the blog post for you, my friend! We will help you deploy code that hasn’t even been reviewed yet (if that is the adventure you seek). If you’re a little cautious, reading this might sound like a bad idea, and often it is, but it can be a great way to ensure that a PR really fixes your bug, or the new proposed Spark release doesn’t break anything you depend on (and if it does, you can raise the alarm). This post will help you try out new (2.3.0+) and custom versions of Spark on Google/Azure with Kubernetes. Just don’t run this in production without a backup and a very fancy support contract for when things go sideways.

Natural Language Generation with Markovify in Python

In the age of Artificial Intelligence Systems, developing solutions that don’t sound plastic or artificial is an area where a lot of innovation is happening. While Natural Language Processing (NLP) is primarily focused on consuming the Natural Language Text and making sense of it, Natural Language Generation – NLG is a niche area within NLP to generate human-like text rather than machine generated.

Continue Reading…


Read More

5 best practices for delivering design critiques

Real critique helps teams strengthen their designs, products, and services.

Continue reading 5 best practices for delivering design critiques.

Continue Reading…


Read More

By how much does AVX-512 slow down your CPU? A first experiment.

Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.

This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.

On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.

Vlad Krasnov from Cloudfare wrote:

If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)

Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.

Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).

Intel is not being very clear:

Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.

What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?

I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.

The results?

mode running time (average over 10)
no AVX-512 1.48 s
light AVX-512 1.48 s
heavy AVX-512 1.48 s

Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.

This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.

In any case, we need reproducible simple tests. Do you have one?

My code and scripts are available.

Continue Reading…


Read More

Leverage the Power of Data-Literacy

Optimizing your business for AI success is the only way to leverage its growing power; data-literacy represents the foundation of that optimization.

Continue Reading…


Read More

Book Memo: “Optimization in Engineering: Models and Algorithms”

Models and Algorithms
This textbook covers the fundamentals of optimization, including linear, mixed-integer linear, nonlinear, and dynamic optimization techniques, with a clear engineering focus. It carefully describes classical optimization models and algorithms using an engineering problem-solving perspective, and emphasizes modeling issues using many real-world examples related to a variety of application areas. Providing an appropriate blend of practical applications and optimization theory makes the text useful to both practitioners and students, and gives the reader a good sense of the power of optimization and the potential difficulties in applying optimization to modeling real-world systems.
The book is intended for undergraduate and graduate-level teaching in industrial engineering and other engineering specialties. It is also of use to industry practitioners, due to the inclusion of real-world applications, opening the door to advanced courses on both modeling and algorithm development within the industrial engineering and operations research fields.

Continue Reading…


Read More

Book Memo: “Tensor Numerical Methods in Scientific Computing”

This book presents an introduction to modern tensor-structured numerical methods in scientific computing. In recent years, these methods have been shown to provide a powerful tool for efficient computations in higher dimensions, thus overcoming the so-called “curse of dimensionality”, a problem that encompasses various phenomena that arise when analyzing and organizing data in high-dimensional spaces.

Continue Reading…


Read More

Two day workshop: Flexible programming of MCMC and other methods for hierarchical and Bayesian models

(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

We’ll be giving a two day workshop at the 43rd Annual Summer Institute of Applied Statistics at Brigham Young University (BYU) in Utah, June 19-20, 2018.

Abstract is below, and registration and logistics information can be found here.

This workshop provides a hands-on introduction to using, programming, and sharing Bayesian and hierarchical modeling algorithms using NIMBLE ( In addition to learning the NIMBLE system, users will develop hands-on experience with various computational methods. NIMBLE is an R-based system that allows one to fit models specified using BUGS/JAGS syntax but with much more flexibility in defining the statistical model and the algorithm to be used on the model. Users operate from within R, but NIMBLE generates C++ code for models and algorithms for fast computation. I will open with an overview of creating a hierarchical model and fitting the model using a basic MCMC, similarly to how one can use WinBUGS, JAGS, and Stan. I will then discuss how NIMBLE allows the user to modify the MCMC – changing samplers and specifying blocking of parameters. Next I will show how to extend the BUGS syntax with user-defined distributions and functions that provide flexibility in specifying a statistical model of interest. With this background we can then explore the NIMBLE programming system, which allows one to write new algorithms not already provided by NIMBLE, including new MCMC samplers, using a subset of the R language. I will then provide examples of non-MCMC algorithms that have been programmed in NIMBLE and how algorithms can be combined together, using the example of a particle filter embedded within an MCMC. We will see new functionality in NIMBLE that allows one to fit Bayesian nonparametric models and spatial models. I will close with a discussion of how NIMBLE enables sharing of new methods and reproducibility of research. The workshop will include a number of breakout periods for participants to use and program MCMC and other methods, either on example problems or problems provided by participants. In addition, participants will see NIMBLE’s flexibility in action in several real problems.

To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Let’s Admit It: We’re a Long Way from Using “Real Intelligence” in AI

With the growth of AI systems and unstructured data, there is a need for an independent means of data curation, evaluation and measurement of output that does not depend on the natural language constructs of AI and creates a comparative method of how the data is processed.

Continue Reading…


Read More

Hackathon – Hack the city 2018

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)


Hello, R-Users!

Have you ever joined an hackathon? If so, you surely know how fun and stimulating these events are. If not… now’s your chance!

Quantide is collaborating with Hack the City 2018, the 4th edition of the Southern Switzerland’s hackathon. Aside from being partners, we’re actively part of it: we have members in the mentors’ group and as part of the technical commission of the jury. We also give out a special mention for for the best data science project, developed with open source technology and preferably using R as programming language. It’s the first time that R makes its entry in this hackathon: if you work with R, here’s your chance!

Hack the city is a great occasion to show off your talent and your ideas, in team or as an individual. Programmers, graphic designers, project designers and makers collaborate to build something in just 48 hours. You can also join the competition from your own home, as long as at least one of your teammates is in Lugano! At the end of the hackathon, a jury will give out great prizes to the best project: respectively 5000, 2500 and 100o CHF for first, second and third prizes. There are other special mentions and scholarships, as we have mentioned before.

The last minutes tickets are available now, so if you want to grab this occasion, you should do it fast! Hack the City 2018 will take place in Lugano, from 27th to 29th April. Hope to see you there!

The post Hackathon – Hack the city 2018 appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Europeans remain welcoming to immigrants

FOR those who believe that migration can, if managed properly, make a country materially and culturally richer, recent developments in European politics have been worrying.

Continue Reading…


Read More

Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan!

Here it is:

Predicting drug toxicity with Bayesian machine learning models

We’re currently looking for talented scientists to join our innovative academic-style Postdoc. From our centre in Cambridge, UK you’ll be in a global pharmaceutical environment, contributing to live projects right from the start. You’ll take part in a comprehensive training programme, including a focus on drug discovery and development, given access to our existing Postdoctoral research, and encouraged to pursue your own independent research. It’s a newly expanding programme spanning a range of therapeutic areas across a wide range of disciplines. . . .

You will be part of the Quantitative Biology group and develop comprehensive Bayesian machine learning models for predicting drug toxicity in liver, heart, and other organs. This includes predicting the mechanism as well as the probability of toxicity by incorporating scientific knowledge into the prediction problem, such as known causal relationships and known toxicity mechanisms. Bayesian models will be used to account for uncertainty in the inputs and propagate this uncertainty into the predictions. In addition, you will promote the use of Bayesian methods across safety pharmacology and biology more generally. You are also expected to present your findings at key conferences and in leading publications

This project is in collaboration with Prof. Andrew Gelman at Columbia University, and Dr Stanley Lazic at AstraZeneca.

The post Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Umpire strike zone changes to finish games earlier

When watching baseball on television, we get the benefit of seeing whether a pitch entered the strike zone or not. Umpires go by eye, and intentional or not, they tend towards finishing a game over extra innings. Michael Lopez, Brian Mills, and Gus Wezerek for FiveThirtyEight:

The left panel shows the comparative rate of strike calls when, in the bottom of an inning in extras, the batting team is positioned to win — defined as having a runner on base in a tie game — relative to those rates in situations when there’s no runner on base in a tie game. When the home team has a baserunner, umps call more balls, thus setting up more favorable counts for home-team hitters, creating more trouble for the pitcher, and giving the home team more chances to end the game.

I doubt the shift is on purpose, but it’s interesting to see the calls go that way regardless. Also, from a non baseball-viewer, why isn’t there any replay in baseball yet?

Tags: , , ,

Continue Reading…


Read More

Deploying Deep Learning Models on Kubernetes with GPUs

This post is authored by Mathew Salvaris and Fidan Boylu Uz, Senior Data Scientists at Microsoft.

One of the major challenges that data scientists often face is closing the gap between training a deep learning model and deploying it at production scale. Training of these models is a resource intensive task that requires a lot of computational power and is typically done using GPUs. The resource requirement is less of a problem for deployment since inference tends not to pose as heavy a computational burden as training. However, for inference, other goals also become pertinent such as maximizing throughput and minimizing latency. When inference speed is a bottleneck, GPUs show considerable performance gains over CPUs. Coupled with containerized applications and container orchestrators like Kubernetes, it is now possible to go from training to deployment with GPUs faster and more easily while satisfying latency and throughput goals for production grade deployments.

In this tutorial, we provide step-by-step instructions to go from loading a pre-trained Convolutional Neural Network model to creating a containerized web application that is hosted on Kubernetes cluster with GPUs on Azure Container Service (AKS). AKS makes it quick and easy to deploy and manage containerized applications without much expertise in managing Kubernetes environment. It eliminates complexity and operational overhead of maintaining the cluster by provisioning, upgrading, and scaling resources on demand, without taking the applications offline. AKS reduces the cost and complexity of using a Kubernetes cluster by managing the master nodes for which the user does no incur a cost. Azure Container Service has been available for a while and similar approach was provided in a previous tutorial to deploy a deep learning framework on Marathon cluster with CPUs . In this tutorial, we focus on two of the most popular deep learning frameworks and provide the step-by-step instructions to deploy pre- trained models on Kubernetes cluster with GPUs.

The tutorial is organized in two parts, one for each deep learning framework, specifically TensorFlow and Keras with TensorFlow backend. Under each framework, there are several notebooks that can be executed to perform the following steps:

  • Develop the model that will be used in the application.
  • Develop the API module that will initialize the model and make predictions.
  • Create Docker image of the application with Flask and Nginx.
  • Test the application locally.
  • Create an AKS cluster with GPUs and deploy the web app.
  • Test the web app hosted on AKS.
  • Perform speed tests to understand latency of the web app.

Below, you will find short descriptions of the steps above.

Develop the Model

As the first step of the tutorial, we load the pre-trained ResNet152 model, pre-process an example image to the required format and call the model to find the top predictions. The code developed in this step will be used in the next step when we develop the API module that initializes the model and makes predictions.

Develop the API

In this step, we develop the API that will call the model. This driver module initializes the model, transforms the input so that it is in the appropriate format and defines the scoring method that will produce the predictions. The API will expect the input to be in JSON format. Once a request is received, the API will convert the JSON encoded request into the image format. The first function of the API loads the model and returns a scoring function. The second function processes the images and uses the first function to score them.

Create Docker Image

In this step, we create the Docker image that has three main parts, the web application, the pretrained model and the driver module for executing the model based on the requests made to the web application. The Docker image is based on a Nvidia image to which we only add the necessary Python dependencies and install the deep learning framework to keep the image as lightweight as possible. The Flask web app will be running on the default port 80 which is exposed on the docker image and Nginx is used to create a proxy from port 80 to port 5000. Once the container is built, we push it to a public Docker hub account for AKS cluster to pull it in later steps.

Test the Application Locally

In this step, we test our docker image by pulling it and running it locally. This step is especially important to make sure the image performs as expected before we go through the entire process of deploying to AKS. This will reduce the debugging time substantially by checking if we can send requests to the Docker container and receive predictions back properly.

Create and AKS Cluster and Deploy

In this step, we use Azure CLI to login to Azure, create a resource group for AKS and create the cluster. We create an AKS cluster with 1 node using Standard NC6 series with 1 GPU. After the AKS cluster is created, we connect to the cluster and deploy the application by defining the Kubernetes manifest where we provide the image name, map port 80 and specify Nvidia library locations. We set the number of Kubernetes replicas to 1 which can later be scaled up to meet certain throughput requirements (the latter is out of scope for this tutorial). Kubernetes also has a dashboard that can simply be accessed through a web browser.

Test the Web App

In this step, we test the web application that is deployed on AKS to quickly check if it can produce predictions against images that are sent to the service.

Perform Speed Tests

In this step, we use the deployed service to measure the average response time by sending 100 asynchronous requests with only four concurrent requests at any time. These types of tests are particularly important to perform, especially for deployments with low latency requirements to make sure the cluster is scaled to meet the demand. The result of the tests suggest that the average response times are less than a second for both frameworks with TensorFlow (~20 images/sec) being much faster than its Keras (~12 images/sec) counterpart on a single K80 GPU.

As a last step, to delete the AKS and free up the Azure resources, we use the commands provided at the end of the notebook where AKS was created.

We hope you give this tutorial a try! Reach out to us with any comments or questions below.

Mathew & Fidan


We would like to thank William Buchwalter for helping us craft the Kubernetes manifest files, Daniel Grecoe for testing the throughput of the models and lastly Danielle Dean for the useful discussions and proofreading of the blog post.

Continue Reading…


Read More

A Recession Before 2020 Is Likely; On the Distribution of Time Between Recessions

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

I recently saw a Reddit thread in r/PoliticalDiscussion asking the question “If the economy is still booming 2020, how should the Democratic address this?” This gets to an issue that’s been on my mind since at least 2016, maybe even 2014: when will the current period of economic growth end?

For some context, the Great Recession, as economists colloquially call the recession beginning in 2007 and punctuated with the 2008 financial crisis, ended officially in June 2009; it was then the economy resumed growth. As of this writing, that was about eight years, ten months ago. The longest previous period between recessions was the time between the early 1990s recession and the recession in the early 2000s that coincided with the collapse of the dot-com bubble; that period was ten years, and the only period longer than the present period between recessions.

There is growing optimism in the economy, most noticeably amongst consumers, and we are finally seeing wages increase in the United States after years of stagnation. Donald Trump and Republicans point to the economy as a reason to vote for Republicans in November (and yet Donald Trump is still historically unpopular and Democrats have a strong chance of capturing the House, and a fair chance at the Senate). Followers of the American economy are starting to ask, “How long can this last?”

In 2016, I was thinking about this issue in relation to the election. I wanted Hillary Clinton to win, but at the same time I feared that a Clinton win would be a short-term gain, long-term loss for Democrats. One reason why is I believe there’s a strong chance of a recession within the next few years.

The 2008 financial crisis was a dramatic event, yet the Dodd-Frank reforms and other policy responses, in my opinion, did not go far enough to address the problems unearthed by the financial crisis. Too-big-to-fail institutions are now a part of law (though the policy jargon is systemically important financial institution, or SIFI). In fact, the scandal surrounding HSBC’s support of money laundering and the Justice Departments weak response suggested bankers may be too-big-to-jail! Many of the financial products and practices that caused the financial crisis are still legal; the fundamentals that produced the crisis have not changed. Barack Obama and the Democrats (and the Republicans, certainly) failed to break the political back of the bankers.

While I did not think Bernie Sanders’ reforms would necessarily make the American economy better, I thought he would put the fear of God back into the financial sector, and that alone could help keep risky behavior in check. Donald Trump, for all his populist rhetoric, has not demonstrated he’s going to put that fear in them. In fact, the Republicans passed a bill that’s a gift to corporations and top earners. The legacy of the 2008 financial crisis is that the financial sector can make grossly risky bets in the good “get government off our back!” times, but will have their losses covered by taxpayers in the “we need government help!” times. Recessions and financial crises are a part of the process of expropriating taxpayers. (I wrote other articles about this topic: see this article and this article, as well as this paper I wrote for an undergraduate class.)

Given all this, there’s good reason to believe that nothing has changed about the American economy that would change the likelihood of a financial crisis. Since it has been so long since the last one, it’s time to start expecting one, and whoever holds the Presidency will be blamed.

Right now that’s Donald Trump and the Republicans. And I don’t need to tell you that given Trump’s popularity in good economic times is historically low, a recession before the 2020 election would lead to a Republican rout, with few survivors.

And in a Census year, too!

So what is the probability of a recession? The rest of this article will focus on finding a statistical model for duration between elections and using that model to estimate the probability of a recession.

A recent article in the magazine Significance entitled “The Weibull distribution” describes the Weibull distribution, a common and expressive probability distribution (and one I recently taught in my statistics class). This distribution is used to model a lot of phenomena, including survival times, the time until a system fails or how long a patient diagnosed with a disease survives. Time until recession sounds like a “survival time”, so perhaps the Weibull distribution can be used to model it.

First, I’m going to be doing some bootstrapping, so here’s the seed for replicability:


The dataset below, obtained from this Wikipedia article, contains the time between recessions in the United States. I look only at recessions since the Great Depression, considering this to be the “modern” economic era for the United States. The sample size is necessarily small, at 13 observations.

recessions <- c( 4+ 2/12,  6+ 8/12,  3+ 1/12,  3+ 9/12,  3+ 3/12,  2+ 0/12,
                 8+10/12,  3+ 0/12,  4+10/12,  1+ 0/12,  7+ 8/12, 10+ 0/12,
                 6+ 1/12)



The fitdistrplus allows for estimating the parameters of statistical distributions using the usual statistical techniques. (I found J.Stat.Soft article useful for learning about the package.) I load it below and look at an initial plot to get a sense of appropriate distributions.


descdist(recessions, boot = 1000)

## summary statistics
## ------
## min:  1   max:  10 
## median:  4.166667 
## mean:  4.948718 
## estimated sd:  2.71943 
## estimated skewness:  0.51865 
## estimated kurtosis:  2.349399

The recessions dataset is platykurtic though right-skewed, a surprising result. However, that’s not enough to deter me from attempting to use the Weibull distribution to model time between recessions. (I should mention here that I am assuming essentially that I’m assuming that time between recessions since the Great Depression are independent and identically distributed. This is not obvious or uncontroversial, but I doubt this could be credibly disproven or that assuming dependence would improve the model.) Let’s fit parameters.

fw <- fitdist(recessions, "weibull")
## Fitting of the distribution ' weibull ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## shape 2.001576  0.4393137
## scale 5.597367  0.8179352
## Loglikelihood:  -30.12135   AIC:  64.2427   BIC:  65.3726 
## Correlation matrix:
##           shape     scale
## shape 1.0000000 0.3172753
## scale 0.3172753 1.0000000

plot(seq(0, 15, length.out = 1000), dweibull(seq(0, 15, length.out = 1000),
                                              shape = fw$estimate["shape"],
                                              scale = fw$estimate["scale"]),
     col = "blue", type = "l", xlab = "Duration", ylab = "Density",
     main = "Weibull distribution applied to recession duration")


The plots above suggest the fitted Weibull distribution describe the observed distribution; the Q-Q plot, P-P plot, and the estimated density function all fit well with a Weibull distribution. I also compared the AIC values of the fitted Weibull distribution to two other close candidates, the gamma and log-normal distributions; the Weibull distribution provides the best fit according to the AIC criterion, being twice as reasonable as the log-normal distribution, although only slightly better than the gamma distribution (which is not surprising, given that the two distributions are similar). Due to the interpretations that come with the Weibull distribution and the statistical evidence, I believe it provides the better fit and should be used.

Based on the form of the distribution and the estimated parameters we can find a point estimate for the probability of a recession both before the 2018 midterm election and before the 2020 presidential election. That is, if T is the time between recessions, we can estimate

P(T \leq t + t_0 | T > t_0)

alpha <- fw$estimate["shape"]
beta <- fw$estimate["scale"]

recession_prob_wei <- function(delta, passed, shape, scale) {
  # Computes the probability of a recession within the next delta years given
  # passed years
  # args:
  #   delta: a number representing time to next recession
  #   passed: a number representing time since last recession
  #   shape: the shape parameter of the Weibull distribution
  #   scale: the scale parameter of the Weibull distribution
  if (delta < 0 | passed < 0) {
    stop("Both delta and passed must be non-negative")
  return(1 - pweibull(passed + delta, shape = shape, scale = scale,
                      lower.tail = FALSE) /
             pweibull(passed, shape = shape, scale = scale, lower.tail = FALSE))

# Recession prob. before 2018 election point estimate
recession_prob_wei(6/12, 8+10/12, shape = alpha, scale = beta)
## [1] 0.252013

# Before 2020 election
recession_prob_wei(2+6/12, 8+10/12, shape = alpha, scale = beta)
## [1] 0.8005031

Judging by the point estimates, there’s a 25% chance of a recession before the 2018 midterm election and an 80% chance of a recession before the 2020 election.

The code below finds bootstrapped 95% confidence intervals for these numbers.

recession_prob_wei_bootci <- function(data, delta, passed, conf = .95,
                                      R = 1000) {
  # Computes bootstrapped CI for the probability a recession will occur before
  # a certain time given some time has passed
  # args:
  #   data: A numeric vector containing recession data
  #   delta: A nonnegative real number representing maximum time till recession
  #   passed: A nonnegative real number representing time since last recession
  #   conf: A real number between 0 and 1; the confidence level
  #   R: A positive integer for the number of bootstrap replicates
  bootobj <- boot(recessions, R = R, statistic = function(data, indices) {
    d <- data[indices]
    params <- fitdist(d, "weibull")$estimate
    return(recession_prob_wei(delta, passed, shape = params["shape"],
                              scale = params["scale"]))
  }), type = "perc", conf = conf)

# Bootstrapped 95% CI for probability of recession before 2018 election
recession_prob_wei_bootci(recessions, 6/12, 8+10/12, R = 10000)
## Based on 10000 bootstrap replicates
## CALL : 
## = bootobj, conf = conf, type = "perc")
## Intervals : 
## Level     Percentile     
## 95%   ( 0.1691,  0.6174 )  
## Calculations and Intervals on Original Scale

# Bootstrapped 95% CI for probability of recession before 2020 election
recession_prob_wei_bootci(recessions, 2+6/12, 8+10/12, R = 10000)
## Based on 10000 bootstrap replicates
## CALL : 
## = bootobj, conf = conf, type = "perc")
## Intervals : 
## Level     Percentile     
## 95%   ( 0.6299,  0.9974 )  
## Calculations and Intervals on Original Scale

These CIs suggest that while the probability of a recession before the 2018 midterm is very uncertain (it could plausibly be between 17% and 62%), my hunch about 2020 has validity; even the lower bound of that CI suggests a recession before 2020 is likely, and the upper bound is almost-certainty.

How bad could it be? That’s hard to say. However, these odds make the Republican tax bill and its trillion-dollar deficits look even more irresponsible; that money will be needed to deal with a potential recession’s fallout.

As bad as 2018 looks for Republicans, it could look like a cakewalk compared to 2020.

(And despite the seemingly jubilant tone, this suggests I may have trouble finding a job in the upcoming years.)

I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

5 Reasons to Attend Spark + AI Summit

Spark + AI Summit will be held in San Francisco on June 4-6, 2018. Check out the full agenda and get your ticket before it sells out! Register today with the discount code 5Reasons and get 15% off.

Convergence of Knowledge

For any Apache Spark enthusiast, these summits are the convergence of Spark knowledge. Used by a growing global community of enterprises, academics, contributors, and advocates, attendees have convened at these summits since 2013 to share knowledge. And this summer attendees will return to San Francisco—to an expanded scope and agenda.

Expansion of Scope

Today, unified analytics is paramount for building big data and Artificial Intelligence (AI) applications. Because AI applications require massive amounts of data to enhance and train machine learning models at scale, so far Spark has been the only engine that combines large-scale data processing with the execution of state-of-the-art machine learning and AI algorithms in a unified manner.

So we have changed the name and expanded the scope of the summit to focus on and bring to you AI use cases and machine learning technology.

AI has always been one of the most exciting applications of big data and Apache Spark, so with this change, we are planning to bring in keynotes, talks and tutorials about the latest tools in AI in addition to the great data engineering and data science content we already have” — Matei Zaharia.

For this expanded scope and much more, here are my five reasons as a program chair why you should join us.

1. Keynotes from Distinguished Engineers, Academics and Industry Leaders

Distinguished engineers and academics (Matei Zaharia, Dominique Brezinski, Dawn Song, and Michael I. Jordan) and visionary industry leaders (Ali Ghodsi, Marc Andreessen, and Andrej Karparthy) in the big data and AI industries will share their vision of where Apache Spark and AI are heading in 2018 and beyond.

2. Five New Tracks

To support our expanded scope, we have added five tracks to cover AI and Use Cases, Deep Learning Techniques, Python and Advanced Analytics, Productionizing Machine Learning Models, and Hardware in the Cloud. Combined with all other tracks, all these sessions will provide you with over 180 talks to choose from. And if you miss any sessions, peruse the recorded sessions on summit website later.

3. Apache Spark Training

Update your skills and get the best training from Databricks’ best trainers, who have trained over 3,600 summit attendees. A day dedicated to training, you can choose from four courses and stay abreast with the latest in Spark 2.3 and Deep Learning: Data Science with Apache Spark; Understand and Apply Deep Learning with Keras, TensorFlow, and Apache Spark; Apache Spark Tuning and Best Practices; and Apache Spark Essentials. Depending on your preference, you can choose to register for each class on either AWS or Azure cloud. Plus, we will offer half-day Databricks Developer Certification for Apache Spark prep course after which you can sit for the exam on the same day. Get Databricks Certified!

4. The Bay Area Apache Spark Meetup

Apache Spark Meetups are reputed for tech-talks. At summits’ meetups, you learn what other Spark developers from all over are up to, mingle and enjoy the beverages and camaraderie in an informal setting, and ask burning questions.

5. City By The Bay

San Francisco is a city famed for its restaurants, cable cars, hills, Golden Gate Bridge, and vibrant nightlife. Take a breather after days of cerebral sessions, chill out at the Fisherman’s Wharf, visit MOMA, and much more…

We hope to you see you in San Francisco!

What’s Next

With only less than six weeks left, tickets are selling fast. If you haven’t yet, register today with the discount code 5Reasons and get 15% off.


Try Databricks for free. Get started today.

The post 5 Reasons to Attend Spark + AI Summit appeared first on Databricks.

Continue Reading…


Read More

Saddling up; getting on the hoRse for the first time

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Laura Swales, Marketing and Events Assistant

This year at Mango we’re proudly sponsoring the Bath Cats & Dogs
. To start our fundraising
for them, we decided to run a sweepstake on the Grand National. We asked
for £2 per horse, which would go to the cats and dogs home and the
winner was promised a bottle of wine for their charitable efforts.

Working in a Data Science company I knew that I couldn’t simply pick
names out of a hat for the sweepstake, ‘That’s not truly random!’ they
would cry. So in my hour of need, I turned to our two university
placement students Owen and Caroline to help me randomise the names in


To use an appropriate horse-based metaphor, I would class myself as a
‘non-starter’ in R – I’m not even near the actual race! My knowledge is
practically non-existent (‘Do you just type alot of random letters?’)
and up until this blog I didn’t even have RStudio on my laptop.

The first hurdle

We began by creating a list of the people who had entered the
sweepstake. With some people betting on more than one horse their name
was entered as many times as needed to correlate to how many bets they
had laid down.

people_list <- c("Matt Glover", "Matt Glover", "Ed Gash",
                 "Ed Gash", "Ed Gash", "Lisa S", "Toby",
                 "Jen", "Jen", "Liz", "Liz", "Andrew M",
                 "Nikki", "Chris James", "Yvi", "Yvi",
                 "Yvi", "Beany", "Karina", "Chrissy", "Enrique",
                 "Pete", "Karis", "Laura", "Ryan", "Ryan", "Ryan",
                 "Ryan", "Ryan", "Owen", "Rich", "Rich", "Matt A",
                 "Matt A", "Matt A", "Matt A", "Matt A", "Matt A", 
                 "Matt A", "Matt A")

I had now associated all the names with the object called people_list.
Next I created an object that contained numbers 1-40 to represent each

horses_list <- 1:40

With the two sets of values ready to go, I wanted to display them in a
table format to make it easier to match names and numbers.

assign <- data.frame(Runners = horses_list, People = people_list)

##   Runners      People
## 1       1 Matt Glover
## 2       2 Matt Glover
## 3       3     Ed Gash
## 4       4     Ed Gash
## 5       5     Ed Gash
## 6       6      Lisa S

Now the data appeared in a table, but had not been randomised. To do
this I used the sample function to jumble up the people_list names.

assign <- data.frame(horses_list, sample(people_list))

Free Rein

Success! I had a list of numbers (1-40) representing the horses and a
randomly jumbled up list of those taking part in the sweepstake.

At the time of writing (In RMarkdown!), unfortunately fate had randomly
selected me the favourite to win. As you can imagine, this is something
that will not make you popular in the office.

My First Trot

I hope you enjoyed my first attempt in R. I will definitely use it again
to randomise our next sweepstake, though under intense supervision. I
can still hear the cries of ‘FIX!’ around the office. It’s always an
awkward moment when you win your own sweepstake…

Despite the controversy, it was fun to try out R in an accessible way
and it helped me understand some of the basic functions available.
Perhaps I’ll sit in on the next LondonR
workshop and learn some more!

If you’d like to find out more about the Bath Cats & Dogs Home please
visit here.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Presto for Data Scientists – SQL on anything

Presto enables data scientists to run interactive SQL across multiple data sources. This open source engine supports querying anything, anywhere, and at large scale.

Continue Reading…


Read More

Psychometrics corner: They want to fit a multilevel model instead of running 37 separate correlation analyses

Anouschka Foltz writes:

One of my students has some data, and there is an issue with multiple comparisons. While trying to find out how to best deal with the issue, I came across your article with Martin Lindquist, “Correlations and Multiple Comparisons in Functional Imaging: A Statistical Perspective.” And while my student’s work does not involve functional imaging, I thought that your article may present a solution for our problem.

My student is interested in the relationship between vocabulary size and different vocabulary learning strategies (VLS). He has measured each participant’s approximate vocabulary size with a standardized test (scores between 0 and 10000) and asked each participant how frequently they use each of 37 VLS on a scale from 1 through 5. The 37 VLS fall into five different groups (cognitive, memory, social etc.). He is interested in which VLS correlate with or predict vocabulary size. To see which VSL correlate with vocabulary size, we could run 37 separate correlation analyses, but then we run into the problem that we are doing multiple comparisons and the issue of false positives that goes along with that.

Do you think a multilevel Bayesian approach that uses partial pooling, as you suggest in your paper for functional imaging date, would be appropriate in our case? If so, would you be able to provide me with some more information as to how I can actually run such an analysis? I am working in R, and any information as to which packages and functions would be appropriate for the analysis would be really helpful. I came across the brms package for Advanced Bayesian Multilevel Modeling, but I have not worked with this particular package before and I am not sure if this is exactly what I need.

My reply:

I do think a multilevel Bayesian approach would make sense. I’ve never worked on this particular problem. So I am posting it here on blog on the hope that someone might have a response. This seems like the exact sort of problem where we’d fit a multilevel model rather than running 37 separate analyses!

The post Psychometrics corner: They want to fit a multilevel model instead of running 37 separate correlation analyses appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Using Natural Language Processing to Combat Filter Bubbles and Fake News – 360° Stance Detection

The phenomenon of filter bubbles is one of the most alarming developments of the Internet age. ‘Filter bubble’ refers to the situation where people only consume news content expressing political or social viewpoints that they already agree with, and it has led to entrenchment of existing biases and increased polarization in society.

This problem of filter bubbles is due in large part to the success of content recommendation systems on social networks, smartphones, and websites, where the content recommended to the user is heavily oriented toward the same topics and news outlets that he or she has already shown an interest in. Because of this, the views expressed in the recommended content are likely to be similar to what the user has already read, making the content people consume resemble an  echo chamber.

To make matters worse, filter bubbles also increase people’s susceptibility towards ‘fake news’, where fabricated news stories spread sensational but false information to readers with a particular stance on a social or political issue. In the final three months of the presidential election in 2016, the most popular fake news stories were shared even more than the most popular legitimate news stories on Facebook.

Facebook’s attempt to mitigate the problem by placing red flags next to content from known fake news sites has come to an unsuccessful end, and the Natural Language Processing community has been making efforts to ameliorate this problem.


360° Stance Detection

To contribute to these efforts, our Science and Engineering teams at AYLIEN have built 360° Stance Detection. This tool gathers multiple news stories about a topic that also mention an entity, and classifies the stance each author is expressing towards a given topic – ‘for,’ ‘against,’ or ‘neutral’.

This tool allows us to automatically source news stories that take different views about a certain topic. For example, we can search for stories about Brexit that mention Theresa May, and retrieve stories that the tool has recognized as supporting Theresa May as well as stories recognized as being against Theresa May.



To showcase how the tool works, we’ve put together a simple demo that you can access here. Given a topic and a keyword, the stance detection tool will gather stories about a given topic and predict the stance of the author of each story toward the keyword. The demo will then display the results on a scatter plot that graphs the stance of the author along with the popularity of the website (according to its Alexa ranking).

Below is a screenshot of the scatter plot showing the predicted stance of each author toward Donald Trump in stories about Robert Mueller. Mueller is the special counsel charged with investigating Russian meddling in the 2016 Presidential Election, a news topic that prompts polarized opinions about Trump, depending on your political stance.



You can see that the model has predicted the stance of a story about Mueller on alt-right stalwart Breitbart as being pro-Trump, while a story about Mueller appearing on Real Clear Politics a favourite of the left, has been predicted as being anti-Trump. We can also see the stance predicted about Trump in stories about Mueller by a diverse range of publishers like ABC, The Sun, Farming UK, and others.

Here’s a two-minute walkthrough of the demo by our founder, Parsa, that predicts the stance of authors toward Ireland in stories that mention both Ireland and Brexit:

In practice, this tool can be used to make it easy for people to access recommended content that does not simply agree with the content that they have already read, making it easier for people to regularly be exposed to see other points of view. It can also be used to analyze the stance of organizations and individuals toward a particular entity in the coverage of an event.

We’ll use this stance detection tool to look into other 360° Stance Detection has been accepted to NAACL-HLT 2018, and you can read the submission on arXiv here.


The post Using Natural Language Processing to Combat Filter Bubbles and Fake News – 360° Stance Detection appeared first on AYLIEN.

Continue Reading…


Read More

Python Regular Expressions Cheat Sheet

The tough thing about learning data is remembering all the syntax. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it's nice to have a handy reference, so we've put together this cheat sheet to help you out!

Continue Reading…


Read More

Upcoming speaking engagments

I have a couple of public appearances coming up soon.


Preparing Datasets – The Ugly Truth & Some Solutions is a great idea of Jim Porzak’s. Jim will speak on problems one is likely to encounter in trying to use real world data for predictive modeling and then I will speak on how the vtreat package helps address these issues. vtreat systematizes a number of routine domain independent data repairs and preparations, leaving you more time to work on important domain specific issues (plus it has citable documentation, helping make your methodology section smaller).

vtreat is the best way to prepare messy real world data for predictive modeling.


rquery: a Query Generator for Working With SQL Data

is an introduction to the rquery query generator system. rquery is a new R package that builds “pipe-able SQL” and includes a number of very powerful data operators and analyses. It includes a number of very neat features, including query pipeline diagrams.


We think rquery (plus cdata) is going to be the best way (easiest to learn, most expressive, easiest to maintain, and most performant) method to use R to manipulate data at scale (SQL databases and Spark).

Continue Reading…


Read More

Do You Have the Expertise to be a Big Data Project Manager?

Are you one of those project management professionals inclined to handle big data proposals? Having a strong project management background would add excellent support to your aspirations. However, for setting up a big data project team, you don’t just need good project management skills. You also need to supersize the

The post Do You Have the Expertise to be a Big Data Project Manager? appeared first on Dataconomy.

Continue Reading…


Read More

Derivation of Convolutional Neural Network from Fully Connected Network Step-By-Step

What are the advantages of ConvNets over FC networks in image analysis? How is ConvNet derived from FC networks? Where the term convolution in CNNs came from? These questions are to be answered in this article.

Continue Reading…


Read More

Thanks for reading!